- linaro-toolchain - lists.linaro.org

[ACTIVITY] week ending Mar. 27 2022

by Alex Bennée

Project Stratos =============== - spent some time talking through design approaches for xen vhost-master with Viresh - posted Re: Understanding osdep_xenforeignmemory_map mmap behaviour Message-Id: <alpine.DEB.2.22.394.2203231838130.2910984@ubuntu-linux-20-04-desktop> Linux RPMB Sub-system and virtio-driver ([STR-40]) - continued working on [Linux driver] - discovered a bug in vhost-user config handling in QEMU as well - posted [PATCH v1 00/13] various virtio docs, fixes and tweaks Message-Id: <20220321153037.3622127-1-alex.bennee(a)linaro.org> [STR-40] <https://linaro.atlassian.net/browse/STR-40> [Linux driver] <http://git.linaro.org/people/alex.bennee/linux.git/shortlog/refs/heads/rpmb…> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL 00/18] testing and semihosting updates Message-Id: <20220301094715.550871-1-alex.bennee(a)linaro.org> - posted [PULL for 7.0 0/8] i386, docs, gitlab fixes Message-Id: <20220323112711.440376-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Other ===== - Finished work on presentation for LTD Completed Reviews [11/11] ========================= [PATCH] gdbstub.c: add support for info proc mappings Message-Id: <20220221030910.3203063-1-dominik.b.czarnota(a)gmail.com> [PATCH] tests/Makefile.include: Let "make clean" remove the TCG tests, too Message-Id: <20220301085900.1443232-1-thuth(a)redhat.com> [PATCH 0/3] gdbstub: add support for switchable endianness Message-Id: <20210823142004.17935-1-changbin.du(a)gmail.com> [PATCH 0/6] More record/replay acceptance tests Message-Id: <162332427732.194926.7555369160312506539.stgit@pasha-ThinkPad-X280> [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> [PATCH 00/11] edk2: update to stable202202 Message-Id: <20220308145521.3106395-1-kraxel(a)redhat.com> [PATCH 0/3] Use g_new() & friends where that makes obvious Message-Id: <20220314160108.1440470-3-armbru(a)redhat.com> [PATCH 2/2] target/arm: Log fault address for M-profile faults Message-Id: <20220315204306.2797684-3-peter.maydell(a)linaro.org> [RFC PATCH 0/6] Port PPC64/PowerNV MMU tests to QEMU Message-Id: <20220324190854.156898-1-leandro.lupori(a)eldorado.org.br> [PULL for-7.1 00/36] Logging cleanup and per-thread logfiles Message-Id: <20220320171135.2704502-1-richard.henderson(a)linaro.org> [PATCH 1/2] gdbstub: Set current_cpu for memory read write Message-Id: <20220322154213.86475-1-bmeng.cn(a)gmail.com> Absences ======== Current Review Queue ==================== TODO [PULL for-7.1 00/36] Logging cleanup and per-thread logfiles Message-Id: <20220320171135.2704502-1-richard.henderson(a)linaro.org> ==================================================================================================================================== TODO [PATCH 00/11] edk2: update to stable202202 Message-Id: <20220308145521.3106395-1-kraxel(a)redhat.com> ======================================================================================================= TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ -- Alex Bennée

3 years, 3 months

1
0
0 0

[ACTIVITY] report week ending 25 Mar

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] + More of the usual freeze-related work * QEMU-420 [GICv4 emulation] + I think the code is more or less bug-free now; still need to figure out the best way for a board to request a GICv4 (eg do we want a 'revision' property specifying 3, 3.1, 4, 4.1, or just 3 vs 4 with some optional booleans for extra features?) -- PMM

3 years, 3 months

1
0
0 0

[ACTIVITY] Report for week #11

by Thiago Jung Bauermann

Hello, # [GNU-732] GDB support for ARMv9 Scalable Matrix Extension (SME) * Continued reading patches from Mark Brown's v12 patch set adding SME support to the Linux kernel. Sent a few trivial review comments. * After conversation with Luis, decided to work on gdbserver and remote protocol support for programs which change the SVE vector length during execution (native GDB already supports it). This issue will most likely be relevant for SME as well. Started by studying Luis' proposal from 2020 and background information provided by him. # Basic setup / onboarding * Bought a work laptop. * Set up access to the team's machines. -- Thiago

3 years, 4 months

1
0
0 0

[ACTIVITY] report week ending 18 Mar

by Peter Maydell

Progress (for a week-and-a-half) * UM-2 [QEMU upstream maintainership] + Lots of freeze-related work (softfreeze was last week and we tagged rc0 this week) + Code review of other peoples's stuff to go into the release + Assembling arm pullreqs + Investigating an intermittent failure in one of our test cases on s390 host, which seems like it may be a bug in the s390 h/w-accelerated zlib * QEMU-420 [GICv4 emulation] + Still debugging... -- PMM

3 years, 4 months

1
0
0 0

Interest in reproducing gcc-linaro-4.9-2016.02 arm-linux-gnueabihf target under darwin and linux aarch64 host

by jhgorse＠gmail.com

Hello, I see this release gcc-linaro-4.9-2016.02 for 86_64_arm-linux-gnueabihf: https://releases.linaro.org/components/toolchain/binaries/4.9-2016.02/arm-l… and would like to reproduce the toolchain for aarch64 hosts. I see that it was built with ABE, though I have generally been unsuccessful in getting ABE to work on aarch64 for this. I was looking for some build or ci breadcrumbs or documentation. What can you recommend? The motivation here is to support legacy development/testing from modern aarch64 hardware. Cheers, Joe Gorse

3 years, 4 months

1
0
0 0

[Activity] Week #10

by Thiago Jung Bauermann

Hello, # GDB support for ARMv9 Scalable Matrix Extension (SME) - Synced with Luis Machado to learn what the current status is. Read discussions in the linux-arm-kernel mailing list which he pointed to. - Read Arm architecture documentation about Neon, SVE, SVE2 and SME to familiarise myself with these features. - Basic setup / onboarding - Joined some internal and external mailing lists, IRC and Slack channels. - Read some company policy documents. - Researched models and got a quote for a work laptop. - Set up aarch64 cross-compilation environment on my laptop. - Set up emulated aarch64 machine with Fedora on my laptop. - Attempted setting up emulated aarch64 machine with Ubuntu on my laptop, but ran into problems with the Ubuntu Server installer. -- Thiago

3 years, 4 months

1
0
0 0

[ACTIVITY] week ending Mar. 6 2022

by Alex Bennée

Project Stratos =============== - spent some time talking through design approaches for xen vhost-master with Viresh Linux RPMB Sub-system and virtio-driver ([STR-40]) - continued working on [Linux driver] - discovered a bug in vhost-user config handling in QEMU as well [STR-40] <https://linaro.atlassian.net/browse/STR-40> [Linux driver] <http://git.linaro.org/people/alex.bennee/linux.git/shortlog/refs/heads/rpmb…> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL 00/18] testing and semihosting updates Message-Id: <20220301094715.550871-1-alex.bennee(a)linaro.org> Other ===== - started work on presentation for LTD Completed Reviews [5/5] ======================= [PATCH] gdbstub.c: add support for info proc mappings Message-Id: <20220221030910.3203063-1-dominik.b.czarnota(a)gmail.com> [PATCH] tests/Makefile.include: Let "make clean" remove the TCG tests, too Message-Id: <20220301085900.1443232-1-thuth(a)redhat.com> [PATCH 0/3] gdbstub: add support for switchable endianness Message-Id: <20210823142004.17935-1-changbin.du(a)gmail.com> [PATCH 0/6] More record/replay acceptance tests Message-Id: <162332427732.194926.7555369160312506539.stgit@pasha-ThinkPad-X280> [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> Absences ======== Current Review Queue ==================== TODO [PATCH v4 00/18] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220301215958.157011-1-richard.henderson(a)linaro.org> ===================================================================================================================================== TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 4 months

1
0
0 0

[ACTIVITY] report week ending 4 Mar

by Peter Maydell

Progress * UM-2 [QEMU upstream maintainership] + Looked at and sent patches to fix a minor decode error for Neon VLD1/VST1 that RTH found + softfreeze is next Tuesday -- sent out last big Arm pullreq before freeze, though there will probably need to be another smaller one + code review, respinning previously sent patches, looking at bug reports, all to get things in before freeze * QEMU-420 [GICv4 emulation] + All the GICv4.0 stuff is now code-complete, but testing and loose ends (like plumbing it into the virt board) will take a while still. -- PMM

3 years, 4 months

1
0
0 0

[weekly][linaro] report week ending 25 Feb

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] + Respins of a few patchsets that needed v2 + Looked at a few bugs since softfreeze for 7.0 is near + Amazingly my to-review queue is now almost empty * QEMU-420 [GICv4 emulation] + Implemented more of the redistributor code -- the last missing big piece is its handling of VMOVI, though there are also probably some loose ends to tidy up + Note that this isn't going to be in time for 7.0, so will likely go on the back-burner a bit in favour of release-critical items thanks -- PMM

3 years, 4 months

1
0
0 0

[ACTIVITY] week ending Feb. 27 2022

by Alex Bennée

Project Stratos =============== - spent more time troubleshooting Xen builds with Viresh Linux RPMB Sub-system and virtio-driver ([STR-40]) - continued working on [Linux driver] - discovered a bug in vhost-user config handling in QEMU as well [STR-40] <https://linaro.atlassian.net/browse/STR-40> [Linux driver] <http://git.linaro.org/people/alex.bennee/linux.git/shortlog/refs/heads/rpmb…> QEMU Upstream Work ([UM-2]) =========================== - follow-up on Analysis of slow distro boots in check-avocado (BootLinuxAarch64.test_virt_tcg*) Message-Id: <874k4xbqvp.fsf(a)linaro.org> - posted [PATCH v2 00/18] testing and semihosting pre-PR Message-Id: <20220225172021.3493923-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Current Review Queue ==================== TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ TODO [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> ============================================================================================================== TODO [PATCH v2 00/15] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220210040423.95120-1-richard.henderson(a)linaro.org> ==================================================================================================================================== -- Alex Bennée

3 years, 4 months

1
0
0 0

[ACTIVITY] week ending Feb. 20 2022

by Alex Bennée

Project Stratos =============== - spent more time troubleshooting Xen builds with Viresh Linux RPMB Sub-system and virtio-driver ([STR-40]) - started working on v2 of the Linux driver QEMU Upstream Work ([UM-2]) =========================== - posted Analysis of slow distro boots in check-avocado (BootLinuxAarch64.test_virt_tcg*) Message-Id: <874k4xbqvp.fsf(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Current Review Queue ==================== TODO [RFC PATCH 00/27] Virtio sound card implementation Message-Id: <20210429120445.694420-1-chouhan.shreyansh2702(a)gmail.com> ============================================================================================================================ TODO [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> ============================================================================================================== TODO [PATCH v2 00/15] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220210040423.95120-1-richard.henderson(a)linaro.org> ==================================================================================================================================== -- Alex Bennée

3 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 18 Feb

by Peter Maydell

Progress (a report covering two half-weeks) * UM-2 [QEMU upstream maintainership] - lots of code review - fixed another bug in the armv7m clock framework code - refactoring patchset to trim some fat from a header that gets included by every C file in the build * QEMU-420 [GICv4 emulation] - CPU interface parts of GICv4 work are code-complete - started on the redistributor work -- PMM

3 years, 5 months

1
0
0 0

[ACTIVITY] week ending Feb. 13 2022

by Alex Bennée

Project Stratos =============== - posted Metadata and signalling channels for Zephyr virtio-backends on Xen Message-Id: <87h79bgd1m.fsf(a)linaro.org> - spent some time troubleshooting Xen builds with Viresh vhost-device maintainer effort ([UM-196]) - posted [a pull request in rust-vmm/community] [a pull request in rust-vmm/community] <elfeed:github.com#tag:github.com,2008:PullRequestEvent/20180885703> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH] tcg/optimize: only read val after const check Message-Id: <20220209112142.3367525-1-alex.bennee(a)linaro.org> - posted [PULL 00/28] testing and plugin updates Message-Id: <20220209141529.3418384-1-alex.bennee(a)linaro.org> - triage for [qemu-x86_64 uses host libraries instead of emulated system libraries] - triage for [linux-user: substantial memory leak when threads are created and destroyed] - posted [RFC PATCH] linux-user: trap internal SIGABRT's Message-Id: <20220209112207.3368139-1-alex.bennee(a)linaro.org> - posted [PATCH v5 0/2] semihosting/next (SYS_HEAPINFO) Message-Id: <20220210113021.3799514-2-alex.bennee(a)linaro.org> - posted [PATCH v1 00/11] testing/next (docker, lcitool, ci, tcg) Message-Id: <20220211160309.335014-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> [qemu-x86_64 uses host libraries instead of emulated system libraries] <elfeed:gitlab.com#https://gitlab.com/qemu-project/qemu/-/issues/857> [linux-user: substantial memory leak when threads are created and destroyed] <elfeed:gitlab.com#https://gitlab.com/qemu-project/qemu/-/issues/866> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [0/0] ======================= Absences ======== Current Review Queue ==================== TODO [PATCH v6 00/43] CXl 2.0 emulation Support Message-Id: <20220211120747.3074-1-Jonathan.Cameron(a)huawei.com> ============================================================================================================== TODO [PATCH v2 00/15] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20220210040423.95120-1-richard.henderson(a)linaro.org> ==================================================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 5 months

1
0
0 0

Re: [TCWG CI] 401.bzip2 grew in size by 9% after llvm: [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`

by Maxim Kuvyrkov

Hi Roman, Your below patch increased code-size of 401.bzip2 by 9% on 32-bit ARM when compiled with -Os. That’s quite a lot, would you please investigate whether this regression can be avoided? Please let me know if this doesn’t reproduce for you and I’ll try to help. Thank you, -- Maxim Kuvyrkov https://www.linaro.org > On 9 Feb 2022, at 17:10, ci_notify(a)linaro.org wrote: > > After llvm commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9 > Author: Roman Lebedev <lebedev.ri(a)gmail.com> > > [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` > > the following benchmarks grew in size by more than 1%: > - 401.bzip2 grew in size by 9% from 37909 to 41405 bytes > - 401.bzip2:[.] BZ2_decompress grew in size by 42% from 7664 to 10864 bytes > - 429.mcf grew in size by 2% from 7732 to 7908 bytes > > Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. > > For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: > - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > > Configuration: > - Benchmark: SPEC CPU2006 > - Toolchain: Clang + Glibc + LLVM Linker > - Version: all components were built from their tip of trunk > - Target: arm-linux-gnueabihf > - Compiler flags: -Os -mthumb > - Hardware: APM Mustang 8x X-Gene1 > > This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. > > THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. > > This commit has regressed these CI configurations: > - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os_LTO > - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os > - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os_LTO > > First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… > > Reproduce builds: > <cut> > mkdir investigate-llvm-77a0da926c9ea86afa9baf28158d79c7678fc6b9 > cd investigate-llvm-77a0da926c9ea86afa9baf28158d79c7678fc6b9 > > # Fetch scripts > git clone https://git.linaro.org/toolchain/jenkins-scripts > > # Fetch manifests and test.sh script > mkdir -p artifacts/manifests > curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail > curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail > curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail > chmod +x artifacts/test.sh > > # Reproduce the baseline build (build all pre-requisites) > ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh > > # Save baseline build state (which is then restored in artifacts/test.sh) > mkdir -p ./bisect > rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ > > cd llvm > > # Reproduce first_bad build > git checkout --detach 77a0da926c9ea86afa9baf28158d79c7678fc6b9 > ../artifacts/test.sh > > # Reproduce last_good build > git checkout --detach f59787084e09aeb787cb3be3103b2419ccd14163 > ../artifacts/test.sh > > cd .. > </cut> > > Full commit (up to 1000 lines): > <cut> > commit 77a0da926c9ea86afa9baf28158d79c7678fc6b9 > Author: Roman Lebedev <lebedev.ri(a)gmail.com> > Date: Mon Feb 7 16:03:40 2022 +0300 > > [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()` > > D43208 extracted `useEmulatedMaskMemRefHack()` from legality into cost model. > What it essentially does is prevents scalarized vectorization of masked memory operations: > ``` > // TODO: Cost model for emulated masked load/store is completely > // broken. This hack guides the cost model to use an artificially > // high enough value to practically disable vectorization with such > // operations, except where previously deployed legality hack allowed > // using very low cost values. This is to avoid regressions coming simply > // from moving "masked load/store" check from legality to cost model. > // Masked Load/Gather emulation was previously never allowed. > // Limited number of Masked Store/Scatter emulation was allowed. > ``` > > While i don't really understand about what specifically `is completely broken` > was talking about, i believe that at least on X86 with AVX2-or-later, > this is no longer true. (or at least, i would like to know what is still broken). > So i would like to follow suit after D111460, and like wise disable that hack for AVX2+. > > But since this was added for X86 specifically, let's just instead completely remove this hack. > > Reviewed By: RKSimon > > Differential Revision: https://reviews.llvm.org/D114779 > --- > llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 34 +- > .../X86/masked-gather-i32-with-i8-index.ll | 40 +- > .../X86/masked-gather-i64-with-i8-index.ll | 40 +- > .../CostModel/X86/masked-interleaved-load-i16.ll | 36 +- > .../CostModel/X86/masked-interleaved-store-i16.ll | 24 +- > .../test/Analysis/CostModel/X86/masked-load-i16.ll | 46 +- > .../test/Analysis/CostModel/X86/masked-load-i32.ll | 16 +- > .../test/Analysis/CostModel/X86/masked-load-i64.ll | 16 +- > llvm/test/Analysis/CostModel/X86/masked-load-i8.ll | 46 +- > .../AArch64/tail-fold-uniform-memops.ll | 159 ++- > .../Transforms/LoopVectorize/X86/gather_scatter.ll | 1176 ++++++++++++++++---- > .../X86/x86-interleaved-accesses-masked-group.ll | 1041 ++++++++--------- > .../Transforms/LoopVectorize/if-pred-stores.ll | 6 +- > .../Transforms/LoopVectorize/memdep-fold-tail.ll | 6 +- > llvm/test/Transforms/LoopVectorize/optsize.ll | 837 +++++++++++--- > llvm/test/Transforms/LoopVectorize/tripcount.ll | 673 ++++++++++- > .../LoopVectorize/vplan-sink-scalars-and-merge.ll | 4 +- > 17 files changed, 3064 insertions(+), 1136 deletions(-) > > diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp > index bfe08d42c883..ccce2c2a7b15 100644 > --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp > +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp > @@ -307,11 +307,6 @@ static cl::opt<bool> InterleaveSmallLoopScalarReduction( > cl::desc("Enable interleaving for loops with small iteration counts that " > "contain scalar reductions to expose ILP.")); > > -/// The number of stores in a loop that are allowed to need predication. > -static cl::opt<unsigned> NumberOfStoresToPredicate( > - "vectorize-num-stores-pred", cl::init(1), cl::Hidden, > - cl::desc("Max number of stores to be predicated behind an if.")); > - > static cl::opt<bool> EnableIndVarRegisterHeur( > "enable-ind-var-reg-heur", cl::init(true), cl::Hidden, > cl::desc("Count the induction variable only once when interleaving")); > @@ -1797,10 +1792,6 @@ private: > /// as a vector operation. > bool isConsecutiveLoadOrStore(Instruction *I); > > - /// Returns true if an artificially high cost for emulated masked memrefs > - /// should be used. > - bool useEmulatedMaskMemRefHack(Instruction *I, ElementCount VF); > - > /// Map of scalar integer values to the smallest bitwidth they can be legally > /// represented as. The vector equivalents of these values should be truncated > /// to this type. > @@ -6437,22 +6428,6 @@ LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<ElementCount> VFs) { > return RUs; > } > > -bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I, > - ElementCount VF) { > - // TODO: Cost model for emulated masked load/store is completely > - // broken. This hack guides the cost model to use an artificially > - // high enough value to practically disable vectorization with such > - // operations, except where previously deployed legality hack allowed > - // using very low cost values. This is to avoid regressions coming simply > - // from moving "masked load/store" check from legality to cost model. > - // Masked Load/Gather emulation was previously never allowed. > - // Limited number of Masked Store/Scatter emulation was allowed. > - assert(isPredicatedInst(I, VF) && "Expecting a scalar emulated instruction"); > - return isa<LoadInst>(I) || > - (isa<StoreInst>(I) && > - NumPredStores > NumberOfStoresToPredicate); > -} > - > void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) { > // If we aren't vectorizing the loop, or if we've already collected the > // instructions to scalarize, there's nothing to do. Collection may already > @@ -6478,9 +6453,7 @@ void LoopVectorizationCostModel::collectInstsToScalarize(ElementCount VF) { > ScalarCostsTy ScalarCosts; > // Do not apply discount if scalable, because that would lead to > // invalid scalarization costs. > - // Do not apply discount logic if hacked cost is needed > - // for emulated masked memrefs. > - if (!VF.isScalable() && !useEmulatedMaskMemRefHack(&I, VF) && > + if (!VF.isScalable() && > computePredInstDiscount(&I, ScalarCosts, VF) >= 0) > ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end()); > // Remember that BB will remain after vectorization. > @@ -6736,11 +6709,6 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I, > Vec_i1Ty, APInt::getAllOnes(VF.getKnownMinValue()), > /*Insert=*/false, /*Extract=*/true); > Cost += TTI.getCFInstrCost(Instruction::Br, TTI::TCK_RecipThroughput); > - > - if (useEmulatedMaskMemRefHack(I, VF)) > - // Artificially setting to a high enough value to practically disable > - // vectorization with such operations. > - Cost = 3000000; > } > > return Cost; > diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll > index 62412a5d1af0..c52755b7d65c 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-gather-i32-with-i8-index.ll > @@ -17,30 +17,30 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 9 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 18 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX1: LV: Found an estimated cost of 36 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 9 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 18 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 36 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > @@ -50,8 +50,8 @@ target triple = "x86_64-unknown-linux-gnu" > ; AVX2-FASTGATHER: LV: Found an estimated cost of 48 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX512: LV: Found an estimated cost of 10 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; AVX512: LV: Found an estimated cost of 22 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX512: LV: Found an estimated cost of 5 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; AVX512: LV: Found an estimated cost of 11 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX512: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX512: LV: Found an estimated cost of 18 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX512: LV: Found an estimated cost of 36 for VF 32 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll b/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll > index b8eba8b0327b..b38026c824b5 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-gather-i64-with-i8-index.ll > @@ -17,30 +17,30 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX1: LV: Found an estimated cost of 40 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 40 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > @@ -50,8 +50,8 @@ target triple = "x86_64-unknown-linux-gnu" > ; AVX2-FASTGATHER: LV: Found an estimated cost of 48 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX512: LV: Found an estimated cost of 10 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; AVX512: LV: Found an estimated cost of 24 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX512: LV: Found an estimated cost of 5 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; AVX512: LV: Found an estimated cost of 12 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX512: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX512: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX512: LV: Found an estimated cost of 40 for VF 32 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll b/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll > index d6bfdf9d3848..184e23a0128b 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-interleaved-load-i16.ll > @@ -89,30 +89,30 @@ for.end: > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > > ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2" > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i2 = load i16, i16* %arrayidx2, align 2 > @@ -164,17 +164,17 @@ for.end: > ; DISABLED_MASKED_STRIDED: LV: Checking a loop in "test" > ; > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 17 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > > ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test" > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 7 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 9 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 9 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 16 For instruction: %i4 = load i16, i16* %arrayidx6, align 2 > > define void @test(i16* noalias nocapture %points, i16* noalias nocapture readonly %x, i16* noalias nocapture readnone %y) { > diff --git a/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll b/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll > index 5f67026737fc..224dd75a4dc5 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-interleaved-store-i16.ll > @@ -89,17 +89,17 @@ for.end: > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %0, i16* %arrayidx2, align 2 > ; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 5 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 11 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 23 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 50 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 3000000 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2 > +; DISABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > > ; ENABLED_MASKED_STRIDED: LV: Checking a loop in "test2" > ; > @@ -107,16 +107,16 @@ for.end: > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 1 for VF 1 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 2 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 10 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 2 for VF 2 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 4 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 4 for VF 4 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 8 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 14 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 8 for VF 8 For instruction: store i16 %2, i16* %arrayidx7, align 2 > ; > ; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 0 for VF 16 For instruction: store i16 %0, i16* %arrayidx2, align 2 > -; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 27 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > +; ENABLED_MASKED_STRIDED: LV: Found an estimated cost of 20 for VF 16 For instruction: store i16 %2, i16* %arrayidx7, align 2 > > define void @test2(i16* noalias nocapture %points, i32 %numPoints, i16* noalias nocapture readonly %x, i16* noalias nocapture readonly %y) { > entry: > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll > index c8c3078f1625..2722a52c3d96 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i16.ll > @@ -16,37 +16,37 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE2: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; SSE42: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 17 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX1: LV: Found an estimated cost of 34 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 17 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 34 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 17 for VF 16 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 34 for VF 32 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > ; AVX512: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i16, i16* %inB, align 2 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll > index f74c9f044d0b..16c00cfc03b5 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i32.ll > @@ -16,16 +16,16 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE2: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > +; SSE42: LV: Found an estimated cost of 22 for VF 16 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > ; AVX1: LV: Found an estimated cost of 3 for VF 2 For instruction: %valB.loaded = load i32, i32* %inB, align 4 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll > index c5a7825348e9..1baeff242304 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i64.ll > @@ -16,16 +16,16 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE2: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 10 for VF 8 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > +; SSE42: LV: Found an estimated cost of 20 for VF 16 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > ; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i64, i64* %inB, align 8 > diff --git a/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll b/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll > index fc540da58700..99d0f28a03f8 100644 > --- a/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll > +++ b/llvm/test/Analysis/CostModel/X86/masked-load-i8.ll > @@ -16,37 +16,37 @@ target triple = "x86_64-unknown-linux-gnu" > ; CHECK: LV: Checking a loop in "test" > ; > ; SSE2: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE2: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE2: LV: Found an estimated cost of 23 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; SSE42: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; SSE42: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 5 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 11 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; SSE42: LV: Found an estimated cost of 23 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX1: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX1: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX1: LV: Found an estimated cost of 33 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX2-SLOWGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-SLOWGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-SLOWGATHER: LV: Found an estimated cost of 33 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX2-FASTGATHER: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > -; AVX2-FASTGATHER: LV: Found an estimated cost of 3000000 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 4 for VF 4 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 8 for VF 8 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 16 for VF 16 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > +; AVX2-FASTGATHER: LV: Found an estimated cost of 33 for VF 32 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; > ; AVX512: LV: Found an estimated cost of 1 for VF 1 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > ; AVX512: LV: Found an estimated cost of 2 for VF 2 For instruction: %valB.loaded = load i8, i8* %inB, align 1 > diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll b/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll > index bf0aba1931d1..8ce310962b48 100644 > --- a/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll > +++ b/llvm/test/Transforms/LoopVectorize/AArch64/tail-fold-uniform-memops.ll > @@ -1,3 +1,4 @@ > +; NOTE: Assertions have been autogenerated by utils/update_test_checks.py > ; RUN: opt -loop-vectorize -scalable-vectorization=off -force-vector-width=4 -prefer-predicate-over-epilogue=predicate-dont-vectorize -S < %s | FileCheck %s > > ; NOTE: These tests aren't really target-specific, but it's convenient to target AArch64 > @@ -9,21 +10,43 @@ target triple = "aarch64-linux-gnu" > ; we don't artificially create new predicated blocks for the load. > define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 { > ; CHECK-LABEL: @uniform_load( > +; CHECK-NEXT: entry: > +; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]] > +; CHECK: vector.ph: > +; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N:%.*]], 3 > +; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4 > +; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]] > +; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] > ; CHECK: vector.body: > -; CHECK-NEXT: [[IDX:%.*]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.*]], %vector.body ] > -; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0 > -; CHECK-NEXT: [[LOOP_PRED:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n) > -; CHECK-NEXT: [[LOAD_VAL:%.*]] = load i32, i32* %src, align 4 > -; CHECK-NOT: load i32, i32* %src, align 4 > -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[LOAD_VAL]], i32 0 > -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> zeroinitializer > -; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, i32* %dst, i64 [[TMP3]] > -; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], i32 0 > -; CHECK-NEXT: [[STORE_PTR:%.*]] = bitcast i32* [[TMP7]] to <4 x i32>* > -; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[TMP5]], <4 x i32>* [[STORE_PTR]], i32 4, <4 x i1> [[LOOP_PRED]]) > -; CHECK-NEXT: [[IDX_NEXT]] = add i64 [[IDX]], 4 > -; CHECK-NEXT: [[CMP:%.*]] = icmp eq i64 [[IDX_NEXT]], %n.vec > -; CHECK-NEXT: br i1 [[CMP]], label %middle.block, label %vector.body > +; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] > +; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0 > +; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 [[N]]) > +; CHECK-NEXT: [[TMP1:%.*]] = load i32, i32* [[SRC:%.*]], align 4 > +; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[TMP1]], i32 0 > +; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer > +; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[DST:%.*]], i64 [[TMP0]] > +; CHECK-NEXT: [[TMP3:%.*]] = getelementptr inbounds i32, i32* [[TMP2]], i32 0 > +; CHECK-NEXT: [[TMP4:%.*]] = bitcast i32* [[TMP3]] to <4 x i32>* > +; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[BROADCAST_SPLAT]], <4 x i32>* [[TMP4]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]]) > +; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4 > +; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]] > +; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > +; CHECK: middle.block: > +; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] > +; CHECK: scalar.ph: > +; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ] > +; CHECK-NEXT: br label [[FOR_BODY:%.*]] > +; CHECK: for.body: > +; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ] > +; CHECK-NEXT: [[VAL:%.*]] = load i32, i32* [[SRC]], align 4 > +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[DST]], i64 [[INDVARS_IV]] > +; CHECK-NEXT: store i32 [[VAL]], i32* [[ARRAYIDX]], align 4 > +; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1 > +; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]] > +; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]] > +; CHECK: for.end: > +; CHECK-NEXT: ret void > +; > > entry: > br label %for.body > @@ -47,18 +70,108 @@ for.end: ; preds = %for.body, %entry > ; and the original condition. > define void @cond_uniform_load(i32* nocapture %dst, i32* nocapture readonly %src, i32* nocapture readonly %cond, i64 %n) #0 { > ; CHECK-LABEL: @cond_uniform_load( > +; CHECK-NEXT: entry: > +; CHECK-NEXT: [[DST1:%.*]] = bitcast i32* [[DST:%.*]] to i8* > +; CHECK-NEXT: [[COND3:%.*]] = bitcast i32* [[COND:%.*]] to i8* > +; CHECK-NEXT: [[SRC6:%.*]] = bitcast i32* [[SRC:%.*]] to i8* > +; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_MEMCHECK:%.*]] > +; CHECK: vector.memcheck: > +; CHECK-NEXT: [[SCEVGEP:%.*]] = getelementptr i32, i32* [[DST]], i64 [[N:%.*]] > +; CHECK-NEXT: [[SCEVGEP2:%.*]] = bitcast i32* [[SCEVGEP]] to i8* > +; CHECK-NEXT: [[SCEVGEP4:%.*]] = getelementptr i32, i32* [[COND]], i64 [[N]] > +; CHECK-NEXT: [[SCEVGEP45:%.*]] = bitcast i32* [[SCEVGEP4]] to i8* > +; CHECK-NEXT: [[SCEVGEP7:%.*]] = getelementptr i32, i32* [[SRC]], i64 1 > +; CHECK-NEXT: [[SCEVGEP78:%.*]] = bitcast i32* [[SCEVGEP7]] to i8* > +; CHECK-NEXT: [[BOUND0:%.*]] = icmp ult i8* [[DST1]], [[SCEVGEP45]] > +; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult i8* [[COND3]], [[SCEVGEP2]] > +; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]] > +; CHECK-NEXT: [[BOUND09:%.*]] = icmp ult i8* [[DST1]], [[SCEVGEP78]] > +; CHECK-NEXT: [[BOUND110:%.*]] = icmp ult i8* [[SRC6]], [[SCEVGEP2]] > +; CHECK-NEXT: [[FOUND_CONFLICT11:%.*]] = and i1 [[BOUND09]], [[BOUND110]] > +; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[FOUND_CONFLICT]], [[FOUND_CONFLICT11]] > +; CHECK-NEXT: br i1 [[CONFLICT_RDX]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]] > ; CHECK: vector.ph: > -; CHECK: [[TMP1:%.*]] = insertelement <4 x i32*> poison, i32* %src, i32 0 > -; CHECK-NEXT: [[SRC_SPLAT:%.*]] = shufflevector <4 x i32*> [[TMP1]], <4 x i32*> poison, <4 x i32> zeroinitializer > +; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], 3 > +; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], 4 > +; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]] > +; CHECK-NEXT: br label [[VECTOR_BODY:%.*]] > ; CHECK: vector.body: > -; CHECK-NEXT: [[IDX:%.*]] = phi i64 [ 0, %vector.ph ], [ [[IDX_NEXT:%.*]], %vector.body ] > -; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[IDX]], 0 > -; CHECK-NEXT: [[LOOP_PRED:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP3]], i64 %n) > -; CHECK: [[COND_LOAD:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* {{%.*}}, i32 4, <4 x i1> [[LOOP_PRED]], <4 x i32> poison) > -; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i32> [[COND_LOAD]], zeroinitializer > +; CHECK-NEXT: [[INDEX12:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT19:%.*]], [[PRED_LOAD_CONTINUE18:%.*]] ] > +; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX12]], 0 > +; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 [[TMP0]], i64 [[N]]) > +; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, i32* [[COND]], i64 [[TMP0]] > +; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TMP1]], i32 0 > +; CHECK-NEXT: [[TMP3:%.*]] = bitcast i32* [[TMP2]] to <4 x i32>* > +; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison), !alias.scope !4 > +; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <4 x i32> [[WIDE_MASKED_LOAD]], zeroinitializer > ; CHECK-NEXT: [[TMP5:%.*]] = xor <4 x i1> [[TMP4]], <i1 true, i1 true, i1 true, i1 true> > -; CHECK-NEXT: [[MASK:%.*]] = select <4 x i1> [[LOOP_PRED]], <4 x i1> [[TMP5]], <4 x i1> zeroinitializer > -; CHECK-NEXT: call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32*> [[SRC_SPLAT]], i32 4, <4 x i1> [[MASK]], <4 x i32> undef) > +; CHECK-NEXT: [[TMP6:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i1> [[TMP5]], <4 x i1> zeroinitializer > +; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i1> [[TMP6]], i32 0 > +; CHECK-NEXT: br i1 [[TMP7]], label [[PRED_LOAD_IF:%.*]], label [[PRED_LOAD_CONTINUE:%.*]] > +; CHECK: pred.load.if: > +; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> poison, i32 [[TMP8]], i32 0 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]] > +; CHECK: pred.load.continue: > +; CHECK-NEXT: [[TMP10:%.*]] = phi <4 x i32> [ poison, [[VECTOR_BODY]] ], [ [[TMP9]], [[PRED_LOAD_IF]] ] > +; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP6]], i32 1 > +; CHECK-NEXT: br i1 [[TMP11]], label [[PRED_LOAD_IF13:%.*]], label [[PRED_LOAD_CONTINUE14:%.*]] > +; CHECK: pred.load.if13: > +; CHECK-NEXT: [[TMP12:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP12]], i32 1 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE14]] > +; CHECK: pred.load.continue14: > +; CHECK-NEXT: [[TMP14:%.*]] = phi <4 x i32> [ [[TMP10]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF13]] ] > +; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x i1> [[TMP6]], i32 2 > +; CHECK-NEXT: br i1 [[TMP15]], label [[PRED_LOAD_IF15:%.*]], label [[PRED_LOAD_CONTINUE16:%.*]] > +; CHECK: pred.load.if15: > +; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x i32> [[TMP14]], i32 [[TMP16]], i32 2 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE16]] > +; CHECK: pred.load.continue16: > +; CHECK-NEXT: [[TMP18:%.*]] = phi <4 x i32> [ [[TMP14]], [[PRED_LOAD_CONTINUE14]] ], [ [[TMP17]], [[PRED_LOAD_IF15]] ] > +; CHECK-NEXT: [[TMP19:%.*]] = extractelement <4 x i1> [[TMP6]], i32 3 > +; CHECK-NEXT: br i1 [[TMP19]], label [[PRED_LOAD_IF17:%.*]], label [[PRED_LOAD_CONTINUE18]] > +; CHECK: pred.load.if17: > +; CHECK-NEXT: [[TMP20:%.*]] = load i32, i32* [[SRC]], align 4, !alias.scope !7 > +; CHECK-NEXT: [[TMP21:%.*]] = insertelement <4 x i32> [[TMP18]], i32 [[TMP20]], i32 3 > +; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE18]] > +; CHECK: pred.load.continue18: > +; CHECK-NEXT: [[TMP22:%.*]] = phi <4 x i32> [ [[TMP18]], [[PRED_LOAD_CONTINUE16]] ], [ [[TMP21]], [[PRED_LOAD_IF17]] ] > +; CHECK-NEXT: [[TMP23:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i1> [[TMP4]], <4 x i1> zeroinitializer > +; CHECK-NEXT: [[PREDPHI:%.*]] = select <4 x i1> [[TMP23]], <4 x i32> zeroinitializer, <4 x i32> [[TMP22]] > +; CHECK-NEXT: [[TMP24:%.*]] = getelementptr inbounds i32, i32* [[DST]], i64 [[TMP0]] > +; CHECK-NEXT: [[TMP25:%.*]] = or <4 x i1> [[TMP6]], [[TMP23]] > +; CHECK-NEXT: [[TMP26:%.*]] = getelementptr inbounds i32, i32* [[TMP24]], i32 0 > +; CHECK-NEXT: [[TMP27:%.*]] = bitcast i32* [[TMP26]] to <4 x i32>* > +; CHECK-NEXT: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> [[PREDPHI]], <4 x i32>* [[TMP27]], i32 4, <4 x i1> [[TMP25]]), !alias.scope !9, !noalias !11 > +; CHECK-NEXT: [[INDEX_NEXT19]] = add i64 [[INDEX12]], 4 > +; CHECK-NEXT: [[TMP28:%.*]] = icmp eq i64 [[INDEX_NEXT19]], [[N_VEC]] > +; CHECK-NEXT: br i1 [[TMP28]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]] > +; CHECK: middle.block: > +; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]] > +; CHECK: scalar.ph: > +; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.*]] ], [ 0, [[VECTOR_MEMCHECK]] ] > +; CHECK-NEXT: br label [[FOR_BODY:%.*]] > +; CHECK: for.body: > +; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[IF_END:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ] > +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[COND]], i64 [[INDEX]] > +; CHECK-NEXT: [[TMP29:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 > +; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[TMP29]], 0 > +; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END]], label [[IF_THEN:%.*]] > +; CHECK: if.then: > +; CHECK-NEXT: [[TMP30:%.*]] = load i32, i32* [[SRC]], align 4 > +; CHECK-NEXT: br label [[IF_END]] > +; CHECK: if.end: > +; CHECK-NEXT: [[VAL_0:%.*]] = phi i32 [ [[TMP30]], [[IF_THEN]] ], [ 0, [[FOR_BODY]] ] > +; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i32, i32* [[DST]], i64 [[INDEX]] > +; CHECK-NEXT: store i32 [[VAL_0]], i32* [[ARRAYIDX1]], align 4 > +; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 1 > +; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N]] > +; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP13:![0-9]+]] > +; CHECK: for.end: > +; CHECK-NEXT: ret void > +; > entry: > br label %for.body > > diff --git a/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll b/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll > index def98e03030f..d13942e85466 100644 > --- a/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll > +++ b/llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll > @@ -25,22 +25,22 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: iter.check: > ; AVX512-NEXT: br label [[VECTOR_BODY:%.*]] > ; AVX512: vector.body: > -; AVX512-NEXT: [[INDEX8:%.*]] = phi i64 [ 0, [[ITER_CHECK:%.*]] ], [ [[INDEX_NEXT_3:%.*]], [[VECTOR_BODY]] ] > -; AVX512-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX8]] > +; AVX512-NEXT: [[INDEX7:%.*]] = phi i64 [ 0, [[ITER_CHECK:%.*]] ], [ [[INDEX_NEXT_3:%.*]], [[VECTOR_BODY]] ] > +; AVX512-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX7]] > ; AVX512-NEXT: [[TMP1:%.*]] = bitcast i32* [[TMP0]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD:%.*]] = load <16 x i32>, <16 x i32>* [[TMP1]], align 4 > ; AVX512-NEXT: [[TMP2:%.*]] = icmp sgt <16 x i32> [[WIDE_LOAD]], zeroinitializer > -; AVX512-NEXT: [[TMP3:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX8]] > +; AVX512-NEXT: [[TMP3:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX7]] > ; AVX512-NEXT: [[TMP4:%.*]] = bitcast i32* [[TMP3]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* [[TMP4]], i32 4, <16 x i1> [[TMP2]], <16 x i32> poison) > ; AVX512-NEXT: [[TMP5:%.*]] = sext <16 x i32> [[WIDE_MASKED_LOAD]] to <16 x i64> > ; AVX512-NEXT: [[TMP6:%.*]] = getelementptr inbounds float, float* [[IN:%.*]], <16 x i64> [[TMP5]] > ; AVX512-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <16 x float> @llvm.masked.gather.v16f32.v16p0f32(<16 x float*> [[TMP6]], i32 4, <16 x i1> [[TMP2]], <16 x float> undef) > ; AVX512-NEXT: [[TMP7:%.*]] = fadd <16 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01, float 5.000000e-01> > -; AVX512-NEXT: [[TMP8:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX8]] > +; AVX512-NEXT: [[TMP8:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX7]] > ; AVX512-NEXT: [[TMP9:%.*]] = bitcast float* [[TMP8]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP7]], <16 x float>* [[TMP9]], i32 4, <16 x i1> [[TMP2]]) > -; AVX512-NEXT: [[INDEX_NEXT:%.*]] = or i64 [[INDEX8]], 16 > +; AVX512-NEXT: [[INDEX_NEXT:%.*]] = or i64 [[INDEX7]], 16 > ; AVX512-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[INDEX_NEXT]] > ; AVX512-NEXT: [[TMP11:%.*]] = bitcast i32* [[TMP10]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD_1:%.*]] = load <16 x i32>, <16 x i32>* [[TMP11]], align 4 > @@ -55,7 +55,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: [[TMP18:%.*]] = getelementptr float, float* [[OUT]], i64 [[INDEX_NEXT]] > ; AVX512-NEXT: [[TMP19:%.*]] = bitcast float* [[TMP18]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP17]], <16 x float>* [[TMP19]], i32 4, <16 x i1> [[TMP12]]) > -; AVX512-NEXT: [[INDEX_NEXT_1:%.*]] = or i64 [[INDEX8]], 32 > +; AVX512-NEXT: [[INDEX_NEXT_1:%.*]] = or i64 [[INDEX7]], 32 > ; AVX512-NEXT: [[TMP20:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[INDEX_NEXT_1]] > ; AVX512-NEXT: [[TMP21:%.*]] = bitcast i32* [[TMP20]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD_2:%.*]] = load <16 x i32>, <16 x i32>* [[TMP21]], align 4 > @@ -70,7 +70,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: [[TMP28:%.*]] = getelementptr float, float* [[OUT]], i64 [[INDEX_NEXT_1]] > ; AVX512-NEXT: [[TMP29:%.*]] = bitcast float* [[TMP28]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP27]], <16 x float>* [[TMP29]], i32 4, <16 x i1> [[TMP22]]) > -; AVX512-NEXT: [[INDEX_NEXT_2:%.*]] = or i64 [[INDEX8]], 48 > +; AVX512-NEXT: [[INDEX_NEXT_2:%.*]] = or i64 [[INDEX7]], 48 > ; AVX512-NEXT: [[TMP30:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[INDEX_NEXT_2]] > ; AVX512-NEXT: [[TMP31:%.*]] = bitcast i32* [[TMP30]] to <16 x i32>* > ; AVX512-NEXT: [[WIDE_LOAD_3:%.*]] = load <16 x i32>, <16 x i32>* [[TMP31]], align 4 > @@ -85,7 +85,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; AVX512-NEXT: [[TMP38:%.*]] = getelementptr float, float* [[OUT]], i64 [[INDEX_NEXT_2]] > ; AVX512-NEXT: [[TMP39:%.*]] = bitcast float* [[TMP38]] to <16 x float>* > ; AVX512-NEXT: call void @llvm.masked.store.v16f32.p0v16f32(<16 x float> [[TMP37]], <16 x float>* [[TMP39]], i32 4, <16 x i1> [[TMP32]]) > -; AVX512-NEXT: [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX8]], 64 > +; AVX512-NEXT: [[INDEX_NEXT_3]] = add nuw nsw i64 [[INDEX7]], 64 > ; AVX512-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT_3]], 4096 > ; AVX512-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > ; AVX512: for.end: > @@ -95,8 +95,8 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; FVW2-NEXT: entry: > ; FVW2-NEXT: br label [[VECTOR_BODY:%.*]] > ; FVW2: vector.body: > -; FVW2-NEXT: [[INDEX17:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ] > -; FVW2-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX17]] > +; FVW2-NEXT: [[INDEX7:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_LOAD_CONTINUE27:%.*]] ] > +; FVW2-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[INDEX7]] > ; FVW2-NEXT: [[TMP1:%.*]] = bitcast i32* [[TMP0]] to <2 x i32>* > ; FVW2-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, <2 x i32>* [[TMP1]], align 4 > ; FVW2-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TMP0]], i64 2 > @@ -112,7 +112,7 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; FVW2-NEXT: [[TMP9:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD8]], zeroinitializer > ; FVW2-NEXT: [[TMP10:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD9]], zeroinitializer > ; FVW2-NEXT: [[TMP11:%.*]] = icmp sgt <2 x i32> [[WIDE_LOAD10]], zeroinitializer > -; FVW2-NEXT: [[TMP12:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX17]] > +; FVW2-NEXT: [[TMP12:%.*]] = getelementptr i32, i32* [[INDEX:%.*]], i64 [[INDEX7]] > ; FVW2-NEXT: [[TMP13:%.*]] = bitcast i32* [[TMP12]] to <2 x i32>* > ; FVW2-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* [[TMP13]], i32 4, <2 x i1> [[TMP8]], <2 x i32> poison) > ; FVW2-NEXT: [[TMP14:%.*]] = getelementptr i32, i32* [[TMP12]], i64 2 > @@ -128,33 +128,105 @@ define void @foo1(float* noalias %in, float* noalias %out, i32* noalias %trigger > ; FVW2-NEXT: [[TMP21:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD11]] to <2 x i64> > ; FVW2-NEXT: [[TMP22:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD12]] to <2 x i64> > ; FVW2-NEXT: [[TMP23:%.*]] = sext <2 x i32> [[WIDE_MASKED_LOAD13]] to <2 x i64> > -; FVW2-NEXT: [[TMP24:%.*]] = getelementptr inbounds float, float* [[IN:%.*]], <2 x i64> [[TMP20]] > -; FVW2-NEXT: [[TMP25:%.*]] = getelementptr inbounds float, float* [[IN]], <2 x i64> [[TMP21]] > -; FVW2-NEXT: [[TMP26:%.*]] = getelementptr inbounds float, float* [[IN]], <2 x i64> [[TMP22]] > -; FVW2-NEXT: [[TMP27:%.*]] = getelementptr inbounds float, float* [[IN]], <2 x i64> [[TMP23]] > -; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP24]], i32 4, <2 x i1> [[TMP8]], <2 x float> undef) > -; FVW2-NEXT: [[WIDE_MASKED_GATHER14:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP25]], i32 4, <2 x i1> [[TMP9]], <2 x float> undef) > -; FVW2-NEXT: [[WIDE_MASKED_GATHER15:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP26]], i32 4, <2 x i1> [[TMP10]], <2 x float> undef) > -; FVW2-NEXT: [[WIDE_MASKED_GATHER16:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP27]], i32 4, <2 x i1> [[TMP11]], <2 x float> undef) > -; FVW2-NEXT: [[TMP28:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP29:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER14]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP30:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER15]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP31:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER16]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP32:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX17]] > -; FVW2-NEXT: [[TMP33:%.*]] = bitcast float* [[TMP32]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP28]], <2 x float>* [[TMP33]], i32 4, <2 x i1> [[TMP8]]) > -; FVW2-NEXT: [[TMP34:%.*]] = getelementptr float, float* [[TMP32]], i64 2 > -; FVW2-NEXT: [[TMP35:%.*]] = bitcast float* [[TMP34]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP29]], <2 x float>* [[TMP35]], i32 4, <2 x i1> [[TMP9]]) > -; FVW2-NEXT: [[TMP36:%.*]] = getelementptr float, float* [[TMP32]], i64 4 > -; FVW2-NEXT: [[TMP37:%.*]] = bitcast float* [[TMP36]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP30]], <2 x float>* [[TMP37]], i32 4, <2 x i1> [[TMP10]]) > -; FVW2-NEXT: [[TMP38:%.*]] = getelementptr float, float* [[TMP32]], i64 6 > -; FVW2-NEXT: [[TMP39:%.*]] = bitcast float* [[TMP38]] to <2 x float>* > -; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP31]], <2 x float>* [[TMP39]], i32 4, <2 x i1> [[TMP11]]) > -; FVW2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX17]], 8 > -; FVW2-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096 > -; FVW2-NEXT: br i1 [[TMP40]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > +; FVW2-NEXT: [[TMP24:%.*]] = extractelement <2 x i1> [[TMP8]], i64 0 > +; FVW2-NEXT: br i1 [[TMP24]], label [[PRED_LOAD_IF:%.*]], label [[PRED_LOAD_CONTINUE:%.*]] > +; FVW2: pred.load.if: > +; FVW2-NEXT: [[TMP25:%.*]] = extractelement <2 x i64> [[TMP20]], i64 0 > +; FVW2-NEXT: [[TMP26:%.*]] = getelementptr inbounds float, float* [[IN:%.*]], i64 [[TMP25]] > +; FVW2-NEXT: [[TMP27:%.*]] = load float, float* [[TMP26]], align 4 > +; FVW2-NEXT: [[TMP28:%.*]] = insertelement <2 x float> poison, float [[TMP27]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE]] > +; FVW2: pred.load.continue: > +; FVW2-NEXT: [[TMP29:%.*]] = phi <2 x float> [ poison, [[VECTOR_BODY]] ], [ [[TMP28]], [[PRED_LOAD_IF]] ] > +; FVW2-NEXT: [[TMP30:%.*]] = extractelement <2 x i1> [[TMP8]], i64 1 > +; FVW2-NEXT: br i1 [[TMP30]], label [[PRED_LOAD_IF14:%.*]], label [[PRED_LOAD_CONTINUE15:%.*]] > +; FVW2: pred.load.if14: > +; FVW2-NEXT: [[TMP31:%.*]] = extractelement <2 x i64> [[TMP20]], i64 1 > +; FVW2-NEXT: [[TMP32:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP31]] > +; FVW2-NEXT: [[TMP33:%.*]] = load float, float* [[TMP32]], align 4 > +; FVW2-NEXT: [[TMP34:%.*]] = insertelement <2 x float> [[TMP29]], float [[TMP33]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE15]] > +; FVW2: pred.load.continue15: > +; FVW2-NEXT: [[TMP35:%.*]] = phi <2 x float> [ [[TMP29]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP34]], [[PRED_LOAD_IF14]] ] > +; FVW2-NEXT: [[TMP36:%.*]] = extractelement <2 x i1> [[TMP9]], i64 0 > +; FVW2-NEXT: br i1 [[TMP36]], label [[PRED_LOAD_IF16:%.*]], label [[PRED_LOAD_CONTINUE17:%.*]] > +; FVW2: pred.load.if16: > +; FVW2-NEXT: [[TMP37:%.*]] = extractelement <2 x i64> [[TMP21]], i64 0 > +; FVW2-NEXT: [[TMP38:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP37]] > +; FVW2-NEXT: [[TMP39:%.*]] = load float, float* [[TMP38]], align 4 > +; FVW2-NEXT: [[TMP40:%.*]] = insertelement <2 x float> poison, float [[TMP39]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE17]] > +; FVW2: pred.load.continue17: > +; FVW2-NEXT: [[TMP41:%.*]] = phi <2 x float> [ poison, [[PRED_LOAD_CONTINUE15]] ], [ [[TMP40]], [[PRED_LOAD_IF16]] ] > +; FVW2-NEXT: [[TMP42:%.*]] = extractelement <2 x i1> [[TMP9]], i64 1 > +; FVW2-NEXT: br i1 [[TMP42]], label [[PRED_LOAD_IF18:%.*]], label [[PRED_LOAD_CONTINUE19:%.*]] > +; FVW2: pred.load.if18: > +; FVW2-NEXT: [[TMP43:%.*]] = extractelement <2 x i64> [[TMP21]], i64 1 > +; FVW2-NEXT: [[TMP44:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP43]] > +; FVW2-NEXT: [[TMP45:%.*]] = load float, float* [[TMP44]], align 4 > +; FVW2-NEXT: [[TMP46:%.*]] = insertelement <2 x float> [[TMP41]], float [[TMP45]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE19]] > +; FVW2: pred.load.continue19: > +; FVW2-NEXT: [[TMP47:%.*]] = phi <2 x float> [ [[TMP41]], [[PRED_LOAD_CONTINUE17]] ], [ [[TMP46]], [[PRED_LOAD_IF18]] ] > +; FVW2-NEXT: [[TMP48:%.*]] = extractelement <2 x i1> [[TMP10]], i64 0 > +; FVW2-NEXT: br i1 [[TMP48]], label [[PRED_LOAD_IF20:%.*]], label [[PRED_LOAD_CONTINUE21:%.*]] > +; FVW2: pred.load.if20: > +; FVW2-NEXT: [[TMP49:%.*]] = extractelement <2 x i64> [[TMP22]], i64 0 > +; FVW2-NEXT: [[TMP50:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP49]] > +; FVW2-NEXT: [[TMP51:%.*]] = load float, float* [[TMP50]], align 4 > +; FVW2-NEXT: [[TMP52:%.*]] = insertelement <2 x float> poison, float [[TMP51]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE21]] > +; FVW2: pred.load.continue21: > +; FVW2-NEXT: [[TMP53:%.*]] = phi <2 x float> [ poison, [[PRED_LOAD_CONTINUE19]] ], [ [[TMP52]], [[PRED_LOAD_IF20]] ] > +; FVW2-NEXT: [[TMP54:%.*]] = extractelement <2 x i1> [[TMP10]], i64 1 > +; FVW2-NEXT: br i1 [[TMP54]], label [[PRED_LOAD_IF22:%.*]], label [[PRED_LOAD_CONTINUE23:%.*]] > +; FVW2: pred.load.if22: > +; FVW2-NEXT: [[TMP55:%.*]] = extractelement <2 x i64> [[TMP22]], i64 1 > +; FVW2-NEXT: [[TMP56:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP55]] > +; FVW2-NEXT: [[TMP57:%.*]] = load float, float* [[TMP56]], align 4 > +; FVW2-NEXT: [[TMP58:%.*]] = insertelement <2 x float> [[TMP53]], float [[TMP57]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE23]] > +; FVW2: pred.load.continue23: > +; FVW2-NEXT: [[TMP59:%.*]] = phi <2 x float> [ [[TMP53]], [[PRED_LOAD_CONTINUE21]] ], [ [[TMP58]], [[PRED_LOAD_IF22]] ] > +; FVW2-NEXT: [[TMP60:%.*]] = extractelement <2 x i1> [[TMP11]], i64 0 > +; FVW2-NEXT: br i1 [[TMP60]], label [[PRED_LOAD_IF24:%.*]], label [[PRED_LOAD_CONTINUE25:%.*]] > +; FVW2: pred.load.if24: > +; FVW2-NEXT: [[TMP61:%.*]] = extractelement <2 x i64> [[TMP23]], i64 0 > +; FVW2-NEXT: [[TMP62:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP61]] > +; FVW2-NEXT: [[TMP63:%.*]] = load float, float* [[TMP62]], align 4 > +; FVW2-NEXT: [[TMP64:%.*]] = insertelement <2 x float> poison, float [[TMP63]], i64 0 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE25]] > +; FVW2: pred.load.continue25: > +; FVW2-NEXT: [[TMP65:%.*]] = phi <2 x float> [ poison, [[PRED_LOAD_CONTINUE23]] ], [ [[TMP64]], [[PRED_LOAD_IF24]] ] > +; FVW2-NEXT: [[TMP66:%.*]] = extractelement <2 x i1> [[TMP11]], i64 1 > +; FVW2-NEXT: br i1 [[TMP66]], label [[PRED_LOAD_IF26:%.*]], label [[PRED_LOAD_CONTINUE27]] > +; FVW2: pred.load.if26: > +; FVW2-NEXT: [[TMP67:%.*]] = extractelement <2 x i64> [[TMP23]], i64 1 > +; FVW2-NEXT: [[TMP68:%.*]] = getelementptr inbounds float, float* [[IN]], i64 [[TMP67]] > +; FVW2-NEXT: [[TMP69:%.*]] = load float, float* [[TMP68]], align 4 > +; FVW2-NEXT: [[TMP70:%.*]] = insertelement <2 x float> [[TMP65]], float [[TMP69]], i64 1 > +; FVW2-NEXT: br label [[PRED_LOAD_CONTINUE27]] > +; FVW2: pred.load.continue27: > +; FVW2-NEXT: [[TMP71:%.*]] = phi <2 x float> [ [[TMP65]], [[PRED_LOAD_CONTINUE25]] ], [ [[TMP70]], [[PRED_LOAD_IF26]] ] > +; FVW2-NEXT: [[TMP72:%.*]] = fadd <2 x float> [[TMP35]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP73:%.*]] = fadd <2 x float> [[TMP47]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP74:%.*]] = fadd <2 x float> [[TMP59]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP75:%.*]] = fadd <2 x float> [[TMP71]], <float 5.000000e-01, float 5.000000e-01> > +; FVW2-NEXT: [[TMP76:%.*]] = getelementptr float, float* [[OUT:%.*]], i64 [[INDEX7]] > +; FVW2-NEXT: [[TMP77:%.*]] = bitcast float* [[TMP76]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP72]], <2 x float>* [[TMP77]], i32 4, <2 x i1> [[TMP8]]) > +; FVW2-NEXT: [[TMP78:%.*]] = getelementptr float, float* [[TMP76]], i64 2 > +; FVW2-NEXT: [[TMP79:%.*]] = bitcast float* [[TMP78]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP73]], <2 x float>* [[TMP79]], i32 4, <2 x i1> [[TMP9]]) > +; FVW2-NEXT: [[TMP80:%.*]] = getelementptr float, float* [[TMP76]], i64 4 > +; FVW2-NEXT: [[TMP81:%.*]] = bitcast float* [[TMP80]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP74]], <2 x float>* [[TMP81]], i32 4, <2 x i1> [[TMP10]]) > +; FVW2-NEXT: [[TMP82:%.*]] = getelementptr float, float* [[TMP76]], i64 6 > +; FVW2-NEXT: [[TMP83:%.*]] = bitcast float* [[TMP82]] to <2 x float>* > +; FVW2-NEXT: call void @llvm.masked.store.v2f32.p0v2f32(<2 x float> [[TMP75]], <2 x float>* [[TMP83]], i32 4, <2 x i1> [[TMP11]]) > +; FVW2-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX7]], 8 > +; FVW2-NEXT: [[TMP84:%.*]] = icmp eq i64 [[INDEX_NEXT]], 4096 > +; FVW2-NEXT: br i1 [[TMP84]], label [[FOR_END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]] > ; FVW2: for.end: > ; FVW2-NEXT: ret void > ; > @@ -365,40 +437,186 @@ define void @foo2(%struct.In* noalias %in, float* noalias %out, i32* noalias %tr > ; FVW2-NEXT: entry: > ; FVW2-NEXT: br label [[VECTOR_BODY:%.*]] > ; FVW2: vector.body: > -; FVW2-NEXT: [[INDEX10:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE9:%.*]] ] > -; FVW2-NEXT: [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 0, i64 16>, [[ENTRY]] ], [ [[VEC_IND_NEXT:%.*]], [[PRED_STORE_CONTINUE9]] ] > -; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX10]], 4 > +; FVW2-NEXT: [[INDEX7:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[PRED_STORE_CONTINUE35:%.*]] ] > +; FVW2-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX7]], 4 > ; FVW2-NEXT: [[TMP0:%.*]] = or i64 [[OFFSET_IDX]], 16 > -; FVW2-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[OFFSET_IDX]] > -; FVW2-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP0]] > -; FVW2-NEXT: [[TMP3:%.*]] = load i32, i32* [[TMP1]], align 4 > -; FVW2-NEXT: [[TMP4:%.*]] = load i32, i32* [[TMP2]], align 4 > -; FVW2-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i64 0 > -; FVW2-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 [[TMP4]], i64 1 > -; FVW2-NEXT: [[TMP7:%.*]] = icmp sgt <2 x i32> [[TMP6]], zeroinitializer > -; FVW2-NEXT: [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_IN:%.*]], %struct.In* [[IN:%.*]], <2 x i64> [[VEC_IND]], i32 1 > -; FVW2-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <2 x float> @llvm.masked.gather.v2f32.v2p0f32(<2 x float*> [[TMP8]], i32 4, <2 x i1> [[TMP7]], <2 x float> undef) > -; FVW2-NEXT: [[TMP9:%.*]] = fadd <2 x float> [[WIDE_MASKED_GATHER]], <float 5.000000e-01, float 5.000000e-01> > -; FVW2-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP7]], i64 0 > -; FVW2-NEXT: br i1 [[TMP10]], label [[PRED_STORE_IF:%.*]], label [[PRED_STORE_CONTINUE:%.*]] > +; FVW2-NEXT: [[TMP1:%.*]] = or i64 [[OFFSET_IDX]], 32 > +; FVW2-NEXT: [[TMP2:%.*]] = or i64 [[OFFSET_IDX]], 48 > +; FVW2-NEXT: [[TMP3:%.*]] = or i64 [[OFFSET_IDX]], 64 > +; FVW2-NEXT: [[TMP4:%.*]] = or i64 [[OFFSET_IDX]], 80 > +; FVW2-NEXT: [[TMP5:%.*]] = or i64 [[OFFSET_IDX]], 96 > +; FVW2-NEXT: [[TMP6:%.*]] = or i64 [[OFFSET_IDX]], 112 > +; FVW2-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER:%.*]], i64 [[OFFSET_IDX]] > +; FVW2-NEXT: [[TMP8:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP0]] > +; FVW2-NEXT: [[TMP9:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP1]] > +; FVW2-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP2]] > +; FVW2-NEXT: [[TMP11:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP3]] > +; FVW2-NEXT: [[TMP12:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP4]] > +; FVW2-NEXT: [[TMP13:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP5]] > +; FVW2-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, i32* [[TRIGGER]], i64 [[TMP6]] > +; FVW2-NEXT: [[TMP15:%.*]] = load i32, i32* [[TMP7]], align 4 > +; FVW2-NEXT: [[TMP16:%.*]] = load i32, i32* [[TMP8]], align 4 > +; FVW2-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP15]], i64 0 > +; FVW2-NEXT: [[TMP18:%.*]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP16]], i64 1 > +; FVW2-NEXT: [[TMP19:%.*]] = load i32, i32* [[TMP9]], align 4 > +; FVW2-NEXT: [[TMP20:%.*]] = load i32, i32* [[TMP10]], align 4 > +; FVW2-NEXT: [[TMP21:%.*]] = insertelement <2 x i32> poison, i32 [[TMP19]], i64 0 > </cut>

3 years, 5 months

1
0
0 0

[ACTIVITY] week ending Feb. 6 2022

by Alex Bennée

Project Stratos =============== - posted [RFC PATCH] tests/qtest: attempt to enable tests for virtio-gpio (!working) Message-Id: <20220121151534.3654562-1-alex.bennee(a)linaro.org> - need to increase coverage of the QEMU boilerplate to get it merged - discussions on next steps with SCMI backend with Vincent (moving from the QEMU->QEMU PoC) QEMU Upstream Work ([UM-2]) =========================== - posted [PATCH v2 00/25] testing and plugin updates Message-Id: <20220201182050.15087-1-alex.bennee(a)linaro.org> - posted [RFC PATCH 0/4] improve coverage of vector backend Message-Id: <20220202191242.652607-2-alex.bennee(a)linaro.org> - posted [PATCH v3 00/26] testing and plugins pre-PR Message-Id: <20220204204335.1689602-1-alex.bennee(a)linaro.org> - posted [RFC PATCH] arm: force flag recalculation when messing with DAIF Message-Id: <20220202122353.457084-1-alex.bennee(a)linaro.org> - trying to track down a weird TLS bug: <https://gitlab.com/stsquad/qemu/-/jobs/2056025874#L3532> - on aarch64 HW, running qemu-s390x with a simple test case fails every 100/200 times - seems TLS memory gets made non-accessible (rw-p -> ---p, except to gdb) - strace doesn't show a culprit, possible kernel bug? [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Other ===== - planning and brainstorming for Linaro Tech Day Completed Reviews [5/5] ======================= [PATCH v4 00/42] CXl 2.0 emulation Support Message-Id: <20220124171705.10432-1-Jonathan.Cameron(a)huawei.com> [PATCH] gitlab: fall back to commit hash in qemu-setup filename Message-Id: <20220125173454.10381-1-stefanha(a)redhat.com> [PATCH for-7.0] gitlab-ci: Add cirrus-ci based tests for NetBSD and OpenBSD Message-Id: <20211209103124.121942-1-thuth(a)redhat.com> [PATCH 00/20] tcg: vector improvements Message-Id: <20211218194250.247633-1-richard.henderson(a)linaro.org> Absences ======== Current Review Queue ==================== TODO [PATCH 0/4] target/arm: SVE fixes versus VHE Message-Id: <20220127063428.30212-1-richard.henderson(a)linaro.org> ================================================================================================================== TODO [PATCH 00/14] arm_gicv3_its: Implement MOVI and MOVALL commands Message-Id: <20220122182444.724087-1-peter.maydell(a)linaro.org> ================================================================================================================================== TODO [PATCH v11 0/8] hmp,qmp: Add commands to introspect virtio devices Message-Id: <1642678168-20447-1-git-send-email-jonah.palmer(a)oracle.com> ============================================================================================================================================== TODO [PATCH v2 00/13] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20220111171048.3545974-1-peter.maydell(a)linaro.org> ====================================================================================================================================== -- Alex Bennée

3 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 4 Feb

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Fixed some minor issues with the hvf accelerator and sent out a patchset + '-cpu max' didn't act like '-cpu host' + we weren't exposing PAuth to the guest * QEMU-420 [GICv4 emulation] - Sent out a patchset with more cleanups and fixes to the existing ITS code - The ITS parts of the GICv4 work are now code-complete; moving on to the redistributor end of things next week. -- PMM

3 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 28 Jan

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Before the QEMU 7.0 release we tried to land a bug fix which corrected the handling in our PSCI emulation of calls where the function ID is unrecognized -- these are supposed to return an error code. The bugfix turned out to cause regressions for some boards when running guest code at EL3 (because those boards were incorrectly enabling PSCI emulation in that situation). Sent a patchset that fixed those boards so we don't enable PSCI when running EL3 guests, and re-introduced the original PSCI bugfix. - Fixed various bugs in the highbank/midway boards discovered in the process of writing and testing the above patchset. (These two boards were the most complicated to fix.) - More code review, and sent out an arm pullrequest - Small handful of other minor patches -- PMM

3 years, 5 months

1
0
0 0

tsan buildbot failure possibly due to DWARFv5 switch

by David Blaikie

Seems like my change to make Clang default to DWARFv5 might've caused a buildbot failure on your build worker here: https://lab.llvm.org/buildbot/#/builders/185/builds/1295 But I seem to be able to run this test successfully locally on my Linux machine - so I'm wondering if you can offer any help diagnosing the issue showing up on your builder/worker?

3 years, 5 months

2
2
0 0

[ACTIVITY] week ending Jan. 23 2022

by Alex Bennée

Project Stratos =============== - [RFC PATCH] tests/qtest: attempt to enable tests for virtio-gpio (!working) Message-Id: <20220121151534.3654562-1-alex.bennee(a)linaro.org> - trying to clear the way for merging virtio-gpio to QEMU vhost-device maintainer effort ([UM-196]) - reviewed vhost-device [pr7 with the vm-virtio vsock abstraction] [UM-196] <https://linaro.atlassian.net/browse/UM-196> [pr7 with the vm-virtio vsock abstraction] <https://github.com/stsquad/vhost-device/tree/review/pr7-with-laurat-abstrac…> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL v2 00/31] testing/next and other misc fixes Message-Id: <20220118190043.1427303-1-alex.bennee(a)linaro.org> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> Completed Reviews [2/2] ======================= [PATCH v2 00/13] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20220111171048.3545974-1-peter.maydell(a)linaro.org> [PATCH v2 0/6] qtests/libqos: Allow PCI tests to be run with virt-machine Message-Id: <20220118203833.316741-7-eric.auger(a)redhat.com> Absences ======== Current Review Queue ==================== TODO [PATCH v11 0/8] hmp,qmp: Add commands to introspect virtio devices Message-Id: <1642678168-20447-1-git-send-email-jonah.palmer(a)oracle.com> ============================================================================================================================================== TODO [PATCH v2 00/13] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20220111171048.3545974-1-peter.maydell(a)linaro.org> ====================================================================================================================================== TODO [PATCH v2 00/11] Atomic cleanup + clang-12 build fix Message-Id: <20210717014121.1784956-1-richard.henderson(a)linaro.org> ============================================================================================================================ TODO [PATCH 0/7] tcg: some small towards more modular tcg Message-Id: <20210804143826.3402872-1-kraxel(a)redhat.com> ================================================================================================================= -- Alex Bennée

3 years, 5 months

1
0
0 0

[ACTIVITY] report week ending 21 Jan

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Sent patches for some reported bugs to do with state save/load * QEMU-420 [GICv4 emulation] - Wrote patches to implement the missing MOVALL and MOVI commands - Fixed a few minor bugs noticed along the way - Should be able to send out a patchset early next week and then can get back to the new-in-GICv4 work -- PMM

3 years, 5 months

1
0
0 0

[TCWG CI] Regression caused by gcc: Add -Wdangling-pointer [PR63272].

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: Add -Wdangling-pointer [PR63272].: commit 9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 Author: Martin Sebor <msebor(a)redhat.com> Add -Wdangling-pointer [PR63272]. Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21324 # First few build errors in logs: # 00:03:31 sound/core/oss/mixer_oss.c:1057:21: error: ‘slot’ is used uninitialized [-Werror=uninitialized] # 00:03:32 sound/core/oss/pcm_oss.c:108:29: error: ‘t’ is used uninitialized [-Werror=uninitialized] # 00:03:32 sound/core/oss/pcm_oss.c:2488:34: error: ‘setup’ is used uninitialized [-Werror=uninitialized] # 00:03:32 sound/core/oss/pcm_oss.c:2998:51: error: ‘template’ is used uninitialized [-Werror=uninitialized] # 00:03:35 make[3]: *** [scripts/Makefile.build:277: sound/core/oss/mixer_oss.o] Error 1 # 00:03:35 sound/core/seq/oss/seq_oss_init.c:350:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:35 sound/core/seq/oss/seq_oss_init.c:370:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:36 make[4]: *** [scripts/Makefile.build:277: sound/core/seq/oss/seq_oss_init.o] Error 1 # 00:03:40 make[3]: *** [scripts/Makefile.build:277: sound/core/oss/pcm_oss.o] Error 1 # 00:03:50 make[3]: *** [scripts/Makefile.build:540: sound/core/seq/oss] Error 2 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21354 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-aarch64-stable-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… Reproduce builds: <cut> mkdir investigate-gcc-9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 cd investigate-gcc-9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-stable-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 671a283636de75f7ed638ee6b01ed2d44361b8b6 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 9d6a0f388eb048f8d87f47af78f07b5ce513bfe6 Author: Martin Sebor <msebor(a)redhat.com> Date: Sat Jan 15 16:41:40 2022 -0700 Add -Wdangling-pointer [PR63272]. Resolves: PR c/63272 - GCC should warn when using pointer to dead scoped variable with in the same function gcc/c-family/ChangeLog: PR c/63272 * c.opt (-Wdangling-pointer): New option. gcc/ChangeLog: PR c/63272 * diagnostic-spec.c (nowarn_spec_t::nowarn_spec_t): Handle -Wdangling-pointer. * doc/invoke.texi (-Wdangling-pointer): Document new option. * gimple-ssa-warn-access.cc (pass_waccess::clone): Set new member. (pass_waccess::check_pointer_uses): New function. (pass_waccess::gimple_call_return_arg): New function. (pass_waccess::gimple_call_return_arg_ref): New function. (pass_waccess::check_call_dangling): New function. (pass_waccess::check_dangling_uses): New function overloads. (pass_waccess::check_dangling_stores): New function. (pass_waccess::check_dangling_stores): New function. (pass_waccess::m_clobbers): New data member. (pass_waccess::m_func): New data member. (pass_waccess::m_run_number): New data member. (pass_waccess::m_check_dangling_p): New data member. (pass_waccess::check_alloca): Check m_early_checks_p. (pass_waccess::check_alloc_size_call): Same. (pass_waccess::check_strcat): Same. (pass_waccess::check_strncat): Same. (pass_waccess::check_stxcpy): Same. (pass_waccess::check_stxncpy): Same. (pass_waccess::check_strncmp): Same. (pass_waccess::check_memop_access): Same. (pass_waccess::check_read_access): Same. (pass_waccess::check_builtin): Call check_pointer_uses. (pass_waccess::warn_invalid_pointer): Add arguments. (is_auto_decl): New function. (pass_waccess::check_stmt): New function. (pass_waccess::check_block): Call check_stmt. (pass_waccess::execute): Call check_dangling_uses, check_dangling_stores. Empty m_clobbers. * passes.def (pass_warn_access): Invoke pass two more times. gcc/testsuite/ChangeLog: PR c/63272 * g++.dg/warn/Wfree-nonheap-object-6.C: Disable valid warnings. * g++.dg/warn/ref-temp1.C: Prune expected warning. * gcc.dg/uninit-pr50476.c: Expect a new warning. * c-c++-common/Wdangling-pointer-2.c: New test. * c-c++-common/Wdangling-pointer-3.c: New test. * c-c++-common/Wdangling-pointer-4.c: New test. * c-c++-common/Wdangling-pointer-5.c: New test. * c-c++-common/Wdangling-pointer-6.c: New test. * c-c++-common/Wdangling-pointer.c: New test. * g++.dg/warn/Wdangling-pointer-2.C: New test. * g++.dg/warn/Wdangling-pointer.C: New test. * gcc.dg/Wdangling-pointer-2.c: New test. * gcc.dg/Wdangling-pointer.c: New test. --- gcc/c-family/c.opt | 8 + gcc/diagnostic-spec.c | 1 + gcc/doc/invoke.texi | 62 +- gcc/gimple-ssa-warn-access.cc | 635 +++++++++++++++++++-- gcc/passes.def | 5 +- gcc/testsuite/c-c++-common/Wdangling-pointer-2.c | 437 ++++++++++++++ gcc/testsuite/c-c++-common/Wdangling-pointer-3.c | 64 +++ gcc/testsuite/c-c++-common/Wdangling-pointer-4.c | 73 +++ gcc/testsuite/c-c++-common/Wdangling-pointer-5.c | 90 +++ gcc/testsuite/c-c++-common/Wdangling-pointer-6.c | 32 ++ gcc/testsuite/c-c++-common/Wdangling-pointer.c | 434 ++++++++++++++ gcc/testsuite/g++.dg/warn/Wdangling-pointer-2.C | 23 + gcc/testsuite/g++.dg/warn/Wdangling-pointer.C | 74 +++ gcc/testsuite/g++.dg/warn/Wfree-nonheap-object-6.C | 4 +- gcc/testsuite/g++.dg/warn/ref-temp1.C | 3 + gcc/testsuite/gcc.dg/Wdangling-pointer-2.c | 82 +++ gcc/testsuite/gcc.dg/Wdangling-pointer.c | 75 +++ gcc/testsuite/gcc.dg/uninit-pr50476.c | 2 +- 18 files changed, 2043 insertions(+), 61 deletions(-) diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt index 28363643664..db65c14a7a5 100644 --- a/gcc/c-family/c.opt +++ b/gcc/c-family/c.opt @@ -548,6 +548,14 @@ Wdangling-else C ObjC C++ ObjC++ Var(warn_dangling_else) Warning LangEnabledBy(C ObjC C++ ObjC++,Wparentheses) Warn about dangling else. +Wdangling-pointer +C ObjC C++ LTO ObjC++ Alias(Wdangling-pointer=, 2, 0) Warning +Warn for uses of pointers to auto variables whose lifetime has ended. + +Wdangling-pointer= +C ObjC C++ ObjC++ Joined RejectNegative UInteger Var(warn_dangling_pointer) Warning LangEnabledBy(C ObjC C++ ObjC++,Wall, 2, 0) IntegerRange(0, 2) +Warn for uses of pointers to auto variables whose lifetime has ended. + Wdate-time C ObjC C++ ObjC++ CPP(warn_date_time) CppReason(CPP_W_DATE_TIME) Var(cpp_warn_date_time) Init(0) Warning Warn about __TIME__, __DATE__ and __TIMESTAMP__ usage. diff --git a/gcc/diagnostic-spec.c b/gcc/diagnostic-spec.c index c9e1c1be91d..a8af229d677 100644 --- a/gcc/diagnostic-spec.c +++ b/gcc/diagnostic-spec.c @@ -99,6 +99,7 @@ nowarn_spec_t::nowarn_spec_t (opt_code opt) m_bits = NW_UNINIT; break; + case OPT_Wdangling_pointer_: case OPT_Wreturn_local_addr: case OPT_Wuse_after_free_: m_bits = NW_DANGLING; diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 121c8ea827f..7f2205e4a85 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -341,7 +341,8 @@ Objective-C and Objective-C++ Dialects}. -Wchar-subscripts @gol -Wclobbered -Wcomment @gol -Wconversion -Wno-coverage-mismatch -Wno-cpp @gol --Wdangling-else -Wdate-time @gol +-Wdangling-else -Wdangling-pointer -Wdangling-pointer=@var{n} @gol +-Wdate-time @gol -Wno-deprecated -Wno-deprecated-declarations -Wno-designated-init @gol -Wdisabled-optimization @gol -Wno-discarded-array-qualifiers -Wno-discarded-qualifiers @gol @@ -4389,6 +4390,8 @@ Warn about overriding virtual functions that are not marked with the @opindex Wno-use-after-free Warn about uses of pointers to dynamically allocated objects that have been rendered indeterminate by a call to a deallocation function. +The warning is enabled at all optimization levels but may yield different +results with optimization than without. @table @gcctabopt @item -Wuse-after-free=1 @@ -5714,6 +5717,7 @@ Options} and @ref{Objective-C and Objective-C++ Dialect Options}. -Wcatch-value @r{(C++ and Objective-C++ only)} @gol -Wchar-subscripts @gol -Wcomment @gol +-Wdangling-pointer=2 @gol -Wduplicate-decl-specifier @r{(C and Objective-C only)} @gol -Wenum-compare @r{(in C/ObjC; this is on by default in C++)} @gol -Wformat @gol @@ -8587,6 +8591,62 @@ looks like this: This warning is enabled by @option{-Wparentheses}. +@item -Wdangling-pointer +@itemx -Wdangling-pointer=@var{n} +@opindex Wdangling-pointer +@opindex Wno-dangling-pointer +Warn about uses of pointers (or C++ references) to objects with automatic +storage duration after their lifetime has ended. This includes local +variables declared in nested blocks, compound literals and other unnamed +temporary objects. In addition, warn about storing the address of such +objects in escaped pointers. The warning is enabled at all optimization +levels but may yield different results with optimization than without. + +@table @gcctabopt +@item -Wdangling-pointer=1 +At level 1 the warning diagnoses only unconditional uses of dangling pointers. +For example +@smallexample +int f (int c1, int c2, x) +@{ + char *p = strchr ((char[])@{ c1, c2 @}, c3); + return p ? *p : 'x'; // warning: dangling pointer to a compound literal +@} +@end smallexample +In the following function the store of the address of the local variable +@code{x} in the escaped pointer @code{*p} also triggers the warning. +@smallexample +void g (int **p) +@{ + int x = 7; + *p = &x; // warning: storing the address of a local variable in *p +@} +@end smallexample + +@item -Wdangling-pointer=2 +At level 2, in addition to unconditional uses the warning also diagnoses +conditional uses of dangling pointers. + +For example, because the array @var{a} in the following function is out of +scope when the pointer @var{s} that was set to point is used, the warning +triggers at this level. + +@smallexample +void f (char *s) +@{ + if (!s) + @{ + char a[12] = "tmpname"; + s = a; + @} + strcat (s, ".tmp"); // warning: dangling pointer to a may be used + ... +@} +@end smallexample +@end table + +@option{-Wdangling-pointer=2} is included in @option{-Wall}. + @item -Wdate-time @opindex Wdate-time @opindex Wno-date-time diff --git a/gcc/gimple-ssa-warn-access.cc b/gcc/gimple-ssa-warn-access.cc index 882129143a1..f639807a78a 100644 --- a/gcc/gimple-ssa-warn-access.cc +++ b/gcc/gimple-ssa-warn-access.cc @@ -2069,10 +2069,12 @@ class pass_waccess : public gimple_opt_pass ~pass_waccess (); - opt_pass *clone () { return new pass_waccess (m_ctxt); } + opt_pass *clone (); virtual bool gate (function *); + void set_pass_param (unsigned, bool); + virtual unsigned int execute (function *); private: @@ -2089,6 +2091,9 @@ private: /* Check a call to an ordinary function for invalid accesses. */ bool check_call_access (gcall *); + /* Check a non-call statement. */ + void check_stmt (gimple *); + /* Check statements in a basic block. */ void check_block (basic_block); @@ -2112,26 +2117,41 @@ private: void check_atomic_memmodel (gimple *, tree, tree, const unsigned char *); /* Check for uses of indeterminate pointers. */ - void check_pointer_uses (gimple *, tree); + void check_pointer_uses (gimple *, tree, tree = NULL_TREE, bool = false); /* Return the argument that a call returns. */ tree gimple_call_return_arg (gcall *); + tree gimple_call_return_arg_ref (gcall *); + + /* Check a call for uses of a dangling pointer arguments. */ + void check_call_dangling (gcall *); + + /* Check uses of a dangling pointer or those derived from it. */ + void check_dangling_uses (tree, tree, bool = false, bool = false); + void check_dangling_uses (); + void check_dangling_stores (); + void check_dangling_stores (basic_block, hash_set<tree> &, auto_bitmap &); - void warn_invalid_pointer (tree, gimple *, gimple *, bool, bool = false); + void warn_invalid_pointer (tree, gimple *, gimple *, tree, bool, bool = false); /* Return true if use follows an invalidating statement. */ - bool use_after_inval_p (gimple *, gimple *); + bool use_after_inval_p (gimple *, gimple *, bool = false); /* A pointer_query object and its cache to store information about pointers and their targets in. */ pointer_query m_ptr_qry; pointer_query::cache_type m_var_cache; - + /* Mapping from DECLs and their clobber statements in the function. */ + hash_map<tree, gimple *> m_clobbers; /* A bit is set for each basic block whose statements have been assigned valid UIDs. */ bitmap m_bb_uids_set; /* The current function. */ function *m_func; + /* True to run checks for uses of dangling pointers. */ + bool m_check_dangling_p; + /* True to run checks early on in the optimization pipeline. */ + bool m_early_checks_p; }; /* Construct the pass. */ @@ -2140,11 +2160,22 @@ pass_waccess::pass_waccess (gcc::context *ctxt) : gimple_opt_pass (pass_data_waccess, ctxt), m_ptr_qry (NULL, &m_var_cache), m_var_cache (), + m_clobbers (), m_bb_uids_set (), - m_func () + m_func (), + m_check_dangling_p (), + m_early_checks_p () { } +/* Return a copy of the pass with RUN_NUMBER one greater than THIS. */ + +opt_pass* +pass_waccess::clone () +{ + return new pass_waccess (m_ctxt); +} + /* Release pointer_query cache. */ pass_waccess::~pass_waccess () @@ -2152,6 +2183,14 @@ pass_waccess::~pass_waccess () m_ptr_qry.flush_cache (); } +void +pass_waccess::set_pass_param (unsigned int n, bool early) +{ + gcc_assert (n == 0); + + m_early_checks_p = early; +} + /* Return true when any checks performed by the pass are enabled. */ bool @@ -2340,6 +2379,9 @@ maybe_warn_alloc_args_overflow (gimple *stmt, const tree args[2], void pass_waccess::check_alloca (gcall *stmt) { + if (m_early_checks_p) + return; + if ((warn_vla_limit >= HOST_WIDE_INT_MAX && warn_alloc_size_limit < warn_vla_limit) || (warn_alloca_limit >= HOST_WIDE_INT_MAX @@ -2361,6 +2403,13 @@ pass_waccess::check_alloca (gcall *stmt) void pass_waccess::check_alloc_size_call (gcall *stmt) { + if (m_early_checks_p) + return; + + if (gimple_call_num_args (stmt) < 1) + /* Avoid invalid calls to functions without a prototype. */ + return; + tree fndecl = gimple_call_fndecl (stmt); if (fndecl && gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) { @@ -2413,6 +2462,9 @@ pass_waccess::check_alloc_size_call (gcall *stmt) void pass_waccess::check_strcat (gcall *stmt) { + if (m_early_checks_p) + return; + if (!warn_stringop_overflow && !warn_stringop_overread) return; @@ -2438,6 +2490,9 @@ pass_waccess::check_strcat (gcall *stmt) void pass_waccess::check_strncat (gcall *stmt) { + if (m_early_checks_p) + return; + if (!warn_stringop_overflow && !warn_stringop_overread) return; @@ -2507,6 +2562,9 @@ pass_waccess::check_strncat (gcall *stmt) void pass_waccess::check_stxcpy (gcall *stmt) { + if (m_early_checks_p) + return; + tree dst = call_arg (stmt, 0); tree src = call_arg (stmt, 1); @@ -2545,7 +2603,7 @@ pass_waccess::check_stxcpy (gcall *stmt) void pass_waccess::check_stxncpy (gcall *stmt) { - if (!warn_stringop_overflow) + if (m_early_checks_p || !warn_stringop_overflow) return; tree dst = call_arg (stmt, 0); @@ -2569,7 +2627,7 @@ pass_waccess::check_stxncpy (gcall *stmt) void pass_waccess::check_strncmp (gcall *stmt) { - if (!warn_stringop_overread) + if (m_early_checks_p || !warn_stringop_overread) return; tree arg1 = call_arg (stmt, 0); @@ -2674,6 +2732,9 @@ pass_waccess::check_strncmp (gcall *stmt) void pass_waccess::check_memop_access (gimple *stmt, tree dest, tree src, tree size) { + if (m_early_checks_p) + return; + /* For functions like memset and memcpy that operate on raw memory try to determine the size of the largest source and destination object using type-0 Object Size regardless of the object size @@ -2695,7 +2756,7 @@ pass_waccess::check_read_access (gimple *stmt, tree src, tree bound /* = NULL_TREE */, int ost /* = 1 */) { - if (!warn_stringop_overread) + if (m_early_checks_p || !warn_stringop_overread) return; if (bound && !useless_type_conversion_p (size_type_node, TREE_TYPE (bound))) @@ -2938,7 +2999,7 @@ pass_waccess::check_atomic_memmodel (gimple *stmt, tree ord_sucs, if (warning_suppressed_p (stmt, OPT_Winvalid_memory_model)) return; - if (maybe_warn_memmodel (stmt, ord_sucs, ord_fail, valid)) + if (!maybe_warn_memmodel (stmt, ord_sucs, ord_fail, valid)) return; suppress_warning (stmt, OPT_Winvalid_memory_model); @@ -3094,11 +3155,12 @@ pass_waccess::check_builtin (gcall *stmt) case BUILT_IN_FREE: case BUILT_IN_REALLOC: - { - tree arg = call_arg (stmt, 0); - if (TREE_CODE (arg) == SSA_NAME) - check_pointer_uses (stmt, arg); - } + if (!m_early_checks_p) + { + tree arg = call_arg (stmt, 0); + if (TREE_CODE (arg) == SSA_NAME) + check_pointer_uses (stmt, arg); + } return true; case BUILT_IN_GETTEXT: @@ -3725,16 +3787,67 @@ pass_waccess::maybe_check_dealloc_call (gcall *call) /* Return true if either USE_STMT's basic block (that of a pointer's use) is dominated by INVAL_STMT's (that of a pointer's invalidating statement, - or if they're in the same block, USE_STMT follows INVAL_STMT. */ + which is either a clobber or a deallocation call), or if they're in + the same block, USE_STMT follows INVAL_STMT. */ bool -pass_waccess::use_after_inval_p (gimple *inval_stmt, gimple *use_stmt) +pass_waccess::use_after_inval_p (gimple *inval_stmt, gimple *use_stmt, + bool last_block /* = false */) { + tree clobvar = + gimple_clobber_p (inval_stmt) ? gimple_assign_lhs (inval_stmt) : NULL_TREE; + basic_block inval_bb = gimple_bb (inval_stmt); basic_block use_bb = gimple_bb (use_stmt); + if (!inval_bb || !use_bb) + return false; + if (inval_bb != use_bb) - return dominated_by_p (CDI_DOMINATORS, use_bb, inval_bb); + { + if (dominated_by_p (CDI_DOMINATORS, use_bb, inval_bb)) + return true; + + if (!clobvar || !last_block) + return false; + + /* Proceed only when looking for uses of dangling pointers. */ + auto gsi = gsi_for_stmt (use_stmt); + + auto_bitmap visited; + + /* A use statement in the last basic block in a function or one that + falls through to it is after any other prior clobber of the used + variable unless it's followed by a clobber of the same variable. */ + basic_block bb = use_bb; + while (bb != inval_bb + && single_succ_p (bb) + && !(single_succ_edge (bb)->flags & (EDGE_EH|EDGE_DFS_BACK))) + { + if (!bitmap_set_bit (visited, bb->index)) + /* Avoid cycles. */ + return true; + + for (; !gsi_end_p (gsi); gsi_next_nondebug (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (gimple_clobber_p (stmt)) + { + if (clobvar == gimple_assign_lhs (stmt)) + /* The use is followed by a clobber. */ + return false; + } + } + + bb = single_succ (bb); + gsi = gsi_start_bb (bb); + } + + /* The use is one of a dangling pointer if a clobber of the variable + [the pointer points to] has not been found before the function exit + point. */ + return bb == EXIT_BLOCK_PTR_FOR_FN (cfun); + } if (bitmap_set_bit (m_bb_uids_set, inval_bb->index)) /* The first time this basic block is visited assign increasing ids @@ -3752,27 +3865,30 @@ pass_waccess::use_after_inval_p (gimple *inval_stmt, gimple *use_stmt) return gimple_uid (inval_stmt) < gimple_uid (use_stmt); } -/* Issue a warning for the USE_STMT of pointer PTR rendered invalid - by INVAL_STMT. PTR may be null when it's been optimized away. - MAYBE is true to issue the "maybe" kind of warning. EQUALITY is - true when the pointer is used in an equality expression. */ +/* Issue a warning for the USE_STMT of pointer or reference REF rendered + invalid by INVAL_STMT. REF may be null when it's been optimized away. + When nonnull, INVAL_STMT is the deallocation function that rendered + the pointer or reference dangling. Otherwise, VAR is the auto variable + (including an unnamed temporary such as a compound literal) whose + lifetime's rended it dangling. MAYBE is true to issue the "maybe" + kind of warning. EQUALITY is true when the pointer is used in + an equality expression. */ void -pass_waccess::warn_invalid_pointer (tree ptr, gimple *use_stmt, - gimple *inval_stmt, - bool maybe, - bool equality /* = false */) +pass_waccess::warn_invalid_pointer (tree ref, gimple *use_stmt, + gimple *inval_stmt, tree var, + bool maybe, bool equality /* = false */) { /* Avoid printing the unhelpful "<unknown>" in the diagnostics. */ - if (ptr && TREE_CODE (ptr) == SSA_NAME - && (!SSA_NAME_VAR (ptr) || DECL_ARTIFICIAL (SSA_NAME_VAR (ptr)))) - ptr = NULL_TREE; + if (ref && TREE_CODE (ref) == SSA_NAME + && (!SSA_NAME_VAR (ref) || DECL_ARTIFICIAL (SSA_NAME_VAR (ref)))) + ref = NULL_TREE; location_t use_loc = gimple_location (use_stmt); if (use_loc == UNKNOWN_LOCATION) { - use_loc = cfun->function_end_locus; - if (!ptr) + use_loc = m_func->function_end_locus; + if (!ref) /* Avoid issuing a warning with no context other than the function. That would make it difficult to debug in any but very simple cases. */ @@ -3788,12 +3904,12 @@ pass_waccess::warn_invalid_pointer (tree ptr, gimple *use_stmt, const tree inval_decl = gimple_call_fndecl (inval_stmt); - if ((ptr && warning_at (use_loc, OPT_Wuse_after_free, + if ((ref && warning_at (use_loc, OPT_Wuse_after_free, (maybe ? G_("pointer %qE may be used after %qD") : G_("pointer %qE used after %qD")), - ptr, inval_decl)) - || (!ptr && warning_at (use_loc, OPT_Wuse_after_free, + ref, inval_decl)) + || (!ref && warning_at (use_loc, OPT_Wuse_after_free, (maybe ? G_("pointer may be used after %qD") : G_("pointer used after %qD")), @@ -3805,6 +3921,52 @@ pass_waccess::warn_invalid_pointer (tree ptr, gimple *use_stmt, } return; } + + if ((maybe && warn_dangling_pointer < 2) + || warning_suppressed_p (use_stmt, OPT_Wdangling_pointer_)) + return; + + if (DECL_NAME (var)) + { + if ((ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer %qE to %qD may be used") + : G_("using dangling pointer %qE to %qD")), + ref, var)) + || (!ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer to %qD may be used") + : G_("using a dangling pointer to %qD")), + var))) + inform (DECL_SOURCE_LOCATION (var), + "%qD declared here", var); + suppress_warning (use_stmt, OPT_Wdangling_pointer_); + return; + } + + if ((ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer %qE to an unnamed temporary " + "may be used") + : G_("using dangling pointer %qE to an unnamed " + "temporary")), + ref, var)) + || (!ref + && warning_at (use_loc, OPT_Wdangling_pointer_, + (maybe + ? G_("dangling pointer to an unnamed temporary " + "may be used") + : G_("using a dangling pointer to an unnamed " + "temporary")), + var))) + { + inform (DECL_SOURCE_LOCATION (var), + "unnamed temporary defined here"); + suppress_warning (use_stmt, OPT_Wdangling_pointer_); + } } /* If STMT is a call to either the standard realloc or to a user-defined @@ -3927,10 +4089,14 @@ pointers_related_p (gimple *stmt, tree p, tree q, pointer_query &qry) /* For a STMT either a call to a deallocation function or a clobber, warn for uses of the pointer PTR it was called with (including its copies - or others derived from it by pointer arithmetic). */ + or others derived from it by pointer arithmetic). If STMT is a clobber, + VAR is the decl of the clobbered variable. When MAYBE is true use + a "maybe" form of diagnostic. */ void -pass_waccess::check_pointer_uses (gimple *stmt, tree ptr) +pass_waccess::check_pointer_uses (gimple *stmt, tree ptr, + tree var /* = NULL_TREE */, + bool maybe /* = false */) { gcc_assert (TREE_CODE (ptr) == SSA_NAME); @@ -4013,18 +4179,25 @@ pass_waccess::check_pointer_uses (gimple *stmt, tree ptr) /* Warn if USE_STMT is dominated by the deallocation STMT. Otherwise, add the pointer to POINTERS so that the uses of any other pointers derived from it can be checked. */ - if (use_after_inval_p (stmt, use_stmt)) + if (use_after_inval_p (stmt, use_stmt, check_dangling)) { - /* TODO: Handle PHIs but careful of false positives. */ - if (gimple_code (use_stmt) != GIMPLE_PHI) + if (gimple_code (use_stmt) == GIMPLE_PHI) { - basic_block use_bb = gimple_bb (use_stmt); - bool this_maybe - = !dominated_by_p (CDI_POST_DOMINATORS, use_bb, stmt_bb); - warn_invalid_pointer (*use_p->use, use_stmt, stmt, - this_maybe, equality); - continue; + tree lhs = gimple_phi_result (use_stmt); + if (TREE_CODE (lhs) == SSA_NAME) + { + pointers.safe_push (lhs); + continue; + } } + + basic_block use_bb = gimple_bb (use_stmt); + bool this_maybe + = (maybe + || !dominated_by_p (CDI_POST_DOMINATORS, use_bb, stmt_bb)); + warn_invalid_pointer (*use_p->use, use_stmt, stmt, var, + this_maybe, equality); + continue; } if (is_gimple_assign (use_stmt)) @@ -4059,26 +4232,100 @@ pass_waccess::check_call (gcall *stmt) if (gimple_call_builtin_p (stmt, BUILT_IN_NORMAL)) check_builtin (stmt); - if (tree callee = gimple_call_fndecl (stmt)) - { - /* Check for uses of the pointer passed to either a standard - or a user-defined deallocation function. */ - unsigned argno = fndecl_dealloc_argno (callee); - if (argno < (unsigned) call_nargs (stmt)) - { - tree arg = call_arg (stmt, argno); - if (TREE_CODE (arg) == SSA_NAME) - check_pointer_uses (stmt, arg); - } - } + if (!m_early_checks_p) + if (tree callee = gimple_call_fndecl (stmt)) + { + /* Check for uses of the pointer passed to either a standard + or a user-defined deallocation function. */ + unsigned argno = fndecl_dealloc_argno (callee); + if (argno < (unsigned) call_nargs (stmt)) + { + tree arg = call_arg (stmt, argno); + if (TREE_CODE (arg) == SSA_NAME) + check_pointer_uses (stmt, arg); + } + } check_call_access (stmt); + check_call_dangling (stmt); + + if (m_early_checks_p) + return; maybe_check_dealloc_call (stmt); check_nonstring_args (stmt); } +/* Return true of X is a DECL with automatic storage duration. */ + +static inline bool +is_auto_decl (tree x) +{ + return DECL_P (x) && !DECL_EXTERNAL (x) && !TREE_STATIC (x); +} + +/* Check non-call STMT for invalid accesses. */ + +void +pass_waccess::check_stmt (gimple *stmt) +{ + if (m_check_dangling_p && gimple_clobber_p (stmt)) + { + /* Ignore clobber statemts in blocks with exceptional edges. */ + basic_block bb = gimple_bb (stmt); + edge e = EDGE_PRED (bb, 0); + if (e->flags & EDGE_EH) + return; + + tree var = gimple_assign_lhs (stmt); + m_clobbers.put (var, stmt); + return; + } + + if (is_gimple_assign (stmt)) + { + /* Clobbered unnamed temporaries such as compound literals can be + revived. Check for an assignment to one and remove it from + M_CLOBBERS. */ + tree lhs = gimple_assign_lhs (stmt); + while (handled_component_p (lhs)) + lhs = TREE_OPERAND (lhs, 0); + + if (is_auto_decl (lhs)) + m_clobbers.remove (lhs); + return; + } + + if (greturn *ret = dyn_cast <greturn *> (stmt)) + { + if (optimize && flag_isolate_erroneous_paths_dereference) + /* Avoid interfering with -Wreturn-local-addr (which runs only + with optimization enabled). */ + return; + + tree arg = gimple_return_retval (ret); + if (!arg || TREE_CODE (arg) != ADDR_EXPR) + return; + + arg = TREE_OPERAND (arg, 0); + while (handled_component_p (arg)) + arg = TREE_OPERAND (arg, 0); + + if (!is_auto_decl (arg)) + return; + + gimple **pclobber = m_clobbers.get (arg); + if (!pclobber) + return; + + if (!use_after_inval_p (*pclobber, stmt)) + return; + + warn_invalid_pointer (NULL_TREE, stmt, *pclobber, arg, false); + } +} + /* Check basic block BB for invalid accesses. */ void @@ -4091,6 +4338,8 @@ pass_waccess::check_block (basic_block bb) gimple *stmt = gsi_stmt (si); if (gcall *call = dyn_cast <gcall *> (stmt)) check_call (call); + else + check_stmt (stmt); } } @@ -4139,6 +4388,262 @@ pass_waccess::gimple_call_return_arg (gcall *call) return gimple_call_arg (call, argno); } +/* Return the decl referenced by the argument that the call STMT to + a built-in function returns (including with an offset) or null if + it doesn't. */ + +tree +pass_waccess::gimple_call_return_arg_ref (gcall *call) +{ + if (tree arg = gimple_call_return_arg (call)) + { + access_ref aref; + if (m_ptr_qry.get_ref (arg, call, &aref, 0) + && DECL_P (aref.ref)) + return aref.ref; + } + + return NULL_TREE; +} + +/* Check for and diagnose all uses of the dangling pointer VAR to the auto + object DECL whose lifetime has ended. OBJREF is true when VAR denotes + an access to a DECL that may have been clobbered. */ + +void +pass_waccess::check_dangling_uses (tree var, tree decl, bool maybe /* = false */, + bool objref /* = false */) +{ + if (!decl || !is_auto_decl (decl)) + return; + + gimple **pclob = m_clobbers.get (decl); + if (!pclob) + return; + + if (!objref) + { + check_pointer_uses (*pclob, var, decl, maybe); + return; + } + + gimple *use_stmt = SSA_NAME_DEF_STMT (var); + if (!use_after_inval_p (*pclob, use_stmt, true)) + return; + + basic_block use_bb = gimple_bb (use_stmt); + basic_block clob_bb = gimple_bb (*pclob); + maybe = maybe || !dominated_by_p (CDI_POST_DOMINATORS, use_bb, clob_bb); + warn_invalid_pointer (var, use_stmt, *pclob, decl, maybe, false); +} + +/* Diagnose stores in BB and (recursively) its predecessors of the addresses + of local variables into nonlocal pointers that are left dangling after + the function returns. BBS is a bitmap of basic blocks visited. */ + +void +pass_waccess::check_dangling_stores (basic_block bb, + hash_set<tree> &stores, + auto_bitmap &bbs) +{ + if (!bitmap_set_bit (bbs, bb->index)) + /* Avoid cycles. */ + return; + + /* Iterate backwards over the statements looking for a store of + the address of a local variable into a nonlocal pointer. */ + for (auto gsi = gsi_last_nondebug_bb (bb); ; gsi_prev_nondebug (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (!stmt) + break; + + if (is_gimple_call (stmt) + && !(gimple_call_flags (stmt) & (ECF_CONST | ECF_PURE))) + /* Avoid looking before nonconst, nonpure calls since those might + use the escaped locals. */ + return; + + if (!is_gimple_assign (stmt) || gimple_clobber_p (stmt)) + continue; + + access_ref lhs_ref; + tree lhs = gimple_assign_lhs (stmt); + if (!m_ptr_qry.get_ref (lhs, stmt, &lhs_ref, 0)) + continue; + + if (is_auto_decl (lhs_ref.ref)) + continue; + + if (DECL_P (lhs_ref.ref)) + { + if (!POINTER_TYPE_P (TREE_TYPE (lhs_ref.ref)) + || lhs_ref.deref > 0) + continue; + } + else if (TREE_CODE (lhs_ref.ref) == SSA_NAME) + { + /* Avoid looking at or before stores into unknown objects. */ + gimple *def_stmt = SSA_NAME_DEF_STMT (lhs_ref.ref); + if (!gimple_nop_p (def_stmt)) + return; + } + else if (TREE_CODE (lhs_ref.ref) == MEM_REF) + { + tree arg = TREE_OPERAND (lhs_ref.ref, 0); + if (TREE_CODE (arg) == SSA_NAME) + { + gimple *def_stmt = SSA_NAME_DEF_STMT (arg); + if (!gimple_nop_p (def_stmt)) + return; + } + } + else + continue; + + if (stores.add (lhs_ref.ref)) + continue; + + /* FIXME: Handle stores of alloca() and VLA. */ + access_ref rhs_ref; + tree rhs = gimple_assign_rhs1 (stmt); + if (!m_ptr_qry.get_ref (rhs, stmt, &rhs_ref, 0) + || rhs_ref.deref != -1) + continue; + + if (!is_auto_decl (rhs_ref.ref)) + continue; + + location_t loc = gimple_location (stmt); + if (warning_at (loc, OPT_Wdangling_pointer_, + "storing the address of local variable %qD in %qE", + rhs_ref.ref, lhs)) + { + location_t loc = DECL_SOURCE_LOCATION (rhs_ref.ref); + inform (loc, "%qD declared here", rhs_ref.ref); + + if (DECL_P (lhs_ref.ref)) + loc = DECL_SOURCE_LOCATION (lhs_ref.ref); + else if (EXPR_HAS_LOCATION (lhs_ref.ref)) + loc = EXPR_LOCATION (lhs_ref.ref); + + if (loc != UNKNOWN_LOCATION) + inform (loc, "%qE declared here", lhs_ref.ref); + } + } + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, bb->preds) + { + basic_block pred = e->src; + check_dangling_stores (pred, stores, bbs); + } +} + +/* Diagnose stores of the addresses of local variables into nonlocal + pointers that are left dangling after the function returns. */ + +void +pass_waccess::check_dangling_stores () +{ + auto_bitmap bbs; + hash_set<tree> stores; + check_dangling_stores (EXIT_BLOCK_PTR_FOR_FN (m_func), stores, bbs); +} + +/* Check for and diagnose uses of dangling pointers to auto objects + whose lifetime has ended. */ + +void +pass_waccess::check_dangling_uses () +{ + tree var; + unsigned i; + FOR_EACH_SSA_NAME (i, var, m_func) + { + /* For each SSA_NAME pointer VAR find the DECL it points to. + If the DECL is a clobbered local variable, check to see + if any of VAR's uses (or those of other pointers derived + from VAR) happens after the clobber. If so, warn. */ + tree decl = NULL_TREE; + + gimple *def_stmt = SSA_NAME_DEF_STMT (var); + if (is_gimple_assign (def_stmt)) + { + tree rhs = gimple_assign_rhs1 (def_stmt); + if (TREE_CODE (rhs) == ADDR_EXPR) + { + if (!POINTER_TYPE_P (TREE_TYPE (var))) + continue; + decl = TREE_OPERAND (rhs, 0); + } + else + { + /* For other expressions, check the base DECL to see + if it's been clobbered, most likely as a result of </cut>

3 years, 6 months

1
0
0 0

[ACTIVITY] week ending Jan. 16 2022

by Alex Bennée

Project Stratos =============== - reviewed Peter's virtio-video patches for QEMU [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - started reviewing https://github.com/rust-vmm/vhost-device/pull/7 - looking pretty good, see how https://github.com/rust-vmm/vm-virtio/commit/463dd20552fc32139bbbb56e9152df… would work with it [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH 0/6] Basic skeleton of RP2040 Raspbery Pi Pico Message-Id: <20220110175104.2908956-1-alex.bennee(a)linaro.org> - posted [PATCH v1 00/34] testing/next and other misc fixes Message-Id: <20220105135009.1584676-1-alex.bennee(a)linaro.org> - and the eventual [PULL 00/31] testing/next and other misc fixes Message-Id: <20220112112722.3641051-1-alex.bennee(a)linaro.org> - and the inevitable fixup [RFC PATCH] linux-user: expand reserved brk space for 64bit guests Message-Id: <20220113165550.4184455-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [6/6] ======================= [PATCH] tests/docker: Add gentoo-loongarch64-cross image and run cross builds in GitLab Message-Id: <20211229062204.3726981-1-git(a)xen0n.name> [PATCH 0/2] tests/tcg: Fix float_{convs,madds} Message-Id: <20211224035541.2159966-1-richard.henderson(a)linaro.org> [PATCH v5 00/18] tests/docker: start using libvirt-ci's "lcitool" for dockerfiles Message-Id: <20211215141949.3512719-1-berrange(a)redhat.com> [PATCH] tests/tcg: Unconditionally use 90 second timeout Message-Id: <20211230235424.49155-1-richard.henderson(a)linaro.org> [PATCH] gitlab-ci: Speed up the msys2-64bit job by using --without-default-devices Message-Id: <20211216082253.43899-1-thuth(a)redhat.com> [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> Absences ======== Current Review Queue ==================== TODO [PATCH v2 00/11] Atomic cleanup + clang-12 build fix Message-Id: <20210717014121.1784956-1-richard.henderson(a)linaro.org> ============================================================================================================================ TODO [PATCH 0/7] tcg: some small towards more modular tcg Message-Id: <20210804143826.3402872-1-kraxel(a)redhat.com> ================================================================================================================= TODO [PATCH 0/6] Introduce CanoKey QEMU Message-Id: <YcSupUSXWDXOAkas@Sun> ========================================================================= TODO [PATCH] target/arm: Add missing FEAT_TLBIOS instructions Message-Id: <20211231103928.1455657-1-idan.horowitz(a)gmail.com> =========================================================================================================================== -- Alex Bennée

3 years, 6 months

1
0
0 0

[ACTIVITY] week ending Jan. 16 2022

by Alex Bennée

Project Stratos =============== - reviewed Peter's virtio-video patches for QEMU [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - started reviewing https://github.com/rust-vmm/vhost-device/pull/7 - looking pretty good, see how https://github.com/rust-vmm/vm-virtio/commit/463dd20552fc32139bbbb56e9152df… would work with it [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH 0/6] Basic skeleton of RP2040 Raspbery Pi Pico Message-Id: <20220110175104.2908956-1-alex.bennee(a)linaro.org> - posted [PATCH v1 00/34] testing/next and other misc fixes Message-Id: <20220105135009.1584676-1-alex.bennee(a)linaro.org> - and the eventual [PULL 00/31] testing/next and other misc fixes Message-Id: <20220112112722.3641051-1-alex.bennee(a)linaro.org> - and the inevitable fixup [RFC PATCH] linux-user: expand reserved brk space for 64bit guests Message-Id: <20220113165550.4184455-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - still waiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [6/6] ======================= [PATCH] tests/docker: Add gentoo-loongarch64-cross image and run cross builds in GitLab Message-Id: <20211229062204.3726981-1-git(a)xen0n.name> [PATCH 0/2] tests/tcg: Fix float_{convs,madds} Message-Id: <20211224035541.2159966-1-richard.henderson(a)linaro.org> [PATCH v5 00/18] tests/docker: start using libvirt-ci's "lcitool" for dockerfiles Message-Id: <20211215141949.3512719-1-berrange(a)redhat.com> [PATCH] tests/tcg: Unconditionally use 90 second timeout Message-Id: <20211230235424.49155-1-richard.henderson(a)linaro.org> [PATCH] gitlab-ci: Speed up the msys2-64bit job by using --without-default-devices Message-Id: <20211216082253.43899-1-thuth(a)redhat.com> [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> Absences ======== Current Review Queue ==================== TODO [PATCH 0/6] Introduce CanoKey QEMU Message-Id: <YcSupUSXWDXOAkas@Sun> ========================================================================= TODO [PATCH] target/arm: Add missing FEAT_TLBIOS instructions Message-Id: <20211231103928.1455657-1-idan.horowitz(a)gmail.com> ======================================================================================================================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== -- Alex Bennée

3 years, 6 months

1
0
0 0

[ACTIVITY] report week ending 14 Jan

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Most of this week was spent on continuing to work through my code-review queue :-/ - Sent a few minor cleanup patches for linux-user nits I noticed while reading the code as part of reviewing a big bsd-user patchset * QEMU-420 [GICv4 emulation] - got some reviewed ITS cleanup patches upstream - rerolled and sent v2 patchset for the rest of the cleanup patches - got back up to speed with where I left my GICv4 ITS patches before Christmas, and dealt with some minor loose ends I'd left in the last patch or two I was working on. -- PMM

3 years, 6 months

1
0
0 0

On holiday through 22 Jan

by Richard Henderson

Hi Peter, Welcome back, hope you had a good Christmas break. I'm off oh holiday myself for the next two weeks, so this would be an ideal time to pass back merge control to you. The board is mostly green now, with occasional allowed failures for centos-stream and freebsd for upstream package manager failures. See yall in a couple of weeks. r~

3 years, 6 months

1
0
0 0

[ACTIVITY] week ending 9 Jan 2022

by Richard Henderson

[UM-2] * Re-greening of gitlab-ci. - There are continuing issues with cross-i386-tci. Occasionally I see *really* long test times: https://gitlab.com/qemu-project/qemu/-/jobs/1941996332 with qtest-aarch64/qom-test taking 1738s, or 28 of the 60 minute budget. More often it's merely slow: https://gitlab.com/qemu-project/qemu/-/jobs/1954634840 with qtest-aarch64/qom-test taking 538s. Note that locally this test runs in about 100s, and I have been unable to determine why it runs so much slower on gitlab. - Worked on a ppc64-softmmu slowdown leading to timeouts. - Fixes for meson regressions affecting testing. * Refresh tcg unaligned user patch sets. r~

3 years, 6 months

1
0
0 0

[ACTIVITY] report week ending 7 Jan

by Peter Maydell

Progress (short week, 2 days): * UM-2 [QEMU upstream maintainership] - Catching up with email and codereview backlog from 3 weeks holiday :-) (Have got the codereview queue down to less than a dozen things so should be able to do some more GICv4 development next week.) -- PMM

3 years, 6 months

1
0
0 0

[ACTIVITY] week ending Dec. 19 2021

by Alex Bennée

Project Stratos =============== - got Xen working on the MachiatoBin - posted Configuring the host GIC for guest to guest IPI Message-Id: <87fsqwn2sd.fsf(a)linaro.org> QEMU Upstream Work ([UM-2]) =========================== - posted [RFC PATCH] linux-user: don't adjust base of found hole Message-Id: <20211216144442.2270605-1-alex.bennee(a)linaro.org> - posted [PATCH] hw/arm: add control knob to disable kaslr_seed via DTB Message-Id: <20211215120926.1696302-1-alex.bennee(a)linaro.org> Completed Reviews [3/3] ======================= [PATCH 00/26] arm gicv3 ITS: Various bug fixes and refactorings Message-Id: <20211211191135.1764649-1-peter.maydell(a)linaro.org> [PATCH for-7.0 0/6] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20211208231154.392029-1-richard.henderson(a)linaro.org> [PATCH-for-6.2? v2 0/5] docs/devel/style: Improve rST rendering Message-Id: <20211118145716.4116731-1-philmd(a)redhat.com> Absences ======== Off for holidays, back in the new year. Merry Christmas everyone! -- Alex Bennée

3 years, 7 months

1
0
0 0

[TCWG CI] Regression caused by gcc: tree-object-size: Use trees and support negative offsets

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: tree-object-size: Use trees and support negative offsets: commit 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 Author: Siddhesh Poyarekar <siddhesh(a)gotplt.org> tree-object-size: Use trees and support negative offsets Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 550 # First few build errors in logs: # 00:01:37 ./include/linux/thread_info.h:213:25: error: call to ‘__bad_copy_to’ declared with attribute error: copy destination size is too small # 00:01:37 ./include/linux/thread_info.h:213:25: error: call to ‘__bad_copy_to’ declared with attribute error: copy destination size is too small # 00:01:37 make[1]: *** [scripts/Makefile.build:287: fs/io_uring.o] Error 1 # 00:01:37 make: *** [Makefile:1846: fs] Error 2 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 567 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-arm-mainline-allnoconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Reproduce builds: <cut> mkdir investigate-gcc-422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 cd investigate-gcc-422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 871504b0dd5cd023d3a28cf9e5ccbda75928b102 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 Author: Siddhesh Poyarekar <siddhesh(a)gotplt.org> Date: Fri Dec 17 07:07:18 2021 +0530 tree-object-size: Use trees and support negative offsets Transform tree-object-size to operate on tree objects instead of host wide integers. This makes it easier to extend to dynamic expressions for object sizes. The compute_builtin_object_size interface also now returns a tree expression instead of HOST_WIDE_INT, so callers have been adjusted to account for that. The trees in object_sizes are each an object_size object with members size (the bytes from the pointer to the end of the object) and wholesize (the size of the whole object). This allows analysis of negative offsets, which can now be allowed to the extent of the object bounds. Tests have been added to verify that it actually works. gcc/ChangeLog: * tree-object-size.h (compute_builtin_object_size): Return tree instead of HOST_WIDE_INT. * builtins.c (fold_builtin_object_size): Adjust. * gimple-fold.c (gimple_fold_builtin_strncat): Likewise. * ubsan.c (instrument_object_size): Likewise. * tree-object-size.c (object_size): New structure. (object_sizes): Change type to vec<object_size>. (initval): New function. (unknown): Use it. (size_unknown_p, size_initval, size_unknown): New functions. (object_sizes_unknown_p): Use it. (object_sizes_get): Return tree. (object_sizes_initialize): Rename from object_sizes_set_force and set VAL parameter type as tree. Add new parameter WHOLEVAL. (object_sizes_set): Set VAL parameter type as tree and adjust implementation. Add new parameter WHOLEVAL. (size_for_offset): New function. (decl_init_size): Adjust comment. (addr_object_size): Change PSIZE parameter to tree and adjust implementation. Add new parameter PWHOLESIZE. (alloc_object_size): Return tree. (compute_builtin_object_size): Return tree in PSIZE. (expr_object_size, call_object_size, unknown_object_size): Adjust for object_sizes_set change. (merge_object_sizes): Drop OFFSET parameter and adjust implementation for tree change. (plus_stmt_object_size): Call collect_object_sizes_for directly instead of merge_object_size and call size_for_offset to get net size. (cond_expr_object_size, collect_object_sizes_for, object_sizes_execute): Adjust for change of type from HOST_WIDE_INT to tree. (check_for_plus_in_loops_1): Likewise and skip non-positive offsets. gcc/testsuite/ChangeLog: * gcc.dg/builtin-object-size-1.c (test9): New test. (main): Call it. * gcc.dg/builtin-object-size-2.c (test8): New test. (main): Call it. * gcc.dg/builtin-object-size-3.c (test9): New test. (main): Call it. * gcc.dg/builtin-object-size-4.c (test8): New test. (main): Call it. * gcc.dg/builtin-object-size-5.c (test5, test6, test7): New tests. Signed-off-by: Siddhesh Poyarekar <siddhesh(a)gotplt.org> --- gcc/builtins.c | 10 +- gcc/gimple-fold.c | 11 +- gcc/testsuite/gcc.dg/builtin-object-size-1.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-2.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-3.c | 31 +++ gcc/testsuite/gcc.dg/builtin-object-size-4.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-5.c | 25 ++ gcc/tree-object-size.c | 394 +++++++++++++++++---------- gcc/tree-object-size.h | 2 +- gcc/ubsan.c | 5 +- 10 files changed, 409 insertions(+), 159 deletions(-) diff --git a/gcc/builtins.c b/gcc/builtins.c index cd8947b4de2..abe342e111d 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -10255,7 +10255,7 @@ maybe_emit_sprintf_chk_warning (tree exp, enum built_in_function fcode) static tree fold_builtin_object_size (tree ptr, tree ost) { - unsigned HOST_WIDE_INT bytes; + tree bytes; int object_size_type; if (!validate_arg (ptr, POINTER_TYPE) @@ -10280,8 +10280,8 @@ fold_builtin_object_size (tree ptr, tree ost) if (TREE_CODE (ptr) == ADDR_EXPR) { compute_builtin_object_size (ptr, object_size_type, &bytes); - if (wi::fits_to_tree_p (bytes, size_type_node)) - return build_int_cstu (size_type_node, bytes); + if (int_fits_type_p (bytes, size_type_node)) + return fold_convert (size_type_node, bytes); } else if (TREE_CODE (ptr) == SSA_NAME) { @@ -10289,8 +10289,8 @@ fold_builtin_object_size (tree ptr, tree ost) later. Maybe subsequent passes will help determining it. */ if (compute_builtin_object_size (ptr, object_size_type, &bytes) - && wi::fits_to_tree_p (bytes, size_type_node)) - return build_int_cstu (size_type_node, bytes); + && int_fits_type_p (bytes, size_type_node)) + return fold_convert (size_type_node, bytes); } return NULL_TREE; diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c index 1d8fd74f72c..64515aabc04 100644 --- a/gcc/gimple-fold.c +++ b/gcc/gimple-fold.c @@ -2493,17 +2493,16 @@ gimple_fold_builtin_strncat (gimple_stmt_iterator *gsi) if (!src_len || known_lower (stmt, len, src_len, true)) return false; - unsigned HOST_WIDE_INT dstsize; - bool found_dstsize = compute_builtin_object_size (dst, 1, &dstsize); - /* Warn on constant LEN. */ if (TREE_CODE (len) == INTEGER_CST) { bool nowarn = warning_suppressed_p (stmt, OPT_Wstringop_overflow_); + tree dstsize; - if (!nowarn && found_dstsize) + if (!nowarn && compute_builtin_object_size (dst, 1, &dstsize) + && TREE_CODE (dstsize) == INTEGER_CST) { - int cmpdst = compare_tree_int (len, dstsize); + int cmpdst = tree_int_cst_compare (len, dstsize); if (cmpdst >= 0) { @@ -2519,7 +2518,7 @@ gimple_fold_builtin_strncat (gimple_stmt_iterator *gsi) ? G_("%qD specified bound %E equals " "destination size") : G_("%qD specified bound %E exceeds " - "destination size %wu"), + "destination size %E"), fndecl, len, dstsize); if (nowarn) suppress_warning (stmt, OPT_Wstringop_overflow_); diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-1.c b/gcc/testsuite/gcc.dg/builtin-object-size-1.c index 8cdae49a6b1..0154f4e9695 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-1.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-1.c @@ -424,6 +424,35 @@ test8 (void) abort (); } +void +__attribute__ ((noinline)) +test9 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 0) != 10) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 0) != 10) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 0) != sizeof (y)) + abort (); +} + int main (void) { @@ -437,5 +466,6 @@ main (void) test6 (4); test7 (); test8 (); + test9 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-2.c b/gcc/testsuite/gcc.dg/builtin-object-size-2.c index ad2dd296a9a..5cf29291aff 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-2.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-2.c @@ -382,6 +382,35 @@ test7 (void) abort (); } +void +__attribute__ ((noinline)) +test8 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 1) != 10) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 1) != 10) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 1) != sizeof (y.c)) + abort (); +} + int main (void) { @@ -394,5 +423,6 @@ main (void) test5 (4); test6 (); test7 (); + test8 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-3.c b/gcc/testsuite/gcc.dg/builtin-object-size-3.c index d5ca5047ee9..3a692c4e3d2 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-3.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-3.c @@ -430,6 +430,36 @@ test8 (void) abort (); } +void +__attribute__ ((noinline)) +test9 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 2) != 6) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 2) != 2) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 2) + != sizeof (y) - __builtin_offsetof (struct A, c) - 8) + abort (); +} + int main (void) { @@ -443,5 +473,6 @@ main (void) test6 (4); test7 (); test8 (); + test9 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-4.c b/gcc/testsuite/gcc.dg/builtin-object-size-4.c index 9f159e36a0f..87381620cc9 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-4.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-4.c @@ -395,6 +395,35 @@ test7 (void) abort (); } +void +__attribute__ ((noinline)) +test8 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 3) != 6) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 3) != 2) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 3) != sizeof (y.c) - 8) + abort (); +} + int main (void) { @@ -407,5 +436,6 @@ main (void) test5 (4); test6 (); test7 (); + test8 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-5.c b/gcc/testsuite/gcc.dg/builtin-object-size-5.c index 7c274cdfd42..8e63d9c7a5e 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-5.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-5.c @@ -53,4 +53,29 @@ test4 (size_t x) abort (); } +void +test5 (void) +{ + char *p = &buf[0x90000004]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + +void +test6 (void) +{ + char *p = &buf[-4]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + +void +test7 (void) +{ + char *buf2 = __builtin_malloc (8); + char *p = &buf2[0x90000004]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + /* { dg-final { scan-assembler-not "abort" } } */ diff --git a/gcc/tree-object-size.c b/gcc/tree-object-size.c index b4881ef198f..32ef6dd5133 100644 --- a/gcc/tree-object-size.c +++ b/gcc/tree-object-size.c @@ -45,6 +45,14 @@ struct object_size_info unsigned int *stack, *tos; }; +struct GTY(()) object_size +{ + /* Estimate of bytes till the end of the object. */ + tree size; + /* Estimate of the size of the whole object. */ + tree wholesize; +}; + enum { OST_SUBOBJECT = 1, @@ -54,13 +62,12 @@ enum static tree compute_object_offset (const_tree, const_tree); static bool addr_object_size (struct object_size_info *, - const_tree, int, unsigned HOST_WIDE_INT *); -static unsigned HOST_WIDE_INT alloc_object_size (const gcall *, int); + const_tree, int, tree *, tree *t = NULL); +static tree alloc_object_size (const gcall *, int); static tree pass_through_call (const gcall *); static void collect_object_sizes_for (struct object_size_info *, tree); static void expr_object_size (struct object_size_info *, tree, tree); -static bool merge_object_sizes (struct object_size_info *, tree, tree, - unsigned HOST_WIDE_INT); +static bool merge_object_sizes (struct object_size_info *, tree, tree); static bool plus_stmt_object_size (struct object_size_info *, tree, gimple *); static bool cond_expr_object_size (struct object_size_info *, tree, gimple *); static void init_offset_limit (void); @@ -68,13 +75,13 @@ static void check_for_plus_in_loops (struct object_size_info *, tree); static void check_for_plus_in_loops_1 (struct object_size_info *, tree, unsigned int); -/* object_sizes[0] is upper bound for number of bytes till the end of - the object. - object_sizes[1] is upper bound for number of bytes till the end of - the subobject (innermost array or field with address taken). - object_sizes[2] is lower bound for number of bytes till the end of - the object and object_sizes[3] lower bound for subobject. */ -static vec<unsigned HOST_WIDE_INT> object_sizes[OST_END]; +/* object_sizes[0] is upper bound for the object size and number of bytes till + the end of the object. + object_sizes[1] is upper bound for the object size and number of bytes till + the end of the subobject (innermost array or field with address taken). + object_sizes[2] is lower bound for the object size and number of bytes till + the end of the object and object_sizes[3] lower bound for subobject. */ +static vec<object_size> object_sizes[OST_END]; /* Bitmaps what object sizes have been computed already. */ static bitmap computed[OST_END]; @@ -82,10 +89,46 @@ static bitmap computed[OST_END]; /* Maximum value of offset we consider to be addition. */ static unsigned HOST_WIDE_INT offset_limit; +/* Initial value of object sizes; zero for maximum and SIZE_MAX for minimum + object size. */ + +static inline unsigned HOST_WIDE_INT +initval (int object_size_type) +{ + return (object_size_type & OST_MINIMUM) ? HOST_WIDE_INT_M1U : 0; +} + +/* Unknown object size value; it's the opposite of initval. */ + static inline unsigned HOST_WIDE_INT unknown (int object_size_type) { - return ((unsigned HOST_WIDE_INT) -((object_size_type >> 1) ^ 1)); + return ~initval (object_size_type); +} + +/* Return true if VAL is represents an unknown size for OBJECT_SIZE_TYPE. */ + +static inline bool +size_unknown_p (tree val, int object_size_type) +{ + return (tree_fits_uhwi_p (val) + && tree_to_uhwi (val) == unknown (object_size_type)); +} + +/* Return a tree with initial value for OBJECT_SIZE_TYPE. */ + +static inline tree +size_initval (int object_size_type) +{ + return size_int (initval (object_size_type)); +} + +/* Return a tree with unknown value for OBJECT_SIZE_TYPE. */ + +static inline tree +size_unknown (int object_size_type) +{ + return size_int (unknown (object_size_type)); } /* Grow object_sizes[OBJECT_SIZE_TYPE] to num_ssa_names. */ @@ -110,47 +153,57 @@ object_sizes_release (int object_size_type) static inline bool object_sizes_unknown_p (int object_size_type, unsigned varno) { - return (object_sizes[object_size_type][varno] - == unknown (object_size_type)); + return size_unknown_p (object_sizes[object_size_type][varno].size, + object_size_type); } -/* Return size for VARNO corresponding to OSI. */ +/* Return size for VARNO corresponding to OSI. If WHOLE is true, return the + whole object size. */ -static inline unsigned HOST_WIDE_INT -object_sizes_get (struct object_size_info *osi, unsigned varno) +static inline tree +object_sizes_get (struct object_size_info *osi, unsigned varno, + bool whole = false) { - return object_sizes[osi->object_size_type][varno]; + if (whole) + return object_sizes[osi->object_size_type][varno].wholesize; + else + return object_sizes[osi->object_size_type][varno].size; } /* Set size for VARNO corresponding to OSI to VAL. */ -static inline bool -object_sizes_set_force (struct object_size_info *osi, unsigned varno, - unsigned HOST_WIDE_INT val) +static inline void +object_sizes_initialize (struct object_size_info *osi, unsigned varno, + tree val, tree wholeval) { - object_sizes[osi->object_size_type][varno] = val; - return true; + int object_size_type = osi->object_size_type; + + object_sizes[object_size_type][varno].size = val; + object_sizes[object_size_type][varno].wholesize = wholeval; } /* Set size for VARNO corresponding to OSI to VAL if it is the new minimum or maximum. */ static inline bool -object_sizes_set (struct object_size_info *osi, unsigned varno, - unsigned HOST_WIDE_INT val) +object_sizes_set (struct object_size_info *osi, unsigned varno, tree val, + tree wholeval) { int object_size_type = osi->object_size_type; - if ((object_size_type & OST_MINIMUM) == 0) - { - if (object_sizes[object_size_type][varno] < val) - return object_sizes_set_force (osi, varno, val); - } - else - { - if (object_sizes[object_size_type][varno] > val) - return object_sizes_set_force (osi, varno, val); - } - return false; + object_size osize = object_sizes[object_size_type][varno]; + + tree oldval = osize.size; + tree old_wholeval = osize.wholesize; + + enum tree_code code = object_size_type & OST_MINIMUM ? MIN_EXPR : MAX_EXPR; + + val = size_binop (code, val, oldval); + wholeval = size_binop (code, wholeval, old_wholeval); + + object_sizes[object_size_type][varno].size = val; + object_sizes[object_size_type][varno].wholesize = wholeval; + return (tree_int_cst_compare (oldval, val) != 0 + || tree_int_cst_compare (old_wholeval, wholeval) != 0); } /* Initialize OFFSET_LIMIT variable. */ @@ -164,6 +217,48 @@ init_offset_limit (void) offset_limit /= 2; } +/* Bytes at end of the object with SZ from offset OFFSET. If WHOLESIZE is not + NULL_TREE, use it to get the net offset of the pointer, which should always + be positive and hence, be within OFFSET_LIMIT for valid offsets. */ + +static tree +size_for_offset (tree sz, tree offset, tree wholesize = NULL_TREE) +{ + gcc_checking_assert (TREE_CODE (offset) == INTEGER_CST); + gcc_checking_assert (TREE_CODE (sz) == INTEGER_CST); + gcc_checking_assert (types_compatible_p (TREE_TYPE (sz), sizetype)); + + /* For negative offsets, if we have a distinct WHOLESIZE, use it to get a net + offset from the whole object. */ + if (wholesize && tree_int_cst_compare (sz, wholesize)) + { + gcc_checking_assert (TREE_CODE (wholesize) == INTEGER_CST); + gcc_checking_assert (types_compatible_p (TREE_TYPE (wholesize), + sizetype)); + + /* Restructure SZ - OFFSET as + WHOLESIZE - (WHOLESIZE + OFFSET - SZ) so that the offset part, i.e. + WHOLESIZE + OFFSET - SZ is only allowed to be positive. */ + tree tmp = size_binop (MAX_EXPR, wholesize, sz); + offset = fold_build2 (PLUS_EXPR, sizetype, tmp, offset); + offset = fold_build2 (MINUS_EXPR, sizetype, offset, sz); + sz = tmp; + } + + /* Safe to convert now, since a valid net offset should be non-negative. */ + if (!types_compatible_p (TREE_TYPE (offset), sizetype)) + fold_convert (sizetype, offset); + + if (integer_zerop (offset)) + return sz; + + /* Negative or too large offset even after adjustment, cannot be within + bounds of an object. */ + if (compare_tree_int (offset, offset_limit) > 0) + return size_zero_node; + + return size_binop (MINUS_EXPR, size_binop (MAX_EXPR, sz, offset), offset); +} /* Compute offset of EXPR within VAR. Return error_mark_node if unknown. */ @@ -274,19 +369,22 @@ decl_init_size (tree decl, bool min) /* Compute __builtin_object_size for PTR, which is a ADDR_EXPR. OBJECT_SIZE_TYPE is the second argument from __builtin_object_size. - If unknown, return unknown (object_size_type). */ + If unknown, return size_unknown (object_size_type). */ static bool addr_object_size (struct object_size_info *osi, const_tree ptr, - int object_size_type, unsigned HOST_WIDE_INT *psize) + int object_size_type, tree *psize, tree *pwholesize) { - tree pt_var, pt_var_size = NULL_TREE, var_size, bytes; + tree pt_var, pt_var_size = NULL_TREE, pt_var_wholesize = NULL_TREE; + tree var_size, bytes, wholebytes; gcc_assert (TREE_CODE (ptr) == ADDR_EXPR); /* Set to unknown and overwrite just before returning if the size could be determined. */ - *psize = unknown (object_size_type); + *psize = size_unknown (object_size_type); + if (pwholesize) + *pwholesize = size_unknown (object_size_type); pt_var = TREE_OPERAND (ptr, 0); while (handled_component_p (pt_var)) @@ -297,13 +395,14 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, if (TREE_CODE (pt_var) == MEM_REF) { - unsigned HOST_WIDE_INT sz; + tree sz, wholesize; if (!osi || (object_size_type & OST_SUBOBJECT) != 0 || TREE_CODE (TREE_OPERAND (pt_var, 0)) != SSA_NAME) { compute_builtin_object_size (TREE_OPERAND (pt_var, 0), object_size_type & ~OST_SUBOBJECT, &sz); + wholesize = sz; } else { @@ -312,46 +411,47 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, collect_object_sizes_for (osi, var); if (bitmap_bit_p (computed[object_size_type], SSA_NAME_VERSION (var))) - sz = object_sizes_get (osi, SSA_NAME_VERSION (var)); + { + sz = object_sizes_get (osi, SSA_NAME_VERSION (var)); + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (var), true); + } else - sz = unknown (object_size_type); + sz = wholesize = size_unknown (object_size_type); } - if (sz != unknown (object_size_type)) + if (!size_unknown_p (sz, object_size_type)) { - offset_int mem_offset; - if (mem_ref_offset (pt_var).is_constant (&mem_offset)) - { - offset_int dsz = wi::sub (sz, mem_offset); - if (wi::neg_p (dsz)) - sz = 0; - else if (wi::fits_uhwi_p (dsz)) - sz = dsz.to_uhwi (); - else - sz = unknown (object_size_type); - } + tree offset = TREE_OPERAND (pt_var, 1); + if (TREE_CODE (offset) != INTEGER_CST + || TREE_CODE (sz) != INTEGER_CST) + sz = wholesize = size_unknown (object_size_type); else - sz = unknown (object_size_type); + sz = size_for_offset (sz, offset, wholesize); } - if (sz != unknown (object_size_type) && sz < offset_limit) - pt_var_size = size_int (sz); + if (!size_unknown_p (sz, object_size_type) + && TREE_CODE (sz) == INTEGER_CST + && compare_tree_int (sz, offset_limit) < 0) + { + pt_var_size = sz; + pt_var_wholesize = wholesize; + } } else if (DECL_P (pt_var)) { - pt_var_size = decl_init_size (pt_var, object_size_type & OST_MINIMUM); + pt_var_size = pt_var_wholesize + = decl_init_size (pt_var, object_size_type & OST_MINIMUM); if (!pt_var_size) return false; } else if (TREE_CODE (pt_var) == STRING_CST) - pt_var_size = TYPE_SIZE_UNIT (TREE_TYPE (pt_var)); + pt_var_size = pt_var_wholesize = TYPE_SIZE_UNIT (TREE_TYPE (pt_var)); else return false; if (pt_var_size) { /* Validate the size determined above. */ - if (!tree_fits_uhwi_p (pt_var_size) - || tree_to_uhwi (pt_var_size) >= offset_limit) + if (compare_tree_int (pt_var_size, offset_limit) >= 0) return false; } @@ -496,28 +596,35 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, bytes = size_binop (MIN_EXPR, bytes, bytes2); } } + + wholebytes + = object_size_type & OST_SUBOBJECT ? var_size : pt_var_wholesize; } else if (!pt_var_size) return false; else - bytes = pt_var_size; - - if (tree_fits_uhwi_p (bytes)) { - *psize = tree_to_uhwi (bytes); - return true; + bytes = pt_var_size; + wholebytes = pt_var_wholesize; } - return false; + if (TREE_CODE (bytes) != INTEGER_CST + || TREE_CODE (wholebytes) != INTEGER_CST) + return false; + + *psize = bytes; + if (pwholesize) + *pwholesize = wholebytes; + return true; } /* Compute __builtin_object_size for CALL, which is a GIMPLE_CALL. Handles calls to functions declared with attribute alloc_size. OBJECT_SIZE_TYPE is the second argument from __builtin_object_size. - If unknown, return unknown (object_size_type). */ + If unknown, return size_unknown (object_size_type). */ -static unsigned HOST_WIDE_INT +static tree alloc_object_size (const gcall *call, int object_size_type) { gcc_assert (is_gimple_call (call)); @@ -529,7 +636,7 @@ alloc_object_size (const gcall *call, int object_size_type) calltype = gimple_call_fntype (call); if (!calltype) - return unknown (object_size_type); + return size_unknown (object_size_type); /* Set to positions of alloc_size arguments. */ int arg1 = -1, arg2 = -1; @@ -549,7 +656,7 @@ alloc_object_size (const gcall *call, int object_size_type) || (arg2 >= 0 && (arg2 >= (int)gimple_call_num_args (call) || TREE_CODE (gimple_call_arg (call, arg2)) != INTEGER_CST))) - return unknown (object_size_type); + return size_unknown (object_size_type); tree bytes = NULL_TREE; if (arg2 >= 0) @@ -559,10 +666,7 @@ alloc_object_size (const gcall *call, int object_size_type) else if (arg1 >= 0) bytes = fold_convert (sizetype, gimple_call_arg (call, arg1)); - if (bytes && tree_fits_uhwi_p (bytes)) - return tree_to_uhwi (bytes); - - return unknown (object_size_type); + return bytes; } @@ -598,13 +702,13 @@ pass_through_call (const gcall *call) bool compute_builtin_object_size (tree ptr, int object_size_type, - unsigned HOST_WIDE_INT *psize) + tree *psize) { gcc_assert (object_size_type >= 0 && object_size_type < OST_END); /* Set to unknown and overwrite just before returning if the size could be determined. */ - *psize = unknown (object_size_type); + *psize = size_unknown (object_size_type); if (! offset_limit) init_offset_limit (); @@ -638,8 +742,7 @@ compute_builtin_object_size (tree ptr, int object_size_type, psize)) { /* Return zero when the offset is out of bounds. */ - unsigned HOST_WIDE_INT off = tree_to_shwi (offset); - *psize = off < *psize ? *psize - off : 0; + *psize = size_for_offset (*psize, offset); return true; } } @@ -747,12 +850,13 @@ compute_builtin_object_size (tree ptr, int object_size_type, print_generic_expr (dump_file, ssa_name (i), dump_flags); fprintf (dump_file, - ": %s %sobject size " - HOST_WIDE_INT_PRINT_UNSIGNED "\n", + ": %s %sobject size ", ((object_size_type & OST_MINIMUM) ? "minimum" : "maximum"), - (object_size_type & OST_SUBOBJECT) ? "sub" : "", - object_sizes_get (&osi, i)); + (object_size_type & OST_SUBOBJECT) ? "sub" : ""); + print_generic_expr (dump_file, object_sizes_get (&osi, i), + dump_flags); + fprintf (dump_file, "\n"); } } @@ -761,7 +865,7 @@ compute_builtin_object_size (tree ptr, int object_size_type, } *psize = object_sizes_get (&osi, SSA_NAME_VERSION (ptr)); - return *psize != unknown (object_size_type); + return !size_unknown_p (*psize, object_size_type); } /* Compute object_sizes for PTR, defined to VALUE, which is not an SSA_NAME. */ @@ -771,7 +875,7 @@ expr_object_size (struct object_size_info *osi, tree ptr, tree value) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (ptr); - unsigned HOST_WIDE_INT bytes; + tree bytes, wholesize; gcc_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_assert (osi->pass == 0); @@ -784,11 +888,11 @@ expr_object_size (struct object_size_info *osi, tree ptr, tree value) || !POINTER_TYPE_P (TREE_TYPE (value))); if (TREE_CODE (value) == ADDR_EXPR) - addr_object_size (osi, value, object_size_type, &bytes); + addr_object_size (osi, value, object_size_type, &bytes, &wholesize); else - bytes = unknown (object_size_type); + bytes = wholesize = size_unknown (object_size_type); - object_sizes_set (osi, varno, bytes); + object_sizes_set (osi, varno, bytes, wholesize); } @@ -799,16 +903,14 @@ call_object_size (struct object_size_info *osi, tree ptr, gcall *call) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (ptr); - unsigned HOST_WIDE_INT bytes; gcc_assert (is_gimple_call (call)); gcc_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_assert (osi->pass == 0); + tree bytes = alloc_object_size (call, object_size_type); - bytes = alloc_object_size (call, object_size_type); - - object_sizes_set (osi, varno, bytes); + object_sizes_set (osi, varno, bytes, bytes); } @@ -822,8 +924,9 @@ unknown_object_size (struct object_size_info *osi, tree ptr) gcc_checking_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_checking_assert (osi->pass == 0); + tree bytes = size_unknown (object_size_type); - object_sizes_set (osi, varno, unknown (object_size_type)); + object_sizes_set (osi, varno, bytes, bytes); } @@ -831,30 +934,22 @@ unknown_object_size (struct object_size_info *osi, tree ptr) the object size might need reexamination later. */ static bool -merge_object_sizes (struct object_size_info *osi, tree dest, tree orig, - unsigned HOST_WIDE_INT offset) +merge_object_sizes (struct object_size_info *osi, tree dest, tree orig) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (dest); - unsigned HOST_WIDE_INT orig_bytes; + tree orig_bytes, wholesize; if (object_sizes_unknown_p (object_size_type, varno)) return false; - if (offset >= offset_limit) - { - object_sizes_set (osi, varno, unknown (object_size_type)); - return false; - } if (osi->pass == 0) collect_object_sizes_for (osi, orig); orig_bytes = object_sizes_get (osi, SSA_NAME_VERSION (orig)); - if (orig_bytes != unknown (object_size_type)) - orig_bytes = (offset > orig_bytes) - ? HOST_WIDE_INT_0U : orig_bytes - offset; + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (orig), true); - if (object_sizes_set (osi, varno, orig_bytes)) + if (object_sizes_set (osi, varno, orig_bytes, wholesize)) osi->changed = true; return bitmap_bit_p (osi->reexamine, SSA_NAME_VERSION (orig)); @@ -870,8 +965,9 @@ plus_stmt_object_size (struct object_size_info *osi, tree var, gimple *stmt) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (var); - unsigned HOST_WIDE_INT bytes; + tree bytes, wholesize; tree op0, op1; + bool reexamine = false; if (gimple_assign_rhs_code (stmt) == POINTER_PLUS_EXPR) { @@ -896,31 +992,38 @@ plus_stmt_object_size (struct object_size_info *osi, tree var, gimple *stmt) && (TREE_CODE (op0) == SSA_NAME || TREE_CODE (op0) == ADDR_EXPR)) { - if (! tree_fits_uhwi_p (op1)) - bytes = unknown (object_size_type); - else if (TREE_CODE (op0) == SSA_NAME) - return merge_object_sizes (osi, var, op0, tree_to_uhwi (op1)); + if (TREE_CODE (op0) == SSA_NAME) + { + if (osi->pass == 0) + collect_object_sizes_for (osi, op0); + + bytes = object_sizes_get (osi, SSA_NAME_VERSION (op0)); + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (op0), true); + reexamine = bitmap_bit_p (osi->reexamine, SSA_NAME_VERSION (op0)); + } else { - unsigned HOST_WIDE_INT off = tree_to_uhwi (op1); - - /* op0 will be ADDR_EXPR here. */ - addr_object_size (osi, op0, object_size_type, &bytes); - if (bytes == unknown (object_size_type)) - ; - else if (off > offset_limit) - bytes = unknown (object_size_type); - else if (off > bytes) - bytes = 0; - else - bytes -= off; + /* op0 will be ADDR_EXPR here. We should never come here during + reexamination. */ + gcc_checking_assert (osi->pass == 0); + addr_object_size (osi, op0, object_size_type, &bytes, &wholesize); } + + /* In the first pass, do not compute size for offset if either the + maximum size is unknown or the minimum size is not initialized yet; + the latter indicates a dependency loop and will be resolved in + subsequent passes. We attempt to compute offset for 0 minimum size + too because a negative offset could be within bounds of WHOLESIZE, + giving a non-zero result for VAR. */ + if (osi->pass != 0 || !size_unknown_p (bytes, 0)) + bytes = size_for_offset (bytes, op1, wholesize); } else - bytes = unknown (object_size_type); </cut>

3 years, 7 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 4% after llvm: [SLP]Improve multinode analysis.

by ci_notify＠linaro.org

After llvm commit bd053769867f988500dc1b451c6439eefcf7643f Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve multinode analysis. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 4% from 11205 to 11640 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bd053769867f988500dc1b451c6439eefcf7643f cd investigate-llvm-bd053769867f988500dc1b451c6439eefcf7643f # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bd053769867f988500dc1b451c6439eefcf7643f ../artifacts/test.sh # Reproduce last_good build git checkout --detach 135d5d4a6d37f30173c1b9ea85a3a969c364b241 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bd053769867f988500dc1b451c6439eefcf7643f Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Tue Apr 6 08:35:52 2021 -0700 [SLP]Improve multinode analysis. Changes the preliminary multinode analysis: 1. Introduced scores for reversed loads/extractelements. 2. Improved shallow score calculation. 3. Lowered the cost of external uses (no need to consider it several times, just ones). 4. The initial lane for analysis is the one with the minimal possible reorderings. These changes in general shall reduce compile time and improve the reordering in many cases. Part of D57059. Differential Revision: https://reviews.llvm.org/D101109 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 253 ++++++++++++++++----- .../AArch64/transpose-inseltpoison.ll | 30 +-- .../Transforms/SLPVectorizer/AArch64/transpose.ll | 30 +-- .../AArch64/vectorize-free-extracts-inserts.ll | 20 +- llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll | 2 +- llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 24 +- .../Transforms/SLPVectorizer/X86/commutativity.ll | 20 +- .../SLPVectorizer/X86/crash_exceed_scheduling.ll | 6 +- .../Transforms/SLPVectorizer/X86/crash_smallpt.ll | 18 +- .../Transforms/SLPVectorizer/X86/extractelement.ll | 4 +- .../Transforms/SLPVectorizer/X86/insert-shuffle.ll | 34 ++- .../test/Transforms/SLPVectorizer/X86/lookahead.ll | 35 +-- .../Transforms/SLPVectorizer/X86/operandorder.ll | 44 ++-- .../Transforms/SLPVectorizer/X86/store-jumbled.ll | 4 +- .../SLPVectorizer/X86/stores_vectorize.ll | 6 +- .../test/Transforms/SLPVectorizer/X86/supernode.ll | 2 +- 16 files changed, 328 insertions(+), 204 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index d145b04c0694..c685432ae28e 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -1016,18 +1016,25 @@ public: std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]); } - // The hard-coded scores listed here are not very important. When computing - // the scores of matching one sub-tree with another, we are basically - // counting the number of values that are matching. So even if all scores - // are set to 1, we would still get a decent matching result. + // The hard-coded scores listed here are not very important, though it shall + // be higher for better matches to improve the resulting cost. When + // computing the scores of matching one sub-tree with another, we are + // basically counting the number of values that are matching. So even if all + // scores are set to 1, we would still get a decent matching result. // However, sometimes we have to break ties. For example we may have to // choose between matching loads vs matching opcodes. This is what these - // scores are helping us with: they provide the order of preference. + // scores are helping us with: they provide the order of preference. Also, + // this is important if the scalar is externally used or used in another + // tree entry node in the different lane. /// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]). - static const int ScoreConsecutiveLoads = 3; + static const int ScoreConsecutiveLoads = 4; + /// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]). + static const int ScoreReversedLoads = 3; /// ExtractElementInst from same vector and consecutive indexes. - static const int ScoreConsecutiveExtracts = 3; + static const int ScoreConsecutiveExtracts = 4; + /// ExtractElementInst from same vector and reversed indices. + static const int ScoreReversedExtracts = 3; /// Constants. static const int ScoreConstants = 2; /// Instructions with the same opcode. @@ -1047,7 +1054,10 @@ public: /// \returns the score of placing \p V1 and \p V2 in consecutive lanes. static int getShallowScore(Value *V1, Value *V2, const DataLayout &DL, - ScalarEvolution &SE) { + ScalarEvolution &SE, int NumLanes) { + if (V1 == V2) + return VLOperands::ScoreSplat; + auto *LI1 = dyn_cast<LoadInst>(V1); auto *LI2 = dyn_cast<LoadInst>(V2); if (LI1 && LI2) { @@ -1057,8 +1067,17 @@ public: Optional<int> Dist = getPointersDiff( LI1->getType(), LI1->getPointerOperand(), LI2->getType(), LI2->getPointerOperand(), DL, SE, /*StrictCheck=*/true); - return (Dist && *Dist == 1) ? VLOperands::ScoreConsecutiveLoads - : VLOperands::ScoreFail; + if (!Dist) + return VLOperands::ScoreFail; + // The distance is too large - still may be profitable to use masked + // loads/gathers. + if (std::abs(*Dist) > NumLanes / 2) + return VLOperands::ScoreAltOpcodes; + // This still will detect consecutive loads, but we might have "holes" + // in some cases. It is ok for non-power-2 vectorization and may produce + // better results. It should not affect current vectorization. + return (*Dist > 0) ? VLOperands::ScoreConsecutiveLoads + : VLOperands::ScoreReversedLoads; } auto *C1 = dyn_cast<Constant>(V1); @@ -1068,18 +1087,41 @@ public: // Extracts from consecutive indexes of the same vector better score as // the extracts could be optimized away. - Value *EV; - ConstantInt *Ex1Idx, *Ex2Idx; - if (match(V1, m_ExtractElt(m_Value(EV), m_ConstantInt(Ex1Idx))) && - match(V2, m_ExtractElt(m_Deferred(EV), m_ConstantInt(Ex2Idx))) && - Ex1Idx->getZExtValue() + 1 == Ex2Idx->getZExtValue()) - return VLOperands::ScoreConsecutiveExtracts; + Value *EV1; + ConstantInt *Ex1Idx; + if (match(V1, m_ExtractElt(m_Value(EV1), m_ConstantInt(Ex1Idx)))) { + // Undefs are always profitable for extractelements. + if (isa<UndefValue>(V2)) + return VLOperands::ScoreConsecutiveExtracts; + Value *EV2 = nullptr; + ConstantInt *Ex2Idx = nullptr; + if (match(V2, + m_ExtractElt(m_Value(EV2), m_CombineOr(m_ConstantInt(Ex2Idx), + m_Undef())))) { + // Undefs are always profitable for extractelements. + if (!Ex2Idx) + return VLOperands::ScoreConsecutiveExtracts; + if (isUndefVector(EV2) && EV2->getType() == EV1->getType()) + return VLOperands::ScoreConsecutiveExtracts; + if (EV2 == EV1) { + int Idx1 = Ex1Idx->getZExtValue(); + int Idx2 = Ex2Idx->getZExtValue(); + int Dist = Idx2 - Idx1; + // The distance is too large - still may be profitable to use + // shuffles. + if (std::abs(Dist) > NumLanes / 2) + return VLOperands::ScoreAltOpcodes; + return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts + : VLOperands::ScoreReversedExtracts; + } + } + } auto *I1 = dyn_cast<Instruction>(V1); auto *I2 = dyn_cast<Instruction>(V2); if (I1 && I2) { - if (I1 == I2) - return VLOperands::ScoreSplat; + if (I1->getParent() != I2->getParent()) + return VLOperands::ScoreFail; InstructionsState S = getSameOpcode({I1, I2}); // Note: Only consider instructions with <= 2 operands to avoid // complexity explosion. @@ -1094,11 +1136,13 @@ public: return VLOperands::ScoreFail; } - /// Holds the values and their lane that are taking part in the look-ahead + /// Holds the values and their lanes that are taking part in the look-ahead /// score calculation. This is used in the external uses cost calculation. - SmallDenseMap<Value *, int> InLookAheadValues; + /// Need to hold all the lanes in case of splat/broadcast at least to + /// correctly check for the use in the different lane. + SmallDenseMap<Value *, SmallSet<int, 4>> InLookAheadValues; - /// \Returns the additinal cost due to uses of \p LHS and \p RHS that are + /// \returns the additional cost due to uses of \p LHS and \p RHS that are /// either external to the vectorized code, or require shuffling. int getExternalUsesCost(const std::pair<Value *, int> &LHS, const std::pair<Value *, int> &RHS) { @@ -1122,22 +1166,30 @@ public: for (User *U : V->users()) { if (const TreeEntry *UserTE = R.getTreeEntry(U)) { // The user is in the VectorizableTree. Check if we need to insert. - auto It = llvm::find(UserTE->Scalars, U); - assert(It != UserTE->Scalars.end() && "U is in UserTE"); - int UserLn = std::distance(UserTE->Scalars.begin(), It); + int UserLn = UserTE->findLaneForValue(U); assert(UserLn >= 0 && "Bad lane"); - if (UserLn != Ln) + // If the values are different, check just the line of the current + // value. If the values are the same, need to add UserInDiffLaneCost + // only if UserLn does not match both line numbers. + if ((LHS.first != RHS.first && UserLn != Ln) || + (LHS.first == RHS.first && UserLn != LHS.second && + UserLn != RHS.second)) { Cost += UserInDiffLaneCost; + break; + } } else { // Check if the user is in the look-ahead code. auto It2 = InLookAheadValues.find(U); if (It2 != InLookAheadValues.end()) { // The user is in the look-ahead code. Check the lane. - if (It2->second != Ln) + if (!It2->getSecond().contains(Ln)) { Cost += UserInDiffLaneCost; + break; + } } else { // The user is neither in SLP tree nor in the look-ahead code. Cost += ExternalUseCost; + break; } } // Limit the number of visited uses to cap compilation time. @@ -1176,32 +1228,36 @@ public: Value *V1 = LHS.first; Value *V2 = RHS.first; // Get the shallow score of V1 and V2. - int ShallowScoreAtThisLevel = - std::max((int)ScoreFail, getShallowScore(V1, V2, DL, SE) - - getExternalUsesCost(LHS, RHS)); + int ShallowScoreAtThisLevel = std::max( + (int)ScoreFail, getShallowScore(V1, V2, DL, SE, getNumLanes()) - + getExternalUsesCost(LHS, RHS)); int Lane1 = LHS.second; int Lane2 = RHS.second; // If reached MaxLevel, // or if V1 and V2 are not instructions, // or if they are SPLAT, - // or if they are not consecutive, early return the current cost. + // or if they are not consecutive, + // or if profitable to vectorize loads or extractelements, early return + // the current cost. auto *I1 = dyn_cast<Instruction>(V1); auto *I2 = dyn_cast<Instruction>(V2); if (CurrLevel == MaxLevel || !(I1 && I2) || I1 == I2 || ShallowScoreAtThisLevel == VLOperands::ScoreFail || - (isa<LoadInst>(I1) && isa<LoadInst>(I2) && ShallowScoreAtThisLevel)) + (((isa<LoadInst>(I1) && isa<LoadInst>(I2)) || + (isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) && + ShallowScoreAtThisLevel)) return ShallowScoreAtThisLevel; assert(I1 && I2 && "Should have early exited."); // Keep track of in-tree values for determining the external-use cost. - InLookAheadValues[V1] = Lane1; - InLookAheadValues[V2] = Lane2; + InLookAheadValues[V1].insert(Lane1); + InLookAheadValues[V2].insert(Lane2); // Contains the I2 operand indexes that got matched with I1 operands. SmallSet<unsigned, 4> Op2Used; - // Recursion towards the operands of I1 and I2. We are trying all possbile + // Recursion towards the operands of I1 and I2. We are trying all possible // operand pairs, and keeping track of the best score. for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands(); OpIdx1 != NumOperands1; ++OpIdx1) { @@ -1325,27 +1381,79 @@ public: return None; } - /// Helper for reorderOperandVecs. \Returns the lane that we should start - /// reordering from. This is the one which has the least number of operands - /// that can freely move about. + /// Helper for reorderOperandVecs. + /// \returns the lane that we should start reordering from. This is the one + /// which has the least number of operands that can freely move about or + /// less profitable because it already has the most optimal set of operands. unsigned getBestLaneToStartReordering() const { - unsigned BestLane = 0; unsigned Min = UINT_MAX; - for (unsigned Lane = 0, NumLanes = getNumLanes(); Lane != NumLanes; - ++Lane) { - unsigned NumFreeOps = getMaxNumOperandsThatCanBeReordered(Lane); - if (NumFreeOps < Min) { - Min = NumFreeOps; - BestLane = Lane; + unsigned SameOpNumber = 0; + // std::pair<unsigned, unsigned> is used to implement a simple voting + // algorithm and choose the lane with the least number of operands that + // can freely move about or less profitable because it already has the + // most optimal set of operands. The first unsigned is a counter for + // voting, the second unsigned is the counter of lanes with instructions + // with same/alternate opcodes and same parent basic block. + MapVector<unsigned, std::pair<unsigned, unsigned>> HashMap; + // Try to be closer to the original results, if we have multiple lanes + // with same cost. If 2 lanes have the same cost, use the one with the + // lowest index. + for (int I = getNumLanes(); I > 0; --I) { + unsigned Lane = I - 1; + OperandsOrderData NumFreeOpsHash = + getMaxNumOperandsThatCanBeReordered(Lane); + // Compare the number of operands that can move and choose the one with + // the least number. + if (NumFreeOpsHash.NumOfAPOs < Min) { + Min = NumFreeOpsHash.NumOfAPOs; + SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent; + HashMap.clear(); + HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane); + } else if (NumFreeOpsHash.NumOfAPOs == Min && + NumFreeOpsHash.NumOpsWithSameOpcodeParent < SameOpNumber) { + // Select the most optimal lane in terms of number of operands that + // should be moved around. + SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent; + HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane); + } else if (NumFreeOpsHash.NumOfAPOs == Min && + NumFreeOpsHash.NumOpsWithSameOpcodeParent == SameOpNumber) { + ++HashMap[NumFreeOpsHash.Hash].first; + } + } + // Select the lane with the minimum counter. + unsigned BestLane = 0; + unsigned CntMin = UINT_MAX; + for (const auto &Data : reverse(HashMap)) { + if (Data.second.first < CntMin) { + CntMin = Data.second.first; + BestLane = Data.second.second; } } return BestLane; } - /// \Returns the maximum number of operands that are allowed to be reordered - /// for \p Lane. This is used as a heuristic for selecting the first lane to - /// start operand reordering. - unsigned getMaxNumOperandsThatCanBeReordered(unsigned Lane) const { + /// Data structure that helps to reorder operands. + struct OperandsOrderData { + /// The best number of operands with the same APOs, which can be + /// reordered. + unsigned NumOfAPOs = UINT_MAX; + /// Number of operands with the same/alternate instruction opcode and + /// parent. + unsigned NumOpsWithSameOpcodeParent = 0; + /// Hash for the actual operands ordering. + /// Used to count operands, actually their position id and opcode + /// value. It is used in the voting mechanism to find the lane with the + /// least number of operands that can freely move about or less profitable + /// because it already has the most optimal set of operands. Can be + /// replaced with SmallVector<unsigned> instead but hash code is faster + /// and requires less memory. + unsigned Hash = 0; + }; + /// \returns the maximum number of operands that are allowed to be reordered + /// for \p Lane and the number of compatible instructions(with the same + /// parent/opcode). This is used as a heuristic for selecting the first lane + /// to start operand reordering. + OperandsOrderData getMaxNumOperandsThatCanBeReordered(unsigned Lane) const { unsigned CntTrue = 0; unsigned NumOperands = getNumOperands(); // Operands with the same APO can be reordered. We therefore need to count @@ -1354,11 +1462,45 @@ public: // a map. Instead we can simply count the number of operands that // correspond to one of them (in this case the 'true' APO), and calculate // the other by subtracting it from the total number of operands. - for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) - if (getData(OpIdx, Lane).APO) + // Operands with the same instruction opcode and parent are more + // profitable since we don't need to move them in many cases, with a high + // probability such lane already can be vectorized effectively. + bool AllUndefs = true; + unsigned NumOpsWithSameOpcodeParent = 0; + Instruction *OpcodeI = nullptr; + BasicBlock *Parent = nullptr; + unsigned Hash = 0; + for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) { + const OperandData &OpData = getData(OpIdx, Lane); + if (OpData.APO) ++CntTrue; - unsigned CntFalse = NumOperands - CntTrue; - return std::max(CntTrue, CntFalse); + // Use Boyer-Moore majority voting for finding the majority opcode and + // the number of times it occurs. + if (auto *I = dyn_cast<Instruction>(OpData.V)) { + if (!OpcodeI || !getSameOpcode({OpcodeI, I}).getOpcode() || + I->getParent() != Parent) { + if (NumOpsWithSameOpcodeParent == 0) { + NumOpsWithSameOpcodeParent = 1; + OpcodeI = I; + Parent = I->getParent(); + } else { + --NumOpsWithSameOpcodeParent; + } + } else { + ++NumOpsWithSameOpcodeParent; + } + } + Hash = hash_combine( + Hash, hash_value((OpIdx + 1) * (OpData.V->getValueID() + 1))); + AllUndefs = AllUndefs && isa<UndefValue>(OpData.V); + } + if (AllUndefs) + return {}; + OperandsOrderData Data; + Data.NumOfAPOs = std::max(CntTrue, NumOperands - CntTrue); + Data.NumOpsWithSameOpcodeParent = NumOpsWithSameOpcodeParent; + Data.Hash = Hash; + return Data; } /// Go through the instructions in VL and append their operands. @@ -2876,7 +3018,8 @@ void BoUpSLP::reorderTopToBottom() { // their ordering. DenseMap<const TreeEntry *, OrdersType> GathersToOrders; // Find all reorderable nodes with the given VF. - // Currently the are vectorized loads,extracts + some gathering of extracts. + // Currently the are vectorized stores,loads,extracts + some gathering of + // extracts. for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders]( const std::unique_ptr<TreeEntry> &TE) { if (Optional<OrdersType> CurrentOrder = @@ -3497,11 +3640,9 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth, } } - // If any of the scalars is marked as a value that needs to stay scalar, then - // we need to gather the scalars. // The reduction nodes (stored in UserIgnoreList) also should stay scalar. for (Value *V : VL) { - if (MustGather.count(V) || is_contained(UserIgnoreList, V)) { + if (is_contained(UserIgnoreList, V)) { LLVM_DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n"); if (TryToFindDuplicates(S)) newTreeEntry(VL, None /*not vectorized*/, S, UserTreeIdx, diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll index fa95ec7357aa..c8aa06677f8f 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll @@ -167,25 +167,17 @@ define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) { define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) { ; CHECK-LABEL: @build_vec_v4i32_3_binops( -; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> [[V0:%.*]], i64 0 -; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i64 1 -; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> [[V1:%.*]], i64 0 -; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i64 1 -; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> -; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]] -; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]] -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]] -; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> poison, i32 [[TMP2_0]], i64 0 -; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i64 1 -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <4 x i32> [[TMP3_1]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]] +; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2> +; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer +; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> +; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]] +; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: ret <4 x i32> [[TMP3_31]] ; %v0.0 = extractelement <2 x i32> %v0, i32 0 diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll index dcfdbee9bc5f..307480ce8018 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll @@ -167,25 +167,17 @@ define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) { define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) { ; CHECK-LABEL: @build_vec_v4i32_3_binops( -; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> [[V0:%.*]], i64 0 -; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i64 1 -; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> [[V1:%.*]], i64 0 -; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i64 1 -; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> -; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]] -; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]] -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]] -; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2_0]], i64 0 -; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i64 1 -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <4 x i32> [[TMP3_1]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]] +; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2> +; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer +; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> +; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]] +; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: ret <4 x i32> [[TMP3_31]] ; %v0.0 = extractelement <2 x i32> %v0, i32 0 diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll index d7ef813d6b72..b79d2d494aa4 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll @@ -282,19 +282,21 @@ define void @extracts_jumbled_4_lanes(<9 x double>* %ptr.1, <4 x double>* %ptr.2 ; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0 ; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1 ; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2 -; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]] -; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_1]] -; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_2]] -; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]] -; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0 -; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1 -; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2 -; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3 +; CHECK-NEXT: [[TMP0:%.*]] = insertelement <4 x double> poison, double [[V1_LANE_0]], i32 0 +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> [[TMP0]], double [[V1_LANE_2]], i32 1 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x double> [[TMP1]], double [[V1_LANE_1]], i32 2 +; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP2]], double [[V1_LANE_3]], i32 3 +; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x double> poison, double [[V2_LANE_2]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP4]], double [[V2_LANE_1]], i32 1 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x double> [[TMP5]], double [[V2_LANE_2]], i32 2 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP6]], double [[V2_LANE_0]], i32 3 +; CHECK-NEXT: [[TMP8:%.*]] = fmul <4 x double> [[TMP3]], [[TMP7]] +; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x double> [[TMP8]], <4 x double> poison, <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> ; CHECK-NEXT: call void @use(double [[V1_LANE_0]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_1]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_2]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_3]]) -; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8 +; CHECK-NEXT: store <9 x double> [[TMP9]], <9 x double>* [[PTR_1]], align 8 ; CHECK-NEXT: ret void ; bb: diff --git a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll index 51a6e1ed81b1..7668747a75ac 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll @@ -1,6 +1,6 @@ ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py ; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-6 | FileCheck %s --check-prefix=CHECK -; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-8 -slp-min-tree-size=6 | FileCheck %s --check-prefix=FORCE_REDUCTION +; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-7 -slp-min-tree-size=6 | FileCheck %s --check-prefix=FORCE_REDUCTION define void @Test(i32) { ; CHECK-LABEL: @Test( diff --git a/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll b/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll index c9cb8951e882..ebbbefc9f81f 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll @@ -342,18 +342,18 @@ define void @vec_shuff_reorder() #0 { ; CHECK-LABEL: @vec_shuff_reorder( ; CHECK-NEXT: [[TMP1:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4 ; CHECK-NEXT: [[TMP2:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4 -; CHECK-NEXT: [[TMP3:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1) to <2 x float>*), align 4 -; CHECK-NEXT: [[TMP4:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1) to <2 x float>*), align 4 -; CHECK-NEXT: [[TMP5:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4 -; CHECK-NEXT: [[TMP6:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4 -; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0 -; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x float> [[TMP7]], <4 x float> [[TMP8]], <4 x i32> <i32 0, i32 4, i32 5, i32 3> -; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x float> [[TMP9]], float [[TMP5]], i32 3 -; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0 -; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x float> [[TMP11]], <4 x float> [[TMP12]], <4 x i32> <i32 0, i32 4, i32 5, i32 3> -; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP6]], i32 3 +; CHECK-NEXT: [[TMP3:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4 +; CHECK-NEXT: [[TMP4:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4 +; CHECK-NEXT: [[TMP5:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 2) to <2 x float>*), align 4 +; CHECK-NEXT: [[TMP6:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 2) to <2 x float>*), align 4 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0 +; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[TMP7]], float [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <2 x float> [[TMP5]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> +; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x float> [[TMP8]], <4 x float> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0 +; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x float> [[TMP11]], float [[TMP4]], i32 1 +; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> +; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <4 x float> [[TMP12]], <4 x float> [[TMP13]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> ; CHECK-NEXT: [[TMP15:%.*]] = fadd <4 x float> [[TMP10]], [[TMP14]] ; CHECK-NEXT: [[TMP16:%.*]] = fsub <4 x float> [[TMP10]], [[TMP14]] ; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP15]], <4 x float> [[TMP16]], <4 x i32> <i32 0, i32 5, i32 2, i32 7> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll b/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll index d23dc9b1d822..1a218ae02aef 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll @@ -16,21 +16,21 @@ define void @splat(i8 %a, i8 %b, i8 %c) { ; SSE-LABEL: @splat( -; SSE-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 -; SSE-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer -; SSE-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 -; SSE-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> [[TMP2]], i8 [[B:%.*]], i32 1 -; SSE-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; SSE-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 +; SSE-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> [[TMP1]], i8 [[B:%.*]], i32 1 +; SSE-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP2]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; SSE-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 +; SSE-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer ; SSE-NEXT: [[TMP4:%.*]] = xor <16 x i8> [[SHUFFLE]], [[SHUFFLE1]] ; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast ([32 x i8]* @cle to <16 x i8>*), align 16 ; SSE-NEXT: ret void ; ; AVX-LABEL: @splat( -; AVX-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 -; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer -; AVX-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 -; AVX-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> [[TMP2]], i8 [[B:%.*]], i32 1 -; AVX-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; AVX-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 +; AVX-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> [[TMP1]], i8 [[B:%.*]], i32 1 +; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP2]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; AVX-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 +; AVX-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer ; AVX-NEXT: [[TMP4:%.*]] = xor <16 x i8> [[SHUFFLE]], [[SHUFFLE1]] ; AVX-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast ([32 x i8]* @cle to <16 x i8>*), align 16 ; AVX-NEXT: ret void diff --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll index 6be7dda2375d..098b83bb0259 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll @@ -34,9 +34,9 @@ define void @exceed(double %0, double %1) { ; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]] ; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], [[TMP11]] ; CHECK-NEXT: [[IXX101:%.*]] = fsub double undef, undef -; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP7]], i32 0 -; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP1]], i32 1 -; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP13]], [[TMP14]] +; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> poison, double [[TMP1]], i32 1 +; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> [[TMP13]], double [[TMP7]], i32 0 +; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP14]], undef ; CHECK-NEXT: switch i32 undef, label [[BB1:%.*]] [ ; CHECK-NEXT: i32 0, label [[BB2:%.*]] ; CHECK-NEXT: ] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll index c8beac34fc90..9c8fbf8a2ed9 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll @@ -30,15 +30,19 @@ define void @main() #0 { ; CHECK-NEXT: br i1 undef, label [[COND_TRUE63_US:%.*]], label [[COND_FALSE66_US:%.*]] ; CHECK: cond.false66.us: ; CHECK-NEXT: [[ADD_I276_US:%.*]] = fadd double 0.000000e+00, undef -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[ADD_I276_US]], i32 0 -; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> [[TMP0]], <double 0.000000e+00, double 0xBFA5CC2D1960285F> +; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double poison, double 0xBFA5CC2D1960285F>, double [[ADD_I276_US]], i32 0 +; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> <double 0.000000e+00, double undef>, [[TMP0]] ; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> [[TMP1]], <double 1.400000e+02, double 1.400000e+02> ; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], <double 5.000000e+01, double 5.200000e+01> -; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> undef, [[TMP1]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[AGG_TMP99208_SROA_0_0_IDX]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP5]], align 8 -; CHECK-NEXT: [[TMP6:%.*]] = bitcast double* [[AGG_TMP101211_SROA_0_0_IDX]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP6]], align 8 +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP1]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP1]], i32 1 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP4]], i32 0 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP5]], i32 1 +; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP6]], [[TMP7]] +; CHECK-NEXT: [[TMP9:%.*]] = bitcast double* [[AGG_TMP99208_SROA_0_0_IDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP9]], align 8 +; CHECK-NEXT: [[TMP10:%.*]] = bitcast double* [[AGG_TMP101211_SROA_0_0_IDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP10]], align 8 ; CHECK-NEXT: unreachable ; CHECK: cond.true63.us: ; CHECK-NEXT: unreachable diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll index 0a0c0e6763fd..1fff6841a538 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll @@ -85,7 +85,7 @@ define float @f_used_twice_in_tree(<2 x float> %x) { ; THRESH1-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[X:%.*]], i32 1 ; THRESH1-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0 ; THRESH1-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1 -; THRESH1-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[X]], [[TMP3]] +; THRESH1-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[X]] ; THRESH1-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP4]], i32 0 ; THRESH1-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP4]], i32 1 ; THRESH1-NEXT: [[ADD:%.*]] = fadd float [[TMP5]], [[TMP6]] @@ -95,7 +95,7 @@ define float @f_used_twice_in_tree(<2 x float> %x) { ; THRESH2-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[X:%.*]], i32 1 ; THRESH2-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0 ; THRESH2-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1 -; THRESH2-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[X]], [[TMP3]] +; THRESH2-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[X]] ; THRESH2-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP4]], i32 0 ; THRESH2-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP4]], i32 1 ; THRESH2-NEXT: [[ADD:%.*]] = fadd float [[TMP5]], [[TMP6]] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll index 2c983a353623..7d43465eecf8 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll @@ -11,25 +11,23 @@ define { <2 x float>, <2 x float> } @foo(%struct.sw* %v) { ; CHECK-NEXT: [[Y:%.*]] = getelementptr inbounds [[STRUCT_SW]], %struct.sw* [[V]], i64 0, i32 1 ; CHECK-NEXT: [[TMP1:%.*]] = bitcast float* [[X]] to <2 x float>* ; CHECK-NEXT: [[TMP2:%.*]] = load <2 x float>, <2 x float>* [[TMP1]], align 16 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 1> ; CHECK-NEXT: [[TMP3:%.*]] = load float, float* undef, align 4 -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> <float poison, float undef, float poison, float poison>, float [[TMP0]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x float> [[TMP4]], <4 x float> [[TMP5]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> -; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x float> poison, <4 x float> [[TMP7]], <4 x i32> <i32 4, i32 5, i32 2, i32 3> -; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x float> [[TMP8]], float [[TMP3]], i32 2 -; CHECK-NEXT: [[TMP10:%.*]] = fmul <4 x float> [[TMP6]], [[TMP9]] -; CHECK-NEXT: [[TMP11:%.*]] = fadd <4 x float> poison, [[TMP10]] -; CHECK-NEXT: [[TMP12:%.*]] = fadd <4 x float> [[TMP11]], poison -; CHECK-NEXT: [[TMP13:%.*]] = fadd <4 x float> [[TMP12]], poison -; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP13]], i32 0 -; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP14]], i32 0 -; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x float> [[TMP13]], i32 1 -; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP15]], i32 1 -; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x float> [[TMP13]], i32 2 -; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP16]], i32 0 -; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[TMP13]], i32 3 -; CHECK-NEXT: [[VEC4:%.*]] = insertelement <2 x float> [[VEC3]], float [[TMP17]], i32 1 +; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> poison, float [[TMP0]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x float> [[TMP4]], float [[TMP3]], i32 1 +; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float> [[TMP5]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 1, i32 undef> +; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x float> [[SHUFFLE]], [[SHUFFLE1]] +; CHECK-NEXT: [[TMP7:%.*]] = fadd <4 x float> poison, [[TMP6]] +; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], poison +; CHECK-NEXT: [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], poison +; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[TMP9]], i32 0 +; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0 +; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP9]], i32 1 +; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP11]], i32 1 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP9]], i32 2 +; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0 +; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP9]], i32 3 +; CHECK-NEXT: [[VEC4:%.*]] = insertelement <2 x float> [[VEC3]], float [[TMP13]], i32 1 ; CHECK-NEXT: [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[VEC2]], 0 ; CHECK-NEXT: [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[VEC4]], 1 ; CHECK-NEXT: ret { <2 x float>, <2 x float> } [[INS2]] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll b/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll index 96502d44acee..ba3bd26d3861 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll @@ -37,7 +37,7 @@ define void @lookahead_basic(double* %array) { ; CHECK-NEXT: [[TMP7:%.*]] = load <2 x double>, <2 x double>* [[TMP6]], align 8 ; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]] -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP8]], [[TMP9]] +; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP8]] ; CHECK-NEXT: [[TMP11:%.*]] = bitcast double* [[IDX0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8 ; CHECK-NEXT: ret void @@ -175,7 +175,7 @@ define void @lookahead_alt2(double* %array) { ; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP12:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x double> [[TMP11]], <2 x double> [[TMP12]], <2 x i32> <i32 0, i32 3> -; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP13]], [[TMP10]] +; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP10]], [[TMP13]] ; CHECK-NEXT: [[TMP15:%.*]] = bitcast double* [[IDX0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP14]], <2 x double>* [[TMP15]], align 8 ; CHECK-NEXT: ret void @@ -237,28 +237,29 @@ define void @lookahead_external_uses(double* %A, double *%B, double *%C, double ; CHECK-NEXT: [[IDXB2:%.*]] = getelementptr inbounds double, double* [[B]], i64 2 ; CHECK-NEXT: [[IDXA2:%.*]] = getelementptr inbounds double, double* [[A]], i64 2 ; CHECK-NEXT: [[IDXB1:%.*]] = getelementptr inbounds double, double* [[B]], i64 1 -; CHECK-NEXT: [[A0:%.*]] = load double, double* [[IDXA0]], align 8 +; CHECK-NEXT: [[B0:%.*]] = load double, double* [[IDXB0]], align 8 ; CHECK-NEXT: [[C0:%.*]] = load double, double* [[IDXC0]], align 8 ; CHECK-NEXT: [[D0:%.*]] = load double, double* [[IDXD0]], align 8 -; CHECK-NEXT: [[A1:%.*]] = load double, double* [[IDXA1]], align 8 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[IDXA0]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 ; CHECK-NEXT: [[B2:%.*]] = load double, double* [[IDXB2]], align 8 ; CHECK-NEXT: [[A2:%.*]] = load double, double* [[IDXA2]], align 8 -; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[IDXB0]] to <2 x double>* -; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 -; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[A1]], i32 1 -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[B2]], i32 1 -; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x double> [[TMP3]], [[TMP5]] -; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[A0]], i32 0 -; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A2]], i32 1 -; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP1]] -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP6]] +; CHECK-NEXT: [[B1:%.*]] = load double, double* [[IDXB1]], align 8 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[B0]], i32 0 +; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B2]], i32 1 +; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A2]], i32 1 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0 +; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[B1]], i32 1 +; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]] ; CHECK-NEXT: [[IDXS0:%.*]] = getelementptr inbounds double, double* [[S:%.*]], i64 0 ; CHECK-NEXT: [[IDXS1:%.*]] = getelementptr inbounds double, double* [[S]], i64 1 ; CHECK-NEXT: [[TMP11:%.*]] = bitcast double* [[IDXS0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8 -; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP1]], i32 1 +; CHECK-NEXT: store double [[TMP12]], double* [[EXT1:%.*]], align 8 ; CHECK-NEXT: ret void ; entry: @@ -607,7 +608,7 @@ define void @ChecksExtractScores_different_vectors(double* %storeArray, double* ; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0 ; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1 ; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP7]], [[TMP2]] -; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[TMP8]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP8]] ; CHECK-NEXT: [[SIDX0:%.*]] = getelementptr inbounds double, double* [[STOREARRAY:%.*]], i64 0 ; CHECK-NEXT: [[SIDX1:%.*]] = getelementptr inbounds double, double* [[STOREARRAY]], i64 1 ; CHECK-NEXT: [[TMP10:%.*]] = bitcast double* [[SIDX0]] to <2 x double>* diff --git a/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll b/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll index a0554d7c5a81..125cd23d0140 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll @@ -142,16 +142,13 @@ define void @shuffle_nodes_match1(double * noalias %from, double * noalias %to, ; CHECK-NEXT: br label [[LP:%.*]] ; CHECK: lp: ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] -; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1 -; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4 -; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4 -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0 -; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0 -; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -183,11 +180,11 @@ define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] ; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* ; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> ; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP2]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -218,16 +215,13 @@ define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, ; CHECK-NEXT: br label [[LP:%.*]] ; CHECK: lp: ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] -; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1 -; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4 -; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4 -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0 -; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0 -; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP2]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -348,7 +342,7 @@ define void @load_reorder_double(double* nocapture %c, double* noalias nocapture ; CHECK-NEXT: [[TMP2:%.*]] = load <2 x double>, <2 x double>* [[TMP1]], align 4 ; CHECK-NEXT: [[TMP3:%.*]] = bitcast double* [[A:%.*]] to <2 x double>* ; CHECK-NEXT: [[TMP4:%.*]] = load <2 x double>, <2 x double>* [[TMP3]], align 4 -; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP4]], [[TMP2]] +; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP2]], [[TMP4]] ; CHECK-NEXT: [[TMP6:%.*]] = bitcast double* [[C:%.*]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 4 ; CHECK-NEXT: ret void diff --git a/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll b/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll index ced403ae5375..19f654e5a4f8 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll @@ -22,9 +22,9 @@ define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn ; CHECK-NEXT: [[GEP_8:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 1 ; CHECK-NEXT: [[GEP_9:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 2 ; CHECK-NEXT: [[GEP_10:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 3 -; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> ; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32* [[GEP_7]] to <4 x i32>* -; CHECK-NEXT: store <4 x i32> [[REORDER_SHUFFLE]], <4 x i32>* [[TMP6]], align 4 +; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP6]], align 4 ; CHECK-NEXT: ret i32 undef ; %in.addr = getelementptr inbounds i32, i32* %in, i64 0 diff --git a/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll b/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll index 9983578a7058..65d1fce9e130 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll @@ -97,9 +97,9 @@ define void @store_reverse(i64* %p3) { ; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i64>, <4 x i64>* [[TMP2]], align 8 ; CHECK-NEXT: [[TMP4:%.*]] = shl <4 x i64> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[ARRAYIDX14:%.*]] = getelementptr inbounds i64, i64* [[P3]], i64 4 -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> -; CHECK-NEXT: [[TMP6:%.*]] = bitcast i64* [[ARRAYIDX14]] to <4 x i64>* -; CHECK-NEXT: store <4 x i64> [[TMP5]], <4 x i64>* [[TMP6]], align 8 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> +; CHECK-NEXT: [[TMP5:%.*]] = bitcast i64* [[ARRAYIDX14]] to <4 x i64>* +; CHECK-NEXT: store <4 x i64> [[SHUFFLE]], <4 x i64>* [[TMP5]], align 8 ; CHECK-NEXT: ret void ; entry: diff --git a/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll b/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll index bf98a148e9dc..f1ff95b51f8b 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll @@ -23,7 +23,7 @@ define void @test_supernode_add(double* %Aarray, double* %Barray, double *%Carra ; ENABLED-NEXT: [[C1:%.*]] = load double, double* [[IDXC1]], align 8 ; ENABLED-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[A0]], i32 0 ; ENABLED-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[C1]], i32 1 -; ENABLED-NEXT: [[TMP4:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP1]] +; ENABLED-NEXT: [[TMP4:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]] ; ENABLED-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 ; ENABLED-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A1]], i32 1 ; ENABLED-NEXT: [[TMP7:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP6]] </cut>

3 years, 7 months

1
0
0 0

[ACTIVITY] week ending Dec. 12 2021

by Alex Bennée

Project Stratos =============== - posted Potential demo setup for a TSN/XDP networking Message-Id: <87wnkfkp2f.fsf(a)linaro.org> - final Stratos call of the year - CC and Arnd will look at fat virtq - nice update from EPAM on Zephyr - had another round of getting working ACPI on MachiatoBin - posted [PR to clean up some typos in EDK2] - might have a working Xen setup without needing SMC hacks [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - finished review of https://github.com/rust-vmm/vhost-device/pull/4 [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - discussion around Suggestions for TCG performance improvements Message-Id: <c76bde31-8f3b-2d03-b7c7-9e026d4b5873(a)huawei.com> - did a bunch of bug triage and tagging [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - awaiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [3/3] ======================= [PATCH] tests/plugin/syscall.c: fix compiler warnings Message-Id: <20211128011551.2115468-1-juro.bystricky(a)intel.com> [PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers Message-Id: <20211126163915.1048353-2-peter.maydell(a)linaro.org> [PATCH] tests/docker: add libfuse3 development headers Message-Id: <20211207160025.52466-1-stefanha(a)redhat.com> Absences ======== Current Review Queue ==================== TODO [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> ======================================================================================================================== TODO [PATCH for-7.0 0/6] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20211208231154.392029-1-richard.henderson(a)linaro.org> ======================================================================================================================================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== -- Alex Bennée

3 years, 7 months

1
0
0 0

[ACTIVITY] report week ending 10 Dec

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - More code review: now have a target-arm.next poised and ready to send once 6.2 is released * QEMU-420 [GICv4 emulation] - Working on the ITS changes needed for GICv4 support (this turns out to be a more tractable end to start than the redistributor) - I have a preliminary set of 25 or so patches to the ITS which clean up the code and fix some pre-existing bugs that I found while working on the GICv4 changes - have implemented the new VMAPI, VMAPTI, VMAPP ITS commands -- PMM

3 years, 7 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [LV] Pass compare predicate to getCmpSelInstrCost.

by ci_notify＠linaro.org

After llvm commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desmalen(a)arm.com> [LV] Pass compare predicate to getCmpSelInstrCost. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 11115 to 11846 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea cd investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 3d549dddf75b6ff9e0ec8c053677750bde4226ea ../artifacts/test.sh # Reproduce last_good build git checkout --detach ab31d003e16e483bff298ea2f28fec0f23e8eb79 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desmalen(a)arm.com> Date: Mon Dec 6 11:14:27 2021 +0000 [LV] Pass compare predicate to getCmpSelInstrCost. If the condition of a select is a compare, pass its predicate to TTI::getCmpSelInstrCost to get a more accurate cost value instead of passing BAD_ICMP_PREDICATE. I noticed that the commit message from D90070 had a comment about the vectorized select predicate possibly being composed of other compares with different predicate values, but I wasn't able to construct an example where this was an actual issue. If this is an issue, I guess we could add another check that the block isn't predicated for any reason. Reviewed By: dmgreen, fhahn Differential Revision: https://reviews.llvm.org/D114646 --- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 11 ++++++++--- llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll | 14 +++++++------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 050879144afd..c03e506b7474 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -7570,8 +7570,12 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, Type *CondTy = SI->getCondition()->getType(); if (!ScalarCond) CondTy = VectorType::get(CondTy, VF); - return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + + CmpInst::Predicate Pred = CmpInst::BAD_ICMP_PREDICATE; + if (auto *Cmp = dyn_cast<CmpInst>(SI->getCondition())) + Pred = Cmp->getPredicate(); + return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, Pred, + CostKind, I); } case Instruction::ICmp: case Instruction::FCmp: { @@ -7581,7 +7585,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]); VectorTy = ToVectorTy(ValTy, VF); return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + cast<CmpInst>(I)->getPredicate(), CostKind, + I); } case Instruction::Store: case Instruction::Load: { diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll index 62b18f44fbc5..20d2dc0b7cda 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll @@ -5,17 +5,17 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128" target triple = "arm64-apple-ios5.0.0" define void @selects_1(i32* nocapture %dst, i32 %A, i32 %B, i32 %C, i32 %N) { -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 ; CHECK-LABEL: define void @selects_1( ; CHECK: vector.body: -; CHECK: select <2 x i1> +; CHECK: select <4 x i1> entry: %cmp26 = icmp sgt i32 %N, 0 </cut>

3 years, 7 months

1
0
0 0

clang-thumbv7-full-2stage is red for 20 days

by Galina Kistanova

Dear Linaro Toolchain Working Group, clang-thumbv7-full-2stage is red for 20 days. Could you take it to the staging area and make it green again, please? Thanks Galina

3 years, 7 months

2
2
0 0

[TCWG CI] 433.milc slowed down by 4% after llvm: Add missing header

by ci_notify＠linaro.org

After llvm commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d Author: David Blaikie <dblaikie(a)gmail.com> Add missing header the following benchmarks slowed down by more than 2%: - 433.milc slowed down by 4% from 12427 to 12916 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d cd investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bd4c6a476fd037fb07a1c484f75d93ee40713d3d ../artifacts/test.sh # Reproduce last_good build git checkout --detach 7d4da4e1ab7f79e51db0d5c2a0f5ef1711122dd7 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d Author: David Blaikie <dblaikie(a)gmail.com> Date: Mon Nov 29 16:29:25 2021 -0800 Add missing header --- llvm/lib/Demangle/DLangDemangle.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/llvm/lib/Demangle/DLangDemangle.cpp b/llvm/lib/Demangle/DLangDemangle.cpp index faf91b239490..f380aa90035e 100644 --- a/llvm/lib/Demangle/DLangDemangle.cpp +++ b/llvm/lib/Demangle/DLangDemangle.cpp @@ -17,6 +17,7 @@ #include "llvm/Demangle/StringView.h" #include "llvm/Demangle/Utility.h" +#include <cctype> #include <cstring> #include <limits> </cut>

3 years, 7 months

3
3
0 0

[TCWG CI] 453.povray failed to build after llvm: [SLP]Fix reused extracts cost.

by ci_notify＠linaro.org

After llvm commit ba74bb3a226e1b4660537f274627285b1bf41ee1 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Fix reused extracts cost. the following benchmarks slowed down by more than 2%: - 453.povray failed to build Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 cd investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach ba74bb3a226e1b4660537f274627285b1bf41ee1 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 78cc133c63173a4b5b7a43750cc507d4cff683cf ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit ba74bb3a226e1b4660537f274627285b1bf41ee1 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Thu Dec 2 04:22:55 2021 -0800 [SLP]Fix reused extracts cost. If the extractelement instruction is used multiple times in the different tree entries (either vectorized, or gathered), need to compensate the scalar cost of such instructions. They are completely removed if all users are part of the tree but we need to compensate the cost only once for each instruction. Differential Revision: https://reviews.llvm.org/D114958 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 29 +++++++++++++--------- .../X86/extractelement-multiple-uses.ll | 23 +++++++++-------- 2 files changed, 29 insertions(+), 23 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index 95061e9053fa..335ad6c85387 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -4287,8 +4287,8 @@ bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, Value *OpValue, bool BoUpSLP::areAllUsersVectorized(Instruction *I, ArrayRef<Value *> VectorizedVals) const { return (I->hasOneUse() && is_contained(VectorizedVals, I)) || - llvm::all_of(I->users(), [this](User *U) { - return ScalarToTreeEntry.count(U) > 0; + all_of(I->users(), [this](User *U) { + return ScalarToTreeEntry.count(U) > 0 || MustGather.contains(U); }); } @@ -4442,9 +4442,9 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // FIXME: it tries to fix a problem with MSVC buildbots. TargetTransformInfo &TTIRef = *TTI; auto &&AdjustExtractsCost = [this, &TTIRef, CostKind, VL, VecTy, - VectorizedVals](InstructionCost &Cost, - bool IsGather) { + VectorizedVals, E](InstructionCost &Cost) { DenseMap<Value *, int> ExtractVectorsTys; + SmallPtrSet<Value *, 4> CheckedExtracts; for (auto *V : VL) { if (isa<UndefValue>(V)) continue; @@ -4452,7 +4452,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // instruction itself is not going to be vectorized, consider this // instruction as dead and remove its cost from the final cost of the // vectorized tree. - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) + // Also, avoid adjusting the cost for extractelements with multiple uses + // in different graph entries. + const TreeEntry *VE = getTreeEntry(V); + if (!CheckedExtracts.insert(V).second || + !areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || + (VE && VE != E)) continue; auto *EE = cast<ExtractElementInst>(V); Optional<unsigned> EEIdx = getExtractIndex(EE); @@ -4549,11 +4554,6 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, } return GatherCost; } - if (isSplat(VL)) { - // Found the broadcasting of the single scalar, calculate the cost as the - // broadcast. - return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); - } if ((E->getOpcode() == Instruction::ExtractElement || all_of(E->Scalars, [](Value *V) { @@ -4571,13 +4571,18 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // single input vector or of 2 input vectors. InstructionCost Cost = computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI); - AdjustExtractsCost(Cost, /*IsGather=*/true); + AdjustExtractsCost(Cost); if (NeedToShuffleReuses) Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices); return Cost; } } + if (isSplat(VL)) { + // Found the broadcasting of the single scalar, calculate the cost as the + // broadcast. + return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); + } InstructionCost ReuseShuffleCost = 0; if (NeedToShuffleReuses) ReuseShuffleCost = TTI->getShuffleCost( @@ -4755,7 +4760,7 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I); } } else { - AdjustExtractsCost(CommonCost, /*IsGather=*/false); + AdjustExtractsCost(CommonCost); } return CommonCost; } diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll index c47f255f0bfe..31696752bbb3 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll @@ -2,24 +2,25 @@ ; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -march=core-avx2 -pass-remarks-output=%t | FileCheck %s ; RUN: FileCheck %s --input-file=%t --check-prefix=YAML -; YAML: --- !Missed +; YAML: --- !Passed ; YAML: Pass: slp-vectorizer -; YAML: Name: NotBeneficial +; YAML: Name: VectorizedList ; YAML: Function: multi_uses ; YAML: Args: -; YAML: - String: 'List vectorization was possible but not beneficial with cost ' -; YAML: - Cost: '0' -; YAML: - String: ' >= ' -; YAML: - Treshold: '0' +; YAML: - String: 'SLP vectorized with cost ' +; YAML: - Cost: '-1' +; YAML: - String: ' and with tree size ' +; YAML: - TreeSize: '3' define float @multi_uses(<2 x float> %x, <2 x float> %y) { ; CHECK-LABEL: @multi_uses( -; CHECK-NEXT: [[X0:%.*]] = extractelement <2 x float> [[X:%.*]], i32 0 -; CHECK-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1 ; CHECK-NEXT: [[Y1:%.*]] = extractelement <2 x float> [[Y:%.*]], i32 1 -; CHECK-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[Y1]] -; CHECK-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[Y1]] -; CHECK-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]] +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x float> poison, float [[Y1]], i32 0 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float [[Y1]], i32 1 +; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x float> [[X:%.*]], [[TMP2]] +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1 +; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP4]], [[TMP5]] ; CHECK-NEXT: ret float [[ADD]] ; %x0 = extractelement <2 x float> %x, i32 0 </cut>

3 years, 7 months

3
3
0 0

[TCWG CI] 464.h264ref slowed down by 3% after llvm: [SLP]Improve isFixedVectorShuffle and its use.

by ci_notify＠linaro.org

After llvm commit dce6c434ead3ccbaa67b8db2301b2a9fb4319123 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve isFixedVectorShuffle and its use. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 3% from 10824 to 11101 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-dce6c434ead3ccbaa67b8db2301b2a9fb4319123 cd investigate-llvm-dce6c434ead3ccbaa67b8db2301b2a9fb4319123 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach dce6c434ead3ccbaa67b8db2301b2a9fb4319123 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 7a7c059d867554e116244ad5639d05d75ed1a7cd ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit dce6c434ead3ccbaa67b8db2301b2a9fb4319123 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Wed Nov 17 11:14:38 2021 -0800 [SLP]Improve isFixedVectorShuffle and its use. Extended support for undefined source vector/extract indices/non-fixed vector types, also no need to check for the parent of the extractelement instructions with the constant indicies. Differential Revision: https://reviews.llvm.org/D114121 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 67 +++++++++++++++------- .../X86/alternate-int-inseltpoison.ll | 24 ++++---- .../Transforms/SLPVectorizer/X86/alternate-int.ll | 24 ++++---- 3 files changed, 66 insertions(+), 49 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index e3d3d8992c23..4db630fbd063 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -327,7 +327,11 @@ static bool isCommutative(Instruction *I) { /// TargetTransformInfo::getInstructionThroughput? static Optional<TargetTransformInfo::ShuffleKind> isFixedVectorShuffle(ArrayRef<Value *> VL, SmallVectorImpl<int> &Mask) { - auto *EI0 = cast<ExtractElementInst>(VL[0]); + const auto *It = + find_if(VL, [](Value *V) { return isa<ExtractElementInst>(V); }); + if (It == VL.end()) + return None; + auto *EI0 = cast<ExtractElementInst>(*It); if (isa<ScalableVectorType>(EI0->getVectorOperandType())) return None; unsigned Size = @@ -336,33 +340,41 @@ isFixedVectorShuffle(ArrayRef<Value *> VL, SmallVectorImpl<int> &Mask) { Value *Vec2 = nullptr; enum ShuffleMode { Unknown, Select, Permute }; ShuffleMode CommonShuffleMode = Unknown; + Mask.assign(VL.size(), UndefMaskElem); for (unsigned I = 0, E = VL.size(); I < E; ++I) { + // Undef can be represented as an undef element in a vector. + if (isa<UndefValue>(VL[I])) + continue; auto *EI = cast<ExtractElementInst>(VL[I]); + if (isa<ScalableVectorType>(EI->getVectorOperandType())) + return None; auto *Vec = EI->getVectorOperand(); + // We can extractelement from undef or poison vector. + if (isa<UndefValue>(Vec)) + continue; // All vector operands must have the same number of vector elements. if (cast<FixedVectorType>(Vec->getType())->getNumElements() != Size) return None; + if (isa<UndefValue>(EI->getIndexOperand())) + continue; auto *Idx = dyn_cast<ConstantInt>(EI->getIndexOperand()); if (!Idx) return None; // Undefined behavior if Idx is negative or >= Size. - if (Idx->getValue().uge(Size)) { - Mask.push_back(UndefMaskElem); + if (Idx->getValue().uge(Size)) continue; - } unsigned IntIdx = Idx->getValue().getZExtValue(); - Mask.push_back(IntIdx); - // We can extractelement from undef or poison vector. - if (isa<UndefValue>(Vec)) - continue; + Mask[I] = IntIdx; // For correct shuffling we have to have at most 2 different vector operands // in all extractelement instructions. - if (!Vec1 || Vec1 == Vec) + if (!Vec1 || Vec1 == Vec) { Vec1 = Vec; - else if (!Vec2 || Vec2 == Vec) + } else if (!Vec2 || Vec2 == Vec) { Vec2 = Vec; - else + Mask[I] += Size; + } else { return None; + } if (CommonShuffleMode == Permute) continue; // If the extract index is not the same as the operation number, it is a @@ -4414,15 +4426,19 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, bool IsGather) { DenseMap<Value *, int> ExtractVectorsTys; for (auto *V : VL) { + if (isa<UndefValue>(V)) + continue; // If all users of instruction are going to be vectorized and this // instruction itself is not going to be vectorized, consider this // instruction as dead and remove its cost from the final cost of the // vectorized tree. - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || - (IsGather && ScalarToTreeEntry.count(V))) + if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) continue; auto *EE = cast<ExtractElementInst>(V); - unsigned Idx = *getExtractIndex(EE); + Optional<unsigned> EEIdx = getExtractIndex(EE); + if (!EEIdx) + continue; + unsigned Idx = *EEIdx; if (TTIRef.getNumberOfParts(VecTy) != TTIRef.getNumberOfParts(EE->getVectorOperandType())) { auto It = @@ -4454,6 +4470,8 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, for (const auto &Data : ExtractVectorsTys) { auto *EEVTy = cast<FixedVectorType>(Data.first->getType()); unsigned NumElts = VecTy->getNumElements(); + if (Data.second % NumElts == 0) + continue; if (TTIRef.getNumberOfParts(EEVTy) > TTIRef.getNumberOfParts(VecTy)) { unsigned Idx = (Data.second / NumElts) * NumElts; unsigned EENumElts = EEVTy->getNumElements(); @@ -4516,10 +4534,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // broadcast. return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); } - if (E->getOpcode() == Instruction::ExtractElement && allSameType(VL) && - allSameBlock(VL) && - !isa<ScalableVectorType>( - cast<ExtractElementInst>(E->getMainOp())->getVectorOperandType())) { + if ((E->getOpcode() == Instruction::ExtractElement || + all_of(E->Scalars, + [](Value *V) { + return isa<ExtractElementInst, UndefValue>(V); + })) && + allSameType(VL)) { // Check that gather of extractelements can be represented as just a // shuffle of a single/two vectors the scalars are extracted from. SmallVector<int> Mask; @@ -5111,7 +5131,11 @@ bool BoUpSLP::isFullyVectorizableTinyTree(bool ForReduction) const { [this](Value *V) { return EphValues.contains(V); }) && (allConstant(TE->Scalars) || isSplat(TE->Scalars) || TE->Scalars.size() < Limit || - (TE->getOpcode() == Instruction::ExtractElement && + ((TE->getOpcode() == Instruction::ExtractElement || + all_of(TE->Scalars, + [](Value *V) { + return isa<ExtractElementInst, UndefValue>(V); + })) && isFixedVectorShuffle(TE->Scalars, Mask)) || (TE->State == TreeEntry::NeedToGather && TE->getOpcode() == Instruction::Load && !TE->isAltShuffle())); @@ -9183,8 +9207,9 @@ bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI, SmallVector<Value *, 16> BuildVectorOpds; SmallVector<int> Mask; if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) || - (llvm::all_of(BuildVectorOpds, - [](Value *V) { return isa<ExtractElementInst>(V); }) && + (llvm::all_of( + BuildVectorOpds, + [](Value *V) { return isa<ExtractElementInst, UndefValue>(V); }) && isFixedVectorShuffle(BuildVectorOpds, Mask))) return false; diff --git a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll index 8ab137cc2d7d..9c19a32b2f41 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll @@ -230,25 +230,21 @@ define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) { define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { ; SSE-LABEL: @ashr_lshr_shl_v8i32( -; SSE-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A:%.*]], i32 6 -; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7 -; SSE-NEXT: [[B6:%.*]] = extractelement <8 x i32> [[B:%.*]], i32 6 -; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7 -; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> -; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7> ; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]] ; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5> -; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]] -; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]] -; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> -; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6 -; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7 -; SSE-NEXT: ret <8 x i32> [[R7]] +; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]] +; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7> +; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> +; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9> +; SSE-NEXT: ret <8 x i32> [[R71]] ; ; SLM-LABEL: @ashr_lshr_shl_v8i32( ; SLM-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll index 3af16bf404a3..783b50ae4b17 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll @@ -230,25 +230,21 @@ define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) { define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { ; SSE-LABEL: @ashr_lshr_shl_v8i32( -; SSE-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A:%.*]], i32 6 -; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7 -; SSE-NEXT: [[B6:%.*]] = extractelement <8 x i32> [[B:%.*]], i32 6 -; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7 -; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> -; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7> ; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]] ; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5> -; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]] -; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]] -; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> -; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6 -; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7 -; SSE-NEXT: ret <8 x i32> [[R7]] +; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]] +; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7> +; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> +; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9> +; SSE-NEXT: ret <8 x i32> [[R71]] ; ; SLM-LABEL: @ashr_lshr_shl_v8i32( ; SLM-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> </cut>

3 years, 7 months

1
0
0 0

[TCWG CI] 453.povray slowed down by 3% after llvm: [llvm] Use range-based for loops (NFC)

by ci_notify＠linaro.org

After llvm commit f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c Author: Kazu Hirata <kazu(a)google.com> [llvm] Use range-based for loops (NFC) the following benchmarks slowed down by more than 2%: - 453.povray slowed down by 3% from 4906 to 5047 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c cd investigate-llvm-f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c ../artifacts/test.sh # Reproduce last_good build git checkout --detach c572eb1ad9d8a528bcaff0160888aff31b1f4b5f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c Author: Kazu Hirata <kazu(a)google.com> Date: Mon Nov 29 09:04:44 2021 -0800 [llvm] Use range-based for loops (NFC) --- llvm/lib/CodeGen/MachinePipeliner.cpp | 7 ++--- .../CodeGen/SelectionDAG/ResourcePriorityQueue.cpp | 4 +-- llvm/lib/Object/ELFObjectFile.cpp | 4 +-- llvm/lib/ObjectYAML/COFFEmitter.cpp | 32 ++++++++++------------ llvm/lib/Passes/StandardInstrumentations.cpp | 4 +-- llvm/lib/ProfileData/InstrProf.cpp | 11 ++++---- llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp | 18 ++++++------ llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp | 32 ++++++++++------------ 8 files changed, 50 insertions(+), 62 deletions(-) diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp index 21be6718b7d9..8d6459a627fa 100644 --- a/llvm/lib/CodeGen/MachinePipeliner.cpp +++ b/llvm/lib/CodeGen/MachinePipeliner.cpp @@ -1519,9 +1519,8 @@ static bool pred_L(SetVector<SUnit *> &NodeOrder, SmallSetVector<SUnit *, 8> &Preds, const NodeSet *S = nullptr) { Preds.clear(); - for (SetVector<SUnit *>::iterator I = NodeOrder.begin(), E = NodeOrder.end(); - I != E; ++I) { - for (const SDep &Pred : (*I)->Preds) { + for (const SUnit *SU : NodeOrder) { + for (const SDep &Pred : SU->Preds) { if (S && S->count(Pred.getSUnit()) == 0) continue; if (ignoreDependence(Pred, true)) @@ -1530,7 +1529,7 @@ static bool pred_L(SetVector<SUnit *> &NodeOrder, Preds.insert(Pred.getSUnit()); } // Back-edges are predecessors with an anti-dependence. - for (const SDep &Succ : (*I)->Succs) { + for (const SDep &Succ : SU->Succs) { if (Succ.getKind() != SDep::Anti) continue; if (S && S->count(Succ.getSUnit()) == 0) diff --git a/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp b/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp index 55fe26eb64cd..2695ed36991c 100644 --- a/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp @@ -268,8 +268,8 @@ bool ResourcePriorityQueue::isResourceAvailable(SUnit *SU) { // Now see if there are no other dependencies // to instructions already in the packet. - for (unsigned i = 0, e = Packet.size(); i != e; ++i) - for (const SDep &Succ : Packet[i]->Succs) { + for (const SUnit *S : Packet) + for (const SDep &Succ : S->Succs) { // Since we do not add pseudos to packets, might as well // ignore order deps. if (Succ.isCtrl()) diff --git a/llvm/lib/Object/ELFObjectFile.cpp b/llvm/lib/Object/ELFObjectFile.cpp index 50035d6c7523..cf1f12d9a9a7 100644 --- a/llvm/lib/Object/ELFObjectFile.cpp +++ b/llvm/lib/Object/ELFObjectFile.cpp @@ -682,7 +682,7 @@ readDynsymVersionsImpl(const ELFFile<ELFT> &EF, std::vector<VersionEntry> Ret; size_t I = 0; - for (auto It = Symbols.begin(), E = Symbols.end(); It != E; ++It) { + for (const ELFSymbolRef &Sym : Symbols) { ++I; Expected<const typename ELFT::Versym *> VerEntryOrErr = EF.template getEntry<typename ELFT::Versym>(*VerSec, I); @@ -691,7 +691,7 @@ readDynsymVersionsImpl(const ELFFile<ELFT> &EF, " from " + describe(EF, *VerSec) + ": " + toString(VerEntryOrErr.takeError())); - Expected<uint32_t> FlagsOrErr = It->getFlags(); + Expected<uint32_t> FlagsOrErr = Sym.getFlags(); if (!FlagsOrErr) return createError("unable to read flags for symbol with index " + Twine(I) + ": " + toString(FlagsOrErr.takeError())); diff --git a/llvm/lib/ObjectYAML/COFFEmitter.cpp b/llvm/lib/ObjectYAML/COFFEmitter.cpp index 5f38ca13cfc2..66ad16db1ba4 100644 --- a/llvm/lib/ObjectYAML/COFFEmitter.cpp +++ b/llvm/lib/ObjectYAML/COFFEmitter.cpp @@ -476,29 +476,25 @@ static bool writeCOFF(COFFParser &CP, raw_ostream &OS) { assert(OS.tell() == CP.SectionTableStart); // Output section table. - for (std::vector<COFFYAML::Section>::iterator i = CP.Obj.Sections.begin(), - e = CP.Obj.Sections.end(); - i != e; ++i) { - OS.write(i->Header.Name, COFF::NameSize); - OS << binary_le(i->Header.VirtualSize) - << binary_le(i->Header.VirtualAddress) - << binary_le(i->Header.SizeOfRawData) - << binary_le(i->Header.PointerToRawData) - << binary_le(i->Header.PointerToRelocations) - << binary_le(i->Header.PointerToLineNumbers) - << binary_le(i->Header.NumberOfRelocations) - << binary_le(i->Header.NumberOfLineNumbers) - << binary_le(i->Header.Characteristics); + for (const COFFYAML::Section &S : CP.Obj.Sections) { + OS.write(S.Header.Name, COFF::NameSize); + OS << binary_le(S.Header.VirtualSize) + << binary_le(S.Header.VirtualAddress) + << binary_le(S.Header.SizeOfRawData) + << binary_le(S.Header.PointerToRawData) + << binary_le(S.Header.PointerToRelocations) + << binary_le(S.Header.PointerToLineNumbers) + << binary_le(S.Header.NumberOfRelocations) + << binary_le(S.Header.NumberOfLineNumbers) + << binary_le(S.Header.Characteristics); } assert(OS.tell() == CP.SectionTableStart + CP.SectionTableSize); unsigned CurSymbol = 0; StringMap<unsigned> SymbolTableIndexMap; - for (std::vector<COFFYAML::Symbol>::iterator I = CP.Obj.Symbols.begin(), - E = CP.Obj.Symbols.end(); - I != E; ++I) { - SymbolTableIndexMap[I->Name] = CurSymbol; - CurSymbol += 1 + I->Header.NumberOfAuxSymbols; + for (const COFFYAML::Symbol &Sym : CP.Obj.Symbols) { + SymbolTableIndexMap[Sym.Name] = CurSymbol; + CurSymbol += 1 + Sym.Header.NumberOfAuxSymbols; } // Output section data. diff --git a/llvm/lib/Passes/StandardInstrumentations.cpp b/llvm/lib/Passes/StandardInstrumentations.cpp index 8e6be6730ea4..27a6c519ff82 100644 --- a/llvm/lib/Passes/StandardInstrumentations.cpp +++ b/llvm/lib/Passes/StandardInstrumentations.cpp @@ -225,8 +225,8 @@ std::string doSystemDiff(StringRef Before, StringRef After, return "Unable to read result."; // Clean up. - for (unsigned I = 0; I < NumFiles; ++I) { - std::error_code EC = sys::fs::remove(FileName[I]); + for (const std::string &I : FileName) { + std::error_code EC = sys::fs::remove(I); if (EC) return "Unable to remove temporary file."; } diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp index 1168ad27fe52..ab3487ecffe8 100644 --- a/llvm/lib/ProfileData/InstrProf.cpp +++ b/llvm/lib/ProfileData/InstrProf.cpp @@ -657,19 +657,18 @@ void InstrProfValueSiteRecord::merge(InstrProfValueSiteRecord &Input, Input.sortByTargetValues(); auto I = ValueData.begin(); auto IE = ValueData.end(); - for (auto J = Input.ValueData.begin(), JE = Input.ValueData.end(); J != JE; - ++J) { - while (I != IE && I->Value < J->Value) + for (const InstrProfValueData &J : Input.ValueData) { + while (I != IE && I->Value < J.Value) ++I; - if (I != IE && I->Value == J->Value) { + if (I != IE && I->Value == J.Value) { bool Overflowed; - I->Count = SaturatingMultiplyAdd(J->Count, Weight, I->Count, &Overflowed); + I->Count = SaturatingMultiplyAdd(J.Count, Weight, I->Count, &Overflowed); if (Overflowed) Warn(instrprof_error::counter_overflow); ++I; continue; } - ValueData.insert(I, *J); + ValueData.insert(I, J); } } diff --git a/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp b/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp index 43f0758f6598..8c3b9572201e 100644 --- a/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp +++ b/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp @@ -476,10 +476,10 @@ namespace { } // end anonymous namespace static const NodeSet *node_class(GepNode *N, NodeSymRel &Rel) { - for (NodeSymRel::iterator I = Rel.begin(), E = Rel.end(); I != E; ++I) - if (I->count(N)) - return &*I; - return nullptr; + for (const NodeSet &S : Rel) + if (S.count(N)) + return &S; + return nullptr; } // Create an ordered pair of GepNode pointers. The pair will be used in @@ -589,9 +589,8 @@ void HexagonCommonGEP::common() { dbgs() << "{ " << I->first << ", " << I->second << " }\n"; dbgs() << "Gep equivalence classes:\n"; - for (NodeSymRel::iterator I = EqRel.begin(), E = EqRel.end(); I != E; ++I) { + for (const NodeSet &S : EqRel) { dbgs() << '{'; - const NodeSet &S = *I; for (NodeSet::const_iterator J = S.begin(), F = S.end(); J != F; ++J) { if (J != S.begin()) dbgs() << ','; @@ -604,8 +603,7 @@ void HexagonCommonGEP::common() { // Create a projection from a NodeSet to the minimal element in it. using ProjMap = std::map<const NodeSet *, GepNode *>; ProjMap PM; - for (NodeSymRel::iterator I = EqRel.begin(), E = EqRel.end(); I != E; ++I) { - const NodeSet &S = *I; + for (const NodeSet &S : EqRel) { GepNode *Min = *std::min_element(S.begin(), S.end(), NodeOrder); std::pair<ProjMap::iterator,bool> Ins = PM.insert(std::make_pair(&S, Min)); (void)Ins; @@ -1280,8 +1278,8 @@ bool HexagonCommonGEP::runOnFunction(Function &F) { return false; // For now bail out on C++ exception handling. - for (Function::iterator A = F.begin(), Z = F.end(); A != Z; ++A) - for (BasicBlock::iterator I = A->begin(), E = A->end(); I != E; ++I) + for (const BasicBlock &BB : F) + for (const Instruction &I : BB) if (isa<InvokeInst>(I) || isa<LandingPadInst>(I)) return false; diff --git a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp index c2ddfd6164f4..c35e67d6726f 100644 --- a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp +++ b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp @@ -130,10 +130,8 @@ VisitGlobalVariableForEmission(const GlobalVariable *GV, for (unsigned i = 0, e = GV->getNumOperands(); i != e; ++i) DiscoverDependentGlobals(GV->getOperand(i), Others); - for (DenseSet<const GlobalVariable *>::iterator I = Others.begin(), - E = Others.end(); - I != E; ++I) - VisitGlobalVariableForEmission(*I, Order, Visited, Visiting); + for (const GlobalVariable *GV : Others) + VisitGlobalVariableForEmission(GV, Order, Visited, Visiting); // Now we can visit ourself Order.push_back(GV); @@ -699,35 +697,33 @@ static bool useFuncSeen(const Constant *C, void NVPTXAsmPrinter::emitDeclarations(const Module &M, raw_ostream &O) { DenseMap<const Function *, bool> seenMap; - for (Module::const_iterator FI = M.begin(), FE = M.end(); FI != FE; ++FI) { - const Function *F = &*FI; - - if (F->getAttributes().hasFnAttr("nvptx-libcall-callee")) { - emitDeclaration(F, O); + for (const Function &F : M) { + if (F.getAttributes().hasFnAttr("nvptx-libcall-callee")) { + emitDeclaration(&F, O); continue; } - if (F->isDeclaration()) { - if (F->use_empty()) + if (F.isDeclaration()) { + if (F.use_empty()) continue; - if (F->getIntrinsicID()) + if (F.getIntrinsicID()) continue; - emitDeclaration(F, O); + emitDeclaration(&F, O); continue; } - for (const User *U : F->users()) { + for (const User *U : F.users()) { if (const Constant *C = dyn_cast<Constant>(U)) { if (usedInGlobalVarDef(C)) { // The use is in the initialization of a global variable // that is a function pointer, so print a declaration // for the original function - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } // Emit a declaration of this function if the function that // uses this constant expr has already been seen. if (useFuncSeen(C, seenMap)) { - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } } @@ -746,11 +742,11 @@ void NVPTXAsmPrinter::emitDeclarations(const Module &M, raw_ostream &O) { // appearing in the module before the callee. so print out // a declaration for the callee. if (seenMap.find(caller) != seenMap.end()) { - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } } - seenMap[F] = true; + seenMap[&F] = true; } } </cut>

3 years, 7 months

1
0
0 0

[TCWG CI] 458.sjeng slowed down by 5% after llvm: Reland "[LICM] Hoist LOAD without sinking the STORE"

by ci_notify＠linaro.org

After llvm commit 2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 Author: Djordje Todorovic <djordje.todorovic(a)syrmia.com> Reland "[LICM] Hoist LOAD without sinking the STORE" the following benchmarks slowed down by more than 2%: - 458.sjeng slowed down by 5% from 13781 to 14482 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 cd investigate-llvm-2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 47616c8855fd44abcbd7cad3f7d8153d28db347b ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 2cdc6f2ca62e83fec445114fbbe6276e9ab2a7d0 Author: Djordje Todorovic <djordje.todorovic(a)syrmia.com> Date: Thu Dec 2 03:40:00 2021 -0800 Reland "[LICM] Hoist LOAD without sinking the STORE" When doing load/store promotion within LICM, if we cannot prove that it is safe to sink the store we won't hoist the load, even though we can prove the load could be dereferenced and moved outside the loop. This patch implements the load promotion by moving it in the loop preheader by inserting proper PHI in the loop. The store is kept as is in the loop. By doing this, we avoid doing the load from a memory location in each iteration. Please consider this small example: loop { var = *ptr; if (var) break; *ptr= var + 1; } After this patch, it will be: var0 = *ptr; loop { var1 = phi (var0, var2); if (var1) break; var2 = var1 + 1; *ptr = var2; } This addresses some problems from [0]. [0] https://bugs.llvm.org/show_bug.cgi?id=51193 Differential revision: https://reviews.llvm.org/D113289 --- llvm/include/llvm/Transforms/Utils/SSAUpdater.h | 4 +++ llvm/lib/Transforms/Scalar/LICM.cpp | 41 +++++++++++++++++----- llvm/lib/Transforms/Utils/SSAUpdater.cpp | 3 ++ .../Transforms/InstMerge/st_sink_bugfix_22613.ll | 6 ++-- .../Transforms/LICM/hoist-load-without-store.ll | 5 +-- llvm/test/Transforms/LICM/promote-capture.ll | 8 +++-- .../Transforms/LICM/scalar-promote-memmodel.ll | 8 +++-- .../Transforms/LICM/scalar-promote-opaque-ptrs.ll | 8 +++-- llvm/test/Transforms/LICM/scalar-promote.ll | 8 +++-- 9 files changed, 65 insertions(+), 26 deletions(-) diff --git a/llvm/include/llvm/Transforms/Utils/SSAUpdater.h b/llvm/include/llvm/Transforms/Utils/SSAUpdater.h index 22b2295cc9d7..c233e3dc168e 100644 --- a/llvm/include/llvm/Transforms/Utils/SSAUpdater.h +++ b/llvm/include/llvm/Transforms/Utils/SSAUpdater.h @@ -169,6 +169,10 @@ public: /// Called to update debug info associated with the instruction. virtual void updateDebugInfo(Instruction *I) const {} + + /// Return false if a sub-class wants to keep one of the loads/stores + /// after the SSA construction. + virtual bool shouldDelete(Instruction *I) const { return true; } }; } // end namespace llvm diff --git a/llvm/lib/Transforms/Scalar/LICM.cpp b/llvm/lib/Transforms/Scalar/LICM.cpp index 0d52448efb2b..6f97f3e93123 100644 --- a/llvm/lib/Transforms/Scalar/LICM.cpp +++ b/llvm/lib/Transforms/Scalar/LICM.cpp @@ -1860,6 +1860,7 @@ class LoopPromoter : public LoadAndStorePromoter { bool UnorderedAtomic; AAMDNodes AATags; ICFLoopSafetyInfo &SafetyInfo; + bool CanInsertStoresInExitBlocks; // We're about to add a use of V in a loop exit block. Insert an LCSSA phi // (if legal) if doing so would add an out-of-loop use to an instruction @@ -1886,12 +1887,13 @@ public: SmallVectorImpl<MemoryAccess *> &MSSAIP, PredIteratorCache &PIC, MemorySSAUpdater *MSSAU, LoopInfo &li, DebugLoc dl, Align Alignment, bool UnorderedAtomic, const AAMDNodes &AATags, - ICFLoopSafetyInfo &SafetyInfo) + ICFLoopSafetyInfo &SafetyInfo, bool CanInsertStoresInExitBlocks) : LoadAndStorePromoter(Insts, S), SomePtr(SP), PointerMustAliases(PMA), LoopExitBlocks(LEB), LoopInsertPts(LIP), MSSAInsertPts(MSSAIP), PredCache(PIC), MSSAU(MSSAU), LI(li), DL(std::move(dl)), Alignment(Alignment), UnorderedAtomic(UnorderedAtomic), AATags(AATags), - SafetyInfo(SafetyInfo) {} + SafetyInfo(SafetyInfo), + CanInsertStoresInExitBlocks(CanInsertStoresInExitBlocks) {} bool isInstInList(Instruction *I, const SmallVectorImpl<Instruction *> &) const override { @@ -1903,7 +1905,7 @@ public: return PointerMustAliases.count(Ptr); } - void doExtraRewritesBeforeFinalDeletion() override { + void insertStoresInLoopExitBlocks() { // Insert stores after in the loop exit blocks. Each exit block gets a // store of the live-out values that feed them. Since we've already told // the SSA updater about the defs in the loop and the preheader @@ -1937,10 +1939,21 @@ public: } } + void doExtraRewritesBeforeFinalDeletion() override { + if (CanInsertStoresInExitBlocks) + insertStoresInLoopExitBlocks(); + } + void instructionDeleted(Instruction *I) const override { SafetyInfo.removeInstruction(I); MSSAU->removeMemoryAccess(I); } + + bool shouldDelete(Instruction *I) const override { + if (isa<StoreInst>(I)) + return CanInsertStoresInExitBlocks; + return true; + } }; bool isNotCapturedBeforeOrInLoop(const Value *V, const Loop *L, @@ -2039,6 +2052,7 @@ bool llvm::promoteLoopAccessesToScalars( bool DereferenceableInPH = false; bool SafeToInsertStore = false; + bool FoundLoadToPromote = false; SmallVector<Instruction *, 64> LoopUses; @@ -2086,6 +2100,7 @@ bool llvm::promoteLoopAccessesToScalars( SawUnorderedAtomic |= Load->isAtomic(); SawNotAtomic |= !Load->isAtomic(); + FoundLoadToPromote = true; Align InstAlignment = Load->getAlign(); @@ -2197,13 +2212,20 @@ bool llvm::promoteLoopAccessesToScalars( } } - // If we've still failed to prove we can sink the store, give up. - if (!SafeToInsertStore) + // If we've still failed to prove we can sink the store, hoist the load + // only, if possible. + if (!SafeToInsertStore && !FoundLoadToPromote) + // If we cannot hoist the load either, give up. return false; - // Otherwise, this is safe to promote, lets do it! - LLVM_DEBUG(dbgs() << "LICM: Promoting value stored to in loop: " << *SomePtr - << '\n'); + // Lets do the promotion! + if (SafeToInsertStore) + LLVM_DEBUG(dbgs() << "LICM: Promoting load/store of the value: " << *SomePtr + << '\n'); + else + LLVM_DEBUG(dbgs() << "LICM: Promoting load of the value: " << *SomePtr + << '\n'); + ORE->emit([&]() { return OptimizationRemark(DEBUG_TYPE, "PromoteLoopAccessesToScalar", LoopUses[0]) @@ -2222,7 +2244,8 @@ bool llvm::promoteLoopAccessesToScalars( SSAUpdater SSA(&NewPHIs); LoopPromoter Promoter(SomePtr, LoopUses, SSA, PointerMustAliases, ExitBlocks, InsertPts, MSSAInsertPts, PIC, MSSAU, *LI, DL, - Alignment, SawUnorderedAtomic, AATags, *SafetyInfo); + Alignment, SawUnorderedAtomic, AATags, *SafetyInfo, + SafeToInsertStore); // Set up the preheader to have a definition of the value. It is the live-out // value from the preheader that uses in the loop will use. diff --git a/llvm/lib/Transforms/Utils/SSAUpdater.cpp b/llvm/lib/Transforms/Utils/SSAUpdater.cpp index 5893ce15b129..7d9992176658 100644 --- a/llvm/lib/Transforms/Utils/SSAUpdater.cpp +++ b/llvm/lib/Transforms/Utils/SSAUpdater.cpp @@ -446,6 +446,9 @@ void LoadAndStorePromoter::run(const SmallVectorImpl<Instruction *> &Insts) { // Now that everything is rewritten, delete the old instructions from the // function. They should all be dead now. for (Instruction *User : Insts) { + if (!shouldDelete(User)) + continue; + // If this is a load that still has uses, then the load must have been added // as a live value in the SSAUpdate data structure for a block (e.g. because // the loaded value was stored later). In this case, we need to recursively diff --git a/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll b/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll index 48882eca44cc..e5a75cca8ee7 100644 --- a/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll +++ b/llvm/test/Transforms/InstMerge/st_sink_bugfix_22613.ll @@ -5,12 +5,12 @@ target triple = "x86_64-unknown-linux-gnu" ; RUN: opt -O2 -S < %s | FileCheck %s ; CHECK-LABEL: main -; CHECK: if.end -; CHECK: store ; CHECK: memset ; CHECK: if.then ; CHECK: store -; CHECK: memset +; CHECK: if.end +; CHECK: store +; CHECK: store @d = common global i32 0, align 4 @b = common global i32 0, align 4 diff --git a/llvm/test/Transforms/LICM/hoist-load-without-store.ll b/llvm/test/Transforms/LICM/hoist-load-without-store.ll index b464f6b7328d..275a53172737 100644 --- a/llvm/test/Transforms/LICM/hoist-load-without-store.ll +++ b/llvm/test/Transforms/LICM/hoist-load-without-store.ll @@ -18,10 +18,11 @@ define dso_local void @f(i32* nocapture %ptr, i32 %n) { ; CHECK-NEXT: [[CMP7:%.*]] = icmp slt i32 0, [[N:%.*]] ; CHECK-NEXT: br i1 [[CMP7]], label [[FOR_BODY_LR_PH:%.*]], label [[CLEANUP1:%.*]] ; CHECK: for.body.lr.ph: +; CHECK-NEXT: [[PTR_PROMOTED:%.*]] = load i32, i32* [[PTR:%.*]], align 4 ; CHECK-NEXT: br label [[FOR_BODY:%.*]] ; CHECK: for.body: -; CHECK-NEXT: [[I_08:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH]] ], [ [[INC:%.*]], [[IF_END:%.*]] ] -; CHECK-NEXT: [[TMP0:%.*]] = load i32, i32* [[PTR:%.*]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = phi i32 [ [[PTR_PROMOTED]], [[FOR_BODY_LR_PH]] ], [ 1, [[IF_END:%.*]] ] +; CHECK-NEXT: [[I_08:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH]] ], [ [[INC:%.*]], [[IF_END]] ] ; CHECK-NEXT: [[TOBOOL_NOT:%.*]] = icmp eq i32 [[TMP0]], 0 ; CHECK-NEXT: br i1 [[TOBOOL_NOT]], label [[IF_END]], label [[FOR_BODY_CLEANUP1_CRIT_EDGE:%.*]] ; CHECK: if.end: diff --git a/llvm/test/Transforms/LICM/promote-capture.ll b/llvm/test/Transforms/LICM/promote-capture.ll index 1a2603d1c986..945036e6e175 100644 --- a/llvm/test/Transforms/LICM/promote-capture.ll +++ b/llvm/test/Transforms/LICM/promote-capture.ll @@ -111,17 +111,19 @@ define void @test_captured_before_loop(i32 %len) { ; CHECK-NEXT: [[COUNT:%.*]] = alloca i32, align 4 ; CHECK-NEXT: store i32 0, i32* [[COUNT]], align 4 ; CHECK-NEXT: call void @capture(i32* [[COUNT]]) +; CHECK-NEXT: [[COUNT_PROMOTED:%.*]] = load i32, i32* [[COUNT]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[I_NEXT:%.*]], [[LATCH:%.*]] ] +; CHECK-NEXT: [[C_INC2:%.*]] = phi i32 [ [[COUNT_PROMOTED]], [[ENTRY:%.*]] ], [ [[C_INC1:%.*]], [[LATCH:%.*]] ] +; CHECK-NEXT: [[I:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[I_NEXT:%.*]], [[LATCH]] ] ; CHECK-NEXT: [[COND:%.*]] = call i1 @cond(i32 [[I]]) ; CHECK-NEXT: br i1 [[COND]], label [[IF:%.*]], label [[LATCH]] ; CHECK: if: -; CHECK-NEXT: [[C:%.*]] = load i32, i32* [[COUNT]], align 4 -; CHECK-NEXT: [[C_INC:%.*]] = add i32 [[C]], 1 +; CHECK-NEXT: [[C_INC:%.*]] = add i32 [[C_INC2]], 1 ; CHECK-NEXT: store i32 [[C_INC]], i32* [[COUNT]], align 4 ; CHECK-NEXT: br label [[LATCH]] ; CHECK: latch: +; CHECK-NEXT: [[C_INC1]] = phi i32 [ [[C_INC]], [[IF]] ], [ [[C_INC2]], [[LOOP]] ] ; CHECK-NEXT: [[I_NEXT]] = add nuw i32 [[I]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[I_NEXT]], [[LEN:%.*]] ; CHECK-NEXT: br i1 [[CMP]], label [[EXIT:%.*]], label [[LOOP]] diff --git a/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll b/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll index c3bae731fb6b..33076b39e908 100644 --- a/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll +++ b/llvm/test/Transforms/LICM/scalar-promote-memmodel.ll @@ -11,19 +11,21 @@ define void @bar(i32 %n, i32 %b) nounwind uwtable ssp { ; CHECK-LABEL: @bar( ; CHECK-NEXT: entry: ; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq i32 [[B:%.*]], 0 +; CHECK-NEXT: [[G_PROMOTED:%.*]] = load i32, i32* @g, align 4 ; CHECK-NEXT: br label [[FOR_COND:%.*]] ; CHECK: for.cond: -; CHECK-NEXT: [[I_0:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INC5:%.*]], [[FOR_INC:%.*]] ] +; CHECK-NEXT: [[INC2:%.*]] = phi i32 [ [[G_PROMOTED]], [[ENTRY:%.*]] ], [ [[INC1:%.*]], [[FOR_INC:%.*]] ] +; CHECK-NEXT: [[I_0:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[INC5:%.*]], [[FOR_INC]] ] ; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[I_0]], [[N:%.*]] ; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY:%.*]], label [[FOR_END:%.*]] ; CHECK: for.body: ; CHECK-NEXT: br i1 [[TOBOOL]], label [[FOR_INC]], label [[IF_THEN:%.*]] ; CHECK: if.then: -; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* @g, align 4 -; CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[TMP3]], 1 +; CHECK-NEXT: [[INC:%.*]] = add nsw i32 [[INC2]], 1 ; CHECK-NEXT: store i32 [[INC]], i32* @g, align 4 ; CHECK-NEXT: br label [[FOR_INC]] ; CHECK: for.inc: +; CHECK-NEXT: [[INC1]] = phi i32 [ [[INC]], [[IF_THEN]] ], [ [[INC2]], [[FOR_BODY]] ] ; CHECK-NEXT: [[INC5]] = add nsw i32 [[I_0]], 1 ; CHECK-NEXT: br label [[FOR_COND]] ; CHECK: for.end: diff --git a/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll b/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll index da4bae936dc1..b239b6fb0296 100644 --- a/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll +++ b/llvm/test/Transforms/LICM/scalar-promote-opaque-ptrs.ll @@ -314,17 +314,19 @@ define i32 @test7bad() { ; CHECK-NEXT: entry: ; CHECK-NEXT: [[LOCAL:%.*]] = alloca i32, align 4 ; CHECK-NEXT: call void @capture(ptr [[LOCAL]]) +; CHECK-NEXT: [[LOCAL_PROMOTED:%.*]] = load i32, ptr [[LOCAL]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[NEXT:%.*]], [[ELSE:%.*]] ] -; CHECK-NEXT: [[X:%.*]] = load i32, ptr [[LOCAL]], align 4 -; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X]]) +; CHECK-NEXT: [[X22:%.*]] = phi i32 [ [[LOCAL_PROMOTED]], [[ENTRY:%.*]] ], [ [[X21:%.*]], [[ELSE:%.*]] ] +; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[NEXT:%.*]], [[ELSE]] ] +; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X22]]) ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[X2]], 0 ; CHECK-NEXT: br i1 [[CMP]], label [[IF:%.*]], label [[ELSE]] ; CHECK: if: ; CHECK-NEXT: store i32 [[X2]], ptr [[LOCAL]], align 4 ; CHECK-NEXT: br label [[ELSE]] ; CHECK: else: +; CHECK-NEXT: [[X21]] = phi i32 [ [[X2]], [[IF]] ], [ [[X22]], [[LOOP]] ] ; CHECK-NEXT: [[NEXT]] = add i32 [[J]], 1 ; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[NEXT]], 0 ; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.*]], label [[LOOP]] diff --git a/llvm/test/Transforms/LICM/scalar-promote.ll b/llvm/test/Transforms/LICM/scalar-promote.ll index 290e990f8513..c064edb8cd93 100644 --- a/llvm/test/Transforms/LICM/scalar-promote.ll +++ b/llvm/test/Transforms/LICM/scalar-promote.ll @@ -315,17 +315,19 @@ define i32 @test7bad() { ; CHECK-NEXT: entry: ; CHECK-NEXT: [[LOCAL:%.*]] = alloca i32, align 4 ; CHECK-NEXT: call void @capture(i32* [[LOCAL]]) +; CHECK-NEXT: [[LOCAL_PROMOTED:%.*]] = load i32, i32* [[LOCAL]], align 4 ; CHECK-NEXT: br label [[LOOP:%.*]] ; CHECK: loop: -; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[NEXT:%.*]], [[ELSE:%.*]] ] -; CHECK-NEXT: [[X:%.*]] = load i32, i32* [[LOCAL]], align 4 -; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X]]) +; CHECK-NEXT: [[X22:%.*]] = phi i32 [ [[LOCAL_PROMOTED]], [[ENTRY:%.*]] ], [ [[X21:%.*]], [[ELSE:%.*]] ] +; CHECK-NEXT: [[J:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[NEXT:%.*]], [[ELSE]] ] +; CHECK-NEXT: [[X2:%.*]] = call i32 @opaque(i32 [[X22]]) ; CHECK-NEXT: [[CMP:%.*]] = icmp eq i32 [[X2]], 0 ; CHECK-NEXT: br i1 [[CMP]], label [[IF:%.*]], label [[ELSE]] ; CHECK: if: ; CHECK-NEXT: store i32 [[X2]], i32* [[LOCAL]], align 4 ; CHECK-NEXT: br label [[ELSE]] ; CHECK: else: +; CHECK-NEXT: [[X21]] = phi i32 [ [[X2]], [[IF]] ], [ [[X22]], [[LOOP]] ] ; CHECK-NEXT: [[NEXT]] = add i32 [[J]], 1 ; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[NEXT]], 0 ; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.*]], label [[LOOP]] </cut>

3 years, 7 months

1
0
0 0

[ACTIVITY] week ending Dec. 5 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - synced up on [AX_XDP task with Akashi-san] - synced on rust-vmm [AX_XDP task with Akashi-san] <https://linaro.atlassian.net/browse/STR-68> vhost-device maintainer effort ([UM-196]) - started looking at https://github.com/rust-vmm/vhost-device/pull/4 QEMU Upstream Work ([UM-2]) =========================== - posted [PULL for 6.2 0/8] more tcg, plugin, test and build fixes Message-Id: <20211129171449.4176301-1-alex.bennee(a)linaro.org> - commented on Re: Follow-up on the CXL discussion at OFTC Message-Id: <20211119015207.62fhk5mjmvaj5nz4(a)intel.com> to see if I can unblock - posted [RFC PATCH] blog post: how to get your new feature up-streamed Message-Id: <20211126203319.3298089-1-alex.bennee(a)linaro.org> - posted [PATCH for 6.2?] Revert "vga: don't abort when adding a duplicate isa-vga device" Message-Id: <20211202164929.1119036-1-alex.bennee(a)linaro.org> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Other ===== - wrote [RFC PATCH 0/2] insn plugin tweaks for measuring frequency Message-Id: <20211203144421.1445232-1-alex.bennee(a)linaro.org> - might make a good basis for a TCG plugins blog post Completed Reviews [2/2] ======================= [PATCH] tests/plugin/syscall.c: fix compiler warnings Message-Id: <20211128011551.2115468-1-juro.bystricky(a)intel.com> [PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers Message-Id: <20211126163915.1048353-2-peter.maydell(a)linaro.org> Current Review Queue ==================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 7 months

1
0
0 0

[ACTIVITY] report week ending 3 Dec

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - Code review: worked through some of the backlog and accumulated a list of series to take once the tree reopens for 7.0 - Wrote and sent some cleanup patches relating to the qemu-common.h header file - Fixed a bug where we miscalculated the length for TLB range invalidations * QEMU-420 [GICv4 emulation] - Found the problem with PCI passthrough in my nested test setup: apparently virtio PCI devices need an extra command line argument to get them to honour the presence of an IOMMU. Everything is now working and I've put some notes about the setup into https://linaro.atlassian.net/browse/QEMU-447 - started to implement the GICv4 redistributor changes -- PMM

3 years, 7 months

1
0
0 0

[ACTIVITY] week ending Nov. 28 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - [this weeks sync], topics on AF_XDP, virtio-video and virtio-watchdog [upstream rust-vmm sync meeting] <https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…> QEMU Upstream Work ([UM-2]) =========================== - posted [PATCH for 6.2 v2 0/7] more tcg, plugin, test and build fixes Message-Id: <20211125154144.2904741-1-alex.bennee(a)linaro.org> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v8 00/10] MTTCG sanity tests for ARM Message-Id: <20211118184650.661575-1-alex.bennee(a)linaro.org> [mttcg tests to current state and fixed up] <https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8> Other ===== - renewal feedback Completed Reviews [2/2] ======================= [PATCH v2 0/3] KVM: qemu patches for few KVM features I developed Message-Id: <20211101132300.192584-1-mlevitsk(a)redhat.com> [PATCH v2] hw/intc/arm_gicv3: Update cached state after LPI state changes Message-Id: <20211124202005.989935-1-peter.maydell(a)linaro.org> Absences ======== - off 2 days sick Current Review Queue ==================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 7 months

1
0
0 0

[ACTIVITY] report week ending 26 Nov

by Peter Maydell

Progress: * QEMU-420 [GICv4 emulation] - Tracked down and fixed a bug in our ITS emulation which would (intermittently?) result in a Linux guest reporting "irq 54: nobody cared" and hanging, because we were not correctly recalculating the highest priority pending interrupt when the guest acknowledged a pending LPI. This fix will go into 6.2. - Set up a test environment for GICv4 work -- because the major feature of GICv4 is support for directly injecting interrupts into a VM, the test setup needs to be nested virtualization, where an outer L1 guest runs on pure emulated QEMU, the inner L2 guest uses KVM (as provided by L1), and we pass a PCI device (emulated by QEMU) through from L1 to L2. I think I have this correctly set up now, but... - ...the L2 guest hangs because it apparently never sees an interrupt from the passed-through PCI device. This implies a bug in our current GICv3 emulation somewhere: need to track this down before starting in on GICv4 work. - Separately, I found through code inspection a bug where we do the wrong thing in the non-passthrough case when the L1 guest sets a virtual interrupt for the L2 guest in the GIC list registers and that interrupt has an ID > 1023 (ie it is an LPI). We got this wrong both for acknowledging and ending an interrupt, so the two bugs cancel each other out except that we don't set the vCPU priority and so the L2 guest might get an unexpected interrupt while it was servicing the LPI. Patches sent. -- PMM

3 years, 7 months

1
0
0 0

[TCWG CI] 433.milc slowed down by 5% after llvm: [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32

by ci_notify＠linaro.org

After llvm commit d7e03df719464354b20a845b7853be57da863924 Author: Jay Foad <jay.foad(a)amd.com> [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32 the following benchmarks slowed down by more than 2%: - 433.milc slowed down by 5% from 12335 to 12997 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-d7e03df719464354b20a845b7853be57da863924 cd investigate-llvm-d7e03df719464354b20a845b7853be57da863924 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach d7e03df719464354b20a845b7853be57da863924 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 8a52bd82e36855b3ad842f2535d0c78a97db55dc ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit d7e03df719464354b20a845b7853be57da863924 Author: Jay Foad <jay.foad(a)amd.com> Date: Fri Nov 12 18:02:58 2021 +0000 [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32 Select SelectionDAG ops smul_lohi/umul_lohi to v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0. v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions so it is better to use one instruction than two. Further improvements are possible to make better use of the addend operand, but this is already a strict improvement over what we have now. Differential Revision: https://reviews.llvm.org/D113986 --- llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp | 29 + llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h | 1 + llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp | 49 + llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h | 1 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 23 + llvm/lib/Target/AMDGPU/SIISelLowering.h | 1 + .../AMDGPU/atomic_optimizations_global_pointer.ll | 104 +- .../AMDGPU/atomic_optimizations_local_pointer.ll | 108 +- llvm/test/CodeGen/AMDGPU/bypass-div.ll | 1064 +++++++++----------- llvm/test/CodeGen/AMDGPU/llvm.mulo.ll | 178 ++-- llvm/test/CodeGen/AMDGPU/mad_64_32.ll | 110 +- llvm/test/CodeGen/AMDGPU/mul.ll | 55 +- llvm/test/CodeGen/AMDGPU/mul_int24.ll | 9 +- llvm/test/CodeGen/AMDGPU/mul_uint24-amdgcn.ll | 24 +- llvm/test/CodeGen/AMDGPU/udiv.ll | 358 +++---- llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll | 126 +-- llvm/test/CodeGen/AMDGPU/wwm-reserved.ll | 16 +- 17 files changed, 1126 insertions(+), 1130 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp index 2e571ad01c1c..8236e6672247 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp @@ -654,6 +654,9 @@ void AMDGPUDAGToDAGISel::Select(SDNode *N) { SelectMAD_64_32(N); return; } + case ISD::SMUL_LOHI: + case ISD::UMUL_LOHI: + return SelectMUL_LOHI(N); case ISD::CopyToReg: { const SITargetLowering& Lowering = *static_cast<const SITargetLowering*>(getTargetLowering()); @@ -1013,6 +1016,32 @@ void AMDGPUDAGToDAGISel::SelectMAD_64_32(SDNode *N) { CurDAG->SelectNodeTo(N, Opc, N->getVTList(), Ops); } +// We need to handle this here because tablegen doesn't support matching +// instructions with multiple outputs. +void AMDGPUDAGToDAGISel::SelectMUL_LOHI(SDNode *N) { + SDLoc SL(N); + bool Signed = N->getOpcode() == ISD::SMUL_LOHI; + unsigned Opc = Signed ? AMDGPU::V_MAD_I64_I32_e64 : AMDGPU::V_MAD_U64_U32_e64; + + SDValue Zero = CurDAG->getTargetConstant(0, SL, MVT::i64); + SDValue Clamp = CurDAG->getTargetConstant(0, SL, MVT::i1); + SDValue Ops[] = {N->getOperand(0), N->getOperand(1), Zero, Clamp}; + SDNode *Mad = CurDAG->getMachineNode(Opc, SL, N->getVTList(), Ops); + if (!SDValue(N, 0).use_empty()) { + SDValue Sub0 = CurDAG->getTargetConstant(AMDGPU::sub0, SL, MVT::i32); + SDNode *Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG, SL, + MVT::i32, SDValue(Mad, 0), Sub0); + ReplaceUses(SDValue(N, 0), SDValue(Lo, 0)); + } + if (!SDValue(N, 1).use_empty()) { + SDValue Sub1 = CurDAG->getTargetConstant(AMDGPU::sub1, SL, MVT::i32); + SDNode *Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG, SL, + MVT::i32, SDValue(Mad, 0), Sub1); + ReplaceUses(SDValue(N, 1), SDValue(Hi, 0)); + } + CurDAG->RemoveDeadNode(N); +} + bool AMDGPUDAGToDAGISel::isDSOffsetLegal(SDValue Base, unsigned Offset) const { if (!isUInt<16>(Offset)) return false; diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h index 74aff9e406c9..d638d9877a9b 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.h @@ -235,6 +235,7 @@ private: void SelectUADDO_USUBO(SDNode *N); void SelectDIV_SCALE(SDNode *N); void SelectMAD_64_32(SDNode *N); + void SelectMUL_LOHI(SDNode *N); void SelectFMA_W_CHAIN(SDNode *N); void SelectFMUL_W_CHAIN(SDNode *N); SDNode *getBFE32(bool IsSigned, const SDLoc &DL, SDValue Val, uint32_t Offset, diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp index 523fa2d3724b..54177564afbc 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp @@ -594,6 +594,8 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM, setTargetDAGCombine(ISD::SRL); setTargetDAGCombine(ISD::TRUNCATE); setTargetDAGCombine(ISD::MUL); + setTargetDAGCombine(ISD::SMUL_LOHI); + setTargetDAGCombine(ISD::UMUL_LOHI); setTargetDAGCombine(ISD::MULHU); setTargetDAGCombine(ISD::MULHS); setTargetDAGCombine(ISD::SELECT); @@ -3462,6 +3464,50 @@ SDValue AMDGPUTargetLowering::performMulCombine(SDNode *N, return DAG.getSExtOrTrunc(Mul, DL, VT); } +SDValue +AMDGPUTargetLowering::performMulLoHiCombine(SDNode *N, + DAGCombinerInfo &DCI) const { + if (N->getValueType(0) != MVT::i32) + return SDValue(); + + SelectionDAG &DAG = DCI.DAG; + SDLoc DL(N); + + SDValue N0 = N->getOperand(0); + SDValue N1 = N->getOperand(1); + + // SimplifyDemandedBits has the annoying habit of turning useful zero_extends + // in the source into any_extends if the result of the mul is truncated. Since + // we can assume the high bits are whatever we want, use the underlying value + // to avoid the unknown high bits from interfering. + if (N0.getOpcode() == ISD::ANY_EXTEND) + N0 = N0.getOperand(0); + if (N1.getOpcode() == ISD::ANY_EXTEND) + N1 = N1.getOperand(0); + + // Try to use two fast 24-bit multiplies (one for each half of the result) + // instead of one slow extending multiply. + unsigned LoOpcode, HiOpcode; + if (Subtarget->hasMulU24() && isU24(N0, DAG) && isU24(N1, DAG)) { + N0 = DAG.getZExtOrTrunc(N0, DL, MVT::i32); + N1 = DAG.getZExtOrTrunc(N1, DL, MVT::i32); + LoOpcode = AMDGPUISD::MUL_U24; + HiOpcode = AMDGPUISD::MULHI_U24; + } else if (Subtarget->hasMulI24() && isI24(N0, DAG) && isI24(N1, DAG)) { + N0 = DAG.getSExtOrTrunc(N0, DL, MVT::i32); + N1 = DAG.getSExtOrTrunc(N1, DL, MVT::i32); + LoOpcode = AMDGPUISD::MUL_I24; + HiOpcode = AMDGPUISD::MULHI_I24; + } else { + return SDValue(); + } + + SDValue Lo = DAG.getNode(LoOpcode, DL, MVT::i32, N0, N1); + SDValue Hi = DAG.getNode(HiOpcode, DL, MVT::i32, N0, N1); + DCI.CombineTo(N, Lo, Hi); + return SDValue(N, 0); +} + SDValue AMDGPUTargetLowering::performMulhsCombine(SDNode *N, DAGCombinerInfo &DCI) const { EVT VT = N->getValueType(0); @@ -4103,6 +4149,9 @@ SDValue AMDGPUTargetLowering::PerformDAGCombine(SDNode *N, return performTruncateCombine(N, DCI); case ISD::MUL: return performMulCombine(N, DCI); + case ISD::SMUL_LOHI: + case ISD::UMUL_LOHI: + return performMulLoHiCombine(N, DCI); case ISD::MULHS: return performMulhsCombine(N, DCI); case ISD::MULHU: diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h index 03632ac18598..daaca8737c5d 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h @@ -91,6 +91,7 @@ protected: SDValue performSrlCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performTruncateCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performMulCombine(SDNode *N, DAGCombinerInfo &DCI) const; + SDValue performMulLoHiCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performMulhsCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performMulhuCombine(SDNode *N, DAGCombinerInfo &DCI) const; SDValue performCtlz_CttzCombine(const SDLoc &SL, SDValue Cond, SDValue LHS, diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp index 519c5b936536..02440044d6e2 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp @@ -809,6 +809,11 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM, setOperationAction(ISD::SMULO, MVT::i64, Custom); setOperationAction(ISD::UMULO, MVT::i64, Custom); + if (Subtarget->hasMad64_32()) { + setOperationAction(ISD::SMUL_LOHI, MVT::i32, Custom); + setOperationAction(ISD::UMUL_LOHI, MVT::i32, Custom); + } + setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom); setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::f32, Custom); setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::v4f32, Custom); @@ -4691,6 +4696,9 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const { case ISD::SMULO: case ISD::UMULO: return lowerXMULO(Op, DAG); + case ISD::SMUL_LOHI: + case ISD::UMUL_LOHI: + return lowerXMUL_LOHI(Op, DAG); case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG); } @@ -5304,6 +5312,21 @@ SDValue SITargetLowering::lowerXMULO(SDValue Op, SelectionDAG &DAG) const { return DAG.getMergeValues({ Result, Overflow }, SL); } +SDValue SITargetLowering::lowerXMUL_LOHI(SDValue Op, SelectionDAG &DAG) const { + if (Op->isDivergent()) { + // Select to V_MAD_[IU]64_[IU]32. + return Op; + } + if (Subtarget->hasSMulHi()) { + // Expand to S_MUL_I32 + S_MUL_HI_[IU]32. + return SDValue(); + } + // The multiply is uniform but we would have to use V_MUL_HI_[IU]32 to + // calculate the high part, so we might as well do the whole thing with + // V_MAD_[IU]64_[IU]32. + return Op; +} + SDValue SITargetLowering::lowerTRAP(SDValue Op, SelectionDAG &DAG) const { if (!Subtarget->isTrapHandlerEnabled() || Subtarget->getTrapHandlerAbi() != GCNSubtarget::TrapHandlerAbi::AMDHSA) diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h index 1e48c96ad3c8..ea6ca3f48827 100644 --- a/llvm/lib/Target/AMDGPU/SIISelLowering.h +++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h @@ -135,6 +135,7 @@ private: SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const; SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const; SDValue lowerXMULO(SDValue Op, SelectionDAG &DAG) const; + SDValue lowerXMUL_LOHI(SDValue Op, SelectionDAG &DAG) const; SDValue getSegmentAperture(unsigned AS, const SDLoc &DL, SelectionDAG &DAG) const; diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll index 49f05fceb8ed..4ad774db6686 100644 --- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll +++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll @@ -818,32 +818,29 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX8-NEXT: s_mov_b32 s12, s6 ; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[8:9] ; GFX8-NEXT: v_mov_b32_e32 v0, s6 -; GFX8-NEXT: v_mul_hi_u32 v0, s0, v0 -; GFX8-NEXT: s_mov_b32 s13, s7 -; GFX8-NEXT: s_mul_i32 s7, s1, s6 -; GFX8-NEXT: s_mul_i32 s6, s0, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[8:9], s0, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s1, s6 ; GFX8-NEXT: s_mov_b32 s15, 0xf000 ; GFX8-NEXT: s_mov_b32 s14, -1 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_mov_b32 s13, s7 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0) ; GFX8-NEXT: buffer_atomic_add_x2 v[0:1], off, s[12:15], 0 glc ; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: buffer_wbinvl1_vol ; GFX8-NEXT: .LBB4_2: ; GFX8-NEXT: s_or_b64 exec, exec, s[2:3] -; GFX8-NEXT: v_readfirstlane_b32 s2, v0 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_lo_u32 v0, s1, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s0, v2 +; GFX8-NEXT: v_mul_lo_u32 v4, s1, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s0, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s0, v2 -; GFX8-NEXT: s_mov_b32 s7, 0xf000 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_add_u32_e32 v0, vcc, s2, v1 +; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v2 +; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -878,17 +875,16 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX9-NEXT: .LBB4_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -927,14 +923,13 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1064-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1064-NEXT: s_mov_b32 s6, -1 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_add_co_u32 v0, vcc, s0, v2 ; GFX1064-NEXT: v_add_co_ci_u32_e32 v1, vcc, s1, v1, vcc ; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 @@ -974,14 +969,13 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s0 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s0, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1032-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1032-NEXT: s_mov_b32 s6, -1 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_add_co_u32 v0, vcc_lo, s0, v2 ; GFX1032-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo ; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 @@ -1955,32 +1949,29 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX8-NEXT: s_mov_b32 s12, s6 ; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[8:9] ; GFX8-NEXT: v_mov_b32_e32 v0, s6 -; GFX8-NEXT: v_mul_hi_u32 v0, s0, v0 -; GFX8-NEXT: s_mov_b32 s13, s7 -; GFX8-NEXT: s_mul_i32 s7, s1, s6 -; GFX8-NEXT: s_mul_i32 s6, s0, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[8:9], s0, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s1, s6 ; GFX8-NEXT: s_mov_b32 s15, 0xf000 ; GFX8-NEXT: s_mov_b32 s14, -1 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_mov_b32 s13, s7 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0) ; GFX8-NEXT: buffer_atomic_sub_x2 v[0:1], off, s[12:15], 0 glc ; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: buffer_wbinvl1_vol ; GFX8-NEXT: .LBB10_2: ; GFX8-NEXT: s_or_b64 exec, exec, s[2:3] -; GFX8-NEXT: v_readfirstlane_b32 s2, v0 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_lo_u32 v0, s1, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s0, v2 +; GFX8-NEXT: v_mul_lo_u32 v4, s1, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s0, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s0, v2 -; GFX8-NEXT: s_mov_b32 s7, 0xf000 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s2, v1 +; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s0, v2 +; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -2015,17 +2006,16 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX9-NEXT: .LBB10_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -2064,14 +2054,13 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[0:1] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1064-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1064-NEXT: s_mov_b32 s6, -1 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_sub_co_u32 v0, vcc, s0, v2 ; GFX1064-NEXT: v_sub_co_ci_u32_e32 v1, vcc, s1, v1, vcc ; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 @@ -2111,14 +2100,13 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 addrspace ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s0 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s0, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s0, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s1, v1 ; GFX1032-NEXT: s_mov_b32 s7, 0x31016000 ; GFX1032-NEXT: s_mov_b32 s6, -1 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_sub_co_u32 v0, vcc_lo, s0, v2 ; GFX1032-NEXT: v_sub_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo ; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll index 455f9de836ba..bf91960537a4 100644 --- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll +++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll @@ -954,15 +954,13 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX8-NEXT: s_and_saveexec_b64 s[4:5], vcc ; GFX8-NEXT: s_cbranch_execz .LBB5_2 ; GFX8-NEXT: ; %bb.1: -; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[6:7] -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_bcnt1_i32_b64 s8, s[6:7] +; GFX8-NEXT: v_mov_b32_e32 v0, s8 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_hi_u32 v0, s2, v0 -; GFX8-NEXT: s_mul_i32 s7, s3, s6 -; GFX8-NEXT: s_mul_i32 s6, s2, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[6:7], s2, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s3, s8 ; GFX8-NEXT: v_mov_b32_e32 v3, 0 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_mov_b32 m0, -1 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1] @@ -971,18 +969,17 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX8-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: s_mov_b32 s4, s0 -; GFX8-NEXT: v_readfirstlane_b32 s0, v0 -; GFX8-NEXT: v_mul_lo_u32 v0, s3, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s2, v2 ; GFX8-NEXT: s_mov_b32 s5, s1 +; GFX8-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s2, v2 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v1 +; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v2 ; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -1012,19 +1009,18 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX9-NEXT: .LBB5_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX9-NEXT: s_mov_b32 s4, s0 -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: s_mov_b32 s5, s1 +; GFX9-NEXT: v_readfirstlane_b32 s0, v0 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -1057,13 +1053,12 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1064-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_add_co_u32 v0, vcc, s2, v2 ; GFX1064-NEXT: s_mov_b32 s2, -1 ; GFX1064-NEXT: v_add_co_ci_u32_e32 v1, vcc, s4, v1, vcc @@ -1098,13 +1093,12 @@ define amdgpu_kernel void @add_i64_uniform(i64 addrspace(1)* %out, i64 %additive ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s4 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s2, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1032-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_add_co_u32 v0, vcc_lo, s2, v2 ; GFX1032-NEXT: s_mov_b32 s2, -1 ; GFX1032-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s4, v1, vcc_lo @@ -2133,15 +2127,13 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX8-NEXT: s_and_saveexec_b64 s[4:5], vcc ; GFX8-NEXT: s_cbranch_execz .LBB12_2 ; GFX8-NEXT: ; %bb.1: -; GFX8-NEXT: s_bcnt1_i32_b64 s6, s[6:7] -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: s_bcnt1_i32_b64 s8, s[6:7] +; GFX8-NEXT: v_mov_b32_e32 v0, s8 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) -; GFX8-NEXT: v_mul_hi_u32 v0, s2, v0 -; GFX8-NEXT: s_mul_i32 s7, s3, s6 -; GFX8-NEXT: s_mul_i32 s6, s2, s6 +; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[6:7], s2, v0, 0 +; GFX8-NEXT: s_mul_i32 s6, s3, s8 ; GFX8-NEXT: v_mov_b32_e32 v3, 0 -; GFX8-NEXT: v_add_u32_e32 v1, vcc, s7, v0 -; GFX8-NEXT: v_mov_b32_e32 v0, s6 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1 ; GFX8-NEXT: s_mov_b32 m0, -1 ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: ds_sub_rtn_u64 v[0:1], v3, v[0:1] @@ -2150,18 +2142,17 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX8-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX8-NEXT: s_waitcnt lgkmcnt(0) ; GFX8-NEXT: s_mov_b32 s4, s0 -; GFX8-NEXT: v_readfirstlane_b32 s0, v0 -; GFX8-NEXT: v_mul_lo_u32 v0, s3, v2 -; GFX8-NEXT: v_mul_hi_u32 v3, s2, v2 ; GFX8-NEXT: s_mov_b32 s5, s1 +; GFX8-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0 +; GFX8-NEXT: v_readfirstlane_b32 s0, v0 ; GFX8-NEXT: v_readfirstlane_b32 s1, v1 -; GFX8-NEXT: v_mul_lo_u32 v1, s2, v2 -; GFX8-NEXT: v_add_u32_e32 v2, vcc, v3, v0 +; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4 ; GFX8-NEXT: v_mov_b32_e32 v3, s1 -; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s0, v1 +; GFX8-NEXT: v_sub_u32_e32 v0, vcc, s0, v2 ; GFX8-NEXT: s_mov_b32 s7, 0xf000 ; GFX8-NEXT: s_mov_b32 s6, -1 -; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v2, vcc +; GFX8-NEXT: v_subb_u32_e32 v1, vcc, v3, v1, vcc ; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX8-NEXT: s_endpgm ; @@ -2191,19 +2182,18 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX9-NEXT: .LBB12_2: ; GFX9-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX9-NEXT: s_waitcnt lgkmcnt(0) +; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX9-NEXT: s_mov_b32 s4, s0 -; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX9-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX9-NEXT: v_readfirstlane_b32 s0, v0 -; GFX9-NEXT: v_mul_lo_u32 v0, s2, v2 ; GFX9-NEXT: s_mov_b32 s5, s1 +; GFX9-NEXT: v_readfirstlane_b32 s0, v0 ; GFX9-NEXT: v_readfirstlane_b32 s1, v1 -; GFX9-NEXT: v_add_u32_e32 v1, v4, v3 -; GFX9-NEXT: v_mov_b32_e32 v2, s1 -; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v0 +; GFX9-NEXT: v_add_u32_e32 v1, v3, v4 +; GFX9-NEXT: v_mov_b32_e32 v3, s1 +; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, s0, v2 ; GFX9-NEXT: s_mov_b32 s7, 0xf000 ; GFX9-NEXT: s_mov_b32 s6, -1 -; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v2, v1, vcc +; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v3, v1, vcc ; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0 ; GFX9-NEXT: s_endpgm ; @@ -2236,13 +2226,12 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX1064-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1064-NEXT: s_or_b64 exec, exec, s[4:5] ; GFX1064-NEXT: s_waitcnt lgkmcnt(0) -; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1064-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1064-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0 ; GFX1064-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1064-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1064-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1064-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1064-NEXT: v_sub_co_u32 v0, vcc, s2, v2 ; GFX1064-NEXT: s_mov_b32 s2, -1 ; GFX1064-NEXT: v_sub_co_ci_u32_e32 v1, vcc, s4, v1, vcc @@ -2277,13 +2266,12 @@ define amdgpu_kernel void @sub_i64_uniform(i64 addrspace(1)* %out, i64 %subitive ; GFX1032-NEXT: s_waitcnt_depctr 0xffe3 ; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s4 ; GFX1032-NEXT: s_waitcnt lgkmcnt(0) -; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2 -; GFX1032-NEXT: v_mul_hi_u32 v4, s2, v2 -; GFX1032-NEXT: v_mul_lo_u32 v2, s2, v2 +; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2 +; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s2, s2, v2, 0 ; GFX1032-NEXT: v_readfirstlane_b32 s2, v0 ; GFX1032-NEXT: v_readfirstlane_b32 s4, v1 ; GFX1032-NEXT: s_mov_b32 s3, 0x31016000 -; GFX1032-NEXT: v_add_nc_u32_e32 v1, v4, v3 +; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4 ; GFX1032-NEXT: v_sub_co_u32 v0, vcc_lo, s2, v2 ; GFX1032-NEXT: s_mov_b32 s2, -1 ; GFX1032-NEXT: v_sub_co_ci_u32_e32 v1, vcc_lo, s4, v1, vcc_lo diff --git a/llvm/test/CodeGen/AMDGPU/bypass-div.ll b/llvm/test/CodeGen/AMDGPU/bypass-div.ll index 4ff9f6159cae..907ba8dd3086 100644 --- a/llvm/test/CodeGen/AMDGPU/bypass-div.ll +++ b/llvm/test/CodeGen/AMDGPU/bypass-div.ll @@ -16,119 +16,107 @@ define i64 @sdiv64(i64 %a, i64 %b) { ; GFX9-NEXT: s_xor_b64 s[6:7], exec, s[4:5] ; GFX9-NEXT: s_cbranch_execz .LBB0_2 ; GFX9-NEXT: ; %bb.1: -; GFX9-NEXT: v_ashrrev_i32_e32 v4, 31, v3 -; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v4 -; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v4, vcc -; GFX9-NEXT: v_xor_b32_e32 v3, v3, v4 -; GFX9-NEXT: v_xor_b32_e32 v2, v2, v4 -; GFX9-NEXT: v_cvt_f32_u32_e32 v5, v2 -; GFX9-NEXT: v_cvt_f32_u32_e32 v6, v3 -; GFX9-NEXT: v_sub_co_u32_e32 v7, vcc, 0, v2 -; GFX9-NEXT: v_subb_co_u32_e32 v8, vcc, 0, v3, vcc -; GFX9-NEXT: v_mac_f32_e32 v5, 0x4f800000, v6 -; GFX9-NEXT: v_rcp_f32_e32 v5, v5 +; GFX9-NEXT: v_ashrrev_i32_e32 v9, 31, v3 +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v9 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v9, vcc +; GFX9-NEXT: v_xor_b32_e32 v10, v3, v9 +; GFX9-NEXT: v_xor_b32_e32 v11, v2, v9 +; GFX9-NEXT: v_cvt_f32_u32_e32 v2, v11 +; GFX9-NEXT: v_cvt_f32_u32_e32 v3, v10 +; GFX9-NEXT: v_sub_co_u32_e32 v7, vcc, 0, v11 +; GFX9-NEXT: v_subb_co_u32_e32 v8, vcc, 0, v10, vcc +; GFX9-NEXT: v_mac_f32_e32 v2, 0x4f800000, v3 +; GFX9-NEXT: v_rcp_f32_e32 v2, v2 ; GFX9-NEXT: v_mov_b32_e32 v14, 0 -; GFX9-NEXT: v_mul_f32_e32 v5, 0x5f7ffffc, v5 -; GFX9-NEXT: v_mul_f32_e32 v6, 0x2f800000, v5 -; GFX9-NEXT: v_trunc_f32_e32 v6, v6 -; GFX9-NEXT: v_mac_f32_e32 v5, 0xcf800000, v6 -; GFX9-NEXT: v_cvt_u32_f32_e32 v6, v6 -; GFX9-NEXT: v_cvt_u32_f32_e32 v5, v5 -; GFX9-NEXT: v_mul_lo_u32 v11, v7, v6 -; GFX9-NEXT: v_mul_lo_u32 v9, v8, v5 -; GFX9-NEXT: v_mul_hi_u32 v10, v7, v5 -; GFX9-NEXT: v_mul_lo_u32 v12, v7, v5 -; GFX9-NEXT: v_add3_u32 v9, v10, v11, v9 -; GFX9-NEXT: v_mul_lo_u32 v10, v5, v9 -; GFX9-NEXT: v_mul_hi_u32 v11, v5, v12 -; GFX9-NEXT: v_mul_hi_u32 v13, v5, v9 -; GFX9-NEXT: v_mul_hi_u32 v15, v6, v9 -; GFX9-NEXT: v_mul_lo_u32 v9, v6, v9 -; GFX9-NEXT: v_add_co_u32_e32 v10, vcc, v11, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v11, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v13, v6, v12 -; GFX9-NEXT: v_mul_hi_u32 v12, v6, v12 -; GFX9-NEXT: v_add_co_u32_e32 v10, vcc, v10, v13 -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, v11, v12, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v11, vcc, v15, v14, vcc -; GFX9-NEXT: v_add_co_u32_e32 v9, vcc, v10, v9 -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, 0, v11, vcc -; GFX9-NEXT: v_add_co_u32_e32 v5, vcc, v5, v9 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v10, vcc -; GFX9-NEXT: v_mul_lo_u32 v9, v7, v6 -; GFX9-NEXT: v_mul_lo_u32 v8, v8, v5 -; GFX9-NEXT: v_mul_hi_u32 v10, v7, v5 -; GFX9-NEXT: v_mul_lo_u32 v7, v7, v5 -; GFX9-NEXT: v_add3_u32 v8, v10, v9, v8 -; GFX9-NEXT: v_mul_lo_u32 v11, v5, v8 -; GFX9-NEXT: v_mul_hi_u32 v12, v5, v7 -; GFX9-NEXT: v_mul_hi_u32 v13, v5, v8 -; GFX9-NEXT: v_mul_hi_u32 v10, v6, v7 -; GFX9-NEXT: v_mul_lo_u32 v7, v6, v7 -; GFX9-NEXT: v_mul_hi_u32 v9, v6, v8 -; GFX9-NEXT: v_add_co_u32_e32 v11, vcc, v12, v11 -; GFX9-NEXT: v_addc_co_u32_e32 v12, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v6, v8 -; GFX9-NEXT: v_add_co_u32_e32 v7, vcc, v11, v7 -; GFX9-NEXT: v_addc_co_u32_e32 v7, vcc, v12, v10, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, v9, v14, vcc -; GFX9-NEXT: v_add_co_u32_e32 v7, vcc, v7, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, 0, v9, vcc -; GFX9-NEXT: v_add_co_u32_e32 v5, vcc, v5, v7 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v8, vcc -; GFX9-NEXT: v_ashrrev_i32_e32 v7, 31, v1 -; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v7 -; GFX9-NEXT: v_xor_b32_e32 v0, v0, v7 -; GFX9-NEXT: v_mul_lo_u32 v8, v0, v6 -; GFX9-NEXT: v_mul_hi_u32 v9, v0, v5 -; GFX9-NEXT: v_mul_hi_u32 v10, v0, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v7, vcc -; GFX9-NEXT: v_xor_b32_e32 v1, v1, v7 -; GFX9-NEXT: v_add_co_u32_e32 v8, vcc, v9, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, 0, v10, vcc -; GFX9-NEXT: v_mul_lo_u32 v10, v1, v5 -; GFX9-NEXT: v_mul_hi_u32 v5, v1, v5 -; GFX9-NEXT: v_mul_hi_u32 v11, v1, v6 -; GFX9-NEXT: v_mul_lo_u32 v6, v1, v6 -; GFX9-NEXT: v_add_co_u32_e32 v8, vcc, v8, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v9, v5, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, v11, v14, vcc -; GFX9-NEXT: v_add_co_u32_e32 v5, vcc, v5, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, 0, v8, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v3, v5 -; GFX9-NEXT: v_mul_lo_u32 v9, v2, v6 -; GFX9-NEXT: v_mul_hi_u32 v10, v2, v5 -; GFX9-NEXT: v_mul_lo_u32 v11, v2, v5 -; GFX9-NEXT: v_add3_u32 v8, v10, v9, v8 -; GFX9-NEXT: v_sub_u32_e32 v9, v1, v8 -; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, v0, v11 -; GFX9-NEXT: v_subb_co_u32_e64 v9, s[4:5], v9, v3, vcc -; GFX9-NEXT: v_sub_co_u32_e64 v10, s[4:5], v0, v2 -; GFX9-NEXT: v_subbrev_co_u32_e64 v9, s[4:5], 0, v9, s[4:5] -; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v9, v3 -; GFX9-NEXT: v_cndmask_b32_e64 v11, 0, -1, s[4:5] -; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v10, v2 -; GFX9-NEXT: v_cndmask_b32_e64 v10, 0, -1, s[4:5] -; GFX9-NEXT: v_cmp_eq_u32_e64 s[4:5], v9, v3 -; GFX9-NEXT: v_cndmask_b32_e64 v9, v11, v10, s[4:5] -; GFX9-NEXT: v_add_co_u32_e64 v10, s[4:5], 2, v5 -; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v1, v8, vcc -; GFX9-NEXT: v_addc_co_u32_e64 v11, s[4:5], 0, v6, s[4:5] -; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v1, v3 -; GFX9-NEXT: v_add_co_u32_e64 v12, s[4:5], 1, v5 -; GFX9-NEXT: v_cndmask_b32_e64 v8, 0, -1, vcc -; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v0, v2 -; GFX9-NEXT: v_addc_co_u32_e64 v13, s[4:5], 0, v6, s[4:5] +; GFX9-NEXT: v_mul_f32_e32 v2, 0x5f7ffffc, v2 +; GFX9-NEXT: v_mul_f32_e32 v3, 0x2f800000, v2 +; GFX9-NEXT: v_trunc_f32_e32 v3, v3 +; GFX9-NEXT: v_mac_f32_e32 v2, 0xcf800000, v3 +; GFX9-NEXT: v_cvt_u32_f32_e32 v6, v2 +; GFX9-NEXT: v_cvt_u32_f32_e32 v12, v3 +; GFX9-NEXT: v_mul_lo_u32 v4, v8, v6 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v7, v6, 0 +; GFX9-NEXT: v_mul_lo_u32 v5, v7, v12 +; GFX9-NEXT: v_mul_hi_u32 v13, v6, v2 +; GFX9-NEXT: v_add3_u32 v5, v3, v5, v4 +; GFX9-NEXT: v_mad_u64_u32 v[3:4], s[4:5], v6, v5, 0 +; GFX9-NEXT: v_add_co_u32_e32 v13, vcc, v13, v3 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v12, v2, 0 +; GFX9-NEXT: v_addc_co_u32_e32 v15, vcc, 0, v4, vcc +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v12, v5, 0 +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v13, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v2, vcc, v15, v3, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v5, v14, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v3, vcc +; GFX9-NEXT: v_add_co_u32_e32 v13, vcc, v6, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v12, vcc, v12, v3, vcc +; GFX9-NEXT: v_mul_lo_u32 v4, v7, v12 +; GFX9-NEXT: v_mul_lo_u32 v5, v8, v13 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v7, v13, 0 +; GFX9-NEXT: v_add3_u32 v5, v3, v4, v5 +; GFX9-NEXT: v_mad_u64_u32 v[3:4], s[4:5], v12, v5, 0 +; GFX9-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v13, v5, 0 +; GFX9-NEXT: v_mul_hi_u32 v15, v13, v2 +; GFX9-NEXT: v_mad_u64_u32 v[7:8], s[4:5], v12, v2, 0 +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v15, v5 +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v6, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v7 +; GFX9-NEXT: v_addc_co_u32_e32 v2, vcc, v5, v8, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v4, v14, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v2, v3 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v4, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v13, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, v12, v3, vcc +; GFX9-NEXT: v_ashrrev_i32_e32 v4, 31, v1 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v4 +; GFX9-NEXT: v_xor_b32_e32 v6, v0, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v1, v4, vcc +; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v3, 0 +; GFX9-NEXT: v_mul_hi_u32 v7, v6, v2 +; GFX9-NEXT: v_xor_b32_e32 v5, v5, v4 +; GFX9-NEXT: v_add_co_u32_e32 v7, vcc, v7, v0 +; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, 0, v1, vcc +; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v5, v2, 0 +; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v5, v3, 0 +; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v7, v0 +; GFX9-NEXT: v_addc_co_u32_e32 v0, vcc, v8, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v14, vcc +; GFX9-NEXT: v_add_co_u32_e32 v2, vcc, v0, v2 +; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc +; GFX9-NEXT: v_mul_lo_u32 v7, v10, v2 +; GFX9-NEXT: v_mul_lo_u32 v8, v11, v3 +; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v11, v2, 0 +; GFX9-NEXT: v_add3_u32 v1, v1, v8, v7 +; GFX9-NEXT: v_sub_u32_e32 v7, v5, v1 +; GFX9-NEXT: v_sub_co_u32_e32 v0, vcc, v6, v0 +; GFX9-NEXT: v_subb_co_u32_e64 v6, s[4:5], v7, v10, vcc +; GFX9-NEXT: v_sub_co_u32_e64 v7, s[4:5], v0, v11 +; GFX9-NEXT: v_subbrev_co_u32_e64 v6, s[4:5], 0, v6, s[4:5] +; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v6, v10 +; GFX9-NEXT: v_cndmask_b32_e64 v8, 0, -1, s[4:5] +; GFX9-NEXT: v_cmp_ge_u32_e64 s[4:5], v7, v11 +; GFX9-NEXT: v_cndmask_b32_e64 v7, 0, -1, s[4:5] +; GFX9-NEXT: v_cmp_eq_u32_e64 s[4:5], v6, v10 +; GFX9-NEXT: v_cndmask_b32_e64 v6, v8, v7, s[4:5] +; GFX9-NEXT: v_add_co_u32_e64 v7, s[4:5], 2, v2 +; GFX9-NEXT: v_subb_co_u32_e32 v1, vcc, v5, v1, vcc +; GFX9-NEXT: v_addc_co_u32_e64 v8, s[4:5], 0, v3, s[4:5] +; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v1, v10 +; GFX9-NEXT: v_add_co_u32_e64 v12, s[4:5], 1, v2 +; GFX9-NEXT: v_cndmask_b32_e64 v5, 0, -1, vcc +; GFX9-NEXT: v_cmp_ge_u32_e32 vcc, v0, v11 +; GFX9-NEXT: v_addc_co_u32_e64 v13, s[4:5], 0, v3, s[4:5] ; GFX9-NEXT: v_cndmask_b32_e64 v0, 0, -1, vcc -; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, v1, v3 -; GFX9-NEXT: v_cmp_ne_u32_e64 s[4:5], 0, v9 -; GFX9-NEXT: v_cndmask_b32_e32 v0, v8, v0, vcc +; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, v1, v10 +; GFX9-NEXT: v_cmp_ne_u32_e64 s[4:5], 0, v6 +; GFX9-NEXT: v_cndmask_b32_e32 v0, v5, v0, vcc ; GFX9-NEXT: v_cmp_ne_u32_e32 vcc, 0, v0 -; GFX9-NEXT: v_cndmask_b32_e64 v1, v12, v10, s[4:5] -; GFX9-NEXT: v_cndmask_b32_e64 v9, v13, v11, s[4:5] -; GFX9-NEXT: v_cndmask_b32_e32 v1, v5, v1, vcc -; GFX9-NEXT: v_xor_b32_e32 v2, v7, v4 -; GFX9-NEXT: v_cndmask_b32_e32 v0, v6, v9, vcc +; GFX9-NEXT: v_cndmask_b32_e64 v1, v12, v7, s[4:5] +; GFX9-NEXT: v_cndmask_b32_e64 v6, v13, v8, s[4:5] +; GFX9-NEXT: v_cndmask_b32_e32 v1, v2, v1, vcc +; GFX9-NEXT: v_xor_b32_e32 v2, v4, v9 +; GFX9-NEXT: v_cndmask_b32_e32 v0, v3, v6, vcc ; GFX9-NEXT: v_xor_b32_e32 v1, v1, v2 ; GFX9-NEXT: v_xor_b32_e32 v0, v0, v2 ; GFX9-NEXT: v_sub_co_u32_e32 v4, vcc, v1, v2 @@ -183,106 +171,94 @@ define i64 @udiv64(i64 %a, i64 %b) { ; GFX9-NEXT: ; %bb.1: ; GFX9-NEXT: v_cvt_f32_u32_e32 v4, v2 ; GFX9-NEXT: v_cvt_f32_u32_e32 v5, v3 -; GFX9-NEXT: v_sub_co_u32_e32 v6, vcc, 0, v2 -; GFX9-NEXT: v_subb_co_u32_e32 v7, vcc, 0, v3, vcc +; GFX9-NEXT: v_sub_co_u32_e32 v10, vcc, 0, v2 +; GFX9-NEXT: v_subb_co_u32_e32 v11, vcc, 0, v3, vcc ; GFX9-NEXT: v_mac_f32_e32 v4, 0x4f800000, v5 ; GFX9-NEXT: v_rcp_f32_e32 v4, v4 -; GFX9-NEXT: v_mov_b32_e32 v12, 0 +; GFX9-NEXT: v_mov_b32_e32 v13, 0 ; GFX9-NEXT: v_mul_f32_e32 v4, 0x5f7ffffc, v4 ; GFX9-NEXT: v_mul_f32_e32 v5, 0x2f800000, v4 ; GFX9-NEXT: v_trunc_f32_e32 v5, v5 ; GFX9-NEXT: v_mac_f32_e32 v4, 0xcf800000, v5 -; GFX9-NEXT: v_cvt_u32_f32_e32 v5, v5 -; GFX9-NEXT: v_cvt_u32_f32_e32 v4, v4 -; GFX9-NEXT: v_mul_lo_u32 v8, v6, v5 -; GFX9-NEXT: v_mul_lo_u32 v9, v7, v4 -; GFX9-NEXT: v_mul_hi_u32 v10, v6, v4 -; GFX9-NEXT: v_mul_lo_u32 v11, v6, v4 -; GFX9-NEXT: v_add3_u32 v8, v10, v8, v9 -; GFX9-NEXT: v_mul_hi_u32 v9, v4, v11 -; GFX9-NEXT: v_mul_lo_u32 v10, v4, v8 -; GFX9-NEXT: v_mul_hi_u32 v13, v4, v8 -; GFX9-NEXT: v_mul_hi_u32 v14, v5, v8 -; GFX9-NEXT: v_mul_lo_u32 v8, v5, v8 -; GFX9-NEXT: v_add_co_u32_e32 v9, vcc, v9, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v13, v5, v11 -; GFX9-NEXT: v_mul_hi_u32 v11, v5, v11 -; GFX9-NEXT: v_add_co_u32_e32 v9, vcc, v9, v13 -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, v10, v11, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v10, vcc, v14, v12, vcc -; GFX9-NEXT: v_add_co_u32_e32 v8, vcc, v9, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v9, vcc, 0, v10, vcc -; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v5, v9, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v6, v5 -; GFX9-NEXT: v_mul_lo_u32 v7, v7, v4 -; GFX9-NEXT: v_mul_hi_u32 v9, v6, v4 -; GFX9-NEXT: v_mul_lo_u32 v6, v6, v4 -; GFX9-NEXT: v_add3_u32 v7, v9, v8, v7 -; GFX9-NEXT: v_mul_lo_u32 v10, v4, v7 -; GFX9-NEXT: v_mul_hi_u32 v11, v4, v6 -; GFX9-NEXT: v_mul_hi_u32 v13, v4, v7 -; GFX9-NEXT: v_mul_hi_u32 v9, v5, v6 -; GFX9-NEXT: v_mul_lo_u32 v6, v5, v6 -; GFX9-NEXT: v_mul_hi_u32 v8, v5, v7 -; GFX9-NEXT: v_add_co_u32_e32 v10, vcc, v11, v10 -; GFX9-NEXT: v_addc_co_u32_e32 v11, vcc, 0, v13, vcc -; GFX9-NEXT: v_mul_lo_u32 v7, v5, v7 -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v10, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v11, v9, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v8, vcc, v8, v12, vcc -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v6, v7 -; GFX9-NEXT: v_addc_co_u32_e32 v7, vcc, 0, v8, vcc +; GFX9-NEXT: v_cvt_u32_f32_e32 v8, v5 +; GFX9-NEXT: v_cvt_u32_f32_e32 v9, v4 +; GFX9-NEXT: v_mul_lo_u32 v6, v10, v8 +; GFX9-NEXT: v_mul_lo_u32 v7, v11, v9 +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v10, v9, 0 +; GFX9-NEXT: v_add3_u32 v7, v5, v6, v7 +; GFX9-NEXT: v_mul_hi_u32 v12, v9, v4 +; GFX9-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v9, v7, 0 +; GFX9-NEXT: v_add_co_u32_e32 v12, vcc, v12, v5 +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v8, v4, 0 +; GFX9-NEXT: v_addc_co_u32_e32 v14, vcc, 0, v6, vcc +; GFX9-NEXT: v_mad_u64_u32 v[6:7], s[4:5], v8, v7, 0 +; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v12, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v14, v5, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v7, v13, vcc ; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v6 -; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v5, v7, vcc -; GFX9-NEXT: v_mul_lo_u32 v6, v0, v5 -; GFX9-NEXT: v_mul_hi_u32 v7, v0, v4 -; GFX9-NEXT: v_mul_hi_u32 v8, v0, v5 -; GFX9-NEXT: v_mul_hi_u32 v9, v1, v5 -; GFX9-NEXT: v_mul_lo_u32 v5, v1, v5 -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v7, v6 +; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v5, vcc +; GFX9-NEXT: v_add_co_u32_e32 v12, vcc, v9, v4 +; GFX9-NEXT: v_addc_co_u32_e32 v14, vcc, v8, v5, vcc +; GFX9-NEXT: v_mul_lo_u32 v6, v10, v14 +; GFX9-NEXT: v_mul_lo_u32 v7, v11, v12 +; GFX9-NEXT: v_mad_u64_u32 v[4:5], s[4:5], v10, v12, 0 +; GFX9-NEXT: v_add3_u32 v7, v5, v6, v7 +; GFX9-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v14, v7, 0 +; GFX9-NEXT: v_mad_u64_u32 v[7:8], s[4:5], v12, v7, 0 +; GFX9-NEXT: v_mul_hi_u32 v11, v12, v4 +; GFX9-NEXT: v_mad_u64_u32 v[9:10], s[4:5], v14, v4, 0 +; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v11, v7 ; GFX9-NEXT: v_addc_co_u32_e32 v7, vcc, 0, v8, vcc -; GFX9-NEXT: v_mul_lo_u32 v8, v1, v4 -; GFX9-NEXT: v_mul_hi_u32 v4, v1, v4 -; GFX9-NEXT: v_add_co_u32_e32 v6, vcc, v6, v8 -; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v7, v4, vcc -; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v9, v12, vcc +; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v9 +; GFX9-NEXT: v_addc_co_u32_e32 v4, vcc, v7, v10, vcc +; GFX9-NEXT: v_addc_co_u32_e32 v6, vcc, v6, v13, vcc ; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, v4, v5 ; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v6, vcc -; GFX9-NEXT: v_mul_lo_u32 v6, v3, v4 -; GFX9-NEXT: v_mul_lo_u32 v7, v2, v5 -; GFX9-NEXT: v_mul_hi_u32 v8, v2, v4 </cut>

3 years, 7 months

2
1
0 0

[ACTIVITY] week ending 21 Nov 2021

by Richard Henderson

[UM-2] * release work * revived some 6month old ppc fpu fixes * reviews: loongarch, riscv, watchpoints, gdbstub. r~

3 years, 7 months

1
0
0 0

[ACTIVITY] week ending Nov. 21 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - posted Initial thoughts for test scenarios for AF_XDP epic Message-Id: <87k0h5v6ju.fsf(a)linaro.org> vhost-device maintainer effort ([UM-196]) - more review - [did some more noodling with rust] to get comfortable with generics [UM-196] <https://linaro.atlassian.net/browse/UM-196> [vhost-device crate] <https://github.com/rust-vmm/vhost-device> [did some more noodling with rust] <https://gitlab.com/stsquad/softfloat.rs> QEMU Upstream Work ([UM-2]) =========================== - posted [PULL for 6.2 0/7] misc build and test fixes Message-Id: <20211116162515.4100231-1-alex.bennee(a)linaro.org> - posted [RFC PATCH] tests/avocado: fix tcg_plugin mem access count test Message-Id: <20211117095448.136558-1-alex.bennee(a)linaro.org> - posted Re: [RFC PATCH] plugins/meson.build: fix linker issue with weird paths (for v6.2?) Message-Id: <20211117111924.179776-1-alex.bennee(a)linaro.org> - posted Re: [PATCH v2 1/3] icount: preserve cflags when custom tb is about to execute Message-Id: <87h7cbw1tx.fsf(a)linaro.org> - posted [RFC PATCH] gdbstub: handle a potentially racing TaskState Message-Id: <20211119145124.942390-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v3 0/3] GIC ITS tests Message-Id: <20211112114734.3058678-1-alex.bennee(a)linaro.org> - posted [kvm-unit-tests PATCH v8 00/10] MTTCG sanity tests for ARM Message-Id: <20211118184650.661575-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> [mttcg tests to current state and fixed up] <https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8> Completed Reviews [2/2] ======================= [PATCH v2 0/3] Some watchpoint-related patches Message-Id: <163662450348.125458.5494710452733592356.stgit@pasha-ThinkPad-X280> [PATCH 0/5] Update linux-headers + NOIRQ support for KVM gdbstub Message-Id: <20211111110604.207376-1-pbonzini(a)redhat.com> Absences ======== - none Current Review Queue ==================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 8 months

1
0
0 0

[ACTIVITY] report week ending 19 Nov

by Peter Maydell

Progress (short week, 3 days) * UM-2 [QEMU upstream maintainership] - Still trying to sort out the regression of booting EL3 guest code on the imx7 board. I got most of the way through prototyping a cleanup which would fix this, but then spotted that the highbank board has a more awkward-to-fix similar problem. We're going to revert the PSCI emulation change for 6.2 so we can take the time to get the cleanup right and land it in 7.0. - Usual patch accumulation, review, etc during release cycle -- PMM

3 years, 8 months

1
0
0 0

[TCWG CI] Regression caused by gcc: tree-optimization/102880 - make PHI-OPT recognize more CFGs

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: tree-optimization/102880 - make PHI-OPT recognize more CFGs: commit f98f373dd822b35c52356b753d528924e9f89678 Author: Richard Biener <rguenther(a)suse.de> tree-optimization/102880 - make PHI-OPT recognize more CFGs Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21059 # First few build errors in logs: # 00:20:13 drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.c:2521:1: error: definition in block 22 does not dominate use in block 21 # 00:20:13 drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.c:2521:1: internal compiler error: verify_ssa failed # 00:20:13 make[6]: *** [scripts/Makefile.build:280: drivers/net/wireless/realtek/rtlwifi/rtl8192se/hw.o] Error 1 # 00:20:20 make[5]: *** [scripts/Makefile.build:497: drivers/net/wireless/realtek/rtlwifi/rtl8192se] Error 2 # 00:24:02 make[4]: *** [scripts/Makefile.build:497: drivers/net/wireless/realtek/rtlwifi] Error 2 # 00:24:02 make[3]: *** [scripts/Makefile.build:497: drivers/net/wireless/realtek] Error 2 # 00:24:02 make[2]: *** [scripts/Makefile.build:497: drivers/net/wireless] Error 2 # 00:25:21 drivers/staging/comedi/drivers/addi_apci_3120.c:1117:1: error: definition in block 10 does not dominate use in block 11 # 00:25:21 drivers/staging/comedi/drivers/addi_apci_3120.c:1117:1: internal compiler error: verify_ssa failed # 00:25:22 make[4]: *** [scripts/Makefile.build:280: drivers/staging/comedi/drivers/addi_apci_3120.o] Error 1 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 28893 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-arm-lts-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… Reproduce builds: <cut> mkdir investigate-gcc-f98f373dd822b35c52356b753d528924e9f89678 cd investigate-gcc-f98f373dd822b35c52356b753d528924e9f89678 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-lts-allmodc… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach f98f373dd822b35c52356b753d528924e9f89678 ../artifacts/test.sh # Reproduce last_good build git checkout --detach d699f03720fce57b319276226ac4a463a8538e9f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f98f373dd822b35c52356b753d528924e9f89678 Author: Richard Biener <rguenther(a)suse.de> Date: Mon Nov 15 15:19:36 2021 +0100 tree-optimization/102880 - make PHI-OPT recognize more CFGs This allows extra edges into the middle BB for the PHI-OPT transforms using replace_phi_edge_with_variable that do not end up moving stmts from that middle BB. This avoids regressing gcc.dg/tree-ssa/ssa-hoist-4.c with the actual fix for PR102880 where CFG cleanup has the choice to remove two forwarders and picks "the wrong" leading to if (a > b) / /\ / / <BB> / | # PHI <a, b> rather than if (a > b) | /\ | <BB> \ | / \ | # PHI <a, b, b> but it's relatively straight-forward to support extra edges into the middle-BB in paths ending in replace_phi_edge_with_variable and that do not require moving stmts. That's because we really only want to remove the edge from the condition to the middle BB. Of course actually doing that means updating dominators in non-trival ways which is why I kept the original code for the single edge case and simply defer to CFG cleanup by adjusting the condition for the complicated case. The testcase needs to be a GIMPLE one since it's quite unreliable to produce the desired CFG. 2021-11-15 Richard Biener <rguenther(a)suse.de> PR tree-optimization/102880 * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Push single_pred (bb1) condition to places that really need it. (match_simplify_replacement): Likewise. (value_replacement): Likewise. (replace_phi_edge_with_variable): Deal with extra edges into the middle BB. * gcc.dg/tree-ssa/phi-opt-26.c: New testcase. --- gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c | 31 +++++++++++++ gcc/tree-ssa-phiopt.c | 71 +++++++++++++++++------------- 2 files changed, 72 insertions(+), 30 deletions(-) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c new file mode 100644 index 00000000000..21aa66e38b8 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-26.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-options "-O -fgimple -fdump-tree-phiopt1" } */ + +int __GIMPLE (ssa,startwith("phiopt")) +foo (int a, int b, int flag) +{ + int res; + + __BB(2): + if (flag_2(D) != 0) + goto __BB6; + else + goto __BB4; + + __BB(4): + if (a_3(D) > b_4(D)) + goto __BB7; + else + goto __BB6; + + __BB(6): + goto __BB7; + + __BB(7): + res_1 = __PHI (__BB4: a_3(D), __BB6: b_4(D)); + return res_1; +} + +/* We should be able to detect MAX despite the extra edge into + the middle BB. */ +/* { dg-final { scan-tree-dump "MAX" "phiopt1" } } */ diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c index 173ac835ca6..6b22f6bedd4 100644 --- a/gcc/tree-ssa-phiopt.c +++ b/gcc/tree-ssa-phiopt.c @@ -220,7 +220,6 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) /* If either bb1's succ or bb2 or bb2's succ is non NULL. */ if (EDGE_COUNT (bb1->succs) == 0 - || bb2 == NULL || EDGE_COUNT (bb2->succs) == 0) continue; @@ -276,14 +275,14 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) || (e1->flags & EDGE_FALLTHRU) == 0) continue; - /* Also make sure that bb1 only have one predecessor and that it - is bb. */ - if (!single_pred_p (bb1) - || single_pred (bb1) != bb) - continue; - if (do_store_elim) { + /* Also make sure that bb1 only have one predecessor and that it + is bb. */ + if (!single_pred_p (bb1) + || single_pred (bb1) != bb) + continue; + /* bb1 is the middle block, bb2 the join block, bb the split block, e1 the fallthrough edge from bb1 to bb2. We can't do the optimization if the join block has more than two predecessors. */ @@ -328,10 +327,11 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) node. */ gcc_assert (arg0 != NULL_TREE && arg1 != NULL_TREE); - gphi *newphi = factor_out_conditional_conversion (e1, e2, phi, - arg0, arg1, - cond_stmt); - if (newphi != NULL) + gphi *newphi; + if (single_pred_p (bb1) + && (newphi = factor_out_conditional_conversion (e1, e2, phi, + arg0, arg1, + cond_stmt))) { phi = newphi; /* factor_out_conditional_conversion may create a new PHI in @@ -350,12 +350,14 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads, bool early_p) early_p)) cfgchanged = true; else if (!early_p + && single_pred_p (bb1) && cond_removal_in_builtin_zero_pattern (bb, bb1, e1, e2, phi, arg0, arg1)) cfgchanged = true; else if (minmax_replacement (bb, bb1, e1, e2, phi, arg0, arg1)) cfgchanged = true; - else if (spaceship_replacement (bb, bb1, e1, e2, phi, arg0, arg1)) + else if (single_pred_p (bb1) + && spaceship_replacement (bb, bb1, e1, e2, phi, arg0, arg1)) cfgchanged = true; } } @@ -386,7 +388,6 @@ replace_phi_edge_with_variable (basic_block cond_block, edge e, gphi *phi, tree new_tree) { basic_block bb = gimple_bb (phi); - basic_block block_to_remove; gimple_stmt_iterator gsi; tree phi_result = PHI_RESULT (phi); @@ -422,28 +423,33 @@ replace_phi_edge_with_variable (basic_block cond_block, SET_USE (PHI_ARG_DEF_PTR (phi, e->dest_idx), new_tree); /* Remove the empty basic block. */ + edge edge_to_remove; if (EDGE_SUCC (cond_block, 0)->dest == bb) + edge_to_remove = EDGE_SUCC (cond_block, 1); + else + edge_to_remove = EDGE_SUCC (cond_block, 0); + if (EDGE_COUNT (edge_to_remove->dest->preds) == 1) { - EDGE_SUCC (cond_block, 0)->flags |= EDGE_FALLTHRU; - EDGE_SUCC (cond_block, 0)->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); - EDGE_SUCC (cond_block, 0)->probability = profile_probability::always (); + e->flags |= EDGE_FALLTHRU; + e->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); + e->probability = profile_probability::always (); + delete_basic_block (edge_to_remove->dest); - block_to_remove = EDGE_SUCC (cond_block, 1)->dest; + /* Eliminate the COND_EXPR at the end of COND_BLOCK. */ + gsi = gsi_last_bb (cond_block); + gsi_remove (&gsi, true); } else { - EDGE_SUCC (cond_block, 1)->flags |= EDGE_FALLTHRU; - EDGE_SUCC (cond_block, 1)->flags - &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); - EDGE_SUCC (cond_block, 1)->probability = profile_probability::always (); - - block_to_remove = EDGE_SUCC (cond_block, 0)->dest; + /* If there are other edges into the middle block make + CFG cleanup deal with the edge removal to avoid + updating dominators here in a non-trivial way. */ + gcond *cond = as_a <gcond *> (last_stmt (cond_block)); + if (edge_to_remove->flags & EDGE_TRUE_VALUE) + gimple_cond_make_false (cond); + else + gimple_cond_make_true (cond); } - delete_basic_block (block_to_remove); - - /* Eliminate the COND_EXPR at the end of COND_BLOCK. */ - gsi = gsi_last_bb (cond_block); - gsi_remove (&gsi, true); statistics_counter_event (cfun, "Replace PHI with variable", 1); @@ -959,6 +965,9 @@ match_simplify_replacement (basic_block cond_bb, basic_block middle_bb, allow it and move it once the transformation is done. */ if (!empty_block_p (middle_bb)) { + if (!single_pred_p (middle_bb)) + return false; + stmt_to_move = last_and_only_stmt (middle_bb); if (!stmt_to_move) return false; @@ -1351,7 +1360,10 @@ value_replacement (basic_block cond_bb, basic_block middle_bb, } else { - statistics_counter_event (cfun, "Replace PHI with variable/value_replacement", 1); + if (!single_pred_p (middle_bb)) + return 0; + statistics_counter_event (cfun, "Replace PHI with " + "variable/value_replacement", 1); /* Replace the PHI arguments with arg. */ SET_PHI_ARG_DEF (phi, e0->dest_idx, arg); @@ -1367,7 +1379,6 @@ value_replacement (basic_block cond_bb, basic_block middle_bb, } return 1; } - } /* Now optimize (x != 0) ? x + y : y to just x + y. */ </cut>

3 years, 8 months

2
1
0 0

[TCWG CI] 464.h264ref slowed down by 6% after llvm: [unroll] Keep unrolled iterations with initial iteration

by ci_notify＠linaro.org

After llvm commit de2fed61528a5584dc54c47f6754408597be24de Author: Philip Reames <listmail(a)philipreames.com> [unroll] Keep unrolled iterations with initial iteration the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 6% from 10902 to 11518 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 43% from 1494 to 2141 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-de2fed61528a5584dc54c47f6754408597be24de cd investigate-llvm-de2fed61528a5584dc54c47f6754408597be24de # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach de2fed61528a5584dc54c47f6754408597be24de ../artifacts/test.sh # Reproduce last_good build git checkout --detach da25f968a90ad4560fc920a6d18fc2a0221d2750 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit de2fed61528a5584dc54c47f6754408597be24de Author: Philip Reames <listmail(a)philipreames.com> Date: Fri Nov 12 11:35:28 2021 -0800 [unroll] Keep unrolled iterations with initial iteration The unrolling code was previously inserting new cloned blocks at the end of the function. The result of this with typical loop structures is that the new iterations are placed far from the initial iteration. With unrolling, the general assumption is that the a) the loop is reasonable hot, and b) the first Count-1 copies of the loop are rarely (if ever) loop exiting. As such, placing Count-1 copies out of line is a fairly poor code placement choice. We'd much rather fall through into the hot (non-exiting) path. For code with branch profiles, later layout would fix this, but this may have a positive impact on non-PGO compiled code. However, the real motivation for this change isn't performance. Its readability and human understanding. Having to jump around long distances in an IR file to trace an unrolled loop structure is error prone and tedious. --- llvm/lib/Transforms/Utils/LoopUnroll.cpp | 6 +- llvm/test/DebugInfo/unrolled-loop-remainder.ll | 86 +- .../Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll | 66 +- .../Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll | 24 +- .../LoopUnroll/AArch64/runtime-unroll-generic.ll | 4 +- .../LoopUnroll/AArch64/thresholdO3-cost-model.ll | 8 +- .../LoopUnroll/AArch64/unroll-upperbound.ll | 4 +- .../Transforms/LoopUnroll/ARM/loop-unrolling.ll | 4 +- .../test/Transforms/LoopUnroll/ARM/multi-blocks.ll | 230 +- llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll | 10 +- .../LoopUnroll/full-unroll-keep-first-exit.ll | 16 +- .../full-unroll-one-unpredictable-exit.ll | 16 +- llvm/test/Transforms/LoopUnroll/multiple-exits.ll | 8 +- llvm/test/Transforms/LoopUnroll/nonlatchcondbr.ll | 20 +- .../LoopUnroll/partial-unroll-non-latch-exit.ll | 14 +- .../partially-unroll-unconditional-latch.ll | 4 +- .../LoopUnroll/runtime-loop-at-most-two-exits.ll | 120 +- .../runtime-loop-multiexit-dom-verify.ll | 206 +- .../LoopUnroll/runtime-loop-multiple-exits.ll | 2560 ++++++++++---------- llvm/test/Transforms/LoopUnroll/runtime-loop5.ll | 34 +- .../LoopUnroll/runtime-multiexit-heuristic.ll | 122 +- .../LoopUnroll/runtime-small-upperbound.ll | 8 +- .../LoopUnroll/runtime-unroll-remainder.ll | 62 +- llvm/test/Transforms/LoopUnroll/scevunroll.ll | 48 +- .../Transforms/LoopUnroll/shifted-tripcount.ll | 4 +- ...er-exiting-with-phis-multiple-exiting-blocks.ll | 20 +- .../LoopUnroll/unroll-unconditional-latch.ll | 12 +- .../Transforms/LoopUnrollAndJam/unroll-and-jam.ll | 68 +- .../PhaseOrdering/AArch64/matrix-extract-insert.ll | 4 +- 29 files changed, 1896 insertions(+), 1892 deletions(-) diff --git a/llvm/lib/Transforms/Utils/LoopUnroll.cpp b/llvm/lib/Transforms/Utils/LoopUnroll.cpp index ce463927fd50..b0c622b98d5e 100644 --- a/llvm/lib/Transforms/Utils/LoopUnroll.cpp +++ b/llvm/lib/Transforms/Utils/LoopUnroll.cpp @@ -514,6 +514,10 @@ LoopUnrollResult llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI, SmallVector<MDNode *, 6> LoopLocalNoAliasDeclScopes; identifyNoAliasScopesToClone(L->getBlocks(), LoopLocalNoAliasDeclScopes); + // We place the unrolled iterations immediately after the original loop + // latch. This is a reasonable default placement if we don't have block + // frequencies, and if we do, well the layout will be adjusted later. + auto BlockInsertPt = std::next(LatchBlock->getIterator()); for (unsigned It = 1; It != ULO.Count; ++It) { SmallVector<BasicBlock *, 8> NewBlocks; SmallDenseMap<const Loop *, Loop *, 4> NewLoops; @@ -522,7 +526,7 @@ LoopUnrollResult llvm::UnrollLoop(Loop *L, UnrollLoopOptions ULO, LoopInfo *LI, for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) { ValueToValueMapTy VMap; BasicBlock *New = CloneBasicBlock(*BB, VMap, "." + Twine(It)); - Header->getParent()->getBasicBlockList().push_back(New); + Header->getParent()->getBasicBlockList().insert(BlockInsertPt, New); assert((*BB != Header || LI->getLoopFor(*BB) == L) && "Header should not be in a sub-loop"); diff --git a/llvm/test/DebugInfo/unrolled-loop-remainder.ll b/llvm/test/DebugInfo/unrolled-loop-remainder.ll index 83c30dec780d..ba4ce1f409f6 100644 --- a/llvm/test/DebugInfo/unrolled-loop-remainder.ll +++ b/llvm/test/DebugInfo/unrolled-loop-remainder.ll @@ -38,71 +38,71 @@ define i32 @func_c() local_unnamed_addr #0 !dbg !14 { ; CHECK-NEXT: [[PROL_ITER_SUB:%.*]] = sub i32 [[XTRAITER]], 1, !dbg [[DBG24]] ; CHECK-NEXT: [[PROL_ITER_CMP:%.*]] = icmp ne i32 [[PROL_ITER_SUB]], 0, !dbg [[DBG24]] ; CHECK-NEXT: br i1 [[PROL_ITER_CMP]], label [[FOR_BODY_PROL_1:%.*]], label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA:%.*]], !dbg [[DBG24]] +; CHECK: for.body.prol.1: +; CHECK-NEXT: [[ARRAYIDX_PROL_1:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX_PROL_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_PROL_1:%.*]] = sext i32 [[TMP7]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP8:%.*]] = inttoptr i64 [[CONV_PROL_1]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ADD_PROL_1:%.*]] = add nsw i32 [[ADD_PROL]], 2, !dbg [[DBG29]] +; CHECK-NEXT: [[PROL_ITER_SUB_1:%.*]] = sub i32 [[PROL_ITER_SUB]], 1, !dbg [[DBG24]] +; CHECK-NEXT: [[PROL_ITER_CMP_1:%.*]] = icmp ne i32 [[PROL_ITER_SUB_1]], 0, !dbg [[DBG24]] +; CHECK-NEXT: br i1 [[PROL_ITER_CMP_1]], label [[FOR_BODY_PROL_2:%.*]], label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]], !dbg [[DBG24]] +; CHECK: for.body.prol.2: +; CHECK-NEXT: [[ARRAYIDX_PROL_2:%.*]] = getelementptr inbounds i32, i32* [[TMP8]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX_PROL_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_PROL_2:%.*]] = sext i32 [[TMP9]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP10:%.*]] = inttoptr i64 [[CONV_PROL_2]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ADD_PROL_2:%.*]] = add nsw i32 [[ADD_PROL_1]], 2, !dbg [[DBG29]] +; CHECK-NEXT: br label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ; CHECK: for.body.prol.loopexit.unr-lcssa: -; CHECK-NEXT: [[DOTLCSSA_UNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP20:%.*]], [[FOR_BODY_PROL_1]] ], [ [[TMP22:%.*]], [[FOR_BODY_PROL_2:%.*]] ] -; CHECK-NEXT: [[DOTUNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP20]], [[FOR_BODY_PROL_1]] ], [ [[TMP22]], [[FOR_BODY_PROL_2]] ] -; CHECK-NEXT: [[DOTUNR1_PH:%.*]] = phi i32 [ [[ADD_PROL]], [[FOR_BODY_PROL]] ], [ [[ADD_PROL_1:%.*]], [[FOR_BODY_PROL_1]] ], [ [[ADD_PROL_2:%.*]], [[FOR_BODY_PROL_2]] ] +; CHECK-NEXT: [[DOTLCSSA_UNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP8]], [[FOR_BODY_PROL_1]] ], [ [[TMP10]], [[FOR_BODY_PROL_2]] ] +; CHECK-NEXT: [[DOTUNR_PH:%.*]] = phi i32* [ [[TMP6]], [[FOR_BODY_PROL]] ], [ [[TMP8]], [[FOR_BODY_PROL_1]] ], [ [[TMP10]], [[FOR_BODY_PROL_2]] ] +; CHECK-NEXT: [[DOTUNR1_PH:%.*]] = phi i32 [ [[ADD_PROL]], [[FOR_BODY_PROL]] ], [ [[ADD_PROL_1]], [[FOR_BODY_PROL_1]] ], [ [[ADD_PROL_2]], [[FOR_BODY_PROL_2]] ] ; CHECK-NEXT: br label [[FOR_BODY_PROL_LOOPEXIT]], !dbg [[DBG24]] ; CHECK: for.body.prol.loopexit: ; CHECK-NEXT: [[DOTLCSSA_UNR:%.*]] = phi i32* [ undef, [[FOR_BODY_LR_PH]] ], [ [[DOTLCSSA_UNR_PH]], [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ] ; CHECK-NEXT: [[DOTUNR:%.*]] = phi i32* [ [[A_PROMOTED]], [[FOR_BODY_LR_PH]] ], [ [[DOTUNR_PH]], [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ] ; CHECK-NEXT: [[DOTUNR1:%.*]] = phi i32 [ [[DOTPR]], [[FOR_BODY_LR_PH]] ], [ [[DOTUNR1_PH]], [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ] -; CHECK-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP3]], 3, !dbg [[DBG24]] -; CHECK-NEXT: br i1 [[TMP7]], label [[FOR_COND_FOR_END_CRIT_EDGE:%.*]], label [[FOR_BODY_LR_PH_NEW:%.*]], !dbg [[DBG24]] +; CHECK-NEXT: [[TMP11:%.*]] = icmp ult i32 [[TMP3]], 3, !dbg [[DBG24]] +; CHECK-NEXT: br i1 [[TMP11]], label [[FOR_COND_FOR_END_CRIT_EDGE:%.*]], label [[FOR_BODY_LR_PH_NEW:%.*]], !dbg [[DBG24]] ; CHECK: for.body.lr.ph.new: ; CHECK-NEXT: br label [[FOR_BODY:%.*]], !dbg [[DBG24]] ; CHECK: for.body: -; CHECK-NEXT: [[TMP8:%.*]] = phi i32* [ [[DOTUNR]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP17:%.*]], [[FOR_BODY]] ], !dbg [[DBG28]] -; CHECK-NEXT: [[TMP9:%.*]] = phi i32 [ [[DOTUNR1]], [[FOR_BODY_LR_PH_NEW]] ], [ [[ADD_3:%.*]], [[FOR_BODY]] ] -; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[TMP8]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP10]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP11:%.*]] = inttoptr i64 [[CONV]] to i32*, !dbg [[DBG28]] -; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP9]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[TMP11]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP12:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_1:%.*]] = sext i32 [[TMP12]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP13:%.*]] = inttoptr i64 [[CONV_1]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP12:%.*]] = phi i32* [ [[DOTUNR]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP21:%.*]], [[FOR_BODY]] ], !dbg [[DBG28]] +; CHECK-NEXT: [[TMP13:%.*]] = phi i32 [ [[DOTUNR1]], [[FOR_BODY_LR_PH_NEW]] ], [ [[ADD_3:%.*]], [[FOR_BODY]] ] +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[TMP12]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP14]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP15:%.*]] = inttoptr i64 [[CONV]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP13]], 2, !dbg [[DBG29]] +; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[TMP15]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_1:%.*]] = sext i32 [[TMP16]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP17:%.*]] = inttoptr i64 [[CONV_1]] to i32*, !dbg [[DBG28]] ; CHECK-NEXT: [[ADD_1:%.*]] = add nsw i32 [[ADD]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, i32* [[TMP13]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_2:%.*]] = sext i32 [[TMP14]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP15:%.*]] = inttoptr i64 [[CONV_2]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, i32* [[TMP17]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP18:%.*]] = load i32, i32* [[ARRAYIDX_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_2:%.*]] = sext i32 [[TMP18]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP19:%.*]] = inttoptr i64 [[CONV_2]] to i32*, !dbg [[DBG28]] ; CHECK-NEXT: [[ADD_2:%.*]] = add nsw i32 [[ADD_1]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, i32* [[TMP15]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX_3]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_3:%.*]] = sext i32 [[TMP16]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP17]] = inttoptr i64 [[CONV_3]] to i32*, !dbg [[DBG28]] +; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, i32* [[TMP19]], i64 1, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP20:%.*]] = load i32, i32* [[ARRAYIDX_3]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] +; CHECK-NEXT: [[CONV_3:%.*]] = sext i32 [[TMP20]] to i64, !dbg [[DBG28]] +; CHECK-NEXT: [[TMP21]] = inttoptr i64 [[CONV_3]] to i32*, !dbg [[DBG28]] ; CHECK-NEXT: [[ADD_3]] = add nsw i32 [[ADD_2]], 2, !dbg [[DBG29]] ; CHECK-NEXT: [[TOBOOL_3:%.*]] = icmp eq i32 [[ADD_3]], 0, !dbg [[DBG24]] ; CHECK-NEXT: br i1 [[TOBOOL_3]], label [[FOR_COND_FOR_END_CRIT_EDGE_UNR_LCSSA:%.*]], label [[FOR_BODY]], !dbg [[DBG24]], !llvm.loop [[LOOP30:![0-9]+]] ; CHECK: for.cond.for.end_crit_edge.unr-lcssa: -; CHECK-NEXT: [[DOTLCSSA_PH:%.*]] = phi i32* [ [[TMP17]], [[FOR_BODY]] ] +; CHECK-NEXT: [[DOTLCSSA_PH:%.*]] = phi i32* [ [[TMP21]], [[FOR_BODY]] ] ; CHECK-NEXT: br label [[FOR_COND_FOR_END_CRIT_EDGE]], !dbg [[DBG24]] ; CHECK: for.cond.for.end_crit_edge: ; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32* [ [[DOTLCSSA_UNR]], [[FOR_BODY_PROL_LOOPEXIT]] ], [ [[DOTLCSSA_PH]], [[FOR_COND_FOR_END_CRIT_EDGE_UNR_LCSSA]] ], !dbg [[DBG28]] -; CHECK-NEXT: [[TMP18:%.*]] = add i32 [[TMP2]], 2, !dbg [[DBG24]] +; CHECK-NEXT: [[TMP22:%.*]] = add i32 [[TMP2]], 2, !dbg [[DBG24]] ; CHECK-NEXT: store i32* [[DOTLCSSA]], i32** @a, align 8, !dbg [[DBG25]], !tbaa [[TBAA26]] -; CHECK-NEXT: store i32 [[TMP18]], i32* @b, align 4, !dbg [[DBG33:![0-9]+]], !tbaa [[TBAA20]] +; CHECK-NEXT: store i32 [[TMP22]], i32* @b, align 4, !dbg [[DBG33:![0-9]+]], !tbaa [[TBAA20]] ; CHECK-NEXT: br label [[FOR_END]], !dbg [[DBG24]] ; CHECK: for.end: ; CHECK-NEXT: ret i32 undef, !dbg [[DBG34:![0-9]+]] -; CHECK: for.body.prol.1: -; CHECK-NEXT: [[ARRAYIDX_PROL_1:%.*]] = getelementptr inbounds i32, i32* [[TMP6]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP19:%.*]] = load i32, i32* [[ARRAYIDX_PROL_1]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_PROL_1:%.*]] = sext i32 [[TMP19]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP20]] = inttoptr i64 [[CONV_PROL_1]] to i32*, !dbg [[DBG28]] -; CHECK-NEXT: [[ADD_PROL_1]] = add nsw i32 [[ADD_PROL]], 2, !dbg [[DBG29]] -; CHECK-NEXT: [[PROL_ITER_SUB_1:%.*]] = sub i32 [[PROL_ITER_SUB]], 1, !dbg [[DBG24]] -; CHECK-NEXT: [[PROL_ITER_CMP_1:%.*]] = icmp ne i32 [[PROL_ITER_SUB_1]], 0, !dbg [[DBG24]] -; CHECK-NEXT: br i1 [[PROL_ITER_CMP_1]], label [[FOR_BODY_PROL_2]], label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]], !dbg [[DBG24]] -; CHECK: for.body.prol.2: -; CHECK-NEXT: [[ARRAYIDX_PROL_2:%.*]] = getelementptr inbounds i32, i32* [[TMP20]], i64 1, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP21:%.*]] = load i32, i32* [[ARRAYIDX_PROL_2]], align 4, !dbg [[DBG28]], !tbaa [[TBAA20]] -; CHECK-NEXT: [[CONV_PROL_2:%.*]] = sext i32 [[TMP21]] to i64, !dbg [[DBG28]] -; CHECK-NEXT: [[TMP22]] = inttoptr i64 [[CONV_PROL_2]] to i32*, !dbg [[DBG28]] -; CHECK-NEXT: [[ADD_PROL_2]] = add nsw i32 [[ADD_PROL_1]], 2, !dbg [[DBG29]] -; CHECK-NEXT: br label [[FOR_BODY_PROL_LOOPEXIT_UNR_LCSSA]] ; entry: %.pr = load i32, i32* @b, align 4, !dbg !17, !tbaa !20 diff --git a/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll b/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll index 3e611430d69e..7bb2d732195a 100644 --- a/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll +++ b/llvm/test/Transforms/LoopUnroll/2011-08-08-PhiUpdate.ll @@ -17,24 +17,24 @@ define void @test1(i32 %i, i32 %j) nounwind uwtable ssp { ; CHECK-NEXT: [[SUB5:%.*]] = sub i32 [[SUB]], [[J:%.*]] ; CHECK-NEXT: [[COND2:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2]], label [[IF_THEN_LOOPEXIT:%.*]], label [[IF_ELSE_1:%.*]] -; CHECK: if.then.loopexit: -; CHECK-NEXT: [[SUB5_LCSSA:%.*]] = phi i32 [ [[SUB5]], [[IF_ELSE]] ], [ [[SUB5_1:%.*]], [[IF_ELSE_1]] ], [ [[SUB5_2:%.*]], [[IF_ELSE_2:%.*]] ], [ [[SUB5_3]], [[IF_ELSE_3]] ] -; CHECK-NEXT: br label [[IF_THEN]] -; CHECK: if.then: -; CHECK-NEXT: [[I_TR:%.*]] = phi i32 [ [[I]], [[ENTRY:%.*]] ], [ [[SUB5_LCSSA]], [[IF_THEN_LOOPEXIT]] ] -; CHECK-NEXT: ret void ; CHECK: if.else.1: -; CHECK-NEXT: [[SUB5_1]] = sub i32 [[SUB5]], [[J]] +; CHECK-NEXT: [[SUB5_1:%.*]] = sub i32 [[SUB5]], [[J]] ; CHECK-NEXT: [[COND2_1:%.*]] = call zeroext i1 @check() -; CHECK-NEXT: br i1 [[COND2_1]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE_2]] +; CHECK-NEXT: br i1 [[COND2_1]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE_2:%.*]] ; CHECK: if.else.2: -; CHECK-NEXT: [[SUB5_2]] = sub i32 [[SUB5_1]], [[J]] +; CHECK-NEXT: [[SUB5_2:%.*]] = sub i32 [[SUB5_1]], [[J]] ; CHECK-NEXT: [[COND2_2:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_2]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE_3]] ; CHECK: if.else.3: ; CHECK-NEXT: [[SUB5_3]] = sub i32 [[SUB5_2]], [[J]] ; CHECK-NEXT: [[COND2_3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_3]], label [[IF_THEN_LOOPEXIT]], label [[IF_ELSE]], !llvm.loop [[LOOP0:![0-9]+]] +; CHECK: if.then.loopexit: +; CHECK-NEXT: [[SUB5_LCSSA:%.*]] = phi i32 [ [[SUB5]], [[IF_ELSE]] ], [ [[SUB5_1]], [[IF_ELSE_1]] ], [ [[SUB5_2]], [[IF_ELSE_2]] ], [ [[SUB5_3]], [[IF_ELSE_3]] ] +; CHECK-NEXT: br label [[IF_THEN]] +; CHECK: if.then: +; CHECK-NEXT: [[I_TR:%.*]] = phi i32 [ [[I]], [[ENTRY:%.*]] ], [ [[SUB5_LCSSA]], [[IF_THEN_LOOPEXIT]] ] +; CHECK-NEXT: ret void ; entry: %cond1 = call zeroext i1 @check() @@ -77,17 +77,11 @@ define i32 @test2(i32* nocapture %p, i32 %n) nounwind readonly { ; CHECK-NEXT: [[INDVAR_NEXT:%.*]] = add nuw nsw i64 [[INDVAR]], 1 ; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[INDVAR_NEXT]], [[TMP]] ; CHECK-NEXT: br i1 [[EXITCOND]], label [[BB_1:%.*]], label [[BB1_BB2_CRIT_EDGE:%.*]] -; CHECK: bb1.bb2_crit_edge: -; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32 [ [[TMP2]], [[BB1]] ], [ [[TMP4:%.*]], [[BB1_1:%.*]] ], [ [[TMP6:%.*]], [[BB1_2:%.*]] ], [ [[TMP8]], [[BB1_3]] ] -; CHECK-NEXT: br label [[BB2]] -; CHECK: bb2: -; CHECK-NEXT: [[S_0_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA]], [[BB1_BB2_CRIT_EDGE]] ], [ 0, [[ENTRY:%.*]] ] -; CHECK-NEXT: ret i32 [[S_0_LCSSA]] ; CHECK: bb.1: ; CHECK-NEXT: [[SCEVGEP_1:%.*]] = getelementptr i32, i32* [[P]], i64 [[INDVAR_NEXT]] ; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* [[SCEVGEP_1]], align 1 -; CHECK-NEXT: [[TMP4]] = add nsw i32 [[TMP3]], [[TMP2]] -; CHECK-NEXT: br label [[BB1_1]] +; CHECK-NEXT: [[TMP4:%.*]] = add nsw i32 [[TMP3]], [[TMP2]] +; CHECK-NEXT: br label [[BB1_1:%.*]] ; CHECK: bb1.1: ; CHECK-NEXT: [[INDVAR_NEXT_1:%.*]] = add nuw nsw i64 [[INDVAR_NEXT]], 1 ; CHECK-NEXT: [[EXITCOND_1:%.*]] = icmp ne i64 [[INDVAR_NEXT_1]], [[TMP]] @@ -95,8 +89,8 @@ define i32 @test2(i32* nocapture %p, i32 %n) nounwind readonly { ; CHECK: bb.2: ; CHECK-NEXT: [[SCEVGEP_2:%.*]] = getelementptr i32, i32* [[P]], i64 [[INDVAR_NEXT_1]] ; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[SCEVGEP_2]], align 1 -; CHECK-NEXT: [[TMP6]] = add nsw i32 [[TMP5]], [[TMP4]] -; CHECK-NEXT: br label [[BB1_2]] +; CHECK-NEXT: [[TMP6:%.*]] = add nsw i32 [[TMP5]], [[TMP4]] +; CHECK-NEXT: br label [[BB1_2:%.*]] ; CHECK: bb1.2: ; CHECK-NEXT: [[INDVAR_NEXT_2:%.*]] = add nuw nsw i64 [[INDVAR_NEXT_1]], 1 ; CHECK-NEXT: [[EXITCOND_2:%.*]] = icmp ne i64 [[INDVAR_NEXT_2]], [[TMP]] @@ -110,6 +104,12 @@ define i32 @test2(i32* nocapture %p, i32 %n) nounwind readonly { ; CHECK-NEXT: [[INDVAR_NEXT_3]] = add i64 [[INDVAR_NEXT_2]], 1 ; CHECK-NEXT: [[EXITCOND_3:%.*]] = icmp ne i64 [[INDVAR_NEXT_3]], [[TMP]] ; CHECK-NEXT: br i1 [[EXITCOND_3]], label [[BB]], label [[BB1_BB2_CRIT_EDGE]], !llvm.loop [[LOOP2:![0-9]+]] +; CHECK: bb1.bb2_crit_edge: +; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32 [ [[TMP2]], [[BB1]] ], [ [[TMP4]], [[BB1_1]] ], [ [[TMP6]], [[BB1_2]] ], [ [[TMP8]], [[BB1_3]] ] +; CHECK-NEXT: br label [[BB2]] +; CHECK: bb2: +; CHECK-NEXT: [[S_0_LCSSA:%.*]] = phi i32 [ [[DOTLCSSA]], [[BB1_BB2_CRIT_EDGE]] ], [ 0, [[ENTRY:%.*]] ] +; CHECK-NEXT: ret i32 [[S_0_LCSSA]] ; entry: %0 = icmp sgt i32 %n, 0 ; <i1> [#uses=1] @@ -162,20 +162,12 @@ define i32 @test3() nounwind uwtable ssp align 2 { ; CHECK: do.cond: ; CHECK-NEXT: [[COND3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND3]], label [[DO_END:%.*]], label [[DO_BODY_1:%.*]] -; CHECK: do.end: -; CHECK-NEXT: br label [[RETURN]] -; CHECK: return.loopexit: -; CHECK-NEXT: [[TMP7_I_LCSSA:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ [[TMP7_I_1:%.*]], [[LAND_LHS_TRUE_1:%.*]] ], [ [[TMP7_I_2:%.*]], [[LAND_LHS_TRUE_2:%.*]] ], [ [[TMP7_I_3:%.*]], [[LAND_LHS_TRUE_3:%.*]] ] -; CHECK-NEXT: br label [[RETURN]] -; CHECK: return: -; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ 0, [[DO_END]] ], [ 0, [[ENTRY:%.*]] ], [ [[TMP7_I_LCSSA]], [[RETURN_LOOPEXIT]] ] -; CHECK-NEXT: ret i32 [[RETVAL_0]] ; CHECK: do.body.1: ; CHECK-NEXT: [[COND2_1:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_1]], label [[EXIT_1:%.*]], label [[DO_COND_1:%.*]] ; CHECK: exit.1: -; CHECK-NEXT: [[TMP7_I_1]] = load i32, i32* undef, align 8 -; CHECK-NEXT: br i1 undef, label [[DO_COND_1]], label [[LAND_LHS_TRUE_1]] +; CHECK-NEXT: [[TMP7_I_1:%.*]] = load i32, i32* undef, align 8 +; CHECK-NEXT: br i1 undef, label [[DO_COND_1]], label [[LAND_LHS_TRUE_1:%.*]] ; CHECK: land.lhs.true.1: ; CHECK-NEXT: br i1 true, label [[RETURN_LOOPEXIT]], label [[DO_COND_1]] ; CHECK: do.cond.1: @@ -185,8 +177,8 @@ define i32 @test3() nounwind uwtable ssp align 2 { ; CHECK-NEXT: [[COND2_2:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_2]], label [[EXIT_2:%.*]], label [[DO_COND_2:%.*]] ; CHECK: exit.2: -; CHECK-NEXT: [[TMP7_I_2]] = load i32, i32* undef, align 8 -; CHECK-NEXT: br i1 undef, label [[DO_COND_2]], label [[LAND_LHS_TRUE_2]] +; CHECK-NEXT: [[TMP7_I_2:%.*]] = load i32, i32* undef, align 8 +; CHECK-NEXT: br i1 undef, label [[DO_COND_2]], label [[LAND_LHS_TRUE_2:%.*]] ; CHECK: land.lhs.true.2: ; CHECK-NEXT: br i1 true, label [[RETURN_LOOPEXIT]], label [[DO_COND_2]] ; CHECK: do.cond.2: @@ -196,13 +188,21 @@ define i32 @test3() nounwind uwtable ssp align 2 { ; CHECK-NEXT: [[COND2_3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND2_3]], label [[EXIT_3:%.*]], label [[DO_COND_3:%.*]] ; CHECK: exit.3: -; CHECK-NEXT: [[TMP7_I_3]] = load i32, i32* undef, align 8 -; CHECK-NEXT: br i1 undef, label [[DO_COND_3]], label [[LAND_LHS_TRUE_3]] +; CHECK-NEXT: [[TMP7_I_3:%.*]] = load i32, i32* undef, align 8 +; CHECK-NEXT: br i1 undef, label [[DO_COND_3]], label [[LAND_LHS_TRUE_3:%.*]] ; CHECK: land.lhs.true.3: ; CHECK-NEXT: br i1 true, label [[RETURN_LOOPEXIT]], label [[DO_COND_3]] ; CHECK: do.cond.3: ; CHECK-NEXT: [[COND3_3:%.*]] = call zeroext i1 @check() ; CHECK-NEXT: br i1 [[COND3_3]], label [[DO_END]], label [[DO_BODY]], !llvm.loop [[LOOP3:![0-9]+]] +; CHECK: do.end: +; CHECK-NEXT: br label [[RETURN]] +; CHECK: return.loopexit: +; CHECK-NEXT: [[TMP7_I_LCSSA:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ [[TMP7_I_1]], [[LAND_LHS_TRUE_1]] ], [ [[TMP7_I_2]], [[LAND_LHS_TRUE_2]] ], [ [[TMP7_I_3]], [[LAND_LHS_TRUE_3]] ] +; CHECK-NEXT: br label [[RETURN]] +; CHECK: return: +; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ 0, [[DO_END]] ], [ 0, [[ENTRY:%.*]] ], [ [[TMP7_I_LCSSA]], [[RETURN_LOOPEXIT]] ] +; CHECK-NEXT: ret i32 [[RETVAL_0]] ; entry: %cond1 = call zeroext i1 @check() diff --git a/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll b/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll index be4b6ff64fdd..af648bae8642 100644 --- a/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll +++ b/llvm/test/Transforms/LoopUnroll/2011-08-09-PhiUpdate.ll @@ -33,16 +33,13 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK: do.cond: ; CHECK-NEXT: [[CMP18:%.*]] = icmp sgt i32 [[CALL2]], -1 ; CHECK-NEXT: br i1 [[CMP18]], label [[LAND_LHS_TRUE_I_1:%.*]], label [[RETURN]] -; CHECK: return: -; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ 0, [[DO_COND]] ], [ [[TMP7_I_1:%.*]], [[LAND_LHS_TRUE_1:%.*]] ], [ 0, [[DO_COND_1:%.*]] ], [ [[TMP7_I_2:%.*]], [[LAND_LHS_TRUE_2:%.*]] ], [ 0, [[DO_COND_2:%.*]] ], [ [[TMP7_I_3:%.*]], [[LAND_LHS_TRUE_3:%.*]] ], [ 0, [[DO_COND_3:%.*]] ] -; CHECK-NEXT: ret i32 [[RETVAL_0]] ; CHECK: land.lhs.true.i.1: ; CHECK-NEXT: [[CMP4_I_1:%.*]] = call zeroext i1 @check() #[[ATTR0]] -; CHECK-NEXT: br i1 [[CMP4_I_1]], label [[BAR_EXIT_1:%.*]], label [[DO_COND_1]] +; CHECK-NEXT: br i1 [[CMP4_I_1]], label [[BAR_EXIT_1:%.*]], label [[DO_COND_1:%.*]] ; CHECK: bar.exit.1: -; CHECK-NEXT: [[TMP7_I_1]] = call i32 @getval() #[[ATTR0]] +; CHECK-NEXT: [[TMP7_I_1:%.*]] = call i32 @getval() #[[ATTR0]] ; CHECK-NEXT: [[CMP_NOT_1:%.*]] = icmp eq i32 [[TMP7_I_1]], 0 -; CHECK-NEXT: br i1 [[CMP_NOT_1]], label [[DO_COND_1]], label [[LAND_LHS_TRUE_1]] +; CHECK-NEXT: br i1 [[CMP_NOT_1]], label [[DO_COND_1]], label [[LAND_LHS_TRUE_1:%.*]] ; CHECK: land.lhs.true.1: ; CHECK-NEXT: [[CALL10_1:%.*]] = call i32 @getval() ; CHECK-NEXT: [[CMP11_1:%.*]] = icmp eq i32 [[CALL10_1]], 0 @@ -52,11 +49,11 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK-NEXT: br i1 [[CMP18_1]], label [[LAND_LHS_TRUE_I_2:%.*]], label [[RETURN]] ; CHECK: land.lhs.true.i.2: ; CHECK-NEXT: [[CMP4_I_2:%.*]] = call zeroext i1 @check() #[[ATTR0]] -; CHECK-NEXT: br i1 [[CMP4_I_2]], label [[BAR_EXIT_2:%.*]], label [[DO_COND_2]] +; CHECK-NEXT: br i1 [[CMP4_I_2]], label [[BAR_EXIT_2:%.*]], label [[DO_COND_2:%.*]] ; CHECK: bar.exit.2: -; CHECK-NEXT: [[TMP7_I_2]] = call i32 @getval() #[[ATTR0]] +; CHECK-NEXT: [[TMP7_I_2:%.*]] = call i32 @getval() #[[ATTR0]] ; CHECK-NEXT: [[CMP_NOT_2:%.*]] = icmp eq i32 [[TMP7_I_2]], 0 -; CHECK-NEXT: br i1 [[CMP_NOT_2]], label [[DO_COND_2]], label [[LAND_LHS_TRUE_2]] +; CHECK-NEXT: br i1 [[CMP_NOT_2]], label [[DO_COND_2]], label [[LAND_LHS_TRUE_2:%.*]] ; CHECK: land.lhs.true.2: ; CHECK-NEXT: [[CALL10_2:%.*]] = call i32 @getval() ; CHECK-NEXT: [[CMP11_2:%.*]] = icmp eq i32 [[CALL10_2]], 0 @@ -66,11 +63,11 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK-NEXT: br i1 [[CMP18_2]], label [[LAND_LHS_TRUE_I_3:%.*]], label [[RETURN]] ; CHECK: land.lhs.true.i.3: ; CHECK-NEXT: [[CMP4_I_3:%.*]] = call zeroext i1 @check() #[[ATTR0]] -; CHECK-NEXT: br i1 [[CMP4_I_3]], label [[BAR_EXIT_3:%.*]], label [[DO_COND_3]] +; CHECK-NEXT: br i1 [[CMP4_I_3]], label [[BAR_EXIT_3:%.*]], label [[DO_COND_3:%.*]] ; CHECK: bar.exit.3: -; CHECK-NEXT: [[TMP7_I_3]] = call i32 @getval() #[[ATTR0]] +; CHECK-NEXT: [[TMP7_I_3:%.*]] = call i32 @getval() #[[ATTR0]] ; CHECK-NEXT: [[CMP_NOT_3:%.*]] = icmp eq i32 [[TMP7_I_3]], 0 -; CHECK-NEXT: br i1 [[CMP_NOT_3]], label [[DO_COND_3]], label [[LAND_LHS_TRUE_3]] +; CHECK-NEXT: br i1 [[CMP_NOT_3]], label [[DO_COND_3]], label [[LAND_LHS_TRUE_3:%.*]] ; CHECK: land.lhs.true.3: ; CHECK-NEXT: [[CALL10_3:%.*]] = call i32 @getval() ; CHECK-NEXT: [[CMP11_3:%.*]] = icmp eq i32 [[CALL10_3]], 0 @@ -78,6 +75,9 @@ define i32 @foo() uwtable ssp align 2 { ; CHECK: do.cond.3: ; CHECK-NEXT: [[CMP18_3:%.*]] = icmp sgt i32 [[CALL2]], -1 ; CHECK-NEXT: br i1 [[CMP18_3]], label [[LAND_LHS_TRUE_I]], label [[RETURN]], !llvm.loop [[LOOP0:![0-9]+]] +; CHECK: return: +; CHECK-NEXT: [[RETVAL_0:%.*]] = phi i32 [ [[TMP7_I]], [[LAND_LHS_TRUE]] ], [ 0, [[DO_COND]] ], [ [[TMP7_I_1]], [[LAND_LHS_TRUE_1]] ], [ 0, [[DO_COND_1]] ], [ [[TMP7_I_2]], [[LAND_LHS_TRUE_2]] ], [ 0, [[DO_COND_2]] ], [ [[TMP7_I_3]], [[LAND_LHS_TRUE_3]] ], [ 0, [[DO_COND_3]] ] +; CHECK-NEXT: ret i32 [[RETVAL_0]] ; entry: br i1 undef, label %return, label %if.end diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll b/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll index 5bbab929c936..5c8f9ca01679 100644 --- a/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll +++ b/llvm/test/Transforms/LoopUnroll/AArch64/runtime-unroll-generic.ll @@ -67,8 +67,6 @@ define void @runtime_unroll_generic(i32 %arg_0, i32* %arg_1, i16* %arg_2, i16* % ; CHECK-A55-NEXT: store i32 [[ADD21_EPIL]], i32* [[ARRAYIDX20]], align 4 ; CHECK-A55-NEXT: [[EPIL_ITER_CMP_NOT:%.*]] = icmp eq i32 [[XTRAITER]], 1 ; CHECK-A55-NEXT: br i1 [[EPIL_ITER_CMP_NOT]], label [[FOR_END]], label [[FOR_BODY6_EPIL_1:%.*]] -; CHECK-A55: for.end: -; CHECK-A55-NEXT: ret void ; CHECK-A55: for.body6.epil.1: ; CHECK-A55-NEXT: [[TMP14:%.*]] = load i16, i16* [[ARRAYIDX10]], align 2 ; CHECK-A55-NEXT: [[CONV_EPIL_1:%.*]] = sext i16 [[TMP14]] to i32 @@ -90,6 +88,8 @@ define void @runtime_unroll_generic(i32 %arg_0, i32* %arg_1, i16* %arg_2, i16* % ; CHECK-A55-NEXT: [[ADD21_EPIL_2:%.*]] = add nsw i32 [[MUL16_EPIL_2]], [[TMP19]] ; CHECK-A55-NEXT: store i32 [[ADD21_EPIL_2]], i32* [[ARRAYIDX20]], align 4 ; CHECK-A55-NEXT: br label [[FOR_END]] +; CHECK-A55: for.end: +; CHECK-A55-NEXT: ret void ; ; CHECK-GENERIC-LABEL: @runtime_unroll_generic( ; CHECK-GENERIC-NEXT: entry: diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll b/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll index ee07518f8cac..5c6ac690c0ca 100644 --- a/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll +++ b/llvm/test/Transforms/LoopUnroll/AArch64/thresholdO3-cost-model.ll @@ -21,10 +21,6 @@ define i32 @tripcount_11() { ; CHECK-NEXT: br label [[DO_BODY6:%.*]] ; CHECK: for.cond: ; CHECK-NEXT: br i1 true, label [[FOR_COND_1:%.*]], label [[IF_THEN11:%.*]] -; CHECK: do.body6: -; CHECK-NEXT: br i1 true, label [[FOR_COND:%.*]], label [[IF_THEN11]] -; CHECK: if.then11: -; CHECK-NEXT: unreachable ; CHECK: for.cond.1: ; CHECK-NEXT: br i1 true, label [[FOR_COND_2:%.*]], label [[IF_THEN11]] ; CHECK: for.cond.2: @@ -45,6 +41,10 @@ define i32 @tripcount_11() { ; CHECK-NEXT: br i1 true, label [[FOR_COND_10:%.*]], label [[IF_THEN11]] ; CHECK: for.cond.10: ; CHECK-NEXT: ret i32 0 +; CHECK: do.body6: +; CHECK-NEXT: br i1 true, label [[FOR_COND:%.*]], label [[IF_THEN11]] +; CHECK: if.then11: +; CHECK-NEXT: unreachable ; do.body6.preheader: br label %do.body6 diff --git a/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll b/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll index 3b82365d1a6e..ee905e5b10fe 100644 --- a/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll +++ b/llvm/test/Transforms/LoopUnroll/AArch64/unroll-upperbound.ll @@ -18,8 +18,6 @@ define void @test(i1 %cond) { ; CHECK-NEXT: br label [[LATCH]] ; CHECK: latch: ; CHECK-NEXT: br i1 false, label [[FOR_END:%.*]], label [[FOR_BODY_1:%.*]] -; CHECK: for.end: -; CHECK-NEXT: ret void ; CHECK: for.body.1: ; CHECK-NEXT: switch i32 1, label [[SW_DEFAULT_1:%.*]] [ ; CHECK-NEXT: i32 2, label [[LATCH_1:%.*]] @@ -38,6 +36,8 @@ define void @test(i1 %cond) { ; CHECK-NEXT: br label [[LATCH_2]] ; CHECK: latch.2: ; CHECK-NEXT: br label [[FOR_END]] +; CHECK: for.end: +; CHECK-NEXT: ret void ; entry: %0 = select i1 %cond, i32 2, i32 3 diff --git a/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll b/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll index f2e748ade0a2..e12dbf031b3b 100644 --- a/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll +++ b/llvm/test/Transforms/LoopUnroll/ARM/loop-unrolling.ll @@ -121,14 +121,14 @@ for.body4: ; CHECK-NOUNROLL: br ; CHECK-UNROLL: for.body4.epil: +; CHECK-UNROLL: for.body4.epil.1: +; CHECK-UNROLL: for.body4.epil.2: ; CHECK-UNROLL: [[IV0:%[a-z.0-9]+]] = phi i32 [ 0, [[PRE:%[a-z0-9.]+]] ], [ [[IV4:%[a-z.0-9]+]], %for.body4 ] ; CHECK-UNROLL: [[IV1:%[a-z.0-9]+]] = add nuw nsw i32 [[IV0]], 1 ; CHECK-UNROLL: [[IV2:%[a-z.0-9]+]] = add nuw nsw i32 [[IV1]], 1 ; CHECK-UNROLL: [[IV3:%[a-z.0-9]+]] = add nuw nsw i32 [[IV2]], 1 ; CHECK-UNROLL: [[IV4]] = add nuw i32 [[IV3]], 1 ; CHECK-UNROLL: br -; CHECK-UNROLL: for.body4.epil.1: -; CHECK-UNROLL: for.body4.epil.2: %w.024 = phi i32 [ 0, %for.body4.lr.ph ], [ %inc, %for.body4 ] %add = add i32 %w.024, %mul diff --git a/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll b/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll index 156c0ab10658..8c4257698ab7 100644 --- a/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll +++ b/llvm/test/Transforms/LoopUnroll/ARM/multi-blocks.ll @@ -45,8 +45,37 @@ define void @test_three_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[EPIL_ITER_SUB:%.*]] = sub i32 [[XTRAITER]], 1 ; CHECK-NEXT: [[EPIL_ITER_CMP:%.*]] = icmp ne i32 [[EPIL_ITER_SUB]], 0 ; CHECK-NEXT: br i1 [[EPIL_ITER_CMP]], label [[FOR_BODY_EPIL_1:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA:%.*]] +; CHECK: for.body.epil.1: +; CHECK-NEXT: [[ARRAYIDX_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL]] +; CHECK-NEXT: [[TMP4:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 +; CHECK-NEXT: [[TOBOOL_EPIL_1:%.*]] = icmp eq i32 [[TMP4]], 0 +; CHECK-NEXT: br i1 [[TOBOOL_EPIL_1]], label [[FOR_INC_EPIL_1:%.*]], label [[IF_THEN_EPIL_1:%.*]] +; CHECK: if.then.epil.1: +; CHECK-NEXT: [[ARRAYIDX1_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL]] +; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_1]], align 4 +; CHECK-NEXT: [[ADD_EPIL_1:%.*]] = add i32 [[TMP5]], [[TEMP_1_EPIL]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] +; CHECK: for.inc.epil.1: +; CHECK-NEXT: [[TEMP_1_EPIL_1:%.*]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[TEMP_1_EPIL]], [[FOR_BODY_EPIL_1]] ] +; CHECK-NEXT: [[INC_EPIL_1:%.*]] = add nuw i32 [[INC_EPIL]], 1 +; CHECK-NEXT: [[EPIL_ITER_SUB_1:%.*]] = sub i32 [[EPIL_ITER_SUB]], 1 +; CHECK-NEXT: [[EPIL_ITER_CMP_1:%.*]] = icmp ne i32 [[EPIL_ITER_SUB_1]], 0 +; CHECK-NEXT: br i1 [[EPIL_ITER_CMP_1]], label [[FOR_BODY_EPIL_2:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] +; CHECK: for.body.epil.2: +; CHECK-NEXT: [[ARRAYIDX_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL_1]] +; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 +; CHECK-NEXT: [[TOBOOL_EPIL_2:%.*]] = icmp eq i32 [[TMP6]], 0 +; CHECK-NEXT: br i1 [[TOBOOL_EPIL_2]], label [[FOR_INC_EPIL_2:%.*]], label [[IF_THEN_EPIL_2:%.*]] +; CHECK: if.then.epil.2: +; CHECK-NEXT: [[ARRAYIDX1_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL_1]] +; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_2]], align 4 +; CHECK-NEXT: [[ADD_EPIL_2:%.*]] = add i32 [[TMP7]], [[TEMP_1_EPIL_1]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] +; CHECK: for.inc.epil.2: +; CHECK-NEXT: [[TEMP_1_EPIL_2:%.*]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[TEMP_1_EPIL_1]], [[FOR_BODY_EPIL_2]] ] +; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ; CHECK: for.cond.cleanup.loopexit.epilog-lcssa: -; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1:%.*]], [[FOR_INC_EPIL_1:%.*]] ], [ [[TEMP_1_EPIL_2:%.*]], [[FOR_INC_EPIL_2:%.*]] ] +; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1]], [[FOR_INC_EPIL_1]] ], [ [[TEMP_1_EPIL_2]], [[FOR_INC_EPIL_2]] ] ; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]] ; CHECK: for.cond.cleanup.loopexit: ; CHECK-NEXT: [[TEMP_1_LCSSA:%.*]] = phi i32 [ [[TEMP_1_LCSSA_PH]], [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]] ], [ [[TEMP_1_LCSSA_PH1]], [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ] @@ -60,51 +89,22 @@ define void @test_three_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[TEMP_09:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER_NEW]] ], [ [[TEMP_1_3]], [[FOR_INC_3]] ] ; CHECK-NEXT: [[NITER:%.*]] = phi i32 [ [[UNROLL_ITER]], [[FOR_BODY_PREHEADER_NEW]] ], [ [[NITER_NSUB_3:%.*]], [[FOR_INC_3]] ] ; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[J_010]] -; CHECK-NEXT: [[TMP4:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 -; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq i32 [[TMP4]], 0 +; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 +; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq i32 [[TMP8]], 0 ; CHECK-NEXT: br i1 [[TOBOOL]], label [[FOR_INC:%.*]], label [[IF_THEN:%.*]] ; CHECK: if.then: ; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[J_010]] -; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[ARRAYIDX1]], align 4 -; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TMP5]], [[TEMP_09]] +; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX1]], align 4 +; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TMP9]], [[TEMP_09]] ; CHECK-NEXT: br label [[FOR_INC]] ; CHECK: for.inc: ; CHECK-NEXT: [[TEMP_1:%.*]] = phi i32 [ [[ADD]], [[IF_THEN]] ], [ [[TEMP_09]], [[FOR_BODY]] ] ; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_010]], 1 ; CHECK-NEXT: [[NITER_NSUB:%.*]] = sub i32 [[NITER]], 1 ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] -; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 -; CHECK-NEXT: [[TOBOOL_1:%.*]] = icmp eq i32 [[TMP6]], 0 +; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 +; CHECK-NEXT: [[TOBOOL_1:%.*]] = icmp eq i32 [[TMP10]], 0 ; CHECK-NEXT: br i1 [[TOBOOL_1]], label [[FOR_INC_1:%.*]], label [[IF_THEN_1:%.*]] -; CHECK: for.body.epil.1: -; CHECK-NEXT: [[ARRAYIDX_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 -; CHECK-NEXT: [[TOBOOL_EPIL_1:%.*]] = icmp eq i32 [[TMP7]], 0 -; CHECK-NEXT: br i1 [[TOBOOL_EPIL_1]], label [[FOR_INC_EPIL_1]], label [[IF_THEN_EPIL_1:%.*]] -; CHECK: if.then.epil.1: -; CHECK-NEXT: [[ARRAYIDX1_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_1]], align 4 -; CHECK-NEXT: [[ADD_EPIL_1:%.*]] = add i32 [[TMP8]], [[TEMP_1_EPIL]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] -; CHECK: for.inc.epil.1: -; CHECK-NEXT: [[TEMP_1_EPIL_1]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[TEMP_1_EPIL]], [[FOR_BODY_EPIL_1]] ] -; CHECK-NEXT: [[INC_EPIL_1:%.*]] = add nuw i32 [[INC_EPIL]], 1 -; CHECK-NEXT: [[EPIL_ITER_SUB_1:%.*]] = sub i32 [[EPIL_ITER_SUB]], 1 -; CHECK-NEXT: [[EPIL_ITER_CMP_1:%.*]] = icmp ne i32 [[EPIL_ITER_SUB_1]], 0 -; CHECK-NEXT: br i1 [[EPIL_ITER_CMP_1]], label [[FOR_BODY_EPIL_2:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] -; CHECK: for.body.epil.2: -; CHECK-NEXT: [[ARRAYIDX_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 -; CHECK-NEXT: [[TOBOOL_EPIL_2:%.*]] = icmp eq i32 [[TMP9]], 0 -; CHECK-NEXT: br i1 [[TOBOOL_EPIL_2]], label [[FOR_INC_EPIL_2]], label [[IF_THEN_EPIL_2:%.*]] -; CHECK: if.then.epil.2: -; CHECK-NEXT: [[ARRAYIDX1_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX1_EPIL_2]], align 4 -; CHECK-NEXT: [[ADD_EPIL_2:%.*]] = add i32 [[TMP10]], [[TEMP_1_EPIL_1]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] -; CHECK: for.inc.epil.2: -; CHECK-NEXT: [[TEMP_1_EPIL_2]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[TEMP_1_EPIL_1]], [[FOR_BODY_EPIL_2]] ] -; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ; CHECK: if.then.1: ; CHECK-NEXT: [[ARRAYIDX1_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] ; CHECK-NEXT: [[TMP11:%.*]] = load i32, i32* [[ARRAYIDX1_1]], align 4 @@ -203,41 +203,34 @@ define void @test_two_exits(i32* nocapture %Output, ; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_016]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[INC]], [[MAXJ]] ; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY_1:%.*]], label [[CLEANUP_LOOPEXIT]] -; CHECK: cleanup.loopexit: -; CHECK-NEXT: [[TEMP_0_LCSSA_PH:%.*]] = phi i32 [ [[TEMP_0_ADD]], [[IF_END]] ], [ [[TEMP_015]], [[FOR_BODY]] ], [ [[TEMP_0_ADD]], [[FOR_BODY_1]] ], [ [[TEMP_0_ADD_1:%.*]], [[IF_END_1:%.*]] ], [ [[TEMP_0_ADD_1]], [[FOR_BODY_2:%.*]] ], [ [[TEMP_0_ADD_2:%.*]], [[IF_END_2:%.*]] ], [ [[TEMP_0_ADD_2]], [[FOR_BODY_3:%.*]] ], [ [[TEMP_0_ADD_3]], [[IF_END_3]] ] -; CHECK-NEXT: br label [[CLEANUP]] -; CHECK: cleanup: -; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_0_LCSSA_PH]], [[CLEANUP_LOOPEXIT]] ] -; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 -; CHECK-NEXT: ret void ; CHECK: for.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] ; CHECK-NEXT: [[TMP2:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 ; CHECK-NEXT: [[CMP1_1:%.*]] = icmp ugt i32 [[TMP2]], 65535 -; CHECK-NEXT: br i1 [[CMP1_1]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_1]] +; CHECK-NEXT: br i1 [[CMP1_1]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_1:%.*]] ; CHECK: if.end.1: ; CHECK-NEXT: [[ARRAYIDX2_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] ; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* [[ARRAYIDX2_1]], align 4 ; CHECK-NEXT: [[TOBOOL_1:%.*]] = icmp eq i32 [[TMP3]], 0 ; CHECK-NEXT: [[ADD_1:%.*]] = select i1 [[TOBOOL_1]], i32 0, i32 [[TMP2]] -; CHECK-NEXT: [[TEMP_0_ADD_1]] = add i32 [[ADD_1]], [[TEMP_0_ADD]] +; CHECK-NEXT: [[TEMP_0_ADD_1:%.*]] = add i32 [[ADD_1]], [[TEMP_0_ADD]] ; CHECK-NEXT: [[INC_1:%.*]] = add nuw nsw i32 [[INC]], 1 ; CHECK-NEXT: [[CMP_1:%.*]] = icmp ult i32 [[INC_1]], [[MAXJ]] -; CHECK-NEXT: br i1 [[CMP_1]], label [[FOR_BODY_2]], label [[CLEANUP_LOOPEXIT]] +; CHECK-NEXT: br i1 [[CMP_1]], label [[FOR_BODY_2:%.*]], label [[CLEANUP_LOOPEXIT]] ; CHECK: for.body.2: ; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_1]] ; CHECK-NEXT: [[TMP4:%.*]] = load i32, i32* [[ARRAYIDX_2]], align 4 ; CHECK-NEXT: [[CMP1_2:%.*]] = icmp ugt i32 [[TMP4]], 65535 -; CHECK-NEXT: br i1 [[CMP1_2]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_2]] +; CHECK-NEXT: br i1 [[CMP1_2]], label [[CLEANUP_LOOPEXIT]], label [[IF_END_2:%.*]] ; CHECK: if.end.2: ; CHECK-NEXT: [[ARRAYIDX2_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_1]] ; CHECK-NEXT: [[TMP5:%.*]] = load i32, i32* [[ARRAYIDX2_2]], align 4 ; CHECK-NEXT: [[TOBOOL_2:%.*]] = icmp eq i32 [[TMP5]], 0 ; CHECK-NEXT: [[ADD_2:%.*]] = select i1 [[TOBOOL_2]], i32 0, i32 [[TMP4]] -; CHECK-NEXT: [[TEMP_0_ADD_2]] = add i32 [[ADD_2]], [[TEMP_0_ADD_1]] +; CHECK-NEXT: [[TEMP_0_ADD_2:%.*]] = add i32 [[ADD_2]], [[TEMP_0_ADD_1]] ; CHECK-NEXT: [[INC_2:%.*]] = add nuw nsw i32 [[INC_1]], 1 ; CHECK-NEXT: [[CMP_2:%.*]] = icmp ult i32 [[INC_2]], [[MAXJ]] -; CHECK-NEXT: br i1 [[CMP_2]], label [[FOR_BODY_3]], label [[CLEANUP_LOOPEXIT]] +; CHECK-NEXT: br i1 [[CMP_2]], label [[FOR_BODY_3:%.*]], label [[CLEANUP_LOOPEXIT]] ; CHECK: for.body.3: ; CHECK-NEXT: [[ARRAYIDX_3:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_2]] ; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_3]], align 4 @@ -252,6 +245,13 @@ define void @test_two_exits(i32* nocapture %Output, ; CHECK-NEXT: [[INC_3]] = add nuw i32 [[INC_2]], 1 ; CHECK-NEXT: [[CMP_3:%.*]] = icmp ult i32 [[INC_3]], [[MAXJ]] ; CHECK-NEXT: br i1 [[CMP_3]], label [[FOR_BODY]], label [[CLEANUP_LOOPEXIT]] +; CHECK: cleanup.loopexit: +; CHECK-NEXT: [[TEMP_0_LCSSA_PH:%.*]] = phi i32 [ [[TEMP_0_ADD]], [[IF_END]] ], [ [[TEMP_015]], [[FOR_BODY]] ], [ [[TEMP_0_ADD]], [[FOR_BODY_1]] ], [ [[TEMP_0_ADD_1]], [[IF_END_1]] ], [ [[TEMP_0_ADD_1]], [[FOR_BODY_2]] ], [ [[TEMP_0_ADD_2]], [[IF_END_2]] ], [ [[TEMP_0_ADD_2]], [[FOR_BODY_3]] ], [ [[TEMP_0_ADD_3]], [[IF_END_3]] ] +; CHECK-NEXT: br label [[CLEANUP]] +; CHECK: cleanup: +; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_0_LCSSA_PH]], [[CLEANUP_LOOPEXIT]] ] +; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 +; CHECK-NEXT: ret void ; i32* nocapture readonly %Condition, i32* nocapture readonly %Input, @@ -417,100 +417,100 @@ define void @test_four_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[EPIL_ITER_SUB:%.*]] = sub i32 [[XTRAITER]], 1 ; CHECK-NEXT: [[EPIL_ITER_CMP:%.*]] = icmp ne i32 [[EPIL_ITER_SUB]], 0 ; CHECK-NEXT: br i1 [[EPIL_ITER_CMP]], label [[FOR_BODY_EPIL_1:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA:%.*]] -; CHECK: for.cond.cleanup.loopexit.epilog-lcssa: -; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1:%.*]], [[FOR_INC_EPIL_1:%.*]] ], [ [[TEMP_1_EPIL_2:%.*]], [[FOR_INC_EPIL_2:%.*]] ] -; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]] -; CHECK: for.cond.cleanup.loopexit: -; CHECK-NEXT: [[TEMP_1_LCSSA:%.*]] = phi i32 [ [[TEMP_1_LCSSA_PH]], [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]] ], [ [[TEMP_1_LCSSA_PH1]], [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ] -; CHECK-NEXT: br label [[FOR_COND_CLEANUP]] -; CHECK: for.cond.cleanup: -; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_1_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ] -; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 -; CHECK-NEXT: ret void -; CHECK: for.body: -; CHECK-NEXT: [[TMP6:%.*]] = phi i32 [ [[DOTPRE]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP23]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[J_027:%.*]] = phi i32 [ 1, [[FOR_BODY_LR_PH_NEW]] ], [ [[INC_3]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[TEMP_026:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH_NEW]] ], [ [[TEMP_1_3]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[NITER:%.*]] = phi i32 [ [[UNROLL_ITER]], [[FOR_BODY_LR_PH_NEW]] ], [ [[NITER_NSUB_3:%.*]], [[FOR_INC_3]] ] -; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[J_027]] -; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 -; CHECK-NEXT: [[CMP1:%.*]] = icmp ugt i32 [[TMP7]], 65535 -; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[J_027]] -; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX2]], align 4 -; CHECK-NEXT: [[CMP4:%.*]] = icmp ugt i32 [[TMP8]], [[TMP6]] -; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]] -; CHECK: if.then: -; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP4]] to i32 -; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TEMP_026]], [[COND]] -; CHECK-NEXT: br label [[FOR_INC:%.*]] -; CHECK: if.else: -; CHECK-NEXT: [[NOT_CMP4:%.*]] = xor i1 [[CMP4]], true -; CHECK-NEXT: [[SUB:%.*]] = sext i1 [[NOT_CMP4]] to i32 -; CHECK-NEXT: [[SUB10_SINK:%.*]] = add i32 [[J_027]], [[SUB]] -; CHECK-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK]] -; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX11]], align 4 -; CHECK-NEXT: [[SUB13:%.*]] = sub i32 [[TEMP_026]], [[TMP9]] -; CHECK-NEXT: br label [[FOR_INC]] -; CHECK: for.inc: -; CHECK-NEXT: [[TEMP_1:%.*]] = phi i32 [ [[ADD]], [[IF_THEN]] ], [ [[SUB13]], [[IF_ELSE]] ] -; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_027]], 1 -; CHECK-NEXT: [[NITER_NSUB:%.*]] = sub i32 [[NITER]], 1 -; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] -; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 -; CHECK-NEXT: [[CMP1_1:%.*]] = icmp ugt i32 [[TMP10]], 65535 -; CHECK-NEXT: [[ARRAYIDX2_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] -; CHECK-NEXT: [[TMP11:%.*]] = load i32, i32* [[ARRAYIDX2_1]], align 4 -; CHECK-NEXT: [[CMP4_1:%.*]] = icmp ugt i32 [[TMP11]], [[TMP8]] -; CHECK-NEXT: br i1 [[CMP1_1]], label [[IF_THEN_1:%.*]], label [[IF_ELSE_1:%.*]] ; CHECK: for.body.epil.1: ; CHECK-NEXT: [[ARRAYIDX_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP12:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 -; CHECK-NEXT: [[CMP1_EPIL_1:%.*]] = icmp ugt i32 [[TMP12]], 65535 +; CHECK-NEXT: [[TMP6:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_1]], align 4 +; CHECK-NEXT: [[CMP1_EPIL_1:%.*]] = icmp ugt i32 [[TMP6]], 65535 ; CHECK-NEXT: [[ARRAYIDX2_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL]] -; CHECK-NEXT: [[TMP13:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_1]], align 4 -; CHECK-NEXT: [[CMP4_EPIL_1:%.*]] = icmp ugt i32 [[TMP13]], [[TMP4]] +; CHECK-NEXT: [[TMP7:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_1]], align 4 +; CHECK-NEXT: [[CMP4_EPIL_1:%.*]] = icmp ugt i32 [[TMP7]], [[TMP4]] ; CHECK-NEXT: br i1 [[CMP1_EPIL_1]], label [[IF_THEN_EPIL_1:%.*]], label [[IF_ELSE_EPIL_1:%.*]] ; CHECK: if.else.epil.1: ; CHECK-NEXT: [[NOT_CMP4_EPIL_1:%.*]] = xor i1 [[CMP4_EPIL_1]], true ; CHECK-NEXT: [[SUB_EPIL_1:%.*]] = sext i1 [[NOT_CMP4_EPIL_1]] to i32 ; CHECK-NEXT: [[SUB10_SINK_EPIL_1:%.*]] = add i32 [[INC_EPIL]], [[SUB_EPIL_1]] ; CHECK-NEXT: [[ARRAYIDX11_EPIL_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK_EPIL_1]] -; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_1]], align 4 -; CHECK-NEXT: [[SUB13_EPIL_1:%.*]] = sub i32 [[TEMP_1_EPIL]], [[TMP14]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] +; CHECK-NEXT: [[TMP8:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_1]], align 4 +; CHECK-NEXT: [[SUB13_EPIL_1:%.*]] = sub i32 [[TEMP_1_EPIL]], [[TMP8]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_1:%.*]] ; CHECK: if.then.epil.1: ; CHECK-NEXT: [[COND_EPIL_1:%.*]] = zext i1 [[CMP4_EPIL_1]] to i32 ; CHECK-NEXT: [[ADD_EPIL_1:%.*]] = add i32 [[TEMP_1_EPIL]], [[COND_EPIL_1]] ; CHECK-NEXT: br label [[FOR_INC_EPIL_1]] ; CHECK: for.inc.epil.1: -; CHECK-NEXT: [[TEMP_1_EPIL_1]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[SUB13_EPIL_1]], [[IF_ELSE_EPIL_1]] ] +; CHECK-NEXT: [[TEMP_1_EPIL_1:%.*]] = phi i32 [ [[ADD_EPIL_1]], [[IF_THEN_EPIL_1]] ], [ [[SUB13_EPIL_1]], [[IF_ELSE_EPIL_1]] ] ; CHECK-NEXT: [[INC_EPIL_1:%.*]] = add nuw i32 [[INC_EPIL]], 1 ; CHECK-NEXT: [[EPIL_ITER_SUB_1:%.*]] = sub i32 [[EPIL_ITER_SUB]], 1 ; CHECK-NEXT: [[EPIL_ITER_CMP_1:%.*]] = icmp ne i32 [[EPIL_ITER_SUB_1]], 0 ; CHECK-NEXT: br i1 [[EPIL_ITER_CMP_1]], label [[FOR_BODY_EPIL_2:%.*]], label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ; CHECK: for.body.epil.2: ; CHECK-NEXT: [[ARRAYIDX_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP15:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 -; CHECK-NEXT: [[CMP1_EPIL_2:%.*]] = icmp ugt i32 [[TMP15]], 65535 +; CHECK-NEXT: [[TMP9:%.*]] = load i32, i32* [[ARRAYIDX_EPIL_2]], align 4 +; CHECK-NEXT: [[CMP1_EPIL_2:%.*]] = icmp ugt i32 [[TMP9]], 65535 ; CHECK-NEXT: [[ARRAYIDX2_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_EPIL_1]] -; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_2]], align 4 -; CHECK-NEXT: [[CMP4_EPIL_2:%.*]] = icmp ugt i32 [[TMP16]], [[TMP13]] +; CHECK-NEXT: [[TMP10:%.*]] = load i32, i32* [[ARRAYIDX2_EPIL_2]], align 4 +; CHECK-NEXT: [[CMP4_EPIL_2:%.*]] = icmp ugt i32 [[TMP10]], [[TMP7]] ; CHECK-NEXT: br i1 [[CMP1_EPIL_2]], label [[IF_THEN_EPIL_2:%.*]], label [[IF_ELSE_EPIL_2:%.*]] ; CHECK: if.else.epil.2: ; CHECK-NEXT: [[NOT_CMP4_EPIL_2:%.*]] = xor i1 [[CMP4_EPIL_2]], true ; CHECK-NEXT: [[SUB_EPIL_2:%.*]] = sext i1 [[NOT_CMP4_EPIL_2]] to i32 ; CHECK-NEXT: [[SUB10_SINK_EPIL_2:%.*]] = add i32 [[INC_EPIL_1]], [[SUB_EPIL_2]] ; CHECK-NEXT: [[ARRAYIDX11_EPIL_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK_EPIL_2]] -; CHECK-NEXT: [[TMP17:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_2]], align 4 -; CHECK-NEXT: [[SUB13_EPIL_2:%.*]] = sub i32 [[TEMP_1_EPIL_1]], [[TMP17]] -; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] +; CHECK-NEXT: [[TMP11:%.*]] = load i32, i32* [[ARRAYIDX11_EPIL_2]], align 4 +; CHECK-NEXT: [[SUB13_EPIL_2:%.*]] = sub i32 [[TEMP_1_EPIL_1]], [[TMP11]] +; CHECK-NEXT: br label [[FOR_INC_EPIL_2:%.*]] ; CHECK: if.then.epil.2: ; CHECK-NEXT: [[COND_EPIL_2:%.*]] = zext i1 [[CMP4_EPIL_2]] to i32 ; CHECK-NEXT: [[ADD_EPIL_2:%.*]] = add i32 [[TEMP_1_EPIL_1]], [[COND_EPIL_2]] ; CHECK-NEXT: br label [[FOR_INC_EPIL_2]] ; CHECK: for.inc.epil.2: -; CHECK-NEXT: [[TEMP_1_EPIL_2]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[SUB13_EPIL_2]], [[IF_ELSE_EPIL_2]] ] +; CHECK-NEXT: [[TEMP_1_EPIL_2:%.*]] = phi i32 [ [[ADD_EPIL_2]], [[IF_THEN_EPIL_2]] ], [ [[SUB13_EPIL_2]], [[IF_ELSE_EPIL_2]] ] ; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] +; CHECK: for.cond.cleanup.loopexit.epilog-lcssa: +; CHECK-NEXT: [[TEMP_1_LCSSA_PH1:%.*]] = phi i32 [ [[TEMP_1_EPIL]], [[FOR_INC_EPIL]] ], [ [[TEMP_1_EPIL_1]], [[FOR_INC_EPIL_1]] ], [ [[TEMP_1_EPIL_2]], [[FOR_INC_EPIL_2]] ] +; CHECK-NEXT: br label [[FOR_COND_CLEANUP_LOOPEXIT]] +; CHECK: for.cond.cleanup.loopexit: +; CHECK-NEXT: [[TEMP_1_LCSSA:%.*]] = phi i32 [ [[TEMP_1_LCSSA_PH]], [[FOR_COND_CLEANUP_LOOPEXIT_UNR_LCSSA]] ], [ [[TEMP_1_LCSSA_PH1]], [[FOR_COND_CLEANUP_LOOPEXIT_EPILOG_LCSSA]] ] +; CHECK-NEXT: br label [[FOR_COND_CLEANUP]] +; CHECK: for.cond.cleanup: +; CHECK-NEXT: [[TEMP_0_LCSSA:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[TEMP_1_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ] +; CHECK-NEXT: store i32 [[TEMP_0_LCSSA]], i32* [[OUTPUT:%.*]], align 4 +; CHECK-NEXT: ret void +; CHECK: for.body: +; CHECK-NEXT: [[TMP12:%.*]] = phi i32 [ [[DOTPRE]], [[FOR_BODY_LR_PH_NEW]] ], [ [[TMP23]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[J_027:%.*]] = phi i32 [ 1, [[FOR_BODY_LR_PH_NEW]] ], [ [[INC_3]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[TEMP_026:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH_NEW]] ], [ [[TEMP_1_3]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[NITER:%.*]] = phi i32 [ [[UNROLL_ITER]], [[FOR_BODY_LR_PH_NEW]] ], [ [[NITER_NSUB_3:%.*]], [[FOR_INC_3]] ] +; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[J_027]] +; CHECK-NEXT: [[TMP13:%.*]] = load i32, i32* [[ARRAYIDX]], align 4 +; CHECK-NEXT: [[CMP1:%.*]] = icmp ugt i32 [[TMP13]], 65535 +; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[J_027]] +; CHECK-NEXT: [[TMP14:%.*]] = load i32, i32* [[ARRAYIDX2]], align 4 +; CHECK-NEXT: [[CMP4:%.*]] = icmp ugt i32 [[TMP14]], [[TMP12]] +; CHECK-NEXT: br i1 [[CMP1]], label [[IF_THEN:%.*]], label [[IF_ELSE:%.*]] +; CHECK: if.then: +; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP4]] to i32 +; CHECK-NEXT: [[ADD:%.*]] = add i32 [[TEMP_026]], [[COND]] +; CHECK-NEXT: br label [[FOR_INC:%.*]] +; CHECK: if.else: +; CHECK-NEXT: [[NOT_CMP4:%.*]] = xor i1 [[CMP4]], true +; CHECK-NEXT: [[SUB:%.*]] = sext i1 [[NOT_CMP4]] to i32 +; CHECK-NEXT: [[SUB10_SINK:%.*]] = add i32 [[J_027]], [[SUB]] +; CHECK-NEXT: [[ARRAYIDX11:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[SUB10_SINK]] +; CHECK-NEXT: [[TMP15:%.*]] = load i32, i32* [[ARRAYIDX11]], align 4 +; CHECK-NEXT: [[SUB13:%.*]] = sub i32 [[TEMP_026]], [[TMP15]] +; CHECK-NEXT: br label [[FOR_INC]] +; CHECK: for.inc: +; CHECK-NEXT: [[TEMP_1:%.*]] = phi i32 [ [[ADD]], [[IF_THEN]] ], [ [[SUB13]], [[IF_ELSE]] ] +; CHECK-NEXT: [[INC:%.*]] = add nuw nsw i32 [[J_027]], 1 +; CHECK-NEXT: [[NITER_NSUB:%.*]] = sub i32 [[NITER]], 1 +; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr inbounds i32, i32* [[CONDITION]], i32 [[INC]] +; CHECK-NEXT: [[TMP16:%.*]] = load i32, i32* [[ARRAYIDX_1]], align 4 +; CHECK-NEXT: [[CMP1_1:%.*]] = icmp ugt i32 [[TMP16]], 65535 +; CHECK-NEXT: [[ARRAYIDX2_1:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC]] +; CHECK-NEXT: [[TMP17:%.*]] = load i32, i32* [[ARRAYIDX2_1]], align 4 +; CHECK-NEXT: [[CMP4_1:%.*]] = icmp ugt i32 [[TMP17]], [[TMP14]] +; CHECK-NEXT: br i1 [[CMP1_1]], label [[IF_THEN_1:%.*]], label [[IF_ELSE_1:%.*]] ; CHECK: if.else.1: ; CHECK-NEXT: [[NOT_CMP4_1:%.*]] = xor i1 [[CMP4_1]], true ; CHECK-NEXT: [[SUB_1:%.*]] = sext i1 [[NOT_CMP4_1]] to i32 @@ -532,7 +532,7 @@ define void @test_four_blocks(i32* nocapture %Output, ; CHECK-NEXT: [[CMP1_2:%.*]] = icmp ugt i32 [[TMP19]], 65535 ; CHECK-NEXT: [[ARRAYIDX2_2:%.*]] = getelementptr inbounds i32, i32* [[INPUT]], i32 [[INC_1]] ; CHECK-NEXT: [[TMP20:%.*]] = load i32, i32* [[ARRAYIDX2_2]], align 4 -; CHECK-NEXT: [[CMP4_2:%.*]] = icmp ugt i32 [[TMP20]], [[TMP11]] +; CHECK-NEXT: [[CMP4_2:%.*]] = icmp ugt i32 [[TMP20]], [[TMP17]] ; CHECK-NEXT: br i1 [[CMP1_2]], label [[IF_THEN_2:%.*]], label [[IF_ELSE_2:%.*]] ; CHECK: if.else.2: ; CHECK-NEXT: [[NOT_CMP4_2:%.*]] = xor i1 [[CMP4_2]], true @@ -742,10 +742,6 @@ define void @iterate_inc(%struct.Node* %n, i32 %limit) { ; CHECK-NEXT: [[TMP2:%.*]] = load %struct.Node*, %struct.Node** [[TMP1]], align 4 ; CHECK-NEXT: [[TOBOOL:%.*]] = icmp eq %struct.Node* [[TMP2]], null ; CHECK-NEXT: br i1 [[TOBOOL]], label [[WHILE_END_LOOPEXIT]], label [[LAND_RHS_1:%.*]] -; CHECK: while.end.loopexit: -; CHECK-NEXT: br label [[WHILE_END]] -; CHECK: while.end: -; CHECK-NEXT: ret void ; CHECK: land.rhs.1: ; CHECK-NEXT: [[VAL_1:%.*]] = getelementptr inbounds [[STRUCT_NODE]], %struct.Node* [[TMP2]], i32 0, i32 1 ; CHECK-NEXT: [[TMP3:%.*]] = load i32, i32* [[VAL_1]], align 4 @@ -782,6 +778,10 @@ define void @iterate_inc(%struct.Node* %n, i32 %limit) { ; CHECK-NEXT: [[TMP11]] = load %struct.Node*, %struct.Node** [[TMP10]], align 4 ; CHECK-NEXT: [[TOBOOL_3:%.*]] = icmp eq %struct.Node* [[TMP11]], null ; CHECK-NEXT: br i1 [[TOBOOL_3]], label [[WHILE_END_LOOPEXIT]], label [[LAND_RHS]] +; CHECK: while.end.loopexit: +; CHECK-NEXT: br label [[WHILE_END]] +; CHECK: while.end: +; CHECK-NEXT: ret void ; entry: %tobool5 = icmp eq %struct.Node* %n, null diff --git a/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll b/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll index ea18d3aa1054..33151c68b319 100644 --- a/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll +++ b/llvm/test/Transforms/LoopUnroll/ARM/upperbound.ll @@ -20,8 +20,6 @@ define void @test(i32* %x, i32 %n) { ; CHECK-NEXT: [[INCDEC_PTR:%.*]] = getelementptr inbounds i32, i32* [[X]], i64 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp sgt i32 [[REM]], 1 ; CHECK-NEXT: br i1 [[CMP]], label [[WHILE_BODY_1:%.*]], label [[WHILE_END]] -; CHECK: while.end: -; CHECK-NEXT: ret void ; CHECK: while.body.1: ; CHECK-NEXT: [[TMP1:%.*]] = load i32, i32* [[INCDEC_PTR]], align 4 ; CHECK-NEXT: [[CMP1_1:%.*]] = icmp slt i32 [[TMP1]], 10 @@ -40,6 +38,8 @@ define void @test(i32* %x, i32 %n) { ; CHECK: if.then.2: ; CHECK-NEXT: store i32 0, i32* [[INCDEC_PTR_1]], align 4 ; CHECK-NEXT: br label [[WHILE_END]] +; CHECK: while.end: +; CHECK-NEXT: ret void ; entry: %sub = add nsw i32 %n, -1 @@ -76,9 +76,9 @@ define i32 @test2(i32 %l86) { ; CHECK-NEXT: [[L86_OFF:%.*]] = add i32 [[L86:%.*]], -1 ; CHECK-NEXT: [[SWITCH:%.*]] = icmp ult i32 [[L86_OFF]], 24 ; CHECK-NEXT: [[DOTNOT30:%.*]] = icmp ne i32 [[L86]], 25 -; CHECK-NEXT: [[SPEC_SELECT24:%.*]] = zext i1 [[DOTNOT30]] to i32 -; CHECK-NEXT: [[COMMON_RET31_OP:%.*]] = select i1 [[SWITCH]], i32 0, i32 [[SPEC_SELECT24]] -; CHECK-NEXT: ret i32 [[COMMON_RET31_OP]] +; CHECK-NEXT: [[SPEC_SELECT:%.*]] = zext i1 [[DOTNOT30]] to i32 +; CHECK-NEXT: [[COMMON_RET_OP:%.*]] = select i1 [[SWITCH]], i32 0, i32 [[SPEC_SELECT]] +; CHECK-NEXT: ret i32 [[COMMON_RET_OP]] ; entry: br label %for.body.i.i diff --git a/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll b/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll index 316051715584..cdc8e944715e 100644 --- a/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll +++ b/llvm/test/Transforms/LoopUnroll/full-unroll-keep-first-exit.ll @@ -15,12 +15,12 @@ define void @s32_max1(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 1 @@ -51,8 +51,6 @@ define void @s32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 @@ -60,6 +58,8 @@ define void @s32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC_1]] ; CHECK-NEXT: store i32 [[INC_1]], i32* [[ARRAYIDX_2]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 2 @@ -163,12 +163,12 @@ define void @u32_max1(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 1 @@ -199,8 +199,6 @@ define void @u32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[INC:%.*]] = add i32 [[N]], 1 ; CHECK-NEXT: [[CMP:%.*]] = icmp ult i32 [[N]], [[ADD]] ; CHECK-NEXT: br i1 [[CMP]], label [[DO_BODY_1:%.*]], label [[DO_END:%.*]] -; CHECK: do.end: -; CHECK-NEXT: ret void ; CHECK: do.body.1: ; CHECK-NEXT: [[ARRAYIDX_1:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC]] ; CHECK-NEXT: store i32 [[INC]], i32* [[ARRAYIDX_1]], align 4 @@ -208,6 +206,8 @@ define void @u32_max2(i32 %n, i32* %p) { ; CHECK-NEXT: [[ARRAYIDX_2:%.*]] = getelementptr i32, i32* [[P]], i32 [[INC_1]] ; CHECK-NEXT: store i32 [[INC_1]], i32* [[ARRAYIDX_2]], align 4 ; CHECK-NEXT: br label [[DO_END]] +; CHECK: do.end: +; CHECK-NEXT: ret void ; entry: %add = add i32 %n, 2 diff --git a/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll b/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll index 095a7c1e1dd1..b7d7e00fa0c9 100644 --- a/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll +++ b/llvm/test/Transforms/LoopUnroll/full-unroll-one-unpredictable-exit.ll @@ -34,11 +34,11 @@ define i1 @test_latch() { ; CHECK-NEXT: [[LOAD2_1:%.*]] = load i64, i64* [[GEP2_1]], align 8 ; CHECK-NEXT: [[EXITCOND2_1:%.*]] = icmp eq i64 [[LOAD1_1]], [[LOAD2_1]] ; CHECK-NEXT: br i1 [[EXITCOND2_1]], label [[LATCH_1:%.*]], label [[EXIT]] +; CHECK: latch.1: +; CHECK-NEXT: br label [[EXIT]] ; CHECK: exit: ; CHECK-NEXT: [[EXIT_VAL:%.*]] = phi i1 [ false, [[LOOP]] ], [ false, [[LATCH]] ], [ true, [[LATCH_1]] ] ; CHECK-NEXT: ret i1 [[EXIT_VAL]] -; CHECK: latch.1: -; CHECK-NEXT: br label [[EXIT]] ; start: %a1 = alloca [2 x i64], align 8 @@ -95,22 +95,22 @@ define i1 @test_non_latch() { ; CHECK-NEXT: [[LOAD2:%.*]] = load i64, i64* [[GEP2]], align 8 ; CHECK-NEXT: [[EXITCOND2:%.*]] = icmp eq i64 [[LOAD1]], [[LOAD2]] ; CHECK-NEXT: br i1 [[EXITCOND2]], label [[LOOP_1:%.*]], label [[EXIT:%.*]] -; CHECK: exit: -; CHECK-NEXT: [[EXIT_VAL:%.*]] = phi i1 [ false, [[LATCH]] ], [ false, [[LATCH_1:%.*]] ], [ true, [[LOOP_2:%.*]] ], [ false, [[LATCH_2:%.*]] ] -; CHECK-NEXT: ret i1 [[EXIT_VAL]] ; CHECK: loop.1: -; CHECK-NEXT: br label [[LATCH_1]] +; CHECK-NEXT: br label [[LATCH_1:%.*]] ; CHECK: latch.1: ; CHECK-NEXT: [[GEP1_1:%.*]] = getelementptr inbounds [2 x i64], [2 x i64]* [[A1]], i64 0, i64 1 ; CHECK-NEXT: [[GEP2_1:%.*]] = getelementptr inbounds [2 x i64], [2 x i64]* [[A2]], i64 0, i64 1 ; CHECK-NEXT: [[LOAD1_1:%.*]] = load i64, i64* [[GEP1_1]], align 8 ; CHECK-NEXT: [[LOAD2_1:%.*]] = load i64, i64* [[GEP2_1]], align 8 ; CHECK-NEXT: [[EXITCOND2_1:%.*]] = icmp eq i64 [[LOAD1_1]], [[LOAD2_1]] -; CHECK-NEXT: br i1 [[EXITCOND2_1]], label [[LOOP_2]], label [[EXIT]] +; CHECK-NEXT: br i1 [[EXITCOND2_1]], label [[LOOP_2:%.*]], label [[EXIT]] ; CHECK: loop.2: -; CHECK-NEXT: br i1 true, label [[EXIT]], label [[LATCH_2]] +; CHECK-NEXT: br i1 true, label [[EXIT]], label [[LATCH_2:%.*]] ; CHECK: latch.2: ; CHECK-NEXT: br label [[EXIT]] +; CHECK: exit: +; CHECK-NEXT: [[EXIT_VAL:%.*]] = phi i1 [ false, [[LATCH]] ], [ false, [[LATCH_1]] ], [ true, [[LOOP_2]] ], [ false, [[LATCH_2]] ] +; CHECK-NEXT: ret i1 [[EXIT_VAL]] ; start: %a1 = alloca [2 x i64], align 8 diff --git a/llvm/test/Transforms/LoopUnroll/multiple-exits.ll b/llvm/test/Transforms/LoopUnroll/multiple-exits.ll index 0bea86350b99..9f40f51c10e6 100644 --- a/llvm/test/Transforms/LoopUnroll/multiple-exits.ll +++ b/llvm/test/Transforms/LoopUnroll/multiple-exits.ll @@ -14,8 +14,6 @@ define void @test1() { ; CHECK-NEXT: call void @bar() ; CHECK-NEXT: call void @bar() ; CHECK-NEXT: br label [[LATCH_1:%.*]] -; CHECK: exit: -; CHECK-NEXT: ret void ; CHECK: latch.1: </cut>

3 years, 8 months

1
0
0 0

[TCWG CI] 482.sphinx3 slowed down by 6% after llvm: Making the code compliant to the documentation about Floating Point support default values for C/C++. FPP-MODEL=PRECISE enables FFP-CONTRACT(FMA is enabled).

by ci_notify＠linaro.org

After llvm commit f04e387055e495e3e14570087d68e93593cf2918 Author: Zahira Ammarguellat <zahira.ammarguellat(a)intel.com> Making the code compliant to the documentation about Floating Point support default values for C/C++. FPP-MODEL=PRECISE enables FFP-CONTRACT(FMA is enabled). the following benchmarks slowed down by more than 2%: - 482.sphinx3 slowed down by 6% from 25875 to 27484 perf samples - 482.sphinx3:[.] mgau_eval slowed down by 12% from 9996 to 11165 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f04e387055e495e3e14570087d68e93593cf2918 cd investigate-llvm-f04e387055e495e3e14570087d68e93593cf2918 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f04e387055e495e3e14570087d68e93593cf2918 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 491beae71d69960a3bb0298b17d4ef1f3119b767 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f04e387055e495e3e14570087d68e93593cf2918 Author: Zahira Ammarguellat <zahira.ammarguellat(a)intel.com> Date: Tue Nov 9 09:35:25 2021 -0500 Making the code compliant to the documentation about Floating Point support default values for C/C++. FPP-MODEL=PRECISE enables FFP-CONTRACT(FMA is enabled). Fix for https://bugs.llvm.org/show_bug.cgi?id=50222 --- clang/docs/ReleaseNotes.rst | 10 +++ clang/docs/UsersManual.rst | 47 +++++++++++- clang/lib/Driver/ToolChains/Clang.cpp | 48 +++++++----- clang/test/CodeGen/ffp-contract-option.c | 127 +++++++++++++++++++++++++++++-- clang/test/CodeGen/ffp-model.c | 48 ++++++++++++ clang/test/CodeGen/ppc-emmintrin.c | 4 +- clang/test/CodeGen/ppc-xmmintrin.c | 4 +- clang/test/Driver/fp-model.c | 2 +- clang/test/Misc/ffp-contract.c | 10 +++ 9 files changed, 263 insertions(+), 37 deletions(-) diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst index 57c5150becae..00582b689862 100644 --- a/clang/docs/ReleaseNotes.rst +++ b/clang/docs/ReleaseNotes.rst @@ -202,6 +202,16 @@ Arm and AArch64 Support in Clang architecture features, but will enable certain optimizations specific to Cortex-A57 CPUs and enable the use of a more accurate scheduling model. + +Floating Point Support in Clang +------------------------------- +- The -ffp-model=precise now implies -ffp-contract=on rather than + -ffp-contract=fast, and the documentation of these features has been + clarified. Previously, the documentation claimed that -ffp-model=precise was + the default, but this was incorrect because the precise model implied + -ffp-contract=fast, whereas the default behavior is -ffp-contract=on. + -ffp-model=precise is now exactly the default mode of the compiler. + Internal API Changes -------------------- diff --git a/clang/docs/UsersManual.rst b/clang/docs/UsersManual.rst index 8c6922db6b37..406efb093d55 100644 --- a/clang/docs/UsersManual.rst +++ b/clang/docs/UsersManual.rst @@ -1260,8 +1260,49 @@ installed. Controlling Floating Point Behavior ----------------------------------- -Clang provides a number of ways to control floating point behavior. The options -are listed below. +Clang provides a number of ways to control floating point behavior, including +with command line options and source pragmas. This section +describes the various floating point semantic modes and the corresponding options. + +.. csv-table:: Floating Point Semantic Modes + :header: "Mode", "Values" + :widths: 15, 30, 30 + + "ffp-exception-behavior", "{ignore, strict, may_trap}", + "fenv_access", "{off, on}", "(none)" + "frounding-math", "{dynamic, tonearest, downward, upward, towardzero}" + "ffp-contract", "{on, off, fast, fast-honor-pragmas}" + "fdenormal-fp-math", "{IEEE, PreserveSign, PositiveZero}" + "fdenormal-fp-math-fp32", "{IEEE, PreserveSign, PositiveZero}" + "fmath-errno", "{on, off}" + "fhonor-nans", "{on, off}" + "fhonor-infinities", "{on, off}" + "fsigned-zeros", "{on, off}" + "freciprocal-math", "{on, off}" + "allow_approximate_fns", "{on, off}" + "fassociative-math", "{on, off}" + +This table describes the option settings that correspond to the three +floating point semantic models: precise (the default), strict, and fast. + + +.. csv-table:: Floating Point Models + :header: "Mode", "Precise", "Strict", "Fast" + :widths: 25, 15, 15, 15 + + "except_behavior", "ignore", "strict", "ignore" + "fenv_access", "off", "on", "off" + "rounding_mode", "tonearest", "dynamic", "tonearest" + "contract", "on", "off", "fast" + "denormal_fp_math", "IEEE", "IEEE", "PreserveSign" + "denormal_fp32_math", "IEEE","IEEE", "PreserveSign" + "support_math_errno", "on", "on", "off" + "no_honor_nans", "off", "off", "on" + "no_honor_infinities", "off", "off", "on" + "no_signed_zeros", "off", "off", "on" + "allow_reciprocal", "off", "off", "on" + "allow_approximate_fns", "off", "off", "on" + "allow_reassociation", "off", "off", "on" .. option:: -ffast-math @@ -1467,7 +1508,7 @@ Note that floating-point operations performed as part of constant initialization and ``fast``. Details: - * ``precise`` Disables optimizations that are not value-safe on floating-point data, although FP contraction (FMA) is enabled (``-ffp-contract=fast``). This is the default behavior. + * ``precise`` Disables optimizations that are not value-safe on floating-point data, although FP contraction (FMA) is enabled (``-ffp-contract=on``). This is the default behavior. * ``strict`` Enables ``-frounding-math`` and ``-ffp-exception-behavior=strict``, and disables contractions (FMA). All of the ``-ffast-math`` enablements are disabled. Enables ``STDC FENV_ACCESS``: by default ``FENV_ACCESS`` is disabled. This option setting behaves as though ``#pragma STDC FENV_ACESS ON`` appeared at the top of the source file. * ``fast`` Behaves identically to specifying both ``-ffast-math`` and ``ffp-contract=fast`` diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp index e8ad105a7829..5d6f8e9fba0e 100644 --- a/clang/lib/Driver/ToolChains/Clang.cpp +++ b/clang/lib/Driver/ToolChains/Clang.cpp @@ -2666,10 +2666,14 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, llvm::DenormalMode DenormalFPMath = DefaultDenormalFPMath; llvm::DenormalMode DenormalFP32Math = DefaultDenormalFP32Math; - StringRef FPContract = ""; + // CUDA and HIP don't rely on the frontend to pass an ffp-contract option. + // If one wasn't given by the user, don't pass it here. + StringRef FPContract; + if (!JA.isDeviceOffloading(Action::OFK_Cuda) && + !JA.isOffloading(Action::OFK_HIP)) + FPContract = "on"; bool StrictFPModel = false; - if (const Arg *A = Args.getLastArg(options::OPT_flimited_precision_EQ)) { CmdArgs.push_back("-mlimit-float-precision"); CmdArgs.push_back(A->getValue()); @@ -2691,7 +2695,7 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, ReciprocalMath = false; SignedZeros = true; // -fno_fast_math restores default denormal and fpcontract handling - FPContract = ""; + FPContract = "on"; DenormalFPMath = llvm::DenormalMode::getIEEE(); // FIXME: The target may have picked a non-IEEE default mode here based on @@ -2711,12 +2715,10 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, // ffp-model= is a Driver option, it is entirely rewritten into more // granular options before being passed into cc1. // Use the gcc option in the switch below. - if (!FPModel.empty() && !FPModel.equals(Val)) { + if (!FPModel.empty() && !FPModel.equals(Val)) D.Diag(clang::diag::warn_drv_overriding_flag_option) - << Args.MakeArgString("-ffp-model=" + FPModel) - << Args.MakeArgString("-ffp-model=" + Val); - FPContract = ""; - } + << Args.MakeArgString("-ffp-model=" + FPModel) + << Args.MakeArgString("-ffp-model=" + Val); if (Val.equals("fast")) { optID = options::OPT_ffast_math; FPModel = Val; @@ -2724,7 +2726,7 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, } else if (Val.equals("precise")) { optID = options::OPT_ffp_contract; FPModel = Val; - FPContract = "fast"; + FPContract = "on"; PreciseFPModel = true; } else if (Val.equals("strict")) { StrictFPModel = true; @@ -2812,9 +2814,9 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, case options::OPT_ffp_contract: { StringRef Val = A->getValue(); if (PreciseFPModel) { - // -ffp-model=precise enables ffp-contract=fast as a side effect - // the FPContract value has already been set to a string literal - // and the Val string isn't a pertinent value. + // -ffp-model=precise enables ffp-contract=on. + // -ffp-model=precise sets PreciseFPModel to on and Val to + // "precise". FPContract is set. ; } else if (Val.equals("fast") || Val.equals("on") || Val.equals("off")) FPContract = Val; @@ -2907,23 +2909,27 @@ static void RenderFloatingPointOptions(const ToolChain &TC, const Driver &D, AssociativeMath = false; ReciprocalMath = false; SignedZeros = true; - TrappingMath = false; - RoundingFPMath = false; // -fno_fast_math restores default denormal and fpcontract handling DenormalFPMath = DefaultDenormalFPMath; DenormalFP32Math = llvm::DenormalMode::getIEEE(); - FPContract = ""; + if (!JA.isDeviceOffloading(Action::OFK_Cuda) && + !JA.isOffloading(Action::OFK_HIP)) + if (FPContract == "fast") { + FPContract = "on"; + D.Diag(clang::diag::warn_drv_overriding_flag_option) + << "-ffp-contract=fast" + << "-ffp-contract=on"; + } break; } if (StrictFPModel) { // If -ffp-model=strict has been specified on command line but // subsequent options conflict then emit warning diagnostic. - if (HonorINFs && HonorNaNs && - !AssociativeMath && !ReciprocalMath && - SignedZeros && TrappingMath && RoundingFPMath && - (FPContract.equals("off") || FPContract.empty()) && - DenormalFPMath == llvm::DenormalMode::getIEEE() && - DenormalFP32Math == llvm::DenormalMode::getIEEE()) + if (HonorINFs && HonorNaNs && !AssociativeMath && !ReciprocalMath && + SignedZeros && TrappingMath && RoundingFPMath && + DenormalFPMath == llvm::DenormalMode::getIEEE() && + DenormalFP32Math == llvm::DenormalMode::getIEEE() && + FPContract.equals("off")) // OK: Current Arg doesn't conflict with -ffp-model=strict ; else { diff --git a/clang/test/CodeGen/ffp-contract-option.c b/clang/test/CodeGen/ffp-contract-option.c index 52b750795940..857a9c2369ef 100644 --- a/clang/test/CodeGen/ffp-contract-option.c +++ b/clang/test/CodeGen/ffp-contract-option.c @@ -1,9 +1,120 @@ -// RUN: %clang_cc1 -O3 -ffp-contract=fast -triple=aarch64-apple-darwin -S -o - %s | FileCheck %s -// REQUIRES: aarch64-registered-target - -float fma_test1(float a, float b, float c) { -// CHECK: fmadd - float x = a * b; - float y = x + c; - return y; +// REQUIRES: x86-registered-target +// RUN: %clang_cc1 -triple=x86_64 %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-DEFAULT %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=off %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-DEFAULT %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=on %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-ON %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=fast %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-CONTRACTFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math %s -emit-llvm -o - \ +// RUN:| FileCheck --check-prefixes CHECK,CHECK-CONTRACTOFF %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math -ffp-contract=off %s -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-CONTRACTOFF %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math -ffp-contract=on %s -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-ONFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffast-math -ffp-contract=fast %s -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-FASTFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=fast -ffast-math %s \ +// RUN: -emit-llvm \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-FASTFAST %s + +// RUN: %clang_cc1 -triple=x86_64 -ffp-contract=off -fmath-errno \ +// RUN: -fno-rounding-math %s -emit-llvm -o - \ +// RUN: -o - | FileCheck --check-prefixes CHECK,CHECK-NOFAST %s + +// RUN: %clang -S -emit-llvm -fno-fast-math %s -o - \ +// RUN: | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-contract=fast -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-contract=on -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-contract=off -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-OFF + +// RUN: %clang -S -emit-llvm -ffp-model=fast -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-model=precise -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +// RUN: %clang -S -emit-llvm -ffp-model=strict -fno-fast-math \ +// RUN: -target x86_64 %s -o - | FileCheck %s \ +// RUN: --check-prefixes=CHECK,CHECK-FPSC-OFF + +// RUN: %clang -S -emit-llvm -ffast-math -fno-fast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-FPC-ON + +float mymuladd(float x, float y, float z) { + // CHECK: define{{.*}} float @mymuladd + return x * y + z; + // expected-warning{{overriding '-ffp-contract=fast' option with '-ffp-contract=on'}} + + // CHECK-DEFAULT: load float, float* + // CHECK-DEFAULT: fmul float + // CHECK-DEFAULT: load float, float* + // CHECK-DEFAULT: fadd float + + // CHECK-ON: load float, float* + // CHECK-ON: load float, float* + // CHECK-ON: load float, float* + // CHECK-ON: call float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-CONTRACTFAST: load float, float* + // CHECK-CONTRACTFAST: load float, float* + // CHECK-CONTRACTFAST: fmul contract float + // CHECK-CONTRACTFAST: load float, float* + // CHECK-CONTRACTFAST: fadd contract float + + // CHECK-CONTRACTOFF: load float, float* + // CHECK-CONTRACTOFF: load float, float* + // CHECK-CONTRACTOFF: fmul reassoc nnan ninf nsz arcp afn float + // CHECK-CONTRACTOFF: load float, float* + // CHECK-CONTRACTOFF: fadd reassoc nnan ninf nsz arcp afn float {{.*}}, {{.*}} + + // CHECK-ONFAST: load float, float* + // CHECK-ONFAST: load float, float* + // CHECK-ONFAST: load float, float* + // CHECK-ONFAST: call reassoc nnan ninf nsz arcp afn float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-FASTFAST: load float, float* + // CHECK-FASTFAST: load float, float* + // CHECK-FASTFAST: fmul fast float + // CHECK-FASTFAST: load float, float* + // CHECK-FASTFAST: fadd fast float {{.*}}, {{.*}} + + // CHECK-NOFAST: load float, float* + // CHECK-NOFAST: load float, float* + // CHECK-NOFAST: fmul float {{.*}}, {{.*}} + // CHECK-NOFAST: load float, float* + // CHECK-NOFAST: fadd float {{.*}}, {{.*}} + + // CHECK-FPC-ON: load float, float* + // CHECK-FPC-ON: load float, float* + // CHECK-FPC-ON: load float, float* + // CHECK-FPC-ON: call float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-FPC-OFF: load float, float* + // CHECK-FPC-OFF: load float, float* + // CHECK-FPC-OFF: fmul float + // CHECK-FPC-OFF: load float, float* + // CHECK-FPC-OFF: fadd float {{.*}}, {{.*}} + + // CHECK-FFPC-OFF: load float, float* + // CHECK-FFPC-OFF: load float, float* + // CHECK-FPSC-OFF: call float @llvm.experimental.constrained.fmul.f32(float {{.*}}, float {{.*}}, {{.*}}) + // CHECK-FPSC-OFF: load float, float* + // CHECK-FPSC-OFF: [[RES:%.*]] = call float @llvm.experimental.constrained.fadd.f32(float {{.*}}, float {{.*}}, {{.*}}) + } diff --git a/clang/test/CodeGen/ffp-model.c b/clang/test/CodeGen/ffp-model.c new file mode 100644 index 000000000000..21dbc8de99aa --- /dev/null +++ b/clang/test/CodeGen/ffp-model.c @@ -0,0 +1,48 @@ +// REQUIRES: x86-registered-target +// RUN: %clang -S -emit-llvm -ffp-model=fast -emit-llvm %s -o - \ +// RUN: | FileCheck %s --check-prefixes=CHECK,CHECK-FAST + +// RUN: %clang -S -emit-llvm -ffp-model=precise %s -o - \ +// RUN: | FileCheck %s --check-prefixes=CHECK,CHECK-PRECISE + +// RUN: %clang -S -emit-llvm -ffp-model=strict %s -o - \ +// RUN: -target x86_64 | FileCheck %s --check-prefixes=CHECK,CHECK-STRICT + +// RUN: %clang -S -emit-llvm -ffp-model=strict -ffast-math \ +// RUN: -target x86_64 %s -o - | FileCheck %s \ +// RUN: --check-prefixes CHECK,CHECK-STRICT-FAST + +// RUN: %clang -S -emit-llvm -ffp-model=precise -ffast-math \ +// RUN: %s -o - | FileCheck %s --check-prefixes CHECK,CHECK-FAST1 + +float mymuladd(float x, float y, float z) { + // CHECK: define{{.*}} float @mymuladd + return x * y + z; + + // CHECK-FAST: fmul fast float + // CHECK-FAST: load float, float* + // CHECK-FAST: fadd fast float + + // CHECK-PRECISE: load float, float* + // CHECK-PRECISE: load float, float* + // CHECK-PRECISE: load float, float* + // CHECK-PRECISE: call float @llvm.fmuladd.f32(float {{.*}}, float {{.*}}, float {{.*}}) + + // CHECK-STRICT: load float, float* + // CHECK-STRICT: load float, float* + // CHECK-STRICT: call float @llvm.experimental.constrained.fmul.f32(float {{.*}}, float {{.*}}, {{.*}}) + // CHECK-STRICT: load float, float* + // CHECK-STRICT: call float @llvm.experimental.constrained.fadd.f32(float {{.*}}, float {{.*}}, {{.*}}) + + // CHECK-STRICT-FAST: load float, float* + // CHECK-STRICT-FAST: load float, float* + // CHECK-STRICT-FAST: call fast float @llvm.experimental.constrained.fmul.f32(float {{.*}}, float {{.*}}, {{.*}}) + // CHECK-STRICT-FAST: load float, float* + // CHECK-STRICT-FAST: call fast float @llvm.experimental.constrained.fadd.f32(float {{.*}}, float {{.*}}, {{.*}} + + // CHECK-FAST1: load float, float* + // CHECK-FAST1: load float, float* + // CHECK-FAST1: fmul fast float {{.*}}, {{.*}} + // CHECK-FAST1: load float, float* {{.*}} + // CHECK-FAST1: fadd fast float {{.*}}, {{.*}} +} diff --git a/clang/test/CodeGen/ppc-emmintrin.c b/clang/test/CodeGen/ppc-emmintrin.c index fa3801f50a01..4a246ff92d76 100644 --- a/clang/test/CodeGen/ppc-emmintrin.c +++ b/clang/test/CodeGen/ppc-emmintrin.c @@ -2,9 +2,9 @@ // REQUIRES: powerpc-registered-target // RUN: %clang -S -emit-llvm -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE // RUN: %clang -S -emit-llvm -target powerpc64le-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE // CHECK-BE-DAG: @_mm_movemask_pd.perm_mask = internal constant <4 x i32> <i32 -2139062144, i32 -2139062144, i32 -2139062144, i32 -2139078656>, align 16 // CHECK-BE-DAG: @_mm_shuffle_epi32.permute_selectors = internal constant [4 x i32] [i32 66051, i32 67438087, i32 134810123, i32 202182159], align 4 diff --git a/clang/test/CodeGen/ppc-xmmintrin.c b/clang/test/CodeGen/ppc-xmmintrin.c index d3f18bfbb1e5..4aff7a7c3dda 100644 --- a/clang/test/CodeGen/ppc-xmmintrin.c +++ b/clang/test/CodeGen/ppc-xmmintrin.c @@ -2,11 +2,11 @@ // REQUIRES: powerpc-registered-target // RUN: %clang -S -emit-llvm -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-BE // RUN: %clang -x c++ -fsyntax-only -target powerpc64-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ // RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns // RUN: %clang -S -emit-llvm -target powerpc64le-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ -// RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE +// RUN: -ffp-contract=off -fno-discard-value-names -mllvm -disable-llvm-optzns -o - | llvm-cxxfilt -n | FileCheck %s --check-prefixes=CHECK,CHECK-LE // RUN: %clang -x c++ -fsyntax-only -target powerpc64le-unknown-linux-gnu -mcpu=pwr8 -ffreestanding -DNO_WARN_X86_INTRINSICS %s \ // RUN: -fno-discard-value-names -mllvm -disable-llvm-optzns diff --git a/clang/test/Driver/fp-model.c b/clang/test/Driver/fp-model.c index 5fa9d110dd83..0824b3e2c596 100644 --- a/clang/test/Driver/fp-model.c +++ b/clang/test/Driver/fp-model.c @@ -99,7 +99,7 @@ // RUN: %clang -### -nostdinc -ffp-model=precise -c %s 2>&1 \ // RUN: | FileCheck --check-prefix=CHECK-FPM-PRECISE %s // CHECK-FPM-PRECISE: "-cc1" -// CHECK-FPM-PRECISE: "-ffp-contract=fast" +// CHECK-FPM-PRECISE: "-ffp-contract=on" // CHECK-FPM-PRECISE: "-fno-rounding-math" // RUN: %clang -### -nostdinc -ffp-model=strict -c %s 2>&1 \ diff --git a/clang/test/Misc/ffp-contract.c b/clang/test/Misc/ffp-contract.c new file mode 100644 index 000000000000..0d26905d4ef2 --- /dev/null +++ b/clang/test/Misc/ffp-contract.c @@ -0,0 +1,10 @@ +// RUN: %clang_cc1 -O3 -ffp-contract=fast -triple=aarch64-apple-darwin \ +// RUN: -S -o - %s | FileCheck --check-prefix=CHECK-FMADD %s +// REQUIRES: aarch64-registered-target + +float fma_test1(float a, float b, float c) { + // CHECK-FMADD: fmadd + float x = a * b; + float y = x + c; + return y; +} </cut>

3 years, 8 months

1
0
0 0

[ACTIVITY] week ending Nov. 14 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - project admin [STR-9] <https://linaro.atlassian.net/browse/STR-9> [upstream rust-vmm sync meeting] <https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…> [proposal] <https://github.com/rust-vmm/vhost-device/pull/57> vhost-device maintainer effort ([UM-196]) - did a bunch of review on [vhost-device crate] [UM-196] <https://linaro.atlassian.net/browse/UM-196> [vhost-device crate] <https://github.com/rust-vmm/vhost-device> QEMU Upstream Work ([UM-2]) =========================== [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - posted [kvm-unit-tests PATCH v3 0/3] GIC ITS tests Message-Id: <20211112114734.3058678-1-alex.bennee(a)linaro.org> - might as well flush the tree state as I left it - posted [RFC PATCH] hw/intc: clean-up error reporting for failed ITS cmd Message-Id: <20211112170454.3158925-1-alex.bennee(a)linaro.org> - re-based [mttcg tests to current state and fixed up] [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> [mttcg tests to current state and fixed up] <https://github.com/stsquad/qemu/tree/mttcg/current-tests-v8> Completed Reviews [2/2] ======================= [PATCH v2 0/3] Some watchpoint-related patches Message-Id: <163662450348.125458.5494710452733592356.stgit@pasha-ThinkPad-X280> [PATCH 0/5] Update linux-headers + NOIRQ support for KVM gdbstub Message-Id: <20211111110604.207376-1-pbonzini(a)redhat.com> Absences ======== - none Current Review Queue ==================== TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== TODO [PATCH] softmmu: fix watchpoint processing in icount mode Message-Id: <163101424137.678744.18360776310711795413.stgit@pasha-ThinkPad-X280> ============================================================================================================================================== -- Alex Bennée

3 years, 8 months

1
0
0 0

[ACTIVITY] report week ending 11 Nov

by Peter Maydell

Progress (short week, 3 days) * UM-2 [QEMU upstream maintainership] - recent changes to QEMU's PSCI emulation broke booting of guest code at EL3 on the imx7 board, which was previously accidentally relying on PSCI-emulation-via-SMC not getting in its way despite being enabled. We need to make this board disable PSCI when the guest code is booting to EL3, as the virt board does, but it's trickier here because the CPU-creation code is hidden inside a model of an SoC object. After some on-list discussion I have a plan for how to restructure this, and need to write some code... * QEMU-420 [GICv4 emulation] - re-read the GIC architecture specification, acquired a better understanding of the required work, and broke this epic down into stories - discussed with Leif how the ITS support should be landed in the sbsa-ref board Misc: * higher-than-usual amount of meetings and meeting-prep this week -- PMM

3 years, 8 months

1
0
0 0

[TCWG CI] 456.hmmer slowed down by 3% after llvm: [flang] Fix crash in semantic error recovery situation

by ci_notify＠linaro.org

After llvm commit f411c1dd95092139c8b992260705ac0b75c8583f Author: Peter Klausler <pklausler(a)nvidia.com> [flang] Fix crash in semantic error recovery situation the following benchmarks slowed down by more than 2%: - 456.hmmer slowed down by 3% from 7600 to 7806 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f411c1dd95092139c8b992260705ac0b75c8583f cd investigate-llvm-f411c1dd95092139c8b992260705ac0b75c8583f # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f411c1dd95092139c8b992260705ac0b75c8583f ../artifacts/test.sh # Reproduce last_good build git checkout --detach c0b298fc213c1b33e97ca72fba58597365375875 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f411c1dd95092139c8b992260705ac0b75c8583f Author: Peter Klausler <pklausler(a)nvidia.com> Date: Tue Nov 2 16:41:15 2021 -0700 [flang] Fix crash in semantic error recovery situation A CHECK() in semantics is triggering when analyzing a program with an undefined derived type pointer because the CHECK is expecting a new error message to have been issued in a function but not allowing for the case that a diagnostic could have been produced earlier. Adjust the predicate. Differential Revision: https://reviews.llvm.org/D113307 --- flang/lib/Semantics/expression.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/flang/lib/Semantics/expression.cpp b/flang/lib/Semantics/expression.cpp index 331b9b2cf5bc..8ee8c9a9c9ce 100644 --- a/flang/lib/Semantics/expression.cpp +++ b/flang/lib/Semantics/expression.cpp @@ -1916,7 +1916,7 @@ auto ExpressionAnalyzer::AnalyzeProcedureComponentRef( "Base of procedure component reference is not a derived-type object"_err_en_US); } } - CHECK(!GetContextualMessages().empty()); + CHECK(context_.AnyFatalError()); return std::nullopt; } </cut>

3 years, 8 months

1
0
0 0

ARM Cortex A55 support

by Stefan Johansson A

Hello, We have been using Linaro GCC 7.5-2019.12 for the A53. As we move on to new tech there seems to be no support for "- mcpu=cortex-a55". Today, we use the aarch64-elf- toolchain. What GCC do you suggest we start using for A55 ? Thanks, Stefan

3 years, 8 months

2
1
0 0

[TCWG CI] 473.astar:[.] wayobj::makebound2 grew in size by 14% after llvm: [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default

by ci_notify＠linaro.org

After llvm commit 7584ef766a7219b6ee5a400637206d26e0fa98ac Author: Juneyoung Lee <aqjune(a)gmail.com> [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default the following hot functions grew in size by more than 10% (but their benchmarks grew in size by less than 1%): - 473.astar:[.] wayobj::makebound2 grew in size by 14% from 404 to 462 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -Oz -mthumb - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Oz First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-7584ef766a7219b6ee5a400637206d26e0fa98ac cd investigate-llvm-7584ef766a7219b6ee5a400637206d26e0fa98ac # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 7584ef766a7219b6ee5a400637206d26e0fa98ac ../artifacts/test.sh # Reproduce last_good build git checkout --detach 1ab9a2906e19cca87cafac25cc31231a36de4843 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 7584ef766a7219b6ee5a400637206d26e0fa98ac Author: Juneyoung Lee <aqjune(a)gmail.com> Date: Sat Nov 6 15:34:49 2021 +0900 [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default Turning on `enable_noundef_analysis` flag allows better codegen by removing freeze instructions. I modified clang by renaming `enable_noundef_analysis` flag to `disable-noundef-analysis` and turning it off by default. Test updates are made as a separate patch: D108453 Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D105169 --- clang/include/clang/Basic/CodeGenOptions.def | 2 +- clang/include/clang/Driver/Options.td | 6 +- clang/lib/CodeGen/CGCall.cpp | 4 +- clang/test/CXX/except/except.spec/p14-ir.cpp | 4 +- .../expr.prim/expr.prim.lambda/blocks-irgen.mm | 4 +- clang/test/CodeGen/2005-01-02-ConstantInits.c | 10 +- clang/test/CodeGen/2006-05-19-SingleEltReturn.c | 2 +- clang/test/CodeGen/2007-06-18-SextAttrAggregate.c | 2 +- .../test/CodeGen/2009-02-13-zerosize-union-field.c | 2 +- clang/test/CodeGen/2009-05-04-EnumInreg.c | 2 +- clang/test/CodeGen/64bit-swiftcall.c | 8 +- clang/test/CodeGen/RISCV/riscv-inline-asm.c | 2 +- clang/test/CodeGen/RISCV/riscv32-ilp32-abi.c | 8 +- .../test/CodeGen/RISCV/riscv32-ilp32-ilp32f-abi.c | 8 +- .../RISCV/riscv32-ilp32-ilp32f-ilp32d-abi.c | 48 +- clang/test/CodeGen/RISCV/riscv32-ilp32d-abi.c | 24 +- clang/test/CodeGen/RISCV/riscv32-ilp32f-abi.c | 6 +- .../test/CodeGen/RISCV/riscv32-ilp32f-ilp32d-abi.c | 16 +- clang/test/CodeGen/RISCV/riscv64-lp64-abi.c | 6 +- clang/test/CodeGen/RISCV/riscv64-lp64-lp64f-abi.c | 4 +- .../CodeGen/RISCV/riscv64-lp64-lp64f-lp64d-abi.c | 58 +- clang/test/CodeGen/RISCV/riscv64-lp64d-abi.c | 12 +- clang/test/CodeGen/RISCV/riscv64-lp64f-lp64d-abi.c | 16 +- clang/test/CodeGen/SystemZ/systemz-abi-vector.c | 18 +- clang/test/CodeGen/SystemZ/systemz-abi.c | 22 +- clang/test/CodeGen/SystemZ/systemz-inline-asm.c | 24 +- clang/test/CodeGen/WebAssembly/wasm-arguments.c | 38 +- .../test/CodeGen/WebAssembly/wasm-main_argc_argv.c | 2 +- clang/test/CodeGen/X86/avx-union.c | 6 +- clang/test/CodeGen/X86/avx512fp16-complex-abi.c | 2 +- clang/test/CodeGen/X86/ms-x86-intrinsics.c | 32 +- clang/test/CodeGen/X86/strictfp_builtins.c | 8 +- clang/test/CodeGen/X86/x86-atomic-long_double.c | 36 +- .../CodeGen/X86/x86-inline-asm-min-vector-width.c | 12 +- clang/test/CodeGen/X86/x86-long-double.cpp | 6 +- clang/test/CodeGen/X86/x86-soft-float.c | 4 +- clang/test/CodeGen/X86/x86-vec-i128.c | 22 +- clang/test/CodeGen/X86/x86_32-arguments-darwin.c | 62 +- clang/test/CodeGen/X86/x86_32-arguments-iamcu.c | 24 +- clang/test/CodeGen/X86/x86_32-arguments-linux.c | 30 +- clang/test/CodeGen/X86/x86_32-arguments-nommx.c | 4 +- clang/test/CodeGen/X86/x86_32-arguments-realign.c | 2 +- clang/test/CodeGen/X86/x86_32-arguments-win32.c | 12 +- clang/test/CodeGen/X86/x86_64-arguments-nacl.c | 6 +- clang/test/CodeGen/X86/x86_64-arguments-win32.c | 12 +- clang/test/CodeGen/X86/x86_64-arguments.c | 82 +- clang/test/CodeGen/X86/x86_64-longdouble.c | 36 +- clang/test/CodeGen/aapcs-align.cpp | 56 +- clang/test/CodeGen/aapcs64-align.cpp | 34 +- clang/test/CodeGen/aarch64-args.cpp | 18 +- clang/test/CodeGen/aarch64-byval-temp.c | 8 +- clang/test/CodeGen/aarch64-neon-3v.c | 160 +- clang/test/CodeGen/aarch64-neon-across.c | 88 +- clang/test/CodeGen/aarch64-neon-dot-product.c | 24 +- clang/test/CodeGen/aarch64-neon-extract.c | 48 +- clang/test/CodeGen/aarch64-neon-fcvt-intrinsics.c | 42 +- clang/test/CodeGen/aarch64-neon-fma.c | 44 +- clang/test/CodeGen/aarch64-neon-ldst-one.c | 540 ++-- clang/test/CodeGen/aarch64-neon-scalar-copy.c | 48 +- .../CodeGen/aarch64-neon-scalar-x-indexed-elem.c | 80 +- clang/test/CodeGen/aarch64-neon-tbl.c | 144 +- clang/test/CodeGen/aarch64-neon-vcombine.c | 28 +- clang/test/CodeGen/aarch64-neon-vget-hilo.c | 56 +- clang/test/CodeGen/aarch64-neon-vget.c | 96 +- clang/test/CodeGen/aarch64-poly128.c | 62 +- clang/test/CodeGen/aarch64-poly64.c | 96 +- clang/test/CodeGen/aarch64-strictfp-builtins.c | 8 +- ...4-sve-acle-__ARM_FEATURE_SVE_VECTOR_OPERATORS.c | 16 +- ...sve-acle-__ARM_FEATURE_SVE_VECTOR_OPERATORS.cpp | 8 +- clang/test/CodeGen/aarch64-varargs.c | 2 +- clang/test/CodeGen/address-space-field1.c | 2 +- clang/test/CodeGen/address-space.c | 2 +- clang/test/CodeGen/aix-alignment.c | 8 +- clang/test/CodeGen/aix-altivec.c | 10 +- clang/test/CodeGen/aix-ignore-xcoff-visibility.cpp | 12 +- clang/test/CodeGen/aix-return.c | 16 +- clang/test/CodeGen/aix-struct-arg.c | 44 +- clang/test/CodeGen/aix-vaargs.c | 4 +- clang/test/CodeGen/alias.c | 12 +- clang/test/CodeGen/align_value.cpp | 63 +- clang/test/CodeGen/alloc-align-attr.c | 46 +- clang/test/CodeGen/alloc-fns-alignment.c | 2 +- clang/test/CodeGen/alloc-size-fnptr.c | 12 +- clang/test/CodeGen/arc/arguments.c | 26 +- clang/test/CodeGen/arithmetic-fence-builtin.c | 10 +- clang/test/CodeGen/arm-aapcs-vfp.c | 24 +- clang/test/CodeGen/arm-abi-vector.c | 48 +- clang/test/CodeGen/arm-arguments.c | 10 +- clang/test/CodeGen/arm-bf16-params-returns.c | 10 +- clang/test/CodeGen/arm-byval-align.c | 2 +- clang/test/CodeGen/arm-cmse-attr.c | 4 +- clang/test/CodeGen/arm-cmse-call.c | 4 +- clang/test/CodeGen/arm-float-helpers.c | 76 +- clang/test/CodeGen/arm-fp16-arguments.c | 12 +- clang/test/CodeGen/arm-homogenous.c | 2 +- clang/test/CodeGen/arm-mangle-bf16.cpp | 2 +- clang/test/CodeGen/arm-neon-directed-rounding.c | 30 +- clang/test/CodeGen/arm-neon-dot-product.c | 16 +- clang/test/CodeGen/arm-neon-fma.c | 8 +- clang/test/CodeGen/arm-neon-numeric-maxmin.c | 8 +- clang/test/CodeGen/arm-neon-vcvtX.c | 32 +- clang/test/CodeGen/arm-swiftcall.c | 6 +- clang/test/CodeGen/arm-varargs.c | 2 +- clang/test/CodeGen/arm-vector-arguments.c | 10 +- clang/test/CodeGen/arm-vfp16-arguments.c | 12 +- clang/test/CodeGen/arm64-aapcs-arguments.c | 12 +- clang/test/CodeGen/arm64-abi-vector.c | 42 +- clang/test/CodeGen/arm64-arguments.c | 96 +- clang/test/CodeGen/arm64-microsoft-arguments.cpp | 6 +- clang/test/CodeGen/arm64_32.c | 2 +- clang/test/CodeGen/arm64_vcopy.c | 20 +- clang/test/CodeGen/arm64_vdupq_n_f64.c | 12 +- clang/test/CodeGen/armv7k-abi.c | 6 +- clang/test/CodeGen/asm-label.c | 12 +- .../assume-aligned-and-alloc-align-attributes.c | 12 +- clang/test/CodeGen/atomic-arm64.c | 2 +- clang/test/CodeGen/atomic-ops-libcall.c | 34 +- clang/test/CodeGen/atomic-ops.c | 44 +- clang/test/CodeGen/atomic_ops.c | 10 +- clang/test/CodeGen/atomics-inlining.c | 52 +- clang/test/CodeGen/attr-func-def.c | 4 +- clang/test/CodeGen/attr-naked.c | 2 +- clang/test/CodeGen/attr-no-tail.c | 8 +- clang/test/CodeGen/attr-nomerge.cpp | 20 +- clang/test/CodeGen/attr-noundef.cpp | 4 +- clang/test/CodeGen/attr-target-mv-func-ptrs.c | 4 +- clang/test/CodeGen/attr-target-mv-va-args.c | 24 +- clang/test/CodeGen/attr-target-mv.c | 28 +- clang/test/CodeGen/attr-x86-interrupt.c | 16 +- clang/test/CodeGen/attributes.c | 2 +- clang/test/CodeGen/available-externally-hidden.cpp | 2 +- clang/test/CodeGen/available-externally-suppress.c | 2 +- clang/test/CodeGen/avr/struct.c | 4 +- clang/test/CodeGen/big-atomic-ops.c | 30 +- clang/test/CodeGen/bittest-intrin.c | 8 +- clang/test/CodeGen/blocks.c | 6 +- clang/test/CodeGen/bool-convert.c | 2 +- clang/test/CodeGen/builtin-align-array.c | 8 +- clang/test/CodeGen/builtin-align.c | 24 +- clang/test/CodeGen/builtin-assume-aligned.c | 31 +- clang/test/CodeGen/builtin-attributes.c | 20 +- clang/test/CodeGen/builtin-memfns.c | 4 +- clang/test/CodeGen/builtin-sqrt.c | 2 +- clang/test/CodeGen/builtins-arm.c | 24 +- clang/test/CodeGen/builtins-memcpy-inline.c | 8 +- clang/test/CodeGen/builtins-ms.c | 4 +- clang/test/CodeGen/builtins-multiprecision.c | 4 +- clang/test/CodeGen/builtins-overflow.c | 12 +- clang/test/CodeGen/builtins-ppc-xlcompat-macros.c | 4 +- clang/test/CodeGen/builtins.c | 44 +- clang/test/CodeGen/c-strings.c | 2 +- clang/test/CodeGen/c11atomics-ios.c | 8 +- clang/test/CodeGen/c11atomics.c | 52 +- clang/test/CodeGen/calling-conv-ignored.c | 32 +- ...-assumption-attribute-align_value-on-lvalue.cpp | 2 +- ...ssumption-attribute-align_value-on-paramvar.cpp | 4 +- ...-attribute-alloc_align-on-function-variable.cpp | 6 +- ...ssumption-attribute-alloc_align-on-function.cpp | 8 +- ...ibute-assume_aligned-on-function-two-params.cpp | 6 +- ...mption-attribute-assume_aligned-on-function.cpp | 8 +- ...uiltin_assume_aligned-three-params-variable.cpp | 2 +- ...umption-builtin_assume_aligned-three-params.cpp | 2 +- ...ssumption-builtin_assume_aligned-two-params.cpp | 2 +- .../CodeGen/catch-alignment-assumption-openmp.cpp | 2 +- .../CodeGen/catch-implicit-integer-sign-changes.c | 18 +- ...icit-signed-integer-truncation-or-sign-change.c | 10 +- ...tr-and-nonzero-offset-when-nullptr-is-defined.c | 2 +- .../CodeGen/catch-nullptr-and-nonzero-offset.c | 14 +- .../test/CodeGen/catch-pointer-overflow-volatile.c | 2 +- clang/test/CodeGen/catch-pointer-overflow.c | 16 +- clang/test/CodeGen/cfi-check-fail.c | 2 +- clang/test/CodeGen/cfi-check-fail2.c | 2 +- clang/test/CodeGen/cmse-clear-arg.c | 2 +- clang/test/CodeGen/complex-builtins.c | 228 +- clang/test/CodeGen/complex-indirect.c | 2 +- clang/test/CodeGen/complex-libcalls.c | 228 +- clang/test/CodeGen/complex-math.c | 12 +- clang/test/CodeGen/complex-strictfp.c | 42 +- clang/test/CodeGen/constructor-attribute.c | 2 +- clang/test/CodeGen/debug-info-block-vars.c | 2 +- clang/test/CodeGen/debug-info-pseudo-probe.cpp | 4 +- clang/test/CodeGen/decl.c | 2 +- clang/test/CodeGen/default-address-space.c | 4 +- clang/test/CodeGen/exceptions-seh-finally.c | 14 +- clang/test/CodeGen/exceptions-seh-leave.c | 30 +- clang/test/CodeGen/exceptions-seh-nested-finally.c | 4 +- clang/test/CodeGen/exceptions-seh.c | 26 +- clang/test/CodeGen/exceptions.c | 2 +- clang/test/CodeGen/ext-int-cc.c | 58 +- clang/test/CodeGen/extend-arg-64.c | 2 +- clang/test/CodeGen/fp-function-attrs.cpp | 6 +- clang/test/CodeGen/fp-options-to-fast-math-flags.c | 18 +- clang/test/CodeGen/fpconstrained-cmp-double.c | 24 +- clang/test/CodeGen/fpconstrained-cmp-float.c | 24 +- clang/test/CodeGen/function-attributes.c | 20 +- clang/test/CodeGen/functions.c | 4 +- clang/test/CodeGen/hexagon-hvx-abi.c | 8 +- clang/test/CodeGen/incomplete-function-type-2.c | 2 +- clang/test/CodeGen/indirect-noundef.cpp | 2 +- clang/test/CodeGen/inline.c | 4 +- clang/test/CodeGen/lanai-arguments.c | 12 +- clang/test/CodeGen/lanai-regparm.c | 12 +- clang/test/CodeGen/libcall-declarations.c | 636 ++-- clang/test/CodeGen/libcalls.c | 54 +- clang/test/CodeGen/long_double_fp128.cpp | 14 +- clang/test/CodeGen/malign-double-x86-nacl.c | 6 +- clang/test/CodeGen/mangle-blocks.c | 6 +- clang/test/CodeGen/mangle-windows.c | 2 +- clang/test/CodeGen/math-builtins-long.c | 386 +-- clang/test/CodeGen/math-builtins.c | 648 ++-- clang/test/CodeGen/math-libcalls.c | 474 +-- clang/test/CodeGen/matrix-cast.c | 26 +- clang/test/CodeGen/matrix-type-builtins.c | 4 +- .../test/CodeGen/matrix-type-operators-fast-math.c | 12 +- clang/test/CodeGen/matrix-type-operators.c | 84 +- clang/test/CodeGen/memcmp-inline-builtin-to-asm.c | 2 +- clang/test/CodeGen/memcpy-inline-builtin.c | 2 +- clang/test/CodeGen/microsoft-call-conv-x64.c | 2 +- clang/test/CodeGen/microsoft-call-conv.c | 2 +- clang/test/CodeGen/mingw-long-double.c | 12 +- clang/test/CodeGen/mips-unsigned-ext-var.c | 6 +- clang/test/CodeGen/mips-unsigned-extend.c | 6 +- clang/test/CodeGen/mips-vector-arg.c | 16 +- clang/test/CodeGen/mips-zero-sized-struct.c | 6 +- clang/test/CodeGen/mips64-padding-arg.c | 24 +- clang/test/CodeGen/mrtd.c | 6 +- clang/test/CodeGen/ms-inline-asm.c | 2 +- clang/test/CodeGen/ms-intrinsics-cpuid.c | 4 +- clang/test/CodeGen/ms-intrinsics-other.c | 2 +- clang/test/CodeGen/ms-mixed-ptr-sizes.c | 20 +- clang/test/CodeGen/ms_abi.c | 4 +- clang/test/CodeGen/ms_abi_aarch64.c | 4 +- clang/test/CodeGen/named_reg_global.c | 2 +- clang/test/CodeGen/no-bitfield-type-align.c | 2 +- clang/test/CodeGen/no-builtin.cpp | 12 +- clang/test/CodeGen/no-prototype.c | 2 +- clang/test/CodeGen/noduplicate-cxx11-test.cpp | 2 +- .../CodeGen/non-power-of-2-alignment-assumptions.c | 10 +- clang/test/CodeGen/nonnull.c | 28 +- clang/test/CodeGen/nrvo-tracking.cpp | 2 +- clang/test/CodeGen/nvptx-abi.c | 10 +- clang/test/CodeGen/object-size.c | 4 +- clang/test/CodeGen/padding-init.c | 6 +- clang/test/CodeGen/pass-by-value-noalias.c | 4 +- clang/test/CodeGen/pass-object-size.c | 114 +- clang/test/CodeGen/pch-dllexport.cpp | 4 +- clang/test/CodeGen/powerpc-c99complex.c | 14 +- clang/test/CodeGen/ppc-emmintrin.c | 750 ++--- clang/test/CodeGen/ppc-mm-malloc-le.c | 8 +- clang/test/CodeGen/ppc-mm-malloc.c | 8 +- clang/test/CodeGen/ppc-mmintrin.c | 124 +- clang/test/CodeGen/ppc-pmmintrin.c | 177 +- clang/test/CodeGen/ppc-signbit.c | 2 +- clang/test/CodeGen/ppc-smmintrin.c | 32 +- clang/test/CodeGen/ppc-tmmintrin.c | 290 +- clang/test/CodeGen/ppc-xmmintrin.c | 400 +-- clang/test/CodeGen/ppc64-align-struct.c | 26 +- clang/test/CodeGen/ppc64-complex-parms.c | 38 +- clang/test/CodeGen/ppc64-complex-return.c | 20 +- clang/test/CodeGen/ppc64-extend.c | 4 +- clang/test/CodeGen/ppc64-inline-asm.c | 14 +- clang/test/CodeGen/ppc64-long-double.cpp | 6 +- clang/test/CodeGen/ppc64-soft-float.c | 6 +- clang/test/CodeGen/ppc64-vector.c | 10 +- clang/test/CodeGen/ppc64le-aggregates.c | 8 +- clang/test/CodeGen/ppc64le-f128Aggregates.c | 4 +- clang/test/CodeGen/ppc64le-varargs-f128.c | 12 +- clang/test/CodeGen/pr25786.c | 4 +- clang/test/CodeGen/pr5406.c | 2 +- clang/test/CodeGen/pr9614.c | 4 +- clang/test/CodeGen/pragma-weak.c | 2 +- clang/test/CodeGen/ps4-dllimport-dllexport.c | 2 +- clang/test/CodeGen/regcall.c | 100 +- clang/test/CodeGen/regparm-flag.c | 12 +- clang/test/CodeGen/regparm-struct.c | 36 +- clang/test/CodeGen/regparm.c | 6 +- clang/test/CodeGen/renderscript.c | 14 +- clang/test/CodeGen/restrict.c | 10 +- .../sanitize-thread-no-checking-at-run-time.m | 2 +- clang/test/CodeGen/sparc-arguments.c | 4 +- clang/test/CodeGen/sparcv8-abi.c | 6 +- clang/test/CodeGen/sparcv8-inline-asm.c | 2 +- clang/test/CodeGen/sparcv9-abi.c | 16 +- clang/test/CodeGen/spir-half-type.cpp | 2 +- clang/test/CodeGen/stack-protector.c | 4 +- clang/test/CodeGen/stdcall-fastcall.c | 24 +- clang/test/CodeGen/strictfp_builtins.c | 26 +- clang/test/CodeGen/swift-async-call-conv.c | 22 +- clang/test/CodeGen/switch-dce.c | 4 +- clang/test/CodeGen/sysv_abi.c | 8 +- clang/test/CodeGen/temporary-lifetime.cpp | 4 +- clang/test/CodeGen/transparent-union-redecl.c | 8 +- clang/test/CodeGen/transparent-union.c | 8 +- clang/test/CodeGen/ubsan-function.cpp | 2 +- .../CodeGen/unique-internal-linkage-names-dwarf.c | 4 +- .../unique-internal-linkage-names-dwarf.cpp | 12 +- .../test/CodeGen/unique-internal-linkage-names.cpp | 16 +- clang/test/CodeGen/variadic-null-win64.c | 12 +- clang/test/CodeGen/ve-abi.c | 34 +- clang/test/CodeGen/vectorcall.c | 86 +- clang/test/CodeGen/vla.c | 22 +- clang/test/CodeGen/win64-i128.c | 4 +- clang/test/CodeGen/windows-itanium.c | 2 +- .../CodeGen/windows-on-arm-dllimport-dllexport.c | 2 +- .../CodeGen/windows-seh-EHa-CppCatchDotDotDot.cpp | 2 +- .../test/CodeGen/windows-seh-EHa-CppCondiTemps.cpp | 18 +- clang/test/CodeGen/windows-seh-EHa-CppDtors01.cpp | 2 +- .../test/CodeGen/windows-seh-EHa-TryInFinally.cpp | 4 +- clang/test/CodeGen/windows-seh-abnormal-exits.c | 2 +- clang/test/CodeGen/windows-swiftcall.c | 22 +- clang/test/CodeGen/x86_32-align-linux.c | 6 +- clang/test/CodeGen/xcore-abi.c | 14 +- clang/test/CodeGen/xray-log-args.cpp | 4 +- clang/test/CodeGenCUDA/address-spaces.cu | 2 +- .../CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | 10 +- clang/test/CodeGenCUDA/builtins-amdgcn.cu | 2 +- clang/test/CodeGenCUDA/cuda-builtin-vars.cu | 2 +- clang/test/CodeGenCUDA/kernel-args-alignment.cu | 2 +- clang/test/CodeGenCUDA/kernel-args.cu | 8 +- clang/test/CodeGenCUDA/kernel-stub-name.cu | 4 +- clang/test/CodeGenCUDA/lambda.cu | 8 +- clang/test/CodeGenCUDA/redux-builtins.cu | 2 +- clang/test/CodeGenCUDA/surface.cu | 4 +- clang/test/CodeGenCUDA/texture.cu | 6 +- clang/test/CodeGenCUDA/unnamed-types.cu | 8 +- clang/test/CodeGenCUDA/usual-deallocators.cu | 36 +- clang/test/CodeGenCUDA/vtbl.cu | 2 +- .../CodeGenCXX/2009-05-04-PureConstNounwind.cpp | 10 +- .../test/CodeGenCXX/2011-12-19-init-list-ctor.cpp | 6 +- .../diamond-virtual-inheritance.cpp | 2 +- .../CodeGenCXX/RelativeVTablesABI/dynamic-cast.cpp | 8 +- .../RelativeVTablesABI/member-function-pointer.cpp | 2 +- .../RelativeVTablesABI/multiple-inheritance.cpp | 2 +- .../parent-and-child-in-comdats.cpp | 2 +- .../CodeGenCXX/RelativeVTablesABI/type-info.cpp | 2 +- .../CodeGenCXX/RelativeVTablesABI/vbase-offset.cpp | 2 +- .../RelativeVTablesABI/virtual-function-call.cpp | 2 +- clang/test/CodeGenCXX/address-space-cast.cpp | 14 +- clang/test/CodeGenCXX/address-space-ref.cpp | 8 +- clang/test/CodeGenCXX/aix-alignment.cpp | 6 +- .../aix-static-init-temp-spec-and-inline-var.cpp | 14 +- clang/test/CodeGenCXX/aix-static-init.cpp | 4 +- .../test/CodeGenCXX/align-avx-complete-objects.cpp | 4 +- clang/test/CodeGenCXX/alignment.cpp | 20 +- clang/test/CodeGenCXX/alloc-size.cpp | 16 +- .../test/CodeGenCXX/amdgcn-automatic-variable.cpp | 10 +- clang/test/CodeGenCXX/amdgcn-func-arg.cpp | 24 +- clang/test/CodeGenCXX/amdgcn_declspec_get.cpp | 2 +- clang/test/CodeGenCXX/anonymous-namespaces.cpp | 4 +- .../test/CodeGenCXX/apple-kext-indirect-call-2.cpp | 8 +- clang/test/CodeGenCXX/apple-kext-linkage.cpp | 4 +- clang/test/CodeGenCXX/arm-cc.cpp | 4 +- clang/test/CodeGenCXX/arm-swiftcall.cpp | 2 +- clang/test/CodeGenCXX/arm.cpp | 4 +- clang/test/CodeGenCXX/arm64-constructor-return.cpp | 4 +- clang/test/CodeGenCXX/arm64-darwinpcs.cpp | 4 +- clang/test/CodeGenCXX/atomic-dllexport.cpp | 4 +- clang/test/CodeGenCXX/atomic-inline.cpp | 2 +- clang/test/CodeGenCXX/atomicinit.cpp | 8 +- .../CodeGenCXX/attr-cpuspecific-outoflinedefs.cpp | 28 +- clang/test/CodeGenCXX/attr-disable-tail-calls.cpp | 12 +- clang/test/CodeGenCXX/attr-musttail.cpp | 40 +- clang/test/CodeGenCXX/attr-notail.cpp | 10 +- clang/test/CodeGenCXX/attr-target-mv-diff-ns.cpp | 42 +- clang/test/CodeGenCXX/attr-target-mv-func-ptrs.cpp | 6 +- clang/test/CodeGenCXX/attr-target-mv-inalloca.cpp | 16 +- .../CodeGenCXX/attr-target-mv-member-funcs.cpp | 96 +- .../CodeGenCXX/attr-target-mv-out-of-line-defs.cpp | 22 +- clang/test/CodeGenCXX/attr-target-mv-overloads.cpp | 36 +- ...used-member-function-implicit-instantiation.cpp | 2 +- clang/test/CodeGenCXX/attr-x86-interrupt.cpp | 24 +- clang/test/CodeGenCXX/blocks-cxx11.cpp | 16 +- clang/test/CodeGenCXX/blocks.cpp | 4 +- clang/test/CodeGenCXX/builtin-calling-conv.cpp | 18 +- .../CodeGenCXX/builtin-is-constant-evaluated.cpp | 8 +- .../CodeGenCXX/builtin-operator-new-delete.cpp | 20 +- clang/test/CodeGenCXX/builtin-source-location.cpp | 20 +- clang/test/CodeGenCXX/builtin_FUNCTION.cpp | 6 +- clang/test/CodeGenCXX/builtin_LINE.cpp | 24 +- clang/test/CodeGenCXX/builtins.cpp | 4 +- clang/test/CodeGenCXX/call-with-static-chain.cpp | 16 +- clang/test/CodeGenCXX/catch-undef-behavior.cpp | 10 +- clang/test/CodeGenCXX/cfi-cast.cpp | 4 +- clang/test/CodeGenCXX/cfi-multiple-inheritance.cpp | 2 +- .../test/CodeGenCXX/cfi-vcall-check-after-args.cpp | 2 +- clang/test/CodeGenCXX/clang-sections.cpp | 2 +- clang/test/CodeGenCXX/compound-literals.cpp | 6 +- clang/test/CodeGenCXX/condition.cpp | 30 +- clang/test/CodeGenCXX/conditional-gnu-ext.cpp | 14 +- clang/test/CodeGenCXX/conditional-temporaries.cpp | 44 +- clang/test/CodeGenCXX/const-init-cxx11.cpp | 16 +- .../constructor-destructor-return-this.cpp | 100 +- clang/test/CodeGenCXX/constructor-direct-call.cpp | 14 +- clang/test/CodeGenCXX/constructor-init.cpp | 10 +- clang/test/CodeGenCXX/constructors.cpp | 24 +- clang/test/CodeGenCXX/convert-to-fptr.cpp | 4 +- clang/test/CodeGenCXX/copy-assign-synthesis-1.cpp | 2 +- clang/test/CodeGenCXX/copy-constructor-elim-2.cpp | 2 +- .../CodeGenCXX/copy-constructor-synthesis-2.cpp | 2 +- .../test/CodeGenCXX/copy-constructor-synthesis.cpp | 6 +- clang/test/CodeGenCXX/copy-elision.cpp | 2 +- clang/test/CodeGenCXX/copy-initialization.cpp | 2 +- clang/test/CodeGenCXX/cxx-abi-switch.cpp | 4 +- clang/test/CodeGenCXX/cxx0x-delegating-ctors.cpp | 2 +- .../CodeGenCXX/cxx0x-initializer-constructors.cpp | 14 +- .../CodeGenCXX/cxx0x-initializer-references.cpp | 4 +- .../CodeGenCXX/cxx11-initializer-aggregate.cpp | 4 +- .../CodeGenCXX/cxx11-initializer-array-new.cpp | 30 +- .../CodeGenCXX/cxx11-thread-local-reference.cpp | 6 +- .../CodeGenCXX/cxx11-thread-local-visibility.cpp | 8 +- clang/test/CodeGenCXX/cxx11-thread-local.cpp | 38 +- .../test/CodeGenCXX/cxx11-user-defined-literal.cpp | 20 +- clang/test/CodeGenCXX/cxx1y-init-captures.cpp | 12 +- .../CodeGenCXX/cxx1y-initializer-aggregate.cpp | 6 +- clang/test/CodeGenCXX/cxx1y-sized-deallocation.cpp | 48 +- .../CodeGenCXX/cxx1y-variable-template-linkage.cpp | 10 +- clang/test/CodeGenCXX/cxx1y-variable-template.cpp | 2 +- clang/test/CodeGenCXX/cxx1z-aligned-allocation.cpp | 68 +- clang/test/CodeGenCXX/cxx1z-copy-omission.cpp | 8 +- clang/test/CodeGenCXX/cxx1z-decomposition.cpp | 4 +- clang/test/CodeGenCXX/cxx1z-init-statement.cpp | 4 +- .../CodeGenCXX/cxx1z-initializer-aggregate.cpp | 20 +- clang/test/CodeGenCXX/cxx1z-inline-variables.cpp | 8 +- clang/test/CodeGenCXX/cxx2a-consteval.cpp | 11 +- clang/test/CodeGenCXX/cxx2a-destroying-delete.cpp | 38 +- .../debug-info-codeview-heapallocsite.cpp | 6 +- .../test/CodeGenCXX/debug-info-destroy-helper.cpp | 48 +- clang/test/CodeGenCXX/debug-info-globalinit.cpp | 6 +- clang/test/CodeGenCXX/debug-info-line.cpp | 4 +- clang/test/CodeGenCXX/debug-info-nested-exprs.cpp | 84 +- clang/test/CodeGenCXX/debug-info-static-fns.cpp | 2 +- clang/test/CodeGenCXX/debug-info-thunk-msabi.cpp | 2 +- clang/test/CodeGenCXX/decl-ref-init.cpp | 4 +- clang/test/CodeGenCXX/default-arg-temps.cpp | 4 +- clang/test/CodeGenCXX/default-arguments.cpp | 2 +- clang/test/CodeGenCXX/default_calling_conv.cpp | 24 +- clang/test/CodeGenCXX/delete-two-arg.cpp | 8 +- clang/test/CodeGenCXX/delete.cpp | 6 +- clang/test/CodeGenCXX/derived-to-base-conv.cpp | 6 +- clang/test/CodeGenCXX/derived-to-base.cpp | 4 +- clang/test/CodeGenCXX/destructors.cpp | 8 +- clang/test/CodeGenCXX/devirtualize-ms-dtor.cpp | 2 +- .../devirtualize-virtual-function-calls-final.cpp | 34 +- .../devirtualize-virtual-function-calls.cpp | 2 +- clang/test/CodeGenCXX/dllexport-ctor-closure.cpp | 10 +- clang/test/CodeGenCXX/dllexport-dtor-thunks.cpp | 2 +- clang/test/CodeGenCXX/dllexport-members.cpp | 12 +- .../CodeGenCXX/dllexport-no-dllexport-inlines.cpp | 18 +- clang/test/CodeGenCXX/dllexport.cpp | 12 +- clang/test/CodeGenCXX/dllimport-members.cpp | 12 +- clang/test/CodeGenCXX/dllimport-runtime-fns.cpp | 6 +- clang/test/CodeGenCXX/dllimport.cpp | 18 +- clang/test/CodeGenCXX/eh.cpp | 10 +- .../CodeGenCXX/empty-nontrivially-copyable.cpp | 6 +- clang/test/CodeGenCXX/exceptions-cxx-new.cpp | 10 +- .../CodeGenCXX/exceptions-seh-filter-captures.cpp | 24 +- .../CodeGenCXX/exceptions-seh-filter-uwtable.cpp | 2 +- clang/test/CodeGenCXX/exceptions-seh.cpp | 16 +- clang/test/CodeGenCXX/exceptions.cpp | 4 +- clang/test/CodeGenCXX/explicit-instantiation.cpp | 32 +- clang/test/CodeGenCXX/ext-int.cpp | 16 +- clang/test/CodeGenCXX/fastcall.cpp | 2 +- clang/test/CodeGenCXX/float128-declarations.cpp | 20 +- clang/test/CodeGenCXX/float16-declarations.cpp | 8 +- clang/test/CodeGenCXX/for-cond-var.cpp | 16 +- clang/test/CodeGenCXX/for-range-temporaries.cpp | 2 +- clang/test/CodeGenCXX/for-range.cpp | 20 +- clang/test/CodeGenCXX/forward-enum.cpp | 2 +- clang/test/CodeGenCXX/fp16-mangle-arg-return.cpp | 4 +- clang/test/CodeGenCXX/fp16-mangle.cpp | 4 +- clang/test/CodeGenCXX/fp16-overload.cpp | 4 +- clang/test/CodeGenCXX/global-init.cpp | 2 +- clang/test/CodeGenCXX/goto.cpp | 6 +- clang/test/CodeGenCXX/homogeneous-aggregates.cpp | 28 +- clang/test/CodeGenCXX/ibm128-declarations.cpp | 24 +- .../CodeGenCXX/implicit-copy-assign-operator.cpp | 2 +- .../test/CodeGenCXX/implicit-copy-constructor.cpp | 2 +- clang/test/CodeGenCXX/inalloca-overaligned.cpp | 38 +- clang/test/CodeGenCXX/inalloca-stmtexpr.cpp | 2 +- clang/test/CodeGenCXX/inalloca-vector.cpp | 40 +- .../CodeGenCXX/inheriting-constructor-cleanup.cpp | 4 +- clang/test/CodeGenCXX/inheriting-constructor.cpp | 10 +- clang/test/CodeGenCXX/init-invariant.cpp | 14 +- clang/test/CodeGenCXX/init-priority-attr.cpp | 10 +- .../CodeGenCXX/initializer-list-ctor-order.cpp | 2 +- clang/test/CodeGenCXX/inline-functions.cpp | 2 +- clang/test/CodeGenCXX/lambda-conversion-op-cc.cpp | 56 +- .../lambda-expressions-inside-auto-functions.cpp | 8 +- .../lambda-expressions-nested-linkage.cpp | 10 +- clang/test/CodeGenCXX/lambda-expressions.cpp | 30 +- clang/test/CodeGenCXX/lifetime-sanitizer.cpp | 2 +- clang/test/CodeGenCXX/linkage.cpp | 2 +- clang/test/CodeGenCXX/mangle-abi-tag.cpp | 2 +- clang/test/CodeGenCXX/mangle-exprs.cpp | 8 +- clang/test/CodeGenCXX/mangle-extern-local.cpp | 6 +- clang/test/CodeGenCXX/mangle-lambdas.cpp | 102 +- clang/test/CodeGenCXX/mangle-ms-cxx11.cpp | 4 +- .../CodeGenCXX/mangle-ms-templates-memptrs-2.cpp | 2 +- clang/test/CodeGenCXX/mangle-ms-vector-types.cpp | 14 +- clang/test/CodeGenCXX/mangle-ms.cpp | 10 +- clang/test/CodeGenCXX/mangle-this-cxx11.cpp | 4 +- clang/test/CodeGenCXX/mangle-win-ccs.cpp | 24 +- clang/test/CodeGenCXX/mangle-win64-ccs.cpp | 14 +- clang/test/CodeGenCXX/mangle.cpp | 32 +- clang/test/CodeGenCXX/matrix-casts.cpp | 8 +- clang/test/CodeGenCXX/matrix-type-builtins.cpp | 56 +- clang/test/CodeGenCXX/matrix-type-operators.cpp | 48 +- clang/test/CodeGenCXX/matrix-type.cpp | 2 +- .../CodeGenCXX/member-expr-references-variable.cpp | 40 +- clang/test/CodeGenCXX/member-expressions.cpp | 2 +- .../CodeGenCXX/member-function-pointer-calls.cpp | 8 +- clang/test/CodeGenCXX/member-init-assignment.cpp | 2 +- clang/test/CodeGenCXX/member-templates.cpp | 4 +- clang/test/CodeGenCXX/microsoft-abi-arg-order.cpp | 16 +- .../CodeGenCXX/microsoft-abi-array-cookies.cpp | 8 +- clang/test/CodeGenCXX/microsoft-abi-byval-sret.cpp | 8 +- .../test/CodeGenCXX/microsoft-abi-byval-thunks.cpp | 16 +- .../test/CodeGenCXX/microsoft-abi-byval-vararg.cpp | 12 +- .../CodeGenCXX/microsoft-abi-cdecl-method-sret.cpp | 8 +- .../test/CodeGenCXX/microsoft-abi-dynamic-cast.cpp | 22 +- clang/test/CodeGenCXX/microsoft-abi-eh-catch.cpp | 6 +- .../test/CodeGenCXX/microsoft-abi-eh-cleanups.cpp | 56 +- .../CodeGenCXX/microsoft-abi-extern-template.cpp | 8 +- .../CodeGenCXX/microsoft-abi-member-pointers.cpp | 42 +- clang/test/CodeGenCXX/microsoft-abi-methods.cpp | 10 +- ...crosoft-abi-multiple-nonvirtual-inheritance.cpp | 10 +- .../CodeGenCXX/microsoft-abi-sret-and-byval.cpp | 78 +- .../microsoft-abi-static-initializers.cpp | 24 +- clang/test/CodeGenCXX/microsoft-abi-structors.cpp | 2 +- .../CodeGenCXX/microsoft-abi-this-nullable.cpp | 2 +- .../microsoft-abi-thread-safe-statics.cpp | 2 +- clang/test/CodeGenCXX/microsoft-abi-throw.cpp | 4 +- clang/test/CodeGenCXX/microsoft-abi-thunks.cpp | 14 +- clang/test/CodeGenCXX/microsoft-abi-typeid.cpp | 16 +- .../test/CodeGenCXX/microsoft-abi-unknown-arch.cpp | 2 +- clang/test/CodeGenCXX/microsoft-abi-vbase-dtor.cpp | 2 +- ...microsoft-abi-virtual-inheritance-vtordisps.cpp | 6 +- .../microsoft-abi-virtual-inheritance.cpp | 54 +- .../microsoft-abi-virtual-member-pointers.cpp | 56 +- .../CodeGenCXX/microsoft-abi-vmemptr-conflicts.cpp | 34 +- .../CodeGenCXX/microsoft-abi-vmemptr-fastcall.cpp | 4 +- ...iple-nonvirtual-inheritance-this-adjustment.cpp | 4 +- clang/test/CodeGenCXX/microsoft-compatibility.cpp | 2 +- .../CodeGenCXX/microsoft-inaccessible-base.cpp | 4 +- clang/test/CodeGenCXX/microsoft-interface.cpp | 10 +- clang/test/CodeGenCXX/microsoft-new.cpp | 8 +- clang/test/CodeGenCXX/mips-size_t-ptrdiff_t.cpp | 12 +- clang/test/CodeGenCXX/ms-inline-asm-fields.cpp | 2 +- clang/test/CodeGenCXX/ms-inline-asm-return.cpp | 2 +- clang/test/CodeGenCXX/ms-property.cpp | 48 +- clang/test/CodeGenCXX/ms-thunks-ehspec.cpp | 4 +- clang/test/CodeGenCXX/ms-thunks-unprototyped.cpp | 18 +- clang/test/CodeGenCXX/ms-union-member-ref.cpp | 6 +- .../test/CodeGenCXX/msabi-ctor-abstract-vbase.cpp | 8 +- clang/test/CodeGenCXX/multi-dim-operator-new.cpp | 6 +- clang/test/CodeGenCXX/new-alias.cpp | 2 +- clang/test/CodeGenCXX/new-array-init.cpp | 18 +- clang/test/CodeGenCXX/new-infallible.cpp | 4 +- clang/test/CodeGenCXX/new-overflow.cpp | 30 +- clang/test/CodeGenCXX/new.cpp | 56 +- clang/test/CodeGenCXX/noescape.cpp | 22 +- clang/test/CodeGenCXX/nonconst-init.cpp | 2 +- clang/test/CodeGenCXX/nrvo.cpp | 4 +- clang/test/CodeGenCXX/observe-noexcept.cpp | 4 +- clang/test/CodeGenCXX/operator-new.cpp | 8 +- clang/test/CodeGenCXX/partial-destruction.cpp | 22 +- clang/test/CodeGenCXX/pass-by-value-noalias.cpp | 16 +- clang/test/CodeGenCXX/pass-object-size.cpp | 8 +- clang/test/CodeGenCXX/pod-member-memcpys.cpp | 4 +- clang/test/CodeGenCXX/powerpc-byval.cpp | 2 +- clang/test/CodeGenCXX/pr13396.cpp | 12 +- clang/test/CodeGenCXX/pr20897.cpp | 4 +- clang/test/CodeGenCXX/pr24097.cpp | 2 +- clang/test/CodeGenCXX/pr28360.cpp | 2 +- clang/test/CodeGenCXX/pr9130.cpp | 2 +- clang/test/CodeGenCXX/pragma-visibility.cpp | 2 +- clang/test/CodeGenCXX/redefine_extname.cpp | 2 +- clang/test/CodeGenCXX/reference-cast.cpp | 12 +- clang/test/CodeGenCXX/references.cpp | 2 +- clang/test/CodeGenCXX/regcall.cpp | 42 +- clang/test/CodeGenCXX/regparm.cpp | 6 +- clang/test/CodeGenCXX/runtime-dllstorage.cpp | 14 +- clang/test/CodeGenCXX/runtimecc.cpp | 2 +- clang/test/CodeGenCXX/rvalue-references.cpp | 12 +- clang/test/CodeGenCXX/split-stacks.cpp | 12 +- clang/test/CodeGenCXX/stack-reuse-miscompile.cpp | 8 +- clang/test/CodeGenCXX/stack-reuse.cpp | 2 +- clang/test/CodeGenCXX/static-data-member.cpp | 4 +- clang/test/CodeGenCXX/static-destructor.cpp | 4 +- clang/test/CodeGenCXX/static-init-1.cpp | 8 +- clang/test/CodeGenCXX/static-init-wasm.cpp | 4 +- clang/test/CodeGenCXX/static-init.cpp | 14 +- .../CodeGenCXX/static-local-in-local-class.cpp | 20 +- clang/test/CodeGenCXX/stmtexpr.cpp | 16 +- clang/test/CodeGenCXX/switch-case-folding-2.cpp | 2 +- clang/test/CodeGenCXX/temp-order.cpp | 18 +- clang/test/CodeGenCXX/template-anonymous-types.cpp | 12 +- clang/test/CodeGenCXX/temporaries.cpp | 48 +- clang/test/CodeGenCXX/this-nonnull.cpp | 8 +- clang/test/CodeGenCXX/thunk-linkonce-odr.cpp | 4 +- clang/test/CodeGenCXX/thunk-returning-memptr.cpp | 2 +- clang/test/CodeGenCXX/thunks-ehspec.cpp | 6 +- clang/test/CodeGenCXX/thunks.cpp | 20 +- clang/test/CodeGenCXX/tls-init-funcs.cpp | 10 +- clang/test/CodeGenCXX/trivial_abi.cpp | 46 +- clang/test/CodeGenCXX/ubsan-suppress-checks.cpp | 16 +- clang/test/CodeGenCXX/ubsan-vtable-checks.cpp | 4 +- clang/test/CodeGenCXX/uncopyable-args.cpp | 48 +- clang/test/CodeGenCXX/unknown-anytype.cpp | 28 +- clang/test/CodeGenCXX/value-init.cpp | 4 +- clang/test/CodeGenCXX/varargs.cpp | 2 +- clang/test/CodeGenCXX/variadic-templates.cpp | 2 +- .../CodeGenCXX/virtual-base-destructor-call.cpp | 4 +- clang/test/CodeGenCXX/virtual-bases.cpp | 8 +- clang/test/CodeGenCXX/virtual-operator-call.cpp | 4 +- .../visibility-inlines-hidden-staticvar.cpp | 44 +- .../test/CodeGenCXX/visibility-inlines-hidden.cpp | 4 +- clang/test/CodeGenCXX/vla-consruct.cpp | 4 +- clang/test/CodeGenCXX/vla-lambda-capturing.cpp | 6 +- clang/test/CodeGenCXX/vla.cpp | 4 +- clang/test/CodeGenCXX/volatile.cpp | 2 +- clang/test/CodeGenCXX/vtable-assume-load.cpp | 2 +- .../CodeGenCXX/vtable-available-externally.cpp | 16 +- clang/test/CodeGenCXX/wasm-args-returns.cpp | 4 +- clang/test/CodeGenCXX/wasm-eh.cpp | 8 +- .../windows-on-arm-itanium-thread-local.cpp | 2 +- clang/test/CodeGenCXX/windows-x86-swiftcall.cpp | 6 +- clang/test/CodeGenCXX/x86_32-arguments.cpp | 8 +- clang/test/CodeGenCXX/x86_64-arguments-avx.cpp | 2 +- .../test/CodeGenCXX/x86_64-arguments-nacl-x32.cpp | 2 +- clang/test/CodeGenCXX/x86_64-arguments.cpp | 2 +- .../CodeGenCoroutines/coro-alloc-exp-namespace.cpp | 26 +- clang/test/CodeGenCoroutines/coro-alloc.cpp | 26 +- .../CodeGenCoroutines/coro-await-exp-namespace.cpp | 2 +- clang/test/CodeGenCoroutines/coro-await.cpp | 2 +- clang/test/CodeGenCoroutines/coro-builtins.c | 2 +- .../coro-cleanup-exp-namespace.cpp | 6 +- clang/test/CodeGenCoroutines/coro-cleanup.cpp | 6 +- .../CodeGenCoroutines/coro-gro-exp-namespace.cpp | 6 +- .../coro-gro-nrvo-exp-namespace.cpp | 8 +- clang/test/CodeGenCoroutines/coro-gro-nrvo.cpp | 8 +- clang/test/CodeGenCoroutines/coro-gro.cpp | 6 +- .../coro-params-exp-namespace.cpp | 22 +- clang/test/CodeGenCoroutines/coro-params.cpp | 22 +- .../coro-promise-dtor-exp-namespace.cpp | 2 +- clang/test/CodeGenCoroutines/coro-promise-dtor.cpp | 2 +- .../coro-ret-void-exp-namespace.cpp | 2 +- clang/test/CodeGenCoroutines/coro-ret-void.cpp | 2 +- .../coro-return-exp-namespace.cpp | 6 +- clang/test/CodeGenCoroutines/coro-return.cpp | 6 +- .../coro-symmetric-transfer-01-exp-namespace.cpp | 4 +- .../coro-symmetric-transfer-01.cpp | 2 +- clang/test/CodeGenObjC/arc-blocks.m | 44 +- clang/test/CodeGenObjC/arc-foreach.m | 4 +- clang/test/CodeGenObjC/arc-literals.m | 16 +- clang/test/CodeGenObjC/arc-no-arc-exceptions.m | 6 +- clang/test/CodeGenObjC/arc-precise-lifetime.m | 4 +- clang/test/CodeGenObjC/arc-property.m | 10 +- clang/test/CodeGenObjC/arc-ternary-op.m | 4 +- clang/test/CodeGenObjC/arc.m | 44 +- .../CodeGenObjC/arm-atomic-scalar-setter-getter.m | 4 +- clang/test/CodeGenObjC/atomic-aggregate-property.m | 4 +- .../test/CodeGenObjC/availability-cf-link-guard.m | 2 +- clang/test/CodeGenObjC/blocks.m | 4 +- clang/test/CodeGenObjC/builtin-constant-p.m | 4 +- clang/test/CodeGenObjC/class-stubs.m | 10 +- clang/test/CodeGenObjC/debug-info-blocks.m | 2 +- clang/test/CodeGenObjC/debug-info-nested-blocks.m | 2 +- clang/test/CodeGenObjC/exceptions.m | 16 +- clang/test/CodeGenObjC/for-in.m | 2 +- clang/test/CodeGenObjC/fragile-arc.m | 8 +- clang/test/CodeGenObjC/gnu-exceptions.m | 4 +- clang/test/CodeGenObjC/implicit-objc_msgSend.m | 2 +- clang/test/CodeGenObjC/ivar-invariant.m | 2 +- clang/test/CodeGenObjC/local-static-block.m | 2 +- clang/test/CodeGenObjC/mangle-blocks.m | 6 +- clang/test/CodeGenObjC/matrix-type-builtins.m | 16 +- clang/test/CodeGenObjC/matrix-type-operators.m | 10 +- clang/test/CodeGenObjC/noescape.m | 10 +- .../CodeGenObjC/nontrivial-c-struct-exception.m | 2 +- .../nontrivial-c-struct-within-struct-name.m | 6 +- .../CodeGenObjC/nsvalue-objc-boxable-ios-arc.m | 12 +- clang/test/CodeGenObjC/nsvalue-objc-boxable-ios.m | 12 +- .../CodeGenObjC/nsvalue-objc-boxable-mac-arc.m | 12 +- clang/test/CodeGenObjC/nsvalue-objc-boxable-mac.m | 12 +- .../CodeGenObjC/objc-container-subscripting-1.m | 8 +- clang/test/CodeGenObjC/objc-literal-tests.m | 26 +- .../CodeGenObjC/objc-non-trivial-struct-nrvo.m | 6 +- clang/test/CodeGenObjC/objfw.m | 2 +- clang/test/CodeGenObjC/optimize-ivar-offset-load.m | 2 +- clang/test/CodeGenObjC/os_log.m | 12 +- clang/test/CodeGenObjC/parameterized_classes.m | 2 +- clang/test/CodeGenObjC/pass-by-value-noalias.m | 4 +- clang/test/CodeGenObjC/property-array-type.m | 2 +- clang/test/CodeGenObjC/property-atomic-bool.m | 4 +- clang/test/CodeGenObjC/property-ref-cast-to-void.m | 4 +- clang/test/CodeGenObjC/property.m | 10 +- clang/test/CodeGenObjC/return-objc-object.mm | 4 +- clang/test/CodeGenObjC/stret_lookup.m | 4 +- clang/test/CodeGenObjC/strong-in-c-struct.m | 54 +- .../test/CodeGenObjC/tentative-cfconstantstring.m | 2 +- clang/test/CodeGenObjC/terminate.m | 8 +- clang/test/CodeGenObjC/ubsan-bool.m | 6 +- clang/test/CodeGenObjC/ubsan-nonnull.m | 12 +- clang/test/CodeGenObjC/ubsan-nullability.m | 4 +- clang/test/CodeGenObjC/weak-in-c-struct.m | 30 +- clang/test/CodeGenObjCXX/arc-attrs.mm | 18 +- clang/test/CodeGenObjCXX/arc-blocks.mm | 6 +- clang/test/CodeGenObjCXX/arc-cxx11-init-list.mm | 2 +- clang/test/CodeGenObjCXX/arc-cxx11-member-init.mm | 4 +- clang/test/CodeGenObjCXX/arc-exceptions.mm | 8 +- .../CodeGenObjCXX/arc-forwarded-lambda-call.mm | 8 +- clang/test/CodeGenObjCXX/arc-globals.mm | 4 +- clang/test/CodeGenObjCXX/arc-list-init-destruct.mm | 2 +- clang/test/CodeGenObjCXX/arc-mangle.mm | 22 +- clang/test/CodeGenObjCXX/arc-marker-funclet.mm | 2 +- clang/test/CodeGenObjCXX/arc-move.mm | 6 +- clang/test/CodeGenObjCXX/arc-new-delete.mm | 16 +- clang/test/CodeGenObjCXX/arc-references.mm | 6 +- clang/test/CodeGenObjCXX/arc-rv-attr.mm | 2 +- .../CodeGenObjCXX/arc-special-member-functions.mm | 2 +- clang/test/CodeGenObjCXX/arc.mm | 44 +- .../CodeGenObjCXX/auto-release-result-assert.mm | 8 +- clang/test/CodeGenObjCXX/block-default-arg.mm | 4 +- clang/test/CodeGenObjCXX/block-nested-in-lambda.mm | 4 +- clang/test/CodeGenObjCXX/copy.mm | 2 +- .../CodeGenObjCXX/implicit-copy-assign-operator.mm | 2 +- .../CodeGenObjCXX/implicit-copy-constructor.mm | 2 +- .../inheriting-constructor-cleanup.mm | 2 +- clang/test/CodeGenObjCXX/lambda-expressions.mm | 20 +- clang/test/CodeGenObjCXX/lambda-to-block.mm | 18 +- clang/test/CodeGenObjCXX/literals.mm | 8 +- .../test/CodeGenObjCXX/lvalue-reference-getter.mm | 4 +- clang/test/CodeGenObjCXX/mangle-blocks.mm | 8 +- clang/test/CodeGenObjCXX/message-reference.mm | 2 +- clang/test/CodeGenObjCXX/message.mm | 4 +- .../CodeGenObjCXX/objc-container-subscripting.mm | 2 +- clang/test/CodeGenObjCXX/objc-struct-cxx-abi.mm | 54 +- clang/test/CodeGenObjCXX/objc-weak.mm | 4 +- .../CodeGenObjCXX/property-dot-copy-elision.mm | 6 +- clang/test/CodeGenObjCXX/property-dot-reference.mm | 22 +- .../test/CodeGenObjCXX/property-lvalue-capture.mm | 6 +- clang/test/CodeGenObjCXX/property-lvalue-lambda.mm | 2 +- .../CodeGenObjCXX/property-object-reference-1.mm | 2 +- .../CodeGenObjCXX/property-object-reference-2.mm | 14 +- clang/test/CodeGenObjCXX/property-objects.mm | 14 +- clang/test/CodeGenObjCXX/property-reference.mm | 6 +- clang/test/CodeGenObjCXX/selector-expr-lvalue.mm | 2 +- .../CodeGenObjCXX/synthesized-property-cleanup.mm | 2 +- .../ubsan-nullability-return-notypeloc.mm | 2 +- clang/test/CodeGenOpenCL/addr-space-struct-arg.cl | 20 +- clang/test/CodeGenOpenCL/address-spaces.cl | 10 +- .../CodeGenOpenCL/amdgcn-automatic-variable.cl | 8 +- .../test/CodeGenOpenCL/amdgpu-abi-struct-coerce.cl | 48 +- clang/test/CodeGenOpenCL/amdgpu-call-kernel.cl | 2 +- clang/test/CodeGenOpenCL/amdgpu-nullptr.cl | 8 +- clang/test/CodeGenOpenCL/as_type.cl | 26 +- clang/test/CodeGenOpenCL/atomic-ops-libcall.cl | 54 +- clang/test/CodeGenOpenCL/blocks.cl | 12 +- clang/test/CodeGenOpenCL/byval.cl | 4 +- .../test/CodeGenOpenCL/cl20-device-side-enqueue.cl | 6 +- clang/test/CodeGenOpenCL/const-str-array-decay.cl | 2 +- .../CodeGenOpenCL/constant-addr-space-globals.cl | 2 +- clang/test/CodeGenOpenCL/convergent.cl | 4 +- clang/test/CodeGenOpenCL/fpmath.cl | 4 +- clang/test/CodeGenOpenCL/half.cl | 8 +- .../kernels-have-spir-cc-by-default.cl | 8 +- clang/test/CodeGenOpenCL/no-half.cl | 4 +- clang/test/CodeGenOpenCL/overload.cl | 20 +- clang/test/CodeGenOpenCL/printf.cl | 12 +- clang/test/CodeGenOpenCL/size_t.cl | 60 +- clang/test/CodeGenOpenCL/spir-calling-conv.cl | 10 +- .../CodeGenOpenCLCXX/address-space-deduction.clcpp | 2 +- .../CodeGenOpenCLCXX/addrspace-derived-base.clcpp | 4 +- .../CodeGenOpenCLCXX/addrspace-new-delete.clcpp | 2 +- .../test/CodeGenOpenCLCXX/addrspace-of-this.clcpp | 32 +- .../CodeGenOpenCLCXX/addrspace-operators.clcpp | 4 +- .../CodeGenOpenCLCXX/addrspace-references.clcpp | 2 +- .../CodeGenOpenCLCXX/addrspace-with-class.clcpp | 22 +- .../CodeGenOpenCLCXX/template-address-spaces.clcpp | 6 +- .../test/CodeGenSYCL/address-space-conversions.cpp | 52 +- clang/test/CodeGenSYCL/address-space-mangling.cpp | 16 +- clang/test/CodeGenSYCL/unique_stable_name.cpp | 40 +- clang/test/Headers/ms-arm64-intrin.cpp | 6 +- clang/test/Headers/stdarg.cpp | 28 +- clang/test/Modules/codegen-extern-template.cpp | 2 +- clang/test/Modules/codegen.test | 2 +- clang/test/Modules/cxx-irgen.cpp | 2 +- clang/test/Modules/initializers.cpp | 4 +- clang/test/Modules/templates.mm | 8 +- clang/test/OpenMP/allocate_codegen.cpp | 2 +- clang/test/OpenMP/allocate_codegen_attr.cpp | 2 +- clang/test/OpenMP/assumes_include_nvptx.cpp | 6 +- clang/test/OpenMP/atomic_capture_codegen.cpp | 28 +- clang/test/OpenMP/atomic_codegen.cpp | 8 +- clang/test/OpenMP/atomic_read_codegen.c | 14 +- clang/test/OpenMP/atomic_update_codegen.cpp | 28 +- clang/test/OpenMP/atomic_write_codegen.c | 18 +- clang/test/OpenMP/cancel_codegen.cpp | 104 +- clang/test/OpenMP/cancellation_point_codegen.cpp | 28 +- clang/test/OpenMP/debug-info-complex-byval.cpp | 49 +- clang/test/OpenMP/debug-info-openmp-array.cpp | 6 +- clang/test/OpenMP/declare_mapper_codegen.cpp | 20 +- clang/test/OpenMP/declare_reduction_codegen.c | 48 +- clang/test/OpenMP/declare_reduction_codegen.cpp | 46 +- .../declare_reduction_codegen_in_templates.cpp | 2 +- clang/test/OpenMP/declare_target_codegen.cpp | 4 +- .../declare_target_codegen_globalization.cpp | 12 +- clang/test/OpenMP/declare_target_link_codegen.cpp | 4 +- clang/test/OpenMP/declare_variant_mixed_codegen.c | 12 +- clang/test/OpenMP/distribute_codegen.cpp | 304 +- .../OpenMP/distribute_firstprivate_codegen.cpp | 329 +- .../test/OpenMP/distribute_lastprivate_codegen.cpp | 361 ++- .../OpenMP/distribute_parallel_for_codegen.cpp | 576 ++-- ...istribute_parallel_for_firstprivate_codegen.cpp | 385 ++- .../OpenMP/distribute_parallel_for_if_codegen.cpp | 320 +- ...distribute_parallel_for_lastprivate_codegen.cpp | 449 ++- ...distribute_parallel_for_num_threads_codegen.cpp | 481 ++- .../distribute_parallel_for_private_codegen.cpp | 425 ++- .../distribute_parallel_for_proc_bind_codegen.cpp | 29 +- ...tribute_parallel_for_reduction_task_codegen.cpp | 44 +- .../distribute_parallel_for_simd_codegen.cpp | 592 ++-- ...bute_parallel_for_simd_firstprivate_codegen.cpp | 1362 ++++----- .../distribute_parallel_for_simd_if_codegen.cpp | 3192 ++++++++++---------- ...ibute_parallel_for_simd_lastprivate_codegen.cpp | 1336 ++++---- ...ibute_parallel_for_simd_num_threads_codegen.cpp | 2640 ++++++++-------- ...istribute_parallel_for_simd_private_codegen.cpp | 1288 ++++---- ...tribute_parallel_for_simd_proc_bind_codegen.cpp | 236 +- clang/test/OpenMP/distribute_private_codegen.cpp | 345 ++- clang/test/OpenMP/distribute_simd_codegen.cpp | 512 ++-- .../distribute_simd_firstprivate_codegen.cpp | 944 +++--- .../OpenMP/distribute_simd_lastprivate_codegen.cpp | 1008 +++---- .../OpenMP/distribute_simd_private_codegen.cpp | 1056 +++---- .../OpenMP/distribute_simd_reduction_codegen.cpp | 272 +- clang/test/OpenMP/for_codegen.cpp | 16 +- clang/test/OpenMP/for_firstprivate_codegen.cpp | 313 +- clang/test/OpenMP/for_lastprivate_codegen.cpp | 601 ++-- clang/test/OpenMP/for_linear_codegen.cpp | 165 +- clang/test/OpenMP/for_private_codegen.cpp | 177 +- clang/test/OpenMP/for_reduction_codegen.cpp | 760 ++--- clang/test/OpenMP/for_reduction_codegen_UDR.cpp | 936 +++--- clang/test/OpenMP/for_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/for_scan_codegen.cpp | 2 +- clang/test/OpenMP/for_simd_codegen.cpp | 6 +- clang/test/OpenMP/for_simd_scan_codegen.cpp | 2 +- clang/test/OpenMP/function-attr.cpp | 8 +- clang/test/OpenMP/irbuilder_for_iterator.cpp | 24 +- clang/test/OpenMP/irbuilder_for_rangefor.cpp | 28 +- clang/test/OpenMP/irbuilder_for_unsigned.c | 6 +- ...builder_unroll_partial_heuristic_constant_for.c | 2 +- ...builder_unroll_partial_heuristic_for_collapse.c | 380 ++- ...rbuilder_unroll_partial_heuristic_runtime_for.c | 2 +- clang/test/OpenMP/master_taskloop_codegen.cpp | 10 +- .../master_taskloop_firstprivate_codegen.cpp | 22 +- .../master_taskloop_in_reduction_codegen.cpp | 12 +- .../OpenMP/master_taskloop_lastprivate_codegen.cpp | 22 +- .../OpenMP/master_taskloop_private_codegen.cpp | 22 +- .../OpenMP/master_taskloop_reduction_codegen.cpp | 22 +- clang/test/OpenMP/master_taskloop_simd_codegen.cpp | 8 +- .../master_taskloop_simd_firstprivate_codegen.cpp | 22 +- .../master_taskloop_simd_in_reduction_codegen.cpp | 12 +- .../master_taskloop_simd_lastprivate_codegen.cpp | 22 +- .../master_taskloop_simd_private_codegen.cpp | 22 +- .../master_taskloop_simd_reduction_codegen.cpp | 22 +- clang/test/OpenMP/nvptx_allocate_codegen.cpp | 8 +- clang/test/OpenMP/nvptx_data_sharing.cpp | 8 +- .../nvptx_declare_target_var_ctor_dtor_codegen.cpp | 28 +- .../OpenMP/nvptx_declare_variant_name_mangling.cpp | 4 +- ...tx_distribute_parallel_generic_mode_codegen.cpp | 48 +- clang/test/OpenMP/nvptx_lambda_capturing.cpp | 122 +- .../OpenMP/nvptx_multi_target_parallel_codegen.cpp | 18 +- .../test/OpenMP/nvptx_nested_parallel_codegen.cpp | 72 +- clang/test/OpenMP/nvptx_parallel_codegen.cpp | 52 +- clang/test/OpenMP/nvptx_parallel_for_codegen.cpp | 6 +- clang/test/OpenMP/nvptx_target_codegen.cpp | 10 +- .../OpenMP/nvptx_target_firstprivate_codegen.cpp | 8 +- .../test/OpenMP/nvptx_target_parallel_codegen.cpp | 48 +- .../nvptx_target_parallel_num_threads_codegen.cpp | 48 +- .../nvptx_target_parallel_reduction_codegen.cpp | 18 +- ...get_parallel_reduction_codegen_tbaa_PR46146.cpp | 10 +- clang/test/OpenMP/nvptx_target_printf_codegen.c | 4 +- clang/test/OpenMP/nvptx_target_teams_codegen.cpp | 48 +- .../nvptx_target_teams_distribute_codegen.cpp | 18 +- ...arget_teams_distribute_parallel_for_codegen.cpp | 144 +- ...istribute_parallel_for_generic_mode_codegen.cpp | 72 +- ..._teams_distribute_parallel_for_simd_codegen.cpp | 72 +- .../nvptx_target_teams_distribute_simd_codegen.cpp | 22 +- clang/test/OpenMP/nvptx_teams_codegen.cpp | 32 +- .../test/OpenMP/nvptx_teams_reduction_codegen.cpp | 162 +- .../test/OpenMP/nvptx_unsupported_type_codegen.cpp | 4 +- clang/test/OpenMP/openmp_offload_codegen.cpp | 2 +- clang/test/OpenMP/openmp_win_codegen.cpp | 7 +- clang/test/OpenMP/ordered_codegen.cpp | 76 +- clang/test/OpenMP/parallel_codegen.cpp | 100 +- clang/test/OpenMP/parallel_copyin_codegen.cpp | 613 ++-- .../test/OpenMP/parallel_firstprivate_codegen.cpp | 44 +- clang/test/OpenMP/parallel_for_codegen.cpp | 224 +- .../parallel_for_lastprivate_conditional.cpp | 17 +- clang/test/OpenMP/parallel_for_linear_codegen.cpp | 93 +- .../OpenMP/parallel_for_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/parallel_for_scan_codegen.cpp | 2 +- .../OpenMP/parallel_for_simd_aligned_codegen.cpp | 72 +- clang/test/OpenMP/parallel_for_simd_codegen.cpp | 6 +- .../test/OpenMP/parallel_for_simd_scan_codegen.cpp | 2 +- clang/test/OpenMP/parallel_if_codegen.cpp | 100 +- clang/test/OpenMP/parallel_if_codegen_PR51349.cpp | 2 +- clang/test/OpenMP/parallel_master_codegen.cpp | 63 +- .../parallel_master_reduction_task_codegen.cpp | 36 +- .../OpenMP/parallel_master_taskloop_codegen.cpp | 60 +- ...rallel_master_taskloop_firstprivate_codegen.cpp | 20 +- ...arallel_master_taskloop_lastprivate_codegen.cpp | 282 +- .../parallel_master_taskloop_private_codegen.cpp | 20 +- .../parallel_master_taskloop_reduction_codegen.cpp | 22 +- .../parallel_master_taskloop_simd_codegen.cpp | 160 +- ...l_master_taskloop_simd_firstprivate_codegen.cpp | 20 +- ...el_master_taskloop_simd_lastprivate_codegen.cpp | 470 +-- ...rallel_master_taskloop_simd_private_codegen.cpp | 20 +- ...llel_master_taskloop_simd_reduction_codegen.cpp | 22 +- clang/test/OpenMP/parallel_num_threads_codegen.cpp | 4 +- clang/test/OpenMP/parallel_private_codegen.cpp | 261 +- clang/test/OpenMP/parallel_reduction_codegen.cpp | 501 ++- .../OpenMP/parallel_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/parallel_sections_codegen.cpp | 13 +- .../parallel_sections_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/reduction_compound_op.cpp | 12 +- .../test/OpenMP/sections_firstprivate_codegen.cpp | 321 +- clang/test/OpenMP/sections_lastprivate_codegen.cpp | 433 ++- clang/test/OpenMP/sections_private_codegen.cpp | 189 +- clang/test/OpenMP/sections_reduction_codegen.cpp | 353 ++- .../OpenMP/sections_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/simd_codegen.cpp | 8 +- clang/test/OpenMP/single_codegen.cpp | 597 ++-- clang/test/OpenMP/single_firstprivate_codegen.cpp | 321 +- clang/test/OpenMP/single_private_codegen.cpp | 189 +- clang/test/OpenMP/target_codegen.cpp | 12 +- .../test/OpenMP/target_codegen_global_capture.cpp | 104 +- clang/test/OpenMP/target_defaultmap_codegen_01.cpp | 676 ++--- clang/test/OpenMP/target_depend_codegen.cpp | 14 +- clang/test/OpenMP/target_enter_data_codegen.cpp | 2 +- .../OpenMP/target_enter_data_depend_codegen.cpp | 8 +- clang/test/OpenMP/target_exit_data_codegen.cpp | 2 +- .../OpenMP/target_exit_data_depend_codegen.cpp | 8 +- clang/test/OpenMP/target_firstprivate_codegen.cpp | 12 +- clang/test/OpenMP/target_map_codegen_00.cpp | 2 +- clang/test/OpenMP/target_map_codegen_01.cpp | 4 +- clang/test/OpenMP/target_map_codegen_02.cpp | 2 +- clang/test/OpenMP/target_map_codegen_03.cpp | 96 +- clang/test/OpenMP/target_map_codegen_04.cpp | 2 +- clang/test/OpenMP/target_map_codegen_05.cpp | 2 +- clang/test/OpenMP/target_map_codegen_06.cpp | 2 +- clang/test/OpenMP/target_map_codegen_07.cpp | 2 +- clang/test/OpenMP/target_map_codegen_11.cpp | 2 +- clang/test/OpenMP/target_map_codegen_12.cpp | 2 +- clang/test/OpenMP/target_map_codegen_13.cpp | 2 +- clang/test/OpenMP/target_map_codegen_14.cpp | 4 +- clang/test/OpenMP/target_map_codegen_15.cpp | 2 +- clang/test/OpenMP/target_map_codegen_17.cpp | 2 +- clang/test/OpenMP/target_map_codegen_24.cpp | 2 +- clang/test/OpenMP/target_map_names.cpp | 2 +- clang/test/OpenMP/target_map_names_attr.cpp | 2 +- clang/test/OpenMP/target_parallel_codegen.cpp | 608 ++-- .../test/OpenMP/target_parallel_debug_codegen.cpp | 24 +- .../test/OpenMP/target_parallel_depend_codegen.cpp | 12 +- clang/test/OpenMP/target_parallel_for_codegen.cpp | 672 ++--- .../OpenMP/target_parallel_for_debug_codegen.cpp | 24 +- .../OpenMP/target_parallel_for_depend_codegen.cpp | 12 +- .../target_parallel_for_reduction_task_codegen.cpp | 40 +- .../OpenMP/target_parallel_for_simd_codegen.cpp | 1008 +++---- .../target_parallel_for_simd_depend_codegen.cpp | 12 +- clang/test/OpenMP/target_parallel_if_codegen.cpp | 464 +-- .../OpenMP/target_parallel_num_threads_codegen.cpp | 464 +-- .../target_parallel_reduction_task_codegen.cpp | 40 +- clang/test/OpenMP/target_private_codegen.cpp | 4 +- clang/test/OpenMP/target_reduction_codegen.cpp | 2 +- clang/test/OpenMP/target_simd_codegen.cpp | 6 +- clang/test/OpenMP/target_simd_depend_codegen.cpp | 12 +- clang/test/OpenMP/target_teams_codegen.cpp | 928 +++--- clang/test/OpenMP/target_teams_depend_codegen.cpp | 12 +- .../OpenMP/target_teams_distribute_codegen.cpp | 656 ++-- .../target_teams_distribute_collapse_codegen.cpp | 89 +- .../target_teams_distribute_depend_codegen.cpp | 12 +- ...rget_teams_distribute_dist_schedule_codegen.cpp | 184 +- ...arget_teams_distribute_firstprivate_codegen.cpp | 573 ++-- ...target_teams_distribute_lastprivate_codegen.cpp | 361 ++- ...arget_teams_distribute_parallel_for_codegen.cpp | 118 +- </cut>

3 years, 8 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 6% after llvm: [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default

by ci_notify＠linaro.org

After llvm commit aacfbb953eb705af2ecfeb95a6262818fa85dd92 Author: hyeongyukim <gusrb406(a)snu.ac.kr> [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 6% from 10889 to 11584 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-aacfbb953eb705af2ecfeb95a6262818fa85dd92 cd investigate-llvm-aacfbb953eb705af2ecfeb95a6262818fa85dd92 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach aacfbb953eb705af2ecfeb95a6262818fa85dd92 ../artifacts/test.sh # Reproduce last_good build git checkout --detach b5aef90d4656c5188759d03e2c5c3dc3d8bb398b ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit aacfbb953eb705af2ecfeb95a6262818fa85dd92 Author: hyeongyukim <gusrb406(a)snu.ac.kr> Date: Fri Oct 15 19:26:07 2021 +0900 [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default Turning on `enable_noundef_analysis` flag allows better codegen by removing freeze instructions. I modified clang by renaming `enable_noundef_analysis` flag to `disable-noundef-analysis` and turning it off by default. Test updates are made as a separate patch: D108453 Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D105169 [Clang/Test]: Rename enable_noundef_analysis to disable-noundef-analysis and turn it off by default (2) This patch updates test files after D105169. Autogenerated test codes are changed by `utils/update_cc_test_checks.py,` and non-autogenerated test codes are changed as follows: (1) I wrote a python script that (partially) updates the tests using regex: {F18594904} The script is not perfect, but I believe it gives hints about which patterns are updated to have `noundef` attached. (2) The remaining tests are updated manually. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D108453 Resolve lit failures in clang after 8ca4b3e's land Fix lit test failures in clang-ppc* and clang-x64-windows-msvc Fix missing failures in clang-ppc64be* and retry fixing clang-x64-windows-msvc Fix internal_clone(aarch64) inline assembly --- clang/include/clang/Basic/CodeGenOptions.def | 2 +- clang/include/clang/Driver/Options.td | 6 +- clang/lib/CodeGen/CGCall.cpp | 4 +- clang/test/CXX/except/except.spec/p14-ir.cpp | 4 +- .../expr.prim/expr.prim.lambda/blocks-irgen.mm | 4 +- clang/test/CodeGen/2005-01-02-ConstantInits.c | 10 +- clang/test/CodeGen/2006-05-19-SingleEltReturn.c | 2 +- clang/test/CodeGen/2007-06-18-SextAttrAggregate.c | 2 +- .../test/CodeGen/2009-02-13-zerosize-union-field.c | 2 +- clang/test/CodeGen/2009-05-04-EnumInreg.c | 2 +- clang/test/CodeGen/64bit-swiftcall.c | 8 +- clang/test/CodeGen/RISCV/riscv-inline-asm.c | 2 +- clang/test/CodeGen/RISCV/riscv32-ilp32-abi.c | 8 +- .../test/CodeGen/RISCV/riscv32-ilp32-ilp32f-abi.c | 8 +- .../RISCV/riscv32-ilp32-ilp32f-ilp32d-abi.c | 48 +- clang/test/CodeGen/RISCV/riscv32-ilp32d-abi.c | 24 +- clang/test/CodeGen/RISCV/riscv32-ilp32f-abi.c | 6 +- .../test/CodeGen/RISCV/riscv32-ilp32f-ilp32d-abi.c | 16 +- clang/test/CodeGen/RISCV/riscv64-lp64-abi.c | 6 +- clang/test/CodeGen/RISCV/riscv64-lp64-lp64f-abi.c | 4 +- .../CodeGen/RISCV/riscv64-lp64-lp64f-lp64d-abi.c | 58 +- clang/test/CodeGen/RISCV/riscv64-lp64d-abi.c | 12 +- clang/test/CodeGen/RISCV/riscv64-lp64f-lp64d-abi.c | 16 +- clang/test/CodeGen/SystemZ/systemz-abi-vector.c | 18 +- clang/test/CodeGen/SystemZ/systemz-abi.c | 22 +- clang/test/CodeGen/SystemZ/systemz-inline-asm.c | 24 +- clang/test/CodeGen/WebAssembly/wasm-arguments.c | 38 +- .../test/CodeGen/WebAssembly/wasm-main_argc_argv.c | 2 +- clang/test/CodeGen/X86/avx-union.c | 6 +- clang/test/CodeGen/X86/avx512fp16-complex-abi.c | 2 +- clang/test/CodeGen/X86/ms-x86-intrinsics.c | 32 +- clang/test/CodeGen/X86/strictfp_builtins.c | 8 +- clang/test/CodeGen/X86/x86-atomic-long_double.c | 36 +- .../CodeGen/X86/x86-inline-asm-min-vector-width.c | 12 +- clang/test/CodeGen/X86/x86-long-double.cpp | 6 +- clang/test/CodeGen/X86/x86-soft-float.c | 4 +- clang/test/CodeGen/X86/x86-vec-i128.c | 22 +- clang/test/CodeGen/X86/x86_32-arguments-darwin.c | 62 +- clang/test/CodeGen/X86/x86_32-arguments-iamcu.c | 24 +- clang/test/CodeGen/X86/x86_32-arguments-linux.c | 30 +- clang/test/CodeGen/X86/x86_32-arguments-nommx.c | 4 +- clang/test/CodeGen/X86/x86_32-arguments-realign.c | 2 +- clang/test/CodeGen/X86/x86_32-arguments-win32.c | 12 +- clang/test/CodeGen/X86/x86_64-arguments-nacl.c | 6 +- clang/test/CodeGen/X86/x86_64-arguments-win32.c | 12 +- clang/test/CodeGen/X86/x86_64-arguments.c | 82 +- clang/test/CodeGen/X86/x86_64-longdouble.c | 36 +- clang/test/CodeGen/aapcs-align.cpp | 56 +- clang/test/CodeGen/aapcs64-align.cpp | 34 +- clang/test/CodeGen/aarch64-args.cpp | 18 +- clang/test/CodeGen/aarch64-byval-temp.c | 8 +- clang/test/CodeGen/aarch64-neon-3v.c | 160 +- clang/test/CodeGen/aarch64-neon-across.c | 88 +- clang/test/CodeGen/aarch64-neon-dot-product.c | 24 +- clang/test/CodeGen/aarch64-neon-extract.c | 48 +- clang/test/CodeGen/aarch64-neon-fcvt-intrinsics.c | 42 +- clang/test/CodeGen/aarch64-neon-fma.c | 44 +- clang/test/CodeGen/aarch64-neon-ldst-one.c | 540 ++-- clang/test/CodeGen/aarch64-neon-scalar-copy.c | 48 +- .../CodeGen/aarch64-neon-scalar-x-indexed-elem.c | 80 +- clang/test/CodeGen/aarch64-neon-tbl.c | 144 +- clang/test/CodeGen/aarch64-neon-vcombine.c | 28 +- clang/test/CodeGen/aarch64-neon-vget-hilo.c | 56 +- clang/test/CodeGen/aarch64-neon-vget.c | 96 +- clang/test/CodeGen/aarch64-poly128.c | 62 +- clang/test/CodeGen/aarch64-poly64.c | 96 +- clang/test/CodeGen/aarch64-strictfp-builtins.c | 8 +- ...4-sve-acle-__ARM_FEATURE_SVE_VECTOR_OPERATORS.c | 16 +- ...sve-acle-__ARM_FEATURE_SVE_VECTOR_OPERATORS.cpp | 8 +- clang/test/CodeGen/aarch64-varargs.c | 2 +- clang/test/CodeGen/address-space-field1.c | 2 +- clang/test/CodeGen/address-space.c | 2 +- clang/test/CodeGen/aix-alignment.c | 8 +- clang/test/CodeGen/aix-altivec.c | 10 +- clang/test/CodeGen/aix-ignore-xcoff-visibility.cpp | 12 +- clang/test/CodeGen/aix-return.c | 16 +- clang/test/CodeGen/aix-struct-arg.c | 44 +- clang/test/CodeGen/aix-vaargs.c | 4 +- clang/test/CodeGen/alias.c | 12 +- clang/test/CodeGen/align_value.cpp | 63 +- clang/test/CodeGen/alloc-align-attr.c | 46 +- clang/test/CodeGen/alloc-fns-alignment.c | 2 +- clang/test/CodeGen/alloc-size-fnptr.c | 12 +- clang/test/CodeGen/arc/arguments.c | 26 +- clang/test/CodeGen/arithmetic-fence-builtin.c | 10 +- clang/test/CodeGen/arm-aapcs-vfp.c | 24 +- clang/test/CodeGen/arm-abi-vector.c | 48 +- clang/test/CodeGen/arm-arguments.c | 10 +- clang/test/CodeGen/arm-bf16-params-returns.c | 10 +- clang/test/CodeGen/arm-byval-align.c | 2 +- clang/test/CodeGen/arm-cmse-attr.c | 4 +- clang/test/CodeGen/arm-cmse-call.c | 4 +- clang/test/CodeGen/arm-float-helpers.c | 76 +- clang/test/CodeGen/arm-fp16-arguments.c | 12 +- clang/test/CodeGen/arm-homogenous.c | 2 +- clang/test/CodeGen/arm-mangle-bf16.cpp | 2 +- clang/test/CodeGen/arm-neon-directed-rounding.c | 30 +- clang/test/CodeGen/arm-neon-dot-product.c | 16 +- clang/test/CodeGen/arm-neon-fma.c | 8 +- clang/test/CodeGen/arm-neon-numeric-maxmin.c | 8 +- clang/test/CodeGen/arm-neon-vcvtX.c | 32 +- clang/test/CodeGen/arm-swiftcall.c | 6 +- clang/test/CodeGen/arm-varargs.c | 2 +- clang/test/CodeGen/arm-vector-arguments.c | 10 +- clang/test/CodeGen/arm-vfp16-arguments.c | 12 +- clang/test/CodeGen/arm64-aapcs-arguments.c | 12 +- clang/test/CodeGen/arm64-abi-vector.c | 42 +- clang/test/CodeGen/arm64-arguments.c | 96 +- clang/test/CodeGen/arm64-microsoft-arguments.cpp | 6 +- clang/test/CodeGen/arm64_32.c | 2 +- clang/test/CodeGen/arm64_vcopy.c | 20 +- clang/test/CodeGen/arm64_vdupq_n_f64.c | 12 +- clang/test/CodeGen/armv7k-abi.c | 6 +- clang/test/CodeGen/asm-label.c | 12 +- .../assume-aligned-and-alloc-align-attributes.c | 12 +- clang/test/CodeGen/atomic-arm64.c | 2 +- clang/test/CodeGen/atomic-ops-libcall.c | 34 +- clang/test/CodeGen/atomic-ops.c | 44 +- clang/test/CodeGen/atomic_ops.c | 10 +- clang/test/CodeGen/atomics-inlining.c | 52 +- clang/test/CodeGen/attr-func-def.c | 4 +- clang/test/CodeGen/attr-naked.c | 2 +- clang/test/CodeGen/attr-no-tail.c | 8 +- clang/test/CodeGen/attr-nomerge.cpp | 20 +- clang/test/CodeGen/attr-noundef.cpp | 4 +- clang/test/CodeGen/attr-target-mv-func-ptrs.c | 4 +- clang/test/CodeGen/attr-target-mv-va-args.c | 24 +- clang/test/CodeGen/attr-target-mv.c | 28 +- clang/test/CodeGen/attr-x86-interrupt.c | 16 +- clang/test/CodeGen/attributes.c | 2 +- clang/test/CodeGen/available-externally-hidden.cpp | 2 +- clang/test/CodeGen/available-externally-suppress.c | 2 +- clang/test/CodeGen/avr/struct.c | 4 +- clang/test/CodeGen/big-atomic-ops.c | 30 +- clang/test/CodeGen/bittest-intrin.c | 8 +- clang/test/CodeGen/blocks.c | 6 +- clang/test/CodeGen/bool-convert.c | 2 +- clang/test/CodeGen/builtin-align-array.c | 8 +- clang/test/CodeGen/builtin-align.c | 24 +- clang/test/CodeGen/builtin-assume-aligned.c | 31 +- clang/test/CodeGen/builtin-attributes.c | 20 +- clang/test/CodeGen/builtin-memfns.c | 4 +- clang/test/CodeGen/builtin-sqrt.c | 2 +- clang/test/CodeGen/builtins-arm.c | 24 +- clang/test/CodeGen/builtins-memcpy-inline.c | 8 +- clang/test/CodeGen/builtins-ms.c | 4 +- clang/test/CodeGen/builtins-multiprecision.c | 4 +- clang/test/CodeGen/builtins-overflow.c | 12 +- clang/test/CodeGen/builtins-ppc-xlcompat-macros.c | 4 +- clang/test/CodeGen/builtins.c | 44 +- clang/test/CodeGen/c-strings.c | 2 +- clang/test/CodeGen/c11atomics-ios.c | 8 +- clang/test/CodeGen/c11atomics.c | 52 +- clang/test/CodeGen/calling-conv-ignored.c | 32 +- ...-assumption-attribute-align_value-on-lvalue.cpp | 2 +- ...ssumption-attribute-align_value-on-paramvar.cpp | 4 +- ...-attribute-alloc_align-on-function-variable.cpp | 6 +- ...ssumption-attribute-alloc_align-on-function.cpp | 8 +- ...ibute-assume_aligned-on-function-two-params.cpp | 6 +- ...mption-attribute-assume_aligned-on-function.cpp | 8 +- ...uiltin_assume_aligned-three-params-variable.cpp | 2 +- ...umption-builtin_assume_aligned-three-params.cpp | 2 +- ...ssumption-builtin_assume_aligned-two-params.cpp | 2 +- .../CodeGen/catch-alignment-assumption-openmp.cpp | 2 +- .../CodeGen/catch-implicit-integer-sign-changes.c | 18 +- ...icit-signed-integer-truncation-or-sign-change.c | 10 +- ...tr-and-nonzero-offset-when-nullptr-is-defined.c | 2 +- .../CodeGen/catch-nullptr-and-nonzero-offset.c | 14 +- .../test/CodeGen/catch-pointer-overflow-volatile.c | 2 +- clang/test/CodeGen/catch-pointer-overflow.c | 16 +- clang/test/CodeGen/cfi-check-fail.c | 2 +- clang/test/CodeGen/cfi-check-fail2.c | 2 +- clang/test/CodeGen/cmse-clear-arg.c | 2 +- clang/test/CodeGen/complex-builtins.c | 228 +- clang/test/CodeGen/complex-indirect.c | 2 +- clang/test/CodeGen/complex-libcalls.c | 228 +- clang/test/CodeGen/complex-math.c | 12 +- clang/test/CodeGen/complex-strictfp.c | 42 +- clang/test/CodeGen/constructor-attribute.c | 2 +- clang/test/CodeGen/debug-info-block-vars.c | 2 +- clang/test/CodeGen/debug-info-pseudo-probe.cpp | 4 +- clang/test/CodeGen/decl.c | 2 +- clang/test/CodeGen/default-address-space.c | 4 +- clang/test/CodeGen/exceptions-seh-finally.c | 14 +- clang/test/CodeGen/exceptions-seh-leave.c | 30 +- clang/test/CodeGen/exceptions-seh-nested-finally.c | 4 +- clang/test/CodeGen/exceptions-seh.c | 26 +- clang/test/CodeGen/exceptions.c | 2 +- clang/test/CodeGen/ext-int-cc.c | 58 +- clang/test/CodeGen/extend-arg-64.c | 2 +- clang/test/CodeGen/fp-function-attrs.cpp | 6 +- clang/test/CodeGen/fp-options-to-fast-math-flags.c | 18 +- clang/test/CodeGen/fpconstrained-cmp-double.c | 24 +- clang/test/CodeGen/fpconstrained-cmp-float.c | 24 +- clang/test/CodeGen/function-attributes.c | 20 +- clang/test/CodeGen/functions.c | 4 +- clang/test/CodeGen/hexagon-hvx-abi.c | 8 +- clang/test/CodeGen/incomplete-function-type-2.c | 2 +- clang/test/CodeGen/indirect-noundef.cpp | 2 +- clang/test/CodeGen/inline.c | 4 +- clang/test/CodeGen/lanai-arguments.c | 12 +- clang/test/CodeGen/lanai-regparm.c | 12 +- clang/test/CodeGen/libcall-declarations.c | 636 ++-- clang/test/CodeGen/libcalls.c | 54 +- clang/test/CodeGen/long_double_fp128.cpp | 14 +- clang/test/CodeGen/malign-double-x86-nacl.c | 6 +- clang/test/CodeGen/mangle-blocks.c | 6 +- clang/test/CodeGen/mangle-windows.c | 2 +- clang/test/CodeGen/math-builtins-long.c | 386 +-- clang/test/CodeGen/math-builtins.c | 648 ++-- clang/test/CodeGen/math-libcalls.c | 474 +-- clang/test/CodeGen/matrix-cast.c | 26 +- clang/test/CodeGen/matrix-type-builtins.c | 4 +- .../test/CodeGen/matrix-type-operators-fast-math.c | 12 +- clang/test/CodeGen/matrix-type-operators.c | 84 +- clang/test/CodeGen/memcmp-inline-builtin-to-asm.c | 2 +- clang/test/CodeGen/memcpy-inline-builtin.c | 2 +- clang/test/CodeGen/microsoft-call-conv-x64.c | 2 +- clang/test/CodeGen/microsoft-call-conv.c | 2 +- clang/test/CodeGen/mingw-long-double.c | 12 +- clang/test/CodeGen/mips-unsigned-ext-var.c | 6 +- clang/test/CodeGen/mips-unsigned-extend.c | 6 +- clang/test/CodeGen/mips-vector-arg.c | 16 +- clang/test/CodeGen/mips-zero-sized-struct.c | 6 +- clang/test/CodeGen/mips64-padding-arg.c | 24 +- clang/test/CodeGen/mrtd.c | 6 +- clang/test/CodeGen/ms-inline-asm.c | 2 +- clang/test/CodeGen/ms-intrinsics-cpuid.c | 4 +- clang/test/CodeGen/ms-intrinsics-other.c | 2 +- clang/test/CodeGen/ms-mixed-ptr-sizes.c | 20 +- clang/test/CodeGen/ms_abi.c | 4 +- clang/test/CodeGen/ms_abi_aarch64.c | 4 +- clang/test/CodeGen/named_reg_global.c | 2 +- clang/test/CodeGen/no-bitfield-type-align.c | 2 +- clang/test/CodeGen/no-builtin.cpp | 12 +- clang/test/CodeGen/no-prototype.c | 2 +- clang/test/CodeGen/noduplicate-cxx11-test.cpp | 2 +- .../CodeGen/non-power-of-2-alignment-assumptions.c | 10 +- clang/test/CodeGen/nonnull.c | 28 +- clang/test/CodeGen/nrvo-tracking.cpp | 2 +- clang/test/CodeGen/nvptx-abi.c | 10 +- clang/test/CodeGen/object-size.c | 4 +- clang/test/CodeGen/padding-init.c | 6 +- clang/test/CodeGen/pass-by-value-noalias.c | 4 +- clang/test/CodeGen/pass-object-size.c | 114 +- clang/test/CodeGen/pch-dllexport.cpp | 4 +- clang/test/CodeGen/powerpc-c99complex.c | 14 +- clang/test/CodeGen/ppc-emmintrin.c | 750 ++--- clang/test/CodeGen/ppc-mm-malloc-le.c | 8 +- clang/test/CodeGen/ppc-mm-malloc.c | 8 +- clang/test/CodeGen/ppc-mmintrin.c | 124 +- clang/test/CodeGen/ppc-pmmintrin.c | 177 +- clang/test/CodeGen/ppc-signbit.c | 2 +- clang/test/CodeGen/ppc-smmintrin.c | 32 +- clang/test/CodeGen/ppc-tmmintrin.c | 290 +- clang/test/CodeGen/ppc-xmmintrin.c | 400 +-- clang/test/CodeGen/ppc64-align-struct.c | 26 +- clang/test/CodeGen/ppc64-complex-parms.c | 38 +- clang/test/CodeGen/ppc64-complex-return.c | 20 +- clang/test/CodeGen/ppc64-extend.c | 4 +- clang/test/CodeGen/ppc64-inline-asm.c | 14 +- clang/test/CodeGen/ppc64-long-double.cpp | 6 +- clang/test/CodeGen/ppc64-soft-float.c | 6 +- clang/test/CodeGen/ppc64-vector.c | 10 +- clang/test/CodeGen/ppc64le-aggregates.c | 8 +- clang/test/CodeGen/ppc64le-f128Aggregates.c | 4 +- clang/test/CodeGen/ppc64le-varargs-f128.c | 12 +- clang/test/CodeGen/pr25786.c | 4 +- clang/test/CodeGen/pr5406.c | 2 +- clang/test/CodeGen/pr9614.c | 4 +- clang/test/CodeGen/pragma-weak.c | 2 +- clang/test/CodeGen/ps4-dllimport-dllexport.c | 2 +- clang/test/CodeGen/regcall.c | 100 +- clang/test/CodeGen/regparm-flag.c | 12 +- clang/test/CodeGen/regparm-struct.c | 36 +- clang/test/CodeGen/regparm.c | 6 +- clang/test/CodeGen/renderscript.c | 14 +- clang/test/CodeGen/restrict.c | 10 +- .../sanitize-thread-no-checking-at-run-time.m | 2 +- clang/test/CodeGen/sparc-arguments.c | 4 +- clang/test/CodeGen/sparcv8-abi.c | 6 +- clang/test/CodeGen/sparcv8-inline-asm.c | 2 +- clang/test/CodeGen/sparcv9-abi.c | 16 +- clang/test/CodeGen/spir-half-type.cpp | 2 +- clang/test/CodeGen/stack-protector.c | 4 +- clang/test/CodeGen/stdcall-fastcall.c | 24 +- clang/test/CodeGen/strictfp_builtins.c | 26 +- clang/test/CodeGen/swift-async-call-conv.c | 22 +- clang/test/CodeGen/switch-dce.c | 4 +- clang/test/CodeGen/sysv_abi.c | 8 +- clang/test/CodeGen/temporary-lifetime.cpp | 4 +- clang/test/CodeGen/transparent-union-redecl.c | 8 +- clang/test/CodeGen/transparent-union.c | 8 +- clang/test/CodeGen/ubsan-function.cpp | 2 +- .../CodeGen/unique-internal-linkage-names-dwarf.c | 4 +- .../unique-internal-linkage-names-dwarf.cpp | 12 +- .../test/CodeGen/unique-internal-linkage-names.cpp | 16 +- clang/test/CodeGen/variadic-null-win64.c | 12 +- clang/test/CodeGen/ve-abi.c | 34 +- clang/test/CodeGen/vectorcall.c | 86 +- clang/test/CodeGen/vla.c | 22 +- clang/test/CodeGen/win64-i128.c | 4 +- clang/test/CodeGen/windows-itanium.c | 2 +- .../CodeGen/windows-on-arm-dllimport-dllexport.c | 2 +- .../CodeGen/windows-seh-EHa-CppCatchDotDotDot.cpp | 2 +- .../test/CodeGen/windows-seh-EHa-CppCondiTemps.cpp | 18 +- clang/test/CodeGen/windows-seh-EHa-CppDtors01.cpp | 2 +- .../test/CodeGen/windows-seh-EHa-TryInFinally.cpp | 4 +- clang/test/CodeGen/windows-seh-abnormal-exits.c | 2 +- clang/test/CodeGen/windows-swiftcall.c | 22 +- clang/test/CodeGen/x86_32-align-linux.c | 6 +- clang/test/CodeGen/xcore-abi.c | 14 +- clang/test/CodeGen/xray-log-args.cpp | 4 +- clang/test/CodeGenCUDA/address-spaces.cu | 2 +- .../CodeGenCUDA/amdgpu-kernel-arg-pointer-type.cu | 10 +- clang/test/CodeGenCUDA/builtins-amdgcn.cu | 2 +- clang/test/CodeGenCUDA/cuda-builtin-vars.cu | 2 +- clang/test/CodeGenCUDA/kernel-args-alignment.cu | 2 +- clang/test/CodeGenCUDA/kernel-args.cu | 8 +- clang/test/CodeGenCUDA/kernel-stub-name.cu | 4 +- clang/test/CodeGenCUDA/lambda.cu | 8 +- clang/test/CodeGenCUDA/redux-builtins.cu | 2 +- clang/test/CodeGenCUDA/surface.cu | 4 +- clang/test/CodeGenCUDA/texture.cu | 6 +- clang/test/CodeGenCUDA/unnamed-types.cu | 8 +- clang/test/CodeGenCUDA/usual-deallocators.cu | 36 +- clang/test/CodeGenCUDA/vtbl.cu | 2 +- .../CodeGenCXX/2009-05-04-PureConstNounwind.cpp | 10 +- .../test/CodeGenCXX/2011-12-19-init-list-ctor.cpp | 6 +- .../diamond-virtual-inheritance.cpp | 2 +- .../CodeGenCXX/RelativeVTablesABI/dynamic-cast.cpp | 8 +- .../RelativeVTablesABI/member-function-pointer.cpp | 2 +- .../RelativeVTablesABI/multiple-inheritance.cpp | 2 +- .../parent-and-child-in-comdats.cpp | 2 +- .../CodeGenCXX/RelativeVTablesABI/type-info.cpp | 2 +- .../CodeGenCXX/RelativeVTablesABI/vbase-offset.cpp | 2 +- .../RelativeVTablesABI/virtual-function-call.cpp | 2 +- clang/test/CodeGenCXX/address-space-cast.cpp | 14 +- clang/test/CodeGenCXX/address-space-ref.cpp | 8 +- clang/test/CodeGenCXX/aix-alignment.cpp | 6 +- .../aix-static-init-temp-spec-and-inline-var.cpp | 14 +- clang/test/CodeGenCXX/aix-static-init.cpp | 4 +- .../test/CodeGenCXX/align-avx-complete-objects.cpp | 4 +- clang/test/CodeGenCXX/alignment.cpp | 20 +- clang/test/CodeGenCXX/alloc-size.cpp | 16 +- .../test/CodeGenCXX/amdgcn-automatic-variable.cpp | 10 +- clang/test/CodeGenCXX/amdgcn-func-arg.cpp | 24 +- clang/test/CodeGenCXX/amdgcn_declspec_get.cpp | 2 +- clang/test/CodeGenCXX/anonymous-namespaces.cpp | 4 +- .../test/CodeGenCXX/apple-kext-indirect-call-2.cpp | 8 +- clang/test/CodeGenCXX/apple-kext-linkage.cpp | 4 +- clang/test/CodeGenCXX/arm-cc.cpp | 4 +- clang/test/CodeGenCXX/arm-swiftcall.cpp | 2 +- clang/test/CodeGenCXX/arm.cpp | 4 +- clang/test/CodeGenCXX/arm64-constructor-return.cpp | 4 +- clang/test/CodeGenCXX/arm64-darwinpcs.cpp | 4 +- clang/test/CodeGenCXX/atomic-dllexport.cpp | 4 +- clang/test/CodeGenCXX/atomic-inline.cpp | 2 +- clang/test/CodeGenCXX/atomicinit.cpp | 8 +- .../CodeGenCXX/attr-cpuspecific-outoflinedefs.cpp | 28 +- clang/test/CodeGenCXX/attr-disable-tail-calls.cpp | 12 +- clang/test/CodeGenCXX/attr-musttail.cpp | 40 +- clang/test/CodeGenCXX/attr-notail.cpp | 10 +- clang/test/CodeGenCXX/attr-target-mv-diff-ns.cpp | 42 +- clang/test/CodeGenCXX/attr-target-mv-func-ptrs.cpp | 6 +- clang/test/CodeGenCXX/attr-target-mv-inalloca.cpp | 16 +- .../CodeGenCXX/attr-target-mv-member-funcs.cpp | 96 +- .../CodeGenCXX/attr-target-mv-out-of-line-defs.cpp | 22 +- clang/test/CodeGenCXX/attr-target-mv-overloads.cpp | 36 +- ...used-member-function-implicit-instantiation.cpp | 2 +- clang/test/CodeGenCXX/attr-x86-interrupt.cpp | 24 +- clang/test/CodeGenCXX/blocks-cxx11.cpp | 16 +- clang/test/CodeGenCXX/blocks.cpp | 4 +- clang/test/CodeGenCXX/builtin-calling-conv.cpp | 18 +- .../CodeGenCXX/builtin-is-constant-evaluated.cpp | 8 +- .../CodeGenCXX/builtin-operator-new-delete.cpp | 20 +- clang/test/CodeGenCXX/builtin-source-location.cpp | 20 +- clang/test/CodeGenCXX/builtin_FUNCTION.cpp | 6 +- clang/test/CodeGenCXX/builtin_LINE.cpp | 24 +- clang/test/CodeGenCXX/builtins.cpp | 4 +- clang/test/CodeGenCXX/call-with-static-chain.cpp | 16 +- clang/test/CodeGenCXX/catch-undef-behavior.cpp | 10 +- clang/test/CodeGenCXX/cfi-cast.cpp | 4 +- clang/test/CodeGenCXX/cfi-multiple-inheritance.cpp | 2 +- .../test/CodeGenCXX/cfi-vcall-check-after-args.cpp | 2 +- clang/test/CodeGenCXX/clang-sections.cpp | 2 +- clang/test/CodeGenCXX/compound-literals.cpp | 6 +- clang/test/CodeGenCXX/condition.cpp | 30 +- clang/test/CodeGenCXX/conditional-gnu-ext.cpp | 14 +- clang/test/CodeGenCXX/conditional-temporaries.cpp | 44 +- clang/test/CodeGenCXX/const-init-cxx11.cpp | 16 +- .../constructor-destructor-return-this.cpp | 100 +- clang/test/CodeGenCXX/constructor-direct-call.cpp | 14 +- clang/test/CodeGenCXX/constructor-init.cpp | 10 +- clang/test/CodeGenCXX/constructors.cpp | 24 +- clang/test/CodeGenCXX/convert-to-fptr.cpp | 4 +- clang/test/CodeGenCXX/copy-assign-synthesis-1.cpp | 2 +- clang/test/CodeGenCXX/copy-constructor-elim-2.cpp | 2 +- .../CodeGenCXX/copy-constructor-synthesis-2.cpp | 2 +- .../test/CodeGenCXX/copy-constructor-synthesis.cpp | 6 +- clang/test/CodeGenCXX/copy-elision.cpp | 2 +- clang/test/CodeGenCXX/copy-initialization.cpp | 2 +- clang/test/CodeGenCXX/cxx-abi-switch.cpp | 4 +- clang/test/CodeGenCXX/cxx0x-delegating-ctors.cpp | 2 +- .../CodeGenCXX/cxx0x-initializer-constructors.cpp | 14 +- .../CodeGenCXX/cxx0x-initializer-references.cpp | 4 +- .../CodeGenCXX/cxx11-initializer-aggregate.cpp | 4 +- .../CodeGenCXX/cxx11-initializer-array-new.cpp | 30 +- .../CodeGenCXX/cxx11-thread-local-reference.cpp | 6 +- .../CodeGenCXX/cxx11-thread-local-visibility.cpp | 8 +- clang/test/CodeGenCXX/cxx11-thread-local.cpp | 38 +- .../test/CodeGenCXX/cxx11-user-defined-literal.cpp | 20 +- clang/test/CodeGenCXX/cxx1y-init-captures.cpp | 12 +- .../CodeGenCXX/cxx1y-initializer-aggregate.cpp | 6 +- clang/test/CodeGenCXX/cxx1y-sized-deallocation.cpp | 48 +- .../CodeGenCXX/cxx1y-variable-template-linkage.cpp | 10 +- clang/test/CodeGenCXX/cxx1y-variable-template.cpp | 2 +- clang/test/CodeGenCXX/cxx1z-aligned-allocation.cpp | 68 +- clang/test/CodeGenCXX/cxx1z-copy-omission.cpp | 8 +- clang/test/CodeGenCXX/cxx1z-decomposition.cpp | 4 +- clang/test/CodeGenCXX/cxx1z-init-statement.cpp | 4 +- .../CodeGenCXX/cxx1z-initializer-aggregate.cpp | 20 +- clang/test/CodeGenCXX/cxx1z-inline-variables.cpp | 8 +- clang/test/CodeGenCXX/cxx2a-consteval.cpp | 11 +- clang/test/CodeGenCXX/cxx2a-destroying-delete.cpp | 38 +- .../debug-info-codeview-heapallocsite.cpp | 6 +- .../test/CodeGenCXX/debug-info-destroy-helper.cpp | 48 +- clang/test/CodeGenCXX/debug-info-globalinit.cpp | 6 +- clang/test/CodeGenCXX/debug-info-line.cpp | 4 +- clang/test/CodeGenCXX/debug-info-nested-exprs.cpp | 84 +- clang/test/CodeGenCXX/debug-info-static-fns.cpp | 2 +- clang/test/CodeGenCXX/debug-info-thunk-msabi.cpp | 2 +- clang/test/CodeGenCXX/decl-ref-init.cpp | 4 +- clang/test/CodeGenCXX/default-arg-temps.cpp | 4 +- clang/test/CodeGenCXX/default-arguments.cpp | 2 +- clang/test/CodeGenCXX/default_calling_conv.cpp | 24 +- clang/test/CodeGenCXX/delete-two-arg.cpp | 8 +- clang/test/CodeGenCXX/delete.cpp | 6 +- clang/test/CodeGenCXX/derived-to-base-conv.cpp | 6 +- clang/test/CodeGenCXX/derived-to-base.cpp | 4 +- clang/test/CodeGenCXX/destructors.cpp | 8 +- clang/test/CodeGenCXX/devirtualize-ms-dtor.cpp | 2 +- .../devirtualize-virtual-function-calls-final.cpp | 34 +- .../devirtualize-virtual-function-calls.cpp | 2 +- clang/test/CodeGenCXX/dllexport-ctor-closure.cpp | 10 +- clang/test/CodeGenCXX/dllexport-dtor-thunks.cpp | 2 +- clang/test/CodeGenCXX/dllexport-members.cpp | 12 +- .../CodeGenCXX/dllexport-no-dllexport-inlines.cpp | 18 +- clang/test/CodeGenCXX/dllexport.cpp | 12 +- clang/test/CodeGenCXX/dllimport-members.cpp | 12 +- clang/test/CodeGenCXX/dllimport-runtime-fns.cpp | 6 +- clang/test/CodeGenCXX/dllimport.cpp | 18 +- clang/test/CodeGenCXX/eh.cpp | 10 +- .../CodeGenCXX/empty-nontrivially-copyable.cpp | 6 +- clang/test/CodeGenCXX/exceptions-cxx-new.cpp | 10 +- .../CodeGenCXX/exceptions-seh-filter-captures.cpp | 24 +- .../CodeGenCXX/exceptions-seh-filter-uwtable.cpp | 2 +- clang/test/CodeGenCXX/exceptions-seh.cpp | 16 +- clang/test/CodeGenCXX/exceptions.cpp | 4 +- clang/test/CodeGenCXX/explicit-instantiation.cpp | 32 +- clang/test/CodeGenCXX/ext-int.cpp | 16 +- clang/test/CodeGenCXX/fastcall.cpp | 2 +- clang/test/CodeGenCXX/float128-declarations.cpp | 20 +- clang/test/CodeGenCXX/float16-declarations.cpp | 8 +- clang/test/CodeGenCXX/for-cond-var.cpp | 16 +- clang/test/CodeGenCXX/for-range-temporaries.cpp | 2 +- clang/test/CodeGenCXX/for-range.cpp | 20 +- clang/test/CodeGenCXX/forward-enum.cpp | 2 +- clang/test/CodeGenCXX/fp16-mangle-arg-return.cpp | 4 +- clang/test/CodeGenCXX/fp16-mangle.cpp | 4 +- clang/test/CodeGenCXX/fp16-overload.cpp | 4 +- clang/test/CodeGenCXX/global-init.cpp | 2 +- clang/test/CodeGenCXX/goto.cpp | 6 +- clang/test/CodeGenCXX/homogeneous-aggregates.cpp | 28 +- clang/test/CodeGenCXX/ibm128-declarations.cpp | 24 +- .../CodeGenCXX/implicit-copy-assign-operator.cpp | 2 +- .../test/CodeGenCXX/implicit-copy-constructor.cpp | 2 +- clang/test/CodeGenCXX/inalloca-overaligned.cpp | 38 +- clang/test/CodeGenCXX/inalloca-stmtexpr.cpp | 2 +- clang/test/CodeGenCXX/inalloca-vector.cpp | 40 +- .../CodeGenCXX/inheriting-constructor-cleanup.cpp | 4 +- clang/test/CodeGenCXX/inheriting-constructor.cpp | 10 +- clang/test/CodeGenCXX/init-invariant.cpp | 14 +- clang/test/CodeGenCXX/init-priority-attr.cpp | 10 +- .../CodeGenCXX/initializer-list-ctor-order.cpp | 2 +- clang/test/CodeGenCXX/inline-functions.cpp | 2 +- clang/test/CodeGenCXX/lambda-conversion-op-cc.cpp | 56 +- .../lambda-expressions-inside-auto-functions.cpp | 8 +- .../lambda-expressions-nested-linkage.cpp | 10 +- clang/test/CodeGenCXX/lambda-expressions.cpp | 30 +- clang/test/CodeGenCXX/lifetime-sanitizer.cpp | 2 +- clang/test/CodeGenCXX/linkage.cpp | 2 +- clang/test/CodeGenCXX/mangle-abi-tag.cpp | 2 +- clang/test/CodeGenCXX/mangle-exprs.cpp | 8 +- clang/test/CodeGenCXX/mangle-extern-local.cpp | 6 +- clang/test/CodeGenCXX/mangle-lambdas.cpp | 102 +- clang/test/CodeGenCXX/mangle-ms-cxx11.cpp | 4 +- .../CodeGenCXX/mangle-ms-templates-memptrs-2.cpp | 2 +- clang/test/CodeGenCXX/mangle-ms-vector-types.cpp | 14 +- clang/test/CodeGenCXX/mangle-ms.cpp | 10 +- clang/test/CodeGenCXX/mangle-this-cxx11.cpp | 4 +- clang/test/CodeGenCXX/mangle-win-ccs.cpp | 24 +- clang/test/CodeGenCXX/mangle-win64-ccs.cpp | 14 +- clang/test/CodeGenCXX/mangle.cpp | 32 +- clang/test/CodeGenCXX/matrix-casts.cpp | 8 +- clang/test/CodeGenCXX/matrix-type-builtins.cpp | 56 +- clang/test/CodeGenCXX/matrix-type-operators.cpp | 48 +- clang/test/CodeGenCXX/matrix-type.cpp | 2 +- .../CodeGenCXX/member-expr-references-variable.cpp | 40 +- clang/test/CodeGenCXX/member-expressions.cpp | 2 +- .../CodeGenCXX/member-function-pointer-calls.cpp | 8 +- clang/test/CodeGenCXX/member-init-assignment.cpp | 2 +- clang/test/CodeGenCXX/member-templates.cpp | 4 +- clang/test/CodeGenCXX/microsoft-abi-arg-order.cpp | 16 +- .../CodeGenCXX/microsoft-abi-array-cookies.cpp | 8 +- clang/test/CodeGenCXX/microsoft-abi-byval-sret.cpp | 8 +- .../test/CodeGenCXX/microsoft-abi-byval-thunks.cpp | 16 +- .../test/CodeGenCXX/microsoft-abi-byval-vararg.cpp | 12 +- .../CodeGenCXX/microsoft-abi-cdecl-method-sret.cpp | 8 +- .../test/CodeGenCXX/microsoft-abi-dynamic-cast.cpp | 22 +- clang/test/CodeGenCXX/microsoft-abi-eh-catch.cpp | 6 +- .../test/CodeGenCXX/microsoft-abi-eh-cleanups.cpp | 56 +- .../CodeGenCXX/microsoft-abi-extern-template.cpp | 8 +- .../CodeGenCXX/microsoft-abi-member-pointers.cpp | 42 +- clang/test/CodeGenCXX/microsoft-abi-methods.cpp | 10 +- ...crosoft-abi-multiple-nonvirtual-inheritance.cpp | 10 +- .../CodeGenCXX/microsoft-abi-sret-and-byval.cpp | 78 +- .../microsoft-abi-static-initializers.cpp | 24 +- clang/test/CodeGenCXX/microsoft-abi-structors.cpp | 2 +- .../CodeGenCXX/microsoft-abi-this-nullable.cpp | 2 +- .../microsoft-abi-thread-safe-statics.cpp | 2 +- clang/test/CodeGenCXX/microsoft-abi-throw.cpp | 4 +- clang/test/CodeGenCXX/microsoft-abi-thunks.cpp | 14 +- clang/test/CodeGenCXX/microsoft-abi-typeid.cpp | 16 +- .../test/CodeGenCXX/microsoft-abi-unknown-arch.cpp | 2 +- clang/test/CodeGenCXX/microsoft-abi-vbase-dtor.cpp | 2 +- ...microsoft-abi-virtual-inheritance-vtordisps.cpp | 6 +- .../microsoft-abi-virtual-inheritance.cpp | 54 +- .../microsoft-abi-virtual-member-pointers.cpp | 56 +- .../CodeGenCXX/microsoft-abi-vmemptr-conflicts.cpp | 34 +- .../CodeGenCXX/microsoft-abi-vmemptr-fastcall.cpp | 4 +- ...iple-nonvirtual-inheritance-this-adjustment.cpp | 4 +- clang/test/CodeGenCXX/microsoft-compatibility.cpp | 2 +- .../CodeGenCXX/microsoft-inaccessible-base.cpp | 4 +- clang/test/CodeGenCXX/microsoft-interface.cpp | 10 +- clang/test/CodeGenCXX/microsoft-new.cpp | 8 +- clang/test/CodeGenCXX/mips-size_t-ptrdiff_t.cpp | 12 +- clang/test/CodeGenCXX/ms-inline-asm-fields.cpp | 2 +- clang/test/CodeGenCXX/ms-inline-asm-return.cpp | 2 +- clang/test/CodeGenCXX/ms-property.cpp | 48 +- clang/test/CodeGenCXX/ms-thunks-ehspec.cpp | 4 +- clang/test/CodeGenCXX/ms-thunks-unprototyped.cpp | 18 +- clang/test/CodeGenCXX/ms-union-member-ref.cpp | 6 +- .../test/CodeGenCXX/msabi-ctor-abstract-vbase.cpp | 8 +- clang/test/CodeGenCXX/multi-dim-operator-new.cpp | 6 +- clang/test/CodeGenCXX/new-alias.cpp | 2 +- clang/test/CodeGenCXX/new-array-init.cpp | 18 +- clang/test/CodeGenCXX/new-infallible.cpp | 4 +- clang/test/CodeGenCXX/new-overflow.cpp | 30 +- clang/test/CodeGenCXX/new.cpp | 56 +- clang/test/CodeGenCXX/noescape.cpp | 22 +- clang/test/CodeGenCXX/nonconst-init.cpp | 2 +- clang/test/CodeGenCXX/nrvo.cpp | 4 +- clang/test/CodeGenCXX/observe-noexcept.cpp | 4 +- clang/test/CodeGenCXX/operator-new.cpp | 8 +- clang/test/CodeGenCXX/partial-destruction.cpp | 22 +- clang/test/CodeGenCXX/pass-by-value-noalias.cpp | 16 +- clang/test/CodeGenCXX/pass-object-size.cpp | 8 +- clang/test/CodeGenCXX/pod-member-memcpys.cpp | 4 +- clang/test/CodeGenCXX/powerpc-byval.cpp | 2 +- clang/test/CodeGenCXX/pr13396.cpp | 12 +- clang/test/CodeGenCXX/pr20897.cpp | 4 +- clang/test/CodeGenCXX/pr24097.cpp | 2 +- clang/test/CodeGenCXX/pr28360.cpp | 2 +- clang/test/CodeGenCXX/pr9130.cpp | 2 +- clang/test/CodeGenCXX/pragma-visibility.cpp | 2 +- clang/test/CodeGenCXX/redefine_extname.cpp | 2 +- clang/test/CodeGenCXX/reference-cast.cpp | 12 +- clang/test/CodeGenCXX/references.cpp | 2 +- clang/test/CodeGenCXX/regcall.cpp | 42 +- clang/test/CodeGenCXX/regparm.cpp | 6 +- clang/test/CodeGenCXX/runtime-dllstorage.cpp | 14 +- clang/test/CodeGenCXX/runtimecc.cpp | 2 +- clang/test/CodeGenCXX/rvalue-references.cpp | 12 +- clang/test/CodeGenCXX/split-stacks.cpp | 12 +- clang/test/CodeGenCXX/stack-reuse-miscompile.cpp | 8 +- clang/test/CodeGenCXX/stack-reuse.cpp | 2 +- clang/test/CodeGenCXX/static-data-member.cpp | 4 +- clang/test/CodeGenCXX/static-destructor.cpp | 4 +- clang/test/CodeGenCXX/static-init-1.cpp | 8 +- clang/test/CodeGenCXX/static-init-wasm.cpp | 4 +- clang/test/CodeGenCXX/static-init.cpp | 14 +- .../CodeGenCXX/static-local-in-local-class.cpp | 20 +- clang/test/CodeGenCXX/stmtexpr.cpp | 16 +- clang/test/CodeGenCXX/switch-case-folding-2.cpp | 2 +- clang/test/CodeGenCXX/temp-order.cpp | 18 +- clang/test/CodeGenCXX/template-anonymous-types.cpp | 12 +- clang/test/CodeGenCXX/temporaries.cpp | 48 +- clang/test/CodeGenCXX/this-nonnull.cpp | 8 +- clang/test/CodeGenCXX/thunk-linkonce-odr.cpp | 4 +- clang/test/CodeGenCXX/thunk-returning-memptr.cpp | 2 +- clang/test/CodeGenCXX/thunks-ehspec.cpp | 6 +- clang/test/CodeGenCXX/thunks.cpp | 20 +- clang/test/CodeGenCXX/tls-init-funcs.cpp | 10 +- clang/test/CodeGenCXX/trivial_abi.cpp | 46 +- clang/test/CodeGenCXX/ubsan-suppress-checks.cpp | 16 +- clang/test/CodeGenCXX/ubsan-vtable-checks.cpp | 4 +- clang/test/CodeGenCXX/uncopyable-args.cpp | 48 +- clang/test/CodeGenCXX/unknown-anytype.cpp | 28 +- clang/test/CodeGenCXX/value-init.cpp | 4 +- clang/test/CodeGenCXX/varargs.cpp | 2 +- clang/test/CodeGenCXX/variadic-templates.cpp | 2 +- .../CodeGenCXX/virtual-base-destructor-call.cpp | 4 +- clang/test/CodeGenCXX/virtual-bases.cpp | 8 +- clang/test/CodeGenCXX/virtual-operator-call.cpp | 4 +- .../visibility-inlines-hidden-staticvar.cpp | 44 +- .../test/CodeGenCXX/visibility-inlines-hidden.cpp | 4 +- clang/test/CodeGenCXX/vla-consruct.cpp | 4 +- clang/test/CodeGenCXX/vla-lambda-capturing.cpp | 6 +- clang/test/CodeGenCXX/vla.cpp | 4 +- clang/test/CodeGenCXX/volatile.cpp | 2 +- clang/test/CodeGenCXX/vtable-assume-load.cpp | 2 +- .../CodeGenCXX/vtable-available-externally.cpp | 16 +- clang/test/CodeGenCXX/wasm-args-returns.cpp | 4 +- clang/test/CodeGenCXX/wasm-eh.cpp | 8 +- .../windows-on-arm-itanium-thread-local.cpp | 2 +- clang/test/CodeGenCXX/windows-x86-swiftcall.cpp | 6 +- clang/test/CodeGenCXX/x86_32-arguments.cpp | 8 +- clang/test/CodeGenCXX/x86_64-arguments-avx.cpp | 2 +- .../test/CodeGenCXX/x86_64-arguments-nacl-x32.cpp | 2 +- clang/test/CodeGenCXX/x86_64-arguments.cpp | 2 +- .../CodeGenCoroutines/coro-alloc-exp-namespace.cpp | 26 +- clang/test/CodeGenCoroutines/coro-alloc.cpp | 26 +- .../CodeGenCoroutines/coro-await-exp-namespace.cpp | 2 +- clang/test/CodeGenCoroutines/coro-await.cpp | 4 + clang/test/CodeGenCoroutines/coro-builtins.c | 2 +- .../coro-cleanup-exp-namespace.cpp | 6 +- clang/test/CodeGenCoroutines/coro-cleanup.cpp | 6 +- .../CodeGenCoroutines/coro-gro-exp-namespace.cpp | 6 +- .../coro-gro-nrvo-exp-namespace.cpp | 8 +- clang/test/CodeGenCoroutines/coro-gro-nrvo.cpp | 8 +- clang/test/CodeGenCoroutines/coro-gro.cpp | 6 +- .../coro-params-exp-namespace.cpp | 22 +- clang/test/CodeGenCoroutines/coro-params.cpp | 22 +- .../coro-promise-dtor-exp-namespace.cpp | 2 +- clang/test/CodeGenCoroutines/coro-promise-dtor.cpp | 2 +- .../coro-ret-void-exp-namespace.cpp | 2 +- clang/test/CodeGenCoroutines/coro-ret-void.cpp | 5 + .../coro-return-exp-namespace.cpp | 6 +- clang/test/CodeGenCoroutines/coro-return.cpp | 6 +- .../coro-symmetric-transfer-01.cpp | 26 +- clang/test/CodeGenObjC/arc-blocks.m | 44 +- clang/test/CodeGenObjC/arc-foreach.m | 4 +- clang/test/CodeGenObjC/arc-literals.m | 16 +- clang/test/CodeGenObjC/arc-no-arc-exceptions.m | 6 +- clang/test/CodeGenObjC/arc-precise-lifetime.m | 4 +- clang/test/CodeGenObjC/arc-property.m | 10 +- clang/test/CodeGenObjC/arc-ternary-op.m | 4 +- clang/test/CodeGenObjC/arc.m | 44 +- .../CodeGenObjC/arm-atomic-scalar-setter-getter.m | 4 +- clang/test/CodeGenObjC/atomic-aggregate-property.m | 4 +- .../test/CodeGenObjC/availability-cf-link-guard.m | 2 +- clang/test/CodeGenObjC/blocks.m | 4 +- clang/test/CodeGenObjC/builtin-constant-p.m | 4 +- clang/test/CodeGenObjC/class-stubs.m | 10 +- clang/test/CodeGenObjC/debug-info-blocks.m | 2 +- clang/test/CodeGenObjC/debug-info-nested-blocks.m | 2 +- clang/test/CodeGenObjC/exceptions.m | 16 +- clang/test/CodeGenObjC/for-in.m | 2 +- clang/test/CodeGenObjC/fragile-arc.m | 8 +- clang/test/CodeGenObjC/gnu-exceptions.m | 4 +- clang/test/CodeGenObjC/implicit-objc_msgSend.m | 2 +- clang/test/CodeGenObjC/ivar-invariant.m | 2 +- clang/test/CodeGenObjC/local-static-block.m | 2 +- clang/test/CodeGenObjC/mangle-blocks.m | 6 +- clang/test/CodeGenObjC/matrix-type-builtins.m | 16 +- clang/test/CodeGenObjC/matrix-type-operators.m | 10 +- clang/test/CodeGenObjC/noescape.m | 10 +- .../CodeGenObjC/nontrivial-c-struct-exception.m | 2 +- .../nontrivial-c-struct-within-struct-name.m | 6 +- .../CodeGenObjC/nsvalue-objc-boxable-ios-arc.m | 12 +- clang/test/CodeGenObjC/nsvalue-objc-boxable-ios.m | 12 +- .../CodeGenObjC/nsvalue-objc-boxable-mac-arc.m | 12 +- clang/test/CodeGenObjC/nsvalue-objc-boxable-mac.m | 12 +- .../CodeGenObjC/objc-container-subscripting-1.m | 8 +- clang/test/CodeGenObjC/objc-literal-tests.m | 26 +- .../CodeGenObjC/objc-non-trivial-struct-nrvo.m | 6 +- clang/test/CodeGenObjC/objfw.m | 2 +- clang/test/CodeGenObjC/optimize-ivar-offset-load.m | 2 +- clang/test/CodeGenObjC/os_log.m | 12 +- clang/test/CodeGenObjC/parameterized_classes.m | 2 +- clang/test/CodeGenObjC/pass-by-value-noalias.m | 4 +- clang/test/CodeGenObjC/property-array-type.m | 2 +- clang/test/CodeGenObjC/property-atomic-bool.m | 4 +- clang/test/CodeGenObjC/property-ref-cast-to-void.m | 4 +- clang/test/CodeGenObjC/property.m | 10 +- clang/test/CodeGenObjC/return-objc-object.mm | 4 +- clang/test/CodeGenObjC/stret_lookup.m | 4 +- clang/test/CodeGenObjC/strong-in-c-struct.m | 54 +- .../test/CodeGenObjC/tentative-cfconstantstring.m | 2 +- clang/test/CodeGenObjC/terminate.m | 8 +- clang/test/CodeGenObjC/ubsan-bool.m | 6 +- clang/test/CodeGenObjC/ubsan-nonnull.m | 12 +- clang/test/CodeGenObjC/ubsan-nullability.m | 4 +- clang/test/CodeGenObjC/weak-in-c-struct.m | 30 +- clang/test/CodeGenObjCXX/arc-attrs.mm | 18 +- clang/test/CodeGenObjCXX/arc-blocks.mm | 6 +- clang/test/CodeGenObjCXX/arc-cxx11-init-list.mm | 2 +- clang/test/CodeGenObjCXX/arc-cxx11-member-init.mm | 4 +- clang/test/CodeGenObjCXX/arc-exceptions.mm | 8 +- .../CodeGenObjCXX/arc-forwarded-lambda-call.mm | 8 +- clang/test/CodeGenObjCXX/arc-globals.mm | 4 +- clang/test/CodeGenObjCXX/arc-list-init-destruct.mm | 2 +- clang/test/CodeGenObjCXX/arc-mangle.mm | 22 +- clang/test/CodeGenObjCXX/arc-marker-funclet.mm | 2 +- clang/test/CodeGenObjCXX/arc-move.mm | 6 +- clang/test/CodeGenObjCXX/arc-new-delete.mm | 16 +- clang/test/CodeGenObjCXX/arc-references.mm | 6 +- clang/test/CodeGenObjCXX/arc-rv-attr.mm | 2 +- .../CodeGenObjCXX/arc-special-member-functions.mm | 2 +- clang/test/CodeGenObjCXX/arc.mm | 44 +- .../CodeGenObjCXX/auto-release-result-assert.mm | 8 +- clang/test/CodeGenObjCXX/block-default-arg.mm | 4 +- clang/test/CodeGenObjCXX/block-nested-in-lambda.mm | 4 +- clang/test/CodeGenObjCXX/copy.mm | 2 +- .../CodeGenObjCXX/implicit-copy-assign-operator.mm | 2 +- .../CodeGenObjCXX/implicit-copy-constructor.mm | 2 +- .../inheriting-constructor-cleanup.mm | 2 +- clang/test/CodeGenObjCXX/lambda-expressions.mm | 20 +- clang/test/CodeGenObjCXX/lambda-to-block.mm | 18 +- clang/test/CodeGenObjCXX/literals.mm | 8 +- .../test/CodeGenObjCXX/lvalue-reference-getter.mm | 4 +- clang/test/CodeGenObjCXX/mangle-blocks.mm | 8 +- clang/test/CodeGenObjCXX/message-reference.mm | 2 +- clang/test/CodeGenObjCXX/message.mm | 4 +- .../CodeGenObjCXX/objc-container-subscripting.mm | 2 +- clang/test/CodeGenObjCXX/objc-struct-cxx-abi.mm | 54 +- clang/test/CodeGenObjCXX/objc-weak.mm | 4 +- .../CodeGenObjCXX/property-dot-copy-elision.mm | 6 +- clang/test/CodeGenObjCXX/property-dot-reference.mm | 22 +- .../test/CodeGenObjCXX/property-lvalue-capture.mm | 6 +- clang/test/CodeGenObjCXX/property-lvalue-lambda.mm | 2 +- .../CodeGenObjCXX/property-object-reference-1.mm | 2 +- .../CodeGenObjCXX/property-object-reference-2.mm | 14 +- clang/test/CodeGenObjCXX/property-objects.mm | 14 +- clang/test/CodeGenObjCXX/property-reference.mm | 6 +- clang/test/CodeGenObjCXX/selector-expr-lvalue.mm | 2 +- .../CodeGenObjCXX/synthesized-property-cleanup.mm | 2 +- .../ubsan-nullability-return-notypeloc.mm | 2 +- clang/test/CodeGenOpenCL/addr-space-struct-arg.cl | 20 +- clang/test/CodeGenOpenCL/address-spaces.cl | 10 +- .../CodeGenOpenCL/amdgcn-automatic-variable.cl | 8 +- .../test/CodeGenOpenCL/amdgpu-abi-struct-coerce.cl | 48 +- clang/test/CodeGenOpenCL/amdgpu-call-kernel.cl | 2 +- clang/test/CodeGenOpenCL/amdgpu-nullptr.cl | 8 +- clang/test/CodeGenOpenCL/as_type.cl | 26 +- clang/test/CodeGenOpenCL/atomic-ops-libcall.cl | 54 +- clang/test/CodeGenOpenCL/blocks.cl | 12 +- clang/test/CodeGenOpenCL/byval.cl | 4 +- .../test/CodeGenOpenCL/cl20-device-side-enqueue.cl | 6 +- clang/test/CodeGenOpenCL/const-str-array-decay.cl | 2 +- .../CodeGenOpenCL/constant-addr-space-globals.cl | 2 +- clang/test/CodeGenOpenCL/convergent.cl | 4 +- clang/test/CodeGenOpenCL/fpmath.cl | 4 +- clang/test/CodeGenOpenCL/half.cl | 8 +- .../kernels-have-spir-cc-by-default.cl | 8 +- clang/test/CodeGenOpenCL/no-half.cl | 4 +- clang/test/CodeGenOpenCL/overload.cl | 20 +- clang/test/CodeGenOpenCL/printf.cl | 12 +- clang/test/CodeGenOpenCL/size_t.cl | 60 +- clang/test/CodeGenOpenCL/spir-calling-conv.cl | 10 +- .../CodeGenOpenCLCXX/address-space-deduction.clcpp | 2 +- .../CodeGenOpenCLCXX/addrspace-derived-base.clcpp | 4 +- .../CodeGenOpenCLCXX/addrspace-new-delete.clcpp | 2 +- .../test/CodeGenOpenCLCXX/addrspace-of-this.clcpp | 32 +- .../CodeGenOpenCLCXX/addrspace-operators.clcpp | 4 +- .../CodeGenOpenCLCXX/addrspace-references.clcpp | 2 +- .../CodeGenOpenCLCXX/addrspace-with-class.clcpp | 22 +- .../CodeGenOpenCLCXX/template-address-spaces.clcpp | 6 +- .../test/CodeGenSYCL/address-space-conversions.cpp | 52 +- clang/test/CodeGenSYCL/address-space-mangling.cpp | 16 +- clang/test/CodeGenSYCL/unique_stable_name.cpp | 40 +- clang/test/Headers/ms-arm64-intrin.cpp | 6 +- clang/test/Headers/stdarg.cpp | 28 +- clang/test/Modules/codegen-extern-template.cpp | 2 +- clang/test/Modules/codegen.test | 2 +- clang/test/Modules/cxx-irgen.cpp | 2 +- clang/test/Modules/initializers.cpp | 4 +- clang/test/Modules/templates.mm | 8 +- clang/test/OpenMP/allocate_codegen.cpp | 2 +- clang/test/OpenMP/allocate_codegen_attr.cpp | 2 +- clang/test/OpenMP/assumes_include_nvptx.cpp | 6 +- clang/test/OpenMP/atomic_capture_codegen.cpp | 28 +- clang/test/OpenMP/atomic_codegen.cpp | 8 +- clang/test/OpenMP/atomic_read_codegen.c | 14 +- clang/test/OpenMP/atomic_update_codegen.cpp | 28 +- clang/test/OpenMP/atomic_write_codegen.c | 18 +- clang/test/OpenMP/cancel_codegen.cpp | 104 +- clang/test/OpenMP/cancellation_point_codegen.cpp | 28 +- clang/test/OpenMP/debug-info-complex-byval.cpp | 49 +- clang/test/OpenMP/debug-info-openmp-array.cpp | 6 +- clang/test/OpenMP/declare_mapper_codegen.cpp | 20 +- clang/test/OpenMP/declare_reduction_codegen.c | 48 +- clang/test/OpenMP/declare_reduction_codegen.cpp | 46 +- .../declare_reduction_codegen_in_templates.cpp | 2 +- clang/test/OpenMP/declare_target_codegen.cpp | 4 +- .../declare_target_codegen_globalization.cpp | 12 +- clang/test/OpenMP/declare_target_link_codegen.cpp | 4 +- clang/test/OpenMP/declare_variant_mixed_codegen.c | 12 +- clang/test/OpenMP/distribute_codegen.cpp | 304 +- .../OpenMP/distribute_firstprivate_codegen.cpp | 329 +- .../test/OpenMP/distribute_lastprivate_codegen.cpp | 361 ++- .../OpenMP/distribute_parallel_for_codegen.cpp | 576 ++-- ...istribute_parallel_for_firstprivate_codegen.cpp | 385 ++- .../OpenMP/distribute_parallel_for_if_codegen.cpp | 320 +- ...distribute_parallel_for_lastprivate_codegen.cpp | 449 ++- ...distribute_parallel_for_num_threads_codegen.cpp | 481 ++- .../distribute_parallel_for_private_codegen.cpp | 425 ++- .../distribute_parallel_for_proc_bind_codegen.cpp | 29 +- ...tribute_parallel_for_reduction_task_codegen.cpp | 44 +- .../distribute_parallel_for_simd_codegen.cpp | 592 ++-- ...bute_parallel_for_simd_firstprivate_codegen.cpp | 1362 ++++----- .../distribute_parallel_for_simd_if_codegen.cpp | 3192 ++++++++++---------- ...ibute_parallel_for_simd_lastprivate_codegen.cpp | 1336 ++++---- ...ibute_parallel_for_simd_num_threads_codegen.cpp | 2640 ++++++++-------- ...istribute_parallel_for_simd_private_codegen.cpp | 1288 ++++---- ...tribute_parallel_for_simd_proc_bind_codegen.cpp | 236 +- clang/test/OpenMP/distribute_private_codegen.cpp | 345 ++- clang/test/OpenMP/distribute_simd_codegen.cpp | 512 ++-- .../distribute_simd_firstprivate_codegen.cpp | 944 +++--- .../OpenMP/distribute_simd_lastprivate_codegen.cpp | 1008 +++---- .../OpenMP/distribute_simd_private_codegen.cpp | 1056 +++---- .../OpenMP/distribute_simd_reduction_codegen.cpp | 272 +- clang/test/OpenMP/for_codegen.cpp | 16 +- clang/test/OpenMP/for_firstprivate_codegen.cpp | 313 +- clang/test/OpenMP/for_lastprivate_codegen.cpp | 601 ++-- clang/test/OpenMP/for_linear_codegen.cpp | 165 +- clang/test/OpenMP/for_private_codegen.cpp | 177 +- clang/test/OpenMP/for_reduction_codegen.cpp | 760 ++--- clang/test/OpenMP/for_reduction_codegen_UDR.cpp | 936 +++--- clang/test/OpenMP/for_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/for_scan_codegen.cpp | 2 +- clang/test/OpenMP/for_simd_codegen.cpp | 6 +- clang/test/OpenMP/for_simd_scan_codegen.cpp | 2 +- clang/test/OpenMP/function-attr.cpp | 8 +- clang/test/OpenMP/irbuilder_for_iterator.cpp | 24 +- clang/test/OpenMP/irbuilder_for_rangefor.cpp | 28 +- clang/test/OpenMP/irbuilder_for_unsigned.c | 6 +- ...builder_unroll_partial_heuristic_constant_for.c | 2 +- ...builder_unroll_partial_heuristic_for_collapse.c | 380 ++- ...rbuilder_unroll_partial_heuristic_runtime_for.c | 2 +- clang/test/OpenMP/master_taskloop_codegen.cpp | 10 +- .../master_taskloop_firstprivate_codegen.cpp | 22 +- .../master_taskloop_in_reduction_codegen.cpp | 12 +- .../OpenMP/master_taskloop_lastprivate_codegen.cpp | 22 +- .../OpenMP/master_taskloop_private_codegen.cpp | 22 +- .../OpenMP/master_taskloop_reduction_codegen.cpp | 22 +- clang/test/OpenMP/master_taskloop_simd_codegen.cpp | 8 +- .../master_taskloop_simd_firstprivate_codegen.cpp | 22 +- .../master_taskloop_simd_in_reduction_codegen.cpp | 12 +- .../master_taskloop_simd_lastprivate_codegen.cpp | 22 +- .../master_taskloop_simd_private_codegen.cpp | 22 +- .../master_taskloop_simd_reduction_codegen.cpp | 22 +- clang/test/OpenMP/nvptx_allocate_codegen.cpp | 8 +- clang/test/OpenMP/nvptx_data_sharing.cpp | 8 +- .../nvptx_declare_target_var_ctor_dtor_codegen.cpp | 28 +- .../OpenMP/nvptx_declare_variant_name_mangling.cpp | 4 +- ...tx_distribute_parallel_generic_mode_codegen.cpp | 48 +- clang/test/OpenMP/nvptx_lambda_capturing.cpp | 122 +- .../OpenMP/nvptx_multi_target_parallel_codegen.cpp | 18 +- .../test/OpenMP/nvptx_nested_parallel_codegen.cpp | 72 +- clang/test/OpenMP/nvptx_parallel_codegen.cpp | 52 +- clang/test/OpenMP/nvptx_parallel_for_codegen.cpp | 6 +- clang/test/OpenMP/nvptx_target_codegen.cpp | 10 +- .../OpenMP/nvptx_target_firstprivate_codegen.cpp | 8 +- .../test/OpenMP/nvptx_target_parallel_codegen.cpp | 48 +- .../nvptx_target_parallel_num_threads_codegen.cpp | 48 +- .../nvptx_target_parallel_reduction_codegen.cpp | 18 +- ...get_parallel_reduction_codegen_tbaa_PR46146.cpp | 10 +- clang/test/OpenMP/nvptx_target_printf_codegen.c | 4 +- clang/test/OpenMP/nvptx_target_teams_codegen.cpp | 48 +- .../nvptx_target_teams_distribute_codegen.cpp | 18 +- ...arget_teams_distribute_parallel_for_codegen.cpp | 144 +- ...istribute_parallel_for_generic_mode_codegen.cpp | 72 +- ..._teams_distribute_parallel_for_simd_codegen.cpp | 72 +- .../nvptx_target_teams_distribute_simd_codegen.cpp | 22 +- clang/test/OpenMP/nvptx_teams_codegen.cpp | 32 +- .../test/OpenMP/nvptx_teams_reduction_codegen.cpp | 162 +- .../test/OpenMP/nvptx_unsupported_type_codegen.cpp | 4 +- clang/test/OpenMP/openmp_offload_codegen.cpp | 2 +- clang/test/OpenMP/openmp_win_codegen.cpp | 7 +- clang/test/OpenMP/ordered_codegen.cpp | 76 +- clang/test/OpenMP/parallel_codegen.cpp | 100 +- clang/test/OpenMP/parallel_copyin_codegen.cpp | 613 ++-- .../test/OpenMP/parallel_firstprivate_codegen.cpp | 44 +- clang/test/OpenMP/parallel_for_codegen.cpp | 224 +- .../parallel_for_lastprivate_conditional.cpp | 17 +- clang/test/OpenMP/parallel_for_linear_codegen.cpp | 93 +- .../OpenMP/parallel_for_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/parallel_for_scan_codegen.cpp | 2 +- .../OpenMP/parallel_for_simd_aligned_codegen.cpp | 72 +- clang/test/OpenMP/parallel_for_simd_codegen.cpp | 6 +- .../test/OpenMP/parallel_for_simd_scan_codegen.cpp | 2 +- clang/test/OpenMP/parallel_if_codegen.cpp | 100 +- clang/test/OpenMP/parallel_if_codegen_PR51349.cpp | 2 +- clang/test/OpenMP/parallel_master_codegen.cpp | 63 +- .../parallel_master_reduction_task_codegen.cpp | 36 +- .../OpenMP/parallel_master_taskloop_codegen.cpp | 60 +- ...rallel_master_taskloop_firstprivate_codegen.cpp | 20 +- ...arallel_master_taskloop_lastprivate_codegen.cpp | 282 +- .../parallel_master_taskloop_private_codegen.cpp | 20 +- .../parallel_master_taskloop_reduction_codegen.cpp | 22 +- .../parallel_master_taskloop_simd_codegen.cpp | 160 +- ...l_master_taskloop_simd_firstprivate_codegen.cpp | 20 +- ...el_master_taskloop_simd_lastprivate_codegen.cpp | 470 +-- ...rallel_master_taskloop_simd_private_codegen.cpp | 20 +- ...llel_master_taskloop_simd_reduction_codegen.cpp | 22 +- clang/test/OpenMP/parallel_num_threads_codegen.cpp | 4 +- clang/test/OpenMP/parallel_private_codegen.cpp | 261 +- clang/test/OpenMP/parallel_reduction_codegen.cpp | 501 ++- .../OpenMP/parallel_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/parallel_sections_codegen.cpp | 13 +- .../parallel_sections_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/reduction_compound_op.cpp | 12 +- .../test/OpenMP/sections_firstprivate_codegen.cpp | 321 +- clang/test/OpenMP/sections_lastprivate_codegen.cpp | 433 ++- clang/test/OpenMP/sections_private_codegen.cpp | 189 +- clang/test/OpenMP/sections_reduction_codegen.cpp | 353 ++- .../OpenMP/sections_reduction_task_codegen.cpp | 36 +- clang/test/OpenMP/simd_codegen.cpp | 8 +- clang/test/OpenMP/single_codegen.cpp | 597 ++-- clang/test/OpenMP/single_firstprivate_codegen.cpp | 321 +- clang/test/OpenMP/single_private_codegen.cpp | 189 +- clang/test/OpenMP/target_codegen.cpp | 12 +- .../test/OpenMP/target_codegen_global_capture.cpp | 104 +- clang/test/OpenMP/target_defaultmap_codegen_01.cpp | 676 ++--- clang/test/OpenMP/target_depend_codegen.cpp | 14 +- clang/test/OpenMP/target_enter_data_codegen.cpp | 2 +- .../OpenMP/target_enter_data_depend_codegen.cpp | 8 +- clang/test/OpenMP/target_exit_data_codegen.cpp | 2 +- .../OpenMP/target_exit_data_depend_codegen.cpp | 8 +- clang/test/OpenMP/target_firstprivate_codegen.cpp | 12 +- clang/test/OpenMP/target_map_codegen_00.cpp | 2 +- clang/test/OpenMP/target_map_codegen_01.cpp | 4 +- clang/test/OpenMP/target_map_codegen_02.cpp | 2 +- clang/test/OpenMP/target_map_codegen_03.cpp | 96 +- clang/test/OpenMP/target_map_codegen_04.cpp | 2 +- clang/test/OpenMP/target_map_codegen_05.cpp | 2 +- clang/test/OpenMP/target_map_codegen_06.cpp | 2 +- clang/test/OpenMP/target_map_codegen_07.cpp | 2 +- clang/test/OpenMP/target_map_codegen_11.cpp | 2 +- clang/test/OpenMP/target_map_codegen_12.cpp | 2 +- clang/test/OpenMP/target_map_codegen_13.cpp | 2 +- clang/test/OpenMP/target_map_codegen_14.cpp | 4 +- clang/test/OpenMP/target_map_codegen_15.cpp | 2 +- clang/test/OpenMP/target_map_codegen_17.cpp | 2 +- clang/test/OpenMP/target_map_codegen_24.cpp | 2 +- clang/test/OpenMP/target_map_names.cpp | 2 +- clang/test/OpenMP/target_map_names_attr.cpp | 2 +- clang/test/OpenMP/target_parallel_codegen.cpp | 608 ++-- .../test/OpenMP/target_parallel_debug_codegen.cpp | 24 +- .../test/OpenMP/target_parallel_depend_codegen.cpp | 12 +- clang/test/OpenMP/target_parallel_for_codegen.cpp | 672 ++--- .../OpenMP/target_parallel_for_debug_codegen.cpp | 24 +- </cut>

3 years, 8 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes

by ci_notify＠linaro.org

After llvm commit 9c2469c1ddb34517de8dafd83d1940deada3fc22 Author: Roman Lebedev <lebedev.ri(a)gmail.com> [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 10836 to 11596 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 46% from 1525 to 2231 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-9c2469c1ddb34517de8dafd83d1940deada3fc22 cd investigate-llvm-9c2469c1ddb34517de8dafd83d1940deada3fc22 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 9c2469c1ddb34517de8dafd83d1940deada3fc22 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 4bef0304e153c757c9f42c2001d4c56e8f99929e ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 9c2469c1ddb34517de8dafd83d1940deada3fc22 Author: Roman Lebedev <lebedev.ri(a)gmail.com> Date: Wed Nov 3 19:23:25 2021 +0300 [PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes Test thanks to Michael Kuklinski from `#llvm`: https://godbolt.org/z/bdrah5Goo originally inspired by Daniel Lemire's https://lemire.me/blog/2021/10/26/in-c-is-empty-faster-than-comparing-the-s… We manage to deduce that the answer does not require looping, but we do that after the last `LoopDeletion` pass run, so we end up being stuck with a dead loop. Now, as with all things SCEV, this has a very expected ~`+0.12%` compile time performance regression: https://llvm-compile-time-tracker.com/compare.php?from=0ae7bf124a9bca76dd9a… (for comparison, doing that in function simplification pipeline would have been ~`+0.5` compile time performance regression, D112840) Looking at the transformation stats over vanilla test-suite, i think it's rather expected: ``` | statistic name | baseline | proposed | Δ | % | |%| | |--------------------------------------------------|----------:|----------:|------:|-------:|-------:| | scalar-evolution.NumBruteForceTripCountsComputed | 789 | 888 | 99 | 12.55% | 12.55% | | scalar-evolution.NumTripCountsNotComputed | 105592 | 117900 | 12308 | 11.66% | 11.66% | | loop-delete.NumBackedgesBroken | 542 | 559 | 17 | 3.14% | 3.14% | | regalloc.numExtends | 81 | 79 | -2 | -2.47% | 2.47% | | indvars.NumFoldedUser | 408 | 400 | -8 | -1.96% | 1.96% | | indvars.NumElimCmp | 3831 | 3758 | -73 | -1.91% | 1.91% | | scalar-evolution.NumTripCountsComputed | 299759 | 304278 | 4519 | 1.51% | 1.51% | | loop-delete.NumDeleted | 8055 | 8128 | 73 | 0.91% | 0.91% | | machine-cse.NumCommutes | 111 | 110 | -1 | -0.90% | 0.90% | | globaldce.NumFunctions | 1187 | 1192 | 5 | 0.42% | 0.42% | | codegenprepare.NumSelectsExpanded | 277 | 278 | 1 | 0.36% | 0.36% | | loop-unroll.NumRuntimeUnrolled | 13841 | 13791 | -50 | -0.36% | 0.36% | | machinelicm.NumPostRAHoisted | 1168 | 1172 | 4 | 0.34% | 0.34% | | phi-node-elimination.NumCriticalEdgesSplit | 83054 | 82879 | -175 | -0.21% | 0.21% | | machine-cse.NumPREs | 3085 | 3079 | -6 | -0.19% | 0.19% | | branch-folder.NumBranchOpts | 108122 | 107942 | -180 | -0.17% | 0.17% | | loop-unroll.NumUnrolled | 40136 | 40067 | -69 | -0.17% | 0.17% | | branch-folder.NumDeadBlocks | 130818 | 130607 | -211 | -0.16% | 0.16% | | codegenprepare.NumBlocksElim | 92856 | 92714 | -142 | -0.15% | 0.15% | | instsimplify.NumSimplified | 103263 | 103129 | -134 | -0.13% | 0.13% | | instcombine.NumConstProp | 26070 | 26102 | 32 | 0.12% | 0.12% | | instsimplify.NumExpand | 1716 | 1718 | 2 | 0.12% | 0.12% | | loop-unroll.NumCompletelyUnrolled | 9236 | 9225 | -11 | -0.12% | 0.12% | | branch-folder.NumHoist | 2773 | 2770 | -3 | -0.11% | 0.11% | | regalloc.NumReloadsRemoved | 10822 | 10834 | 12 | 0.11% | 0.11% | | regalloc.NumSnippets | 11394 | 11406 | 12 | 0.11% | 0.11% | | machine-cse.NumCrossBBCSEs | 1052 | 1053 | 1 | 0.10% | 0.10% | | machinelicm.NumCSEed | 99887 | 99784 | -103 | -0.10% | 0.10% | | branch-folder.NumTailMerge | 72501 | 72435 | -66 | -0.09% | 0.09% | | codegenprepare.NumExtUses | 22007 | 21987 | -20 | -0.09% | 0.09% | | local.NumRemoved | 68232 | 68294 | 62 | 0.09% | 0.09% | | loop-vectorize.LoopsAnalyzed | 75483 | 75413 | -70 | -0.09% | 0.09% | ``` Note that i'm only changing current PM, and not touching obsolete PM. This is an alternative to the function simplification pipeline variant of the same change, D112840. It has both less compile time impact (since the additional number of SCEV trip count calculations is way lass less than with the D112840), and it is much more powerful/impactful (almost 2x more loops deleted). I have checked, and doing this after loop rotation is favorable (more loops deleted). Reviewed By: mkazantsev Differential Revision: https://reviews.llvm.org/D112851 --- llvm/lib/Passes/PassBuilderPipelines.cpp | 9 +++- llvm/test/Other/new-pm-defaults.ll | 1 + llvm/test/Other/new-pm-thinlto-defaults.ll | 1 + .../Other/new-pm-thinlto-postlink-pgo-defaults.ll | 1 + .../new-pm-thinlto-postlink-samplepgo-defaults.ll | 1 + ...letion-of-loops-that-became-side-effect-free.ll | 49 ++++------------------ 6 files changed, 18 insertions(+), 44 deletions(-) diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp index 2009a687ae7d..f0f7803ed3ae 100644 --- a/llvm/lib/Passes/PassBuilderPipelines.cpp +++ b/llvm/lib/Passes/PassBuilderPipelines.cpp @@ -1093,11 +1093,16 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level, for (auto &C : VectorizerStartEPCallbacks) C(OptimizePM, Level); + LoopPassManager LPM; // First rotate loops that may have been un-rotated by prior passes. // Disable header duplication at -Oz. + LPM.addPass(LoopRotatePass(Level != OptimizationLevel::Oz, LTOPreLink)); + // Some loops may have become dead by now. Try to delete them. + // FIXME: see disscussion in https://reviews.llvm.org/D112851 + // this may need to be revisited once GVN is more powerful. + LPM.addPass(LoopDeletionPass()); OptimizePM.addPass(createFunctionToLoopPassAdaptor( - LoopRotatePass(Level != OptimizationLevel::Oz, LTOPreLink), - /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false)); + std::move(LPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false)); // Distribute loops to allow partial vectorization. I.e. isolate dependences // into separate loop that would otherwise inhibit vectorization. This is diff --git a/llvm/test/Other/new-pm-defaults.ll b/llvm/test/Other/new-pm-defaults.ll index 5067b6fbdd18..b9f90dad8224 100644 --- a/llvm/test/Other/new-pm-defaults.ll +++ b/llvm/test/Other/new-pm-defaults.ll @@ -216,6 +216,7 @@ ; CHECK-O-NEXT: Running pass: LoopSimplifyPass ; CHECK-O-NEXT: Running pass: LCSSAPass ; CHECK-O-NEXT: Running pass: LoopRotatePass +; CHECK-O-NEXT: Running pass: LoopDeletionPass ; CHECK-O-NEXT: Running pass: LoopDistributePass ; CHECK-O-NEXT: Running pass: InjectTLIMappings ; CHECK-O-NEXT: Running pass: LoopVectorizePass diff --git a/llvm/test/Other/new-pm-thinlto-defaults.ll b/llvm/test/Other/new-pm-thinlto-defaults.ll index 1f52fe47ae73..7836de5c6cce 100644 --- a/llvm/test/Other/new-pm-thinlto-defaults.ll +++ b/llvm/test/Other/new-pm-thinlto-defaults.ll @@ -196,6 +196,7 @@ ; CHECK-POSTLINK-O-NEXT: Running pass: LoopSimplifyPass ; CHECK-POSTLINK-O-NEXT: Running pass: LCSSAPass ; CHECK-POSTLINK-O-NEXT: Running pass: LoopRotatePass +; CHECK-POSTLINK-O-NEXT: Running pass: LoopDeletionPass ; CHECK-POSTLINK-O-NEXT: Running pass: LoopDistributePass ; CHECK-POSTLINK-O-NEXT: Running pass: InjectTLIMappings ; CHECK-POSTLINK-O-NEXT: Running pass: LoopVectorizePass diff --git a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll index 3a80efba3c56..e66e8672358c 100644 --- a/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll +++ b/llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll @@ -167,6 +167,7 @@ ; CHECK-O-NEXT: Running pass: LoopSimplifyPass on foo ; CHECK-O-NEXT: Running pass: LCSSAPass on foo ; CHECK-O-NEXT: Running pass: LoopRotatePass +; CHECK-O-NEXT: Running pass: LoopDeletionPass ; CHECK-O-NEXT: Running pass: LoopDistributePass ; CHECK-O-NEXT: Running pass: InjectTLIMappings ; CHECK-O-NEXT: Running pass: LoopVectorizePass diff --git a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll index 2e822b21f8a1..410841124c8e 100644 --- a/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll +++ b/llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll @@ -179,6 +179,7 @@ ; CHECK-O-NEXT: Running pass: LoopSimplifyPass ; CHECK-O-NEXT: Running pass: LCSSAPass ; CHECK-O-NEXT: Running pass: LoopRotatePass +; CHECK-O-NEXT: Running pass: LoopDeletionPass ; CHECK-O-NEXT: Running pass: LoopDistributePass ; CHECK-O-NEXT: Running pass: InjectTLIMappings ; CHECK-O-NEXT: Running pass: LoopVectorizePass diff --git a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll index ec8db3cceeb1..99a52acd3b2b 100644 --- a/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll +++ b/llvm/test/Transforms/PhaseOrdering/deletion-of-loops-that-became-side-effect-free.ll @@ -11,17 +11,8 @@ define dso_local zeroext i1 @is_not_empty_variant1(%struct.node* %p) { ; ALL-LABEL: @is_not_empty_variant1( ; ALL-NEXT: entry: -; ALL-NEXT: [[TOBOOL_NOT3_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null -; ALL-NEXT: br i1 [[TOBOOL_NOT3_I]], label [[COUNT_NODES_VARIANT1_EXIT:%.*]], label [[WHILE_BODY_I:%.*]] -; ALL: while.body.i: -; ALL-NEXT: [[P_ADDR_04_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY:%.*]] ] -; ALL-NEXT: [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_04_I]], i64 0, i32 0 -; ALL-NEXT: [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8 -; ALL-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null -; ALL-NEXT: br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT1_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]] -; ALL: count_nodes_variant1.exit: -; ALL-NEXT: [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT3_I]], true -; ALL-NEXT: ret i1 [[TMP1]] +; ALL-NEXT: [[TOBOOL_NOT3_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null +; ALL-NEXT: ret i1 [[TOBOOL_NOT3_I]] ; entry: %p.addr = alloca %struct.node*, align 8 @@ -113,39 +104,13 @@ while.end: define dso_local zeroext i1 @is_not_empty_variant3(%struct.node* %p) { ; O3-LABEL: @is_not_empty_variant3( ; O3-NEXT: entry: -; O3-NEXT: [[TOBOOL_NOT4_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null -; O3-NEXT: br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]] -; O3: while.body.i: -; O3-NEXT: [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ] -; O3-NEXT: [[P_ADDR_05_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ] -; O3-NEXT: [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1 -; O3-NEXT: tail call void @llvm.assume(i1 [[CMP_I]]) #[[ATTR3:[0-9]+]] -; O3-NEXT: [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_05_I]], i64 0, i32 0 -; O3-NEXT: [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8 -; O3-NEXT: [[INC_I]] = add nuw i64 [[SIZE_06_I]], 1 -; O3-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null -; O3-NEXT: br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]] -; O3: count_nodes_variant3.exit: -; O3-NEXT: [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT4_I]], true -; O3-NEXT: ret i1 [[TMP1]] +; O3-NEXT: [[TOBOOL_NOT4_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null +; O3-NEXT: ret i1 [[TOBOOL_NOT4_I]] ; ; O2-LABEL: @is_not_empty_variant3( ; O2-NEXT: entry: -; O2-NEXT: [[TOBOOL_NOT4_I:%.*]] = icmp eq %struct.node* [[P:%.*]], null -; O2-NEXT: br i1 [[TOBOOL_NOT4_I]], label [[COUNT_NODES_VARIANT3_EXIT:%.*]], label [[WHILE_BODY_I:%.*]] -; O2: while.body.i: -; O2-NEXT: [[SIZE_06_I:%.*]] = phi i64 [ [[INC_I:%.*]], [[WHILE_BODY_I]] ], [ 0, [[ENTRY:%.*]] ] -; O2-NEXT: [[P_ADDR_05_I:%.*]] = phi %struct.node* [ [[TMP0:%.*]], [[WHILE_BODY_I]] ], [ [[P]], [[ENTRY]] ] -; O2-NEXT: [[CMP_I:%.*]] = icmp ne i64 [[SIZE_06_I]], -1 -; O2-NEXT: tail call void @llvm.assume(i1 [[CMP_I]]) #[[ATTR3:[0-9]+]] -; O2-NEXT: [[NEXT_I:%.*]] = getelementptr inbounds [[STRUCT_NODE:%.*]], %struct.node* [[P_ADDR_05_I]], i64 0, i32 0 -; O2-NEXT: [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8 -; O2-NEXT: [[INC_I]] = add nuw i64 [[SIZE_06_I]], 1 -; O2-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null -; O2-NEXT: br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]] -; O2: count_nodes_variant3.exit: -; O2-NEXT: [[TMP1:%.*]] = xor i1 [[TOBOOL_NOT4_I]], true -; O2-NEXT: ret i1 [[TMP1]] +; O2-NEXT: [[TOBOOL_NOT4_I:%.*]] = icmp ne %struct.node* [[P:%.*]], null +; O2-NEXT: ret i1 [[TOBOOL_NOT4_I]] ; ; O1-LABEL: @is_not_empty_variant3( ; O1-NEXT: entry: @@ -160,7 +125,7 @@ define dso_local zeroext i1 @is_not_empty_variant3(%struct.node* %p) { ; O1-NEXT: [[TMP0]] = load %struct.node*, %struct.node** [[NEXT_I]], align 8 ; O1-NEXT: [[INC_I]] = add i64 [[SIZE_06_I]], 1 ; O1-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp eq %struct.node* [[TMP0]], null -; O1-NEXT: br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP2:![0-9]+]] +; O1-NEXT: br i1 [[TOBOOL_NOT_I]], label [[COUNT_NODES_VARIANT3_EXIT_LOOPEXIT:%.*]], label [[WHILE_BODY_I]], !llvm.loop [[LOOP0:![0-9]+]] ; O1: count_nodes_variant3.exit.loopexit: ; O1-NEXT: [[PHI_CMP:%.*]] = icmp ne i64 [[INC_I]], 0 ; O1-NEXT: br label [[COUNT_NODES_VARIANT3_EXIT]] </cut>

3 years, 8 months

1
0
0 0

[ACTIVITY] week ending Nov. 7 2021

by Alex Bennée

VirtIO Initiative ([STR-9]) =========================== - various rust-vmm discussions - [upstream rust-vmm sync meeting] - how to deal with vhost-device/vm-virtio split: [proposal] - synced with ARM on their interests - got update on Fwd: FW: [App-services] Slides from the hypervisor-less virtio status meeting Message-Id: <CAHDbmO2G4hUyfxtaxwnbxsrMk+P41zbL-7VNe=Aa6DshxC-5zQ(a)mail.gmail.com> [STR-9] <https://linaro.atlassian.net/browse/STR-9> [upstream rust-vmm sync meeting] <https://etherpad.opendev.org/p/rust-vmm-sync-2021&sa=D&source=calendar&ust=…> [proposal] <https://github.com/rust-vmm/vhost-device/pull/57> QEMU Upstream Work ([UM-2]) =========================== - did some bug triage and investigated [555] and [690] which might intersect with earlier changes I made - spent time on the PR from hell [PULL 00/30] testing, gdbstub and semihosting Message-Id: <20210115130828.23968-1-alex.bennee(a)linaro.org> [UM-2] <https://linaro.atlassian.net/browse/UM-2> [555] <https://gitlab.com/qemu-project/qemu/-/issues/555> [690] <https://gitlab.com/qemu-project/qemu/-/issues/690> Other ===== - TSC report preparation for QEMU and Stratos Completed Reviews [1/1] ======================= [XEN PATCH v7 00/51] xen: Build system improvements, now with out-of-tree build! Message-Id: <20210824105038.1257926-1-anthony.perard(a)citrix.com> Absences ======== ,---- | (save-excursion | (goto-char (point-min)) | (when (re-search-forward "* Absences") | (goto-char (match-beginning 0)) | (org-export-as 'ascii t nil t ))) `---- Current Review Queue ==================== TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org> ================================================================================================================================ TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 8 months

1
0
0 0

[ACTIVITY] report week ending 5 Nov

by Peter Maydell

Progress * UM-2 [QEMU upstream maintainership] + worked through the big pile of email that had built up while I was on holiday... + some long-delayed sysadmin tasks on my work machines now I have an opportunity to go into the office and do things that would be too risky with only remote access + triaged a bunch of Coverity issues * QEMU-406 [QEMU support for MVE (M-profile Vector Extension; Helium)] + All work here has now gone upstream; closed! -- PMM

3 years, 8 months

1
0
0 0

[TCWG CI] 401.bzip2 grew in size by 4% after llvm: Revert "Revert "Recommit "Revert "[CVP] processSwitch: Remove default case when switch cover all possible values.""""

by ci_notify＠linaro.org

After llvm commit c93f93b2e3f28997f794265089fb8138dd5b5f13 Author: Jun Ma <JunMa(a)linux.alibaba.com> Revert "Revert "Recommit "Revert "[CVP] processSwitch: Remove default case when switch cover all possible values."""" the following benchmarks grew in size by more than 1%: - 401.bzip2 grew in size by 4% from 36134 to 37534 bytes - 401.bzip2:[.] BZ2_decompress grew in size by 19% from 7256 to 8656 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -Oz -mthumb - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Oz First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-c93f93b2e3f28997f794265089fb8138dd5b5f13 cd investigate-llvm-c93f93b2e3f28997f794265089fb8138dd5b5f13 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach c93f93b2e3f28997f794265089fb8138dd5b5f13 ../artifacts/test.sh # Reproduce last_good build git checkout --detach b4fb42300e39c99ac5bb9d02b304b713fabdec4d ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit c93f93b2e3f28997f794265089fb8138dd5b5f13 Author: Jun Ma <JunMa(a)linux.alibaba.com> Date: Tue Sep 28 09:44:00 2021 +0800 Revert "Revert "Recommit "Revert "[CVP] processSwitch: Remove default case when switch cover all possible values."""" This reverts commit 3a998c06a8e93989319238e12b56a731198cc1c2. --- llvm/include/llvm/Transforms/Utils/Local.h | 5 ++++ .../Scalar/CorrelatedValuePropagation.cpp | 27 +++++++++++++++++++++- llvm/lib/Transforms/Utils/Local.cpp | 20 ++++++++++++++++ llvm/lib/Transforms/Utils/SimplifyCFG.cpp | 20 ---------------- .../Transforms/CorrelatedValuePropagation/basic.ll | 11 +++++---- 5 files changed, 57 insertions(+), 26 deletions(-) diff --git a/llvm/include/llvm/Transforms/Utils/Local.h b/llvm/include/llvm/Transforms/Utils/Local.h index 3c529abce85a..72cb606eb51a 100644 --- a/llvm/include/llvm/Transforms/Utils/Local.h +++ b/llvm/include/llvm/Transforms/Utils/Local.h @@ -55,6 +55,7 @@ class MDNode; class MemorySSAUpdater; class PHINode; class StoreInst; +class SwitchInst; class TargetLibraryInfo; class TargetTransformInfo; @@ -237,6 +238,10 @@ CallInst *createCallMatchingInvoke(InvokeInst *II); /// This function converts the specified invoek into a normall call. void changeToCall(InvokeInst *II, DomTreeUpdater *DTU = nullptr); +/// This function removes the default destination from the specified switch. +void createUnreachableSwitchDefault(SwitchInst *Switch, + DomTreeUpdater *DTU = nullptr); + ///===---------------------------------------------------------------------===// /// Dbg Intrinsic utilities /// diff --git a/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp b/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp index 6dbd3da24059..4b8392db9628 100644 --- a/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp +++ b/llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp @@ -341,7 +341,13 @@ static bool processSwitch(SwitchInst *I, LazyValueInfo *LVI, // ConstantFoldTerminator() as the underlying SwitchInst can be changed. SwitchInstProfUpdateWrapper SI(*I); - for (auto CI = SI->case_begin(), CE = SI->case_end(); CI != CE;) { + APInt Low = + APInt::getSignedMaxValue(Cond->getType()->getScalarSizeInBits()); + APInt High = + APInt::getSignedMinValue(Cond->getType()->getScalarSizeInBits()); + + SwitchInst::CaseIt CI = SI->case_begin(); + for (auto CE = SI->case_end(); CI != CE;) { ConstantInt *Case = CI->getCaseValue(); LazyValueInfo::Tristate State = LVI->getPredicateAt(CmpInst::ICMP_EQ, Cond, Case, I, @@ -374,9 +380,28 @@ static bool processSwitch(SwitchInst *I, LazyValueInfo *LVI, break; } + // Get Lower/Upper bound from switch cases. + Low = APIntOps::smin(Case->getValue(), Low); + High = APIntOps::smax(Case->getValue(), High); + // Increment the case iterator since we didn't delete it. ++CI; } + + // Try to simplify default case as unreachable + if (CI == SI->case_end() && SI->getNumCases() != 0 && + !isa<UnreachableInst>(SI->getDefaultDest()->getFirstNonPHIOrDbg())) { + const ConstantRange SIRange = + LVI->getConstantRange(SI->getCondition(), SI); + + // If the numbered switch cases cover the entire range of the condition, + // then the default case is not reachable. + if (SIRange.getSignedMin() == Low && SIRange.getSignedMax() == High && + SI->getNumCases() == High - Low + 1) { + createUnreachableSwitchDefault(SI, &DTU); + Changed = true; + } + } } if (Changed) diff --git a/llvm/lib/Transforms/Utils/Local.cpp b/llvm/lib/Transforms/Utils/Local.cpp index 3e36f498523d..74ab37fadf36 100644 --- a/llvm/lib/Transforms/Utils/Local.cpp +++ b/llvm/lib/Transforms/Utils/Local.cpp @@ -2190,6 +2190,26 @@ void llvm::changeToCall(InvokeInst *II, DomTreeUpdater *DTU) { DTU->applyUpdates({{DominatorTree::Delete, BB, UnwindDestBB}}); } +void llvm::createUnreachableSwitchDefault(SwitchInst *Switch, + DomTreeUpdater *DTU) { + LLVM_DEBUG(dbgs() << "SimplifyCFG: switch default is dead.\n"); + auto *BB = Switch->getParent(); + auto *OrigDefaultBlock = Switch->getDefaultDest(); + OrigDefaultBlock->removePredecessor(BB); + BasicBlock *NewDefaultBlock = BasicBlock::Create( + BB->getContext(), BB->getName() + ".unreachabledefault", BB->getParent(), + OrigDefaultBlock); + new UnreachableInst(Switch->getContext(), NewDefaultBlock); + Switch->setDefaultDest(&*NewDefaultBlock); + if (DTU) { + SmallVector<DominatorTree::UpdateType, 2> Updates; + Updates.push_back({DominatorTree::Insert, BB, &*NewDefaultBlock}); + if (!is_contained(successors(BB), OrigDefaultBlock)) + Updates.push_back({DominatorTree::Delete, BB, &*OrigDefaultBlock}); + DTU->applyUpdates(Updates); + } +} + BasicBlock *llvm::changeToInvokeAndSplitBasicBlock(CallInst *CI, BasicBlock *UnwindEdge, DomTreeUpdater *DTU) { diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp index 7b49f47778e0..3eab293b433e 100644 --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp @@ -4782,26 +4782,6 @@ static bool CasesAreContiguous(SmallVectorImpl<ConstantInt *> &Cases) { return true; } -static void createUnreachableSwitchDefault(SwitchInst *Switch, - DomTreeUpdater *DTU) { - LLVM_DEBUG(dbgs() << "SimplifyCFG: switch default is dead.\n"); - auto *BB = Switch->getParent(); - auto *OrigDefaultBlock = Switch->getDefaultDest(); - OrigDefaultBlock->removePredecessor(BB); - BasicBlock *NewDefaultBlock = BasicBlock::Create( - BB->getContext(), BB->getName() + ".unreachabledefault", BB->getParent(), - OrigDefaultBlock); - new UnreachableInst(Switch->getContext(), NewDefaultBlock); - Switch->setDefaultDest(&*NewDefaultBlock); - if (DTU) { - SmallVector<DominatorTree::UpdateType, 2> Updates; - Updates.push_back({DominatorTree::Insert, BB, &*NewDefaultBlock}); - if (!is_contained(successors(BB), OrigDefaultBlock)) - Updates.push_back({DominatorTree::Delete, BB, &*OrigDefaultBlock}); - DTU->applyUpdates(Updates); - } -} - /// Turn a switch with two reachable destinations into an integer range /// comparison and branch. bool SimplifyCFGOpt::TurnSwitchRangeIntoICmp(SwitchInst *SI, diff --git a/llvm/test/Transforms/CorrelatedValuePropagation/basic.ll b/llvm/test/Transforms/CorrelatedValuePropagation/basic.ll index 5abbcbc90e01..a620c8468d4d 100644 --- a/llvm/test/Transforms/CorrelatedValuePropagation/basic.ll +++ b/llvm/test/Transforms/CorrelatedValuePropagation/basic.ll @@ -382,7 +382,7 @@ define i32 @switch_range(i32 %cond) { ; CHECK-NEXT: entry: ; CHECK-NEXT: [[S:%.*]] = urem i32 [[COND:%.*]], 3 ; CHECK-NEXT: [[S1:%.*]] = add nuw nsw i32 [[S]], 1 -; CHECK-NEXT: switch i32 [[S1]], label [[UNREACHABLE:%.*]] [ +; CHECK-NEXT: switch i32 [[S1]], label [[ENTRY_UNREACHABLEDEFAULT:%.*]] [ ; CHECK-NEXT: i32 1, label [[EXIT1:%.*]] ; CHECK-NEXT: i32 2, label [[EXIT2:%.*]] ; CHECK-NEXT: i32 3, label [[EXIT1]] @@ -391,6 +391,8 @@ define i32 @switch_range(i32 %cond) { ; CHECK-NEXT: ret i32 1 ; CHECK: exit2: ; CHECK-NEXT: ret i32 2 +; CHECK: entry.unreachabledefault: +; CHECK-NEXT: unreachable ; CHECK: unreachable: ; CHECK-NEXT: ret i32 0 ; @@ -453,10 +455,9 @@ define i8 @switch_defaultdest_multipleuse(i8 %t0) { ; CHECK-NEXT: entry: ; CHECK-NEXT: [[O:%.*]] = or i8 [[T0:%.*]], 1 ; CHECK-NEXT: [[R:%.*]] = srem i8 1, [[O]] -; CHECK-NEXT: switch i8 [[R]], label [[EXIT:%.*]] [ -; CHECK-NEXT: i8 0, label [[EXIT]] -; CHECK-NEXT: i8 1, label [[EXIT]] -; CHECK-NEXT: ] +; CHECK-NEXT: br label [[EXIT:%.*]] +; CHECK: entry.unreachabledefault: +; CHECK-NEXT: unreachable ; CHECK: exit: ; CHECK-NEXT: ret i8 0 ; </cut>

3 years, 8 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [BasicAA] Handle known bits as ranges

by ci_notify＠linaro.org

After llvm commit fbc0c308d599fe3300ab6516650b65b41979446d Author: Nikita Popov <nikita.ppv(a)gmail.com> [BasicAA] Handle known bits as ranges the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 10899 to 11610 perf samples - 464.h264ref:libc.so.6 slowed down by 11% from 3538 to 3922 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d cd investigate-llvm-fbc0c308d599fe3300ab6516650b65b41979446d # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach fbc0c308d599fe3300ab6516650b65b41979446d ../artifacts/test.sh # Reproduce last_good build git checkout --detach 30a3652b6ade43504087f6e3acd8dc879055f501 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit fbc0c308d599fe3300ab6516650b65b41979446d Author: Nikita Popov <nikita.ppv(a)gmail.com> Date: Mon Oct 25 15:47:21 2021 +0200 [BasicAA] Handle known bits as ranges BasicAA currently tries to determine that the offset is positive by checking whether all variable indices are positive based on known bits, multiplied by a positive scale. However, this is incorrect if the scale multiplication might overflow. In the modified test case the original value is positive, but may be negative after a left shift. Fix this by converting known bits into a constant range and reusing the range-based logic, which handles overflow correctly. Differential Revision: https://reviews.llvm.org/D112611 --- llvm/lib/Analysis/BasicAliasAnalysis.cpp | 51 +++++----------------- .../test/Analysis/BasicAA/assume-index-positive.ll | 4 +- 2 files changed, 12 insertions(+), 43 deletions(-) diff --git a/llvm/lib/Analysis/BasicAliasAnalysis.cpp b/llvm/lib/Analysis/BasicAliasAnalysis.cpp index 0305732ca5d5..8cf947c43bf4 100644 --- a/llvm/lib/Analysis/BasicAliasAnalysis.cpp +++ b/llvm/lib/Analysis/BasicAliasAnalysis.cpp @@ -318,15 +318,6 @@ struct CastedValue { return N; } - KnownBits evaluateWith(KnownBits N) const { - assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() && - "Incompatible bit width"); - if (TruncBits) N = N.trunc(N.getBitWidth() - TruncBits); - if (SExtBits) N = N.sext(N.getBitWidth() + SExtBits); - if (ZExtBits) N = N.zext(N.getBitWidth() + ZExtBits); - return N; - } - ConstantRange evaluateWith(ConstantRange N) const { assert(N.getBitWidth() == V->getType()->getPrimitiveSizeInBits() && "Incompatible bit width"); @@ -1250,8 +1241,6 @@ AliasResult BasicAAResult::aliasGEP( if (!DecompGEP1.VarIndices.empty()) { APInt GCD; - bool AllNonNegative = DecompGEP1.Offset.isNonNegative(); - bool AllNonPositive = DecompGEP1.Offset.isNonPositive(); ConstantRange OffsetRange = ConstantRange(DecompGEP1.Offset); for (unsigned i = 0, e = DecompGEP1.VarIndices.size(); i != e; ++i) { const VariableGEPIndex &Index = DecompGEP1.VarIndices[i]; @@ -1266,24 +1255,19 @@ AliasResult BasicAAResult::aliasGEP( else GCD = APIntOps::GreatestCommonDivisor(GCD, ScaleForGCD.abs()); - if (AllNonNegative || AllNonPositive) { - KnownBits Known = Index.Val.evaluateWith( - computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT)); - bool SignKnownZero = Known.isNonNegative(); - bool SignKnownOne = Known.isNegative(); - AllNonNegative &= (SignKnownZero && Scale.isNonNegative()) || - (SignKnownOne && Scale.isNonPositive()); - AllNonPositive &= (SignKnownZero && Scale.isNonPositive()) || - (SignKnownOne && Scale.isNonNegative()); - } + ConstantRange CR = + computeConstantRange(Index.Val.V, true, &AC, Index.CxtI); + KnownBits Known = + computeKnownBits(Index.Val.V, DL, 0, &AC, Index.CxtI, DT); + CR = CR.intersectWith( + ConstantRange::fromKnownBits(Known, /* Signed */ true), + ConstantRange::Signed); assert(OffsetRange.getBitWidth() == Scale.getBitWidth() && "Bit widths are normalized to MaxPointerSize"); - OffsetRange = OffsetRange.add(Index.Val - .evaluateWith(computeConstantRange( - Index.Val.V, true, &AC, Index.CxtI)) - .sextOrTrunc(OffsetRange.getBitWidth()) - .smul_fast(ConstantRange(Scale))); + OffsetRange = OffsetRange.add( + Index.Val.evaluateWith(CR).sextOrTrunc(OffsetRange.getBitWidth()) + .smul_fast(ConstantRange(Scale))); } // We now have accesses at two offsets from the same base: @@ -1300,21 +1284,6 @@ AliasResult BasicAAResult::aliasGEP( (GCD - ModOffset).uge(V1Size.getValue())) return AliasResult::NoAlias; - // If we know all the variables are non-negative, then the total offset is - // also non-negative and >= DecompGEP1.Offset. We have the following layout: - // [0, V2Size) ... [TotalOffset, TotalOffer+V1Size] - // If DecompGEP1.Offset >= V2Size, the accesses don't alias. - if (AllNonNegative && V2Size.hasValue() && - DecompGEP1.Offset.uge(V2Size.getValue())) - return AliasResult::NoAlias; - // Similarly, if the variables are non-positive, then the total offset is - // also non-positive and <= DecompGEP1.Offset. We have the following layout: - // [TotalOffset, TotalOffset+V1Size) ... [0, V2Size) - // If -DecompGEP1.Offset >= V1Size, the accesses don't alias. - if (AllNonPositive && V1Size.hasValue() && - (-DecompGEP1.Offset).uge(V1Size.getValue())) - return AliasResult::NoAlias; - if (V1Size.hasValue() && V2Size.hasValue()) { // Compute ranges of potentially accessed bytes for both accesses. If the // interseciton is empty, there can be no overlap. diff --git a/llvm/test/Analysis/BasicAA/assume-index-positive.ll b/llvm/test/Analysis/BasicAA/assume-index-positive.ll index 451592067f4b..a53fff2c6009 100644 --- a/llvm/test/Analysis/BasicAA/assume-index-positive.ll +++ b/llvm/test/Analysis/BasicAA/assume-index-positive.ll @@ -130,12 +130,12 @@ define void @symmetry([0 x i8]* %ptr, i32 %a, i32 %b, i32 %c) { ret void } -; TODO: %ptr.neg and %ptr.shl may alias, as the shl renders the previously +; %ptr.neg and %ptr.shl may alias, as the shl renders the previously ; non-negative value potentially negative. define void @shl_of_non_negative(i8* %ptr, i64 %a) { ; CHECK-LABEL: Function: shl_of_non_negative ; CHECK: NoAlias: i8* %ptr.a, i8* %ptr.neg -; CHECK: NoAlias: i8* %ptr.neg, i8* %ptr.shl +; CHECK: MayAlias: i8* %ptr.neg, i8* %ptr.shl %a.cmp = icmp sge i64 %a, 0 call void @llvm.assume(i1 %a.cmp) %ptr.neg = getelementptr i8, i8* %ptr, i64 -2 </cut>

3 years, 8 months

1
0
0 0

[TCWG CI] 400.perlbench:[.] S_find_byclass slowed down by 12% after llvm: [ORC] Call ExecutorProcessControl::disconnect in unit tests that require it.

by ci_notify＠linaro.org

After llvm commit adf55ac6657693f7bfbe3087b599b4031a765a44 Author: Lang Hames <lhames(a)gmail.com> [ORC] Call ExecutorProcessControl::disconnect in unit tests that require it. the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%): - 400.perlbench:[.] S_find_byclass slowed down by 12% from 644 to 721 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-adf55ac6657693f7bfbe3087b599b4031a765a44 cd investigate-llvm-adf55ac6657693f7bfbe3087b599b4031a765a44 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach adf55ac6657693f7bfbe3087b599b4031a765a44 ../artifacts/test.sh # Reproduce last_good build git checkout --detach f526ee5b8517b60620cd03bb3e5945ed69d6bfaa ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit adf55ac6657693f7bfbe3087b599b4031a765a44 Author: Lang Hames <lhames(a)gmail.com> Date: Tue Oct 12 14:55:49 2021 -0700 [ORC] Call ExecutorProcessControl::disconnect in unit tests that require it. Another follow-up to 2815ed57e3c and 19b4e3cfc6a. For unit tests that don't use an ExecutionSession we need to call ExecutorProcessControl::disconnect directly to wait for the dispatcher to shut down. https://llvm.org/PR52153 --- .../ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp | 2 ++ llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp | 2 ++ 2 files changed, 4 insertions(+) diff --git a/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp b/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp index f2b157e424b6..a95435aec2a3 100644 --- a/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp +++ b/llvm/unittests/ExecutionEngine/Orc/EPCGenericJITLinkMemoryManagerTest.cpp @@ -134,6 +134,8 @@ TEST(EPCGenericJITLinkMemoryManagerTest, AllocFinalizeFree) { auto Err2 = MemMgr->deallocate(std::move(*FA)); EXPECT_THAT_ERROR(std::move(Err2), Succeeded()); + + cantFail(SelfEPC->disconnect()); } } // namespace diff --git a/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp b/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp index 78024644ca8b..beb0fefa094a 100644 --- a/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp +++ b/llvm/unittests/ExecutionEngine/Orc/EPCGenericMemoryAccessTest.cpp @@ -93,6 +93,8 @@ TEST(EPCGenericMemoryAccessTest, MemWrites) { {{pointerToJITTargetAddress(&Test_Buffer), TestMsg}}); EXPECT_THAT_ERROR(std::move(Err5), Succeeded()); EXPECT_EQ(StringRef(Test_Buffer, TestMsg.size()), TestMsg); + + cantFail(SelfEPC->disconnect()); } } // namespace </cut>

3 years, 8 months

2
1
0 0

[TCWG CI] 400.perlbench slowed down by 6% after llvm: [AArch64] Remove redundant ORRWrs which is generated by zero-extend

by ci_notify＠linaro.org

After llvm commit a502436259307f95e9c95437d8a1d2d07294341c Author: Jingu Kang <jingu.kang(a)arm.com> [AArch64] Remove redundant ORRWrs which is generated by zero-extend the following benchmarks slowed down by more than 2%: - 400.perlbench slowed down by 6% from 9792 to 10354 perf samples - 464.h264ref slowed down by 4% from 11023 to 11509 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 33% from 1634 to 2180 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-a502436259307f95e9c95437d8a1d2d07294341c cd investigate-llvm-a502436259307f95e9c95437d8a1d2d07294341c # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach a502436259307f95e9c95437d8a1d2d07294341c ../artifacts/test.sh # Reproduce last_good build git checkout --detach 6fa1b4ff4b05b9b9a432f7310802255c160c8f4f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit a502436259307f95e9c95437d8a1d2d07294341c Author: Jingu Kang <jingu.kang(a)arm.com> Date: Thu Sep 30 15:39:10 2021 +0100 [AArch64] Remove redundant ORRWrs which is generated by zero-extend %3:gpr32 = ORRWrs $wzr, %2, 0 %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 If AArch64's 32-bit form of instruction defines the source operand of ORRWrs, we can remove the ORRWrs because the upper 32 bits of the source operand are set to zero. Differential Revision: https://reviews.llvm.org/D110841 --- llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp | 57 ++++++++++++++++ .../test/CodeGen/AArch64/arm64-assert-zext-sext.ll | 51 +++++++++++--- .../AArch64/redundant-mov-from-zero-extend.ll | 79 ++++++++++++++++++++++ .../AArch64/redundant-orrwrs-from-zero-extend.mir | 69 +++++++++++++++++++ 4 files changed, 248 insertions(+), 8 deletions(-) diff --git a/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp b/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp index 42f683613698..42db18332f1c 100644 --- a/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp +++ b/llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp @@ -15,6 +15,16 @@ // later. In this case, we could try to split the constant operand of mov // instruction into two bitmask immediates. It makes two AND instructions // intead of multiple `mov` + `and` instructions. +// +// 2. Remove redundant ORRWrs which is generated by zero-extend. +// +// %3:gpr32 = ORRWrs $wzr, %2, 0 +// %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 +// +// If AArch64's 32-bit form of instruction defines the source operand of +// ORRWrs, we can remove the ORRWrs because the upper 32 bits of the source +// operand are set to zero. +// //===----------------------------------------------------------------------===// #include "AArch64ExpandImm.h" @@ -44,6 +54,8 @@ struct AArch64MIPeepholeOpt : public MachineFunctionPass { template <typename T> bool visitAND(MachineInstr &MI, SmallSetVector<MachineInstr *, 8> &ToBeRemoved); + bool visitORR(MachineInstr &MI, + SmallSetVector<MachineInstr *, 8> &ToBeRemoved); bool runOnMachineFunction(MachineFunction &MF) override; StringRef getPassName() const override { @@ -196,6 +208,49 @@ bool AArch64MIPeepholeOpt::visitAND( return true; } +bool AArch64MIPeepholeOpt::visitORR( + MachineInstr &MI, SmallSetVector<MachineInstr *, 8> &ToBeRemoved) { + // Check this ORR comes from below zero-extend pattern. + // + // def : Pat<(i64 (zext GPR32:$src)), + // (SUBREG_TO_REG (i32 0), (ORRWrs WZR, GPR32:$src, 0), sub_32)>; + if (MI.getOperand(3).getImm() != 0) + return false; + + if (MI.getOperand(1).getReg() != AArch64::WZR) + return false; + + MachineInstr *SrcMI = MRI->getUniqueVRegDef(MI.getOperand(2).getReg()); + if (!SrcMI) + return false; + + // From https://developer.arm.com/documentation/dui0801/b/BABBGCAC + // + // When you use the 32-bit form of an instruction, the upper 32 bits of the + // source registers are ignored and the upper 32 bits of the destination + // register are set to zero. + // + // If AArch64's 32-bit form of instruction defines the source operand of + // zero-extend, we do not need the zero-extend. Let's check the MI's opcode is + // real AArch64 instruction and if it is not, do not process the opcode + // conservatively. + if (SrcMI->getOpcode() <= TargetOpcode::GENERIC_OP_END) + return false; + + Register DefReg = MI.getOperand(0).getReg(); + Register SrcReg = MI.getOperand(2).getReg(); + MRI->replaceRegWith(DefReg, SrcReg); + MRI->clearKillFlags(SrcReg); + // replaceRegWith changes MI's definition register. Keep it for SSA form until + // deleting MI. + MI.getOperand(0).setReg(DefReg); + ToBeRemoved.insert(&MI); + + LLVM_DEBUG({ dbgs() << "Removed: " << MI << "\n"; }); + + return true; +} + bool AArch64MIPeepholeOpt::runOnMachineFunction(MachineFunction &MF) { if (skipFunction(MF.getFunction())) return false; @@ -221,6 +276,8 @@ bool AArch64MIPeepholeOpt::runOnMachineFunction(MachineFunction &MF) { case AArch64::ANDXrr: Changed = visitAND<uint64_t>(MI, ToBeRemoved); break; + case AArch64::ORRWrs: + Changed = visitORR(MI, ToBeRemoved); } } } diff --git a/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll b/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll index 07dd0b4ec56b..df4a9010dfa9 100644 --- a/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll +++ b/llvm/test/CodeGen/AArch64/arm64-assert-zext-sext.ll @@ -1,9 +1,32 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py ; RUN: llc -O2 -mtriple=aarch64-linux-gnu < %s | FileCheck %s declare i32 @test(i32) local_unnamed_addr declare i32 @test1(i64) local_unnamed_addr define i32 @assertzext(i32 %n, i1 %a, i32* %b) local_unnamed_addr { +; CHECK-LABEL: assertzext: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: stp x30, x19, [sp, #-16]! // 16-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 16 +; CHECK-NEXT: .cfi_offset w19, -8 +; CHECK-NEXT: .cfi_offset w30, -16 +; CHECK-NEXT: mov w8, #33066 +; CHECK-NEXT: tst w1, #0x1 +; CHECK-NEXT: movk w8, #28567, lsl #16 +; CHECK-NEXT: csel w19, wzr, w8, ne +; CHECK-NEXT: cbnz w0, .LBB0_2 +; CHECK-NEXT: // %bb.1: // %if.then +; CHECK-NEXT: mov w19, wzr +; CHECK-NEXT: str wzr, [x2] +; CHECK-NEXT: .LBB0_2: // %if.end +; CHECK-NEXT: mov w0, w19 +; CHECK-NEXT: bl test +; CHECK-NEXT: mov w0, w19 +; CHECK-NEXT: bl test1 +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload +; CHECK-NEXT: ret entry: %i = select i1 %a, i64 0, i64 66296709418 %conv.i = trunc i64 %i to i32 @@ -20,14 +43,29 @@ if.end: ; preds = %if.then, %entry %i2 = sext i32 %i1 to i64 %call1.i = tail call i32 @test1(i64 %i2) ret i32 0 -; CHECK: // %if.end -; CHECK: mov w{{[0-9]+}}, w{{[0-9]+}} -; CHECK: bl test -; CHECK: mov w{{[0-9]+}}, w{{[0-9]+}} -; CHECK: bl test1 } define i32 @assertsext(i32 %n, i8 %a) local_unnamed_addr { +; CHECK-LABEL: assertsext: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: cbz w0, .LBB1_2 +; CHECK-NEXT: // %bb.1: +; CHECK-NEXT: mov x0, xzr +; CHECK-NEXT: b .LBB1_3 +; CHECK-NEXT: .LBB1_2: // %if.then +; CHECK-NEXT: mov x9, #24575 +; CHECK-NEXT: sxtb w8, w1 +; CHECK-NEXT: movk x9, #15873, lsl #16 +; CHECK-NEXT: movk x9, #474, lsl #32 +; CHECK-NEXT: udiv x0, x9, x8 +; CHECK-NEXT: .LBB1_3: // %if.end +; CHECK-NEXT: str x30, [sp, #-16]! // 8-byte Folded Spill +; CHECK-NEXT: .cfi_def_cfa_offset 16 +; CHECK-NEXT: .cfi_offset w30, -16 +; CHECK-NEXT: bl test1 +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ldr x30, [sp], #16 // 8-byte Folded Reload +; CHECK-NEXT: ret entry: %conv.i = sext i8 %a to i32 %cmp = icmp eq i32 %n, 0 @@ -37,9 +75,6 @@ if.then: ; preds = %entry %conv1 = zext i32 %conv.i to i64 %div = udiv i64 2036854775807, %conv1 br label %if.end -; CHECK: // %if.then -; CHECK: mov w{{[0-9]+}}, w{{[0-9]+}} -; CHECK: udiv x{{[0-9]+}}, x{{[0-9]+}}, x{{[0-9]+}} if.end: ; preds = %if.then, %entry %i1 = phi i64 [ %div, %if.then ], [ 0, %entry ] diff --git a/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll b/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll new file mode 100644 index 000000000000..42b9838acef2 --- /dev/null +++ b/llvm/test/CodeGen/AArch64/redundant-mov-from-zero-extend.ll @@ -0,0 +1,79 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py +; RUN: llc -O3 -mtriple=aarch64-linux-gnu < %s | FileCheck %s + +define i32 @test(i32 %input, i32 %n, i32 %a) { +; CHECK-LABEL: test: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: cbz w1, .LBB0_2 +; CHECK-NEXT: // %bb.1: +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_2: // %bb.0 +; CHECK-NEXT: add w8, w0, w1 +; CHECK-NEXT: mov w0, #100 +; CHECK-NEXT: cmp w8, #4 +; CHECK-NEXT: b.hi .LBB0_5 +; CHECK-NEXT: // %bb.3: // %bb.0 +; CHECK-NEXT: adrp x9, .LJTI0_0 +; CHECK-NEXT: add x9, x9, :lo12:.LJTI0_0 +; CHECK-NEXT: adr x10, .LBB0_4 +; CHECK-NEXT: ldrb w11, [x9, x8] +; CHECK-NEXT: add x10, x10, x11, lsl #2 +; CHECK-NEXT: br x10 +; CHECK-NEXT: .LBB0_4: // %sw.bb +; CHECK-NEXT: add w0, w2, #1 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_5: // %bb.0 +; CHECK-NEXT: cmp w8, #200 +; CHECK-NEXT: b.ne .LBB0_10 +; CHECK-NEXT: // %bb.6: // %sw.bb7 +; CHECK-NEXT: add w0, w2, #7 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_7: // %sw.bb1 +; CHECK-NEXT: add w0, w2, #3 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_8: // %sw.bb3 +; CHECK-NEXT: add w0, w2, #4 +; CHECK-NEXT: ret +; CHECK-NEXT: .LBB0_9: // %sw.bb5 +; CHECK-NEXT: add w0, w2, #5 +; CHECK-NEXT: .LBB0_10: // %return +; CHECK-NEXT: ret +entry: + %b = add nsw i32 %input, %n + %cmp = icmp eq i32 %n, 0 + br i1 %cmp, label %bb.0, label %return + +bb.0: + switch i32 %b, label %return [ + i32 0, label %sw.bb + i32 1, label %sw.bb1 + i32 2, label %sw.bb3 + i32 4, label %sw.bb5 + i32 200, label %sw.bb7 + ] + +sw.bb: + %add = add nsw i32 %a, 1 + br label %return + +sw.bb1: + %add2 = add nsw i32 %a, 3 + br label %return + +sw.bb3: + %add4 = add nsw i32 %a, 4 + br label %return + +sw.bb5: + %add6 = add nsw i32 %a, 5 + br label %return + +sw.bb7: + %add8 = add nsw i32 %a, 7 + br label %return + +return: + %retval.0 = phi i32 [ %add8, %sw.bb7 ], [ %add6, %sw.bb5 ], [ %add4, %sw.bb3 ], [ %add2, %sw.bb1 ], [ %add, %sw.bb ], [ 100, %bb.0 ], [ 0, %entry ] + ret i32 %retval.0 +} diff --git a/llvm/test/CodeGen/AArch64/redundant-orrwrs-from-zero-extend.mir b/llvm/test/CodeGen/AArch64/redundant-orrwrs-from-zero-extend.mir new file mode 100644 index 000000000000..37540dde048f --- /dev/null +++ b/llvm/test/CodeGen/AArch64/redundant-orrwrs-from-zero-extend.mir @@ -0,0 +1,69 @@ +# RUN: llc -mtriple=aarch64 -run-pass aarch64-mi-peephole-opt -verify-machineinstrs -o - %s | FileCheck %s +--- +name: test1 +# CHECK-LABEL: name: test1 +alignment: 4 +tracksRegLiveness: true +registers: + - { id: 0, class: gpr32 } + - { id: 1, class: gpr32 } + - { id: 2, class: gpr32 } + - { id: 3, class: gpr32 } + - { id: 4, class: gpr64 } +body: | + bb.0: + liveins: $w0, $w1 + + %0:gpr32 = COPY $w0 + %1:gpr32 = COPY $w1 + B %bb.1 + + bb.1: + %2:gpr32 = nsw ADDWrr %0, %1 + B %bb.2 + + bb.2: + ; CHECK-LABEL: bb.2: + ; CHECK-NOT: %3:gpr32 = ORRWrs $wzr, %2, 0 + ; The ORRWrs should be removed. + %3:gpr32 = ORRWrs $wzr, %2, 0 + %4:gpr64 = SUBREG_TO_REG 0, %3, %subreg.sub_32 + B %bb.3 + + bb.3: + $x0 = COPY %4 + RET_ReallyLR implicit $x0 +... +--- +name: test2 +# CHECK-LABEL: name: test2 +alignment: 4 +tracksRegLiveness: true +registers: + - { id: 0, class: gpr64 } + - { id: 1, class: gpr32 } + - { id: 2, class: gpr32 } + - { id: 3, class: gpr64 } +body: | + bb.0: + liveins: $x0 + + %0:gpr64 = COPY $x0 + B %bb.1 + + bb.1: + %1:gpr32 = EXTRACT_SUBREG %0, %subreg.sub_32 + B %bb.2 + + bb.2: + ; CHECK-LABEL: bb.2: + ; CHECK: %2:gpr32 = ORRWrs $wzr, %1, 0 + ; The ORRWrs should not be removed. + %2:gpr32 = ORRWrs $wzr, %1, 0 + %3:gpr64 = SUBREG_TO_REG 0, %2, %subreg.sub_32 + B %bb.3 + + bb.3: + $x0 = COPY %3 + RET_ReallyLR implicit $x0 +... </cut>

3 years, 8 months

4
3
0 0

[TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler

by ci_notify＠linaro.org

After llvm commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec Author: Jay Foad <jay.foad(a)amd.com> [AMDGPU] Enable load clustering in the post-RA scheduler the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%): - 433.milc:[.] mult_su3_mat_vec slowed down by 11% from 2163 to 2391 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec cd investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 66e13c7f439cf162d7ed1d25883e71a5755ac7ec ../artifacts/test.sh # Reproduce last_good build git checkout --detach 838b4a533e6853d44e0c6d1977bcf0b06557d4ab ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec Author: Jay Foad <jay.foad(a)amd.com> Date: Tue Oct 12 15:39:43 2021 +0100 [AMDGPU] Enable load clustering in the post-RA scheduler This has a couple of benefits: 1. It can sometimes fix clusters that got broken apart when the register allocator inserted a copy. 2. Post-RA scheduling does not have to worry about increasing register pressure, which in some cases gives it more freedom to reorder instructions. Testing on a collection of 10,000 graphics shaders compiled for gfx1010 showed: - The average length of each run of one or more load instructions increased by about 1%. - The number of runs of two or more load instructions increased by about 4%. --- llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | 1 + llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll | 5 ++--- llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll | 5 +++-- llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll | 4 ++-- llvm/test/CodeGen/AMDGPU/idiv-licm.ll | 2 +- llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll | 6 +++--- llvm/test/CodeGen/AMDGPU/sdiv64.ll | 2 +- llvm/test/CodeGen/AMDGPU/srem64.ll | 2 +- llvm/test/CodeGen/AMDGPU/udiv64.ll | 2 +- llvm/test/CodeGen/AMDGPU/urem64.ll | 2 +- 10 files changed, 16 insertions(+), 15 deletions(-) diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp index b0902465c592..7b2d56e88b5f 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp @@ -825,6 +825,7 @@ public: createPostMachineScheduler(MachineSchedContext *C) const override { ScheduleDAGMI *DAG = createGenericSchedPostRA(C); const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>(); + DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI)); DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII)); return DAG; } diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll index fa500054e058..804dea705011 100644 --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll @@ -185,21 +185,20 @@ define i128 @extractelement_vgpr_v4i128_vgpr_idx(<4 x i128> addrspace(1)* %ptr, ; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ; GFX8-NEXT: v_add_u32_e32 v3, vcc, 16, v0 ; GFX8-NEXT: v_addc_u32_e32 v4, vcc, 0, v1, vcc -; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1] ; GFX8-NEXT: flat_load_dwordx4 v[4:7], v[3:4] +; GFX8-NEXT: flat_load_dwordx4 v[8:11], v[0:1] ; GFX8-NEXT: v_lshlrev_b32_e32 v16, 1, v2 ; GFX8-NEXT: v_add_u32_e32 v17, vcc, 1, v16 ; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 1, v17 ; GFX8-NEXT: v_cmp_eq_u32_e64 s[4:5], 1, v16 ; GFX8-NEXT: v_cmp_eq_u32_e64 s[6:7], 6, v16 ; GFX8-NEXT: v_cmp_eq_u32_e64 s[8:9], 7, v16 -; GFX8-NEXT: s_waitcnt vmcnt(1) +; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: v_cndmask_b32_e64 v2, v8, v10, s[4:5] ; GFX8-NEXT: v_cndmask_b32_e64 v3, v9, v11, s[4:5] ; GFX8-NEXT: v_cndmask_b32_e32 v8, v8, v10, vcc ; GFX8-NEXT: v_cndmask_b32_e32 v9, v9, v11, vcc ; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v16 -; GFX8-NEXT: s_waitcnt vmcnt(0) ; GFX8-NEXT: v_cndmask_b32_e32 v2, v2, v4, vcc ; GFX8-NEXT: v_cndmask_b32_e32 v3, v3, v5, vcc ; GFX8-NEXT: v_cmp_eq_u32_e32 vcc, 2, v17 diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll index 133a224b7437..bd4ecd3a17e5 100644 --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll @@ -830,8 +830,8 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> addrspace(1)* %out0, <4 x i32 ; GFX9-LABEL: udivrem_v4i32: ; GFX9: ; %bb.0: ; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x20 -; GFX9-NEXT: v_mov_b32_e32 v2, 0x4f7ffffe ; GFX9-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x10 +; GFX9-NEXT: v_mov_b32_e32 v2, 0x4f7ffffe ; GFX9-NEXT: s_waitcnt lgkmcnt(0) ; GFX9-NEXT: v_cvt_f32_u32_e32 v0, s0 ; GFX9-NEXT: v_cvt_f32_u32_e32 v1, s1 @@ -926,9 +926,10 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> addrspace(1)* %out0, <4 x i32 ; ; GFX10-LABEL: udivrem_v4i32: ; GFX10: ; %bb.0: +; GFX10-NEXT: s_clause 0x1 ; GFX10-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x20 -; GFX10-NEXT: v_mov_b32_e32 v4, 0x4f7ffffe ; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x10 +; GFX10-NEXT: v_mov_b32_e32 v4, 0x4f7ffffe ; GFX10-NEXT: v_mov_b32_e32 v8, 0 ; GFX10-NEXT: s_waitcnt lgkmcnt(0) ; GFX10-NEXT: v_cvt_f32_u32_e32 v0, s8 diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll index b033497d3aed..81b055166dd2 100644 --- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll +++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll @@ -11236,8 +11236,8 @@ define amdgpu_kernel void @sdiv_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 % ; GFX6-LABEL: sdiv_i64_pow2_shl_denom: ; GFX6: ; %bb.0: ; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd -; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_mov_b32 s7, 0xf000 ; GFX6-NEXT: s_mov_b32 s6, -1 ; GFX6-NEXT: s_waitcnt lgkmcnt(0) @@ -13358,8 +13358,8 @@ define amdgpu_kernel void @srem_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 % ; GFX6-LABEL: srem_i64_pow2_shl_denom: ; GFX6: ; %bb.0: ; GFX6-NEXT: s_load_dword s4, s[0:1], 0xd -; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GFX6-NEXT: s_mov_b64 s[2:3], 0x1000 ; GFX6-NEXT: s_mov_b32 s7, 0xf000 ; GFX6-NEXT: s_mov_b32 s6, -1 ; GFX6-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll index fb9348bae000..9ea8f101b5e9 100644 --- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll +++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll @@ -491,8 +491,8 @@ define amdgpu_kernel void @urem16_invariant_denom(i16 addrspace(1)* nocapture %a ; GFX9-LABEL: urem16_invariant_denom: ; GFX9: ; %bb.0: ; %bb ; GFX9-NEXT: s_load_dword s2, s[0:1], 0x2c -; GFX9-NEXT: s_mov_b32 s6, 0xffff ; GFX9-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x24 +; GFX9-NEXT: s_mov_b32 s6, 0xffff ; GFX9-NEXT: v_mov_b32_e32 v1, 0 ; GFX9-NEXT: s_movk_i32 s8, 0x400 ; GFX9-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll index e2fbc0bc4af9..ba093ad3771d 100644 --- a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll +++ b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll @@ -100,14 +100,14 @@ define hidden amdgpu_kernel void @clmem_read(i8 addrspace(1)* %buffer) { ; GFX900: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 -; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 -; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 -; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} +; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} +; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} +; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048 ; GFX10: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off{{$}} diff --git a/llvm/test/CodeGen/AMDGPU/sdiv64.ll b/llvm/test/CodeGen/AMDGPU/sdiv64.ll index 0b80b4170316..dbb6d4805495 100644 --- a/llvm/test/CodeGen/AMDGPU/sdiv64.ll +++ b/llvm/test/CodeGen/AMDGPU/sdiv64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_sdiv(i64 addrspace(1)* %out, i64 %x, i64 %y) { ; GCN-LABEL: s_test_sdiv: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v7, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v7, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/srem64.ll b/llvm/test/CodeGen/AMDGPU/srem64.ll index fac510e8dbda..04f8ea10545e 100644 --- a/llvm/test/CodeGen/AMDGPU/srem64.ll +++ b/llvm/test/CodeGen/AMDGPU/srem64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_srem(i64 addrspace(1)* %out, i64 %x, i64 %y) { ; GCN-LABEL: s_test_srem: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/udiv64.ll b/llvm/test/CodeGen/AMDGPU/udiv64.ll index cc829b8e7eb3..48a86eec9832 100644 --- a/llvm/test/CodeGen/AMDGPU/udiv64.ll +++ b/llvm/test/CodeGen/AMDGPU/udiv64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_udiv_i64(i64 addrspace(1)* %out, i64 %x, i64 % ; GCN-LABEL: s_test_udiv_i64: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) diff --git a/llvm/test/CodeGen/AMDGPU/urem64.ll b/llvm/test/CodeGen/AMDGPU/urem64.ll index a0a4b73262a7..296aaf2ed1c6 100644 --- a/llvm/test/CodeGen/AMDGPU/urem64.ll +++ b/llvm/test/CodeGen/AMDGPU/urem64.ll @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_urem_i64(i64 addrspace(1)* %out, i64 %x, i64 % ; GCN-LABEL: s_test_urem_i64: ; GCN: ; %bb.0: ; GCN-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0xd -; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_load_dwordx4 s[8:11], s[0:1], 0x9 +; GCN-NEXT: v_mov_b32_e32 v2, 0 ; GCN-NEXT: s_mov_b32 s7, 0xf000 ; GCN-NEXT: s_mov_b32 s6, -1 ; GCN-NEXT: s_waitcnt lgkmcnt(0) </cut>

3 years, 8 months

3
2
0 0

[ACTIVITY] 18 - 22 Oct 2021

by Prathamesh Kulkarni

== This Week == * GCC - Committed a clean up patch to gimple-isel - PR93183: Committed fix - PR102376: Patch approved upstream - PR83750: Patch approved upstream but it regresses one test-case. == Next Week == - Continue with ongoing tasks

3 years, 8 months

1
0
0 0

[TCWG CI] 444.namd grew in size by 2% after llvm: [SLP]Improve graph reordering.

by ci_notify＠linaro.org

After llvm commit bc69dd62c04a70d29943c1c06c7effed150b70e1 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve graph reordering. the following benchmarks grew in size by more than 1%: - 444.namd grew in size by 2% from 192302 to 195218 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -Os - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1 cd investigate-llvm-bc69dd62c04a70d29943c1c06c7effed150b70e1 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bc69dd62c04a70d29943c1c06c7effed150b70e1 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 5661317f864abf750cf893c6a4cc7a977be0995a ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bc69dd62c04a70d29943c1c06c7effed150b70e1 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Tue Aug 3 13:20:32 2021 -0700 [SLP]Improve graph reordering. Reworked reordering algorithm. Originally, the compiler just tried to detect the most common order in the reordarable nodes (loads, stores, extractelements,extractvalues) and then fully rebuilding the graph in the best order. This was not effecient, since it required an extra memory and time for building/rebuilding tree, double the use of the scheduling budget, which could lead to missing vectorization due to exausted scheduling resources. Patch provide 2-way approach for graph reodering problem. At first, all reordering is done in-place, it doe not required tree deleting/rebuilding, it just rotates the scalars/orders/reuses masks in the graph node. The first step (top-to bottom) rotates the whole graph, similarly to the previous implementation. Compiler counts the number of the most used orders of the graph nodes with the same vectorization factor and then rotates the subgraph with the given vectorization factor to the most used order, if it is not empty. Then repeats the same procedure for the subgraphs with the smaller vectorization factor. We can do this because we still need to reshuffle smaller subgraph when buildiong operands for the graph nodes with lasrger vectorization factor, we can rotate just subgraph, not the whole graph. The second step (bottom-to-top) scans through the leaves and tries to detect the users of the leaves which can be reordered. If the leaves can be reorder in the best fashion, they are reordered and their user too. It allows to remove double shuffles to the same ordering of the operands in many cases and just reorder the user operations instead. Plus, it moves the final shuffles closer to the top of the graph and in many cases allows to remove extra shuffle because the same procedure is repeated again and we can again merge some reordering masks and reorder user nodes instead of the operands. Also, patch improves cost model for gathering of loads, which improves x264 benchmark in some cases. Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264, +3% for 508.namd, improves most of other benchmarks. The compile and link time are almost the same, though in some cases it should be better (we're not doing an extra instruction scheduling anymore) + we may vectorize more code for the large basic blocks again because of saving scheduling budget. Differential Revision: https://reviews.llvm.org/D105020 --- .../llvm/Transforms/Vectorize/SLPVectorizer.h | 3 +- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 1364 ++++++++++++++------ .../AArch64/transpose-inseltpoison.ll | 84 +- .../Transforms/SLPVectorizer/AArch64/transpose.ll | 84 +- llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 42 +- .../Transforms/SLPVectorizer/X86/crash_cmpop.ll | 6 +- llvm/test/Transforms/SLPVectorizer/X86/extract.ll | 6 +- .../SLPVectorizer/X86/jumbled-load-multiuse.ll | 12 +- .../Transforms/SLPVectorizer/X86/jumbled-load.ll | 22 +- .../SLPVectorizer/X86/jumbled_store_crash.ll | 29 +- .../SLPVectorizer/X86/reorder_repeated_ops.ll | 4 +- .../SLPVectorizer/X86/split-load8_2-unord.ll | 4 +- .../X86/vectorize-reorder-alt-shuffle.ll | 9 +- .../SLPVectorizer/X86/vectorize-reorder-reuse.ll | 52 +- 14 files changed, 1119 insertions(+), 602 deletions(-) diff --git a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h index f416a592d683..5e8c29913cad 100644 --- a/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h +++ b/llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h @@ -95,8 +95,7 @@ private: /// Try to vectorize a list of operands. /// \returns true if a value was vectorized. - bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R, - bool AllowReorder = false); + bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R); /// Try to vectorize a chain that may start at the operands of \p I. bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R); diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index 9c0029484964..7400b3d8a503 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -21,6 +21,7 @@ #include "llvm/ADT/DenseSet.h" #include "llvm/ADT/Optional.h" #include "llvm/ADT/PostOrderIterator.h" +#include "llvm/ADT/PriorityQueue.h" #include "llvm/ADT/STLExtras.h" #include "llvm/ADT/SetOperations.h" #include "llvm/ADT/SetVector.h" @@ -535,13 +536,68 @@ static bool isSimple(Instruction *I) { return true; } +/// Shuffles \p Mask in accordance with the given \p SubMask. +static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) { + if (SubMask.empty()) + return; + if (Mask.empty()) { + Mask.append(SubMask.begin(), SubMask.end()); + return; + } + SmallVector<int> NewMask(SubMask.size(), UndefMaskElem); + int TermValue = std::min(Mask.size(), SubMask.size()); + for (int I = 0, E = SubMask.size(); I < E; ++I) { + if (SubMask[I] >= TermValue || SubMask[I] == UndefMaskElem || + Mask[SubMask[I]] >= TermValue) + continue; + NewMask[I] = Mask[SubMask[I]]; + } + Mask.swap(NewMask); +} + +/// Order may have elements assigned special value (size) which is out of +/// bounds. Such indices only appear on places which correspond to undef values +/// (see canReuseExtract for details) and used in order to avoid undef values +/// have effect on operands ordering. +/// The first loop below simply finds all unused indices and then the next loop +/// nest assigns these indices for undef values positions. +/// As an example below Order has two undef positions and they have assigned +/// values 3 and 7 respectively: +/// before: 6 9 5 4 9 2 1 0 +/// after: 6 3 5 4 7 2 1 0 +/// \returns Fixed ordering. +static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) { + const unsigned Sz = Order.size(); + SmallBitVector UsedIndices(Sz); + SmallVector<int> MaskedIndices; + for (unsigned I = 0; I < Sz; ++I) { + if (Order[I] < Sz) + UsedIndices.set(Order[I]); + else + MaskedIndices.push_back(I); + } + if (MaskedIndices.empty()) + return; + SmallVector<int> AvailableIndices(MaskedIndices.size()); + unsigned Cnt = 0; + int Idx = UsedIndices.find_first(); + do { + AvailableIndices[Cnt] = Idx; + Idx = UsedIndices.find_next(Idx); + ++Cnt; + } while (Idx > 0); + assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices."); + for (int I = 0, E = MaskedIndices.size(); I < E; ++I) + Order[MaskedIndices[I]] = AvailableIndices[I]; +} + namespace llvm { static void inversePermutation(ArrayRef<unsigned> Indices, SmallVectorImpl<int> &Mask) { Mask.clear(); const unsigned E = Indices.size(); - Mask.resize(E, E + 1); + Mask.resize(E, UndefMaskElem); for (unsigned I = 0; I < E; ++I) Mask[Indices[I]] = I; } @@ -581,6 +637,22 @@ static Optional<int> getInsertIndex(Value *InsertInst, unsigned Offset) { return Index; } +/// Reorders the list of scalars in accordance with the given \p Order and then +/// the \p Mask. \p Order - is the original order of the scalars, need to +/// reorder scalars into an unordered state at first according to the given +/// order. Then the ordered scalars are shuffled once again in accordance with +/// the provided mask. +static void reorderScalars(SmallVectorImpl<Value *> &Scalars, + ArrayRef<int> Mask) { + assert(!Mask.empty() && "Expected non-empty mask."); + SmallVector<Value *> Prev(Scalars.size(), + UndefValue::get(Scalars.front()->getType())); + Prev.swap(Scalars); + for (unsigned I = 0, E = Prev.size(); I < E; ++I) + if (Mask[I] != UndefMaskElem) + Scalars[Mask[I]] = Prev[I]; +} + namespace slpvectorizer { /// Bottom Up SLP Vectorizer. @@ -645,13 +717,12 @@ public: void buildTree(ArrayRef<Value *> Roots, ArrayRef<Value *> UserIgnoreLst = None); - /// Construct a vectorizable tree that starts at \p Roots, ignoring users for - /// the purpose of scheduling and extraction in the \p UserIgnoreLst taking - /// into account (and updating it, if required) list of externally used - /// values stored in \p ExternallyUsedValues. - void buildTree(ArrayRef<Value *> Roots, - ExtraValueToDebugLocsMap &ExternallyUsedValues, - ArrayRef<Value *> UserIgnoreLst = None); + /// Builds external uses of the vectorized scalars, i.e. the list of + /// vectorized scalars to be extracted, their lanes and their scalar users. \p + /// ExternallyUsedValues contains additional list of external uses to handle + /// vectorization of reductions. + void + buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {}); /// Clear the internal data structures that are created by 'buildTree'. void deleteTree() { @@ -659,8 +730,6 @@ public: ScalarToTreeEntry.clear(); MustGather.clear(); ExternalUses.clear(); - NumOpsWantToKeepOrder.clear(); - NumOpsWantToKeepOriginalOrder = 0; for (auto &Iter : BlocksSchedules) { BlockScheduling *BS = Iter.second.get(); BS->clear(); @@ -674,103 +743,22 @@ public: /// Perform LICM and CSE on the newly generated gather sequences. void optimizeGatherSequence(); - /// \returns The best order of instructions for vectorization. - Optional<ArrayRef<unsigned>> bestOrder() const { - assert(llvm::all_of( - NumOpsWantToKeepOrder, - [this](const decltype(NumOpsWantToKeepOrder)::value_type &D) { - return D.getFirst().size() == - VectorizableTree[0]->Scalars.size(); - }) && - "All orders must have the same size as number of instructions in " - "tree node."); - auto I = std::max_element( - NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(), - [](const decltype(NumOpsWantToKeepOrder)::value_type &D1, - const decltype(NumOpsWantToKeepOrder)::value_type &D2) { - return D1.second < D2.second; - }); - if (I == NumOpsWantToKeepOrder.end() || - I->getSecond() <= NumOpsWantToKeepOriginalOrder) - return None; - - return makeArrayRef(I->getFirst()); - } - - /// Builds the correct order for root instructions. - /// If some leaves have the same instructions to be vectorized, we may - /// incorrectly evaluate the best order for the root node (it is built for the - /// vector of instructions without repeated instructions and, thus, has less - /// elements than the root node). This function builds the correct order for - /// the root node. - /// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves - /// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first - /// leaf, it will be shrink to \<a, b\>. If instructions in this leaf should - /// be reordered, the best order will be \<1, 0\>. We need to extend this - /// order for the root node. For the root node this order should look like - /// \<3, 0, 1, 2\>. This function extends the order for the reused - /// instructions. - void findRootOrder(OrdersType &Order) { - // If the leaf has the same number of instructions to vectorize as the root - // - order must be set already. - unsigned RootSize = VectorizableTree[0]->Scalars.size(); - if (Order.size() == RootSize) - return; - SmallVector<unsigned, 4> RealOrder(Order.size()); - std::swap(Order, RealOrder); - SmallVector<int, 4> Mask; - inversePermutation(RealOrder, Mask); - Order.assign(Mask.begin(), Mask.end()); - // The leaf has less number of instructions - need to find the true order of - // the root. - // Scan the nodes starting from the leaf back to the root. - const TreeEntry *PNode = VectorizableTree.back().get(); - SmallVector<const TreeEntry *, 4> Nodes(1, PNode); - SmallPtrSet<const TreeEntry *, 4> Visited; - while (!Nodes.empty() && Order.size() != RootSize) { - const TreeEntry *PNode = Nodes.pop_back_val(); - if (!Visited.insert(PNode).second) - continue; - const TreeEntry &Node = *PNode; - for (const EdgeInfo &EI : Node.UserTreeIndices) - if (EI.UserTE) - Nodes.push_back(EI.UserTE); - if (Node.ReuseShuffleIndices.empty()) - continue; - // Build the order for the parent node. - OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize); - SmallVector<unsigned, 4> OrderCounter(Order.size(), 0); - // The algorithm of the order extension is: - // 1. Calculate the number of the same instructions for the order. - // 2. Calculate the index of the new order: total number of instructions - // with order less than the order of the current instruction + reuse - // number of the current instruction. - // 3. The new order is just the index of the instruction in the original - // vector of the instructions. - for (unsigned I : Node.ReuseShuffleIndices) - ++OrderCounter[Order[I]]; - SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0); - for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) { - unsigned ReusedIdx = Node.ReuseShuffleIndices[I]; - unsigned OrderIdx = Order[ReusedIdx]; - unsigned NewIdx = 0; - for (unsigned J = 0; J < OrderIdx; ++J) - NewIdx += OrderCounter[J]; - NewIdx += CurrentCounter[OrderIdx]; - ++CurrentCounter[OrderIdx]; - assert(NewOrder[NewIdx] == RootSize && - "The order index should not be written already."); - NewOrder[NewIdx] = I; - } - std::swap(Order, NewOrder); - } - assert(Order.size() == RootSize && - "Root node is expected or the size of the order must be the same as " - "the number of elements in the root node."); - assert(llvm::all_of(Order, - [RootSize](unsigned Val) { return Val != RootSize; }) && - "All indices must be initialized"); - } + /// Reorders the current graph to the most profitable order starting from the + /// root node to the leaf nodes. The best order is chosen only from the nodes + /// of the same size (vectorization factor). Smaller nodes are considered + /// parts of subgraph with smaller VF and they are reordered independently. We + /// can make it because we still need to extend smaller nodes to the wider VF + /// and we can merge reordering shuffles with the widening shuffles. + void reorderTopToBottom(); + + /// Reorders the current graph to the most profitable order starting from + /// leaves to the root. It allows to rotate small subgraphs and reduce the + /// number of reshuffles if the leaf nodes use the same order. In this case we + /// can merge the orders and just shuffle user node instead of shuffling its + /// operands. Plus, even the leaf nodes have different orders, it allows to + /// sink reordering in the graph closer to the root node and merge it later + /// during analysis. + void reorderBottomToTop(); /// \return The vector element size in bits to use when vectorizing the /// expression tree ending at \p V. If V is a store, the size is the width of @@ -793,6 +781,10 @@ public: return MinVecRegSize; } + unsigned getMinVF(unsigned Sz) const { + return std::max(2U, getMinVecRegSize() / Sz); + } + unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { unsigned MaxVF = MaxVFOption.getNumOccurrences() ? MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode); @@ -1621,12 +1613,29 @@ private: /// \returns true if the scalars in VL are equal to this entry. bool isSame(ArrayRef<Value *> VL) const { - if (VL.size() == Scalars.size()) - return std::equal(VL.begin(), VL.end(), Scalars.begin()); - return VL.size() == ReuseShuffleIndices.size() && - std::equal( - VL.begin(), VL.end(), ReuseShuffleIndices.begin(), - [this](Value *V, int Idx) { return V == Scalars[Idx]; }); + auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) { + if (Mask.size() != VL.size() && VL.size() == Scalars.size()) + return std::equal(VL.begin(), VL.end(), Scalars.begin()); + return VL.size() == Mask.size() && + std::equal( + VL.begin(), VL.end(), Mask.begin(), + [Scalars](Value *V, int Idx) { return V == Scalars[Idx]; }); + }; + if (!ReorderIndices.empty()) { + // TODO: implement matching if the nodes are just reordered, still can + // treat the vector as the same if the list of scalars matches VL + // directly, without reordering. + SmallVector<int> Mask; + inversePermutation(ReorderIndices, Mask); + if (VL.size() == Scalars.size()) + return IsSame(Scalars, Mask); + if (VL.size() == ReuseShuffleIndices.size()) { + ::addMask(Mask, ReuseShuffleIndices); + return IsSame(Scalars, Mask); + } + return false; + } + return IsSame(Scalars, ReuseShuffleIndices); } /// A vector of scalars. @@ -1701,6 +1710,12 @@ private: } } + /// Reorders operands of the node to the given mask \p Mask. + void reorderOperands(ArrayRef<int> Mask) { + for (ValueList &Operand : Operands) + reorderScalars(Operand, Mask); + } + /// \returns the \p OpIdx operand of this TreeEntry. ValueList &getOperand(unsigned OpIdx) { assert(OpIdx < Operands.size() && "Off bounds"); @@ -1760,19 +1775,14 @@ private: return AltOp ? AltOp->getOpcode() : 0; } - /// Update operations state of this entry if reorder occurred. - bool updateStateIfReorder() { - if (ReorderIndices.empty()) - return false; - InstructionsState S = getSameOpcode(Scalars, ReorderIndices.front()); - setOperations(S); - return true; - } - /// When ReuseShuffleIndices is empty it just returns position of \p V - /// within vector of Scalars. Otherwise, try to remap on its reuse index. + /// When ReuseReorderShuffleIndices is empty it just returns position of \p + /// V within vector of Scalars. Otherwise, try to remap on its reuse index. int findLaneForValue(Value *V) const { unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V)); assert(FoundLane < Scalars.size() && "Couldn't find extract lane"); + if (!ReorderIndices.empty()) + FoundLane = ReorderIndices[FoundLane]; + assert(FoundLane < Scalars.size() && "Couldn't find extract lane"); if (!ReuseShuffleIndices.empty()) { FoundLane = std::distance(ReuseShuffleIndices.begin(), find(ReuseShuffleIndices, FoundLane)); @@ -1856,7 +1866,7 @@ private: TreeEntry *newTreeEntry(ArrayRef<Value *> VL, Optional<ScheduleData *> Bundle, const InstructionsState &S, const EdgeInfo &UserTreeIdx, - ArrayRef<unsigned> ReuseShuffleIndices = None, + ArrayRef<int> ReuseShuffleIndices = None, ArrayRef<unsigned> ReorderIndices = None) { TreeEntry::EntryState EntryState = Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather; @@ -1869,7 +1879,7 @@ private: Optional<ScheduleData *> Bundle, const InstructionsState &S, const EdgeInfo &UserTreeIdx, - ArrayRef<unsigned> ReuseShuffleIndices = None, + ArrayRef<int> ReuseShuffleIndices = None, ArrayRef<unsigned> ReorderIndices = None) { assert(((!Bundle && EntryState == TreeEntry::NeedToGather) || (Bundle && EntryState != TreeEntry::NeedToGather)) && @@ -1877,12 +1887,25 @@ private: VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree)); TreeEntry *Last = VectorizableTree.back().get(); Last->Idx = VectorizableTree.size() - 1; - Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end()); Last->State = EntryState; Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(), ReuseShuffleIndices.end()); - Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end()); - Last->setOperations(S); + if (ReorderIndices.empty()) { + Last->Scalars.assign(VL.begin(), VL.end()); + Last->setOperations(S); + } else { + // Reorder scalars and build final mask. + Last->Scalars.assign(VL.size(), nullptr); + transform(ReorderIndices, Last->Scalars.begin(), + [VL](unsigned Idx) -> Value * { + if (Idx >= VL.size()) + return UndefValue::get(VL.front()->getType()); + return VL[Idx]; + }); + InstructionsState S = getSameOpcode(Last->Scalars); + Last->setOperations(S); + Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end()); + } if (Last->State != TreeEntry::NeedToGather) { for (Value *V : VL) { assert(!getTreeEntry(V) && "Scalar already in tree!"); @@ -2431,14 +2454,6 @@ private: } }; - /// Contains orders of operations along with the number of bundles that have - /// operations in this order. It stores only those orders that require - /// reordering, if reordering is not required it is counted using \a - /// NumOpsWantToKeepOriginalOrder. - DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder; - /// Number of bundles that do not require reordering. - unsigned NumOpsWantToKeepOriginalOrder = 0; - // Analysis and block reference. Function *F; ScalarEvolution *SE; @@ -2591,21 +2606,439 @@ void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) { }; } -void BoUpSLP::buildTree(ArrayRef<Value *> Roots, - ArrayRef<Value *> UserIgnoreLst) { - ExtraValueToDebugLocsMap ExternallyUsedValues; - buildTree(Roots, ExternallyUsedValues, UserIgnoreLst); +/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses +/// contains original mask for the scalars reused in the node. Procedure +/// transform this mask in accordance with the given \p Mask. +static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) { + assert(!Mask.empty() && Reuses.size() == Mask.size() && + "Expected non-empty mask."); + SmallVector<int> Prev(Reuses.begin(), Reuses.end()); + Prev.swap(Reuses); + for (unsigned I = 0, E = Prev.size(); I < E; ++I) + if (Mask[I] != UndefMaskElem) + Reuses[Mask[I]] = Prev[I]; } -void BoUpSLP::buildTree(ArrayRef<Value *> Roots, - ExtraValueToDebugLocsMap &ExternallyUsedValues, - ArrayRef<Value *> UserIgnoreLst) { - deleteTree(); - UserIgnoreList = UserIgnoreLst; - if (!allSameType(Roots)) +/// Reorders the given \p Order according to the given \p Mask. \p Order - is +/// the original order of the scalars. Procedure transforms the provided order +/// in accordance with the given \p Mask. If the resulting \p Order is just an +/// identity order, \p Order is cleared. +static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> Mask) { + assert(!Mask.empty() && "Expected non-empty mask."); + SmallVector<int> MaskOrder; + if (Order.empty()) { + MaskOrder.resize(Mask.size()); + std::iota(MaskOrder.begin(), MaskOrder.end(), 0); + } else { + inversePermutation(Order, MaskOrder); + } + reorderReuses(MaskOrder, Mask); + if (ShuffleVectorInst::isIdentityMask(MaskOrder)) { + Order.clear(); return; - buildTree_rec(Roots, 0, EdgeInfo()); + } + Order.assign(Mask.size(), Mask.size()); + for (unsigned I = 0, E = Mask.size(); I < E; ++I) + if (MaskOrder[I] != UndefMaskElem) + Order[MaskOrder[I]] = I; + fixupOrderingIndices(Order); +} + +void BoUpSLP::reorderTopToBottom() { + // Maps VF to the graph nodes. + DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries; + // ExtractElement gather nodes which can be vectorized and need to handle + // their ordering. + DenseMap<const TreeEntry *, OrdersType> GathersToOrders; + // Find all reorderable nodes with the given VF. + // Currently the are vectorized loads,extracts + some gathering of extracts. + for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders]( + const std::unique_ptr<TreeEntry> &TE) { + // No need to reorder if need to shuffle reuses, still need to shuffle the + // node. + if (!TE->ReuseShuffleIndices.empty()) + return; + if (TE->State == TreeEntry::Vectorize && + isa<LoadInst, ExtractElementInst, ExtractValueInst, StoreInst, + InsertElementInst>(TE->getMainOp()) && + !TE->isAltShuffle()) { + VFToOrderedEntries[TE->Scalars.size()].insert(TE.get()); + } else if (TE->State == TreeEntry::NeedToGather && + TE->getOpcode() == Instruction::ExtractElement && + !TE->isAltShuffle() && + isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp()) + ->getVectorOperandType()) && + allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) { + // Check that gather of extractelements can be represented as + // just a shuffle of a single vector. + OrdersType CurrentOrder; + bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder); + if (Reuse || !CurrentOrder.empty()) { + VFToOrderedEntries[TE->Scalars.size()].insert(TE.get()); + GathersToOrders.try_emplace(TE.get(), CurrentOrder); + } + } + }); + + // Reorder the graph nodes according to their vectorization factor. + for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1; + VF /= 2) { + auto It = VFToOrderedEntries.find(VF); + if (It == VFToOrderedEntries.end()) + continue; + // Try to find the most profitable order. We just are looking for the most + // used order and reorder scalar elements in the nodes according to this + // mostly used order. + const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond(); + // All operands are reordered and used only in this node - propagate the + // most used order to the user node. + DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses; + SmallPtrSet<const TreeEntry *, 4> VisitedOps; + for (const TreeEntry *OpTE : OrderedEntries) { + // No need to reorder this nodes, still need to extend and to use shuffle, + // just need to merge reordering shuffle and the reuse shuffle. + if (!OpTE->ReuseShuffleIndices.empty()) + continue; + // Count number of orders uses. + const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & { + if (OpTE->State == TreeEntry::NeedToGather) + return GathersToOrders.find(OpTE)->second; + return OpTE->ReorderIndices; + }(); + // Stores actually store the mask, not the order, need to invert. + if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() && + OpTE->getOpcode() == Instruction::Store && !Order.empty()) { + SmallVector<int> Mask; + inversePermutation(Order, Mask); + unsigned E = Order.size(); + OrdersType CurrentOrder(E, E); + transform(Mask, CurrentOrder.begin(), [E](int Idx) { + return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx); + }); + fixupOrderingIndices(CurrentOrder); + ++OrdersUses.try_emplace(CurrentOrder).first->getSecond(); + } else { + ++OrdersUses.try_emplace(Order).first->getSecond(); + } + } + // Set order of the user node. + if (OrdersUses.empty()) + continue; + // Choose the most used order. + ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first; + unsigned Cnt = OrdersUses.begin()->second; + for (const auto &Pair : llvm::drop_begin(OrdersUses)) { + if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) { + BestOrder = Pair.first; + Cnt = Pair.second; + } + } + // Set order of the user node. + if (BestOrder.empty()) + continue; + SmallVector<int> Mask; + inversePermutation(BestOrder, Mask); + SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem); + unsigned E = BestOrder.size(); + transform(BestOrder, MaskOrder.begin(), [E](unsigned I) { + return I < E ? static_cast<int>(I) : UndefMaskElem; + }); + // Do an actual reordering, if profitable. + for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) { + // Just do the reordering for the nodes with the given VF. + if (TE->Scalars.size() != VF) { + if (TE->ReuseShuffleIndices.size() == VF) { + // Need to reorder the reuses masks of the operands with smaller VF to + // be able to find the match between the graph nodes and scalar + // operands of the given node during vectorization/cost estimation. + assert(all_of(TE->UserTreeIndices, + [VF, &TE](const EdgeInfo &EI) { + return EI.UserTE->Scalars.size() == VF || + EI.UserTE->Scalars.size() == + TE->Scalars.size(); + }) && + "All users must be of VF size."); + // Update ordering of the operands with the smaller VF than the given + // one. + reorderReuses(TE->ReuseShuffleIndices, Mask); + } + continue; + } + if (TE->State == TreeEntry::Vectorize && + isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst, + InsertElementInst>(TE->getMainOp()) && + !TE->isAltShuffle()) { + // Build correct orders for extract{element,value}, loads and + // stores. + reorderOrder(TE->ReorderIndices, Mask); + if (isa<InsertElementInst, StoreInst>(TE->getMainOp())) + TE->reorderOperands(Mask); + } else { + // Reorder the node and its operands. + TE->reorderOperands(Mask); + assert(TE->ReorderIndices.empty() && + "Expected empty reorder sequence."); + reorderScalars(TE->Scalars, Mask); + } + if (!TE->ReuseShuffleIndices.empty()) { + // Apply reversed order to keep the original ordering of the reused + // elements to avoid extra reorder indices shuffling. + OrdersType CurrentOrder; + reorderOrder(CurrentOrder, MaskOrder); + SmallVector<int> NewReuses; + inversePermutation(CurrentOrder, NewReuses); + addMask(NewReuses, TE->ReuseShuffleIndices); + TE->ReuseShuffleIndices.swap(NewReuses); + } + } + } +} + +void BoUpSLP::reorderBottomToTop() { + SetVector<TreeEntry *> OrderedEntries; + DenseMap<const TreeEntry *, OrdersType> GathersToOrders; + // Find all reorderable leaf nodes with the given VF. + // Currently the are vectorized loads,extracts without alternate operands + + // some gathering of extracts. + SmallVector<TreeEntry *> NonVectorized; + for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders, + &NonVectorized]( + const std::unique_ptr<TreeEntry> &TE) { + // No need to reorder if need to shuffle reuses, still need to shuffle the + // node. + if (!TE->ReuseShuffleIndices.empty()) + return; + if (TE->State == TreeEntry::Vectorize && + isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp()) && + !TE->isAltShuffle()) { + OrderedEntries.insert(TE.get()); + } else if (TE->State == TreeEntry::NeedToGather && + TE->getOpcode() == Instruction::ExtractElement && + !TE->isAltShuffle() && + isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp()) + ->getVectorOperandType()) && + allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) { + // Check that gather of extractelements can be represented as + // just a shuffle of a single vector with a single user only. + OrdersType CurrentOrder; + bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder); + if ((Reuse || !CurrentOrder.empty()) && + !any_of( + VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> &Entry) { + return Entry->State == TreeEntry::NeedToGather && + Entry.get() != TE.get() && Entry->isSame(TE->Scalars); + })) { + OrderedEntries.insert(TE.get()); + GathersToOrders.try_emplace(TE.get(), CurrentOrder); + } + } + if (TE->State != TreeEntry::Vectorize) + NonVectorized.push_back(TE.get()); + }); + + // Checks if the operands of the users are reordarable and have only single + // use. + auto &&CheckOperands = + [this, &NonVectorized](const auto &Data, + SmallVectorImpl<TreeEntry *> &GatherOps) { + for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) { + if (any_of(Data.second, + [I](const std::pair<unsigned, TreeEntry *> &OpData) { + return OpData.first == I && + OpData.second->State == TreeEntry::Vectorize; + })) + continue; + ArrayRef<Value *> VL = Data.first->getOperand(I); + const TreeEntry *TE = nullptr; + const auto *It = find_if(VL, [this, &TE](Value *V) { + TE = getTreeEntry(V); + return TE; + }); + if (It != VL.end() && TE->isSame(VL)) + return false; + TreeEntry *Gather = nullptr; + if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) { + assert(TE->State != TreeEntry::Vectorize && + "Only non-vectorized nodes are expected."); + if (TE->isSame(VL)) { + Gather = TE; + return true; + } + return false; + }) > 1) + return false; + if (Gather) + GatherOps.push_back(Gather); + } + return true; + }; + // 1. Propagate order to the graph nodes, which use only reordered nodes. + // I.e., if the node has operands, that are reordered, try to make at least + // one operand order in the natural order and reorder others + reorder the + // user node itself. + SmallPtrSet<const TreeEntry *, 4> Visited; + while (!OrderedEntries.empty()) { + // 1. Filter out only reordered nodes. + // 2. If the entry has multiple uses - skip it and jump to the next node. + MapVector<TreeEntry *, SmallVector<std::pair<unsigned, TreeEntry *>>> Users; + SmallVector<TreeEntry *> Filtered; + for (TreeEntry *TE : OrderedEntries) { + if (!(TE->State == TreeEntry::Vectorize || + (TE->State == TreeEntry::NeedToGather && + TE->getOpcode() == Instruction::ExtractElement)) || + TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() || + !all_of(drop_begin(TE->UserTreeIndices), + [TE](const EdgeInfo &EI) { + return EI.UserTE == TE->UserTreeIndices.front().UserTE; + }) || + !Visited.insert(TE).second) { + Filtered.push_back(TE); + continue; + } + // Build a map between user nodes and their operands order to speedup + // search. The graph currently does not provide this dependency directly. + for (EdgeInfo &EI : TE->UserTreeIndices) { + TreeEntry *UserTE = EI.UserTE; + auto It = Users.find(UserTE); + if (It == Users.end()) + It = Users.insert({UserTE, {}}).first; + It->second.emplace_back(EI.EdgeIdx, TE); + } + } + // Erase filtered entries. + for_each(Filtered, + [&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); }); + for (const auto &Data : Users) { + // Check that operands are used only in the User node. + SmallVector<TreeEntry *> GatherOps; + if (!CheckOperands(Data, GatherOps)) { + for_each(Data.second, + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) { + OrderedEntries.remove(Op.second); + }); + continue; + } + // All operands are reordered and used only in this node - propagate the + // most used order to the user node. + DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses; + SmallPtrSet<const TreeEntry *, 4> VisitedOps; + for (const auto &Op : Data.second) { + TreeEntry *OpTE = Op.second; + if (!OpTE->ReuseShuffleIndices.empty()) + continue; + const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & { + if (OpTE->State == TreeEntry::NeedToGather) + return GathersToOrders.find(OpTE)->second; + return OpTE->ReorderIndices; + }(); + // Stores actually store the mask, not the order, need to invert. + if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() && + OpTE->getOpcode() == Instruction::Store && !Order.empty()) { + SmallVector<int> Mask; + inversePermutation(Order, Mask); + unsigned E = Order.size(); + OrdersType CurrentOrder(E, E); + transform(Mask, CurrentOrder.begin(), [E](int Idx) { + return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx); + }); + fixupOrderingIndices(CurrentOrder); + ++OrdersUses.try_emplace(CurrentOrder).first->getSecond(); + } else { + ++OrdersUses.try_emplace(Order).first->getSecond(); + } + if (VisitedOps.insert(OpTE).second) + OrdersUses.try_emplace({}, 0).first->getSecond() += + OpTE->UserTreeIndices.size(); + --OrdersUses[{}]; + } + // If no orders - skip current nodes and jump to the next one, if any. + if (OrdersUses.empty()) { + for_each(Data.second, + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) { + OrderedEntries.remove(Op.second); + }); + continue; + } + // Choose the best order. + ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first; + unsigned Cnt = OrdersUses.begin()->second; + for (const auto &Pair : llvm::drop_begin(OrdersUses)) { + if (Cnt < Pair.second || (Cnt == Pair.second && Pair.first.empty())) { + BestOrder = Pair.first; + Cnt = Pair.second; + } + } + // Set order of the user node (reordering of operands and user nodes). + if (BestOrder.empty()) { + for_each(Data.second, + [&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) { + OrderedEntries.remove(Op.second); + }); + continue; + } + // Erase operands from OrderedEntries list and adjust their orders. + VisitedOps.clear(); + SmallVector<int> Mask; + inversePermutation(BestOrder, Mask); + SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem); + unsigned E = BestOrder.size(); + transform(BestOrder, MaskOrder.begin(), [E](unsigned I) { + return I < E ? static_cast<int>(I) : UndefMaskElem; + }); + for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) { + TreeEntry *TE = Op.second; + OrderedEntries.remove(TE); + if (!VisitedOps.insert(TE).second) + continue; + if (!TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty()) { + // Just reorder reuses indices. + reorderReuses(TE->ReuseShuffleIndices, Mask); + continue; + } + // Gathers are processed separately. + if (TE->State != TreeEntry::Vectorize) + continue; + assert((BestOrder.size() == TE->ReorderIndices.size() || + TE->ReorderIndices.empty()) && + "Non-matching sizes of user/operand entries."); + reorderOrder(TE->ReorderIndices, Mask); + } + // For gathers just need to reorder its scalars. + for (TreeEntry *Gather : GatherOps) { + if (!Gather->ReuseShuffleIndices.empty()) + continue; + assert(Gather->ReorderIndices.empty() && + "Unexpected reordering of gathers."); + reorderScalars(Gather->Scalars, Mask); + OrderedEntries.remove(Gather); + } + // Reorder operands of the user node and set the ordering for the user + // node itself. + if (Data.first->State != TreeEntry::Vectorize || + !isa<ExtractElementInst, ExtractValueInst, LoadInst>( + Data.first->getMainOp()) || + Data.first->isAltShuffle()) + Data.first->reorderOperands(Mask); + if (!isa<InsertElementInst, StoreInst>(Data.first->getMainOp()) || + Data.first->isAltShuffle()) { + reorderScalars(Data.first->Scalars, Mask); + reorderOrder(Data.first->ReorderIndices, MaskOrder); + if (Data.first->ReuseShuffleIndices.empty() && + !Data.first->ReorderIndices.empty() && + !Data.first->isAltShuffle()) { + // Insert user node to the list to try to sink reordering deeper in + // the graph. + OrderedEntries.insert(Data.first); + } + } else { + reorderOrder(Data.first->ReorderIndices, Mask); + } + } + } +} +void BoUpSLP::buildExternalUses( + const ExtraValueToDebugLocsMap &ExternallyUsedValues) { // Collect the values that we need to extract from the tree. for (auto &TEPtr : VectorizableTree) { TreeEntry *Entry = TEPtr.get(); @@ -2664,6 +3097,80 @@ void BoUpSLP::buildTree(ArrayRef<Value *> Roots, } } +void BoUpSLP::buildTree(ArrayRef<Value *> Roots, + ArrayRef<Value *> UserIgnoreLst) { + deleteTree(); + UserIgnoreList = UserIgnoreLst; + if (!allSameType(Roots)) + return; + buildTree_rec(Roots, 0, EdgeInfo()); +} + +namespace { +/// Tracks the state we can represent the loads in the given sequence. +enum class LoadsState { Gather, Vectorize, ScatterVectorize }; +} // anonymous namespace + +/// Checks if the given array of loads can be represented as a vectorized, +/// scatter or just simple gather. +static LoadsState canVectorizeLoads(ArrayRef<Value *> VL, const Value *VL0, + const TargetTransformInfo &TTI, + const DataLayout &DL, ScalarEvolution &SE, + SmallVectorImpl<unsigned> &Order, + SmallVectorImpl<Value *> &PointerOps) { + // Check that a vectorized load would load the same memory as a scalar + // load. For example, we don't want to vectorize loads that are smaller + // than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM + // treats loading/storing it as an i8 struct. If we vectorize loads/stores + // from such a struct, we read/write packed bits disagreeing with the + // unvectorized version. + Type *ScalarTy = VL0->getType(); + + if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy)) + return LoadsState::Gather; + + // Make sure all loads in the bundle are simple - we can't vectorize + // atomic or volatile loads. + PointerOps.clear(); + PointerOps.resize(VL.size()); + auto *POIter = PointerOps.begin(); + for (Value *V : VL) { + auto *L = cast<LoadInst>(V); + if (!L->isSimple()) + return LoadsState::Gather; + *POIter = L->getPointerOperand(); + ++POIter; + } + + Order.clear(); + // Check the order of pointer operands. + if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) { + Value *Ptr0; + Value *PtrN; + if (Order.empty()) { + Ptr0 = PointerOps.front(); + PtrN = PointerOps.back(); + } else { + Ptr0 = PointerOps[Order.front()]; + PtrN = PointerOps[Order.back()]; + } + Optional<int> Diff = + getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE); + // Check that the sorted loads are consecutive. + if (static_cast<unsigned>(*Diff) == VL.size() - 1) + return LoadsState::Vectorize; + Align CommonAlignment = cast<LoadInst>(VL0)->getAlign(); </cut>

3 years, 8 months

2
1
0 0

[CI-NOTIFY]: TCWG Bisect tcwg_bmk_tk1/llvm-master-arm-spec2k6-Os - Build # 9 - Successful!

by ci_notify＠linaro.org

Successfully identified regression in *llvm* in CI configuration tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-Os. So far, this commit has regressed CI configurations: - tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-Os Culprit: <cut> commit 40b752d28d95158e52dba7cfeea92e41b7ccff9a Author: Sanjay Patel <spatel(a)rotateright.com> Date: Mon Jul 5 09:57:39 2021 -0400 [InstCombine] fold icmp slt/sgt of offset value with constant This follows up patches for the unsigned siblings: 0c400e895306 c7b658aeb526 We are translating an offset signed compare to its unsigned equivalent when one end of the range is at the limit (zero or unsigned max). (X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1) (X + C2) <s C --> X >u (C ^ SMAX) (if C == C2) This probably does not show up much in IR derived from C/C++ source because that would likely have 'nsw', and we have folds for that already. As with the previous unsigned transforms, the folds could be generalized to handle non-constant patterns: https://alive2.llvm.org/ce/z/Y8Xrrm ; sgt define i1 @src(i8 %a, i8 %c) { %c2 = add i8 %c, 1 %t = add i8 %a, %c2 %ov = icmp sgt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_off = sub i8 127, %c ; SMAX %ov = icmp ult i8 %a, %c_off ret i1 %ov } https://alive2.llvm.org/ce/z/c8uhnk ; slt define i1 @src(i8 %a, i8 %c) { %t = add i8 %a, %c %ov = icmp slt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_offnot = xor i8 %c, 127 ; SMAX %ov = icmp ugt i8 %a, %c_offnot ret i1 %ov } </cut> Results regressed to (for first_bad == 40b752d28d95158e52dba7cfeea92e41b7ccff9a) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -Os_mthumb -- artifacts/build-40b752d28d95158e52dba7cfeea92e41b7ccff9a/results_id: 1 # 401.bzip2,bzip2_base.default regressed by 110 # 401.bzip2,[.] BZ2_decompress regressed by 149 from (for last_good == 32dd914f7182875730eb3453f39dcc584b7219b2) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -8 # build_abe linux: -7 # build_abe glibc: -6 # build_abe stage2 -- --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--disable-libsanitizer: -5 # build_llvm true: -3 # true: 0 # benchmark -Os_mthumb -- artifacts/build-32dd914f7182875730eb3453f39dcc584b7219b2/results_id: 1 Artifacts of last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Results ID of last_good: tk1_32/tcwg_bmk_llvm_tk1/bisect-llvm-master-arm-spec2k6-Os/1631 Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Results ID of first_bad: tk1_32/tcwg_bmk_llvm_tk1/bisect-llvm-master-arm-spec2k6-Os/1622 Build top page/logs: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Configuration details: Reproduce builds: <cut> mkdir investigate-llvm-40b752d28d95158e52dba7cfeea92e41b7ccff9a cd investigate-llvm-40b752d28d95158e52dba7cfeea92e41b7ccff9a git clone https://git.linaro.org/toolchain/jenkins-scripts mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) rsync -a --del --delete-excluded --exclude bisect/ --exclude artifacts/ --exclude llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 40b752d28d95158e52dba7cfeea92e41b7ccff9a ../artifacts/test.sh # Reproduce last_good build git checkout --detach 32dd914f7182875730eb3453f39dcc584b7219b2 ../artifacts/test.sh cd .. </cut> History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/… Artifacts: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Build log: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Full commit (up to 1000 lines): <cut> commit 40b752d28d95158e52dba7cfeea92e41b7ccff9a Author: Sanjay Patel <spatel(a)rotateright.com> Date: Mon Jul 5 09:57:39 2021 -0400 [InstCombine] fold icmp slt/sgt of offset value with constant This follows up patches for the unsigned siblings: 0c400e895306 c7b658aeb526 We are translating an offset signed compare to its unsigned equivalent when one end of the range is at the limit (zero or unsigned max). (X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1) (X + C2) <s C --> X >u (C ^ SMAX) (if C == C2) This probably does not show up much in IR derived from C/C++ source because that would likely have 'nsw', and we have folds for that already. As with the previous unsigned transforms, the folds could be generalized to handle non-constant patterns: https://alive2.llvm.org/ce/z/Y8Xrrm ; sgt define i1 @src(i8 %a, i8 %c) { %c2 = add i8 %c, 1 %t = add i8 %a, %c2 %ov = icmp sgt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_off = sub i8 127, %c ; SMAX %ov = icmp ult i8 %a, %c_off ret i1 %ov } https://alive2.llvm.org/ce/z/c8uhnk ; slt define i1 @src(i8 %a, i8 %c) { %t = add i8 %a, %c %ov = icmp slt i8 %t, %c ret i1 %ov } define i1 @tgt(i8 %a, i8 %c) { %c_offnot = xor i8 %c, 127 ; SMAX %ov = icmp ugt i8 %a, %c_offnot ret i1 %ov } --- .../Transforms/InstCombine/InstCombineCompares.cpp | 21 ++++++++++++++------- llvm/test/Transforms/InstCombine/icmp-add.ll | 20 ++++++++++---------- 2 files changed, 24 insertions(+), 17 deletions(-) diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp index 6bd479def210..6e66c61f5e46 100644 --- a/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp +++ b/llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp @@ -2636,20 +2636,27 @@ Instruction *InstCombinerImpl::foldICmpAddConstant(ICmpInst &Cmp, // Fold icmp pred (add X, C2), C. Value *X = Add->getOperand(0); Type *Ty = Add->getType(); - CmpInst::Predicate Pred = Cmp.getPredicate(); + const CmpInst::Predicate Pred = Cmp.getPredicate(); + const APInt SMax = APInt::getSignedMaxValue(Ty->getScalarSizeInBits()); + const APInt SMin = APInt::getSignedMinValue(Ty->getScalarSizeInBits()); - // Fold an unsigned compare with offset to signed compare: + // Fold compare with offset to opposite sign compare if it eliminates offset: // (X + C2) >u C --> X <s -C2 (if C == C2 + SMAX) - // TODO: Find the signed predicate siblings. - if (Pred == CmpInst::ICMP_UGT && - C == *C2 + APInt::getSignedMaxValue(Ty->getScalarSizeInBits())) + if (Pred == CmpInst::ICMP_UGT && C == *C2 + SMax) return new ICmpInst(ICmpInst::ICMP_SLT, X, ConstantInt::get(Ty, -(*C2))); // (X + C2) <u C --> X >s ~C2 (if C == C2 + SMIN) - if (Pred == CmpInst::ICMP_ULT && - C == *C2 + APInt::getSignedMinValue(Ty->getScalarSizeInBits())) + if (Pred == CmpInst::ICMP_ULT && C == *C2 + SMin) return new ICmpInst(ICmpInst::ICMP_SGT, X, ConstantInt::get(Ty, ~(*C2))); + // (X + C2) >s C --> X <u (SMAX - C) (if C == C2 - 1) + if (Pred == CmpInst::ICMP_SGT && C == *C2 - 1) + return new ICmpInst(ICmpInst::ICMP_ULT, X, ConstantInt::get(Ty, SMax - C)); + + // (X + C2) <s C --> X >u (C ^ SMAX) (if C == C2) + if (Pred == CmpInst::ICMP_SLT && C == *C2) + return new ICmpInst(ICmpInst::ICMP_UGT, X, ConstantInt::get(Ty, C ^ SMax)); + // If the add does not wrap, we can always adjust the compare by subtracting // the constants. Equality comparisons are handled elsewhere. SGE/SLE/UGE/ULE // are canonicalized to SGT/SLT/UGT/ULT. diff --git a/llvm/test/Transforms/InstCombine/icmp-add.ll b/llvm/test/Transforms/InstCombine/icmp-add.ll index 1f00dc6e2992..aa69325d716c 100644 --- a/llvm/test/Transforms/InstCombine/icmp-add.ll +++ b/llvm/test/Transforms/InstCombine/icmp-add.ll @@ -842,8 +842,7 @@ define i1 @ult_wrong_offset(i8 %a) { define i1 @sgt_offset(i8 %a) { ; CHECK-LABEL: @sgt_offset( -; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -6 -; CHECK-NEXT: [[OV:%.*]] = icmp sgt i8 [[T]], -7 +; CHECK-NEXT: [[OV:%.*]] = icmp ult i8 [[A:%.*]], -122 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i8 %a, -6 @@ -855,7 +854,7 @@ define i1 @sgt_offset_use(i32 %a) { ; CHECK-LABEL: @sgt_offset_use( ; CHECK-NEXT: [[T:%.*]] = add i32 [[A:%.*]], 42 ; CHECK-NEXT: call void @use(i32 [[T]]) -; CHECK-NEXT: [[OV:%.*]] = icmp sgt i32 [[T]], 41 +; CHECK-NEXT: [[OV:%.*]] = icmp ult i32 [[A]], 2147483606 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i32 %a, 42 @@ -866,8 +865,7 @@ define i1 @sgt_offset_use(i32 %a) { define <2 x i1> @sgt_offset_splat(<2 x i5> %a) { ; CHECK-LABEL: @sgt_offset_splat( -; CHECK-NEXT: [[T:%.*]] = add <2 x i5> [[A:%.*]], <i5 9, i5 9> -; CHECK-NEXT: [[OV:%.*]] = icmp sgt <2 x i5> [[T]], <i5 8, i5 8> +; CHECK-NEXT: [[OV:%.*]] = icmp ult <2 x i5> [[A:%.*]], <i5 7, i5 7> ; CHECK-NEXT: ret <2 x i1> [[OV]] ; %t = add <2 x i5> %a, <i5 9, i5 9> @@ -875,6 +873,8 @@ define <2 x i1> @sgt_offset_splat(<2 x i5> %a) { ret <2 x i1> %ov } +; negative test - constants must differ by 1 + define i1 @sgt_wrong_offset(i8 %a) { ; CHECK-LABEL: @sgt_wrong_offset( ; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -7 @@ -888,8 +888,7 @@ define i1 @sgt_wrong_offset(i8 %a) { define i1 @slt_offset(i8 %a) { ; CHECK-LABEL: @slt_offset( -; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -6 -; CHECK-NEXT: [[OV:%.*]] = icmp slt i8 [[T]], -6 +; CHECK-NEXT: [[OV:%.*]] = icmp ugt i8 [[A:%.*]], -123 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i8 %a, -6 @@ -901,7 +900,7 @@ define i1 @slt_offset_use(i32 %a) { ; CHECK-LABEL: @slt_offset_use( ; CHECK-NEXT: [[T:%.*]] = add i32 [[A:%.*]], 42 ; CHECK-NEXT: call void @use(i32 [[T]]) -; CHECK-NEXT: [[OV:%.*]] = icmp slt i32 [[T]], 42 +; CHECK-NEXT: [[OV:%.*]] = icmp ugt i32 [[A]], 2147483605 ; CHECK-NEXT: ret i1 [[OV]] ; %t = add i32 %a, 42 @@ -912,8 +911,7 @@ define i1 @slt_offset_use(i32 %a) { define <2 x i1> @slt_offset_splat(<2 x i5> %a) { ; CHECK-LABEL: @slt_offset_splat( -; CHECK-NEXT: [[T:%.*]] = add <2 x i5> [[A:%.*]], <i5 9, i5 9> -; CHECK-NEXT: [[OV:%.*]] = icmp slt <2 x i5> [[T]], <i5 9, i5 9> +; CHECK-NEXT: [[OV:%.*]] = icmp ugt <2 x i5> [[A:%.*]], <i5 6, i5 6> ; CHECK-NEXT: ret <2 x i1> [[OV]] ; %t = add <2 x i5> %a, <i5 9, i5 9> @@ -921,6 +919,8 @@ define <2 x i1> @slt_offset_splat(<2 x i5> %a) { ret <2 x i1> %ov } +; negative test - constants must be equal + define i1 @slt_wrong_offset(i8 %a) { ; CHECK-LABEL: @slt_wrong_offset( ; CHECK-NEXT: [[T:%.*]] = add i8 [[A:%.*]], -6 </cut>

3 years, 8 months

3
5
0 0

[ACTIVITY] week ending Oct. 24 2021

by Alex Bennée

And the rest of the week I flushed my maintainer queues ;-) Other ===== [update-ticket] <file:~/org/team.org::update-ticket> Update [update-ticket] to work with cloud JIRA Completed Reviews [8/8] ======================= [PATCH 0/7] tests: docker images for hexagon, nios2, microblaze Message-Id: <20211014224435.2539547-1-richard.henderson(a)linaro.org> [PATCH] gdbstub: Switch to the thread receiving a signal Message-Id: <20210930095111.23205-1-pavel(a)labath.sk> [PATCH] replay: improve determinism of virtio-net Message-Id: <162125666020.1252655.9997723318921206001.stgit@pasha-ThinkPad-X280> [PATCH RESEND v3 0/2] add APIs to handle alternative sNaN propagation for fmax/fmin Message-Id: <20211015065500.3850513-1-frank.chang(a)sifive.com> [PATCH v3 0/5] plugins/cache: multicore cache modelling and minor tweaks Message-Id: <20210722065428.134608-1-ma.mandourr(a)gmail.com> [PATCH v2 0/2] plugins: add a drcov plugin Message-Id: <163429165642.439576.16356288759891202632.stgit@pc-System-Product-Name> [PATCH v2 0/2] plugins: add a drcov plugin Message-Id: <163429165642.439576.16356288759891202632.stgit@pc-System-Product-Name> [PATCH 0/3] KVM: qemu patches for few KVM features I developed Message-Id: <20210914155214.105415-1-mlevitsk(a)redhat.com> Absences ======== - Off Friday next week Current Review Queue ==================== TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org> ================================================================================================================================ TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/41] linux-user: Streamline handling of SIGSEGV Message-Id: <20211006172307.780893-1-richard.henderson(a)linaro.org> ================================================================================================================================== -- Alex Bennée

3 years, 8 months

1
0
0 0

[TCWG CI] Regression caused by gcc: Disallow loop rotation and loop header crossing in jump threaders.

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: Disallow loop rotation and loop header crossing in jump threaders.: commit d8edfadfc7a9795b65177a50ce44fd348858e844 Author: Aldy Hernandez <aldyh(a)redhat.com> Disallow loop rotation and loop header crossing in jump threaders. Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21314 # First few build errors in logs: # 00:03:05 ./include/linux/spi/spi.h:1248:28: error: ‘msg’ is used uninitialized [-Werror=uninitialized] # 00:03:07 make[2]: *** [scripts/Makefile.build:277: drivers/bus/moxtet.o] Error 1 # 00:03:11 make[1]: *** [scripts/Makefile.build:540: drivers/bus] Error 2 # 00:03:23 sound/core/oss/mixer_oss.c:1035:21: error: ‘slot’ is used uninitialized [-Werror=uninitialized] # 00:03:26 sound/core/seq/oss/seq_oss_init.c:350:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:26 sound/core/seq/oss/seq_oss_init.c:370:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:03:26 make[4]: *** [scripts/Makefile.build:277: sound/core/seq/oss/seq_oss_init.o] Error 1 # 00:03:27 make[3]: *** [scripts/Makefile.build:277: sound/core/oss/mixer_oss.o] Error 1 # 00:03:29 sound/core/oss/pcm_oss.c:2475:34: error: ‘setup’ is used uninitialized [-Werror=uninitialized] # 00:03:29 sound/core/oss/pcm_oss.c:108:29: error: ‘t’ is used uninitialized [-Werror=uninitialized] from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21326 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-aarch64-mainline-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… Reproduce builds: <cut> mkdir investigate-gcc-d8edfadfc7a9795b65177a50ce44fd348858e844 cd investigate-gcc-d8edfadfc7a9795b65177a50ce44fd348858e844 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-mainlin… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach d8edfadfc7a9795b65177a50ce44fd348858e844 ../artifacts/test.sh # Reproduce last_good build git checkout --detach f36240f8c835d792f788b6724e272fc0a4a4f26f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit d8edfadfc7a9795b65177a50ce44fd348858e844 Author: Aldy Hernandez <aldyh(a)redhat.com> Date: Mon Oct 4 09:47:02 2021 +0200 Disallow loop rotation and loop header crossing in jump threaders. There is a lot of fall-out from this patch, as there were many threading tests that assumed the restrictions introduced by this patch were valid. Some tests have merely shifted the threading to after loop optimizations, but others ended up with no threading opportunities at all. Surprisingly some tests ended up with more total threads. It was a crapshoot all around. On a postive note, there are 6 tests that no longer XFAIL, and one guality test which now passes. I felt a bit queasy about such a fundamental change wrt threading, so I ran it through my callgrind test harness (.ii files from a bootstrap). There was no change in overall compilation, DOM, or the VRP threaders. However, there was a slight increase of 1.63% in the backward threader. I'm pretty sure we could reduce this if we incorporated the restrictions into their profitability code. This way we could stop the search when we ran into one of these restrictions. Not sure it's worth it at this point. Tested on x86-64 Linux. Co-authored-by: Richard Biener <rguenther(a)suse.de> gcc/ChangeLog: * tree-ssa-threadupdate.c (cancel_thread): Dump threading reason on the same line as the threading cancellation. (jt_path_registry::cancel_invalid_paths): Avoid rotating loops. Avoid threading through loop headers where the path remains in the loop. libgomp/ChangeLog: * testsuite/libgomp.graphite/force-parallel-5.c: Remove xfail. gcc/testsuite/ChangeLog: * gcc.dg/Warray-bounds-87.c: Remove xfail. * gcc.dg/analyzer/pr94851-2.c: Remove xfail. * gcc.dg/graphite/pr69728.c: Remove xfail. * gcc.dg/graphite/scop-dsyr2k.c: Remove xfail. * gcc.dg/graphite/scop-dsyrk.c: Remove xfail. * gcc.dg/shrink-wrap-loop.c: Remove xfail. * gcc.dg/loop-8.c: Adjust for new threading restrictions. * gcc.dg/tree-ssa/ifc-20040816-1.c: Same. * gcc.dg/tree-ssa/pr21559.c: Same. * gcc.dg/tree-ssa/pr59597.c: Same. * gcc.dg/tree-ssa/pr71437.c: Same. * gcc.dg/tree-ssa/pr77445-2.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-4.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same. * gcc.dg/vect/bb-slp-16.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-6.c: Remove. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Remove. * gcc.dg/tree-ssa/ssa-dom-thread-2a.c: Remove. * gcc.dg/tree-ssa/ssa-thread-invalid.c: New test. --- gcc/testsuite/gcc.dg/Warray-bounds-87.c | 2 +- gcc/testsuite/gcc.dg/analyzer/pr94851-2.c | 2 +- gcc/testsuite/gcc.dg/graphite/pr69728.c | 4 +- gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c | 2 +- gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c | 2 +- gcc/testsuite/gcc.dg/loop-8.c | 19 ++-- gcc/testsuite/gcc.dg/shrink-wrap-loop.c | 54 +---------- gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/pr21559.c | 7 +- gcc/testsuite/gcc.dg/tree-ssa/pr59597.c | 10 +- gcc/testsuite/gcc.dg/tree-ssa/pr71437.c | 8 +- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 3 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 27 ------ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c | 21 ----- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c | 14 ++- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 44 --------- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 5 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c | 102 +++++++++++++++++++++ gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 70 ++++++++------ gcc/tree-ssa-threadupdate.c | 26 +++++- .../testsuite/libgomp.graphite/force-parallel-5.c | 2 +- 21 files changed, 207 insertions(+), 219 deletions(-) diff --git a/gcc/testsuite/gcc.dg/Warray-bounds-87.c b/gcc/testsuite/gcc.dg/Warray-bounds-87.c index a49874df5da..a5457807c3a 100644 --- a/gcc/testsuite/gcc.dg/Warray-bounds-87.c +++ b/gcc/testsuite/gcc.dg/Warray-bounds-87.c @@ -33,7 +33,7 @@ static unsigned int h (int i, int j) case 9: return j; case 10: - return a[i]; // { dg-bogus "-Warray-bounds" "pr101671" { xfail *-*-* } } + return a[i]; // { dg-bogus "-Warray-bounds" "pr101671" } } return 0; } diff --git a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c index 0acf48810c1..62176bdaee8 100644 --- a/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c +++ b/gcc/testsuite/gcc.dg/analyzer/pr94851-2.c @@ -45,7 +45,7 @@ int pamark(void) { if (curbp->b_amark == (AMARK *)NULL) curbp->b_amark = p; else - last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" { xfail *-*-* } } */ + last->m_next = p; /* { dg-warning "dereference of NULL 'last'" "deref" } */ } p->m_name = (char)c; /* { dg-bogus "leak of 'p'" "bogus leak" } */ diff --git a/gcc/testsuite/gcc.dg/graphite/pr69728.c b/gcc/testsuite/gcc.dg/graphite/pr69728.c index 69e28318aaf..a6f385749c2 100644 --- a/gcc/testsuite/gcc.dg/graphite/pr69728.c +++ b/gcc/testsuite/gcc.dg/graphite/pr69728.c @@ -24,6 +24,4 @@ fn1 () run into scheduling issues before here, not being able to handle empty domains. */ -/* XFAILed by fix for PR86865. */ - -/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump "loop nest optimized" "graphite" } } */ diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c index 925ae306903..41c91b97b57 100644 --- a/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c +++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyr2k.c @@ -17,4 +17,4 @@ void dsyr2k(int N) { #pragma endscop } -/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */ diff --git a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c index b748946fabb..e01a517be11 100644 --- a/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c +++ b/gcc/testsuite/gcc.dg/graphite/scop-dsyrk.c @@ -19,4 +19,4 @@ void dsyrk(int N) #pragma endscop } -/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "number of SCoPs: 1" 1 "graphite" } } */ diff --git a/gcc/testsuite/gcc.dg/loop-8.c b/gcc/testsuite/gcc.dg/loop-8.c index 90ea1c45524..a685fc25056 100644 --- a/gcc/testsuite/gcc.dg/loop-8.c +++ b/gcc/testsuite/gcc.dg/loop-8.c @@ -11,18 +11,23 @@ f (int *a, int *b) { int i; - for (i = 0; i < 100; i++) + i = 100; + if (i > 0) { - int d = 42; + do + { + int d = 42; - a[i] = d; - if (i % 2) - d = i; - b[i] = d; + a[i] = d; + if (i % 2) + d = i; + b[i] = d; + ++i; + } + while (i < 100); } } /* Load of 42 is moved out of the loop, introducing a new pseudo register. */ -/* { dg-final { scan-rtl-dump-times "Decided" 1 "loop2_invariant" } } */ /* { dg-final { scan-rtl-dump-not "without introducing a new temporary register" "loop2_invariant" } } */ diff --git a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c index 6e1be8937fe..ddc99e6b75a 100644 --- a/gcc/testsuite/gcc.dg/shrink-wrap-loop.c +++ b/gcc/testsuite/gcc.dg/shrink-wrap-loop.c @@ -1,58 +1,6 @@ /* { dg-do compile { target { { { i?86-*-* x86_64-*-* } && lp64 } || { arm_thumb2 } } } } */ /* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ -/* -Our new threader is threading things a bit too early, and causing the -testcase in gcc.dg/shrink-wrap-loop.c to fail. - - The gist is this BB inside a loop: - - <bb 6> : - # p_2 = PHI <p2_6(D)(2), p_12(5)> - if (p_2 != 0B) - goto <bb 3>; [INV] - else - goto <bb 7>; [INV] - -Our threader can move this check outside of the loop (good). This is -done before branch probabilities are calculated and causes the probs -to be calculated as: - -<bb 2> [local count: 216361238]: - if (p2_6(D) != 0B) - goto <bb 7>; [54.59%] - else - goto <bb 6>; [45.41%] - -Logically this seems correct to me. A simple check outside of a loop -should slightly but not overwhelmingly favor a non-zero value. - -Interestingly however, the old threader couldn't get this, but the IL -ended up identical, albeit with different probabilities. What happens -is that, because the old code could not thread this, the p2 != 0 check -would remain inside the loop and probs would be calculated thusly: - - <bb 6> [local count: 1073741824]: - # p_2 = PHI <p2_6(D)(2), p_12(5)> - if (p_2 != 0B) - goto <bb 3>; [94.50%] - else - goto <bb 7>; [5.50%] - -Then when the loop header copying pass ("ch") shuffled things around, -the IL would end up identical to my early threader code, but with the -probabilities would remain as 94.5/5.5. - -The above discrepancy causes the RTL ifcvt pass to generate different -code, and by the time we get to the shrink wrapping pass, things look -sufficiently different such that the legacy code can actually shrink -wrap, whereas our new code does not. - -IMO, if the loop-ch pass moves conditionals outside of a loop, the -probabilities should be adjusted, but that does mean the shrink wrap -won't happen for this contrived testcase. - */ - int foo (int *p1, int *p2); int @@ -68,4 +16,4 @@ test (int *p1, int *p2) return 1; } -/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c index b55a533e374..f8a6495cbaa 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ifc-20040816-1.c @@ -39,4 +39,4 @@ int main1 () which is folded by vectorizer. Both outgoing edges must have probability 100% so the resulting profile match after folding. */ /* { dg-final { scan-tree-dump-times "Invalid sum of outgoing probabilities 200.0" 1 "ifcvt" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 1 "ifcvt" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum of incoming counts" 2 "ifcvt" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c index 51b3b7ac755..43f046edabe 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr21559.c @@ -35,10 +35,7 @@ void foo (void) /* First, we should simplify the bits < 0 test within the loop. */ /* { dg-final { scan-tree-dump-times "Simplified relational" 1 "evrp" } } */ -/* Second, we should thread the edge out of the loop via the break - statement. We also realize that the final bytes == 0 test is useless, - and thread over it. We also know that toread != 0 is useless when - entering while loop and thread over it. */ -/* { dg-final { scan-tree-dump-times "Threaded jump" 3 "vrp-thread1" } } */ +/* We used to check for 3 threaded jumps here, but they all would + rotate the loop. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c index 2caa1f532ea..764b3fe2e80 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr59597.c @@ -56,11 +56,7 @@ main (int argc, char argv[]) return crc; } -/* Previously we had 3 jump threads, but one of them crossed loops. - The reason the old threader was allowing it, was because there was - an ASSERT_EXPR getting in the way. Without the ASSERT_EXPR, we - have an empty pre-header block as the final block in the thread, - which the threader will simply join with the next block which *is* - in a different loop. */ -/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */ +/* None of the threads we can get in vrp-thread1 are valid. They all + cross or rotate loops. */ +/* { dg-final { scan-tree-dump-not "Registering jump thread" "vrp-thread1" } } */ /* { dg-final { scan-tree-dump-not "joiner" "vrp-thread1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c index a2386ba19f0..eab3a25928e 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr71437.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-ffast-math -O3 -fdump-tree-vrp-thread1-details" } */ +/* { dg-options "-ffast-math -O3 -fdump-tree-dom3-details" } */ int I = 50, J = 50; int S, L; @@ -39,4 +39,8 @@ void foo (int K) bar (LD, SD); } } -/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "vrp-thread1" } } */ + +/* We used to get 1 vrp-thread1 candidates here, but they now get + deferred until after loop opts are done, because they were rotating + loops. */ +/* { dg-final { scan-tree-dump-times "Threaded jump " 2 "dom3" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index 18f7aab2be7..f2a5e78e6be 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,7 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: \[7-9\]" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c deleted file mode 100644 index 0246ebf3c63..00000000000 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ /dev/null @@ -1,27 +0,0 @@ -/* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -std=gnu89 --param logical-op-non-short-circuit=0" } */ - -#include "ssa-dom-thread-4.c" - -/* On targets that define LOGICAL_OP_NON_SHORT_CIRCUIT to 0, we split both - "a_elt || b_elt" and "b_elt && kill_elt" into two conditions each, - rather than using "(var1 != 0) op (var2 != 0)". Also, as on other targets, - we duplicate the header of the inner "while" loop. There are then - 4 threading opportunities: - - 1x "!a_elt && b_elt" in the outer "while" loop - -> the start of the inner "while" loop, - skipping the known-true "b_elt" in the first condition. - 1x "!b_elt" in the first condition - -> the outer "while" loop's continuation point, - skipping the known-false "b_elt" in the second condition. - 2x "kill_elt->indx >= b_elt->indx" in the first "while" loop - -> "kill_elt->indx == b_elt->indx" in the second condition, - skipping the known-true "b_elt && kill_elt" in the second - condition. - - All the cases are picked up by VRP1 as jump threads. */ - -/* There used to be 6 jump threads found by thread1, but they all - depended on threading through distinct loops in ethread. */ -/* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp-thread1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c deleted file mode 100644 index 8f0a12c12ee..00000000000 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2a.c +++ /dev/null @@ -1,21 +0,0 @@ -/* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp-thread1-stats -fdump-tree-dom2-stats" } */ - -void bla(); - -/* In the following case, we should be able to thread edge through - the loop header. */ - -void thread_entry_through_header (void) -{ - int i; - - for (i = 0; i < 170; i++) - bla (); -} - -/* There's a single jump thread that should be handled by the VRP - jump threading pass. */ -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "vrp-thread1"} } */ -/* { dg-final { scan-tree-dump-times "Jumps threaded: 2" 0 "vrp-thread1"} } */ -/* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c index 46e464ff26a..9cd463571c4 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-4.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-vrp-thread1-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */ +/* { dg-options "-O2 -fdump-tree-vrp-thread2-details -fdump-tree-dom2-details -std=gnu89 --param logical-op-non-short-circuit=1" } */ struct bitmap_head_def; typedef struct bitmap_head_def *bitmap; typedef const struct bitmap_head_def *const_bitmap; @@ -53,10 +53,8 @@ bitmap_ior_and_compl (bitmap dst, const_bitmap a, const_bitmap b, return changed; } -/* The block starting the second conditional has 3 incoming edges, - we should thread all three, but due to a bug in the threading - code we missed the edge when the first conditional is false - (b_elt is zero, which means the second conditional is always - zero. VRP1 catches all three. */ -/* { dg-final { scan-tree-dump-times "Registering jump thread" 2 "vrp-thread1" } } */ -/* { dg-final { scan-tree-dump-times "Path crosses loops" 1 "vrp-thread1" } } */ +/* We used to catch 3 jump threads in vrp-thread1, but they all + rotated the loop, so they were disallowed. This in turn created + other opportunities for the other threaders which result in the the + post-loop threader (vrp-thread2) catching more. */ +/* { dg-final { scan-tree-dump-times "Registering jump thread" 5 "vrp-thread2" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c deleted file mode 100644 index b0a7d423475..00000000000 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c +++ /dev/null @@ -1,44 +0,0 @@ -/* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */ - -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Registering jump" 1 "thread3" } } */ - -int sum0, sum1, sum2, sum3; -int foo (char *s, char **ret) -{ - int state=0; - char c; - - for (; *s && state != 4; s++) - { - c = *s; - if (c == '*') - { - s++; - break; - } - switch (state) - { - case 0: - if (c == '+') - state = 1; - else if (c != '-') - sum0+=c; - break; - case 1: - if (c == '+') - state = 2; - else if (c == '-') - state = 0; - else - sum1+=c; - break; - default: - break; - } - - } - *ret = s; - return state; -} diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index 16abcde5053..1da00a691c8 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,15 +1,14 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom3" { target { ! aarch64*-*-* } } } } */ -/* { dg-final { scan-tree-dump-not "Jumps threaded" "vrp2" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump-not "Jumps threaded" "vrp-thread2" { target { ! aarch64*-*-* } } } } */ enum STATE { S0=0, diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c new file mode 100644 index 00000000000..bd56a62a4b4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-invalid.c @@ -0,0 +1,102 @@ +// { dg-do compile } +// { dg-options "-O2 -fgimple -fdump-statistics" } +// +// This is a collection of seemingly threadble paths that should not be allowed. + +void foobar (int); + +// Possible thread from 2->4->3, but it would rotate the loop. +void __GIMPLE (ssa) +f1 () +{ + int i; + + // Pre-header. + __BB(2): + goto __BB4; + + // Latch. + __BB(3): + foobar (i_1); + i_5 = i_1 + 1; + goto __BB4; + + __BB(4,loop_header(1)): + i_1 = __PHI (__BB2: 0, __BB3: i_5); + if (i_1 != 101) + goto __BB3; + else + goto __BB5; + + __BB(5): + return; + +} + +// Possible thread from 2->3->5 but threading through the empty latch +// would create a non-empty latch. +void __GIMPLE (ssa) +f2 () +{ + int i; + + // Pre-header. + __BB(2): + goto __BB3; + + __BB(3,loop_header(1)): + i_8 = __PHI (__BB5: i_5, __BB2: 0); + foobar (i_8); + i_5 = i_8 + 1; + if (i_5 != 256) + goto __BB5; + else + goto __BB4; + + // Latch. + __BB(5): + goto __BB3; + + __BB(4): + return; + +} + +// Possible thread from 3->5->6->3 but this would thread through the +// header but not exit the loop. +int __GIMPLE (ssa) +f3 (int a) +{ + int i; + + __BB(2): + goto __BB6; + + __BB(3): + if (i_1 != 0) + goto __BB4; + else + goto __BB5; + + __BB(4): + foobar (5); + goto __BB5; + + // Latch. + __BB(5): + i_7 = i_1 + 1; + goto __BB6; + + __BB(6,loop_header(1)): + i_1 = __PHI (__BB2: 1, __BB5: i_7); + if (i_1 <= 99) + goto __BB3; + else + goto __BB7; + + __BB(7): + return; + +} + +// { dg-final { scan-tree-dump-not "Jumps threaded" "statistics" } } diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index e68a9b62535..4fc176dde84 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -16,41 +16,52 @@ main1 (int dummy) unsigned int *pin = &in[0]; unsigned int *pout = &out[0]; unsigned int a = 0; - - for (i = 0; i < N; i++) + + i = N; + if (i > 0) { - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - *pout++ = *pin++ + a; - if (arr[i] = i) - a = i; - else - a = 2; + do + { + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + *pout++ = *pin++ + a; + if (arr[i] = i) + a = i; + else + a = 2; + } + while (i < N); } a = 0; - /* check results: */ - for (i = 0; i < N; i++) + /* check results: */ + i = N; + if (i > 0) { - if (out[i*8] != in[i*8] + a - || out[i*8 + 1] != in[i*8 + 1] + a - || out[i*8 + 2] != in[i*8 + 2] + a - || out[i*8 + 3] != in[i*8 + 3] + a - || out[i*8 + 4] != in[i*8 + 4] + a - || out[i*8 + 5] != in[i*8 + 5] + a - || out[i*8 + 6] != in[i*8 + 6] + a - || out[i*8 + 7] != in[i*8 + 7] + a) - abort (); + do + { + if (out[i*8] != in[i*8] + a + || out[i*8 + 1] != in[i*8 + 1] + a + || out[i*8 + 2] != in[i*8 + 2] + a + || out[i*8 + 3] != in[i*8 + 3] + a + || out[i*8 + 4] != in[i*8 + 4] + a + || out[i*8 + 5] != in[i*8 + 5] + a + || out[i*8 + 6] != in[i*8 + 6] + a + || out[i*8 + 7] != in[i*8 + 7] + a) + abort (); - if (arr[i] = i) - a = i; - else - a = 2; + if (arr[i] = i) + a = i; + else + a = 2; + i++; + } + while (i < N); } return 0; @@ -66,4 +77,3 @@ int main (void) } /* { dg-final { scan-tree-dump-times "optimized: basic block" 1 "slp1" } } */ - diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 32ce1e3af40..293836cdc53 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -278,7 +278,7 @@ cancel_thread (vec<jump_thread_edge *> *path, const char *reason = NULL) if (dump_file && (dump_flags & TDF_DETAILS)) { if (reason) - fprintf (dump_file, "%s:\n", reason); + fprintf (dump_file, "%s: ", reason); dump_jump_thread_path (dump_file, *path, false); fprintf (dump_file, "\n"); @@ -2771,6 +2771,7 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) bool seen_latch = false; int loops_crossed = 0; bool crossed_latch = false; + bool crossed_loop_header = false; // Use ->dest here instead of ->src to ignore the first block. The // first block is allowed to be in a different loop, since it'll be // redirected. See similar comment in profitable_path_p: "we don't @@ -2804,6 +2805,14 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) ++loops_crossed; } + // ?? Avoid threading through loop headers that remain in the + // loop, as such threadings tend to create sub-loops which + // _might_ be OK ??. + if (e->dest->loop_father->header == e->dest + && !flow_loop_nested_p (exit->dest->loop_father, + e->dest->loop_father)) + crossed_loop_header = true; + if (flag_checking && !m_backedge_threads) gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0); } @@ -2829,6 +2838,21 @@ jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) cancel_thread (&path, "Path crosses loops"); return true; } + // The path should either start and end in the same loop or exit the + // loop it starts in but never enter a loop. This also catches + // creating irreducible loops, not only rotation. + if (entry->src->loop_father != exit->dest->loop_father + && !flow_loop_nested_p (exit->src->loop_father, + entry->dest->loop_father)) + { + cancel_thread (&path, "Path rotates loop"); + return true; + } + if (crossed_loop_header) + { + cancel_thread (&path, "Path crosses loop header but does not exit it"); + return true; + } return false; } diff --git a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c index b83ca79dfae..de31d6436f5 100644 --- a/libgomp/testsuite/libgomp.graphite/force-parallel-5.c +++ b/libgomp/testsuite/libgomp.graphite/force-parallel-5.c @@ -31,6 +31,6 @@ int main(void) } /* Check that parallel code generation part make the right answer. */ -/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" { xfail *-*-* } } } */ +/* { dg-final { scan-tree-dump-times "2 loops carried no dependency" 1 "graphite" } } */ /* { dg-final { scan-tree-dump-times "loopfn.0" 4 "optimized" } } */ /* { dg-final { scan-tree-dump-times "loopfn.1" 4 "optimized" } } */ </cut>

3 years, 9 months

1
0
0 0

[ACTIVITY] last few weeks ending Oct. 24 2021

by Alex Bennée

OK I've fixed up my JIRA and email tooling so this is a bit of a flush of stale data from my org-mode. VirtIO Initiative ([STR-9]) =========================== - posted Enabling hypervisor agnosticism for VirtIO backends Message-Id: <87v94ldrqq.fsf(a)linaro.org> - posted [a PR to do some cleanups to vm-virtio] [STR-9] <https://projects.linaro.org/browse/STR-9> [a PR to do some cleanups to vm-virtio] <https://github.com/rust-vmm/vm-virtio/pull/103> VirtIO RPMB ([STR-5]) ===================== - made more progress and now have PROGRAM_KEY/WRITE_COUNTER done - feels like it's getting faster [STR-5] <https://projects.linaro.org/browse/STR-5> [Rust version of virtio-rpmb] <https://github.com/stsquad/virtio-rpmb> [fixes for the C daemon] <https://github.com/ruchi393/qemu/tree/vhost-user-rpmb-fixes> [hacking branch] <https://github.com/stsquad/virtio-rpmb/tree/hacking> Fix VirtIO spec as per Rucha's email QEMU Upstream Work ([UM-2]) =========================== - posted [PATCH for 6.1-rc3 v1 0/4] gitlab and plugins pre-PR Message-Id: <20210806141015.2487502-1-alex.bennee(a)linaro.org> - prepared a potential [pull request for testing issues] but looks like it will wait for 6.2 [UM-2] <https://projects.linaro.org/browse/UM-2> [this is the last iteration before Monday] <https://patchew.org/QEMU/20210709143005.1554-1-alex.bennee@linaro.org/> [pull request for testing issues] <https://github.com/stsquad/qemu/tree/pr/120821-for-6.1-rc4-1> Completed Reviews [4/4] ======================= [RFC PATCH 0/1] QEMU TCG plugin interface extensions Message-Id: <20210821094527.491232-1-florian.hauschild(a)fs.ei.tum.de> [PATCH 0/8] tcg: support 32-bit guest addresses as signed Message-Id: <20211010174401.141339-1-richard.henderson(a)linaro.org> [PATCH 0/3] KVM: qemu patches for few KVM features I developed Message-Id: <20210914155214.105415-1-mlevitsk(a)redhat.com> [PATCH 0/6] More record/replay acceptance tests Message-Id: <162332427732.194926.7555369160312506539.stgit@pasha-ThinkPad-X280> [PATCH 0/3] Gitlab-CI improvements Message-Id: <20210730143809.717079-1-thuth(a)redhat.com> ======================================================================================== [PATCH v3 00/13] new plugin argument passing scheme Message-Id: <20210722071236.139520-1-ma.mandourr(a)gmail.com> ============================================================================================================== [PATCH] contrib/plugins: add a drcov plugin Message-Id: <20211011111130.170178-1-arkaisp2021(a)gmail.com> ====================================================================================================== [RFC PATCH v2] Add a post for the new TCG cache modelling plugin Message-Id: <20210617121707.764126-1-ma.mandourr(a)gmail.com> =========================================================================================================================== Current Review Queue ==================== TODO [PATCH v2 00/48] tcg: optimize redundant sign extensions Message-Id: <20211007195456.1168070-1-richard.henderson(a)linaro.org> ================================================================================================================================ TODO [PATCH] cpu-models-x86.rst: Tidy up a couple of things Message-Id: <20211015100718.17828-1-pbonzini(a)redhat.com> =================================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== TODO [PATCH v4 00/48] Message-Id: <20211013024607.731881-1-richard.henderson(a)linaro.org> ======================================================================================= -- Alex Bennée

3 years, 9 months

1
0
0 0

[TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 16% after llvm: [AIX][ZOS] Excluding merge-objc-interface.m from Tests

by ci_notify＠linaro.org

After llvm commit 75127bce6de78b83b70b898a04473f213451f13e Author: Qiongsi Wu <qwu(a)ibm.com> [AIX][ZOS] Excluding merge-objc-interface.m from Tests the following hot functions slowed down by more than 10% (but their benchmarks slowed down by less than 2%): - 433.milc:[.] mult_su3_mat_vec slowed down by 16% from 1615 to 1871 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-75127bce6de78b83b70b898a04473f213451f13e cd investigate-llvm-75127bce6de78b83b70b898a04473f213451f13e # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 75127bce6de78b83b70b898a04473f213451f13e ../artifacts/test.sh # Reproduce last_good build git checkout --detach d01ae990e1fd6561ed86dc8004a7147dd09fb13c ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 75127bce6de78b83b70b898a04473f213451f13e Author: Qiongsi Wu <qwu(a)ibm.com> Date: Fri Oct 8 13:58:32 2021 +0000 [AIX][ZOS] Excluding merge-objc-interface.m from Tests Objective C is not supported on AIX or ZOS. This patch excludes the newly added `clang/test/Modules/merge-objc-interface.m` (added by https://reviews.llvm.org/D110280) from AIX and ZOS testing. Many existing tests are already disabled by https://reviews.llvm.org/D109060. Reviewed By: jsji Differential Revision: https://reviews.llvm.org/D111406 --- clang/test/Modules/merge-objc-interface.m | 1 + 1 file changed, 1 insertion(+) diff --git a/clang/test/Modules/merge-objc-interface.m b/clang/test/Modules/merge-objc-interface.m index fba06294a26a..f62f541c1a29 100644 --- a/clang/test/Modules/merge-objc-interface.m +++ b/clang/test/Modules/merge-objc-interface.m @@ -1,3 +1,4 @@ +// UNSUPPORTED: -zos, -aix // RUN: rm -rf %t // RUN: split-file %s %t // RUN: %clang_cc1 -emit-llvm -o %t/test.bc -F%t/Frameworks %t/test.m \ </cut>

3 years, 9 months

2
1
0 0

[TCWG CI] 433.milc slowed down by 4% after llvm: [LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux

by ci_notify＠linaro.org

After llvm commit 483db1c706864d0940206228dfe64bdcd17faa4e Author: Muhammad Omair Javaid <omair.javaid(a)linaro.org> [LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux the following benchmarks slowed down by more than 2%: - 433.milc slowed down by 4% from 13309 to 13838 perf samples - 433.milc:[.] mult_su3_mat_vec slowed down by 17% from 2058 to 2409 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-483db1c706864d0940206228dfe64bdcd17faa4e cd investigate-llvm-483db1c706864d0940206228dfe64bdcd17faa4e # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 483db1c706864d0940206228dfe64bdcd17faa4e ../artifacts/test.sh # Reproduce last_good build git checkout --detach d11ec6f67e45c630ab87bfb6010dcc93e89542fc ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 483db1c706864d0940206228dfe64bdcd17faa4e Author: Muhammad Omair Javaid <omair.javaid(a)linaro.org> Date: Mon Oct 11 14:34:41 2021 +0500 [LLDB] Remove xfail decorator TestInferiorAssert.py AArch64/Linux TestInferiorAssert.py test_inferior_asserting_disassemble passes after upgrading LLDB AArch64/Linux buildbot to Ubuntu Focal. --- lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py b/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py index c533a1e29a12..5ac4eeb0514a 100644 --- a/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py +++ b/lldb/test/API/functionalities/inferior-assert/TestInferiorAssert.py @@ -45,9 +45,7 @@ class AssertingInferiorTestCase(TestBase): bugnumber="llvm.org/pr21793: need to implement support for detecting assertion / abort on Windows") @expectedFailureAll( oslist=["linux"], - archs=[ - "aarch64", - "arm"], + archs=["arm"], triple=no_match(".*-android"), bugnumber="llvm.org/pr25338") @expectedFailureAll(bugnumber="llvm.org/pr26592", triple='^mips') </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 444.namd grew in size by 2% after llvm: [AArch64] Make -mcpu=generic schedule for an in-order core

by ci_notify＠linaro.org

After llvm commit adec9223616477df023026b0269ccd008701cc94 Author: David Green <david.green(a)arm.com> [AArch64] Make -mcpu=generic schedule for an in-order core the following benchmarks grew in size by more than 1%: - 444.namd grew in size by 2% from 185531 to 188815 bytes the following hot functions grew in size by more than 10% (but their benchmarks grew in size by less than 1%): - 482.sphinx3:[.] OUTLINED_FUNCTION_4 grew in size by 14% from 28 to 32 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -Oz - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Oz First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-adec9223616477df023026b0269ccd008701cc94 cd investigate-llvm-adec9223616477df023026b0269ccd008701cc94 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach adec9223616477df023026b0269ccd008701cc94 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e2a2e5475cbd370044474e132a1b5c58e6a3d458 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit adec9223616477df023026b0269ccd008701cc94 Author: David Green <david.green(a)arm.com> Date: Sat Oct 9 15:58:31 2021 +0100 [AArch64] Make -mcpu=generic schedule for an in-order core We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830 --- llvm/lib/Target/AArch64/AArch64.td | 2 +- .../Analysis/CostModel/AArch64/shuffle-select.ll | 2 +- .../Analysis/CostModel/AArch64/vector-select.ll | 4 +- llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll | 2 +- .../CodeGen/AArch64/GlobalISel/arm64-atomic.ll | 68 +- llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll | 4 +- .../call-translator-variadic-musttail.ll | 26 +- .../CodeGen/AArch64/GlobalISel/combine-udiv.ll | 308 +- .../AArch64/GlobalISel/merge-stores-truncating.ll | 10 +- llvm/test/CodeGen/AArch64/GlobalISel/swifterror.ll | 86 +- llvm/test/CodeGen/AArch64/aarch64-addv.ll | 2 +- llvm/test/CodeGen/AArch64/aarch64-be-bv.ll | 40 +- .../CodeGen/AArch64/aarch64-dup-ext-scalable.ll | 40 +- llvm/test/CodeGen/AArch64/aarch64-dup-ext.ll | 18 +- llvm/test/CodeGen/AArch64/aarch64-fold-lslfast.ll | 12 +- llvm/test/CodeGen/AArch64/aarch64-load-ext.ll | 36 +- .../CodeGen/AArch64/aarch64-matrix-umull-smull.ll | 24 +- llvm/test/CodeGen/AArch64/aarch64-smull.ll | 124 +- llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll | 6 +- .../test/CodeGen/AArch64/aarch64_win64cc_vararg.ll | 4 +- llvm/test/CodeGen/AArch64/addimm-mulimm.ll | 32 +- .../CodeGen/AArch64/addsub-constant-folding.ll | 18 +- llvm/test/CodeGen/AArch64/addsub.ll | 2 +- llvm/test/CodeGen/AArch64/align-down.ll | 10 +- llvm/test/CodeGen/AArch64/and-mask-removal.ll | 12 +- .../AArch64/argument-blocks-array-of-struct.ll | 51 +- llvm/test/CodeGen/AArch64/arm64-AdvSIMD-Scalar.ll | 24 +- .../CodeGen/AArch64/arm64-addr-type-promotion.ll | 37 +- llvm/test/CodeGen/AArch64/arm64-addrmode.ll | 6 +- .../test/CodeGen/AArch64/arm64-bitfield-extract.ll | 14 +- llvm/test/CodeGen/AArch64/arm64-collect-loh.ll | 2 +- llvm/test/CodeGen/AArch64/arm64-convert-v4f64.ll | 22 +- llvm/test/CodeGen/AArch64/arm64-csel.ll | 16 +- llvm/test/CodeGen/AArch64/arm64-dup.ll | 10 +- llvm/test/CodeGen/AArch64/arm64-fcopysign.ll | 18 +- llvm/test/CodeGen/AArch64/arm64-fmadd.ll | 4 +- .../arm64-homogeneous-prolog-epilog-no-helper.ll | 18 +- llvm/test/CodeGen/AArch64/arm64-indexed-memory.ll | 54 +- .../CodeGen/AArch64/arm64-indexed-vector-ldst.ll | 180 +- llvm/test/CodeGen/AArch64/arm64-inline-asm.ll | 8 +- .../AArch64/arm64-instruction-mix-remarks.ll | 20 +- llvm/test/CodeGen/AArch64/arm64-ldp.ll | 20 +- llvm/test/CodeGen/AArch64/arm64-memset-inline.ll | 4 +- llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll | 64 +- llvm/test/CodeGen/AArch64/arm64-neon-aba-abd.ll | 6 +- llvm/test/CodeGen/AArch64/arm64-neon-copy.ll | 13 +- llvm/test/CodeGen/AArch64/arm64-neon-mul-div.ll | 1428 ++++---- llvm/test/CodeGen/AArch64/arm64-nvcast.ll | 10 +- llvm/test/CodeGen/AArch64/arm64-popcnt.ll | 198 +- .../arm64-promote-const-complex-initializers.ll | 8 +- .../test/CodeGen/AArch64/arm64-register-pairing.ll | 4 +- llvm/test/CodeGen/AArch64/arm64-rev.ll | 14 +- .../AArch64/arm64-setcc-int-to-fp-combine.ll | 20 +- llvm/test/CodeGen/AArch64/arm64-shrink-wrapping.ll | 92 +- llvm/test/CodeGen/AArch64/arm64-sli-sri-opt.ll | 30 +- llvm/test/CodeGen/AArch64/arm64-srl-and.ll | 2 +- .../test/CodeGen/AArch64/arm64-subvector-extend.ll | 630 ++-- llvm/test/CodeGen/AArch64/arm64-tls-dynamics.ll | 8 +- llvm/test/CodeGen/AArch64/arm64-tls-local-exec.ll | 8 +- llvm/test/CodeGen/AArch64/arm64-trunc-store.ll | 4 +- llvm/test/CodeGen/AArch64/arm64-vabs.ll | 446 ++- llvm/test/CodeGen/AArch64/arm64-vhadd.ll | 32 +- llvm/test/CodeGen/AArch64/arm64-vmul.ll | 226 +- llvm/test/CodeGen/AArch64/arm64-windows-calls.ll | 19 +- .../CodeGen/AArch64/arm64-zero-cycle-zeroing.ll | 8 +- llvm/test/CodeGen/AArch64/arm64_32-addrs.ll | 6 +- llvm/test/CodeGen/AArch64/arm64_32-atomics.ll | 2 +- llvm/test/CodeGen/AArch64/atomic-ops-lse.ll | 17 +- .../CodeGen/AArch64/atomic-ops-not-barriers.ll | 2 +- llvm/test/CodeGen/AArch64/bcmp-inline-small.ll | 4 +- llvm/test/CodeGen/AArch64/bitcast-promote-widen.ll | 8 +- llvm/test/CodeGen/AArch64/bitfield-insert.ll | 34 +- llvm/test/CodeGen/AArch64/build-one-lane.ll | 9 +- llvm/test/CodeGen/AArch64/build-vector-extract.ll | 126 +- llvm/test/CodeGen/AArch64/cgp-usubo.ll | 24 +- llvm/test/CodeGen/AArch64/cmp-select-sign.ll | 44 +- llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll | 16 +- .../CodeGen/AArch64/combine-comparisons-by-cse.ll | 50 +- llvm/test/CodeGen/AArch64/cond-sel-value-prop.ll | 12 +- llvm/test/CodeGen/AArch64/consthoist-gep.ll | 32 +- llvm/test/CodeGen/AArch64/csr-split.ll | 4 +- llvm/test/CodeGen/AArch64/ctpop-nonean.ll | 30 +- llvm/test/CodeGen/AArch64/dag-combine-select.ll | 2 +- .../CodeGen/AArch64/dag-combine-trunc-build-vec.ll | 14 +- llvm/test/CodeGen/AArch64/dag-numsignbits.ll | 12 +- .../AArch64/div-rem-pair-recomposition-signed.ll | 210 +- .../AArch64/div-rem-pair-recomposition-unsigned.ll | 210 +- llvm/test/CodeGen/AArch64/emutls.ll | 6 +- llvm/test/CodeGen/AArch64/expand-select.ll | 50 +- llvm/test/CodeGen/AArch64/expand-vector-rot.ll | 12 +- llvm/test/CodeGen/AArch64/extract-bits.ll | 484 +-- llvm/test/CodeGen/AArch64/extract-lowbits.ll | 116 +- llvm/test/CodeGen/AArch64/f16-instructions.ll | 18 +- llvm/test/CodeGen/AArch64/fabs.ll | 8 +- llvm/test/CodeGen/AArch64/fadd-combines.ll | 14 +- llvm/test/CodeGen/AArch64/faddp-half.ll | 8 +- .../CodeGen/AArch64/fast-isel-addressing-modes.ll | 6 +- .../CodeGen/AArch64/fast-isel-branch-cond-split.ll | 4 +- llvm/test/CodeGen/AArch64/fast-isel-gep.ll | 6 +- llvm/test/CodeGen/AArch64/fast-isel-memcpy.ll | 6 +- llvm/test/CodeGen/AArch64/fast-isel-shift.ll | 24 +- llvm/test/CodeGen/AArch64/fdiv_combine.ll | 6 +- llvm/test/CodeGen/AArch64/fold-global-offsets.ll | 10 +- llvm/test/CodeGen/AArch64/fp16-v8-instructions.ll | 1441 ++++---- llvm/test/CodeGen/AArch64/fp16-vector-shuffle.ll | 2 +- llvm/test/CodeGen/AArch64/fptosi-sat-scalar.ll | 198 +- llvm/test/CodeGen/AArch64/fptosi-sat-vector.ll | 958 +++--- llvm/test/CodeGen/AArch64/fptoui-sat-scalar.ll | 114 +- llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll | 708 ++-- .../CodeGen/AArch64/framelayout-frame-record.mir | 3 +- .../CodeGen/AArch64/framelayout-unaligned-fp.ll | 4 +- llvm/test/CodeGen/AArch64/func-calls.ll | 2 +- llvm/test/CodeGen/AArch64/funnel-shift-rot.ll | 30 +- llvm/test/CodeGen/AArch64/funnel-shift.ll | 108 +- llvm/test/CodeGen/AArch64/global-merge-3.ll | 24 +- llvm/test/CodeGen/AArch64/half.ll | 10 +- .../hoist-and-by-const-from-lshr-in-eqcmp-zero.ll | 6 +- .../test/CodeGen/AArch64/hwasan-check-memaccess.ll | 2 +- .../CodeGen/AArch64/i128_volatile_load_store.ll | 36 +- llvm/test/CodeGen/AArch64/implicit-null-check.ll | 12 +- .../AArch64/insert-subvector-res-legalization.ll | 70 +- llvm/test/CodeGen/AArch64/isinf.ll | 2 +- llvm/test/CodeGen/AArch64/known-never-nan.ll | 16 +- llvm/test/CodeGen/AArch64/ldst-opt.ll | 5 +- llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll | 163 +- llvm/test/CodeGen/AArch64/logical_shifted_reg.ll | 137 +- llvm/test/CodeGen/AArch64/lowerMUL-newload.ll | 24 +- .../CodeGen/AArch64/machine-licm-sink-instr.ll | 24 +- .../test/CodeGen/AArch64/machine-outliner-throw.ll | 4 +- .../AArch64/machine_cse_impdef_killflags.ll | 4 +- llvm/test/CodeGen/AArch64/madd-lohi.ll | 4 +- llvm/test/CodeGen/AArch64/memcpy-scoped-aa.ll | 50 +- llvm/test/CodeGen/AArch64/merge-trunc-store.ll | 72 +- llvm/test/CodeGen/AArch64/midpoint-int.ll | 308 +- llvm/test/CodeGen/AArch64/min-max.ll | 260 +- llvm/test/CodeGen/AArch64/minmax-of-minmax.ll | 256 +- llvm/test/CodeGen/AArch64/minmax.ll | 10 +- llvm/test/CodeGen/AArch64/misched-fusion-lit.ll | 5 +- llvm/test/CodeGen/AArch64/misched-fusion.ll | 4 +- .../CodeGen/AArch64/named-vector-shuffles-neon.ll | 18 +- .../CodeGen/AArch64/named-vector-shuffles-sve.ll | 408 +-- llvm/test/CodeGen/AArch64/neg-abs.ll | 8 +- llvm/test/CodeGen/AArch64/neg-imm.ll | 3 +- .../CodeGen/AArch64/neon-bitwise-instructions.ll | 6 +- llvm/test/CodeGen/AArch64/neon-dotpattern.ll | 4 +- llvm/test/CodeGen/AArch64/neon-dotreduce.ll | 88 +- llvm/test/CodeGen/AArch64/neon-mla-mls.ll | 30 +- llvm/test/CodeGen/AArch64/neon-mov.ll | 2 +- llvm/test/CodeGen/AArch64/neon-reverseshuffle.ll | 2 +- llvm/test/CodeGen/AArch64/neon-shift-neg.ll | 24 +- llvm/test/CodeGen/AArch64/neon-truncstore.ll | 30 +- llvm/test/CodeGen/AArch64/nontemporal.ll | 74 +- llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll | 10 +- llvm/test/CodeGen/AArch64/pow.ll | 12 +- .../pull-conditional-binop-through-shift.ll | 6 +- llvm/test/CodeGen/AArch64/qmovn.ll | 8 +- .../AArch64/ragreedy-local-interval-cost.ll | 187 +- llvm/test/CodeGen/AArch64/rand.ll | 10 +- llvm/test/CodeGen/AArch64/reduce-and.ll | 348 +- llvm/test/CodeGen/AArch64/reduce-or.ll | 348 +- llvm/test/CodeGen/AArch64/reduce-xor.ll | 164 +- llvm/test/CodeGen/AArch64/regress-tblgen-chains.ll | 4 +- llvm/test/CodeGen/AArch64/rotate-extract.ll | 14 +- .../rvmarker-pseudo-expansion-and-outlining.mir | 4 +- llvm/test/CodeGen/AArch64/sadd_sat.ll | 12 +- llvm/test/CodeGen/AArch64/sadd_sat_plus.ll | 36 +- llvm/test/CodeGen/AArch64/sadd_sat_vec.ll | 68 +- llvm/test/CodeGen/AArch64/sat-add.ll | 30 +- llvm/test/CodeGen/AArch64/sdivpow2.ll | 2 +- llvm/test/CodeGen/AArch64/seh-finally.ll | 8 +- llvm/test/CodeGen/AArch64/select-with-and-or.ll | 32 +- llvm/test/CodeGen/AArch64/select_const.ll | 112 +- llvm/test/CodeGen/AArch64/select_fmf.ll | 32 +- llvm/test/CodeGen/AArch64/selectcc-to-shiftand.ll | 16 +- llvm/test/CodeGen/AArch64/settag-merge-order.ll | 4 +- llvm/test/CodeGen/AArch64/settag-merge.ll | 8 +- llvm/test/CodeGen/AArch64/settag.ll | 10 +- llvm/test/CodeGen/AArch64/shift-amount-mod.ll | 168 +- llvm/test/CodeGen/AArch64/shift-by-signext.ll | 20 +- llvm/test/CodeGen/AArch64/shift-mod.ll | 2 +- llvm/test/CodeGen/AArch64/shrink-wrapping-vla.ll | 4 +- llvm/test/CodeGen/AArch64/sibling-call.ll | 2 +- llvm/test/CodeGen/AArch64/signbit-shift.ll | 8 +- llvm/test/CodeGen/AArch64/sink-addsub-of-const.ll | 48 +- llvm/test/CodeGen/AArch64/sitofp-fixed-legal.ll | 18 +- .../CodeGen/AArch64/speculation-hardening-loads.ll | 4 +- .../test/CodeGen/AArch64/speculation-hardening.mir | 2 +- llvm/test/CodeGen/AArch64/split-vector-insert.ll | 70 +- llvm/test/CodeGen/AArch64/sqrt-fastmath.ll | 254 +- llvm/test/CodeGen/AArch64/srem-lkk.ll | 2 +- .../CodeGen/AArch64/srem-seteq-illegal-types.ll | 90 +- llvm/test/CodeGen/AArch64/srem-seteq-optsize.ll | 16 +- .../CodeGen/AArch64/srem-seteq-vec-nonsplat.ll | 382 +-- llvm/test/CodeGen/AArch64/srem-seteq-vec-splat.ll | 64 +- llvm/test/CodeGen/AArch64/srem-seteq.ll | 12 +- llvm/test/CodeGen/AArch64/srem-vector-lkk.ll | 446 +-- llvm/test/CodeGen/AArch64/ssub_sat.ll | 12 +- llvm/test/CodeGen/AArch64/ssub_sat_plus.ll | 36 +- llvm/test/CodeGen/AArch64/ssub_sat_vec.ll | 68 +- .../CodeGen/AArch64/stack-guard-remat-bitcast.ll | 12 +- llvm/test/CodeGen/AArch64/stack-guard-sysreg.ll | 30 +- .../CodeGen/AArch64/statepoint-call-lowering.ll | 6 +- .../AArch64/sve-calling-convention-mixed.ll | 16 +- llvm/test/CodeGen/AArch64/sve-expand-div.ll | 12 +- llvm/test/CodeGen/AArch64/sve-extract-element.ll | 4 +- .../CodeGen/AArch64/sve-extract-fixed-vector.ll | 64 +- .../CodeGen/AArch64/sve-extract-scalable-vector.ll | 60 +- llvm/test/CodeGen/AArch64/sve-fcopysign.ll | 18 +- llvm/test/CodeGen/AArch64/sve-fcvt.ll | 64 +- .../CodeGen/AArch64/sve-fixed-length-concat.ll | 28 +- .../AArch64/sve-fixed-length-extract-vector-elt.ll | 12 +- .../AArch64/sve-fixed-length-float-compares.ll | 28 +- .../AArch64/sve-fixed-length-fp-extend-trunc.ll | 54 +- .../CodeGen/AArch64/sve-fixed-length-fp-select.ll | 48 +- .../CodeGen/AArch64/sve-fixed-length-fp-to-int.ll | 54 +- .../CodeGen/AArch64/sve-fixed-length-fp-vselect.ll | 1716 +++++----- .../AArch64/sve-fixed-length-insert-vector-elt.ll | 148 +- .../CodeGen/AArch64/sve-fixed-length-int-div.ll | 216 +- .../AArch64/sve-fixed-length-int-extends.ll | 56 +- .../AArch64/sve-fixed-length-int-immediates.ll | 56 +- .../CodeGen/AArch64/sve-fixed-length-int-mulh.ll | 30 +- .../CodeGen/AArch64/sve-fixed-length-int-rem.ll | 282 +- .../CodeGen/AArch64/sve-fixed-length-int-select.ll | 144 +- .../CodeGen/AArch64/sve-fixed-length-int-to-fp.ll | 108 +- .../AArch64/sve-fixed-length-int-vselect.ll | 3584 ++++++++++---------- .../AArch64/sve-fixed-length-masked-gather.ll | 296 +- .../AArch64/sve-fixed-length-masked-loads.ll | 46 +- .../AArch64/sve-fixed-length-masked-scatter.ll | 342 +- .../AArch64/sve-fixed-length-masked-stores.ll | 82 +- .../AArch64/sve-fixed-length-vector-shuffle.ll | 78 +- llvm/test/CodeGen/AArch64/sve-forward-st-to-ld.ll | 7 +- llvm/test/CodeGen/AArch64/sve-fptrunc-store.ll | 4 +- llvm/test/CodeGen/AArch64/sve-gep.ll | 4 +- .../CodeGen/AArch64/sve-implicit-zero-filling.ll | 13 +- llvm/test/CodeGen/AArch64/sve-insert-element.ll | 192 +- llvm/test/CodeGen/AArch64/sve-insert-vector.ll | 80 +- llvm/test/CodeGen/AArch64/sve-int-arith-imm.ll | 30 +- llvm/test/CodeGen/AArch64/sve-int-arith.ll | 2 +- llvm/test/CodeGen/AArch64/sve-intrinsics-index.ll | 10 +- .../CodeGen/AArch64/sve-intrinsics-int-arith.ll | 4 +- llvm/test/CodeGen/AArch64/sve-ld-post-inc.ll | 6 +- llvm/test/CodeGen/AArch64/sve-ld1r.ll | 2 +- .../sve-lsr-scaled-index-addressing-mode.ll | 1 + .../CodeGen/AArch64/sve-masked-gather-legalize.ll | 6 +- .../CodeGen/AArch64/sve-masked-scatter-legalize.ll | 2 +- llvm/test/CodeGen/AArch64/sve-masked-scatter.ll | 2 +- llvm/test/CodeGen/AArch64/sve-pred-arith.ll | 16 +- llvm/test/CodeGen/AArch64/sve-sext-zext.ll | 12 +- llvm/test/CodeGen/AArch64/sve-split-extract-elt.ll | 100 +- llvm/test/CodeGen/AArch64/sve-split-fcvt.ll | 40 +- llvm/test/CodeGen/AArch64/sve-split-fp-reduce.ll | 2 +- llvm/test/CodeGen/AArch64/sve-split-insert-elt.ll | 72 +- llvm/test/CodeGen/AArch64/sve-split-int-reduce.ll | 10 +- llvm/test/CodeGen/AArch64/sve-split-load.ll | 6 +- llvm/test/CodeGen/AArch64/sve-split-store.ll | 6 +- .../AArch64/sve-st1-addressing-mode-reg-imm.ll | 12 +- llvm/test/CodeGen/AArch64/sve-stepvector.ll | 22 +- llvm/test/CodeGen/AArch64/sve-trunc.ll | 30 +- llvm/test/CodeGen/AArch64/sve-vscale-attr.ll | 40 +- llvm/test/CodeGen/AArch64/sve-vscale.ll | 2 +- llvm/test/CodeGen/AArch64/sve-vselect-imm.ll | 12 +- llvm/test/CodeGen/AArch64/swift-async.ll | 20 +- llvm/test/CodeGen/AArch64/swift-return.ll | 2 +- llvm/test/CodeGen/AArch64/swifterror.ll | 6 +- llvm/test/CodeGen/AArch64/tiny-model-pic.ll | 12 +- llvm/test/CodeGen/AArch64/tiny-model-static.ll | 12 +- .../test/CodeGen/AArch64/typepromotion-overflow.ll | 136 +- llvm/test/CodeGen/AArch64/typepromotion-signed.ll | 38 +- llvm/test/CodeGen/AArch64/uadd_sat.ll | 6 +- llvm/test/CodeGen/AArch64/uadd_sat_plus.ll | 30 +- llvm/test/CodeGen/AArch64/uadd_sat_vec.ll | 72 +- .../AArch64/umulo-128-legalisation-lowering.ll | 27 +- ...old-masked-merge-scalar-constmask-innerouter.ll | 18 +- ...asked-merge-scalar-constmask-interleavedbits.ll | 12 +- ...merge-scalar-constmask-interleavedbytehalves.ll | 12 +- ...unfold-masked-merge-scalar-constmask-lowhigh.ll | 2 +- .../unfold-masked-merge-scalar-variablemask.ll | 98 +- llvm/test/CodeGen/AArch64/urem-lkk.ll | 20 +- .../CodeGen/AArch64/urem-seteq-illegal-types.ll | 28 +- llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | 46 +- llvm/test/CodeGen/AArch64/urem-seteq-optsize.ll | 14 +- .../CodeGen/AArch64/urem-seteq-vec-nonsplat.ll | 340 +- .../test/CodeGen/AArch64/urem-seteq-vec-nonzero.ll | 56 +- llvm/test/CodeGen/AArch64/urem-seteq-vec-splat.ll | 38 +- .../CodeGen/AArch64/urem-seteq-vec-tautological.ll | 56 +- llvm/test/CodeGen/AArch64/urem-seteq.ll | 14 +- llvm/test/CodeGen/AArch64/urem-vector-lkk.ll | 330 +- .../AArch64/use-cr-result-of-dom-icmp-st.ll | 8 +- llvm/test/CodeGen/AArch64/usub_sat_plus.ll | 20 +- llvm/test/CodeGen/AArch64/usub_sat_vec.ll | 48 +- llvm/test/CodeGen/AArch64/vcvt-oversize.ll | 4 +- llvm/test/CodeGen/AArch64/vec-libcalls.ll | 34 +- llvm/test/CodeGen/AArch64/vec_cttz.ll | 8 +- llvm/test/CodeGen/AArch64/vec_uaddo.ll | 168 +- llvm/test/CodeGen/AArch64/vec_umulo.ll | 296 +- .../CodeGen/AArch64/vecreduce-and-legalization.ll | 36 +- .../AArch64/vecreduce-fadd-legalization-strict.ll | 96 +- .../CodeGen/AArch64/vecreduce-fadd-legalization.ll | 6 +- llvm/test/CodeGen/AArch64/vecreduce-fadd.ll | 188 +- .../CodeGen/AArch64/vecreduce-fmax-legalization.ll | 246 +- .../CodeGen/AArch64/vecreduce-fmin-legalization.ll | 246 +- .../CodeGen/AArch64/vecreduce-umax-legalization.ll | 14 +- llvm/test/CodeGen/AArch64/vector-fcopysign.ll | 346 +- llvm/test/CodeGen/AArch64/vector-gep.ll | 6 +- .../CodeGen/AArch64/vector-popcnt-128-ult-ugt.ll | 680 ++-- llvm/test/CodeGen/AArch64/vldn_shuffle.ll | 6 +- llvm/test/CodeGen/AArch64/vselect-constants.ll | 42 +- llvm/test/CodeGen/AArch64/win-tls.ll | 6 +- llvm/test/CodeGen/AArch64/win64_vararg.ll | 32 +- llvm/test/CodeGen/AArch64/win64_vararg_float.ll | 12 +- llvm/test/CodeGen/AArch64/win64_vararg_float_cc.ll | 12 +- llvm/test/CodeGen/AArch64/xor.ll | 8 +- llvm/test/MC/AArch64/elf-globaladdress.ll | 6 +- .../CanonicalizeFreezeInLoops/aarch64.ll | 2 +- .../CodeGenPrepare/AArch64/large-offset-gep.ll | 30 +- .../AArch64/lsr-pre-inc-offset-check.ll | 12 +- .../LoopStrengthReduce/AArch64/small-constant.ll | 2 +- .../aarch64_generated_funcs.ll.generated.expected | 30 +- ...aarch64_generated_funcs.ll.nogenerated.expected | 24 +- 319 files changed, 14045 insertions(+), 13817 deletions(-) diff --git a/llvm/lib/Target/AArch64/AArch64.td b/llvm/lib/Target/AArch64/AArch64.td index 5c1bf783ba2a..cb52532343fe 100644 --- a/llvm/lib/Target/AArch64/AArch64.td +++ b/llvm/lib/Target/AArch64/AArch64.td @@ -1156,7 +1156,7 @@ def ProcTSV110 : SubtargetFeature<"tsv110", "ARMProcFamily", "TSV110", FeatureFP16FML, FeatureDotProd]>; -def : ProcessorModel<"generic", NoSchedModel, [ +def : ProcessorModel<"generic", CortexA55Model, [ FeatureFPARMv8, FeatureFuseAES, FeatureNEON, diff --git a/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll b/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll index 5008c7f5c847..cb8ec7ba6f21 100644 --- a/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll +++ b/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll @@ -4,7 +4,7 @@ ; COST-LABEL: sel.v8i8 ; COST: Found an estimated cost of 42 for instruction: %tmp0 = shufflevector <8 x i8> %v0, <8 x i8> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15> ; CODE-LABEL: sel.v8i8 -; CODE: tbl v0.8b, { v0.16b }, v2.8b +; CODE: tbl v0.8b, { v0.16b }, v1.8b define <8 x i8> @sel.v8i8(<8 x i8> %v0, <8 x i8> %v1) { %tmp0 = shufflevector <8 x i8> %v0, <8 x i8> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15> ret <8 x i8> %tmp0 diff --git a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll index f2271c4ed71f..6e77612815f4 100644 --- a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll +++ b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll @@ -119,15 +119,15 @@ define <2 x i64> @v2i64_select_sle(<2 x i64> %a, <2 x i64> %b, <2 x i64> %c) { ; CODE-LABEL: v3i64_select_sle ; CODE: bb.0 -; CODE: ldr ; CODE: mov +; CODE: ldr ; CODE: mov ; CODE: mov ; CODE: cmge ; CODE: cmge ; CODE: bif -; CODE: ext ; CODE: bif +; CODE: ext ; CODE: ret define <3 x i64> @v3i64_select_sle(<3 x i64> %a, <3 x i64> %b, <3 x i64> %c) { diff --git a/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll b/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll index 6fe73e067e1a..c2436ccecc75 100644 --- a/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll +++ b/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll @@ -51,8 +51,8 @@ define <vscale x 4 x i32> @ashr_add_shl_nxv4i8(<vscale x 4 x i32> %a) { ; CHECK-LABEL: ashr_add_shl_nxv4i8: ; CHECK: // %bb.0: ; CHECK-NEXT: mov w8, #16777216 -; CHECK-NEXT: mov z1.s, w8 ; CHECK-NEXT: lsl z0.s, z0.s, #24 +; CHECK-NEXT: mov z1.s, w8 ; CHECK-NEXT: add z0.s, z0.s, z1.s ; CHECK-NEXT: asr z0.s, z0.s, #24 ; CHECK-NEXT: ret diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll index fd3a0072d2a8..4385e3ede36f 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll @@ -705,14 +705,14 @@ define i32 @atomic_load(i32* %p) #0 { define i8 @atomic_load_relaxed_8(i8* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_8: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldrb w8, [x0, #4095] -; CHECK-NOLSE-O1-NEXT: ldrb w9, [x0, w1, sxtw] -; CHECK-NOLSE-O1-NEXT: ldurb w10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldrb w11, [x11] -; CHECK-NOLSE-O1-NEXT: add w8, w8, w9 -; CHECK-NOLSE-O1-NEXT: add w8, w8, w10 -; CHECK-NOLSE-O1-NEXT: add w0, w8, w11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldrb w9, [x0, #4095] +; CHECK-NOLSE-O1-NEXT: ldrb w10, [x0, w1, sxtw] +; CHECK-NOLSE-O1-NEXT: ldurb w11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldrb w8, [x8] +; CHECK-NOLSE-O1-NEXT: add w9, w9, w10 +; CHECK-NOLSE-O1-NEXT: add w9, w9, w11 +; CHECK-NOLSE-O1-NEXT: add w0, w9, w8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_8: @@ -775,14 +775,14 @@ define i8 @atomic_load_relaxed_8(i8* %p, i32 %off32) #0 { define i16 @atomic_load_relaxed_16(i16* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_16: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldrh w8, [x0, #8190] -; CHECK-NOLSE-O1-NEXT: ldrh w9, [x0, w1, sxtw #1] -; CHECK-NOLSE-O1-NEXT: ldurh w10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldrh w11, [x11] -; CHECK-NOLSE-O1-NEXT: add w8, w8, w9 -; CHECK-NOLSE-O1-NEXT: add w8, w8, w10 -; CHECK-NOLSE-O1-NEXT: add w0, w8, w11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldrh w9, [x0, #8190] +; CHECK-NOLSE-O1-NEXT: ldrh w10, [x0, w1, sxtw #1] +; CHECK-NOLSE-O1-NEXT: ldurh w11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldrh w8, [x8] +; CHECK-NOLSE-O1-NEXT: add w9, w9, w10 +; CHECK-NOLSE-O1-NEXT: add w9, w9, w11 +; CHECK-NOLSE-O1-NEXT: add w0, w9, w8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_16: @@ -845,14 +845,14 @@ define i16 @atomic_load_relaxed_16(i16* %p, i32 %off32) #0 { define i32 @atomic_load_relaxed_32(i32* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_32: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldr w8, [x0, #16380] -; CHECK-NOLSE-O1-NEXT: ldr w9, [x0, w1, sxtw #2] -; CHECK-NOLSE-O1-NEXT: ldur w10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldr w11, [x11] -; CHECK-NOLSE-O1-NEXT: add w8, w8, w9 -; CHECK-NOLSE-O1-NEXT: add w8, w8, w10 -; CHECK-NOLSE-O1-NEXT: add w0, w8, w11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldr w9, [x0, #16380] +; CHECK-NOLSE-O1-NEXT: ldr w10, [x0, w1, sxtw #2] +; CHECK-NOLSE-O1-NEXT: ldur w11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldr w8, [x8] +; CHECK-NOLSE-O1-NEXT: add w9, w9, w10 +; CHECK-NOLSE-O1-NEXT: add w9, w9, w11 +; CHECK-NOLSE-O1-NEXT: add w0, w9, w8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_32: @@ -911,14 +911,14 @@ define i32 @atomic_load_relaxed_32(i32* %p, i32 %off32) #0 { define i64 @atomic_load_relaxed_64(i64* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_64: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldr x8, [x0, #32760] -; CHECK-NOLSE-O1-NEXT: ldr x9, [x0, w1, sxtw #3] -; CHECK-NOLSE-O1-NEXT: ldur x10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldr x11, [x11] -; CHECK-NOLSE-O1-NEXT: add x8, x8, x9 -; CHECK-NOLSE-O1-NEXT: add x8, x8, x10 -; CHECK-NOLSE-O1-NEXT: add x0, x8, x11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldr x9, [x0, #32760] +; CHECK-NOLSE-O1-NEXT: ldr x10, [x0, w1, sxtw #3] +; CHECK-NOLSE-O1-NEXT: ldur x11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldr x8, [x8] +; CHECK-NOLSE-O1-NEXT: add x9, x9, x10 +; CHECK-NOLSE-O1-NEXT: add x9, x9, x11 +; CHECK-NOLSE-O1-NEXT: add x0, x9, x8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_64: @@ -2717,8 +2717,8 @@ define { i8, i1 } @cmpxchg_i8(i8* %ptr, i8 %desired, i8 %new) { ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; CHECK-NOLSE-O1-NEXT: LBB47_4: ; %cmpxchg.nostore -; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: mov w1, wzr +; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; @@ -2783,8 +2783,8 @@ define { i16, i1 } @cmpxchg_i16(i16* %ptr, i16 %desired, i16 %new) { ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; CHECK-NOLSE-O1-NEXT: LBB48_4: ; %cmpxchg.nostore -; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: mov w1, wzr +; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll b/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll index f8d4731d3249..651ca31ae555 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll @@ -27,8 +27,8 @@ define void @call_byval_a64i32([64 x i32]* %incoming) { ; CHECK: // %bb.0: ; CHECK-NEXT: sub sp, sp, #288 ; CHECK-NEXT: stp x29, x30, [sp, #256] // 16-byte Folded Spill -; CHECK-NEXT: str x28, [sp, #272] // 8-byte Folded Spill ; CHECK-NEXT: add x29, sp, #256 +; CHECK-NEXT: str x28, [sp, #272] // 8-byte Folded Spill ; CHECK-NEXT: .cfi_def_cfa w29, 32 ; CHECK-NEXT: .cfi_offset w28, -16 ; CHECK-NEXT: .cfi_offset w30, -24 @@ -66,8 +66,8 @@ define void @call_byval_a64i32([64 x i32]* %incoming) { ; CHECK-NEXT: ldr q0, [x0, #240] ; CHECK-NEXT: str q0, [sp, #240] ; CHECK-NEXT: bl byval_a64i32 -; CHECK-NEXT: ldr x28, [sp, #272] // 8-byte Folded Reload ; CHECK-NEXT: ldp x29, x30, [sp, #256] // 16-byte Folded Reload +; CHECK-NEXT: ldr x28, [sp, #272] // 8-byte Folded Reload ; CHECK-NEXT: add sp, sp, #288 ; CHECK-NEXT: ret call void @byval_a64i32([64 x i32]* byval([64 x i32]) %incoming) diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll b/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll index 42e91f631822..44c0854ea03d 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll @@ -63,15 +63,12 @@ define i32 @test_musttail_variadic_spill(i32 %arg0, ...) { ; CHECK-NEXT: mov x25, x6 ; CHECK-NEXT: mov x26, x7 ; CHECK-NEXT: stp q1, q0, [sp, #96] ; 32-byte Folded Spill +; CHECK-NEXT: mov x27, x8 ; CHECK-NEXT: stp q3, q2, [sp, #64] ; 32-byte Folded Spill ; CHECK-NEXT: stp q5, q4, [sp, #32] ; 32-byte Folded Spill ; CHECK-NEXT: stp q7, q6, [sp] ; 32-byte Folded Spill -; CHECK-NEXT: mov x27, x8 ; CHECK-NEXT: bl _puts ; CHECK-NEXT: ldp q1, q0, [sp, #96] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload ; CHECK-NEXT: mov w0, w19 ; CHECK-NEXT: mov x1, x20 ; CHECK-NEXT: mov x2, x21 @@ -81,6 +78,9 @@ define i32 @test_musttail_variadic_spill(i32 %arg0, ...) { ; CHECK-NEXT: mov x6, x25 ; CHECK-NEXT: mov x7, x26 ; CHECK-NEXT: mov x8, x27 +; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload ; CHECK-NEXT: ldp x29, x30, [sp, #208] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x20, x19, [sp, #192] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x22, x21, [sp, #176] ; 16-byte Folded Reload @@ -122,9 +122,8 @@ define void @f_thunk(i8* %this, ...) { ; CHECK-NEXT: .cfi_offset w26, -80 ; CHECK-NEXT: .cfi_offset w27, -88 ; CHECK-NEXT: .cfi_offset w28, -96 -; CHECK-NEXT: mov x27, x8 -; CHECK-NEXT: add x8, sp, #128 -; CHECK-NEXT: add x9, sp, #256 +; CHECK-NEXT: add x9, sp, #128 +; CHECK-NEXT: add x10, sp, #256 ; CHECK-NEXT: mov x19, x0 ; CHECK-NEXT: mov x20, x1 ; CHECK-NEXT: mov x21, x2 @@ -134,16 +133,14 @@ define void @f_thunk(i8* %this, ...) { ; CHECK-NEXT: mov x25, x6 ; CHECK-NEXT: mov x26, x7 ; CHECK-NEXT: stp q1, q0, [sp, #96] ; 32-byte Folded Spill +; CHECK-NEXT: mov x27, x8 ; CHECK-NEXT: stp q3, q2, [sp, #64] ; 32-byte Folded Spill ; CHECK-NEXT: stp q5, q4, [sp, #32] ; 32-byte Folded Spill ; CHECK-NEXT: stp q7, q6, [sp] ; 32-byte Folded Spill -; CHECK-NEXT: str x9, [x8] +; CHECK-NEXT: str x10, [x9] ; CHECK-NEXT: bl _get_f -; CHECK-NEXT: mov x9, x0 ; CHECK-NEXT: ldp q1, q0, [sp, #96] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload +; CHECK-NEXT: mov x9, x0 ; CHECK-NEXT: mov x0, x19 ; CHECK-NEXT: mov x1, x20 ; CHECK-NEXT: mov x2, x21 @@ -153,6 +150,9 @@ define void @f_thunk(i8* %this, ...) { ; CHECK-NEXT: mov x6, x25 ; CHECK-NEXT: mov x7, x26 ; CHECK-NEXT: mov x8, x27 +; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload ; CHECK-NEXT: ldp x29, x30, [sp, #240] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x20, x19, [sp, #224] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x22, x21, [sp, #208] ; 16-byte Folded Reload @@ -195,9 +195,9 @@ define void @h_thunk(%struct.Foo* %this, ...) { ; CHECK-NEXT: Lloh2: ; CHECK-NEXT: adrp x10, _g@GOTPAGE ; CHECK-NEXT: ldr x9, [x0, #16] +; CHECK-NEXT: mov w11, #42 ; CHECK-NEXT: Lloh3: ; CHECK-NEXT: ldr x10, [x10, _g@GOTPAGEOFF] -; CHECK-NEXT: mov w11, #42 ; CHECK-NEXT: Lloh4: ; CHECK-NEXT: str w11, [x10] ; CHECK-NEXT: br x9 diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll b/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll index 6d9dad450ef1..3dc45e4cf5a7 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll @@ -18,20 +18,20 @@ define <8 x i16> @combine_vec_udiv_uniform(<8 x i16> %x) { ; ; GISEL-LABEL: combine_vec_udiv_uniform: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI0_1 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI0_1] -; GISEL-NEXT: adrp x8, .LCPI0_0 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI0_0] ; GISEL-NEXT: adrp x8, .LCPI0_2 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI0_2] -; GISEL-NEXT: sub v1.8h, v2.8h, v1.8h -; GISEL-NEXT: neg v1.8h, v1.8h -; GISEL-NEXT: umull2 v2.4s, v0.8h, v3.8h -; GISEL-NEXT: umull v3.4s, v0.4h, v3.4h -; GISEL-NEXT: uzp2 v2.8h, v3.8h, v2.8h -; GISEL-NEXT: sub v0.8h, v0.8h, v2.8h -; GISEL-NEXT: ushl v0.8h, v0.8h, v1.8h -; GISEL-NEXT: add v0.8h, v0.8h, v2.8h +; GISEL-NEXT: adrp x9, .LCPI0_0 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI0_2] +; GISEL-NEXT: adrp x8, .LCPI0_1 +; GISEL-NEXT: ldr q4, [x9, :lo12:.LCPI0_0] +; GISEL-NEXT: umull2 v2.4s, v0.8h, v1.8h +; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI0_1] +; GISEL-NEXT: umull v1.4s, v0.4h, v1.4h +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v2.8h +; GISEL-NEXT: sub v2.8h, v4.8h, v3.8h +; GISEL-NEXT: sub v0.8h, v0.8h, v1.8h +; GISEL-NEXT: neg v2.8h, v2.8h +; GISEL-NEXT: ushl v0.8h, v0.8h, v2.8h +; GISEL-NEXT: add v0.8h, v0.8h, v1.8h ; GISEL-NEXT: ushr v0.8h, v0.8h, #4 ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 23, i16 23, i16 23, i16 23, i16 23, i16 23, i16 23, i16 23> @@ -44,53 +44,53 @@ define <8 x i16> @combine_vec_udiv_nonuniform(<8 x i16> %x) { ; SDAG-NEXT: adrp x8, .LCPI1_0 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI1_0] ; SDAG-NEXT: adrp x8, .LCPI1_1 +; SDAG-NEXT: ushl v1.8h, v0.8h, v1.8h ; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI1_1] ; SDAG-NEXT: adrp x8, .LCPI1_2 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI1_2] -; SDAG-NEXT: ushl v1.8h, v0.8h, v1.8h -; SDAG-NEXT: umull2 v4.4s, v1.8h, v2.8h +; SDAG-NEXT: umull2 v3.4s, v1.8h, v2.8h ; SDAG-NEXT: umull v1.4s, v1.4h, v2.4h +; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI1_2] ; SDAG-NEXT: adrp x8, .LCPI1_3 -; SDAG-NEXT: uzp2 v1.8h, v1.8h, v4.8h -; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI1_3] +; SDAG-NEXT: uzp2 v1.8h, v1.8h, v3.8h ; SDAG-NEXT: sub v0.8h, v0.8h, v1.8h -; SDAG-NEXT: umull2 v4.4s, v0.8h, v3.8h -; SDAG-NEXT: umull v0.4s, v0.4h, v3.4h -; SDAG-NEXT: uzp2 v0.8h, v0.8h, v4.8h +; SDAG-NEXT: umull2 v3.4s, v0.8h, v2.8h +; SDAG-NEXT: umull v0.4s, v0.4h, v2.4h +; SDAG-NEXT: uzp2 v0.8h, v0.8h, v3.8h ; SDAG-NEXT: add v0.8h, v0.8h, v1.8h -; SDAG-NEXT: ushl v0.8h, v0.8h, v2.8h +; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI1_3] +; SDAG-NEXT: ushl v0.8h, v0.8h, v1.8h ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI1_5 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI1_5] ; GISEL-NEXT: adrp x8, .LCPI1_4 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI1_4] +; GISEL-NEXT: adrp x10, .LCPI1_0 +; GISEL-NEXT: adrp x9, .LCPI1_1 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI1_4] ; GISEL-NEXT: adrp x8, .LCPI1_3 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI1_3] -; GISEL-NEXT: adrp x8, .LCPI1_1 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI1_1] -; GISEL-NEXT: adrp x8, .LCPI1_0 -; GISEL-NEXT: ldr q5, [x8, :lo12:.LCPI1_0] +; GISEL-NEXT: ldr q5, [x10, :lo12:.LCPI1_0] +; GISEL-NEXT: ldr q6, [x9, :lo12:.LCPI1_1] +; GISEL-NEXT: neg v1.8h, v1.8h +; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI1_3] ; GISEL-NEXT: adrp x8, .LCPI1_2 -; GISEL-NEXT: neg v2.8h, v2.8h -; GISEL-NEXT: ldr q6, [x8, :lo12:.LCPI1_2] -; GISEL-NEXT: ushl v2.8h, v0.8h, v2.8h -; GISEL-NEXT: cmeq v1.8h, v1.8h, v5.8h -; GISEL-NEXT: umull2 v5.4s, v2.8h, v3.8h +; GISEL-NEXT: ushl v1.8h, v0.8h, v1.8h +; GISEL-NEXT: umull2 v3.4s, v1.8h, v2.8h +; GISEL-NEXT: umull v1.4s, v1.4h, v2.4h +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v3.8h +; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI1_2] +; GISEL-NEXT: adrp x8, .LCPI1_5 +; GISEL-NEXT: sub v2.8h, v0.8h, v1.8h +; GISEL-NEXT: umull2 v4.4s, v2.8h, v3.8h ; GISEL-NEXT: umull v2.4s, v2.4h, v3.4h -; GISEL-NEXT: uzp2 v2.8h, v2.8h, v5.8h -; GISEL-NEXT: sub v3.8h, v0.8h, v2.8h -; GISEL-NEXT: umull2 v5.4s, v3.8h, v6.8h -; GISEL-NEXT: umull v3.4s, v3.4h, v6.4h -; GISEL-NEXT: uzp2 v3.8h, v3.8h, v5.8h -; GISEL-NEXT: neg v4.8h, v4.8h -; GISEL-NEXT: shl v1.8h, v1.8h, #15 -; GISEL-NEXT: add v2.8h, v3.8h, v2.8h -; GISEL-NEXT: ushl v2.8h, v2.8h, v4.8h -; GISEL-NEXT: sshr v1.8h, v1.8h, #15 -; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b +; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI1_5] +; GISEL-NEXT: cmeq v3.8h, v3.8h, v5.8h +; GISEL-NEXT: uzp2 v2.8h, v2.8h, v4.8h +; GISEL-NEXT: neg v4.8h, v6.8h +; GISEL-NEXT: add v1.8h, v2.8h, v1.8h +; GISEL-NEXT: shl v2.8h, v3.8h, #15 +; GISEL-NEXT: ushl v1.8h, v1.8h, v4.8h +; GISEL-NEXT: sshr v2.8h, v2.8h, #15 +; GISEL-NEXT: bif v0.16b, v1.16b, v2.16b ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 23, i16 34, i16 -23, i16 56, i16 128, i16 -1, i16 -256, i16 -32768> ret <8 x i16> %1 @@ -100,41 +100,41 @@ define <8 x i16> @combine_vec_udiv_nonuniform2(<8 x i16> %x) { ; SDAG-LABEL: combine_vec_udiv_nonuniform2: ; SDAG: // %bb.0: ; SDAG-NEXT: adrp x8, .LCPI2_0 -; SDAG-NEXT: adrp x9, .LCPI2_1 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI2_0] -; SDAG-NEXT: ldr q2, [x9, :lo12:.LCPI2_1] +; SDAG-NEXT: adrp x8, .LCPI2_1 +; SDAG-NEXT: ushl v0.8h, v0.8h, v1.8h +; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI2_1] ; SDAG-NEXT: adrp x8, .LCPI2_2 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI2_2] +; SDAG-NEXT: umull2 v2.4s, v0.8h, v1.8h +; SDAG-NEXT: umull v0.4s, v0.4h, v1.4h +; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI2_2] +; SDAG-NEXT: uzp2 v0.8h, v0.8h, v2.8h ; SDAG-NEXT: ushl v0.8h, v0.8h, v1.8h -; SDAG-NEXT: umull2 v1.4s, v0.8h, v2.8h -; SDAG-NEXT: umull v0.4s, v0.4h, v2.4h -; SDAG-NEXT: uzp2 v0.8h, v0.8h, v1.8h -; SDAG-NEXT: ushl v0.8h, v0.8h, v3.8h ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform2: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI2_4 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI2_4] ; GISEL-NEXT: adrp x8, .LCPI2_3 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI2_3] -; GISEL-NEXT: adrp x8, .LCPI2_1 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI2_1] -; GISEL-NEXT: adrp x8, .LCPI2_0 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI2_0] +; GISEL-NEXT: adrp x9, .LCPI2_4 +; GISEL-NEXT: adrp x10, .LCPI2_0 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI2_3] ; GISEL-NEXT: adrp x8, .LCPI2_2 -; GISEL-NEXT: ldr q5, [x8, :lo12:.LCPI2_2] +; GISEL-NEXT: ldr q3, [x9, :lo12:.LCPI2_4] +; GISEL-NEXT: ldr q4, [x10, :lo12:.LCPI2_0] +; GISEL-NEXT: neg v1.8h, v1.8h +; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI2_2] +; GISEL-NEXT: adrp x8, .LCPI2_1 +; GISEL-NEXT: cmeq v3.8h, v3.8h, v4.8h +; GISEL-NEXT: ushl v1.8h, v0.8h, v1.8h +; GISEL-NEXT: shl v3.8h, v3.8h, #15 +; GISEL-NEXT: umull2 v5.4s, v1.8h, v2.8h +; GISEL-NEXT: umull v1.4s, v1.4h, v2.4h +; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI2_1] ; GISEL-NEXT: neg v2.8h, v2.8h -; GISEL-NEXT: ushl v2.8h, v0.8h, v2.8h -; GISEL-NEXT: cmeq v1.8h, v1.8h, v4.8h -; GISEL-NEXT: umull2 v4.4s, v2.8h, v5.8h -; GISEL-NEXT: umull v2.4s, v2.4h, v5.4h -; GISEL-NEXT: neg v3.8h, v3.8h -; GISEL-NEXT: shl v1.8h, v1.8h, #15 -; GISEL-NEXT: uzp2 v2.8h, v2.8h, v4.8h -; GISEL-NEXT: ushl v2.8h, v2.8h, v3.8h -; GISEL-NEXT: sshr v1.8h, v1.8h, #15 -; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v5.8h +; GISEL-NEXT: ushl v1.8h, v1.8h, v2.8h +; GISEL-NEXT: sshr v2.8h, v3.8h, #15 +; GISEL-NEXT: bif v0.16b, v1.16b, v2.16b ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 -34, i16 35, i16 36, i16 -37, i16 38, i16 -39, i16 40, i16 -41> ret <8 x i16> %1 @@ -146,43 +146,43 @@ define <8 x i16> @combine_vec_udiv_nonuniform3(<8 x i16> %x) { ; SDAG-NEXT: adrp x8, .LCPI3_0 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI3_0] ; SDAG-NEXT: adrp x8, .LCPI3_1 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI3_1] ; SDAG-NEXT: umull2 v2.4s, v0.8h, v1.8h ; SDAG-NEXT: umull v1.4s, v0.4h, v1.4h ; SDAG-NEXT: uzp2 v1.8h, v1.8h, v2.8h ; SDAG-NEXT: sub v0.8h, v0.8h, v1.8h ; SDAG-NEXT: usra v1.8h, v0.8h, #1 -; SDAG-NEXT: ushl v0.8h, v1.8h, v3.8h +; SDAG-NEXT: ldr q0, [x8, :lo12:.LCPI3_1] +; SDAG-NEXT: ushl v0.8h, v1.8h, v0.8h ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform3: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI3_5 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI3_5] ; GISEL-NEXT: adrp x8, .LCPI3_4 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI3_4] -; GISEL-NEXT: adrp x8, .LCPI3_2 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI3_2] -; GISEL-NEXT: adrp x8, .LCPI3_1 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI3_1] -; GISEL-NEXT: adrp x8, .LCPI3_3 -; GISEL-NEXT: ldr q5, [x8, :lo12:.LCPI3_3] -; GISEL-NEXT: adrp x8, .LCPI3_0 -; GISEL-NEXT: ldr q6, [x8, :lo12:.LCPI3_0] -; GISEL-NEXT: sub v3.8h, v4.8h, v3.8h -; GISEL-NEXT: umull2 v4.4s, v0.8h, v2.8h -; GISEL-NEXT: umull v2.4s, v0.4h, v2.4h -; GISEL-NEXT: uzp2 v2.8h, v2.8h, v4.8h -; GISEL-NEXT: neg v3.8h, v3.8h -; GISEL-NEXT: sub v4.8h, v0.8h, v2.8h -; GISEL-NEXT: cmeq v1.8h, v1.8h, v6.8h -; GISEL-NEXT: ushl v3.8h, v4.8h, v3.8h -; GISEL-NEXT: neg v5.8h, v5.8h -; GISEL-NEXT: shl v1.8h, v1.8h, #15 -; GISEL-NEXT: add v2.8h, v3.8h, v2.8h -; GISEL-NEXT: ushl v2.8h, v2.8h, v5.8h -; GISEL-NEXT: sshr v1.8h, v1.8h, #15 -; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b +; GISEL-NEXT: adrp x9, .LCPI3_2 +; GISEL-NEXT: adrp x10, .LCPI3_1 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI3_4] +; GISEL-NEXT: adrp x8, .LCPI3_5 +; GISEL-NEXT: ldr q2, [x9, :lo12:.LCPI3_2] +; GISEL-NEXT: adrp x9, .LCPI3_3 +; GISEL-NEXT: ldr q3, [x10, :lo12:.LCPI3_1] +; GISEL-NEXT: adrp x10, .LCPI3_0 +; GISEL-NEXT: umull2 v4.4s, v0.8h, v1.8h +; GISEL-NEXT: umull v1.4s, v0.4h, v1.4h +; GISEL-NEXT: ldr q6, [x9, :lo12:.LCPI3_3] +; GISEL-NEXT: sub v2.8h, v3.8h, v2.8h +; GISEL-NEXT: ldr q5, [x10, :lo12:.LCPI3_0] +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v4.8h +; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI3_5] +; GISEL-NEXT: neg v2.8h, v2.8h +; GISEL-NEXT: sub v3.8h, v0.8h, v1.8h +; GISEL-NEXT: ushl v2.8h, v3.8h, v2.8h +; GISEL-NEXT: cmeq v3.8h, v4.8h, v5.8h +; GISEL-NEXT: neg v4.8h, v6.8h +; GISEL-NEXT: add v1.8h, v2.8h, v1.8h +; GISEL-NEXT: shl v2.8h, v3.8h, #15 +; GISEL-NEXT: ushl v1.8h, v1.8h, v4.8h +; GISEL-NEXT: sshr v2.8h, v2.8h, #15 +; GISEL-NEXT: bif v0.16b, v1.16b, v2.16b ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 7, i16 23, i16 25, i16 27, i16 31, i16 47, i16 63, i16 127> ret <8 x i16> %1 @@ -192,39 +192,39 @@ define <16 x i8> @combine_vec_udiv_nonuniform4(<16 x i8> %x) { ; SDAG-LABEL: combine_vec_udiv_nonuniform4: ; SDAG: // %bb.0: ; SDAG-NEXT: adrp x8, .LCPI4_0 +; SDAG-NEXT: adrp x9, .LCPI4_3 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI4_0] ; SDAG-NEXT: adrp x8, .LCPI4_1 +; SDAG-NEXT: ldr q3, [x9, :lo12:.LCPI4_3] +; SDAG-NEXT: umull2 v2.8h, v0.16b, v1.16b +; SDAG-NEXT: umull v1.8h, v0.8b, v1.8b +; SDAG-NEXT: and v0.16b, v0.16b, v3.16b +; SDAG-NEXT: uzp2 v1.16b, v1.16b, v2.16b ; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI4_1] ; SDAG-NEXT: adrp x8, .LCPI4_2 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI4_2] -; SDAG-NEXT: adrp x8, .LCPI4_3 -; SDAG-NEXT: ldr q4, [x8, :lo12:.LCPI4_3] -; SDAG-NEXT: umull2 v5.8h, v0.16b, v1.16b -; SDAG-NEXT: umull v1.8h, v0.8b, v1.8b -; SDAG-NEXT: uzp2 v1.16b, v1.16b, v5.16b ; SDAG-NEXT: ushl v1.16b, v1.16b, v2.16b -; SDAG-NEXT: and v1.16b, v1.16b, v3.16b -; SDAG-NEXT: and v0.16b, v0.16b, v4.16b +; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI4_2] +; SDAG-NEXT: and v1.16b, v1.16b, v2.16b ; SDAG-NEXT: orr v0.16b, v0.16b, v1.16b ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform4: ; GISEL: // %bb.0: ; GISEL-NEXT: adrp x8, .LCPI4_3 +; GISEL-NEXT: adrp x9, .LCPI4_2 +; GISEL-NEXT: adrp x10, .LCPI4_1 ; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI4_3] ; GISEL-NEXT: adrp x8, .LCPI4_0 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI4_0] -; GISEL-NEXT: adrp x8, .LCPI4_2 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI4_2] -; GISEL-NEXT: adrp x8, .LCPI4_1 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI4_1] -; GISEL-NEXT: cmeq v1.16b, v1.16b, v2.16b -; GISEL-NEXT: umull2 v2.8h, v0.16b, v3.16b -; GISEL-NEXT: umull v3.8h, v0.8b, v3.8b -; GISEL-NEXT: neg v4.16b, v4.16b -; GISEL-NEXT: uzp2 v2.16b, v3.16b, v2.16b +; GISEL-NEXT: ldr q2, [x9, :lo12:.LCPI4_2] +; GISEL-NEXT: ldr q3, [x10, :lo12:.LCPI4_1] +; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI4_0] +; GISEL-NEXT: umull2 v5.8h, v0.16b, v2.16b +; GISEL-NEXT: umull v2.8h, v0.8b, v2.8b +; GISEL-NEXT: cmeq v1.16b, v1.16b, v4.16b +; GISEL-NEXT: neg v3.16b, v3.16b +; GISEL-NEXT: uzp2 v2.16b, v2.16b, v5.16b ; GISEL-NEXT: shl v1.16b, v1.16b, #7 -; GISEL-NEXT: ushl v2.16b, v2.16b, v4.16b +; GISEL-NEXT: ushl v2.16b, v2.16b, v3.16b ; GISEL-NEXT: sshr v1.16b, v1.16b, #7 ; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b ; GISEL-NEXT: ret @@ -236,55 +236,55 @@ define <8 x i16> @pr38477(<8 x i16> %a0) { </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 470.lbm slowed down by 15% after llvm: [AArch64] Make -mcpu=generic schedule for an in-order core

by ci_notify＠linaro.org

After llvm commit adec9223616477df023026b0269ccd008701cc94 Author: David Green <david.green(a)arm.com> [AArch64] Make -mcpu=generic schedule for an in-order core the following benchmarks slowed down by more than 2%: - 470.lbm slowed down by 15% from 16308 to 18676 perf samples - 433.milc slowed down by 9% from 12206 to 13270 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-adec9223616477df023026b0269ccd008701cc94 cd investigate-llvm-adec9223616477df023026b0269ccd008701cc94 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach adec9223616477df023026b0269ccd008701cc94 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e2a2e5475cbd370044474e132a1b5c58e6a3d458 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit adec9223616477df023026b0269ccd008701cc94 Author: David Green <david.green(a)arm.com> Date: Sat Oct 9 15:58:31 2021 +0100 [AArch64] Make -mcpu=generic schedule for an in-order core We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830 --- llvm/lib/Target/AArch64/AArch64.td | 2 +- .../Analysis/CostModel/AArch64/shuffle-select.ll | 2 +- .../Analysis/CostModel/AArch64/vector-select.ll | 4 +- llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll | 2 +- .../CodeGen/AArch64/GlobalISel/arm64-atomic.ll | 68 +- llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll | 4 +- .../call-translator-variadic-musttail.ll | 26 +- .../CodeGen/AArch64/GlobalISel/combine-udiv.ll | 308 +- .../AArch64/GlobalISel/merge-stores-truncating.ll | 10 +- llvm/test/CodeGen/AArch64/GlobalISel/swifterror.ll | 86 +- llvm/test/CodeGen/AArch64/aarch64-addv.ll | 2 +- llvm/test/CodeGen/AArch64/aarch64-be-bv.ll | 40 +- .../CodeGen/AArch64/aarch64-dup-ext-scalable.ll | 40 +- llvm/test/CodeGen/AArch64/aarch64-dup-ext.ll | 18 +- llvm/test/CodeGen/AArch64/aarch64-fold-lslfast.ll | 12 +- llvm/test/CodeGen/AArch64/aarch64-load-ext.ll | 36 +- .../CodeGen/AArch64/aarch64-matrix-umull-smull.ll | 24 +- llvm/test/CodeGen/AArch64/aarch64-smull.ll | 124 +- llvm/test/CodeGen/AArch64/aarch64-tail-dup-size.ll | 6 +- .../test/CodeGen/AArch64/aarch64_win64cc_vararg.ll | 4 +- llvm/test/CodeGen/AArch64/addimm-mulimm.ll | 32 +- .../CodeGen/AArch64/addsub-constant-folding.ll | 18 +- llvm/test/CodeGen/AArch64/addsub.ll | 2 +- llvm/test/CodeGen/AArch64/align-down.ll | 10 +- llvm/test/CodeGen/AArch64/and-mask-removal.ll | 12 +- .../AArch64/argument-blocks-array-of-struct.ll | 51 +- llvm/test/CodeGen/AArch64/arm64-AdvSIMD-Scalar.ll | 24 +- .../CodeGen/AArch64/arm64-addr-type-promotion.ll | 37 +- llvm/test/CodeGen/AArch64/arm64-addrmode.ll | 6 +- .../test/CodeGen/AArch64/arm64-bitfield-extract.ll | 14 +- llvm/test/CodeGen/AArch64/arm64-collect-loh.ll | 2 +- llvm/test/CodeGen/AArch64/arm64-convert-v4f64.ll | 22 +- llvm/test/CodeGen/AArch64/arm64-csel.ll | 16 +- llvm/test/CodeGen/AArch64/arm64-dup.ll | 10 +- llvm/test/CodeGen/AArch64/arm64-fcopysign.ll | 18 +- llvm/test/CodeGen/AArch64/arm64-fmadd.ll | 4 +- .../arm64-homogeneous-prolog-epilog-no-helper.ll | 18 +- llvm/test/CodeGen/AArch64/arm64-indexed-memory.ll | 54 +- .../CodeGen/AArch64/arm64-indexed-vector-ldst.ll | 180 +- llvm/test/CodeGen/AArch64/arm64-inline-asm.ll | 8 +- .../AArch64/arm64-instruction-mix-remarks.ll | 20 +- llvm/test/CodeGen/AArch64/arm64-ldp.ll | 20 +- llvm/test/CodeGen/AArch64/arm64-memset-inline.ll | 4 +- llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll | 64 +- llvm/test/CodeGen/AArch64/arm64-neon-aba-abd.ll | 6 +- llvm/test/CodeGen/AArch64/arm64-neon-copy.ll | 13 +- llvm/test/CodeGen/AArch64/arm64-neon-mul-div.ll | 1428 ++++---- llvm/test/CodeGen/AArch64/arm64-nvcast.ll | 10 +- llvm/test/CodeGen/AArch64/arm64-popcnt.ll | 198 +- .../arm64-promote-const-complex-initializers.ll | 8 +- .../test/CodeGen/AArch64/arm64-register-pairing.ll | 4 +- llvm/test/CodeGen/AArch64/arm64-rev.ll | 14 +- .../AArch64/arm64-setcc-int-to-fp-combine.ll | 20 +- llvm/test/CodeGen/AArch64/arm64-shrink-wrapping.ll | 92 +- llvm/test/CodeGen/AArch64/arm64-sli-sri-opt.ll | 30 +- llvm/test/CodeGen/AArch64/arm64-srl-and.ll | 2 +- .../test/CodeGen/AArch64/arm64-subvector-extend.ll | 630 ++-- llvm/test/CodeGen/AArch64/arm64-tls-dynamics.ll | 8 +- llvm/test/CodeGen/AArch64/arm64-tls-local-exec.ll | 8 +- llvm/test/CodeGen/AArch64/arm64-trunc-store.ll | 4 +- llvm/test/CodeGen/AArch64/arm64-vabs.ll | 446 ++- llvm/test/CodeGen/AArch64/arm64-vhadd.ll | 32 +- llvm/test/CodeGen/AArch64/arm64-vmul.ll | 226 +- llvm/test/CodeGen/AArch64/arm64-windows-calls.ll | 19 +- .../CodeGen/AArch64/arm64-zero-cycle-zeroing.ll | 8 +- llvm/test/CodeGen/AArch64/arm64_32-addrs.ll | 6 +- llvm/test/CodeGen/AArch64/arm64_32-atomics.ll | 2 +- llvm/test/CodeGen/AArch64/atomic-ops-lse.ll | 17 +- .../CodeGen/AArch64/atomic-ops-not-barriers.ll | 2 +- llvm/test/CodeGen/AArch64/bcmp-inline-small.ll | 4 +- llvm/test/CodeGen/AArch64/bitcast-promote-widen.ll | 8 +- llvm/test/CodeGen/AArch64/bitfield-insert.ll | 34 +- llvm/test/CodeGen/AArch64/build-one-lane.ll | 9 +- llvm/test/CodeGen/AArch64/build-vector-extract.ll | 126 +- llvm/test/CodeGen/AArch64/cgp-usubo.ll | 24 +- llvm/test/CodeGen/AArch64/cmp-select-sign.ll | 44 +- llvm/test/CodeGen/AArch64/cmpxchg-idioms.ll | 16 +- .../CodeGen/AArch64/combine-comparisons-by-cse.ll | 50 +- llvm/test/CodeGen/AArch64/cond-sel-value-prop.ll | 12 +- llvm/test/CodeGen/AArch64/consthoist-gep.ll | 32 +- llvm/test/CodeGen/AArch64/csr-split.ll | 4 +- llvm/test/CodeGen/AArch64/ctpop-nonean.ll | 30 +- llvm/test/CodeGen/AArch64/dag-combine-select.ll | 2 +- .../CodeGen/AArch64/dag-combine-trunc-build-vec.ll | 14 +- llvm/test/CodeGen/AArch64/dag-numsignbits.ll | 12 +- .../AArch64/div-rem-pair-recomposition-signed.ll | 210 +- .../AArch64/div-rem-pair-recomposition-unsigned.ll | 210 +- llvm/test/CodeGen/AArch64/emutls.ll | 6 +- llvm/test/CodeGen/AArch64/expand-select.ll | 50 +- llvm/test/CodeGen/AArch64/expand-vector-rot.ll | 12 +- llvm/test/CodeGen/AArch64/extract-bits.ll | 484 +-- llvm/test/CodeGen/AArch64/extract-lowbits.ll | 116 +- llvm/test/CodeGen/AArch64/f16-instructions.ll | 18 +- llvm/test/CodeGen/AArch64/fabs.ll | 8 +- llvm/test/CodeGen/AArch64/fadd-combines.ll | 14 +- llvm/test/CodeGen/AArch64/faddp-half.ll | 8 +- .../CodeGen/AArch64/fast-isel-addressing-modes.ll | 6 +- .../CodeGen/AArch64/fast-isel-branch-cond-split.ll | 4 +- llvm/test/CodeGen/AArch64/fast-isel-gep.ll | 6 +- llvm/test/CodeGen/AArch64/fast-isel-memcpy.ll | 6 +- llvm/test/CodeGen/AArch64/fast-isel-shift.ll | 24 +- llvm/test/CodeGen/AArch64/fdiv_combine.ll | 6 +- llvm/test/CodeGen/AArch64/fold-global-offsets.ll | 10 +- llvm/test/CodeGen/AArch64/fp16-v8-instructions.ll | 1441 ++++---- llvm/test/CodeGen/AArch64/fp16-vector-shuffle.ll | 2 +- llvm/test/CodeGen/AArch64/fptosi-sat-scalar.ll | 198 +- llvm/test/CodeGen/AArch64/fptosi-sat-vector.ll | 958 +++--- llvm/test/CodeGen/AArch64/fptoui-sat-scalar.ll | 114 +- llvm/test/CodeGen/AArch64/fptoui-sat-vector.ll | 708 ++-- .../CodeGen/AArch64/framelayout-frame-record.mir | 3 +- .../CodeGen/AArch64/framelayout-unaligned-fp.ll | 4 +- llvm/test/CodeGen/AArch64/func-calls.ll | 2 +- llvm/test/CodeGen/AArch64/funnel-shift-rot.ll | 30 +- llvm/test/CodeGen/AArch64/funnel-shift.ll | 108 +- llvm/test/CodeGen/AArch64/global-merge-3.ll | 24 +- llvm/test/CodeGen/AArch64/half.ll | 10 +- .../hoist-and-by-const-from-lshr-in-eqcmp-zero.ll | 6 +- .../test/CodeGen/AArch64/hwasan-check-memaccess.ll | 2 +- .../CodeGen/AArch64/i128_volatile_load_store.ll | 36 +- llvm/test/CodeGen/AArch64/implicit-null-check.ll | 12 +- .../AArch64/insert-subvector-res-legalization.ll | 70 +- llvm/test/CodeGen/AArch64/isinf.ll | 2 +- llvm/test/CodeGen/AArch64/known-never-nan.ll | 16 +- llvm/test/CodeGen/AArch64/ldst-opt.ll | 5 +- llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll | 163 +- llvm/test/CodeGen/AArch64/logical_shifted_reg.ll | 137 +- llvm/test/CodeGen/AArch64/lowerMUL-newload.ll | 24 +- .../CodeGen/AArch64/machine-licm-sink-instr.ll | 24 +- .../test/CodeGen/AArch64/machine-outliner-throw.ll | 4 +- .../AArch64/machine_cse_impdef_killflags.ll | 4 +- llvm/test/CodeGen/AArch64/madd-lohi.ll | 4 +- llvm/test/CodeGen/AArch64/memcpy-scoped-aa.ll | 50 +- llvm/test/CodeGen/AArch64/merge-trunc-store.ll | 72 +- llvm/test/CodeGen/AArch64/midpoint-int.ll | 308 +- llvm/test/CodeGen/AArch64/min-max.ll | 260 +- llvm/test/CodeGen/AArch64/minmax-of-minmax.ll | 256 +- llvm/test/CodeGen/AArch64/minmax.ll | 10 +- llvm/test/CodeGen/AArch64/misched-fusion-lit.ll | 5 +- llvm/test/CodeGen/AArch64/misched-fusion.ll | 4 +- .../CodeGen/AArch64/named-vector-shuffles-neon.ll | 18 +- .../CodeGen/AArch64/named-vector-shuffles-sve.ll | 408 +-- llvm/test/CodeGen/AArch64/neg-abs.ll | 8 +- llvm/test/CodeGen/AArch64/neg-imm.ll | 3 +- .../CodeGen/AArch64/neon-bitwise-instructions.ll | 6 +- llvm/test/CodeGen/AArch64/neon-dotpattern.ll | 4 +- llvm/test/CodeGen/AArch64/neon-dotreduce.ll | 88 +- llvm/test/CodeGen/AArch64/neon-mla-mls.ll | 30 +- llvm/test/CodeGen/AArch64/neon-mov.ll | 2 +- llvm/test/CodeGen/AArch64/neon-reverseshuffle.ll | 2 +- llvm/test/CodeGen/AArch64/neon-shift-neg.ll | 24 +- llvm/test/CodeGen/AArch64/neon-truncstore.ll | 30 +- llvm/test/CodeGen/AArch64/nontemporal.ll | 74 +- llvm/test/CodeGen/AArch64/overeager_mla_fusing.ll | 10 +- llvm/test/CodeGen/AArch64/pow.ll | 12 +- .../pull-conditional-binop-through-shift.ll | 6 +- llvm/test/CodeGen/AArch64/qmovn.ll | 8 +- .../AArch64/ragreedy-local-interval-cost.ll | 187 +- llvm/test/CodeGen/AArch64/rand.ll | 10 +- llvm/test/CodeGen/AArch64/reduce-and.ll | 348 +- llvm/test/CodeGen/AArch64/reduce-or.ll | 348 +- llvm/test/CodeGen/AArch64/reduce-xor.ll | 164 +- llvm/test/CodeGen/AArch64/regress-tblgen-chains.ll | 4 +- llvm/test/CodeGen/AArch64/rotate-extract.ll | 14 +- .../rvmarker-pseudo-expansion-and-outlining.mir | 4 +- llvm/test/CodeGen/AArch64/sadd_sat.ll | 12 +- llvm/test/CodeGen/AArch64/sadd_sat_plus.ll | 36 +- llvm/test/CodeGen/AArch64/sadd_sat_vec.ll | 68 +- llvm/test/CodeGen/AArch64/sat-add.ll | 30 +- llvm/test/CodeGen/AArch64/sdivpow2.ll | 2 +- llvm/test/CodeGen/AArch64/seh-finally.ll | 8 +- llvm/test/CodeGen/AArch64/select-with-and-or.ll | 32 +- llvm/test/CodeGen/AArch64/select_const.ll | 112 +- llvm/test/CodeGen/AArch64/select_fmf.ll | 32 +- llvm/test/CodeGen/AArch64/selectcc-to-shiftand.ll | 16 +- llvm/test/CodeGen/AArch64/settag-merge-order.ll | 4 +- llvm/test/CodeGen/AArch64/settag-merge.ll | 8 +- llvm/test/CodeGen/AArch64/settag.ll | 10 +- llvm/test/CodeGen/AArch64/shift-amount-mod.ll | 168 +- llvm/test/CodeGen/AArch64/shift-by-signext.ll | 20 +- llvm/test/CodeGen/AArch64/shift-mod.ll | 2 +- llvm/test/CodeGen/AArch64/shrink-wrapping-vla.ll | 4 +- llvm/test/CodeGen/AArch64/sibling-call.ll | 2 +- llvm/test/CodeGen/AArch64/signbit-shift.ll | 8 +- llvm/test/CodeGen/AArch64/sink-addsub-of-const.ll | 48 +- llvm/test/CodeGen/AArch64/sitofp-fixed-legal.ll | 18 +- .../CodeGen/AArch64/speculation-hardening-loads.ll | 4 +- .../test/CodeGen/AArch64/speculation-hardening.mir | 2 +- llvm/test/CodeGen/AArch64/split-vector-insert.ll | 70 +- llvm/test/CodeGen/AArch64/sqrt-fastmath.ll | 254 +- llvm/test/CodeGen/AArch64/srem-lkk.ll | 2 +- .../CodeGen/AArch64/srem-seteq-illegal-types.ll | 90 +- llvm/test/CodeGen/AArch64/srem-seteq-optsize.ll | 16 +- .../CodeGen/AArch64/srem-seteq-vec-nonsplat.ll | 382 +-- llvm/test/CodeGen/AArch64/srem-seteq-vec-splat.ll | 64 +- llvm/test/CodeGen/AArch64/srem-seteq.ll | 12 +- llvm/test/CodeGen/AArch64/srem-vector-lkk.ll | 446 +-- llvm/test/CodeGen/AArch64/ssub_sat.ll | 12 +- llvm/test/CodeGen/AArch64/ssub_sat_plus.ll | 36 +- llvm/test/CodeGen/AArch64/ssub_sat_vec.ll | 68 +- .../CodeGen/AArch64/stack-guard-remat-bitcast.ll | 12 +- llvm/test/CodeGen/AArch64/stack-guard-sysreg.ll | 30 +- .../CodeGen/AArch64/statepoint-call-lowering.ll | 6 +- .../AArch64/sve-calling-convention-mixed.ll | 16 +- llvm/test/CodeGen/AArch64/sve-expand-div.ll | 12 +- llvm/test/CodeGen/AArch64/sve-extract-element.ll | 4 +- .../CodeGen/AArch64/sve-extract-fixed-vector.ll | 64 +- .../CodeGen/AArch64/sve-extract-scalable-vector.ll | 60 +- llvm/test/CodeGen/AArch64/sve-fcopysign.ll | 18 +- llvm/test/CodeGen/AArch64/sve-fcvt.ll | 64 +- .../CodeGen/AArch64/sve-fixed-length-concat.ll | 28 +- .../AArch64/sve-fixed-length-extract-vector-elt.ll | 12 +- .../AArch64/sve-fixed-length-float-compares.ll | 28 +- .../AArch64/sve-fixed-length-fp-extend-trunc.ll | 54 +- .../CodeGen/AArch64/sve-fixed-length-fp-select.ll | 48 +- .../CodeGen/AArch64/sve-fixed-length-fp-to-int.ll | 54 +- .../CodeGen/AArch64/sve-fixed-length-fp-vselect.ll | 1716 +++++----- .../AArch64/sve-fixed-length-insert-vector-elt.ll | 148 +- .../CodeGen/AArch64/sve-fixed-length-int-div.ll | 216 +- .../AArch64/sve-fixed-length-int-extends.ll | 56 +- .../AArch64/sve-fixed-length-int-immediates.ll | 56 +- .../CodeGen/AArch64/sve-fixed-length-int-mulh.ll | 30 +- .../CodeGen/AArch64/sve-fixed-length-int-rem.ll | 282 +- .../CodeGen/AArch64/sve-fixed-length-int-select.ll | 144 +- .../CodeGen/AArch64/sve-fixed-length-int-to-fp.ll | 108 +- .../AArch64/sve-fixed-length-int-vselect.ll | 3584 ++++++++++---------- .../AArch64/sve-fixed-length-masked-gather.ll | 296 +- .../AArch64/sve-fixed-length-masked-loads.ll | 46 +- .../AArch64/sve-fixed-length-masked-scatter.ll | 342 +- .../AArch64/sve-fixed-length-masked-stores.ll | 82 +- .../AArch64/sve-fixed-length-vector-shuffle.ll | 78 +- llvm/test/CodeGen/AArch64/sve-forward-st-to-ld.ll | 7 +- llvm/test/CodeGen/AArch64/sve-fptrunc-store.ll | 4 +- llvm/test/CodeGen/AArch64/sve-gep.ll | 4 +- .../CodeGen/AArch64/sve-implicit-zero-filling.ll | 13 +- llvm/test/CodeGen/AArch64/sve-insert-element.ll | 192 +- llvm/test/CodeGen/AArch64/sve-insert-vector.ll | 80 +- llvm/test/CodeGen/AArch64/sve-int-arith-imm.ll | 30 +- llvm/test/CodeGen/AArch64/sve-int-arith.ll | 2 +- llvm/test/CodeGen/AArch64/sve-intrinsics-index.ll | 10 +- .../CodeGen/AArch64/sve-intrinsics-int-arith.ll | 4 +- llvm/test/CodeGen/AArch64/sve-ld-post-inc.ll | 6 +- llvm/test/CodeGen/AArch64/sve-ld1r.ll | 2 +- .../sve-lsr-scaled-index-addressing-mode.ll | 1 + .../CodeGen/AArch64/sve-masked-gather-legalize.ll | 6 +- .../CodeGen/AArch64/sve-masked-scatter-legalize.ll | 2 +- llvm/test/CodeGen/AArch64/sve-masked-scatter.ll | 2 +- llvm/test/CodeGen/AArch64/sve-pred-arith.ll | 16 +- llvm/test/CodeGen/AArch64/sve-sext-zext.ll | 12 +- llvm/test/CodeGen/AArch64/sve-split-extract-elt.ll | 100 +- llvm/test/CodeGen/AArch64/sve-split-fcvt.ll | 40 +- llvm/test/CodeGen/AArch64/sve-split-fp-reduce.ll | 2 +- llvm/test/CodeGen/AArch64/sve-split-insert-elt.ll | 72 +- llvm/test/CodeGen/AArch64/sve-split-int-reduce.ll | 10 +- llvm/test/CodeGen/AArch64/sve-split-load.ll | 6 +- llvm/test/CodeGen/AArch64/sve-split-store.ll | 6 +- .../AArch64/sve-st1-addressing-mode-reg-imm.ll | 12 +- llvm/test/CodeGen/AArch64/sve-stepvector.ll | 22 +- llvm/test/CodeGen/AArch64/sve-trunc.ll | 30 +- llvm/test/CodeGen/AArch64/sve-vscale-attr.ll | 40 +- llvm/test/CodeGen/AArch64/sve-vscale.ll | 2 +- llvm/test/CodeGen/AArch64/sve-vselect-imm.ll | 12 +- llvm/test/CodeGen/AArch64/swift-async.ll | 20 +- llvm/test/CodeGen/AArch64/swift-return.ll | 2 +- llvm/test/CodeGen/AArch64/swifterror.ll | 6 +- llvm/test/CodeGen/AArch64/tiny-model-pic.ll | 12 +- llvm/test/CodeGen/AArch64/tiny-model-static.ll | 12 +- .../test/CodeGen/AArch64/typepromotion-overflow.ll | 136 +- llvm/test/CodeGen/AArch64/typepromotion-signed.ll | 38 +- llvm/test/CodeGen/AArch64/uadd_sat.ll | 6 +- llvm/test/CodeGen/AArch64/uadd_sat_plus.ll | 30 +- llvm/test/CodeGen/AArch64/uadd_sat_vec.ll | 72 +- .../AArch64/umulo-128-legalisation-lowering.ll | 27 +- ...old-masked-merge-scalar-constmask-innerouter.ll | 18 +- ...asked-merge-scalar-constmask-interleavedbits.ll | 12 +- ...merge-scalar-constmask-interleavedbytehalves.ll | 12 +- ...unfold-masked-merge-scalar-constmask-lowhigh.ll | 2 +- .../unfold-masked-merge-scalar-variablemask.ll | 98 +- llvm/test/CodeGen/AArch64/urem-lkk.ll | 20 +- .../CodeGen/AArch64/urem-seteq-illegal-types.ll | 28 +- llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll | 46 +- llvm/test/CodeGen/AArch64/urem-seteq-optsize.ll | 14 +- .../CodeGen/AArch64/urem-seteq-vec-nonsplat.ll | 340 +- .../test/CodeGen/AArch64/urem-seteq-vec-nonzero.ll | 56 +- llvm/test/CodeGen/AArch64/urem-seteq-vec-splat.ll | 38 +- .../CodeGen/AArch64/urem-seteq-vec-tautological.ll | 56 +- llvm/test/CodeGen/AArch64/urem-seteq.ll | 14 +- llvm/test/CodeGen/AArch64/urem-vector-lkk.ll | 330 +- .../AArch64/use-cr-result-of-dom-icmp-st.ll | 8 +- llvm/test/CodeGen/AArch64/usub_sat_plus.ll | 20 +- llvm/test/CodeGen/AArch64/usub_sat_vec.ll | 48 +- llvm/test/CodeGen/AArch64/vcvt-oversize.ll | 4 +- llvm/test/CodeGen/AArch64/vec-libcalls.ll | 34 +- llvm/test/CodeGen/AArch64/vec_cttz.ll | 8 +- llvm/test/CodeGen/AArch64/vec_uaddo.ll | 168 +- llvm/test/CodeGen/AArch64/vec_umulo.ll | 296 +- .../CodeGen/AArch64/vecreduce-and-legalization.ll | 36 +- .../AArch64/vecreduce-fadd-legalization-strict.ll | 96 +- .../CodeGen/AArch64/vecreduce-fadd-legalization.ll | 6 +- llvm/test/CodeGen/AArch64/vecreduce-fadd.ll | 188 +- .../CodeGen/AArch64/vecreduce-fmax-legalization.ll | 246 +- .../CodeGen/AArch64/vecreduce-fmin-legalization.ll | 246 +- .../CodeGen/AArch64/vecreduce-umax-legalization.ll | 14 +- llvm/test/CodeGen/AArch64/vector-fcopysign.ll | 346 +- llvm/test/CodeGen/AArch64/vector-gep.ll | 6 +- .../CodeGen/AArch64/vector-popcnt-128-ult-ugt.ll | 680 ++-- llvm/test/CodeGen/AArch64/vldn_shuffle.ll | 6 +- llvm/test/CodeGen/AArch64/vselect-constants.ll | 42 +- llvm/test/CodeGen/AArch64/win-tls.ll | 6 +- llvm/test/CodeGen/AArch64/win64_vararg.ll | 32 +- llvm/test/CodeGen/AArch64/win64_vararg_float.ll | 12 +- llvm/test/CodeGen/AArch64/win64_vararg_float_cc.ll | 12 +- llvm/test/CodeGen/AArch64/xor.ll | 8 +- llvm/test/MC/AArch64/elf-globaladdress.ll | 6 +- .../CanonicalizeFreezeInLoops/aarch64.ll | 2 +- .../CodeGenPrepare/AArch64/large-offset-gep.ll | 30 +- .../AArch64/lsr-pre-inc-offset-check.ll | 12 +- .../LoopStrengthReduce/AArch64/small-constant.ll | 2 +- .../aarch64_generated_funcs.ll.generated.expected | 30 +- ...aarch64_generated_funcs.ll.nogenerated.expected | 24 +- 319 files changed, 14045 insertions(+), 13817 deletions(-) diff --git a/llvm/lib/Target/AArch64/AArch64.td b/llvm/lib/Target/AArch64/AArch64.td index 5c1bf783ba2a..cb52532343fe 100644 --- a/llvm/lib/Target/AArch64/AArch64.td +++ b/llvm/lib/Target/AArch64/AArch64.td @@ -1156,7 +1156,7 @@ def ProcTSV110 : SubtargetFeature<"tsv110", "ARMProcFamily", "TSV110", FeatureFP16FML, FeatureDotProd]>; -def : ProcessorModel<"generic", NoSchedModel, [ +def : ProcessorModel<"generic", CortexA55Model, [ FeatureFPARMv8, FeatureFuseAES, FeatureNEON, diff --git a/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll b/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll index 5008c7f5c847..cb8ec7ba6f21 100644 --- a/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll +++ b/llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll @@ -4,7 +4,7 @@ ; COST-LABEL: sel.v8i8 ; COST: Found an estimated cost of 42 for instruction: %tmp0 = shufflevector <8 x i8> %v0, <8 x i8> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15> ; CODE-LABEL: sel.v8i8 -; CODE: tbl v0.8b, { v0.16b }, v2.8b +; CODE: tbl v0.8b, { v0.16b }, v1.8b define <8 x i8> @sel.v8i8(<8 x i8> %v0, <8 x i8> %v1) { %tmp0 = shufflevector <8 x i8> %v0, <8 x i8> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15> ret <8 x i8> %tmp0 diff --git a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll index f2271c4ed71f..6e77612815f4 100644 --- a/llvm/test/Analysis/CostModel/AArch64/vector-select.ll +++ b/llvm/test/Analysis/CostModel/AArch64/vector-select.ll @@ -119,15 +119,15 @@ define <2 x i64> @v2i64_select_sle(<2 x i64> %a, <2 x i64> %b, <2 x i64> %c) { ; CODE-LABEL: v3i64_select_sle ; CODE: bb.0 -; CODE: ldr ; CODE: mov +; CODE: ldr ; CODE: mov ; CODE: mov ; CODE: cmge ; CODE: cmge ; CODE: bif -; CODE: ext ; CODE: bif +; CODE: ext ; CODE: ret define <3 x i64> @v3i64_select_sle(<3 x i64> %a, <3 x i64> %b, <3 x i64> %c) { diff --git a/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll b/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll index 6fe73e067e1a..c2436ccecc75 100644 --- a/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll +++ b/llvm/test/CodeGen/AArch64/DAGCombine_vscale.ll @@ -51,8 +51,8 @@ define <vscale x 4 x i32> @ashr_add_shl_nxv4i8(<vscale x 4 x i32> %a) { ; CHECK-LABEL: ashr_add_shl_nxv4i8: ; CHECK: // %bb.0: ; CHECK-NEXT: mov w8, #16777216 -; CHECK-NEXT: mov z1.s, w8 ; CHECK-NEXT: lsl z0.s, z0.s, #24 +; CHECK-NEXT: mov z1.s, w8 ; CHECK-NEXT: add z0.s, z0.s, z1.s ; CHECK-NEXT: asr z0.s, z0.s, #24 ; CHECK-NEXT: ret diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll index fd3a0072d2a8..4385e3ede36f 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-atomic.ll @@ -705,14 +705,14 @@ define i32 @atomic_load(i32* %p) #0 { define i8 @atomic_load_relaxed_8(i8* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_8: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldrb w8, [x0, #4095] -; CHECK-NOLSE-O1-NEXT: ldrb w9, [x0, w1, sxtw] -; CHECK-NOLSE-O1-NEXT: ldurb w10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldrb w11, [x11] -; CHECK-NOLSE-O1-NEXT: add w8, w8, w9 -; CHECK-NOLSE-O1-NEXT: add w8, w8, w10 -; CHECK-NOLSE-O1-NEXT: add w0, w8, w11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldrb w9, [x0, #4095] +; CHECK-NOLSE-O1-NEXT: ldrb w10, [x0, w1, sxtw] +; CHECK-NOLSE-O1-NEXT: ldurb w11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldrb w8, [x8] +; CHECK-NOLSE-O1-NEXT: add w9, w9, w10 +; CHECK-NOLSE-O1-NEXT: add w9, w9, w11 +; CHECK-NOLSE-O1-NEXT: add w0, w9, w8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_8: @@ -775,14 +775,14 @@ define i8 @atomic_load_relaxed_8(i8* %p, i32 %off32) #0 { define i16 @atomic_load_relaxed_16(i16* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_16: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldrh w8, [x0, #8190] -; CHECK-NOLSE-O1-NEXT: ldrh w9, [x0, w1, sxtw #1] -; CHECK-NOLSE-O1-NEXT: ldurh w10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldrh w11, [x11] -; CHECK-NOLSE-O1-NEXT: add w8, w8, w9 -; CHECK-NOLSE-O1-NEXT: add w8, w8, w10 -; CHECK-NOLSE-O1-NEXT: add w0, w8, w11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldrh w9, [x0, #8190] +; CHECK-NOLSE-O1-NEXT: ldrh w10, [x0, w1, sxtw #1] +; CHECK-NOLSE-O1-NEXT: ldurh w11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldrh w8, [x8] +; CHECK-NOLSE-O1-NEXT: add w9, w9, w10 +; CHECK-NOLSE-O1-NEXT: add w9, w9, w11 +; CHECK-NOLSE-O1-NEXT: add w0, w9, w8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_16: @@ -845,14 +845,14 @@ define i16 @atomic_load_relaxed_16(i16* %p, i32 %off32) #0 { define i32 @atomic_load_relaxed_32(i32* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_32: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldr w8, [x0, #16380] -; CHECK-NOLSE-O1-NEXT: ldr w9, [x0, w1, sxtw #2] -; CHECK-NOLSE-O1-NEXT: ldur w10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldr w11, [x11] -; CHECK-NOLSE-O1-NEXT: add w8, w8, w9 -; CHECK-NOLSE-O1-NEXT: add w8, w8, w10 -; CHECK-NOLSE-O1-NEXT: add w0, w8, w11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldr w9, [x0, #16380] +; CHECK-NOLSE-O1-NEXT: ldr w10, [x0, w1, sxtw #2] +; CHECK-NOLSE-O1-NEXT: ldur w11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldr w8, [x8] +; CHECK-NOLSE-O1-NEXT: add w9, w9, w10 +; CHECK-NOLSE-O1-NEXT: add w9, w9, w11 +; CHECK-NOLSE-O1-NEXT: add w0, w9, w8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_32: @@ -911,14 +911,14 @@ define i32 @atomic_load_relaxed_32(i32* %p, i32 %off32) #0 { define i64 @atomic_load_relaxed_64(i64* %p, i32 %off32) #0 { ; CHECK-NOLSE-O1-LABEL: atomic_load_relaxed_64: ; CHECK-NOLSE-O1: ; %bb.0: -; CHECK-NOLSE-O1-NEXT: ldr x8, [x0, #32760] -; CHECK-NOLSE-O1-NEXT: ldr x9, [x0, w1, sxtw #3] -; CHECK-NOLSE-O1-NEXT: ldur x10, [x0, #-256] -; CHECK-NOLSE-O1-NEXT: add x11, x0, #291, lsl #12 ; =1191936 -; CHECK-NOLSE-O1-NEXT: ldr x11, [x11] -; CHECK-NOLSE-O1-NEXT: add x8, x8, x9 -; CHECK-NOLSE-O1-NEXT: add x8, x8, x10 -; CHECK-NOLSE-O1-NEXT: add x0, x8, x11 +; CHECK-NOLSE-O1-NEXT: add x8, x0, #291, lsl #12 ; =1191936 +; CHECK-NOLSE-O1-NEXT: ldr x9, [x0, #32760] +; CHECK-NOLSE-O1-NEXT: ldr x10, [x0, w1, sxtw #3] +; CHECK-NOLSE-O1-NEXT: ldur x11, [x0, #-256] +; CHECK-NOLSE-O1-NEXT: ldr x8, [x8] +; CHECK-NOLSE-O1-NEXT: add x9, x9, x10 +; CHECK-NOLSE-O1-NEXT: add x9, x9, x11 +; CHECK-NOLSE-O1-NEXT: add x0, x9, x8 ; CHECK-NOLSE-O1-NEXT: ret ; ; CHECK-NOLSE-O0-LABEL: atomic_load_relaxed_64: @@ -2717,8 +2717,8 @@ define { i8, i1 } @cmpxchg_i8(i8* %ptr, i8 %desired, i8 %new) { ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; CHECK-NOLSE-O1-NEXT: LBB47_4: ; %cmpxchg.nostore -; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: mov w1, wzr +; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; @@ -2783,8 +2783,8 @@ define { i16, i1 } @cmpxchg_i16(i16* %ptr, i16 %desired, i16 %new) { ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; CHECK-NOLSE-O1-NEXT: LBB48_4: ; %cmpxchg.nostore -; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: mov w1, wzr +; CHECK-NOLSE-O1-NEXT: clrex ; CHECK-NOLSE-O1-NEXT: ; kill: def $w0 killed $w0 killed $x0 ; CHECK-NOLSE-O1-NEXT: ret ; diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll b/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll index f8d4731d3249..651ca31ae555 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/byval-call.ll @@ -27,8 +27,8 @@ define void @call_byval_a64i32([64 x i32]* %incoming) { ; CHECK: // %bb.0: ; CHECK-NEXT: sub sp, sp, #288 ; CHECK-NEXT: stp x29, x30, [sp, #256] // 16-byte Folded Spill -; CHECK-NEXT: str x28, [sp, #272] // 8-byte Folded Spill ; CHECK-NEXT: add x29, sp, #256 +; CHECK-NEXT: str x28, [sp, #272] // 8-byte Folded Spill ; CHECK-NEXT: .cfi_def_cfa w29, 32 ; CHECK-NEXT: .cfi_offset w28, -16 ; CHECK-NEXT: .cfi_offset w30, -24 @@ -66,8 +66,8 @@ define void @call_byval_a64i32([64 x i32]* %incoming) { ; CHECK-NEXT: ldr q0, [x0, #240] ; CHECK-NEXT: str q0, [sp, #240] ; CHECK-NEXT: bl byval_a64i32 -; CHECK-NEXT: ldr x28, [sp, #272] // 8-byte Folded Reload ; CHECK-NEXT: ldp x29, x30, [sp, #256] // 16-byte Folded Reload +; CHECK-NEXT: ldr x28, [sp, #272] // 8-byte Folded Reload ; CHECK-NEXT: add sp, sp, #288 ; CHECK-NEXT: ret call void @byval_a64i32([64 x i32]* byval([64 x i32]) %incoming) diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll b/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll index 42e91f631822..44c0854ea03d 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/call-translator-variadic-musttail.ll @@ -63,15 +63,12 @@ define i32 @test_musttail_variadic_spill(i32 %arg0, ...) { ; CHECK-NEXT: mov x25, x6 ; CHECK-NEXT: mov x26, x7 ; CHECK-NEXT: stp q1, q0, [sp, #96] ; 32-byte Folded Spill +; CHECK-NEXT: mov x27, x8 ; CHECK-NEXT: stp q3, q2, [sp, #64] ; 32-byte Folded Spill ; CHECK-NEXT: stp q5, q4, [sp, #32] ; 32-byte Folded Spill ; CHECK-NEXT: stp q7, q6, [sp] ; 32-byte Folded Spill -; CHECK-NEXT: mov x27, x8 ; CHECK-NEXT: bl _puts ; CHECK-NEXT: ldp q1, q0, [sp, #96] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload ; CHECK-NEXT: mov w0, w19 ; CHECK-NEXT: mov x1, x20 ; CHECK-NEXT: mov x2, x21 @@ -81,6 +78,9 @@ define i32 @test_musttail_variadic_spill(i32 %arg0, ...) { ; CHECK-NEXT: mov x6, x25 ; CHECK-NEXT: mov x7, x26 ; CHECK-NEXT: mov x8, x27 +; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload ; CHECK-NEXT: ldp x29, x30, [sp, #208] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x20, x19, [sp, #192] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x22, x21, [sp, #176] ; 16-byte Folded Reload @@ -122,9 +122,8 @@ define void @f_thunk(i8* %this, ...) { ; CHECK-NEXT: .cfi_offset w26, -80 ; CHECK-NEXT: .cfi_offset w27, -88 ; CHECK-NEXT: .cfi_offset w28, -96 -; CHECK-NEXT: mov x27, x8 -; CHECK-NEXT: add x8, sp, #128 -; CHECK-NEXT: add x9, sp, #256 +; CHECK-NEXT: add x9, sp, #128 +; CHECK-NEXT: add x10, sp, #256 ; CHECK-NEXT: mov x19, x0 ; CHECK-NEXT: mov x20, x1 ; CHECK-NEXT: mov x21, x2 @@ -134,16 +133,14 @@ define void @f_thunk(i8* %this, ...) { ; CHECK-NEXT: mov x25, x6 ; CHECK-NEXT: mov x26, x7 ; CHECK-NEXT: stp q1, q0, [sp, #96] ; 32-byte Folded Spill +; CHECK-NEXT: mov x27, x8 ; CHECK-NEXT: stp q3, q2, [sp, #64] ; 32-byte Folded Spill ; CHECK-NEXT: stp q5, q4, [sp, #32] ; 32-byte Folded Spill ; CHECK-NEXT: stp q7, q6, [sp] ; 32-byte Folded Spill -; CHECK-NEXT: str x9, [x8] +; CHECK-NEXT: str x10, [x9] ; CHECK-NEXT: bl _get_f -; CHECK-NEXT: mov x9, x0 ; CHECK-NEXT: ldp q1, q0, [sp, #96] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload -; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload +; CHECK-NEXT: mov x9, x0 ; CHECK-NEXT: mov x0, x19 ; CHECK-NEXT: mov x1, x20 ; CHECK-NEXT: mov x2, x21 @@ -153,6 +150,9 @@ define void @f_thunk(i8* %this, ...) { ; CHECK-NEXT: mov x6, x25 ; CHECK-NEXT: mov x7, x26 ; CHECK-NEXT: mov x8, x27 +; CHECK-NEXT: ldp q3, q2, [sp, #64] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q5, q4, [sp, #32] ; 32-byte Folded Reload +; CHECK-NEXT: ldp q7, q6, [sp] ; 32-byte Folded Reload ; CHECK-NEXT: ldp x29, x30, [sp, #240] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x20, x19, [sp, #224] ; 16-byte Folded Reload ; CHECK-NEXT: ldp x22, x21, [sp, #208] ; 16-byte Folded Reload @@ -195,9 +195,9 @@ define void @h_thunk(%struct.Foo* %this, ...) { ; CHECK-NEXT: Lloh2: ; CHECK-NEXT: adrp x10, _g@GOTPAGE ; CHECK-NEXT: ldr x9, [x0, #16] +; CHECK-NEXT: mov w11, #42 ; CHECK-NEXT: Lloh3: ; CHECK-NEXT: ldr x10, [x10, _g@GOTPAGEOFF] -; CHECK-NEXT: mov w11, #42 ; CHECK-NEXT: Lloh4: ; CHECK-NEXT: str w11, [x10] ; CHECK-NEXT: br x9 diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll b/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll index 6d9dad450ef1..3dc45e4cf5a7 100644 --- a/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll +++ b/llvm/test/CodeGen/AArch64/GlobalISel/combine-udiv.ll @@ -18,20 +18,20 @@ define <8 x i16> @combine_vec_udiv_uniform(<8 x i16> %x) { ; ; GISEL-LABEL: combine_vec_udiv_uniform: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI0_1 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI0_1] -; GISEL-NEXT: adrp x8, .LCPI0_0 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI0_0] ; GISEL-NEXT: adrp x8, .LCPI0_2 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI0_2] -; GISEL-NEXT: sub v1.8h, v2.8h, v1.8h -; GISEL-NEXT: neg v1.8h, v1.8h -; GISEL-NEXT: umull2 v2.4s, v0.8h, v3.8h -; GISEL-NEXT: umull v3.4s, v0.4h, v3.4h -; GISEL-NEXT: uzp2 v2.8h, v3.8h, v2.8h -; GISEL-NEXT: sub v0.8h, v0.8h, v2.8h -; GISEL-NEXT: ushl v0.8h, v0.8h, v1.8h -; GISEL-NEXT: add v0.8h, v0.8h, v2.8h +; GISEL-NEXT: adrp x9, .LCPI0_0 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI0_2] +; GISEL-NEXT: adrp x8, .LCPI0_1 +; GISEL-NEXT: ldr q4, [x9, :lo12:.LCPI0_0] +; GISEL-NEXT: umull2 v2.4s, v0.8h, v1.8h +; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI0_1] +; GISEL-NEXT: umull v1.4s, v0.4h, v1.4h +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v2.8h +; GISEL-NEXT: sub v2.8h, v4.8h, v3.8h +; GISEL-NEXT: sub v0.8h, v0.8h, v1.8h +; GISEL-NEXT: neg v2.8h, v2.8h +; GISEL-NEXT: ushl v0.8h, v0.8h, v2.8h +; GISEL-NEXT: add v0.8h, v0.8h, v1.8h ; GISEL-NEXT: ushr v0.8h, v0.8h, #4 ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 23, i16 23, i16 23, i16 23, i16 23, i16 23, i16 23, i16 23> @@ -44,53 +44,53 @@ define <8 x i16> @combine_vec_udiv_nonuniform(<8 x i16> %x) { ; SDAG-NEXT: adrp x8, .LCPI1_0 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI1_0] ; SDAG-NEXT: adrp x8, .LCPI1_1 +; SDAG-NEXT: ushl v1.8h, v0.8h, v1.8h ; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI1_1] ; SDAG-NEXT: adrp x8, .LCPI1_2 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI1_2] -; SDAG-NEXT: ushl v1.8h, v0.8h, v1.8h -; SDAG-NEXT: umull2 v4.4s, v1.8h, v2.8h +; SDAG-NEXT: umull2 v3.4s, v1.8h, v2.8h ; SDAG-NEXT: umull v1.4s, v1.4h, v2.4h +; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI1_2] ; SDAG-NEXT: adrp x8, .LCPI1_3 -; SDAG-NEXT: uzp2 v1.8h, v1.8h, v4.8h -; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI1_3] +; SDAG-NEXT: uzp2 v1.8h, v1.8h, v3.8h ; SDAG-NEXT: sub v0.8h, v0.8h, v1.8h -; SDAG-NEXT: umull2 v4.4s, v0.8h, v3.8h -; SDAG-NEXT: umull v0.4s, v0.4h, v3.4h -; SDAG-NEXT: uzp2 v0.8h, v0.8h, v4.8h +; SDAG-NEXT: umull2 v3.4s, v0.8h, v2.8h +; SDAG-NEXT: umull v0.4s, v0.4h, v2.4h +; SDAG-NEXT: uzp2 v0.8h, v0.8h, v3.8h ; SDAG-NEXT: add v0.8h, v0.8h, v1.8h -; SDAG-NEXT: ushl v0.8h, v0.8h, v2.8h +; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI1_3] +; SDAG-NEXT: ushl v0.8h, v0.8h, v1.8h ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI1_5 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI1_5] ; GISEL-NEXT: adrp x8, .LCPI1_4 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI1_4] +; GISEL-NEXT: adrp x10, .LCPI1_0 +; GISEL-NEXT: adrp x9, .LCPI1_1 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI1_4] ; GISEL-NEXT: adrp x8, .LCPI1_3 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI1_3] -; GISEL-NEXT: adrp x8, .LCPI1_1 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI1_1] -; GISEL-NEXT: adrp x8, .LCPI1_0 -; GISEL-NEXT: ldr q5, [x8, :lo12:.LCPI1_0] +; GISEL-NEXT: ldr q5, [x10, :lo12:.LCPI1_0] +; GISEL-NEXT: ldr q6, [x9, :lo12:.LCPI1_1] +; GISEL-NEXT: neg v1.8h, v1.8h +; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI1_3] ; GISEL-NEXT: adrp x8, .LCPI1_2 -; GISEL-NEXT: neg v2.8h, v2.8h -; GISEL-NEXT: ldr q6, [x8, :lo12:.LCPI1_2] -; GISEL-NEXT: ushl v2.8h, v0.8h, v2.8h -; GISEL-NEXT: cmeq v1.8h, v1.8h, v5.8h -; GISEL-NEXT: umull2 v5.4s, v2.8h, v3.8h +; GISEL-NEXT: ushl v1.8h, v0.8h, v1.8h +; GISEL-NEXT: umull2 v3.4s, v1.8h, v2.8h +; GISEL-NEXT: umull v1.4s, v1.4h, v2.4h +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v3.8h +; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI1_2] +; GISEL-NEXT: adrp x8, .LCPI1_5 +; GISEL-NEXT: sub v2.8h, v0.8h, v1.8h +; GISEL-NEXT: umull2 v4.4s, v2.8h, v3.8h ; GISEL-NEXT: umull v2.4s, v2.4h, v3.4h -; GISEL-NEXT: uzp2 v2.8h, v2.8h, v5.8h -; GISEL-NEXT: sub v3.8h, v0.8h, v2.8h -; GISEL-NEXT: umull2 v5.4s, v3.8h, v6.8h -; GISEL-NEXT: umull v3.4s, v3.4h, v6.4h -; GISEL-NEXT: uzp2 v3.8h, v3.8h, v5.8h -; GISEL-NEXT: neg v4.8h, v4.8h -; GISEL-NEXT: shl v1.8h, v1.8h, #15 -; GISEL-NEXT: add v2.8h, v3.8h, v2.8h -; GISEL-NEXT: ushl v2.8h, v2.8h, v4.8h -; GISEL-NEXT: sshr v1.8h, v1.8h, #15 -; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b +; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI1_5] +; GISEL-NEXT: cmeq v3.8h, v3.8h, v5.8h +; GISEL-NEXT: uzp2 v2.8h, v2.8h, v4.8h +; GISEL-NEXT: neg v4.8h, v6.8h +; GISEL-NEXT: add v1.8h, v2.8h, v1.8h +; GISEL-NEXT: shl v2.8h, v3.8h, #15 +; GISEL-NEXT: ushl v1.8h, v1.8h, v4.8h +; GISEL-NEXT: sshr v2.8h, v2.8h, #15 +; GISEL-NEXT: bif v0.16b, v1.16b, v2.16b ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 23, i16 34, i16 -23, i16 56, i16 128, i16 -1, i16 -256, i16 -32768> ret <8 x i16> %1 @@ -100,41 +100,41 @@ define <8 x i16> @combine_vec_udiv_nonuniform2(<8 x i16> %x) { ; SDAG-LABEL: combine_vec_udiv_nonuniform2: ; SDAG: // %bb.0: ; SDAG-NEXT: adrp x8, .LCPI2_0 -; SDAG-NEXT: adrp x9, .LCPI2_1 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI2_0] -; SDAG-NEXT: ldr q2, [x9, :lo12:.LCPI2_1] +; SDAG-NEXT: adrp x8, .LCPI2_1 +; SDAG-NEXT: ushl v0.8h, v0.8h, v1.8h +; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI2_1] ; SDAG-NEXT: adrp x8, .LCPI2_2 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI2_2] +; SDAG-NEXT: umull2 v2.4s, v0.8h, v1.8h +; SDAG-NEXT: umull v0.4s, v0.4h, v1.4h +; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI2_2] +; SDAG-NEXT: uzp2 v0.8h, v0.8h, v2.8h ; SDAG-NEXT: ushl v0.8h, v0.8h, v1.8h -; SDAG-NEXT: umull2 v1.4s, v0.8h, v2.8h -; SDAG-NEXT: umull v0.4s, v0.4h, v2.4h -; SDAG-NEXT: uzp2 v0.8h, v0.8h, v1.8h -; SDAG-NEXT: ushl v0.8h, v0.8h, v3.8h ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform2: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI2_4 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI2_4] ; GISEL-NEXT: adrp x8, .LCPI2_3 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI2_3] -; GISEL-NEXT: adrp x8, .LCPI2_1 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI2_1] -; GISEL-NEXT: adrp x8, .LCPI2_0 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI2_0] +; GISEL-NEXT: adrp x9, .LCPI2_4 +; GISEL-NEXT: adrp x10, .LCPI2_0 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI2_3] ; GISEL-NEXT: adrp x8, .LCPI2_2 -; GISEL-NEXT: ldr q5, [x8, :lo12:.LCPI2_2] +; GISEL-NEXT: ldr q3, [x9, :lo12:.LCPI2_4] +; GISEL-NEXT: ldr q4, [x10, :lo12:.LCPI2_0] +; GISEL-NEXT: neg v1.8h, v1.8h +; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI2_2] +; GISEL-NEXT: adrp x8, .LCPI2_1 +; GISEL-NEXT: cmeq v3.8h, v3.8h, v4.8h +; GISEL-NEXT: ushl v1.8h, v0.8h, v1.8h +; GISEL-NEXT: shl v3.8h, v3.8h, #15 +; GISEL-NEXT: umull2 v5.4s, v1.8h, v2.8h +; GISEL-NEXT: umull v1.4s, v1.4h, v2.4h +; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI2_1] ; GISEL-NEXT: neg v2.8h, v2.8h -; GISEL-NEXT: ushl v2.8h, v0.8h, v2.8h -; GISEL-NEXT: cmeq v1.8h, v1.8h, v4.8h -; GISEL-NEXT: umull2 v4.4s, v2.8h, v5.8h -; GISEL-NEXT: umull v2.4s, v2.4h, v5.4h -; GISEL-NEXT: neg v3.8h, v3.8h -; GISEL-NEXT: shl v1.8h, v1.8h, #15 -; GISEL-NEXT: uzp2 v2.8h, v2.8h, v4.8h -; GISEL-NEXT: ushl v2.8h, v2.8h, v3.8h -; GISEL-NEXT: sshr v1.8h, v1.8h, #15 -; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v5.8h +; GISEL-NEXT: ushl v1.8h, v1.8h, v2.8h +; GISEL-NEXT: sshr v2.8h, v3.8h, #15 +; GISEL-NEXT: bif v0.16b, v1.16b, v2.16b ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 -34, i16 35, i16 36, i16 -37, i16 38, i16 -39, i16 40, i16 -41> ret <8 x i16> %1 @@ -146,43 +146,43 @@ define <8 x i16> @combine_vec_udiv_nonuniform3(<8 x i16> %x) { ; SDAG-NEXT: adrp x8, .LCPI3_0 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI3_0] ; SDAG-NEXT: adrp x8, .LCPI3_1 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI3_1] ; SDAG-NEXT: umull2 v2.4s, v0.8h, v1.8h ; SDAG-NEXT: umull v1.4s, v0.4h, v1.4h ; SDAG-NEXT: uzp2 v1.8h, v1.8h, v2.8h ; SDAG-NEXT: sub v0.8h, v0.8h, v1.8h ; SDAG-NEXT: usra v1.8h, v0.8h, #1 -; SDAG-NEXT: ushl v0.8h, v1.8h, v3.8h +; SDAG-NEXT: ldr q0, [x8, :lo12:.LCPI3_1] +; SDAG-NEXT: ushl v0.8h, v1.8h, v0.8h ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform3: ; GISEL: // %bb.0: -; GISEL-NEXT: adrp x8, .LCPI3_5 -; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI3_5] ; GISEL-NEXT: adrp x8, .LCPI3_4 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI3_4] -; GISEL-NEXT: adrp x8, .LCPI3_2 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI3_2] -; GISEL-NEXT: adrp x8, .LCPI3_1 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI3_1] -; GISEL-NEXT: adrp x8, .LCPI3_3 -; GISEL-NEXT: ldr q5, [x8, :lo12:.LCPI3_3] -; GISEL-NEXT: adrp x8, .LCPI3_0 -; GISEL-NEXT: ldr q6, [x8, :lo12:.LCPI3_0] -; GISEL-NEXT: sub v3.8h, v4.8h, v3.8h -; GISEL-NEXT: umull2 v4.4s, v0.8h, v2.8h -; GISEL-NEXT: umull v2.4s, v0.4h, v2.4h -; GISEL-NEXT: uzp2 v2.8h, v2.8h, v4.8h -; GISEL-NEXT: neg v3.8h, v3.8h -; GISEL-NEXT: sub v4.8h, v0.8h, v2.8h -; GISEL-NEXT: cmeq v1.8h, v1.8h, v6.8h -; GISEL-NEXT: ushl v3.8h, v4.8h, v3.8h -; GISEL-NEXT: neg v5.8h, v5.8h -; GISEL-NEXT: shl v1.8h, v1.8h, #15 -; GISEL-NEXT: add v2.8h, v3.8h, v2.8h -; GISEL-NEXT: ushl v2.8h, v2.8h, v5.8h -; GISEL-NEXT: sshr v1.8h, v1.8h, #15 -; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b +; GISEL-NEXT: adrp x9, .LCPI3_2 +; GISEL-NEXT: adrp x10, .LCPI3_1 +; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI3_4] +; GISEL-NEXT: adrp x8, .LCPI3_5 +; GISEL-NEXT: ldr q2, [x9, :lo12:.LCPI3_2] +; GISEL-NEXT: adrp x9, .LCPI3_3 +; GISEL-NEXT: ldr q3, [x10, :lo12:.LCPI3_1] +; GISEL-NEXT: adrp x10, .LCPI3_0 +; GISEL-NEXT: umull2 v4.4s, v0.8h, v1.8h +; GISEL-NEXT: umull v1.4s, v0.4h, v1.4h +; GISEL-NEXT: ldr q6, [x9, :lo12:.LCPI3_3] +; GISEL-NEXT: sub v2.8h, v3.8h, v2.8h +; GISEL-NEXT: ldr q5, [x10, :lo12:.LCPI3_0] +; GISEL-NEXT: uzp2 v1.8h, v1.8h, v4.8h +; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI3_5] +; GISEL-NEXT: neg v2.8h, v2.8h +; GISEL-NEXT: sub v3.8h, v0.8h, v1.8h +; GISEL-NEXT: ushl v2.8h, v3.8h, v2.8h +; GISEL-NEXT: cmeq v3.8h, v4.8h, v5.8h +; GISEL-NEXT: neg v4.8h, v6.8h +; GISEL-NEXT: add v1.8h, v2.8h, v1.8h +; GISEL-NEXT: shl v2.8h, v3.8h, #15 +; GISEL-NEXT: ushl v1.8h, v1.8h, v4.8h +; GISEL-NEXT: sshr v2.8h, v2.8h, #15 +; GISEL-NEXT: bif v0.16b, v1.16b, v2.16b ; GISEL-NEXT: ret %1 = udiv <8 x i16> %x, <i16 7, i16 23, i16 25, i16 27, i16 31, i16 47, i16 63, i16 127> ret <8 x i16> %1 @@ -192,39 +192,39 @@ define <16 x i8> @combine_vec_udiv_nonuniform4(<16 x i8> %x) { ; SDAG-LABEL: combine_vec_udiv_nonuniform4: ; SDAG: // %bb.0: ; SDAG-NEXT: adrp x8, .LCPI4_0 +; SDAG-NEXT: adrp x9, .LCPI4_3 ; SDAG-NEXT: ldr q1, [x8, :lo12:.LCPI4_0] ; SDAG-NEXT: adrp x8, .LCPI4_1 +; SDAG-NEXT: ldr q3, [x9, :lo12:.LCPI4_3] +; SDAG-NEXT: umull2 v2.8h, v0.16b, v1.16b +; SDAG-NEXT: umull v1.8h, v0.8b, v1.8b +; SDAG-NEXT: and v0.16b, v0.16b, v3.16b +; SDAG-NEXT: uzp2 v1.16b, v1.16b, v2.16b ; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI4_1] ; SDAG-NEXT: adrp x8, .LCPI4_2 -; SDAG-NEXT: ldr q3, [x8, :lo12:.LCPI4_2] -; SDAG-NEXT: adrp x8, .LCPI4_3 -; SDAG-NEXT: ldr q4, [x8, :lo12:.LCPI4_3] -; SDAG-NEXT: umull2 v5.8h, v0.16b, v1.16b -; SDAG-NEXT: umull v1.8h, v0.8b, v1.8b -; SDAG-NEXT: uzp2 v1.16b, v1.16b, v5.16b ; SDAG-NEXT: ushl v1.16b, v1.16b, v2.16b -; SDAG-NEXT: and v1.16b, v1.16b, v3.16b -; SDAG-NEXT: and v0.16b, v0.16b, v4.16b +; SDAG-NEXT: ldr q2, [x8, :lo12:.LCPI4_2] +; SDAG-NEXT: and v1.16b, v1.16b, v2.16b ; SDAG-NEXT: orr v0.16b, v0.16b, v1.16b ; SDAG-NEXT: ret ; ; GISEL-LABEL: combine_vec_udiv_nonuniform4: ; GISEL: // %bb.0: ; GISEL-NEXT: adrp x8, .LCPI4_3 +; GISEL-NEXT: adrp x9, .LCPI4_2 +; GISEL-NEXT: adrp x10, .LCPI4_1 ; GISEL-NEXT: ldr q1, [x8, :lo12:.LCPI4_3] ; GISEL-NEXT: adrp x8, .LCPI4_0 -; GISEL-NEXT: ldr q2, [x8, :lo12:.LCPI4_0] -; GISEL-NEXT: adrp x8, .LCPI4_2 -; GISEL-NEXT: ldr q3, [x8, :lo12:.LCPI4_2] -; GISEL-NEXT: adrp x8, .LCPI4_1 -; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI4_1] -; GISEL-NEXT: cmeq v1.16b, v1.16b, v2.16b -; GISEL-NEXT: umull2 v2.8h, v0.16b, v3.16b -; GISEL-NEXT: umull v3.8h, v0.8b, v3.8b -; GISEL-NEXT: neg v4.16b, v4.16b -; GISEL-NEXT: uzp2 v2.16b, v3.16b, v2.16b +; GISEL-NEXT: ldr q2, [x9, :lo12:.LCPI4_2] +; GISEL-NEXT: ldr q3, [x10, :lo12:.LCPI4_1] +; GISEL-NEXT: ldr q4, [x8, :lo12:.LCPI4_0] +; GISEL-NEXT: umull2 v5.8h, v0.16b, v2.16b +; GISEL-NEXT: umull v2.8h, v0.8b, v2.8b +; GISEL-NEXT: cmeq v1.16b, v1.16b, v4.16b +; GISEL-NEXT: neg v3.16b, v3.16b +; GISEL-NEXT: uzp2 v2.16b, v2.16b, v5.16b ; GISEL-NEXT: shl v1.16b, v1.16b, #7 -; GISEL-NEXT: ushl v2.16b, v2.16b, v4.16b +; GISEL-NEXT: ushl v2.16b, v2.16b, v3.16b ; GISEL-NEXT: sshr v1.16b, v1.16b, #7 ; GISEL-NEXT: bif v0.16b, v2.16b, v1.16b ; GISEL-NEXT: ret @@ -236,55 +236,55 @@ define <8 x i16> @pr38477(<8 x i16> %a0) { </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 471.omnetpp slowed down by 8% after gcc: Avoid invalid loop transformations in jump threading registry.

by ci_notify＠linaro.org

After gcc commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez <aldyh(a)redhat.com> Avoid invalid loop transformations in jump threading registry. the following benchmarks slowed down by more than 2%: - 471.omnetpp slowed down by 8% from 6348 to 6828 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: GCC + Glibc + GNU Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -O3 -marm - Hardware: NVidia TK1 4x Cortex-A15 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_gnu_tk1/gnu-master-arm-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… Reproduce builds: <cut> mkdir investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 cd investigate-gcc-4a960d548b7d7d942f316c5295f6d849b74214f5 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_tk1-gnu-master-ar… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 4a960d548b7d7d942f316c5295f6d849b74214f5 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 29c92857039d0a105281be61c10c9e851aaeea4a ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 4a960d548b7d7d942f316c5295f6d849b74214f5 Author: Aldy Hernandez <aldyh(a)redhat.com> Date: Thu Sep 23 10:59:24 2021 +0200 Avoid invalid loop transformations in jump threading registry. My upcoming improvements to the forward jump threader make it thread more aggressively. In investigating some "regressions", I noticed that it has always allowed threading through empty latches and across loop boundaries. As we have discussed recently, this should be avoided until after loop optimizations have run their course. Note that this wasn't much of a problem before because DOM/VRP couldn't find these opportunities, but with a smarter solver, we trip over them more easily. Because the forward threader doesn't have an independent localized cost model like the new threader (profitable_path_p), it is difficult to catch these things at discovery. However, we can catch them at registration time, with the added benefit that all the threaders (forward and backward) can share the handcuffs. This patch is an adaptation of what we do in the backward threader, but it is not meant to catch everything we do there, as some of the restrictions there are due to limitations of the different block copiers (for example, the generic copier does not re-use existing threading paths). We could ideally remove the now redundant bits in profitable_path_p, but I would prefer not to for two reasons. First, the backward threader uses profitable_path_p as it discovers paths to avoid discovering paths in unprofitable directions. Second, I would like to merge all the forward cost restrictions into the profitability class in the backward threader, not the other way around. Alas, that reshuffling will have to wait for the next release. As usual, there are quite a few tests that needed adjustments. It seems we were quite happily threading improper scenarios. With most of them, as can be seen in pr77445-2.c, we're merely shifting the threading to after loop optimizations. Tested on x86-64 Linux. gcc/ChangeLog: * tree-ssa-threadupdate.c (jt_path_registry::cancel_invalid_paths): New. (jt_path_registry::register_jump_thread): Call cancel_invalid_paths. * tree-ssa-threadupdate.h (class jt_path_registry): Add cancel_invalid_paths. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/20030714-2.c: Adjust. * gcc.dg/tree-ssa/pr66752-3.c: Adjust. * gcc.dg/tree-ssa/pr77445-2.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-18.c: Adjust. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust. * gcc.dg/vect/bb-slp-16.c: Adjust. --- gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c | 7 ++- gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c | 19 ++++--- gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c | 4 +- gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 4 +- gcc/testsuite/gcc.dg/vect/bb-slp-16.c | 7 --- gcc/tree-ssa-threadupdate.c | 67 ++++++++++++++++++----- gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 78 insertions(+), 35 deletions(-) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c index eb663f2ff5b..9585ff11307 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030714-2.c @@ -32,7 +32,8 @@ get_alias_set (t) } } -/* There should be exactly three IF conditionals if we thread jumps - properly. */ -/* { dg-final { scan-tree-dump-times "if " 3 "dom2"} } */ +/* There should be exactly 4 IF conditionals if we thread jumps + properly. There used to be 3, but one thread was crossing + loops. */ +/* { dg-final { scan-tree-dump-times "if " 4 "dom2"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index e1464e21170..922a331b217 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3" } */ extern int status, pt; extern int count; @@ -32,10 +32,15 @@ foo (int N, int c, int b, int *a) pt--; } -/* There are 4 jump threading opportunities, all of which will be - realized, which will eliminate testing of FLAG, completely. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 4 "thread1"} } */ +/* There are 2 jump threading opportunities (which don't cross loops), + all of which will be realized, which will eliminate testing of + FLAG, completely. */ +/* { dg-final { scan-tree-dump-times "Registering jump" 2 "thread1"} } */ -/* There should be no assignments or references to FLAG, verify they're - eliminated as early as possible. */ -/* { dg-final { scan-tree-dump-not "if .flag" "dce2"} } */ +/* We used to remove references to FLAG by DCE2, but this was + depending on early threaders threading through loop boundaries + (which we shouldn't do). However, the late threading passes, which + run after loop optimizations , can successfully eliminate the + references to FLAG. Verify that ther are no references by the late + threading passes. */ +/* { dg-final { scan-tree-dump-not "if .flag" "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c index f9fc212f49e..01a0f1f197d 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c @@ -123,8 +123,8 @@ enum STATES FMS( u8 **in , u32 *transitions) { aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough to change decisions in switch expansion which in turn can expose new jump threading opportunities. Skip the later tests on aarch64. */ -/* { dg-final { scan-tree-dump "Jumps threaded: 1\[1-9\]" "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Invalid sum" 4 "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread1" } } */ +/* { dg-final { scan-tree-dump-times "Invalid sum" 1 "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread1" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread2" } } */ /* { dg-final { scan-tree-dump-not "optimizing for size" "thread3" { target { ! aarch64*-*-* } } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c index 60d4f76f076..2d78d045516 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-18.c @@ -21,5 +21,7 @@ condition. All the cases are picked up by VRP1 as jump threads. */ -/* { dg-final { scan-tree-dump-times "Registering jump" 6 "thread1" } } */ + +/* There used to be 6 jump threads found by thread1, but they all + depended on threading through distinct loops in ethread. */ /* { dg-final { scan-tree-dump-times "Threaded" 2 "vrp1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index e3d4b311c03..16abcde5053 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" { target { ! aarch64*-*-* } } } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 12" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 5" "thread3" { target { ! aarch64*-*-* } } } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c index 664e93e9b60..e68a9b62535 100644 --- a/gcc/testsuite/gcc.dg/vect/bb-slp-16.c +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-16.c @@ -1,8 +1,5 @@ /* { dg-require-effective-target vect_int } */ -/* See note below as to why we disable threading. */ -/* { dg-additional-options "-fdisable-tree-thread1" } */ - #include <stdarg.h> #include "tree-vect.h" @@ -30,10 +27,6 @@ main1 (int dummy) *pout++ = *pin++ + a; *pout++ = *pin++ + a; *pout++ = *pin++ + a; - /* In some architectures like ppc64, jump threading may thread - the iteration where i==0 such that we no longer optimize the - BB. Another alternative to disable jump threading would be - to wrap the read from `i' into a function returning i. */ if (arr[i] = i) a = i; else diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index baac11280fa..2b9b8f81274 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2757,6 +2757,58 @@ fwd_jt_path_registry::update_cfg (bool may_peel_loop_headers) return retval; } +bool +jt_path_registry::cancel_invalid_paths (vec<jump_thread_edge *> &path) +{ + gcc_checking_assert (!path.is_empty ()); + edge taken_edge = path[path.length () - 1]->e; + loop_p loop = taken_edge->src->loop_father; + bool seen_latch = false; + bool path_crosses_loops = false; + + for (unsigned int i = 0; i < path.length (); i++) + { + edge e = path[i]->e; + + if (e == NULL) + { + // NULL outgoing edges on a path can happen for jumping to a + // constant address. + cancel_thread (&path, "Found NULL edge in jump threading path"); + return true; + } + + if (loop->latch == e->src || loop->latch == e->dest) + seen_latch = true; + + // The first entry represents the block with an outgoing edge + // that we will redirect to the jump threading path. Thus we + // don't care about that block's loop father. + if ((i > 0 && e->src->loop_father != loop) + || e->dest->loop_father != loop) + path_crosses_loops = true; + + if (flag_checking && !m_backedge_threads) + gcc_assert ((path[i]->e->flags & EDGE_DFS_BACK) == 0); + } + + if (cfun->curr_properties & PROP_loop_opts_done) + return false; + + if (seen_latch && empty_block_p (loop->latch)) + { + cancel_thread (&path, "Threading through latch before loop opts " + "would create non-empty latch"); + return true; + } + if (path_crosses_loops) + { + cancel_thread (&path, "Path crosses loops"); + return true; + } + return false; +} + /* Register a jump threading opportunity. We queue up all the jump threading opportunities discovered by a pass and update the CFG and SSA form all at once. @@ -2776,19 +2828,8 @@ jt_path_registry::register_jump_thread (vec<jump_thread_edge *> *path) return false; } - /* First make sure there are no NULL outgoing edges on the jump threading - path. That can happen for jumping to a constant address. */ - for (unsigned int i = 0; i < path->length (); i++) - { - if ((*path)[i]->e == NULL) - { - cancel_thread (path, "Found NULL edge in jump threading path"); - return false; - } - - if (flag_checking && !m_backedge_threads) - gcc_assert (((*path)[i]->e->flags & EDGE_DFS_BACK) == 0); - } + if (cancel_invalid_paths (*path)) + return false; if (dump_file && (dump_flags & TDF_DETAILS)) dump_jump_thread_path (dump_file, *path, true); diff --git a/gcc/tree-ssa-threadupdate.h b/gcc/tree-ssa-threadupdate.h index 8b48a671212..d68795c9f27 100644 --- a/gcc/tree-ssa-threadupdate.h +++ b/gcc/tree-ssa-threadupdate.h @@ -75,6 +75,7 @@ protected: unsigned long m_num_threaded_edges; private: virtual bool update_cfg (bool peel_loop_headers) = 0; + bool cancel_invalid_paths (vec<jump_thread_edge *> &path); jump_thread_path_allocator m_allocator; // True if threading through back edges is allowed. This is only // allowed in the generic copier in the backward threader. </cut>

3 years, 9 months

8
13
0 0

[TCWG CI] Regression caused by gcc: tree-optimization/102570 - teach VN about internal functions

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: tree-optimization/102570 - teach VN about internal functions: commit 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8 Author: Richard Biener <rguenther(a)suse.de> tree-optimization/102570 - teach VN about internal functions Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 18360 # First few build errors in logs: # 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized] # 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized] # 00:01:21 ./include/linux/arm-smccc.h:460:40: error: ‘res.a0’ is used uninitialized [-Werror=uninitialized] # 00:01:21 make[2]: *** [scripts/Makefile.build:288: arch/arm64/hyperv/hv_core.o] Error 1 # 00:01:22 crypto/asymmetric_keys/asymmetric_type.c:481:15: error: ‘restrict_method’ is used uninitialized [-Werror=uninitialized] # 00:01:22 make[2]: *** [scripts/Makefile.build:288: crypto/asymmetric_keys/asymmetric_type.o] Error 1 # 00:01:22 ./include/trace/perf.h:38:25: error: ‘entry’ is used uninitialized [-Werror=uninitialized] # 00:01:22 ./include/trace/perf.h:44:13: error: ‘__entry_size’ is used uninitialized [-Werror=uninitialized] # 00:01:23 security/keys/encrypted-keys/encrypted.c:660:19: error: ‘mkey’ is used uninitialized [-Werror=uninitialized] # 00:01:23 security/keys/encrypted-keys/encrypted.c:905:19: error: ‘epayload’ is used uninitialized [-Werror=uninitialized] from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21404 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-aarch64-next-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Reproduce builds: <cut> mkdir investigate-gcc-55a3be2f5255d69e740d61b2c5aaba1ccbc643b8 cd investigate-gcc-55a3be2f5255d69e740d61b2c5aaba1ccbc643b8 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 22d34a2a50651d01669b6fbcdb9677c18d2197c5 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 55a3be2f5255d69e740d61b2c5aaba1ccbc643b8 Author: Richard Biener <rguenther(a)suse.de> Date: Mon Oct 4 10:57:45 2021 +0200 tree-optimization/102570 - teach VN about internal functions We're now using internal functions for a lot of stuff but there's still missing VN support out of laziness. The following instantiates support and adds testcases for FRE and PRE (hoisting). 2021-10-04 Richard Biener <rguenther(a)suse.de> PR tree-optimization/102570 * tree-ssa-sccvn.h (vn_reference_op_struct): Document we are using clique for the internal function code. * tree-ssa-sccvn.c (vn_reference_op_eq): Compare the internal function code. (print_vn_reference_ops): Print the internal function code. (vn_reference_op_compute_hash): Hash it. (copy_reference_ops_from_call): Record it. (visit_stmt): Remove the restriction around internal function calls. (fully_constant_vn_reference_p): Use fold_const_call and handle internal functions. (vn_reference_eq): Compare call return types. * tree-ssa-pre.c (create_expression_by_pieces): Handle generating calls to internal functions. (compute_avail): Remove the restriction around internal function calls. * gcc.dg/tree-ssa/ssa-fre-96.c: New testcase. * gcc.dg/tree-ssa/ssa-pre-33.c: Likewise. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c | 14 +++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c | 15 +++++ gcc/tree-ssa-pre.c | 27 +++++---- gcc/tree-ssa-sccvn.c | 91 ++++++++++++++++++------------ gcc/tree-ssa-sccvn.h | 3 +- 5 files changed, 103 insertions(+), 47 deletions(-) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c new file mode 100644 index 00000000000..fd1d5713b5f --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-96.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-options "-O -fdump-tree-fre1" } */ + +_Bool f1(unsigned x, unsigned y, unsigned *res) +{ + _Bool t = __builtin_add_overflow(x, y, res); + unsigned res1; + _Bool t1 = __builtin_add_overflow(x, y, &res1); + *res -= res1; + return t==t1; +} + +/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 1 "fre1" } } */ +/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c new file mode 100644 index 00000000000..3b3bd629bc2 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-pre-33.c @@ -0,0 +1,15 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-pre" } */ + +_Bool f1(unsigned x, unsigned y, unsigned *res, int flag, _Bool *t) +{ + if (flag) + *t = __builtin_add_overflow(x, y, res); + unsigned res1; + _Bool t1 = __builtin_add_overflow(x, y, &res1); + *res -= res1; + return *t==t1; +} + +/* We should hoist the .ADD_OVERFLOW to before the check. */ +/* { dg-final { scan-tree-dump-times "ADD_OVERFLOW" 1 "pre" } } */ diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c index 08755847f66..1cc1aae694f 100644 --- a/gcc/tree-ssa-pre.c +++ b/gcc/tree-ssa-pre.c @@ -2855,9 +2855,13 @@ create_expression_by_pieces (basic_block block, pre_expr expr, unsigned int operand = 1; vn_reference_op_t currop = &ref->operands[0]; tree sc = NULL_TREE; - tree fn = find_or_generate_expression (block, currop->op0, stmts); - if (!fn) - return NULL_TREE; + tree fn = NULL_TREE; + if (currop->op0) + { + fn = find_or_generate_expression (block, currop->op0, stmts); + if (!fn) + return NULL_TREE; + } if (currop->op1) { sc = find_or_generate_expression (block, currop->op1, stmts); @@ -2873,12 +2877,19 @@ create_expression_by_pieces (basic_block block, pre_expr expr, return NULL_TREE; args.quick_push (arg); } - gcall *call = gimple_build_call_vec (fn, args); + gcall *call; + if (currop->op0) + { + call = gimple_build_call_vec (fn, args); + gimple_call_set_fntype (call, currop->type); + } + else + call = gimple_build_call_internal_vec ((internal_fn)currop->clique, + args); gimple_set_location (call, expr->loc); - gimple_call_set_fntype (call, currop->type); if (sc) gimple_call_set_chain (call, sc); - tree forcedname = make_ssa_name (TREE_TYPE (currop->type)); + tree forcedname = make_ssa_name (ref->type); gimple_call_set_lhs (call, forcedname); /* There's no CCP pass after PRE which would re-compute alignment information so make sure we re-materialize this here. */ @@ -4004,10 +4015,6 @@ compute_avail (function *fun) vn_reference_s ref1; pre_expr result = NULL; - /* We can value number only calls to real functions. */ - if (gimple_call_internal_p (stmt)) - continue; - vn_reference_lookup_call (as_a <gcall *> (stmt), &ref, &ref1); /* There is no point to PRE a call without a value. */ if (!ref || !ref->result) diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c index 416a5252144..0d942218279 100644 --- a/gcc/tree-ssa-sccvn.c +++ b/gcc/tree-ssa-sccvn.c @@ -70,6 +70,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-scalar-evolution.h" #include "tree-ssa-loop-niter.h" #include "builtins.h" +#include "fold-const-call.h" #include "tree-ssa-sccvn.h" /* This algorithm is based on the SCC algorithm presented by Keith @@ -212,7 +213,8 @@ vn_reference_op_eq (const void *p1, const void *p2) TYPE_MAIN_VARIANT (vro2->type)))) && expressions_equal_p (vro1->op0, vro2->op0) && expressions_equal_p (vro1->op1, vro2->op1) - && expressions_equal_p (vro1->op2, vro2->op2)); + && expressions_equal_p (vro1->op2, vro2->op2) + && (vro1->opcode != CALL_EXPR || vro1->clique == vro2->clique)); } /* Free a reference operation structure VP. */ @@ -264,15 +266,18 @@ print_vn_reference_ops (FILE *outfile, const vec<vn_reference_op_s> ops) && TREE_CODE_CLASS (vro->opcode) != tcc_declaration) { fprintf (outfile, "%s", get_tree_code_name (vro->opcode)); - if (vro->op0) + if (vro->op0 || vro->opcode == CALL_EXPR) { fprintf (outfile, "<"); closebrace = true; } } - if (vro->op0) + if (vro->op0 || vro->opcode == CALL_EXPR) { - print_generic_expr (outfile, vro->op0); + if (!vro->op0) + fprintf (outfile, internal_fn_name ((internal_fn)vro->clique)); + else + print_generic_expr (outfile, vro->op0); if (vro->op1) { fprintf (outfile, ","); @@ -684,6 +689,8 @@ static void vn_reference_op_compute_hash (const vn_reference_op_t vro1, inchash::hash &hstate) { hstate.add_int (vro1->opcode); + if (vro1->opcode == CALL_EXPR && !vro1->op0) + hstate.add_int (vro1->clique); if (vro1->op0) inchash::add_expr (vro1->op0, hstate); if (vro1->op1) @@ -769,11 +776,16 @@ vn_reference_eq (const_vn_reference_t const vr1, const_vn_reference_t const vr2) if (vr1->type != vr2->type) return false; } + else if (vr1->type == vr2->type) + ; else if (COMPLETE_TYPE_P (vr1->type) != COMPLETE_TYPE_P (vr2->type) || (COMPLETE_TYPE_P (vr1->type) && !expressions_equal_p (TYPE_SIZE (vr1->type), TYPE_SIZE (vr2->type)))) return false; + else if (vr1->operands[0].opcode == CALL_EXPR + && !types_compatible_p (vr1->type, vr2->type)) + return false; else if (INTEGRAL_TYPE_P (vr1->type) && INTEGRAL_TYPE_P (vr2->type)) { @@ -1270,6 +1282,8 @@ copy_reference_ops_from_call (gcall *call, temp.type = gimple_call_fntype (call); temp.opcode = CALL_EXPR; temp.op0 = gimple_call_fn (call); + if (gimple_call_internal_p (call)) + temp.clique = gimple_call_internal_fn (call); temp.op1 = gimple_call_chain (call); if (stmt_could_throw_p (cfun, call) && (lr = lookup_stmt_eh_lp (call)) > 0) temp.op2 = size_int (lr); @@ -1459,9 +1473,11 @@ fully_constant_vn_reference_p (vn_reference_t ref) a call to a builtin function with at most two arguments. */ op = &operands[0]; if (op->opcode == CALL_EXPR - && TREE_CODE (op->op0) == ADDR_EXPR - && TREE_CODE (TREE_OPERAND (op->op0, 0)) == FUNCTION_DECL - && fndecl_built_in_p (TREE_OPERAND (op->op0, 0)) + && (!op->op0 + || (TREE_CODE (op->op0) == ADDR_EXPR + && TREE_CODE (TREE_OPERAND (op->op0, 0)) == FUNCTION_DECL + && fndecl_built_in_p (TREE_OPERAND (op->op0, 0), + BUILT_IN_NORMAL))) && operands.length () >= 2 && operands.length () <= 3) { @@ -1481,13 +1497,17 @@ fully_constant_vn_reference_p (vn_reference_t ref) anyconst = true; if (anyconst) { - tree folded = build_call_expr (TREE_OPERAND (op->op0, 0), - arg1 ? 2 : 1, - arg0->op0, - arg1 ? arg1->op0 : NULL); - if (folded - && TREE_CODE (folded) == NOP_EXPR) - folded = TREE_OPERAND (folded, 0); + combined_fn fn; + if (op->op0) + fn = as_combined_fn (DECL_FUNCTION_CODE + (TREE_OPERAND (op->op0, 0))); + else + fn = as_combined_fn ((internal_fn) op->clique); + tree folded; + if (arg1) + folded = fold_const_call (fn, ref->type, arg0->op0, arg1->op0); + else + folded = fold_const_call (fn, ref->type, arg0->op0); if (folded && is_gimple_min_invariant (folded)) return folded; @@ -5648,28 +5668,27 @@ visit_stmt (gimple *stmt, bool backedges_varying_p = false) && TREE_CODE (TREE_OPERAND (fn, 0)) == FUNCTION_DECL) extra_fnflags = flags_from_decl_or_type (TREE_OPERAND (fn, 0)); } - if (!gimple_call_internal_p (call_stmt) - && (/* Calls to the same function with the same vuse - and the same operands do not necessarily return the same - value, unless they're pure or const. */ - ((gimple_call_flags (call_stmt) | extra_fnflags) - & (ECF_PURE | ECF_CONST)) - /* If calls have a vdef, subsequent calls won't have - the same incoming vuse. So, if 2 calls with vdef have the - same vuse, we know they're not subsequent. - We can value number 2 calls to the same function with the - same vuse and the same operands which are not subsequent - the same, because there is no code in the program that can - compare the 2 values... */ - || (gimple_vdef (call_stmt) - /* ... unless the call returns a pointer which does - not alias with anything else. In which case the - information that the values are distinct are encoded - in the IL. */ - && !(gimple_call_return_flags (call_stmt) & ERF_NOALIAS) - /* Only perform the following when being called from PRE - which embeds tail merging. */ - && default_vn_walk_kind == VN_WALK))) + if (/* Calls to the same function with the same vuse + and the same operands do not necessarily return the same + value, unless they're pure or const. */ + ((gimple_call_flags (call_stmt) | extra_fnflags) + & (ECF_PURE | ECF_CONST)) + /* If calls have a vdef, subsequent calls won't have + the same incoming vuse. So, if 2 calls with vdef have the + same vuse, we know they're not subsequent. + We can value number 2 calls to the same function with the + same vuse and the same operands which are not subsequent + the same, because there is no code in the program that can + compare the 2 values... */ + || (gimple_vdef (call_stmt) + /* ... unless the call returns a pointer which does + not alias with anything else. In which case the + information that the values are distinct are encoded + in the IL. */ + && !(gimple_call_return_flags (call_stmt) & ERF_NOALIAS) + /* Only perform the following when being called from PRE + which embeds tail merging. */ + && default_vn_walk_kind == VN_WALK)) changed = visit_reference_op_call (lhs, call_stmt); else changed = defs_to_varying (call_stmt); diff --git a/gcc/tree-ssa-sccvn.h b/gcc/tree-ssa-sccvn.h index 96100596d2e..8a1b649c726 100644 --- a/gcc/tree-ssa-sccvn.h +++ b/gcc/tree-ssa-sccvn.h @@ -106,7 +106,8 @@ typedef const struct vn_phi_s *const_vn_phi_t; typedef struct vn_reference_op_struct { ENUM_BITFIELD(tree_code) opcode : 16; - /* Dependence info, used for [TARGET_]MEM_REF only. */ + /* Dependence info, used for [TARGET_]MEM_REF only. For internal + function calls clique is also used for the internal function code. */ unsigned short clique; unsigned short base; unsigned reverse : 1; </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 6% after llvm: [SCEV] Infer flags from add/gep in any block

by ci_notify＠linaro.org

After llvm commit 0658bab870c89d81678f1f37aac0396ddd0913b3 Author: Philip Reames <listmail(a)philipreames.com> [SCEV] Infer flags from add/gep in any block the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 6% from 11124 to 11783 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 41% from 1504 to 2116 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-0658bab870c89d81678f1f37aac0396ddd0913b3 cd investigate-llvm-0658bab870c89d81678f1f37aac0396ddd0913b3 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 0658bab870c89d81678f1f37aac0396ddd0913b3 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 2ced9a42be8aba4533225fdb8ed02fe6f50060b6 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 0658bab870c89d81678f1f37aac0396ddd0913b3 Author: Philip Reames <listmail(a)philipreames.com> Date: Wed Oct 6 10:35:01 2021 -0700 [SCEV] Infer flags from add/gep in any block This patch removes a compile time restriction from isSCEVExprNeverPoison. We've strengthened our ability to reason about flags on scopes other than addrecs, and this bailout prevents us from using it. The comment is also suspect as well in that we're in the middle of constructing a SCEV for I. As such, we're going to visit all operands *anyways*. Differential Revision: https://reviews.llvm.org/D111186 --- llvm/lib/Analysis/ScalarEvolution.cpp | 10 --------- .../Analysis/DependenceAnalysis/Preliminary.ll | 2 +- .../Analysis/ScalarEvolution/flags-from-poison.ll | 8 +++---- .../SLPVectorizer/X86/consecutive-access.ll | 25 +++++++++------------- 4 files changed, 15 insertions(+), 30 deletions(-) diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp index 6683b1a5205c..4fb2266b6e89 100644 --- a/llvm/lib/Analysis/ScalarEvolution.cpp +++ b/llvm/lib/Analysis/ScalarEvolution.cpp @@ -6645,16 +6645,6 @@ bool ScalarEvolution::isGuaranteedToTransferExecutionTo(const Instruction *A, bool ScalarEvolution::isSCEVExprNeverPoison(const Instruction *I) { - // Here we check that I is in the header of the innermost loop containing I, - // since we only deal with instructions in the loop header. The actual loop we - // need to check later will come from an add recurrence, but getting that - // requires computing the SCEV of the operands, which can be expensive. This - // check we can do cheaply to rule out some cases early. - Loop *InnermostContainingLoop = LI.getLoopFor(I->getParent()); - if (InnermostContainingLoop == nullptr || - InnermostContainingLoop->getHeader() != I->getParent()) - return false; - // Only proceed if we can prove that I does not yield poison. if (!programUndefinedIfPoison(I)) return false; diff --git a/llvm/test/Analysis/DependenceAnalysis/Preliminary.ll b/llvm/test/Analysis/DependenceAnalysis/Preliminary.ll index 0899f67d6914..91827f3231ba 100644 --- a/llvm/test/Analysis/DependenceAnalysis/Preliminary.ll +++ b/llvm/test/Analysis/DependenceAnalysis/Preliminary.ll @@ -623,7 +623,7 @@ entry: ; CHECK-LABEL: p9 ; CHECK: da analyze - none! -; CHECK: da analyze - flow [|<]! +; CHECK: da analyze - none! ; CHECK: da analyze - confused! ; CHECK: da analyze - none! ; CHECK: da analyze - confused! diff --git a/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll b/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll index f0bda26edb38..0423854bbc3b 100644 --- a/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll +++ b/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll @@ -1628,9 +1628,9 @@ define noundef i32 @add-basic(i32 %a, i32 %b) { ; CHECK-LABEL: 'add-basic' ; CHECK-NEXT: Classifying expressions for: @add-basic ; CHECK-NEXT: %res = add nuw nsw i32 %a, %b -; CHECK-NEXT: --> (%a + %b) U: full-set S: full-set +; CHECK-NEXT: --> (%a + %b)<nuw><nsw> U: full-set S: full-set ; CHECK-NEXT: %res2 = udiv i32 255, %res -; CHECK-NEXT: --> (255 /u (%a + %b)) U: [0,256) S: [0,256) +; CHECK-NEXT: --> (255 /u (%a + %b)<nuw><nsw>) U: [0,256) S: [0,256) ; CHECK-NEXT: Determining loop execution counts for: @add-basic ; %res = add nuw nsw i32 %a, %b @@ -1656,9 +1656,9 @@ define noundef i32 @mul-basic(i32 %a, i32 %b) { ; CHECK-LABEL: 'mul-basic' ; CHECK-NEXT: Classifying expressions for: @mul-basic ; CHECK-NEXT: %res = mul nuw nsw i32 %a, %b -; CHECK-NEXT: --> (%a * %b) U: full-set S: full-set +; CHECK-NEXT: --> (%a * %b)<nuw><nsw> U: full-set S: full-set ; CHECK-NEXT: %res2 = udiv i32 255, %res -; CHECK-NEXT: --> (255 /u (%a * %b)) U: [0,256) S: [0,256) +; CHECK-NEXT: --> (255 /u (%a * %b)<nuw><nsw>) U: [0,256) S: [0,256) ; CHECK-NEXT: Determining loop execution counts for: @mul-basic ; %res = mul nuw nsw i32 %a, %b diff --git a/llvm/test/Transforms/SLPVectorizer/X86/consecutive-access.ll b/llvm/test/Transforms/SLPVectorizer/X86/consecutive-access.ll index e4000b52c4a9..8f57fe6866bd 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/consecutive-access.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/consecutive-access.ll @@ -8,10 +8,6 @@ target triple = "x86_64-apple-macosx10.9.0" @C = common global [2000 x float] zeroinitializer, align 16 @D = common global [2000 x float] zeroinitializer, align 16 -; Currently SCEV isn't smart enough to figure out that accesses -; A[3*i], A[3*i+1] and A[3*i+2] are consecutive, but in future -; that would hopefully be fixed. For now, check that this isn't -; vectorized. ; Function Attrs: nounwind ssp uwtable define void @foo_3double(i32 %u) #0 { ; CHECK-LABEL: @foo_3double( @@ -21,26 +17,25 @@ define void @foo_3double(i32 %u) #0 { ; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[U]], 3 ; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[MUL]] to i64 ; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 [[IDXPROM]] -; CHECK-NEXT: [[TMP0:%.*]] = load double, double* [[ARRAYIDX]], align 8 ; CHECK-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 [[IDXPROM]] -; CHECK-NEXT: [[TMP1:%.*]] = load double, double* [[ARRAYIDX4]], align 8 -; CHECK-NEXT: [[ADD5:%.*]] = fadd double [[TMP0]], [[TMP1]] -; CHECK-NEXT: store double [[ADD5]], double* [[ARRAYIDX]], align 8 ; CHECK-NEXT: [[ADD11:%.*]] = add nsw i32 [[MUL]], 1 ; CHECK-NEXT: [[IDXPROM12:%.*]] = sext i32 [[ADD11]] to i64 ; CHECK-NEXT: [[ARRAYIDX13:%.*]] = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 [[IDXPROM12]] -; CHECK-NEXT: [[TMP2:%.*]] = load double, double* [[ARRAYIDX13]], align 8 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[ARRAYIDX]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 ; CHECK-NEXT: [[ARRAYIDX17:%.*]] = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 [[IDXPROM12]] -; CHECK-NEXT: [[TMP3:%.*]] = load double, double* [[ARRAYIDX17]], align 8 -; CHECK-NEXT: [[ADD18:%.*]] = fadd double [[TMP2]], [[TMP3]] -; CHECK-NEXT: store double [[ADD18]], double* [[ARRAYIDX13]], align 8 +; CHECK-NEXT: [[TMP2:%.*]] = bitcast double* [[ARRAYIDX4]] to <2 x double>* +; CHECK-NEXT: [[TMP3:%.*]] = load <2 x double>, <2 x double>* [[TMP2]], align 8 +; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] +; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[ARRAYIDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 8 ; CHECK-NEXT: [[ADD24:%.*]] = add nsw i32 [[MUL]], 2 ; CHECK-NEXT: [[IDXPROM25:%.*]] = sext i32 [[ADD24]] to i64 ; CHECK-NEXT: [[ARRAYIDX26:%.*]] = getelementptr inbounds [2000 x double], [2000 x double]* @A, i32 0, i64 [[IDXPROM25]] -; CHECK-NEXT: [[TMP4:%.*]] = load double, double* [[ARRAYIDX26]], align 8 +; CHECK-NEXT: [[TMP6:%.*]] = load double, double* [[ARRAYIDX26]], align 8 ; CHECK-NEXT: [[ARRAYIDX30:%.*]] = getelementptr inbounds [2000 x double], [2000 x double]* @B, i32 0, i64 [[IDXPROM25]] -; CHECK-NEXT: [[TMP5:%.*]] = load double, double* [[ARRAYIDX30]], align 8 -; CHECK-NEXT: [[ADD31:%.*]] = fadd double [[TMP4]], [[TMP5]] +; CHECK-NEXT: [[TMP7:%.*]] = load double, double* [[ARRAYIDX30]], align 8 +; CHECK-NEXT: [[ADD31:%.*]] = fadd double [[TMP6]], [[TMP7]] ; CHECK-NEXT: store double [[ADD31]], double* [[ARRAYIDX26]], align 8 ; CHECK-NEXT: ret void ; </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 6% after llvm: [SCEV] Use full logic when infering flags on add and gep

by ci_notify＠linaro.org

After llvm commit d02db32644b7360bcda54cdf739fa42abe450fcd Author: Philip Reames <listmail(a)philipreames.com> [SCEV] Use full logic when infering flags on add and gep the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 6% from 10842 to 11545 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-d02db32644b7360bcda54cdf739fa42abe450fcd cd investigate-llvm-d02db32644b7360bcda54cdf739fa42abe450fcd # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach d02db32644b7360bcda54cdf739fa42abe450fcd ../artifacts/test.sh # Reproduce last_good build git checkout --detach f39978b84f1d3a1da6c32db48f64c8daae64b3ad ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit d02db32644b7360bcda54cdf739fa42abe450fcd Author: Philip Reames <listmail(a)philipreames.com> Date: Sun Oct 3 15:32:15 2021 -0700 [SCEV] Use full logic when infering flags on add and gep This is a followon to D109845. With that landed, we will have fixed all known instances of pr51817, and can thus start inferring flags more aggressively with greatly reduced risk of miscompiles. This patch simply applies the same inference logic used in that patch to our other major flag inference path. We can still do much better here (on both paths), but this is our first step. Differential Revision: https://reviews.llvm.org/D111003 --- llvm/lib/Analysis/ScalarEvolution.cpp | 10 ++-------- .../Delinearization/multidim_ivs_and_integer_offsets_3d.ll | 2 +- .../Delinearization/multidim_ivs_and_parameteric_offsets_3d.ll | 2 +- llvm/test/Analysis/LoopCacheAnalysis/PowerPC/stencil.ll | 4 ++-- llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll | 8 ++++---- llvm/test/Analysis/ScalarEvolution/load.ll | 2 +- llvm/test/Analysis/ScalarEvolution/ptrtoint.ll | 2 +- polly/test/IstAstInfo/simple-run-time-condition.ll | 2 +- 8 files changed, 13 insertions(+), 19 deletions(-) diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp index 75cecbf48c08..70bf9aee6e0a 100644 --- a/llvm/lib/Analysis/ScalarEvolution.cpp +++ b/llvm/lib/Analysis/ScalarEvolution.cpp @@ -6657,14 +6657,8 @@ bool ScalarEvolution::isSCEVExprNeverPoison(const Instruction *I) { // TODO: We can do better here in some cases. if (!isSCEVable(Op->getType())) return false; - // TODO: the following two lines should be: - // if (auto *DefI = getDefinedScopeRoot(getSCEV(Op))) - // if (isGuaranteedToTransferExecutionTo(DefI, I)) - // We use the following instead for the purposes of seperating a bugfix - // change from an optimization change. Once pr51817 is fully addressed, - // we should unlock this power. - if (auto *AddRecS = dyn_cast<SCEVAddRecExpr>(getSCEV(Op))) - if (isGuaranteedToExecuteForEveryIteration(I, AddRecS->getLoop())) + if (auto *DefI = getDefinedScopeRoot(getSCEV(Op))) + if (isGuaranteedToTransferExecutionTo(DefI, I)) return true; } return false; diff --git a/llvm/test/Analysis/Delinearization/multidim_ivs_and_integer_offsets_3d.ll b/llvm/test/Analysis/Delinearization/multidim_ivs_and_integer_offsets_3d.ll index 712a52927dcb..77982c786e6e 100644 --- a/llvm/test/Analysis/Delinearization/multidim_ivs_and_integer_offsets_3d.ll +++ b/llvm/test/Analysis/Delinearization/multidim_ivs_and_integer_offsets_3d.ll @@ -11,7 +11,7 @@ ; AddRec: {{{(56 + (8 * (-4 + (3 * %m)) * %o) + %A),+,(8 * %m * %o)}<%for.i>,+,(8 * %o)}<%for.j>,+,8}<%for.k> ; CHECK: Base offset: %A ; CHECK: ArrayDecl[UnknownSize][%m][%o] with elements of 8 bytes. -; CHECK: ArrayRef[{3,+,1}<nuw><%for.i>][{-4,+,1}<nw><%for.j>][{7,+,1}<nuw><nsw><%for.k>] +; CHECK: ArrayRef[{3,+,1}<nuw><%for.i>][{-4,+,1}<nsw><%for.j>][{7,+,1}<nuw><nsw><%for.k>] define void @foo(i64 %n, i64 %m, i64 %o, double* %A) { entry: diff --git a/llvm/test/Analysis/Delinearization/multidim_ivs_and_parameteric_offsets_3d.ll b/llvm/test/Analysis/Delinearization/multidim_ivs_and_parameteric_offsets_3d.ll index e3fdb0642211..8ecd498ea211 100644 --- a/llvm/test/Analysis/Delinearization/multidim_ivs_and_parameteric_offsets_3d.ll +++ b/llvm/test/Analysis/Delinearization/multidim_ivs_and_parameteric_offsets_3d.ll @@ -11,7 +11,7 @@ ; AddRec: {{{((8 * ((((%m * %p) + %q) * %o) + %r)) + %A),+,(8 * %m * %o)}<%for.i>,+,(8 * %o)}<%for.j>,+,8}<%for.k> ; CHECK: Base offset: %A ; CHECK: ArrayDecl[UnknownSize][%m][%o] with elements of 8 bytes. -; CHECK: ArrayRef[{%p,+,1}<nw><%for.i>][{%q,+,1}<nw><%for.j>][{%r,+,1}<nsw><%for.k>] +; CHECK: ArrayRef[{%p,+,1}<nw><%for.i>][{%q,+,1}<nsw><%for.j>][{%r,+,1}<nsw><%for.k>] define void @foo(i64 %n, i64 %m, i64 %o, double* %A, i64 %p, i64 %q, i64 %r) { entry: diff --git a/llvm/test/Analysis/LoopCacheAnalysis/PowerPC/stencil.ll b/llvm/test/Analysis/LoopCacheAnalysis/PowerPC/stencil.ll index 821513199546..1f1515435e1a 100644 --- a/llvm/test/Analysis/LoopCacheAnalysis/PowerPC/stencil.ll +++ b/llvm/test/Analysis/LoopCacheAnalysis/PowerPC/stencil.ll @@ -11,8 +11,8 @@ target triple = "powerpc64le-unknown-linux-gnu" ; } ; } -; CHECK-DAG: Loop 'for.i' has cost = 20300 -; CHECK-DAG: Loop 'for.j' has cost = 700 +; CHECK-DAG: Loop 'for.i' has cost = 20600 +; CHECK-DAG: Loop 'for.j' has cost = 800 define void @foo(i64 %n, i64 %m, i32* %A, i32* %B, i32* %C) { entry: diff --git a/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll b/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll index c8d3137f8dc9..5ab24159c250 100644 --- a/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll +++ b/llvm/test/Analysis/ScalarEvolution/flags-from-poison.ll @@ -273,9 +273,9 @@ define void @test-add-scope-bound-unkn-header(i32* %input, i32 %needle) { ; CHECK-NEXT: %offset = load i32, i32* %gep, align 4 ; CHECK-NEXT: --> %offset U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } ; CHECK-NEXT: %i.next = add nuw i32 %i, %offset -; CHECK-NEXT: --> (%offset + %i) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } +; CHECK-NEXT: --> (%offset + %i)<nuw> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } ; CHECK-NEXT: %gep2 = getelementptr i32, i32* %input, i32 %i.next -; CHECK-NEXT: --> ((4 * (sext i32 (%offset + %i) to i64))<nsw> + %input) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } +; CHECK-NEXT: --> ((4 * (sext i32 (%offset + %i)<nuw> to i64))<nsw> + %input) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } ; CHECK-NEXT: Determining loop execution counts for: @test-add-scope-bound-unkn-header ; CHECK-NEXT: Loop %loop: Unpredictable backedge-taken count. ; CHECK-NEXT: Loop %loop: Unpredictable max backedge-taken count. @@ -307,9 +307,9 @@ define void @test-add-scope-bound-unkn-header2(i32* %input, i32 %needle) { ; CHECK-NEXT: %offset = load i32, i32* %gep, align 4 ; CHECK-NEXT: --> %offset U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } ; CHECK-NEXT: %i.next = add nuw i32 %i, %offset -; CHECK-NEXT: --> (%offset + %i) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } +; CHECK-NEXT: --> (%offset + %i)<nuw> U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } ; CHECK-NEXT: %gep2 = getelementptr i32, i32* %input, i32 %i.next -; CHECK-NEXT: --> ((4 * (sext i32 (%offset + %i) to i64))<nsw> + %input) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } +; CHECK-NEXT: --> ((4 * (sext i32 (%offset + %i)<nuw> to i64))<nsw> + %input) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %loop: Variant } ; CHECK-NEXT: Determining loop execution counts for: @test-add-scope-bound-unkn-header2 ; CHECK-NEXT: Loop %loop: Unpredictable backedge-taken count. ; CHECK-NEXT: Loop %loop: Unpredictable max backedge-taken count. diff --git a/llvm/test/Analysis/ScalarEvolution/load.ll b/llvm/test/Analysis/ScalarEvolution/load.ll index c0d671342af7..e95a093b2a8b 100644 --- a/llvm/test/Analysis/ScalarEvolution/load.ll +++ b/llvm/test/Analysis/ScalarEvolution/load.ll @@ -73,7 +73,7 @@ define i32 @test2() nounwind uwtable readonly { ; CHECK-NEXT: %n.01 = phi %struct.ListNode* [ bitcast ({ %struct.ListNode*, i32, [4 x i8] }* @node5 to %struct.ListNode*), %entry ], [ %1, %for.body ] ; CHECK-NEXT: --> %n.01 U: full-set S: full-set Exits: @node1 LoopDispositions: { %for.body: Variant } ; CHECK-NEXT: %i = getelementptr inbounds %struct.ListNode, %struct.ListNode* %n.01, i64 0, i32 1 -; CHECK-NEXT: --> (4 + %n.01) U: full-set S: full-set Exits: (4 + @node1)<nuw><nsw> LoopDispositions: { %for.body: Variant } +; CHECK-NEXT: --> (4 + %n.01)<nuw> U: [4,0) S: [4,0) Exits: (4 + @node1)<nuw><nsw> LoopDispositions: { %for.body: Variant } ; CHECK-NEXT: %0 = load i32, i32* %i, align 4 ; CHECK-NEXT: --> %0 U: full-set S: full-set Exits: 0 LoopDispositions: { %for.body: Variant } ; CHECK-NEXT: %add = add nsw i32 %0, %sum.02 diff --git a/llvm/test/Analysis/ScalarEvolution/ptrtoint.ll b/llvm/test/Analysis/ScalarEvolution/ptrtoint.ll index 93d8782f373e..cb40ddda9369 100644 --- a/llvm/test/Analysis/ScalarEvolution/ptrtoint.ll +++ b/llvm/test/Analysis/ScalarEvolution/ptrtoint.ll @@ -502,7 +502,7 @@ define void @pr46786_c26_int(i32* %arg, i32* %arg1, i32* %arg2) { ; X32-NEXT: %i11 = ashr exact i64 %i10, 2 ; X32-NEXT: --> %i11 U: [-2147483648,2147483648) S: [-2147483648,2147483648) Exits: <<Unknown>> LoopDispositions: { %bb6: Variant } ; X32-NEXT: %i12 = getelementptr inbounds i32, i32* %arg2, i64 %i11 -; X32-NEXT: --> ((4 * (trunc i64 %i11 to i32)) + %arg2) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %bb6: Variant } +; X32-NEXT: --> ((4 * (trunc i64 %i11 to i32))<nsw> + %arg2) U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %bb6: Variant } ; X32-NEXT: %i13 = load i32, i32* %i12, align 4 ; X32-NEXT: --> %i13 U: full-set S: full-set Exits: <<Unknown>> LoopDispositions: { %bb6: Variant } ; X32-NEXT: %i14 = add nsw i32 %i13, %i8 diff --git a/polly/test/IstAstInfo/simple-run-time-condition.ll b/polly/test/IstAstInfo/simple-run-time-condition.ll index aba5d9e34f50..0d167566291b 100644 --- a/polly/test/IstAstInfo/simple-run-time-condition.ll +++ b/polly/test/IstAstInfo/simple-run-time-condition.ll @@ -20,7 +20,7 @@ target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f3 ; for the delinearization is simplified such that conditions that would not ; cause any code to be executed are not generated. -; CHECK: if (((o >= 1 && q <= 0 && m + q >= 0) || (o <= 0 && m + q >= 100 && q <= 100)) && 0 == ((m >= 1 && n + p >= 9223372036854775809) || (o <= 0 && n >= 1 && m + q >= 9223372036854775909) || (o <= 0 && m >= 1 && n >= 1 && q <= -9223372036854775709))) +; CHECK: if (((o >= 1 && q <= 0 && m + q >= 0) || (o <= 0 && m + q >= 100 && q <= 100)) && 0 == ((o <= 0 && n >= 1 && m + q >= 9223372036854775909) || (o <= 0 && m >= 1 && n >= 1 && q <= -9223372036854775709))) ; CHECK: if (o <= 0) { ; CHECK: for (int c0 = 0; c0 < n; c0 += 1) </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] Regression caused by gcc: Enhance -Waddress to detect more suspicious expressions [PR102103].

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: Enhance -Waddress to detect more suspicious expressions [PR102103].: commit 4dc7ce6fb3917958d1a6036d8acf2953b9c1b868 Author: Martin Sebor <msebor(a)redhat.com> Enhance -Waddress to detect more suspicious expressions [PR102103]. Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21397 # First few build errors in logs: # 00:02:12 sound/core/oss/mixer_oss.c:1035:21: error: ‘slot’ is used uninitialized [-Werror=uninitialized] # 00:02:15 make[3]: *** [scripts/Makefile.build:288: sound/core/oss/mixer_oss.o] Error 1 # 00:02:15 sound/core/oss/pcm_oss.c:108:29: error: ‘t’ is used uninitialized [-Werror=uninitialized] # 00:02:15 sound/core/oss/pcm_oss.c:2475:34: error: ‘setup’ is used uninitialized [-Werror=uninitialized] # 00:02:15 sound/core/oss/pcm_oss.c:2985:51: error: ‘template’ is used uninitialized [-Werror=uninitialized] # 00:02:15 sound/core/seq/oss/seq_oss_init.c:350:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:02:15 sound/core/seq/oss/seq_oss_init.c:370:35: error: ‘qinfo’ is used uninitialized [-Werror=uninitialized] # 00:02:16 make[4]: *** [scripts/Makefile.build:288: sound/core/seq/oss/seq_oss_init.o] Error 1 # 00:02:22 make[3]: *** [scripts/Makefile.build:288: sound/core/oss/pcm_oss.o] Error 1 # 00:02:28 make[3]: *** [scripts/Makefile.build:571: sound/core/seq/oss] Error 2 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21403 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-aarch64-next-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… Reproduce builds: <cut> mkdir investigate-gcc-4dc7ce6fb3917958d1a6036d8acf2953b9c1b868 cd investigate-gcc-4dc7ce6fb3917958d1a6036d8acf2953b9c1b868 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-aarch64-next-al… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 4dc7ce6fb3917958d1a6036d8acf2953b9c1b868 ../artifacts/test.sh # Reproduce last_good build git checkout --detach f1710910087fb1f4a7706e9ce838163ffcbc50b4 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 4dc7ce6fb3917958d1a6036d8acf2953b9c1b868 Author: Martin Sebor <msebor(a)redhat.com> Date: Fri Oct 1 11:50:25 2021 -0600 Enhance -Waddress to detect more suspicious expressions [PR102103]. Resolves: PR c/102103 - missing warning comparing array address to null gcc/ChangeLog: PR c/102103 * doc/invoke.texi (-Waddress): Update. * gengtype.c (write_types): Avoid -Waddress. * poly-int.h (POLY_SET_COEFF): Avoid using null. gcc/c-family/ChangeLog: PR c/102103 * c-common.c (decl_with_nonnull_addr_p): Handle members. Check and perform warning suppression. (c_common_truthvalue_conversion): Enhance warning suppression. gcc/c/ChangeLog: PR c/102103 * c-typeck.c (maybe_warn_for_null_address): New function. (build_binary_op): Call it. gcc/cp/ChangeLog: PR c/102103 * typeck.c (warn_for_null_address): Enhance. (cp_build_binary_op): Call it also for member pointers. gcc/fortran/ChangeLog: PR c/102103 * array.c: Remove an unnecessary test. * trans-array.c: Same. gcc/testsuite/ChangeLog: PR c/102103 * g++.dg/cpp0x/constexpr-array-ptr10.C: Suppress a valid warning. * g++.dg/warn/Wreturn-local-addr-6.C: Correct a cast. * gcc.dg/Waddress.c: Expect a warning. * c-c++-common/Waddress-3.c: New test. * c-c++-common/Waddress-4.c: New test. * g++.dg/warn/Waddress-5.C: New test. * g++.dg/warn/Waddress-6.C: New test. * g++.dg/warn/pr101219.C: Expect a warning. * gcc.dg/Waddress-3.c: New test. --- gcc/c-family/c-common.c | 29 +++-- gcc/c/c-typeck.c | 140 ++++++++++++++++----- gcc/cp/typeck.c | 94 ++++++++++++-- gcc/doc/invoke.texi | 48 +++++-- gcc/fortran/array.c | 2 +- gcc/fortran/trans-array.c | 1 - gcc/gengtype.c | 4 +- gcc/poly-int.h | 4 +- gcc/testsuite/c-c++-common/Waddress-3.c | 125 ++++++++++++++++++ gcc/testsuite/c-c++-common/Waddress-4.c | 106 ++++++++++++++++ gcc/testsuite/g++.dg/cpp0x/constexpr-array-ptr10.C | 5 +- gcc/testsuite/g++.dg/warn/Waddress-5.C | 115 +++++++++++++++++ gcc/testsuite/g++.dg/warn/Waddress-6.C | 79 ++++++++++++ gcc/testsuite/g++.dg/warn/Wreturn-local-addr-6.C | 4 +- gcc/testsuite/g++.dg/warn/pr101219.C | 4 +- gcc/testsuite/gcc.dg/Waddress-3.c | 35 ++++++ gcc/testsuite/gcc.dg/Waddress.c | 2 +- 17 files changed, 722 insertions(+), 75 deletions(-) diff --git a/gcc/c-family/c-common.c b/gcc/c-family/c-common.c index 5845c675e85..9d19e352725 100644 --- a/gcc/c-family/c-common.c +++ b/gcc/c-family/c-common.c @@ -3393,14 +3393,16 @@ c_wrap_maybe_const (tree expr, bool non_const) return expr; } -/* Return whether EXPR is a declaration whose address can never be - NULL. */ +/* Return whether EXPR is a declaration whose address can never be NULL. + The address of the first struct member could be NULL only if it were + accessed through a NULL pointer, and such an access would be invalid. */ bool decl_with_nonnull_addr_p (const_tree expr) { return (DECL_P (expr) - && (TREE_CODE (expr) == PARM_DECL + && (TREE_CODE (expr) == FIELD_DECL + || TREE_CODE (expr) == PARM_DECL || TREE_CODE (expr) == LABEL_DECL || !DECL_WEAK (expr))); } @@ -3488,13 +3490,17 @@ c_common_truthvalue_conversion (location_t location, tree expr) case ADDR_EXPR: { tree inner = TREE_OPERAND (expr, 0); - if (decl_with_nonnull_addr_p (inner)) + if (decl_with_nonnull_addr_p (inner) + /* Check both EXPR and INNER for suppression. */ + && !warning_suppressed_p (expr, OPT_Waddress) + && !warning_suppressed_p (inner, OPT_Waddress)) { - /* Common Ada programmer's mistake. */ + /* Common Ada programmer's mistake. */ warning_at (location, OPT_Waddress, "the address of %qD will always evaluate as %<true%>", inner); + suppress_warning (inner, OPT_Waddress); return truthvalue_true_node; } break; @@ -3627,8 +3633,17 @@ c_common_truthvalue_conversion (location_t location, tree expr) break; /* If this isn't narrowing the argument, we can ignore it. */ if (TYPE_PRECISION (totype) >= TYPE_PRECISION (fromtype)) - return c_common_truthvalue_conversion (location, - TREE_OPERAND (expr, 0)); + { + tree op0 = TREE_OPERAND (expr, 0); + if ((TREE_CODE (fromtype) == POINTER_TYPE + && TREE_CODE (totype) == INTEGER_TYPE) + || warning_suppressed_p (expr, OPT_Waddress)) + /* Suppress -Waddress for casts to intptr_t, propagating + any suppression from the enclosing expression to its + operand. */ + suppress_warning (op0, OPT_Waddress); + return c_common_truthvalue_conversion (location, op0); + } } break; diff --git a/gcc/c/c-typeck.c b/gcc/c/c-typeck.c index c74f876e667..33963d7555a 100644 --- a/gcc/c/c-typeck.c +++ b/gcc/c/c-typeck.c @@ -11554,6 +11554,110 @@ build_vec_cmp (tree_code code, tree type, return build3 (VEC_COND_EXPR, type, cmp, minus_one_vec, zero_vec); } +/* Possibly warn about an address of OP never being NULL in a comparison + operation CODE involving null. */ + +static void +maybe_warn_for_null_address (location_t loc, tree op, tree_code code) +{ + if (!warn_address || warning_suppressed_p (op, OPT_Waddress)) + return; + + if (TREE_CODE (op) == NOP_EXPR) + { + /* Allow casts to intptr_t to suppress the warning. */ + tree type = TREE_TYPE (op); + if (TREE_CODE (type) == INTEGER_TYPE) + return; + op = TREE_OPERAND (op, 0); + } + + if (TREE_CODE (op) == POINTER_PLUS_EXPR) + { + /* Allow a cast to void* to suppress the warning. */ + tree type = TREE_TYPE (TREE_TYPE (op)); + if (VOID_TYPE_P (type)) + return; + + /* Adding any value to a null pointer, including zero, is undefined + in C. This includes the expression &p[0] where p is the null + pointer, although &p[0] will have been folded to p by this point + and so not diagnosed. */ + if (code == EQ_EXPR) + warning_at (loc, OPT_Waddress, + "the comparison will always evaluate as %<false%> " + "for the pointer operand in %qE must not be NULL", + op); + else + warning_at (loc, OPT_Waddress, + "the comparison will always evaluate as %<true%> " + "for the pointer operand in %qE must not be NULL", + op); + + return; + } + + if (TREE_CODE (op) != ADDR_EXPR) + return; + + op = TREE_OPERAND (op, 0); + + if (TREE_CODE (op) == IMAGPART_EXPR + || TREE_CODE (op) == REALPART_EXPR) + { + /* The address of either complex part may not be null. */ + if (code == EQ_EXPR) + warning_at (loc, OPT_Waddress, + "the comparison will always evaluate as %<false%> " + "for the address of %qE will never be NULL", + op); + else + warning_at (loc, OPT_Waddress, + "the comparison will always evaluate as %<true%> " + "for the address of %qE will never be NULL", + op); + return; + } + + /* Set to true in the loop below if OP dereferences is operand. + In such a case the ultimate target need not be a decl for + the null [in]equality test to be constant. */ + bool deref = false; + + /* Get the outermost array or object, or member. */ + while (handled_component_p (op)) + { + if (TREE_CODE (op) == COMPONENT_REF) + { + /* Get the member (its address is never null). */ + op = TREE_OPERAND (op, 1); + break; + } + + /* Get the outer array/object to refer to in the warning. */ + op = TREE_OPERAND (op, 0); + deref = true; + } + + if ((!deref && !decl_with_nonnull_addr_p (op)) + || from_macro_expansion_at (loc)) + return; + + if (code == EQ_EXPR) + warning_at (loc, OPT_Waddress, + "the comparison will always evaluate as %<false%> " + "for the address of %qE will never be NULL", + op); + else + warning_at (loc, OPT_Waddress, + "the comparison will always evaluate as %<true%> " + "for the address of %qE will never be NULL", + op); + + if (DECL_P (op)) + inform (DECL_SOURCE_LOCATION (op), "%qD declared here", op); +} + /* Build a binary-operation expression without default conversions. CODE is the kind of expression to build. LOCATION is the operator's location. @@ -12189,44 +12293,12 @@ build_binary_op (location_t location, enum tree_code code, short_compare = 1; else if (code0 == POINTER_TYPE && null_pointer_constant_p (orig_op1)) { - if (TREE_CODE (op0) == ADDR_EXPR - && decl_with_nonnull_addr_p (TREE_OPERAND (op0, 0)) - && !from_macro_expansion_at (location)) - { - if (code == EQ_EXPR) - warning_at (location, - OPT_Waddress, - "the comparison will always evaluate as %<false%> " - "for the address of %qD will never be NULL", - TREE_OPERAND (op0, 0)); - else - warning_at (location, - OPT_Waddress, - "the comparison will always evaluate as %<true%> " - "for the address of %qD will never be NULL", - TREE_OPERAND (op0, 0)); - } + maybe_warn_for_null_address (location, op0, code); result_type = type0; } else if (code1 == POINTER_TYPE && null_pointer_constant_p (orig_op0)) { - if (TREE_CODE (op1) == ADDR_EXPR - && decl_with_nonnull_addr_p (TREE_OPERAND (op1, 0)) - && !from_macro_expansion_at (location)) - { - if (code == EQ_EXPR) - warning_at (location, - OPT_Waddress, - "the comparison will always evaluate as %<false%> " - "for the address of %qD will never be NULL", - TREE_OPERAND (op1, 0)); - else - warning_at (location, - OPT_Waddress, - "the comparison will always evaluate as %<true%> " - "for the address of %qD will never be NULL", - TREE_OPERAND (op1, 0)); - } + maybe_warn_for_null_address (location, op1, code); result_type = type1; } else if (code0 == POINTER_TYPE && code1 == POINTER_TYPE) diff --git a/gcc/cp/typeck.c b/gcc/cp/typeck.c index cd130f16a66..e880d34dcfe 100644 --- a/gcc/cp/typeck.c +++ b/gcc/cp/typeck.c @@ -4603,25 +4603,93 @@ warn_for_null_address (location_t location, tree op, tsubst_flags_t complain) || warning_suppressed_p (op, OPT_Waddress)) return; + if (TREE_CODE (op) == NON_DEPENDENT_EXPR) + op = TREE_OPERAND (op, 0); + tree cop = fold_for_warn (op); - if (TREE_CODE (cop) == ADDR_EXPR - && decl_with_nonnull_addr_p (TREE_OPERAND (cop, 0)) - && !warning_suppressed_p (cop, OPT_Waddress)) - warning_at (location, OPT_Waddress, "the address of %qD will never " - "be NULL", TREE_OPERAND (cop, 0)); + if (TREE_CODE (cop) == NON_LVALUE_EXPR) + /* Unwrap the expression for C++ 98. */ + cop = TREE_OPERAND (cop, 0); - if (CONVERT_EXPR_P (op) + if (TREE_CODE (cop) == PTRMEM_CST) + { + /* The address of a nonstatic data member is never null. */ + warning_at (location, OPT_Waddress, + "the address %qE will never be NULL", + cop); + return; + } + + if (TREE_CODE (cop) == NOP_EXPR) + { + /* Allow casts to intptr_t to suppress the warning. */ + tree type = TREE_TYPE (cop); + if (TREE_CODE (type) == INTEGER_TYPE) + return; + + STRIP_NOPS (cop); + } + + bool warned = false; + if (TREE_CODE (cop) == ADDR_EXPR) + { + cop = TREE_OPERAND (cop, 0); + + /* Set to true in the loop below if OP dereferences its operand. + In such a case the ultimate target need not be a decl for + the null [in]equality test to be necessarily constant. */ + bool deref = false; + + /* Get the outermost array or object, or member. */ + while (handled_component_p (cop)) + { + if (TREE_CODE (cop) == COMPONENT_REF) + { + /* Get the member (its address is never null). */ + cop = TREE_OPERAND (cop, 1); + break; + } + + /* Get the outer array/object to refer to in the warning. */ + cop = TREE_OPERAND (cop, 0); + deref = true; + } + + if ((!deref && !decl_with_nonnull_addr_p (cop)) + || from_macro_expansion_at (location) + || warning_suppressed_p (cop, OPT_Waddress)) + return; + + warned = warning_at (location, OPT_Waddress, + "the address of %qD will never be NULL", cop); + op = cop; + } + else if (TREE_CODE (cop) == POINTER_PLUS_EXPR) + { + /* Adding zero to the null pointer is well-defined in C++. When + the offset is unknown (i.e., not a constant) warn anyway since + it's less likely that the pointer operand is null than not. */ + tree off = TREE_OPERAND (cop, 1); + if (!integer_zerop (off) + && !warning_suppressed_p (cop, OPT_Waddress)) + warning_at (location, OPT_Waddress, "comparing the result of pointer " + "addition %qE and NULL", cop); + return; + } + else if (CONVERT_EXPR_P (op) && TYPE_REF_P (TREE_TYPE (TREE_OPERAND (op, 0)))) { - tree inner_op = op; - STRIP_NOPS (inner_op); + STRIP_NOPS (op); - if (DECL_P (inner_op)) - warning_at (location, OPT_Waddress, - "the compiler can assume that the address of " - "%qD will never be NULL", inner_op); + if (DECL_P (op)) + warned = warning_at (location, OPT_Waddress, + "the compiler can assume that the address of " + "%qD will never be NULL", op); } + + if (warned && DECL_P (op)) + inform (DECL_SOURCE_LOCATION (op), "%qD declared here", op); } /* Warn about [expr.arith.conv]/2: If one operand is of enumeration type and @@ -5411,6 +5479,8 @@ cp_build_binary_op (const op_location_t &location, op1 = cp_convert (TREE_TYPE (op0), op1, complain); } result_type = TREE_TYPE (op0); + + warn_for_null_address (location, orig_op0, complain); } else if (TYPE_PTRMEMFUNC_P (type1) && null_ptr_cst_p (orig_op0)) return cp_build_binary_op (location, code, op1, op0, complain); diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 5f39b208049..d35114c0727 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -8551,17 +8551,43 @@ by @option{-Wall}. @item -Waddress @opindex Waddress @opindex Wno-address -Warn about suspicious uses of memory addresses. These include using -the address of a function in a conditional expression, such as -@code{void func(void); if (func)}, and comparisons against the memory -address of a string literal, such as @code{if (x == "abc")}. Such -uses typically indicate a programmer error: the address of a function -always evaluates to true, so their use in a conditional usually -indicate that the programmer forgot the parentheses in a function -call; and comparisons against string literals result in unspecified -behavior and are not portable in C, so they usually indicate that the -programmer intended to use @code{strcmp}. This warning is enabled by -@option{-Wall}. +Warn about suspicious uses of address expressions. These include comparing +the address of a function or a declared object to the null pointer constant +such as in +@smallexample +void f (void); +void g (void) +@{ + if (!func) // warning: expression evaluates to false + abort (); +@} +@end smallexample +comparisons of a pointer to a string literal, such as in +@smallexample +void f (const char *x) +@{ + if (x == "abc") // warning: expression evaluates to false + puts ("equal"); +@} +@end smallexample +and tests of the results of pointer addition or subtraction for equality +to null, such as in +@smallexample +void f (const int *p, int i) +@{ + return p + i == NULL; +@} +@end smallexample +Such uses typically indicate a programmer error: the address of most +functions and objects necessarily evaluates to true (the exception are +weak symbols), so their use in a conditional might indicate missing +parentheses in a function call or a missing dereference in an array +expression. The subset of the warning for object pointers can be +suppressed by casting the pointer operand to an integer type such +as @code{inptr_t} or @code{uinptr_t}. +Comparisons against string literals result in unspecified behavior +and are not portable, and suggest the intent was to call @code{strcmp}. +@option{-Waddress} warning is enabled by @option{-Wall}. @item -Wno-address-of-packed-member @opindex Waddress-of-packed-member diff --git a/gcc/fortran/array.c b/gcc/fortran/array.c index a4d1cb4c72d..6552eaf3b0c 100644 --- a/gcc/fortran/array.c +++ b/gcc/fortran/array.c @@ -2581,7 +2581,7 @@ gfc_array_dimen_size (gfc_expr *array, int dimen, mpz_t *result) } } - if (array->shape && array->shape[dimen]) + if (array->shape) { mpz_init_set (*result, array->shape[dimen]); return true; diff --git a/gcc/fortran/trans-array.c b/gcc/fortran/trans-array.c index b8061f37772..e2f59e0823c 100644 --- a/gcc/fortran/trans-array.c +++ b/gcc/fortran/trans-array.c @@ -5104,7 +5104,6 @@ set_loop_bounds (gfc_loopinfo *loop) if (info->shape) { - gcc_assert (info->shape[dim]); /* The frontend has worked out the size for us. */ if (!loopspec[n] || !specinfo->shape diff --git a/gcc/gengtype.c b/gcc/gengtype.c index 31d4bf4e5d0..a77cfd92bfa 100644 --- a/gcc/gengtype.c +++ b/gcc/gengtype.c @@ -3685,8 +3685,8 @@ write_types (outf_p output_header, type_p structures, output_mangled_typename (output_header, s); oprintf (output_header, "(X) do { \\\n"); oprintf (output_header, - " if (X != NULL) gt_%sx_%s (X);\\\n", wtd->prefix, - s_id_for_tag); + " if ((intptr_t)(X) != 0) gt_%sx_%s (X);\\\n", + wtd->prefix, s_id_for_tag); oprintf (output_header, " } while (0)\n"); for (opt = s->u.s.opt; opt; opt = opt->next) diff --git a/gcc/poly-int.h b/gcc/poly-int.h index f47f9e436a8..94e7b701f64 100644 --- a/gcc/poly-int.h +++ b/gcc/poly-int.h @@ -324,10 +324,10 @@ struct poly_result<T1, T2, 2> routine can take the address of RES rather than the address of a temporary. - The dummy comparison against a null C * is just a way of checking + The dummy self-comparison against C * is just a way of checking that C gives the right type. */ #define POLY_SET_COEFF(C, RES, I, VALUE) \ - ((void) (&(RES).coeffs[0] == (C *) 0), \ + ((void) (&(RES).coeffs[0] == (C *) (void *) &(RES).coeffs[0]), \ wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION \ ? (void) ((RES).coeffs[I] = VALUE) \ : (void) ((RES).coeffs[I].~C (), new (&(RES).coeffs[I]) C (VALUE))) diff --git a/gcc/testsuite/c-c++-common/Waddress-3.c b/gcc/testsuite/c-c++-common/Waddress-3.c new file mode 100644 index 00000000000..9a13a444045 --- /dev/null +++ b/gcc/testsuite/c-c++-common/Waddress-3.c @@ -0,0 +1,125 @@ +/* PR c/102103 - missing warning comparing array address to null + { dg-do compile } + { dg-options "-Wall" } */ + +typedef __INTPTR_TYPE__ intptr_t; +typedef __UINTPTR_TYPE__ uintptr_t; + +#ifndef __cplusplus +# define bool _Bool +#endif + +struct S { void *p, *a1[2], *a2[2][2]; } s, *p; + +extern const void *a1[2]; +extern void *a2[2][2], *ax[]; + +void T (bool); + +void test_array_eq_0 (int i) +{ + // Verify that casts intptr_t suppress the warning. + T ((intptr_t)a1 == 0); + T ((uintptr_t)a1 == 0); + T (a1 == 0); // { dg-warning "-Waddress" } + T (0 == &a1); // { dg-warning "-Waddress" } + // Verify that casts to other pointer types don't suppress it. + T ((void *)a1 == 0); // { dg-warning "-Waddress" } + T ((char *)a1 == 0); // { dg-warning "-Waddress" } + T (a1[0] == 0); + T (0 == (intptr_t)&a1[0]); + T (0 == &a1[0]); // { dg-warning "-Waddress" } + T (a1[i] == 0); + T (0 == (uintptr_t)&a1[i]); + T (0 == &a1[i]); // { dg-warning "-Waddress" } + + T ((intptr_t)a2 == 0); + T (a2 == 0); // { dg-warning "-Waddress" } + T (0 == &a2); // { dg-warning "-Waddress" } + T (a2[0] == 0); // { dg-warning "-Waddress" } + T (0 == &a1[0]); // { dg-warning "-Waddress" } + T (a2[i] == 0); // { dg-warning "-Waddress" } + T (0 == &a2[i]); // { dg-warning "-Waddress" } + T (a2[0][0] == 0); + T (0 == &a2[0][0]); // { dg-warning "-Waddress" } + T (&ax == 0); // { dg-warning "-Waddress" } + T (0 == &ax); // { dg-warning "-Waddress" } + T (&ax[0] == 0); // { dg-warning "-Waddress" } + T (0 == ax[0]); +} + + +void test_array_neq_0 (int i) +{ + // Verify that casts to intptr_t suppress the warning. + T ((uintptr_t)a1); + + T (a1); // { dg-warning "-Waddress" } + T ((void *)a1); // { dg-warning "-Waddress" } + T (&a1 != 0); // { dg-warning "-Waddress" } + T (a1[0]); + T (&a1[0] != 0); // { dg-warning "-Waddress" } + T (a1[i]); + T (&a1[i] != 0); // { dg-warning "-Waddress" } + + T ((intptr_t)a2); + T (a2); // { dg-warning "-Waddress" } + T ((void *)a2); // { dg-warning "-Waddress" } + T ((char *)a2); // { dg-warning "-Waddress" } + T (&a2 != 0); // { dg-warning "-Waddress" } + T (a2[0]); // { dg-warning "-Waddress" } + T (&a1[0] != 0); // { dg-warning "-Waddress" } + T (a2[i]); // { dg-warning "-Waddress" } + T (&a2[i] != 0); // { dg-warning "-Waddress" } + T (a2[0][0]); + T (&a2[0][0] != 0); // { dg-warning "-Waddress" } +} + + +void test_member_array_eq_0 (int i) +{ + // Verify that casts to intptr_t suppress the warning. + T ((intptr_t)s.a1 == 0); + T (s.a1 == 0); // { dg-warning "-Waddress" } + T (0 == &a1); // { dg-warning "-Waddress" } + T (s.a1[0] == 0); + T ((void*)s.a1); // { dg-warning "-Waddress" } + T (0 == &a1[0]); // { dg-warning "-Waddress" } + T (s.a1[i] == 0); + T (0 == &a1[i]); // { dg-warning "-Waddress" } + + T ((uintptr_t)s.a2 == 0); + T (s.a2 == 0); // { dg-warning "-Waddress" } + T (0 == &a2); // { dg-warning "-Waddress" } + T ((void *)s.a2 == 0);// { dg-warning "-Waddress" } + T (s.a2[0] == 0); // { dg-warning "-Waddress" } + T (0 == &a1[0]); // { dg-warning "-Waddress" } + T (s.a2[i] == 0); // { dg-warning "-Waddress" } + T (0 == &a2[i]); // { dg-warning "-Waddress" } + T (s.a2[0][0] == 0); + T (0 == &a2[0][0]); // { dg-warning "-Waddress" } +} + + +void test_member_array_neq_0 (int i) +{ + // Verify that casts to intptr_t suppress the warning. + T ((uintptr_t)s.a1); + T (s.a1); // { dg-warning "-Waddress" } + T (&s.a1 != 0); // { dg-warning "-Waddress" } + T ((void *)&s.a1[0]); // { dg-warning "-Waddress" } + T (s.a1[0]); + T (&s.a1[0] != 0); // { dg-warning "-Waddress" } + T (s.a1[i]); + T (&s.a1[i] != 0); // { dg-warning "-Waddress" } + + T ((intptr_t)s.a2); + T (s.a2); // { dg-warning "-Waddress" } + T (&s.a2 != 0); // { dg-warning "-Waddress" } + T (s.a2[0]); // { dg-warning "-Waddress" } + T (&s.a1[0] != 0); // { dg-warning "-Waddress" } + T (s.a2[i]); // { dg-warning "-Waddress" } + T (&s.a2[i] != 0); // { dg-warning "-Waddress" } + T (s.a2[0][0]); + T (&s.a2[0][0] != 0); // { dg-warning "-Waddress" } +} diff --git a/gcc/testsuite/c-c++-common/Waddress-4.c b/gcc/testsuite/c-c++-common/Waddress-4.c new file mode 100644 index 00000000000..489a0cd717c --- /dev/null +++ b/gcc/testsuite/c-c++-common/Waddress-4.c @@ -0,0 +1,106 @@ +/* PR c/102103 - missing warning comparing array address to null + { dg-do compile } + { dg-options "-Wall" } */ + +typedef __INTPTR_TYPE__ intptr_t; +typedef __INTPTR_TYPE__ uintptr_t; + +extern char *ax[], *a2[][2]; + +void T (int); + +void test_ax_plus_eq_0 (int i) +{ + // Verify that casts to intptr_t suppress the warning. + T ((intptr_t)(ax + 0) == 0); + T ((uintptr_t)(ax + 1) == 0); + + T (ax + 0 == 0); // { dg-warning "-Waddress" } + T (&ax[0] == 0); // { dg-warning "-Waddress" } + T (ax - 1 == 0); // { dg-warning "-Waddress" } + T (0 == &ax[-1]); // { dg-warning "-Waddress" } + T ((void *)(&ax[0] + 2) == 0); // { dg-warning "-Waddress" } + T (&ax[0] + 2 == 0); // { dg-warning "-Waddress" } + T (ax + 3 == 0); // { dg-warning "-Waddress" } + T (0 == &ax[-4]); // { dg-warning "-Waddress" } + T (ax - i == 0); // { dg-warning "-Waddress" } + T (&ax[i] == 0); // { dg-warning "-Waddress" } + T (0 == &ax[1] + i); // { dg-warning "-Waddress" } +} + +void test_a2_plus_eq_0 (int i) +{ + // Verify that casts to intptr_t suppress the warning. + T ((intptr_t)(a2 + 0) == 0); + T ((uintptr_t)(a2 + 1) == 0); + + T (a2 + 0 == 0); // { dg-warning "-Waddress" } + // Verify that a cast to another pointer type doesn't suppress it. + T ((void*)(a2 + 0) == 0); // { dg-warning "-Waddress" } + T ((char*)a2 + 1 == 0); // { dg-warning "-Waddress" } + T (&a2[0] == 0); // { dg-warning "-Waddress" } + T (a2 - 1 == 0); // { dg-warning "-Waddress" } + T (0 == &a2[-1]); // { dg-warning "-Waddress" } + T (a2 + 2 == 0); // { dg-warning "-Waddress" } + T (0 == &a2[-2]); // { dg-warning "-Waddress" } + T (a2 - i == 0); // { dg-warning "-Waddress" } + T (&a2[i] == 0); // { dg-warning "-Waddress" } +} + +// Exercise a pointer. +void test_p_plus_eq_0 (int *p, int i) +{ + /* P + 0 and equivalently &P[0] are invalid for a null P but they're + folded to p before the warning has a chance to trigger. */ + T (p + 0 == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + T (&p[0] == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + + T (p - 1 == 0); // { dg-warning "-Waddress" } + T (0 == &p[-1]); // { dg-warning "-Waddress" } + T (p + 2 == 0); // { dg-warning "-Waddress" } + T (0 == &p[-2]); // { dg-warning "-Waddress" } + T (p - i == 0); // { dg-warning "-Waddress" } + T (&p[i] == 0); // { dg-warning "-Waddress" } +} + +// Exercise pointer to array. +void test_pa_plus_eq_0 (int (*p)[], int (*p2)[][2], int i) +{ + // The array pointer may be null. + T (*p == 0); + /* &**P is equivalent to *P and might be the result od macro expansion. + Verify it doesn't cause a warning. */ + T (&**p == 0); + + /* *P + 0 is invalid but folded to *P before the warning has a chance + to trigger. */ + T (*p + 0 == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + + T (&(*p)[0] == 0); // { dg-warning "-Waddress" } + T (*p - 1 == 0); // { dg-warning "-Waddress" } + T (0 == &(*p)[-1]); // { dg-warning "-Waddress" } + T (*p + 2 == 0); // { dg-warning "-Waddress" } + T (0 == &(*p)[-2]); // { dg-warning "-Waddress" } + T (*p - i == 0); // { dg-warning "-Waddress" } + T (&(*p)[i] == 0); // { dg-warning "-Waddress" } + + + /* Similar to the above but for a pointer to a two-dimensional array, + referring to the higher-level element (i.e., an array itself). */ + T (*p2 == 0); + T (**p2 == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + T (&**p2 == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + T (&***p2 == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + T (&**p2 == 0); + + T (*p2 + 0 == 0); // { dg-warning "-Waddress" "pr102555" { xfail *-*-* } } + T (&(*p2)[0] == 0); // { dg-warning "-Waddress" } + T (&(*p2)[0][1] == 0); // { dg-warning "-Waddress" } + T (*p2 - 1 == 0); // { dg-warning "-Waddress" } + T (0 == &(*p2)[-1]); // { dg-warning "-Waddress" } + T (0 == &(*p2)[1][2]); // { dg-warning "-Waddress" } + T (*p2 + 2 == 0); // { dg-warning "-Waddress" } + T (0 == &(*p2)[-2]); // { dg-warning "-Waddress" } + T (*p2 - i == 0); // { dg-warning "-Waddress" } + T (&(*p2)[i] == 0); // { dg-warning "-Waddress" } +} diff --git a/gcc/testsuite/g++.dg/cpp0x/constexpr-array-ptr10.C b/gcc/testsuite/g++.dg/cpp0x/constexpr-array-ptr10.C index 5224bb14234..63295230d51 100644 --- a/gcc/testsuite/g++.dg/cpp0x/constexpr-array-ptr10.C +++ b/gcc/testsuite/g++.dg/cpp0x/constexpr-array-ptr10.C @@ -85,8 +85,11 @@ extern __attribute__ ((weak)) int i; constexpr int *p1 = &i + 1; #pragma GCC diagnostic push +// Suppress warning: ordered comparison of pointer with integer zero. #pragma GCC diagnostic ignored "-Wextra" -// Suppress warning: ordered comparison of pointer with integer zero +// Also suppress -Waddress for comparisons of constant addresses to +// to null. +#pragma GCC diagnostic ignored "-Waddress" constexpr bool b0 = p1; // { dg-error "not a constant expression" } constexpr bool b1 = p1 == 0; // { dg-error "not a constant expression" } diff --git a/gcc/testsuite/g++.dg/warn/Waddress-5.C b/gcc/testsuite/g++.dg/warn/Waddress-5.C new file mode 100644 index 00000000000..b1ad38a8112 --- /dev/null +++ b/gcc/testsuite/g++.dg/warn/Waddress-5.C @@ -0,0 +1,115 @@ +/* PR c/102103 - missing warning comparing array address to null + { dg-do compile } + { dg-options "-Wall" } */ + +#if __cplusplus < 201103L +# define nullptr __null +#endif + +struct A +{ + void f (); + virtual void vf (); + virtual void pvf () = 0; + + void sf (); + + int *p; + int a[2]; +}; + +void T (bool); + +void warn_memptr_if () +{ + // Exercise warnings for addresses of nonstatic member functions. + if (&A::f == 0) // { dg-warning "the address '&A::f'" } + T (0); + + if (&A::vf) // { dg-warning "-Waddress" } + T (0); + + if (&A::pvf != 0) // { dg-warning "-Waddress" } + T (0); + + // Exercise warnings for addresses of static member functions. + if (&A::sf == 0) // { dg-warning "-Waddress" } + T (0); + + if (&A::sf) // { dg-warning "-Waddress" } + T (0); + + // Exercise warnings for addresses of nonstatic data members. + if (&A::p == 0) // { dg-warning "the address '&A::p'" } + T (0); + + if (&A::a == nullptr) // { dg-warning "-Waddress" } + T (0); +} + +void warn_memptr_bool () +{ + // Exercise warnings for addresses of nonstatic member functions. + T (&A::f == 0); // { dg-warning "-Waddress" } + T (&A::vf); // { dg-warning "-Waddress" } + T (&A::pvf != 0); // { dg-warning "-Waddress" } + + // Exercise warnings for addresses of static member functions. + T (&A::sf == 0); // { dg-warning "-Waddress" } + T (&A::sf); // { dg-warning "-Waddress" } + + // Exercise warnings for addresses of nonstatic data members. + T (&A::p == 0); // { dg-warning "-Waddress" } + T (&A::a == nullptr); // { dg-warning "-Waddress" } +} + + +/* Verify that no warnings are issued for a dependent expression in + a template. */ + +template <int> +struct B +{ + // This is why. + struct F { void* operator& () const { return 0; } } f; +}; + +template <class Type, int N> +void nowarn_dependent (Type targ) +{ + T (&Type::x == 0); + T (&targ == 0); + + Type tarr[1]; + T (&tarr[0] == nullptr); + + T (&B<N>::f == 0); + + /* Like in the case above, the address-of operator could be a member + of B<N>::vf that returns zero. */ + T (&B<N>::vf); + T (&B<N>::pvf != 0); + T (&B<N>::p == 0); + T (&B<N>::a == 0); +} + + +/* Verify that in an uninstantiated template warnings are not issued + for dependent expressions but are issued otherwise. */ + +template <class Type> +void warn_non_dependent (Type targ, Type *tptr, int i) +{ + /* The address of a pointer to a dependent type cannot be null but + the warning doesn't have a chance to see it. */ + T (&tptr == 0); // { dg-warning "-Waddress" "pr102378" { xfail *-*-* } } + T (&i == 0); // { dg-warning "-Waddress" } + + int iarr[1]; + T (&iarr == 0); // { dg-warning "-Waddress" } + T (&*iarr != 0); // { dg-warning "-Waddress" "pr102378" { xfail *-*-* } } + T (&iarr[0] == 0); // { dg-warning "-Waddress" } + + Type tarr[1]; + T (&tarr == nullptr); // { dg-warning "-Waddress" "pr102378" { xfail *-*-* } } +} diff --git a/gcc/testsuite/g++.dg/warn/Waddress-6.C b/gcc/testsuite/g++.dg/warn/Waddress-6.C new file mode 100644 index 00000000000..c22a83a0dd7 --- /dev/null +++ b/gcc/testsuite/g++.dg/warn/Waddress-6.C @@ -0,0 +1,79 @@ +/* PR c/102103 - missing warning comparing array address to null + { dg-do compile } + Verify -Waddress for member arrays of structs and notes. + { dg-options "-Wall" } */ + +#if __cplusplus < 201103L +# define nullptr __null +#endif + +void T (bool); + +struct A +{ + int n; + int ia[]; // { dg-message "'A::ia' declared here" } +}; + +struct B +{ + A a[3]; // { dg-message "'B::a' declared here" } +}; + +struct C +{ + B b[3]; // { dg-message "'C::b' declared here" } +}; + +struct D +{ + C c[3]; // { dg-message "'D::c' declared here" } +}; + + +void test_waddress_1d () +{ + D d[2]; // { dg-message "'d' declared here" } + + T (d); // { dg-warning "address of 'd'" } + T (d == nullptr); // { dg-warning "address of 'd'" } + T (&d); // { dg-warning "address of 'd'" } + T (d->c); // { dg-warning "address of 'D::c'" } + T (d->c != nullptr); // { dg-warning "address of 'D::c'" } + T (d->c->b); // { dg-warning "address of 'C::b'" } + T (d->c[1].b->a); // { dg-warning "address of 'B::a'" } + T (d->c->b[2].a->ia); // { dg-warning "address of 'A::ia'" } + + if (d->c->b[2].a[1].ia) // { dg-warning "address of 'A::ia'" } + T (0); + + if (bool b = d->c->b[1].a) // { dg-warning "address of 'B::a'" } + T (b); + + /* The following is represented as a declaration of P followed + by an if statement and so it isn't diagnosed. It's not clear + that it should be since the pointer is then used. + void *p = d->c->b[2].a; + if (p) ... + */ + if (void *p = d->c->b[2].a) // { dg-warning "address of 'A::ia'" "" { xfail *-*-* } } + T (p); +} + + +void test_waddress_2d (int i) +{ + D d[2][3]; // { dg-message "'d' declared here" } + + T (d); // { dg-warning "address of 'd'" } + T (d == nullptr); // { dg-warning "address of 'd'" } + T (&d); // { dg-warning "address of 'd'" } + T (*d); // { dg-warning "address of 'd'" } + T (d[1] != nullptr); // { dg-warning "address of 'd'" } + T (&d[1]->c); // { dg-warning "address of 'D::c'" } + T (d[1]->c); // { dg-warning "address of 'D::c'" } + T (d[1]->c == nullptr); // { dg-warning "address of 'D::c'" } + T (d[i]->c[1].b); // { dg-warning "address of 'C::b'" } + T ((*(d + i))->c->b->a); // { dg-warning "address of 'B::a'" } + T (d[1][2].c->b->a->ia); // { dg-warning "address of 'A::ia'" } +} </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 458.sjeng grew in size by 4% after gcc: aarch64: Improve size heuristic for cpymem expansion

by ci_notify＠linaro.org

After gcc commit a459ee44c0a74b0df0485ed7a56683816c02aae9 Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com> aarch64: Improve size heuristic for cpymem expansion the following benchmarks grew in size by more than 1%: - 458.sjeng grew in size by 4% from 105780 to 109944 bytes - 459.GemsFDTD grew in size by 2% from 247504 to 251468 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: GCC + Glibc + GNU Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -Os -flto - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_gnu_apm/gnu-master-aarch64-spec2k6-Os_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… Reproduce builds: <cut> mkdir investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9 cd investigate-gcc-a459ee44c0a74b0df0485ed7a56683816c02aae9 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu-bisect-tcwg_bmk_apm-gnu-master-aa… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach a459ee44c0a74b0df0485ed7a56683816c02aae9 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 8f95e3c04d659d541ca4937b3df2f1175a1c5f05 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit a459ee44c0a74b0df0485ed7a56683816c02aae9 Author: Kyrylo Tkachov <kyrylo.tkachov(a)arm.com> Date: Wed Sep 29 11:21:45 2021 +0100 aarch64: Improve size heuristic for cpymem expansion Similar to my previous patch for setmem this one does the same for the cpymem expansion. We count the number of ops emitted and compare it against the alternative of just calling the library function when optimising for size. For the code: void cpy_127 (char *out, char *in) { __builtin_memcpy (out, in, 127); } void cpy_128 (char *out, char *in) { __builtin_memcpy (out, in, 128); } we now emit a call to memcpy (with an extra MOV-immediate instruction for the size) instead of: cpy_127(char*, char*): ldp q0, q1, [x1] stp q0, q1, [x0] ldp q0, q1, [x1, 32] stp q0, q1, [x0, 32] ldp q0, q1, [x1, 64] stp q0, q1, [x0, 64] ldr q0, [x1, 96] str q0, [x0, 96] ldr q0, [x1, 111] str q0, [x0, 111] ret cpy_128(char*, char*): ldp q0, q1, [x1] stp q0, q1, [x0] ldp q0, q1, [x1, 32] stp q0, q1, [x0, 32] ldp q0, q1, [x1, 64] stp q0, q1, [x0, 64] ldp q0, q1, [x1, 96] stp q0, q1, [x0, 96] ret which is a clear code size win. Speed optimisation heuristics remain unchanged. 2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com> * config/aarch64/aarch64.c (aarch64_expand_cpymem): Count number of emitted operations and adjust heuristic for code size. 2021-09-29 Kyrylo Tkachov <kyrylo.tkachov(a)arm.com> * gcc.target/aarch64/cpymem-size.c: New test. --- gcc/config/aarch64/aarch64.c | 36 ++++++++++++++++++-------- gcc/testsuite/gcc.target/aarch64/cpymem-size.c | 29 +++++++++++++++++++++ 2 files changed, 54 insertions(+), 11 deletions(-) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index ac17c1c88fb..a9a1800af53 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -23390,7 +23390,8 @@ aarch64_copy_one_block_and_progress_pointers (rtx *src, rtx *dst, } /* Expand cpymem, as if from a __builtin_memcpy. Return true if - we succeed, otherwise return false. */ + we succeed, otherwise return false, indicating that a libcall to + memcpy should be emitted. */ bool aarch64_expand_cpymem (rtx *operands) @@ -23407,11 +23408,13 @@ aarch64_expand_cpymem (rtx *operands) unsigned HOST_WIDE_INT size = INTVAL (operands[2]); - /* Inline up to 256 bytes when optimizing for speed. */ + /* Try to inline up to 256 bytes. */ unsigned HOST_WIDE_INT max_copy_size = 256; - if (optimize_function_for_size_p (cfun)) - max_copy_size = 128; + bool size_p = optimize_function_for_size_p (cfun); + + if (size > max_copy_size) + return false; int copy_bits = 256; @@ -23421,13 +23424,14 @@ aarch64_expand_cpymem (rtx *operands) || !TARGET_SIMD || (aarch64_tune_params.extra_tuning_flags & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS)) - { - copy_bits = 128; - max_copy_size = max_copy_size / 2; - } + copy_bits = 128; - if (size > max_copy_size) - return false; + /* Emit an inline load+store sequence and count the number of operations + involved. We use a simple count of just the loads and stores emitted + rather than rtx_insn count as all the pointer adjustments and reg copying + in this function will get optimized away later in the pipeline. */ + start_sequence (); + unsigned nops = 0; base = copy_to_mode_reg (Pmode, XEXP (dst, 0)); dst = adjust_automodify_address (dst, VOIDmode, base, 0); @@ -23456,7 +23460,8 @@ aarch64_expand_cpymem (rtx *operands) cur_mode = V4SImode; aarch64_copy_one_block_and_progress_pointers (&src, &dst, cur_mode); - + /* A single block copy is 1 load + 1 store. */ + nops += 2; n -= mode_bits; /* Emit trailing copies using overlapping unaligned accesses - this is @@ -23471,7 +23476,16 @@ aarch64_expand_cpymem (rtx *operands) n = n_bits; } } + rtx_insn *seq = get_insns (); + end_sequence (); + + /* A memcpy libcall in the worst case takes 3 instructions to prepare the + arguments + 1 for the call. */ + unsigned libcall_cost = 4; + if (size_p && libcall_cost < nops) + return false; + emit_insn (seq); return true; } diff --git a/gcc/testsuite/gcc.target/aarch64/cpymem-size.c b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c new file mode 100644 index 00000000000..4d488b74301 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/cpymem-size.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-Os" } */ + +#include <stdlib.h> + +/* +** cpy_127: +** mov x2, 127 +** b memcpy +*/ +void +cpy_127 (char *out, char *in) +{ + __builtin_memcpy (out, in, 127); +} + +/* +** cpy_128: +** mov x2, 128 +** b memcpy +*/ +void +cpy_128 (char *out, char *in) +{ + __builtin_memcpy (out, in, 128); +} + +/* { dg-final { check-function-bodies "**" "" "" } } */ + </cut>

3 years, 9 months

2
2
0 0

[TCWG CI] Regression caused by gcc: [PR102546] X << Y being non-zero implies X is also non-zero.

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: [PR102546] X << Y being non-zero implies X is also non-zero.: commit 5f9ccf17de7f7581412c6bffd4a37beca9a79836 Author: Aldy Hernandez <aldyh(a)redhat.com> [PR102546] X << Y being non-zero implies X is also non-zero. Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 18603 # First few build errors in logs: # 00:01:53 arch/arm/vfp/vfpdouble.c:1206:1: internal compiler error: in upper_bound, at value-range.h:531 # 00:01:53 arch/arm/vfp/vfpsingle.c:1246:1: internal compiler error: in upper_bound, at value-range.h:531 # 00:01:53 make[2]: *** [scripts/Makefile.build:271: arch/arm/vfp/vfpdouble.o] Error 1 # 00:01:54 make[2]: *** [scripts/Makefile.build:271: arch/arm/vfp/vfpsingle.o] Error 1 # 00:01:55 make[1]: *** [scripts/Makefile.build:514: arch/arm/vfp] Error 2 # 00:01:56 arch/arm/nwfpe/softfloat.c:3432:1: internal compiler error: in upper_bound, at value-range.h:531 # 00:01:57 make[2]: *** [scripts/Makefile.build:271: arch/arm/nwfpe/softfloat.o] Error 1 # 00:01:57 make[1]: *** [scripts/Makefile.build:514: arch/arm/nwfpe] Error 2 # 00:02:14 arch/arm/kernel/smp.c:857:1: internal compiler error: in upper_bound, at value-range.h:531 # 00:02:15 make[2]: *** [scripts/Makefile.build:271: arch/arm/kernel/smp.o] Error 1 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 19709 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-arm-stable-allyesconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… Reproduce builds: <cut> mkdir investigate-gcc-5f9ccf17de7f7581412c6bffd4a37beca9a79836 cd investigate-gcc-5f9ccf17de7f7581412c6bffd4a37beca9a79836 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-stable-ally… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 5f9ccf17de7f7581412c6bffd4a37beca9a79836 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 257d2890a769a8aa564d079170377e637e07acb1 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 5f9ccf17de7f7581412c6bffd4a37beca9a79836 Author: Aldy Hernandez <aldyh(a)redhat.com> Date: Fri Oct 1 13:05:36 2021 +0200 [PR102546] X << Y being non-zero implies X is also non-zero. This patch teaches this to range-ops. Tested on x86-64 Linux. gcc/ChangeLog: PR tree-optimization/102546 * range-op.cc (operator_lshift::op1_range): Teach range-ops that X << Y is non-zero implies X is also non-zero. --- gcc/range-op.cc | 18 ++++++++++++++---- gcc/testsuite/gcc.dg/tree-ssa/pr102546.c | 23 +++++++++++++++++++++++ 2 files changed, 37 insertions(+), 4 deletions(-) diff --git a/gcc/range-op.cc b/gcc/range-op.cc index 5e37133026d..2baca4a197f 100644 --- a/gcc/range-op.cc +++ b/gcc/range-op.cc @@ -2078,6 +2078,12 @@ operator_lshift::op1_range (irange &r, relation_kind rel ATTRIBUTE_UNUSED) const { tree shift_amount; + + if (!lhs.contains_p (build_zero_cst (type))) + r.set_nonzero (type); + else + r.set_varying (type); + if (op2.singleton_p (&shift_amount)) { wide_int shift = wi::to_wide (shift_amount); @@ -2089,21 +2095,24 @@ operator_lshift::op1_range (irange &r, return false; if (shift == 0) { - r = lhs; + r.intersect (lhs); return true; } // Work completely in unsigned mode to start. tree utype = type; + int_range_max tmp_range; if (TYPE_SIGN (type) == SIGNED) { int_range_max tmp = lhs; utype = unsigned_type_for (type); range_cast (tmp, utype); - op_rshift.fold_range (r, utype, tmp, op2); + op_rshift.fold_range (tmp_range, utype, tmp, op2); } else - op_rshift.fold_range (r, utype, lhs, op2); + op_rshift.fold_range (tmp_range, utype, lhs, op2); + + r.intersect (tmp_range); // Start with ranges which can produce the LHS by right shifting the // result by the shift amount. @@ -2128,7 +2137,8 @@ operator_lshift::op1_range (irange &r, range_cast (r, type); return true; } - return false; + + return !r.varying_p (); } bool diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr102546.c b/gcc/testsuite/gcc.dg/tree-ssa/pr102546.c new file mode 100644 index 00000000000..4bd98747732 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr102546.c @@ -0,0 +1,23 @@ +// { dg-do compile } +// { dg-options "-O3 -fdump-tree-optimized" } + +static int a; +static char b, c, d; +void bar(void); +void foo(void); + +int main() { + int f = 0; + for (; f <= 5; f++) { + bar(); + b = b && f; + d = f << f; + if (!(a >= d || f)) + foo(); + c = 1; + for (; c; c = 0) + ; + } +} + +// { dg-final { scan-tree-dump-not "foo" "optimized" } } </cut>

3 years, 9 months

2
1
0 0

Re: [TCWG CI] 400.perlbench slowed down by 6% after llvm: [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest

by Maxim Kuvyrkov

Hi Arthur, Thanks for looking into this! The flags to compile regexec.c were: -O3 --target=aarch64-linux-gnu -fgnu89-inline Clang was configured with (on x86_64-linux-gnu host): cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=AArch64 Please let me know if the above doesn’t work for you. Regards, -- Maxim Kuvyrkov https://www.linaro.org > On 29 Sep 2021, at 20:47, Arthur Eubanks <aeubanks(a)google.com> wrote: > > Do you know the flags passed to Clang to compile the sources? I tried compiling the preprocessed sources but ran into the below, and couldn't find the flags in any of the logs. > > In file included from regexec.c:93: > In file included from ./perl.h:384: > In file included from /home/tcwg-buildslave/workspace/tcwg_bmk_0/abe/builds/destdir/x86_64-pc-linux-gnu/aarch64-linux-gnu/libc/usr/include/sys/types.h:144: > /home/tcwg-buildslave/workspace/tcwg_bmk_0/llvm-install/lib/clang/14.0.0/include/stddef.h:46:27: error: typedef redefinition with different types ('unsigned long' vs 'unsigned long long') > typedef long unsigned int size_t; > ^ > 1 error generated. > > > > And yeah just moving the code around could cause major performance regressions, I've had other patches do the same for various benchmarks, there's not much we can do about that if that's actually the root cause. If I can compile the file I can check if the optimization actually created worse IR or not. > > > On Wed, Sep 29, 2021 at 5:59 AM Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> wrote: > Hi Arthur, > > Pre-processed source is in the save-temps tarballs linked below; S_regmatch() is in regexec.i . > > The save-temps also have .s assembly file for before and after your patch, and the only code-gen difference is in S_reginclass() function — see the attached screenshot #1. > > Looking into profile of S_regmatch(), some of the extra cycles come from hot loop starting with “cbz w19,...” getting misaligned — before your patch it was starting at "2bce10", and after it starts at "2bce6c”. > > Maybe the added instructions in S_reginclass() pushed the loop in S_regmatch() in an unfortunate way? > > -- > Maxim Kuvyrkov > https://www.linaro.org > >> On 27 Sep 2021, at 20:05, Arthur Eubanks <aeubanks(a)google.com> wrote: >> >> Could I get the source file with S_regmatch()? >> >> On Mon, Sep 27, 2021 at 6:07 AM Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> wrote: >> Hi Arthur, >> >> Your patch seems to be slowing down 400.perlbench by 6% — due to slow down of its hot function S_regmatch() by 14%. >> >> Could you take a look if this is easily fixable, please? >> >> Regards, >> >> -- >> Maxim Kuvyrkov >> https://www.linaro.org >> >> > On 24 Sep 2021, at 15:07, ci_notify(a)linaro.org wrote: >> > >> > After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc >> > Author: Arthur Eubanks <aeubanks(a)google.com> >> > >> > [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest >> > >> > the following benchmarks slowed down by more than 2%: >> > - 400.perlbench slowed down by 6% from 9730 to 10312 perf samples >> > - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples >> > >> > Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. >> > >> > For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: >> > - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… >> > - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… >> > - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… >> > >> > Configuration: >> > - Benchmark: SPEC CPU2006 >> > - Toolchain: Clang + Glibc + LLVM Linker >> > - Version: all components were built from their tip of trunk >> > - Target: aarch64-linux-gnu >> > - Compiler flags: -O3 >> > - Hardware: NVidia TX1 4x Cortex-A57 >> > >> > This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. > > <2021-09-29_15-44-27.png><2021-09-29_15-53-20.png>

3 years, 9 months

2
3
0 0

[TCWG CI] 456.hmmer slowed down by 5% after llvm: Revert "Allow rematerialization of virtual reg uses"

by ci_notify＠linaro.org

After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Revert "Allow rematerialization of virtual reg uses" the following benchmarks slowed down by more than 2%: - 456.hmmer slowed down by 5% from 7649 to 8028 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -O2 -flto -marm - Hardware: NVidia TK1 4x Cortex-A15 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tk1/llvm-master-arm-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Date: Fri Sep 24 09:53:51 2021 -0700 Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39. --- llvm/include/llvm/CodeGen/TargetInstrInfo.h | 12 +- llvm/lib/CodeGen/TargetInstrInfo.cpp | 9 +- llvm/test/CodeGen/AMDGPU/remat-sop.mir | 60 - llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll | 28 +- llvm/test/CodeGen/ARM/funnel-shift-rot.ll | 32 +- llvm/test/CodeGen/ARM/funnel-shift.ll | 30 +- .../test/CodeGen/ARM/illegal-bitfield-loadstore.ll | 30 +- llvm/test/CodeGen/ARM/neon-copy.ll | 10 +- llvm/test/CodeGen/Mips/llvm-ir/ashr.ll | 227 +- llvm/test/CodeGen/Mips/llvm-ir/lshr.ll | 206 +- llvm/test/CodeGen/Mips/llvm-ir/shl.ll | 95 +- llvm/test/CodeGen/Mips/llvm-ir/sub.ll | 31 +- llvm/test/CodeGen/Mips/tls.ll | 4 +- llvm/test/CodeGen/RISCV/atomic-rmw.ll | 120 +- llvm/test/CodeGen/RISCV/atomic-signext.ll | 24 +- llvm/test/CodeGen/RISCV/bswap-ctlz-cttz-ctpop.ll | 96 +- llvm/test/CodeGen/RISCV/mul.ll | 72 +- llvm/test/CodeGen/RISCV/rv32i-rv64i-half.ll | 12 +- llvm/test/CodeGen/RISCV/rv32zbb-zbp.ll | 270 +- llvm/test/CodeGen/RISCV/rv32zbb.ll | 94 +- llvm/test/CodeGen/RISCV/rv32zbp.ll | 262 +- llvm/test/CodeGen/RISCV/rv32zbt.ll | 206 +- .../CodeGen/RISCV/rvv/fixed-vectors-bitreverse.ll | 150 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap.ll | 146 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz.ll | 3584 ++++++++++---------- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz.ll | 664 ++-- llvm/test/CodeGen/RISCV/shifts.ll | 308 +- llvm/test/CodeGen/RISCV/srem-vector-lkk.ll | 208 +- llvm/test/CodeGen/RISCV/urem-vector-lkk.ll | 190 +- llvm/test/CodeGen/Thumb/dyn-stackalloc.ll | 7 +- .../tail-pred-disabled-in-loloops.ll | 14 +- .../LowOverheadLoops/varying-outer-2d-reduction.ll | 64 +- .../CodeGen/Thumb2/LowOverheadLoops/while-loops.ll | 67 +- llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll | 30 +- llvm/test/CodeGen/Thumb2/mve-float16regloops.ll | 82 +- llvm/test/CodeGen/Thumb2/mve-float32regloops.ll | 98 +- llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll | 529 +-- llvm/test/CodeGen/X86/addcarry.ll | 20 +- llvm/test/CodeGen/X86/callbr-asm-blockplacement.ll | 12 +- llvm/test/CodeGen/X86/dag-update-nodetomatch.ll | 17 +- .../X86/delete-dead-instrs-with-live-uses.mir | 4 +- llvm/test/CodeGen/X86/inalloca-invoke.ll | 2 +- llvm/test/CodeGen/X86/licm-regpressure.ll | 28 +- llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll | 40 +- llvm/test/CodeGen/X86/sdiv_fix.ll | 5 +- 45 files changed, 4093 insertions(+), 4106 deletions(-) diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h index a0c52e2f1a13..c394ac910be1 100644 --- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h +++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h @@ -117,11 +117,10 @@ public: const MachineFunction &MF) const; /// Return true if the instruction is trivially rematerializable, meaning it - /// has no side effects. Uses of constants and unallocatable physical - /// registers are always trivial to rematerialize so that the instructions - /// result is independent of the place in the function. Uses of virtual - /// registers are allowed but it is caller's responsility to ensure these - /// operands are valid at the point the instruction is beeing moved. + /// has no side effects and requires no operands that aren't always available. + /// This means the only allowed uses are constants and unallocatable physical + /// registers so that the instructions result is independent of the place + /// in the function. bool isTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA = nullptr) const { return MI.getOpcode() == TargetOpcode::IMPLICIT_DEF || @@ -141,7 +140,8 @@ protected: /// set, this hook lets the target specify whether the instruction is actually /// trivially rematerializable, taking into consideration its operands. This /// predicate must return false if the instruction has any side effects other - /// than producing a value. + /// than producing a value, or if it requres any address registers that are + /// not always available. /// Requirements must be check as stated in isTriviallyReMaterializable() . virtual bool isReallyTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA) const { diff --git a/llvm/lib/CodeGen/TargetInstrInfo.cpp b/llvm/lib/CodeGen/TargetInstrInfo.cpp index fe7d60e0b7e2..1eab8e7443a7 100644 --- a/llvm/lib/CodeGen/TargetInstrInfo.cpp +++ b/llvm/lib/CodeGen/TargetInstrInfo.cpp @@ -921,8 +921,7 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( const MachineRegisterInfo &MRI = MF.getRegInfo(); // Remat clients assume operand 0 is the defined register. - if (!MI.getNumOperands() || !MI.getOperand(0).isReg() || - MI.getOperand(0).isTied()) + if (!MI.getNumOperands() || !MI.getOperand(0).isReg()) return false; Register DefReg = MI.getOperand(0).getReg(); @@ -984,6 +983,12 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( // same virtual register, though. if (MO.isDef() && Reg != DefReg) return false; + + // Don't allow any virtual-register uses. Rematting an instruction with + // virtual register uses would length the live ranges of the uses, which + // is not necessarily a good idea, certainly not "trivial". + if (MO.isUse()) + return false; } // Everything checked out. diff --git a/llvm/test/CodeGen/AMDGPU/remat-sop.mir b/llvm/test/CodeGen/AMDGPU/remat-sop.mir index c9915aaabfde..ed799bfca028 100644 --- a/llvm/test/CodeGen/AMDGPU/remat-sop.mir +++ b/llvm/test/CodeGen/AMDGPU/remat-sop.mir @@ -51,66 +51,6 @@ body: | S_NOP 0, implicit %2 S_ENDPGM 0 ... -# The liverange of %0 covers a point of rematerialization, source value is -# availabe. ---- -name: test_remat_s_mov_b32_vreg_src_long_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_remat_s_mov_b32_vreg_src_long_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_NOP 0, implicit %0 - S_ENDPGM 0 -... -# The liverange of %0 does not cover a point of rematerialization, source value is -# unavailabe and we do not want to artificially extend the liverange. ---- -name: test_no_remat_s_mov_b32_vreg_src_short_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_no_remat_s_mov_b32_vreg_src_short_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.1, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.1, addrspace 5) - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.0, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.0, addrspace 5) - ; GCN: renamable $sgpr0 = S_MOV_B32 killed renamable $sgpr0 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.1, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.1, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.0, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.0, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_ENDPGM 0 -... --- name: test_remat_s_mov_b64 tracksRegLiveness: true diff --git a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll index 175a2069a441..a4243276c70a 100644 --- a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll +++ b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll @@ -29,20 +29,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; ENABLE-NEXT: pophs {r11, pc} ; ENABLE-NEXT: .LBB0_3: @ %while.body.preheader ; ENABLE-NEXT: movw r12, :lower16:skip -; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: sub r1, r1, #1 ; ENABLE-NEXT: movt r12, :upper16:skip ; ENABLE-NEXT: .LBB0_4: @ %while.body ; ENABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; ENABLE-NEXT: ldrb r1, [r0] -; ENABLE-NEXT: ldrb r1, [r12, r1] -; ENABLE-NEXT: add r0, r0, r1 -; ENABLE-NEXT: sub r1, r3, #1 -; ENABLE-NEXT: cmp r1, r3 +; ENABLE-NEXT: ldrb r3, [r0] +; ENABLE-NEXT: ldrb r3, [r12, r3] +; ENABLE-NEXT: add r0, r0, r3 +; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: cmp r3, r1 ; ENABLE-NEXT: bhs .LBB0_6 ; ENABLE-NEXT: @ %bb.5: @ %while.body ; ENABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; ENABLE-NEXT: cmp r0, r2 -; ENABLE-NEXT: mov r3, r1 +; ENABLE-NEXT: mov r1, r3 ; ENABLE-NEXT: blo .LBB0_4 ; ENABLE-NEXT: .LBB0_6: @ %if.end29 ; ENABLE-NEXT: pop {r11, pc} @@ -119,20 +119,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; DISABLE-NEXT: pophs {r11, pc} ; DISABLE-NEXT: .LBB0_3: @ %while.body.preheader ; DISABLE-NEXT: movw r12, :lower16:skip -; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: sub r1, r1, #1 ; DISABLE-NEXT: movt r12, :upper16:skip ; DISABLE-NEXT: .LBB0_4: @ %while.body ; DISABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; DISABLE-NEXT: ldrb r1, [r0] -; DISABLE-NEXT: ldrb r1, [r12, r1] -; DISABLE-NEXT: add r0, r0, r1 -; DISABLE-NEXT: sub r1, r3, #1 -; DISABLE-NEXT: cmp r1, r3 +; DISABLE-NEXT: ldrb r3, [r0] +; DISABLE-NEXT: ldrb r3, [r12, r3] +; DISABLE-NEXT: add r0, r0, r3 +; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: cmp r3, r1 ; DISABLE-NEXT: bhs .LBB0_6 ; DISABLE-NEXT: @ %bb.5: @ %while.body ; DISABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; DISABLE-NEXT: cmp r0, r2 -; DISABLE-NEXT: mov r3, r1 +; DISABLE-NEXT: mov r1, r3 ; DISABLE-NEXT: blo .LBB0_4 ; DISABLE-NEXT: .LBB0_6: @ %if.end29 ; DISABLE-NEXT: pop {r11, pc} diff --git a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll index ea15fcc5c824..55157875d355 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll @@ -73,13 +73,13 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: push {r4, r5, r11, lr} ; SCALAR-NEXT: rsb r3, r2, #0 ; SCALAR-NEXT: and r4, r2, #63 -; SCALAR-NEXT: and r12, r3, #63 -; SCALAR-NEXT: rsb r3, r12, #32 +; SCALAR-NEXT: and lr, r3, #63 +; SCALAR-NEXT: rsb r3, lr, #32 ; SCALAR-NEXT: lsl r2, r0, r4 -; SCALAR-NEXT: lsr lr, r0, r12 -; SCALAR-NEXT: orr r3, lr, r1, lsl r3 -; SCALAR-NEXT: subs lr, r12, #32 -; SCALAR-NEXT: lsrpl r3, r1, lr +; SCALAR-NEXT: lsr r12, r0, lr +; SCALAR-NEXT: orr r3, r12, r1, lsl r3 +; SCALAR-NEXT: subs r12, lr, #32 +; SCALAR-NEXT: lsrpl r3, r1, r12 ; SCALAR-NEXT: subs r5, r4, #32 ; SCALAR-NEXT: movwpl r2, #0 ; SCALAR-NEXT: cmp r5, #0 @@ -88,8 +88,8 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: lsr r3, r0, r3 ; SCALAR-NEXT: orr r3, r3, r1, lsl r4 ; SCALAR-NEXT: lslpl r3, r0, r5 -; SCALAR-NEXT: lsr r0, r1, r12 -; SCALAR-NEXT: cmp lr, #0 +; SCALAR-NEXT: lsr r0, r1, lr +; SCALAR-NEXT: cmp r12, #0 ; SCALAR-NEXT: movwpl r0, #0 ; SCALAR-NEXT: orr r1, r3, r0 ; SCALAR-NEXT: mov r0, r2 @@ -245,15 +245,15 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK: @ %bb.0: ; CHECK-NEXT: .save {r4, r5, r11, lr} ; CHECK-NEXT: push {r4, r5, r11, lr} -; CHECK-NEXT: and r12, r2, #63 +; CHECK-NEXT: and lr, r2, #63 ; CHECK-NEXT: rsb r2, r2, #0 -; CHECK-NEXT: rsb r3, r12, #32 +; CHECK-NEXT: rsb r3, lr, #32 ; CHECK-NEXT: and r4, r2, #63 -; CHECK-NEXT: lsr lr, r0, r12 -; CHECK-NEXT: orr r3, lr, r1, lsl r3 -; CHECK-NEXT: subs lr, r12, #32 +; CHECK-NEXT: lsr r12, r0, lr +; CHECK-NEXT: orr r3, r12, r1, lsl r3 +; CHECK-NEXT: subs r12, lr, #32 ; CHECK-NEXT: lsl r2, r0, r4 -; CHECK-NEXT: lsrpl r3, r1, lr +; CHECK-NEXT: lsrpl r3, r1, r12 ; CHECK-NEXT: subs r5, r4, #32 ; CHECK-NEXT: movwpl r2, #0 ; CHECK-NEXT: cmp r5, #0 @@ -262,8 +262,8 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK-NEXT: lsr r3, r0, r3 ; CHECK-NEXT: orr r3, r3, r1, lsl r4 ; CHECK-NEXT: lslpl r3, r0, r5 -; CHECK-NEXT: lsr r0, r1, r12 -; CHECK-NEXT: cmp lr, #0 +; CHECK-NEXT: lsr r0, r1, lr +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r0, #0 ; CHECK-NEXT: orr r1, r0, r3 ; CHECK-NEXT: mov r0, r2 diff --git a/llvm/test/CodeGen/ARM/funnel-shift.ll b/llvm/test/CodeGen/ARM/funnel-shift.ll index 6372f9be2ca3..54c93b493c98 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift.ll @@ -224,31 +224,31 @@ define i37 @fshr_i37(i37 %x, i37 %y, i37 %z) { ; CHECK-NEXT: mov r3, #0 ; CHECK-NEXT: bl __aeabi_uldivmod ; CHECK-NEXT: add r0, r2, #27 -; CHECK-NEXT: lsl r2, r7, #27 -; CHECK-NEXT: and r12, r0, #63 ; CHECK-NEXT: lsl r6, r6, #27 +; CHECK-NEXT: and r1, r0, #63 +; CHECK-NEXT: lsl r2, r7, #27 ; CHECK-NEXT: orr r7, r6, r7, lsr #5 -; CHECK-NEXT: rsb r3, r12, #32 -; CHECK-NEXT: lsr r2, r2, r12 ; CHECK-NEXT: mov r6, #63 -; CHECK-NEXT: orr r2, r2, r7, lsl r3 -; CHECK-NEXT: subs r3, r12, #32 +; CHECK-NEXT: rsb r3, r1, #32 +; CHECK-NEXT: lsr r2, r2, r1 +; CHECK-NEXT: subs r12, r1, #32 ; CHECK-NEXT: bic r6, r6, r0 +; CHECK-NEXT: orr r2, r2, r7, lsl r3 ; CHECK-NEXT: lsl r5, r9, #1 -; CHECK-NEXT: lsrpl r2, r7, r3 -; CHECK-NEXT: subs r1, r6, #32 +; CHECK-NEXT: lsrpl r2, r7, r12 ; CHECK-NEXT: lsl r0, r5, r6 -; CHECK-NEXT: lsl r4, r8, #1 +; CHECK-NEXT: subs r4, r6, #32 +; CHECK-NEXT: lsl r3, r8, #1 ; CHECK-NEXT: movwpl r0, #0 -; CHECK-NEXT: orr r4, r4, r9, lsr #31 +; CHECK-NEXT: orr r3, r3, r9, lsr #31 ; CHECK-NEXT: orr r0, r0, r2 ; CHECK-NEXT: rsb r2, r6, #32 -; CHECK-NEXT: cmp r1, #0 +; CHECK-NEXT: cmp r4, #0 +; CHECK-NEXT: lsr r1, r7, r1 ; CHECK-NEXT: lsr r2, r5, r2 -; CHECK-NEXT: orr r2, r2, r4, lsl r6 -; CHECK-NEXT: lslpl r2, r5, r1 -; CHECK-NEXT: lsr r1, r7, r12 -; CHECK-NEXT: cmp r3, #0 +; CHECK-NEXT: orr r2, r2, r3, lsl r6 +; CHECK-NEXT: lslpl r2, r5, r4 +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r1, #0 ; CHECK-NEXT: orr r1, r2, r1 ; CHECK-NEXT: pop {r4, r5, r6, r7, r8, r9, r11, pc} diff --git a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll index 0a0bb62b0a09..2922e0ed5423 100644 --- a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll +++ b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll @@ -91,17 +91,17 @@ define void @i56_or(i56* %a) { ; BE-LABEL: i56_or: ; BE: @ %bb.0: ; BE-NEXT: mov r1, r0 +; BE-NEXT: ldr r12, [r0] ; BE-NEXT: ldrh r2, [r1, #4]! ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: orr r2, r3, r2, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: strb r12, [r1, #2] -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: strb r2, [r1, #2] +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr %aa = load i56, i56* %a @@ -127,13 +127,13 @@ define void @i56_and_or(i56* %a) { ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: strb r2, [r1, #2] ; BE-NEXT: orr r2, r3, r12, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: ldr r12, [r0] +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr diff --git a/llvm/test/CodeGen/ARM/neon-copy.ll b/llvm/test/CodeGen/ARM/neon-copy.ll index 46490efb6631..09a991da2e59 100644 --- a/llvm/test/CodeGen/ARM/neon-copy.ll +++ b/llvm/test/CodeGen/ARM/neon-copy.ll @@ -1340,16 +1340,16 @@ define <4 x i16> @test_extracts_inserts_varidx_insert(<8 x i16> %x, i32 %idx) { ; CHECK-NEXT: .pad #8 ; CHECK-NEXT: sub sp, sp, #8 ; CHECK-NEXT: vmov.u16 r1, d0[1] -; CHECK-NEXT: and r12, r0, #3 +; CHECK-NEXT: and r0, r0, #3 ; CHECK-NEXT: vmov.u16 r2, d0[2] -; CHECK-NEXT: mov r0, sp -; CHECK-NEXT: vmov.u16 r3, d0[3] -; CHECK-NEXT: orr r0, r0, r12, lsl #1 +; CHECK-NEXT: mov r3, sp +; CHECK-NEXT: vmov.u16 r12, d0[3] +; CHECK-NEXT: orr r0, r3, r0, lsl #1 ; CHECK-NEXT: vst1.16 {d0[0]}, [r0:16] ; CHECK-NEXT: vldr d0, [sp] ; CHECK-NEXT: vmov.16 d0[1], r1 ; CHECK-NEXT: vmov.16 d0[2], r2 -; CHECK-NEXT: vmov.16 d0[3], r3 +; CHECK-NEXT: vmov.16 d0[3], r12 ; CHECK-NEXT: add sp, sp, #8 ; CHECK-NEXT: bx lr %tmp = extractelement <8 x i16> %x, i32 0 diff --git a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll index a125446b27c3..8be7100d368b 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll @@ -766,85 +766,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: move $2, $6 -; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: sw $4, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 32($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $5, 36($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 76($sp) -; MMR3-NEXT: srlv $3, $7, $16 -; MMR3-NEXT: not16 $6, $16 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: sw $2, 32($sp) # 4-byte Folded Spill -; MMR3-NEXT: sll16 $2, $2, 1 -; MMR3-NEXT: sllv $2, $2, $6 -; MMR3-NEXT: li16 $6, 64 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: srlv $4, $4, $16 -; MMR3-NEXT: sw $4, 16($sp) # 4-byte Folded Spill -; MMR3-NEXT: subu16 $7, $6, $16 +; MMR3-NEXT: srlv $4, $7, $16 +; MMR3-NEXT: not16 $3, $16 +; MMR3-NEXT: sw $3, 24($sp) # 4-byte Folded Spill +; MMR3-NEXT: sll16 $2, $6, 1 +; MMR3-NEXT: sllv $3, $2, $3 +; MMR3-NEXT: li16 $2, 64 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: srlv $6, $6, $16 +; MMR3-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: subu16 $7, $2, $16 ; MMR3-NEXT: sllv $9, $5, $7 -; MMR3-NEXT: andi16 $5, $7, 32 -; MMR3-NEXT: sw $5, 28($sp) # 4-byte Folded Spill -; MMR3-NEXT: andi16 $6, $16, 32 -; MMR3-NEXT: sw $6, 36($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $3, $9 +; MMR3-NEXT: andi16 $2, $7, 32 +; MMR3-NEXT: sw $2, 28($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $5, $16, 32 +; MMR3-NEXT: sw $5, 16($sp) # 4-byte Folded Spill +; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $3, $17, $5 -; MMR3-NEXT: movn $2, $4, $6 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: lw $17, 0($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $4, $17, $4 -; MMR3-NEXT: sw $4, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $6, 1 -; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $3 -; MMR3-NEXT: addiu $3, $16, -64 -; MMR3-NEXT: srav $1, $6, $3 -; MMR3-NEXT: andi16 $3, $3, 32 -; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $3 -; MMR3-NEXT: sllv $3, $6, $7 -; MMR3-NEXT: sw $3, 4($sp) # 4-byte Folded Spill -; MMR3-NEXT: not16 $3, $7 -; MMR3-NEXT: srl16 $4, $17, 1 -; MMR3-NEXT: srlv $3, $4, $3 +; MMR3-NEXT: movn $4, $17, $2 +; MMR3-NEXT: movn $3, $6, $5 +; MMR3-NEXT: addiu $2, $16, -64 +; MMR3-NEXT: lw $5, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srlv $5, $5, $2 +; MMR3-NEXT: sw $5, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $17, 1 +; MMR3-NEXT: sw $6, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: not16 $5, $2 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: lw $4, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $4 +; MMR3-NEXT: srav $1, $17, $2 +; MMR3-NEXT: andi16 $2, $2, 32 +; MMR3-NEXT: sw $2, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $2 +; MMR3-NEXT: sllv $2, $17, $7 +; MMR3-NEXT: not16 $4, $7 +; MMR3-NEXT: lw $7, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $6, $7, 1 +; MMR3-NEXT: srlv $6, $6, $4 ; MMR3-NEXT: sltiu $10, $16, 64 -; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: lw $2, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $5, $3, $10 +; MMR3-NEXT: or16 $6, $2 +; MMR3-NEXT: srlv $2, $7, $16 +; MMR3-NEXT: lw $3, 24($sp) # 4-byte Folded Reload +; MMR3-NEXT: lw $4, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: sllv $3, $4, $3 ; MMR3-NEXT: or16 $3, $2 -; MMR3-NEXT: srlv $2, $17, $16 -; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $7, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: sllv $17, $7, $4 -; MMR3-NEXT: or16 $17, $2 -; MMR3-NEXT: srav $11, $6, $16 -; MMR3-NEXT: lw $2, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $17, $11, $2 -; MMR3-NEXT: sra $2, $6, 31 +; MMR3-NEXT: srav $11, $17, $16 +; MMR3-NEXT: lw $4, 16($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $11, $4 +; MMR3-NEXT: sra $2, $17, 31 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: movn $4, $17, $10 -; MMR3-NEXT: lw $6, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $9, $6 -; MMR3-NEXT: lw $6, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: lw $7, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $3 +; MMR3-NEXT: move $8, $2 +; MMR3-NEXT: movn $8, $3, $10 +; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $6, $9, $3 +; MMR3-NEXT: li16 $3, 0 +; MMR3-NEXT: lw $7, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $7, $3, $4 +; MMR3-NEXT: or16 $7, $6 ; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload ; MMR3-NEXT: movn $1, $2, $3 ; MMR3-NEXT: movn $1, $7, $10 ; MMR3-NEXT: lw $3, 32($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $3, $16 -; MMR3-NEXT: movn $11, $2, $6 +; MMR3-NEXT: movn $11, $2, $4 ; MMR3-NEXT: movn $2, $11, $10 -; MMR3-NEXT: move $3, $4 +; MMR3-NEXT: move $3, $8 ; MMR3-NEXT: move $4, $1 ; MMR3-NEXT: lwp $16, 40($sp) ; MMR3-NEXT: addiusp 48 @@ -858,80 +852,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 -; MMR6-NEXT: move $12, $7 +; MMR6-NEXT: move $1, $7 ; MMR6-NEXT: lw $3, 44($sp) ; MMR6-NEXT: li16 $2, 64 -; MMR6-NEXT: subu16 $16, $2, $3 -; MMR6-NEXT: sllv $1, $5, $16 -; MMR6-NEXT: andi16 $2, $16, 32 -; MMR6-NEXT: selnez $8, $1, $2 -; MMR6-NEXT: sllv $9, $4, $16 -; MMR6-NEXT: not16 $16, $16 -; MMR6-NEXT: srl16 $17, $5, 1 -; MMR6-NEXT: srlv $10, $17, $16 -; MMR6-NEXT: or $9, $9, $10 -; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $8, $8, $9 -; MMR6-NEXT: srlv $9, $7, $3 -; MMR6-NEXT: not16 $7, $3 -; MMR6-NEXT: sw $7, 4($sp) # 4-byte Folded Spill +; MMR6-NEXT: subu16 $7, $2, $3 +; MMR6-NEXT: sllv $8, $5, $7 +; MMR6-NEXT: andi16 $2, $7, 32 +; MMR6-NEXT: selnez $9, $8, $2 +; MMR6-NEXT: sllv $10, $4, $7 +; MMR6-NEXT: not16 $7, $7 +; MMR6-NEXT: srl16 $16, $5, 1 +; MMR6-NEXT: srlv $7, $16, $7 +; MMR6-NEXT: or $7, $10, $7 +; MMR6-NEXT: seleqz $7, $7, $2 +; MMR6-NEXT: or $7, $9, $7 +; MMR6-NEXT: srlv $9, $1, $3 +; MMR6-NEXT: not16 $16, $3 +; MMR6-NEXT: sw $16, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $17, $6, 1 -; MMR6-NEXT: sllv $10, $17, $7 +; MMR6-NEXT: sllv $10, $17, $16 ; MMR6-NEXT: or $9, $10, $9 ; MMR6-NEXT: andi16 $17, $3, 32 ; MMR6-NEXT: seleqz $9, $9, $17 ; MMR6-NEXT: srlv $10, $6, $3 ; MMR6-NEXT: selnez $11, $10, $17 ; MMR6-NEXT: seleqz $10, $10, $17 -; MMR6-NEXT: or $8, $10, $8 -; MMR6-NEXT: seleqz $1, $1, $2 -; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: or $10, $10, $7 +; MMR6-NEXT: seleqz $12, $8, $2 +; MMR6-NEXT: or $8, $11, $9 ; MMR6-NEXT: addiu $2, $3, -64 -; MMR6-NEXT: srlv $10, $5, $2 +; MMR6-NEXT: srlv $9, $5, $2 ; MMR6-NEXT: sll16 $7, $4, 1 ; MMR6-NEXT: not16 $16, $2 ; MMR6-NEXT: sllv $11, $7, $16 ; MMR6-NEXT: sltiu $13, $3, 64 -; MMR6-NEXT: or $1, $9, $1 -; MMR6-NEXT: selnez $8, $8, $13 -; MMR6-NEXT: or $9, $11, $10 -; MMR6-NEXT: srav $10, $4, $2 +; MMR6-NEXT: or $8, $8, $12 +; MMR6-NEXT: selnez $10, $10, $13 +; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: srav $11, $4, $2 ; MMR6-NEXT: andi16 $2, $2, 32 -; MMR6-NEXT: seleqz $11, $10, $2 +; MMR6-NEXT: seleqz $12, $11, $2 ; MMR6-NEXT: sra $14, $4, 31 ; MMR6-NEXT: selnez $15, $14, $2 ; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $11, $15, $11 -; MMR6-NEXT: seleqz $11, $11, $13 -; MMR6-NEXT: selnez $2, $10, $2 -; MMR6-NEXT: seleqz $10, $14, $13 -; MMR6-NEXT: or $8, $8, $11 -; MMR6-NEXT: selnez $8, $8, $3 -; MMR6-NEXT: selnez $1, $1, $13 +; MMR6-NEXT: or $12, $15, $12 +; MMR6-NEXT: seleqz $12, $12, $13 +; MMR6-NEXT: selnez $2, $11, $2 +; MMR6-NEXT: seleqz $11, $14, $13 +; MMR6-NEXT: or $10, $10, $12 +; MMR6-NEXT: selnez $10, $10, $3 +; MMR6-NEXT: selnez $8, $8, $13 ; MMR6-NEXT: or $2, $2, $9 ; MMR6-NEXT: srav $9, $4, $3 ; MMR6-NEXT: seleqz $4, $9, $17 -; MMR6-NEXT: selnez $11, $14, $17 -; MMR6-NEXT: or $4, $11, $4 -; MMR6-NEXT: selnez $11, $4, $13 +; MMR6-NEXT: selnez $12, $14, $17 +; MMR6-NEXT: or $4, $12, $4 +; MMR6-NEXT: selnez $12, $4, $13 ; MMR6-NEXT: seleqz $2, $2, $13 ; MMR6-NEXT: seleqz $4, $6, $3 -; MMR6-NEXT: seleqz $6, $12, $3 +; MMR6-NEXT: seleqz $1, $1, $3 +; MMR6-NEXT: or $2, $8, $2 +; MMR6-NEXT: selnez $2, $2, $3 ; MMR6-NEXT: or $1, $1, $2 -; MMR6-NEXT: selnez $1, $1, $3 -; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: or $4, $4, $8 -; MMR6-NEXT: or $6, $11, $10 -; MMR6-NEXT: srlv $2, $5, $3 -; MMR6-NEXT: lw $3, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $3, $7, $3 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: seleqz $2, $2, $17 -; MMR6-NEXT: selnez $3, $9, $17 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: selnez $2, $2, $13 -; MMR6-NEXT: or $3, $2, $10 -; MMR6-NEXT: move $2, $6 +; MMR6-NEXT: or $4, $4, $10 +; MMR6-NEXT: or $2, $12, $11 +; MMR6-NEXT: srlv $3, $5, $3 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: sllv $5, $7, $5 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: seleqz $3, $3, $17 +; MMR6-NEXT: selnez $5, $9, $17 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: selnez $3, $3, $13 +; MMR6-NEXT: or $3, $3, $11 ; MMR6-NEXT: move $5, $1 ; MMR6-NEXT: lw $16, 8($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $17, 12($sp) # 4-byte Folded Reload diff --git a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll index e4b4b3ae1d0f..ed2bfc9fcf60 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll @@ -776,77 +776,76 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sw $4, 28($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 68($sp) ; MMR3-NEXT: li16 $2, 64 -; MMR3-NEXT: subu16 $17, $2, $16 -; MMR3-NEXT: sllv $9, $5, $17 -; MMR3-NEXT: andi16 $3, $17, 32 +; MMR3-NEXT: subu16 $7, $2, $16 +; MMR3-NEXT: sllv $9, $5, $7 +; MMR3-NEXT: move $17, $5 +; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $3, $7, 32 ; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill ; MMR3-NEXT: li16 $2, 0 ; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: movn $4, $2, $3 -; MMR3-NEXT: srlv $5, $7, $16 +; MMR3-NEXT: srlv $5, $8, $16 ; MMR3-NEXT: not16 $3, $16 ; MMR3-NEXT: sw $3, 16($sp) # 4-byte Folded Spill ; MMR3-NEXT: sll16 $2, $6, 1 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sllv $2, $2, $3 ; MMR3-NEXT: or16 $2, $5 -; MMR3-NEXT: srlv $7, $6, $16 +; MMR3-NEXT: srlv $5, $6, $16 +; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill ; MMR3-NEXT: andi16 $3, $16, 32 ; MMR3-NEXT: sw $3, 12($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $2, $7, $3 +; MMR3-NEXT: movn $2, $5, $3 ; MMR3-NEXT: addiu $3, $16, -64 ; MMR3-NEXT: or16 $2, $4 -; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $3, $6, $3 -; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $3, 1 -; MMR3-NEXT: sw $4, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $4 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: srlv $1, $3, $4 -; MMR3-NEXT: andi16 $4, $4, 32 +; MMR3-NEXT: srlv $4, $17, $3 ; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $4 +; MMR3-NEXT: lw $4, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $4, 1 +; MMR3-NEXT: not16 $5, $3 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $17 +; MMR3-NEXT: srlv $1, $4, $3 +; MMR3-NEXT: andi16 $3, $3, 32 +; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $3 ; MMR3-NEXT: sltiu $10, $16, 64 ; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: sllv $2, $3, $17 -; MMR3-NEXT: not16 $3, $17 -; MMR3-NEXT: srl16 $4, $6, 1 +; MMR3-NEXT: sllv $2, $4, $7 +; MMR3-NEXT: not16 $3, $7 +; MMR3-NEXT: lw $7, 0($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $4, $7, 1 ; MMR3-NEXT: srlv $4, $4, $3 ; MMR3-NEXT: or16 $4, $2 -; MMR3-NEXT: srlv $2, $6, $16 +; MMR3-NEXT: srlv $2, $7, $16 ; MMR3-NEXT: lw $3, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $6, 0($sp) # 4-byte Folded Reload ; MMR3-NEXT: sllv $3, $6, $3 ; MMR3-NEXT: or16 $3, $2 ; MMR3-NEXT: lw $2, 28($sp) # 4-byte Folded Reload ; MMR3-NEXT: srlv $2, $2, $16 -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $2, $6 +; MMR3-NEXT: lw $17, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $2, $17 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movz $3, $17, $10 -; MMR3-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $4, $9, $17 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $4 +; MMR3-NEXT: li16 $6, 0 +; MMR3-NEXT: movz $3, $6, $10 +; MMR3-NEXT: lw $7, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $4, $9, $7 +; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $6, $7, $17 +; MMR3-NEXT: or16 $6, $4 ; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $1, $17, $4 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $1, $7, $10 +; MMR3-NEXT: movn $1, $7, $4 +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $1, $6, $10 ; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $4, $16 -; MMR3-NEXT: movn $2, $17, $6 +; MMR3-NEXT: movn $2, $7, $17 ; MMR3-NEXT: li16 $4, 0 ; MMR3-NEXT: movz $2, $4, $10 ; MMR3-NEXT: move $4, $1 @@ -856,91 +855,98 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; ; MMR6-LABEL: lshr_i128: ; MMR6: # %bb.0: # %entry -; MMR6-NEXT: addiu $sp, $sp, -24 -; MMR6-NEXT: .cfi_def_cfa_offset 24 -; MMR6-NEXT: sw $17, 20($sp) # 4-byte Folded Spill -; MMR6-NEXT: sw $16, 16($sp) # 4-byte Folded Spill +; MMR6-NEXT: addiu $sp, $sp, -32 +; MMR6-NEXT: .cfi_def_cfa_offset 32 +; MMR6-NEXT: sw $17, 28($sp) # 4-byte Folded Spill +; MMR6-NEXT: sw $16, 24($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 ; MMR6-NEXT: move $1, $7 -; MMR6-NEXT: move $7, $4 -; MMR6-NEXT: lw $3, 52($sp) +; MMR6-NEXT: move $7, $5 +; MMR6-NEXT: lw $3, 60($sp) ; MMR6-NEXT: srlv $2, $1, $3 -; MMR6-NEXT: not16 $16, $3 -; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill -; MMR6-NEXT: move $4, $6 -; MMR6-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $5, $3 +; MMR6-NEXT: sw $5, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: move $17, $6 +; MMR6-NEXT: sw $6, 16($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $6, $6, 1 -; MMR6-NEXT: sllv $6, $6, $16 +; MMR6-NEXT: sllv $6, $6, $5 ; MMR6-NEXT: or $8, $6, $2 -; MMR6-NEXT: addiu $6, $3, -64 -; MMR6-NEXT: srlv $9, $5, $6 -; MMR6-NEXT: sll16 $2, $7, 1 -; MMR6-NEXT: sw $2, 4($sp) # 4-byte Folded Spill -; MMR6-NEXT: not16 $16, $6 +; MMR6-NEXT: addiu $5, $3, -64 +; MMR6-NEXT: srlv $9, $7, $5 +; MMR6-NEXT: move $6, $4 +; MMR6-NEXT: sll16 $2, $4, 1 +; MMR6-NEXT: sw $2, 8($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $16, $5 ; MMR6-NEXT: sllv $10, $2, $16 ; MMR6-NEXT: andi16 $16, $3, 32 ; MMR6-NEXT: seleqz $8, $8, $16 ; MMR6-NEXT: or $9, $10, $9 -; MMR6-NEXT: srlv $10, $4, $3 +; MMR6-NEXT: srlv $10, $17, $3 ; MMR6-NEXT: selnez $11, $10, $16 ; MMR6-NEXT: li16 $17, 64 ; MMR6-NEXT: subu16 $2, $17, $3 -; MMR6-NEXT: sllv $12, $5, $2 +; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: move $17, $7 ; MMR6-NEXT: andi16 $4, $2, 32 -; MMR6-NEXT: andi16 $17, $6, 32 -; MMR6-NEXT: seleqz $9, $9, $17 +; MMR6-NEXT: andi16 $7, $5, 32 +; MMR6-NEXT: sw $7, 20($sp) # 4-byte Folded Spill +; MMR6-NEXT: seleqz $9, $9, $7 ; MMR6-NEXT: seleqz $13, $12, $4 ; MMR6-NEXT: or $8, $11, $8 ; MMR6-NEXT: selnez $11, $12, $4 -; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: sllv $12, $6, $2 +; MMR6-NEXT: move $7, $6 +; MMR6-NEXT: sw $6, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: not16 $2, $2 -; MMR6-NEXT: srl16 $6, $5, 1 +; MMR6-NEXT: srl16 $6, $17, 1 ; MMR6-NEXT: srlv $2, $6, $2 ; MMR6-NEXT: or $2, $12, $2 ; MMR6-NEXT: seleqz $2, $2, $4 -; MMR6-NEXT: addiu $4, $3, -64 -; MMR6-NEXT: srlv $4, $7, $4 -; MMR6-NEXT: or $12, $11, $2 -; MMR6-NEXT: or $6, $8, $13 -; MMR6-NEXT: srlv $5, $5, $3 -; MMR6-NEXT: selnez $8, $4, $17 -; MMR6-NEXT: sltiu $11, $3, 64 -; MMR6-NEXT: selnez $13, $6, $11 -; MMR6-NEXT: or $8, $8, $9 +; MMR6-NEXT: srlv $4, $7, $5 +; MMR6-NEXT: or $11, $11, $2 +; MMR6-NEXT: or $5, $8, $13 +; MMR6-NEXT: srlv $6, $17, $3 +; MMR6-NEXT: lw $2, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: selnez $7, $4, $2 +; MMR6-NEXT: sltiu $8, $3, 64 +; MMR6-NEXT: selnez $12, $5, $8 +; MMR6-NEXT: or $7, $7, $9 +; MMR6-NEXT: lw $5, 12($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $2, 8($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $9, $6, $2 +; MMR6-NEXT: sllv $9, $2, $5 ; MMR6-NEXT: seleqz $10, $10, $16 -; MMR6-NEXT: li16 $2, 0 -; MMR6-NEXT: or $10, $10, $12 -; MMR6-NEXT: or $9, $9, $5 -; MMR6-NEXT: seleqz $5, $8, $11 -; MMR6-NEXT: seleqz $8, $2, $11 -; MMR6-NEXT: srlv $7, $7, $3 -; MMR6-NEXT: seleqz $2, $7, $16 -; MMR6-NEXT: selnez $2, $2, $11 +; MMR6-NEXT: li16 $5, 0 +; MMR6-NEXT: or $10, $10, $11 +; MMR6-NEXT: or $6, $9, $6 +; MMR6-NEXT: seleqz $2, $7, $8 +; MMR6-NEXT: seleqz $7, $5, $8 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: srlv $9, $5, $3 +; MMR6-NEXT: seleqz $11, $9, $16 +; MMR6-NEXT: selnez $11, $11, $8 ; MMR6-NEXT: seleqz $1, $1, $3 -; MMR6-NEXT: or $5, $13, $5 -; MMR6-NEXT: selnez $5, $5, $3 -; MMR6-NEXT: or $5, $1, $5 -; MMR6-NEXT: or $2, $8, $2 -; MMR6-NEXT: seleqz $1, $9, $16 -; MMR6-NEXT: selnez $6, $7, $16 -; MMR6-NEXT: lw $7, 12($sp) # 4-byte Folded Reload -; MMR6-NEXT: seleqz $7, $7, $3 -; MMR6-NEXT: selnez $9, $10, $11 -; MMR6-NEXT: seleqz $4, $4, $17 -; MMR6-NEXT: seleqz $4, $4, $11 -; MMR6-NEXT: or $4, $9, $4 +; MMR6-NEXT: or $2, $12, $2 +; MMR6-NEXT: selnez $2, $2, $3 +; MMR6-NEXT: or $5, $1, $2 +; MMR6-NEXT: or $2, $7, $11 +; MMR6-NEXT: seleqz $1, $6, $16 +; MMR6-NEXT: selnez $6, $9, $16 +; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $9, $16, $3 +; MMR6-NEXT: selnez $10, $10, $8 +; MMR6-NEXT: lw $16, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $4, $4, $16 +; MMR6-NEXT: seleqz $4, $4, $8 +; MMR6-NEXT: or $4, $10, $4 ; MMR6-NEXT: selnez $3, $4, $3 -; MMR6-NEXT: or $4, $7, $3 +; MMR6-NEXT: or $4, $9, $3 ; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: selnez $1, $1, $11 -; MMR6-NEXT: or $3, $8, $1 -; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR6-NEXT: addiu $sp, $sp, 24 +; MMR6-NEXT: selnez $1, $1, $8 +; MMR6-NEXT: or $3, $7, $1 +; MMR6-NEXT: lw $16, 24($sp) # 4-byte Folded Reload +; MMR6-NEXT: lw $17, 28($sp) # 4-byte Folded Reload +; MMR6-NEXT: addiu $sp, $sp, 32 ; MMR6-NEXT: jrc $ra </cut>

3 years, 9 months

3
7
0 0

[ACTIVITY] report week ending 1 Oct

by Peter Maydell

Progress * UM-2 [QEMU upstream maintainership] + Worked through my code-review backlog + Noticed that we never got round to making our emulated GICv3 support having redistributors in more than one contiguous region; this prevents using more than 123 CPUs with the virt board. Sent out a patchset which adds the necessary handling. + Generally trying to tie off loose ends pre-holiday :-) -- PMM

3 years, 9 months

1
0
0 0

test-mail

by Prasanth Nair

test-mail

3 years, 9 months

1
1
0 0

[TCWG CI] Regression caused by linux:30f349097897c115345beabeecc5e710b479ff1e

by ci_notify＠linaro.org

Identified regression caused by *linux:30f349097897c115345beabeecc5e710b479ff1e*: commit 30f349097897c115345beabeecc5e710b479ff1e Merge: 9c566611ac5c f76c87e8c337 Author: Linus Torvalds <torvalds(a)linux-foundation.org> Merge tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Results regressed to (for first_bad == 30f349097897c115345beabeecc5e710b479ff1e) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21782 # First few build errors in logs: from (for last_good == 9c566611ac5cc7b45af943632f7a9b1b6a642991) # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 29893 # linux build successful: all This commit has regressed these CI configurations: - tcwg_kernel/gnu-release-arm-mainline-allmodconfig Artifacts of last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… Reproduce builds: <cut> mkdir investigate-linux-30f349097897c115345beabeecc5e710b479ff1e cd investigate-linux-30f349097897c115345beabeecc5e710b479ff1e # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-mainline-a… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /linux/ ./ ./bisect/baseline/ cd linux # Reproduce first_bad build git checkout --detach 30f349097897c115345beabeecc5e710b479ff1e ../artifacts/test.sh # Reproduce last_good build git checkout --detach 9c566611ac5cc7b45af943632f7a9b1b6a642991 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 30f349097897c115345beabeecc5e710b479ff1e Merge: 9c566611ac5c f76c87e8c337 Author: Linus Torvalds <torvalds(a)linux-foundation.org> Date: Wed Sep 8 16:38:25 2021 -0700 Merge tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These are mostly ARM cpufreq driver updates, including one new MediaTek driver that has just passed all of the reviews, with the addition of a revert of a recent intel_pstate commit, some core cpufreq changes and a DT-related update of the operating performance points (OPP) support code. Specifics: - Add new cpufreq driver for the MediaTek MT6779 platform called mediatek-hw along with corresponding DT bindings (Hector.Yuan). - Add DCVS interrupt support to the qcom-cpufreq-hw driver (Thara Gopinath). - Make the qcom-cpufreq-hw driver set the dvfs_possible_from_any_cpu policy flag (Taniya Das). - Blocklist more Qualcomm platforms in cpufreq-dt-platdev (Bjorn Andersson). - Make the vexpress cpufreq driver set the CPUFREQ_IS_COOLING_DEV flag (Viresh Kumar). - Add new cpufreq driver callback to allow drivers to register with the Energy Model in a consistent way and make several drivers use it (Viresh Kumar). - Change the remaining users of the .ready() cpufreq driver callback to move the code from it elsewhere and drop it from the cpufreq core (Viresh Kumar). - Revert recent intel_pstate change adding HWP guaranteed performance change notification support to it that led to problems, because the notification in question is triggered prematurely on some systems (Rafael Wysocki). - Convert the OPP DT bindings to DT schema and clean them up while at it (Rob Herring)" * tag 'pm-5.15-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification" cpufreq: mediatek-hw: Add support for CPUFREQ HW cpufreq: Add of_perf_domain_get_sharing_cpumask dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW cpufreq: Remove ready() callback cpufreq: sh: Remove sh_cpufreq_cpu_ready() cpufreq: acpi: Remove acpi_cpufreq_cpu_ready() cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support cpufreq: scmi: Use .register_em() to register with energy model cpufreq: vexpress: Use .register_em() to register with energy model cpufreq: scpi: Use .register_em() to register with energy model dt-bindings: opp: Convert to DT schema dt-bindings: Clean-up OPP binding node names in examples ARM: dts: omap: Drop references to opp.txt cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model cpufreq: omap: Use .register_em() to register with energy model cpufreq: mediatek: Use .register_em() to register with energy model cpufreq: imx6q: Use .register_em() to register with energy model ... Documentation/cpu-freq/cpu-drivers.rst | 3 - .../devicetree/bindings/cpufreq/cpufreq-dt.txt | 2 +- .../bindings/cpufreq/cpufreq-mediatek-hw.yaml | 70 +++ .../bindings/cpufreq/cpufreq-mediatek.txt | 2 +- .../devicetree/bindings/cpufreq/cpufreq-st.txt | 6 +- .../bindings/cpufreq/nvidia,tegra20-cpufreq.txt | 2 +- .../devicetree/bindings/devfreq/rk3399_dmc.txt | 2 +- .../devicetree/bindings/gpu/arm,mali-bifrost.yaml | 2 +- .../devicetree/bindings/gpu/arm,mali-midgard.yaml | 2 +- .../bindings/interconnect/fsl,imx8m-noc.yaml | 4 +- .../opp/allwinner,sun50i-h6-operating-points.yaml | 4 + Documentation/devicetree/bindings/opp/opp-v1.yaml | 51 ++ .../devicetree/bindings/opp/opp-v2-base.yaml | 214 +++++++ Documentation/devicetree/bindings/opp/opp-v2.yaml | 475 ++++++++++++++++ Documentation/devicetree/bindings/opp/opp.txt | 622 --------------------- Documentation/devicetree/bindings/opp/qcom-opp.txt | 2 +- .../bindings/opp/ti-omap5-opp-supply.txt | 2 +- .../devicetree/bindings/power/power-domain.yaml | 2 +- .../translations/zh_CN/cpu-freq/cpu-drivers.rst | 2 - arch/arm/boot/dts/omap34xx.dtsi | 1 - arch/arm/boot/dts/omap36xx.dtsi | 1 - drivers/base/arch_topology.c | 2 + drivers/cpufreq/Kconfig.arm | 12 + drivers/cpufreq/Makefile | 1 + drivers/cpufreq/acpi-cpufreq.c | 14 +- drivers/cpufreq/cpufreq-dt-platdev.c | 4 + drivers/cpufreq/cpufreq-dt.c | 3 +- drivers/cpufreq/cpufreq.c | 17 +- drivers/cpufreq/imx6q-cpufreq.c | 2 +- drivers/cpufreq/intel_pstate.c | 39 -- drivers/cpufreq/mediatek-cpufreq-hw.c | 308 ++++++++++ drivers/cpufreq/mediatek-cpufreq.c | 3 +- drivers/cpufreq/omap-cpufreq.c | 2 +- drivers/cpufreq/qcom-cpufreq-hw.c | 151 ++++- drivers/cpufreq/scmi-cpufreq.c | 65 ++- drivers/cpufreq/scpi-cpufreq.c | 3 +- drivers/cpufreq/sh-cpufreq.c | 11 - drivers/cpufreq/vexpress-spc-cpufreq.c | 25 +- include/linux/cpufreq.h | 75 ++- 39 files changed, 1441 insertions(+), 767 deletions(-) </cut>

3 years, 9 months

3
3
0 0

[TCWG CI] Regression caused by linux: scripts/gcc-plugins: consistently use HOSTCC

by ci_notify＠linaro.org

[TCWG CI] Regression caused by linux: scripts/gcc-plugins: consistently use HOSTCC: commit e554fdf7141e9edc05e7ece258f45b471af87494 Author: Ross Burton <ross.burton(a)arm.com> scripts/gcc-plugins: consistently use HOSTCC Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 21723 # First few build errors in logs: # 00:02:07 drivers/char/ipmi/ipmi_msghandler.c:4880:1: error: the frame size of 1232 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] # 00:02:07 make[2]: *** [drivers/char/ipmi/ipmi_msghandler.o] Error 1 # 00:02:43 make[1]: *** [drivers/char/ipmi] Error 2 # 00:03:47 lib/crypto/curve25519-fiat32.c:864:1: error: the frame size of 1288 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] # 00:03:47 make[2]: *** [lib/crypto/curve25519-fiat32.o] Error 1 # 00:03:54 make[1]: *** [lib/crypto] Error 2 # 00:03:54 fs/reiserfs/namei.c:1646:1: error: the frame size of 1176 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] # 00:03:54 make[2]: *** [fs/reiserfs/namei.o] Error 1 # 00:04:32 drivers/tty/serial/8250/8250_aspeed_vuart.c:568:1: error: the frame size of 1048 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] # 00:04:32 make[4]: *** [drivers/tty/serial/8250/8250_aspeed_vuart.o] Error 1 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 29916 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-release-arm-next-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… Reproduce builds: <cut> mkdir investigate-linux-e554fdf7141e9edc05e7ece258f45b471af87494 cd investigate-linux-e554fdf7141e9edc05e7ece258f45b471af87494 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-release-arm-next-allmo… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /linux/ ./ ./bisect/baseline/ cd linux # Reproduce first_bad build git checkout --detach e554fdf7141e9edc05e7ece258f45b471af87494 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 86455276585996fe5b43972aa8f31afcbafabc40 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit e554fdf7141e9edc05e7ece258f45b471af87494 Author: Ross Burton <ross.burton(a)arm.com> Date: Thu Sep 23 16:28:11 2021 +0100 scripts/gcc-plugins: consistently use HOSTCC The GCC plugins are built using HOSTCC, but the path to the GCC plugins headers is obtained using CC. This can lead to interesting failures if the host compiler and cross compiler are different versions, and the host compiler uses the cross headers. Signed-off-by: Ross Burton <ross.burton(a)arm.com> Signed-off-by: Kees Cook <keescook(a)chromium.org> Link: https://lore.kernel.org/r/20210923152811.406516-1-ross.burton@arm.com --- scripts/gcc-plugins/Kconfig | 2 +- scripts/gcc-plugins/Makefile | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/gcc-plugins/Kconfig b/scripts/gcc-plugins/Kconfig index ab9eb4cbe33a..5dad6d780138 100644 --- a/scripts/gcc-plugins/Kconfig +++ b/scripts/gcc-plugins/Kconfig @@ -9,7 +9,7 @@ menuconfig GCC_PLUGINS bool "GCC plugins" depends on HAVE_GCC_PLUGINS depends on CC_IS_GCC - depends on $(success,test -e $(shell,$(CC) -print-file-name=plugin)/include/plugin-version.h) + depends on $(success,test -e $(shell,$(HOSTCC) -print-file-name=plugin)/include/plugin-version.h) default y help GCC plugins are loadable modules that provide extra features to the diff --git a/scripts/gcc-plugins/Makefile b/scripts/gcc-plugins/Makefile index 1952d3bb80c6..6aac404344a6 100644 --- a/scripts/gcc-plugins/Makefile +++ b/scripts/gcc-plugins/Makefile @@ -19,7 +19,7 @@ targets += randomize_layout_seed.h randomize_layout_hash.h always-y += $(GCC_PLUGIN) -GCC_PLUGINS_DIR = $(shell $(CC) -print-file-name=plugin) +GCC_PLUGINS_DIR = $(shell $(HOSTCXX) -print-file-name=plugin) plugin_cxxflags = -Wp,-MMD,$(depfile) $(KBUILD_HOSTCXXFLAGS) -fPIC \ -include $(srctree)/include/linux/compiler-version.h \ </cut>

3 years, 9 months

1
0
0 0

[CI-NOTIFY]: TCWG Bisect tcwg_kernel/llvm-release-aarch64-next-allnoconfig - Build # 10 - Successful!

by ci_notify＠linaro.org

Successfully identified regression in *linux* in CI configuration tcwg_kernel/llvm-release-aarch64-next-allnoconfig. So far, this commit has regressed CI configurations: - tcwg_kernel/llvm-release-aarch64-next-allnoconfig Culprit: <cut> commit 8633ef82f101c040427b57d4df7b706261420b94 Author: Javier Martinez Canillas <javierm(a)redhat.com> Date: Fri Jun 25 15:13:59 2021 +0200 drivers/firmware: consolidate EFI framebuffer setup for all arches The register_gop_device() function registers an "efi-framebuffer" platform device to match against the efifb driver, to have an early framebuffer for EFI platforms. But there is already support to do exactly the same by the Generic System Framebuffers (sysfb) driver. This used to be only for X86 but it has been moved to drivers/firmware and could be reused by other architectures. Also, besides supporting registering an "efi-framebuffer", this driver can register a "simple-framebuffer" allowing to use the siple{fb,drm} drivers on non-X86 EFI platforms. For example, on aarch64 these drivers can only be used with DT and doesn't have code to register a "simple-frambuffer" platform device when booting with EFI. For these reasons, let's remove the register_gop_device() duplicated code and instead move the platform specific logic that's there to sysfb driver. Signed-off-by: Javier Martinez Canillas <javierm(a)redhat.com> Acked-by: Borislav Petkov <bp(a)suse.de> Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch> Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20210625131359.1804394-1-javi… </cut> Results regressed to (for first_bad == 8633ef82f101c040427b57d4df7b706261420b94) # reset_artifacts: -10 # build_abe binutils: -9 # build_llvm: -5 # build_abe qemu: -2 # linux_n_obj: 600 # First few build errors in logs: # 00:00:38 ld.lld: error: undefined symbol: screen_info # 00:00:38 make: *** [vmlinux] Error 1 from (for last_good == d391c58271072d0b0fad93c82018d495b2633448) # reset_artifacts: -10 # build_abe binutils: -9 # build_llvm: -5 # build_abe qemu: -2 # linux_n_obj: 601 # linux build successful: all # linux boot successful: boot Artifacts of last_good build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… Build top page/logs: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… Configuration details: rr[linux_git]="https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git#ff11764…" Reproduce builds: <cut> mkdir investigate-linux-8633ef82f101c040427b57d4df7b706261420b94 cd investigate-linux-8633ef82f101c040427b57d4df7b706261420b94 git clone https://git.linaro.org/toolchain/jenkins-scripts mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /linux/ ./ ./bisect/baseline/ cd linux # Reproduce first_bad build git checkout --detach 8633ef82f101c040427b57d4df7b706261420b94 ../artifacts/test.sh # Reproduce last_good build git checkout --detach d391c58271072d0b0fad93c82018d495b2633448 ../artifacts/test.sh cd .. </cut> History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/… Artifacts: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… Build log: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-release-aarch64-next… Full commit (up to 1000 lines): <cut> commit 8633ef82f101c040427b57d4df7b706261420b94 Author: Javier Martinez Canillas <javierm(a)redhat.com> Date: Fri Jun 25 15:13:59 2021 +0200 drivers/firmware: consolidate EFI framebuffer setup for all arches The register_gop_device() function registers an "efi-framebuffer" platform device to match against the efifb driver, to have an early framebuffer for EFI platforms. But there is already support to do exactly the same by the Generic System Framebuffers (sysfb) driver. This used to be only for X86 but it has been moved to drivers/firmware and could be reused by other architectures. Also, besides supporting registering an "efi-framebuffer", this driver can register a "simple-framebuffer" allowing to use the siple{fb,drm} drivers on non-X86 EFI platforms. For example, on aarch64 these drivers can only be used with DT and doesn't have code to register a "simple-frambuffer" platform device when booting with EFI. For these reasons, let's remove the register_gop_device() duplicated code and instead move the platform specific logic that's there to sysfb driver. Signed-off-by: Javier Martinez Canillas <javierm(a)redhat.com> Acked-by: Borislav Petkov <bp(a)suse.de> Acked-by: Daniel Vetter <daniel.vetter(a)ffwll.ch> Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20210625131359.1804394-1-javi… --- arch/arm/include/asm/efi.h | 5 +-- arch/arm64/include/asm/efi.h | 5 +-- arch/riscv/include/asm/efi.h | 5 +-- drivers/firmware/Kconfig | 8 ++-- drivers/firmware/Makefile | 2 +- drivers/firmware/efi/efi-init.c | 90 --------------------------------------- drivers/firmware/efi/sysfb_efi.c | 76 ++++++++++++++++++++++++++++++++- drivers/firmware/sysfb.c | 35 ++++++++++----- drivers/firmware/sysfb_simplefb.c | 31 ++++++++++---- drivers/gpu/drm/tiny/Kconfig | 4 +- include/linux/sysfb.h | 26 +++++------ 11 files changed, 143 insertions(+), 144 deletions(-) diff --git a/arch/arm/include/asm/efi.h b/arch/arm/include/asm/efi.h index 9de7ab2ce05d..a6f3b179e8a9 100644 --- a/arch/arm/include/asm/efi.h +++ b/arch/arm/include/asm/efi.h @@ -17,6 +17,7 @@ #ifdef CONFIG_EFI void efi_init(void); +extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt); int efi_create_mapping(struct mm_struct *mm, efi_memory_desc_t *md); int efi_set_mapping_permissions(struct mm_struct *mm, efi_memory_desc_t *md); @@ -52,10 +53,6 @@ void efi_virtmap_unload(void); struct screen_info *alloc_screen_info(void); void free_screen_info(struct screen_info *si); -static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt) -{ -} - /* * A reasonable upper bound for the uncompressed kernel size is 32 MBytes, * so we will reserve that amount of memory. We have no easy way to tell what diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h index 3578aba9c608..42d673a011c8 100644 --- a/arch/arm64/include/asm/efi.h +++ b/arch/arm64/include/asm/efi.h @@ -14,6 +14,7 @@ #ifdef CONFIG_EFI extern void efi_init(void); +extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt); #else #define efi_init() #endif @@ -85,10 +86,6 @@ static inline void free_screen_info(struct screen_info *si) { } -static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt) -{ -} - #define EFI_ALLOC_ALIGN SZ_64K /* diff --git a/arch/riscv/include/asm/efi.h b/arch/riscv/include/asm/efi.h index 6d98cd999680..7a8f0d45b13a 100644 --- a/arch/riscv/include/asm/efi.h +++ b/arch/riscv/include/asm/efi.h @@ -13,6 +13,7 @@ #ifdef CONFIG_EFI extern void efi_init(void); +extern void efifb_setup_from_dmi(struct screen_info *si, const char *opt); #else #define efi_init() #endif @@ -39,10 +40,6 @@ static inline void free_screen_info(struct screen_info *si) { } -static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt) -{ -} - void efi_virtmap_load(void); void efi_virtmap_unload(void); diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index 71f3d97f0c39..af6719cc576b 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -254,9 +254,9 @@ config QCOM_SCM_DOWNLOAD_MODE_DEFAULT config SYSFB bool default y - depends on X86 || COMPILE_TEST + depends on X86 || ARM || ARM64 || RISCV || COMPILE_TEST -config X86_SYSFB +config SYSFB_SIMPLEFB bool "Mark VGA/VBE/EFI FB as generic system framebuffer" depends on SYSFB help @@ -264,10 +264,10 @@ config X86_SYSFB bootloader or kernel can show basic video-output during boot for user-guidance and debugging. Historically, x86 used the VESA BIOS Extensions and EFI-framebuffers for this, which are mostly limited - to x86. + to x86 BIOS or EFI systems. This option, if enabled, marks VGA/VBE/EFI framebuffers as generic framebuffers so the new generic system-framebuffer drivers can be - used on x86. If the framebuffer is not compatible with the generic + used instead. If the framebuffer is not compatible with the generic modes, it is advertised as fallback platform framebuffer so legacy drivers like efifb, vesafb and uvesafb can pick it up. If this option is not selected, all system framebuffers are always diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile index ad78f78ffa8d..6ac637e422b9 100644 --- a/drivers/firmware/Makefile +++ b/drivers/firmware/Makefile @@ -19,7 +19,7 @@ obj-$(CONFIG_RASPBERRYPI_FIRMWARE) += raspberrypi.o obj-$(CONFIG_FW_CFG_SYSFS) += qemu_fw_cfg.o obj-$(CONFIG_QCOM_SCM) += qcom_scm.o qcom_scm-smc.o qcom_scm-legacy.o obj-$(CONFIG_SYSFB) += sysfb.o -obj-$(CONFIG_X86_SYSFB) += sysfb_simplefb.o +obj-$(CONFIG_SYSFB_SIMPLEFB) += sysfb_simplefb.o obj-$(CONFIG_TI_SCI_PROTOCOL) += ti_sci.o obj-$(CONFIG_TRUSTED_FOUNDATIONS) += trusted_foundations.o obj-$(CONFIG_TURRIS_MOX_RWTM) += turris-mox-rwtm.o diff --git a/drivers/firmware/efi/efi-init.c b/drivers/firmware/efi/efi-init.c index a552a08a1741..b19ce1a83f91 100644 --- a/drivers/firmware/efi/efi-init.c +++ b/drivers/firmware/efi/efi-init.c @@ -275,93 +275,3 @@ void __init efi_init(void) } #endif } - -static bool efifb_overlaps_pci_range(const struct of_pci_range *range) -{ - u64 fb_base = screen_info.lfb_base; - - if (screen_info.capabilities & VIDEO_CAPABILITY_64BIT_BASE) - fb_base |= (u64)(unsigned long)screen_info.ext_lfb_base << 32; - - return fb_base >= range->cpu_addr && - fb_base < (range->cpu_addr + range->size); -} - -static struct device_node *find_pci_overlap_node(void) -{ - struct device_node *np; - - for_each_node_by_type(np, "pci") { - struct of_pci_range_parser parser; - struct of_pci_range range; - int err; - - err = of_pci_range_parser_init(&parser, np); - if (err) { - pr_warn("of_pci_range_parser_init() failed: %d\n", err); - continue; - } - - for_each_of_pci_range(&parser, &range) - if (efifb_overlaps_pci_range(&range)) - return np; - } - return NULL; -} - -/* - * If the efifb framebuffer is backed by a PCI graphics controller, we have - * to ensure that this relation is expressed using a device link when - * running in DT mode, or the probe order may be reversed, resulting in a - * resource reservation conflict on the memory window that the efifb - * framebuffer steals from the PCIe host bridge. - */ -static int efifb_add_links(struct fwnode_handle *fwnode) -{ - struct device_node *sup_np; - - sup_np = find_pci_overlap_node(); - - /* - * If there's no PCI graphics controller backing the efifb, we are - * done here. - */ - if (!sup_np) - return 0; - - fwnode_link_add(fwnode, of_fwnode_handle(sup_np)); - of_node_put(sup_np); - - return 0; -} - -static const struct fwnode_operations efifb_fwnode_ops = { - .add_links = efifb_add_links, -}; - -static struct fwnode_handle efifb_fwnode; - -static int __init register_gop_device(void) -{ - struct platform_device *pd; - int err; - - if (screen_info.orig_video_isVGA != VIDEO_TYPE_EFI) - return 0; - - pd = platform_device_alloc("efi-framebuffer", 0); - if (!pd) - return -ENOMEM; - - if (IS_ENABLED(CONFIG_PCI)) { - fwnode_init(&efifb_fwnode, &efifb_fwnode_ops); - pd->dev.fwnode = &efifb_fwnode; - } - - err = platform_device_add_data(pd, &screen_info, sizeof(screen_info)); - if (err) - return err; - - return platform_device_add(pd); -} -subsys_initcall(register_gop_device); diff --git a/drivers/firmware/efi/sysfb_efi.c b/drivers/firmware/efi/sysfb_efi.c index 9f035b15501c..f51865e1b876 100644 --- a/drivers/firmware/efi/sysfb_efi.c +++ b/drivers/firmware/efi/sysfb_efi.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-or-later /* - * Generic System Framebuffers on x86 + * Generic System Framebuffers * Copyright (c) 2012-2013 David Herrmann <dh.herrmann(a)gmail.com> * * EFI Quirks Copyright (c) 2006 Edgar Hucek <gimli(a)dark-green.com> @@ -19,7 +19,9 @@ #include <linux/init.h> #include <linux/kernel.h> #include <linux/mm.h> +#include <linux/of_address.h> #include <linux/pci.h> +#include <linux/platform_device.h> #include <linux/screen_info.h> #include <linux/sysfb.h> #include <video/vga.h> @@ -267,7 +269,72 @@ static const struct dmi_system_id efifb_dmi_swap_width_height[] __initconst = { {}, }; -__init void sysfb_apply_efi_quirks(void) +static bool efifb_overlaps_pci_range(const struct of_pci_range *range) +{ + u64 fb_base = screen_info.lfb_base; + + if (screen_info.capabilities & VIDEO_CAPABILITY_64BIT_BASE) + fb_base |= (u64)(unsigned long)screen_info.ext_lfb_base << 32; + + return fb_base >= range->cpu_addr && + fb_base < (range->cpu_addr + range->size); +} + +static struct device_node *find_pci_overlap_node(void) +{ + struct device_node *np; + + for_each_node_by_type(np, "pci") { + struct of_pci_range_parser parser; + struct of_pci_range range; + int err; + + err = of_pci_range_parser_init(&parser, np); + if (err) { + pr_warn("of_pci_range_parser_init() failed: %d\n", err); + continue; + } + + for_each_of_pci_range(&parser, &range) + if (efifb_overlaps_pci_range(&range)) + return np; + } + return NULL; +} + +/* + * If the efifb framebuffer is backed by a PCI graphics controller, we have + * to ensure that this relation is expressed using a device link when + * running in DT mode, or the probe order may be reversed, resulting in a + * resource reservation conflict on the memory window that the efifb + * framebuffer steals from the PCIe host bridge. + */ +static int efifb_add_links(struct fwnode_handle *fwnode) +{ + struct device_node *sup_np; + + sup_np = find_pci_overlap_node(); + + /* + * If there's no PCI graphics controller backing the efifb, we are + * done here. + */ + if (!sup_np) + return 0; + + fwnode_link_add(fwnode, of_fwnode_handle(sup_np)); + of_node_put(sup_np); + + return 0; +} + +static const struct fwnode_operations efifb_fwnode_ops = { + .add_links = efifb_add_links, +}; + +static struct fwnode_handle efifb_fwnode; + +__init void sysfb_apply_efi_quirks(struct platform_device *pd) { if (screen_info.orig_video_isVGA != VIDEO_TYPE_EFI || !(screen_info.capabilities & VIDEO_CAPABILITY_SKIP_QUIRKS)) @@ -281,4 +348,9 @@ __init void sysfb_apply_efi_quirks(void) screen_info.lfb_height = temp; screen_info.lfb_linelength = 4 * screen_info.lfb_width; } + + if (screen_info.orig_video_isVGA == VIDEO_TYPE_EFI && IS_ENABLED(CONFIG_PCI)) { + fwnode_init(&efifb_fwnode, &efifb_fwnode_ops); + pd->dev.fwnode = &efifb_fwnode; + } } diff --git a/drivers/firmware/sysfb.c b/drivers/firmware/sysfb.c index 1337515963d5..2bfbb05f7d89 100644 --- a/drivers/firmware/sysfb.c +++ b/drivers/firmware/sysfb.c @@ -1,11 +1,11 @@ // SPDX-License-Identifier: GPL-2.0-or-later /* - * Generic System Framebuffers on x86 + * Generic System Framebuffers * Copyright (c) 2012-2013 David Herrmann <dh.herrmann(a)gmail.com> */ /* - * Simple-Framebuffer support for x86 systems + * Simple-Framebuffer support * Create a platform-device for any available boot framebuffer. The * simple-framebuffer platform device is already available on DT systems, so * this module parses the global "screen_info" object and creates a suitable @@ -16,12 +16,12 @@ * to pick these devices up without messing with simple-framebuffer drivers. * The global "screen_info" is still valid at all times. * - * If CONFIG_X86_SYSFB is not selected, we never register "simple-framebuffer" + * If CONFIG_SYSFB_SIMPLEFB is not selected, never register "simple-framebuffer" * platform devices, but only use legacy framebuffer devices for * backwards compatibility. * * TODO: We set the dev_id field of all platform-devices to 0. This allows - * other x86 OF/DT parsers to create such devices, too. However, they must + * other OF/DT parsers to create such devices, too. However, they must * start at offset 1 for this to work. */ @@ -43,12 +43,10 @@ static __init int sysfb_init(void) bool compatible; int ret; - sysfb_apply_efi_quirks(); - /* try to create a simple-framebuffer device */ - compatible = parse_mode(si, &mode); + compatible = sysfb_parse_mode(si, &mode); if (compatible) { - ret = create_simplefb(si, &mode); + ret = sysfb_create_simplefb(si, &mode); if (!ret) return 0; } @@ -61,9 +59,24 @@ static __init int sysfb_init(void) else name = "platform-framebuffer"; - pd = platform_device_register_resndata(NULL, name, 0, - NULL, 0, si, sizeof(*si)); - return PTR_ERR_OR_ZERO(pd); + pd = platform_device_alloc(name, 0); + if (!pd) + return -ENOMEM; + + sysfb_apply_efi_quirks(pd); + + ret = platform_device_add_data(pd, si, sizeof(*si)); + if (ret) + goto err; + + ret = platform_device_add(pd); + if (ret) + goto err; + + return 0; +err: + platform_device_put(pd); + return ret; } /* must execute after PCI subsystem for EFI quirks */ diff --git a/drivers/firmware/sysfb_simplefb.c b/drivers/firmware/sysfb_simplefb.c index df892444ea17..b86761904949 100644 --- a/drivers/firmware/sysfb_simplefb.c +++ b/drivers/firmware/sysfb_simplefb.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-or-later /* - * Generic System Framebuffers on x86 + * Generic System Framebuffers * Copyright (c) 2012-2013 David Herrmann <dh.herrmann(a)gmail.com> */ @@ -23,9 +23,9 @@ static const char simplefb_resname[] = "BOOTFB"; static const struct simplefb_format formats[] = SIMPLEFB_FORMATS; -/* try parsing x86 screen_info into a simple-framebuffer mode struct */ -__init bool parse_mode(const struct screen_info *si, - struct simplefb_platform_data *mode) +/* try parsing screen_info into a simple-framebuffer mode struct */ +__init bool sysfb_parse_mode(const struct screen_info *si, + struct simplefb_platform_data *mode) { const struct simplefb_format *f; __u8 type; @@ -57,13 +57,14 @@ __init bool parse_mode(const struct screen_info *si, return false; } -__init int create_simplefb(const struct screen_info *si, - const struct simplefb_platform_data *mode) +__init int sysfb_create_simplefb(const struct screen_info *si, + const struct simplefb_platform_data *mode) { struct platform_device *pd; struct resource res; u64 base, size; u32 length; + int ret; /* * If the 64BIT_BASE capability is set, ext_lfb_base will contain the @@ -105,7 +106,19 @@ __init int create_simplefb(const struct screen_info *si, if (res.end <= res.start) return -EINVAL; - pd = platform_device_register_resndata(NULL, "simple-framebuffer", 0, - &res, 1, mode, sizeof(*mode)); - return PTR_ERR_OR_ZERO(pd); + pd = platform_device_alloc("simple-framebuffer", 0); + if (!pd) + return -ENOMEM; + + sysfb_apply_efi_quirks(pd); + + ret = platform_device_add_resources(pd, &res, 1); + if (ret) + return ret; + + ret = platform_device_add_data(pd, mode, sizeof(*mode)); + if (ret) + return ret; + + return platform_device_add(pd); } diff --git a/drivers/gpu/drm/tiny/Kconfig b/drivers/gpu/drm/tiny/Kconfig index 5593128eeff9..d31be274a2bd 100644 --- a/drivers/gpu/drm/tiny/Kconfig +++ b/drivers/gpu/drm/tiny/Kconfig @@ -64,8 +64,8 @@ config DRM_SIMPLEDRM buffer, size, and display format must be provided via device tree, UEFI, VESA, etc. - On x86 and compatible, you should also select CONFIG_X86_SYSFB to - use UEFI and VESA framebuffers. + On x86 BIOS or UEFI systems, you should also select SYSFB_SIMPLEFB + to use UEFI and VESA framebuffers. config TINYDRM_HX8357D tristate "DRM support for HX8357D display panels" diff --git a/include/linux/sysfb.h b/include/linux/sysfb.h index 3e5355769dc3..b0dcfa26d07b 100644 --- a/include/linux/sysfb.h +++ b/include/linux/sysfb.h @@ -58,37 +58,37 @@ struct efifb_dmi_info { #ifdef CONFIG_EFI extern struct efifb_dmi_info efifb_dmi_list[]; -void sysfb_apply_efi_quirks(void); +void sysfb_apply_efi_quirks(struct platform_device *pd); #else /* CONFIG_EFI */ -static inline void sysfb_apply_efi_quirks(void) +static inline void sysfb_apply_efi_quirks(struct platform_device *pd) { } #endif /* CONFIG_EFI */ -#ifdef CONFIG_X86_SYSFB +#ifdef CONFIG_SYSFB_SIMPLEFB -bool parse_mode(const struct screen_info *si, - struct simplefb_platform_data *mode); -int create_simplefb(const struct screen_info *si, - const struct simplefb_platform_data *mode); +bool sysfb_parse_mode(const struct screen_info *si, + struct simplefb_platform_data *mode); +int sysfb_create_simplefb(const struct screen_info *si, + const struct simplefb_platform_data *mode); -#else /* CONFIG_X86_SYSFB */ +#else /* CONFIG_SYSFB_SIMPLE */ -static inline bool parse_mode(const struct screen_info *si, - struct simplefb_platform_data *mode) +static inline bool sysfb_parse_mode(const struct screen_info *si, + struct simplefb_platform_data *mode) { return false; } -static inline int create_simplefb(const struct screen_info *si, - const struct simplefb_platform_data *mode) +static inline int sysfb_create_simplefb(const struct screen_info *si, + const struct simplefb_platform_data *mode) { return -EINVAL; } -#endif /* CONFIG_X86_SYSFB */ +#endif /* CONFIG_SYSFB_SIMPLE */ #endif /* _LINUX_SYSFB_H */ </cut>

3 years, 9 months

4
4
0 0

[TCWG CI] 400.perlbench slowed down by 6% after llvm: [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest

by ci_notify＠linaro.org

After llvm commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc Author: Arthur Eubanks <aeubanks(a)google.com> [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest the following benchmarks slowed down by more than 2%: - 400.perlbench slowed down by 6% from 9730 to 10312 perf samples - 400.perlbench:[.] S_regmatch slowed down by 14% from 3660 to 4188 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc cd investigate-llvm-e7249e4acf3cf9438d6d9e02edecebd5b622a4dc # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach e7249e4acf3cf9438d6d9e02edecebd5b622a4dc ../artifacts/test.sh # Reproduce last_good build git checkout --detach 32a50078657dd8beead327a3478ede4e9d730432 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit e7249e4acf3cf9438d6d9e02edecebd5b622a4dc Author: Arthur Eubanks <aeubanks(a)google.com> Date: Fri Aug 27 12:32:59 2021 -0700 [SimplifyCFG] Ignore free instructions when computing cost for folding branch to common dest When determining whether to fold branches to a common destination by merging two blocks, SimplifyCFG will count the number of instructions to be moved into the first basic block. However, there's no reason to count free instructions like bitcasts and other similar instructions. This resolves missed branch foldings with -fstrict-vtable-pointers in llvm-test-suite's lambda benchmark. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D108837 --- llvm/lib/Transforms/Utils/SimplifyCFG.cpp | 17 ++++++----- llvm/test/CodeGen/AArch64/csr-split.ll | 34 +++++++++++----------- .../fold-branch-to-common-dest-free-cost.ll | 5 ++-- 3 files changed, 29 insertions(+), 27 deletions(-) diff --git a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp index 2ff98b238de0..a3bd89e72af9 100644 --- a/llvm/lib/Transforms/Utils/SimplifyCFG.cpp +++ b/llvm/lib/Transforms/Utils/SimplifyCFG.cpp @@ -3258,13 +3258,16 @@ bool llvm::FoldBranchToCommonDest(BranchInst *BI, DomTreeUpdater *DTU, SawVectorOp |= isVectorOp(I); // Account for the cost of duplicating this instruction into each - // predecessor. - NumBonusInsts += PredCount; - - // Early exits once we reach the limit. - if (NumBonusInsts > - BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier) - return false; + // predecessor. Ignore free instructions. + if (!TTI || + TTI->getUserCost(&I, CostKind) != TargetTransformInfo::TCC_Free) { + NumBonusInsts += PredCount; + + // Early exits once we reach the limit. + if (NumBonusInsts > + BonusInstThreshold * BranchFoldToCommonDestVectorMultiplier) + return false; + } auto IsBCSSAUse = [BB, &I](Use &U) { auto *UI = cast<Instruction>(U.getUser()); diff --git a/llvm/test/CodeGen/AArch64/csr-split.ll b/llvm/test/CodeGen/AArch64/csr-split.ll index 1bee7f05acec..de85b4313433 100644 --- a/llvm/test/CodeGen/AArch64/csr-split.ll +++ b/llvm/test/CodeGen/AArch64/csr-split.ll @@ -82,22 +82,22 @@ define dso_local signext i32 @test2(i32* %p1) local_unnamed_addr { ; CHECK-NEXT: .cfi_def_cfa_offset 16 ; CHECK-NEXT: .cfi_offset w19, -8 ; CHECK-NEXT: .cfi_offset w30, -16 -; CHECK-NEXT: cbz x0, .LBB1_2 -; CHECK-NEXT: // %bb.1: // %if.end +; CHECK-NEXT: cbz x0, .LBB1_3 +; CHECK-NEXT: // %bb.1: // %entry ; CHECK-NEXT: adrp x8, a ; CHECK-NEXT: ldrsw x8, [x8, :lo12:a] ; CHECK-NEXT: mov x19, x0 ; CHECK-NEXT: cmp x8, x0 -; CHECK-NEXT: b.eq .LBB1_3 -; CHECK-NEXT: .LBB1_2: // %return -; CHECK-NEXT: mov w0, wzr -; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload -; CHECK-NEXT: ret -; CHECK-NEXT: .LBB1_3: // %if.then2 +; CHECK-NEXT: b.ne .LBB1_3 +; CHECK-NEXT: // %bb.2: // %if.then2 ; CHECK-NEXT: bl callVoid ; CHECK-NEXT: mov x0, x19 ; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload ; CHECK-NEXT: b callNonVoid +; CHECK-NEXT: .LBB1_3: // %return +; CHECK-NEXT: mov w0, wzr +; CHECK-NEXT: ldp x30, x19, [sp], #16 // 16-byte Folded Reload +; CHECK-NEXT: ret ; ; CHECK-APPLE-LABEL: test2: ; CHECK-APPLE: ; %bb.0: ; %entry @@ -108,26 +108,26 @@ define dso_local signext i32 @test2(i32* %p1) local_unnamed_addr { ; CHECK-APPLE-NEXT: .cfi_offset w29, -16 ; CHECK-APPLE-NEXT: .cfi_offset w19, -24 ; CHECK-APPLE-NEXT: .cfi_offset w20, -32 -; CHECK-APPLE-NEXT: cbz x0, LBB1_2 -; CHECK-APPLE-NEXT: ; %bb.1: ; %if.end +; CHECK-APPLE-NEXT: cbz x0, LBB1_3 +; CHECK-APPLE-NEXT: ; %bb.1: ; %entry ; CHECK-APPLE-NEXT: Lloh2: ; CHECK-APPLE-NEXT: adrp x8, _a@PAGE ; CHECK-APPLE-NEXT: Lloh3: ; CHECK-APPLE-NEXT: ldrsw x8, [x8, _a@PAGEOFF] ; CHECK-APPLE-NEXT: mov x19, x0 ; CHECK-APPLE-NEXT: cmp x8, x0 -; CHECK-APPLE-NEXT: b.eq LBB1_3 -; CHECK-APPLE-NEXT: LBB1_2: ; %return -; CHECK-APPLE-NEXT: ldp x29, x30, [sp, #16] ; 16-byte Folded Reload -; CHECK-APPLE-NEXT: mov w0, wzr -; CHECK-APPLE-NEXT: ldp x20, x19, [sp], #32 ; 16-byte Folded Reload -; CHECK-APPLE-NEXT: ret -; CHECK-APPLE-NEXT: LBB1_3: ; %if.then2 +; CHECK-APPLE-NEXT: b.ne LBB1_3 +; CHECK-APPLE-NEXT: ; %bb.2: ; %if.then2 ; CHECK-APPLE-NEXT: bl _callVoid ; CHECK-APPLE-NEXT: ldp x29, x30, [sp, #16] ; 16-byte Folded Reload ; CHECK-APPLE-NEXT: mov x0, x19 ; CHECK-APPLE-NEXT: ldp x20, x19, [sp], #32 ; 16-byte Folded Reload ; CHECK-APPLE-NEXT: b _callNonVoid +; CHECK-APPLE-NEXT: LBB1_3: ; %return +; CHECK-APPLE-NEXT: ldp x29, x30, [sp, #16] ; 16-byte Folded Reload +; CHECK-APPLE-NEXT: mov w0, wzr +; CHECK-APPLE-NEXT: ldp x20, x19, [sp], #32 ; 16-byte Folded Reload +; CHECK-APPLE-NEXT: ret ; CHECK-APPLE-NEXT: .loh AdrpLdr Lloh2, Lloh3 entry: %tobool = icmp eq i32* %p1, null diff --git a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll index ace2a5ed35ca..27df5ec44582 100644 --- a/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll +++ b/llvm/test/Transforms/SimplifyCFG/fold-branch-to-common-dest-free-cost.ll @@ -8,12 +8,11 @@ declare void @g2() define void @f(i8* %a, i8* %b, i1 %c, i1 %d, i1 %e) { ; CHECK-LABEL: @f( -; CHECK-NEXT: br i1 [[C:%.*]], label [[L1:%.*]], label [[L3:%.*]] -; CHECK: l1: ; CHECK-NEXT: [[A1:%.*]] = call i8* @llvm.strip.invariant.group.p0i8(i8* [[A:%.*]]) ; CHECK-NEXT: [[B1:%.*]] = call i8* @llvm.strip.invariant.group.p0i8(i8* [[B:%.*]]) ; CHECK-NEXT: [[I:%.*]] = icmp eq i8* [[A1]], [[B1]] -; CHECK-NEXT: br i1 [[I]], label [[L2:%.*]], label [[L3]] +; CHECK-NEXT: [[OR_COND:%.*]] = select i1 [[C:%.*]], i1 [[I]], i1 false +; CHECK-NEXT: br i1 [[OR_COND]], label [[L2:%.*]], label [[L3:%.*]] ; CHECK: l2: ; CHECK-NEXT: call void @g1() ; CHECK-NEXT: br label [[RET:%.*]] </cut>

3 years, 9 months

3
4
0 0

[CI-NOTIFY]: TCWG Bisect tcwg_kernel/llvm-master-aarch64-next-allyesconfig - Build # 14 - Successful!

by ci_notify＠linaro.org

Successfully identified regression in *linux* in CI configuration tcwg_kernel/llvm-master-aarch64-next-allyesconfig. So far, this commit has regressed CI configurations: - tcwg_kernel/llvm-master-aarch64-next-allyesconfig Culprit: <cut> commit 3d463dd5023b5a58b3c37207d65eeb5acbac2be3 Author: Krzysztof Kozlowski <krzysztof.kozlowski(a)canonical.com> Date: Thu Jul 29 12:40:19 2021 +0200 nfc: fdp: constify several pointers Several functions do not modify pointed data so arguments and local variables can be const for correctness and safety. This allows also making file-scope nci_core_get_config_otp_ram_version array const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)canonical.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> </cut> Results regressed to (for first_bad == 3d463dd5023b5a58b3c37207d65eeb5acbac2be3) # reset_artifacts: -10 # build_abe binutils: -9 # build_llvm: -5 # build_abe qemu: -2 # linux_n_obj: 19928 # First few build errors in logs: # 00:02:04 drivers/nfc/fdp/fdp.c:116:60: error: passing 'const char *' to parameter of type '__u8 *' (aka 'unsigned char *') discards qualifiers [-Werror,-Wincompatible-pointer-types-discards-qualifiers] # 00:02:04 make[3]: *** [scripts/Makefile.build:271: drivers/nfc/fdp/fdp.o] Error 1 # 00:02:05 make[2]: *** [scripts/Makefile.build:514: drivers/nfc/fdp] Error 2 # 00:02:23 make[1]: *** [scripts/Makefile.build:514: drivers/nfc] Error 2 # 00:04:16 make: *** [Makefile:1842: drivers] Error 2 from (for last_good == c3e26b6dc1b4e3e8f57be4f004b1f2a410c5c468) # reset_artifacts: -10 # build_abe binutils: -9 # build_llvm: -5 # build_abe qemu: -2 # linux_n_obj: 20000 # linux build successful: all Artifacts of last_good build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… Artifacts of first_bad build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… Build top page/logs: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… Configuration details: rr[linux_git]="https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git#cb16362…" Reproduce builds: <cut> mkdir investigate-linux-3d463dd5023b5a58b3c37207d65eeb5acbac2be3 cd investigate-linux-3d463dd5023b5a58b3c37207d65eeb5acbac2be3 git clone https://git.linaro.org/toolchain/jenkins-scripts mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /linux/ ./ ./bisect/baseline/ cd linux # Reproduce first_bad build git checkout --detach 3d463dd5023b5a58b3c37207d65eeb5acbac2be3 ../artifacts/test.sh # Reproduce last_good build git checkout --detach c3e26b6dc1b4e3e8f57be4f004b1f2a410c5c468 ../artifacts/test.sh cd .. </cut> History of pending regressions and results: https://git.linaro.org/toolchain/ci/base-artifacts.git/log/?h=linaro-local/… Artifacts: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… Build log: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-aarch64-next-… Full commit (up to 1000 lines): <cut> commit 3d463dd5023b5a58b3c37207d65eeb5acbac2be3 Author: Krzysztof Kozlowski <krzysztof.kozlowski(a)canonical.com> Date: Thu Jul 29 12:40:19 2021 +0200 nfc: fdp: constify several pointers Several functions do not modify pointed data so arguments and local variables can be const for correctness and safety. This allows also making file-scope nci_core_get_config_otp_ram_version array const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski(a)canonical.com> Signed-off-by: David S. Miller <davem(a)davemloft.net> --- drivers/nfc/fdp/fdp.c | 18 +++++++++--------- drivers/nfc/fdp/fdp.h | 2 +- drivers/nfc/fdp/i2c.c | 6 +++--- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/nfc/fdp/fdp.c b/drivers/nfc/fdp/fdp.c index 3f5fba922c4d..c6b3334f24c9 100644 --- a/drivers/nfc/fdp/fdp.c +++ b/drivers/nfc/fdp/fdp.c @@ -52,7 +52,7 @@ struct fdp_nci_info { u32 limited_otp_version; u8 key_index; - u8 *fw_vsc_cfg; + const u8 *fw_vsc_cfg; u8 clock_type; u32 clock_freq; @@ -65,7 +65,7 @@ struct fdp_nci_info { wait_queue_head_t setup_wq; }; -static u8 nci_core_get_config_otp_ram_version[5] = { +static const u8 nci_core_get_config_otp_ram_version[5] = { 0x04, NCI_PARAM_ID_FW_RAM_VERSION, NCI_PARAM_ID_FW_OTP_VERSION, @@ -111,7 +111,7 @@ static inline int fdp_nci_patch_cmd(struct nci_dev *ndev, u8 type) } static inline int fdp_nci_set_production_data(struct nci_dev *ndev, u8 len, - char *data) + const char *data) { return nci_prop_cmd(ndev, NCI_OP_PROP_SET_PDATA_OID, len, data); } @@ -236,7 +236,7 @@ static int fdp_nci_send_patch(struct nci_dev *ndev, u8 conn_id, u8 type) static int fdp_nci_open(struct nci_dev *ndev) { - struct fdp_nci_info *info = nci_get_drvdata(ndev); + const struct fdp_nci_info *info = nci_get_drvdata(ndev); return info->phy_ops->enable(info->phy); } @@ -260,7 +260,7 @@ static int fdp_nci_request_firmware(struct nci_dev *ndev) { struct fdp_nci_info *info = nci_get_drvdata(ndev); struct device *dev = &info->phy->i2c_dev->dev; - u8 *data; + const u8 *data; int r; r = request_firmware(&info->ram_patch, FDP_RAM_PATCH_NAME, dev); @@ -269,7 +269,7 @@ static int fdp_nci_request_firmware(struct nci_dev *ndev) return r; } - data = (u8 *) info->ram_patch->data; + data = info->ram_patch->data; info->ram_patch_version = data[FDP_FW_HEADER_SIZE] | (data[FDP_FW_HEADER_SIZE + 1] << 8) | @@ -610,9 +610,9 @@ static int fdp_nci_core_get_config_rsp_packet(struct nci_dev *ndev, { struct fdp_nci_info *info = nci_get_drvdata(ndev); struct device *dev = &info->phy->i2c_dev->dev; - struct nci_core_get_config_rsp *rsp = (void *) skb->data; + const struct nci_core_get_config_rsp *rsp = (void *) skb->data; unsigned int i; - u8 *p; + const u8 *p; if (rsp->status == NCI_STATUS_OK) { @@ -691,7 +691,7 @@ static const struct nci_ops nci_ops = { int fdp_nci_probe(struct fdp_i2c_phy *phy, const struct nfc_phy_ops *phy_ops, struct nci_dev **ndevp, int tx_headroom, int tx_tailroom, u8 clock_type, u32 clock_freq, - u8 *fw_vsc_cfg) + const u8 *fw_vsc_cfg) { struct device *dev = &phy->i2c_dev->dev; struct fdp_nci_info *info; diff --git a/drivers/nfc/fdp/fdp.h b/drivers/nfc/fdp/fdp.h index dc048d4b977e..2e9161a4d7bf 100644 --- a/drivers/nfc/fdp/fdp.h +++ b/drivers/nfc/fdp/fdp.h @@ -23,7 +23,7 @@ struct fdp_i2c_phy { int fdp_nci_probe(struct fdp_i2c_phy *phy, const struct nfc_phy_ops *phy_ops, struct nci_dev **ndev, int tx_headroom, int tx_tailroom, - u8 clock_type, u32 clock_freq, u8 *fw_vsc_cfg); + u8 clock_type, u32 clock_freq, const u8 *fw_vsc_cfg); void fdp_nci_remove(struct nci_dev *ndev); #endif /* __LOCAL_FDP_H_ */ diff --git a/drivers/nfc/fdp/i2c.c b/drivers/nfc/fdp/i2c.c index 98e1876c9468..051c43a2a52f 100644 --- a/drivers/nfc/fdp/i2c.c +++ b/drivers/nfc/fdp/i2c.c @@ -36,7 +36,7 @@ print_hex_dump(KERN_DEBUG, prefix": ", DUMP_PREFIX_OFFSET, \ 16, 1, (skb)->data, (skb)->len, 0) -static void fdp_nci_i2c_reset(struct fdp_i2c_phy *phy) +static void fdp_nci_i2c_reset(const struct fdp_i2c_phy *phy) { /* Reset RST/WakeUP for at least 100 micro-second */ gpiod_set_value_cansleep(phy->power_gpio, FDP_POWER_OFF); @@ -47,7 +47,7 @@ static void fdp_nci_i2c_reset(struct fdp_i2c_phy *phy) static int fdp_nci_i2c_enable(void *phy_id) { - struct fdp_i2c_phy *phy = phy_id; + const struct fdp_i2c_phy *phy = phy_id; fdp_nci_i2c_reset(phy); @@ -56,7 +56,7 @@ static int fdp_nci_i2c_enable(void *phy_id) static void fdp_nci_i2c_disable(void *phy_id) { - struct fdp_i2c_phy *phy = phy_id; + const struct fdp_i2c_phy *phy = phy_id; fdp_nci_i2c_reset(phy); } </cut>

3 years, 9 months

3
2
0 0

Re: [CI-NOTIFY]: TCWG Bisect tcwg_kernel/llvm-master-aarch64-lts-allmodconfig - Build # 6 - Successful!

by Maxim Kuvyrkov

Hi Greg, This appears to have been a fluke. Boot-testing succeeded before the merge and failed after. Boot-testing on allmodconfig doesn’t seem to be stable, so we are going to disable it. Regards, -- Maxim Kuvyrkov https://www.linaro.org > On 18 Aug 2021, at 08:38, Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> wrote: > > On Wed, Aug 18, 2021 at 05:22:07AM +0000, ci_notify(a)linaro.org wrote: >> Successfully identified regression in *linux* in CI configuration tcwg_kernel/llvm-master-aarch64-lts-allmodconfig. So far, this commit has regressed CI configurations: >> - tcwg_kernel/llvm-master-aarch64-lts-allmodconfig >> >> Culprit: >> <cut> >> commit 132a8267adabd645476b542b3b132c1b91988fe8 >> Author: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> >> Date: Thu Aug 12 13:22:21 2021 +0200 >> >> Linux 5.10.58 > > <snip> > > And what am I supposed to do with this information? > > -- > You received this message because you are subscribed to the Google Groups "Clang Built Linux" group. > To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe(a)googlegroups.com. > To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/YRyczv2OCq51edQh%40kroa….

3 years, 9 months

1
0
0 0

clang-aarch64-full-2stage buildbot timeout

by Maxim Kuvyrkov

> > On Sep 22, 2021, at 11:23, Florian Hahn <florian_hahn at apple.com > > wrote: > > > > Hi, > > > > It looks like a lot of the recent builds of clang-aarch64-full-2stage are timing out. > > > > E.g https://lab.llvm.org/buildbot/#/builders/179/builds/1078 while checking out sources > > > https://lab.llvm.org/buildbot/#/builders/179/builds/1076 during building stage2 > > > > Is there anything that could be done to avoid such timeouts and avoid false positive failure emails? > > > > Cheers, > > Florian Hi Florian, Thanks for the heads up. We’ve noticed these timeouts too, and have reduced the load in the machine. It appears to have helped. > > Looks like other bots are also hit by timeouts, including clang-arm64-windows-msvc-2stage ( > https://lab.llvm.org/buildbot/#/builders/120/builds/1197 > ) This one looks like a legitimate failure, and appears to have been fixed by https://github.com/llvm/llvm-project/commit/c6013f71a4555f6d9ef9c60e6bc4376… in build https://lab.llvm.org/buildbot/#/builders/120/builds/1200 . Regards, -- Maxim Kuvyrkov https://www.linaro.org

3 years, 9 months

1
0
0 0

[TCWG CI] 464.h264ref slowed down by 6% after llvm: Revert "Allow rematerialization of virtual reg uses"

by ci_notify＠linaro.org

After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Revert "Allow rematerialization of virtual reg uses" the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 6% from 11014 to 11697 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 40% from 1513 to 2118 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Date: Fri Sep 24 09:53:51 2021 -0700 Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39. --- llvm/include/llvm/CodeGen/TargetInstrInfo.h | 12 +- llvm/lib/CodeGen/TargetInstrInfo.cpp | 9 +- llvm/test/CodeGen/AMDGPU/remat-sop.mir | 60 - llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll | 28 +- llvm/test/CodeGen/ARM/funnel-shift-rot.ll | 32 +- llvm/test/CodeGen/ARM/funnel-shift.ll | 30 +- .../test/CodeGen/ARM/illegal-bitfield-loadstore.ll | 30 +- llvm/test/CodeGen/ARM/neon-copy.ll | 10 +- llvm/test/CodeGen/Mips/llvm-ir/ashr.ll | 227 +- llvm/test/CodeGen/Mips/llvm-ir/lshr.ll | 206 +- llvm/test/CodeGen/Mips/llvm-ir/shl.ll | 95 +- llvm/test/CodeGen/Mips/llvm-ir/sub.ll | 31 +- llvm/test/CodeGen/Mips/tls.ll | 4 +- llvm/test/CodeGen/RISCV/atomic-rmw.ll | 120 +- llvm/test/CodeGen/RISCV/atomic-signext.ll | 24 +- llvm/test/CodeGen/RISCV/bswap-ctlz-cttz-ctpop.ll | 96 +- llvm/test/CodeGen/RISCV/mul.ll | 72 +- llvm/test/CodeGen/RISCV/rv32i-rv64i-half.ll | 12 +- llvm/test/CodeGen/RISCV/rv32zbb-zbp.ll | 270 +- llvm/test/CodeGen/RISCV/rv32zbb.ll | 94 +- llvm/test/CodeGen/RISCV/rv32zbp.ll | 262 +- llvm/test/CodeGen/RISCV/rv32zbt.ll | 206 +- .../CodeGen/RISCV/rvv/fixed-vectors-bitreverse.ll | 150 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap.ll | 146 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz.ll | 3584 ++++++++++---------- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz.ll | 664 ++-- llvm/test/CodeGen/RISCV/shifts.ll | 308 +- llvm/test/CodeGen/RISCV/srem-vector-lkk.ll | 208 +- llvm/test/CodeGen/RISCV/urem-vector-lkk.ll | 190 +- llvm/test/CodeGen/Thumb/dyn-stackalloc.ll | 7 +- .../tail-pred-disabled-in-loloops.ll | 14 +- .../LowOverheadLoops/varying-outer-2d-reduction.ll | 64 +- .../CodeGen/Thumb2/LowOverheadLoops/while-loops.ll | 67 +- llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll | 30 +- llvm/test/CodeGen/Thumb2/mve-float16regloops.ll | 82 +- llvm/test/CodeGen/Thumb2/mve-float32regloops.ll | 98 +- llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll | 529 +-- llvm/test/CodeGen/X86/addcarry.ll | 20 +- llvm/test/CodeGen/X86/callbr-asm-blockplacement.ll | 12 +- llvm/test/CodeGen/X86/dag-update-nodetomatch.ll | 17 +- .../X86/delete-dead-instrs-with-live-uses.mir | 4 +- llvm/test/CodeGen/X86/inalloca-invoke.ll | 2 +- llvm/test/CodeGen/X86/licm-regpressure.ll | 28 +- llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll | 40 +- llvm/test/CodeGen/X86/sdiv_fix.ll | 5 +- 45 files changed, 4093 insertions(+), 4106 deletions(-) diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h index a0c52e2f1a13..c394ac910be1 100644 --- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h +++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h @@ -117,11 +117,10 @@ public: const MachineFunction &MF) const; /// Return true if the instruction is trivially rematerializable, meaning it - /// has no side effects. Uses of constants and unallocatable physical - /// registers are always trivial to rematerialize so that the instructions - /// result is independent of the place in the function. Uses of virtual - /// registers are allowed but it is caller's responsility to ensure these - /// operands are valid at the point the instruction is beeing moved. + /// has no side effects and requires no operands that aren't always available. + /// This means the only allowed uses are constants and unallocatable physical + /// registers so that the instructions result is independent of the place + /// in the function. bool isTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA = nullptr) const { return MI.getOpcode() == TargetOpcode::IMPLICIT_DEF || @@ -141,7 +140,8 @@ protected: /// set, this hook lets the target specify whether the instruction is actually /// trivially rematerializable, taking into consideration its operands. This /// predicate must return false if the instruction has any side effects other - /// than producing a value. + /// than producing a value, or if it requres any address registers that are + /// not always available. /// Requirements must be check as stated in isTriviallyReMaterializable() . virtual bool isReallyTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA) const { diff --git a/llvm/lib/CodeGen/TargetInstrInfo.cpp b/llvm/lib/CodeGen/TargetInstrInfo.cpp index fe7d60e0b7e2..1eab8e7443a7 100644 --- a/llvm/lib/CodeGen/TargetInstrInfo.cpp +++ b/llvm/lib/CodeGen/TargetInstrInfo.cpp @@ -921,8 +921,7 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( const MachineRegisterInfo &MRI = MF.getRegInfo(); // Remat clients assume operand 0 is the defined register. - if (!MI.getNumOperands() || !MI.getOperand(0).isReg() || - MI.getOperand(0).isTied()) + if (!MI.getNumOperands() || !MI.getOperand(0).isReg()) return false; Register DefReg = MI.getOperand(0).getReg(); @@ -984,6 +983,12 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( // same virtual register, though. if (MO.isDef() && Reg != DefReg) return false; + + // Don't allow any virtual-register uses. Rematting an instruction with + // virtual register uses would length the live ranges of the uses, which + // is not necessarily a good idea, certainly not "trivial". + if (MO.isUse()) + return false; } // Everything checked out. diff --git a/llvm/test/CodeGen/AMDGPU/remat-sop.mir b/llvm/test/CodeGen/AMDGPU/remat-sop.mir index c9915aaabfde..ed799bfca028 100644 --- a/llvm/test/CodeGen/AMDGPU/remat-sop.mir +++ b/llvm/test/CodeGen/AMDGPU/remat-sop.mir @@ -51,66 +51,6 @@ body: | S_NOP 0, implicit %2 S_ENDPGM 0 ... -# The liverange of %0 covers a point of rematerialization, source value is -# availabe. ---- -name: test_remat_s_mov_b32_vreg_src_long_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_remat_s_mov_b32_vreg_src_long_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_NOP 0, implicit %0 - S_ENDPGM 0 -... -# The liverange of %0 does not cover a point of rematerialization, source value is -# unavailabe and we do not want to artificially extend the liverange. ---- -name: test_no_remat_s_mov_b32_vreg_src_short_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_no_remat_s_mov_b32_vreg_src_short_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.1, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.1, addrspace 5) - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.0, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.0, addrspace 5) - ; GCN: renamable $sgpr0 = S_MOV_B32 killed renamable $sgpr0 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.1, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.1, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.0, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.0, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_ENDPGM 0 -... --- name: test_remat_s_mov_b64 tracksRegLiveness: true diff --git a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll index 175a2069a441..a4243276c70a 100644 --- a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll +++ b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll @@ -29,20 +29,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; ENABLE-NEXT: pophs {r11, pc} ; ENABLE-NEXT: .LBB0_3: @ %while.body.preheader ; ENABLE-NEXT: movw r12, :lower16:skip -; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: sub r1, r1, #1 ; ENABLE-NEXT: movt r12, :upper16:skip ; ENABLE-NEXT: .LBB0_4: @ %while.body ; ENABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; ENABLE-NEXT: ldrb r1, [r0] -; ENABLE-NEXT: ldrb r1, [r12, r1] -; ENABLE-NEXT: add r0, r0, r1 -; ENABLE-NEXT: sub r1, r3, #1 -; ENABLE-NEXT: cmp r1, r3 +; ENABLE-NEXT: ldrb r3, [r0] +; ENABLE-NEXT: ldrb r3, [r12, r3] +; ENABLE-NEXT: add r0, r0, r3 +; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: cmp r3, r1 ; ENABLE-NEXT: bhs .LBB0_6 ; ENABLE-NEXT: @ %bb.5: @ %while.body ; ENABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; ENABLE-NEXT: cmp r0, r2 -; ENABLE-NEXT: mov r3, r1 +; ENABLE-NEXT: mov r1, r3 ; ENABLE-NEXT: blo .LBB0_4 ; ENABLE-NEXT: .LBB0_6: @ %if.end29 ; ENABLE-NEXT: pop {r11, pc} @@ -119,20 +119,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; DISABLE-NEXT: pophs {r11, pc} ; DISABLE-NEXT: .LBB0_3: @ %while.body.preheader ; DISABLE-NEXT: movw r12, :lower16:skip -; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: sub r1, r1, #1 ; DISABLE-NEXT: movt r12, :upper16:skip ; DISABLE-NEXT: .LBB0_4: @ %while.body ; DISABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; DISABLE-NEXT: ldrb r1, [r0] -; DISABLE-NEXT: ldrb r1, [r12, r1] -; DISABLE-NEXT: add r0, r0, r1 -; DISABLE-NEXT: sub r1, r3, #1 -; DISABLE-NEXT: cmp r1, r3 +; DISABLE-NEXT: ldrb r3, [r0] +; DISABLE-NEXT: ldrb r3, [r12, r3] +; DISABLE-NEXT: add r0, r0, r3 +; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: cmp r3, r1 ; DISABLE-NEXT: bhs .LBB0_6 ; DISABLE-NEXT: @ %bb.5: @ %while.body ; DISABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; DISABLE-NEXT: cmp r0, r2 -; DISABLE-NEXT: mov r3, r1 +; DISABLE-NEXT: mov r1, r3 ; DISABLE-NEXT: blo .LBB0_4 ; DISABLE-NEXT: .LBB0_6: @ %if.end29 ; DISABLE-NEXT: pop {r11, pc} diff --git a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll index ea15fcc5c824..55157875d355 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll @@ -73,13 +73,13 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: push {r4, r5, r11, lr} ; SCALAR-NEXT: rsb r3, r2, #0 ; SCALAR-NEXT: and r4, r2, #63 -; SCALAR-NEXT: and r12, r3, #63 -; SCALAR-NEXT: rsb r3, r12, #32 +; SCALAR-NEXT: and lr, r3, #63 +; SCALAR-NEXT: rsb r3, lr, #32 ; SCALAR-NEXT: lsl r2, r0, r4 -; SCALAR-NEXT: lsr lr, r0, r12 -; SCALAR-NEXT: orr r3, lr, r1, lsl r3 -; SCALAR-NEXT: subs lr, r12, #32 -; SCALAR-NEXT: lsrpl r3, r1, lr +; SCALAR-NEXT: lsr r12, r0, lr +; SCALAR-NEXT: orr r3, r12, r1, lsl r3 +; SCALAR-NEXT: subs r12, lr, #32 +; SCALAR-NEXT: lsrpl r3, r1, r12 ; SCALAR-NEXT: subs r5, r4, #32 ; SCALAR-NEXT: movwpl r2, #0 ; SCALAR-NEXT: cmp r5, #0 @@ -88,8 +88,8 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: lsr r3, r0, r3 ; SCALAR-NEXT: orr r3, r3, r1, lsl r4 ; SCALAR-NEXT: lslpl r3, r0, r5 -; SCALAR-NEXT: lsr r0, r1, r12 -; SCALAR-NEXT: cmp lr, #0 +; SCALAR-NEXT: lsr r0, r1, lr +; SCALAR-NEXT: cmp r12, #0 ; SCALAR-NEXT: movwpl r0, #0 ; SCALAR-NEXT: orr r1, r3, r0 ; SCALAR-NEXT: mov r0, r2 @@ -245,15 +245,15 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK: @ %bb.0: ; CHECK-NEXT: .save {r4, r5, r11, lr} ; CHECK-NEXT: push {r4, r5, r11, lr} -; CHECK-NEXT: and r12, r2, #63 +; CHECK-NEXT: and lr, r2, #63 ; CHECK-NEXT: rsb r2, r2, #0 -; CHECK-NEXT: rsb r3, r12, #32 +; CHECK-NEXT: rsb r3, lr, #32 ; CHECK-NEXT: and r4, r2, #63 -; CHECK-NEXT: lsr lr, r0, r12 -; CHECK-NEXT: orr r3, lr, r1, lsl r3 -; CHECK-NEXT: subs lr, r12, #32 +; CHECK-NEXT: lsr r12, r0, lr +; CHECK-NEXT: orr r3, r12, r1, lsl r3 +; CHECK-NEXT: subs r12, lr, #32 ; CHECK-NEXT: lsl r2, r0, r4 -; CHECK-NEXT: lsrpl r3, r1, lr +; CHECK-NEXT: lsrpl r3, r1, r12 ; CHECK-NEXT: subs r5, r4, #32 ; CHECK-NEXT: movwpl r2, #0 ; CHECK-NEXT: cmp r5, #0 @@ -262,8 +262,8 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK-NEXT: lsr r3, r0, r3 ; CHECK-NEXT: orr r3, r3, r1, lsl r4 ; CHECK-NEXT: lslpl r3, r0, r5 -; CHECK-NEXT: lsr r0, r1, r12 -; CHECK-NEXT: cmp lr, #0 +; CHECK-NEXT: lsr r0, r1, lr +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r0, #0 ; CHECK-NEXT: orr r1, r0, r3 ; CHECK-NEXT: mov r0, r2 diff --git a/llvm/test/CodeGen/ARM/funnel-shift.ll b/llvm/test/CodeGen/ARM/funnel-shift.ll index 6372f9be2ca3..54c93b493c98 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift.ll @@ -224,31 +224,31 @@ define i37 @fshr_i37(i37 %x, i37 %y, i37 %z) { ; CHECK-NEXT: mov r3, #0 ; CHECK-NEXT: bl __aeabi_uldivmod ; CHECK-NEXT: add r0, r2, #27 -; CHECK-NEXT: lsl r2, r7, #27 -; CHECK-NEXT: and r12, r0, #63 ; CHECK-NEXT: lsl r6, r6, #27 +; CHECK-NEXT: and r1, r0, #63 +; CHECK-NEXT: lsl r2, r7, #27 ; CHECK-NEXT: orr r7, r6, r7, lsr #5 -; CHECK-NEXT: rsb r3, r12, #32 -; CHECK-NEXT: lsr r2, r2, r12 ; CHECK-NEXT: mov r6, #63 -; CHECK-NEXT: orr r2, r2, r7, lsl r3 -; CHECK-NEXT: subs r3, r12, #32 +; CHECK-NEXT: rsb r3, r1, #32 +; CHECK-NEXT: lsr r2, r2, r1 +; CHECK-NEXT: subs r12, r1, #32 ; CHECK-NEXT: bic r6, r6, r0 +; CHECK-NEXT: orr r2, r2, r7, lsl r3 ; CHECK-NEXT: lsl r5, r9, #1 -; CHECK-NEXT: lsrpl r2, r7, r3 -; CHECK-NEXT: subs r1, r6, #32 +; CHECK-NEXT: lsrpl r2, r7, r12 ; CHECK-NEXT: lsl r0, r5, r6 -; CHECK-NEXT: lsl r4, r8, #1 +; CHECK-NEXT: subs r4, r6, #32 +; CHECK-NEXT: lsl r3, r8, #1 ; CHECK-NEXT: movwpl r0, #0 -; CHECK-NEXT: orr r4, r4, r9, lsr #31 +; CHECK-NEXT: orr r3, r3, r9, lsr #31 ; CHECK-NEXT: orr r0, r0, r2 ; CHECK-NEXT: rsb r2, r6, #32 -; CHECK-NEXT: cmp r1, #0 +; CHECK-NEXT: cmp r4, #0 +; CHECK-NEXT: lsr r1, r7, r1 ; CHECK-NEXT: lsr r2, r5, r2 -; CHECK-NEXT: orr r2, r2, r4, lsl r6 -; CHECK-NEXT: lslpl r2, r5, r1 -; CHECK-NEXT: lsr r1, r7, r12 -; CHECK-NEXT: cmp r3, #0 +; CHECK-NEXT: orr r2, r2, r3, lsl r6 +; CHECK-NEXT: lslpl r2, r5, r4 +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r1, #0 ; CHECK-NEXT: orr r1, r2, r1 ; CHECK-NEXT: pop {r4, r5, r6, r7, r8, r9, r11, pc} diff --git a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll index 0a0bb62b0a09..2922e0ed5423 100644 --- a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll +++ b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll @@ -91,17 +91,17 @@ define void @i56_or(i56* %a) { ; BE-LABEL: i56_or: ; BE: @ %bb.0: ; BE-NEXT: mov r1, r0 +; BE-NEXT: ldr r12, [r0] ; BE-NEXT: ldrh r2, [r1, #4]! ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: orr r2, r3, r2, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: strb r12, [r1, #2] -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: strb r2, [r1, #2] +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr %aa = load i56, i56* %a @@ -127,13 +127,13 @@ define void @i56_and_or(i56* %a) { ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: strb r2, [r1, #2] ; BE-NEXT: orr r2, r3, r12, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: ldr r12, [r0] +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr diff --git a/llvm/test/CodeGen/ARM/neon-copy.ll b/llvm/test/CodeGen/ARM/neon-copy.ll index 46490efb6631..09a991da2e59 100644 --- a/llvm/test/CodeGen/ARM/neon-copy.ll +++ b/llvm/test/CodeGen/ARM/neon-copy.ll @@ -1340,16 +1340,16 @@ define <4 x i16> @test_extracts_inserts_varidx_insert(<8 x i16> %x, i32 %idx) { ; CHECK-NEXT: .pad #8 ; CHECK-NEXT: sub sp, sp, #8 ; CHECK-NEXT: vmov.u16 r1, d0[1] -; CHECK-NEXT: and r12, r0, #3 +; CHECK-NEXT: and r0, r0, #3 ; CHECK-NEXT: vmov.u16 r2, d0[2] -; CHECK-NEXT: mov r0, sp -; CHECK-NEXT: vmov.u16 r3, d0[3] -; CHECK-NEXT: orr r0, r0, r12, lsl #1 +; CHECK-NEXT: mov r3, sp +; CHECK-NEXT: vmov.u16 r12, d0[3] +; CHECK-NEXT: orr r0, r3, r0, lsl #1 ; CHECK-NEXT: vst1.16 {d0[0]}, [r0:16] ; CHECK-NEXT: vldr d0, [sp] ; CHECK-NEXT: vmov.16 d0[1], r1 ; CHECK-NEXT: vmov.16 d0[2], r2 -; CHECK-NEXT: vmov.16 d0[3], r3 +; CHECK-NEXT: vmov.16 d0[3], r12 ; CHECK-NEXT: add sp, sp, #8 ; CHECK-NEXT: bx lr %tmp = extractelement <8 x i16> %x, i32 0 diff --git a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll index a125446b27c3..8be7100d368b 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll @@ -766,85 +766,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: move $2, $6 -; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: sw $4, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 32($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $5, 36($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 76($sp) -; MMR3-NEXT: srlv $3, $7, $16 -; MMR3-NEXT: not16 $6, $16 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: sw $2, 32($sp) # 4-byte Folded Spill -; MMR3-NEXT: sll16 $2, $2, 1 -; MMR3-NEXT: sllv $2, $2, $6 -; MMR3-NEXT: li16 $6, 64 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: srlv $4, $4, $16 -; MMR3-NEXT: sw $4, 16($sp) # 4-byte Folded Spill -; MMR3-NEXT: subu16 $7, $6, $16 +; MMR3-NEXT: srlv $4, $7, $16 +; MMR3-NEXT: not16 $3, $16 +; MMR3-NEXT: sw $3, 24($sp) # 4-byte Folded Spill +; MMR3-NEXT: sll16 $2, $6, 1 +; MMR3-NEXT: sllv $3, $2, $3 +; MMR3-NEXT: li16 $2, 64 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: srlv $6, $6, $16 +; MMR3-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: subu16 $7, $2, $16 ; MMR3-NEXT: sllv $9, $5, $7 -; MMR3-NEXT: andi16 $5, $7, 32 -; MMR3-NEXT: sw $5, 28($sp) # 4-byte Folded Spill -; MMR3-NEXT: andi16 $6, $16, 32 -; MMR3-NEXT: sw $6, 36($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $3, $9 +; MMR3-NEXT: andi16 $2, $7, 32 +; MMR3-NEXT: sw $2, 28($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $5, $16, 32 +; MMR3-NEXT: sw $5, 16($sp) # 4-byte Folded Spill +; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $3, $17, $5 -; MMR3-NEXT: movn $2, $4, $6 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: lw $17, 0($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $4, $17, $4 -; MMR3-NEXT: sw $4, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $6, 1 -; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $3 -; MMR3-NEXT: addiu $3, $16, -64 -; MMR3-NEXT: srav $1, $6, $3 -; MMR3-NEXT: andi16 $3, $3, 32 -; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $3 -; MMR3-NEXT: sllv $3, $6, $7 -; MMR3-NEXT: sw $3, 4($sp) # 4-byte Folded Spill -; MMR3-NEXT: not16 $3, $7 -; MMR3-NEXT: srl16 $4, $17, 1 -; MMR3-NEXT: srlv $3, $4, $3 +; MMR3-NEXT: movn $4, $17, $2 +; MMR3-NEXT: movn $3, $6, $5 +; MMR3-NEXT: addiu $2, $16, -64 +; MMR3-NEXT: lw $5, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srlv $5, $5, $2 +; MMR3-NEXT: sw $5, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $17, 1 +; MMR3-NEXT: sw $6, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: not16 $5, $2 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: lw $4, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $4 +; MMR3-NEXT: srav $1, $17, $2 +; MMR3-NEXT: andi16 $2, $2, 32 +; MMR3-NEXT: sw $2, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $2 +; MMR3-NEXT: sllv $2, $17, $7 +; MMR3-NEXT: not16 $4, $7 +; MMR3-NEXT: lw $7, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $6, $7, 1 +; MMR3-NEXT: srlv $6, $6, $4 ; MMR3-NEXT: sltiu $10, $16, 64 -; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: lw $2, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $5, $3, $10 +; MMR3-NEXT: or16 $6, $2 +; MMR3-NEXT: srlv $2, $7, $16 +; MMR3-NEXT: lw $3, 24($sp) # 4-byte Folded Reload +; MMR3-NEXT: lw $4, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: sllv $3, $4, $3 ; MMR3-NEXT: or16 $3, $2 -; MMR3-NEXT: srlv $2, $17, $16 -; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $7, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: sllv $17, $7, $4 -; MMR3-NEXT: or16 $17, $2 -; MMR3-NEXT: srav $11, $6, $16 -; MMR3-NEXT: lw $2, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $17, $11, $2 -; MMR3-NEXT: sra $2, $6, 31 +; MMR3-NEXT: srav $11, $17, $16 +; MMR3-NEXT: lw $4, 16($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $11, $4 +; MMR3-NEXT: sra $2, $17, 31 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: movn $4, $17, $10 -; MMR3-NEXT: lw $6, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $9, $6 -; MMR3-NEXT: lw $6, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: lw $7, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $3 +; MMR3-NEXT: move $8, $2 +; MMR3-NEXT: movn $8, $3, $10 +; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $6, $9, $3 +; MMR3-NEXT: li16 $3, 0 +; MMR3-NEXT: lw $7, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $7, $3, $4 +; MMR3-NEXT: or16 $7, $6 ; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload ; MMR3-NEXT: movn $1, $2, $3 ; MMR3-NEXT: movn $1, $7, $10 ; MMR3-NEXT: lw $3, 32($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $3, $16 -; MMR3-NEXT: movn $11, $2, $6 +; MMR3-NEXT: movn $11, $2, $4 ; MMR3-NEXT: movn $2, $11, $10 -; MMR3-NEXT: move $3, $4 +; MMR3-NEXT: move $3, $8 ; MMR3-NEXT: move $4, $1 ; MMR3-NEXT: lwp $16, 40($sp) ; MMR3-NEXT: addiusp 48 @@ -858,80 +852,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 -; MMR6-NEXT: move $12, $7 +; MMR6-NEXT: move $1, $7 ; MMR6-NEXT: lw $3, 44($sp) ; MMR6-NEXT: li16 $2, 64 -; MMR6-NEXT: subu16 $16, $2, $3 -; MMR6-NEXT: sllv $1, $5, $16 -; MMR6-NEXT: andi16 $2, $16, 32 -; MMR6-NEXT: selnez $8, $1, $2 -; MMR6-NEXT: sllv $9, $4, $16 -; MMR6-NEXT: not16 $16, $16 -; MMR6-NEXT: srl16 $17, $5, 1 -; MMR6-NEXT: srlv $10, $17, $16 -; MMR6-NEXT: or $9, $9, $10 -; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $8, $8, $9 -; MMR6-NEXT: srlv $9, $7, $3 -; MMR6-NEXT: not16 $7, $3 -; MMR6-NEXT: sw $7, 4($sp) # 4-byte Folded Spill +; MMR6-NEXT: subu16 $7, $2, $3 +; MMR6-NEXT: sllv $8, $5, $7 +; MMR6-NEXT: andi16 $2, $7, 32 +; MMR6-NEXT: selnez $9, $8, $2 +; MMR6-NEXT: sllv $10, $4, $7 +; MMR6-NEXT: not16 $7, $7 +; MMR6-NEXT: srl16 $16, $5, 1 +; MMR6-NEXT: srlv $7, $16, $7 +; MMR6-NEXT: or $7, $10, $7 +; MMR6-NEXT: seleqz $7, $7, $2 +; MMR6-NEXT: or $7, $9, $7 +; MMR6-NEXT: srlv $9, $1, $3 +; MMR6-NEXT: not16 $16, $3 +; MMR6-NEXT: sw $16, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $17, $6, 1 -; MMR6-NEXT: sllv $10, $17, $7 +; MMR6-NEXT: sllv $10, $17, $16 ; MMR6-NEXT: or $9, $10, $9 ; MMR6-NEXT: andi16 $17, $3, 32 ; MMR6-NEXT: seleqz $9, $9, $17 ; MMR6-NEXT: srlv $10, $6, $3 ; MMR6-NEXT: selnez $11, $10, $17 ; MMR6-NEXT: seleqz $10, $10, $17 -; MMR6-NEXT: or $8, $10, $8 -; MMR6-NEXT: seleqz $1, $1, $2 -; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: or $10, $10, $7 +; MMR6-NEXT: seleqz $12, $8, $2 +; MMR6-NEXT: or $8, $11, $9 ; MMR6-NEXT: addiu $2, $3, -64 -; MMR6-NEXT: srlv $10, $5, $2 +; MMR6-NEXT: srlv $9, $5, $2 ; MMR6-NEXT: sll16 $7, $4, 1 ; MMR6-NEXT: not16 $16, $2 ; MMR6-NEXT: sllv $11, $7, $16 ; MMR6-NEXT: sltiu $13, $3, 64 -; MMR6-NEXT: or $1, $9, $1 -; MMR6-NEXT: selnez $8, $8, $13 -; MMR6-NEXT: or $9, $11, $10 -; MMR6-NEXT: srav $10, $4, $2 +; MMR6-NEXT: or $8, $8, $12 +; MMR6-NEXT: selnez $10, $10, $13 +; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: srav $11, $4, $2 ; MMR6-NEXT: andi16 $2, $2, 32 -; MMR6-NEXT: seleqz $11, $10, $2 +; MMR6-NEXT: seleqz $12, $11, $2 ; MMR6-NEXT: sra $14, $4, 31 ; MMR6-NEXT: selnez $15, $14, $2 ; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $11, $15, $11 -; MMR6-NEXT: seleqz $11, $11, $13 -; MMR6-NEXT: selnez $2, $10, $2 -; MMR6-NEXT: seleqz $10, $14, $13 -; MMR6-NEXT: or $8, $8, $11 -; MMR6-NEXT: selnez $8, $8, $3 -; MMR6-NEXT: selnez $1, $1, $13 +; MMR6-NEXT: or $12, $15, $12 +; MMR6-NEXT: seleqz $12, $12, $13 +; MMR6-NEXT: selnez $2, $11, $2 +; MMR6-NEXT: seleqz $11, $14, $13 +; MMR6-NEXT: or $10, $10, $12 +; MMR6-NEXT: selnez $10, $10, $3 +; MMR6-NEXT: selnez $8, $8, $13 ; MMR6-NEXT: or $2, $2, $9 ; MMR6-NEXT: srav $9, $4, $3 ; MMR6-NEXT: seleqz $4, $9, $17 -; MMR6-NEXT: selnez $11, $14, $17 -; MMR6-NEXT: or $4, $11, $4 -; MMR6-NEXT: selnez $11, $4, $13 +; MMR6-NEXT: selnez $12, $14, $17 +; MMR6-NEXT: or $4, $12, $4 +; MMR6-NEXT: selnez $12, $4, $13 ; MMR6-NEXT: seleqz $2, $2, $13 ; MMR6-NEXT: seleqz $4, $6, $3 -; MMR6-NEXT: seleqz $6, $12, $3 +; MMR6-NEXT: seleqz $1, $1, $3 +; MMR6-NEXT: or $2, $8, $2 +; MMR6-NEXT: selnez $2, $2, $3 ; MMR6-NEXT: or $1, $1, $2 -; MMR6-NEXT: selnez $1, $1, $3 -; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: or $4, $4, $8 -; MMR6-NEXT: or $6, $11, $10 -; MMR6-NEXT: srlv $2, $5, $3 -; MMR6-NEXT: lw $3, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $3, $7, $3 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: seleqz $2, $2, $17 -; MMR6-NEXT: selnez $3, $9, $17 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: selnez $2, $2, $13 -; MMR6-NEXT: or $3, $2, $10 -; MMR6-NEXT: move $2, $6 +; MMR6-NEXT: or $4, $4, $10 +; MMR6-NEXT: or $2, $12, $11 +; MMR6-NEXT: srlv $3, $5, $3 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: sllv $5, $7, $5 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: seleqz $3, $3, $17 +; MMR6-NEXT: selnez $5, $9, $17 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: selnez $3, $3, $13 +; MMR6-NEXT: or $3, $3, $11 ; MMR6-NEXT: move $5, $1 ; MMR6-NEXT: lw $16, 8($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $17, 12($sp) # 4-byte Folded Reload diff --git a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll index e4b4b3ae1d0f..ed2bfc9fcf60 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll @@ -776,77 +776,76 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sw $4, 28($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 68($sp) ; MMR3-NEXT: li16 $2, 64 -; MMR3-NEXT: subu16 $17, $2, $16 -; MMR3-NEXT: sllv $9, $5, $17 -; MMR3-NEXT: andi16 $3, $17, 32 +; MMR3-NEXT: subu16 $7, $2, $16 +; MMR3-NEXT: sllv $9, $5, $7 +; MMR3-NEXT: move $17, $5 +; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $3, $7, 32 ; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill ; MMR3-NEXT: li16 $2, 0 ; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: movn $4, $2, $3 -; MMR3-NEXT: srlv $5, $7, $16 +; MMR3-NEXT: srlv $5, $8, $16 ; MMR3-NEXT: not16 $3, $16 ; MMR3-NEXT: sw $3, 16($sp) # 4-byte Folded Spill ; MMR3-NEXT: sll16 $2, $6, 1 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sllv $2, $2, $3 ; MMR3-NEXT: or16 $2, $5 -; MMR3-NEXT: srlv $7, $6, $16 +; MMR3-NEXT: srlv $5, $6, $16 +; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill ; MMR3-NEXT: andi16 $3, $16, 32 ; MMR3-NEXT: sw $3, 12($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $2, $7, $3 +; MMR3-NEXT: movn $2, $5, $3 ; MMR3-NEXT: addiu $3, $16, -64 ; MMR3-NEXT: or16 $2, $4 -; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $3, $6, $3 -; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $3, 1 -; MMR3-NEXT: sw $4, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $4 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: srlv $1, $3, $4 -; MMR3-NEXT: andi16 $4, $4, 32 +; MMR3-NEXT: srlv $4, $17, $3 ; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $4 +; MMR3-NEXT: lw $4, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $4, 1 +; MMR3-NEXT: not16 $5, $3 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $17 +; MMR3-NEXT: srlv $1, $4, $3 +; MMR3-NEXT: andi16 $3, $3, 32 +; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $3 ; MMR3-NEXT: sltiu $10, $16, 64 ; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: sllv $2, $3, $17 -; MMR3-NEXT: not16 $3, $17 -; MMR3-NEXT: srl16 $4, $6, 1 +; MMR3-NEXT: sllv $2, $4, $7 +; MMR3-NEXT: not16 $3, $7 +; MMR3-NEXT: lw $7, 0($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $4, $7, 1 ; MMR3-NEXT: srlv $4, $4, $3 ; MMR3-NEXT: or16 $4, $2 -; MMR3-NEXT: srlv $2, $6, $16 +; MMR3-NEXT: srlv $2, $7, $16 ; MMR3-NEXT: lw $3, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $6, 0($sp) # 4-byte Folded Reload ; MMR3-NEXT: sllv $3, $6, $3 ; MMR3-NEXT: or16 $3, $2 ; MMR3-NEXT: lw $2, 28($sp) # 4-byte Folded Reload ; MMR3-NEXT: srlv $2, $2, $16 -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $2, $6 +; MMR3-NEXT: lw $17, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $2, $17 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movz $3, $17, $10 -; MMR3-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $4, $9, $17 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $4 +; MMR3-NEXT: li16 $6, 0 +; MMR3-NEXT: movz $3, $6, $10 +; MMR3-NEXT: lw $7, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $4, $9, $7 +; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $6, $7, $17 +; MMR3-NEXT: or16 $6, $4 ; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $1, $17, $4 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $1, $7, $10 +; MMR3-NEXT: movn $1, $7, $4 +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $1, $6, $10 ; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $4, $16 -; MMR3-NEXT: movn $2, $17, $6 +; MMR3-NEXT: movn $2, $7, $17 ; MMR3-NEXT: li16 $4, 0 ; MMR3-NEXT: movz $2, $4, $10 ; MMR3-NEXT: move $4, $1 @@ -856,91 +855,98 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; ; MMR6-LABEL: lshr_i128: ; MMR6: # %bb.0: # %entry -; MMR6-NEXT: addiu $sp, $sp, -24 -; MMR6-NEXT: .cfi_def_cfa_offset 24 -; MMR6-NEXT: sw $17, 20($sp) # 4-byte Folded Spill -; MMR6-NEXT: sw $16, 16($sp) # 4-byte Folded Spill +; MMR6-NEXT: addiu $sp, $sp, -32 +; MMR6-NEXT: .cfi_def_cfa_offset 32 +; MMR6-NEXT: sw $17, 28($sp) # 4-byte Folded Spill +; MMR6-NEXT: sw $16, 24($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 ; MMR6-NEXT: move $1, $7 -; MMR6-NEXT: move $7, $4 -; MMR6-NEXT: lw $3, 52($sp) +; MMR6-NEXT: move $7, $5 +; MMR6-NEXT: lw $3, 60($sp) ; MMR6-NEXT: srlv $2, $1, $3 -; MMR6-NEXT: not16 $16, $3 -; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill -; MMR6-NEXT: move $4, $6 -; MMR6-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $5, $3 +; MMR6-NEXT: sw $5, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: move $17, $6 +; MMR6-NEXT: sw $6, 16($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $6, $6, 1 -; MMR6-NEXT: sllv $6, $6, $16 +; MMR6-NEXT: sllv $6, $6, $5 ; MMR6-NEXT: or $8, $6, $2 -; MMR6-NEXT: addiu $6, $3, -64 -; MMR6-NEXT: srlv $9, $5, $6 -; MMR6-NEXT: sll16 $2, $7, 1 -; MMR6-NEXT: sw $2, 4($sp) # 4-byte Folded Spill -; MMR6-NEXT: not16 $16, $6 +; MMR6-NEXT: addiu $5, $3, -64 +; MMR6-NEXT: srlv $9, $7, $5 +; MMR6-NEXT: move $6, $4 +; MMR6-NEXT: sll16 $2, $4, 1 +; MMR6-NEXT: sw $2, 8($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $16, $5 ; MMR6-NEXT: sllv $10, $2, $16 ; MMR6-NEXT: andi16 $16, $3, 32 ; MMR6-NEXT: seleqz $8, $8, $16 ; MMR6-NEXT: or $9, $10, $9 -; MMR6-NEXT: srlv $10, $4, $3 +; MMR6-NEXT: srlv $10, $17, $3 ; MMR6-NEXT: selnez $11, $10, $16 ; MMR6-NEXT: li16 $17, 64 ; MMR6-NEXT: subu16 $2, $17, $3 -; MMR6-NEXT: sllv $12, $5, $2 +; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: move $17, $7 ; MMR6-NEXT: andi16 $4, $2, 32 -; MMR6-NEXT: andi16 $17, $6, 32 -; MMR6-NEXT: seleqz $9, $9, $17 +; MMR6-NEXT: andi16 $7, $5, 32 +; MMR6-NEXT: sw $7, 20($sp) # 4-byte Folded Spill +; MMR6-NEXT: seleqz $9, $9, $7 ; MMR6-NEXT: seleqz $13, $12, $4 ; MMR6-NEXT: or $8, $11, $8 ; MMR6-NEXT: selnez $11, $12, $4 -; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: sllv $12, $6, $2 +; MMR6-NEXT: move $7, $6 +; MMR6-NEXT: sw $6, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: not16 $2, $2 -; MMR6-NEXT: srl16 $6, $5, 1 +; MMR6-NEXT: srl16 $6, $17, 1 ; MMR6-NEXT: srlv $2, $6, $2 ; MMR6-NEXT: or $2, $12, $2 ; MMR6-NEXT: seleqz $2, $2, $4 -; MMR6-NEXT: addiu $4, $3, -64 -; MMR6-NEXT: srlv $4, $7, $4 -; MMR6-NEXT: or $12, $11, $2 -; MMR6-NEXT: or $6, $8, $13 -; MMR6-NEXT: srlv $5, $5, $3 -; MMR6-NEXT: selnez $8, $4, $17 -; MMR6-NEXT: sltiu $11, $3, 64 -; MMR6-NEXT: selnez $13, $6, $11 -; MMR6-NEXT: or $8, $8, $9 +; MMR6-NEXT: srlv $4, $7, $5 +; MMR6-NEXT: or $11, $11, $2 +; MMR6-NEXT: or $5, $8, $13 +; MMR6-NEXT: srlv $6, $17, $3 +; MMR6-NEXT: lw $2, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: selnez $7, $4, $2 +; MMR6-NEXT: sltiu $8, $3, 64 +; MMR6-NEXT: selnez $12, $5, $8 +; MMR6-NEXT: or $7, $7, $9 +; MMR6-NEXT: lw $5, 12($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $2, 8($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $9, $6, $2 +; MMR6-NEXT: sllv $9, $2, $5 ; MMR6-NEXT: seleqz $10, $10, $16 -; MMR6-NEXT: li16 $2, 0 -; MMR6-NEXT: or $10, $10, $12 -; MMR6-NEXT: or $9, $9, $5 -; MMR6-NEXT: seleqz $5, $8, $11 -; MMR6-NEXT: seleqz $8, $2, $11 -; MMR6-NEXT: srlv $7, $7, $3 -; MMR6-NEXT: seleqz $2, $7, $16 -; MMR6-NEXT: selnez $2, $2, $11 +; MMR6-NEXT: li16 $5, 0 +; MMR6-NEXT: or $10, $10, $11 +; MMR6-NEXT: or $6, $9, $6 +; MMR6-NEXT: seleqz $2, $7, $8 +; MMR6-NEXT: seleqz $7, $5, $8 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: srlv $9, $5, $3 +; MMR6-NEXT: seleqz $11, $9, $16 +; MMR6-NEXT: selnez $11, $11, $8 ; MMR6-NEXT: seleqz $1, $1, $3 -; MMR6-NEXT: or $5, $13, $5 -; MMR6-NEXT: selnez $5, $5, $3 -; MMR6-NEXT: or $5, $1, $5 -; MMR6-NEXT: or $2, $8, $2 -; MMR6-NEXT: seleqz $1, $9, $16 -; MMR6-NEXT: selnez $6, $7, $16 -; MMR6-NEXT: lw $7, 12($sp) # 4-byte Folded Reload -; MMR6-NEXT: seleqz $7, $7, $3 -; MMR6-NEXT: selnez $9, $10, $11 -; MMR6-NEXT: seleqz $4, $4, $17 -; MMR6-NEXT: seleqz $4, $4, $11 -; MMR6-NEXT: or $4, $9, $4 +; MMR6-NEXT: or $2, $12, $2 +; MMR6-NEXT: selnez $2, $2, $3 +; MMR6-NEXT: or $5, $1, $2 +; MMR6-NEXT: or $2, $7, $11 +; MMR6-NEXT: seleqz $1, $6, $16 +; MMR6-NEXT: selnez $6, $9, $16 +; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $9, $16, $3 +; MMR6-NEXT: selnez $10, $10, $8 +; MMR6-NEXT: lw $16, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $4, $4, $16 +; MMR6-NEXT: seleqz $4, $4, $8 +; MMR6-NEXT: or $4, $10, $4 ; MMR6-NEXT: selnez $3, $4, $3 -; MMR6-NEXT: or $4, $7, $3 +; MMR6-NEXT: or $4, $9, $3 ; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: selnez $1, $1, $11 -; MMR6-NEXT: or $3, $8, $1 -; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR6-NEXT: addiu $sp, $sp, 24 +; MMR6-NEXT: selnez $1, $1, $8 +; MMR6-NEXT: or $3, $7, $1 +; MMR6-NEXT: lw $16, 24($sp) # 4-byte Folded Reload +; MMR6-NEXT: lw $17, 28($sp) # 4-byte Folded Reload +; MMR6-NEXT: addiu $sp, $sp, 32 ; MMR6-NEXT: jrc $ra </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 401.bzip2 grew in size by 2% after llvm: Revert "Allow rematerialization of virtual reg uses"

by ci_notify＠linaro.org

After llvm commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Revert "Allow rematerialization of virtual reg uses" the following benchmarks grew in size by more than 1%: - 401.bzip2 grew in size by 2% from 46428 to 47368 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -Os -flto - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-aarch64-spec2k6-Os_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Date: Fri Sep 24 09:53:51 2021 -0700 Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39. --- llvm/include/llvm/CodeGen/TargetInstrInfo.h | 12 +- llvm/lib/CodeGen/TargetInstrInfo.cpp | 9 +- llvm/test/CodeGen/AMDGPU/remat-sop.mir | 60 - llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll | 28 +- llvm/test/CodeGen/ARM/funnel-shift-rot.ll | 32 +- llvm/test/CodeGen/ARM/funnel-shift.ll | 30 +- .../test/CodeGen/ARM/illegal-bitfield-loadstore.ll | 30 +- llvm/test/CodeGen/ARM/neon-copy.ll | 10 +- llvm/test/CodeGen/Mips/llvm-ir/ashr.ll | 227 +- llvm/test/CodeGen/Mips/llvm-ir/lshr.ll | 206 +- llvm/test/CodeGen/Mips/llvm-ir/shl.ll | 95 +- llvm/test/CodeGen/Mips/llvm-ir/sub.ll | 31 +- llvm/test/CodeGen/Mips/tls.ll | 4 +- llvm/test/CodeGen/RISCV/atomic-rmw.ll | 120 +- llvm/test/CodeGen/RISCV/atomic-signext.ll | 24 +- llvm/test/CodeGen/RISCV/bswap-ctlz-cttz-ctpop.ll | 96 +- llvm/test/CodeGen/RISCV/mul.ll | 72 +- llvm/test/CodeGen/RISCV/rv32i-rv64i-half.ll | 12 +- llvm/test/CodeGen/RISCV/rv32zbb-zbp.ll | 270 +- llvm/test/CodeGen/RISCV/rv32zbb.ll | 94 +- llvm/test/CodeGen/RISCV/rv32zbp.ll | 262 +- llvm/test/CodeGen/RISCV/rv32zbt.ll | 206 +- .../CodeGen/RISCV/rvv/fixed-vectors-bitreverse.ll | 150 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap.ll | 146 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz.ll | 3584 ++++++++++---------- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz.ll | 664 ++-- llvm/test/CodeGen/RISCV/shifts.ll | 308 +- llvm/test/CodeGen/RISCV/srem-vector-lkk.ll | 208 +- llvm/test/CodeGen/RISCV/urem-vector-lkk.ll | 190 +- llvm/test/CodeGen/Thumb/dyn-stackalloc.ll | 7 +- .../tail-pred-disabled-in-loloops.ll | 14 +- .../LowOverheadLoops/varying-outer-2d-reduction.ll | 64 +- .../CodeGen/Thumb2/LowOverheadLoops/while-loops.ll | 67 +- llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll | 30 +- llvm/test/CodeGen/Thumb2/mve-float16regloops.ll | 82 +- llvm/test/CodeGen/Thumb2/mve-float32regloops.ll | 98 +- llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll | 529 +-- llvm/test/CodeGen/X86/addcarry.ll | 20 +- llvm/test/CodeGen/X86/callbr-asm-blockplacement.ll | 12 +- llvm/test/CodeGen/X86/dag-update-nodetomatch.ll | 17 +- .../X86/delete-dead-instrs-with-live-uses.mir | 4 +- llvm/test/CodeGen/X86/inalloca-invoke.ll | 2 +- llvm/test/CodeGen/X86/licm-regpressure.ll | 28 +- llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll | 40 +- llvm/test/CodeGen/X86/sdiv_fix.ll | 5 +- 45 files changed, 4093 insertions(+), 4106 deletions(-) diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h index a0c52e2f1a13..c394ac910be1 100644 --- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h +++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h @@ -117,11 +117,10 @@ public: const MachineFunction &MF) const; /// Return true if the instruction is trivially rematerializable, meaning it - /// has no side effects. Uses of constants and unallocatable physical - /// registers are always trivial to rematerialize so that the instructions - /// result is independent of the place in the function. Uses of virtual - /// registers are allowed but it is caller's responsility to ensure these - /// operands are valid at the point the instruction is beeing moved. + /// has no side effects and requires no operands that aren't always available. + /// This means the only allowed uses are constants and unallocatable physical + /// registers so that the instructions result is independent of the place + /// in the function. bool isTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA = nullptr) const { return MI.getOpcode() == TargetOpcode::IMPLICIT_DEF || @@ -141,7 +140,8 @@ protected: /// set, this hook lets the target specify whether the instruction is actually /// trivially rematerializable, taking into consideration its operands. This /// predicate must return false if the instruction has any side effects other - /// than producing a value. + /// than producing a value, or if it requres any address registers that are + /// not always available. /// Requirements must be check as stated in isTriviallyReMaterializable() . virtual bool isReallyTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA) const { diff --git a/llvm/lib/CodeGen/TargetInstrInfo.cpp b/llvm/lib/CodeGen/TargetInstrInfo.cpp index fe7d60e0b7e2..1eab8e7443a7 100644 --- a/llvm/lib/CodeGen/TargetInstrInfo.cpp +++ b/llvm/lib/CodeGen/TargetInstrInfo.cpp @@ -921,8 +921,7 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( const MachineRegisterInfo &MRI = MF.getRegInfo(); // Remat clients assume operand 0 is the defined register. - if (!MI.getNumOperands() || !MI.getOperand(0).isReg() || - MI.getOperand(0).isTied()) + if (!MI.getNumOperands() || !MI.getOperand(0).isReg()) return false; Register DefReg = MI.getOperand(0).getReg(); @@ -984,6 +983,12 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( // same virtual register, though. if (MO.isDef() && Reg != DefReg) return false; + + // Don't allow any virtual-register uses. Rematting an instruction with + // virtual register uses would length the live ranges of the uses, which + // is not necessarily a good idea, certainly not "trivial". + if (MO.isUse()) + return false; } // Everything checked out. diff --git a/llvm/test/CodeGen/AMDGPU/remat-sop.mir b/llvm/test/CodeGen/AMDGPU/remat-sop.mir index c9915aaabfde..ed799bfca028 100644 --- a/llvm/test/CodeGen/AMDGPU/remat-sop.mir +++ b/llvm/test/CodeGen/AMDGPU/remat-sop.mir @@ -51,66 +51,6 @@ body: | S_NOP 0, implicit %2 S_ENDPGM 0 ... -# The liverange of %0 covers a point of rematerialization, source value is -# availabe. ---- -name: test_remat_s_mov_b32_vreg_src_long_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_remat_s_mov_b32_vreg_src_long_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_NOP 0, implicit %0 - S_ENDPGM 0 -... -# The liverange of %0 does not cover a point of rematerialization, source value is -# unavailabe and we do not want to artificially extend the liverange. ---- -name: test_no_remat_s_mov_b32_vreg_src_short_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_no_remat_s_mov_b32_vreg_src_short_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.1, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.1, addrspace 5) - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.0, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.0, addrspace 5) - ; GCN: renamable $sgpr0 = S_MOV_B32 killed renamable $sgpr0 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.1, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.1, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.0, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.0, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_ENDPGM 0 -... --- name: test_remat_s_mov_b64 tracksRegLiveness: true diff --git a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll index 175a2069a441..a4243276c70a 100644 --- a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll +++ b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll @@ -29,20 +29,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; ENABLE-NEXT: pophs {r11, pc} ; ENABLE-NEXT: .LBB0_3: @ %while.body.preheader ; ENABLE-NEXT: movw r12, :lower16:skip -; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: sub r1, r1, #1 ; ENABLE-NEXT: movt r12, :upper16:skip ; ENABLE-NEXT: .LBB0_4: @ %while.body ; ENABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; ENABLE-NEXT: ldrb r1, [r0] -; ENABLE-NEXT: ldrb r1, [r12, r1] -; ENABLE-NEXT: add r0, r0, r1 -; ENABLE-NEXT: sub r1, r3, #1 -; ENABLE-NEXT: cmp r1, r3 +; ENABLE-NEXT: ldrb r3, [r0] +; ENABLE-NEXT: ldrb r3, [r12, r3] +; ENABLE-NEXT: add r0, r0, r3 +; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: cmp r3, r1 ; ENABLE-NEXT: bhs .LBB0_6 ; ENABLE-NEXT: @ %bb.5: @ %while.body ; ENABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; ENABLE-NEXT: cmp r0, r2 -; ENABLE-NEXT: mov r3, r1 +; ENABLE-NEXT: mov r1, r3 ; ENABLE-NEXT: blo .LBB0_4 ; ENABLE-NEXT: .LBB0_6: @ %if.end29 ; ENABLE-NEXT: pop {r11, pc} @@ -119,20 +119,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; DISABLE-NEXT: pophs {r11, pc} ; DISABLE-NEXT: .LBB0_3: @ %while.body.preheader ; DISABLE-NEXT: movw r12, :lower16:skip -; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: sub r1, r1, #1 ; DISABLE-NEXT: movt r12, :upper16:skip ; DISABLE-NEXT: .LBB0_4: @ %while.body ; DISABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; DISABLE-NEXT: ldrb r1, [r0] -; DISABLE-NEXT: ldrb r1, [r12, r1] -; DISABLE-NEXT: add r0, r0, r1 -; DISABLE-NEXT: sub r1, r3, #1 -; DISABLE-NEXT: cmp r1, r3 +; DISABLE-NEXT: ldrb r3, [r0] +; DISABLE-NEXT: ldrb r3, [r12, r3] +; DISABLE-NEXT: add r0, r0, r3 +; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: cmp r3, r1 ; DISABLE-NEXT: bhs .LBB0_6 ; DISABLE-NEXT: @ %bb.5: @ %while.body ; DISABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; DISABLE-NEXT: cmp r0, r2 -; DISABLE-NEXT: mov r3, r1 +; DISABLE-NEXT: mov r1, r3 ; DISABLE-NEXT: blo .LBB0_4 ; DISABLE-NEXT: .LBB0_6: @ %if.end29 ; DISABLE-NEXT: pop {r11, pc} diff --git a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll index ea15fcc5c824..55157875d355 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll @@ -73,13 +73,13 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: push {r4, r5, r11, lr} ; SCALAR-NEXT: rsb r3, r2, #0 ; SCALAR-NEXT: and r4, r2, #63 -; SCALAR-NEXT: and r12, r3, #63 -; SCALAR-NEXT: rsb r3, r12, #32 +; SCALAR-NEXT: and lr, r3, #63 +; SCALAR-NEXT: rsb r3, lr, #32 ; SCALAR-NEXT: lsl r2, r0, r4 -; SCALAR-NEXT: lsr lr, r0, r12 -; SCALAR-NEXT: orr r3, lr, r1, lsl r3 -; SCALAR-NEXT: subs lr, r12, #32 -; SCALAR-NEXT: lsrpl r3, r1, lr +; SCALAR-NEXT: lsr r12, r0, lr +; SCALAR-NEXT: orr r3, r12, r1, lsl r3 +; SCALAR-NEXT: subs r12, lr, #32 +; SCALAR-NEXT: lsrpl r3, r1, r12 ; SCALAR-NEXT: subs r5, r4, #32 ; SCALAR-NEXT: movwpl r2, #0 ; SCALAR-NEXT: cmp r5, #0 @@ -88,8 +88,8 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: lsr r3, r0, r3 ; SCALAR-NEXT: orr r3, r3, r1, lsl r4 ; SCALAR-NEXT: lslpl r3, r0, r5 -; SCALAR-NEXT: lsr r0, r1, r12 -; SCALAR-NEXT: cmp lr, #0 +; SCALAR-NEXT: lsr r0, r1, lr +; SCALAR-NEXT: cmp r12, #0 ; SCALAR-NEXT: movwpl r0, #0 ; SCALAR-NEXT: orr r1, r3, r0 ; SCALAR-NEXT: mov r0, r2 @@ -245,15 +245,15 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK: @ %bb.0: ; CHECK-NEXT: .save {r4, r5, r11, lr} ; CHECK-NEXT: push {r4, r5, r11, lr} -; CHECK-NEXT: and r12, r2, #63 +; CHECK-NEXT: and lr, r2, #63 ; CHECK-NEXT: rsb r2, r2, #0 -; CHECK-NEXT: rsb r3, r12, #32 +; CHECK-NEXT: rsb r3, lr, #32 ; CHECK-NEXT: and r4, r2, #63 -; CHECK-NEXT: lsr lr, r0, r12 -; CHECK-NEXT: orr r3, lr, r1, lsl r3 -; CHECK-NEXT: subs lr, r12, #32 +; CHECK-NEXT: lsr r12, r0, lr +; CHECK-NEXT: orr r3, r12, r1, lsl r3 +; CHECK-NEXT: subs r12, lr, #32 ; CHECK-NEXT: lsl r2, r0, r4 -; CHECK-NEXT: lsrpl r3, r1, lr +; CHECK-NEXT: lsrpl r3, r1, r12 ; CHECK-NEXT: subs r5, r4, #32 ; CHECK-NEXT: movwpl r2, #0 ; CHECK-NEXT: cmp r5, #0 @@ -262,8 +262,8 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK-NEXT: lsr r3, r0, r3 ; CHECK-NEXT: orr r3, r3, r1, lsl r4 ; CHECK-NEXT: lslpl r3, r0, r5 -; CHECK-NEXT: lsr r0, r1, r12 -; CHECK-NEXT: cmp lr, #0 +; CHECK-NEXT: lsr r0, r1, lr +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r0, #0 ; CHECK-NEXT: orr r1, r0, r3 ; CHECK-NEXT: mov r0, r2 diff --git a/llvm/test/CodeGen/ARM/funnel-shift.ll b/llvm/test/CodeGen/ARM/funnel-shift.ll index 6372f9be2ca3..54c93b493c98 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift.ll @@ -224,31 +224,31 @@ define i37 @fshr_i37(i37 %x, i37 %y, i37 %z) { ; CHECK-NEXT: mov r3, #0 ; CHECK-NEXT: bl __aeabi_uldivmod ; CHECK-NEXT: add r0, r2, #27 -; CHECK-NEXT: lsl r2, r7, #27 -; CHECK-NEXT: and r12, r0, #63 ; CHECK-NEXT: lsl r6, r6, #27 +; CHECK-NEXT: and r1, r0, #63 +; CHECK-NEXT: lsl r2, r7, #27 ; CHECK-NEXT: orr r7, r6, r7, lsr #5 -; CHECK-NEXT: rsb r3, r12, #32 -; CHECK-NEXT: lsr r2, r2, r12 ; CHECK-NEXT: mov r6, #63 -; CHECK-NEXT: orr r2, r2, r7, lsl r3 -; CHECK-NEXT: subs r3, r12, #32 +; CHECK-NEXT: rsb r3, r1, #32 +; CHECK-NEXT: lsr r2, r2, r1 +; CHECK-NEXT: subs r12, r1, #32 ; CHECK-NEXT: bic r6, r6, r0 +; CHECK-NEXT: orr r2, r2, r7, lsl r3 ; CHECK-NEXT: lsl r5, r9, #1 -; CHECK-NEXT: lsrpl r2, r7, r3 -; CHECK-NEXT: subs r1, r6, #32 +; CHECK-NEXT: lsrpl r2, r7, r12 ; CHECK-NEXT: lsl r0, r5, r6 -; CHECK-NEXT: lsl r4, r8, #1 +; CHECK-NEXT: subs r4, r6, #32 +; CHECK-NEXT: lsl r3, r8, #1 ; CHECK-NEXT: movwpl r0, #0 -; CHECK-NEXT: orr r4, r4, r9, lsr #31 +; CHECK-NEXT: orr r3, r3, r9, lsr #31 ; CHECK-NEXT: orr r0, r0, r2 ; CHECK-NEXT: rsb r2, r6, #32 -; CHECK-NEXT: cmp r1, #0 +; CHECK-NEXT: cmp r4, #0 +; CHECK-NEXT: lsr r1, r7, r1 ; CHECK-NEXT: lsr r2, r5, r2 -; CHECK-NEXT: orr r2, r2, r4, lsl r6 -; CHECK-NEXT: lslpl r2, r5, r1 -; CHECK-NEXT: lsr r1, r7, r12 -; CHECK-NEXT: cmp r3, #0 +; CHECK-NEXT: orr r2, r2, r3, lsl r6 +; CHECK-NEXT: lslpl r2, r5, r4 +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r1, #0 ; CHECK-NEXT: orr r1, r2, r1 ; CHECK-NEXT: pop {r4, r5, r6, r7, r8, r9, r11, pc} diff --git a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll index 0a0bb62b0a09..2922e0ed5423 100644 --- a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll +++ b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll @@ -91,17 +91,17 @@ define void @i56_or(i56* %a) { ; BE-LABEL: i56_or: ; BE: @ %bb.0: ; BE-NEXT: mov r1, r0 +; BE-NEXT: ldr r12, [r0] ; BE-NEXT: ldrh r2, [r1, #4]! ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: orr r2, r3, r2, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: strb r12, [r1, #2] -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: strb r2, [r1, #2] +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr %aa = load i56, i56* %a @@ -127,13 +127,13 @@ define void @i56_and_or(i56* %a) { ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: strb r2, [r1, #2] ; BE-NEXT: orr r2, r3, r12, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: ldr r12, [r0] +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr diff --git a/llvm/test/CodeGen/ARM/neon-copy.ll b/llvm/test/CodeGen/ARM/neon-copy.ll index 46490efb6631..09a991da2e59 100644 --- a/llvm/test/CodeGen/ARM/neon-copy.ll +++ b/llvm/test/CodeGen/ARM/neon-copy.ll @@ -1340,16 +1340,16 @@ define <4 x i16> @test_extracts_inserts_varidx_insert(<8 x i16> %x, i32 %idx) { ; CHECK-NEXT: .pad #8 ; CHECK-NEXT: sub sp, sp, #8 ; CHECK-NEXT: vmov.u16 r1, d0[1] -; CHECK-NEXT: and r12, r0, #3 +; CHECK-NEXT: and r0, r0, #3 ; CHECK-NEXT: vmov.u16 r2, d0[2] -; CHECK-NEXT: mov r0, sp -; CHECK-NEXT: vmov.u16 r3, d0[3] -; CHECK-NEXT: orr r0, r0, r12, lsl #1 +; CHECK-NEXT: mov r3, sp +; CHECK-NEXT: vmov.u16 r12, d0[3] +; CHECK-NEXT: orr r0, r3, r0, lsl #1 ; CHECK-NEXT: vst1.16 {d0[0]}, [r0:16] ; CHECK-NEXT: vldr d0, [sp] ; CHECK-NEXT: vmov.16 d0[1], r1 ; CHECK-NEXT: vmov.16 d0[2], r2 -; CHECK-NEXT: vmov.16 d0[3], r3 +; CHECK-NEXT: vmov.16 d0[3], r12 ; CHECK-NEXT: add sp, sp, #8 ; CHECK-NEXT: bx lr %tmp = extractelement <8 x i16> %x, i32 0 diff --git a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll index a125446b27c3..8be7100d368b 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll @@ -766,85 +766,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: move $2, $6 -; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: sw $4, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 32($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $5, 36($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 76($sp) -; MMR3-NEXT: srlv $3, $7, $16 -; MMR3-NEXT: not16 $6, $16 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: sw $2, 32($sp) # 4-byte Folded Spill -; MMR3-NEXT: sll16 $2, $2, 1 -; MMR3-NEXT: sllv $2, $2, $6 -; MMR3-NEXT: li16 $6, 64 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: srlv $4, $4, $16 -; MMR3-NEXT: sw $4, 16($sp) # 4-byte Folded Spill -; MMR3-NEXT: subu16 $7, $6, $16 +; MMR3-NEXT: srlv $4, $7, $16 +; MMR3-NEXT: not16 $3, $16 +; MMR3-NEXT: sw $3, 24($sp) # 4-byte Folded Spill +; MMR3-NEXT: sll16 $2, $6, 1 +; MMR3-NEXT: sllv $3, $2, $3 +; MMR3-NEXT: li16 $2, 64 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: srlv $6, $6, $16 +; MMR3-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: subu16 $7, $2, $16 ; MMR3-NEXT: sllv $9, $5, $7 -; MMR3-NEXT: andi16 $5, $7, 32 -; MMR3-NEXT: sw $5, 28($sp) # 4-byte Folded Spill -; MMR3-NEXT: andi16 $6, $16, 32 -; MMR3-NEXT: sw $6, 36($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $3, $9 +; MMR3-NEXT: andi16 $2, $7, 32 +; MMR3-NEXT: sw $2, 28($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $5, $16, 32 +; MMR3-NEXT: sw $5, 16($sp) # 4-byte Folded Spill +; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $3, $17, $5 -; MMR3-NEXT: movn $2, $4, $6 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: lw $17, 0($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $4, $17, $4 -; MMR3-NEXT: sw $4, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $6, 1 -; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $3 -; MMR3-NEXT: addiu $3, $16, -64 -; MMR3-NEXT: srav $1, $6, $3 -; MMR3-NEXT: andi16 $3, $3, 32 -; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $3 -; MMR3-NEXT: sllv $3, $6, $7 -; MMR3-NEXT: sw $3, 4($sp) # 4-byte Folded Spill -; MMR3-NEXT: not16 $3, $7 -; MMR3-NEXT: srl16 $4, $17, 1 -; MMR3-NEXT: srlv $3, $4, $3 +; MMR3-NEXT: movn $4, $17, $2 +; MMR3-NEXT: movn $3, $6, $5 +; MMR3-NEXT: addiu $2, $16, -64 +; MMR3-NEXT: lw $5, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srlv $5, $5, $2 +; MMR3-NEXT: sw $5, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $17, 1 +; MMR3-NEXT: sw $6, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: not16 $5, $2 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: lw $4, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $4 +; MMR3-NEXT: srav $1, $17, $2 +; MMR3-NEXT: andi16 $2, $2, 32 +; MMR3-NEXT: sw $2, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $2 +; MMR3-NEXT: sllv $2, $17, $7 +; MMR3-NEXT: not16 $4, $7 +; MMR3-NEXT: lw $7, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $6, $7, 1 +; MMR3-NEXT: srlv $6, $6, $4 ; MMR3-NEXT: sltiu $10, $16, 64 -; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: lw $2, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $5, $3, $10 +; MMR3-NEXT: or16 $6, $2 +; MMR3-NEXT: srlv $2, $7, $16 +; MMR3-NEXT: lw $3, 24($sp) # 4-byte Folded Reload +; MMR3-NEXT: lw $4, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: sllv $3, $4, $3 ; MMR3-NEXT: or16 $3, $2 -; MMR3-NEXT: srlv $2, $17, $16 -; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $7, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: sllv $17, $7, $4 -; MMR3-NEXT: or16 $17, $2 -; MMR3-NEXT: srav $11, $6, $16 -; MMR3-NEXT: lw $2, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $17, $11, $2 -; MMR3-NEXT: sra $2, $6, 31 +; MMR3-NEXT: srav $11, $17, $16 +; MMR3-NEXT: lw $4, 16($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $11, $4 +; MMR3-NEXT: sra $2, $17, 31 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: movn $4, $17, $10 -; MMR3-NEXT: lw $6, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $9, $6 -; MMR3-NEXT: lw $6, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: lw $7, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $3 +; MMR3-NEXT: move $8, $2 +; MMR3-NEXT: movn $8, $3, $10 +; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $6, $9, $3 +; MMR3-NEXT: li16 $3, 0 +; MMR3-NEXT: lw $7, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $7, $3, $4 +; MMR3-NEXT: or16 $7, $6 ; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload ; MMR3-NEXT: movn $1, $2, $3 ; MMR3-NEXT: movn $1, $7, $10 ; MMR3-NEXT: lw $3, 32($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $3, $16 -; MMR3-NEXT: movn $11, $2, $6 +; MMR3-NEXT: movn $11, $2, $4 ; MMR3-NEXT: movn $2, $11, $10 -; MMR3-NEXT: move $3, $4 +; MMR3-NEXT: move $3, $8 ; MMR3-NEXT: move $4, $1 ; MMR3-NEXT: lwp $16, 40($sp) ; MMR3-NEXT: addiusp 48 @@ -858,80 +852,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 -; MMR6-NEXT: move $12, $7 +; MMR6-NEXT: move $1, $7 ; MMR6-NEXT: lw $3, 44($sp) ; MMR6-NEXT: li16 $2, 64 -; MMR6-NEXT: subu16 $16, $2, $3 -; MMR6-NEXT: sllv $1, $5, $16 -; MMR6-NEXT: andi16 $2, $16, 32 -; MMR6-NEXT: selnez $8, $1, $2 -; MMR6-NEXT: sllv $9, $4, $16 -; MMR6-NEXT: not16 $16, $16 -; MMR6-NEXT: srl16 $17, $5, 1 -; MMR6-NEXT: srlv $10, $17, $16 -; MMR6-NEXT: or $9, $9, $10 -; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $8, $8, $9 -; MMR6-NEXT: srlv $9, $7, $3 -; MMR6-NEXT: not16 $7, $3 -; MMR6-NEXT: sw $7, 4($sp) # 4-byte Folded Spill +; MMR6-NEXT: subu16 $7, $2, $3 +; MMR6-NEXT: sllv $8, $5, $7 +; MMR6-NEXT: andi16 $2, $7, 32 +; MMR6-NEXT: selnez $9, $8, $2 +; MMR6-NEXT: sllv $10, $4, $7 +; MMR6-NEXT: not16 $7, $7 +; MMR6-NEXT: srl16 $16, $5, 1 +; MMR6-NEXT: srlv $7, $16, $7 +; MMR6-NEXT: or $7, $10, $7 +; MMR6-NEXT: seleqz $7, $7, $2 +; MMR6-NEXT: or $7, $9, $7 +; MMR6-NEXT: srlv $9, $1, $3 +; MMR6-NEXT: not16 $16, $3 +; MMR6-NEXT: sw $16, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $17, $6, 1 -; MMR6-NEXT: sllv $10, $17, $7 +; MMR6-NEXT: sllv $10, $17, $16 ; MMR6-NEXT: or $9, $10, $9 ; MMR6-NEXT: andi16 $17, $3, 32 ; MMR6-NEXT: seleqz $9, $9, $17 ; MMR6-NEXT: srlv $10, $6, $3 ; MMR6-NEXT: selnez $11, $10, $17 ; MMR6-NEXT: seleqz $10, $10, $17 -; MMR6-NEXT: or $8, $10, $8 -; MMR6-NEXT: seleqz $1, $1, $2 -; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: or $10, $10, $7 +; MMR6-NEXT: seleqz $12, $8, $2 +; MMR6-NEXT: or $8, $11, $9 ; MMR6-NEXT: addiu $2, $3, -64 -; MMR6-NEXT: srlv $10, $5, $2 +; MMR6-NEXT: srlv $9, $5, $2 ; MMR6-NEXT: sll16 $7, $4, 1 ; MMR6-NEXT: not16 $16, $2 ; MMR6-NEXT: sllv $11, $7, $16 ; MMR6-NEXT: sltiu $13, $3, 64 -; MMR6-NEXT: or $1, $9, $1 -; MMR6-NEXT: selnez $8, $8, $13 -; MMR6-NEXT: or $9, $11, $10 -; MMR6-NEXT: srav $10, $4, $2 +; MMR6-NEXT: or $8, $8, $12 +; MMR6-NEXT: selnez $10, $10, $13 +; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: srav $11, $4, $2 ; MMR6-NEXT: andi16 $2, $2, 32 -; MMR6-NEXT: seleqz $11, $10, $2 +; MMR6-NEXT: seleqz $12, $11, $2 ; MMR6-NEXT: sra $14, $4, 31 ; MMR6-NEXT: selnez $15, $14, $2 ; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $11, $15, $11 -; MMR6-NEXT: seleqz $11, $11, $13 -; MMR6-NEXT: selnez $2, $10, $2 -; MMR6-NEXT: seleqz $10, $14, $13 -; MMR6-NEXT: or $8, $8, $11 -; MMR6-NEXT: selnez $8, $8, $3 -; MMR6-NEXT: selnez $1, $1, $13 +; MMR6-NEXT: or $12, $15, $12 +; MMR6-NEXT: seleqz $12, $12, $13 +; MMR6-NEXT: selnez $2, $11, $2 +; MMR6-NEXT: seleqz $11, $14, $13 +; MMR6-NEXT: or $10, $10, $12 +; MMR6-NEXT: selnez $10, $10, $3 +; MMR6-NEXT: selnez $8, $8, $13 ; MMR6-NEXT: or $2, $2, $9 ; MMR6-NEXT: srav $9, $4, $3 ; MMR6-NEXT: seleqz $4, $9, $17 -; MMR6-NEXT: selnez $11, $14, $17 -; MMR6-NEXT: or $4, $11, $4 -; MMR6-NEXT: selnez $11, $4, $13 +; MMR6-NEXT: selnez $12, $14, $17 +; MMR6-NEXT: or $4, $12, $4 +; MMR6-NEXT: selnez $12, $4, $13 ; MMR6-NEXT: seleqz $2, $2, $13 ; MMR6-NEXT: seleqz $4, $6, $3 -; MMR6-NEXT: seleqz $6, $12, $3 +; MMR6-NEXT: seleqz $1, $1, $3 +; MMR6-NEXT: or $2, $8, $2 +; MMR6-NEXT: selnez $2, $2, $3 ; MMR6-NEXT: or $1, $1, $2 -; MMR6-NEXT: selnez $1, $1, $3 -; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: or $4, $4, $8 -; MMR6-NEXT: or $6, $11, $10 -; MMR6-NEXT: srlv $2, $5, $3 -; MMR6-NEXT: lw $3, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $3, $7, $3 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: seleqz $2, $2, $17 -; MMR6-NEXT: selnez $3, $9, $17 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: selnez $2, $2, $13 -; MMR6-NEXT: or $3, $2, $10 -; MMR6-NEXT: move $2, $6 +; MMR6-NEXT: or $4, $4, $10 +; MMR6-NEXT: or $2, $12, $11 +; MMR6-NEXT: srlv $3, $5, $3 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: sllv $5, $7, $5 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: seleqz $3, $3, $17 +; MMR6-NEXT: selnez $5, $9, $17 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: selnez $3, $3, $13 +; MMR6-NEXT: or $3, $3, $11 ; MMR6-NEXT: move $5, $1 ; MMR6-NEXT: lw $16, 8($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $17, 12($sp) # 4-byte Folded Reload diff --git a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll index e4b4b3ae1d0f..ed2bfc9fcf60 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll @@ -776,77 +776,76 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sw $4, 28($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 68($sp) ; MMR3-NEXT: li16 $2, 64 -; MMR3-NEXT: subu16 $17, $2, $16 -; MMR3-NEXT: sllv $9, $5, $17 -; MMR3-NEXT: andi16 $3, $17, 32 +; MMR3-NEXT: subu16 $7, $2, $16 +; MMR3-NEXT: sllv $9, $5, $7 +; MMR3-NEXT: move $17, $5 +; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $3, $7, 32 ; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill ; MMR3-NEXT: li16 $2, 0 ; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: movn $4, $2, $3 -; MMR3-NEXT: srlv $5, $7, $16 +; MMR3-NEXT: srlv $5, $8, $16 ; MMR3-NEXT: not16 $3, $16 ; MMR3-NEXT: sw $3, 16($sp) # 4-byte Folded Spill ; MMR3-NEXT: sll16 $2, $6, 1 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sllv $2, $2, $3 ; MMR3-NEXT: or16 $2, $5 -; MMR3-NEXT: srlv $7, $6, $16 +; MMR3-NEXT: srlv $5, $6, $16 +; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill ; MMR3-NEXT: andi16 $3, $16, 32 ; MMR3-NEXT: sw $3, 12($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $2, $7, $3 +; MMR3-NEXT: movn $2, $5, $3 ; MMR3-NEXT: addiu $3, $16, -64 ; MMR3-NEXT: or16 $2, $4 -; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $3, $6, $3 -; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $3, 1 -; MMR3-NEXT: sw $4, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $4 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: srlv $1, $3, $4 -; MMR3-NEXT: andi16 $4, $4, 32 +; MMR3-NEXT: srlv $4, $17, $3 ; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $4 +; MMR3-NEXT: lw $4, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $4, 1 +; MMR3-NEXT: not16 $5, $3 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $17 +; MMR3-NEXT: srlv $1, $4, $3 +; MMR3-NEXT: andi16 $3, $3, 32 +; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $3 ; MMR3-NEXT: sltiu $10, $16, 64 ; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: sllv $2, $3, $17 -; MMR3-NEXT: not16 $3, $17 -; MMR3-NEXT: srl16 $4, $6, 1 +; MMR3-NEXT: sllv $2, $4, $7 +; MMR3-NEXT: not16 $3, $7 +; MMR3-NEXT: lw $7, 0($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $4, $7, 1 ; MMR3-NEXT: srlv $4, $4, $3 ; MMR3-NEXT: or16 $4, $2 -; MMR3-NEXT: srlv $2, $6, $16 +; MMR3-NEXT: srlv $2, $7, $16 ; MMR3-NEXT: lw $3, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $6, 0($sp) # 4-byte Folded Reload ; MMR3-NEXT: sllv $3, $6, $3 ; MMR3-NEXT: or16 $3, $2 ; MMR3-NEXT: lw $2, 28($sp) # 4-byte Folded Reload ; MMR3-NEXT: srlv $2, $2, $16 -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $2, $6 +; MMR3-NEXT: lw $17, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $2, $17 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movz $3, $17, $10 -; MMR3-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $4, $9, $17 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $4 +; MMR3-NEXT: li16 $6, 0 +; MMR3-NEXT: movz $3, $6, $10 +; MMR3-NEXT: lw $7, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $4, $9, $7 +; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $6, $7, $17 +; MMR3-NEXT: or16 $6, $4 ; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $1, $17, $4 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $1, $7, $10 +; MMR3-NEXT: movn $1, $7, $4 +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $1, $6, $10 ; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $4, $16 -; MMR3-NEXT: movn $2, $17, $6 +; MMR3-NEXT: movn $2, $7, $17 ; MMR3-NEXT: li16 $4, 0 ; MMR3-NEXT: movz $2, $4, $10 ; MMR3-NEXT: move $4, $1 @@ -856,91 +855,98 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; ; MMR6-LABEL: lshr_i128: ; MMR6: # %bb.0: # %entry -; MMR6-NEXT: addiu $sp, $sp, -24 -; MMR6-NEXT: .cfi_def_cfa_offset 24 -; MMR6-NEXT: sw $17, 20($sp) # 4-byte Folded Spill -; MMR6-NEXT: sw $16, 16($sp) # 4-byte Folded Spill +; MMR6-NEXT: addiu $sp, $sp, -32 +; MMR6-NEXT: .cfi_def_cfa_offset 32 +; MMR6-NEXT: sw $17, 28($sp) # 4-byte Folded Spill +; MMR6-NEXT: sw $16, 24($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 ; MMR6-NEXT: move $1, $7 -; MMR6-NEXT: move $7, $4 -; MMR6-NEXT: lw $3, 52($sp) +; MMR6-NEXT: move $7, $5 +; MMR6-NEXT: lw $3, 60($sp) ; MMR6-NEXT: srlv $2, $1, $3 -; MMR6-NEXT: not16 $16, $3 -; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill -; MMR6-NEXT: move $4, $6 -; MMR6-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $5, $3 +; MMR6-NEXT: sw $5, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: move $17, $6 +; MMR6-NEXT: sw $6, 16($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $6, $6, 1 -; MMR6-NEXT: sllv $6, $6, $16 +; MMR6-NEXT: sllv $6, $6, $5 ; MMR6-NEXT: or $8, $6, $2 -; MMR6-NEXT: addiu $6, $3, -64 -; MMR6-NEXT: srlv $9, $5, $6 -; MMR6-NEXT: sll16 $2, $7, 1 -; MMR6-NEXT: sw $2, 4($sp) # 4-byte Folded Spill -; MMR6-NEXT: not16 $16, $6 +; MMR6-NEXT: addiu $5, $3, -64 +; MMR6-NEXT: srlv $9, $7, $5 +; MMR6-NEXT: move $6, $4 +; MMR6-NEXT: sll16 $2, $4, 1 +; MMR6-NEXT: sw $2, 8($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $16, $5 ; MMR6-NEXT: sllv $10, $2, $16 ; MMR6-NEXT: andi16 $16, $3, 32 ; MMR6-NEXT: seleqz $8, $8, $16 ; MMR6-NEXT: or $9, $10, $9 -; MMR6-NEXT: srlv $10, $4, $3 +; MMR6-NEXT: srlv $10, $17, $3 ; MMR6-NEXT: selnez $11, $10, $16 ; MMR6-NEXT: li16 $17, 64 ; MMR6-NEXT: subu16 $2, $17, $3 -; MMR6-NEXT: sllv $12, $5, $2 +; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: move $17, $7 ; MMR6-NEXT: andi16 $4, $2, 32 -; MMR6-NEXT: andi16 $17, $6, 32 -; MMR6-NEXT: seleqz $9, $9, $17 +; MMR6-NEXT: andi16 $7, $5, 32 +; MMR6-NEXT: sw $7, 20($sp) # 4-byte Folded Spill +; MMR6-NEXT: seleqz $9, $9, $7 ; MMR6-NEXT: seleqz $13, $12, $4 ; MMR6-NEXT: or $8, $11, $8 ; MMR6-NEXT: selnez $11, $12, $4 -; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: sllv $12, $6, $2 +; MMR6-NEXT: move $7, $6 +; MMR6-NEXT: sw $6, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: not16 $2, $2 -; MMR6-NEXT: srl16 $6, $5, 1 +; MMR6-NEXT: srl16 $6, $17, 1 ; MMR6-NEXT: srlv $2, $6, $2 ; MMR6-NEXT: or $2, $12, $2 ; MMR6-NEXT: seleqz $2, $2, $4 -; MMR6-NEXT: addiu $4, $3, -64 -; MMR6-NEXT: srlv $4, $7, $4 -; MMR6-NEXT: or $12, $11, $2 -; MMR6-NEXT: or $6, $8, $13 -; MMR6-NEXT: srlv $5, $5, $3 -; MMR6-NEXT: selnez $8, $4, $17 -; MMR6-NEXT: sltiu $11, $3, 64 -; MMR6-NEXT: selnez $13, $6, $11 -; MMR6-NEXT: or $8, $8, $9 +; MMR6-NEXT: srlv $4, $7, $5 +; MMR6-NEXT: or $11, $11, $2 +; MMR6-NEXT: or $5, $8, $13 +; MMR6-NEXT: srlv $6, $17, $3 +; MMR6-NEXT: lw $2, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: selnez $7, $4, $2 +; MMR6-NEXT: sltiu $8, $3, 64 +; MMR6-NEXT: selnez $12, $5, $8 +; MMR6-NEXT: or $7, $7, $9 +; MMR6-NEXT: lw $5, 12($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $2, 8($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $9, $6, $2 +; MMR6-NEXT: sllv $9, $2, $5 ; MMR6-NEXT: seleqz $10, $10, $16 -; MMR6-NEXT: li16 $2, 0 -; MMR6-NEXT: or $10, $10, $12 -; MMR6-NEXT: or $9, $9, $5 -; MMR6-NEXT: seleqz $5, $8, $11 -; MMR6-NEXT: seleqz $8, $2, $11 -; MMR6-NEXT: srlv $7, $7, $3 -; MMR6-NEXT: seleqz $2, $7, $16 -; MMR6-NEXT: selnez $2, $2, $11 +; MMR6-NEXT: li16 $5, 0 +; MMR6-NEXT: or $10, $10, $11 +; MMR6-NEXT: or $6, $9, $6 +; MMR6-NEXT: seleqz $2, $7, $8 +; MMR6-NEXT: seleqz $7, $5, $8 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: srlv $9, $5, $3 +; MMR6-NEXT: seleqz $11, $9, $16 +; MMR6-NEXT: selnez $11, $11, $8 ; MMR6-NEXT: seleqz $1, $1, $3 -; MMR6-NEXT: or $5, $13, $5 -; MMR6-NEXT: selnez $5, $5, $3 -; MMR6-NEXT: or $5, $1, $5 -; MMR6-NEXT: or $2, $8, $2 -; MMR6-NEXT: seleqz $1, $9, $16 -; MMR6-NEXT: selnez $6, $7, $16 -; MMR6-NEXT: lw $7, 12($sp) # 4-byte Folded Reload -; MMR6-NEXT: seleqz $7, $7, $3 -; MMR6-NEXT: selnez $9, $10, $11 -; MMR6-NEXT: seleqz $4, $4, $17 -; MMR6-NEXT: seleqz $4, $4, $11 -; MMR6-NEXT: or $4, $9, $4 +; MMR6-NEXT: or $2, $12, $2 +; MMR6-NEXT: selnez $2, $2, $3 +; MMR6-NEXT: or $5, $1, $2 +; MMR6-NEXT: or $2, $7, $11 +; MMR6-NEXT: seleqz $1, $6, $16 +; MMR6-NEXT: selnez $6, $9, $16 +; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $9, $16, $3 +; MMR6-NEXT: selnez $10, $10, $8 +; MMR6-NEXT: lw $16, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $4, $4, $16 +; MMR6-NEXT: seleqz $4, $4, $8 +; MMR6-NEXT: or $4, $10, $4 ; MMR6-NEXT: selnez $3, $4, $3 -; MMR6-NEXT: or $4, $7, $3 +; MMR6-NEXT: or $4, $9, $3 ; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: selnez $1, $1, $11 -; MMR6-NEXT: or $3, $8, $1 -; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR6-NEXT: addiu $sp, $sp, 24 +; MMR6-NEXT: selnez $1, $1, $8 +; MMR6-NEXT: or $3, $7, $1 +; MMR6-NEXT: lw $16, 24($sp) # 4-byte Folded Reload +; MMR6-NEXT: lw $17, 28($sp) # 4-byte Folded Reload +; MMR6-NEXT: addiu $sp, $sp, 32 ; MMR6-NEXT: jrc $ra </cut>

3 years, 9 months

1
0
0 0

[TCWG CI] 462.libquantum grew in size by 3% after llvm: [JumpThreading] Ignore free instructions

by ci_notify＠linaro.org

After llvm commit 1e3c6fc7cb9d2ee6a5328881f95d6643afeadbff Author: Nikita Popov <nikita.ppv(a)gmail.com> [JumpThreading] Ignore free instructions the following benchmarks grew in size by more than 1%: - 462.libquantum grew in size by 3% from 14035 to 14398 bytes Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: arm-linux-gnueabihf - Compiler flags: -Os -flto -mthumb - Hardware: APM Mustang 8x X-Gene1 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_apm/llvm-master-arm-spec2k6-Os_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-1e3c6fc7cb9d2ee6a5328881f95d6643afeadbff cd investigate-llvm-1e3c6fc7cb9d2ee6a5328881f95d6643afeadbff # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_apm-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 1e3c6fc7cb9d2ee6a5328881f95d6643afeadbff ../artifacts/test.sh # Reproduce last_good build git checkout --detach 1a6e1ee42a6af255d45e3fd2fe87021dd31f79bb ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 1e3c6fc7cb9d2ee6a5328881f95d6643afeadbff Author: Nikita Popov <nikita.ppv(a)gmail.com> Date: Wed Sep 22 21:34:24 2021 +0200 [JumpThreading] Ignore free instructions This is basically D108837 but for jump threading. Free instructions should be ignored for the threading decision. JumpThreading already skips some free instructions (like pointer bitcasts), but does not skip various free intrinsics -- in fact, it currently gives them a fairly large cost of 2. Differential Revision: https://reviews.llvm.org/D110290 --- .../include/llvm/Transforms/Scalar/JumpThreading.h | 8 +-- llvm/lib/Transforms/Scalar/JumpThreading.cpp | 61 ++++++++++------------ .../Transforms/JumpThreading/free_instructions.ll | 24 +++++---- .../inlining-alignment-assumptions.ll | 12 ++--- 4 files changed, 52 insertions(+), 53 deletions(-) diff --git a/llvm/include/llvm/Transforms/Scalar/JumpThreading.h b/llvm/include/llvm/Transforms/Scalar/JumpThreading.h index 816ea1071e52..0ac7d7c62b7a 100644 --- a/llvm/include/llvm/Transforms/Scalar/JumpThreading.h +++ b/llvm/include/llvm/Transforms/Scalar/JumpThreading.h @@ -44,6 +44,7 @@ class PHINode; class SelectInst; class SwitchInst; class TargetLibraryInfo; +class TargetTransformInfo; class Value; /// A private "module" namespace for types and utilities used by @@ -78,6 +79,7 @@ enum ConstantPreference { WantInteger, WantBlockAddress }; /// revectored to the false side of the second if. class JumpThreadingPass : public PassInfoMixin<JumpThreadingPass> { TargetLibraryInfo *TLI; + TargetTransformInfo *TTI; LazyValueInfo *LVI; AAResults *AA; DomTreeUpdater *DTU; @@ -99,9 +101,9 @@ public: JumpThreadingPass(bool InsertFreezeWhenUnfoldingSelect = false, int T = -1); // Glue for old PM. - bool runImpl(Function &F, TargetLibraryInfo *TLI, LazyValueInfo *LVI, - AAResults *AA, DomTreeUpdater *DTU, bool HasProfileData, - std::unique_ptr<BlockFrequencyInfo> BFI, + bool runImpl(Function &F, TargetLibraryInfo *TLI, TargetTransformInfo *TTI, + LazyValueInfo *LVI, AAResults *AA, DomTreeUpdater *DTU, + bool HasProfileData, std::unique_ptr<BlockFrequencyInfo> BFI, std::unique_ptr<BranchProbabilityInfo> BPI); PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM); diff --git a/llvm/lib/Transforms/Scalar/JumpThreading.cpp b/llvm/lib/Transforms/Scalar/JumpThreading.cpp index 688902ecb9ff..fe9a7211967c 100644 --- a/llvm/lib/Transforms/Scalar/JumpThreading.cpp +++ b/llvm/lib/Transforms/Scalar/JumpThreading.cpp @@ -331,7 +331,7 @@ bool JumpThreading::runOnFunction(Function &F) { BFI.reset(new BlockFrequencyInfo(F, *BPI, LI)); } - bool Changed = Impl.runImpl(F, TLI, LVI, AA, &DTU, F.hasProfileData(), + bool Changed = Impl.runImpl(F, TLI, TTI, LVI, AA, &DTU, F.hasProfileData(), std::move(BFI), std::move(BPI)); if (PrintLVIAfterJumpThreading) { dbgs() << "LVI for function '" << F.getName() << "':\n"; @@ -360,7 +360,7 @@ PreservedAnalyses JumpThreadingPass::run(Function &F, BFI.reset(new BlockFrequencyInfo(F, *BPI, LI)); } - bool Changed = runImpl(F, &TLI, &LVI, &AA, &DTU, F.hasProfileData(), + bool Changed = runImpl(F, &TLI, &TTI, &LVI, &AA, &DTU, F.hasProfileData(), std::move(BFI), std::move(BPI)); if (PrintLVIAfterJumpThreading) { @@ -377,12 +377,14 @@ PreservedAnalyses JumpThreadingPass::run(Function &F, } bool JumpThreadingPass::runImpl(Function &F, TargetLibraryInfo *TLI_, - LazyValueInfo *LVI_, AliasAnalysis *AA_, - DomTreeUpdater *DTU_, bool HasProfileData_, + TargetTransformInfo *TTI_, LazyValueInfo *LVI_, + AliasAnalysis *AA_, DomTreeUpdater *DTU_, + bool HasProfileData_, std::unique_ptr<BlockFrequencyInfo> BFI_, std::unique_ptr<BranchProbabilityInfo> BPI_) { LLVM_DEBUG(dbgs() << "Jump threading on function '" << F.getName() << "'\n"); TLI = TLI_; + TTI = TTI_; LVI = LVI_; AA = AA_; DTU = DTU_; @@ -514,7 +516,8 @@ static void replaceFoldableUses(Instruction *Cond, Value *ToVal) { /// Return the cost of duplicating a piece of this block from first non-phi /// and before StopAt instruction to thread across it. Stop scanning the block /// when exceeding the threshold. If duplication is impossible, returns ~0U. -static unsigned getJumpThreadDuplicationCost(BasicBlock *BB, +static unsigned getJumpThreadDuplicationCost(const TargetTransformInfo *TTI, + BasicBlock *BB, Instruction *StopAt, unsigned Threshold) { assert(StopAt->getParent() == BB && "Not an instruction from proper BB?"); @@ -550,26 +553,21 @@ static unsigned getJumpThreadDuplicationCost(BasicBlock *BB, if (Size > Threshold) return Size; - // Debugger intrinsics don't incur code size. - if (isa<DbgInfoIntrinsic>(I)) continue; - - // Pseudo-probes don't incur code size. - if (isa<PseudoProbeInst>(I)) - continue; - - // If this is a pointer->pointer bitcast, it is free. - if (isa<BitCastInst>(I) && I->getType()->isPointerTy()) - continue; - - // Freeze instruction is free, too. - if (isa<FreezeInst>(I)) - continue; - // Bail out if this instruction gives back a token type, it is not possible // to duplicate it if it is used outside this BB. if (I->getType()->isTokenTy() && I->isUsedOutsideOfBlock(BB)) return ~0U; + // Blocks with NoDuplicate are modelled as having infinite cost, so they + // are never duplicated. + if (const CallInst *CI = dyn_cast<CallInst>(I)) + if (CI->cannotDuplicate() || CI->isConvergent()) + return ~0U; + + if (TTI->getUserCost(&*I, TargetTransformInfo::TCK_SizeAndLatency) + == TargetTransformInfo::TCC_Free) + continue; + // All other instructions count for at least one unit. ++Size; @@ -578,11 +576,7 @@ static unsigned getJumpThreadDuplicationCost(BasicBlock *BB, // as having cost of 2 total, and if they are a vector intrinsic, we model // them as having cost 1. if (const CallInst *CI = dyn_cast<CallInst>(I)) { - if (CI->cannotDuplicate() || CI->isConvergent()) - // Blocks with NoDuplicate are modelled as having infinite cost, so they - // are never duplicated. - return ~0U; - else if (!isa<IntrinsicInst>(CI)) + if (!isa<IntrinsicInst>(CI)) Size += 3; else if (!CI->getType()->isVectorTy()) Size += 1; @@ -2234,10 +2228,10 @@ bool JumpThreadingPass::maybethreadThroughTwoBasicBlocks(BasicBlock *BB, } // Compute the cost of duplicating BB and PredBB. - unsigned BBCost = - getJumpThreadDuplicationCost(BB, BB->getTerminator(), BBDupThreshold); + unsigned BBCost = getJumpThreadDuplicationCost( + TTI, BB, BB->getTerminator(), BBDupThreshold); unsigned PredBBCost = getJumpThreadDuplicationCost( - PredBB, PredBB->getTerminator(), BBDupThreshold); + TTI, PredBB, PredBB->getTerminator(), BBDupThreshold); // Give up if costs are too high. We need to check BBCost and PredBBCost // individually before checking their sum because getJumpThreadDuplicationCost @@ -2345,8 +2339,8 @@ bool JumpThreadingPass::tryThreadEdge( return false; } - unsigned JumpThreadCost = - getJumpThreadDuplicationCost(BB, BB->getTerminator(), BBDupThreshold); + unsigned JumpThreadCost = getJumpThreadDuplicationCost( + TTI, BB, BB->getTerminator(), BBDupThreshold); if (JumpThreadCost > BBDupThreshold) { LLVM_DEBUG(dbgs() << " Not threading BB '" << BB->getName() << "' - Cost is too high: " << JumpThreadCost << "\n"); @@ -2614,8 +2608,8 @@ bool JumpThreadingPass::duplicateCondBranchOnPHIIntoPred( return false; } - unsigned DuplicationCost = - getJumpThreadDuplicationCost(BB, BB->getTerminator(), BBDupThreshold); + unsigned DuplicationCost = getJumpThreadDuplicationCost( + TTI, BB, BB->getTerminator(), BBDupThreshold); if (DuplicationCost > BBDupThreshold) { LLVM_DEBUG(dbgs() << " Not duplicating BB '" << BB->getName() << "' - Cost is too high: " << DuplicationCost << "\n"); @@ -3031,7 +3025,8 @@ bool JumpThreadingPass::threadGuard(BasicBlock *BB, IntrinsicInst *Guard, ValueToValueMapTy UnguardedMapping, GuardedMapping; Instruction *AfterGuard = Guard->getNextNode(); - unsigned Cost = getJumpThreadDuplicationCost(BB, AfterGuard, BBDupThreshold); + unsigned Cost = + getJumpThreadDuplicationCost(TTI, BB, AfterGuard, BBDupThreshold); if (Cost > BBDupThreshold) return false; // Duplicate all instructions before the guard and the guard itself to the diff --git a/llvm/test/Transforms/JumpThreading/free_instructions.ll b/llvm/test/Transforms/JumpThreading/free_instructions.ll index f768ec996779..76392af77d33 100644 --- a/llvm/test/Transforms/JumpThreading/free_instructions.ll +++ b/llvm/test/Transforms/JumpThreading/free_instructions.ll @@ -5,26 +5,28 @@ ; the jump threading threshold, as everything else are free instructions. define i32 @free_instructions(i1 %c, i32* %p) { ; CHECK-LABEL: @free_instructions( -; CHECK-NEXT: br i1 [[C:%.*]], label [[IF:%.*]], label [[ELSE:%.*]] -; CHECK: if: +; CHECK-NEXT: br i1 [[C:%.*]], label [[IF2:%.*]], label [[ELSE2:%.*]] +; CHECK: if2: ; CHECK-NEXT: store i32 -1, i32* [[P:%.*]], align 4 -; CHECK-NEXT: br label [[JOIN:%.*]] -; CHECK: else: -; CHECK-NEXT: store i32 -2, i32* [[P]], align 4 -; CHECK-NEXT: br label [[JOIN]] -; CHECK: join: ; CHECK-NEXT: call void @llvm.experimental.noalias.scope.decl(metadata [[META0:![0-9]+]]) ; CHECK-NEXT: store i32 1, i32* [[P]], align 4, !noalias !0 ; CHECK-NEXT: call void @llvm.assume(i1 true) [ "align"(i32* [[P]], i64 32) ] ; CHECK-NEXT: store i32 2, i32* [[P]], align 4 +; CHECK-NEXT: [[P21:%.*]] = bitcast i32* [[P]] to i8* +; CHECK-NEXT: [[P32:%.*]] = call i8* @llvm.launder.invariant.group.p0i8(i8* [[P21]]) +; CHECK-NEXT: [[P43:%.*]] = bitcast i8* [[P32]] to i32* +; CHECK-NEXT: store i32 3, i32* [[P43]], align 4, !invariant.group !3 +; CHECK-NEXT: ret i32 0 +; CHECK: else2: +; CHECK-NEXT: store i32 -2, i32* [[P]], align 4 +; CHECK-NEXT: call void @llvm.experimental.noalias.scope.decl(metadata [[META4:![0-9]+]]) +; CHECK-NEXT: store i32 1, i32* [[P]], align 4, !noalias !4 +; CHECK-NEXT: call void @llvm.assume(i1 true) [ "align"(i32* [[P]], i64 32) ] +; CHECK-NEXT: store i32 2, i32* [[P]], align 4 ; CHECK-NEXT: [[P2:%.*]] = bitcast i32* [[P]] to i8* ; CHECK-NEXT: [[P3:%.*]] = call i8* @llvm.launder.invariant.group.p0i8(i8* [[P2]]) ; CHECK-NEXT: [[P4:%.*]] = bitcast i8* [[P3]] to i32* ; CHECK-NEXT: store i32 3, i32* [[P4]], align 4, !invariant.group !3 -; CHECK-NEXT: br i1 [[C]], label [[IF2:%.*]], label [[ELSE2:%.*]] -; CHECK: if2: -; CHECK-NEXT: ret i32 0 -; CHECK: else2: ; CHECK-NEXT: ret i32 1 ; br i1 %c, label %if, label %else diff --git a/llvm/test/Transforms/PhaseOrdering/inlining-alignment-assumptions.ll b/llvm/test/Transforms/PhaseOrdering/inlining-alignment-assumptions.ll index 57014e856a09..f764a59dd8a2 100644 --- a/llvm/test/Transforms/PhaseOrdering/inlining-alignment-assumptions.ll +++ b/llvm/test/Transforms/PhaseOrdering/inlining-alignment-assumptions.ll @@ -32,13 +32,10 @@ define void @caller1(i1 %c, i64* align 1 %ptr) { ; ASSUMPTIONS-OFF-NEXT: br label [[COMMON_RET]] ; ; ASSUMPTIONS-ON-LABEL: @caller1( -; ASSUMPTIONS-ON-NEXT: br i1 [[C:%.*]], label [[COMMON_RET:%.*]], label [[FALSE1:%.*]] -; ASSUMPTIONS-ON: false1: -; ASSUMPTIONS-ON-NEXT: store volatile i64 1, i64* [[PTR:%.*]], align 4 -; ASSUMPTIONS-ON-NEXT: br label [[COMMON_RET]] +; ASSUMPTIONS-ON-NEXT: br i1 [[C:%.*]], label [[COMMON_RET:%.*]], label [[FALSE2:%.*]] ; ASSUMPTIONS-ON: common.ret: -; ASSUMPTIONS-ON-NEXT: [[DOTSINK:%.*]] = phi i64 [ 3, [[FALSE1]] ], [ 2, [[TMP0:%.*]] ] -; ASSUMPTIONS-ON-NEXT: call void @llvm.assume(i1 true) [ "align"(i64* [[PTR]], i64 8) ] +; ASSUMPTIONS-ON-NEXT: [[DOTSINK:%.*]] = phi i64 [ 3, [[FALSE2]] ], [ 2, [[TMP0:%.*]] ] +; ASSUMPTIONS-ON-NEXT: call void @llvm.assume(i1 true) [ "align"(i64* [[PTR:%.*]], i64 8) ] ; ASSUMPTIONS-ON-NEXT: store volatile i64 0, i64* [[PTR]], align 8 ; ASSUMPTIONS-ON-NEXT: store volatile i64 -1, i64* [[PTR]], align 8 ; ASSUMPTIONS-ON-NEXT: store volatile i64 -1, i64* [[PTR]], align 8 @@ -47,6 +44,9 @@ define void @caller1(i1 %c, i64* align 1 %ptr) { ; ASSUMPTIONS-ON-NEXT: store volatile i64 -1, i64* [[PTR]], align 8 ; ASSUMPTIONS-ON-NEXT: store volatile i64 [[DOTSINK]], i64* [[PTR]], align 8 ; ASSUMPTIONS-ON-NEXT: ret void +; ASSUMPTIONS-ON: false2: +; ASSUMPTIONS-ON-NEXT: store volatile i64 1, i64* [[PTR]], align 4 +; ASSUMPTIONS-ON-NEXT: br label [[COMMON_RET]] ; br i1 %c, label %true1, label %false1 </cut>

3 years, 9 months

2
2
0 0

[TCWG CI] Regression caused by llvm: Revert "Allow rematerialization of virtual reg uses"

by ci_notify＠linaro.org

[TCWG CI] Regression caused by llvm: Revert "Allow rematerialization of virtual reg uses": commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Revert "Allow rematerialization of virtual reg uses" Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_llvm: -5 # build_abe qemu: -2 # linux_n_obj: 21880 # First few build errors in logs: # 00:04:00 arch/arm/lib/xor-neon.c:30:2: error: This code requires at least version 4.6 of GCC [-Werror,-W#warnings] # 00:04:00 make[1]: *** [scripts/Makefile.build:277: arch/arm/lib/xor-neon.o] Error 1 # 00:04:00 make: *** [Makefile:1868: arch/arm/lib] Error 2 # 00:05:21 crypto/wp512.c:782:13: error: stack frame size (1176) exceeds limit (1024) in function 'wp512_process_buffer' [-Werror,-Wframe-larger-than] # 00:05:21 make[1]: *** [scripts/Makefile.build:277: crypto/wp512.o] Error 1 # 00:08:06 make: *** [Makefile:1868: crypto] Error 2 # 00:18:48 drivers/gpu/drm/selftests/test-drm_mm.c:372:12: error: stack frame size (1032) exceeds limit (1024) in function '__igt_reserve' [-Werror,-Wframe-larger-than] # 00:18:49 make[4]: *** [scripts/Makefile.build:277: drivers/gpu/drm/selftests/test-drm_mm.o] Error 1 # 00:19:07 make[3]: *** [scripts/Makefile.build:540: drivers/gpu/drm/selftests] Error 2 # 00:30:18 drivers/firmware/tegra/bpmp-debugfs.c:357:16: error: stack frame size (1248) exceeds limit (1024) in function 'bpmp_debug_store' [-Werror,-Wframe-larger-than] from # reset_artifacts: -10 # build_abe binutils: -9 # build_llvm: -5 # build_abe qemu: -2 # linux_n_obj: 21881 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/llvm-master-arm-mainline-allmodconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… Last_good build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… Baseline build: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… Even more details: https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… Reproduce builds: <cut> mkdir investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 cd investigate-llvm-08d7eec06e8cf5c15a96ce11f311f1480291a441 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-llvm-bisect-llvm-master-arm-mainline-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 08d7eec06e8cf5c15a96ce11f311f1480291a441 ../artifacts/test.sh # Reproduce last_good build git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 08d7eec06e8cf5c15a96ce11f311f1480291a441 Author: Stanislav Mekhanoshin <Stanislav.Mekhanoshin(a)amd.com> Date: Fri Sep 24 09:53:51 2021 -0700 Revert "Allow rematerialization of virtual reg uses" Reverted due to two distcint performance regression reports. This reverts commit 92c1fd19abb15bc68b1127a26137a69e033cdb39. --- llvm/include/llvm/CodeGen/TargetInstrInfo.h | 12 +- llvm/lib/CodeGen/TargetInstrInfo.cpp | 9 +- llvm/test/CodeGen/AMDGPU/remat-sop.mir | 60 - llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll | 28 +- llvm/test/CodeGen/ARM/funnel-shift-rot.ll | 32 +- llvm/test/CodeGen/ARM/funnel-shift.ll | 30 +- .../test/CodeGen/ARM/illegal-bitfield-loadstore.ll | 30 +- llvm/test/CodeGen/ARM/neon-copy.ll | 10 +- llvm/test/CodeGen/Mips/llvm-ir/ashr.ll | 227 +- llvm/test/CodeGen/Mips/llvm-ir/lshr.ll | 206 +- llvm/test/CodeGen/Mips/llvm-ir/shl.ll | 95 +- llvm/test/CodeGen/Mips/llvm-ir/sub.ll | 31 +- llvm/test/CodeGen/Mips/tls.ll | 4 +- llvm/test/CodeGen/RISCV/atomic-rmw.ll | 120 +- llvm/test/CodeGen/RISCV/atomic-signext.ll | 24 +- llvm/test/CodeGen/RISCV/bswap-ctlz-cttz-ctpop.ll | 96 +- llvm/test/CodeGen/RISCV/mul.ll | 72 +- llvm/test/CodeGen/RISCV/rv32i-rv64i-half.ll | 12 +- llvm/test/CodeGen/RISCV/rv32zbb-zbp.ll | 270 +- llvm/test/CodeGen/RISCV/rv32zbb.ll | 94 +- llvm/test/CodeGen/RISCV/rv32zbp.ll | 262 +- llvm/test/CodeGen/RISCV/rv32zbt.ll | 206 +- .../CodeGen/RISCV/rvv/fixed-vectors-bitreverse.ll | 150 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap.ll | 146 +- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz.ll | 3584 ++++++++++---------- llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz.ll | 664 ++-- llvm/test/CodeGen/RISCV/shifts.ll | 308 +- llvm/test/CodeGen/RISCV/srem-vector-lkk.ll | 208 +- llvm/test/CodeGen/RISCV/urem-vector-lkk.ll | 190 +- llvm/test/CodeGen/Thumb/dyn-stackalloc.ll | 7 +- .../tail-pred-disabled-in-loloops.ll | 14 +- .../LowOverheadLoops/varying-outer-2d-reduction.ll | 64 +- .../CodeGen/Thumb2/LowOverheadLoops/while-loops.ll | 67 +- llvm/test/CodeGen/Thumb2/ldr-str-imm12.ll | 30 +- llvm/test/CodeGen/Thumb2/mve-float16regloops.ll | 82 +- llvm/test/CodeGen/Thumb2/mve-float32regloops.ll | 98 +- llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll | 529 +-- llvm/test/CodeGen/X86/addcarry.ll | 20 +- llvm/test/CodeGen/X86/callbr-asm-blockplacement.ll | 12 +- llvm/test/CodeGen/X86/dag-update-nodetomatch.ll | 17 +- .../X86/delete-dead-instrs-with-live-uses.mir | 4 +- llvm/test/CodeGen/X86/inalloca-invoke.ll | 2 +- llvm/test/CodeGen/X86/licm-regpressure.ll | 28 +- llvm/test/CodeGen/X86/ragreedy-hoist-spill.ll | 40 +- llvm/test/CodeGen/X86/sdiv_fix.ll | 5 +- 45 files changed, 4093 insertions(+), 4106 deletions(-) diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h index a0c52e2f1a13..c394ac910be1 100644 --- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h +++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h @@ -117,11 +117,10 @@ public: const MachineFunction &MF) const; /// Return true if the instruction is trivially rematerializable, meaning it - /// has no side effects. Uses of constants and unallocatable physical - /// registers are always trivial to rematerialize so that the instructions - /// result is independent of the place in the function. Uses of virtual - /// registers are allowed but it is caller's responsility to ensure these - /// operands are valid at the point the instruction is beeing moved. + /// has no side effects and requires no operands that aren't always available. + /// This means the only allowed uses are constants and unallocatable physical + /// registers so that the instructions result is independent of the place + /// in the function. bool isTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA = nullptr) const { return MI.getOpcode() == TargetOpcode::IMPLICIT_DEF || @@ -141,7 +140,8 @@ protected: /// set, this hook lets the target specify whether the instruction is actually /// trivially rematerializable, taking into consideration its operands. This /// predicate must return false if the instruction has any side effects other - /// than producing a value. + /// than producing a value, or if it requres any address registers that are + /// not always available. /// Requirements must be check as stated in isTriviallyReMaterializable() . virtual bool isReallyTriviallyReMaterializable(const MachineInstr &MI, AAResults *AA) const { diff --git a/llvm/lib/CodeGen/TargetInstrInfo.cpp b/llvm/lib/CodeGen/TargetInstrInfo.cpp index fe7d60e0b7e2..1eab8e7443a7 100644 --- a/llvm/lib/CodeGen/TargetInstrInfo.cpp +++ b/llvm/lib/CodeGen/TargetInstrInfo.cpp @@ -921,8 +921,7 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( const MachineRegisterInfo &MRI = MF.getRegInfo(); // Remat clients assume operand 0 is the defined register. - if (!MI.getNumOperands() || !MI.getOperand(0).isReg() || - MI.getOperand(0).isTied()) + if (!MI.getNumOperands() || !MI.getOperand(0).isReg()) return false; Register DefReg = MI.getOperand(0).getReg(); @@ -984,6 +983,12 @@ bool TargetInstrInfo::isReallyTriviallyReMaterializableGeneric( // same virtual register, though. if (MO.isDef() && Reg != DefReg) return false; + + // Don't allow any virtual-register uses. Rematting an instruction with + // virtual register uses would length the live ranges of the uses, which + // is not necessarily a good idea, certainly not "trivial". + if (MO.isUse()) + return false; } // Everything checked out. diff --git a/llvm/test/CodeGen/AMDGPU/remat-sop.mir b/llvm/test/CodeGen/AMDGPU/remat-sop.mir index c9915aaabfde..ed799bfca028 100644 --- a/llvm/test/CodeGen/AMDGPU/remat-sop.mir +++ b/llvm/test/CodeGen/AMDGPU/remat-sop.mir @@ -51,66 +51,6 @@ body: | S_NOP 0, implicit %2 S_ENDPGM 0 ... -# The liverange of %0 covers a point of rematerialization, source value is -# availabe. ---- -name: test_remat_s_mov_b32_vreg_src_long_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_remat_s_mov_b32_vreg_src_long_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_NOP 0, implicit %0 - S_ENDPGM 0 -... -# The liverange of %0 does not cover a point of rematerialization, source value is -# unavailabe and we do not want to artificially extend the liverange. ---- -name: test_no_remat_s_mov_b32_vreg_src_short_lr -tracksRegLiveness: true -machineFunctionInfo: - stackPtrOffsetReg: $sgpr32 -body: | - bb.0: - ; GCN-LABEL: name: test_no_remat_s_mov_b32_vreg_src_short_lr - ; GCN: renamable $sgpr0 = IMPLICIT_DEF - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.1, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.1, addrspace 5) - ; GCN: renamable $sgpr1 = S_MOV_B32 renamable $sgpr0 - ; GCN: SI_SPILL_S32_SAVE killed renamable $sgpr1, %stack.0, implicit $exec, implicit $sgpr32 :: (store (s32) into %stack.0, addrspace 5) - ; GCN: renamable $sgpr0 = S_MOV_B32 killed renamable $sgpr0 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.1, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.1, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: renamable $sgpr1 = SI_SPILL_S32_RESTORE %stack.0, implicit $exec, implicit $sgpr32 :: (load (s32) from %stack.0, addrspace 5) - ; GCN: S_NOP 0, implicit killed renamable $sgpr1 - ; GCN: S_NOP 0, implicit killed renamable $sgpr0 - ; GCN: S_ENDPGM 0 - %0:sreg_32 = IMPLICIT_DEF - %1:sreg_32 = S_MOV_B32 %0:sreg_32 - %2:sreg_32 = S_MOV_B32 %0:sreg_32 - %3:sreg_32 = S_MOV_B32 %0:sreg_32 - S_NOP 0, implicit %1 - S_NOP 0, implicit %2 - S_NOP 0, implicit %3 - S_ENDPGM 0 -... --- name: test_remat_s_mov_b64 tracksRegLiveness: true diff --git a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll index 175a2069a441..a4243276c70a 100644 --- a/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll +++ b/llvm/test/CodeGen/ARM/arm-shrink-wrapping-linux.ll @@ -29,20 +29,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; ENABLE-NEXT: pophs {r11, pc} ; ENABLE-NEXT: .LBB0_3: @ %while.body.preheader ; ENABLE-NEXT: movw r12, :lower16:skip -; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: sub r1, r1, #1 ; ENABLE-NEXT: movt r12, :upper16:skip ; ENABLE-NEXT: .LBB0_4: @ %while.body ; ENABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; ENABLE-NEXT: ldrb r1, [r0] -; ENABLE-NEXT: ldrb r1, [r12, r1] -; ENABLE-NEXT: add r0, r0, r1 -; ENABLE-NEXT: sub r1, r3, #1 -; ENABLE-NEXT: cmp r1, r3 +; ENABLE-NEXT: ldrb r3, [r0] +; ENABLE-NEXT: ldrb r3, [r12, r3] +; ENABLE-NEXT: add r0, r0, r3 +; ENABLE-NEXT: sub r3, r1, #1 +; ENABLE-NEXT: cmp r3, r1 ; ENABLE-NEXT: bhs .LBB0_6 ; ENABLE-NEXT: @ %bb.5: @ %while.body ; ENABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; ENABLE-NEXT: cmp r0, r2 -; ENABLE-NEXT: mov r3, r1 +; ENABLE-NEXT: mov r1, r3 ; ENABLE-NEXT: blo .LBB0_4 ; ENABLE-NEXT: .LBB0_6: @ %if.end29 ; ENABLE-NEXT: pop {r11, pc} @@ -119,20 +119,20 @@ define fastcc i8* @wrongUseOfPostDominate(i8* readonly %s, i32 %off, i8* readnon ; DISABLE-NEXT: pophs {r11, pc} ; DISABLE-NEXT: .LBB0_3: @ %while.body.preheader ; DISABLE-NEXT: movw r12, :lower16:skip -; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: sub r1, r1, #1 ; DISABLE-NEXT: movt r12, :upper16:skip ; DISABLE-NEXT: .LBB0_4: @ %while.body ; DISABLE-NEXT: @ =>This Inner Loop Header: Depth=1 -; DISABLE-NEXT: ldrb r1, [r0] -; DISABLE-NEXT: ldrb r1, [r12, r1] -; DISABLE-NEXT: add r0, r0, r1 -; DISABLE-NEXT: sub r1, r3, #1 -; DISABLE-NEXT: cmp r1, r3 +; DISABLE-NEXT: ldrb r3, [r0] +; DISABLE-NEXT: ldrb r3, [r12, r3] +; DISABLE-NEXT: add r0, r0, r3 +; DISABLE-NEXT: sub r3, r1, #1 +; DISABLE-NEXT: cmp r3, r1 ; DISABLE-NEXT: bhs .LBB0_6 ; DISABLE-NEXT: @ %bb.5: @ %while.body ; DISABLE-NEXT: @ in Loop: Header=BB0_4 Depth=1 ; DISABLE-NEXT: cmp r0, r2 -; DISABLE-NEXT: mov r3, r1 +; DISABLE-NEXT: mov r1, r3 ; DISABLE-NEXT: blo .LBB0_4 ; DISABLE-NEXT: .LBB0_6: @ %if.end29 ; DISABLE-NEXT: pop {r11, pc} diff --git a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll index ea15fcc5c824..55157875d355 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift-rot.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift-rot.ll @@ -73,13 +73,13 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: push {r4, r5, r11, lr} ; SCALAR-NEXT: rsb r3, r2, #0 ; SCALAR-NEXT: and r4, r2, #63 -; SCALAR-NEXT: and r12, r3, #63 -; SCALAR-NEXT: rsb r3, r12, #32 +; SCALAR-NEXT: and lr, r3, #63 +; SCALAR-NEXT: rsb r3, lr, #32 ; SCALAR-NEXT: lsl r2, r0, r4 -; SCALAR-NEXT: lsr lr, r0, r12 -; SCALAR-NEXT: orr r3, lr, r1, lsl r3 -; SCALAR-NEXT: subs lr, r12, #32 -; SCALAR-NEXT: lsrpl r3, r1, lr +; SCALAR-NEXT: lsr r12, r0, lr +; SCALAR-NEXT: orr r3, r12, r1, lsl r3 +; SCALAR-NEXT: subs r12, lr, #32 +; SCALAR-NEXT: lsrpl r3, r1, r12 ; SCALAR-NEXT: subs r5, r4, #32 ; SCALAR-NEXT: movwpl r2, #0 ; SCALAR-NEXT: cmp r5, #0 @@ -88,8 +88,8 @@ define i64 @rotl_i64(i64 %x, i64 %z) { ; SCALAR-NEXT: lsr r3, r0, r3 ; SCALAR-NEXT: orr r3, r3, r1, lsl r4 ; SCALAR-NEXT: lslpl r3, r0, r5 -; SCALAR-NEXT: lsr r0, r1, r12 -; SCALAR-NEXT: cmp lr, #0 +; SCALAR-NEXT: lsr r0, r1, lr +; SCALAR-NEXT: cmp r12, #0 ; SCALAR-NEXT: movwpl r0, #0 ; SCALAR-NEXT: orr r1, r3, r0 ; SCALAR-NEXT: mov r0, r2 @@ -245,15 +245,15 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK: @ %bb.0: ; CHECK-NEXT: .save {r4, r5, r11, lr} ; CHECK-NEXT: push {r4, r5, r11, lr} -; CHECK-NEXT: and r12, r2, #63 +; CHECK-NEXT: and lr, r2, #63 ; CHECK-NEXT: rsb r2, r2, #0 -; CHECK-NEXT: rsb r3, r12, #32 +; CHECK-NEXT: rsb r3, lr, #32 ; CHECK-NEXT: and r4, r2, #63 -; CHECK-NEXT: lsr lr, r0, r12 -; CHECK-NEXT: orr r3, lr, r1, lsl r3 -; CHECK-NEXT: subs lr, r12, #32 +; CHECK-NEXT: lsr r12, r0, lr +; CHECK-NEXT: orr r3, r12, r1, lsl r3 +; CHECK-NEXT: subs r12, lr, #32 ; CHECK-NEXT: lsl r2, r0, r4 -; CHECK-NEXT: lsrpl r3, r1, lr +; CHECK-NEXT: lsrpl r3, r1, r12 ; CHECK-NEXT: subs r5, r4, #32 ; CHECK-NEXT: movwpl r2, #0 ; CHECK-NEXT: cmp r5, #0 @@ -262,8 +262,8 @@ define i64 @rotr_i64(i64 %x, i64 %z) { ; CHECK-NEXT: lsr r3, r0, r3 ; CHECK-NEXT: orr r3, r3, r1, lsl r4 ; CHECK-NEXT: lslpl r3, r0, r5 -; CHECK-NEXT: lsr r0, r1, r12 -; CHECK-NEXT: cmp lr, #0 +; CHECK-NEXT: lsr r0, r1, lr +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r0, #0 ; CHECK-NEXT: orr r1, r0, r3 ; CHECK-NEXT: mov r0, r2 diff --git a/llvm/test/CodeGen/ARM/funnel-shift.ll b/llvm/test/CodeGen/ARM/funnel-shift.ll index 6372f9be2ca3..54c93b493c98 100644 --- a/llvm/test/CodeGen/ARM/funnel-shift.ll +++ b/llvm/test/CodeGen/ARM/funnel-shift.ll @@ -224,31 +224,31 @@ define i37 @fshr_i37(i37 %x, i37 %y, i37 %z) { ; CHECK-NEXT: mov r3, #0 ; CHECK-NEXT: bl __aeabi_uldivmod ; CHECK-NEXT: add r0, r2, #27 -; CHECK-NEXT: lsl r2, r7, #27 -; CHECK-NEXT: and r12, r0, #63 ; CHECK-NEXT: lsl r6, r6, #27 +; CHECK-NEXT: and r1, r0, #63 +; CHECK-NEXT: lsl r2, r7, #27 ; CHECK-NEXT: orr r7, r6, r7, lsr #5 -; CHECK-NEXT: rsb r3, r12, #32 -; CHECK-NEXT: lsr r2, r2, r12 ; CHECK-NEXT: mov r6, #63 -; CHECK-NEXT: orr r2, r2, r7, lsl r3 -; CHECK-NEXT: subs r3, r12, #32 +; CHECK-NEXT: rsb r3, r1, #32 +; CHECK-NEXT: lsr r2, r2, r1 +; CHECK-NEXT: subs r12, r1, #32 ; CHECK-NEXT: bic r6, r6, r0 +; CHECK-NEXT: orr r2, r2, r7, lsl r3 ; CHECK-NEXT: lsl r5, r9, #1 -; CHECK-NEXT: lsrpl r2, r7, r3 -; CHECK-NEXT: subs r1, r6, #32 +; CHECK-NEXT: lsrpl r2, r7, r12 ; CHECK-NEXT: lsl r0, r5, r6 -; CHECK-NEXT: lsl r4, r8, #1 +; CHECK-NEXT: subs r4, r6, #32 +; CHECK-NEXT: lsl r3, r8, #1 ; CHECK-NEXT: movwpl r0, #0 -; CHECK-NEXT: orr r4, r4, r9, lsr #31 +; CHECK-NEXT: orr r3, r3, r9, lsr #31 ; CHECK-NEXT: orr r0, r0, r2 ; CHECK-NEXT: rsb r2, r6, #32 -; CHECK-NEXT: cmp r1, #0 +; CHECK-NEXT: cmp r4, #0 +; CHECK-NEXT: lsr r1, r7, r1 ; CHECK-NEXT: lsr r2, r5, r2 -; CHECK-NEXT: orr r2, r2, r4, lsl r6 -; CHECK-NEXT: lslpl r2, r5, r1 -; CHECK-NEXT: lsr r1, r7, r12 -; CHECK-NEXT: cmp r3, #0 +; CHECK-NEXT: orr r2, r2, r3, lsl r6 +; CHECK-NEXT: lslpl r2, r5, r4 +; CHECK-NEXT: cmp r12, #0 ; CHECK-NEXT: movwpl r1, #0 ; CHECK-NEXT: orr r1, r2, r1 ; CHECK-NEXT: pop {r4, r5, r6, r7, r8, r9, r11, pc} diff --git a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll index 0a0bb62b0a09..2922e0ed5423 100644 --- a/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll +++ b/llvm/test/CodeGen/ARM/illegal-bitfield-loadstore.ll @@ -91,17 +91,17 @@ define void @i56_or(i56* %a) { ; BE-LABEL: i56_or: ; BE: @ %bb.0: ; BE-NEXT: mov r1, r0 +; BE-NEXT: ldr r12, [r0] ; BE-NEXT: ldrh r2, [r1, #4]! ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: orr r2, r3, r2, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: strb r12, [r1, #2] -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: strb r2, [r1, #2] +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr %aa = load i56, i56* %a @@ -127,13 +127,13 @@ define void @i56_and_or(i56* %a) { ; BE-NEXT: ldrb r3, [r1, #2] ; BE-NEXT: strb r2, [r1, #2] ; BE-NEXT: orr r2, r3, r12, lsl #8 -; BE-NEXT: ldr r3, [r0] -; BE-NEXT: orr r2, r2, r3, lsl #24 -; BE-NEXT: orr r12, r2, #384 -; BE-NEXT: lsr r2, r12, #8 -; BE-NEXT: strh r2, [r1] -; BE-NEXT: bic r1, r3, #255 -; BE-NEXT: orr r1, r1, r12, lsr #24 +; BE-NEXT: ldr r12, [r0] +; BE-NEXT: orr r2, r2, r12, lsl #24 +; BE-NEXT: orr r2, r2, #384 +; BE-NEXT: lsr r3, r2, #8 +; BE-NEXT: strh r3, [r1] +; BE-NEXT: bic r1, r12, #255 +; BE-NEXT: orr r1, r1, r2, lsr #24 ; BE-NEXT: str r1, [r0] ; BE-NEXT: mov pc, lr diff --git a/llvm/test/CodeGen/ARM/neon-copy.ll b/llvm/test/CodeGen/ARM/neon-copy.ll index 46490efb6631..09a991da2e59 100644 --- a/llvm/test/CodeGen/ARM/neon-copy.ll +++ b/llvm/test/CodeGen/ARM/neon-copy.ll @@ -1340,16 +1340,16 @@ define <4 x i16> @test_extracts_inserts_varidx_insert(<8 x i16> %x, i32 %idx) { ; CHECK-NEXT: .pad #8 ; CHECK-NEXT: sub sp, sp, #8 ; CHECK-NEXT: vmov.u16 r1, d0[1] -; CHECK-NEXT: and r12, r0, #3 +; CHECK-NEXT: and r0, r0, #3 ; CHECK-NEXT: vmov.u16 r2, d0[2] -; CHECK-NEXT: mov r0, sp -; CHECK-NEXT: vmov.u16 r3, d0[3] -; CHECK-NEXT: orr r0, r0, r12, lsl #1 +; CHECK-NEXT: mov r3, sp +; CHECK-NEXT: vmov.u16 r12, d0[3] +; CHECK-NEXT: orr r0, r3, r0, lsl #1 ; CHECK-NEXT: vst1.16 {d0[0]}, [r0:16] ; CHECK-NEXT: vldr d0, [sp] ; CHECK-NEXT: vmov.16 d0[1], r1 ; CHECK-NEXT: vmov.16 d0[2], r2 -; CHECK-NEXT: vmov.16 d0[3], r3 +; CHECK-NEXT: vmov.16 d0[3], r12 ; CHECK-NEXT: add sp, sp, #8 ; CHECK-NEXT: bx lr %tmp = extractelement <8 x i16> %x, i32 0 diff --git a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll index a125446b27c3..8be7100d368b 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/ashr.ll @@ -766,85 +766,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: move $2, $6 -; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: sw $4, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 32($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $5, 36($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 76($sp) -; MMR3-NEXT: srlv $3, $7, $16 -; MMR3-NEXT: not16 $6, $16 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: sw $2, 32($sp) # 4-byte Folded Spill -; MMR3-NEXT: sll16 $2, $2, 1 -; MMR3-NEXT: sllv $2, $2, $6 -; MMR3-NEXT: li16 $6, 64 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: srlv $4, $4, $16 -; MMR3-NEXT: sw $4, 16($sp) # 4-byte Folded Spill -; MMR3-NEXT: subu16 $7, $6, $16 +; MMR3-NEXT: srlv $4, $7, $16 +; MMR3-NEXT: not16 $3, $16 +; MMR3-NEXT: sw $3, 24($sp) # 4-byte Folded Spill +; MMR3-NEXT: sll16 $2, $6, 1 +; MMR3-NEXT: sllv $3, $2, $3 +; MMR3-NEXT: li16 $2, 64 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: srlv $6, $6, $16 +; MMR3-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR3-NEXT: subu16 $7, $2, $16 ; MMR3-NEXT: sllv $9, $5, $7 -; MMR3-NEXT: andi16 $5, $7, 32 -; MMR3-NEXT: sw $5, 28($sp) # 4-byte Folded Spill -; MMR3-NEXT: andi16 $6, $16, 32 -; MMR3-NEXT: sw $6, 36($sp) # 4-byte Folded Spill -; MMR3-NEXT: move $3, $9 +; MMR3-NEXT: andi16 $2, $7, 32 +; MMR3-NEXT: sw $2, 28($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $5, $16, 32 +; MMR3-NEXT: sw $5, 16($sp) # 4-byte Folded Spill +; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $3, $17, $5 -; MMR3-NEXT: movn $2, $4, $6 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: lw $17, 0($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $4, $17, $4 -; MMR3-NEXT: sw $4, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $6, 1 -; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: or16 $2, $3 -; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $3 -; MMR3-NEXT: addiu $3, $16, -64 -; MMR3-NEXT: srav $1, $6, $3 -; MMR3-NEXT: andi16 $3, $3, 32 -; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $3 -; MMR3-NEXT: sllv $3, $6, $7 -; MMR3-NEXT: sw $3, 4($sp) # 4-byte Folded Spill -; MMR3-NEXT: not16 $3, $7 -; MMR3-NEXT: srl16 $4, $17, 1 -; MMR3-NEXT: srlv $3, $4, $3 +; MMR3-NEXT: movn $4, $17, $2 +; MMR3-NEXT: movn $3, $6, $5 +; MMR3-NEXT: addiu $2, $16, -64 +; MMR3-NEXT: lw $5, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srlv $5, $5, $2 +; MMR3-NEXT: sw $5, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $17, 1 +; MMR3-NEXT: sw $6, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: not16 $5, $2 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: or16 $3, $4 +; MMR3-NEXT: lw $4, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $4 +; MMR3-NEXT: srav $1, $17, $2 +; MMR3-NEXT: andi16 $2, $2, 32 +; MMR3-NEXT: sw $2, 20($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $2 +; MMR3-NEXT: sllv $2, $17, $7 +; MMR3-NEXT: not16 $4, $7 +; MMR3-NEXT: lw $7, 36($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $6, $7, 1 +; MMR3-NEXT: srlv $6, $6, $4 ; MMR3-NEXT: sltiu $10, $16, 64 -; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: lw $2, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $5, $3, $10 +; MMR3-NEXT: or16 $6, $2 +; MMR3-NEXT: srlv $2, $7, $16 +; MMR3-NEXT: lw $3, 24($sp) # 4-byte Folded Reload +; MMR3-NEXT: lw $4, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: sllv $3, $4, $3 ; MMR3-NEXT: or16 $3, $2 -; MMR3-NEXT: srlv $2, $17, $16 -; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $7, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: sllv $17, $7, $4 -; MMR3-NEXT: or16 $17, $2 -; MMR3-NEXT: srav $11, $6, $16 -; MMR3-NEXT: lw $2, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $17, $11, $2 -; MMR3-NEXT: sra $2, $6, 31 +; MMR3-NEXT: srav $11, $17, $16 +; MMR3-NEXT: lw $4, 16($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $11, $4 +; MMR3-NEXT: sra $2, $17, 31 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: move $4, $2 -; MMR3-NEXT: movn $4, $17, $10 -; MMR3-NEXT: lw $6, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $9, $6 -; MMR3-NEXT: lw $6, 36($sp) # 4-byte Folded Reload -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: lw $7, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $3 +; MMR3-NEXT: move $8, $2 +; MMR3-NEXT: movn $8, $3, $10 +; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $6, $9, $3 +; MMR3-NEXT: li16 $3, 0 +; MMR3-NEXT: lw $7, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $7, $3, $4 +; MMR3-NEXT: or16 $7, $6 ; MMR3-NEXT: lw $3, 20($sp) # 4-byte Folded Reload ; MMR3-NEXT: movn $1, $2, $3 ; MMR3-NEXT: movn $1, $7, $10 ; MMR3-NEXT: lw $3, 32($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $3, $16 -; MMR3-NEXT: movn $11, $2, $6 +; MMR3-NEXT: movn $11, $2, $4 ; MMR3-NEXT: movn $2, $11, $10 -; MMR3-NEXT: move $3, $4 +; MMR3-NEXT: move $3, $8 ; MMR3-NEXT: move $4, $1 ; MMR3-NEXT: lwp $16, 40($sp) ; MMR3-NEXT: addiusp 48 @@ -858,80 +852,79 @@ define signext i128 @ashr_i128(i128 signext %a, i128 signext %b) { ; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 -; MMR6-NEXT: move $12, $7 +; MMR6-NEXT: move $1, $7 ; MMR6-NEXT: lw $3, 44($sp) ; MMR6-NEXT: li16 $2, 64 -; MMR6-NEXT: subu16 $16, $2, $3 -; MMR6-NEXT: sllv $1, $5, $16 -; MMR6-NEXT: andi16 $2, $16, 32 -; MMR6-NEXT: selnez $8, $1, $2 -; MMR6-NEXT: sllv $9, $4, $16 -; MMR6-NEXT: not16 $16, $16 -; MMR6-NEXT: srl16 $17, $5, 1 -; MMR6-NEXT: srlv $10, $17, $16 -; MMR6-NEXT: or $9, $9, $10 -; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $8, $8, $9 -; MMR6-NEXT: srlv $9, $7, $3 -; MMR6-NEXT: not16 $7, $3 -; MMR6-NEXT: sw $7, 4($sp) # 4-byte Folded Spill +; MMR6-NEXT: subu16 $7, $2, $3 +; MMR6-NEXT: sllv $8, $5, $7 +; MMR6-NEXT: andi16 $2, $7, 32 +; MMR6-NEXT: selnez $9, $8, $2 +; MMR6-NEXT: sllv $10, $4, $7 +; MMR6-NEXT: not16 $7, $7 +; MMR6-NEXT: srl16 $16, $5, 1 +; MMR6-NEXT: srlv $7, $16, $7 +; MMR6-NEXT: or $7, $10, $7 +; MMR6-NEXT: seleqz $7, $7, $2 +; MMR6-NEXT: or $7, $9, $7 +; MMR6-NEXT: srlv $9, $1, $3 +; MMR6-NEXT: not16 $16, $3 +; MMR6-NEXT: sw $16, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $17, $6, 1 -; MMR6-NEXT: sllv $10, $17, $7 +; MMR6-NEXT: sllv $10, $17, $16 ; MMR6-NEXT: or $9, $10, $9 ; MMR6-NEXT: andi16 $17, $3, 32 ; MMR6-NEXT: seleqz $9, $9, $17 ; MMR6-NEXT: srlv $10, $6, $3 ; MMR6-NEXT: selnez $11, $10, $17 ; MMR6-NEXT: seleqz $10, $10, $17 -; MMR6-NEXT: or $8, $10, $8 -; MMR6-NEXT: seleqz $1, $1, $2 -; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: or $10, $10, $7 +; MMR6-NEXT: seleqz $12, $8, $2 +; MMR6-NEXT: or $8, $11, $9 ; MMR6-NEXT: addiu $2, $3, -64 -; MMR6-NEXT: srlv $10, $5, $2 +; MMR6-NEXT: srlv $9, $5, $2 ; MMR6-NEXT: sll16 $7, $4, 1 ; MMR6-NEXT: not16 $16, $2 ; MMR6-NEXT: sllv $11, $7, $16 ; MMR6-NEXT: sltiu $13, $3, 64 -; MMR6-NEXT: or $1, $9, $1 -; MMR6-NEXT: selnez $8, $8, $13 -; MMR6-NEXT: or $9, $11, $10 -; MMR6-NEXT: srav $10, $4, $2 +; MMR6-NEXT: or $8, $8, $12 +; MMR6-NEXT: selnez $10, $10, $13 +; MMR6-NEXT: or $9, $11, $9 +; MMR6-NEXT: srav $11, $4, $2 ; MMR6-NEXT: andi16 $2, $2, 32 -; MMR6-NEXT: seleqz $11, $10, $2 +; MMR6-NEXT: seleqz $12, $11, $2 ; MMR6-NEXT: sra $14, $4, 31 ; MMR6-NEXT: selnez $15, $14, $2 ; MMR6-NEXT: seleqz $9, $9, $2 -; MMR6-NEXT: or $11, $15, $11 -; MMR6-NEXT: seleqz $11, $11, $13 -; MMR6-NEXT: selnez $2, $10, $2 -; MMR6-NEXT: seleqz $10, $14, $13 -; MMR6-NEXT: or $8, $8, $11 -; MMR6-NEXT: selnez $8, $8, $3 -; MMR6-NEXT: selnez $1, $1, $13 +; MMR6-NEXT: or $12, $15, $12 +; MMR6-NEXT: seleqz $12, $12, $13 +; MMR6-NEXT: selnez $2, $11, $2 +; MMR6-NEXT: seleqz $11, $14, $13 +; MMR6-NEXT: or $10, $10, $12 +; MMR6-NEXT: selnez $10, $10, $3 +; MMR6-NEXT: selnez $8, $8, $13 ; MMR6-NEXT: or $2, $2, $9 ; MMR6-NEXT: srav $9, $4, $3 ; MMR6-NEXT: seleqz $4, $9, $17 -; MMR6-NEXT: selnez $11, $14, $17 -; MMR6-NEXT: or $4, $11, $4 -; MMR6-NEXT: selnez $11, $4, $13 +; MMR6-NEXT: selnez $12, $14, $17 +; MMR6-NEXT: or $4, $12, $4 +; MMR6-NEXT: selnez $12, $4, $13 ; MMR6-NEXT: seleqz $2, $2, $13 ; MMR6-NEXT: seleqz $4, $6, $3 -; MMR6-NEXT: seleqz $6, $12, $3 +; MMR6-NEXT: seleqz $1, $1, $3 +; MMR6-NEXT: or $2, $8, $2 +; MMR6-NEXT: selnez $2, $2, $3 ; MMR6-NEXT: or $1, $1, $2 -; MMR6-NEXT: selnez $1, $1, $3 -; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: or $4, $4, $8 -; MMR6-NEXT: or $6, $11, $10 -; MMR6-NEXT: srlv $2, $5, $3 -; MMR6-NEXT: lw $3, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $3, $7, $3 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: seleqz $2, $2, $17 -; MMR6-NEXT: selnez $3, $9, $17 -; MMR6-NEXT: or $2, $3, $2 -; MMR6-NEXT: selnez $2, $2, $13 -; MMR6-NEXT: or $3, $2, $10 -; MMR6-NEXT: move $2, $6 +; MMR6-NEXT: or $4, $4, $10 +; MMR6-NEXT: or $2, $12, $11 +; MMR6-NEXT: srlv $3, $5, $3 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: sllv $5, $7, $5 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: seleqz $3, $3, $17 +; MMR6-NEXT: selnez $5, $9, $17 +; MMR6-NEXT: or $3, $5, $3 +; MMR6-NEXT: selnez $3, $3, $13 +; MMR6-NEXT: or $3, $3, $11 ; MMR6-NEXT: move $5, $1 ; MMR6-NEXT: lw $16, 8($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $17, 12($sp) # 4-byte Folded Reload diff --git a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll index e4b4b3ae1d0f..ed2bfc9fcf60 100644 --- a/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll +++ b/llvm/test/CodeGen/Mips/llvm-ir/lshr.ll @@ -776,77 +776,76 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; MMR3-NEXT: .cfi_offset 17, -4 ; MMR3-NEXT: .cfi_offset 16, -8 ; MMR3-NEXT: move $8, $7 -; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill +; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sw $4, 28($sp) # 4-byte Folded Spill ; MMR3-NEXT: lw $16, 68($sp) ; MMR3-NEXT: li16 $2, 64 -; MMR3-NEXT: subu16 $17, $2, $16 -; MMR3-NEXT: sllv $9, $5, $17 -; MMR3-NEXT: andi16 $3, $17, 32 +; MMR3-NEXT: subu16 $7, $2, $16 +; MMR3-NEXT: sllv $9, $5, $7 +; MMR3-NEXT: move $17, $5 +; MMR3-NEXT: sw $5, 0($sp) # 4-byte Folded Spill +; MMR3-NEXT: andi16 $3, $7, 32 ; MMR3-NEXT: sw $3, 20($sp) # 4-byte Folded Spill ; MMR3-NEXT: li16 $2, 0 ; MMR3-NEXT: move $4, $9 ; MMR3-NEXT: movn $4, $2, $3 -; MMR3-NEXT: srlv $5, $7, $16 +; MMR3-NEXT: srlv $5, $8, $16 ; MMR3-NEXT: not16 $3, $16 ; MMR3-NEXT: sw $3, 16($sp) # 4-byte Folded Spill ; MMR3-NEXT: sll16 $2, $6, 1 -; MMR3-NEXT: sw $6, 24($sp) # 4-byte Folded Spill ; MMR3-NEXT: sllv $2, $2, $3 ; MMR3-NEXT: or16 $2, $5 -; MMR3-NEXT: srlv $7, $6, $16 +; MMR3-NEXT: srlv $5, $6, $16 +; MMR3-NEXT: sw $5, 4($sp) # 4-byte Folded Spill ; MMR3-NEXT: andi16 $3, $16, 32 ; MMR3-NEXT: sw $3, 12($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $2, $7, $3 +; MMR3-NEXT: movn $2, $5, $3 ; MMR3-NEXT: addiu $3, $16, -64 ; MMR3-NEXT: or16 $2, $4 -; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR3-NEXT: srlv $3, $6, $3 -; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: lw $3, 28($sp) # 4-byte Folded Reload -; MMR3-NEXT: sll16 $4, $3, 1 -; MMR3-NEXT: sw $4, 0($sp) # 4-byte Folded Spill -; MMR3-NEXT: addiu $5, $16, -64 -; MMR3-NEXT: not16 $5, $5 -; MMR3-NEXT: sllv $5, $4, $5 -; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: or16 $5, $4 -; MMR3-NEXT: addiu $4, $16, -64 -; MMR3-NEXT: srlv $1, $3, $4 -; MMR3-NEXT: andi16 $4, $4, 32 +; MMR3-NEXT: srlv $4, $17, $3 ; MMR3-NEXT: sw $4, 8($sp) # 4-byte Folded Spill -; MMR3-NEXT: movn $5, $1, $4 +; MMR3-NEXT: lw $4, 28($sp) # 4-byte Folded Reload +; MMR3-NEXT: sll16 $6, $4, 1 +; MMR3-NEXT: not16 $5, $3 +; MMR3-NEXT: sllv $5, $6, $5 +; MMR3-NEXT: lw $17, 8($sp) # 4-byte Folded Reload +; MMR3-NEXT: or16 $5, $17 +; MMR3-NEXT: srlv $1, $4, $3 +; MMR3-NEXT: andi16 $3, $3, 32 +; MMR3-NEXT: sw $3, 8($sp) # 4-byte Folded Spill +; MMR3-NEXT: movn $5, $1, $3 ; MMR3-NEXT: sltiu $10, $16, 64 ; MMR3-NEXT: movn $5, $2, $10 -; MMR3-NEXT: sllv $2, $3, $17 -; MMR3-NEXT: not16 $3, $17 -; MMR3-NEXT: srl16 $4, $6, 1 +; MMR3-NEXT: sllv $2, $4, $7 +; MMR3-NEXT: not16 $3, $7 +; MMR3-NEXT: lw $7, 0($sp) # 4-byte Folded Reload +; MMR3-NEXT: srl16 $4, $7, 1 ; MMR3-NEXT: srlv $4, $4, $3 ; MMR3-NEXT: or16 $4, $2 -; MMR3-NEXT: srlv $2, $6, $16 +; MMR3-NEXT: srlv $2, $7, $16 ; MMR3-NEXT: lw $3, 16($sp) # 4-byte Folded Reload -; MMR3-NEXT: lw $6, 0($sp) # 4-byte Folded Reload ; MMR3-NEXT: sllv $3, $6, $3 ; MMR3-NEXT: or16 $3, $2 ; MMR3-NEXT: lw $2, 28($sp) # 4-byte Folded Reload ; MMR3-NEXT: srlv $2, $2, $16 -; MMR3-NEXT: lw $6, 12($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $3, $2, $6 +; MMR3-NEXT: lw $17, 12($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $3, $2, $17 ; MMR3-NEXT: movz $5, $8, $16 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movz $3, $17, $10 -; MMR3-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $4, $9, $17 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $7, $17, $6 -; MMR3-NEXT: or16 $7, $4 +; MMR3-NEXT: li16 $6, 0 +; MMR3-NEXT: movz $3, $6, $10 +; MMR3-NEXT: lw $7, 20($sp) # 4-byte Folded Reload +; MMR3-NEXT: movn $4, $9, $7 +; MMR3-NEXT: lw $6, 4($sp) # 4-byte Folded Reload +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $6, $7, $17 +; MMR3-NEXT: or16 $6, $4 ; MMR3-NEXT: lw $4, 8($sp) # 4-byte Folded Reload -; MMR3-NEXT: movn $1, $17, $4 -; MMR3-NEXT: li16 $17, 0 -; MMR3-NEXT: movn $1, $7, $10 +; MMR3-NEXT: movn $1, $7, $4 +; MMR3-NEXT: li16 $7, 0 +; MMR3-NEXT: movn $1, $6, $10 ; MMR3-NEXT: lw $4, 24($sp) # 4-byte Folded Reload ; MMR3-NEXT: movz $1, $4, $16 -; MMR3-NEXT: movn $2, $17, $6 +; MMR3-NEXT: movn $2, $7, $17 ; MMR3-NEXT: li16 $4, 0 ; MMR3-NEXT: movz $2, $4, $10 ; MMR3-NEXT: move $4, $1 @@ -856,91 +855,98 @@ define signext i128 @lshr_i128(i128 signext %a, i128 signext %b) { ; ; MMR6-LABEL: lshr_i128: ; MMR6: # %bb.0: # %entry -; MMR6-NEXT: addiu $sp, $sp, -24 -; MMR6-NEXT: .cfi_def_cfa_offset 24 -; MMR6-NEXT: sw $17, 20($sp) # 4-byte Folded Spill -; MMR6-NEXT: sw $16, 16($sp) # 4-byte Folded Spill +; MMR6-NEXT: addiu $sp, $sp, -32 +; MMR6-NEXT: .cfi_def_cfa_offset 32 +; MMR6-NEXT: sw $17, 28($sp) # 4-byte Folded Spill +; MMR6-NEXT: sw $16, 24($sp) # 4-byte Folded Spill ; MMR6-NEXT: .cfi_offset 17, -4 ; MMR6-NEXT: .cfi_offset 16, -8 ; MMR6-NEXT: move $1, $7 -; MMR6-NEXT: move $7, $4 -; MMR6-NEXT: lw $3, 52($sp) +; MMR6-NEXT: move $7, $5 +; MMR6-NEXT: lw $3, 60($sp) ; MMR6-NEXT: srlv $2, $1, $3 -; MMR6-NEXT: not16 $16, $3 -; MMR6-NEXT: sw $16, 8($sp) # 4-byte Folded Spill -; MMR6-NEXT: move $4, $6 -; MMR6-NEXT: sw $6, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $5, $3 +; MMR6-NEXT: sw $5, 12($sp) # 4-byte Folded Spill +; MMR6-NEXT: move $17, $6 +; MMR6-NEXT: sw $6, 16($sp) # 4-byte Folded Spill ; MMR6-NEXT: sll16 $6, $6, 1 -; MMR6-NEXT: sllv $6, $6, $16 +; MMR6-NEXT: sllv $6, $6, $5 ; MMR6-NEXT: or $8, $6, $2 -; MMR6-NEXT: addiu $6, $3, -64 -; MMR6-NEXT: srlv $9, $5, $6 -; MMR6-NEXT: sll16 $2, $7, 1 -; MMR6-NEXT: sw $2, 4($sp) # 4-byte Folded Spill -; MMR6-NEXT: not16 $16, $6 +; MMR6-NEXT: addiu $5, $3, -64 +; MMR6-NEXT: srlv $9, $7, $5 +; MMR6-NEXT: move $6, $4 +; MMR6-NEXT: sll16 $2, $4, 1 +; MMR6-NEXT: sw $2, 8($sp) # 4-byte Folded Spill +; MMR6-NEXT: not16 $16, $5 ; MMR6-NEXT: sllv $10, $2, $16 ; MMR6-NEXT: andi16 $16, $3, 32 ; MMR6-NEXT: seleqz $8, $8, $16 ; MMR6-NEXT: or $9, $10, $9 -; MMR6-NEXT: srlv $10, $4, $3 +; MMR6-NEXT: srlv $10, $17, $3 ; MMR6-NEXT: selnez $11, $10, $16 ; MMR6-NEXT: li16 $17, 64 ; MMR6-NEXT: subu16 $2, $17, $3 -; MMR6-NEXT: sllv $12, $5, $2 +; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: move $17, $7 ; MMR6-NEXT: andi16 $4, $2, 32 -; MMR6-NEXT: andi16 $17, $6, 32 -; MMR6-NEXT: seleqz $9, $9, $17 +; MMR6-NEXT: andi16 $7, $5, 32 +; MMR6-NEXT: sw $7, 20($sp) # 4-byte Folded Spill +; MMR6-NEXT: seleqz $9, $9, $7 ; MMR6-NEXT: seleqz $13, $12, $4 ; MMR6-NEXT: or $8, $11, $8 ; MMR6-NEXT: selnez $11, $12, $4 -; MMR6-NEXT: sllv $12, $7, $2 +; MMR6-NEXT: sllv $12, $6, $2 +; MMR6-NEXT: move $7, $6 +; MMR6-NEXT: sw $6, 4($sp) # 4-byte Folded Spill ; MMR6-NEXT: not16 $2, $2 -; MMR6-NEXT: srl16 $6, $5, 1 +; MMR6-NEXT: srl16 $6, $17, 1 ; MMR6-NEXT: srlv $2, $6, $2 ; MMR6-NEXT: or $2, $12, $2 ; MMR6-NEXT: seleqz $2, $2, $4 -; MMR6-NEXT: addiu $4, $3, -64 -; MMR6-NEXT: srlv $4, $7, $4 -; MMR6-NEXT: or $12, $11, $2 -; MMR6-NEXT: or $6, $8, $13 -; MMR6-NEXT: srlv $5, $5, $3 -; MMR6-NEXT: selnez $8, $4, $17 -; MMR6-NEXT: sltiu $11, $3, 64 -; MMR6-NEXT: selnez $13, $6, $11 -; MMR6-NEXT: or $8, $8, $9 +; MMR6-NEXT: srlv $4, $7, $5 +; MMR6-NEXT: or $11, $11, $2 +; MMR6-NEXT: or $5, $8, $13 +; MMR6-NEXT: srlv $6, $17, $3 +; MMR6-NEXT: lw $2, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: selnez $7, $4, $2 +; MMR6-NEXT: sltiu $8, $3, 64 +; MMR6-NEXT: selnez $12, $5, $8 +; MMR6-NEXT: or $7, $7, $9 +; MMR6-NEXT: lw $5, 12($sp) # 4-byte Folded Reload ; MMR6-NEXT: lw $2, 8($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $6, 4($sp) # 4-byte Folded Reload -; MMR6-NEXT: sllv $9, $6, $2 +; MMR6-NEXT: sllv $9, $2, $5 ; MMR6-NEXT: seleqz $10, $10, $16 -; MMR6-NEXT: li16 $2, 0 -; MMR6-NEXT: or $10, $10, $12 -; MMR6-NEXT: or $9, $9, $5 -; MMR6-NEXT: seleqz $5, $8, $11 -; MMR6-NEXT: seleqz $8, $2, $11 -; MMR6-NEXT: srlv $7, $7, $3 -; MMR6-NEXT: seleqz $2, $7, $16 -; MMR6-NEXT: selnez $2, $2, $11 +; MMR6-NEXT: li16 $5, 0 +; MMR6-NEXT: or $10, $10, $11 +; MMR6-NEXT: or $6, $9, $6 +; MMR6-NEXT: seleqz $2, $7, $8 +; MMR6-NEXT: seleqz $7, $5, $8 +; MMR6-NEXT: lw $5, 4($sp) # 4-byte Folded Reload +; MMR6-NEXT: srlv $9, $5, $3 +; MMR6-NEXT: seleqz $11, $9, $16 +; MMR6-NEXT: selnez $11, $11, $8 ; MMR6-NEXT: seleqz $1, $1, $3 -; MMR6-NEXT: or $5, $13, $5 -; MMR6-NEXT: selnez $5, $5, $3 -; MMR6-NEXT: or $5, $1, $5 -; MMR6-NEXT: or $2, $8, $2 -; MMR6-NEXT: seleqz $1, $9, $16 -; MMR6-NEXT: selnez $6, $7, $16 -; MMR6-NEXT: lw $7, 12($sp) # 4-byte Folded Reload -; MMR6-NEXT: seleqz $7, $7, $3 -; MMR6-NEXT: selnez $9, $10, $11 -; MMR6-NEXT: seleqz $4, $4, $17 -; MMR6-NEXT: seleqz $4, $4, $11 -; MMR6-NEXT: or $4, $9, $4 +; MMR6-NEXT: or $2, $12, $2 +; MMR6-NEXT: selnez $2, $2, $3 +; MMR6-NEXT: or $5, $1, $2 +; MMR6-NEXT: or $2, $7, $11 +; MMR6-NEXT: seleqz $1, $6, $16 +; MMR6-NEXT: selnez $6, $9, $16 +; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $9, $16, $3 +; MMR6-NEXT: selnez $10, $10, $8 +; MMR6-NEXT: lw $16, 20($sp) # 4-byte Folded Reload +; MMR6-NEXT: seleqz $4, $4, $16 +; MMR6-NEXT: seleqz $4, $4, $8 +; MMR6-NEXT: or $4, $10, $4 ; MMR6-NEXT: selnez $3, $4, $3 -; MMR6-NEXT: or $4, $7, $3 +; MMR6-NEXT: or $4, $9, $3 ; MMR6-NEXT: or $1, $6, $1 -; MMR6-NEXT: selnez $1, $1, $11 -; MMR6-NEXT: or $3, $8, $1 -; MMR6-NEXT: lw $16, 16($sp) # 4-byte Folded Reload -; MMR6-NEXT: lw $17, 20($sp) # 4-byte Folded Reload -; MMR6-NEXT: addiu $sp, $sp, 24 +; MMR6-NEXT: selnez $1, $1, $8 +; MMR6-NEXT: or $3, $7, $1 +; MMR6-NEXT: lw $16, 24($sp) # 4-byte Folded Reload +; MMR6-NEXT: lw $17, 28($sp) # 4-byte Folded Reload +; MMR6-NEXT: addiu $sp, $sp, 32 ; MMR6-NEXT: jrc $ra </cut>

3 years, 9 months

2
1
0 0

[TCWG CI] Regression caused by binutils: [gdb/testsuite] Add gdb.testsuite/dump-system-info.exp

by ci_notify＠linaro.org

[TCWG CI] Regression caused by binutils: [gdb/testsuite] Add gdb.testsuite/dump-system-info.exp: commit b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1 Author: Tom de Vries <tdevries(a)suse.de> [gdb/testsuite] Add gdb.testsuite/dump-system-info.exp Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -8 # build_abe newlib: -6 # build_abe stage2 -- --patch linaro-local/vect-metric-branch --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -5 # true: 0 # benchmark -- -O3_VECT_mthumb artifacts/build-b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1/results_id: 1 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -8 # build_abe newlib: -6 # build_abe stage2 -- --patch linaro-local/vect-metric-branch --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -5 # true: 0 # benchmark -- -O3_VECT_mthumb artifacts/build-baseline/results_id: 1 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_gnu_eabi_stm32/gnu_eabi-master-arm_eabi-coremark-O3_VECT First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Reproduce builds: <cut> mkdir investigate-binutils-b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1 cd investigate-binutils-b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /binutils/ ./ ./bisect/baseline/ cd binutils # Reproduce first_bad build git checkout --detach b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 3814a9e1fe77c01c7e872c25afa198537d4ac780 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit b4e4386a2e58ba6ce8d02b952f1bc6ceb8fc95d1 Author: Tom de Vries <tdevries(a)suse.de> Date: Fri Sep 24 12:39:14 2021 +0200 [gdb/testsuite] Add gdb.testsuite/dump-system-info.exp When interpreting the testsuite results, it's often relevant what kind of machine the testsuite ran on. On a local machine one can just do /proc/cpuinfo, but in case of running tests using a remote system that distributes test runs to other remote systems that are not directly accessible, that's not possible. Fix this by dumping /proc/cpuinfo into the gdb.log, as well as lsb_release -a and uname -a. We could do this at the start of each test run, by putting it into unix.exp or some such. However, this might be too verbose, so we choose to put it into its own test-case, such that it get triggered in a full testrun, but not when running one or a subset of tests. We put the test-case into the gdb.testsuite directory, which is currently the only place in the testsuite where we do not test gdb. [ Though perhaps this could be put into a new gdb.info directory, since the test-case doesn't actually test the testsuite. ] Tested on x86_64-linux. --- gdb/testsuite/gdb.testsuite/dump-system-info.exp | 48 ++++++++++++++++++++++++ 1 file changed, 48 insertions(+) diff --git a/gdb/testsuite/gdb.testsuite/dump-system-info.exp b/gdb/testsuite/gdb.testsuite/dump-system-info.exp new file mode 100644 index 00000000000..bf181469bd5 --- /dev/null +++ b/gdb/testsuite/gdb.testsuite/dump-system-info.exp @@ -0,0 +1,48 @@ +# Copyright 2021 Free Software Foundation, Inc. +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. + +# The purpose of this test-case is to dump /proc/cpuinfo and similar system +# info into gdb.log. + +# Check if /proc/cpuinfo is available. +set res [remote_exec target "test -r /proc/cpuinfo"] +set status [lindex $res 0] +set output [lindex $res 1] + +if { $status == 0 && $output == "" } { + verbose -log "Cpuinfo available, dumping:" + remote_exec target "cat /proc/cpuinfo" +} else { + verbose -log "Cpuinfo not available" +} + +set res [remote_exec target "lsb_release -a"] +set status [lindex $res 0] +set output [lindex $res 1] + +if { $status == 0 } { + verbose -log "lsb_release -a availabe, dumping:\n$output" +} else { + verbose -log "lsb_release -a not available" +} + +set res [remote_exec target "uname -a"] +set status [lindex $res 0] +set output [lindex $res 1] + +if { $status == 0 } { + verbose -log "uname -a availabe, dumping:\n$output" +} else { + verbose -log "uname -a not available" +} </cut>

3 years, 9 months

2
1
0 0

[TCWG CI] 464.h264ref slowed down by 3% after llvm: Fix test from 8dd42f, capitalization in test

by ci_notify＠linaro.org

After llvm commit e8e2edd8ca88f8b0a7dba141349b2aa83284f3af Author: Erich Keane <erich.keane(a)intel.com> Fix test from 8dd42f, capitalization in test the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 3% from 10973 to 11249 perf samples - 464.h264ref:[.] FastFullPelBlockMotionSearch slowed down by 12% from 1446 to 1619 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3 First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af cd investigate-llvm-e8e2edd8ca88f8b0a7dba141349b2aa83284f3af # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach e8e2edd8ca88f8b0a7dba141349b2aa83284f3af ../artifacts/test.sh # Reproduce last_good build git checkout --detach 77d200a546136c2855063613ff4bca1f682fb23a ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit e8e2edd8ca88f8b0a7dba141349b2aa83284f3af Author: Erich Keane <erich.keane(a)intel.com> Date: Fri Sep 24 10:24:17 2021 -0700 Fix test from 8dd42f, capitalization in test --- clang/test/CXX/drs/dr17xx.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/clang/test/CXX/drs/dr17xx.cpp b/clang/test/CXX/drs/dr17xx.cpp index 42303c83ae3c..c8648908ebda 100644 --- a/clang/test/CXX/drs/dr17xx.cpp +++ b/clang/test/CXX/drs/dr17xx.cpp @@ -129,7 +129,7 @@ namespace dr1778 { // dr1778: 9 namespace dr1762 { // dr1762: 14 #if __cplusplus >= 201103L float operator ""_E(const char *); - // expected-error@+2 {{invalid suffix on literal; c++11 requires a space between literal and identifier}} + // expected-error@+2 {{invalid suffix on literal; C++11 requires a space between literal and identifier}} // expected-warning@+1 {{user-defined literal suffixes not starting with '_' are reserved; no literal will invoke this operator}} float operator ""E(const char *); #endif </cut>

3 years, 9 months

2
1
0 0

Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses

by Maxim Kuvyrkov

Thanks, Stanislav, FWIW, it will be, probably, easier for you to just rebuild the compiler, it is an x86_64-linux-gnu -> arm-linux-gnueabihf cross. This link has the build log [1]. cmake -G Ninja ../llvm/llvm '-DLLVM_ENABLE_PROJECTS=clang;lld' -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True -DCMAKE_INSTALL_PREFIX=../llvm-install -DLLVM_TARGETS_TO_BUILD=ARM Then compile the pre-processed source with plain -O2 or -O3 optimisation settings. [1] https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tk1-llvm-master-… Regards, -- Maxim Kuvyrkov https://www.linaro.org > On 24 Sep 2021, at 20:30, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote: > > [AMD Official Use Only] > > I have reverted the whole change. There was yet another perf regression report. > > Stas > > From: Mekhanoshin, Stanislav > Sent: Thursday, September 23, 2021 11:48 > To: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> > Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org> > Subject: RE: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses > > Thanks. I see the reload. There shall not be extra pressure since that is the whole idea, make pressure less. However, I see more spills in that specific file, fast_algorithms.s if I get it right. > Can I get the IR for it? Something to feed llc. > > Stas > > From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> > Sent: Thursday, September 23, 2021 2:31 > To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> > Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org> > Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses > > [CAUTION: External Email] > > Thanks, Stanislav. > > I’ve looked into profile dumps, and 456.hmmer’s hot loop get several additional reloads. E.g., "ldr r1, [sp, #84]” generates 203 additional samples, which translates into 20 seconds of time just for that one instruction. > > See the attached profile dumps and the the screenshot with the hot loop highlighted. > > Maybe your patch increases register pressure too much? > > Regards, > > -- > Maxim Kuvyrkov > https://www.linaro.org > > > On 22 Sep 2021, at 22:35, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote: > > > > [AMD Official Use Only] > > > > There are actually couple things worth to try if that is easy: > > > > https://reviews.llvm.org/D109077 > > https://reviews.llvm.org/differential/diff/374324/ > > > > Both may slightly change spill weights and then spilling pattern. > > > > Stas > > > > -----Original Message----- > > From: Mekhanoshin, Stanislav > > Sent: Wednesday, September 22, 2021 12:09 > > To: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> > > Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org> > > Subject: RE: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses > > > > I assume some of the newly rematerialized instructions caused perf drops. Probably some very specific ones. I would appreciate if you could point them to me. > > In addition I believe I would need to have a linked or optimized bitcode to feed into llc. > > > > Stas > > > > -----Original Message----- > > From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> > > Sent: Wednesday, September 22, 2021 12:06 > > To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> > > Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org> > > Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses > > > > [CAUTION: External Email] > > > > Hi Stanislav, > > > > That's fair; I or someone from Linaro will try to analyze this and follow up here. > > > > On a more general note, what info would you like to see in these benchmarking regression reports? > > > > Thanks, > > > > -- > > Maxim Kuvyrkov > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linar… > > > > > >> On Sep 22, 2021, at 9:40 PM, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote: > >> > >> [AMD Official Use Only] > >> > >> Hm... I'd really like to help, but I do not think I can do anything with megabytes of code in an asm which I do not understand and tons of differences in 48 asm files. > >> What I can see there is overall less spilling code which was the intent in the first place: hmmer has 4 less spill opcodes overall and sphinx has 27 less of them. > >> I doubt I could say much more without someone pointing to the actual root cause. > >> > >> Stas > >> > >> -----Original Message----- > >> From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> > >> Sent: Wednesday, September 22, 2021 5:16 > >> To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> > >> Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org> > >> Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses > >> > >> [CAUTION: External Email] > >> > >> Hi Stanislav, > >> > >> Attached is a tarball with -save-temps output (pre-processed source and generated assembly) for first-bad run (your commit) and last-good run (immediate parent of your commit). > >> > >> -- > >> Maxim Kuvyrkov > >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linar… > >> > >>> On 20 Sep 2021, at 23:15, Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> wrote: > >>> > >>> [AMD Official Use Only] > >>> > >>> Thanks for letting me know. Some regressions are inevitable, however do you happen to have any analysis and dumps? I myself do not understand ARM ISA well... > >>> > >>> Stas > >>> > >>> -----Original Message----- > >>> From: Maxim Kuvyrkov <maxim.kuvyrkov(a)linaro.org> > >>> Sent: Wednesday, September 15, 2021 5:52 > >>> To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin(a)amd.com> > >>> Cc: linaro-toolchain <linaro-toolchain(a)lists.linaro.org> > >>> Subject: Re: [TCWG CI] 456.hmmer slowed down by 6% after llvm: Allow rematerialization of virtual reg uses > >>> > >>> [CAUTION: External Email] > >>> > >>> Hi Stanislav, > >>> > >>> FYI, your patch seems to be slowing down two of SPEC CPU2006 tests on 32-bit ARM at -O2 and -O3 optimization levels. > >>> > >>> -- > >>> Maxim Kuvyrkov > >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linar… > >>> > >> > >> > > > <image001.png>

3 years, 9 months

1
0
0 0

[TCWG CI] Regression caused by binutils: Automatic date update in version.in

by ci_notify＠linaro.org

[TCWG CI] Regression caused by binutils: Automatic date update in version.in: commit fcc561a54de2beb19cb325094fbd3ec76f96e520 Author: GDB Administrator <gdbadmin(a)sourceware.org> Automatic date update in version.in Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -8 # build_abe newlib: -6 # build_abe stage2 -- --patch linaro-local/vect-metric-branch --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -5 # true: 0 # benchmark -- -O3_LTO_VECT_mthumb artifacts/build-fcc561a54de2beb19cb325094fbd3ec76f96e520/results_id: 1 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1 -- --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -8 # build_abe newlib: -6 # build_abe stage2 -- --patch linaro-local/vect-metric-branch --set gcc_override_configure=--disable-libsanitizer --set gcc_override_configure=--disable-multilib --set gcc_override_configure=--with-cpu=cortex-m4 --set gcc_override_configure=--with-mode=thumb --set gcc_override_configure=--with-float=hard: -5 # true: 0 # benchmark -- -O3_LTO_VECT_mthumb artifacts/build-baseline/results_id: 1 THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_gnu_eabi_stm32/gnu_eabi-release-arm_eabi-coremark-O3_LTO_VECT First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… Reproduce builds: <cut> mkdir investigate-binutils-fcc561a54de2beb19cb325094fbd3ec76f96e520 cd investigate-binutils-fcc561a54de2beb19cb325094fbd3ec76f96e520 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_gnu_eabi-bisect-tcwg_bmk_stm32-gnu_ea… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /binutils/ ./ ./bisect/baseline/ cd binutils # Reproduce first_bad build git checkout --detach fcc561a54de2beb19cb325094fbd3ec76f96e520 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 27439f0edab99c6870cf7fe042074e47632f3fbd ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit fcc561a54de2beb19cb325094fbd3ec76f96e520 Author: GDB Administrator <gdbadmin(a)sourceware.org> Date: Wed Sep 22 00:00:31 2021 +0000 Automatic date update in version.in --- bfd/version.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bfd/version.h b/bfd/version.h index 338c1288a22..c45d963473c 100644 --- a/bfd/version.h +++ b/bfd/version.h @@ -16,7 +16,7 @@ In releases, the date is not included in either version strings or sonames. */ -#define BFD_VERSION_DATE 20210921 +#define BFD_VERSION_DATE 20210922 #define BFD_VERSION @bfd_version@ #define BFD_VERSION_STRING @bfd_version_package@ @bfd_version_string@ #define REPORT_BUGS_TO @report_bugs_to@ </cut>

3 years, 9 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain