Until now, we've only checked whether merging stores would cause a cycle via
the value argument, but the address and indexed offset arguments are also
capable of creating cycles in some situations.
The addresses are all base+offset with notionally the same base, but the base
SDNode may still be different (e.g. via an indexed load in one case, and an
ISD::ADD elsewhere). This allows cycles to creep in if one of these sources
depends on another.
The indexed offset is usually undef (representing a non-indexed store), but on
some architectures (e.g. 32-bit ARM-mode ARM) it can be an arbitrary value,
again allowing dependency cycles to creep in.
llvm-svn: 345200
When implementing memset's today we often see this pattern:
$x0 = MOV 0xXYXYXYXYXYXYXYXY
store $x0, ...
$w1 = MOV 0xXYXYXYXY
store $w1, ...
We first create a 64bit constant in a 64bit register with all bytes the
same and then create a 32bit constant with all bytes the same in a 32bit
register. In many targets we could just access the lower byte of the
64bit register instead.
- Ideally this would be handled by the ConstantHoist pass but it runs
too early when memset isn't expanded yet.
- The memset expansion code already had this optimization implemented,
however SelectionDAG constantfolding would constantfold the
"trunc(bigconstnat)" pattern to "smallconstant".
- This patch makes the memset expansion mark the constant as Opaque and
stop DAGCombiner from constant folding in this situation. (Similar to
how ConstantHoisting marks things as Opaque to avoid folding
ADD/SUB/etc.)
Differential Revision: https://reviews.llvm.org/D53181
llvm-svn: 345102
I've included a fix to DAGCombiner::ForwardStoreValueToDirectLoad that I believe will prevent the previous miscompile.
Original commit message:
Theoretically this was done to simplify the amount of isel patterns that were needed. But it also meant a substantial number of our isel patterns have to match an explicit bitcast. By making the vXi32/vXi16/vXi8 types legal for loads, DAG combiner should be able to change the load type to rem
I had to add some additional plain load instruction patterns and a few other special cases, but overall the isel table has reduced in size by ~12000 bytes. So it looks like this promotion was hurting us more than helping.
I still have one crash in vector-trunc.ll that I'm hoping @RKSimon can help with. It seems to relate to using getTargetConstantFromNode on a load that was shrunk due to an extract_subvector combine after the constant pool entry was created. So we end up decoding more mask elements than the lo
I'm hoping this patch will simplify the number of patterns needed to remove the and/or/xor promotion.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits, RKSimon
Differential Revision: https://reviews.llvm.org/D53306
llvm-svn: 344965
Introduce new versions that follow the IEEE semantics
to help with legalization that may need quieted inputs.
There are some regressions from inserting unnecessary
canonicalizes when these are matched from fast math
fcmp + select which should be fixed in a future commit.
llvm-svn: 344914
This is a late backend subset of the IR transform added with:
D52439
We can confirm that the conversion to a 'trunc' is correct by running:
$ opt -instcombine -data-layout="e"
(assuming the IR transforms are correct; change "e" to "E" for big-endian)
As discussed in PR39016:
https://bugs.llvm.org/show_bug.cgi?id=39016
...the pattern may emerge during legalization, so that's we are waiting for an
insertelement to become a scalar_to_vector in the pattern matching here.
The DAG allows for fun variations that are not possible in IR. Result types for
extracts and scalar_to_vector don't necessarily match input types, so that means
we have to be a bit more careful in the transform (see code comments).
The tests show that we don't handle cases that require a shift (as we did in the
IR version). I've left that as a potential follow-up because I'm not sure if
that's a real concern at this late stage.
Differential Revision: https://reviews.llvm.org/D53201
llvm-svn: 344872
I want to add another pattern here that includes scalar_to_vector,
so this makes that patch smaller. I was hoping to remove the
hasOneUse() check because it shouldn't be necessary for common
codegen, but an AMDGPU test has a comment suggesting that the
extra check makes things better on one of those targets.
llvm-svn: 344320
Summary:
Extend analysis forwarding loads from preceeding stores to work with
extended loads and truncated stores to the same address so long as the
load is fully subsumed by the store.
Hexagon's swp-epilog-phis.ll and swp-memrefs-epilog1.ll test are
deleted as they've no longer seem to be relevant.
Reviewers: RKSimon, rnk, kparzysz, javed.absar
Subscribers: sdardis, nemanjai, hiraditya, atanasyan, llvm-commits
Differential Revision: https://reviews.llvm.org/D49200
llvm-svn: 344142
We already do the following combines:
(bitcast int (and (bitcast fp X to int), 0x7fff...) to fp) -> fabs X
(bitcast int (xor (bitcast fp X to int), 0x8000...) to fp) -> fneg X
When the target has "bit preserving fp logic". This patch just extends it
to also combine:
(bitcast int (or (bitcast fp X to int), 0x8000...) to fp) -> fneg (fabs X)
As some targets have fnabs and even those that don't can efficiently lower
both the fabs and the fneg.
Differential revision: https://reviews.llvm.org/D44548
llvm-svn: 344093
This change is proposed as a part of D44548, but we
need this independently to avoid regressions from improved
undef propagation in SimplifyDemandedVectorElts().
llvm-svn: 343940
And use that to transform fsub with zero constant operands.
The integer part isn't used yet, but it is proposed for use in
D44548, so adding both enhancements here makes that
patch simpler.
llvm-svn: 343865
This fixes a case of bad index calculation when merging mismatching
vector types. This changes the existing code to just use the existing
extract_{subvector|element} and a bitcast (instead of bitcast first and
then newly created extract_xxx) so we don't need to adjust any indices
in the first place.
rdar://44584718
Differential Revision: https://reviews.llvm.org/D52681
llvm-svn: 343493
The SINT_TO_FP<->UINT_TO_FP combines for non-negative integers should only occur for legal ops once LegalOperations = true
No test case to hand, noticed when investigating PR38226 + PR38970
llvm-svn: 343405
DAGCombine will try to fold two loads that feed a SELECT or SELECT_CC
after the select, resulting in a select of an address and a single
load after.
If either of the loads depend on the other, this is not legal as it
could introduce cycles. However, it only checked this if the opcode
was a SELECT, and not for a SELECT_CC.
Unfortunately, the only reproducer I have for this is for our
downstream target. I've tried getting it to trigger on an upstream one
but haven't been successful.
Patch thanks to Bevin Hansson.
llvm-svn: 342980
This is a preliminary step towards solving PR14613:
https://bugs.llvm.org/show_bug.cgi?id=14613
If we have an 'add' instruction that sets flags, we can use that to eliminate an
explicit compare instruction or some other instruction (cmn) that sets flags for
use in the later select.
As shown in the unchanged tests that use 'icmp ugt %x, %a', we're effectively
reversing an IR icmp canonicalization that replaces a variable operand with a
constant:
https://rise4fun.com/Alive/V1Q
But we're not using 'uaddo' in those cases via DAG transforms. This happens in
CGP after D8889 without checking target lowering to see if the op is supported.
So AArch already shows 'uaddo' codegen for the i8/i16/i32/i64 test variants with
"using_cmp_sum" in the title. That's the pattern that CGP matches as an unsigned
saturated add and converts to uaddo without checking target capabilities.
This patch is gated by isOperationLegalOrCustom(ISD::UADDO, VT), so we see only
see AArch diffs for i32/i64 in the tests with "using_cmp_notval" in the title
(unlike x86 which sees improvements for all sizes because all sizes are 'custom').
But the AArch code (like x86) looks better when translated to 'uaddo' in all cases.
So someone that is involved with AArch may want to set i8/i16 to 'custom' for UADDO,
so this patch will fire on those tests.
Another possibility given the existing behavior: we could remove the legal-or-custom
check altogether because we're assuming that a UADDO sequence is canonical/optimal
before we ever reach here. But that seems like a bug to me. If the target doesn't
have an add-with-flags op, then it's not likely that we'll get optimal DAG combining
using a UADDO node. This is similar justification for why we don't canonicalize IR to
the overflow math intrinsic sibling (llvm.uadd.with.overflow) for UADDO in the first
place.
Differential Revision: https://reviews.llvm.org/D51929
llvm-svn: 342886
This code handled SCALAR_TO_VECTOR being returned by the recursion, but the code that used to return SCALAR_TO_VECTOR was removed in 2015.
llvm-svn: 342856
This comment was misleading about why we were restricting to before legalize types. The reason given would only apply to before legalize ops. But there is a before legalize types reason that should also be listed.
llvm-svn: 342851
This is an alternative to https://reviews.llvm.org/D37896. We can't decompose
multiplies generically without a target hook to tell us when it's profitable.
ARM and AArch64 may be able to remove some existing code that overlaps with
this transform.
This extends D52195 and may resolve PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
(still an open question about transforming legal vector multiplies, but we
could open another bug report for those)
llvm-svn: 342844
x86 had 2 versions of peekThroughBitcast. DAGCombiner had 1. Plus, it had a 1-off implementation for the one-use variant.
Move the x86 versions of the code to SelectionDAG, so we don't have different copies of the code.
No functional change intended.
I'm putting this next to isBitwiseNot() because I am planning to use it in there. Another option is next to the
helpers in the ISD namespace (eg, ISD::isConstantSplatVector()). But if there's no good reason for those to be
there, I'd prefer to pull other helpers over to SelectionDAG in follow-up steps.
Differential Revision: https://reviews.llvm.org/D52285
llvm-svn: 342669
The test diff in not-and-simplify.ll is from a use in SimplifyDemandedBits,
and the test diff in add.ll is from a DAGCombiner transform.
llvm-svn: 342594
This is an alternative to D37896. I don't see a way to decompose multiplies
generically without a target hook to tell us when it's profitable.
ARM and AArch64 may be able to remove some duplicate code that overlaps with
this transform.
As a first step, we're only getting the most clear wins on the vector examples
requested in PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
As noted in the code comment, it's likely that the x86 constraints are tighter
than necessary, but it may not always be a win to replace a pmullw/pmulld.
Differential Revision: https://reviews.llvm.org/D52195
llvm-svn: 342554
This is a follow-up suggested in D51630 and originally proposed as an IR transform in D49040.
Copying the motivational statement by @evandro from that patch:
"This transformation helps some benchmarks in SPEC CPU2000 and CPU2006, such as 188.ammp,
447.dealII, 453.povray, and especially 300.twolf, as well as some proprietary benchmarks.
Otherwise, no regressions on x86-64 or A64."
I'm proposing to add only the minimum support for a DAG node here. Since we don't have an
LLVM IR intrinsic for cbrt, and there are no other DAG ways to create a FCBRT node yet, I
don't think we need to worry about DAG builder, legalization, a strict variant, etc. We
should be able to expand as needed when adding more functionality/transforms. For reference,
these are transform suggestions currently listed in SimplifyLibCalls.cpp:
// * cbrt(expN(X)) -> expN(x/3)
// * cbrt(sqrt(x)) -> pow(x,1/6)
// * cbrt(cbrt(x)) -> pow(x,1/9)
Also, given that we bail out on long double for now, there should not be any logical
differences between platforms (unless there's some platform out there that has pow()
but not cbrt()).
Differential Revision: https://reviews.llvm.org/D51753
llvm-svn: 342348
Add support for bitcasts from float type to an integer type of the same element bitwidth.
There maybe cases where we need to support different widths (e.g. as SSE __m128i is treated as v2i64) - but I haven't seen cases of this in the wild yet.
llvm-svn: 341652
This was proposed as an IR transform in D49306, but it was not clearly justifiable as a canonicalization.
Here, we only do the transform when the target tells us that sqrt can be lowered with inline code.
This is the basic case. Some potential enhancements are in the TODO comments:
1. Generalize the transform for other exponents (allow more than 2 sqrt calcs if that's really cheaper).
2. If we have less fast-math-flags, generate code to avoid -0.0 and/or INF.
3. Allow the transform when optimizing/minimizing size (might require a target hook to get that right).
Note that by default, x86 converts single-precision sqrt calcs into sqrt reciprocal estimate with
refinement. That codegen is controlled by CPU attributes and can be manually overridden. We have plenty
of test coverage for that already, so I didn't bother to include extra testing for that here. AArch uses
its full-precision ops in all cases (not sure if that's the intended behavior or not, but that should
also be covered by existing tests).
Differential Revision: https://reviews.llvm.org/D51630
llvm-svn: 341481
Summary:
I'm not sure if this patch is correct or if it needs more qualifying somehow. Bitcast shouldn't change the size of the load so it should be ok? We already do something similar for stores. We'll change the type of a volatile store if the resulting store is Legal or Custom. I'm not sure we should be allowing Custom there...
I was playing around with converting X86 atomic loads/stores(except seq_cst) into regular volatile loads and stores during lowering. This would allow some special RMW isel patterns in X86InstrCompiler.td to be removed. But there's some floating point patterns in there that didn't work because we don't fold (f64 (bitconvert (i64 volatile load))) or (f32 (bitconvert (i32 volatile load))).
Reviewers: efriedma, atanasyan, arsenm
Reviewed By: efriedma
Subscribers: jvesely, arsenm, sdardis, kzhuravl, wdng, yaxunl, dstuttard, tpr, t-tye, arichardson, jrtc27, atanasyan, jfb, llvm-commits
Differential Revision: https://reviews.llvm.org/D50491
llvm-svn: 340797