Implement SSE4.1 Insert/Extract and simplify SSE/SSE2 intrinsic#16646
Implement SSE4.1 Insert/Extract and simplify SSE/SSE2 intrinsic#16646CarolEidt merged 3 commits intodotnet:masterfrom
Conversation
| // Select base type using argument type | ||
| HW_Flag_BaseTypeFromArg = 0x400, | ||
| // Select base type using the first argument type | ||
| HW_Flag_BaseTypeFromFirstArg = 0x400, |
There was a problem hiding this comment.
With this change, all the intrinsics (except MemoryStore) that return a scalar value (e.g., bool) should explicitly have flag HW_Flag_BaseTypeFromFirstArg or HW_Flag_BaseTypeFromSecondArg. cc @4creators
There was a problem hiding this comment.
Speaking of asserts (which we were a few hours ago on another thread), this seems like something that could be asserted somewhere - e.g. a pass over the table that validates these kinds of properties. There are probably other consistency checks across signature, category and flag that could be checked.
src/jit/compiler.h
Outdated
| GenTree* impNonConstFallback(NamedIntrinsic intrinsic, var_types simdType, var_types baseType); | ||
| static bool isImmHWIntrinsic(NamedIntrinsic intrinsic, GenTree* lastOp); | ||
| GenTree* addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* lastOp, bool mustExpand); | ||
| bool signatureTypeSupproted(var_types retType, CORINFO_SIG_INFO* sig, HWIntrinsicFlag flags); |
There was a problem hiding this comment.
nit: Supproted->Supported
src/jit/hwintrinsicxarch.cpp
Outdated
| // | ||
| // Return Value: | ||
| // On 32-bit platforms, return true if the intrinsic does not use any 64-bit registers (r64) | ||
| bool Compiler::signatureTypeSupproted(var_types retType, CORINFO_SIG_INFO* sig, HWIntrinsicFlag flags) |
There was a problem hiding this comment.
These needs to have HWIntrinsic in the name somewhere.
src/jit/namedintrinsiclist.h
Outdated
| // 64-bit intrinsics | ||
| // Intrinsics that operate over 64-bit general purpose registers are not supported on 32-bit platform | ||
| HW_Flag_64BitOnly = 0x4000, | ||
| HW_Flag_firstArgMaybe64Bit = 0x8000, |
src/jit/hwintrinsiclistxarch.h
Outdated
| HARDWARE_INTRINSIC(SSE_ConvertToInt32, "ConvertToInt32", SSE, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtss2si, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg) | ||
| HARDWARE_INTRINSIC(SSE_ConvertToInt64, "ConvertToInt64", SSE, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtss2si, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg) | ||
| HARDWARE_INTRINSIC(SSE_ConvertToSingle, "ConvertToSingle", SSE, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movss, INS_invalid}, HW_Category_Helper, HW_Flag_BaseTypeFromFirstArg) | ||
| HARDWARE_INTRINSIC(SSE_ConvertScalarToVector128Single, "ConvertScalarToVector128Single", SSE, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsi2ss, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_SecondArgMaybe64Bit|HW_Flag_BaseTypeFromFirstArg) |
There was a problem hiding this comment.
This needs to be marked as CopyUpperBits
src/jit/hwintrinsiclistxarch.h
Outdated
| HARDWARE_INTRINSIC(SSE2_ConvertScalarToVector128Double, "ConvertScalarToVector128Double", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtss2sd, INS_invalid}, HW_Category_Special, HW_Flag_NoFlag) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertToVector128Int32, "ConvertToVector128Int32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtps2dq, INS_cvtpd2dq}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromArg) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertToDouble, "ConvertToDouble", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsd}, HW_Category_Helper, HW_Flag_BaseTypeFromFirstArg) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertToInt32, "ConvertToInt32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_xmm2i, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsd2si}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns) |
There was a problem hiding this comment.
Why is this one (and the following) MultiIns?
There was a problem hiding this comment.
Because these intrinsics need special CodeGen (but can be table-driven in the importer). Probably, I should add a new flag?
There was a problem hiding this comment.
Probably, I should add a new flag
I think so. I don't believe we want to use MultiIns for anything other than intrinsics which generate multiple instructions
src/jit/hwintrinsiclistxarch.h
Outdated
| HARDWARE_INTRINSIC(SSE2_ConvertToUInt32, "ConvertToUInt32", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_xmm2i, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertToUInt64, "ConvertToUInt64", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_mov_xmm2i, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertToVector128Double, "ConvertToVector128Double", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtdq2pd, INS_invalid, INS_invalid, INS_invalid, INS_cvtps2pd, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg) | ||
| HARDWARE_INTRINSIC(SSE2_ConvertScalarToVector128Double, "ConvertScalarToVector128Double", SSE2, -1, 16, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtsi2sd, INS_invalid, INS_cvtss2sd, INS_invalid}, HW_Category_Special, HW_Flag_SecondArgMaybe64Bit|HW_Flag_MultiIns) |
There was a problem hiding this comment.
You changed the corresponding SSE one off of Special, why not this one?
87b2e82 to
556065f
Compare
556065f to
81961d1
Compare
|
Implemented SSE4.1 @CarolEidt @tannergooding PTAL |
| ("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "Int16", "Vector128", "Int16", "1", "(short)2", "16", "(short)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), | ||
| ("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "UInt16", "Vector128", "UInt16", "1", "(ushort)2","16", "(ushort)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), | ||
| ("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "Int16", "Vector128", "Int16", "129","(short)2", "16", "(short)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), | ||
| ("InsertScalarTest.template",new string[] { "Sse2", "Sse2", "Insert", "Vector128", "UInt16", "Vector128", "UInt16", "129","(ushort)2","16", "(ushort)0", "(i == 1 ? result[i] != 2 : result[i] != 0)"}), |
There was a problem hiding this comment.
Made two new templates for insert and extract.
@4creators I implemented SSE2 insert and extract in this PR as well, if you have done on them, I can remove here.
There was a problem hiding this comment.
@fiigii I have implemented them as well
There was a problem hiding this comment.
@4creators Sounds Good, could you submit your current work or open an issue to keep sync with us?
| HARDWARE_INTRINSIC(SSE41_Extract, "Extract", SSE41, -1, 16, 2, {INS_pextrb, INS_pextrb, INS_invalid, INS_invalid, INS_pextrd, INS_pextrd, INS_pextrq, INS_pextrq, INS_extractps, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_BaseTypeFromFirstArg|HW_Flag_MultiIns|HW_Flag_NoRMWSemantics) | ||
| HARDWARE_INTRINSIC(SSE41_Floor, "Floor", SSE41, 9, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_roundps, INS_roundpd}, HW_Category_SimpleSIMD, HW_Flag_NoRMWSemantics) | ||
| HARDWARE_INTRINSIC(SSE41_FloorScalar, "FloorScalar", SSE41, 9, 16, -1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_roundss, INS_roundsd}, HW_Category_Special, HW_Flag_CopyUpperBits) | ||
| HARDWARE_INTRINSIC(SSE41_Insert, "Insert", SSE41, -1, 16, 3, {INS_pinsrb, INS_pinsrb, INS_invalid, INS_invalid, INS_pinsrd, INS_pinsrd, INS_pinsrq, INS_pinsrq, INS_insertps, INS_invalid}, HW_Category_IMM, HW_Flag_FullRangeIMM|HW_Flag_SecondArgMaybe64Bit) |
There was a problem hiding this comment.
@tannergooding Shall I give Insert a HW_Flag_NoRMWSemantics as well?
There was a problem hiding this comment.
It should only be on instructions that don't need hasDelayFreeSrc on op2.
I think pinsrb/pinsrd/pinsrq qualify here (since op2 is an integer reg and taget is an xmm reg).
However, insertps shouldn't have it, since op2 is also an xmm reg and we need to ensure that op2 is different from targetReg
There was a problem hiding this comment.
Okay, looks like Insert should not have it.
There was a problem hiding this comment.
You'll need to ensure that pinsrb/d/q are specially handled
|
@CarolEidt could you please take a look at this PR? |
|
@fiigii I am just getting on a plane. I should be able to review it in a few hours. Sorry for the delay. |
|
@CarolEidt No worries, thank you 😄 |
src/jit/emitxarch.cpp
Outdated
| #ifndef LEGACY_BACKEND | ||
| if (ins == INS_extractps) | ||
| { | ||
| mReg = id->idReg1(); |
There was a problem hiding this comment.
Why can't we handle this in codegen?
There was a problem hiding this comment.
Hmm..we can. The special case extractps uses ModRM:r/m for the dst register (store semantics).
I will handle this in codgen, thanks!
| case NI_SSE41_Extract: | ||
| if (intrinsicTree->gtSIMDBaseType == TYP_FLOAT) | ||
| { | ||
| info->internalIntCount += 1; |
There was a problem hiding this comment.
This is needed to get the value back in an XMM register, correct?
There was a problem hiding this comment.
Yes, the original extractps instruction writes the result (a floating-point number) into a GPR, but our API returns a float. So we need the int-register as the dst of extractps.
src/jit/instrsxarch.h
Outdated
| INST3( pextrb, "pextrb" , 0, IUM_WR, 0, 0, SSE3A(0x14), BAD_CODE, BAD_CODE) // Extract Byte | ||
| INST3( pextrd, "pextrd" , 0, IUM_WR, 0, 0, SSE3A(0x16), BAD_CODE, BAD_CODE) // Extract Dword | ||
| INST3( pextrq, "pextrq" , 0, IUM_WR, 0, 0, SSE3A(0x16), BAD_CODE, BAD_CODE) // Extract Qword | ||
| INST3( extractps, "extractps" , 0, IUM_WR, 0, 0, BAD_CODE, BAD_CODE, SSE3A(0x17)) // Extract float |
There was a problem hiding this comment.
nit: I believe most of these names have been copied out of the Arch manual... So this should probably be Extract Packed Floating-Point Values
src/jit/hwintrinsicxarch.cpp
Outdated
| if (impIsTableDrivenHWIntrinsic(category, flags)) | ||
| { | ||
| if (!varTypeIsSIMD(retType) || ((flags & HW_Flag_BaseTypeFromArg) != 0)) | ||
| if (category == HW_Category_MemoryStore || |
There was a problem hiding this comment.
NOTE: I had to move this block in my AVX PR (#16655) in order to deal with the intrinsics that have both generic and non-generic overloads.
CarolEidt
left a comment
There was a problem hiding this comment.
Looks good overall, but needs minor changes: comments & parentheses.
I mention a suggestion for a consistency check - but that's not something needed for this PR.
| // Select base type using argument type | ||
| HW_Flag_BaseTypeFromArg = 0x400, | ||
| // Select base type using the first argument type | ||
| HW_Flag_BaseTypeFromFirstArg = 0x400, |
There was a problem hiding this comment.
Speaking of asserts (which we were a few hours ago on another thread), this seems like something that could be asserted somewhere - e.g. a pass over the table that validates these kinds of properties. There are probably other consistency checks across signature, category and flag that could be checked.
src/jit/namedintrinsiclist.h
Outdated
| HW_Flag_SpecialCodeGen = 0x20000, | ||
|
|
||
| // No Read/Modify/Write Semantics | ||
| // the intrinsic does not have read/modify/write semantics and doesn't need |
There was a problem hiding this comment.
This is pre-existing, but this comment isn't complete.
I would qualify it with "the intrinsic doesn't have read/modify/write semantics in two-operand form." (or non-AVX, or whatever the appropriate characterization is).
| case InstructionSet_SSE: | ||
| case InstructionSet_SSE3: | ||
| case InstructionSet_SSSE3: | ||
| case InstructionSet_SSE41: |
There was a problem hiding this comment.
One more moved to the happy state! :-)
src/jit/hwintrinsicxarch.cpp
Outdated
| // flags - flags of the intrinsics | ||
| // | ||
| // Return Value: | ||
| // On 32-bit platforms, return true if the intrinsic does not use any 64-bit registers (r64) |
There was a problem hiding this comment.
This description doesn't reflect the name or the real purpose. I think it should say:
// Return Value:
// Returns true iff the given type signature is supported
//
// Notes:
// - This is only used on 32-bit systems to determine whether the signature uses no 64-bit registers.
// - The `retType` is passed to avoid another call to the type system, as it has already been retrieved.
src/jit/hwintrinsicxarch.cpp
Outdated
| if (impIsTableDrivenHWIntrinsic(category, flags)) | ||
| { | ||
| if (!varTypeIsSIMD(retType) || ((flags & HW_Flag_BaseTypeFromArg) != 0)) | ||
| if (category == HW_Category_MemoryStore || |
There was a problem hiding this comment.
This expression needs more parentheses to be fully-parenthesized.
src/jit/hwintrinsicxarch.cpp
Outdated
| assert(category == HW_Category_MemoryStore); | ||
| baseType = | ||
| getBaseTypeOfSIMDType(info.compCompHnd->getArgClass(sig, info.compCompHnd->getArgNext(sig->args))); | ||
| assert(category == HW_Category_MemoryStore || ((flags & HW_Flag_BaseTypeFromSecondArg) != 0)); |
src/jit/lsraxarch.cpp
Outdated
| assert(numArgs != 1); | ||
|
|
||
| if (info->srcCount >= 2) | ||
| if (info->srcCount >= 2 && (genActualType(intrinsicTree->TypeGet()) == genActualType(op2->TypeGet()))) |
There was a problem hiding this comment.
@tannergooding I think we can have a stricter check to avoid handling special cases like Insert.
There was a problem hiding this comment.
Never mind, we cannot have this check.
|
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic test OSX10.12 x64 Checked jitincompletehwintrinsic |
|
CI seems to have some IO problems. All tests passed on my local machines. |
|
@tannergooding @CarolEidt Addressed all the feedback, does the current code look good to you? |
|
CC. @dotnet/dnceng, are the Mac machines in some bad state? |
| if ((ins == INS_pextrq || ins == INS_pinsrq) && !UseVEXEncoding()) | ||
| { | ||
| assert(UseSSE4()); | ||
| sz += 1; |
There was a problem hiding this comment.
separate issue: We really need to cleanup the code size estimation code (CC. @CarolEidt).
There was a problem hiding this comment.
Yeah, let me just remind you that mis-estimates can lead to odd CHK/RET codegen diffs (#13399).
|
I have logged issue #16696 due to failing OSX jobs |
|
@tannergooding I think there is something wrong in the lab. This is happening across multiple jenkins instances. |
|
|
||
| switch (intrinsicID) | ||
| { | ||
| case NI_SSE_ConvertScalarToVector128Single: |
There was a problem hiding this comment.
Just to confirm. This is now considered table driven, correct?
| case NI_SSE41_Extract: | ||
| if (baseType == TYP_FLOAT) | ||
| { | ||
| info->internalIntCount += 1; |
There was a problem hiding this comment.
Should we explicitly set the internal candidates, as we have done elsewhere?
There was a problem hiding this comment.
No necessary, this internal register can be any GPR.
tannergooding
left a comment
There was a problem hiding this comment.
This looks reasonable to me overall.
I would like to spend some time cleaning up this code (as it is a bit messy right now) after we get the currently approved intrinsics in.
Don't you mean to clean up the code-size estimation code? |
I meant the intrinsic code in general. I've gone over it a couple times and it looks like there is some room for cleanup/improvement in a few places. |
|
test OSX10.12 x64 Checked jitincompletehwintrinsic |
|
test OSX10.12 x64 Checked Innerloop Build and Test |
|
Can we merge this PR? |
I will go ahead and merge. I think the odds of breaking something on OSX with this change but not the other OS's is pretty small. |
|
It seems that we had a bad merge with PR #16669. @fiigii @tannergooding could one of you ensure consistent use of |
|
@CarolEidt I have changed |
This PR
r64instructions on 32-bit platforms and simplify SSE/SSE2 intrinsic code(close https://github.com/dotnet/coreclr/issues/16342).InsertandExtractand marks SSE4.1 as complete.@CarolEidt @tannergooding PTAL