-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Arm64 apple vm fixes for arg alignment. #46665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ade1e53 to
c50a5d6
Compare
|
PTAL @janvorli , @jkotas , @sdmaclea , with these changes arm64 passes my small repro from sandreenko@678ce4b and does not show any new failures in my local run, generic test reflection is passing further but still failing, probably with an independent issue. |
janvorli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more place to change, the https://github.com/dotnet/runtime/blob/master/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/DependencyAnalysis/ReadyToRun/ArgIterator.cs. This one is basically a port of the callingconvention.h to managed code.
src/coreclr/vm/callingconvention.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit misleading, the size is not a target pointer size, but a floating point size. I know they are the same on arm, but they are not the same on arm64 and x64 and the equivalent functions for those two use / 16. I would prefer defining FLOAT_REGISTER_SIZE in the cgencpu.h files and using it here and in the same arm64 and x64 methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would avoid the term FLOAT_REGISTER_SIZE on arm/arm64. I would prefer FLOAT_SIZE or SIZEOF_FLOAT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added FLOAT_REGISTER_SIZE,
I would prefer FLOAT_SIZE or SIZEOF_FLOAT.
would not it be confusing to have #define FLOAT_SIZE 16 on arm64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't expect the value to be 16.
- For Arm32, a float register i.e.
s0would be 4 bytes, but it would be one of many in theq0/v0register. - For arm64
s0is the lower 4 bytes of thev0SIMD register which is 16 bytes.
FLOAT_SIZE is definitely, worse, but I still think FLOAT_REGISTER_SIZE might be ambiguous for ARM architectures. VECTOR_REGISTER_SIZE might be better for ARM/ARM64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that we use m_idxFloatReg, m_cFloatReg fields in the surrounding code, thus the FLOAT_REGISTER_SIZE matches those.
src/coreclr/vm/callingconvention.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name is misleading, as it looks as if it was the size of argument in bytes, but is it the size aligned to stack element size. The more I look at the code below, the more I feel like we should just keep using the cSlots except for the places where we place arguments on stack (we set the m_byteStackSize). This comment is meant for all architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cSlots meant slots count? Does slot size depend on context, like it could be a general reg size, float reg size, stack slot size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cSlots meant slot count where slots were considered both register slots in the transition block and stack slots. Only on ARM32 the term has leaked into the floating point register count setting, which may be misleading. Now the stack slot concept is kind of broken by the Apple ARM64 calling convention, but the register slot still holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this code so we don't create cSlots and don't call StackElemSize unless we put byteArgSize on the stack.
src/coreclr/vm/callingconvention.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this check is needed here. The ArgIteratorTemplate<ARGITERATOR_BASE>::GetNextOffset() is responsible for getting the right offset depending on the available registers and argument type and size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases where we get into this check, for example, JIT\Stress\ABI\pinvokes_d:
CoreCLR!ArgIteratorTemplate<ArgIteratorBase>::GetArgLoc+0x10c
CoreCLR!GenerateShuffleArrayPortable+0x27c
CoreCLR!GenerateShuffleArray+0x14
comes here with byteArgSize = 17 and we change its size to 8 here, if we don't do this we will get a wrong result.
would do it in a separate PR because it looks like the same amount of work there. |
Makes sense |
c50a5d6 to
302a379
Compare
janvorli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the comments.
src/coreclr/vm/comdelegate.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make this change and the one in the caller when we can only handle the case when the offsets are multiples of TARGET_POINTER_SIZE anyways. I believe it would require quite complex change to the shuffle thunk generation to make it work for non-aligned stuff, so it seems that doing it half way doesn't bring any benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now exactly half-way and it is required to pass existing tests.
We have cases on arm64 apple when we return byteIndex that is not byteIndex % 8 == 0 but it is the same for source and destination, so we don't need to generate a move and don't hit the NYI assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, makes sense then.
|
@sandreenko Looks like this is breaking a pinvoke test on linux-arm64 & windows-arm64 |
16c8b29 to
69195a9
Compare
src/coreclr/vm/callingconvention.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit - can you please use the floatRegOfsInBytes constant here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, thanks for catching
…atforms. Before some platforms were using stackSlots, some curOfs in bytes.
Fix arm32. x86 fixes. use StackSize on ArgSizes Add `GetStackArgumentByteIndexFromOffset` and return back the old values for asserts. another fix
because it won't pass on arm64 apple.
It is not a complete fix for arm64 apple, but covers most cases.
db04d24 to
e6c5cc6
Compare
|
I have checked that the updated version still works on arm64 apple and fixes JIT/HardwareIntrinsics/General/Vector* tests. |
Contributes to #46456.
Move to byte sizes/offsets in VM so it works correctly with arguments that do not occupy TARGET_POINTER_SIZE stack slots on arm64.
It also unifies some platforms, for example, before
ArgIteratorBasewas using byte offsets for some and stack slot index for the others, now it uses byte offset for all.Use
TARGET_POINTER_SIZEinstead ofSTACK_ELEM_SIZEbecause:TARGET_POINTER_SIZEwas already used in such context;STACK_ELEM_SIZEand I did not come up with a better name;