Skip to content

Win64 AVX misalignment bug patch #1209#8317

Closed
carlkl wants to merge 1 commit intomsys2:masterfrom
carlkl:binutils-avx-win64
Closed

Win64 AVX misalignment bug patch #1209#8317
carlkl wants to merge 1 commit intomsys2:masterfrom
carlkl:binutils-avx-win64

Conversation

@carlkl
Copy link

@carlkl carlkl commented Apr 9, 2021

MINGW64 and UCRT64: msys2/MSYS2-packages#1209

Unconditionally replaces the opcodes for aligned AVX instructions with the opcodes of matching unaligned instructions in the assembler stage. There should be no loss of performance for aligned memory accesses on moderate modern processors. Unaligned memory access will not segfault anymore.

So-called not-temporal streaming instructions are not handled in this PR as they have no unaligned counterpart. These instructions are used much less commonly used. Falling back to SSE2 streaming instructions is the recommended approach for this use case.

@1480c1
Copy link
Contributor

1480c1 commented Apr 9, 2021

You might want to check your git settings, it seems you might not have it setup properly as it is picking up U-CAMPUS\kleffner (automatically MACHINE\username if you don't set an author) as the author

@1480c1
Copy link
Contributor

1480c1 commented Apr 9, 2021

Another note, your issue tag in the commit message is not pointing to the right repository, it's pointing to issue 1209 in this repo instead of msys2/MSYS2-packages#1209 you might want to move that to the commit message as well

@carlkl
Copy link
Author

carlkl commented Apr 9, 2021

@1480c1 , I used the wrong link for the issue in the comment line of the commit. No idea how to correct that.

@1480c1
Copy link
Contributor

1480c1 commented Apr 9, 2021

you can run git commit --amend if you use the command line and then git push -f

@carlkl carlkl force-pushed the binutils-avx-win64 branch from 68b88cd to 6563d8a Compare April 9, 2021 15:11
@lazka
Copy link
Member

lazka commented Apr 11, 2021

This needs to go through upstream (binutils or gcc).I don't know enough to merge this.

If you propose this upstream and get feedback we might be able to backport it before the next release though.

@lazka lazka closed this Apr 11, 2021
@carlkl
Copy link
Author

carlkl commented Apr 11, 2021

It is highly unlikely, that this patch will be accepted be GNU binutils. This patch is a pragmatic approch to cirumvent the unfortunate mismatch between the the Windows 64 ABI and gcc/binutils.

It is also unlikely that this mismatch will be handled anytime soon by the gcc developers. The corresonding issues are open for quite some time now.

Up to now it is not possible to safely use -mavx for gcc 64-bit on windows due to misalignment segfaults.

@lazka
Copy link
Member

lazka commented Apr 11, 2021

I understand that upstreaming things can be frustrating, but we have to make sure that our packages stay maintainable and don't diverge too much from upstream. And with this being a core package we have to be extra careful.

@mingwandroid
Copy link
Member

mingwandroid commented Apr 11, 2021

I have asked kai teitz for his opinion on this PR. IMHO not completely broken is better than broken. But let's see.

@lazka lazka reopened this Apr 11, 2021
@carlkl
Copy link
Author

carlkl commented Apr 11, 2021

I tested patching these opcodes from binutils 2.34 on. Practice has shown that using sed scripts is much more stable than using fragile patch files in this specific case. It can also shown, that the binary output compared to the default settings exclusively replaces aligned with unaligned instructions.

The questions which should be discussed is: do we want to perform this kind of hack?

@mingwandroid
Copy link
Member

Maybe behind an option?

@carlkl
Copy link
Author

carlkl commented Apr 11, 2021

An option for PKBUILD? It would be nice to have an option for ld.exe. However, in this case I have no ideo how to accomplish that.

@mingwandroid
Copy link
Member

We can ask people if they agree to the idea for advice. There are enough toolchain experts around here to help with the implementation too, i hope.

@katietz
Copy link

katietz commented Apr 11, 2021

Hmm, I thought there were recently patches done to gcc for better support of aligned variables for AVX.
Nevertheless this patch has some down-sides too. It will hit performance badly, as avx-512 on none aligned addresses on load is really slow ... but well, choosing here between pest and cholara ...
Patching binutils is here for sure the wrong approach. The proper change, if we need to go the route of opcodes, is to add a flag to gcc, which switches the generation of aligned vs. unaligned instructions. Binutils should be already able to create proper memonics for unaligned access ...

@carlkl
Copy link
Author

carlkl commented Apr 11, 2021

Are there GCC commits for better support of aligned or for unaligned variables? It would be interesting to hear, that unaligned addresses don't segfault with newer gcc's with aligned instructions, but I could not find anything about that.

It is said, that there is no significant performance loss on aligned adresses with unaligned compared to aligned instructions. And in case of unaligned addresses bad performance is better than segfaults. So in principle patching gcc is the better approach I agree. Would not help for written assembler code though.

@katietz
Copy link

katietz commented Apr 11, 2021

well, it might be that fix at bugzilla/PR target/99234 helped here

@katietz
Copy link

katietz commented Apr 12, 2021

I am not sure if it address it in all aspects, I doubt it to some extend, but it should bring things here in a better state also for alignment.

Anyway I can guide you through producing a proper binutils/gcc patch, if you like. I have some experience in that area

@carlkl
Copy link
Author

carlkl commented Apr 12, 2021

Since i have no idea how to achieve this goal with GCC patches, i will need all the help i can get.

@lazka lazka marked this pull request as draft April 13, 2021 15:48
@carlkl
Copy link
Author

carlkl commented Apr 20, 2021

Here is a quote from Intel:

https://community.intel.com/t5/Intel-ISA-Extensions/movups-and-movupd-movaps-and-movapd/td-p/804505

Maxim_L_Intel1
Employee
02-22-2012 06:25 PM

right, using MOVUPS for any floating point type double or single (and AES instructions too, btw) 
is OK and recommended, MOVDQU should be used with integer types - MOVUPS is as fast as 
MOVAPS for _aligned_ data starting with Nehalem (aka Core i7 / Xeon 5500 etc.)

in AVX, there is an interesting and important paradigm change however, as LD+OP instructions 
no longer generate exceptions

i.e. in SSE:
ADDPS xmm0, [rsp+10] is the equivalent of MOVAPS xmm1, [rsp+10]; ADDPS xmm0, xmm1;
while in AVX:
VADDPS xmm0, xmm0, [rsp+10] <=> VMOVUPS xmm1, [rsp+10]; VADDPS xmm0, xmm0, xmm1;

so, in AVX, to keep uniform exception behavior (more precisely exception-less behavior) 
that is independent on compilers code generation it is strongly recommended to avoid using 
VMOVAPS/VMOVDQA instructions and _mm[256]_load_xx() intrinsics and always use 
VMOVUPS/VMOVDQU instructions and _mm[256]_loadu_xx() intrinsics instead, 
it is neutral for performance and will never surprise you (or customer) with the 
exception (crash) if data passed to the instructions sometimes happen to be misaligned.

Having said that, for the best performance results, please keep aligning your data.

-Max

This PR was created according to this recommendation (last paragraph). When GCC creates no longer misaligned code by itself this patch could be removed again.

@carlkl
Copy link
Author

carlkl commented Apr 20, 2021

Having said this, if this PR will not be accepted as it is I will close it. Patches of different kind to GCC and/or binutils should land in another PR IMHO.

@katietz
Copy link

katietz commented Apr 21, 2021 via email

@carlkl
Copy link
Author

carlkl commented Apr 21, 2021

@katietz, did I missed something? There is no attachment as far I can see.

@katietz
Copy link

katietz commented Apr 21, 2021

Patch file for

gcc

@katietz
Copy link

katietz commented Apr 21, 2021

Hmm

@katietz
Copy link

katietz commented Apr 21, 2021

Could you send me an Email address... I can't attach the Patch file

@1480c1
Copy link
Contributor

1480c1 commented Apr 21, 2021

Perhaps you could try uploading it to gist.github.com and sending a link to it?

@carlkl
Copy link
Author

carlkl commented Apr 27, 2021

The patch works with a small change (see comment in the gist)
If someone wants to make tests: the ucrt64 packages (with and without applied 0001-add-m-no-align-vector-insn-option-for-i386.patch) are uploaded here: mingw-w64-ucrt-x86_64-gcc-10.3.0-1 (pw: ucrt64)

@MehdiChinoune
Copy link
Collaborator

MehdiChinoune commented Sep 23, 2021

Applying this patch doesn't fix the bug at all.

@carlkl
Copy link
Author

carlkl commented Sep 23, 2021

@MehdiChinoune, could you explain this a little bit more specific? Diid you apply the alternative @katietz gcc patch or the binutils opcode patch? Can you give an example where your bug applies? In theory the binutils patch should avoid Segfaults, not introduce new ones.

@MehdiChinoune
Copy link
Collaborator

@MehdiChinoune, could you explain this a little bit more specific? Diid you apply the alternative @katietz gcc patch or the binutils opcode patch? Can you give an example where your bug applies? In theory the binutils patch should avoid Segfaults, not introduce new ones.

I have applied the binutils opcode. nothing has changed still segfaults with avx and avx2

@carlkl
Copy link
Author

carlkl commented Sep 23, 2021

@MehdiChinoune, you first applied this patch to binutils, then (2) compiled binutils with this patch and then (3) installed your newly created binutils package via pacman to your Msys2 environment, right?

@MehdiChinoune
Copy link
Collaborator

@MehdiChinoune, you first applied this patch to binutils, then (2) compiled binutils with this patch and then (3) installed yout newly created binutils packge via pacman to your Msys2 environment, right?

Yes, Of course. I am not new to MSYS2/MinGW64.
but I have gcc-11 on my experimental envireonnement (MINGW-x86_64-v3) (https://github.com/msys2/MINGW-packages/discussions/9456)

@carlkl
Copy link
Author

carlkl commented Sep 23, 2021

However, without further informations it will not be possible to investigate further into the bug you found.

@MehdiChinoune
Copy link
Collaborator

However, without further informations it will not be possible to investigate further into the bug you found.

  • Package binutils with the patch
  • Took a simple example:
// compile command: gcc -O0 -m64 -mavx -o bug.exe bug.c
typedef double v4d __attribute__ ((vector_size (32), aligned (32)));

v4d f (v4d x)
{
  return x;
}

int main ()
{
  v4d x = { 1.0, 2.0, 3.0, 4.0, };
  v4d r = f (x);
}
  • Build it and see it segfaults or not

Is this difficult to do!

@carlkl
Copy link
Author

carlkl commented Sep 23, 2021

My test with the binutils patch shows this:

$ gcc -O0 -march=sandybridge -c bug.c
$ objdump.exe -D bug.o | grep vmov
  12:   c5 fd 10 03             vmovupd (%rbx),%ymm0
  16:   c5 fd 11 45 e0          vmovupd %ymm0,-0x20(%rbp)
  1f:   c5 fd 10 45 e0          vmovupd -0x20(%rbp),%ymm0
  24:   c5 fd 11 00             vmovupd %ymm0,(%rax)
  5f:   c5 fd 10 05 00 00 00    vmovupd 0x0(%rip),%ymm0        # 67 <main+0x34>
  67:   c5 fd 11 43 20          vmovupd %ymm0,0x20(%rbx)
  70:   c5 fd 10 43 20          vmovupd 0x20(%rbx),%ymm0
  75:   c5 fd 11 85 70 ff ff    vmovupd %ymm0,-0x90(%rbp)
  8c:   c5 fd 10 45 90          vmovupd -0x70(%rbp),%ymm0
  91:   c5 fd 11 03             vmovupd %ymm0,(%rbx)

I didn't encounter segfaults with bug.exe in case of patched binutils.

Unpatched I get:

$ objdump.exe -D bug.o | grep vmov
  12:   c5 fd 28 03             vmovapd (%rbx),%ymm0
  16:   c5 fd 29 45 e0          vmovapd %ymm0,-0x20(%rbp)
  1f:   c5 fd 28 45 e0          vmovapd -0x20(%rbp),%ymm0
  24:   c5 fd 29 00             vmovapd %ymm0,(%rax)
  5f:   c5 fd 28 05 00 00 00    vmovapd 0x0(%rip),%ymm0        # 67 <main+0x34>
  67:   c5 fd 29 43 20          vmovapd %ymm0,0x20(%rbx)
  70:   c5 fd 28 43 20          vmovapd 0x20(%rbx),%ymm0
  75:   c5 fd 29 85 70 ff ff    vmovapd %ymm0,-0x90(%rbp)
  8c:   c5 fd 28 45 90          vmovapd -0x70(%rbp),%ymm0
  91:   c5 fd 29 03             vmovapd %ymm0,(%rbx)

with segfaults

@MehdiChinoune
Copy link
Collaborator

My test with the binutils patch shows this:

$ gcc -O0 -march=sandybridge -c bug.c
$ objdump.exe -D bug.o | grep vmov
  12:   c5 fd 10 03             vmovupd (%rbx),%ymm0
  16:   c5 fd 11 45 e0          vmovupd %ymm0,-0x20(%rbp)
  1f:   c5 fd 10 45 e0          vmovupd -0x20(%rbp),%ymm0
  24:   c5 fd 11 00             vmovupd %ymm0,(%rax)
  5f:   c5 fd 10 05 00 00 00    vmovupd 0x0(%rip),%ymm0        # 67 <main+0x34>
  67:   c5 fd 11 43 20          vmovupd %ymm0,0x20(%rbx)
  70:   c5 fd 10 43 20          vmovupd 0x20(%rbx),%ymm0
  75:   c5 fd 11 85 70 ff ff    vmovupd %ymm0,-0x90(%rbp)
  8c:   c5 fd 10 45 90          vmovupd -0x70(%rbp),%ymm0
  91:   c5 fd 11 03             vmovupd %ymm0,(%rbx)

I didn't encounter segfaults with bug.exe in case of patched binutils.

Unpatched I get:

$ objdump.exe -D bug.o | grep vmov
  12:   c5 fd 28 03             vmovapd (%rbx),%ymm0
  16:   c5 fd 29 45 e0          vmovapd %ymm0,-0x20(%rbp)
  1f:   c5 fd 28 45 e0          vmovapd -0x20(%rbp),%ymm0
  24:   c5 fd 29 00             vmovapd %ymm0,(%rax)
  5f:   c5 fd 28 05 00 00 00    vmovapd 0x0(%rip),%ymm0        # 67 <main+0x34>
  67:   c5 fd 29 43 20          vmovapd %ymm0,0x20(%rbx)
  70:   c5 fd 28 43 20          vmovapd 0x20(%rbx),%ymm0
  75:   c5 fd 29 85 70 ff ff    vmovapd %ymm0,-0x90(%rbp)
  8c:   c5 fd 28 45 90          vmovapd -0x70(%rbp),%ymm0
  91:   c5 fd 29 03             vmovapd %ymm0,(%rbx)

with segfaults

I think you are still using binutils 2.36, right!
the second line should be:

-  sed -i '/vmovapd/s/0x6628/0x6610/g;/vmovaps/s/0x28/0x10/g;/vmovdqa/s/0x666f/0xf36f/g;/vmovdqa/s/0x666F/0xF36F/g' opcodes/i386-tbl.h
+  sed -i '/vmovapd/s/0x28/0x10/g;/vmovaps/s/0x28/0x10/g;/vmovdqa/s/0x666f/0xf36f/g;/vmovdqa/s/0x666F/0xF36F/g' opcodes/i386-tbl.h

binutils have "vmovapd", 0x28 not "vmovapd", 0x6628
The patch should be updated against binutils 2.37.

@carlkl
Copy link
Author

carlkl commented Sep 23, 2021

It is correct, the patch should be updated to the recent version of binutils that is used right now by Msys2. I test this I soon I can. Feel free to test yourself and try again.

binutils have "vmovapd", 0x28 not "vmovapd", 0x6628
The patch should be updated against binutils 2.37.

EDIT: binutils-2.3.7 i386-opc.tbl:

vmovapd, 0x6628
vmovaps, 0x28

The sed-based patch was created to be largely indepent from the binutils version. Nevertheless I will test again with binutils-2.3.7

@MehdiChinoune
Copy link
Collaborator

MehdiChinoune commented Sep 23, 2021

It is correct, the patch should be updated to the recent version of binutils that is used right now by Msys2. I test this I soon I can. Feel free to test yourself and try again.

binutils have "vmovapd", 0x28 not "vmovapd", 0x6628
The patch should be updated against binutils 2.37.

EDIT: binutils-2.3.7 i386-opc.tbl:

vmovapd, 0x6628
vmovaps, 0x28

The sed-based patch was created to be largely indepent from the binutils version. Nevertheless I will test again with binutils-2.3.7

Did you read my comment? The line that should be fixed is the one for opcodes/i386-tbl.h

Anyway, After applying the modified lines, Things look good. the example no longer segfaults.

@carlkl
Copy link
Author

carlkl commented Sep 23, 2021

I see; I will update the PR for binutils-2.3.7.

Good to here that it works for you with this simple example. I am looking forward for measures for real live examples.

@MehdiChinoune
Copy link
Collaborator

MehdiChinoune commented Sep 25, 2021

While the binutils patch fix that little example, other applications segfaults.
I have built qt5/qt6 with avx support (which was disabled with MINGW64+GCC because of this bug), It always segfaults unfortunately.
I will try the gcc patch and see if it fixea that qt issue.

@MehdiChinoune
Copy link
Collaborator

Qt5/Qt6 Applications no longer crashes after applying the gcc patch.

@carlkl
Copy link
Author

carlkl commented Oct 2, 2021

@MehdiChinoune, congrats to this success. As you tested Kai's patch do you want to create a new PR with reference to the original gist? Did you test with Kai's patch only or with the binutils patch as well?

@MehdiChinoune
Copy link
Collaborator

MehdiChinoune commented Oct 2, 2021

As you tested Kai's patch do you want to create a new PR with reference to the original gist?

I don't want to mess with something I don't understand. The compiler is a critical component of the subsystem.

Did you test with Kai's patch only or with the binutils patch as well?

without binutils patch.

@carlkl
Copy link
Author

carlkl commented Oct 2, 2021

If you want to maintain Kai's patch please let me know. (EDIT: Please ignore, I didn't read your last message before sending this comment.)
I'm about to close this PR to prepare a dedicated PR for gcc builds.

@MehdiChinoune
Copy link
Collaborator

MehdiChinoune commented Oct 2, 2021

If you want to maintain Kai's patch please let me know. I'm about to close this PR very soon.

Fell free to close it.

@MehdiChinoune
Copy link
Collaborator

@carlkl I think with #9998 merged, this could be closed.

@carlkl
Copy link
Author

carlkl commented Nov 5, 2021

patching opcodes in GCC's code generation is the better approach.

@pps83
Copy link

pps83 commented Apr 29, 2025

// compile command: gcc -O0 -m64 -mavx -o bug.exe bug.c
typedef double v4d __attribute__ ((vector_size (32), aligned (32)));

v4d f (v4d x)
{
  return x;
}

int main ()
{
  v4d x = { 1.0, 2.0, 3.0, 4.0, };
  v4d r = f (x);
}
  • Build it and see it segfaults or not

FYI, this code still segfaults when built/run under msys2.

@carlkl
Copy link
Author

carlkl commented Apr 29, 2025

@msys2 msys2 locked as off-topic and limited conversation to collaborators Apr 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants