Enable NEON optimization for cvRound on newer devices#4103
Enable NEON optimization for cvRound on newer devices#4103opencv-pushbot merged 1 commit intoopencv:masterfrom
Conversation
|
The optimization are the same vfp instruction but now enabled in more flags cases. I really suggest and put in evidence in documentation to enable this flags in arm and android apps on this targets because cvround it is inline so depend on app flags and not on what flags was used to build opencv in manager. This little thing could be one of the performance blocker on arm/android for third party developers. |
|
@bhack, in the case of iOS it happens automatically. in the case of Android there is tutorial: |
|
@vpisarev I don't know if we have the same problem also on SSE. Do you have an Intel android phone or tablet? We have other inline that depend on flags? |
|
@bhack, for SSE there is completely different branch, which should use SSE2 intrinsics |
|
@mtamburrano, what compiler do you use? Doesn't it have the |
|
hi @mshabunin, |
|
Does ARMv5TE architecture support NEON? This is strange. |
|
@mshabunin I don't know if generally will be safe for a target android application to use runtime detection on android: https://developer.android.com/ndk/guides/cpu-arm-neon.html |
|
But the variable to be checked is always also ARM_NEON_FP. |
|
@bhack, I've checked several toolchains and now I think we do not need NEON checks in VFP detection instruction:
$ ./android-ndk-r8b/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86/bin/arm-linux-androideabi-g++ -mfloat-abi=hard -mfpu=neon -c test.cpp
test.cpp:7:2: warning: #warning "__ARM_NEON__ defined"
/tmp/ccEwu06a.s: Assembler messages:
/tmp/ccEwu06a.s:37: Error: bad instruction `vcvtr.s32.f64 s15,d16'
$ ./android-ndk-r10d/toolchains/arm-linux-androideabi-4.9/prebuilt/linux-x86_64/bin/arm-linux-androideabi-g++ -mfloat-abi=soft -c test.cpp
test.cpp:3:2: warning: #warning "__ARM_NEON_FP defined" [-Wcpp]
test.cpp: In function 'int cvRound(double)':
test.cpp:37:129: error: inconsistent operand constraints in an 'asm'
asm("vcvtr.s32.f64 %[temp], %P[value] \n vmov %[res], %[temp]" : [res] "=r" (res), [temp] "=w" (temp) : [value] "w" (value));Thus, I don't think NEON-related macros imply required VFPv3 instruction/register compiler support. Currently I'm sure Also, I tried to build android pack with suggested patch and it had failed: http://pullrequest.opencv.org/buildbot/builders/master_pack-android/builds/114/steps/shell/logs/stdio |
|
@mshabunin Do you have __ARM_PCS_VFP without -mfloat-abi=hard? |
@bhack, no, it is not defined. I checked some variants of VFP check with this script and test file and it seems that the best version is: #if defined __GNUC__ && defined __arm__ && (defined __ARM_PCS_VFP || defined __ARM_VFPV3__ || defined __ARM_NEON__) && !defined __SOFTFP__It will compile for all tested configurations (NDKr8d / NDKr10d / clang3.5-gnueabi / clang3.5-android * soft / softfp / hard) without errors. However, these tests do not cover AArch64 and I'm still not sure if any runtime check is it needed to run compiled code. |
|
@mshabunin I.e. Is your test covering -mfpu=neon-vfpv4 or vfpv4? |
|
Generally i think that on Android a developer want that its distributed application is optimized dynamically at runtime based on instruction availability. The problem is that things like cvround, that make a really big difference not optimized on this platforms, are implemented inline in the header. When a developer deploys an opencv app on Play marketplace he doesn't really know on what target optimized instruction set will be installed. But things like cvround are hereditarily given to the app by opencv headers. So it is generally out of control at runtime with the android mechanisms cited in the link of my previous message. |
No, only |
|
I think that with mfpu=neon-vfpv4 or vfpv4 (and others) the optimized instruction are supported. But i'dont know if you condition cover this ones. |
|
so, what's the conclusion? should we put this patch in? |
|
My opinion is that runtime detection is the best solution but i don't know if we want/could introduce that in opencv. I think that selecting some good default flags could be a good compromise. |
|
@vpisarev, current proposed condition does not compile for android-armeabi, so it should not be merged yet. There is needed at least @bhack, for NEON there is @mtamburrano, I do not see any direct way to ensure that compiler (GCC) supports VFPv3, probably your proposed method will be enough if you exclude legacy platforms with soft float abi. |
|
@mshabunin I think SOFTFP is still valid. Probably -mfloat-abi=soft is not valid because disable neon and vfp (but probably we already check that with other ifs) |
|
@bhack, you are right, but it is some kind of misleading macro naming: compiler with |
|
@mshabunin I've not tested extensively but ok.. I don't know if there are some compilers bugs :) |
|
@mshabunin I updated the patch with your suggestions, I don't know why __ARM_NEON_FP is always defined with all -mfloat-abi types (soft/softfp/hard) and even without -mfpu=neon. Instead __ARM_NEON seems enabled only when -mfpu=neon is enabled and the mfloat-abi type is not soft. By the way I've already tried opencv 2.4.8 included in the Nvidia dev pack, but it seems that the tegra optimization flag is also not raised there, and anyway if it is defined the tegra_round.hpp inclusion fails, I can't find this file anywhere |
|
@mtamburrano, it's OK. Can you, please, squash all commits into one? |
|
👍 |
cvRound seems was not optimized with NEON instructions on newer devices.
Tested on an NVIDIA SHIELD TABLET and with the flag CV_VFP on, the speed up is greater than 4x.
Still I'm not sure why TEGRA_OPTIMIZATION is not enabled, if forced with a define, the header tries to include a mysterious file "tegra_round.hpp", which is not found