common: apply two stage copy to aarch64#3145
Conversation
On aarch64 ZSTD_wildcopy uses a simple loop to do 16B based memory copy. There is existing optimized two stage copy that can achieve better performance. By applying this to aarch64 it is also observed ~1% uplift in silesia corpus. Signed-off-by: Jun He <[email protected]> Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be
|
By |
1 similar comment
This comment was marked as duplicate.
This comment was marked as duplicate.
|
Hi, @Cyan4973 , the result have been benchmarked on the Arm N1/A72/A57 platforms and observed similar uplift. |
|
I can't remember why this code was added here. It could be that, with From what I can see, the second formulation just separates the first branch from later ones, so that it can have its own statistics (as opposed to being merged with other loop iterations). Such a construction is expected to be rather good in the context of So I'm gonna make an educated guess here and state that this PR seems tends to improve the situation, on top of simplifying it by removing a weird and poorly documented corner case. |
1 similar comment
This comment was marked as duplicate.
This comment was marked as duplicate.
|
Thanks for the PR @JunHe77! |
On aarch64 ZSTD_wildcopy uses a simple loop to do
16B based memory copy. There is existing optimized
two stage copy that can achieve better performance.
By applying this to aarch64 it is also observed ~1%
uplift in silesia corpus.
Signed-off-by: Jun He [email protected]
Change-Id: Ic1253308e7a8a7df2d08963ba544e086c81ce8be