Skip to content

Improve compression speed on ZEN/ZEN2 #867

@wolfpld

Description

@wolfpld

MSVC implementation of LZ4_NbCommonBytes compiles to bsf asm instruction, which is suboptimal on ZEN/ZEN2 uarchs, as indicated below (the code compiles to R, R variant):

obraz

Similar in behavior tzcnt instruction has better internal implementation:

obraz

Quoting Intel manuals:

The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined.

This case can be ignored, as LZ4_count function will never pass zero as input parameter.

Technically tzcnt is a part of BMI1 instruction set, but this also doesn't matter:

On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.

On Intel uarchs I don't expect any performance difference, if instruction tables are to be believed.

On ZEN2 number of LZ4_NbCommonBytes profiling samples was reduced from 24% in the bsf case to 15% in tzcnt case.

For completness, here's result of running fullbench in a noisy environment:

Before:

*** LZ4 speed analyzer v1.9.2 64-bits, by Yann Collet ***
 dickens :
Compression functions :
 1-LZ4_compress_default               :  10192446 ->  6431106 (63.10%),  419.1 MB/s
 2-LZ4_compress_default(small dst)    :  10192446 ->  6431106 (63.10%),  415.5 MB/s
 3-LZ4_compress_destSize              :  10192446 ->  6431106 (63.10%),  382.0 MB/s
 4-LZ4_compress_fast(0)               :  10192446 ->  6431106 (63.10%),  417.9 MB/s
 5-LZ4_compress_fast(1)               :  10192446 ->  6431106 (63.10%),  413.2 MB/s
 6-LZ4_compress_fast(2)               :  10192446 ->  6763934 (66.36%),  453.4 MB/s
 7-LZ4_compress_fast(17)              :  10192446 ->  9222511 (90.48%), 1190.5 MB/s
 8-LZ4_compress_fast_extState(0)      :  10192446 ->  6431106 (63.10%),  415.0 MB/s
 9-LZ4_compress_fast_continue(0)      :  10192446 ->  6428753 (63.07%),  422.5 MB/s
10-LZ4_compress_HC                    :  10192446 ->  4441354 (43.57%),   23.0 MB/s
12-LZ4_compress_HC_extStateHC         :  10192446 ->  4441354 (43.57%),   22.9 MB/s
14-LZ4_compress_HC_continue           :  10192446 ->  4432831 (43.49%),   23.0 MB/s
20-LZ4_compress_forceDict             :  10192446 ->  6428753 (63.07%),  382.9 MB/s
30-LZ4F_compressFrame                 :  10192446 ->  6430041 (63.09%),  409.2 MB/s
 ooffice :
Compression functions :
 1-LZ4_compress_default               :   6152192 ->  4339312 (70.53%),  532.3 MB/s
 2-LZ4_compress_default(small dst)    :   6152192 ->  4339312 (70.53%),  524.3 MB/s
 3-LZ4_compress_destSize              :   6152192 ->  4339312 (70.53%),  516.0 MB/s
 4-LZ4_compress_fast(0)               :   6152192 ->  4339312 (70.53%),  546.8 MB/s
 5-LZ4_compress_fast(1)               :   6152192 ->  4339312 (70.53%),  553.5 MB/s
 6-LZ4_compress_fast(2)               :   6152192 ->  4552666 (74.00%),  684.4 MB/s
 7-LZ4_compress_fast(17)              :   6152192 ->  5531363 (89.91%), 1845.7 MB/s
 8-LZ4_compress_fast_extState(0)      :   6152192 ->  4339312 (70.53%),  554.4 MB/s
 9-LZ4_compress_fast_continue(0)      :   6152192 ->  4338923 (70.53%),  517.8 MB/s
10-LZ4_compress_HC                    :   6152192 ->  3546025 (57.64%),   40.4 MB/s
12-LZ4_compress_HC_extStateHC         :   6152192 ->  3546025 (57.64%),   39.8 MB/s
14-LZ4_compress_HC_continue           :   6152192 ->  3543767 (57.60%),   39.5 MB/s
20-LZ4_compress_forceDict             :   6152192 ->  4338923 (70.53%),  532.3 MB/s
30-LZ4F_compressFrame                 :   6152192 ->  4339971 (70.54%),  550.4 MB/s

After:

*** LZ4 speed analyzer v1.9.2 64-bits, by Yann Collet ***
 dickens :
Compression functions :
 1-LZ4_compress_default               :  10192446 ->  6431106 (63.10%),  426.0 MB/s
 2-LZ4_compress_default(small dst)    :  10192446 ->  6431106 (63.10%),  418.1 MB/s
 3-LZ4_compress_destSize              :  10192446 ->  6431106 (63.10%),  400.6 MB/s
 4-LZ4_compress_fast(0)               :  10192446 ->  6431106 (63.10%),  419.3 MB/s
 5-LZ4_compress_fast(1)               :  10192446 ->  6431106 (63.10%),  424.7 MB/s
 6-LZ4_compress_fast(2)               :  10192446 ->  6763934 (66.36%),  471.0 MB/s
 7-LZ4_compress_fast(17)              :  10192446 ->  9222511 (90.48%), 1226.7 MB/s
 8-LZ4_compress_fast_extState(0)      :  10192446 ->  6431106 (63.10%),  419.9 MB/s
 9-LZ4_compress_fast_continue(0)      :  10192446 ->  6428753 (63.07%),  441.6 MB/s
10-LZ4_compress_HC                    :  10192446 ->  4441354 (43.57%),   23.0 MB/s
12-LZ4_compress_HC_extStateHC         :  10192446 ->  4441354 (43.57%),   23.0 MB/s
14-LZ4_compress_HC_continue           :  10192446 ->  4432831 (43.49%),   22.8 MB/s
20-LZ4_compress_forceDict             :  10192446 ->  6428753 (63.07%),  406.9 MB/s
30-LZ4F_compressFrame                 :  10192446 ->  6430041 (63.09%),  420.1 MB/s
 ooffice :
Compression functions :
 1-LZ4_compress_default               :   6152192 ->  4339312 (70.53%),  535.2 MB/s
 2-LZ4_compress_default(small dst)    :   6152192 ->  4339312 (70.53%),  551.9 MB/s
 3-LZ4_compress_destSize              :   6152192 ->  4339312 (70.53%),  551.0 MB/s
 4-LZ4_compress_fast(0)               :   6152192 ->  4339312 (70.53%),  533.8 MB/s
 5-LZ4_compress_fast(1)               :   6152192 ->  4339312 (70.53%),  531.9 MB/s
 6-LZ4_compress_fast(2)               :   6152192 ->  4552666 (74.00%),  691.2 MB/s
 7-LZ4_compress_fast(17)              :   6152192 ->  5531363 (89.91%), 1886.7 MB/s
 8-LZ4_compress_fast_extState(0)      :   6152192 ->  4339312 (70.53%),  565.5 MB/s
 9-LZ4_compress_fast_continue(0)      :   6152192 ->  4338923 (70.53%),  544.8 MB/s
10-LZ4_compress_HC                    :   6152192 ->  3546025 (57.64%),   40.5 MB/s
12-LZ4_compress_HC_extStateHC         :   6152192 ->  3546025 (57.64%),   40.5 MB/s
14-LZ4_compress_HC_continue           :   6152192 ->  3543767 (57.60%),   40.4 MB/s
20-LZ4_compress_forceDict             :   6152192 ->  4338923 (70.53%),  534.5 MB/s
30-LZ4F_compressFrame                 :   6152192 ->  4339971 (70.54%),  552.2 MB/s

Proposed change: wolfpld/tracy@3a302c1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions