Improve compression speed on ZEN/ZEN2

MSVC implementation of `LZ4_NbCommonBytes` compiles to `bsf` asm instruction, which is suboptimal on ZEN/ZEN2 uarchs, as indicated below (the code compiles to `R, R` variant):

![obraz](https://user-images.githubusercontent.com/600573/82132046-c3648c00-97db-11ea-8a5e-57294cd287e9.png)

Similar in behavior `tzcnt` instruction has better internal implementation:

![obraz](https://user-images.githubusercontent.com/600573/82132067-fad33880-97db-11ea-948c-fa048c9d0017.png)

Quoting Intel manuals:

> The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined.

This case can be ignored, as `LZ4_count` function will never pass zero as input parameter.

Technically `tzcnt` is a part of BMI1 instruction set, but this also doesn't matter:

> On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.

On Intel uarchs I don't expect any performance difference, if instruction tables are to be believed.

On ZEN2 number of `LZ4_NbCommonBytes` profiling samples was reduced from 24% in the `bsf` case to 15% in `tzcnt` case.

For completness, here's result of running `fullbench` in a noisy environment:

Before:

```
*** LZ4 speed analyzer v1.9.2 64-bits, by Yann Collet ***
 dickens :
Compression functions :
 1-LZ4_compress_default               :  10192446 ->  6431106 (63.10%),  419.1 MB/s
 2-LZ4_compress_default(small dst)    :  10192446 ->  6431106 (63.10%),  415.5 MB/s
 3-LZ4_compress_destSize              :  10192446 ->  6431106 (63.10%),  382.0 MB/s
 4-LZ4_compress_fast(0)               :  10192446 ->  6431106 (63.10%),  417.9 MB/s
 5-LZ4_compress_fast(1)               :  10192446 ->  6431106 (63.10%),  413.2 MB/s
 6-LZ4_compress_fast(2)               :  10192446 ->  6763934 (66.36%),  453.4 MB/s
 7-LZ4_compress_fast(17)              :  10192446 ->  9222511 (90.48%), 1190.5 MB/s
 8-LZ4_compress_fast_extState(0)      :  10192446 ->  6431106 (63.10%),  415.0 MB/s
 9-LZ4_compress_fast_continue(0)      :  10192446 ->  6428753 (63.07%),  422.5 MB/s
10-LZ4_compress_HC                    :  10192446 ->  4441354 (43.57%),   23.0 MB/s
12-LZ4_compress_HC_extStateHC         :  10192446 ->  4441354 (43.57%),   22.9 MB/s
14-LZ4_compress_HC_continue           :  10192446 ->  4432831 (43.49%),   23.0 MB/s
20-LZ4_compress_forceDict             :  10192446 ->  6428753 (63.07%),  382.9 MB/s
30-LZ4F_compressFrame                 :  10192446 ->  6430041 (63.09%),  409.2 MB/s
 ooffice :
Compression functions :
 1-LZ4_compress_default               :   6152192 ->  4339312 (70.53%),  532.3 MB/s
 2-LZ4_compress_default(small dst)    :   6152192 ->  4339312 (70.53%),  524.3 MB/s
 3-LZ4_compress_destSize              :   6152192 ->  4339312 (70.53%),  516.0 MB/s
 4-LZ4_compress_fast(0)               :   6152192 ->  4339312 (70.53%),  546.8 MB/s
 5-LZ4_compress_fast(1)               :   6152192 ->  4339312 (70.53%),  553.5 MB/s
 6-LZ4_compress_fast(2)               :   6152192 ->  4552666 (74.00%),  684.4 MB/s
 7-LZ4_compress_fast(17)              :   6152192 ->  5531363 (89.91%), 1845.7 MB/s
 8-LZ4_compress_fast_extState(0)      :   6152192 ->  4339312 (70.53%),  554.4 MB/s
 9-LZ4_compress_fast_continue(0)      :   6152192 ->  4338923 (70.53%),  517.8 MB/s
10-LZ4_compress_HC                    :   6152192 ->  3546025 (57.64%),   40.4 MB/s
12-LZ4_compress_HC_extStateHC         :   6152192 ->  3546025 (57.64%),   39.8 MB/s
14-LZ4_compress_HC_continue           :   6152192 ->  3543767 (57.60%),   39.5 MB/s
20-LZ4_compress_forceDict             :   6152192 ->  4338923 (70.53%),  532.3 MB/s
30-LZ4F_compressFrame                 :   6152192 ->  4339971 (70.54%),  550.4 MB/s
```

After:

```
*** LZ4 speed analyzer v1.9.2 64-bits, by Yann Collet ***
 dickens :
Compression functions :
 1-LZ4_compress_default               :  10192446 ->  6431106 (63.10%),  426.0 MB/s
 2-LZ4_compress_default(small dst)    :  10192446 ->  6431106 (63.10%),  418.1 MB/s
 3-LZ4_compress_destSize              :  10192446 ->  6431106 (63.10%),  400.6 MB/s
 4-LZ4_compress_fast(0)               :  10192446 ->  6431106 (63.10%),  419.3 MB/s
 5-LZ4_compress_fast(1)               :  10192446 ->  6431106 (63.10%),  424.7 MB/s
 6-LZ4_compress_fast(2)               :  10192446 ->  6763934 (66.36%),  471.0 MB/s
 7-LZ4_compress_fast(17)              :  10192446 ->  9222511 (90.48%), 1226.7 MB/s
 8-LZ4_compress_fast_extState(0)      :  10192446 ->  6431106 (63.10%),  419.9 MB/s
 9-LZ4_compress_fast_continue(0)      :  10192446 ->  6428753 (63.07%),  441.6 MB/s
10-LZ4_compress_HC                    :  10192446 ->  4441354 (43.57%),   23.0 MB/s
12-LZ4_compress_HC_extStateHC         :  10192446 ->  4441354 (43.57%),   23.0 MB/s
14-LZ4_compress_HC_continue           :  10192446 ->  4432831 (43.49%),   22.8 MB/s
20-LZ4_compress_forceDict             :  10192446 ->  6428753 (63.07%),  406.9 MB/s
30-LZ4F_compressFrame                 :  10192446 ->  6430041 (63.09%),  420.1 MB/s
 ooffice :
Compression functions :
 1-LZ4_compress_default               :   6152192 ->  4339312 (70.53%),  535.2 MB/s
 2-LZ4_compress_default(small dst)    :   6152192 ->  4339312 (70.53%),  551.9 MB/s
 3-LZ4_compress_destSize              :   6152192 ->  4339312 (70.53%),  551.0 MB/s
 4-LZ4_compress_fast(0)               :   6152192 ->  4339312 (70.53%),  533.8 MB/s
 5-LZ4_compress_fast(1)               :   6152192 ->  4339312 (70.53%),  531.9 MB/s
 6-LZ4_compress_fast(2)               :   6152192 ->  4552666 (74.00%),  691.2 MB/s
 7-LZ4_compress_fast(17)              :   6152192 ->  5531363 (89.91%), 1886.7 MB/s
 8-LZ4_compress_fast_extState(0)      :   6152192 ->  4339312 (70.53%),  565.5 MB/s
 9-LZ4_compress_fast_continue(0)      :   6152192 ->  4338923 (70.53%),  544.8 MB/s
10-LZ4_compress_HC                    :   6152192 ->  3546025 (57.64%),   40.5 MB/s
12-LZ4_compress_HC_extStateHC         :   6152192 ->  3546025 (57.64%),   40.5 MB/s
14-LZ4_compress_HC_continue           :   6152192 ->  3543767 (57.60%),   40.4 MB/s
20-LZ4_compress_forceDict             :   6152192 ->  4338923 (70.53%),  534.5 MB/s
30-LZ4F_compressFrame                 :   6152192 ->  4339971 (70.54%),  552.2 MB/s
```

Proposed change: https://github.com/wolfpld/tracy/commit/3a302c18bcd1b7956890dbb5edba73a5db93017f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compression speed on ZEN/ZEN2 #867

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve compression speed on ZEN/ZEN2 #867

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions