MSVC implementation of LZ4_NbCommonBytes compiles to bsf asm instruction, which is suboptimal on ZEN/ZEN2 uarchs, as indicated below (the code compiles to R, R variant):

Similar in behavior tzcnt instruction has better internal implementation:

Quoting Intel manuals:
The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined.
This case can be ignored, as LZ4_count function will never pass zero as input parameter.
Technically tzcnt is a part of BMI1 instruction set, but this also doesn't matter:
On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.
On Intel uarchs I don't expect any performance difference, if instruction tables are to be believed.
On ZEN2 number of LZ4_NbCommonBytes profiling samples was reduced from 24% in the bsf case to 15% in tzcnt case.
For completness, here's result of running fullbench in a noisy environment:
Before:
*** LZ4 speed analyzer v1.9.2 64-bits, by Yann Collet ***
dickens :
Compression functions :
1-LZ4_compress_default : 10192446 -> 6431106 (63.10%), 419.1 MB/s
2-LZ4_compress_default(small dst) : 10192446 -> 6431106 (63.10%), 415.5 MB/s
3-LZ4_compress_destSize : 10192446 -> 6431106 (63.10%), 382.0 MB/s
4-LZ4_compress_fast(0) : 10192446 -> 6431106 (63.10%), 417.9 MB/s
5-LZ4_compress_fast(1) : 10192446 -> 6431106 (63.10%), 413.2 MB/s
6-LZ4_compress_fast(2) : 10192446 -> 6763934 (66.36%), 453.4 MB/s
7-LZ4_compress_fast(17) : 10192446 -> 9222511 (90.48%), 1190.5 MB/s
8-LZ4_compress_fast_extState(0) : 10192446 -> 6431106 (63.10%), 415.0 MB/s
9-LZ4_compress_fast_continue(0) : 10192446 -> 6428753 (63.07%), 422.5 MB/s
10-LZ4_compress_HC : 10192446 -> 4441354 (43.57%), 23.0 MB/s
12-LZ4_compress_HC_extStateHC : 10192446 -> 4441354 (43.57%), 22.9 MB/s
14-LZ4_compress_HC_continue : 10192446 -> 4432831 (43.49%), 23.0 MB/s
20-LZ4_compress_forceDict : 10192446 -> 6428753 (63.07%), 382.9 MB/s
30-LZ4F_compressFrame : 10192446 -> 6430041 (63.09%), 409.2 MB/s
ooffice :
Compression functions :
1-LZ4_compress_default : 6152192 -> 4339312 (70.53%), 532.3 MB/s
2-LZ4_compress_default(small dst) : 6152192 -> 4339312 (70.53%), 524.3 MB/s
3-LZ4_compress_destSize : 6152192 -> 4339312 (70.53%), 516.0 MB/s
4-LZ4_compress_fast(0) : 6152192 -> 4339312 (70.53%), 546.8 MB/s
5-LZ4_compress_fast(1) : 6152192 -> 4339312 (70.53%), 553.5 MB/s
6-LZ4_compress_fast(2) : 6152192 -> 4552666 (74.00%), 684.4 MB/s
7-LZ4_compress_fast(17) : 6152192 -> 5531363 (89.91%), 1845.7 MB/s
8-LZ4_compress_fast_extState(0) : 6152192 -> 4339312 (70.53%), 554.4 MB/s
9-LZ4_compress_fast_continue(0) : 6152192 -> 4338923 (70.53%), 517.8 MB/s
10-LZ4_compress_HC : 6152192 -> 3546025 (57.64%), 40.4 MB/s
12-LZ4_compress_HC_extStateHC : 6152192 -> 3546025 (57.64%), 39.8 MB/s
14-LZ4_compress_HC_continue : 6152192 -> 3543767 (57.60%), 39.5 MB/s
20-LZ4_compress_forceDict : 6152192 -> 4338923 (70.53%), 532.3 MB/s
30-LZ4F_compressFrame : 6152192 -> 4339971 (70.54%), 550.4 MB/s
After:
*** LZ4 speed analyzer v1.9.2 64-bits, by Yann Collet ***
dickens :
Compression functions :
1-LZ4_compress_default : 10192446 -> 6431106 (63.10%), 426.0 MB/s
2-LZ4_compress_default(small dst) : 10192446 -> 6431106 (63.10%), 418.1 MB/s
3-LZ4_compress_destSize : 10192446 -> 6431106 (63.10%), 400.6 MB/s
4-LZ4_compress_fast(0) : 10192446 -> 6431106 (63.10%), 419.3 MB/s
5-LZ4_compress_fast(1) : 10192446 -> 6431106 (63.10%), 424.7 MB/s
6-LZ4_compress_fast(2) : 10192446 -> 6763934 (66.36%), 471.0 MB/s
7-LZ4_compress_fast(17) : 10192446 -> 9222511 (90.48%), 1226.7 MB/s
8-LZ4_compress_fast_extState(0) : 10192446 -> 6431106 (63.10%), 419.9 MB/s
9-LZ4_compress_fast_continue(0) : 10192446 -> 6428753 (63.07%), 441.6 MB/s
10-LZ4_compress_HC : 10192446 -> 4441354 (43.57%), 23.0 MB/s
12-LZ4_compress_HC_extStateHC : 10192446 -> 4441354 (43.57%), 23.0 MB/s
14-LZ4_compress_HC_continue : 10192446 -> 4432831 (43.49%), 22.8 MB/s
20-LZ4_compress_forceDict : 10192446 -> 6428753 (63.07%), 406.9 MB/s
30-LZ4F_compressFrame : 10192446 -> 6430041 (63.09%), 420.1 MB/s
ooffice :
Compression functions :
1-LZ4_compress_default : 6152192 -> 4339312 (70.53%), 535.2 MB/s
2-LZ4_compress_default(small dst) : 6152192 -> 4339312 (70.53%), 551.9 MB/s
3-LZ4_compress_destSize : 6152192 -> 4339312 (70.53%), 551.0 MB/s
4-LZ4_compress_fast(0) : 6152192 -> 4339312 (70.53%), 533.8 MB/s
5-LZ4_compress_fast(1) : 6152192 -> 4339312 (70.53%), 531.9 MB/s
6-LZ4_compress_fast(2) : 6152192 -> 4552666 (74.00%), 691.2 MB/s
7-LZ4_compress_fast(17) : 6152192 -> 5531363 (89.91%), 1886.7 MB/s
8-LZ4_compress_fast_extState(0) : 6152192 -> 4339312 (70.53%), 565.5 MB/s
9-LZ4_compress_fast_continue(0) : 6152192 -> 4338923 (70.53%), 544.8 MB/s
10-LZ4_compress_HC : 6152192 -> 3546025 (57.64%), 40.5 MB/s
12-LZ4_compress_HC_extStateHC : 6152192 -> 3546025 (57.64%), 40.5 MB/s
14-LZ4_compress_HC_continue : 6152192 -> 3543767 (57.60%), 40.4 MB/s
20-LZ4_compress_forceDict : 6152192 -> 4338923 (70.53%), 534.5 MB/s
30-LZ4F_compressFrame : 6152192 -> 4339971 (70.54%), 552.2 MB/s
Proposed change: wolfpld/tracy@3a302c1
MSVC implementation of
LZ4_NbCommonBytescompiles tobsfasm instruction, which is suboptimal on ZEN/ZEN2 uarchs, as indicated below (the code compiles toR, Rvariant):Similar in behavior
tzcntinstruction has better internal implementation:Quoting Intel manuals:
This case can be ignored, as
LZ4_countfunction will never pass zero as input parameter.Technically
tzcntis a part of BMI1 instruction set, but this also doesn't matter:On Intel uarchs I don't expect any performance difference, if instruction tables are to be believed.
On ZEN2 number of
LZ4_NbCommonBytesprofiling samples was reduced from 24% in thebsfcase to 15% intzcntcase.For completness, here's result of running
fullbenchin a noisy environment:Before:
After:
Proposed change: wolfpld/tracy@3a302c1