clamp cmov codegen#2336
Conversation
can't use min/max as can't forward _Pred to two places
can't use min/max as can't forward _Pred to two places
Yes I Hate Ternary
|
@AlexGuteniev Ah thanks, I didn't see the benchmark. You can approximate the latency better, by changing With With With I find the consistency of the branchless benchmark results remarkable and I think it shows why they should be preferred whenever possible. |
|
@lhecker , thanks, I've made an edit to the benhmark |
There's separate |
strega-nil-ms
left a comment
There was a problem hiding this comment.
Minor test notes, the code itself looks good
Co-authored-by: nicole mazzuca <[email protected]>
|
I've done some benchmarking of my own; the following benchmark.cpp (modeled on yours, @AlexGuteniev): benchmark.cpp#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#define DO_STRINGIZE(v) L"" #v
#define STRINGIZE(v) DO_STRINGIZE(v)
#ifndef BRANCHY
#error "forgot to /D BRANCHY"
#endif
struct BenchmarkHeader {
wchar_t const* branchy = STRINGIZE(BRANCHY);
wchar_t const* processor;
wchar_t const* processor_company;
};
struct BenchmarkSettings {
int size_of_vector = INT_MIN;
int runs = INT_MIN;
int minimum = INT_MIN;
int maximum = INT_MIN;
int clamped_minimum = INT_MIN;
int clamped_maximum = INT_MIN;
int trials = INT_MIN;
};
template <class T>
double benchmark(const T* const first, T const* const last, T& result, const BenchmarkHeader& header,
const BenchmarkSettings& settings) {
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < settings.runs; ++i) {
for (const T* it = first; it != last; ++it) {
result += std::clamp<T>(*it, settings.clamped_minimum, settings.clamped_maximum);
}
}
auto end = std::chrono::steady_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count() / settings.runs;
}
template <class T>
auto benchmark_type(const BenchmarkHeader& header, const BenchmarkSettings& settings, const wchar_t* type) {
std::vector<T> input(settings.size_of_vector);
T output = 0;
std::wcout << header.branchy << ',' << header.processor << ',' << header.processor_company << ',' << type << ','
<< settings.size_of_vector << ',' << settings.runs << ',' << settings.minimum << ',' << settings.maximum
<< ',' << settings.clamped_minimum << ',' << settings.clamped_maximum;
for (int i = 0; i < settings.trials; ++i) {
{
std::random_device rd;
auto dist = ([&settings]() {
if constexpr (std::is_integral_v<T>) {
return std::uniform_int_distribution<T>(settings.minimum, settings.maximum);
} else {
return std::uniform_real_distribution<T>(settings.minimum, settings.maximum);
}
})();
std::generate(std::begin(input), std::end(input), [&rd, &dist]() { return dist(rd); });
}
auto trial = benchmark(input.data(), input.data() + input.size(), output, header, settings);
std::wcout << ',' << trial;
}
std::wcout << '\n';
return output;
}
int wmain(const int argc, wchar_t** const argv) {
static constexpr std::pair<const wchar_t*, const wchar_t*(BenchmarkHeader::*)> header_parameters[] = {
{L"--processor", &BenchmarkHeader::processor},
{L"--company", &BenchmarkHeader::processor_company},
};
static constexpr std::pair<const wchar_t*, int(BenchmarkSettings::*)> settings_parameters[] = {
{L"--vector-size", &BenchmarkSettings::size_of_vector},
{L"--runs", &BenchmarkSettings::runs},
{L"--min", &BenchmarkSettings::minimum},
{L"--max", &BenchmarkSettings::maximum},
{L"--clamp-min", &BenchmarkSettings::clamped_minimum},
{L"--clamp-max", &BenchmarkSettings::clamped_maximum},
{L"--trials", &BenchmarkSettings::trials},
};
BenchmarkHeader header{};
BenchmarkSettings settings{};
for (int arg_i = 1; arg_i < argc; ++arg_i) {
bool found = false;
for (auto parm : header_parameters) {
if (wcscmp(argv[arg_i], parm.first) == 0) {
if (!(header.*parm.second)) {
++arg_i;
header.*parm.second = argv[arg_i];
} else {
std::wcerr << L"passed " << parm.first << L" multiple times\n";
return 1;
}
found = true;
}
}
for (auto parm : settings_parameters) {
if (wcscmp(argv[arg_i], parm.first) == 0) {
if (settings.*parm.second == INT_MIN) {
++arg_i;
settings.*parm.second = static_cast<int>(wcstol(argv[arg_i], nullptr, 10));
} else {
std::wcerr << L"passed " << parm.first << L"multiple times\n";
return 1;
}
found = true;
}
}
if (!found) {
std::wcerr << L"unrecognized argument " << argv[arg_i] << L'\n';
return 1;
}
}
for (auto parm : header_parameters) {
if (!(header.*parm.second)) {
std::wcerr << L"argument " << parm.first << " not passed\n";
return 1;
}
}
for (auto parm : settings_parameters) {
if (settings.*parm.second == INT_MIN) {
std::wcerr << L"argument " << parm.first << " not passed\n";
return 1;
}
}
int trials = 5;
std::wcout << L"Branch?,Processor,Processor Company,Type,N,Runs,Min,Max,Clamped Min,Clamped Max";
for (int i = 0; i < settings.trials; ++i) {
std::wcout << L',' << L"Trial " << (i + 1);
}
std::wcout << L'\n';
auto s = benchmark_type<short>(header, settings, L"short");
auto i = benchmark_type<int>(header, settings, L"int");
auto l = benchmark_type<long>(header, settings, L"long");
auto ll = benchmark_type<long long>(header, settings, L"long long");
auto f = benchmark_type<float>(header, settings, L"float");
auto d = benchmark_type<double>(header, settings, L"double");
std::wcout << L"\n\noutput:\n";
std::wcout << L"short,int,long,long long,float,double\n";
std::wcout << s << ',' << i << ',' << l << ',' << ll << ',' << f << ',' << d << '\n';
}And got the following raw data: raw dataBranch?,Processor,Processor Company,Type,N,Runs,Min,Max,Clamped Min,Clamped Max,Trial 1,Trial 2,Trial 3,Trial 4,Trial 5,Average
branchless,x64,intel,short,50,000,000,10,-1000,1000,-300,300,0.0307597,0.0307724,0.0309739,0.0309177,0.0315486,0.03099446
branchless,x64,intel,int,50,000,000,10,-1000,1000,-300,300,0.0261614,0.0262558,0.0266834,0.0265467,0.0261862,0.0263667
branchless,x64,intel,long,50,000,000,10,-1000,1000,-300,300,0.0325709,0.0323946,0.0322038,0.0322979,0.0326036,0.03241416
branchless,x64,intel,long long,50,000,000,10,-1000,1000,-300,300,0.0382545,0.038413,0.0378916,0.0382268,0.0380203,0.03816124
branchless,x64,intel,float,50,000,000,10,-1000,1000,-300,300,0.0581218,0.0568805,0.0568683,0.0571446,0.0570825,0.05721954
branchless,x64,intel,double,50,000,000,10,-1000,1000,-300,300,0.0627476,0.0614245,0.0614438,0.0616488,0.0618331,0.06181956
branchless,x64,intel,short,50,000,000,10,-1000,1000,-1000,1000,0.030247,0.0301774,0.0304275,0.0304509,0.0306174,0.03038404
branchless,x64,intel,int,50,000,000,10,-1000,1000,-1000,1000,0.026132,0.0263663,0.0261524,0.0260969,0.0262399,0.0261975
branchless,x64,intel,long,50,000,000,10,-1000,1000,-1000,1000,0.0356146,0.0322538,0.032345,0.0322297,0.0326115,0.03301092
branchless,x64,intel,long long,50,000,000,10,-1000,1000,-1000,1000,0.0382861,0.0439842,0.0382186,0.0385868,0.041021,0.04001934
branchless,x64,intel,float,50,000,000,10,-1000,1000,-1000,1000,0.0567044,0.0568411,0.0570752,0.05716,0.057088,0.05697374
branchless,x64,intel,double,50,000,000,10,-1000,1000,-1000,1000,0.0618878,0.0612712,0.061348,0.0615576,0.0614417,0.06150126
branchless,x64,intel,short,50,000,000,10,-1000,1000,0,1000,0.0304006,0.030353,0.0302131,0.0304818,0.0312518,0.03054006
branchless,x64,intel,int,50,000,000,10,-1000,1000,0,1000,0.0261579,0.0266862,0.0264275,0.0261135,0.0260856,0.02629414
branchless,x64,intel,long,50,000,000,10,-1000,1000,0,1000,0.0325495,0.032575,0.0324561,0.0323098,0.0326873,0.03251554
branchless,x64,intel,long long,50,000,000,10,-1000,1000,0,1000,0.0382069,0.0382973,0.0382747,0.0380733,0.0378058,0.0381316
branchless,x64,intel,float,50,000,000,10,-1000,1000,0,1000,0.0570075,0.0570859,0.0570695,0.0570084,0.0574737,0.057129
branchless,x64,intel,double,50,000,000,10,-1000,1000,0,1000,0.0619618,0.0612276,0.0611233,0.0630977,0.0614704,0.06177616
branchless,x64,intel,short,50,000,000,10,-1000,1000,-1000,0,0.0304815,0.030353,0.0316145,0.0302088,0.0302104,0.03057364
branchless,x64,intel,int,50,000,000,10,-1000,1000,-1000,0,0.0264647,0.0267136,0.0273937,0.0261126,0.0264696,0.02663084
branchless,x64,intel,long,50,000,000,10,-1000,1000,-1000,0,0.032258,0.0323605,0.0325388,0.0325769,0.036068,0.03316044
branchless,x64,intel,long long,50,000,000,10,-1000,1000,-1000,0,0.0380318,0.0399791,0.0381676,0.0379547,0.0378395,0.03839454
branchless,x64,intel,float,50,000,000,10,-1000,1000,-1000,0,0.0571292,0.0573249,0.0572185,0.0569971,0.0568268,0.0570993
branchless,x64,intel,double,50,000,000,10,-1000,1000,-1000,0,0.0635628,0.0614783,0.0615637,0.0612266,0.0615011,0.0618665
branch,x64,intel,short,50,000,000,10,-1000,1000,-300,300,0.131582,0.131115,0.131612,0.130697,0.130677,0.1311366
branch,x64,intel,int,50,000,000,10,-1000,1000,-300,300,0.141665,0.142679,0.141217,0.142365,0.142126,0.1420104
branch,x64,intel,long,50,000,000,10,-1000,1000,-300,300,0.138849,0.139846,0.138676,0.139194,0.139244,0.1391618
branch,x64,intel,long long,50,000,000,10,-1000,1000,-300,300,0.144001,0.147541,0.142996,0.141286,0.143063,0.1437774
branch,x64,intel,float,50,000,000,10,-1000,1000,-300,300,0.263469,0.263556,0.263109,0.263324,0.263803,0.2634522
branch,x64,intel,double,50,000,000,10,-1000,1000,-300,300,0.269243,0.268038,0.267364,0.268133,0.268234,0.2682024
branch,x64,intel,short,50,000,000,10,-1000,1000,-1000,1000,0.0240649,0.0238211,0.0239027,0.0240058,0.0244858,0.02405606
branch,x64,intel,int,50,000,000,10,-1000,1000,-1000,1000,0.0288381,0.026317,0.0260632,0.0259924,0.0262759,0.02669732
branch,x64,intel,long,50,000,000,10,-1000,1000,-1000,1000,0.0294482,0.0279469,0.0305684,0.03038,0.0291242,0.02949354
branch,x64,intel,long long,50,000,000,10,-1000,1000,-1000,1000,0.0328927,0.0343112,0.0333807,0.0329204,0.0335131,0.03340362
branch,x64,intel,float,50,000,000,10,-1000,1000,-1000,1000,0.0545511,0.0551195,0.0544184,0.0544971,0.0545425,0.05462572
branch,x64,intel,double,50,000,000,10,-1000,1000,-1000,1000,0.058983,0.0586856,0.0588727,0.0588711,0.0587617,0.05883482
branch,x64,intel,short,50,000,000,10,-1000,1000,0,1000,0.0238642,0.0239384,0.0238281,0.0239371,0.0242335,0.02396026
branch,x64,intel,int,50,000,000,10,-1000,1000,0,1000,0.026815,0.0263101,0.0292427,0.0269946,0.0263549,0.02714346
branch,x64,intel,long,50,000,000,10,-1000,1000,0,1000,0.0281906,0.0287167,0.0286152,0.0295415,0.0294489,0.02890258
branch,x64,intel,long long,50,000,000,10,-1000,1000,0,1000,0.0330974,0.0374702,0.0333578,0.0336516,0.0331444,0.03414428
branch,x64,intel,float,50,000,000,10,-1000,1000,0,1000,0.205144,0.205456,0.20791,0.203369,0.202325,0.2048408
branch,x64,intel,double,50,000,000,10,-1000,1000,0,1000,0.20852,0.207115,0.207549,0.207759,0.207784,0.2077454
branch,x64,intel,short,50,000,000,10,-1000,1000,-1000,0,0.152678,0.151846,0.152024,0.151943,0.152173,0.1521328
branch,x64,intel,int,50,000,000,10,-1000,1000,-1000,0,0.15962,0.159133,0.159042,0.158947,0.15809,0.1589664
branch,x64,intel,long,50,000,000,10,-1000,1000,-1000,0,0.1637,0.164116,0.163417,0.164494,0.166192,0.1643838
branch,x64,intel,long long,50,000,000,10,-1000,1000,-1000,0,0.161416,0.163301,0.160363,0.160596,0.160196,0.1611744
branch,x64,intel,float,50,000,000,10,-1000,1000,-1000,0,0.195259,0.196052,0.196406,0.195528,0.197289,0.1961068
branch,x64,intel,double,50,000,000,10,-1000,1000,-1000,0,0.198275,0.198893,0.198614,0.200076,0.198825,0.1989366Which I have bashed into the following:
You'll note that there are two cases where branchy wins - the "always in range" case, and the "always in the upper half of the range (for integers)" case. However, it only wins by <20% (normally 5-10%), and the branchless case wins by 3-5x in the cases it's better. Also, branchless is extremely consistent. The standard deviation of branchy is 0.08, while the standard deviation of branchless is 0.013. Given this, I am in favor of this change. Marking |
|
@strega-nil-ms Thanks for these exhaustive benchmarks! Coincidentally there was a Reddit post about cmov just recently and I really liked one comment in particular:
I found that very fitting and I'm glad to see this to be seemingly confirmed. 🙂 |
|
FYI @strega-nil-ms, I pushed minor changes after you approved. |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
🗜️ 🎉 🗜️ |
Co-authored-by: nicole mazzuca <[email protected]> Co-authored-by: Stephan T. Lavavej <[email protected]>
clamp in terms of min / max is faster ✨
but let's specialize for
std::lessand integers/pointers, to avoid pessimizing cases where cmov is not applicable or comparison is expensiveResolves #2334
Proved by benchmark that it worth doing for interegers, and for
/fp:fastfloats.benchmark.cpp
Benchmark results
Integer results are:
Type: ints branchy: 2.14611 branchless: 0.701921
For
/fp:fastfloating point results are:Type: reals branchy: 1.01462 branchless: 1.01654
For
/fp:precisefloating point results are:Type: reals branchy: 1.06364 branchless: 1.0455
Branchless clamp is slower when all inputs are in range:
Benchmark results
Integer results are:
Type: ints branchy: 0.600716 branchless: 0.721154
For
/fp:fastfloating point results are:Type: reals branchy: 1.02346 branchless: 1.03637
For
/fp:precisefloating point results are:Type: reals branchy: 1.02011 branchless: 1.14736