Skip to content

clamp cmov codegen#2336

Merged
StephanTLavavej merged 29 commits into
microsoft:mainfrom
AlexGuteniev:i_hate_ternary_operator
Jun 2, 2022
Merged

clamp cmov codegen#2336
StephanTLavavej merged 29 commits into
microsoft:mainfrom
AlexGuteniev:i_hate_ternary_operator

Conversation

@AlexGuteniev
Copy link
Copy Markdown
Contributor

@AlexGuteniev AlexGuteniev commented Nov 12, 2021

clamp in terms of min / max is faster ✨

but let's specialize for std::less and integers/pointers, to avoid pessimizing cases where cmov is not applicable or comparison is expensive

Resolves #2334

Proved by benchmark that it worth doing for interegers, and for /fp:fast floats.

benchmark.cpp
#include <chrono>
#include <iostream>
#include <random>

template<class T>
const T& clamp_branchless(const T& v, const T& minv, const T& maxv) {
	const T& t = v < minv ? minv : v;
	const T& r = maxv < t ? maxv : t;
	return r;
}

template<class T>
const T& clamp_branchy(const T& v, const T& minv, const T& maxv) {
	if (maxv < v) {
		return maxv;
	}
	if (v < minv) {
		return minv;
	}
	return v;
}


template<typename T, std::size_t N>
void benchmark(T(&array)[N], T& result, const char* descr, int R = 1000000) {
	auto t1 = std::chrono::steady_clock::now();
	for (int i = 0; i < R; i++) {
		for (int j = 0; j < N; j++) {
			result += clamp_branchless<T>(array[j], 300, 700);
		}
	}
	auto t2 = std::chrono::steady_clock::now();
	for (int i = 0; i < R; i++) {
		for (int j = 0; j < N; j++) {
			result += clamp_branchy<T>(array[j], 300, 700);
		}
	}
	auto t3 = std::chrono::steady_clock::now();

	auto branchless_time = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
	auto branchy_time = std::chrono::duration_cast<std::chrono::duration<double>>(t3 - t2);

	std::cout << "Type: " << descr << " branchy: " << branchy_time.count() << " branchless: " << branchless_time.count() << "\n";
}

int    int_in[1000];
int    int_out;
double real_in[1000];
double real_out;

int main() {
	std::mt19937 gen(53342);
	std::uniform_int_distribution<int> int_dis(0, 1000);
	for (int& element : int_in) {
		element = int_dis(gen);
	}

	std::uniform_real_distribution<double> real_dis(0, 1000);
	for (double& element : real_in) {
		element = real_dis(gen);
	}

	benchmark(int_in, int_out, "ints");
	benchmark(real_in, real_out, "reals");
}
Benchmark results

Integer results are:
Type: ints branchy: 2.14611 branchless: 0.701921

For /fp:fast floating point results are:
Type: reals branchy: 1.01462 branchless: 1.01654

For /fp:precise floating point results are:
Type: reals branchy: 1.06364 branchless: 1.0455

Branchless clamp is slower when all inputs are in range:

Benchmark results

Integer results are:
Type: ints branchy: 0.600716 branchless: 0.721154

For /fp:fast floating point results are:
Type: reals branchy: 1.02346 branchless: 1.03637

For /fp:precise floating point results are:
Type: reals branchy: 1.02011 branchless: 1.14736

Comment thread stl/inc/algorithm Outdated
@AlexGuteniev AlexGuteniev marked this pull request as ready for review November 12, 2021 20:15
@AlexGuteniev AlexGuteniev requested a review from a team as a code owner November 12, 2021 20:15
@StephanTLavavej StephanTLavavej added the performance Must go faster label Nov 12, 2021
Comment thread stl/inc/algorithm Outdated
Comment thread stl/inc/algorithm Outdated
Comment thread stl/inc/algorithm Outdated
Comment thread stl/inc/algorithm Outdated
Comment thread stl/inc/algorithm
Comment thread stl/inc/algorithm Outdated
@lhecker
Copy link
Copy Markdown
Member

lhecker commented Nov 14, 2021

@AlexGuteniev Ah thanks, I didn't see the benchmark.
To be entirely honest I believe that your benchmark has a flaw: It doesn't use the value after the write into the out array, due to which only the throughput is measured, but not the latency. In such cases the branchy version will win, thanks to the enormous pipeline depth of modern CPUs, which allows multiple iterations of the same loop to execute concurrently, hiding the latency of any branches. But I believe it's fairly likely that the result of functions like clamp are used right away more often than not, which makes latency just as important as throughput.

You can approximate the latency better, by changing out_* to be a single int or double and adding onto it.

With out_* being an array (input range 400-600):

Type: ints branchy: 0.423876 branchless: 0.598911
Type: reals branchy: 0.62541 branchless: 0.633118

With out_* being a scalar (input range 400-600):

Type: ints branchy: 0.623527 branchless: 0.526837
Type: reals branchy: 0.822351 branchless: 0.626051

With out_* being a scalar (input range 0-1000):

Type: ints branchy: 0.790643 branchless: 0.545034
Type: reals branchy: 0.845853 branchless: 0.623605

I find the consistency of the branchless benchmark results remarkable and I think it shows why they should be preferred whenever possible.

Comment thread stl/inc/algorithm
@AlexGuteniev
Copy link
Copy Markdown
Contributor Author

@lhecker , thanks, I've made an edit to the benhmark

@AlexGuteniev
Copy link
Copy Markdown
Contributor Author

But I believe it's fairly likely that the result of functions like clamp are used right away more often than not, which makes latency just as important as throughput.

There's separate std::ranges::clamp. Apparently for that one it is unlikely that results are used right away, as it should go through entire range before the next algorithm is invoked.

@StephanTLavavej StephanTLavavej self-assigned this Dec 1, 2021
@AlexGuteniev AlexGuteniev marked this pull request as draft December 2, 2021 00:42
@AlexGuteniev AlexGuteniev marked this pull request as ready for review December 3, 2021 11:33
Copy link
Copy Markdown
Contributor

@strega-nil-ms strega-nil-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor test notes, the code itself looks good

Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
@strega-nil-ms strega-nil-ms added the decision needed We need to choose something before working on this label May 26, 2022
@strega-nil-ms
Copy link
Copy Markdown
Contributor

strega-nil-ms commented May 26, 2022

I've done some benchmarking of my own; the following benchmark.cpp (modeled on yours, @AlexGuteniev):

benchmark.cpp
#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>

#define DO_STRINGIZE(v) L"" #v
#define STRINGIZE(v)    DO_STRINGIZE(v)
#ifndef BRANCHY
#error "forgot to /D BRANCHY"
#endif

struct BenchmarkHeader {
    wchar_t const* branchy = STRINGIZE(BRANCHY);
    wchar_t const* processor;
    wchar_t const* processor_company;
};

struct BenchmarkSettings {
    int size_of_vector  = INT_MIN;
    int runs            = INT_MIN;
    int minimum         = INT_MIN;
    int maximum         = INT_MIN;
    int clamped_minimum = INT_MIN;
    int clamped_maximum = INT_MIN;
    int trials          = INT_MIN;
};

template <class T>
double benchmark(const T* const first, T const* const last, T& result, const BenchmarkHeader& header,
    const BenchmarkSettings& settings) {

    auto start = std::chrono::steady_clock::now();
    for (int i = 0; i < settings.runs; ++i) {
        for (const T* it = first; it != last; ++it) {
            result += std::clamp<T>(*it, settings.clamped_minimum, settings.clamped_maximum);
        }
    }
    auto end = std::chrono::steady_clock::now();
    return std::chrono::duration_cast<std::chrono::duration<double>>(end - start).count() / settings.runs;
}

template <class T>
auto benchmark_type(const BenchmarkHeader& header, const BenchmarkSettings& settings, const wchar_t* type) {
    std::vector<T> input(settings.size_of_vector);
    T output = 0;

    std::wcout << header.branchy << ',' << header.processor << ',' << header.processor_company << ',' << type << ','
               << settings.size_of_vector << ',' << settings.runs << ',' << settings.minimum << ',' << settings.maximum
               << ',' << settings.clamped_minimum << ',' << settings.clamped_maximum;
    for (int i = 0; i < settings.trials; ++i) {
        {
            std::random_device rd;
            auto dist = ([&settings]() {
                if constexpr (std::is_integral_v<T>) {
                    return std::uniform_int_distribution<T>(settings.minimum, settings.maximum);
                } else {
                    return std::uniform_real_distribution<T>(settings.minimum, settings.maximum);
                }
            })();

            std::generate(std::begin(input), std::end(input), [&rd, &dist]() { return dist(rd); });
        }

        auto trial = benchmark(input.data(), input.data() + input.size(), output, header, settings);
        std::wcout << ',' << trial;
    }

    std::wcout << '\n';

    return output;
}

int wmain(const int argc, wchar_t** const argv) {
    static constexpr std::pair<const wchar_t*, const wchar_t*(BenchmarkHeader::*)> header_parameters[] = {
        {L"--processor", &BenchmarkHeader::processor},
        {L"--company", &BenchmarkHeader::processor_company},
    };
    static constexpr std::pair<const wchar_t*, int(BenchmarkSettings::*)> settings_parameters[] = {
        {L"--vector-size", &BenchmarkSettings::size_of_vector},
        {L"--runs", &BenchmarkSettings::runs},
        {L"--min", &BenchmarkSettings::minimum},
        {L"--max", &BenchmarkSettings::maximum},
        {L"--clamp-min", &BenchmarkSettings::clamped_minimum},
        {L"--clamp-max", &BenchmarkSettings::clamped_maximum},
        {L"--trials", &BenchmarkSettings::trials},
    };

    BenchmarkHeader header{};
    BenchmarkSettings settings{};

    for (int arg_i = 1; arg_i < argc; ++arg_i) {
        bool found = false;
        for (auto parm : header_parameters) {
            if (wcscmp(argv[arg_i], parm.first) == 0) {
                if (!(header.*parm.second)) {
                    ++arg_i;
                    header.*parm.second = argv[arg_i];
                } else {
                    std::wcerr << L"passed " << parm.first << L" multiple times\n";
                    return 1;
                }
                found = true;
            }
        }
        for (auto parm : settings_parameters) {
            if (wcscmp(argv[arg_i], parm.first) == 0) {
                if (settings.*parm.second == INT_MIN) {
                    ++arg_i;
                    settings.*parm.second = static_cast<int>(wcstol(argv[arg_i], nullptr, 10));
                } else {
                    std::wcerr << L"passed " << parm.first << L"multiple times\n";
                    return 1;
                }
                found = true;
            }
        }
        if (!found) {
            std::wcerr << L"unrecognized argument " << argv[arg_i] << L'\n';
            return 1;
        }
    }

    for (auto parm : header_parameters) {
        if (!(header.*parm.second)) {
            std::wcerr << L"argument " << parm.first << " not passed\n";
            return 1;
        }
    }
    for (auto parm : settings_parameters) {
        if (settings.*parm.second == INT_MIN) {
            std::wcerr << L"argument " << parm.first << " not passed\n";
            return 1;
        }
    }

    int trials = 5;

    std::wcout << L"Branch?,Processor,Processor Company,Type,N,Runs,Min,Max,Clamped Min,Clamped Max";
    for (int i = 0; i < settings.trials; ++i) {
        std::wcout << L',' << L"Trial " << (i + 1);
    }
    std::wcout << L'\n';

    auto s  = benchmark_type<short>(header, settings, L"short");
    auto i  = benchmark_type<int>(header, settings, L"int");
    auto l  = benchmark_type<long>(header, settings, L"long");
    auto ll = benchmark_type<long long>(header, settings, L"long long");
    auto f  = benchmark_type<float>(header, settings, L"float");
    auto d  = benchmark_type<double>(header, settings, L"double");

    std::wcout << L"\n\noutput:\n";
    std::wcout << L"short,int,long,long long,float,double\n";
    std::wcout << s << ',' << i << ',' << l << ',' << ll << ',' << f << ',' << d << '\n';
}

And got the following raw data:

raw data
Branch?,Processor,Processor Company,Type,N,Runs,Min,Max,Clamped Min,Clamped Max,Trial 1,Trial 2,Trial 3,Trial 4,Trial 5,Average
branchless,x64,intel,short,50,000,000,10,-1000,1000,-300,300,0.0307597,0.0307724,0.0309739,0.0309177,0.0315486,0.03099446
branchless,x64,intel,int,50,000,000,10,-1000,1000,-300,300,0.0261614,0.0262558,0.0266834,0.0265467,0.0261862,0.0263667
branchless,x64,intel,long,50,000,000,10,-1000,1000,-300,300,0.0325709,0.0323946,0.0322038,0.0322979,0.0326036,0.03241416
branchless,x64,intel,long long,50,000,000,10,-1000,1000,-300,300,0.0382545,0.038413,0.0378916,0.0382268,0.0380203,0.03816124
branchless,x64,intel,float,50,000,000,10,-1000,1000,-300,300,0.0581218,0.0568805,0.0568683,0.0571446,0.0570825,0.05721954
branchless,x64,intel,double,50,000,000,10,-1000,1000,-300,300,0.0627476,0.0614245,0.0614438,0.0616488,0.0618331,0.06181956
branchless,x64,intel,short,50,000,000,10,-1000,1000,-1000,1000,0.030247,0.0301774,0.0304275,0.0304509,0.0306174,0.03038404
branchless,x64,intel,int,50,000,000,10,-1000,1000,-1000,1000,0.026132,0.0263663,0.0261524,0.0260969,0.0262399,0.0261975
branchless,x64,intel,long,50,000,000,10,-1000,1000,-1000,1000,0.0356146,0.0322538,0.032345,0.0322297,0.0326115,0.03301092
branchless,x64,intel,long long,50,000,000,10,-1000,1000,-1000,1000,0.0382861,0.0439842,0.0382186,0.0385868,0.041021,0.04001934
branchless,x64,intel,float,50,000,000,10,-1000,1000,-1000,1000,0.0567044,0.0568411,0.0570752,0.05716,0.057088,0.05697374
branchless,x64,intel,double,50,000,000,10,-1000,1000,-1000,1000,0.0618878,0.0612712,0.061348,0.0615576,0.0614417,0.06150126
branchless,x64,intel,short,50,000,000,10,-1000,1000,0,1000,0.0304006,0.030353,0.0302131,0.0304818,0.0312518,0.03054006
branchless,x64,intel,int,50,000,000,10,-1000,1000,0,1000,0.0261579,0.0266862,0.0264275,0.0261135,0.0260856,0.02629414
branchless,x64,intel,long,50,000,000,10,-1000,1000,0,1000,0.0325495,0.032575,0.0324561,0.0323098,0.0326873,0.03251554
branchless,x64,intel,long long,50,000,000,10,-1000,1000,0,1000,0.0382069,0.0382973,0.0382747,0.0380733,0.0378058,0.0381316
branchless,x64,intel,float,50,000,000,10,-1000,1000,0,1000,0.0570075,0.0570859,0.0570695,0.0570084,0.0574737,0.057129
branchless,x64,intel,double,50,000,000,10,-1000,1000,0,1000,0.0619618,0.0612276,0.0611233,0.0630977,0.0614704,0.06177616
branchless,x64,intel,short,50,000,000,10,-1000,1000,-1000,0,0.0304815,0.030353,0.0316145,0.0302088,0.0302104,0.03057364
branchless,x64,intel,int,50,000,000,10,-1000,1000,-1000,0,0.0264647,0.0267136,0.0273937,0.0261126,0.0264696,0.02663084
branchless,x64,intel,long,50,000,000,10,-1000,1000,-1000,0,0.032258,0.0323605,0.0325388,0.0325769,0.036068,0.03316044
branchless,x64,intel,long long,50,000,000,10,-1000,1000,-1000,0,0.0380318,0.0399791,0.0381676,0.0379547,0.0378395,0.03839454
branchless,x64,intel,float,50,000,000,10,-1000,1000,-1000,0,0.0571292,0.0573249,0.0572185,0.0569971,0.0568268,0.0570993
branchless,x64,intel,double,50,000,000,10,-1000,1000,-1000,0,0.0635628,0.0614783,0.0615637,0.0612266,0.0615011,0.0618665
branch,x64,intel,short,50,000,000,10,-1000,1000,-300,300,0.131582,0.131115,0.131612,0.130697,0.130677,0.1311366
branch,x64,intel,int,50,000,000,10,-1000,1000,-300,300,0.141665,0.142679,0.141217,0.142365,0.142126,0.1420104
branch,x64,intel,long,50,000,000,10,-1000,1000,-300,300,0.138849,0.139846,0.138676,0.139194,0.139244,0.1391618
branch,x64,intel,long long,50,000,000,10,-1000,1000,-300,300,0.144001,0.147541,0.142996,0.141286,0.143063,0.1437774
branch,x64,intel,float,50,000,000,10,-1000,1000,-300,300,0.263469,0.263556,0.263109,0.263324,0.263803,0.2634522
branch,x64,intel,double,50,000,000,10,-1000,1000,-300,300,0.269243,0.268038,0.267364,0.268133,0.268234,0.2682024
branch,x64,intel,short,50,000,000,10,-1000,1000,-1000,1000,0.0240649,0.0238211,0.0239027,0.0240058,0.0244858,0.02405606
branch,x64,intel,int,50,000,000,10,-1000,1000,-1000,1000,0.0288381,0.026317,0.0260632,0.0259924,0.0262759,0.02669732
branch,x64,intel,long,50,000,000,10,-1000,1000,-1000,1000,0.0294482,0.0279469,0.0305684,0.03038,0.0291242,0.02949354
branch,x64,intel,long long,50,000,000,10,-1000,1000,-1000,1000,0.0328927,0.0343112,0.0333807,0.0329204,0.0335131,0.03340362
branch,x64,intel,float,50,000,000,10,-1000,1000,-1000,1000,0.0545511,0.0551195,0.0544184,0.0544971,0.0545425,0.05462572
branch,x64,intel,double,50,000,000,10,-1000,1000,-1000,1000,0.058983,0.0586856,0.0588727,0.0588711,0.0587617,0.05883482
branch,x64,intel,short,50,000,000,10,-1000,1000,0,1000,0.0238642,0.0239384,0.0238281,0.0239371,0.0242335,0.02396026
branch,x64,intel,int,50,000,000,10,-1000,1000,0,1000,0.026815,0.0263101,0.0292427,0.0269946,0.0263549,0.02714346
branch,x64,intel,long,50,000,000,10,-1000,1000,0,1000,0.0281906,0.0287167,0.0286152,0.0295415,0.0294489,0.02890258
branch,x64,intel,long long,50,000,000,10,-1000,1000,0,1000,0.0330974,0.0374702,0.0333578,0.0336516,0.0331444,0.03414428
branch,x64,intel,float,50,000,000,10,-1000,1000,0,1000,0.205144,0.205456,0.20791,0.203369,0.202325,0.2048408
branch,x64,intel,double,50,000,000,10,-1000,1000,0,1000,0.20852,0.207115,0.207549,0.207759,0.207784,0.2077454
branch,x64,intel,short,50,000,000,10,-1000,1000,-1000,0,0.152678,0.151846,0.152024,0.151943,0.152173,0.1521328
branch,x64,intel,int,50,000,000,10,-1000,1000,-1000,0,0.15962,0.159133,0.159042,0.158947,0.15809,0.1589664
branch,x64,intel,long,50,000,000,10,-1000,1000,-1000,0,0.1637,0.164116,0.163417,0.164494,0.166192,0.1643838
branch,x64,intel,long long,50,000,000,10,-1000,1000,-1000,0,0.161416,0.163301,0.160363,0.160596,0.160196,0.1611744
branch,x64,intel,float,50,000,000,10,-1000,1000,-1000,0,0.195259,0.196052,0.196406,0.195528,0.197289,0.1961068
branch,x64,intel,double,50,000,000,10,-1000,1000,-1000,0,0.198275,0.198893,0.198614,0.200076,0.198825,0.1989366

Which I have bashed into the following:

Type CMin CMax Branch Branchless Branchless Faster
short -1000 1000 0.02405606 0.03038404 79.17%
int -1000 1000 0.02669732 0.0261975 101.91%
long -1000 1000 0.02949354 0.03301092 89.34%
long long -1000 1000 0.03340362 0.04001934 83.47%
float -1000 1000 0.05462572 0.05697374 95.88%
double -1000 1000 0.05883482 0.06150126 95.66%
short 0 1000 0.02396026 0.03054006 78.46%
int 0 1000 0.02714346 0.02629414 103.23%
long 0 1000 0.02890258 0.03251554 88.89%
long long 0 1000 0.03414428 0.0381316 89.54%
float 0 1000 0.2048408 0.057129 358.56%
double 0 1000 0.2077454 0.06177616 336.29%
short -300 300 0.1311366 0.03099446 423.10%
int -300 300 0.1420104 0.0263667 538.60%
long -300 300 0.1391618 0.03241416 429.32%
long long -300 300 0.1437774 0.03816124 376.76%
float -300 300 0.2634522 0.05721954 460.42%
double -300 300 0.2682024 0.06181956 433.85%
short -1000 0 0.1521328 0.03057364 497.59%
int -1000 0 0.1589664 0.02663084 596.93%
long -1000 0 0.1643838 0.03316044 495.72%
long long -1000 0 0.1611744 0.03839454 419.78%
float -1000 0 0.1961068 0.0570993 343.45%
double -1000 0 0.1989366 0.0618665 321.56%

You'll note that there are two cases where branchy wins - the "always in range" case, and the "always in the upper half of the range (for integers)" case. However, it only wins by <20% (normally 5-10%), and the branchless case wins by 3-5x in the cases it's better.

Also, branchless is extremely consistent. The standard deviation of branchy is 0.08, while the standard deviation of branchless is 0.013.

Given this, I am in favor of this change. Marking decision needed for the rest of the team to take a look

@lhecker
Copy link
Copy Markdown
Member

lhecker commented May 27, 2022

@strega-nil-ms Thanks for these exhaustive benchmarks!

Coincidentally there was a Reddit post about cmov just recently and I really liked one comment in particular:

A correctly-predicted branch is slightly faster than a cmov. An incorrectly-predicted branch is incredibly slower than a cmov.

I found that very fitting and I'm glad to see this to be seemingly confirmed. 🙂

Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/env.lst Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp Outdated
Comment thread tests/std/tests/GH_002334_branchless_clamp/test.cpp
@StephanTLavavej
Copy link
Copy Markdown
Member

FYI @strega-nil-ms, I pushed minor changes after you approved.

@StephanTLavavej StephanTLavavej self-assigned this Jun 1, 2022
@StephanTLavavej
Copy link
Copy Markdown
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej removed the decision needed We need to choose something before working on this label Jun 2, 2022
@StephanTLavavej StephanTLavavej merged commit ae710e7 into microsoft:main Jun 2, 2022
@StephanTLavavej
Copy link
Copy Markdown
Member

🗜️ 🎉 🗜️

@AlexGuteniev AlexGuteniev deleted the i_hate_ternary_operator branch June 2, 2022 05:32
fsb4000 pushed a commit to fsb4000/STL that referenced this pull request Aug 13, 2022
Co-authored-by: nicole mazzuca <[email protected]>
Co-authored-by: Stephan T. Lavavej <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

<algorithm>: clamp produces unoptimal assembly with branches

6 participants