-
Notifications
You must be signed in to change notification settings - Fork 81
Expand file tree
/
Copy pathless_slow.cpp
More file actions
7850 lines (6844 loc) · 340 KB
/
less_slow.cpp
File metadata and controls
7850 lines (6844 loc) · 340 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
/**
* @brief Low-level micro-benchmarks for building a performance-first mindset.
* @file less_slow.cpp
* @author Ash Vardanian
*
* There's no Easter bunny, no tooth fairy... and no free abstractions!
* Every abstraction—no matter how elegant—comes with tradeoffs. Sometimes
* the cost is in readability, like extra layers of indirection, and other
* times, it's in performance, with additional instructions or memory overhead.
*
* This project dives into such tradeoffs, helping engineers develop a deeper
* understanding of the costs associated with seemingly simple constructs.
* While the examples are written in C++20 and focus on GCC and Clang,
* targeting x86_64 and ARM64 architectures, the principles are universal.
*
* The benchmarks cover the costs of numerical operations, designing micro-
* kernels, parallelism, computational complexity, branch prediction, compiler
* limitations, and @b composing those with callbacks, coroutines, and ranges
* into more complex systems. Same principles apply to Rust, Python, and Go,
* so parts of the project have been reproduced in those languages.
*
* @see Rust Benchmarks: https://github.com/ashvardanian/less_slow.rs
* @see Python Benchmarks: https://github.com/ashvardanian/less_slow.py
*
* Most measurements were performed on Intel Sapphire Rapids CPUs on AWS,
* but the findings match across hardware platforms unless explicitly noted.
*
* Worth noting, that some examples may seem over-engineered, but they are
* no less relevant or impractical. They may be hard to recognize at first,
* but they universally appear in larger codebases, as a form of emergent
* complexity.
*
* Let's benchmark them all and dive into the implementation details that
* make those abstractions @b less_slow!
*/
#include <benchmark/benchmark.h>
namespace bm = benchmark;
#pragma region - Basics
#pragma region How to Benchmark and Randomness
/**
* Using Google Benchmark is simple. You define a C++ function and then
* register it using the provided C macros. The suite will invoke your
* function, passing a `State` object, that dynamically chooses the number
* of loops to run based on the time it takes to execute each cycle.
*
* For simplicity, let's start by benchmarking the most basic operation -
* the 32-bit integer addition, universally natively supported by every
* modern CPU, be it x86, ARM, or RISC-V, 32-bit or 64-bit, big-endian
* or little-endian.
*/
#include <cstdint> // `std::int32_t` and other sized integers
#include <cstdlib> // `std::rand`
static void i32_addition(bm::State &state) {
std::int32_t a = std::rand(), b = std::rand(), c = 0;
for (auto _ : state) c = a + b;
// In some categories of projects, benchmarks are easier to convert
// into tests, than the other way around :)
if (c != a + b) state.SkipWithError("Incorrect sum!");
}
BENCHMARK(i32_addition);
/**
* Trivial kernels operating on constant values are not the most
* straightforward candidates for benchmarking. The compiler can easily
* optimize them, and the CPU can predict the result... showing "0ns" - zero
* nanoseconds per iteration. Unfortunately, no operation runs this fast on the
* computer. On a 3 GHz CPU, you would perform 3 Billion ops every second.
* So, each would take 0.33ns, not 0ns. If we change the compilation
* settings, discarding the @b `-O3` flag for "Release build" optimizations,
* we may see a non-zero value, but it won't represent real-world performance.
*
* One way to avoid, is just implementing the kernel in @b inline-assembly,
* interleaving it with the higher-level C++ code:
*/
#if defined(__GNUC__) && !defined(__clang__) //! GCC and Clang support inline assembly, MSVC doesn't!
#if defined(__x86_64__) || defined(__i386__) //? Works for both 64-bit and 32-bit x86 architectures
static void i32_addition_inline_asm(bm::State &state) {
// In inline assembly for x86 we are not explicitly naming the registers,
// so in the 32-bit and 64-bit modes different registers will be used:
// - For 32-bit (`__i386__`): Registers like `eax` and `ebx` will be used,
// and pointers will be 4 bytes wide.
// - For 64-bit targets (`__x86_64__`): Registers like `eax`, `r8d` will
// be used for 32-bit values, while pointers will be 8 bytes wide.
std::int32_t a = std::rand(), b = std::rand(), c = 0;
for (auto _ : state) {
asm volatile(
// Perform a 32-bit addition of `b` into `a`.
"addl %[b], %[a]\n\t"
// `[a] "=r"(c)` means treat `c` as the output for the result.
// `"0"(a)` means reuse the same register allocated to the first output operand for `a`.
// `[b] "r"(b)` means that `b` is a read-only operand that must reside in a register.
: [a] "=r"(c)
: "0"(a), [b] "r"(b)
// Tell the compiler that this code modifies the condition codes (CPU flags),
// so it cannot assume those flags are still valid after this assembly block.
: "cc");
}
if (c != a + b) state.SkipWithError("Incorrect sum!");
}
BENCHMARK(i32_addition_inline_asm);
#elif defined(__aarch64__) //? The following kernel is just for the 64-bit Arm
static void i32_addition_inline_asm(bm::State &state) {
// In inline assembly for AArch64 we use `%w` registers for 32-bit operations.
// That means `add %w[a], %w[a], %w[b]` will add the 32-bit subregisters
// of these named operands. Pointers remain 8 bytes wide, but here we only
// deal with 32-bit integers.
std::int32_t a = std::rand(), b = std::rand(), c = 0;
for (auto _ : state) {
asm volatile(
// Perform a 32-bit addition of `b` into `a`: `%w[a] := %w[a] + %w[b]`.
"add %w[a], %w[a], %w[b]\n\t"
// `[a] "=r"(c)` means treat `c` as the output for the result of the operation.
// `"0"(a)` says to reuse the same register allocated to the first output operand for `a`.
// `[b] "r"(b)` means that `b` is a read-only operand that must reside in a register.
: [a] "=r"(c)
: "0"(a), [b] "r"(b)
// Tell the compiler that this assembly modifies the condition flags (CPU flags),
// so it cannot rely on them remaining unaltered after this assembly block.
: "cc");
}
if (c != a + b) state.SkipWithError("Incorrect sum!");
}
BENCHMARK(i32_addition_inline_asm);
#endif // defined(__x86_64__) || defined(__i386__) || defined(__aarch64__)
#endif // defined(__GNUC__) && !defined(__clang__)
/**
* We can also put the assembly kernels into separate `.S` files and link them
* to our C++ target. Each approach has its technical tradeoffs:
*
* - Inline Assembly:
* Requires direct interaction with registers, which must be carefully managed
* using constraints and clobbers to ensure the compiler knows which registers
* are modified. While inline assembly enables tight coupling with C++ logic
* and access to local variables, it is less portable due to compiler-specific
* syntax and optimization variability. Debugging inline assembly can also be
* challenging as it is embedded in higher-level code.
*
* - Separate Assembly Files:
* Abstracts away register management through adherence to the platform's
* Application Binary Interface @b (ABI). This makes the assembly routines
* easier to debug, test, and reuse across projects. However, separate files
* require more boilerplate for function calls, stack management, and
* parameter passing. They are preferred for large or standalone routines
* that benefit from modularity and clear separation from C++ code.
*
* In this project, we provide assembly kernels for two platforms:
*
* - @b less_slow_aarch64.S - for the 64-bit ARM architecture.
* - @b less_slow_amd64.S - for the x86_64 architecture, with 64-bit extensions,
* originally introduced by AMD.
*/
#if !defined(_MSC_VER) && (defined(__x86_64__) || defined(__aarch64__) || defined(__i386__) || defined(_M_X64))
extern "C" std::int32_t i32_add_asm_kernel(std::int32_t a, std::int32_t b);
static void i32_addition_asm(bm::State &state) {
std::int32_t a = std::rand(), b = std::rand(), c = 0;
for (auto _ : state) c = i32_add_asm_kernel(a, b);
if (c != a + b) state.SkipWithError("Incorrect sum!");
}
BENCHMARK(i32_addition_asm);
#endif // defined(__x86_64__) || defined(__aarch64__)
/**
* So far the results may be:
*
* - `i32_addition` - @b 0ns, as the compiler optimized the code away.
* - `i32_addition_inline_asm` - @b 0.2ns, a single instruction was inlined.
* - `i32_addition_asm` - @b 0.9ns, a new stack frame was created!
*
* Keep this in mind! Even with Link-Time Optimization @b (LTO) enabled,
* most of the time, compilers won't be able to inline your Assembly kernels.
*
* Don't want to switch to Assembly to fool the compilers? No problem!
* Another thing we can try - is generating random inputs on the fly with
* @b `std::rand()`, one of the most controversial operations in the
* C standard library.
*/
static void i32_addition_random(bm::State &state) {
std::int32_t c;
for (auto _ : state) c = std::rand() + std::rand();
(void)c; //? Silence "variable `c` set but not used" warning
}
BENCHMARK(i32_addition_random);
/**
* Running this will report @b 31ns or about 100 CPU cycles. Is integer
* addition really that expensive? It's used all the time, even when you are
* accessing @b `std::vector` elements and need to compute the memory address
* from the pointer and the index passed to the @b `operator[]` or `at()`
* functions. The answer is - no, it's not. The addition takes a single CPU
* cycle and is very fast.
*
* Chances are we just benchmarked something else... the @b `std::rand()`
* function. What if we could ask Google Benchmark to ignore the time spent
* in the `std::rand()` function? There are `PauseTiming` and `ResumeTiming`
* functions just for that!
*/
static void i32_addition_paused(bm::State &state) {
std::int32_t a, b, c;
for (auto _ : state) {
state.PauseTiming();
a = std::rand(), b = std::rand();
state.ResumeTiming();
bm::DoNotOptimize(c = a + b);
}
}
BENCHMARK(i32_addition_paused);
/**
* However, the `PauseTiming` and `ResumeTiming` functions are neither free.
* In the current implementation, they can easily take @b ~233ns, or around
* 150 CPU cycles. They are useless in this case, but there is an alternative!
*
* A typical pattern when implementing a benchmark is to initialize with a
* random value and then define a very cheap update policy that won't affect
* the latency much but will update the inputs. Increments, bit shifts, and bit
* rotations are common choices!
*
* It's also a good idea to use native @b CRC32 and @b AES instructions to
* produce a random state, as it's often done in StringZilla. Another common
* approach is to use integer multiplication, usually derived from the
* Golden Ratio, as in the Knuth's multiplicative hash (with `2654435761`).
*
* @see StringZilla: https://github.com/ashvardanian/stringzilla
*/
static void i32_addition_randomly_initialized(bm::State &state) {
std::int32_t a = std::rand(), b = std::rand(), c = 0;
for (auto _ : state) bm::DoNotOptimize(c = (++a) + (++b));
}
BENCHMARK(i32_addition_randomly_initialized);
/**
* On x86, the `i32_addition_randomly_initialized` benchmark performs two
* @b `inc` instructions and one @b `add` instruction. This should take less
* than @b 0.4ns on a modern CPU. The first cycle increments `a` and `b`
* simultaneously on different Arithmetic Logic Units (ALUs) of the same core,
* while the second cycle performs the final accumulation. At least @b 97% of
* the benchmark time was spent in the `std::rand()` function... even in a
* single-threaded benchmark.
*
* This may seem like a trivial example, far removed from "real-world
* production systems" of "advanced proprietary software designed by the
* world's leading engineers." Sadly, issues like this persist in many
* benchmarks and sometimes influence multi-billion-dollar decisions 🤬
*
* @see Bad I/O benchmark examples: https://www.unum.cloud/blog/2022-03-22-ucsb
*
* How bad is it? Let's re-run the same two benchmarks, this time on all cores.
*/
#include <thread> // `std::thread::hardware_concurrency`
#if defined(__linux__)
#include <unistd.h> // `_SC_NPROCESSORS_ONLN`
#elif defined(__APPLE__)
#include <sys/sysctl.h> // `sysctlbyname` on macOS
#elif defined(_WIN32)
#define WIN32_LEAN_AND_MEAN
#define NOMINMAX
#include <Windows.h>
#include <WinBase.h>
#endif
/**
* @brief Returns the number of physical cores available on the system,
* as opposed to the logical cores, which include hyper-threading.
*/
std::size_t physical_cores() {
#if defined(__linux__)
int nproc = sysconf(_SC_NPROCESSORS_ONLN);
return static_cast<std::size_t>(nproc);
#elif defined(__APPLE__)
int nproc = 0;
size_t len = sizeof(nproc);
sysctlbyname("hw.physicalcpu", &nproc, &len, nullptr, 0);
return static_cast<std::size_t>(nproc);
#elif defined(_WIN32)
// On Windows, both `std::thread::hardware_concurrency` and `GetSystemInfo`
// return at most 64 cores, as limited by a single windows processor group.
// However, starting with newer versions of Windows, applications can seamlessly
// span across multiple processor groups.
DWORD buffer_size = 0;
GetLogicalProcessorInformationEx(RelationProcessorCore, nullptr, &buffer_size);
if (buffer_size == 0) throw std::runtime_error("GetLogicalProcessorInformationEx failed to get buffer size");
using core_info_t = PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX;
std::vector<BYTE> buffer(buffer_size);
if (!GetLogicalProcessorInformationEx(RelationProcessorCore, reinterpret_cast<core_info_t>(buffer.data()),
&buffer_size))
throw std::runtime_error("GetLogicalProcessorInformationEx failed to get core info");
std::size_t core_count = 0;
for (DWORD buffer_progress = 0; buffer_progress < buffer_size;) {
core_info_t ptr = reinterpret_cast<core_info_t>(buffer.data() + buffer_progress);
if (ptr->Relationship == RelationProcessorCore) ++core_count;
buffer_progress += ptr->Size;
}
return core_count;
#else
return std::thread::hardware_concurrency();
#endif
}
BENCHMARK(i32_addition_random)->Threads(physical_cores());
BENCHMARK(i32_addition_randomly_initialized)->Threads(physical_cores());
/**
* The latency of the `std::rand` variant skyrocketed from @b 31ns in
* single-threaded mode to @b 10'974ns when running on multiple threads,
* while our optimized variant remained unaffected.
*
* This happens because `std::rand`, like many other LibC functions, relies
* on a global state protected by a mutex to ensure thread-safe access.
* This mutex-based synchronization becomes a severe bottleneck when multiple
* threads contend for the same global resource.
*
* Here's the relevant snippet from the GNU C Library (GlibC) implementation:
*
* long int __random (void) {
* int32_t retval;
* __libc_lock_lock (lock);
* (void) __random_r (&unsafe_state, &retval);
* __libc_lock_unlock (lock);
* return retval;
* }
* weak_alias (__random, random)
*
* This perfectly illustrates why experienced low-level engineers often avoid
* the "singleton" pattern, where a single shared global state introduces
* contention and kills performance under multi-threaded workloads.
*
* @see GlibC implementation:
* https://code.woboq.org/userspace/glibc/stdlib/random.c.html#291
* @see "Faster random integer generation with batching" by Daniel Lemire:
* https://lemire.me/blog/2024/08/17/faster-random-integer-generation-with-batching/
*/
#pragma endregion // How to Benchmark and Randomness
#pragma region Parallelism and Computational Complexity
/**
* The most obvious way to speed up code is to parallelize it. Since 2002, CPU
* clock speeds have plateaued, with CPUs getting wider instead of faster,
* featuring more cores and hardware threads. However, not all algorithms can
* be parallelized easily. Some are inherently sequential, while others are
* simply too small to benefit from parallel execution.
*
* Let's begin with @b `std::sort`, one of the best-known and best-optimized
* algorithms in the C++ Standard Library.
*
* @see Docs for `std::sort`: https://en.cppreference.com/w/cpp/algorithm/sort
*
* A straightforward benchmarking strategy could involve applying a random
* shuffle before each sort. However, this would introduce variability, making
* the benchmark less predictable. Knowing that `std::sort` uses a Quick-Sort
* variant, we can instead reverse the array on each iteration — a classic
* worst-case scenario for this family of algorithms.
*
* We can also parameterize the benchmark with runtime-configurable values,
* such as the array size and whether to include the preprocessing step.
* Google Benchmark provides the @b `Args` function precisely for this purpose.
*/
#include <algorithm> // `std::sort`
#include <numeric> // `std::iota`
/**
* @brief A minimalistic `std::vector` replacement, wrapping an aligned
* allocation similar to `std::unique_ptr`.
* @see https://stackoverflow.com/a/79363156/2766161
*/
template <typename type_>
class aligned_array {
type_ *data_ = nullptr;
std::size_t size_ = 0;
std::size_t alignment_ = 0;
public:
aligned_array(std::size_t size, std::size_t alignment = 64) : size_(size), alignment_(alignment) {
// With `std::aligned_alloc` in C++17, an exception won't be raised, which may be preferred in
// some environments. MSVC was late to adopt it, and developers would often use a combination
// of lower-level `posix_memalign` and `_aligned_malloc`/`_aligned_free` across Unix and Windows.
data_ = (type_ *)::operator new(sizeof(type_) * size_, std::align_val_t(alignment_));
}
~aligned_array() noexcept { ::operator delete(data_, sizeof(type_) * size_, std::align_val_t(alignment_)); }
aligned_array(aligned_array const &) = delete;
aligned_array &operator=(aligned_array const &) = delete;
aligned_array(aligned_array &&) = delete;
aligned_array &operator=(aligned_array &&) = delete;
type_ *begin() const noexcept { return data_; }
type_ *end() const noexcept { return data_ + size_; }
type_ &operator[](std::size_t index) noexcept { return data_[index]; }
type_ operator[](std::size_t index) const noexcept { return data_[index]; }
};
static void sorting(bm::State &state) {
auto length = static_cast<std::size_t>(state.range(0));
auto include_preprocessing = static_cast<bool>(state.range(1));
aligned_array<std::uint32_t> array(length);
std::iota(array.begin(), array.end(), 1u);
for (auto _ : state) {
if (!include_preprocessing) state.PauseTiming();
// Reverse order is the most classical worst case, but not the only one.
std::reverse(array.begin(), array.end());
if (!include_preprocessing) state.ResumeTiming();
std::sort(array.begin(), array.end());
}
if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");
}
BENCHMARK(sorting)->Args({3, false})->Args({3, true});
BENCHMARK(sorting)->Args({4, false})->Args({4, true});
BENCHMARK(sorting)->Args({1024, false})->Args({1024, true});
BENCHMARK(sorting)->Args({8196, false})->Args({8196, true});
/**
* This highlights how optimal control flow depends on input size:
* - On small inputs, it's faster to perform preprocessing.
* - On larger inputs, the overhead from preprocessing outweighs its benefits.
*
* Until C++17, the standard lacked built-in parallel algorithms.
* The C++17 standard introduced the @b `std::execution` namespace, including
* the @b `std::execution::par_unseq` policy for parallel, order-independent
* execution.
*
* To check for support, the @b `__cpp_lib_parallel_algorithm` standard
* feature testing macro can be used.
*
* @see Feature testing macros: https://en.cppreference.com/w/cpp/utility/feature_test
*/
#if !defined(USE_INTEL_TBB)
#define USE_INTEL_TBB 1
#endif // !defined(USE_INTEL_TBB)
#if defined(__cpp_lib_parallel_algorithm) && USE_INTEL_TBB
#include <execution> // `std::execution::par_unseq`
template <typename execution_policy_>
static void sorting_with_executors( //
bm::State &state, execution_policy_ &&policy) {
auto length = static_cast<std::size_t>(state.range(0));
aligned_array<std::uint32_t> array(length);
std::iota(array.begin(), array.end(), 1u);
for (auto _ : state) {
std::reverse(policy, array.begin(), array.end());
std::sort(policy, array.begin(), array.end());
}
if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");
state.SetComplexityN(length);
// Want to report something else? Sure, go ahead:
//
// state.counters["temperature_on_mars"] = bm::Counter(-95.4);
//
// Just please, for the love of rockets, use the metric system.
// We've already lost one Mars climate orbiter to a tragic "feet vs. meters"
// debate. Let's not make NASA cry again. 🚀💥
}
BENCHMARK_CAPTURE(sorting_with_executors, seq, std::execution::seq)
->RangeMultiplier(4)
->Range(1l << 20, 1l << 28)
->MinTime(10)
->Complexity(bm::oNLogN)
->UseRealTime();
/**
* Memory leak observed in libstdc++ using oneTBB under specific conditions:
* @see Github issue: https://github.com/ashvardanian/less_slow.cpp/issues/17
*
* A workaround is implemented by limiting the number of iterations
* for this benchmark to a single run.
*
* This adjustment is applied to the benchmark below:
*/
BENCHMARK_CAPTURE(sorting_with_executors, par_unseq, std::execution::par_unseq)
->RangeMultiplier(4)
->Range(1l << 20, 1l << 28)
//! Revert from `Iterations` to `MinTime` once the leak is resolved!
//! ->MinTime(10)
->Iterations(1)
->Complexity(bm::oNLogN)
->UseRealTime();
/**
* Without @b `UseRealTime()`, CPU time is measured by default.
* This distinction matters: if your process sleeps, it no longer
* accumulates CPU time.
*
* The @b `Complexity` function specifies the asymptotic computational
* complexity of the benchmark. To estimate the scaling factor, we benchmark
* over a broad range of input sizes, from 2^20 (1 million)
* to 2^28 (256 million) entries — translating to 4 MB to 1 GB of data.
*
* This approach outputs both timings and inferred complexity estimates:
*
* sorting_with_executors/seq/1048576 5776408 ns 5776186 ns
* sorting_with_executors/seq/4194154 25323450 ns 2532153 ns
* sorting_with_executors/seq/16777216 109073782 ns 109071515 ns
* sorting_with_executors/seq/67108864 482794615 ns 482777617 ns
* sorting_with_executors/seq/268435456 2548725384 ns 2548695506 ns
* sorting_with_executors/seq_BigO 0.34 NlgN 0.34 NlgN
* sorting_with_executors/seq_RMS 8 % 8 %
*
* As demonstrated, scaling isn't strictly linear, especially for tasks
* that aren't fully data-parallel.
*
* @see "105 STL Algorithms in Less Than an Hour" by Jonathan Boccara at CppCon 2018:
* https://youtu.be/2olsGf6JIkU
* @see "The C++17 Parallel Algorithms Library and Beyond"
* by Bryce Adelstein Lelbach at CppCon 2016: https://youtu.be/Vck6kzWjY88
*/
#endif // defined(__cpp_lib_parallel_algorithm) && USE_INTEL_TBB
#if defined(_OPENMP)
/**
* An alternative to "Parallel STL" is to design a custom parallel solution
* using some thread pool or task scheduler. The most common approach is to
* divide the input into blocks that can be processed independently and then
* implement a tree-like parallel aggregation of partial results.
*
* Many thread pools exist, including the underlying Intel's @b oneTBB,
* as well as OpenMP. The latter is not a library, but a broadly adopted
* set of compiler directives, capable of parallelizing loops and sections.
*/
#include <omp.h> // `omp_get_max_threads`
std::size_t round_to_multiple(std::size_t value, std::size_t multiple) {
return ((value + multiple - 1) / multiple) * multiple;
}
static void sorting_with_openmp(bm::State &state) {
auto const length = static_cast<std::size_t>(state.range(0));
auto const chunks = static_cast<std::size_t>(omp_get_max_threads());
// Let's round up chunk lengths to presumably 64-byte cache lines.
auto const chunk_length = round_to_multiple(length / chunks, 64 / sizeof(std::uint32_t));
auto const chunk_start_offset = [=](std::size_t i) -> std::size_t {
std::size_t offset = i * chunk_length;
return offset < length ? offset : length;
};
aligned_array<std::uint32_t> array(length);
std::iota(array.begin(), array.end(), 1u);
for (auto _ : state) {
std::reverse(array.begin(), array.end());
//! Remarkably, on Windows, OpenMP can't handle unsigned integers,
//! so we use `std::int64_t` over `std::size_t`.
#pragma omp parallel for
// Sort each chunk in parallel
for (std::int64_t i = 0; i < chunks; i++) {
std::size_t start = chunk_start_offset(static_cast<std::size_t>(i));
std::size_t finish = chunk_start_offset(static_cast<std::size_t>(i) + 1);
std::sort(array.begin() + start, array.begin() + finish);
}
// Merge the blocks in a tree-like fashion doubling the size of the merged block each time
for (std::size_t merge_step = 1; merge_step < chunks; merge_step *= 2) {
#pragma omp parallel for
for (std::int64_t i = 0; i < chunks; i += 2 * merge_step) {
std::size_t first_chunk_index = static_cast<std::size_t>(i);
std::size_t second_chunk_index = first_chunk_index + merge_step;
if (second_chunk_index >= chunks) continue; // No merge needed
// We use `inplace_merge` as opposed to `std::merge` to avoid extra memory allocations,
// but it may not be as fast: https://stackoverflow.com/a/21624819/2766161
auto start = chunk_start_offset(first_chunk_index);
auto mid = chunk_start_offset(second_chunk_index);
auto finish = chunk_start_offset(std::min(second_chunk_index + merge_step, chunks));
std::inplace_merge(array.begin() + start, array.begin() + mid, array.begin() + finish);
}
}
}
if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");
state.SetComplexityN(length);
}
BENCHMARK(sorting_with_openmp)
->RangeMultiplier(4)
->Range(1l << 20, 1l << 28)
->MinTime(10)
->Complexity(bm::oNLogN)
->UseRealTime();
#endif // defined(_OPENMP)
/**
* Detecting CUDA availability isn't trivial. NVIDIA's CUDA toolchain defines:
* - __NVCC__: Set only when using the NVCC compiler.
* - __CUDACC__: Set when NVCC compiles CUDA code (host or device).
* - __CUDA_ARCH__: Informs the CUDA version for compiling device (GPU) code.
*
* Since host compilers may not define these macros, we use @b `__has_include`
* to check for `<cuda_runtime.h>` as an indicator that CUDA is available.
*
* @see NVCC Identification Macros docs:
* https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#nvcc-identification-macro
*/
#if !defined(USE_NVIDIA_CCCL)
#if defined(__has_include)
#if __has_include(<cuda_runtime.h>)
#define USE_NVIDIA_CCCL 1
#else
#define USE_NVIDIA_CCCL 0
#endif // __has_include(<cuda_runtime.h>)
#else
#define USE_NVIDIA_CCCL 0
#endif // defined(__has_include)
#endif // !defined(USE_NVIDIA_CCCL)
#if USE_NVIDIA_CCCL
/**
* Unlike STL, Thrust provides some very handy abstractions for sorting
* one array by values in another. We are not going to use them here!
* And we will also avoid instantiating any CUDA @b `<algorithm>`-like
* templates in this translation unit to better separate the host-side
* code and device-side and keep compilation time sane!
*
* @see Sorting in Thrust: https://nvidia.github.io/cccl/thrust/api_docs/algorithms/sorting
*/
#include <cuda_runtime.h> // `cudaError_t`
#include <thrust/device_vector.h> // `thrust::device_vector`
#include <thrust/host_vector.h> // `thrust::host_vector`
using namespace std::string_literals; // For `""s` literals
/**
* @brief Reverses the array and sorts it with Nvidia's Thrust from CCCL.
* @see Declared in `less_slow.cpp`, but defined in @b `less_slow.cu`!
*/
extern void reverse_and_sort_with_thrust(std::uint32_t *device_pointer, std::size_t array_length);
static void sorting_with_thrust(benchmark::State &state) {
const auto count = static_cast<std::size_t>(state.range(0));
// Typically, the data is first allocated on the "host" CPU side,
// initialized, and then transferred to the "device" GPU memory.
// In our specific case, we could have also used `thrust::sequence`.
thrust::host_vector<std::uint32_t> host_array(count);
std::iota(host_array.begin(), host_array.end(), 1u);
thrust::device_vector<std::uint32_t> device_array = host_array;
for (auto _ : state) {
reverse_and_sort_with_thrust(device_array.data().get(), count);
cudaError_t error = cudaDeviceSynchronize(); //! Block until the GPU has completed all tasks
if (error != cudaSuccess) state.SkipWithError("CUDA error after kernel launch: "s + cudaGetErrorString(error));
benchmark::DoNotOptimize(device_array.data());
}
state.SetComplexityN(count);
state.SetItemsProcessed(count * state.iterations());
state.SetBytesProcessed(count * state.iterations() * sizeof(std::uint32_t));
}
BENCHMARK(sorting_with_thrust)
->RangeMultiplier(4)
->Range(1ll << 20, 1ll << 28)
->MinTime(10)
->Complexity(benchmark::oN) // Not `oNLogN` - it's Radix Sort!
->UseRealTime();
/**
* Thrust, just like STL, is often convenient but not always the fastest.
* It may allocate temporary memory, perform extra copies, or use suboptimal
* algorithms. Thrust's underlying CUB provides more control and a lot of
* functionality for both device-wide, block-wide, and warp-wide operations.
*
* @see CUB docs: https://nvidia.github.io/cccl/cub/modules
*
* Sadly, CUB provides no `reverse` functionality, so we need to combine it
* with a Thrust call, scheduling them on the same job queue. We will also
* pre-allocate temporary memory for CUB's sorting algorithm, and will use
* device-side timers for more accurate measurements.
*/
extern std::size_t reverse_and_sort_with_cub_space(std::uint32_t *device_pointer, std::size_t array_length);
extern void reverse_and_sort_with_cub( //
std::uint32_t *device_pointer, std::size_t array_length, // Task
void *temporary_pointer, std::size_t temporary_length, // Space
cudaStream_t stream); // Order
static void sorting_with_cub(bm::State &state) {
auto count = static_cast<std::size_t>(state.range(0));
thrust::host_vector<std::uint32_t> host_array(count);
std::iota(host_array.begin(), host_array.end(), 1u);
thrust::device_vector<std::uint32_t> device_array = host_array;
// One of the interesting design choices of CUB is that you can call
// the target method with `NULL` arguments to infer the required temporary
// memory amount. The Radix Sort generally requires ~ 2N temporary memory.
//
// Another one is the naming of the operations - `SortKeys` instead of `Sort`.
// There is also `SortPairs` and `SortPairsDescending` in for key-value pairs.
std::size_t temporary_bytes = reverse_and_sort_with_cub_space(device_array.data().get(), count);
// Allocate temporary memory with `cudaMalloc` instead of using a `device_vector`,
// due to potential compilation issues on the host-side compiler.
std::byte *temporary_pointer = nullptr;
cudaError_t error = cudaMalloc(&temporary_pointer, temporary_bytes);
if (error != cudaSuccess) {
state.SkipWithError("Failed to allocate temporary memory: "s + cudaGetErrorString(error));
return;
}
// To schedule the Thrust and CUB operations on the same CUDA stream,
// we need to wrap it into a "policy" object.
cudaStream_t sorting_stream;
cudaStreamCreate(&sorting_stream);
auto policy = thrust::cuda::par.on(sorting_stream);
// Create CUDA events for timing
cudaEvent_t start_event, stop_event;
cudaEventCreate(&start_event);
cudaEventCreate(&stop_event);
for (auto _ : state) {
// Record the start event
cudaEventRecord(start_event, sorting_stream);
// Run the kernels
reverse_and_sort_with_cub( //
device_array.data().get(), count, // Task
temporary_pointer, temporary_bytes, // Space
sorting_stream); // Order
// Record the stop event
cudaEventRecord(stop_event, sorting_stream);
cudaError_t error = cudaEventSynchronize(stop_event); //! Block until the GPU has completed all tasks
if (error != cudaSuccess) state.SkipWithError("CUDA error after kernel launch: "s + cudaGetErrorString(error));
float milliseconds = 0.0f;
cudaEventElapsedTime(&milliseconds, start_event, stop_event);
state.SetIterationTime(milliseconds / 1000.0f);
}
// ! All of the following and above calls can fail, so consider checking the codes
// ! and using a different kernel launch mechanism: https://ashvardanian.com/posts/less-wrong-cuda-hello-world/
cudaEventDestroy(start_event);
cudaEventDestroy(stop_event);
cudaStreamDestroy(sorting_stream);
cudaFree(temporary_pointer);
state.SetComplexityN(count);
state.SetItemsProcessed(count * state.iterations());
state.SetBytesProcessed(count * state.iterations() * sizeof(std::uint32_t));
}
BENCHMARK(sorting_with_cub)
->RangeMultiplier(4)
->Range(1l << 20, 1l << 28)
->MinTime(10)
->Complexity(benchmark::oN) // Not `oNLogN` - it's Radix Sort!
->UseManualTime();
/**
* Comparing CPU to GPU performance is not straightforward, but we can compare
* Thrust to CUB solutions. On NVIDIA H200 GPU, for the largest 1 GB buffer:
*
* - `sorting_with_thrust` takes @b 6.6 ms
* - `sorting_with_cub` takes @b 6 ms
*
* 10% ot 50% performance improvements are typical for CUB.
*/
#endif // USE_NVIDIA_CCCL
#pragma endregion // Parallelism and Computational Complexity
#pragma region Recursion
/**
* The `std::sort` and the underlying Quick-Sort are perfect research subjects
* for benchmarking and understanding how the computer works. Naively
* implementing the Quick-Sort in C/C++ would still put us at disadvantage,
* compared to the STL.
*
* Most implementations we can find in textbooks, use recursion. Recursion is a
* beautiful concept, but it's not always the best choice for performance. Every
* nested call requires a new stack frame, and the stack is limited. Moreover,
* local variables need to be constructed and destructed, and the CPU needs to
* jump around in memory.
*
* The alternative, as it often is in computing, is to use compensate runtime
* issue with memory. We can use a stack data structure to continuously store
* the state of the algorithm, and then process it in a loop.
*
* The same ideas common appear when dealing with trees or graph algorithms.
*/
#include <utility> // `std::swap`
/**
* @brief Quick-Sort helper function for array partitioning, reused by both
* recursive and iterative implementations.
*/
template <typename element_type_>
struct quick_sort_partition {
using element_t = element_type_;
inline std::ptrdiff_t operator()(element_t *arr, std::ptrdiff_t const low, std::ptrdiff_t const high) noexcept {
element_t pivot = arr[high];
std::ptrdiff_t i = low - 1;
for (std::ptrdiff_t j = low; j <= high - 1; j++) {
if (arr[j] > pivot) continue;
i++;
std::swap(arr[i], arr[j]);
}
std::swap(arr[i + 1], arr[high]);
return i + 1;
}
};
/**
* @brief Quick-Sort implementation as a C++ function object, using recursion.
* Note, recursion and @b inlining are not compatible.
*/
template <typename element_type_>
struct quick_sort_recurse {
using element_t = element_type_;
using quick_sort_partition_t = quick_sort_partition<element_t>;
using quick_sort_recurse_t = quick_sort_recurse<element_t>;
void operator()(element_t *arr, std::ptrdiff_t low, std::ptrdiff_t high) noexcept {
if (low >= high) return;
auto pivot = quick_sort_partition_t {}(arr, low, high);
quick_sort_recurse_t {}(arr, low, pivot - 1);
quick_sort_recurse_t {}(arr, pivot + 1, high);
}
};
/**
* @brief Quick-Sort implementation as a C++ function object, with iterative
* deepening using a "stack" data-structure.
*
* Note, this implementation can be inlined, but can't be @b `noexcept`, due to
* a potential memory allocation in the `std::vector::resize` function.
*
* Fun fact: The `std::vector` is actually a better choice for a "stack" than
* the `std::stack`, as the latter builds on top of a `std::deque`, which is
* normally implemented as a sequence of individually allocated fixed-size arrays,
* with additional bookkeeping. In our logic we never need to pop from the middle
* or from the front, so a `std::vector` is a better choice.
*/
#include <vector> // `std::vector`
template <typename element_type_>
struct quick_sort_iterate {
using element_t = element_type_;
using quick_sort_partition_t = quick_sort_partition<element_t>;
std::vector<std::ptrdiff_t> stack;
void operator()(element_t *arr, std::ptrdiff_t low, std::ptrdiff_t high) noexcept(false) {
stack.resize((high - low + 1) * 2);
std::ptrdiff_t top = -1;
stack[++top] = low;
stack[++top] = high;
while (top >= 0) {
high = stack[top--];
low = stack[top--];
auto pivot = quick_sort_partition_t {}(arr, low, high);
// If there are elements on left side of pivot,
// then push left side to stack
if (low < pivot - 1) {
stack[++top] = low;
stack[++top] = pivot - 1;
}
// If there are elements on right side of pivot,
// then push right side to stack
if (pivot + 1 < high) {
stack[++top] = pivot + 1;
stack[++top] = high;
}
}
}
};
template <typename sorter_type_, std::size_t length_> //
static void recursion_cost(bm::State &state) {
using element_t = typename sorter_type_::element_t;
sorter_type_ sorter;
aligned_array<element_t> array(length_);
for (auto _ : state) {
for (std::size_t i = 0; i != length_; ++i) array[i] = length_ - i;
sorter(array.begin(), 0, static_cast<std::ptrdiff_t>(length_ - 1));
}
if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");
state.SetComplexityN(length_);
}
using recursive_sort_i32s = quick_sort_recurse<std::int32_t>;
using iterative_sort_i32s = quick_sort_iterate<std::int32_t>;
BENCHMARK_TEMPLATE(recursion_cost, recursive_sort_i32s, 16);
BENCHMARK_TEMPLATE(recursion_cost, iterative_sort_i32s, 16);
BENCHMARK_TEMPLATE(recursion_cost, recursive_sort_i32s, 256);
BENCHMARK_TEMPLATE(recursion_cost, iterative_sort_i32s, 256);
BENCHMARK_TEMPLATE(recursion_cost, recursive_sort_i32s, 4096);
BENCHMARK_TEMPLATE(recursion_cost, iterative_sort_i32s, 4096);
/**
* If you try pushing the size further, the program will likely @b crash due
* to @b stack_overflow. The recursive version is limited by the stack size, while
* the iterative version can handle much larger inputs.
*
* As can be seen from our benchmarks, the STL implementation of `std::sort`
* is more efficient than our naive kernels, and it's only one of many expressive
* solutions in the @b <algorithm> header.
*/
#pragma endregion // Recursion
#pragma region Branch Prediction
/**
* The `if` statement and the seemingly innocent ternary operator `x ? a : b`
* can be surprisingly expensive in performance-critical code. This is
* especially noticeable when conditional execution operates at the byte level,
* as in text processing, parsing, searching, compression, encoding, and
* similar tasks.
*
* Modern CPUs have sophisticated branch predictors — some of the most complex
* hardware components in a processor. These predictors memorize recent branch
* patterns, enabling "speculative execution," where the CPU starts processing
* the next task (`i+1`) before fully completing the current one (`i`).
*
* While a single `if` in a hot-path is usually not a problem, real-world
* applications often involve thousands of branches. On most modern CPUs, up
* to @b 4096 branches can be memorized. Beyond that, branch mis-predictions
* occur, causing a severe slowdown due to pipeline stalls.
*
* Consider this example: The same snippet can run at @b 0.7ns per operation
* when branch predictions are accurate but slows down to @b 3.7ns when
* predictions fail.
*/
static void branch_cost(bm::State &state) {
auto count = static_cast<std::size_t>(state.range(0));
aligned_array<std::int32_t> random_values(count);
std::generate_n(random_values.begin(), count, &std::rand);
std::int32_t variable = 0;
std::size_t iteration = 0;
for (auto _ : state) {
std::int32_t random = random_values[(++iteration) & (count - 1)];
bm::DoNotOptimize( //
variable = // ! Fun fact: multiplication compiles to a jump,
(random & 1) // ! but replacing with a bitwise operation results in a conditional move.
? (variable + random)
: (variable ^ random));
}
}
BENCHMARK(branch_cost)->RangeMultiplier(4)->Range(256, 32 * 1024);
/**
* It's hard to reason if the above code should compile into a conditional move or a jump,