less_slow.cpp/less_slow.cpp at main · ashvardanian/less_slow.cpp

History

7850 lines (6844 loc) · 340 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

/**

* @brief Low-level micro-benchmarks for building a performance-first mindset.

* @file less_slow.cpp

* @author Ash Vardanian

* There's no Easter bunny, no tooth fairy... and no free abstractions!

* Every abstraction—no matter how elegant—comes with tradeoffs. Sometimes

* the cost is in readability, like extra layers of indirection, and other

* times, it's in performance, with additional instructions or memory overhead.

* This project dives into such tradeoffs, helping engineers develop a deeper

* understanding of the costs associated with seemingly simple constructs.

* While the examples are written in C++20 and focus on GCC and Clang,

* targeting x86_64 and ARM64 architectures, the principles are universal.

* The benchmarks cover the costs of numerical operations, designing micro-

* kernels, parallelism, computational complexity, branch prediction, compiler

* limitations, and @b composing those with callbacks, coroutines, and ranges

* into more complex systems. Same principles apply to Rust, Python, and Go,

* so parts of the project have been reproduced in those languages.

* @see Rust Benchmarks: https://github.com/ashvardanian/less_slow.rs

* @see Python Benchmarks: https://github.com/ashvardanian/less_slow.py

* Most measurements were performed on Intel Sapphire Rapids CPUs on AWS,

* but the findings match across hardware platforms unless explicitly noted.

* Worth noting, that some examples may seem over-engineered, but they are

* no less relevant or impractical. They may be hard to recognize at first,

* but they universally appear in larger codebases, as a form of emergent

* complexity.

* Let's benchmark them all and dive into the implementation details that

* make those abstractions @b less_slow!

#include <benchmark/benchmark.h>

namespace bm = benchmark;

#pragma region - Basics

#pragma region How to Benchmark and Randomness

/**

* Using Google Benchmark is simple. You define a C++ function and then

* register it using the provided C macros. The suite will invoke your

* function, passing a `State` object, that dynamically chooses the number

* of loops to run based on the time it takes to execute each cycle.

* For simplicity, let's start by benchmarking the most basic operation -

* the 32-bit integer addition, universally natively supported by every

* modern CPU, be it x86, ARM, or RISC-V, 32-bit or 64-bit, big-endian

* or little-endian.

#include <cstdint> // `std::int32_t` and other sized integers

#include <cstdlib> // `std::rand`

static void i32_addition(bm::State &state) {

std::int32_t a = std::rand(), b = std::rand(), c = 0;

for (auto _ : state) c = a + b;

// In some categories of projects, benchmarks are easier to convert

// into tests, than the other way around :)

if (c != a + b) state.SkipWithError("Incorrect sum!");

}

BENCHMARK(i32_addition);

/**

* Trivial kernels operating on constant values are not the most

* straightforward candidates for benchmarking. The compiler can easily

* optimize them, and the CPU can predict the result... showing "0ns" - zero

* nanoseconds per iteration. Unfortunately, no operation runs this fast on the

* computer. On a 3 GHz CPU, you would perform 3 Billion ops every second.

* So, each would take 0.33ns, not 0ns. If we change the compilation

* settings, discarding the @b `-O3` flag for "Release build" optimizations,

* we may see a non-zero value, but it won't represent real-world performance.

* One way to avoid, is just implementing the kernel in @b inline-assembly,

* interleaving it with the higher-level C++ code:

#if defined(__GNUC__) && !defined(__clang__) //! GCC and Clang support inline assembly, MSVC doesn't!

#if defined(__x86_64__) || defined(__i386__) //? Works for both 64-bit and 32-bit x86 architectures

static void i32_addition_inline_asm(bm::State &state) {

// In inline assembly for x86 we are not explicitly naming the registers,

// so in the 32-bit and 64-bit modes different registers will be used:

// - For 32-bit (`__i386__`): Registers like `eax` and `ebx` will be used,

// and pointers will be 4 bytes wide.

// - For 64-bit targets (`__x86_64__`): Registers like `eax`, `r8d` will

// be used for 32-bit values, while pointers will be 8 bytes wide.

std::int32_t a = std::rand(), b = std::rand(), c = 0;

for (auto _ : state) {

asm volatile(

// Perform a 32-bit addition of `b` into `a`.

"addl %[b], %[a]\n\t"

// `[a] "=r"(c)` means treat `c` as the output for the result.

// `"0"(a)` means reuse the same register allocated to the first output operand for `a`.

// `[b] "r"(b)` means that `b` is a read-only operand that must reside in a register.

: [a] "=r"(c)

: "0"(a), [b] "r"(b)

// Tell the compiler that this code modifies the condition codes (CPU flags),

// so it cannot assume those flags are still valid after this assembly block.

: "cc");

}

if (c != a + b) state.SkipWithError("Incorrect sum!");

}

BENCHMARK(i32_addition_inline_asm);

#elif defined(__aarch64__) //? The following kernel is just for the 64-bit Arm

static void i32_addition_inline_asm(bm::State &state) {

// In inline assembly for AArch64 we use `%w` registers for 32-bit operations.

// That means `add %w[a], %w[a], %w[b]` will add the 32-bit subregisters

// of these named operands. Pointers remain 8 bytes wide, but here we only

// deal with 32-bit integers.

std::int32_t a = std::rand(), b = std::rand(), c = 0;

for (auto _ : state) {

asm volatile(

// Perform a 32-bit addition of `b` into `a`: `%w[a] := %w[a] + %w[b]`.

"add %w[a], %w[a], %w[b]\n\t"

// `[a] "=r"(c)` means treat `c` as the output for the result of the operation.

// `"0"(a)` says to reuse the same register allocated to the first output operand for `a`.

// `[b] "r"(b)` means that `b` is a read-only operand that must reside in a register.

: [a] "=r"(c)

: "0"(a), [b] "r"(b)

// Tell the compiler that this assembly modifies the condition flags (CPU flags),

// so it cannot rely on them remaining unaltered after this assembly block.

: "cc");

}

if (c != a + b) state.SkipWithError("Incorrect sum!");

}

BENCHMARK(i32_addition_inline_asm);

#endif // defined(__x86_64__) || defined(__i386__) || defined(__aarch64__)

#endif // defined(__GNUC__) && !defined(__clang__)

/**

* We can also put the assembly kernels into separate `.S` files and link them

* to our C++ target. Each approach has its technical tradeoffs:

* - Inline Assembly:

* Requires direct interaction with registers, which must be carefully managed

* using constraints and clobbers to ensure the compiler knows which registers

* are modified. While inline assembly enables tight coupling with C++ logic

* and access to local variables, it is less portable due to compiler-specific

* syntax and optimization variability. Debugging inline assembly can also be

* challenging as it is embedded in higher-level code.

* - Separate Assembly Files:

* Abstracts away register management through adherence to the platform's

* Application Binary Interface @b (ABI). This makes the assembly routines

* easier to debug, test, and reuse across projects. However, separate files

* require more boilerplate for function calls, stack management, and

* parameter passing. They are preferred for large or standalone routines

* that benefit from modularity and clear separation from C++ code.

* In this project, we provide assembly kernels for two platforms:

* - @b less_slow_aarch64.S - for the 64-bit ARM architecture.

* - @b less_slow_amd64.S - for the x86_64 architecture, with 64-bit extensions,

* originally introduced by AMD.

#if !defined(_MSC_VER) && (defined(__x86_64__) || defined(__aarch64__) || defined(__i386__) || defined(_M_X64))

extern "C" std::int32_t i32_add_asm_kernel(std::int32_t a, std::int32_t b);

static void i32_addition_asm(bm::State &state) {

std::int32_t a = std::rand(), b = std::rand(), c = 0;

for (auto _ : state) c = i32_add_asm_kernel(a, b);

if (c != a + b) state.SkipWithError("Incorrect sum!");

}

BENCHMARK(i32_addition_asm);

#endif // defined(__x86_64__) || defined(__aarch64__)

/**

* So far the results may be:

* - `i32_addition` - @b 0ns, as the compiler optimized the code away.

* - `i32_addition_inline_asm` - @b 0.2ns, a single instruction was inlined.

* - `i32_addition_asm` - @b 0.9ns, a new stack frame was created!

* Keep this in mind! Even with Link-Time Optimization @b (LTO) enabled,

* most of the time, compilers won't be able to inline your Assembly kernels.

* Don't want to switch to Assembly to fool the compilers? No problem!

* Another thing we can try - is generating random inputs on the fly with

* @b `std::rand()`, one of the most controversial operations in the

* C standard library.

static void i32_addition_random(bm::State &state) {

std::int32_t c;

for (auto _ : state) c = std::rand() + std::rand();

(void)c; //? Silence "variable `c` set but not used" warning

}

BENCHMARK(i32_addition_random);

/**

* Running this will report @b 31ns or about 100 CPU cycles. Is integer

* addition really that expensive? It's used all the time, even when you are

* accessing @b `std::vector` elements and need to compute the memory address

* from the pointer and the index passed to the @b `operator[]` or `at()`

* functions. The answer is - no, it's not. The addition takes a single CPU

* cycle and is very fast.

* Chances are we just benchmarked something else... the @b `std::rand()`

* function. What if we could ask Google Benchmark to ignore the time spent

* in the `std::rand()` function? There are `PauseTiming` and `ResumeTiming`

* functions just for that!

static void i32_addition_paused(bm::State &state) {

std::int32_t a, b, c;

for (auto _ : state) {

state.PauseTiming();

a = std::rand(), b = std::rand();

state.ResumeTiming();

bm::DoNotOptimize(c = a + b);

}

BENCHMARK(i32_addition_paused);

/**

* However, the `PauseTiming` and `ResumeTiming` functions are neither free.

* In the current implementation, they can easily take @b ~233ns, or around

* 150 CPU cycles. They are useless in this case, but there is an alternative!

* A typical pattern when implementing a benchmark is to initialize with a

* random value and then define a very cheap update policy that won't affect

* the latency much but will update the inputs. Increments, bit shifts, and bit

* rotations are common choices!

* It's also a good idea to use native @b CRC32 and @b AES instructions to

* produce a random state, as it's often done in StringZilla. Another common

* approach is to use integer multiplication, usually derived from the

* Golden Ratio, as in the Knuth's multiplicative hash (with `2654435761`).

* @see StringZilla: https://github.com/ashvardanian/stringzilla

static void i32_addition_randomly_initialized(bm::State &state) {

std::int32_t a = std::rand(), b = std::rand(), c = 0;

for (auto _ : state) bm::DoNotOptimize(c = (++a) + (++b));

}

BENCHMARK(i32_addition_randomly_initialized);

/**

* On x86, the `i32_addition_randomly_initialized` benchmark performs two

* @b `inc` instructions and one @b `add` instruction. This should take less

* than @b 0.4ns on a modern CPU. The first cycle increments `a` and `b`

* simultaneously on different Arithmetic Logic Units (ALUs) of the same core,

* while the second cycle performs the final accumulation. At least @b 97% of

* the benchmark time was spent in the `std::rand()` function... even in a

* single-threaded benchmark.

* This may seem like a trivial example, far removed from "real-world

* production systems" of "advanced proprietary software designed by the

* world's leading engineers." Sadly, issues like this persist in many

* benchmarks and sometimes influence multi-billion-dollar decisions 🤬

* @see Bad I/O benchmark examples: https://www.unum.cloud/blog/2022-03-22-ucsb

* How bad is it? Let's re-run the same two benchmarks, this time on all cores.

#include <thread> // `std::thread::hardware_concurrency`

#if defined(__linux__)

#include <unistd.h> // `_SC_NPROCESSORS_ONLN`

#elif defined(__APPLE__)

#include <sys/sysctl.h> // `sysctlbyname` on macOS

#elif defined(_WIN32)

#define WIN32_LEAN_AND_MEAN

#define NOMINMAX

#include <Windows.h>

#include <WinBase.h>

#endif

/**

* @brief Returns the number of physical cores available on the system,

* as opposed to the logical cores, which include hyper-threading.

std::size_t physical_cores() {

#if defined(__linux__)

int nproc = sysconf(_SC_NPROCESSORS_ONLN);

return static_cast<std::size_t>(nproc);

#elif defined(__APPLE__)

int nproc = 0;

size_t len = sizeof(nproc);

sysctlbyname("hw.physicalcpu", &nproc, &len, nullptr, 0);

return static_cast<std::size_t>(nproc);

#elif defined(_WIN32)

// On Windows, both `std::thread::hardware_concurrency` and `GetSystemInfo`

// return at most 64 cores, as limited by a single windows processor group.

// However, starting with newer versions of Windows, applications can seamlessly

// span across multiple processor groups.

DWORD buffer_size = 0;

GetLogicalProcessorInformationEx(RelationProcessorCore, nullptr, &buffer_size);

if (buffer_size == 0) throw std::runtime_error("GetLogicalProcessorInformationEx failed to get buffer size");

using core_info_t = PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX;

std::vector<BYTE> buffer(buffer_size);

if (!GetLogicalProcessorInformationEx(RelationProcessorCore, reinterpret_cast<core_info_t>(buffer.data()),

&buffer_size))

throw std::runtime_error("GetLogicalProcessorInformationEx failed to get core info");

std::size_t core_count = 0;

for (DWORD buffer_progress = 0; buffer_progress < buffer_size;) {

core_info_t ptr = reinterpret_cast<core_info_t>(buffer.data() + buffer_progress);

if (ptr->Relationship == RelationProcessorCore) ++core_count;

buffer_progress += ptr->Size;

}

return core_count;

#else

return std::thread::hardware_concurrency();

#endif

}

BENCHMARK(i32_addition_random)->Threads(physical_cores());

BENCHMARK(i32_addition_randomly_initialized)->Threads(physical_cores());

/**

* The latency of the `std::rand` variant skyrocketed from @b 31ns in

* single-threaded mode to @b 10'974ns when running on multiple threads,

* while our optimized variant remained unaffected.

* This happens because `std::rand`, like many other LibC functions, relies

* on a global state protected by a mutex to ensure thread-safe access.

* This mutex-based synchronization becomes a severe bottleneck when multiple

* threads contend for the same global resource.

* Here's the relevant snippet from the GNU C Library (GlibC) implementation:

* long int __random (void) {

* int32_t retval;

* __libc_lock_lock (lock);

* (void) __random_r (&unsafe_state, &retval);

* __libc_lock_unlock (lock);

* return retval;

* }

* weak_alias (__random, random)

* This perfectly illustrates why experienced low-level engineers often avoid

* the "singleton" pattern, where a single shared global state introduces

* contention and kills performance under multi-threaded workloads.

* @see GlibC implementation:

* https://code.woboq.org/userspace/glibc/stdlib/random.c.html#291

* @see "Faster random integer generation with batching" by Daniel Lemire:

* https://lemire.me/blog/2024/08/17/faster-random-integer-generation-with-batching/

#pragma endregion // How to Benchmark and Randomness

#pragma region Parallelism and Computational Complexity

/**

* The most obvious way to speed up code is to parallelize it. Since 2002, CPU

* clock speeds have plateaued, with CPUs getting wider instead of faster,

* featuring more cores and hardware threads. However, not all algorithms can

* be parallelized easily. Some are inherently sequential, while others are

* simply too small to benefit from parallel execution.

* Let's begin with @b `std::sort`, one of the best-known and best-optimized

* algorithms in the C++ Standard Library.

* @see Docs for `std::sort`: https://en.cppreference.com/w/cpp/algorithm/sort

* A straightforward benchmarking strategy could involve applying a random

* shuffle before each sort. However, this would introduce variability, making

* the benchmark less predictable. Knowing that `std::sort` uses a Quick-Sort

* variant, we can instead reverse the array on each iteration — a classic

* worst-case scenario for this family of algorithms.

* We can also parameterize the benchmark with runtime-configurable values,

* such as the array size and whether to include the preprocessing step.

* Google Benchmark provides the @b `Args` function precisely for this purpose.

#include <algorithm> // `std::sort`

#include <numeric> // `std::iota`

/**

* @brief A minimalistic `std::vector` replacement, wrapping an aligned

* allocation similar to `std::unique_ptr`.

* @see https://stackoverflow.com/a/79363156/2766161

template <typename type_>

class aligned_array {

type_ *data_ = nullptr;

std::size_t size_ = 0;

std::size_t alignment_ = 0;

public:

aligned_array(std::size_t size, std::size_t alignment = 64) : size_(size), alignment_(alignment) {

// With `std::aligned_alloc` in C++17, an exception won't be raised, which may be preferred in

// some environments. MSVC was late to adopt it, and developers would often use a combination

// of lower-level `posix_memalign` and `_aligned_malloc`/`_aligned_free` across Unix and Windows.

data_ = (type_ *)::operator new(sizeof(type_) * size_, std::align_val_t(alignment_));

}

~aligned_array() noexcept { ::operator delete(data_, sizeof(type_) * size_, std::align_val_t(alignment_)); }

aligned_array(aligned_array const &) = delete;

aligned_array &operator=(aligned_array const &) = delete;

aligned_array(aligned_array &&) = delete;

aligned_array &operator=(aligned_array &&) = delete;

type_ *begin() const noexcept { return data_; }

type_ *end() const noexcept { return data_ + size_; }

type_ &operator[](std::size_t index) noexcept { return data_[index]; }

type_ operator[](std::size_t index) const noexcept { return data_[index]; }

};

static void sorting(bm::State &state) {

auto length = static_cast<std::size_t>(state.range(0));

auto include_preprocessing = static_cast<bool>(state.range(1));

aligned_array<std::uint32_t> array(length);

std::iota(array.begin(), array.end(), 1u);

for (auto _ : state) {

if (!include_preprocessing) state.PauseTiming();

// Reverse order is the most classical worst case, but not the only one.

std::reverse(array.begin(), array.end());

if (!include_preprocessing) state.ResumeTiming();

std::sort(array.begin(), array.end());

}

if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");

}

BENCHMARK(sorting)->Args({3, false})->Args({3, true});

BENCHMARK(sorting)->Args({4, false})->Args({4, true});

BENCHMARK(sorting)->Args({1024, false})->Args({1024, true});

BENCHMARK(sorting)->Args({8196, false})->Args({8196, true});

/**

* This highlights how optimal control flow depends on input size:

* - On small inputs, it's faster to perform preprocessing.

* - On larger inputs, the overhead from preprocessing outweighs its benefits.

* Until C++17, the standard lacked built-in parallel algorithms.

* The C++17 standard introduced the @b `std::execution` namespace, including

* the @b `std::execution::par_unseq` policy for parallel, order-independent

* execution.

* To check for support, the @b `__cpp_lib_parallel_algorithm` standard

* feature testing macro can be used.

* @see Feature testing macros: https://en.cppreference.com/w/cpp/utility/feature_test

#if !defined(USE_INTEL_TBB)

#define USE_INTEL_TBB 1

#endif // !defined(USE_INTEL_TBB)

#if defined(__cpp_lib_parallel_algorithm) && USE_INTEL_TBB

#include <execution> // `std::execution::par_unseq`

template <typename execution_policy_>

static void sorting_with_executors( //

bm::State &state, execution_policy_ &&policy) {

auto length = static_cast<std::size_t>(state.range(0));

aligned_array<std::uint32_t> array(length);

std::iota(array.begin(), array.end(), 1u);

for (auto _ : state) {

std::reverse(policy, array.begin(), array.end());

std::sort(policy, array.begin(), array.end());

}

if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");

state.SetComplexityN(length);

// Want to report something else? Sure, go ahead:

// state.counters["temperature_on_mars"] = bm::Counter(-95.4);

// Just please, for the love of rockets, use the metric system.

// We've already lost one Mars climate orbiter to a tragic "feet vs. meters"

// debate. Let's not make NASA cry again. 🚀💥

}

BENCHMARK_CAPTURE(sorting_with_executors, seq, std::execution::seq)

->RangeMultiplier(4)

->Range(1l << 20, 1l << 28)

->MinTime(10)

->Complexity(bm::oNLogN)

->UseRealTime();

/**

* Memory leak observed in libstdc++ using oneTBB under specific conditions:

* @see Github issue: https://github.com/ashvardanian/less_slow.cpp/issues/17

* A workaround is implemented by limiting the number of iterations

* for this benchmark to a single run.

* This adjustment is applied to the benchmark below:

BENCHMARK_CAPTURE(sorting_with_executors, par_unseq, std::execution::par_unseq)

->RangeMultiplier(4)

->Range(1l << 20, 1l << 28)

//! Revert from `Iterations` to `MinTime` once the leak is resolved!

//! ->MinTime(10)

->Iterations(1)

->Complexity(bm::oNLogN)

->UseRealTime();

/**

* Without @b `UseRealTime()`, CPU time is measured by default.

* This distinction matters: if your process sleeps, it no longer

* accumulates CPU time.

* The @b `Complexity` function specifies the asymptotic computational

* complexity of the benchmark. To estimate the scaling factor, we benchmark

* over a broad range of input sizes, from 2^20 (1 million)

* to 2^28 (256 million) entries — translating to 4 MB to 1 GB of data.

* This approach outputs both timings and inferred complexity estimates:

* sorting_with_executors/seq/1048576 5776408 ns 5776186 ns

* sorting_with_executors/seq/4194154 25323450 ns 2532153 ns

* sorting_with_executors/seq/16777216 109073782 ns 109071515 ns

* sorting_with_executors/seq/67108864 482794615 ns 482777617 ns

* sorting_with_executors/seq/268435456 2548725384 ns 2548695506 ns

* sorting_with_executors/seq_BigO 0.34 NlgN 0.34 NlgN

* sorting_with_executors/seq_RMS 8 % 8 %

* As demonstrated, scaling isn't strictly linear, especially for tasks

* that aren't fully data-parallel.

* @see "105 STL Algorithms in Less Than an Hour" by Jonathan Boccara at CppCon 2018:

* https://youtu.be/2olsGf6JIkU

* @see "The C++17 Parallel Algorithms Library and Beyond"

* by Bryce Adelstein Lelbach at CppCon 2016: https://youtu.be/Vck6kzWjY88

#endif // defined(__cpp_lib_parallel_algorithm) && USE_INTEL_TBB

#if defined(_OPENMP)

/**

* An alternative to "Parallel STL" is to design a custom parallel solution

* using some thread pool or task scheduler. The most common approach is to

* divide the input into blocks that can be processed independently and then

* implement a tree-like parallel aggregation of partial results.

* Many thread pools exist, including the underlying Intel's @b oneTBB,

* as well as OpenMP. The latter is not a library, but a broadly adopted

* set of compiler directives, capable of parallelizing loops and sections.

#include <omp.h> // `omp_get_max_threads`

std::size_t round_to_multiple(std::size_t value, std::size_t multiple) {

return ((value + multiple - 1) / multiple) * multiple;

}

static void sorting_with_openmp(bm::State &state) {

auto const length = static_cast<std::size_t>(state.range(0));

auto const chunks = static_cast<std::size_t>(omp_get_max_threads());

// Let's round up chunk lengths to presumably 64-byte cache lines.

auto const chunk_length = round_to_multiple(length / chunks, 64 / sizeof(std::uint32_t));

auto const chunk_start_offset = [=](std::size_t i) -> std::size_t {

std::size_t offset = i * chunk_length;

return offset < length ? offset : length;

};

aligned_array<std::uint32_t> array(length);

std::iota(array.begin(), array.end(), 1u);

for (auto _ : state) {

std::reverse(array.begin(), array.end());

//! Remarkably, on Windows, OpenMP can't handle unsigned integers,

//! so we use `std::int64_t` over `std::size_t`.

#pragma omp parallel for

// Sort each chunk in parallel

for (std::int64_t i = 0; i < chunks; i++) {

std::size_t start = chunk_start_offset(static_cast<std::size_t>(i));

std::size_t finish = chunk_start_offset(static_cast<std::size_t>(i) + 1);

std::sort(array.begin() + start, array.begin() + finish);

}

// Merge the blocks in a tree-like fashion doubling the size of the merged block each time

for (std::size_t merge_step = 1; merge_step < chunks; merge_step *= 2) {

#pragma omp parallel for

for (std::int64_t i = 0; i < chunks; i += 2 * merge_step) {

std::size_t first_chunk_index = static_cast<std::size_t>(i);

std::size_t second_chunk_index = first_chunk_index + merge_step;

if (second_chunk_index >= chunks) continue; // No merge needed

// We use `inplace_merge` as opposed to `std::merge` to avoid extra memory allocations,

// but it may not be as fast: https://stackoverflow.com/a/21624819/2766161

auto start = chunk_start_offset(first_chunk_index);

auto mid = chunk_start_offset(second_chunk_index);

auto finish = chunk_start_offset(std::min(second_chunk_index + merge_step, chunks));

std::inplace_merge(array.begin() + start, array.begin() + mid, array.begin() + finish);

}

if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");

state.SetComplexityN(length);

}

BENCHMARK(sorting_with_openmp)

->RangeMultiplier(4)

->Range(1l << 20, 1l << 28)

->MinTime(10)

->Complexity(bm::oNLogN)

->UseRealTime();

#endif // defined(_OPENMP)

/**

* Detecting CUDA availability isn't trivial. NVIDIA's CUDA toolchain defines:

* - __NVCC__: Set only when using the NVCC compiler.

* - __CUDACC__: Set when NVCC compiles CUDA code (host or device).

* - __CUDA_ARCH__: Informs the CUDA version for compiling device (GPU) code.

* Since host compilers may not define these macros, we use @b `__has_include`

* to check for `<cuda_runtime.h>` as an indicator that CUDA is available.

* @see NVCC Identification Macros docs:

* https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#nvcc-identification-macro

#if !defined(USE_NVIDIA_CCCL)

#if defined(__has_include)

#if __has_include(<cuda_runtime.h>)

#define USE_NVIDIA_CCCL 1

#else

#define USE_NVIDIA_CCCL 0

#endif // __has_include(<cuda_runtime.h>)

#else

#define USE_NVIDIA_CCCL 0

#endif // defined(__has_include)

#endif // !defined(USE_NVIDIA_CCCL)

#if USE_NVIDIA_CCCL

/**

* Unlike STL, Thrust provides some very handy abstractions for sorting

* one array by values in another. We are not going to use them here!

* And we will also avoid instantiating any CUDA @b `<algorithm>`-like

* templates in this translation unit to better separate the host-side

* code and device-side and keep compilation time sane!

* @see Sorting in Thrust: https://nvidia.github.io/cccl/thrust/api_docs/algorithms/sorting

#include <cuda_runtime.h> // `cudaError_t`

#include <thrust/device_vector.h> // `thrust::device_vector`

#include <thrust/host_vector.h> // `thrust::host_vector`

using namespace std::string_literals; // For `""s` literals

/**

* @brief Reverses the array and sorts it with Nvidia's Thrust from CCCL.

* @see Declared in `less_slow.cpp`, but defined in @b `less_slow.cu`!

extern void reverse_and_sort_with_thrust(std::uint32_t *device_pointer, std::size_t array_length);

static void sorting_with_thrust(benchmark::State &state) {

const auto count = static_cast<std::size_t>(state.range(0));

// Typically, the data is first allocated on the "host" CPU side,

// initialized, and then transferred to the "device" GPU memory.

// In our specific case, we could have also used `thrust::sequence`.

thrust::host_vector<std::uint32_t> host_array(count);

std::iota(host_array.begin(), host_array.end(), 1u);

thrust::device_vector<std::uint32_t> device_array = host_array;

for (auto _ : state) {

reverse_and_sort_with_thrust(device_array.data().get(), count);

cudaError_t error = cudaDeviceSynchronize(); //! Block until the GPU has completed all tasks

if (error != cudaSuccess) state.SkipWithError("CUDA error after kernel launch: "s + cudaGetErrorString(error));

benchmark::DoNotOptimize(device_array.data());

}

state.SetComplexityN(count);

state.SetItemsProcessed(count * state.iterations());

state.SetBytesProcessed(count * state.iterations() * sizeof(std::uint32_t));

}

BENCHMARK(sorting_with_thrust)

->RangeMultiplier(4)

->Range(1ll << 20, 1ll << 28)

->MinTime(10)

->Complexity(benchmark::oN) // Not `oNLogN` - it's Radix Sort!

->UseRealTime();

/**

* Thrust, just like STL, is often convenient but not always the fastest.

* It may allocate temporary memory, perform extra copies, or use suboptimal

* algorithms. Thrust's underlying CUB provides more control and a lot of

* functionality for both device-wide, block-wide, and warp-wide operations.

* @see CUB docs: https://nvidia.github.io/cccl/cub/modules

* Sadly, CUB provides no `reverse` functionality, so we need to combine it

* with a Thrust call, scheduling them on the same job queue. We will also

* pre-allocate temporary memory for CUB's sorting algorithm, and will use

* device-side timers for more accurate measurements.

extern std::size_t reverse_and_sort_with_cub_space(std::uint32_t *device_pointer, std::size_t array_length);

extern void reverse_and_sort_with_cub( //

std::uint32_t *device_pointer, std::size_t array_length, // Task

void *temporary_pointer, std::size_t temporary_length, // Space

cudaStream_t stream); // Order

static void sorting_with_cub(bm::State &state) {

auto count = static_cast<std::size_t>(state.range(0));

thrust::host_vector<std::uint32_t> host_array(count);

std::iota(host_array.begin(), host_array.end(), 1u);

thrust::device_vector<std::uint32_t> device_array = host_array;

// One of the interesting design choices of CUB is that you can call

// the target method with `NULL` arguments to infer the required temporary

// memory amount. The Radix Sort generally requires ~ 2N temporary memory.

// Another one is the naming of the operations - `SortKeys` instead of `Sort`.

// There is also `SortPairs` and `SortPairsDescending` in for key-value pairs.

std::size_t temporary_bytes = reverse_and_sort_with_cub_space(device_array.data().get(), count);

// Allocate temporary memory with `cudaMalloc` instead of using a `device_vector`,

// due to potential compilation issues on the host-side compiler.

std::byte *temporary_pointer = nullptr;

cudaError_t error = cudaMalloc(&temporary_pointer, temporary_bytes);

if (error != cudaSuccess) {

state.SkipWithError("Failed to allocate temporary memory: "s + cudaGetErrorString(error));

return;

}

// To schedule the Thrust and CUB operations on the same CUDA stream,

// we need to wrap it into a "policy" object.

cudaStream_t sorting_stream;

cudaStreamCreate(&sorting_stream);

auto policy = thrust::cuda::par.on(sorting_stream);

// Create CUDA events for timing

cudaEvent_t start_event, stop_event;

cudaEventCreate(&start_event);

cudaEventCreate(&stop_event);

for (auto _ : state) {

// Record the start event

cudaEventRecord(start_event, sorting_stream);

// Run the kernels

reverse_and_sort_with_cub( //

device_array.data().get(), count, // Task

temporary_pointer, temporary_bytes, // Space

sorting_stream); // Order

// Record the stop event

cudaEventRecord(stop_event, sorting_stream);

cudaError_t error = cudaEventSynchronize(stop_event); //! Block until the GPU has completed all tasks

if (error != cudaSuccess) state.SkipWithError("CUDA error after kernel launch: "s + cudaGetErrorString(error));

float milliseconds = 0.0f;

cudaEventElapsedTime(&milliseconds, start_event, stop_event);

state.SetIterationTime(milliseconds / 1000.0f);

}

// ! All of the following and above calls can fail, so consider checking the codes

// ! and using a different kernel launch mechanism: https://ashvardanian.com/posts/less-wrong-cuda-hello-world/

cudaEventDestroy(start_event);

cudaEventDestroy(stop_event);

cudaStreamDestroy(sorting_stream);

cudaFree(temporary_pointer);

state.SetComplexityN(count);

state.SetItemsProcessed(count * state.iterations());

state.SetBytesProcessed(count * state.iterations() * sizeof(std::uint32_t));

}

BENCHMARK(sorting_with_cub)

->RangeMultiplier(4)

->Range(1l << 20, 1l << 28)

->MinTime(10)

->Complexity(benchmark::oN) // Not `oNLogN` - it's Radix Sort!

->UseManualTime();

/**

* Comparing CPU to GPU performance is not straightforward, but we can compare

* Thrust to CUB solutions. On NVIDIA H200 GPU, for the largest 1 GB buffer:

* - `sorting_with_thrust` takes @b 6.6 ms

* - `sorting_with_cub` takes @b 6 ms

* 10% ot 50% performance improvements are typical for CUB.

#endif // USE_NVIDIA_CCCL

#pragma endregion // Parallelism and Computational Complexity

#pragma region Recursion

/**

* The `std::sort` and the underlying Quick-Sort are perfect research subjects

* for benchmarking and understanding how the computer works. Naively

* implementing the Quick-Sort in C/C++ would still put us at disadvantage,

* compared to the STL.

* Most implementations we can find in textbooks, use recursion. Recursion is a

* beautiful concept, but it's not always the best choice for performance. Every

* nested call requires a new stack frame, and the stack is limited. Moreover,

* local variables need to be constructed and destructed, and the CPU needs to

* jump around in memory.

* The alternative, as it often is in computing, is to use compensate runtime

* issue with memory. We can use a stack data structure to continuously store

* the state of the algorithm, and then process it in a loop.

* The same ideas common appear when dealing with trees or graph algorithms.

#include <utility> // `std::swap`

/**

* @brief Quick-Sort helper function for array partitioning, reused by both

* recursive and iterative implementations.

template <typename element_type_>

struct quick_sort_partition {

using element_t = element_type_;

inline std::ptrdiff_t operator()(element_t *arr, std::ptrdiff_t const low, std::ptrdiff_t const high) noexcept {

element_t pivot = arr[high];

std::ptrdiff_t i = low - 1;

for (std::ptrdiff_t j = low; j <= high - 1; j++) {

if (arr[j] > pivot) continue;

i++;

std::swap(arr[i], arr[j]);

}

std::swap(arr[i + 1], arr[high]);

return i + 1;

}

};

/**

* @brief Quick-Sort implementation as a C++ function object, using recursion.

* Note, recursion and @b inlining are not compatible.

template <typename element_type_>

struct quick_sort_recurse {

using element_t = element_type_;

using quick_sort_partition_t = quick_sort_partition<element_t>;

using quick_sort_recurse_t = quick_sort_recurse<element_t>;

void operator()(element_t *arr, std::ptrdiff_t low, std::ptrdiff_t high) noexcept {

if (low >= high) return;

auto pivot = quick_sort_partition_t {}(arr, low, high);

quick_sort_recurse_t {}(arr, low, pivot - 1);

quick_sort_recurse_t {}(arr, pivot + 1, high);

}

};

/**

* @brief Quick-Sort implementation as a C++ function object, with iterative

* deepening using a "stack" data-structure.

* Note, this implementation can be inlined, but can't be @b `noexcept`, due to

* a potential memory allocation in the `std::vector::resize` function.

* Fun fact: The `std::vector` is actually a better choice for a "stack" than

* the `std::stack`, as the latter builds on top of a `std::deque`, which is

* normally implemented as a sequence of individually allocated fixed-size arrays,

* with additional bookkeeping. In our logic we never need to pop from the middle

* or from the front, so a `std::vector` is a better choice.

#include <vector> // `std::vector`

template <typename element_type_>

struct quick_sort_iterate {

using element_t = element_type_;

using quick_sort_partition_t = quick_sort_partition<element_t>;

std::vector<std::ptrdiff_t> stack;

void operator()(element_t *arr, std::ptrdiff_t low, std::ptrdiff_t high) noexcept(false) {

stack.resize((high - low + 1) * 2);

std::ptrdiff_t top = -1;

stack[++top] = low;

stack[++top] = high;

while (top >= 0) {

high = stack[top--];

low = stack[top--];

auto pivot = quick_sort_partition_t {}(arr, low, high);

// If there are elements on left side of pivot,

// then push left side to stack

if (low < pivot - 1) {

stack[++top] = low;

stack[++top] = pivot - 1;

}

// If there are elements on right side of pivot,

// then push right side to stack

if (pivot + 1 < high) {

stack[++top] = pivot + 1;

stack[++top] = high;

}

};

template <typename sorter_type_, std::size_t length_> //

static void recursion_cost(bm::State &state) {

using element_t = typename sorter_type_::element_t;

sorter_type_ sorter;

aligned_array<element_t> array(length_);

for (auto _ : state) {

for (std::size_t i = 0; i != length_; ++i) array[i] = length_ - i;

sorter(array.begin(), 0, static_cast<std::ptrdiff_t>(length_ - 1));

}

if (!std::is_sorted(array.begin(), array.end())) state.SkipWithError("Array is not sorted!");

state.SetComplexityN(length_);

}

using recursive_sort_i32s = quick_sort_recurse<std::int32_t>;

using iterative_sort_i32s = quick_sort_iterate<std::int32_t>;

BENCHMARK_TEMPLATE(recursion_cost, recursive_sort_i32s, 16);

BENCHMARK_TEMPLATE(recursion_cost, iterative_sort_i32s, 16);

BENCHMARK_TEMPLATE(recursion_cost, recursive_sort_i32s, 256);

BENCHMARK_TEMPLATE(recursion_cost, iterative_sort_i32s, 256);

BENCHMARK_TEMPLATE(recursion_cost, recursive_sort_i32s, 4096);

BENCHMARK_TEMPLATE(recursion_cost, iterative_sort_i32s, 4096);

/**

* If you try pushing the size further, the program will likely @b crash due

* to @b stack_overflow. The recursive version is limited by the stack size, while

* the iterative version can handle much larger inputs.

* As can be seen from our benchmarks, the STL implementation of `std::sort`

* is more efficient than our naive kernels, and it's only one of many expressive

* solutions in the @b <algorithm> header.

#pragma endregion // Recursion

#pragma region Branch Prediction

/**

* The `if` statement and the seemingly innocent ternary operator `x ? a : b`

* can be surprisingly expensive in performance-critical code. This is

* especially noticeable when conditional execution operates at the byte level,

* as in text processing, parsing, searching, compression, encoding, and

* similar tasks.

* Modern CPUs have sophisticated branch predictors — some of the most complex

* hardware components in a processor. These predictors memorize recent branch

* patterns, enabling "speculative execution," where the CPU starts processing

* the next task (`i+1`) before fully completing the current one (`i`).

* While a single `if` in a hot-path is usually not a problem, real-world

* applications often involve thousands of branches. On most modern CPUs, up

* to @b 4096 branches can be memorized. Beyond that, branch mis-predictions

* occur, causing a severe slowdown due to pipeline stalls.

* Consider this example: The same snippet can run at @b 0.7ns per operation

* when branch predictions are accurate but slows down to @b 3.7ns when

* predictions fail.

static void branch_cost(bm::State &state) {

auto count = static_cast<std::size_t>(state.range(0));

aligned_array<std::int32_t> random_values(count);

std::generate_n(random_values.begin(), count, &std::rand);

std::int32_t variable = 0;

std::size_t iteration = 0;

for (auto _ : state) {

std::int32_t random = random_values[(++iteration) & (count - 1)];

bm::DoNotOptimize( //

variable = // ! Fun fact: multiplication compiles to a jump,

(random & 1) // ! but replacing with a bitwise operation results in a conditional move.

? (variable + random)

: (variable ^ random));

}

BENCHMARK(branch_cost)->RangeMultiplier(4)->Range(256, 32 * 1024);

/**

* It's hard to reason if the above code should compile into a conditional move or a jump,

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

less_slow.cpp

Latest commit

History

less_slow.cpp

File metadata and controls