0% found this document useful (0 votes)
27 views71 pages

GpuProgrammingWithC andCUDA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views71 pages

GpuProgrammingWithC andCUDA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Uncover effective techniques for writing efficient

GPU-parallel C++ applications


Paulo Motta
About the book
Written by Paulo Motta, a senior researcher with decades of experience, this comprehensive
GPU programming book is an essential guide for leveraging the power of parallelism to
accelerate your computations. The first section introduces the concept of parallelism and
provides practical advice on how to think about and utilize it effectively. Starting with a basic
GPU program, you then gain hands-on experience in managing the device. This foundational
knowledge is then expanded by parallelizing the program to illustrate how GPUs enhance
performance.

The second section explores GPU architecture and implementation strategies for parallel
algorithms, and offers practical insights into optimizing resource usage for efficient
execution. In the final section, you will explore advanced topics such as utilizing CUDA
streams. You will also learn how to package and distribute GPU-accelerated libraries for the
Python ecosystem, extending the reach and impact of your work.

Combining expert insight with real-world problem solving, this book is a valuable resource
for developers and researchers aiming to harness the full potential of GPU computing. The
blend of theoretical foundations, practical programming techniques, and advanced
optimization strategies it offers is sure to help you succeed in the fast-evolving field of GPU
programming.

Key Learnings
 Manage GPU devices and accelerate your applications
 Apply parallelism effectively using CUDA and C++
 Choose between existing libraries and custom GPU solutions
 Package GPU code into libraries for use with Python
 Explore advanced topics such as CUDA streams
 Implement optimization strategies for resource-efficient execution

Chapters
1. Introduction to Parallel Programming
2. Setting Up Your Development Environment
3. Hello CUDA
4. Hello Again, but in Parallel
5. A Closer Look into the World of GPUs
6. Parallel Algorithms with CUDA
7. Performance Strategies
8. Overlaying Multiple Operations
9. Exposing Your Code to Python
10. Exploring Existing GPU Models
Requirements for this book
You should be comfortable writing computer programs in C++, and basic knowledge of
operating systems will help to understand some of the more advanced concepts, given that we
have to manage device communication.

Software / hardware covered in the


Operating system requirements
book

NVIDIA GPU or access to a Cloud-based Ubuntu Linux 20 or later with


VM with NVIDIA GPU NVIDIA Video Driver

CUDA Toolkit 12

Docker 27.0

VS Code 1.92

CMake 3.16

g++ 9.4

Python 3.8

Nsight Compute 2023.3

In Chapter 2, we discuss options for configuring the development environment. Some of the
software that we need is installed automatically if you elect to use the Docker-based
development environment.

Code conventions
We are using the following convetions:

1. camelCase for names of functions, kernels and variables


2. CamelCase with uppercase letter at beginning for structs
3. snake_case for names of files
4. When using two functions to perform the same computation on the CPU and
GPU we use the same name with a Cpu/Kernel suffix like:
computeSomethingCpu / computeSomethingKernel
5. When we need to allocate buffers with similar names, we are using h_ preffix
for host side and d_for device side.
o float* h_A;
o float* d_A;
6. When comparing results we add a suffix to the name of the variable _CPU or
_GPU.
o float* h_C_GPU; // this is the C array or matrix calculated on the GPU
and copied back to the host
o float* h_C_CPU; // this is the C array or matrix calculated on the CPU

Folder names in RED colour

Files Names in Green colour


Chapter 2

27-08-2025 12:10 277 Dockerfile

27-08-2025 12:10 351 readme.md

readme.md
## The original file

As we evolve on the book we changed the Dockerfile that is under .devcontainer, but we kept here
the original one for reference.

The change was that originally, for simplicity purposes, we used an NVidia based docker image. Later
we changed to an Ubuntu based image in which we installed the CUDA Toolkit for added
functionality.

Dockerfile

FROM docker.io/nvidia/cuda:12.0.0-devel-ubuntu20.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \

apt-get install -y \

xz-utils \

build-essential \

libssl-dev \

git \

cmake \

&& \

apt-get clean && \

rm -rf /var/lib/apt/lists/*
Chapter 3
27-08-2025 12:10 753 device_query.cu

27-08-2025 12:10 254 hello_world.cu

27-08-2025 12:10 660 readme.md

readme.md
# Hello World
Here is the command to compile from the terminal:
nvcc hello_world.cu -o hello_world -Xcompiler "-Wall" -lcudadevrt -lcudart_static -lstdc++

or if you are using the docker terminal:


docker run -it --rm --gpus all -v ./:/code nvidia/cuda:12.0.0-devel-ubuntu20.04 bash
cd code
nvcc -o hello_world hello_world.cu

## Device Query
Here is the command to compile from the terminal:
nvcc -o device_query device_query.cu -Xcompiler "-Wall" -lcudadevrt -lcudart_static -lstdc++

or if you are using the docker terminal:


docker run -it --rm --gpus all -v ./:/code nvidia/cuda:12.0.0-devel-ubuntu20.04 bash
cd code
nvcc -o device_query device_query.cu

hello_world.cu

#include <iostream>
__global__ void helloWorld() {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
printf("Hello, World! Thread %d\n", tid);
}
int main() {
helloWorld<<<1, 10>>>();
cudaDeviceSynchronize();
return 0;
}
device_query.cu
#include <cuda_runtime.h>
#include <iostream>

int main() {

int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);

std::cout << "Number of devices " << deviceCount << std::endl;

cudaDeviceProp deviceProp;
int dev = 0;
cudaGetDeviceProperties(&deviceProp, dev);

std::cout << "Device " << dev << ": " << deviceProp.name << std::endl;

std::cout << " CUDA Capability Major/Minor version number: " << deviceProp.major << "." <<
deviceProp.minor << std::endl;

std::cout << " Total amount of shared memory per block: " << deviceProp.sharedMemPerBlock <<
" bytes" << std::endl;

std::cout << " Maximum number of threads per block: " << deviceProp.maxThreadsPerBlock <<
std::endl;

return 0;

Chapter 4
27-08-2025 12:10 <DIR> 1_primes
27-08-2025 12:10 <DIR> 2_vector_add
27-08-2025 12:10 <DIR> 3_euclidean_distance
27-08-2025 12:10 256 readme.md

readme
# Introduction
<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>
To build the examples we may simply:
1. create a `build` folder inside of the corresponding code folder

2. run cmake ..

3. run make
1_primes
27-08-2025 12:10 95 CMakeLists.txt

27-08-2025 12:10 2,347 primes.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(primes_gpu CUDA)

add_executable(primes primes.cu)

primes.cu
#include <cuda_runtime.h>
#include <iostream>
#include <chrono>

__global__ void checkPrimeKernel(long long start, long long end) {


int tid = threadIdx.x + blockIdx.x * blockDim.x;
long long num = start + (tid * 2);
bool isPrime = true;
if (num <= 1) {
isPrime = false;
return;
}
if (num == 2) {
isPrime = true;
return;
}
if (num % 2 == 0) {
isPrime = false;
return;
}
if (num > end) return;

for (long long i = 3; i * i <= num; i += 2) {


if (num % i == 0) {
isPrime = false;
break;
}
}

/*
* for study purposes we can print the verification of each number
*/
//printf("tid=%d %lld is prime? %d\n", tid, num, isPrime);
}
bool checkPrimeCpu(long long num) {

if (num <= 1) return false;


if (num == 2) return true;
if (num % 2 == 0) return false;
for (long long i = 3; i * i <= num; i += 2) {
if (num % i == 0) {
return false;
}
}
return true;
}

int main() {
long long start = 100'001LL; // must start with odd
long long end = 190'001LL;

int threadsPerBlock = 256;


int totalNumbers = (end - start) / 2 + 1;
int blocksPerGrid = (totalNumbers + threadsPerBlock - 1) / threadsPerBlock;

cudaEvent_t startEvent, stopEvent;


cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);

cudaEventRecord(startEvent, 0);

checkPrimeKernel<<<blocksPerGrid, threadsPerBlock>>>(start, end);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, startEvent, stopEvent);

std::cout << "Time taken on GPU: " << gpuDuration << " ms" << std::endl;

cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);

auto startTime = std::chrono::high_resolution_clock::now();

for (long long num = start; num <= end; num += 2) {


checkPrimeCpu(num);
}

auto endTime = std::chrono::high_resolution_clock::now();


std::chrono::duration<double, std::milli> cpuDuration = endTime - startTime;
std::cout << "Time taken on CPU: " << std::fixed << cpuDuration.count() << " ms" << std::endl;
std::cout << "speed up : " << cpuDuration.count() / gpuDuration << std::endl;

return 0;
}

2_vector_add
27-08-2025 12:10 103 CMakeLists.txt

27-08-2025 12:10 3,199 vector_add.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(vector_add CUDA)
add_executable(vector_add vector_add.cu)

vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
#include <chrono>

__global__ void vectorAddKernel(float *A, float *B, float *C, int N) {


int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

void vectorAddCpu(float *A, float *B, float *C, int N) {


for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

int main() {
int N = 100'000'000;
size_t size = N * sizeof(float);

float *h_A = (float *)malloc(size);


float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);

for (int i = 0; i < N; ++i) {


h_A[i] = static_cast<float>(i);
h_B[i] = static_cast<float>(i * 2);
}

float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);

int threadsPerBlock = 1024;


int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

cudaEvent_t startEvent, stopEvent;


cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);

cudaEventRecord(startEvent, 0);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuCopyTime = 0;
cudaEventElapsedTime(&gpuCopyTime, startEvent, stopEvent);

std::cout<< std::fixed << "Time to copy data to GPU: " << gpuCopyTime << " ms" << std::endl;

cudaEventRecord(startEvent, 0);

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuExecutionTime = 0;
cudaEventElapsedTime(&gpuExecutionTime, startEvent, stopEvent);

std::cout<< std::fixed << "Time to execute on GPU: " << gpuExecutionTime << " ms" << std::endl;

cudaEventRecord(startEvent, 0);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuRetrieveTime = 0;
cudaEventElapsedTime(&gpuRetrieveTime, startEvent, stopEvent);
std::cout<< std::fixed << "Time taken to copy results back GPU: " << gpuRetrieveTime << " ms" <<
std::endl << std::endl;

float gpuDuration = (gpuCopyTime + gpuExecutionTime + gpuRetrieveTime);


std::cout << "Time taken by GPU: " << gpuDuration << " ms" << std::endl;

cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);

auto start = std::chrono::high_resolution_clock::now();

vectorAddCpu(h_A, h_B, h_C, N);

auto stop = std::chrono::high_resolution_clock::now();


std::chrono::duration<double, std::milli> cpuDuration = (stop - start);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
std::cout << "========================================== " << std::endl;

std::cout << "speed up (execution time only): " << cpuDuration.count() / gpuExecutionTime <<
std::endl;
std::cout << "speed up (GPU total time): " << cpuDuration.count() / gpuDuration << std::endl;

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);

return 0;
}

3_euclidean_distance
27-08-2025 12:10 127 CMakeLists.txt

27-08-2025 12:10 2,267 euclidean_distance.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(euclidean_distance CUDA)

add_executable(euclidean_distance euclidean_distance.cu)
euclidean_distance.cu
#include <cuda_runtime.h>
#include <iostream>

struct Point {
float x;
float z;
float y;
};

__global__ void calculateEuclideanDistanceKernel(Point *lineA, Point *lineB, float *distances, int


numPoints) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;

if (idx < numPoints) {


float dx = lineA[idx].x - lineB[idx].x;
float dy = lineA[idx].y - lineB[idx].y;
float dz = lineA[idx].z - lineB[idx].z;

distances[idx] = sqrtf(dx * dx + dy * dy + dz * dz);

}
}

int main() {
int numPoints = 10'000'000;
size_t sizePoints = numPoints * sizeof(Point);
size_t sizeDistances = numPoints * sizeof(float);

Point *h_lineA = (Point *)malloc(sizePoints);


Point *h_lineB = (Point *)malloc(sizePoints);
float *h_distances = (float *)malloc(sizeDistances);

for (int i = 0; i < numPoints; i++) {


h_lineA[i].x = i * 1.0f;
h_lineA[i].y = i * 2.0f;
h_lineA[i].z = i * 3.0f;
h_lineB[i].x = i * 0.5f;
h_lineB[i].y = i * 1.5f;
h_lineB[i].z = i * 2.5f;
}

Point *d_lineA;
Point *d_lineB;
float *d_distances;
cudaMalloc((void **)&d_lineA, sizePoints);
cudaMalloc((void **)&d_lineB, sizePoints);
cudaMalloc((void **)&d_distances, sizeDistances);

cudaMemcpy(d_lineA, h_lineA, sizePoints, cudaMemcpyHostToDevice);


cudaMemcpy(d_lineB, h_lineB, sizePoints, cudaMemcpyHostToDevice);
int blockSize = 1024;
int gridSize = (numPoints + blockSize - 1) / blockSize;

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

calculateEuclideanDistanceKernel<<<gridSize, blockSize>>>(d_lineA, d_lineB, d_distances,


numPoints);

cudaEventRecord(stop);

cudaEventSynchronize(stop);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);

std::cout<< std::fixed << "Time taken: " << gpuDuration << " ms" << std::endl;

cudaMemcpy(h_distances, d_distances, sizeDistances, cudaMemcpyDeviceToHost);

cudaFree(d_lineA);
cudaFree(d_lineB);
cudaFree(d_distances);

free(h_lineA);
free(h_lineB);
free(h_distances);

cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}

Chapter 6
27-08-2025 12:10 <DIR> 1_matrix_add

27-08-2025 12:10 <DIR> 2_matrix_multiply

27-08-2025 12:10 <DIR> 3_trapezoidal_travel

27-08-2025 12:10 <DIR> 4_sort_array

27-08-2025 12:10 <DIR> 5_weighted_moving_average

27-08-2025 12:10 256 readme.md


readme.md
# Introduction

<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>

To build the examples we may simply:


1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make

1_matrix_add
27-08-2025 12:10 103 CMakeLists.txt

27-08-2025 12:10 2,972 matrix_add.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(matrix_add CUDA)

add_executable(matrix_add matrix_add.cu)

matrix_add.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <random>

#define N 7000

__global__ void matrixAddKernel(const double* A, const double* B, double* C, int width) {


int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

if (row < width && col < width) {


int idx = row * width + col;
C[idx] = A[idx] + B[idx];
}
}

void matrixAddCpu(const double* A, const double* B, double* C, int width) {


for (int i = 0; i < width; i++) {
for (int j = 0; j < width; j++) {
C[i * width + j] = A[i * width + j] + B[i * width + j];
}
}
}

void initializeMatrix(double *matrix, int size) {


for (int i = 0; i < size; i++) {
matrix[i] = static_cast<double>(rand()) / RAND_MAX;
}
}

bool checkResults(double *A, double *B, int size) {


for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-10) {
return false;
}
}
return true;
}

int main() {

srand(static_cast<unsigned int>(time(0)));

int matrixSize = N * N * sizeof(double);

double *h_A = (double*)malloc(matrixSize);


double *h_B = (double*)malloc(matrixSize);
double *h_C_CPU = (double*)malloc(matrixSize);
double *h_C_GPU = (double*)malloc(matrixSize);

initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);

double *d_A;
double *d_B;
double *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);

auto startTime = std::chrono::high_resolution_clock::now();


matrixAddCpu(h_A, h_B, h_C_CPU, N);
auto endTime = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endTime - startTime;

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);
dim3 blockDim(16, 16);
dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);

matrixAddKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

cudaEventRecord(stop);
cudaEventSynchronize(stop);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);

std::cout << "Time taken by GPU: " << gpuDuration << " ms\n";

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {


std::cout << "Results match!" << std::endl;
} else {
std::cout << "Results do not match!" << std::endl;
}

free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}

2_matrix_multiply

27-08-2025 12:10 118 CMakeLists.txt

27-08-2025 12:10 4,046 matrix_multiply.cu


CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(matrix_multiply CUDA)

add_executable(matrix_multiply matrix_multiply.cu)

matrix_multiply.cu
#include <iostream>
#include <iomanip>
#include <cuda_runtime.h>
#include <chrono>

#define N 2000

__global__ void matrixMulKernel(double *A, double *B, double *C, int width) {
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

if (row < width && col < width) {


double value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}

void matrixMulCpu(double *A, double *B, double *C, int width) {


for (int row = 0; row < width; row++) {
for (int col = 0; col < width; col++) {
double value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}
}

void initializeMatrix(double *matrix, int size) {


for (int i = 0; i < size; i++) {
matrix[i] = static_cast<double>(rand()) / RAND_MAX;
}
}

bool checkResults(double *A, double *B, int size) {


for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-10) {
return false;
}
}
return true;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int matrixSize = N * N * sizeof(double);

double *h_A = (double*)malloc(matrixSize);


double *h_B = (double*)malloc(matrixSize);
double *h_C_CPU = (double*)malloc(matrixSize);
double *h_C_GPU = (double*)malloc(matrixSize);

initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);

auto startCpu = std::chrono::high_resolution_clock::now();


matrixMulCpu(h_A, h_B, h_C_CPU, N);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

double *d_A;
double *d_B;
double *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;

cudaEventRecord(start);
dim3 blockDim(16, 16);
dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

matrixMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "GPU execution time: " << gpuDuration << " ms" << std::endl;

cudaEventRecord(start);

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "GPU total time: " << (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve) << " ms"
<< std::endl;

double cpuTotalTime = cpuDuration.count();


double gpuTotalTime = (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve);
std::cout << "speed up: " << cpuTotalTime / gpuTotalTime << std::endl;

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {


std::cout << "Results match!" << std::endl;
} else {
std::cout << "Results do not match!" << std::endl;
}

free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}
3_trapezoidal_travel
27-08-2025 12:10 218 CMakeLists.txt

27-08-2025 12:10 2,749 trapezoidal_travel_basic.cu

27-08-2025 12:10 3,248 trapezoidal_travel_gpu_reduce.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(trapezoidal_travel CUDA)

add_executable(trapezoidal_travel_basic trapezoidal_travel_basic.cu)
add_executable(trapezoidal_travel_gpu_reduce trapezoidal_travel_gpu_reduce.cu)

matrix_multiply.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cstdlib>

#define N 10'000'000 // Increased number of speed measurements for GPU efficiency


#define T 0.25 // Time interval between measurements (15 minutes)

__global__ void trapezoidalKernel(double *speeds, double *distances, int n) {


int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n - 1) {
distances[i] = 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
}

double trapezoidalCpu(double *speeds, int n) {


double totalDistance = 0.0;
for (int i = 0; i < n - 1; i++) {
totalDistance += 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
return totalDistance;
}

void generateRandomSpeeds(double *speeds, int n) {


for (int i = 0; i < n; i++) {
speeds[i] = 40.0 + static_cast<double>(rand()) / (static_cast<double>(RAND_MAX / (80.0 -
40.0)));
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));

double *h_speeds = (double*)malloc(N * sizeof(double));


double *h_distances = (double*)malloc( (N - 1) * sizeof(double));

generateRandomSpeeds(h_speeds, N);

auto startCPU = std::chrono::high_resolution_clock::now();


double cpuResult = trapezoidalCpu(h_speeds, N);
auto endCPU = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endCPU - startCPU;
std::cout << "CPU Distance: " << cpuResult << " km" << std::endl;
std::cout << "CPU Execution Time: " << cpuDuration.count() << " ms" << std::endl;

double *d_speeds, *d_distances;


cudaMalloc(&d_speeds, N * sizeof(double));
cudaMalloc(&d_distances, (N - 1) * sizeof(double));

int blockSize = 256;


int gridSize = (N + blockSize - 1) / blockSize;

cudaEvent_t startGPU, endGPU;


cudaEventCreate(&startGPU);
cudaEventCreate(&endGPU);
cudaEventRecord(startGPU);

cudaMemcpy(d_speeds, h_speeds, N * sizeof(double), cudaMemcpyHostToDevice);

trapezoidalKernel<<<gridSize, blockSize>>>(d_speeds, d_distances, N);

cudaDeviceSynchronize();

cudaMemcpy(h_distances, d_distances, (N - 1) * sizeof(double), cudaMemcpyDeviceToHost);

double gpuResult = 0.0;


for (int i = 0; i < N - 1; i++) {
gpuResult += h_distances[i];
}
cudaEventRecord(endGPU);
cudaEventSynchronize(endGPU);
float gpuDuration = 0.0;
cudaEventElapsedTime(&gpuDuration, startGPU, endGPU);

std::cout << "GPU Distance: " << gpuResult << " km" << std::endl;
std::cout << "GPU Execution Time: " << gpuDuration << " ms" << std::endl;

free(h_speeds);
free(h_distances);
cudaFree(d_speeds);
cudaFree(d_distances);
cudaEventDestroy(startGPU);
cudaEventDestroy(endGPU);
return 0;
}
trapezoidal_travel_gpu_reduce.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cstdlib>

#define N 10'000'000 // Increased number of speed measurements for GPU efficiency


#define T 0.25 // Time interval between measurements (15 minutes)

__global__ void trapezoidalKernel(double *speeds, double *result, int n) {


extern __shared__ double sharedData[];
int tid = threadIdx.x;
int i = threadIdx.x + blockIdx.x * blockDim.x;

double localDistance = 0.0;


if (i < n - 1) {
localDistance = 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
sharedData[tid] = localDistance;
__syncthreads();

for (int stride = blockDim.x / 2; stride > 0; stride /= 2) {


if (tid < stride) {
sharedData[tid] += sharedData[tid + stride];
}
__syncthreads();
}

if (tid == 0) {
result[blockIdx.x] = sharedData[0];
}
}

double trapezoidalCpu(double *speeds, int n) {


double totalDistance = 0.0;
for (int i = 0; i < n - 1; i++) {
totalDistance += 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
return totalDistance;
}

void generateRandomSpeeds(double *speeds, int n) {


for (int i = 0; i < n; i++) {
speeds[i] = 40.0 + static_cast<double>(rand()) / (static_cast<double>(RAND_MAX / (80.0 -
40.0)));
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));
double *h_speeds = (double*)malloc(N * sizeof(double));
generateRandomSpeeds(h_speeds, N);

auto startCPU = std::chrono::high_resolution_clock::now();


double cpuResult = trapezoidalCpu(h_speeds, N);
auto endCPU = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endCPU - startCPU;
std::cout << "CPU Distance: " << cpuResult << " km" << std::endl;
std::cout << "CPU Execution Time: " << cpuDuration.count() << " ms" << std::endl;

int blockSize = 256;


int gridSize = (N + blockSize - 1) / blockSize;

double *d_speeds, *d_partialResults;


cudaMalloc(&d_speeds, N * sizeof(double));
cudaMalloc(&d_partialResults, gridSize * sizeof(double));

cudaMemcpy(d_speeds, h_speeds, N * sizeof(double), cudaMemcpyHostToDevice);

cudaEvent_t startGPU, endGPU;


cudaEventCreate(&startGPU);
cudaEventCreate(&endGPU);
cudaEventRecord(startGPU);

trapezoidalKernel<<<gridSize, blockSize, blockSize * sizeof(double)>>>(d_speeds, d_partialResults,


N);
cudaDeviceSynchronize();

double *h_partialResults = (double*)malloc(gridSize * sizeof(double));


cudaMemcpy(h_partialResults, d_partialResults, gridSize * sizeof(double),
cudaMemcpyDeviceToHost);

double gpuResult = 0.0;


for (int i = 0; i < gridSize; i++) {
gpuResult += h_partialResults[i];
}

cudaEventRecord(endGPU);
cudaEventSynchronize(endGPU);

float gpuDuration = 0.0;


cudaEventElapsedTime(&gpuDuration, startGPU, endGPU);

std::cout << "GPU Distance: " << gpuResult << " km" << std::endl;
std::cout << "GPU Execution Time: " << gpuDuration << " ms" << std::endl;

free(h_speeds);
free(h_partialResults);
cudaFree(d_speeds);
cudaFree(d_partialResults);
cudaEventDestroy(startGPU);
cudaEventDestroy(endGPU);

return 0;

4_sort_array
27-08-2025 12:10 103 CMakeLists.txt

27-08-2025 12:10 3,742 sort_array.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(sort_array CUDA)

add_executable(sort_array sort_array.cu)

sort_array.cu
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
#include <cfloat>
#include <iomanip>
#include <cuda_runtime.h>

__global__ void oddEvenSortStepKernel(double *arr, int size, bool *swapped, bool isOddPhase) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i = isOddPhase ? 2 * idx + 1 : 2 * idx;

if (i < size - 1) {
if (arr[i] > arr[i + 1]) {
double temp = arr[i];
arr[i] = arr[i + 1];
arr[i + 1] = temp;
*swapped = true;
}
}
}

void oddEvenSortGpu(double *arr, int size) {


double *d_arr;
bool *d_swapped;
bool h_swapped;
int threads = 256;
int blocks = (size + threads - 1) / threads;

cudaMalloc((void **)&d_arr, size * sizeof(double));


cudaMalloc((void **)&d_swapped, sizeof(bool));

cudaMemcpy(d_arr, arr, size * sizeof(double), cudaMemcpyHostToDevice);

do {
h_swapped = false;
cudaMemcpy(d_swapped, &h_swapped, sizeof(bool), cudaMemcpyHostToDevice);

// Odd phase
oddEvenSortStepKernel<<<blocks, threads>>>(d_arr, size, d_swapped, true);
cudaDeviceSynchronize();

// Even phase
oddEvenSortStepKernel<<<blocks, threads>>>(d_arr, size, d_swapped, false);
cudaDeviceSynchronize();

// Check if any swaps occurred


cudaMemcpy(&h_swapped, d_swapped, sizeof(bool), cudaMemcpyDeviceToHost);

} while (h_swapped);

cudaMemcpy(arr, d_arr, size * sizeof(double), cudaMemcpyDeviceToHost);

cudaFree(d_arr);
cudaFree(d_swapped);
}

void oddEvenSortCpu(double *arr, int size) {


bool swapped;

do {
swapped = false;

// Odd phase
for (int i = 1; i < size - 1; i += 2) {
if (arr[i] > arr[i + 1]) {
std::swap(arr[i], arr[i + 1]);
swapped = true;
}
}

// Even phase
for (int i = 0; i < size - 1; i += 2) {
if (arr[i] > arr[i + 1]) {
std::swap(arr[i], arr[i + 1]);
swapped = true;
}
}
} while (swapped);
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int n = 100'000;
double *h_data = (double*)malloc(n * sizeof(double));
double *h_data_gpu = (double*)malloc(n * sizeof(double));

for (int i = 0; i < n; i++) {


h_data[i] = static_cast<double>(rand()) / RAND_MAX;
}

double* d_data;
cudaMalloc(&d_data, n * sizeof(double));
cudaMemcpy(d_data, h_data, n * sizeof(double), cudaMemcpyHostToDevice);

cudaEvent_t startGpu, stopGpu;


cudaEventCreate(&startGpu);
cudaEventCreate(&stopGpu);

cudaEventRecord(startGpu);

oddEvenSortGpu(d_data, n);

cudaEventRecord(stopGpu);

cudaEventSynchronize(stopGpu);
float gpuDuration;
cudaEventElapsedTime(&gpuDuration, startGpu, stopGpu);
std::cout << "GPU sorting time: " << gpuDuration << " ms" << std::endl;

cudaMemcpy(h_data_gpu, d_data, n * sizeof(double), cudaMemcpyDeviceToHost);

auto start = std::chrono::high_resolution_clock::now();


oddEvenSortCpu(h_data, n);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = end - start;
std::cout << "CPU sorting time: " << cpuDuration.count() << " ms" << std::endl;

double max_difference = 0;
for (int i = 0; i < n; i++) {
max_difference = std::max(max_difference, std::abs(h_data[i] - h_data_gpu[i]));
}
std::cout << "Max difference between CPU and GPU results: " << max_difference << std::endl;

free(h_data);
free(h_data_gpu);
cudaFree(d_data);
cudaEventDestroy(startGpu);
cudaEventDestroy(stopGpu);

return 0;
}

5_weighted_moving_average

27-08-2025 12:10 142 CMakeLists.txt

27-08-2025 12:10 5,657 weighted_moving_average.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(weighted_moving_average CUDA)

add_executable(weighted_moving_average weighted_moving_average.cu)

weighted_moving_average.cu

#include <iostream>
#include <cuda_runtime.h>
#include <vector>
#include <cstdlib>
#include <chrono>

#define NUM_SENSORS 50'000'000


#define NUM_READINGS 3

void rotateIndices(int *indices) {


int newData = indices[0];
indices[0] = indices[1];
indices[1] = indices[2];
indices[2] = newData;
}

__global__ void smoothSensorsKernel(float *buffers, int *indices, float *output, float *weights) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= NUM_SENSORS) return;

float val = 0.0f;


for (int i=0; i < NUM_READINGS; i++){
val = weights[i] * buffers[indices[i] * NUM_READINGS + idx];
}

output[idx] = val;
}

void smoothSensorsCpu(float *buffers, int *indices, float *output, float *weights){


for (int i = 0; i < NUM_READINGS; i++) {
for (int j = 0; j < NUM_SENSORS; j++){
output[j] = weights[i] * buffers[indices[i] * NUM_READINGS + j];
}
}
}

bool checkResults(float *A, float *B, int size) {


for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-10) {
std::cerr << "Mismatch at index " << i <<
" CPU: " << A[i] <<
" GPU: " << B[i] <<
" diff=" << abs(A[i] - B[i]) << std::endl;
return false;
}
}
return true;
}

void initializeBuffer(float *h_buffers, int bufferId)


{
for (int j = 0; j < NUM_SENSORS; j++) {
h_buffers[bufferId + j] = 1.0f + ((rand() % 100) / 1000.0f);
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int bufferSize = NUM_SENSORS * sizeof(float);

float h_weights[NUM_READINGS] = {0.2f, 0.3f, 0.5f};


int h_indices[NUM_READINGS] = {0, 1, 2};
float *h_buffers = (float*) malloc(NUM_READINGS * bufferSize);
float *h_output_GPU = (float*) malloc(bufferSize);
float *h_output_CPU = (float*) malloc(bufferSize);

float *d_buffers;
float *d_output;
float *d_weights;
int *d_indices;

std::chrono::duration<double, std::milli> cpuGlobalTime;


int threads = 256;
int blocks = (NUM_SENSORS + threads - 1) / threads;
float gpuGlobalTime = 0;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaMalloc(&d_buffers, NUM_READINGS * bufferSize);


cudaMalloc(&d_output, bufferSize);
cudaMalloc(&d_indices, NUM_READINGS * sizeof(int));
cudaMalloc(&d_weights, NUM_READINGS * sizeof(float));

cudaMemcpy(d_weights, h_weights, NUM_READINGS * sizeof(float), cudaMemcpyHostToDevice);

for(int i = 0; i < 4; i++) {

std::cout << i << std::endl;

int newDataIdx = h_indices[0];

if (i == 0) {
for(int j = 0; j < NUM_READINGS; j++) {
initializeBuffer(h_buffers, j);
}
} else {
newDataIdx = h_indices[0];
rotateIndices(h_indices);
initializeBuffer(h_buffers, newDataIdx); //one new reading for all sensors
}

auto startCpu = std::chrono::high_resolution_clock::now();


smoothSensorsCpu(h_buffers, h_indices, h_output_CPU, h_weights);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);
cpuGlobalTime += cpuDuration;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

cudaEventRecord(start);

cudaMemcpy(d_indices, h_indices, NUM_READINGS * sizeof(int), cudaMemcpyHostToDevice);


if (i == 0) {
cudaMemcpy(d_buffers, h_buffers, NUM_READINGS * bufferSize,
cudaMemcpyHostToDevice);
} else {
cudaMemcpy(&d_buffers[newDataIdx], &h_buffers[newDataIdx], bufferSize,
cudaMemcpyHostToDevice);
}

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;
cudaEventRecord(start);
smoothSensorsKernel<<<blocks, threads>>>(d_buffers, d_indices, d_output, d_weights);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "GPU execution time: " << gpuDuration << " ms" << std::endl;

cudaEventRecord(start);
cudaMemcpy(h_output_GPU, d_output, bufferSize, cudaMemcpyDeviceToHost);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;

float gpuTotalTime = (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve);


gpuGlobalTime += gpuTotalTime;
std::cout << "GPU total time: " << gpuTotalTime << " ms" << std::endl;

if (!checkResults(h_output_CPU, h_output_GPU, NUM_SENSORS)) {


std::cout << "Results do not match!" << std::endl;
}
}

std::cout << "=============================================" << std::endl;


std::cout << "CPU global time: " << cpuGlobalTime.count() << " ms" << std::endl;
std::cout << "GPU global time: " << gpuGlobalTime << " ms" << std::endl;

std::cout << "speed up over all executions: " << cpuGlobalTime.count() / gpuGlobalTime <<
std::endl;

cudaFree(d_buffers);
cudaFree(d_output);
cudaFree(d_indices);
cudaFree(d_weights);
free(h_output_CPU);
free(h_output_GPU);
free(h_buffers);

return 0;
}
Chapter 7
27-08-2025 12:10 315 CMakeLists.txt

27-08-2025 12:10 7,616 matrix_multiply_float.cu

27-08-2025 12:10 256 readme.md

readme.md
# Introduction

<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>

To build the examples we may simply:


1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(performance_strategies CUDA)

set(CMAKE_CXX_FLAGS_DEBUG "-g -O0 -Wall")


set(CMAKE_CUDA_FLAGS_DEBUG "-G -g -O0")

set(CMAKE_CXX_FLAGS_RELEASE "-O3")
set(CMAKE_CUDA_FLAGS_RELEASE "-O3 -arch=sm_75 -rdc=false")

add_executable(matrix_multiply_float matrix_multiply_float.cu)

matrix_multiply_float.cu
#include <chrono>
#include <cuda_runtime.h>
#include <iostream>
#include <iomanip>

#define N 2000 // Define matrix size


#define TILE 16 // Define the tile size

__global__ void matrixMulKernel(float *A, float *B, float *C, int width) {
__shared__ float Asub[TILE][TILE];
__shared__ float Bsub[TILE][TILE];

int tx = threadIdx.x;
int ty = threadIdx.y;
int row = ty + blockIdx.y * blockDim.y;
int col = tx + blockIdx.x * blockDim.x;

float sum = 0.0f;

for (int i = 0; i < (width + blockDim.x - 1) / blockDim.x; i++) {


if (row < width && (i * blockDim.x + tx) < width)
Asub[ty][tx] = A[row * width + i * blockDim.x + tx];
else
Asub[ty][tx] = 0.0f;

if (col < width && (i * blockDim.y + ty) < width)


Bsub[ty][tx] = B[(i * blockDim.y + ty) * width + col];
else
Bsub[ty][tx] = 0.0f;

__syncthreads();

#pragma unroll
for (int k = 0; k < blockDim.x; k++) {
sum = fmaf(Asub[ty][k], Bsub[k][tx], sum);
}

__syncthreads();
}

if (row < width && col < width)


C[row * width + col] = sum;
}

__global__ void matrixMulKernel_shared(float* A, float* B, float* C, int n) {


// Allocate shared memory
__shared__ float Asub[TILE][TILE];
__shared__ float Bsub[TILE][TILE];

int tx = threadIdx.x;
int ty = threadIdx.y;
int row = ty + blockIdx.y * blockDim.y;
int col = tx + blockIdx.x * blockDim.x;

float sum = 0.0f;

for (int i = 0; i < (n + blockDim.x - 1) / blockDim.x; i++) {


if (row < n && (i * blockDim.x + tx) < n)
Asub[ty][tx] = A[row * n + i * blockDim.x + tx];
else
Asub[ty][tx] = 0.0f;

if (col < n && (i * blockDim.y + ty) < n)


Bsub[ty][tx] = B[(i * blockDim.y + ty) * n + col];
else
Bsub[ty][tx] = 0.0f;

__syncthreads();

#pragma unroll
for (int k = 0; k < blockDim.x; k++) {
sum += Asub[ty][k] * Bsub[k][tx];
// Use fused multiply-add
//sum = fmaf(Asub[ty][k], Bsub[k][tx], sum);
}

__syncthreads();
}

if (row < n && col < n)


C[row * n + col] = sum;
}

__global__ void matrixMulKernel_naive(float *A, float *B, float *C, int width) {
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

if (row < width && col < width) {


float value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}

__global__ void matrixMulKernel_row(float *A, float *B, float *C, int width) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < width) {
for (int col = 0; col < width; col++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
}
C[row * width + col] = sum;
}
}
}

__global__ void matrixMulKernel_col(float *A, float *B, float *C, int width) {
int col = threadIdx.x + blockIdx.x * blockDim.x;
if (col < width) {
for (int row = 0; row < width; row++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
}
C[row * width + col] = sum;
}
}
}

void matrixMulCpu(float *A, float *B, float *C, int width) {


for (int row = 0; row < width; row++) {
for (int col = 0; col < width; col++) {
float value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}
}

void initializeMatrix(float *matrix, int size) {


for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

bool checkResults(float *A, float *B, int size) {


for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-4) {
std::cout << "fabs(A[i])=" << fabs(A[i]) << std::endl;
std::cout << "B[i]=" << B[i] << std::endl;
std::cout << std::fixed << std::showpoint;
std::cout << std::setprecision(15);
std::cout << "fabs(A[i] - B[i])=" << fabs(A[i] - B[i]) << std::endl;
std::cout << "#####" << std::endl;
return false;
}
}
return true;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int matrixSize = N * N * sizeof(float);

float *h_A = (float*)malloc(matrixSize);


float *h_B = (float*)malloc(matrixSize);
float *h_C_CPU = (float*)malloc(matrixSize);
float *h_C_GPU = (float*)malloc(matrixSize);
initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);

auto startCpu = std::chrono::high_resolution_clock::now();


matrixMulCpu(h_A, h_B, h_C_CPU, N);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;

cudaEventRecord(start);

dim3 blockDim(TILE, TILE);


dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

matrixMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

// for those kernels we need a different launch configuration


// int threads = 256;
// int blocks = (N + threads - 1) / threads ;
// matrixMulKernel_row<<<blocks, threads>>>(d_A, d_B, d_C, N);
// matrixMulKernel_col<<<blocks, threads>>>(d_A, d_B, d_C, N);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTime = 0;
cudaEventElapsedTime(&gpuTime, start, stop);
std::cout << "GPU execution time: " << gpuTime << " ms" << std::endl;
cudaEventRecord(start);

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "GPU total time: " << (gpuTime + gpuTimeToCopy + gpuTimeToRetrieve) << " ms" <<
std::endl;

double cpuTotalTime = cpuDuration.count();


double gpuTotalTime = (gpuTime + gpuTimeToCopy + gpuTimeToRetrieve);
std::cout << "speed up: " << cpuTotalTime / gpuTotalTime << std::endl;

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {


std::cout << "Results match!" << std::endl;
} else {
std::cout << "Results do not match!" << std::endl;
}

free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}

Chapter 8
27-08-2025 12:10 <DIR> 1_vector_add

27-08-2025 12:10 <DIR> 2_mem_bandwidth

27-08-2025 12:10 <DIR> 3_multiplying_with_streams

27-08-2025 12:10 <DIR> 4_multigpu

27-08-2025 12:10 256 readme.md


readme
# Introduction

<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>

To build the examples we may simply:


1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make

1_vector_add
27-08-2025 12:10 191 CMakeLists.txt

27-08-2025 12:10 1,640 vector_add.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

set(CMAKE_CXX_FLAGS_DEBUG "-g -O0 -Wall")


set(CMAKE_CUDA_FLAGS_DEBUG "-G -g -O0")

add_executable(vector_add vector_add.cu)

vector_add.cu

#include <iostream>
#include <cuda_runtime.h>

__global__ void vectorAddKernel(int *a, int *b, int *c, int N) {


int idx = threadIdx.x + blockIdx.x * blockDim.x;

if (idx < N) {
c[idx] = a[idx] * b[idx];
}

void vectorAddCpu(int *a, int *b, int *c, int N) {

for (int idx = 0; idx < N; idx++) {


c[idx] = a[idx] + b[idx];
}

void checkResult(int *cpuRes, int *gpuRes, int N) {


for (int i = 0; i < N; ++i) {
if (abs(cpuRes[i] - gpuRes[i]) > 1e-5) {
std::cerr << "Mismatch at index " << i <<
" CPU: " << cpuRes[i] <<
" GPU: " << gpuRes[i] <<
" diff=" << abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

int main() {
const int N = 130;
int h_a[N];
int h_b[N];
int h_c[N];
int cpuResult[N];
int *d_a;
int *d_b;
int *d_c;

for (int i = 0; i < N; i++) {


h_a[i] = i;
h_b[i] = i * 2;
}

vectorAddCpu(h_a, h_b, cpuResult, N);

cudaMalloc(&d_a, N * sizeof(int));
cudaMalloc(&d_b, N * sizeof(int));
cudaMalloc(&d_c, N * sizeof(int));

cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice);


cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice);

int blockSize = 128;


int gridSize = (N + blockSize - 1) / blockSize;

vectorAddKernel<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);

cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

checkResult(cpuResult, h_c, N);


cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;
}

2_mem_bandwidth
27-08-2025 12:10 166 CMakeLists.txt

27-08-2025 12:10 1,675 memcpy_benchmark.cu

27-08-2025 12:10 1,245 mem_bandwidth.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

add_executable(mem_bandwidth mem_bandwidth.cu)
add_executable(memcpy_benchmark memcpy_benchmark.cu)

memcpy_benchmark.cu

#include <cuda_runtime.h>
#include <iostream>

int main() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

int memClockMHz = prop.memoryClockRate / 1000; // Convert to MHz


int busWidth = prop.memoryBusWidth; // In bits
float bandwidthInBytes = memClockMHz * busWidth / 8; // In bytes
float bandwidthInGB = bandwidthInBytes / 1024; // In GigaBytes
float bandwidthGBs = 2.0f * bandwidthInGB; // DDR memory is twice the clock rate

int device;
cudaGetDevice(&device);

int concurrentKernels = 0;
int asyncEngineCount = 0;

cudaDeviceGetAttribute(&concurrentKernels, cudaDevAttrConcurrentKernels, device);


cudaDeviceGetAttribute(&asyncEngineCount, cudaDevAttrAsyncEngineCount, device);

std::cout << "Device " << device << ":" << std::endl;
std::cout << " Concurrent Kernel Execution: " << (concurrentKernels ? "Yes" : "No") << std::endl;
std::cout << " Async Engine Count: " << asyncEngineCount << std::endl;

std::cout << "Memory Clock Speed: " << memClockMHz << " MHz" << std::endl;
std::cout << "Memory Bus Width: " << busWidth << " bits" << std::endl;
std::cout << "Estimated Memory Bandwidth: " << bandwidthGBs << " GB/s" << std::endl;

return 0;
}

mem_bandwidth.cu
#include <cuda_runtime.h>
#include <iostream>

#define SIZE_MB(x) (long(x) * 1024 * 1024) // Convert MB to bytes

void measureMemcpyBandwidth(long dataSize) {


float *h_data, *d_data;
cudaMallocHost(&h_data, dataSize);
cudaMalloc(&d_data, dataSize);

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
cudaMemcpyAsync(d_data, h_data, dataSize, cudaMemcpyHostToDevice);
cudaEventRecord(stop);

cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float bandwidth = (dataSize / (milliseconds / 1000.0)) / (1024.0 * 1024.0 * 1024.0);

std::cout << "Data Size: " << dataSize / (1024 * 1024) << " MB, Time: " << milliseconds << " ms,
Bandwidth: " << bandwidth << " GB/s\n";

cudaFreeHost(h_data);
cudaFree(d_data);
}

int main() {
std::cout << "Testing different transfer sizes:\n";
measureMemcpyBandwidth(SIZE_MB(1));
measureMemcpyBandwidth(SIZE_MB(10));
measureMemcpyBandwidth(SIZE_MB(20));
measureMemcpyBandwidth(SIZE_MB(30));
measureMemcpyBandwidth(SIZE_MB(32));
measureMemcpyBandwidth(SIZE_MB(40));
measureMemcpyBandwidth(SIZE_MB(50));
measureMemcpyBandwidth(SIZE_MB(100));
measureMemcpyBandwidth(SIZE_MB(200));
measureMemcpyBandwidth(SIZE_MB(300));
measureMemcpyBandwidth(SIZE_MB(400));
measureMemcpyBandwidth(SIZE_MB(500));
measureMemcpyBandwidth(SIZE_MB(600));
measureMemcpyBandwidth(SIZE_MB(700));
measureMemcpyBandwidth(SIZE_MB(800));
measureMemcpyBandwidth(SIZE_MB(900));
measureMemcpyBandwidth(SIZE_MB(1024)); // 1GB

return 0;
}

3_multiplying_with_streams
27-08-2025 12:10 189 CMakeLists.txt

27-08-2025 12:10 3,178 vec_mat_mul_nostreams.cu

27-08-2025 12:10 6,867 vec_mat_mul_streams.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

add_executable(vec_mat_mul_streams vec_mat_mul_streams.cu)
add_executable(vec_mat_mul_nostreams vec_mat_mul_nostreams.cu)

vec_mat_mul_nostreams.cu

#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <chrono>

#define BLOCK_SIZE 256

__global__ void vectorMatrixMulKernel(float* d_vec, float* d_mat, float* d_res, int rows, int cols) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < rows) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += d_mat[row * cols + col] * d_vec[col];
}
d_res[row] = sum;
}
}

void vectorMatrixMulCpu(float *vec, float *mat, float *res, int rows, int cols) {
for (int i = 0; i < rows; i++) {
res[i] = 0;
for (int j = 0; j < cols; j++) {
res[i] += mat[i * cols + j] * vec[j];
}
}
}

void checkResult(float *cpuRes, float *gpuRes, int size) {


for (int i = 0; i < size; i++) {
if (fabs(cpuRes[i] - gpuRes[i]) > 1e-3) {
std::cerr << "Mismatch at " << i << " CPU: " << cpuRes[i] << " GPU: " << gpuRes[i] << " diff=" <<
abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

void initializeMatrix(float *matrix, int size) {


for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

void compute(int cols) {


int rows = cols;

float *h_vec = (float*)malloc(rows * sizeof(float));


float *h_mat = (float*)malloc(rows * cols * sizeof(float));
initializeMatrix(h_vec, cols);
initializeMatrix(h_mat, rows * cols);

float *h_res_gpu = (float*)malloc(rows * sizeof(float));


float *h_res_cpu = (float*)malloc(rows * sizeof(float));

float *d_vec;
float *d_mat;
float *d_res;
cudaMalloc(&d_vec, cols * sizeof(float));
cudaMalloc(&d_mat, rows * cols * sizeof(float));
cudaMalloc(&d_res, rows * sizeof(float));

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
cudaMemcpy(d_vec, h_vec, cols * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_mat, h_mat, rows * cols * sizeof(float), cudaMemcpyHostToDevice);

int blocks = (rows + BLOCK_SIZE - 1) / BLOCK_SIZE;


vectorMatrixMulKernel<<<blocks, BLOCK_SIZE>>>(d_vec, d_mat, d_res, rows, cols);
cudaMemcpy(h_res_gpu, d_res, rows * sizeof(float), cudaMemcpyDeviceToHost);

cudaEventRecord(stop);

cudaEventSynchronize(stop);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms, for matrix: " << cols << "x" << cols <<
std::endl;

auto startCpu = std::chrono::high_resolution_clock::now();


vectorMatrixMulCpu(h_vec, h_mat, h_res_cpu, rows, cols);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

checkResult(h_res_cpu, h_res_gpu, rows);

cudaFree(d_vec);
cudaFree(d_mat);
cudaFree(d_res);

int main() {
srand(static_cast<unsigned int>(time(0)));

compute(16380);
compute(32760);

return 0;
}

vec_mat_mul_streams.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <chrono>

#define BLOCK_SIZE 256


__global__ void vectorMatrixMulKernel(float* d_vec, float* d_mat, float* d_res, int rows, int cols) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < rows) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += d_mat[row * cols + col] * d_vec[col];
}
d_res[row] = sum;
}
}

void vectorMatrixMulCpu(float* vec, float* mat, float* res, int rows, int cols) {
for (int i = 0; i < rows; i++) {
res[i] = 0;
for (int j = 0; j < cols; j++) {
res[i] += mat[i * cols + j] * vec[j];
}
}
}

void checkResult(float* cpuRes, float* gpuRes, int size) {


for (int i = 0; i < size; i++) {
if (fabs(cpuRes[i] - gpuRes[i]) > 1e-3) {
std::cerr << "Mismatch at " << i << " CPU: " << cpuRes[i] << " GPU: " << gpuRes[i] << " diff=" <<
abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

void initializeMatrix(float *matrix, int size) {


for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

float compute(int chunkSize, int cols, bool computeCpuPart) {


int rows = cols;
float *h_vec = (float*)malloc(rows * sizeof(float));
float *h_res_cpu = (float*)malloc(rows * sizeof(float));
float *h_mat_pinned, *h_res_gpu;
cudaMallocHost(&h_mat_pinned, rows * cols * sizeof(float));
cudaMallocHost(&h_res_gpu, rows * sizeof(float));

initializeMatrix(h_vec, cols);
initializeMatrix(h_mat_pinned, rows * cols);

float *d_vec;
float *d_mat1;
float *d_mat2;
float *d_res1;
float *d_res2;
cudaMalloc(&d_vec, cols * sizeof(float));
cudaMalloc(&d_mat1, chunkSize * cols * sizeof(float));
cudaMalloc(&d_mat2, chunkSize * cols * sizeof(float));
cudaMalloc(&d_res1, chunkSize * sizeof(float));
cudaMalloc(&d_res2, chunkSize * sizeof(float));

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

cudaMemcpy(d_vec, h_vec, cols * sizeof(float), cudaMemcpyHostToDevice);

cudaStream_t stream1, stream2;


cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

int blocks = (chunkSize + BLOCK_SIZE - 1) / BLOCK_SIZE;

for (int i = 0; i < rows; i += chunkSize * 2) {


// Stream 1: Process first chunk
cudaMemcpyAsync(d_mat1, h_mat_pinned + i * cols, chunkSize * cols * sizeof(float),
cudaMemcpyHostToDevice, stream1);

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE, 0, stream1>>>(d_vec, d_mat1, d_res1,


chunkSize, cols);

cudaMemcpyAsync(h_res_gpu + i, d_res1, chunkSize * sizeof(float), cudaMemcpyDeviceToHost,


stream1);

// Stream 2: Process second chunk


cudaMemcpyAsync(d_mat2, h_mat_pinned + (i + chunkSize) * cols, chunkSize * cols *
sizeof(float), cudaMemcpyHostToDevice, stream2);

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE, 0, stream2>>>(d_vec, d_mat2, d_res2,


chunkSize, cols);

cudaMemcpyAsync(h_res_gpu + i + chunkSize, d_res2, chunkSize * sizeof(float),


cudaMemcpyDeviceToHost, stream2);
}
// Ensure first chunk has completed before reusing buffers
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms, for matrix: " << cols << "x" << cols << "
chunk size: " << chunkSize << std::endl;

if (computeCpuPart) {
auto startCpu = std::chrono::high_resolution_clock::now();
vectorMatrixMulCpu(h_vec, h_mat_pinned, h_res_cpu, rows, cols);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

checkResult(h_res_cpu, h_res_gpu, rows);


}

cudaFreeHost(h_mat_pinned);
cudaFreeHost(h_res_gpu);
cudaFree(d_vec);
cudaFree(d_mat1);
cudaFree(d_mat2);
cudaFree(d_res1);
cudaFree(d_res2);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

return gpuDuration;
}

/*
* The average function will collect the gpuDuration for each chunkSize and matrix size
* over the iterations defined and calculate the percetage gain against the provided
* reference value of the execution time for the non streams application version.
*
* Note: you have to run the non streams version with the same matrix size.
*/
void average(int chunkSize, int cols, int iterations, float noStreamsReferenceTime) {
long dataSize = (chunkSize * cols);
long totalDataSize = (cols * cols);

double sum = 0.0f;


for(int i = 0; i < iterations; i++) {
sum += compute(chunkSize, cols, false);
}
double average = sum / iterations;

std::cout << " chunk: " << chunkSize


<< " data size: " << dataSize / 1024 / 1024 * 4 << "MB "
<< " total data size: " << totalDataSize / 1024 / 1024 * 4 << "MB "
<< " data partitions: " << (double)totalDataSize / (double)dataSize << " "
<< " average time: " << average << " ms"
<< " percentage gain: " << 100 - (average / noStreamsReferenceTime * 100) << "%"
<< std::endl;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

compute( 273, 16380, true);


compute( 546, 16380, true);
compute( 819, 16380, true);
compute( 1638, 16380, true);
compute( 4095, 16380, true);
compute( 8190, 16380, true);

compute( 273, 32760, true);


compute( 546, 32760, true);
compute( 819, 32760, true);
compute( 1638, 32760, true);
compute( 4095, 32760, true);
compute( 8190, 32760, true);
compute(16380, 32760, true);

/*
* If you want to check an average of some executions against the values measured
* for the no streams version you can use the following sample calls, but updating
* the last value to the time measurement from your system.
*/
// average( 273, 32760, 10, 500.615);
// average( 546, 32760, 10, 500.615);
// average( 819, 32760, 10, 500.615);
// average( 1638, 32760, 10, 500.615);
// average( 4095, 32760, 10, 500.615);
// average( 8190, 32760, 10, 500.615);
// average(16380, 32760, 10, 500.615);

// average( 273, 16380, 10, 119.299);


// average( 546, 16380, 10, 119.299);
// average( 819, 16380, 10, 119.299);
// average( 1638, 16380, 10, 119.299);
// average( 4095, 16380, 10, 119.299);
// average( 8190, 16380, 10, 119.299);

return 0;
}

4_multigpu
27-08-2025 12:10 104 CMakeLists.txt

27-08-2025 12:10 3,846 multigpu.cu


CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

add_executable(multigpu multigpu.cu)

multigpu.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>

#define BLOCK_SIZE 256

__global__ void vectorMatrixMulKernel(float* d_vec, float* d_mat, float* d_res, int rows, int cols) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < rows) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += d_mat[row * cols + col] * d_vec[col];
}
d_res[row] = sum;
}
}

void vectorMatrixMulCpu(float* vec, float* mat, float* res, int rows, int cols) {
for (int row = 0; row < rows; row++) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += mat[row * cols + col] * vec[col];
}
res[row] = sum;
}
}

void compareResults(float* cpuRes, float* gpuRes, int rows) {


for (int i = 0; i < rows; i++) {
if (fabs(cpuRes[i] - gpuRes[i]) > 1e-4) {
std::cerr << "Mismatch at " << i << " CPU: " << cpuRes[i] << " GPU: " << gpuRes[i] << " diff=" <<
abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

void initializeMatrix(float *matrix, int size) {


for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);

int cols = 16380;


int rows = cols;
int chunk_size = rows / deviceCount;

float *h_vec = (float*)malloc(rows * sizeof(float));


float *h_res_cpu = (float*)malloc(rows * sizeof(float));
float *h_mat = (float*)malloc(rows * cols * sizeof(float));
float *h_res_gpu = (float*)malloc(rows * sizeof(float));

initializeMatrix(h_vec, cols);
initializeMatrix(h_mat, rows * cols);

float *d_vec[deviceCount];
float *d_mat[deviceCount];
float *d_res[deviceCount];
for (int device = 0; device < deviceCount; device++) {
cudaSetDevice(device);
cudaMalloc(&d_vec[device], rows * sizeof(float));
cudaMalloc(&d_mat[device], chunk_size * cols * sizeof(float));
cudaMalloc(&d_res[device], chunk_size * sizeof(float));

cudaMemcpy(d_vec[device], h_vec, rows * sizeof(float), cudaMemcpyHostToDevice);


}

auto gpuStart = std::chrono::high_resolution_clock::now();

for (int device = 0; device < deviceCount; device++) {


cudaSetDevice(device);
cudaMemcpy(d_mat[device], h_mat + device * chunk_size * cols, chunk_size * cols *
sizeof(float), cudaMemcpyHostToDevice);
}

for (int device = 0; device < deviceCount; device++) {


cudaSetDevice(device);
int blocks = (chunk_size + BLOCK_SIZE - 1) / BLOCK_SIZE;

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE>>>(d_vec[device], d_mat[device], d_res[device],


chunk_size, cols);
}

for (int device = 0; device < deviceCount; device++) {


cudaSetDevice(device);

cudaMemcpy(h_res_gpu + device * chunk_size, d_res[device], chunk_size * sizeof(float),


cudaMemcpyDeviceToHost);
cudaFree(d_vec[device]);
cudaFree(d_mat[device]);
cudaFree(d_res[device]);
}

auto gpuEnd = std::chrono::high_resolution_clock::now();


std::chrono::duration<float, std::milli> gpuDuration = gpuEnd - gpuStart;
std::cout << "Time taken by GPU: " << gpuDuration.count() << " ms" << std::endl;

auto cpuStart = std::chrono::high_resolution_clock::now();


vectorMatrixMulCpu(h_vec, h_mat, h_res_cpu, rows, cols);
auto cpuEnd = std::chrono::high_resolution_clock::now();
std::chrono::duration<float, std::milli> cpuDuration = cpuEnd - cpuStart;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

compareResults(h_res_cpu, h_res_gpu, rows);

return 0;
}

Chapter 9
27-08-2025 12:10 <DIR> include

27-08-2025 12:10 <DIR> python

27-08-2025 12:10 1,748 readme.md

27-08-2025 12:10 <DIR> src

readme.md
# Introduction

<p>This chapter presents a C++ library that uses the traditional steps for building, but we also have
three implementations of wrappers to expose the code to Python.
We also provide a run_tests.sh inside the `python` folder that runs the corresponding `test.py` file
inside each alternative of the wrappers. The specific commands are listed bellow:</p>

## To Build the C++ Library


To build we may simply:
1. create a `build` folder inside of `ch9` folder
2. run cmake ..
3. run make

## For CTypes
There is no need to build, just run the code with the Python interpreter. There are two files:
1. test_ctypes.py that uses the library
2. test.py that will be used on the script that runs all the wrappers

## For Wrapper
It will be built by setup.py wit the following command:
1. python3 setup.py build_ext --inplace

<p>We also have two test files here:</p>


1. test_vector_add_wrapper.py to test the wrapper individually with the Python interpreter
2. test.py that will be used on the script that runs all the wrappers

## For the NumPy Wrapper


It will be built by setup.py wit the following command:
1. python3 setup.py build_ext --inplace

<p>We also have two test files here:</p>


1. test_vector_add_np_wrapper.py to test the wrapper individually with the Python interpreter
2. test.py that will be used on the script that runs all the wrappers

## To Execute all Tests


First we have to run the setup.py command for the wrappers, after that we can run:
1. run_tests.sh inside the `python` folder, this script will collect the names of the subfolders and will
execute 10 times each `test.py` program. After each iteration it increases the inpute size so that we
can observe the impact on overall performance.

Include
vector_add.H
#ifndef VEC_ADD_H
#define VEC_ADD_H

extern "C" {
void vectorAdd(int *a, int *b, int *c, int N);
}
#endif

python
27-08-2025 12:10 <DIR> ctypes

27-08-2025 12:10 1,134 run_tests.sh

27-08-2025 12:10 <DIR> wrapper

27-08-2025 12:10 <DIR> wrapper_np


run_tests.sh
#!/bin/bash

# Get the current directory where the script is located


base_dir=$(pwd)

# List of subfolders (assuming there are exactly three subfolders)


subfolders=($(find "$base_dir" -maxdepth 1 -type d | tail -n +2 | head -n 3))

# Program to execute
program="test.py"

# Parameters to pass to the program


param=1000

for i in {1..5}; do
echo "*** Processing parameter value: $param"
# Loop over each subfolder
for subfolder in "${subfolders[@]}"; do
echo "## Processing folder: $subfolder"

# Change to the subfolder


cd "$subfolder" || { echo "Failed to enter $subfolder"; exit 1; }

for i in {1..10}; do
# Execute the program with parameters
echo "$i"
python3 "$program" "$param" || { echo "Failed to execute program in $subfolder"; exit 1; }
done

# Move back to the base directory


cd "$base_dir"

echo " "


done

# echo "Finished processing all subfolders for parameter $param"


echo "#############"

# Increment the parameter (multiply by 10 for the next iteration)


param=$((param * 10))
done
echo "All folders processed."

ctypes
27-08-2025 12:10 1,027 test.py

27-08-2025 12:10 563 test_ctypes.py


test.py
import ctypes
import numpy as np
import time
import argparse

lib = ctypes.CDLL("../../build/libvector_add.so")

lib.vectorAdd.argtypes = [
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.c_int
]

def calculate(size):
N = int (size)
a = np.array(range(N), dtype=np.int32)
b = np.array(range(N, 2*N), dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

startTime = time.time()
lib.vectorAdd(
a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
b.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
c.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
N
)
endTime = time.time()

executionTimeInMs = (endTime - startTime) * 1000

print(f"Function execution time: {executionTimeInMs:.2f} ms")

def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")

args = parser.parse_args()

calculate(args.N)

if __name__ == "__main__":
main()
test_ctypes.py
import ctypes
import numpy as np

lib = ctypes.CDLL("../../build/libvector_add.so")

lib.vectorAdd.argtypes = [
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.c_int
]

N = 10000000
a = np.array(range(N), dtype=np.int32)
b = np.array(range(N, 2*N), dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

lib.vectorAdd(
a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
b.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
c.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
N
)

print("Result:", c)

wrapper

27-08-2025 12:10 411 pyproject_toml.txt

27-08-2025 12:10 272 setup.py

27-08-2025 12:10 648 test.py

27-08-2025 12:10 211 test_vector_add_wrapper.py

27-08-2025 12:10 1,917 vector_add_wrapper.c

pyproject_toml.txt
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "vector_add_wrapper"
version = "1.0"
description = "Python binding for CUDA vector addition"
authors = [{ name = "Paulo Motta", email = "[email protected]" }]
dependencies = []

[tool.setuptools]
packages = []

[tool.setuptools.ext_modules]
vector_add_wrapper = { sources = ["vector_add_wrapper.c"] }

setup.py
from setuptools import setup, Extension

module = Extension(
"vector_add_wrapper",
sources=["vector_add_wrapper.c"],
)

setup(
name="vector_add_wrapper",
version="1.0",
description="Python binding for CUDA vector addition",
ext_modules=[module],
)

test.py
import vector_add_wrapper as vaw
import random
import time
import argparse

def calculate(size):
N = int (size)

a = [random.randint(0, N) for _ in range(N)]


b = [random.randint(N, 2*N) for _ in range(N)]

startTime = time.time()
result = vaw.vectorAdd(a, b, N)
endTime = time.time()

executionTimeInMs = (endTime - startTime) * 1000

print(f"Function execution time: {executionTimeInMs:.2f} ms")

def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")

args = parser.parse_args()

calculate(args.N)

if __name__ == "__main__":
main()

test_vector_add_wrapper.py
import vector_add_wrapper as vaw
import random

N = 10000000
a = [random.randint(0, N) for _ in range(N)]
b = [random.randint(N, 2*N) for _ in range(N)]

result = vaw.vectorAdd(a, b, N)
print("Result:", result)

vector_add_wrapper.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

typedef void (*vectorAddFunc)(int *, int *, int *, int);

static vectorAddFunc vectorAdd = NULL;

static PyObject* pyVectorAdd(PyObject* self, PyObject* args) {


PyObject *py_a, *py_b;
int N;

if (!PyArg_ParseTuple(args, "OOi", &py_a, &py_b, &N))


return NULL;

if (!PyList_Check(py_a) || !PyList_Check(py_b) || PyList_Size(py_a) != N || PyList_Size(py_b) != N)


{
PyErr_SetString(PyExc_ValueError, "Arguments must be lists of same size N.");
return NULL;
}

int *a = (int*)malloc(N * sizeof(int));


int *b = (int*)malloc(N * sizeof(int));
int *c = (int*)malloc(N * sizeof(int));
for (int i = 0; i < N; i++) {
a[i] = (int)PyLong_AsLong(PyList_GetItem(py_a, i));
b[i] = (int)PyLong_AsLong(PyList_GetItem(py_b, i));
}

vectorAdd(a, b, c, N);

PyObject* result = PyList_New(N);


for (int i = 0; i < N; i++) {
PyList_SetItem(result, i, PyLong_FromLong(c[i]));
}

free(a);
free(b);
free(c);

return result;
}

static PyMethodDef VectorAddMethods[] = {


{"vectorAdd", pyVectorAdd, METH_VARARGS, "Perform vector addition using CUDA"},
{NULL, NULL, 0, NULL}
};

static struct PyModuleDef vectoraddmodule = {


PyModuleDef_HEAD_INIT,
"vector_add_wrapper",
NULL,
-1,
VectorAddMethods
};

PyMODINIT_FUNC PyInit_vector_add_wrapper(void) {
void *handle = dlopen("../../build/libvector_add.so", RTLD_LAZY);
if (!handle) {
PyErr_SetString(PyExc_ImportError, "Could not load libvector_add.so");
return NULL;
}

vectorAdd = (vectorAddFunc)dlsym(handle, "vectorAdd");


if (!vectorAdd) {
PyErr_SetString(PyExc_ImportError, "Could not find vectorAdd in libvector_add.so");
return NULL;
}

return PyModule_Create(&vectoraddmodule);
}
wrapper_np
27-08-2025 12:10 375 setup.py

27-08-2025 12:10 705 test.py

27-08-2025 12:10 260 test_vector_add_np_wrapper.py

27-08-2025 12:10 1,585 vector_add_np_wrapper.c

setup.py
from setuptools import setup, Extension
import numpy

module = Extension(
"vector_add_np_wrapper",
include_dirs = [numpy.get_include()],
sources=["vector_add_np_wrapper.c"],
)

setup(
name="vector_add_np_wrapper",
version="1.0",
description="Python binding for CUDA vector addition",
ext_modules=[module],
include_dirs=[numpy.get_include()]
)

test.py
import vector_add_np_wrapper as vaw_np
import numpy as np
import time
import argparse

def calculate(size):
N = int (size)
a = np.random.randint(0, N, size=N, dtype=np.int32)
b = np.random.randint(N, 2*N, size=N, dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

startTime = time.time()
vaw_np.vectorAdd(a, b, c, N)
endTime = time.time()

executionTimeInMs = (endTime - startTime) * 1000

print(f"Function execution time: {executionTimeInMs:.2f} ms")

def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")

args = parser.parse_args()

calculate(args.N)

if __name__ == "__main__":
main()

test_vector_add_np_wrapper.py
import vector_add_np_wrapper as vaw_np
import numpy as np

N = 10000000
a = np.random.randint(0, N, size=N, dtype=np.int32)
b = np.random.randint(N, 2*N, size=N, dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

vaw_np.vectorAdd(a, b, c, N)
print("Result:", c)

vector_add_np_wrapper.c
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <numpy/arrayobject.h>
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

typedef void (*vectorAddFunc)(int *, int *, int *, int);

static vectorAddFunc vectorAdd = NULL;

static PyObject* pyVectorAdd(PyObject* self, PyObject* args) {


PyArrayObject *a, *b, *c;
int N;

if (!PyArg_ParseTuple(args, "OOOi",
&a,
&b,
&c,
&N)) {
return NULL;
}
int *a_ptr = (int*)PyArray_DATA(a);
int *b_ptr = (int*)PyArray_DATA(b);
int *c_ptr = (int*)PyArray_DATA(c);

vectorAdd(a_ptr, b_ptr, c_ptr, N);

Py_RETURN_NONE;
}

static PyMethodDef VectorAddMethods[] = {


{"vectorAdd", pyVectorAdd, METH_VARARGS, "Perform vector addition using CUDA"},
{NULL, NULL, 0, NULL}
};

static struct PyModuleDef vectoraddmodule = {


PyModuleDef_HEAD_INIT,
"vector_add_np_wrapper",
NULL,
-1,
VectorAddMethods
};

PyMODINIT_FUNC PyInit_vector_add_np_wrapper(void) {
void *handle = dlopen("../../build/libvector_add.so", RTLD_LAZY);
if (!handle) {
PyErr_SetString(PyExc_ImportError, "Could not load libvector_add.so");
return NULL;
}

vectorAdd = (vectorAddFunc)dlsym(handle, "vectorAdd");


if (!vectorAdd) {
PyErr_SetString(PyExc_ImportError, "Could not find vectorAdd in libvector_add.so");
return NULL;
}

import_array();

return PyModule_Create(&vectoraddmodule);
}

SRC
vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
#include "../include/vector_add.h"

__global__ void vectorAddKernel(int *a, int *b, int *c, int N) {


int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
c[idx] = a[idx] * b[idx];
}
}

void vectorAdd(int *a, int *b, int *c, int N) {


int *d_a;
int *d_b;
int *d_c;

cudaMalloc((void**)&d_a, N * sizeof(int));
cudaMalloc((void**)&d_b, N * sizeof(int));
cudaMalloc((void**)&d_c, N * sizeof(int));

cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);


cudaMemcpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

int threadsPerBlock = 256;


int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);

cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}

Chapter 10
27-08-2025 12:10 <DIR> 1_cuBLAS

27-08-2025 12:10 <DIR> 2_thrust

27-08-2025 12:10 <DIR> 3_gtest

27-08-2025 12:10 <DIR> 4_pytest

27-08-2025 12:10 256 readme.md

readme.md
# Introduction

<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>

To build the examples we may simply:


1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make
1_cuBLAS
27-08-2025 12:10 403 CMakeLists.txt

27-08-2025 12:10 4,545 main.cpp

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(MatrixMultiplyCUBLAS)

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_BUILD_TYPE Release)

set(CUDA_TOOLKIT_ROOT_DIR /usr/local/cuda)

find_package(CUDA REQUIRED)

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -O3")

include_directories(${CUDA_INCLUDE_DIRS})

add_executable(matrix_mul main.cpp)

target_link_libraries(matrix_mul ${CUDA_CUBLAS_LIBRARIES} ${CUDA_LIBRARIES})

main.cpp
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <chrono>
#include <cublas_v2.h>
#include <cuda_runtime.h>

void matrixMultCpu(float* A, float* B, float* C, int N) {


for (int row = 0; row < N; row++) {
for (int col = 0; col < N; col++) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
}

void checkResult(float* cpuRes, float* gpuRes, int size) {


for (int i = 0; i < size; i++) {
if (fabs(cpuRes[i] - gpuRes[i]) > 1e-3) {
std::cerr << "Mismatch at " << i << " CPU: " << cpuRes[i] << " GPU: " << gpuRes[i] << " diff=" <<
abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int N = 1024;
size_t size = N * N * sizeof(float);

float* h_A = (float*)malloc(size);


float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);
float* h_C_cpu = (float*)malloc(size);

for (int i = 0; i < N * N; ++i) {


h_A[i] = static_cast<float>(rand()) / RAND_MAX;
h_B[i] = static_cast<float>(rand()) / RAND_MAX;
}

auto startCpu = std::chrono::high_resolution_clock::now();


matrixMultCpu(h_A, h_B, h_C_cpu, N);
auto endCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endCpu - startCpu;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

float *d_A;
float *d_B;
float *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

auto startTotal = std::chrono::high_resolution_clock::now();

auto startCopy = std::chrono::high_resolution_clock::now();


cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
auto endCopy = std::chrono::high_resolution_clock::now();

auto startOverheadTime = std::chrono::high_resolution_clock::now();

cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.0f;
float beta = 0.0f;
auto endOverheadTime = std::chrono::high_resolution_clock::now();

cudaEvent_t startEvent, stopEvent;


cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);
cudaEventRecord(startEvent);

cublasSgemm(handle,
CUBLAS_OP_T, CUBLAS_OP_T, // transpose both matrices
N, N, N,
&alpha,
d_A, N,
d_B, N,
&beta,
d_C, N);

float* d_C_fixed;
cudaMalloc(&d_C_fixed, size);

cublasSgeam(handle,
CUBLAS_OP_T, CUBLAS_OP_N,
N, N, // dimensions of the output matrix
&alpha,
d_C, N, // input matrix (to be transposed)
&beta,
nullptr, N, // second matrix not used
d_C_fixed, N); // result goes here

cudaEventRecord(stopEvent);
cudaEventSynchronize(stopEvent);
float gpuComputeTime = 0;
cudaEventElapsedTime(&gpuComputeTime, startEvent, stopEvent);

auto startCopyBack = std::chrono::high_resolution_clock::now();


cudaMemcpy(h_C, d_C_fixed, size, cudaMemcpyDeviceToHost);
auto endCopyBack = std::chrono::high_resolution_clock::now();
auto endTotal = std::chrono::high_resolution_clock::now();

std::chrono::duration<double, std:: milli> copyTime = endCopy - startCopy;


std::chrono::duration<double, std:: milli> copyBackTime = endCopyBack - startCopyBack;
std::chrono::duration<double, std:: milli> totalTime = endTotal - startTotal;
std::chrono::duration<double, std:: milli> overheadTime = endOverheadTime -
startOverheadTime;

std::cout << "Device data copy time: " << copyTime.count() << " ms" << std::endl;
std::cout << "cuBLAS overhead time: " << overheadTime.count() << " ms" << std::endl;
std::cout << "Device compute time: " << gpuComputeTime << " ms" << std::endl;
std::cout << "Device data copy back time: " << copyBackTime.count() << " ms" << std::endl;
std::cout << "Total time taken by GPU: " << totalTime.count() << " ms" << std::endl;
checkResult(h_C_cpu, h_C, N * N);

cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFree(d_C_fixed);
free(h_A);
free(h_B);
free(h_C);
free(h_C_cpu);

return 0;
}

2_thrust
27-08-2025 12:10 321 CMakeLists.txt

27-08-2025 12:10 2,470 main.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.12)
project(ThrustSortExample LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_BUILD_TYPE Release)

find_package(CUDA REQUIRED)

add_executable(thrust_sort main.cu)

set_target_properties(thrust_sort PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_ARCHITECTURES "native"
)

main.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <typeinfo>
#include <vector>
template <typename T>
void thrustSort(size_t size) {
thrust::host_vector<T> h_vec(size);
thrust::host_vector<T> h_result(size);

std::cout << "Processing " << typeid(T).name() << " (" << size << " elements)" << std::endl;

for (auto& val : h_vec) {


val = static_cast<T>(rand()) / RAND_MAX;
}

auto cpuVec = h_vec;


auto cpuStart = std::chrono::high_resolution_clock::now();
std::sort(cpuVec.begin(), cpuVec.end());
auto cpuEnd = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = cpuEnd - cpuStart;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

auto gpuStart = std::chrono::high_resolution_clock::now();

auto startCopy = std::chrono::high_resolution_clock::now();


thrust::device_vector<T> d_vec = h_vec;
auto endCopy = std::chrono::high_resolution_clock::now();

auto startSort = std::chrono::high_resolution_clock::now();


thrust::sort(d_vec.begin(), d_vec.end());
auto endSort = std::chrono::high_resolution_clock::now();

auto startCopyBack = std::chrono::high_resolution_clock::now();


thrust::copy(d_vec.begin(), d_vec.end(), h_result.begin());
auto endCopyBack = std::chrono::high_resolution_clock::now();

auto gpuEnd = std::chrono::high_resolution_clock::now();

std::chrono::duration<double, std::milli> copyTime = endCopy - startCopy;


std::chrono::duration<double, std::milli> copyBackTime = endCopyBack - startCopyBack;
std::chrono::duration<double, std::milli> sortTime = endSort - startSort;
std::chrono::duration<double, std::milli> gpuTotalTime = gpuEnd - gpuStart;

std::cout << "GPU copy time: " << copyTime.count() << " ms" << std::endl;
std::cout << "GPU copy back time: " << copyBackTime.count() << " ms" << std::endl;
std::cout << "GPU sort time: " << sortTime.count() << " ms" << std::endl;
std::cout << "Total time taken by GPU: " << gpuTotalTime.count() << " ms" << std::endl;
}

int main() {
size_t N = 33'000'000;

srand(static_cast<unsigned int>(time(0)));
thrustSort<int>(N);
std::cout << std::endl;
thrustSort<double>(N);
std::cout << std::endl;
thrustSort<float>(N);

return 0;
}

3_gtest
27-08-2025 12:10 493 CMakeLists.txt

27-08-2025 12:10 <DIR> include

27-08-2025 12:10 <DIR> src

27-08-2025 12:10 <DIR> test

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(VectorAddTest LANGUAGES CXX CUDA)

enable_testing()

find_package(GTest REQUIRED)

add_executable(vector_add_test test/test_vector_add.cu src/main.cu)


target_link_libraries(vector_add_test GTest::GTest GTest::Main cuda)

add_test(NAME VectorAddTest COMMAND vector_add_test)

add_library(vector_add SHARED src/main.cu)


set_target_properties(vector_add PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_include_directories(vector_add PUBLIC include)

include
vector_add.h
#ifndef VEC_ADD_H
#define VEC_ADD_H
extern "C" {
void vectorAdd(float* h_A, float* h_B, float* h_C, int N);
}

#endif
SRC
main.cu
#include <cuda_runtime.h>
#include <iostream>
#include "../include/vector_add.h"

__global__ void vectorAddKernel(float* A, float* B, float* C, int N) {


int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

void vectorAdd(float* h_A, float* h_B, float* h_C, int N) {


float *d_A;
float *d_B;
float *d_C;
size_t size = N * sizeof(float);

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;


int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
#include <cuda_runtime.h>
#include <iostream>
#include "../include/vector_add.h"

__global__ void vectorAddKernel(float* A, float* B, float* C, int N) {


int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

void vectorAdd(float* h_A, float* h_B, float* h_C, int N) {


float *d_A;
float *d_B;
float *d_C;
size_t size = N * sizeof(float);

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;


int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

Test
test_vector_add.cu
#include <gtest/gtest.h>
#include "../include/vector_add.h"

TEST(VectorAddTest, SimpleAddition) {
int N = 5;
float A[N] = {1, 2, 3, 4, 5};
float B[N] = {10, 20, 30, 40, 50};
float C[N] = {0};

vectorAdd(A, B, C, N);

for (int i = 0; i < N; i++) {


EXPECT_FLOAT_EQ(C[i], A[i] + B[i]);
}
}

4_pytest
test_vector_add.py
import numpy as np
from vector_add import vectorAdd
def testVectorAdd():
a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
b = np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32)
expected = a + b

result = vectorAdd(a, b)

for i in range(len(expected)):
assert abs(result[i] - expected[i]) < 1e-5, f"Mismatch at index {i}: got {result[i]}, expected
{expected[i]}"

np.testing.assert_allclose(result, expected, rtol=1e-5)

vector_add.py
import ctypes
import numpy as np

lib = ctypes.CDLL('../3_gtest/build/libvector_add.so')

lib.vectorAdd.argtypes = [
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
ctypes.c_int
]

def vectorAdd(a: np.ndarray, b: np.ndarray) -> np.ndarray:


assert a.shape == b.shape
n = a.size
c = np.empty_like(a)
lib.vectorAdd(a, b, c, n)
return c

You might also like