0% found this document useful (0 votes)

27 views71 pages

GpuProgrammingWithC andCUDA

Uploaded by

prashanthvummethala78

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views71 pages

GpuProgrammingWithC andCUDA

Uploaded by

prashanthvummethala78

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Uncover effective techniques for writing efficient

GPU-parallel C++ applications

Paulo Motta
About the book
Written by Paulo Motta, a senior researcher with decades of experience, this comprehensive
GPU programming book is an essential guide for leveraging the power of parallelism to
accelerate your computations. The first section introduces the concept of parallelism and
provides practical advice on how to think about and utilize it effectively. Starting with a basic
GPU program, you then gain hands-on experience in managing the device. This foundational
knowledge is then expanded by parallelizing the program to illustrate how GPUs enhance
performance.

The second section explores GPU architecture and implementation strategies for parallel
algorithms, and offers practical insights into optimizing resource usage for efficient
execution. In the final section, you will explore advanced topics such as utilizing CUDA
streams. You will also learn how to package and distribute GPU-accelerated libraries for the
Python ecosystem, extending the reach and impact of your work.

Combining expert insight with real-world problem solving, this book is a valuable resource
for developers and researchers aiming to harness the full potential of GPU computing. The
blend of theoretical foundations, practical programming techniques, and advanced
optimization strategies it offers is sure to help you succeed in the fast-evolving field of GPU
programming.

Key Learnings
 Manage GPU devices and accelerate your applications
 Apply parallelism effectively using CUDA and C++
 Choose between existing libraries and custom GPU solutions
 Package GPU code into libraries for use with Python
 Explore advanced topics such as CUDA streams
 Implement optimization strategies for resource-efficient execution

Chapters
1. Introduction to Parallel Programming
2. Setting Up Your Development Environment
3. Hello CUDA
4. Hello Again, but in Parallel
5. A Closer Look into the World of GPUs
6. Parallel Algorithms with CUDA
7. Performance Strategies
8. Overlaying Multiple Operations
9. Exposing Your Code to Python
10. Exploring Existing GPU Models
Requirements for this book
You should be comfortable writing computer programs in C++, and basic knowledge of
operating systems will help to understand some of the more advanced concepts, given that we
have to manage device communication.

Software / hardware covered in the

Operating system requirements
book

NVIDIA GPU or access to a Cloud-based Ubuntu Linux 20 or later with

VM with NVIDIA GPU NVIDIA Video Driver

CUDA Toolkit 12

Docker 27.0

VS Code 1.92

CMake 3.16

g++ 9.4

Python 3.8

Nsight Compute 2023.3

In Chapter 2, we discuss options for configuring the development environment. Some of the
software that we need is installed automatically if you elect to use the Docker-based
development environment.

Code conventions
We are using the following convetions:

1. camelCase for names of functions, kernels and variables

2. CamelCase with uppercase letter at beginning for structs
3. snake_case for names of files
4. When using two functions to perform the same computation on the CPU and
GPU we use the same name with a Cpu/Kernel suffix like:
computeSomethingCpu / computeSomethingKernel
5. When we need to allocate buffers with similar names, we are using h_ preffix
for host side and d_for device side.
o float* h_A;
o float* d_A;
6. When comparing results we add a suffix to the name of the variable _CPU or
_GPU.
o float* h_C_GPU; // this is the C array or matrix calculated on the GPU
and copied back to the host
o float* h_C_CPU; // this is the C array or matrix calculated on the CPU

Folder names in RED colour

Files Names in Green colour

Chapter 2

27-08-2025 12:10 277 Dockerfile

27-08-2025 12:10 351 readme.md

readme.md
## The original file

As we evolve on the book we changed the Dockerfile that is under .devcontainer, but we kept here
the original one for reference.

The change was that originally, for simplicity purposes, we used an NVidia based docker image. Later
we changed to an Ubuntu based image in which we installed the CUDA Toolkit for added
functionality.

Dockerfile

FROM docker.io/nvidia/cuda:12.0.0-devel-ubuntu20.04

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && \

apt-get install -y \

xz-utils \

build-essential \

libssl-dev \

git \

cmake \

&& \

apt-get clean && \

rm -rf /var/lib/apt/lists/*
Chapter 3
27-08-2025 12:10 753 device_query.cu

27-08-2025 12:10 254 hello_world.cu

27-08-2025 12:10 660 readme.md

readme.md
# Hello World
Here is the command to compile from the terminal:
nvcc hello_world.cu -o hello_world -Xcompiler "-Wall" -lcudadevrt -lcudart_static -lstdc++

or if you are using the docker terminal:

docker run -it --rm --gpus all -v ./:/code nvidia/cuda:12.0.0-devel-ubuntu20.04 bash
cd code
nvcc -o hello_world hello_world.cu

## Device Query
Here is the command to compile from the terminal:
nvcc -o device_query device_query.cu -Xcompiler "-Wall" -lcudadevrt -lcudart_static -lstdc++

or if you are using the docker terminal:

docker run -it --rm --gpus all -v ./:/code nvidia/cuda:12.0.0-devel-ubuntu20.04 bash
cd code
nvcc -o device_query device_query.cu

hello_world.cu

#include <iostream>
__global__ void helloWorld() {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
printf("Hello, World! Thread %d\n", tid);
}
int main() {
helloWorld<<<1, 10>>>();
cudaDeviceSynchronize();
return 0;
}
device_query.cu
#include <cuda_runtime.h>
#include <iostream>

int main() {

int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);

std::cout << "Number of devices " << deviceCount << std::endl;

cudaDeviceProp deviceProp;
int dev = 0;
cudaGetDeviceProperties(&deviceProp, dev);

std::cout << "Device " << dev << ": " << deviceProp.name << std::endl;

std::cout << " CUDA Capability Major/Minor version number: " << deviceProp.major << "." <<
deviceProp.minor << std::endl;

std::cout << " Total amount of shared memory per block: " << deviceProp.sharedMemPerBlock <<
" bytes" << std::endl;

std::cout << " Maximum number of threads per block: " << deviceProp.maxThreadsPerBlock <<
std::endl;

return 0;

Chapter 4
27-08-2025 12:10 <DIR> 1_primes
27-08-2025 12:10 <DIR> 2_vector_add
27-08-2025 12:10 <DIR> 3_euclidean_distance
27-08-2025 12:10 256 readme.md

readme
# Introduction
All our examples are numbered here to follow the order in which they appear on the book
chapter.
To build the examples we may simply:
1. create a `build` folder inside of the corresponding code folder

2. run cmake ..

3. run make
1_primes
27-08-2025 12:10 95 CMakeLists.txt

27-08-2025 12:10 2,347 primes.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(primes_gpu CUDA)

add_executable(primes primes.cu)

primes.cu
#include <cuda_runtime.h>
#include <iostream>
#include <chrono>

global void checkPrimeKernel(long long start, long long end) {

int tid = threadIdx.x + blockIdx.x * blockDim.x;
long long num = start + (tid * 2);
bool isPrime = true;
if (num <= 1) {
isPrime = false;
return;
}
if (num == 2) {
isPrime = true;
return;
}
if (num % 2 == 0) {
isPrime = false;
return;
}
if (num > end) return;

for (long long i = 3; i * i <= num; i += 2) {

if (num % i == 0) {
isPrime = false;
break;
}
}

/*
* for study purposes we can print the verification of each number
*/
//printf("tid=%d %lld is prime? %d\n", tid, num, isPrime);
}
bool checkPrimeCpu(long long num) {

if (num <= 1) return false;

if (num == 2) return true;
if (num % 2 == 0) return false;
for (long long i = 3; i * i <= num; i += 2) {
if (num % i == 0) {
return false;
}
}
return true;
}

int main() {
long long start = 100'001LL; // must start with odd
long long end = 190'001LL;

int threadsPerBlock = 256;

int totalNumbers = (end - start) / 2 + 1;
int blocksPerGrid = (totalNumbers + threadsPerBlock - 1) / threadsPerBlock;

cudaEvent_t startEvent, stopEvent;

cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);

cudaEventRecord(startEvent, 0);

checkPrimeKernel<<<blocksPerGrid, threadsPerBlock>>>(start, end);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, startEvent, stopEvent);

std::cout << "Time taken on GPU: " << gpuDuration << " ms" << std::endl;

cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);

auto startTime = std::chrono::high_resolution_clock::now();

for (long long num = start; num <= end; num += 2) {

checkPrimeCpu(num);
}

auto endTime = std::chrono::high_resolution_clock::now();

std::chrono::duration<double, std::milli> cpuDuration = endTime - startTime;
std::cout << "Time taken on CPU: " << std::fixed << cpuDuration.count() << " ms" << std::endl;
std::cout << "speed up : " << cpuDuration.count() / gpuDuration << std::endl;

return 0;
}

2_vector_add
27-08-2025 12:10 103 CMakeLists.txt

27-08-2025 12:10 3,199 vector_add.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(vector_add CUDA)
add_executable(vector_add vector_add.cu)

vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
#include <chrono>

global void vectorAddKernel(float A, float B, float *C, int N) {

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

void vectorAddCpu(float A, float B, float *C, int N) {

for (int i = 0; i < N; i++) {
C[i] = A[i] + B[i];
}
}

int main() {
int N = 100'000'000;
size_t size = N * sizeof(float);

float h_A = (float )malloc(size);

float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);

for (int i = 0; i < N; ++i) {

h_A[i] = static_cast<float>(i);
h_B[i] = static_cast<float>(i * 2);
}

float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);

int threadsPerBlock = 1024;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

cudaEvent_t startEvent, stopEvent;

cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);

cudaEventRecord(startEvent, 0);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuCopyTime = 0;
cudaEventElapsedTime(&gpuCopyTime, startEvent, stopEvent);

std::cout<< std::fixed << "Time to copy data to GPU: " << gpuCopyTime << " ms" << std::endl;

cudaEventRecord(startEvent, 0);

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuExecutionTime = 0;
cudaEventElapsedTime(&gpuExecutionTime, startEvent, stopEvent);

std::cout<< std::fixed << "Time to execute on GPU: " << gpuExecutionTime << " ms" << std::endl;

cudaEventRecord(startEvent, 0);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);

float gpuRetrieveTime = 0;
cudaEventElapsedTime(&gpuRetrieveTime, startEvent, stopEvent);
std::cout<< std::fixed << "Time taken to copy results back GPU: " << gpuRetrieveTime << " ms" <<
std::endl << std::endl;

float gpuDuration = (gpuCopyTime + gpuExecutionTime + gpuRetrieveTime);

std::cout << "Time taken by GPU: " << gpuDuration << " ms" << std::endl;

cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);

auto start = std::chrono::high_resolution_clock::now();

vectorAddCpu(h_A, h_B, h_C, N);

auto stop = std::chrono::high_resolution_clock::now();

std::chrono::duration<double, std::milli> cpuDuration = (stop - start);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
std::cout << "========================================== " << std::endl;

std::cout << "speed up (execution time only): " << cpuDuration.count() / gpuExecutionTime <<
std::endl;
std::cout << "speed up (GPU total time): " << cpuDuration.count() / gpuDuration << std::endl;

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);

return 0;
}

3_euclidean_distance
27-08-2025 12:10 127 CMakeLists.txt

27-08-2025 12:10 2,267 euclidean_distance.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(euclidean_distance CUDA)

add_executable(euclidean_distance euclidean_distance.cu)
euclidean_distance.cu
#include <cuda_runtime.h>
#include <iostream>

struct Point {
float x;
float z;
float y;
};

global void calculateEuclideanDistanceKernel(Point lineA, Point lineB, float *distances, int

numPoints) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;

if (idx < numPoints) {

float dx = lineA[idx].x - lineB[idx].x;
float dy = lineA[idx].y - lineB[idx].y;
float dz = lineA[idx].z - lineB[idx].z;

distances[idx] = sqrtf(dx * dx + dy * dy + dz * dz);

}
}

int main() {
int numPoints = 10'000'000;
size_t sizePoints = numPoints * sizeof(Point);
size_t sizeDistances = numPoints * sizeof(float);

Point h_lineA = (Point )malloc(sizePoints);

Point *h_lineB = (Point *)malloc(sizePoints);
float *h_distances = (float *)malloc(sizeDistances);

for (int i = 0; i < numPoints; i++) {

h_lineA[i].x = i * 1.0f;
h_lineA[i].y = i * 2.0f;
h_lineA[i].z = i * 3.0f;
h_lineB[i].x = i * 0.5f;
h_lineB[i].y = i * 1.5f;
h_lineB[i].z = i * 2.5f;
}

Point *d_lineA;
Point *d_lineB;
float *d_distances;
cudaMalloc((void **)&d_lineA, sizePoints);
cudaMalloc((void **)&d_lineB, sizePoints);
cudaMalloc((void **)&d_distances, sizeDistances);

cudaMemcpy(d_lineA, h_lineA, sizePoints, cudaMemcpyHostToDevice);

cudaMemcpy(d_lineB, h_lineB, sizePoints, cudaMemcpyHostToDevice);
int blockSize = 1024;
int gridSize = (numPoints + blockSize - 1) / blockSize;

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

calculateEuclideanDistanceKernel<<<gridSize, blockSize>>>(d_lineA, d_lineB, d_distances,

numPoints);

cudaEventRecord(stop);

cudaEventSynchronize(stop);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);

std::cout<< std::fixed << "Time taken: " << gpuDuration << " ms" << std::endl;

cudaMemcpy(h_distances, d_distances, sizeDistances, cudaMemcpyDeviceToHost);

cudaFree(d_lineA);
cudaFree(d_lineB);
cudaFree(d_distances);

free(h_lineA);
free(h_lineB);
free(h_distances);

cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}

Chapter 6
27-08-2025 12:10 <DIR> 1_matrix_add

27-08-2025 12:10 <DIR> 2_matrix_multiply

27-08-2025 12:10 <DIR> 3_trapezoidal_travel

27-08-2025 12:10 <DIR> 4_sort_array

27-08-2025 12:10 <DIR> 5_weighted_moving_average

27-08-2025 12:10 256 readme.md

readme.md
# Introduction

All our examples are numbered here to follow the order in which they appear on the book
chapter.

To build the examples we may simply:

1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make

1_matrix_add
27-08-2025 12:10 103 CMakeLists.txt

27-08-2025 12:10 2,972 matrix_add.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(matrix_add CUDA)

add_executable(matrix_add matrix_add.cu)

matrix_add.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <random>

#define N 7000

global void matrixAddKernel(const double* A, const double* B, double* C, int width) {

int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

if (row < width && col < width) {

int idx = row * width + col;
C[idx] = A[idx] + B[idx];
}
}

void matrixAddCpu(const double* A, const double* B, double* C, int width) {

for (int i = 0; i < width; i++) {
for (int j = 0; j < width; j++) {
C[i * width + j] = A[i * width + j] + B[i * width + j];
}
}
}

void initializeMatrix(double *matrix, int size) {

for (int i = 0; i < size; i++) {
matrix[i] = static_cast<double>(rand()) / RAND_MAX;
}
}

bool checkResults(double A, double B, int size) {

for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-10) {
return false;
}
}
return true;
}

int main() {

srand(static_cast<unsigned int>(time(0)));

int matrixSize = N * N * sizeof(double);

double h_A = (double)malloc(matrixSize);

double *h_B = (double*)malloc(matrixSize);
double *h_C_CPU = (double*)malloc(matrixSize);
double *h_C_GPU = (double*)malloc(matrixSize);

initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);

double *d_A;
double *d_B;
double *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);

auto startTime = std::chrono::high_resolution_clock::now();

matrixAddCpu(h_A, h_B, h_C_CPU, N);
auto endTime = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endTime - startTime;

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);
dim3 blockDim(16, 16);
dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);

matrixAddKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

cudaEventRecord(stop);
cudaEventSynchronize(stop);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);

std::cout << "Time taken by GPU: " << gpuDuration << " ms\n";

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {

std::cout << "Results match!" << std::endl;
} else {
std::cout << "Results do not match!" << std::endl;
}

free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}

2_matrix_multiply

27-08-2025 12:10 118 CMakeLists.txt

27-08-2025 12:10 4,046 matrix_multiply.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(matrix_multiply CUDA)

add_executable(matrix_multiply matrix_multiply.cu)

matrix_multiply.cu
#include <iostream>
#include <iomanip>
#include <cuda_runtime.h>
#include <chrono>

#define N 2000

__global__ void matrixMulKernel(double *A, double *B, double *C, int width) {
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

if (row < width && col < width) {

double value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}

void matrixMulCpu(double A, double B, double *C, int width) {

for (int row = 0; row < width; row++) {
for (int col = 0; col < width; col++) {
double value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}
}

void initializeMatrix(double *matrix, int size) {

for (int i = 0; i < size; i++) {
matrix[i] = static_cast<double>(rand()) / RAND_MAX;
}
}

bool checkResults(double A, double B, int size) {

for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-10) {
return false;
}
}
return true;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int matrixSize = N * N * sizeof(double);

double h_A = (double)malloc(matrixSize);

double *h_B = (double*)malloc(matrixSize);
double *h_C_CPU = (double*)malloc(matrixSize);
double *h_C_GPU = (double*)malloc(matrixSize);

initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);

auto startCpu = std::chrono::high_resolution_clock::now();

matrixMulCpu(h_A, h_B, h_C_CPU, N);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

double *d_A;
double *d_B;
double *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);

cudaEventRecord(start);
dim3 blockDim(16, 16);
dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

matrixMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "GPU execution time: " << gpuDuration << " ms" << std::endl;

cudaEventRecord(start);

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "GPU total time: " << (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve) << " ms"
<< std::endl;

double cpuTotalTime = cpuDuration.count();

double gpuTotalTime = (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve);
std::cout << "speed up: " << cpuTotalTime / gpuTotalTime << std::endl;

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {

std::cout << "Results match!" << std::endl;
} else {
std::cout << "Results do not match!" << std::endl;
}

free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}
3_trapezoidal_travel
27-08-2025 12:10 218 CMakeLists.txt

27-08-2025 12:10 2,749 trapezoidal_travel_basic.cu

27-08-2025 12:10 3,248 trapezoidal_travel_gpu_reduce.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(trapezoidal_travel CUDA)

add_executable(trapezoidal_travel_basic trapezoidal_travel_basic.cu)
add_executable(trapezoidal_travel_gpu_reduce trapezoidal_travel_gpu_reduce.cu)

matrix_multiply.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cstdlib>

#define N 10'000'000 // Increased number of speed measurements for GPU efficiency

#define T 0.25 // Time interval between measurements (15 minutes)

global void trapezoidalKernel(double speeds, double distances, int n) {

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n - 1) {
distances[i] = 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
}

double trapezoidalCpu(double *speeds, int n) {

double totalDistance = 0.0;
for (int i = 0; i < n - 1; i++) {
totalDistance += 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
return totalDistance;
}

void generateRandomSpeeds(double *speeds, int n) {

for (int i = 0; i < n; i++) {
speeds[i] = 40.0 + static_cast<double>(rand()) / (static_cast<double>(RAND_MAX / (80.0 -
40.0)));
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));

double h_speeds = (double)malloc(N * sizeof(double));

double *h_distances = (double*)malloc( (N - 1) * sizeof(double));

generateRandomSpeeds(h_speeds, N);

auto startCPU = std::chrono::high_resolution_clock::now();

double cpuResult = trapezoidalCpu(h_speeds, N);
auto endCPU = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endCPU - startCPU;
std::cout << "CPU Distance: " << cpuResult << " km" << std::endl;
std::cout << "CPU Execution Time: " << cpuDuration.count() << " ms" << std::endl;

double d_speeds, d_distances;

cudaMalloc(&d_speeds, N * sizeof(double));
cudaMalloc(&d_distances, (N - 1) * sizeof(double));

int blockSize = 256;

int gridSize = (N + blockSize - 1) / blockSize;

cudaEvent_t startGPU, endGPU;

cudaEventCreate(&startGPU);
cudaEventCreate(&endGPU);
cudaEventRecord(startGPU);

cudaMemcpy(d_speeds, h_speeds, N * sizeof(double), cudaMemcpyHostToDevice);

trapezoidalKernel<<<gridSize, blockSize>>>(d_speeds, d_distances, N);

cudaDeviceSynchronize();

cudaMemcpy(h_distances, d_distances, (N - 1) * sizeof(double), cudaMemcpyDeviceToHost);

double gpuResult = 0.0;

for (int i = 0; i < N - 1; i++) {
gpuResult += h_distances[i];
}
cudaEventRecord(endGPU);
cudaEventSynchronize(endGPU);
float gpuDuration = 0.0;
cudaEventElapsedTime(&gpuDuration, startGPU, endGPU);

std::cout << "GPU Distance: " << gpuResult << " km" << std::endl;
std::cout << "GPU Execution Time: " << gpuDuration << " ms" << std::endl;

free(h_speeds);
free(h_distances);
cudaFree(d_speeds);
cudaFree(d_distances);
cudaEventDestroy(startGPU);
cudaEventDestroy(endGPU);
return 0;
}
trapezoidal_travel_gpu_reduce.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cstdlib>

#define N 10'000'000 // Increased number of speed measurements for GPU efficiency

#define T 0.25 // Time interval between measurements (15 minutes)

global void trapezoidalKernel(double speeds, double result, int n) {

extern __shared__ double sharedData[];
int tid = threadIdx.x;
int i = threadIdx.x + blockIdx.x * blockDim.x;

double localDistance = 0.0;

if (i < n - 1) {
localDistance = 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
sharedData[tid] = localDistance;
__syncthreads();

for (int stride = blockDim.x / 2; stride > 0; stride /= 2) {

if (tid < stride) {
sharedData[tid] += sharedData[tid + stride];
}
__syncthreads();
}

if (tid == 0) {
result[blockIdx.x] = sharedData[0];
}
}

double trapezoidalCpu(double *speeds, int n) {

double totalDistance = 0.0;
for (int i = 0; i < n - 1; i++) {
totalDistance += 0.5 * (speeds[i] + speeds[i + 1]) * T;
}
return totalDistance;
}

void generateRandomSpeeds(double *speeds, int n) {

for (int i = 0; i < n; i++) {
speeds[i] = 40.0 + static_cast<double>(rand()) / (static_cast<double>(RAND_MAX / (80.0 -
40.0)));
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));
double *h_speeds = (double*)malloc(N * sizeof(double));
generateRandomSpeeds(h_speeds, N);

auto startCPU = std::chrono::high_resolution_clock::now();

int blockSize = 256;

int gridSize = (N + blockSize - 1) / blockSize;

double d_speeds, d_partialResults;

cudaMalloc(&d_speeds, N * sizeof(double));
cudaMalloc(&d_partialResults, gridSize * sizeof(double));

cudaMemcpy(d_speeds, h_speeds, N * sizeof(double), cudaMemcpyHostToDevice);

cudaEvent_t startGPU, endGPU;

cudaEventCreate(&startGPU);
cudaEventCreate(&endGPU);
cudaEventRecord(startGPU);

trapezoidalKernel<<<gridSize, blockSize, blockSize * sizeof(double)>>>(d_speeds, d_partialResults,

N);
cudaDeviceSynchronize();

double h_partialResults = (double)malloc(gridSize * sizeof(double));

cudaMemcpy(h_partialResults, d_partialResults, gridSize * sizeof(double),
cudaMemcpyDeviceToHost);

double gpuResult = 0.0;

for (int i = 0; i < gridSize; i++) {
gpuResult += h_partialResults[i];
}

cudaEventRecord(endGPU);
cudaEventSynchronize(endGPU);

float gpuDuration = 0.0;

cudaEventElapsedTime(&gpuDuration, startGPU, endGPU);

std::cout << "GPU Distance: " << gpuResult << " km" << std::endl;
std::cout << "GPU Execution Time: " << gpuDuration << " ms" << std::endl;

free(h_speeds);
free(h_partialResults);
cudaFree(d_speeds);
cudaFree(d_partialResults);
cudaEventDestroy(startGPU);
cudaEventDestroy(endGPU);

return 0;

4_sort_array
27-08-2025 12:10 103 CMakeLists.txt

27-08-2025 12:10 3,742 sort_array.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(sort_array CUDA)

add_executable(sort_array sort_array.cu)

sort_array.cu
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
#include <cfloat>
#include <iomanip>
#include <cuda_runtime.h>

__global__ void oddEvenSortStepKernel(double *arr, int size, bool *swapped, bool isOddPhase) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i = isOddPhase ? 2 * idx + 1 : 2 * idx;

if (i < size - 1) {
if (arr[i] > arr[i + 1]) {
double temp = arr[i];
arr[i] = arr[i + 1];
arr[i + 1] = temp;
*swapped = true;
}
}
}

void oddEvenSortGpu(double *arr, int size) {

double *d_arr;
bool *d_swapped;
bool h_swapped;
int threads = 256;
int blocks = (size + threads - 1) / threads;

cudaMalloc((void **)&d_arr, size * sizeof(double));

cudaMalloc((void **)&d_swapped, sizeof(bool));

cudaMemcpy(d_arr, arr, size * sizeof(double), cudaMemcpyHostToDevice);

do {
h_swapped = false;
cudaMemcpy(d_swapped, &h_swapped, sizeof(bool), cudaMemcpyHostToDevice);

// Odd phase
oddEvenSortStepKernel<<<blocks, threads>>>(d_arr, size, d_swapped, true);
cudaDeviceSynchronize();

// Even phase
oddEvenSortStepKernel<<<blocks, threads>>>(d_arr, size, d_swapped, false);
cudaDeviceSynchronize();

// Check if any swaps occurred

cudaMemcpy(&h_swapped, d_swapped, sizeof(bool), cudaMemcpyDeviceToHost);

} while (h_swapped);

cudaMemcpy(arr, d_arr, size * sizeof(double), cudaMemcpyDeviceToHost);

cudaFree(d_arr);
cudaFree(d_swapped);
}

void oddEvenSortCpu(double *arr, int size) {

bool swapped;

do {
swapped = false;

// Odd phase
for (int i = 1; i < size - 1; i += 2) {
if (arr[i] > arr[i + 1]) {
std::swap(arr[i], arr[i + 1]);
swapped = true;
}
}

// Even phase
for (int i = 0; i < size - 1; i += 2) {
if (arr[i] > arr[i + 1]) {
std::swap(arr[i], arr[i + 1]);
swapped = true;
}
}
} while (swapped);
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int n = 100'000;
double *h_data = (double*)malloc(n * sizeof(double));
double *h_data_gpu = (double*)malloc(n * sizeof(double));

for (int i = 0; i < n; i++) {

h_data[i] = static_cast<double>(rand()) / RAND_MAX;
}

double* d_data;
cudaMalloc(&d_data, n * sizeof(double));
cudaMemcpy(d_data, h_data, n * sizeof(double), cudaMemcpyHostToDevice);

cudaEvent_t startGpu, stopGpu;

cudaEventCreate(&startGpu);
cudaEventCreate(&stopGpu);

cudaEventRecord(startGpu);

oddEvenSortGpu(d_data, n);

cudaEventRecord(stopGpu);

cudaEventSynchronize(stopGpu);
float gpuDuration;
cudaEventElapsedTime(&gpuDuration, startGpu, stopGpu);
std::cout << "GPU sorting time: " << gpuDuration << " ms" << std::endl;

cudaMemcpy(h_data_gpu, d_data, n * sizeof(double), cudaMemcpyDeviceToHost);

auto start = std::chrono::high_resolution_clock::now();

oddEvenSortCpu(h_data, n);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = end - start;
std::cout << "CPU sorting time: " << cpuDuration.count() << " ms" << std::endl;

double max_difference = 0;
for (int i = 0; i < n; i++) {
max_difference = std::max(max_difference, std::abs(h_data[i] - h_data_gpu[i]));
}
std::cout << "Max difference between CPU and GPU results: " << max_difference << std::endl;

free(h_data);
free(h_data_gpu);
cudaFree(d_data);
cudaEventDestroy(startGpu);
cudaEventDestroy(stopGpu);

return 0;
}

5_weighted_moving_average

27-08-2025 12:10 142 CMakeLists.txt

27-08-2025 12:10 5,657 weighted_moving_average.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(weighted_moving_average CUDA)

add_executable(weighted_moving_average weighted_moving_average.cu)

weighted_moving_average.cu

#include <iostream>
#include <cuda_runtime.h>
#include <vector>
#include <cstdlib>
#include <chrono>

#define NUM_SENSORS 50'000'000

#define NUM_READINGS 3

void rotateIndices(int *indices) {

int newData = indices[0];
indices[0] = indices[1];
indices[1] = indices[2];
indices[2] = newData;
}

__global__ void smoothSensorsKernel(float *buffers, int *indices, float *output, float *weights) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= NUM_SENSORS) return;

float val = 0.0f;

for (int i=0; i < NUM_READINGS; i++){
val = weights[i] * buffers[indices[i] * NUM_READINGS + idx];
}

output[idx] = val;
}

void smoothSensorsCpu(float buffers, int indices, float output, float weights){

for (int i = 0; i < NUM_READINGS; i++) {
for (int j = 0; j < NUM_SENSORS; j++){
output[j] = weights[i] * buffers[indices[i] * NUM_READINGS + j];
}
}
}

bool checkResults(float A, float B, int size) {

for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-10) {
std::cerr << "Mismatch at index " << i <<
" CPU: " << A[i] <<
" GPU: " << B[i] <<
" diff=" << abs(A[i] - B[i]) << std::endl;
return false;
}
}
return true;
}

void initializeBuffer(float *h_buffers, int bufferId)

{
for (int j = 0; j < NUM_SENSORS; j++) {
h_buffers[bufferId + j] = 1.0f + ((rand() % 100) / 1000.0f);
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int bufferSize = NUM_SENSORS * sizeof(float);

float h_weights[NUM_READINGS] = {0.2f, 0.3f, 0.5f};

int h_indices[NUM_READINGS] = {0, 1, 2};
float *h_buffers = (float*) malloc(NUM_READINGS * bufferSize);
float *h_output_GPU = (float*) malloc(bufferSize);
float *h_output_CPU = (float*) malloc(bufferSize);

float *d_buffers;
float *d_output;
float *d_weights;
int *d_indices;

std::chrono::duration<double, std::milli> cpuGlobalTime;

int threads = 256;
int blocks = (NUM_SENSORS + threads - 1) / threads;
float gpuGlobalTime = 0;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaMalloc(&d_buffers, NUM_READINGS * bufferSize);

cudaMalloc(&d_output, bufferSize);
cudaMalloc(&d_indices, NUM_READINGS * sizeof(int));
cudaMalloc(&d_weights, NUM_READINGS * sizeof(float));

cudaMemcpy(d_weights, h_weights, NUM_READINGS * sizeof(float), cudaMemcpyHostToDevice);

for(int i = 0; i < 4; i++) {

std::cout << i << std::endl;

int newDataIdx = h_indices[0];

if (i == 0) {
for(int j = 0; j < NUM_READINGS; j++) {
initializeBuffer(h_buffers, j);
}
} else {
newDataIdx = h_indices[0];
rotateIndices(h_indices);
initializeBuffer(h_buffers, newDataIdx); //one new reading for all sensors
}

auto startCpu = std::chrono::high_resolution_clock::now();

smoothSensorsCpu(h_buffers, h_indices, h_output_CPU, h_weights);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);
cpuGlobalTime += cpuDuration;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

cudaEventRecord(start);

cudaMemcpy(d_indices, h_indices, NUM_READINGS * sizeof(int), cudaMemcpyHostToDevice);

if (i == 0) {
cudaMemcpy(d_buffers, h_buffers, NUM_READINGS * bufferSize,
cudaMemcpyHostToDevice);
} else {
cudaMemcpy(&d_buffers[newDataIdx], &h_buffers[newDataIdx], bufferSize,
cudaMemcpyHostToDevice);
}

cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;
cudaEventRecord(start);
smoothSensorsKernel<<<blocks, threads>>>(d_buffers, d_indices, d_output, d_weights);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "GPU execution time: " << gpuDuration << " ms" << std::endl;

cudaEventRecord(start);
cudaMemcpy(h_output_GPU, d_output, bufferSize, cudaMemcpyDeviceToHost);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;

float gpuTotalTime = (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve);

gpuGlobalTime += gpuTotalTime;
std::cout << "GPU total time: " << gpuTotalTime << " ms" << std::endl;

if (!checkResults(h_output_CPU, h_output_GPU, NUM_SENSORS)) {

std::cout << "Results do not match!" << std::endl;
}
}

std::cout << "=============================================" << std::endl;

std::cout << "CPU global time: " << cpuGlobalTime.count() << " ms" << std::endl;
std::cout << "GPU global time: " << gpuGlobalTime << " ms" << std::endl;

std::cout << "speed up over all executions: " << cpuGlobalTime.count() / gpuGlobalTime <<
std::endl;

cudaFree(d_buffers);
cudaFree(d_output);
cudaFree(d_indices);
cudaFree(d_weights);
free(h_output_CPU);
free(h_output_GPU);
free(h_buffers);

return 0;
}
Chapter 7
27-08-2025 12:10 315 CMakeLists.txt

27-08-2025 12:10 7,616 matrix_multiply_float.cu

27-08-2025 12:10 256 readme.md

readme.md
# Introduction

All our examples are numbered here to follow the order in which they appear on the book
chapter.

To build the examples we may simply:

1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(performance_strategies CUDA)

set(CMAKE_CXX_FLAGS_DEBUG "-g -O0 -Wall")

set(CMAKE_CUDA_FLAGS_DEBUG "-G -g -O0")

set(CMAKE_CXX_FLAGS_RELEASE "-O3")
set(CMAKE_CUDA_FLAGS_RELEASE "-O3 -arch=sm_75 -rdc=false")

add_executable(matrix_multiply_float matrix_multiply_float.cu)

matrix_multiply_float.cu
#include <chrono>
#include <cuda_runtime.h>
#include <iostream>
#include <iomanip>

#define N 2000 // Define matrix size

#define TILE 16 // Define the tile size

__global__ void matrixMulKernel(float *A, float *B, float *C, int width) {
__shared__ float Asub[TILE][TILE];
__shared__ float Bsub[TILE][TILE];

int tx = threadIdx.x;
int ty = threadIdx.y;
int row = ty + blockIdx.y * blockDim.y;
int col = tx + blockIdx.x * blockDim.x;

float sum = 0.0f;

for (int i = 0; i < (width + blockDim.x - 1) / blockDim.x; i++) {

if (row < width && (i * blockDim.x + tx) < width)
Asub[ty][tx] = A[row * width + i * blockDim.x + tx];
else
Asub[ty][tx] = 0.0f;

if (col < width && (i * blockDim.y + ty) < width)

Bsub[ty][tx] = B[(i * blockDim.y + ty) * width + col];
else
Bsub[ty][tx] = 0.0f;

__syncthreads();

#pragma unroll
for (int k = 0; k < blockDim.x; k++) {
sum = fmaf(Asub[ty][k], Bsub[k][tx], sum);
}

__syncthreads();
}

if (row < width && col < width)

C[row * width + col] = sum;
}

global void matrixMulKernel_shared(float* A, float* B, float* C, int n) {

// Allocate shared memory
__shared__ float Asub[TILE][TILE];
__shared__ float Bsub[TILE][TILE];

int tx = threadIdx.x;
int ty = threadIdx.y;
int row = ty + blockIdx.y * blockDim.y;
int col = tx + blockIdx.x * blockDim.x;

float sum = 0.0f;

for (int i = 0; i < (n + blockDim.x - 1) / blockDim.x; i++) {

if (row < n && (i * blockDim.x + tx) < n)
Asub[ty][tx] = A[row * n + i * blockDim.x + tx];
else
Asub[ty][tx] = 0.0f;

if (col < n && (i * blockDim.y + ty) < n)

Bsub[ty][tx] = B[(i * blockDim.y + ty) * n + col];
else
Bsub[ty][tx] = 0.0f;

__syncthreads();

#pragma unroll
for (int k = 0; k < blockDim.x; k++) {
sum += Asub[ty][k] * Bsub[k][tx];
// Use fused multiply-add
//sum = fmaf(Asub[ty][k], Bsub[k][tx], sum);
}

__syncthreads();
}

if (row < n && col < n)

C[row * n + col] = sum;
}

__global__ void matrixMulKernel_naive(float *A, float *B, float *C, int width) {
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;

if (row < width && col < width) {

float value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}

__global__ void matrixMulKernel_row(float *A, float *B, float *C, int width) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < width) {
for (int col = 0; col < width; col++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
}
C[row * width + col] = sum;
}
}
}

__global__ void matrixMulKernel_col(float *A, float *B, float *C, int width) {
int col = threadIdx.x + blockIdx.x * blockDim.x;
if (col < width) {
for (int row = 0; row < width; row++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
}
C[row * width + col] = sum;
}
}
}

void matrixMulCpu(float A, float B, float *C, int width) {

for (int row = 0; row < width; row++) {
for (int col = 0; col < width; col++) {
float value = 0;
for (int k = 0; k < width; k++) {
value += A[row * width + k] * B[k * width + col];
}
C[row * width + col] = value;
}
}
}

void initializeMatrix(float *matrix, int size) {

for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

bool checkResults(float A, float B, int size) {

for (int i = 0; i < size; i++) {
if (fabs(A[i] - B[i]) > 1e-4) {
std::cout << "fabs(A[i])=" << fabs(A[i]) << std::endl;
std::cout << "B[i]=" << B[i] << std::endl;
std::cout << std::fixed << std::showpoint;
std::cout << std::setprecision(15);
std::cout << "fabs(A[i] - B[i])=" << fabs(A[i] - B[i]) << std::endl;
std::cout << "#####" << std::endl;
return false;
}
}
return true;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int matrixSize = N * N * sizeof(float);

float h_A = (float)malloc(matrixSize);

float *h_B = (float*)malloc(matrixSize);
float *h_C_CPU = (float*)malloc(matrixSize);
float *h_C_GPU = (float*)malloc(matrixSize);
initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);

auto startCpu = std::chrono::high_resolution_clock::now();

matrixMulCpu(h_A, h_B, h_C_CPU, N);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);

cudaEventRecord(start);

dim3 blockDim(TILE, TILE);

dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);

matrixMulKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

// for those kernels we need a different launch configuration

// int threads = 256;
// int blocks = (N + threads - 1) / threads ;
// matrixMulKernel_row<<<blocks, threads>>>(d_A, d_B, d_C, N);
// matrixMulKernel_col<<<blocks, threads>>>(d_A, d_B, d_C, N);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTime = 0;
cudaEventElapsedTime(&gpuTime, start, stop);
std::cout << "GPU execution time: " << gpuTime << " ms" << std::endl;
cudaEventRecord(start);

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "GPU total time: " << (gpuTime + gpuTimeToCopy + gpuTimeToRetrieve) << " ms" <<
std::endl;

double cpuTotalTime = cpuDuration.count();

double gpuTotalTime = (gpuTime + gpuTimeToCopy + gpuTimeToRetrieve);
std::cout << "speed up: " << cpuTotalTime / gpuTotalTime << std::endl;

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {

std::cout << "Results match!" << std::endl;
} else {
std::cout << "Results do not match!" << std::endl;
}

free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);

return 0;
}

Chapter 8
27-08-2025 12:10 <DIR> 1_vector_add

27-08-2025 12:10 <DIR> 2_mem_bandwidth

27-08-2025 12:10 <DIR> 3_multiplying_with_streams

27-08-2025 12:10 <DIR> 4_multigpu

27-08-2025 12:10 256 readme.md

readme
# Introduction

All our examples are numbered here to follow the order in which they appear on the book
chapter.

To build the examples we may simply:

1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make

1_vector_add
27-08-2025 12:10 191 CMakeLists.txt

27-08-2025 12:10 1,640 vector_add.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

set(CMAKE_CXX_FLAGS_DEBUG "-g -O0 -Wall")

set(CMAKE_CUDA_FLAGS_DEBUG "-G -g -O0")

add_executable(vector_add vector_add.cu)

vector_add.cu

#include <iostream>
#include <cuda_runtime.h>

global void vectorAddKernel(int a, int b, int *c, int N) {

int idx = threadIdx.x + blockIdx.x * blockDim.x;

if (idx < N) {
c[idx] = a[idx] * b[idx];
}

void vectorAddCpu(int a, int b, int *c, int N) {

for (int idx = 0; idx < N; idx++) {

c[idx] = a[idx] + b[idx];
}

void checkResult(int cpuRes, int gpuRes, int N) {

for (int i = 0; i < N; ++i) {
if (abs(cpuRes[i] - gpuRes[i]) > 1e-5) {
std::cerr << "Mismatch at index " << i <<
" CPU: " << cpuRes[i] <<
" GPU: " << gpuRes[i] <<
" diff=" << abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

int main() {
const int N = 130;
int h_a[N];
int h_b[N];
int h_c[N];
int cpuResult[N];
int *d_a;
int *d_b;
int *d_c;

for (int i = 0; i < N; i++) {

h_a[i] = i;
h_b[i] = i * 2;
}

vectorAddCpu(h_a, h_b, cpuResult, N);

cudaMalloc(&d_a, N * sizeof(int));
cudaMalloc(&d_b, N * sizeof(int));
cudaMalloc(&d_c, N * sizeof(int));

cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice);

int blockSize = 128;

int gridSize = (N + blockSize - 1) / blockSize;

vectorAddKernel<<<gridSize, blockSize>>>(d_a, d_b, d_c, N);

cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

checkResult(cpuResult, h_c, N);

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);

return 0;
}

2_mem_bandwidth
27-08-2025 12:10 166 CMakeLists.txt

27-08-2025 12:10 1,675 memcpy_benchmark.cu

27-08-2025 12:10 1,245 mem_bandwidth.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

add_executable(mem_bandwidth mem_bandwidth.cu)
add_executable(memcpy_benchmark memcpy_benchmark.cu)

memcpy_benchmark.cu

#include <cuda_runtime.h>
#include <iostream>

int main() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

int memClockMHz = prop.memoryClockRate / 1000; // Convert to MHz

int busWidth = prop.memoryBusWidth; // In bits
float bandwidthInBytes = memClockMHz * busWidth / 8; // In bytes
float bandwidthInGB = bandwidthInBytes / 1024; // In GigaBytes
float bandwidthGBs = 2.0f * bandwidthInGB; // DDR memory is twice the clock rate

int device;
cudaGetDevice(&device);

int concurrentKernels = 0;
int asyncEngineCount = 0;

cudaDeviceGetAttribute(&concurrentKernels, cudaDevAttrConcurrentKernels, device);

cudaDeviceGetAttribute(&asyncEngineCount, cudaDevAttrAsyncEngineCount, device);

std::cout << "Device " << device << ":" << std::endl;
std::cout << " Concurrent Kernel Execution: " << (concurrentKernels ? "Yes" : "No") << std::endl;
std::cout << " Async Engine Count: " << asyncEngineCount << std::endl;

std::cout << "Memory Clock Speed: " << memClockMHz << " MHz" << std::endl;
std::cout << "Memory Bus Width: " << busWidth << " bits" << std::endl;
std::cout << "Estimated Memory Bandwidth: " << bandwidthGBs << " GB/s" << std::endl;

return 0;
}

mem_bandwidth.cu
#include <cuda_runtime.h>
#include <iostream>

#define SIZE_MB(x) (long(x) * 1024 * 1024) // Convert MB to bytes

void measureMemcpyBandwidth(long dataSize) {

float *h_data, *d_data;
cudaMallocHost(&h_data, dataSize);
cudaMalloc(&d_data, dataSize);

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
cudaMemcpyAsync(d_data, h_data, dataSize, cudaMemcpyHostToDevice);
cudaEventRecord(stop);

cudaEventSynchronize(stop);

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float bandwidth = (dataSize / (milliseconds / 1000.0)) / (1024.0 * 1024.0 * 1024.0);

std::cout << "Data Size: " << dataSize / (1024 * 1024) << " MB, Time: " << milliseconds << " ms,
Bandwidth: " << bandwidth << " GB/s\n";

cudaFreeHost(h_data);
cudaFree(d_data);
}

int main() {
std::cout << "Testing different transfer sizes:\n";
measureMemcpyBandwidth(SIZE_MB(1));
measureMemcpyBandwidth(SIZE_MB(10));
measureMemcpyBandwidth(SIZE_MB(20));
measureMemcpyBandwidth(SIZE_MB(30));
measureMemcpyBandwidth(SIZE_MB(32));
measureMemcpyBandwidth(SIZE_MB(40));
measureMemcpyBandwidth(SIZE_MB(50));
measureMemcpyBandwidth(SIZE_MB(100));
measureMemcpyBandwidth(SIZE_MB(200));
measureMemcpyBandwidth(SIZE_MB(300));
measureMemcpyBandwidth(SIZE_MB(400));
measureMemcpyBandwidth(SIZE_MB(500));
measureMemcpyBandwidth(SIZE_MB(600));
measureMemcpyBandwidth(SIZE_MB(700));
measureMemcpyBandwidth(SIZE_MB(800));
measureMemcpyBandwidth(SIZE_MB(900));
measureMemcpyBandwidth(SIZE_MB(1024)); // 1GB

return 0;
}

3_multiplying_with_streams
27-08-2025 12:10 189 CMakeLists.txt

27-08-2025 12:10 3,178 vec_mat_mul_nostreams.cu

27-08-2025 12:10 6,867 vec_mat_mul_streams.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

add_executable(vec_mat_mul_streams vec_mat_mul_streams.cu)
add_executable(vec_mat_mul_nostreams vec_mat_mul_nostreams.cu)

vec_mat_mul_nostreams.cu

#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <chrono>

#define BLOCK_SIZE 256

__global__ void vectorMatrixMulKernel(float* d_vec, float* d_mat, float* d_res, int rows, int cols) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < rows) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += d_mat[row * cols + col] * d_vec[col];
}
d_res[row] = sum;
}
}

void vectorMatrixMulCpu(float *vec, float *mat, float *res, int rows, int cols) {
for (int i = 0; i < rows; i++) {
res[i] = 0;
for (int j = 0; j < cols; j++) {
res[i] += mat[i * cols + j] * vec[j];
}
}
}

void checkResult(float cpuRes, float gpuRes, int size) {

for (int i = 0; i < size; i++) {
if (fabs(cpuRes[i] - gpuRes[i]) > 1e-3) {
std::cerr << "Mismatch at " << i << " CPU: " << cpuRes[i] << " GPU: " << gpuRes[i] << " diff=" <<
abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

void initializeMatrix(float *matrix, int size) {

for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

void compute(int cols) {

int rows = cols;

float h_vec = (float)malloc(rows * sizeof(float));

float *h_mat = (float*)malloc(rows * cols * sizeof(float));
initializeMatrix(h_vec, cols);
initializeMatrix(h_mat, rows * cols);

float h_res_gpu = (float)malloc(rows * sizeof(float));

float *h_res_cpu = (float*)malloc(rows * sizeof(float));

float *d_vec;
float *d_mat;
float *d_res;
cudaMalloc(&d_vec, cols * sizeof(float));
cudaMalloc(&d_mat, rows * cols * sizeof(float));
cudaMalloc(&d_res, rows * sizeof(float));

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);
cudaMemcpy(d_vec, h_vec, cols * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_mat, h_mat, rows * cols * sizeof(float), cudaMemcpyHostToDevice);

int blocks = (rows + BLOCK_SIZE - 1) / BLOCK_SIZE;

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE>>>(d_vec, d_mat, d_res, rows, cols);
cudaMemcpy(h_res_gpu, d_res, rows * sizeof(float), cudaMemcpyDeviceToHost);

cudaEventRecord(stop);

cudaEventSynchronize(stop);

float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms, for matrix: " << cols << "x" << cols <<
std::endl;

auto startCpu = std::chrono::high_resolution_clock::now();

vectorMatrixMulCpu(h_vec, h_mat, h_res_cpu, rows, cols);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

checkResult(h_res_cpu, h_res_gpu, rows);

cudaFree(d_vec);
cudaFree(d_mat);
cudaFree(d_res);

int main() {
srand(static_cast<unsigned int>(time(0)));

compute(16380);
compute(32760);

return 0;
}

vec_mat_mul_streams.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <chrono>

#define BLOCK_SIZE 256

void vectorMatrixMulCpu(float* vec, float* mat, float* res, int rows, int cols) {
for (int i = 0; i < rows; i++) {
res[i] = 0;
for (int j = 0; j < cols; j++) {
res[i] += mat[i * cols + j] * vec[j];
}
}
}

void checkResult(float* cpuRes, float* gpuRes, int size) {

void initializeMatrix(float *matrix, int size) {

for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

float compute(int chunkSize, int cols, bool computeCpuPart) {

int rows = cols;
float *h_vec = (float*)malloc(rows * sizeof(float));
float *h_res_cpu = (float*)malloc(rows * sizeof(float));
float *h_mat_pinned, *h_res_gpu;
cudaMallocHost(&h_mat_pinned, rows * cols * sizeof(float));
cudaMallocHost(&h_res_gpu, rows * sizeof(float));

initializeMatrix(h_vec, cols);
initializeMatrix(h_mat_pinned, rows * cols);

float *d_vec;
float *d_mat1;
float *d_mat2;
float *d_res1;
float *d_res2;
cudaMalloc(&d_vec, cols * sizeof(float));
cudaMalloc(&d_mat1, chunkSize * cols * sizeof(float));
cudaMalloc(&d_mat2, chunkSize * cols * sizeof(float));
cudaMalloc(&d_res1, chunkSize * sizeof(float));
cudaMalloc(&d_res2, chunkSize * sizeof(float));

cudaEvent_t start, stop;

cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start);

cudaMemcpy(d_vec, h_vec, cols * sizeof(float), cudaMemcpyHostToDevice);

cudaStream_t stream1, stream2;

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

int blocks = (chunkSize + BLOCK_SIZE - 1) / BLOCK_SIZE;

for (int i = 0; i < rows; i += chunkSize * 2) {

// Stream 1: Process first chunk
cudaMemcpyAsync(d_mat1, h_mat_pinned + i * cols, chunkSize * cols * sizeof(float),
cudaMemcpyHostToDevice, stream1);

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE, 0, stream1>>>(d_vec, d_mat1, d_res1,

chunkSize, cols);

cudaMemcpyAsync(h_res_gpu + i, d_res1, chunkSize * sizeof(float), cudaMemcpyDeviceToHost,

stream1);

// Stream 2: Process second chunk

cudaMemcpyAsync(d_mat2, h_mat_pinned + (i + chunkSize) * cols, chunkSize * cols *
sizeof(float), cudaMemcpyHostToDevice, stream2);

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE, 0, stream2>>>(d_vec, d_mat2, d_res2,

chunkSize, cols);

cudaMemcpyAsync(h_res_gpu + i + chunkSize, d_res2, chunkSize * sizeof(float),

cudaMemcpyDeviceToHost, stream2);
}
// Ensure first chunk has completed before reusing buffers
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaEventRecord(stop);

cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms, for matrix: " << cols << "x" << cols << "
chunk size: " << chunkSize << std::endl;

if (computeCpuPart) {
auto startCpu = std::chrono::high_resolution_clock::now();
vectorMatrixMulCpu(h_vec, h_mat_pinned, h_res_cpu, rows, cols);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);

std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

checkResult(h_res_cpu, h_res_gpu, rows);

}

cudaFreeHost(h_mat_pinned);
cudaFreeHost(h_res_gpu);
cudaFree(d_vec);
cudaFree(d_mat1);
cudaFree(d_mat2);
cudaFree(d_res1);
cudaFree(d_res2);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

return gpuDuration;
}

/*
* The average function will collect the gpuDuration for each chunkSize and matrix size
* over the iterations defined and calculate the percetage gain against the provided
* reference value of the execution time for the non streams application version.
*
* Note: you have to run the non streams version with the same matrix size.
*/
void average(int chunkSize, int cols, int iterations, float noStreamsReferenceTime) {
long dataSize = (chunkSize * cols);
long totalDataSize = (cols * cols);

double sum = 0.0f;

for(int i = 0; i < iterations; i++) {
sum += compute(chunkSize, cols, false);
}
double average = sum / iterations;

std::cout << " chunk: " << chunkSize

<< " data size: " << dataSize / 1024 / 1024 * 4 << "MB "
<< " total data size: " << totalDataSize / 1024 / 1024 * 4 << "MB "
<< " data partitions: " << (double)totalDataSize / (double)dataSize << " "
<< " average time: " << average << " ms"
<< " percentage gain: " << 100 - (average / noStreamsReferenceTime * 100) << "%"
<< std::endl;
}

int main() {
srand(static_cast<unsigned int>(time(0)));

compute( 273, 16380, true);

compute( 546, 16380, true);
compute( 819, 16380, true);
compute( 1638, 16380, true);
compute( 4095, 16380, true);
compute( 8190, 16380, true);

compute( 273, 32760, true);

compute( 546, 32760, true);
compute( 819, 32760, true);
compute( 1638, 32760, true);
compute( 4095, 32760, true);
compute( 8190, 32760, true);
compute(16380, 32760, true);

/*
* If you want to check an average of some executions against the values measured
* for the no streams version you can use the following sample calls, but updating
* the last value to the time measurement from your system.
*/
// average( 273, 32760, 10, 500.615);
// average( 546, 32760, 10, 500.615);
// average( 819, 32760, 10, 500.615);
// average( 1638, 32760, 10, 500.615);
// average( 4095, 32760, 10, 500.615);
// average( 8190, 32760, 10, 500.615);
// average(16380, 32760, 10, 500.615);

// average( 273, 16380, 10, 119.299);

// average( 546, 16380, 10, 119.299);
// average( 819, 16380, 10, 119.299);
// average( 1638, 16380, 10, 119.299);
// average( 4095, 16380, 10, 119.299);
// average( 8190, 16380, 10, 119.299);

return 0;
}

4_multigpu
27-08-2025 12:10 104 CMakeLists.txt

27-08-2025 12:10 3,846 multigpu.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)

add_executable(multigpu multigpu.cu)

multigpu.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>

#define BLOCK_SIZE 256

void vectorMatrixMulCpu(float* vec, float* mat, float* res, int rows, int cols) {
for (int row = 0; row < rows; row++) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += mat[row * cols + col] * vec[col];
}
res[row] = sum;
}
}

void compareResults(float* cpuRes, float* gpuRes, int rows) {

for (int i = 0; i < rows; i++) {
if (fabs(cpuRes[i] - gpuRes[i]) > 1e-4) {
std::cerr << "Mismatch at " << i << " CPU: " << cpuRes[i] << " GPU: " << gpuRes[i] << " diff=" <<
abs(cpuRes[i] - gpuRes[i]) << std::endl;
return;
}
}
std::cout << "Results match!" << std::endl;
}

void initializeMatrix(float *matrix, int size) {

for (int i = 0; i < size; i++) {
matrix[i] = static_cast<float>(rand()) / RAND_MAX;
}
}

int main() {
srand(static_cast<unsigned int>(time(0)));

int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);

int cols = 16380;

int rows = cols;
int chunk_size = rows / deviceCount;

float h_vec = (float)malloc(rows * sizeof(float));

float *h_res_cpu = (float*)malloc(rows * sizeof(float));
float *h_mat = (float*)malloc(rows * cols * sizeof(float));
float *h_res_gpu = (float*)malloc(rows * sizeof(float));

initializeMatrix(h_vec, cols);
initializeMatrix(h_mat, rows * cols);

float *d_vec[deviceCount];
float *d_mat[deviceCount];
float *d_res[deviceCount];
for (int device = 0; device < deviceCount; device++) {
cudaSetDevice(device);
cudaMalloc(&d_vec[device], rows * sizeof(float));
cudaMalloc(&d_mat[device], chunk_size * cols * sizeof(float));
cudaMalloc(&d_res[device], chunk_size * sizeof(float));

cudaMemcpy(d_vec[device], h_vec, rows * sizeof(float), cudaMemcpyHostToDevice);

}

auto gpuStart = std::chrono::high_resolution_clock::now();

for (int device = 0; device < deviceCount; device++) {

cudaSetDevice(device);
cudaMemcpy(d_mat[device], h_mat + device * chunk_size * cols, chunk_size * cols *
sizeof(float), cudaMemcpyHostToDevice);
}

for (int device = 0; device < deviceCount; device++) {

cudaSetDevice(device);
int blocks = (chunk_size + BLOCK_SIZE - 1) / BLOCK_SIZE;

vectorMatrixMulKernel<<<blocks, BLOCK_SIZE>>>(d_vec[device], d_mat[device], d_res[device],

chunk_size, cols);
}

for (int device = 0; device < deviceCount; device++) {

cudaSetDevice(device);

cudaMemcpy(h_res_gpu + device * chunk_size, d_res[device], chunk_size * sizeof(float),

cudaMemcpyDeviceToHost);
cudaFree(d_vec[device]);
cudaFree(d_mat[device]);
cudaFree(d_res[device]);
}

auto gpuEnd = std::chrono::high_resolution_clock::now();

std::chrono::duration<float, std::milli> gpuDuration = gpuEnd - gpuStart;
std::cout << "Time taken by GPU: " << gpuDuration.count() << " ms" << std::endl;

auto cpuStart = std::chrono::high_resolution_clock::now();

vectorMatrixMulCpu(h_vec, h_mat, h_res_cpu, rows, cols);
auto cpuEnd = std::chrono::high_resolution_clock::now();
std::chrono::duration<float, std::milli> cpuDuration = cpuEnd - cpuStart;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

compareResults(h_res_cpu, h_res_gpu, rows);

return 0;
}

Chapter 9
27-08-2025 12:10 <DIR> include

27-08-2025 12:10 <DIR> python

27-08-2025 12:10 1,748 readme.md

27-08-2025 12:10 <DIR> src

readme.md
# Introduction

This chapter presents a C++ library that uses the traditional steps for building, but we also have
three implementations of wrappers to expose the code to Python.
We also provide a run_tests.sh inside the `python` folder that runs the corresponding `test.py` file
inside each alternative of the wrappers. The specific commands are listed bellow:

## To Build the C++ Library

To build we may simply:
1. create a `build` folder inside of `ch9` folder
2. run cmake ..
3. run make

## For CTypes
There is no need to build, just run the code with the Python interpreter. There are two files:
1. test_ctypes.py that uses the library
2. test.py that will be used on the script that runs all the wrappers

## For Wrapper
It will be built by setup.py wit the following command:
1. python3 setup.py build_ext --inplace

We also have two test files here:

1. test_vector_add_wrapper.py to test the wrapper individually with the Python interpreter
2. test.py that will be used on the script that runs all the wrappers

## For the NumPy Wrapper

It will be built by setup.py wit the following command:
1. python3 setup.py build_ext --inplace

We also have two test files here:

1. test_vector_add_np_wrapper.py to test the wrapper individually with the Python interpreter
2. test.py that will be used on the script that runs all the wrappers

## To Execute all Tests

First we have to run the setup.py command for the wrappers, after that we can run:
1. run_tests.sh inside the `python` folder, this script will collect the names of the subfolders and will
execute 10 times each `test.py` program. After each iteration it increases the inpute size so that we
can observe the impact on overall performance.

Include
vector_add.H
#ifndef VEC_ADD_H
#define VEC_ADD_H

extern "C" {
void vectorAdd(int *a, int *b, int *c, int N);
}
#endif

python
27-08-2025 12:10 <DIR> ctypes

27-08-2025 12:10 1,134 run_tests.sh

27-08-2025 12:10 <DIR> wrapper

27-08-2025 12:10 <DIR> wrapper_np

run_tests.sh
#!/bin/bash

# Get the current directory where the script is located

base_dir=$(pwd)

# List of subfolders (assuming there are exactly three subfolders)

subfolders=($(find "$base_dir" -maxdepth 1 -type d | tail -n +2 | head -n 3))

# Program to execute
program="test.py"

# Parameters to pass to the program

param=1000

for i in {1..5}; do
echo "*** Processing parameter value: $param"
# Loop over each subfolder
for subfolder in "${subfolders[@]}"; do
echo "## Processing folder: $subfolder"

# Change to the subfolder

cd "$subfolder" || { echo "Failed to enter $subfolder"; exit 1; }

for i in {1..10}; do
# Execute the program with parameters
echo "$i"
python3 "$program" "$param" || { echo "Failed to execute program in $subfolder"; exit 1; }
done

# Move back to the base directory

cd "$base_dir"

echo " "

done

# echo "Finished processing all subfolders for parameter $param"

echo "#############"

# Increment the parameter (multiply by 10 for the next iteration)

param=$((param * 10))
done
echo "All folders processed."

ctypes
27-08-2025 12:10 1,027 test.py

27-08-2025 12:10 563 test_ctypes.py

test.py
import ctypes
import numpy as np
import time
import argparse

lib = ctypes.CDLL("../../build/libvector_add.so")

lib.vectorAdd.argtypes = [
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.c_int
]

def calculate(size):
N = int (size)
a = np.array(range(N), dtype=np.int32)
b = np.array(range(N, 2*N), dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

startTime = time.time()
lib.vectorAdd(
a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
b.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
c.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
N
)
endTime = time.time()

executionTimeInMs = (endTime - startTime) * 1000

print(f"Function execution time: {executionTimeInMs:.2f} ms")

def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")

args = parser.parse_args()

calculate(args.N)

if __name__ == "__main__":
main()
test_ctypes.py
import ctypes
import numpy as np

lib = ctypes.CDLL("../../build/libvector_add.so")

lib.vectorAdd.argtypes = [
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.c_int
]

N = 10000000
a = np.array(range(N), dtype=np.int32)
b = np.array(range(N, 2*N), dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

lib.vectorAdd(
a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
b.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
c.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
N
)

print("Result:", c)

wrapper

27-08-2025 12:10 411 pyproject_toml.txt

27-08-2025 12:10 272 setup.py

27-08-2025 12:10 648 test.py

27-08-2025 12:10 211 test_vector_add_wrapper.py

27-08-2025 12:10 1,917 vector_add_wrapper.c

pyproject_toml.txt
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "vector_add_wrapper"
version = "1.0"
description = "Python binding for CUDA vector addition"
authors = [{ name = "Paulo Motta", email = "[email protected]" }]
dependencies = []

[tool.setuptools]
packages = []

[tool.setuptools.ext_modules]
vector_add_wrapper = { sources = ["vector_add_wrapper.c"] }

setup.py
from setuptools import setup, Extension

module = Extension(
"vector_add_wrapper",
sources=["vector_add_wrapper.c"],
)

setup(
name="vector_add_wrapper",
version="1.0",
description="Python binding for CUDA vector addition",
ext_modules=[module],
)

test.py
import vector_add_wrapper as vaw
import random
import time
import argparse

def calculate(size):
N = int (size)

a = [random.randint(0, N) for _ in range(N)]

b = [random.randint(N, 2*N) for _ in range(N)]

startTime = time.time()
result = vaw.vectorAdd(a, b, N)
endTime = time.time()

executionTimeInMs = (endTime - startTime) * 1000

print(f"Function execution time: {executionTimeInMs:.2f} ms")

def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")

args = parser.parse_args()

calculate(args.N)

if __name__ == "__main__":
main()

test_vector_add_wrapper.py
import vector_add_wrapper as vaw
import random

N = 10000000
a = [random.randint(0, N) for _ in range(N)]
b = [random.randint(N, 2*N) for _ in range(N)]

result = vaw.vectorAdd(a, b, N)
print("Result:", result)

vector_add_wrapper.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

typedef void (vectorAddFunc)(int , int , int , int);

static vectorAddFunc vectorAdd = NULL;

static PyObject* pyVectorAdd(PyObject* self, PyObject* args) {

PyObject *py_a, *py_b;
int N;

if (!PyArg_ParseTuple(args, "OOi", &py_a, &py_b, &N))

return NULL;

if (!PyList_Check(py_a) || !PyList_Check(py_b) || PyList_Size(py_a) != N || PyList_Size(py_b) != N)

{
PyErr_SetString(PyExc_ValueError, "Arguments must be lists of same size N.");
return NULL;
}

int a = (int)malloc(N * sizeof(int));

int *b = (int*)malloc(N * sizeof(int));
int *c = (int*)malloc(N * sizeof(int));
for (int i = 0; i < N; i++) {
a[i] = (int)PyLong_AsLong(PyList_GetItem(py_a, i));
b[i] = (int)PyLong_AsLong(PyList_GetItem(py_b, i));
}

vectorAdd(a, b, c, N);

PyObject* result = PyList_New(N);

for (int i = 0; i < N; i++) {
PyList_SetItem(result, i, PyLong_FromLong(c[i]));
}

free(a);
free(b);
free(c);

return result;
}

static PyMethodDef VectorAddMethods[] = {

{"vectorAdd", pyVectorAdd, METH_VARARGS, "Perform vector addition using CUDA"},
{NULL, NULL, 0, NULL}
};

static struct PyModuleDef vectoraddmodule = {

PyModuleDef_HEAD_INIT,
"vector_add_wrapper",
NULL,
-1,
VectorAddMethods
};

PyMODINIT_FUNC PyInit_vector_add_wrapper(void) {
void *handle = dlopen("../../build/libvector_add.so", RTLD_LAZY);
if (!handle) {
PyErr_SetString(PyExc_ImportError, "Could not load libvector_add.so");
return NULL;
}

vectorAdd = (vectorAddFunc)dlsym(handle, "vectorAdd");

if (!vectorAdd) {
PyErr_SetString(PyExc_ImportError, "Could not find vectorAdd in libvector_add.so");
return NULL;
}

return PyModule_Create(&vectoraddmodule);
}
wrapper_np
27-08-2025 12:10 375 setup.py

27-08-2025 12:10 705 test.py

27-08-2025 12:10 260 test_vector_add_np_wrapper.py

27-08-2025 12:10 1,585 vector_add_np_wrapper.c

setup.py
from setuptools import setup, Extension
import numpy

module = Extension(
"vector_add_np_wrapper",
include_dirs = [numpy.get_include()],
sources=["vector_add_np_wrapper.c"],
)

setup(
name="vector_add_np_wrapper",
version="1.0",
description="Python binding for CUDA vector addition",
ext_modules=[module],
include_dirs=[numpy.get_include()]
)

test.py
import vector_add_np_wrapper as vaw_np
import numpy as np
import time
import argparse

def calculate(size):
N = int (size)
a = np.random.randint(0, N, size=N, dtype=np.int32)
b = np.random.randint(N, 2*N, size=N, dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

startTime = time.time()
vaw_np.vectorAdd(a, b, c, N)
endTime = time.time()

executionTimeInMs = (endTime - startTime) * 1000

print(f"Function execution time: {executionTimeInMs:.2f} ms")

def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")

args = parser.parse_args()

calculate(args.N)

if __name__ == "__main__":
main()

test_vector_add_np_wrapper.py
import vector_add_np_wrapper as vaw_np
import numpy as np

N = 10000000
a = np.random.randint(0, N, size=N, dtype=np.int32)
b = np.random.randint(N, 2*N, size=N, dtype=np.int32)
c = np.zeros(N, dtype=np.int32)

vaw_np.vectorAdd(a, b, c, N)
print("Result:", c)

vector_add_np_wrapper.c
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <numpy/arrayobject.h>
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>

typedef void (vectorAddFunc)(int , int , int , int);

static vectorAddFunc vectorAdd = NULL;

static PyObject* pyVectorAdd(PyObject* self, PyObject* args) {

PyArrayObject *a, *b, *c;
int N;

if (!PyArg_ParseTuple(args, "OOOi",
&a,
&b,
&c,
&N)) {
return NULL;
}
int *a_ptr = (int*)PyArray_DATA(a);
int *b_ptr = (int*)PyArray_DATA(b);
int *c_ptr = (int*)PyArray_DATA(c);

vectorAdd(a_ptr, b_ptr, c_ptr, N);

Py_RETURN_NONE;
}

static PyMethodDef VectorAddMethods[] = {

{"vectorAdd", pyVectorAdd, METH_VARARGS, "Perform vector addition using CUDA"},
{NULL, NULL, 0, NULL}
};

static struct PyModuleDef vectoraddmodule = {

PyModuleDef_HEAD_INIT,
"vector_add_np_wrapper",
NULL,
-1,
VectorAddMethods
};

PyMODINIT_FUNC PyInit_vector_add_np_wrapper(void) {
void *handle = dlopen("../../build/libvector_add.so", RTLD_LAZY);
if (!handle) {
PyErr_SetString(PyExc_ImportError, "Could not load libvector_add.so");
return NULL;
}

vectorAdd = (vectorAddFunc)dlsym(handle, "vectorAdd");

if (!vectorAdd) {
PyErr_SetString(PyExc_ImportError, "Could not find vectorAdd in libvector_add.so");
return NULL;
}

import_array();

return PyModule_Create(&vectoraddmodule);
}

SRC
vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
#include "../include/vector_add.h"

global void vectorAddKernel(int a, int b, int *c, int N) {

int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
c[idx] = a[idx] * b[idx];
}
}

void vectorAdd(int a, int b, int *c, int N) {

int *d_a;
int *d_b;
int *d_c;

cudaMalloc((void**)&d_a, N * sizeof(int));
cudaMalloc((void**)&d_b, N * sizeof(int));
cudaMalloc((void**)&d_c, N * sizeof(int));

cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);

cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}

Chapter 10
27-08-2025 12:10 <DIR> 1_cuBLAS

27-08-2025 12:10 <DIR> 2_thrust

27-08-2025 12:10 <DIR> 3_gtest

27-08-2025 12:10 <DIR> 4_pytest

27-08-2025 12:10 256 readme.md

readme.md
# Introduction

All our examples are numbered here to follow the order in which they appear on the book
chapter.

To build the examples we may simply:

1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make
1_cuBLAS
27-08-2025 12:10 403 CMakeLists.txt

27-08-2025 12:10 4,545 main.cpp

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(MatrixMultiplyCUBLAS)

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_BUILD_TYPE Release)

set(CUDA_TOOLKIT_ROOT_DIR /usr/local/cuda)

find_package(CUDA REQUIRED)

set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -O3")

include_directories(${CUDA_INCLUDE_DIRS})

add_executable(matrix_mul main.cpp)

target_link_libraries(matrix_mul ${CUDA_CUBLAS_LIBRARIES} ${CUDA_LIBRARIES})

main.cpp
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <chrono>
#include <cublas_v2.h>
#include <cuda_runtime.h>

void matrixMultCpu(float* A, float* B, float* C, int N) {

for (int row = 0; row < N; row++) {
for (int col = 0; col < N; col++) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
}

void checkResult(float* cpuRes, float* gpuRes, int size) {

int main() {
srand(static_cast<unsigned int>(time(0)));

int N = 1024;
size_t size = N * N * sizeof(float);

float* h_A = (float*)malloc(size);

float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);
float* h_C_cpu = (float*)malloc(size);

for (int i = 0; i < N * N; ++i) {

h_A[i] = static_cast<float>(rand()) / RAND_MAX;
h_B[i] = static_cast<float>(rand()) / RAND_MAX;
}

auto startCpu = std::chrono::high_resolution_clock::now();

matrixMultCpu(h_A, h_B, h_C_cpu, N);
auto endCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = endCpu - startCpu;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

float *d_A;
float *d_B;
float *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

auto startTotal = std::chrono::high_resolution_clock::now();

auto startCopy = std::chrono::high_resolution_clock::now();

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
auto endCopy = std::chrono::high_resolution_clock::now();

auto startOverheadTime = std::chrono::high_resolution_clock::now();

cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.0f;
float beta = 0.0f;
auto endOverheadTime = std::chrono::high_resolution_clock::now();

cudaEvent_t startEvent, stopEvent;

cudaEventCreate(&startEvent);
cudaEventCreate(&stopEvent);
cudaEventRecord(startEvent);

cublasSgemm(handle,
CUBLAS_OP_T, CUBLAS_OP_T, // transpose both matrices
N, N, N,
&alpha,
d_A, N,
d_B, N,
&beta,
d_C, N);

float* d_C_fixed;
cudaMalloc(&d_C_fixed, size);

cublasSgeam(handle,
CUBLAS_OP_T, CUBLAS_OP_N,
N, N, // dimensions of the output matrix
&alpha,
d_C, N, // input matrix (to be transposed)
&beta,
nullptr, N, // second matrix not used
d_C_fixed, N); // result goes here

cudaEventRecord(stopEvent);
cudaEventSynchronize(stopEvent);
float gpuComputeTime = 0;
cudaEventElapsedTime(&gpuComputeTime, startEvent, stopEvent);

auto startCopyBack = std::chrono::high_resolution_clock::now();

cudaMemcpy(h_C, d_C_fixed, size, cudaMemcpyDeviceToHost);
auto endCopyBack = std::chrono::high_resolution_clock::now();
auto endTotal = std::chrono::high_resolution_clock::now();

std::chrono::duration<double, std:: milli> copyTime = endCopy - startCopy;

std::chrono::duration<double, std:: milli> copyBackTime = endCopyBack - startCopyBack;
std::chrono::duration<double, std:: milli> totalTime = endTotal - startTotal;
std::chrono::duration<double, std:: milli> overheadTime = endOverheadTime -
startOverheadTime;

std::cout << "Device data copy time: " << copyTime.count() << " ms" << std::endl;
std::cout << "cuBLAS overhead time: " << overheadTime.count() << " ms" << std::endl;
std::cout << "Device compute time: " << gpuComputeTime << " ms" << std::endl;
std::cout << "Device data copy back time: " << copyBackTime.count() << " ms" << std::endl;
std::cout << "Total time taken by GPU: " << totalTime.count() << " ms" << std::endl;
checkResult(h_C_cpu, h_C, N * N);

cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFree(d_C_fixed);
free(h_A);
free(h_B);
free(h_C);
free(h_C_cpu);

return 0;
}

2_thrust
27-08-2025 12:10 321 CMakeLists.txt

27-08-2025 12:10 2,470 main.cu

CMakeLists.txt
cmake_minimum_required(VERSION 3.12)
project(ThrustSortExample LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 14)
set(CMAKE_BUILD_TYPE Release)

find_package(CUDA REQUIRED)

add_executable(thrust_sort main.cu)

set_target_properties(thrust_sort PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_ARCHITECTURES "native"
)

main.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <typeinfo>
#include <vector>
template <typename T>
void thrustSort(size_t size) {
thrust::host_vector<T> h_vec(size);
thrust::host_vector<T> h_result(size);

std::cout << "Processing " << typeid(T).name() << " (" << size << " elements)" << std::endl;

for (auto& val : h_vec) {

val = static_cast<T>(rand()) / RAND_MAX;
}

auto cpuVec = h_vec;

auto cpuStart = std::chrono::high_resolution_clock::now();
std::sort(cpuVec.begin(), cpuVec.end());
auto cpuEnd = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = cpuEnd - cpuStart;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;

auto gpuStart = std::chrono::high_resolution_clock::now();

auto startCopy = std::chrono::high_resolution_clock::now();

thrust::device_vector<T> d_vec = h_vec;
auto endCopy = std::chrono::high_resolution_clock::now();

auto startSort = std::chrono::high_resolution_clock::now();

thrust::sort(d_vec.begin(), d_vec.end());
auto endSort = std::chrono::high_resolution_clock::now();

auto startCopyBack = std::chrono::high_resolution_clock::now();

thrust::copy(d_vec.begin(), d_vec.end(), h_result.begin());
auto endCopyBack = std::chrono::high_resolution_clock::now();

auto gpuEnd = std::chrono::high_resolution_clock::now();

std::chrono::duration<double, std::milli> copyTime = endCopy - startCopy;

std::chrono::duration<double, std::milli> copyBackTime = endCopyBack - startCopyBack;
std::chrono::duration<double, std::milli> sortTime = endSort - startSort;
std::chrono::duration<double, std::milli> gpuTotalTime = gpuEnd - gpuStart;

std::cout << "GPU copy time: " << copyTime.count() << " ms" << std::endl;
std::cout << "GPU copy back time: " << copyBackTime.count() << " ms" << std::endl;
std::cout << "GPU sort time: " << sortTime.count() << " ms" << std::endl;
std::cout << "Total time taken by GPU: " << gpuTotalTime.count() << " ms" << std::endl;
}

int main() {
size_t N = 33'000'000;

srand(static_cast<unsigned int>(time(0)));
thrustSort<int>(N);
std::cout << std::endl;
thrustSort<double>(N);
std::cout << std::endl;
thrustSort<float>(N);

return 0;
}

3_gtest
27-08-2025 12:10 493 CMakeLists.txt

27-08-2025 12:10 <DIR> include

27-08-2025 12:10 <DIR> src

27-08-2025 12:10 <DIR> test

CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(VectorAddTest LANGUAGES CXX CUDA)

enable_testing()

find_package(GTest REQUIRED)

add_executable(vector_add_test test/test_vector_add.cu src/main.cu)

target_link_libraries(vector_add_test GTest::GTest GTest::Main cuda)

add_test(NAME VectorAddTest COMMAND vector_add_test)

add_library(vector_add SHARED src/main.cu)

set_target_properties(vector_add PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_include_directories(vector_add PUBLIC include)

include
vector_add.h
#ifndef VEC_ADD_H
#define VEC_ADD_H
extern "C" {
void vectorAdd(float* h_A, float* h_B, float* h_C, int N);
}

#endif
SRC
main.cu
#include <cuda_runtime.h>
#include <iostream>
#include "../include/vector_add.h"

global void vectorAddKernel(float* A, float* B, float* C, int N) {

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

void vectorAdd(float* h_A, float* h_B, float* h_C, int N) {

float *d_A;
float *d_B;
float *d_C;
size_t size = N * sizeof(float);

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
#include <cuda_runtime.h>
#include <iostream>
#include "../include/vector_add.h"

global void vectorAddKernel(float* A, float* B, float* C, int N) {

int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

void vectorAdd(float* h_A, float* h_B, float* h_C, int N) {

float *d_A;
float *d_B;
float *d_C;
size_t size = N * sizeof(float);

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}

Test
test_vector_add.cu
#include <gtest/gtest.h>
#include "../include/vector_add.h"

TEST(VectorAddTest, SimpleAddition) {
int N = 5;
float A[N] = {1, 2, 3, 4, 5};
float B[N] = {10, 20, 30, 40, 50};
float C[N] = {0};

vectorAdd(A, B, C, N);

for (int i = 0; i < N; i++) {

EXPECT_FLOAT_EQ(C[i], A[i] + B[i]);
}
}

4_pytest
test_vector_add.py
import numpy as np
from vector_add import vectorAdd
def testVectorAdd():
a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
b = np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32)
expected = a + b

result = vectorAdd(a, b)

for i in range(len(expected)):
assert abs(result[i] - expected[i]) < 1e-5, f"Mismatch at index {i}: got {result[i]}, expected
{expected[i]}"

np.testing.assert_allclose(result, expected, rtol=1e-5)

vector_add.py
import ctypes
import numpy as np

lib = ctypes.CDLL('../3_gtest/build/libvector_add.so')

lib.vectorAdd.argtypes = [
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
ctypes.c_int
]

def vectorAdd(a: np.ndarray, b: np.ndarray) -> np.ndarray:

assert a.shape == b.shape
n = a.size
c = np.empty_like(a)
lib.vectorAdd(a, b, c, n)
return c

Rudin's Analysis Exercises Guide
100% (1)
Rudin's Analysis Exercises Guide
89 pages
Banach Spaces
100% (2)
Banach Spaces
34 pages
Real and Complex Analysis Homework Help
100% (1)
Real and Complex Analysis Homework Help
5 pages
An-Introductory-Course-in-Elementary-Number-Theory (Raji)
100% (1)
An-Introductory-Course-in-Elementary-Number-Theory (Raji)
204 pages
C 02 Vectors in Three Dimensions
No ratings yet
C 02 Vectors in Three Dimensions
84 pages
504 Topper 21 110 1 2 392language of Chemistry Up202312111727 1702295878 8623
No ratings yet
504 Topper 21 110 1 2 392language of Chemistry Up202312111727 1702295878 8623
13 pages
Solutions to Rudin's Mathematical Analysis
No ratings yet
Solutions to Rudin's Mathematical Analysis
172 pages
Introduction To Affine Group Schemes (William C. Waterhouse (Auth.) )
No ratings yet
Introduction To Affine Group Schemes (William C. Waterhouse (Auth.) )
166 pages
Vectors - 2D and 3D Vectors
100% (1)
Vectors - 2D and 3D Vectors
50 pages
PA1Q Datasheet Overview
100% (2)
PA1Q Datasheet Overview
141 pages
Mathematical Methods II Lecture Notes: Manuel Mañas Baena Luis Martínez Alonso
No ratings yet
Mathematical Methods II Lecture Notes: Manuel Mañas Baena Luis Martínez Alonso
249 pages
Von Neumann Architecture - 1945
No ratings yet
Von Neumann Architecture - 1945
7 pages
A Treatise On Advanced Calculus by Franklin Philip
100% (1)
A Treatise On Advanced Calculus by Franklin Philip
612 pages
Multivariable Calculus Lecture Notes
100% (1)
Multivariable Calculus Lecture Notes
296 pages
Numerical Study On The Boundary Value Problem by Using A Shooting Method
No ratings yet
Numerical Study On The Boundary Value Problem by Using A Shooting Method
6 pages
Problems and Solutions in Differential Geometry and Applications
No ratings yet
Problems and Solutions in Differential Geometry and Applications
125 pages
(1972) Differential Quadrature A Technique For The Rapid Solution of Nonlinear Partial Differential Equations
100% (2)
(1972) Differential Quadrature A Technique For The Rapid Solution of Nonlinear Partial Differential Equations
13 pages
Complex Variables and Functions Overview
No ratings yet
Complex Variables and Functions Overview
182 pages
J. H. Wells, L. R. Williams Auth. Embeddings and Extensions in Analysis PDF
100% (1)
J. H. Wells, L. R. Williams Auth. Embeddings and Extensions in Analysis PDF
116 pages
Octonions, Jordan Algebras and Exceptional Groups
No ratings yet
Octonions, Jordan Algebras and Exceptional Groups
212 pages
概率论教程（中文版－钟开莱）
No ratings yet
概率论教程（中文版－钟开莱）
382 pages
Mathematics Course Bibliography
No ratings yet
Mathematics Course Bibliography
29 pages
Stochastic Calculus Overview
No ratings yet
Stochastic Calculus Overview
217 pages
Commutative Algebra Textbook
No ratings yet
Commutative Algebra Textbook
266 pages
Approximate Calculation of Integrals (Dover Books On Mathematics) - V. I. Krylov (Author) Arthur H
No ratings yet
Approximate Calculation of Integrals (Dover Books On Mathematics) - V. I. Krylov (Author) Arthur H
386 pages
(Universitext) Robert B. Reisel (Auth.) - Elementary Theory of Metric Spaces - A Course in Constructing Mathematical Proofs-Springer-Verlag New York (1982)
No ratings yet
(Universitext) Robert B. Reisel (Auth.) - Elementary Theory of Metric Spaces - A Course in Constructing Mathematical Proofs-Springer-Verlag New York (1982)
129 pages
Discovering Discrete Dynamical Systems
No ratings yet
Discovering Discrete Dynamical Systems
131 pages
An Introduction To Optimal Transport and Wasserstein Gradient Flows
No ratings yet
An Introduction To Optimal Transport and Wasserstein Gradient Flows
19 pages
AMS Textbooks for Math Students
100% (1)
AMS Textbooks for Math Students
15 pages
Chen Tang PrimeNumberTheorem
No ratings yet
Chen Tang PrimeNumberTheorem
81 pages
Exactly Solvable Schrödinger Equations With Singularities: A Systematic Approach To Solving Complexified Potentials (Part1)
100% (1)
Exactly Solvable Schrödinger Equations With Singularities: A Systematic Approach To Solving Complexified Potentials (Part1)
36 pages
Idolikecfd Vol1 2ed v2p2
100% (1)
Idolikecfd Vol1 2ed v2p2
299 pages
Introduction to Multidimensional Geometry
100% (1)
Introduction to Multidimensional Geometry
52 pages
Eilenberg, Moore - Foundations of Relative Homological Algebra
No ratings yet
Eilenberg, Moore - Foundations of Relative Homological Algebra
40 pages
应用偏微分方程
No ratings yet
应用偏微分方程
438 pages
Read Online Here: Introduction To Abstract Algebra Solutions 2
No ratings yet
Read Online Here: Introduction To Abstract Algebra Solutions 2
5 pages
Partial Differential Equations: Methods and Applications
No ratings yet
Partial Differential Equations: Methods and Applications
6 pages
Measure Theory
100% (1)
Measure Theory
253 pages
Otto Schreier, Emanuel Sperner - Projective Geometry of N Dimensions (Volume Two of Introduction To Modern Algebra and Matrix Theory) - Chelsea Pub. Co (1961)
No ratings yet
Otto Schreier, Emanuel Sperner - Projective Geometry of N Dimensions (Volume Two of Introduction To Modern Algebra and Matrix Theory) - Chelsea Pub. Co (1961)
208 pages
Applied Nonautonomous and Random Dynamical Systems
No ratings yet
Applied Nonautonomous and Random Dynamical Systems
115 pages
Fixed Point Theorems Guide
No ratings yet
Fixed Point Theorems Guide
61 pages
(Earl A. Coddington) An Introduction To Ordinary D (B-Ok - Xyz) - 1
No ratings yet
(Earl A. Coddington) An Introduction To Ordinary D (B-Ok - Xyz) - 1
306 pages
(University Series in Higher Mathematics) Oscar Zariski, Pierre Samuel, I.S. Cohen - Commutative Algebra Volume I - Volume 1-D. Van Nostrand Co Inc (1958)
No ratings yet
(University Series in Higher Mathematics) Oscar Zariski, Pierre Samuel, I.S. Cohen - Commutative Algebra Volume I - Volume 1-D. Van Nostrand Co Inc (1958)
339 pages
Function Theory of Several Variables
100% (1)
Function Theory of Several Variables
7 pages
Exponential Integral Integrals Table
No ratings yet
Exponential Integral Integrals Table
20 pages
Hatcher A. - Algebraic Topology. Errata (Web Draft, Nov. 2004) PDF
No ratings yet
Hatcher A. - Algebraic Topology. Errata (Web Draft, Nov. 2004) PDF
6 pages
(Ebook) Introduction To Numerical Analysis - J.stoer, R.bulirsch
100% (1)
(Ebook) Introduction To Numerical Analysis - J.stoer, R.bulirsch
672 pages
2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
Cloud Computing Unit-1 To Unit-5 Material
No ratings yet
Cloud Computing Unit-1 To Unit-5 Material
82 pages
If4093 Syllabus1
No ratings yet
If4093 Syllabus1
2 pages
Robots Create Real-Time 3D Maps
No ratings yet
Robots Create Real-Time 3D Maps
6 pages
Introduction To Modern 3D Graphics Programming With OpenGL (PDFDrive)
No ratings yet
Introduction To Modern 3D Graphics Programming With OpenGL (PDFDrive)
341 pages
History of Computer Graphics
No ratings yet
History of Computer Graphics
24 pages
GPU Insights for CPU Experts
100% (1)
GPU Insights for CPU Experts
70 pages
The Role of Game Engines in Game Development and T
No ratings yet
The Role of Game Engines in Game Development and T
26 pages
The Application of AI Technology in GPU Scheduling
No ratings yet
The Application of AI Technology in GPU Scheduling
7 pages
Aaltonen Sebastian GPU Based Clay
No ratings yet
Aaltonen Sebastian GPU Based Clay
70 pages
GPU-Based Ultrasound Simulation Advances
No ratings yet
GPU-Based Ultrasound Simulation Advances
10 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Parallel Programming Models Survey
No ratings yet
Parallel Programming Models Survey
18 pages
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
No ratings yet
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
5 pages
High Performance Computing in CST Studio Suite: Felix Wolfheimer
No ratings yet
High Performance Computing in CST Studio Suite: Felix Wolfheimer
18 pages
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
C# Graphics Programming Guide
100% (1)
C# Graphics Programming Guide
13 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
E2E 24072024121741 Transcript
No ratings yet
E2E 24072024121741 Transcript
19 pages
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
No ratings yet
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
22 pages
EAS 520 UmassD Syllabus Sheer
No ratings yet
EAS 520 UmassD Syllabus Sheer
2 pages
HPC GPU BS CS Spring 2025 Imran Ashraf
No ratings yet
HPC GPU BS CS Spring 2025 Imran Ashraf
5 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
11 pages
Open-Source GPGPU for Domain-Specific Architectures
No ratings yet
Open-Source GPGPU for Domain-Specific Architectures
8 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
GPU-Accelerated CellProfiler Enhancements
No ratings yet
GPU-Accelerated CellProfiler Enhancements
7 pages
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
No ratings yet
Lecture 1 Parallel and Scalable Machine Learning by HPC Morris Riedel
50 pages
Introduction To 3D Programming in Delphi
No ratings yet
Introduction To 3D Programming in Delphi
23 pages
SD Times
No ratings yet
SD Times
45 pages
Overview of Reconfigurable Computing
No ratings yet
Overview of Reconfigurable Computing
23 pages

GpuProgrammingWithC andCUDA

Uploaded by

GpuProgrammingWithC andCUDA

Uploaded by

Uncover effective techniques for writing efficient

GPU-parallel C++ applications

Software / hardware covered in the

NVIDIA GPU or access to a Cloud-based Ubuntu Linux 20 or later with

Nsight Compute 2023.3

1. camelCase for names of functions, kernels and variables

Folder names in RED colour

Files Names in Green colour

27-08-2025 12:10 277 Dockerfile

27-08-2025 12:10 351 readme.md

RUN apt-get update && \

apt-get clean && \

27-08-2025 12:10 254 hello_world.cu

27-08-2025 12:10 660 readme.md

or if you are using the docker terminal:

or if you are using the docker terminal:

std::cout << "Number of devices " << deviceCount << std::endl;

27-08-2025 12:10 2,347 primes.cu

__global__ void checkPrimeKernel(long long start, long long end) {

for (long long i = 3; i * i <= num; i += 2) {

if (num <= 1) return false;

int threadsPerBlock = 256;

cudaEvent_t startEvent, stopEvent;

checkPrimeKernel<<<blocksPerGrid, threadsPerBlock>>>(start, end);

auto startTime = std::chrono::high_resolution_clock::now();

for (long long num = start; num <= end; num += 2) {

auto endTime = std::chrono::high_resolution_clock::now();

27-08-2025 12:10 3,199 vector_add.cu

__global__ void vectorAddKernel(float *A, float *B, float *C, int N) {

void vectorAddCpu(float *A, float *B, float *C, int N) {

float *h_A = (float *)malloc(size);

for (int i = 0; i < N; ++i) {

int threadsPerBlock = 1024;

cudaEvent_t startEvent, stopEvent;

vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

float gpuDuration = (gpuCopyTime + gpuExecutionTime + gpuRetrieveTime);

auto start = std::chrono::high_resolution_clock::now();

vectorAddCpu(h_A, h_B, h_C, N);

auto stop = std::chrono::high_resolution_clock::now();

27-08-2025 12:10 2,267 euclidean_distance.cu

__global__ void calculateEuclideanDistanceKernel(Point *lineA, Point *lineB, float *distances, int

if (idx < numPoints) {

distances[idx] = sqrtf(dx * dx + dy * dy + dz * dz);

Point *h_lineA = (Point *)malloc(sizePoints);

for (int i = 0; i < numPoints; i++) {

cudaMemcpy(d_lineA, h_lineA, sizePoints, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

calculateEuclideanDistanceKernel<<<gridSize, blockSize>>>(d_lineA, d_lineB, d_distances,

cudaMemcpy(h_distances, d_distances, sizeDistances, cudaMemcpyDeviceToHost);

27-08-2025 12:10 <DIR> 2_matrix_multiply

27-08-2025 12:10 <DIR> 3_trapezoidal_travel

27-08-2025 12:10 <DIR> 4_sort_array

27-08-2025 12:10 <DIR> 5_weighted_moving_average

27-08-2025 12:10 256 readme.md

To build the examples we may simply:

27-08-2025 12:10 2,972 matrix_add.cu

__global__ void matrixAddKernel(const double* A, const double* B, double* C, int width) {

if (row < width && col < width) {

void matrixAddCpu(const double* A, const double* B, double* C, int width) {

void initializeMatrix(double *matrix, int size) {

bool checkResults(double *A, double *B, int size) {

int matrixSize = N * N * sizeof(double);

double *h_A = (double*)malloc(matrixSize);

auto startTime = std::chrono::high_resolution_clock::now();

cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;

matrixAddKernel<<<gridDim, blockDim>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C_GPU, d_C, matrixSize, cudaMemcpyDeviceToHost);

if (checkResults(h_C_CPU, h_C_GPU, N * N)) {

27-08-2025 12:10 118 CMakeLists.txt

27-08-2025 12:10 4,046 matrix_multiply.cu

if (row < width && col < width) {

void matrixMulCpu(double *A, double *B, double *C, int width) {

void initializeMatrix(double *matrix, int size) {

bool checkResults(double *A, double *B, int size) {

global void checkPrimeKernel(long long start, long long end) {

global void vectorAddKernel(float A, float B, float *C, int N) {

void vectorAddCpu(float A, float B, float *C, int N) {

float h_A = (float )malloc(size);

global void calculateEuclideanDistanceKernel(Point lineA, Point lineB, float *distances, int

Point h_lineA = (Point )malloc(sizePoints);

global void matrixAddKernel(const double* A, const double* B, double* C, int width) {

bool checkResults(double A, double B, int size) {

double h_A = (double)malloc(matrixSize);

void matrixMulCpu(double A, double B, double *C, int width) {

bool checkResults(double A, double B, int size) {

double h_A = (double)malloc(matrixSize);

global void trapezoidalKernel(double speeds, double distances, int n) {

double h_speeds = (double)malloc(N * sizeof(double));

double d_speeds, d_distances;

global void trapezoidalKernel(double speeds, double result, int n) {

double d_speeds, d_partialResults;

double h_partialResults = (double)malloc(gridSize * sizeof(double));

void smoothSensorsCpu(float buffers, int indices, float output, float weights){

bool checkResults(float A, float B, int size) {