GpuProgrammingWithC andCUDA
GpuProgrammingWithC andCUDA
The second section explores GPU architecture and implementation strategies for parallel
algorithms, and offers practical insights into optimizing resource usage for efficient
execution. In the final section, you will explore advanced topics such as utilizing CUDA
streams. You will also learn how to package and distribute GPU-accelerated libraries for the
Python ecosystem, extending the reach and impact of your work.
Combining expert insight with real-world problem solving, this book is a valuable resource
for developers and researchers aiming to harness the full potential of GPU computing. The
blend of theoretical foundations, practical programming techniques, and advanced
optimization strategies it offers is sure to help you succeed in the fast-evolving field of GPU
programming.
Key Learnings
Manage GPU devices and accelerate your applications
Apply parallelism effectively using CUDA and C++
Choose between existing libraries and custom GPU solutions
Package GPU code into libraries for use with Python
Explore advanced topics such as CUDA streams
Implement optimization strategies for resource-efficient execution
Chapters
1. Introduction to Parallel Programming
2. Setting Up Your Development Environment
3. Hello CUDA
4. Hello Again, but in Parallel
5. A Closer Look into the World of GPUs
6. Parallel Algorithms with CUDA
7. Performance Strategies
8. Overlaying Multiple Operations
9. Exposing Your Code to Python
10. Exploring Existing GPU Models
Requirements for this book
You should be comfortable writing computer programs in C++, and basic knowledge of
operating systems will help to understand some of the more advanced concepts, given that we
have to manage device communication.
CUDA Toolkit 12
Docker 27.0
VS Code 1.92
CMake 3.16
g++ 9.4
Python 3.8
In Chapter 2, we discuss options for configuring the development environment. Some of the
software that we need is installed automatically if you elect to use the Docker-based
development environment.
Code conventions
We are using the following convetions:
readme.md
## The original file
As we evolve on the book we changed the Dockerfile that is under .devcontainer, but we kept here
the original one for reference.
The change was that originally, for simplicity purposes, we used an NVidia based docker image. Later
we changed to an Ubuntu based image in which we installed the CUDA Toolkit for added
functionality.
Dockerfile
FROM docker.io/nvidia/cuda:12.0.0-devel-ubuntu20.04
ENV DEBIAN_FRONTEND=noninteractive
apt-get install -y \
xz-utils \
build-essential \
libssl-dev \
git \
cmake \
&& \
rm -rf /var/lib/apt/lists/*
Chapter 3
27-08-2025 12:10 753 device_query.cu
readme.md
# Hello World
Here is the command to compile from the terminal:
nvcc hello_world.cu -o hello_world -Xcompiler "-Wall" -lcudadevrt -lcudart_static -lstdc++
## Device Query
Here is the command to compile from the terminal:
nvcc -o device_query device_query.cu -Xcompiler "-Wall" -lcudadevrt -lcudart_static -lstdc++
hello_world.cu
#include <iostream>
__global__ void helloWorld() {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
printf("Hello, World! Thread %d\n", tid);
}
int main() {
helloWorld<<<1, 10>>>();
cudaDeviceSynchronize();
return 0;
}
device_query.cu
#include <cuda_runtime.h>
#include <iostream>
int main() {
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
cudaDeviceProp deviceProp;
int dev = 0;
cudaGetDeviceProperties(&deviceProp, dev);
std::cout << "Device " << dev << ": " << deviceProp.name << std::endl;
std::cout << " CUDA Capability Major/Minor version number: " << deviceProp.major << "." <<
deviceProp.minor << std::endl;
std::cout << " Total amount of shared memory per block: " << deviceProp.sharedMemPerBlock <<
" bytes" << std::endl;
std::cout << " Maximum number of threads per block: " << deviceProp.maxThreadsPerBlock <<
std::endl;
return 0;
Chapter 4
27-08-2025 12:10 <DIR> 1_primes
27-08-2025 12:10 <DIR> 2_vector_add
27-08-2025 12:10 <DIR> 3_euclidean_distance
27-08-2025 12:10 256 readme.md
readme
# Introduction
<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>
To build the examples we may simply:
1. create a `build` folder inside of the corresponding code folder
2. run cmake ..
3. run make
1_primes
27-08-2025 12:10 95 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(primes_gpu CUDA)
add_executable(primes primes.cu)
primes.cu
#include <cuda_runtime.h>
#include <iostream>
#include <chrono>
/*
* for study purposes we can print the verification of each number
*/
//printf("tid=%d %lld is prime? %d\n", tid, num, isPrime);
}
bool checkPrimeCpu(long long num) {
int main() {
long long start = 100'001LL; // must start with odd
long long end = 190'001LL;
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, startEvent, stopEvent);
std::cout << "Time taken on GPU: " << gpuDuration << " ms" << std::endl;
cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);
return 0;
}
2_vector_add
27-08-2025 12:10 103 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(vector_add CUDA)
add_executable(vector_add vector_add.cu)
vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
#include <chrono>
int main() {
int N = 100'000'000;
size_t size = N * sizeof(float);
float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void **)&d_A, size);
cudaMalloc((void **)&d_B, size);
cudaMalloc((void **)&d_C, size);
cudaEventRecord(startEvent, 0);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
float gpuCopyTime = 0;
cudaEventElapsedTime(&gpuCopyTime, startEvent, stopEvent);
std::cout<< std::fixed << "Time to copy data to GPU: " << gpuCopyTime << " ms" << std::endl;
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
float gpuExecutionTime = 0;
cudaEventElapsedTime(&gpuExecutionTime, startEvent, stopEvent);
std::cout<< std::fixed << "Time to execute on GPU: " << gpuExecutionTime << " ms" << std::endl;
cudaEventRecord(startEvent, 0);
cudaEventRecord(stopEvent, 0);
cudaEventSynchronize(stopEvent);
float gpuRetrieveTime = 0;
cudaEventElapsedTime(&gpuRetrieveTime, startEvent, stopEvent);
std::cout<< std::fixed << "Time taken to copy results back GPU: " << gpuRetrieveTime << " ms" <<
std::endl << std::endl;
cudaEventDestroy(startEvent);
cudaEventDestroy(stopEvent);
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
std::cout << "========================================== " << std::endl;
std::cout << "speed up (execution time only): " << cpuDuration.count() / gpuExecutionTime <<
std::endl;
std::cout << "speed up (GPU total time): " << cpuDuration.count() / gpuDuration << std::endl;
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);
return 0;
}
3_euclidean_distance
27-08-2025 12:10 127 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(euclidean_distance CUDA)
add_executable(euclidean_distance euclidean_distance.cu)
euclidean_distance.cu
#include <cuda_runtime.h>
#include <iostream>
struct Point {
float x;
float z;
float y;
};
}
}
int main() {
int numPoints = 10'000'000;
size_t sizePoints = numPoints * sizeof(Point);
size_t sizeDistances = numPoints * sizeof(float);
Point *d_lineA;
Point *d_lineB;
float *d_distances;
cudaMalloc((void **)&d_lineA, sizePoints);
cudaMalloc((void **)&d_lineB, sizePoints);
cudaMalloc((void **)&d_distances, sizeDistances);
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout<< std::fixed << "Time taken: " << gpuDuration << " ms" << std::endl;
cudaFree(d_lineA);
cudaFree(d_lineB);
cudaFree(d_distances);
free(h_lineA);
free(h_lineB);
free(h_distances);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
Chapter 6
27-08-2025 12:10 <DIR> 1_matrix_add
<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>
1_matrix_add
27-08-2025 12:10 103 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(matrix_add CUDA)
add_executable(matrix_add matrix_add.cu)
matrix_add.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <random>
#define N 7000
int main() {
srand(static_cast<unsigned int>(time(0)));
initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);
double *d_A;
double *d_B;
double *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms\n";
free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
2_matrix_multiply
add_executable(matrix_multiply matrix_multiply.cu)
matrix_multiply.cu
#include <iostream>
#include <iomanip>
#include <cuda_runtime.h>
#include <chrono>
#define N 2000
__global__ void matrixMulKernel(double *A, double *B, double *C, int width) {
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;
int main() {
srand(static_cast<unsigned int>(time(0)));
initializeMatrix(h_A, N * N);
initializeMatrix(h_B, N * N);
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
double *d_A;
double *d_B;
double *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;
cudaEventRecord(start);
dim3 blockDim(16, 16);
dim3 gridDim((N + blockDim.x - 1) / blockDim.x, (N + blockDim.y - 1) / blockDim.y);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "GPU execution time: " << gpuDuration << " ms" << std::endl;
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "GPU total time: " << (gpuDuration + gpuTimeToCopy + gpuTimeToRetrieve) << " ms"
<< std::endl;
free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
3_trapezoidal_travel
27-08-2025 12:10 218 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(trapezoidal_travel CUDA)
add_executable(trapezoidal_travel_basic trapezoidal_travel_basic.cu)
add_executable(trapezoidal_travel_gpu_reduce trapezoidal_travel_gpu_reduce.cu)
matrix_multiply.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cstdlib>
int main() {
srand(static_cast<unsigned int>(time(0)));
generateRandomSpeeds(h_speeds, N);
cudaDeviceSynchronize();
std::cout << "GPU Distance: " << gpuResult << " km" << std::endl;
std::cout << "GPU Execution Time: " << gpuDuration << " ms" << std::endl;
free(h_speeds);
free(h_distances);
cudaFree(d_speeds);
cudaFree(d_distances);
cudaEventDestroy(startGPU);
cudaEventDestroy(endGPU);
return 0;
}
trapezoidal_travel_gpu_reduce.cu
#include <iostream>
#include <chrono>
#include <cuda_runtime.h>
#include <cstdlib>
if (tid == 0) {
result[blockIdx.x] = sharedData[0];
}
}
int main() {
srand(static_cast<unsigned int>(time(0)));
double *h_speeds = (double*)malloc(N * sizeof(double));
generateRandomSpeeds(h_speeds, N);
cudaEventRecord(endGPU);
cudaEventSynchronize(endGPU);
std::cout << "GPU Distance: " << gpuResult << " km" << std::endl;
std::cout << "GPU Execution Time: " << gpuDuration << " ms" << std::endl;
free(h_speeds);
free(h_partialResults);
cudaFree(d_speeds);
cudaFree(d_partialResults);
cudaEventDestroy(startGPU);
cudaEventDestroy(endGPU);
return 0;
4_sort_array
27-08-2025 12:10 103 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(sort_array CUDA)
add_executable(sort_array sort_array.cu)
sort_array.cu
#include <iostream>
#include <vector>
#include <chrono>
#include <algorithm>
#include <cfloat>
#include <iomanip>
#include <cuda_runtime.h>
__global__ void oddEvenSortStepKernel(double *arr, int size, bool *swapped, bool isOddPhase) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i = isOddPhase ? 2 * idx + 1 : 2 * idx;
if (i < size - 1) {
if (arr[i] > arr[i + 1]) {
double temp = arr[i];
arr[i] = arr[i + 1];
arr[i + 1] = temp;
*swapped = true;
}
}
}
do {
h_swapped = false;
cudaMemcpy(d_swapped, &h_swapped, sizeof(bool), cudaMemcpyHostToDevice);
// Odd phase
oddEvenSortStepKernel<<<blocks, threads>>>(d_arr, size, d_swapped, true);
cudaDeviceSynchronize();
// Even phase
oddEvenSortStepKernel<<<blocks, threads>>>(d_arr, size, d_swapped, false);
cudaDeviceSynchronize();
} while (h_swapped);
cudaFree(d_arr);
cudaFree(d_swapped);
}
do {
swapped = false;
// Odd phase
for (int i = 1; i < size - 1; i += 2) {
if (arr[i] > arr[i + 1]) {
std::swap(arr[i], arr[i + 1]);
swapped = true;
}
}
// Even phase
for (int i = 0; i < size - 1; i += 2) {
if (arr[i] > arr[i + 1]) {
std::swap(arr[i], arr[i + 1]);
swapped = true;
}
}
} while (swapped);
}
int main() {
srand(static_cast<unsigned int>(time(0)));
int n = 100'000;
double *h_data = (double*)malloc(n * sizeof(double));
double *h_data_gpu = (double*)malloc(n * sizeof(double));
double* d_data;
cudaMalloc(&d_data, n * sizeof(double));
cudaMemcpy(d_data, h_data, n * sizeof(double), cudaMemcpyHostToDevice);
cudaEventRecord(startGpu);
oddEvenSortGpu(d_data, n);
cudaEventRecord(stopGpu);
cudaEventSynchronize(stopGpu);
float gpuDuration;
cudaEventElapsedTime(&gpuDuration, startGpu, stopGpu);
std::cout << "GPU sorting time: " << gpuDuration << " ms" << std::endl;
double max_difference = 0;
for (int i = 0; i < n; i++) {
max_difference = std::max(max_difference, std::abs(h_data[i] - h_data_gpu[i]));
}
std::cout << "Max difference between CPU and GPU results: " << max_difference << std::endl;
free(h_data);
free(h_data_gpu);
cudaFree(d_data);
cudaEventDestroy(startGpu);
cudaEventDestroy(stopGpu);
return 0;
}
5_weighted_moving_average
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(weighted_moving_average CUDA)
add_executable(weighted_moving_average weighted_moving_average.cu)
weighted_moving_average.cu
#include <iostream>
#include <cuda_runtime.h>
#include <vector>
#include <cstdlib>
#include <chrono>
__global__ void smoothSensorsKernel(float *buffers, int *indices, float *output, float *weights) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= NUM_SENSORS) return;
output[idx] = val;
}
int main() {
srand(static_cast<unsigned int>(time(0)));
float *d_buffers;
float *d_output;
float *d_weights;
int *d_indices;
if (i == 0) {
for(int j = 0; j < NUM_READINGS; j++) {
initializeBuffer(h_buffers, j);
}
} else {
newDataIdx = h_indices[0];
rotateIndices(h_indices);
initializeBuffer(h_buffers, newDataIdx); //one new reading for all sensors
}
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;
cudaEventRecord(start);
smoothSensorsKernel<<<blocks, threads>>>(d_buffers, d_indices, d_output, d_weights);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "GPU execution time: " << gpuDuration << " ms" << std::endl;
cudaEventRecord(start);
cudaMemcpy(h_output_GPU, d_output, bufferSize, cudaMemcpyDeviceToHost);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "speed up over all executions: " << cpuGlobalTime.count() / gpuGlobalTime <<
std::endl;
cudaFree(d_buffers);
cudaFree(d_output);
cudaFree(d_indices);
cudaFree(d_weights);
free(h_output_CPU);
free(h_output_GPU);
free(h_buffers);
return 0;
}
Chapter 7
27-08-2025 12:10 315 CMakeLists.txt
readme.md
# Introduction
<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(performance_strategies CUDA)
set(CMAKE_CXX_FLAGS_RELEASE "-O3")
set(CMAKE_CUDA_FLAGS_RELEASE "-O3 -arch=sm_75 -rdc=false")
add_executable(matrix_multiply_float matrix_multiply_float.cu)
matrix_multiply_float.cu
#include <chrono>
#include <cuda_runtime.h>
#include <iostream>
#include <iomanip>
__global__ void matrixMulKernel(float *A, float *B, float *C, int width) {
__shared__ float Asub[TILE][TILE];
__shared__ float Bsub[TILE][TILE];
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = ty + blockIdx.y * blockDim.y;
int col = tx + blockIdx.x * blockDim.x;
__syncthreads();
#pragma unroll
for (int k = 0; k < blockDim.x; k++) {
sum = fmaf(Asub[ty][k], Bsub[k][tx], sum);
}
__syncthreads();
}
int tx = threadIdx.x;
int ty = threadIdx.y;
int row = ty + blockIdx.y * blockDim.y;
int col = tx + blockIdx.x * blockDim.x;
__syncthreads();
#pragma unroll
for (int k = 0; k < blockDim.x; k++) {
sum += Asub[ty][k] * Bsub[k][tx];
// Use fused multiply-add
//sum = fmaf(Asub[ty][k], Bsub[k][tx], sum);
}
__syncthreads();
}
__global__ void matrixMulKernel_naive(float *A, float *B, float *C, int width) {
int row = threadIdx.y + blockIdx.y * blockDim.y;
int col = threadIdx.x + blockIdx.x * blockDim.x;
__global__ void matrixMulKernel_row(float *A, float *B, float *C, int width) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < width) {
for (int col = 0; col < width; col++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
}
C[row * width + col] = sum;
}
}
}
__global__ void matrixMulKernel_col(float *A, float *B, float *C, int width) {
int col = threadIdx.x + blockIdx.x * blockDim.x;
if (col < width) {
for (int row = 0; row < width; row++) {
float sum = 0.0f;
for (int i = 0; i < width; i++) {
sum += A[row * width + i] * B[i * N + col];
}
C[row * width + col] = sum;
}
}
}
int main() {
srand(static_cast<unsigned int>(time(0)));
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
float *d_A;
float *d_B;
float *d_C;
cudaMalloc((void**)&d_A, matrixSize);
cudaMalloc((void**)&d_B, matrixSize);
cudaMalloc((void**)&d_C, matrixSize);
cudaEventRecord(start);
cudaMemcpy(d_A, h_A, matrixSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, matrixSize, cudaMemcpyHostToDevice);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToCopy = 0;
cudaEventElapsedTime(&gpuTimeToCopy, start, stop);
std::cout << "GPU memory copy time: " << gpuTimeToCopy << " ms" << std::endl;
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTime = 0;
cudaEventElapsedTime(&gpuTime, start, stop);
std::cout << "GPU execution time: " << gpuTime << " ms" << std::endl;
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuTimeToRetrieve = 0;
cudaEventElapsedTime(&gpuTimeToRetrieve, start, stop);
std::cout << "GPU memory retrieve time: " << gpuTimeToRetrieve << " ms" << std::endl;
std::cout << "GPU total time: " << (gpuTime + gpuTimeToCopy + gpuTimeToRetrieve) << " ms" <<
std::endl;
free(h_A);
free(h_B);
free(h_C_CPU);
free(h_C_GPU);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaEventDestroy(start);
cudaEventDestroy(stop);
return 0;
}
Chapter 8
27-08-2025 12:10 <DIR> 1_vector_add
<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>
1_vector_add
27-08-2025 12:10 191 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)
add_executable(vector_add vector_add.cu)
vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
if (idx < N) {
c[idx] = a[idx] * b[idx];
}
int main() {
const int N = 130;
int h_a[N];
int h_b[N];
int h_c[N];
int cpuResult[N];
int *d_a;
int *d_b;
int *d_c;
cudaMalloc(&d_a, N * sizeof(int));
cudaMalloc(&d_b, N * sizeof(int));
cudaMalloc(&d_c, N * sizeof(int));
return 0;
}
2_mem_bandwidth
27-08-2025 12:10 166 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)
add_executable(mem_bandwidth mem_bandwidth.cu)
add_executable(memcpy_benchmark memcpy_benchmark.cu)
memcpy_benchmark.cu
#include <cuda_runtime.h>
#include <iostream>
int main() {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int device;
cudaGetDevice(&device);
int concurrentKernels = 0;
int asyncEngineCount = 0;
std::cout << "Device " << device << ":" << std::endl;
std::cout << " Concurrent Kernel Execution: " << (concurrentKernels ? "Yes" : "No") << std::endl;
std::cout << " Async Engine Count: " << asyncEngineCount << std::endl;
std::cout << "Memory Clock Speed: " << memClockMHz << " MHz" << std::endl;
std::cout << "Memory Bus Width: " << busWidth << " bits" << std::endl;
std::cout << "Estimated Memory Bandwidth: " << bandwidthGBs << " GB/s" << std::endl;
return 0;
}
mem_bandwidth.cu
#include <cuda_runtime.h>
#include <iostream>
cudaEventRecord(start);
cudaMemcpyAsync(d_data, h_data, dataSize, cudaMemcpyHostToDevice);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float bandwidth = (dataSize / (milliseconds / 1000.0)) / (1024.0 * 1024.0 * 1024.0);
std::cout << "Data Size: " << dataSize / (1024 * 1024) << " MB, Time: " << milliseconds << " ms,
Bandwidth: " << bandwidth << " GB/s\n";
cudaFreeHost(h_data);
cudaFree(d_data);
}
int main() {
std::cout << "Testing different transfer sizes:\n";
measureMemcpyBandwidth(SIZE_MB(1));
measureMemcpyBandwidth(SIZE_MB(10));
measureMemcpyBandwidth(SIZE_MB(20));
measureMemcpyBandwidth(SIZE_MB(30));
measureMemcpyBandwidth(SIZE_MB(32));
measureMemcpyBandwidth(SIZE_MB(40));
measureMemcpyBandwidth(SIZE_MB(50));
measureMemcpyBandwidth(SIZE_MB(100));
measureMemcpyBandwidth(SIZE_MB(200));
measureMemcpyBandwidth(SIZE_MB(300));
measureMemcpyBandwidth(SIZE_MB(400));
measureMemcpyBandwidth(SIZE_MB(500));
measureMemcpyBandwidth(SIZE_MB(600));
measureMemcpyBandwidth(SIZE_MB(700));
measureMemcpyBandwidth(SIZE_MB(800));
measureMemcpyBandwidth(SIZE_MB(900));
measureMemcpyBandwidth(SIZE_MB(1024)); // 1GB
return 0;
}
3_multiplying_with_streams
27-08-2025 12:10 189 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(overlay_things CUDA)
add_executable(vec_mat_mul_streams vec_mat_mul_streams.cu)
add_executable(vec_mat_mul_nostreams vec_mat_mul_nostreams.cu)
vec_mat_mul_nostreams.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <chrono>
__global__ void vectorMatrixMulKernel(float* d_vec, float* d_mat, float* d_res, int rows, int cols) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < rows) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += d_mat[row * cols + col] * d_vec[col];
}
d_res[row] = sum;
}
}
void vectorMatrixMulCpu(float *vec, float *mat, float *res, int rows, int cols) {
for (int i = 0; i < rows; i++) {
res[i] = 0;
for (int j = 0; j < cols; j++) {
res[i] += mat[i * cols + j] * vec[j];
}
}
}
float *d_vec;
float *d_mat;
float *d_res;
cudaMalloc(&d_vec, cols * sizeof(float));
cudaMalloc(&d_mat, rows * cols * sizeof(float));
cudaMalloc(&d_res, rows * sizeof(float));
cudaEventRecord(start);
cudaMemcpy(d_vec, h_vec, cols * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_mat, h_mat, rows * cols * sizeof(float), cudaMemcpyHostToDevice);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms, for matrix: " << cols << "x" << cols <<
std::endl;
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
cudaFree(d_vec);
cudaFree(d_mat);
cudaFree(d_res);
int main() {
srand(static_cast<unsigned int>(time(0)));
compute(16380);
compute(32760);
return 0;
}
vec_mat_mul_streams.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <cassert>
#include <chrono>
void vectorMatrixMulCpu(float* vec, float* mat, float* res, int rows, int cols) {
for (int i = 0; i < rows; i++) {
res[i] = 0;
for (int j = 0; j < cols; j++) {
res[i] += mat[i * cols + j] * vec[j];
}
}
}
initializeMatrix(h_vec, cols);
initializeMatrix(h_mat_pinned, rows * cols);
float *d_vec;
float *d_mat1;
float *d_mat2;
float *d_res1;
float *d_res2;
cudaMalloc(&d_vec, cols * sizeof(float));
cudaMalloc(&d_mat1, chunkSize * cols * sizeof(float));
cudaMalloc(&d_mat2, chunkSize * cols * sizeof(float));
cudaMalloc(&d_res1, chunkSize * sizeof(float));
cudaMalloc(&d_res2, chunkSize * sizeof(float));
cudaEventRecord(start);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float gpuDuration = 0;
cudaEventElapsedTime(&gpuDuration, start, stop);
std::cout << "Time taken by GPU: " << gpuDuration << " ms, for matrix: " << cols << "x" << cols << "
chunk size: " << chunkSize << std::endl;
if (computeCpuPart) {
auto startCpu = std::chrono::high_resolution_clock::now();
vectorMatrixMulCpu(h_vec, h_mat_pinned, h_res_cpu, rows, cols);
auto stopCpu = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> cpuDuration = (stopCpu - startCpu);
std::cout << "Time taken by CPU: " << cpuDuration.count() << " ms" << std::endl;
cudaFreeHost(h_mat_pinned);
cudaFreeHost(h_res_gpu);
cudaFree(d_vec);
cudaFree(d_mat1);
cudaFree(d_mat2);
cudaFree(d_res1);
cudaFree(d_res2);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
return gpuDuration;
}
/*
* The average function will collect the gpuDuration for each chunkSize and matrix size
* over the iterations defined and calculate the percetage gain against the provided
* reference value of the execution time for the non streams application version.
*
* Note: you have to run the non streams version with the same matrix size.
*/
void average(int chunkSize, int cols, int iterations, float noStreamsReferenceTime) {
long dataSize = (chunkSize * cols);
long totalDataSize = (cols * cols);
int main() {
srand(static_cast<unsigned int>(time(0)));
/*
* If you want to check an average of some executions against the values measured
* for the no streams version you can use the following sample calls, but updating
* the last value to the time measurement from your system.
*/
// average( 273, 32760, 10, 500.615);
// average( 546, 32760, 10, 500.615);
// average( 819, 32760, 10, 500.615);
// average( 1638, 32760, 10, 500.615);
// average( 4095, 32760, 10, 500.615);
// average( 8190, 32760, 10, 500.615);
// average(16380, 32760, 10, 500.615);
return 0;
}
4_multigpu
27-08-2025 12:10 104 CMakeLists.txt
add_executable(multigpu multigpu.cu)
multigpu.cu
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>
__global__ void vectorMatrixMulKernel(float* d_vec, float* d_mat, float* d_res, int rows, int cols) {
int row = threadIdx.x + blockIdx.x * blockDim.x;
if (row < rows) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += d_mat[row * cols + col] * d_vec[col];
}
d_res[row] = sum;
}
}
void vectorMatrixMulCpu(float* vec, float* mat, float* res, int rows, int cols) {
for (int row = 0; row < rows; row++) {
float sum = 0.0f;
for (int col = 0; col < cols; col++) {
sum += mat[row * cols + col] * vec[col];
}
res[row] = sum;
}
}
int main() {
srand(static_cast<unsigned int>(time(0)));
int deviceCount = 0;
cudaGetDeviceCount(&deviceCount);
initializeMatrix(h_vec, cols);
initializeMatrix(h_mat, rows * cols);
float *d_vec[deviceCount];
float *d_mat[deviceCount];
float *d_res[deviceCount];
for (int device = 0; device < deviceCount; device++) {
cudaSetDevice(device);
cudaMalloc(&d_vec[device], rows * sizeof(float));
cudaMalloc(&d_mat[device], chunk_size * cols * sizeof(float));
cudaMalloc(&d_res[device], chunk_size * sizeof(float));
return 0;
}
Chapter 9
27-08-2025 12:10 <DIR> include
readme.md
# Introduction
<p>This chapter presents a C++ library that uses the traditional steps for building, but we also have
three implementations of wrappers to expose the code to Python.
We also provide a run_tests.sh inside the `python` folder that runs the corresponding `test.py` file
inside each alternative of the wrappers. The specific commands are listed bellow:</p>
## For CTypes
There is no need to build, just run the code with the Python interpreter. There are two files:
1. test_ctypes.py that uses the library
2. test.py that will be used on the script that runs all the wrappers
## For Wrapper
It will be built by setup.py wit the following command:
1. python3 setup.py build_ext --inplace
Include
vector_add.H
#ifndef VEC_ADD_H
#define VEC_ADD_H
extern "C" {
void vectorAdd(int *a, int *b, int *c, int N);
}
#endif
python
27-08-2025 12:10 <DIR> ctypes
# Program to execute
program="test.py"
for i in {1..5}; do
echo "*** Processing parameter value: $param"
# Loop over each subfolder
for subfolder in "${subfolders[@]}"; do
echo "## Processing folder: $subfolder"
for i in {1..10}; do
# Execute the program with parameters
echo "$i"
python3 "$program" "$param" || { echo "Failed to execute program in $subfolder"; exit 1; }
done
ctypes
27-08-2025 12:10 1,027 test.py
lib = ctypes.CDLL("../../build/libvector_add.so")
lib.vectorAdd.argtypes = [
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.c_int
]
def calculate(size):
N = int (size)
a = np.array(range(N), dtype=np.int32)
b = np.array(range(N, 2*N), dtype=np.int32)
c = np.zeros(N, dtype=np.int32)
startTime = time.time()
lib.vectorAdd(
a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
b.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
c.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
N
)
endTime = time.time()
def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")
args = parser.parse_args()
calculate(args.N)
if __name__ == "__main__":
main()
test_ctypes.py
import ctypes
import numpy as np
lib = ctypes.CDLL("../../build/libvector_add.so")
lib.vectorAdd.argtypes = [
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_int),
ctypes.c_int
]
N = 10000000
a = np.array(range(N), dtype=np.int32)
b = np.array(range(N, 2*N), dtype=np.int32)
c = np.zeros(N, dtype=np.int32)
lib.vectorAdd(
a.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
b.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
c.ctypes.data_as(ctypes.POINTER(ctypes.c_int)),
N
)
print("Result:", c)
wrapper
pyproject_toml.txt
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"
[project]
name = "vector_add_wrapper"
version = "1.0"
description = "Python binding for CUDA vector addition"
authors = [{ name = "Paulo Motta", email = "[email protected]" }]
dependencies = []
[tool.setuptools]
packages = []
[tool.setuptools.ext_modules]
vector_add_wrapper = { sources = ["vector_add_wrapper.c"] }
setup.py
from setuptools import setup, Extension
module = Extension(
"vector_add_wrapper",
sources=["vector_add_wrapper.c"],
)
setup(
name="vector_add_wrapper",
version="1.0",
description="Python binding for CUDA vector addition",
ext_modules=[module],
)
test.py
import vector_add_wrapper as vaw
import random
import time
import argparse
def calculate(size):
N = int (size)
startTime = time.time()
result = vaw.vectorAdd(a, b, N)
endTime = time.time()
def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")
args = parser.parse_args()
calculate(args.N)
if __name__ == "__main__":
main()
test_vector_add_wrapper.py
import vector_add_wrapper as vaw
import random
N = 10000000
a = [random.randint(0, N) for _ in range(N)]
b = [random.randint(N, 2*N) for _ in range(N)]
result = vaw.vectorAdd(a, b, N)
print("Result:", result)
vector_add_wrapper.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
vectorAdd(a, b, c, N);
free(a);
free(b);
free(c);
return result;
}
PyMODINIT_FUNC PyInit_vector_add_wrapper(void) {
void *handle = dlopen("../../build/libvector_add.so", RTLD_LAZY);
if (!handle) {
PyErr_SetString(PyExc_ImportError, "Could not load libvector_add.so");
return NULL;
}
return PyModule_Create(&vectoraddmodule);
}
wrapper_np
27-08-2025 12:10 375 setup.py
setup.py
from setuptools import setup, Extension
import numpy
module = Extension(
"vector_add_np_wrapper",
include_dirs = [numpy.get_include()],
sources=["vector_add_np_wrapper.c"],
)
setup(
name="vector_add_np_wrapper",
version="1.0",
description="Python binding for CUDA vector addition",
ext_modules=[module],
include_dirs=[numpy.get_include()]
)
test.py
import vector_add_np_wrapper as vaw_np
import numpy as np
import time
import argparse
def calculate(size):
N = int (size)
a = np.random.randint(0, N, size=N, dtype=np.int32)
b = np.random.randint(N, 2*N, size=N, dtype=np.int32)
c = np.zeros(N, dtype=np.int32)
startTime = time.time()
vaw_np.vectorAdd(a, b, c, N)
endTime = time.time()
def main():
parser = argparse.ArgumentParser(description="N")
parser.add_argument("N", help="The size of the array")
args = parser.parse_args()
calculate(args.N)
if __name__ == "__main__":
main()
test_vector_add_np_wrapper.py
import vector_add_np_wrapper as vaw_np
import numpy as np
N = 10000000
a = np.random.randint(0, N, size=N, dtype=np.int32)
b = np.random.randint(N, 2*N, size=N, dtype=np.int32)
c = np.zeros(N, dtype=np.int32)
vaw_np.vectorAdd(a, b, c, N)
print("Result:", c)
vector_add_np_wrapper.c
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <numpy/arrayobject.h>
#include <dlfcn.h>
#include <stdio.h>
#include <stdlib.h>
if (!PyArg_ParseTuple(args, "OOOi",
&a,
&b,
&c,
&N)) {
return NULL;
}
int *a_ptr = (int*)PyArray_DATA(a);
int *b_ptr = (int*)PyArray_DATA(b);
int *c_ptr = (int*)PyArray_DATA(c);
Py_RETURN_NONE;
}
PyMODINIT_FUNC PyInit_vector_add_np_wrapper(void) {
void *handle = dlopen("../../build/libvector_add.so", RTLD_LAZY);
if (!handle) {
PyErr_SetString(PyExc_ImportError, "Could not load libvector_add.so");
return NULL;
}
import_array();
return PyModule_Create(&vectoraddmodule);
}
SRC
vector_add.cu
#include <iostream>
#include <cuda_runtime.h>
#include "../include/vector_add.h"
cudaMalloc((void**)&d_a, N * sizeof(int));
cudaMalloc((void**)&d_b, N * sizeof(int));
cudaMalloc((void**)&d_c, N * sizeof(int));
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}
Chapter 10
27-08-2025 12:10 <DIR> 1_cuBLAS
readme.md
# Introduction
<p>All our examples are numbered here to follow the order in which they appear on the book
chapter.</p>
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(MatrixMultiplyCUBLAS)
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_BUILD_TYPE Release)
set(CUDA_TOOLKIT_ROOT_DIR /usr/local/cuda)
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})
add_executable(matrix_mul main.cpp)
main.cpp
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <chrono>
#include <cublas_v2.h>
#include <cuda_runtime.h>
int main() {
srand(static_cast<unsigned int>(time(0)));
int N = 1024;
size_t size = N * N * sizeof(float);
float *d_A;
float *d_B;
float *d_C;
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.0f;
float beta = 0.0f;
auto endOverheadTime = std::chrono::high_resolution_clock::now();
cublasSgemm(handle,
CUBLAS_OP_T, CUBLAS_OP_T, // transpose both matrices
N, N, N,
&alpha,
d_A, N,
d_B, N,
&beta,
d_C, N);
float* d_C_fixed;
cudaMalloc(&d_C_fixed, size);
cublasSgeam(handle,
CUBLAS_OP_T, CUBLAS_OP_N,
N, N, // dimensions of the output matrix
&alpha,
d_C, N, // input matrix (to be transposed)
&beta,
nullptr, N, // second matrix not used
d_C_fixed, N); // result goes here
cudaEventRecord(stopEvent);
cudaEventSynchronize(stopEvent);
float gpuComputeTime = 0;
cudaEventElapsedTime(&gpuComputeTime, startEvent, stopEvent);
std::cout << "Device data copy time: " << copyTime.count() << " ms" << std::endl;
std::cout << "cuBLAS overhead time: " << overheadTime.count() << " ms" << std::endl;
std::cout << "Device compute time: " << gpuComputeTime << " ms" << std::endl;
std::cout << "Device data copy back time: " << copyBackTime.count() << " ms" << std::endl;
std::cout << "Total time taken by GPU: " << totalTime.count() << " ms" << std::endl;
checkResult(h_C_cpu, h_C, N * N);
cublasDestroy(handle);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaFree(d_C_fixed);
free(h_A);
free(h_B);
free(h_C);
free(h_C_cpu);
return 0;
}
2_thrust
27-08-2025 12:10 321 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.12)
project(ThrustSortExample LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_BUILD_TYPE Release)
find_package(CUDA REQUIRED)
add_executable(thrust_sort main.cu)
set_target_properties(thrust_sort PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_ARCHITECTURES "native"
)
main.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <typeinfo>
#include <vector>
template <typename T>
void thrustSort(size_t size) {
thrust::host_vector<T> h_vec(size);
thrust::host_vector<T> h_result(size);
std::cout << "Processing " << typeid(T).name() << " (" << size << " elements)" << std::endl;
std::cout << "GPU copy time: " << copyTime.count() << " ms" << std::endl;
std::cout << "GPU copy back time: " << copyBackTime.count() << " ms" << std::endl;
std::cout << "GPU sort time: " << sortTime.count() << " ms" << std::endl;
std::cout << "Total time taken by GPU: " << gpuTotalTime.count() << " ms" << std::endl;
}
int main() {
size_t N = 33'000'000;
srand(static_cast<unsigned int>(time(0)));
thrustSort<int>(N);
std::cout << std::endl;
thrustSort<double>(N);
std::cout << std::endl;
thrustSort<float>(N);
return 0;
}
3_gtest
27-08-2025 12:10 493 CMakeLists.txt
CMakeLists.txt
cmake_minimum_required(VERSION 3.10)
project(VectorAddTest LANGUAGES CXX CUDA)
enable_testing()
find_package(GTest REQUIRED)
include
vector_add.h
#ifndef VEC_ADD_H
#define VEC_ADD_H
extern "C" {
void vectorAdd(float* h_A, float* h_B, float* h_C, int N);
}
#endif
SRC
main.cu
#include <cuda_runtime.h>
#include <iostream>
#include "../include/vector_add.h"
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
#include <cuda_runtime.h>
#include <iostream>
#include "../include/vector_add.h"
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
}
Test
test_vector_add.cu
#include <gtest/gtest.h>
#include "../include/vector_add.h"
TEST(VectorAddTest, SimpleAddition) {
int N = 5;
float A[N] = {1, 2, 3, 4, 5};
float B[N] = {10, 20, 30, 40, 50};
float C[N] = {0};
vectorAdd(A, B, C, N);
4_pytest
test_vector_add.py
import numpy as np
from vector_add import vectorAdd
def testVectorAdd():
a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
b = np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32)
expected = a + b
result = vectorAdd(a, b)
for i in range(len(expected)):
assert abs(result[i] - expected[i]) < 1e-5, f"Mismatch at index {i}: got {result[i]}, expected
{expected[i]}"
vector_add.py
import ctypes
import numpy as np
lib = ctypes.CDLL('../3_gtest/build/libvector_add.so')
lib.vectorAdd.argtypes = [
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
np.ctypeslib.ndpointer(dtype=np.float32, flags="C_CONTIGUOUS"),
ctypes.c_int
]