Releases · NVIDIA/cccl

These are the release notes for the cuda-cccl Python package version 0.4.2, dated December 9th, 2025. The previous release was v0.4.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Improvements and bug fixes

Add explicit dependency on nvidia-nvvm (#6909 )

These are the release notes for the cuda-cccl Python package version 0.4.1, dated December 8th, 2025. The previous release was v0.4.0.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Improvements and bug fixes

Fix issue with get_dtype() not working anymore for pytorch arrays (#6882)
Add fast path to extract PyTorch array pointer (#6884)

Breaking Changes

These are the release notes for the cuda-cccl Python package version 0.4.0, dated December 3rd, 2025. The previous release was v0.3.4.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Added select algorithm for filtering data (#6766)
Support for nested structs (#6353)
Added DiscardIterator (#6618)
The cccl-python Python package can now be installed via conda (#6513)

Improvements and bug fixes

Allow numpy struct types as initial value for Zipiterator inputs (#6861)
Allow using ZipIterator as an output in cuda.compute (#6518)
Enable caching of advance/dereference methods for Zipiterator and PermutationIterator (#6753)
Use wrapper with void* argument types for iterator advance/dereference signature (#6634)
Fixes and improvements to function caching (#6758)
Fix handling of wrapped cuda.jit functions (#6770)
Use annotations if available to determine return type of transform op (#6760)
Allow passing in None as init value for scan when using an iterator as input (#6499)

Breaking Changes

@wmaxey

What's Changed

🔄 Other Changes

[Backport branch/3.1.x] Fix invalid reference type of cuda::strided_iterator by @github-actions[bot] in #6517
[Backport branch/3.1.x] Fixes issue with select close to int_max by @github-actions[bot] in #6700
Bump branch/3.1.x to 3.1.3. by @wmaxey in #6621
Backport changes for XGBoost compatibility by @bdice in #6727

Full Changelog: v3.1.2...v3.1.3

@miscco

What's Changed

🔄 Other Changes

[BACKPORT 3.1] Always include <new> when we need operator new for clang-cuda (#6310) by @miscco in #6445
[Backport branch/3.1.x] Fix offset_iterator tests by @github-actions[bot] in #6446
[BACKPORT 3.1] Add _CCCL_DECLSPEC_EMPTY_BASES to mdspan features (#6444) by @miscco in #6449
Bump branch/3.1.x to 3.1.2. by @wmaxey in #6433
[Backport 3.1] Fix clang 21 issues (#6404) by @davebayer in #6447
[Backport branch/3.1.x] Ensure that detect_wrong_difference is a valid output iterator by @github-actions[bot] in #6453
[Backport to 3.1] Fix cub.bench.radix_sort.keys.base regression on H200 (#6452) by @bernhardmgruber in #6458
[Backport 3.1] Do not mark deduction guides as hidden (#6350) by @miscco in #6457

Full Changelog: v3.1.1...v3.1.2

These are the release notes for the cuda-cccl Python package version 0.3.4, dated November 5th, 2025. The previous release was v0.3.3.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

Introduced cuda.compute.segmented_sort API.

Bug Fixes

Breaking Changes

@wmaxey

What's Changed

🔄 Other Changes

Bump branch/3.1.x to 3.1.1. by @wmaxey in #6235
[Backport branch/3.1.x] Fix __compressed_movable_box by @github-actions[bot] in #6248
[Backport branch/3.1.x] Fix __is_primary_std_template for libc++ by @github-actions[bot] in #6249
[Backport 3.1] Fix invalid refactoring of #4377 (#6246) by @miscco in #6265
[Backport branch/3.1.x] Fix using char as the index type of tabulate_output_iterator by @github-actions[bot] in #6273
[Backport 3.1]: Fix missing qualifications for __construct_at (#6270) by @miscco in #6274
[Backport branch/3.1.x] Fix missed constructor with compressed box by @github-actions[bot] in #6272
[Backport 3.1] Fix string_view construction from std::string_view (#6291) by @davebayer in #6301
[Backport 3.1] Include <math.h> in <cuda/std/cmath> headers unconditionally (#6333) by @davebayer in #6339

Full Changelog: v3.1.0...v3.1.1

These are the release notes for the cuda-cccl Python package version 0.3.3, dated October 21st, 2025. The previous release was v0.3.2.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

This is the first release that features Windows wheels published to PyPI. You can now pip install cuda-cccl[cu12] or pip install cuda-cccl[cu13] on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.

Bug Fixes

Breaking Changes

These are the release notes for the cuda-cccl Python package version 0.3.2, dated October 17th, 2025. The previous release was v0.3.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

Allow passing in a device array or None as the initial value in scan.

Bug Fixes

Breaking Changes

@fbusato

Highlights

New options for deterministic reductions in `cub::DeviceReduce`

Due to non-associativity of floating point addition, cub::DeviceReduce historically only guaranteed bitwise identical results run-to-run on the same GPU.

Starting with CCCL 3.1, formalizes three different levels of determinism with difference performance trade-offs

Not-guaranteed (new!) - new single-pass reduction using atomics
Run-to-run (status quo) - existing two-pass implementation
GPU-to-GPU (new!) - based on reproducible reduction in @maddyscientis GTC 2024 talk

// Pick your desired trade-off of performance and determinism
// auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
// auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
// auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(..., env);

Not-Guaranteed (new!) Run-to-run (status quo) GPU-to-GPU (new!)

Determinism Varies per run Varies per GPU Constant

Performance Best Better Good

More convenient single-phase CUB APIs

Nearly every CUB algorithm requires temporary storage for intermediate scratch space to carry out the algorithm.

Historically, it was the users responsibility to query and allocate the necessary temporary storage through a two-phase call pattern that is cumbersome and error-prone if arguments aren’t passed the same between two invocations.

CCCL 3.1 adds new overloads of some CUB algorithms that accept a memory resource so you skip the temp-storage query/allocate/free pattern.

Before

// determine temporary storage size
cub::DeviceScan::ExclusiveSum(d_temp_storage, 
                              temp_storage_bytes, 
                              nullptr, ...);
 
// Allocate the required temporary storage
cudaMallocAsync(&d_temp_storage,
                temp_storage_bytes, stream);
 
// run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage,
                              temp_storage_bytes, 
                              d_input...);

 // Free the temporary storage
cudaFreeAsync(temp_storage, stream);

After

// Pool mr uses cudaMallocAsync under the hood
cuda::device_memory_pool mr{cuda::devices[0]};

// Single call. Temp storage is handled by the pool.
cub::DeviceScan::ExclusiveSum(d_input,..., mr);

What's Changed

🚀 Thrust / CUB

par_nosync now uses async allocations by default.
New reduce_into algorithm (PR #4355).
Added strided_iterator (PR #4014).
thrust::device_vector now supports default-init and skip-init constructors (PR #4183).
New overloads for cub::WarpReduce (PR #3884).
Tuned cub::ThreadReduce (PR #3441).

libcu++

Added host/device/managed mdspan and accessors.
New pointer utilities: is_aligned, align_up, align_down, ptr_rebind.
New math utilities: ceil_ilog2, power-of-two helpers, fast_mod_div.
New PTX primitive cuda::ptx::elect.sync.
New warp primitive cuda::device::warp_match_all.
Added compile-time loop utility cuda::static_for.

📚 Libcudacxx

Enable device assertions in CUDA debug mode nvcc -G by @fbusato in #4444
avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
Add host/device/managed mdspan and accessors by @fbusato in #3686
Add cuda::ptx::elect.sync by @fbusato in #4445
Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
Add cuda::ceil_ilog2 by @fbusato in #4485
Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
Add cuda::device::warp_match_all by @fbusato in #4746
Add cuda::static_for by @fbusato in #4855
Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210

📝 Documentation

Making extended API documentation slightly more uniform by @fbusato in #4965
Add memory space note to cuda::memory documentation by @fbusato in #5151
Better specify lane_mask::all_active() behavior by @fbusato in #5183

🔄 Other Changes

[CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
Implement ranges::range_adaptor by @miscco in #4066
Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
Replace CUB util_arch.cuh macros with inline constexpr variables by @fbusato in #4165
Improves test times for DeviceSegmentedRadixSort by @elstehle in #4156
Simplify Thrust iterator functions by @bernhardmgruber in #4178
Remove _LIBCUDACXX_UNUSED_VAR by @davebayer in #4174
Remove _CCCL_NO_IF_CONSTEXPR by @davebayer in #4187
Implement __fp_native_type_t by @davebayer in #4173
Adds support for large number of segments and large number of items to DeviceSegmentedRadixSort by @elstehle in #3402
Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
Remove _CCCL_NO_NOEXCEPT_FUNCTION_TYPE by @davebayer in #4190
Fix not_fn by @miscco in https://github.com/NVIDIA/ccc...

	Not-Guaranteed (new!)	Run-to-run (status quo)	GPU-to-GPU (new!)
Determinism	Varies per run	Varies per GPU	Constant
Performance	Best	Better	Good

Releases: NVIDIA/cccl

CCCL Python Libraries (v0.4.2)

Installation

Features

Improvements and bug fixes

Uh oh!

CCCL Python Libraries (v0.4.1)

Installation

Features

Improvements and bug fixes

Breaking Changes

Uh oh!

CCCL Python Libraries (v0.4.0)

Installation

Features

Improvements and bug fixes

Breaking Changes

Uh oh!

v3.1.3

What's Changed

🔄 Other Changes

Contributors

Uh oh!

v3.1.2

What's Changed

🔄 Other Changes

Contributors

Uh oh!

python-0.3.4

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

v3.1.1

What's Changed

🔄 Other Changes

Contributors

Uh oh!

python-0.3.3

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

python-0.3.2

Installation

Features and improvements

Bug Fixes

Breaking Changes

Uh oh!

v3.1.0

Highlights

New options for deterministic reductions in cub::DeviceReduce

More convenient single-phase CUB APIs

What's Changed

🚀 Thrust / CUB

libcu++

📚 Libcudacxx

📝 Documentation

🔄 Other Changes

Contributors

Uh oh!

New options for deterministic reductions in `cub::DeviceReduce`