Releases: NVIDIA/cccl
CCCL Python Libraries (v0.4.2)
These are the release notes for the cuda-cccl Python package version 0.4.2, dated December 9th, 2025. The previous release was v0.4.1.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features
Improvements and bug fixes
- Add explicit dependency on nvidia-nvvm (#6909 )
CCCL Python Libraries (v0.4.1)
These are the release notes for the cuda-cccl Python package version 0.4.1, dated December 8th, 2025. The previous release was v0.4.0.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features
Improvements and bug fixes
- Fix issue with
get_dtype()not working anymore for pytorch arrays (#6882) - Add fast path to extract PyTorch array pointer (#6884)
Breaking Changes
CCCL Python Libraries (v0.4.0)
These are the release notes for the cuda-cccl Python package version 0.4.0, dated December 3rd, 2025. The previous release was v0.3.4.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features
- Added
selectalgorithm for filtering data (#6766) - Support for nested structs (#6353)
- Added
DiscardIterator(#6618) - The
cccl-pythonPython package can now be installed via conda (#6513)
Improvements and bug fixes
- Allow numpy struct types as initial value for Zipiterator inputs (#6861)
- Allow using ZipIterator as an output in cuda.compute (#6518)
- Enable caching of advance/dereference methods for Zipiterator and PermutationIterator (#6753)
- Use wrapper with
void*argument types for iterator advance/dereference signature (#6634) - Fixes and improvements to function caching (#6758)
- Fix handling of wrapped cuda.jit functions (#6770)
- Use annotations if available to determine return type of transform op (#6760)
- Allow passing in
Noneas init value for scan when using an iterator as input (#6499)
Breaking Changes
v3.1.3
What's Changed
🔄 Other Changes
- [Backport branch/3.1.x] Fix invalid reference type of
cuda::strided_iteratorby @github-actions[bot] in #6517 - [Backport branch/3.1.x] Fixes issue with select close to int_max by @github-actions[bot] in #6700
- Bump branch/3.1.x to 3.1.3. by @wmaxey in #6621
- Backport changes for XGBoost compatibility by @bdice in #6727
Full Changelog: v3.1.2...v3.1.3
v3.1.2
What's Changed
🔄 Other Changes
- [BACKPORT 3.1] Always include
<new>when we need operator new for clang-cuda (#6310) by @miscco in #6445 - [Backport branch/3.1.x] Fix offset_iterator tests by @github-actions[bot] in #6446
- [BACKPORT 3.1] Add
_CCCL_DECLSPEC_EMPTY_BASESto mdspan features (#6444) by @miscco in #6449 - Bump branch/3.1.x to 3.1.2. by @wmaxey in #6433
- [Backport 3.1] Fix clang 21 issues (#6404) by @davebayer in #6447
- [Backport branch/3.1.x] Ensure that
detect_wrong_differenceis a valid output iterator by @github-actions[bot] in #6453 - [Backport to 3.1] Fix
cub.bench.radix_sort.keys.baseregression on H200 (#6452) by @bernhardmgruber in #6458 - [Backport 3.1] Do not mark deduction guides as hidden (#6350) by @miscco in #6457
Full Changelog: v3.1.1...v3.1.2
python-0.3.4
These are the release notes for the cuda-cccl Python package version 0.3.4, dated November 5th, 2025. The previous release was v0.3.3.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- Introduced
cuda.compute.segmented_sortAPI.
Bug Fixes
Breaking Changes
v3.1.1
What's Changed
🔄 Other Changes
- Bump branch/3.1.x to 3.1.1. by @wmaxey in #6235
- [Backport branch/3.1.x] Fix
__compressed_movable_boxby @github-actions[bot] in #6248 - [Backport branch/3.1.x] Fix
__is_primary_std_templatefor libc++ by @github-actions[bot] in #6249 - [Backport 3.1] Fix invalid refactoring of #4377 (#6246) by @miscco in #6265
- [Backport branch/3.1.x] Fix using
charas the index type oftabulate_output_iteratorby @github-actions[bot] in #6273 - [Backport 3.1]: Fix missing qualifications for
__construct_at(#6270) by @miscco in #6274 - [Backport branch/3.1.x] Fix missed constructor with compressed box by @github-actions[bot] in #6272
- [Backport 3.1] Fix
string_viewconstruction fromstd::string_view(#6291) by @davebayer in #6301 - [Backport 3.1] Include
<math.h>in<cuda/std/cmath>headers unconditionally (#6333) by @davebayer in #6339
Full Changelog: v3.1.0...v3.1.1
python-0.3.3
These are the release notes for the cuda-cccl Python package version 0.3.3, dated October 21st, 2025. The previous release was v0.3.2.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- This is the first release that features Windows wheels published to PyPI. You can now
pip install cuda-cccl[cu12]orpip install cuda-cccl[cu13]on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.
Bug Fixes
Breaking Changes
python-0.3.2
These are the release notes for the cuda-cccl Python package version 0.3.2, dated October 17th, 2025. The previous release was v0.3.1.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- Allow passing in a device array or
Noneas the initial value in scan.
Bug Fixes
Breaking Changes
v3.1.0
Highlights
New options for deterministic reductions in cub::DeviceReduce
Due to non-associativity of floating point addition, cub::DeviceReduce historically only guaranteed bitwise identical results run-to-run on the same GPU.
Starting with CCCL 3.1, formalizes three different levels of determinism with difference performance trade-offs
- Not-guaranteed (new!) - new single-pass reduction using atomics
- Run-to-run (status quo) - existing two-pass implementation
- GPU-to-GPU (new!) - based on reproducible reduction in @maddyscientis GTC 2024 talk
// Pick your desired trade-off of performance and determinism
// auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
// auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
// auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(..., env);
| Not-Guaranteed (new!) | Run-to-run (status quo) | GPU-to-GPU (new!) | |
|---|---|---|---|
| Determinism | Varies per run | Varies per GPU | Constant |
| Performance | Best | Better | Good |
More convenient single-phase CUB APIs
Nearly every CUB algorithm requires temporary storage for intermediate scratch space to carry out the algorithm.
Historically, it was the users responsibility to query and allocate the necessary temporary storage through a two-phase call pattern that is cumbersome and error-prone if arguments aren’t passed the same between two invocations.
CCCL 3.1 adds new overloads of some CUB algorithms that accept a memory resource so you skip the temp-storage query/allocate/free pattern.
Before
// determine temporary storage size
cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
nullptr, ...);
// Allocate the required temporary storage
cudaMallocAsync(&d_temp_storage,
temp_storage_bytes, stream);
// run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
d_input...);
// Free the temporary storage
cudaFreeAsync(temp_storage, stream);
After
// Pool mr uses cudaMallocAsync under the hood
cuda::device_memory_pool mr{cuda::devices[0]};
// Single call. Temp storage is handled by the pool.
cub::DeviceScan::ExclusiveSum(d_input,..., mr);
What's Changed
🚀 Thrust / CUB
- par_nosync now uses async allocations by default.
- New reduce_into algorithm (PR #4355).
- Added strided_iterator (PR #4014).
- thrust::device_vector now supports default-init and skip-init constructors (PR #4183).
- New overloads for cub::WarpReduce (PR #3884).
- Tuned cub::ThreadReduce (PR #3441).
libcu++
- Added host/device/managed mdspan and accessors.
- New pointer utilities: is_aligned, align_up, align_down, ptr_rebind.
- New math utilities: ceil_ilog2, power-of-two helpers, fast_mod_div.
- New PTX primitive cuda::ptx::elect.sync.
- New warp primitive cuda::device::warp_match_all.
- Added compile-time loop utility cuda::static_for.
📚 Libcudacxx
- Enable device assertions in CUDA debug mode
nvcc -Gby @fbusato in #4444 - avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
- Add host/device/managed mdspan and accessors by @fbusato in #3686
- Add cuda::ptx::elect.sync by @fbusato in #4445
- Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
- Add cuda::ceil_ilog2 by @fbusato in #4485
- Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
- Add cuda::device::warp_match_all by @fbusato in #4746
- Add cuda::static_for by @fbusato in #4855
- Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
- Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210
📝 Documentation
- Making extended API documentation slightly more uniform by @fbusato in #4965
- Add memory space note to
cuda::memorydocumentation by @fbusato in #5151 - Better specify
lane_mask::all_active()behavior by @fbusato in #5183
🔄 Other Changes
- [CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
- Implement
ranges::range_adaptorby @miscco in #4066 - Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
- Replace CUB
util_arch.cuhmacros withinline constexprvariables by @fbusato in #4165 - Improves test times for
DeviceSegmentedRadixSortby @elstehle in #4156 - Simplify Thrust iterator functions by @bernhardmgruber in #4178
- Remove
_LIBCUDACXX_UNUSED_VARby @davebayer in #4174 - Remove
_CCCL_NO_IF_CONSTEXPRby @davebayer in #4187 - Implement
__fp_native_type_tby @davebayer in #4173 - Adds support for large number of segments and large number of items to
DeviceSegmentedRadixSortby @elstehle in #3402 - Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
- Remove
_CCCL_NO_NOEXCEPT_FUNCTION_TYPEby @davebayer in #4190 - Fix
not_fnby @miscco in https://github.com/NVIDIA/ccc...