Skip to content

Releases: NVIDIA/cccl

CCCL Python Libraries (v0.4.2)

09 Dec 16:58
Immutable release. Only release title and notes can be modified.
c7d8855

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.4.2, dated December 9th, 2025. The previous release was v0.4.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Improvements and bug fixes

  • Add explicit dependency on nvidia-nvvm (#6909 )

CCCL Python Libraries (v0.4.1)

08 Dec 10:06
Immutable release. Only release title and notes can be modified.
7c3300b

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.4.1, dated December 8th, 2025. The previous release was v0.4.0.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

Improvements and bug fixes

  • Fix issue with get_dtype() not working anymore for pytorch arrays (#6882)
  • Add fast path to extract PyTorch array pointer (#6884)

Breaking Changes

CCCL Python Libraries (v0.4.0)

03 Dec 21:51
Immutable release. Only release title and notes can be modified.
958dee5

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.4.0, dated December 3rd, 2025. The previous release was v0.3.4.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features

  • Added select algorithm for filtering data (#6766)
  • Support for nested structs (#6353)
  • Added DiscardIterator (#6618)
  • The cccl-python Python package can now be installed via conda (#6513)

Improvements and bug fixes

  • Allow numpy struct types as initial value for Zipiterator inputs (#6861)
  • Allow using ZipIterator as an output in cuda.compute (#6518)
  • Enable caching of advance/dereference methods for Zipiterator and PermutationIterator (#6753)
  • Use wrapper with void* argument types for iterator advance/dereference signature (#6634)
  • Fixes and improvements to function caching (#6758)
  • Fix handling of wrapped cuda.jit functions (#6770)
  • Use annotations if available to determine return type of transform op (#6760)
  • Allow passing in None as init value for scan when using an iterator as input (#6499)

Breaking Changes

v3.1.3

24 Nov 18:09
Immutable release. Only release title and notes can be modified.
d69eb55

Choose a tag to compare

What's Changed

🔄 Other Changes

  • [Backport branch/3.1.x] Fix invalid reference type of cuda::strided_iterator by @github-actions[bot] in #6517
  • [Backport branch/3.1.x] Fixes issue with select close to int_max by @github-actions[bot] in #6700
  • Bump branch/3.1.x to 3.1.3. by @wmaxey in #6621
  • Backport changes for XGBoost compatibility by @bdice in #6727

Full Changelog: v3.1.2...v3.1.3

v3.1.2

13 Nov 19:14
Immutable release. Only release title and notes can be modified.
4ee3a7b

Choose a tag to compare

What's Changed

🔄 Other Changes

  • [BACKPORT 3.1] Always include <new> when we need operator new for clang-cuda (#6310) by @miscco in #6445
  • [Backport branch/3.1.x] Fix offset_iterator tests by @github-actions[bot] in #6446
  • [BACKPORT 3.1] Add _CCCL_DECLSPEC_EMPTY_BASES to mdspan features (#6444) by @miscco in #6449
  • Bump branch/3.1.x to 3.1.2. by @wmaxey in #6433
  • [Backport 3.1] Fix clang 21 issues (#6404) by @davebayer in #6447
  • [Backport branch/3.1.x] Ensure that detect_wrong_difference is a valid output iterator by @github-actions[bot] in #6453
  • [Backport to 3.1] Fix cub.bench.radix_sort.keys.base regression on H200 (#6452) by @bernhardmgruber in #6458
  • [Backport 3.1] Do not mark deduction guides as hidden (#6350) by @miscco in #6457

Full Changelog: v3.1.1...v3.1.2

python-0.3.4

05 Nov 18:46
Immutable release. Only release title and notes can be modified.
41c55b6

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.3.4, dated November 5th, 2025. The previous release was v0.3.3.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

Bug Fixes

Breaking Changes

v3.1.1

13 Nov 19:13
Immutable release. Only release title and notes can be modified.
148102d

Choose a tag to compare

What's Changed

🔄 Other Changes

  • Bump branch/3.1.x to 3.1.1. by @wmaxey in #6235
  • [Backport branch/3.1.x] Fix __compressed_movable_box by @github-actions[bot] in #6248
  • [Backport branch/3.1.x] Fix __is_primary_std_template for libc++ by @github-actions[bot] in #6249
  • [Backport 3.1] Fix invalid refactoring of #4377 (#6246) by @miscco in #6265
  • [Backport branch/3.1.x] Fix using char as the index type of tabulate_output_iterator by @github-actions[bot] in #6273
  • [Backport 3.1]: Fix missing qualifications for __construct_at (#6270) by @miscco in #6274
  • [Backport branch/3.1.x] Fix missed constructor with compressed box by @github-actions[bot] in #6272
  • [Backport 3.1] Fix string_view construction from std::string_view (#6291) by @davebayer in #6301
  • [Backport 3.1] Include <math.h> in <cuda/std/cmath> headers unconditionally (#6333) by @davebayer in #6339

Full Changelog: v3.1.0...v3.1.1

python-0.3.3

21 Oct 21:44
@tpn tpn
python-0.3.3
ce86c08

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.3.3, dated October 21st, 2025. The previous release was v0.3.2.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

  • This is the first release that features Windows wheels published to PyPI. You can now pip install cuda-cccl[cu12] or pip install cuda-cccl[cu13] on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.

Bug Fixes

Breaking Changes

python-0.3.2

20 Oct 23:17
python-0.3.2
ed40f29

Choose a tag to compare

These are the release notes for the cuda-cccl Python package version 0.3.2, dated October 17th, 2025. The previous release was v0.3.1.

cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.

Installation

Please refer to the install instructions here

Features and improvements

  • Allow passing in a device array or None as the initial value in scan.

Bug Fixes

Breaking Changes

v3.1.0

14 Oct 22:04
ecfd3ad

Choose a tag to compare

Highlights

New options for deterministic reductions in cub::DeviceReduce

Due to non-associativity of floating point addition, cub::DeviceReduce historically only guaranteed bitwise identical results run-to-run on the same GPU.

Starting with CCCL 3.1, formalizes three different levels of determinism with difference performance trade-offs

  • Not-guaranteed (new!) - new single-pass reduction using atomics
  • Run-to-run (status quo) - existing two-pass implementation
  • GPU-to-GPU (new!) - based on reproducible reduction in @maddyscientis GTC 2024 talk
// Pick your desired trade-off of performance and determinism
// auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
// auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
// auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(..., env);
image image

  Not-Guaranteed (new!) Run-to-run (status quo) GPU-to-GPU (new!)
Determinism Varies per run Varies per GPU Constant
Performance Best Better Good

More convenient single-phase CUB APIs

Nearly every CUB algorithm requires temporary storage for intermediate scratch space to carry out the algorithm.

Historically, it was the users responsibility to query and allocate the necessary temporary storage through a two-phase call pattern that is cumbersome and error-prone if arguments aren’t passed the same between two invocations.

CCCL 3.1 adds new overloads of some CUB algorithms that accept a memory resource so you skip the temp-storage query/allocate/free pattern.

Before

// determine temporary storage size
cub::DeviceScan::ExclusiveSum(d_temp_storage, 
                              temp_storage_bytes, 
                              nullptr, ...);
 
// Allocate the required temporary storage
cudaMallocAsync(&d_temp_storage,
                temp_storage_bytes, stream);
 
// run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage,
                              temp_storage_bytes, 
                              d_input...);

 // Free the temporary storage
cudaFreeAsync(temp_storage, stream);

After

// Pool mr uses cudaMallocAsync under the hood
cuda::device_memory_pool mr{cuda::devices[0]};

// Single call. Temp storage is handled by the pool.
cub::DeviceScan::ExclusiveSum(d_input,..., mr);

What's Changed

🚀 Thrust / CUB

libcu++

📚 Libcudacxx

📝 Documentation

  • Making extended API documentation slightly more uniform by @fbusato in #4965
  • Add memory space note to cuda::memory documentation by @fbusato in #5151
  • Better specify lane_mask::all_active() behavior by @fbusato in #5183

🔄 Other Changes

Read more