Using __constant__ memory in LTO FFT callback

I’m having trouble utilizing constant memory in an LTO FFT callback. It seems no matter what I do the value of my constant is 0. I’ve narrowed it down to there being two memory locations for my constant value. I’ve only declared it once, and used extern elsewhere to get it to compile, however when I print out the memory address of the constant in a kernel, I get the correct values and one address. But when I do the same in my callback, I get 0 for the value and a different memory address for the same constant variable. If need be I can create a standalone program and makefile to demonstrate the issue, but first I just wanted to know if there was some limitation on LTO FFT callbacks and using constant memory. I’m using CUDA 12.8, and otherwise my FFT LTO callback is working exactly as I would expect, it’s just the constant memory that doesn’t appear to be working.

Hi,

I believe the problem is that the callback code is placed in a different CUmodule/CUlibrary than the constant. Let me explain.

CUDA functions can only read/write/call symbols that reside in the same CUDA module. That’s why the legacy cuFFT callbacks must have been statically linked to work.

The LTO callbacks work differently. The code is linked into a module with the cuFFT code dynamically and loaded as a separate module.

The problem is that if you have a __constant__ MyType my_var symbol your application, you cannot access it from the callback.

If you try to redefine the __constant__ MyType my_var inside the callback, there will be another instance of the my_var constant variable createn inside the internal cuFFT module.

Using extern __constant__ MyType my_var will not help, too, because - again - those are 2 separate modules.

So, my assumption is that it will not work for now, unless an API exposing the internal module is introduced.

Best regards
David

PS: I might be wrong about some of my assumptions, it would be nice to get a response from a cuFFT team member :)

By the way - why don’t you pass the data via the void* callerInfo parameter?

Thanks for the responses. I’m trying to optimize my code for performance, and passing my data via the callerInfo (which is what I am currently doing to work around this) is much slower for my particular use case. Due to my memory access patterns a mix of textures and constant memory would provide the best performance. In my specific case I’m actually generating the FFT input data on the fly based on those inputs rather than the actual input array passed into the FFT.

I also have another use case where I want to perform some benchmarking for various sized FFTs, but the callback needs to know the FFT size (and another scalar that’s tunable.) Currently I’m using a “define” but that won’t work for a program that is looking to test various combinations of sizes. Interestingly enough, even a single value like fftSize passed in via callerInfo is quite a bit slower than a “define.” It’s possible this is because the compiler can perform better optimization knowing it at compile time, but since I can’t get “_constant_” to work I can’t be sure.

I suppose you could pass a pointer to a __constant__ variable via the callerInfo. I haven’t studied it carefully.

You can’t have “_constant_” in a struct, they have to be declared as a standalone global variable. Having a pointer to a constant seems like an interesting idea though I strongly suspect with all the restrictions already in place for constants that won’t work either. And even if it did, you would probably lose all the caching and broadcast benefits of the constant which is what I really need it for. I’ll give it a shot though when I’m back at my computer.

You may lose some benefit since the compiler will need to handle it via a generic state-space LD rather than constant state-space LDC. But you won’t lose the caching benefit or “broadcasting” benefit. The address will map into constant state-space and be handled by that hardware mechanism.

1 Like

I was finally able to get a pointer to a constant value working via the callerInfo sutrcture. I ended up having to write a kernel to copy my value into it once at start-up. No matter how many different ways I tried I could not get it to work using cudaMemcpy and cudaMemcpyFromSymbol. I’m not sure what I was doing wrong, but it would either seg fault or report an invalid argument. So, after hours of trying, I just wrote a kernel to do it and that worked.

My first attempt was to just try it on a single value (ie my second use case) as that was much simpler code to begin with. Unfortunately, it performed exactly the same as if I just utilized an fftSize variable from within my callerInfo (ie not the constant pointer version of the code.) I somewhat suspected this as I mentioned earlier that there may be some compiler optimizations going on with preprocessor defines. This was also verified by this post:

Speed of modulo operator in CUDA - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums

The fftSize is used to index into arrays of a batched FFT and therefore I need both a divide and a modulo operation to get the proper index into my data. (offset / fftSize, and offset % fftSize)

I’ll report back when I get my updates in to support a constant pointer for my input array and provide details on if I got the performance improvements I was expecting.

a relatively simple cudaMemcpyFromSymbol with a __constant__ variable won’t work.

You’d need to do something like this:

__constant__ int fftSize = 1234;
__device__ int *pfftSize = &fftSize;

...

int *hpfftSize;
cudaMemcpyFromSymbol(&hpfftSize, pfftSize, sizeof(int *));

Some colleagues suggested I try the following, which seemed to work:

  1. git clone the github cuda library samples
  2. change into the lto example directory: cuFFT/lto_callback_window_1d
  3. modify the file src/r2c_c2r_lto_callback_device.cu as follows below
  4. build and run the sample code

When I do that, I get expected output:

# cat src/r2c_c2r_lto_callback_device.cu
/* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */


/*
 * Example showing the use of LTO callbacks with CUFFT to perform
 * truncation with zero padding.
 *
*/

#include <cufftXt.h>
#include <cstdio>      // added

struct cb_params {
        unsigned window_size;
        unsigned signal_size;
};
__constant__ int my_val = 33;    // added
// This is the store callback routine. It filters high frequencies
// based on a truncation window specified by the user
__device__ cufftComplex windowing_callback(void*              input,
                                           unsigned long long idx,
                                           void*              info,
                                           void*              sharedmem) {

        const cb_params* params = static_cast<const cb_params*>(info);
        cufftComplex* cb_output = static_cast<cufftComplex*>(input);

        const unsigned sample   = idx % params->signal_size;
        printf("my_val=%d\n", my_val);   // added
        return (sample < params->window_size) ? cb_output[idx] : cufftComplex{0.f, 0.f};
}
# make
nvcc --std=c++11 --generate-code arch=compute_89,code=lto_89 -dc -fatbin src/r2c_c2r_lto_callback_device.cu -o build/r2c_c2r_lto_callback_device.fatbin
bin2c --name window_callback --type longlong build/r2c_c2r_lto_callback_device.fatbin > src/r2c_c2r_lto_callback_device_fatbin.h
g++ -I /usr/local/cuda/include -c src/r2c_c2r_lto_callback_example.cpp -o build/r2c_c2r_lto_callback_example.o
nvcc -I /usr/local/cuda/include --std=c++11  --generate-code arch=compute_89,code=[compute_89,sm_89] -c src/r2c_c2r_reference.cu -o build/r2c_c2r_reference.o
g++ -I /usr/local/cuda/include --std=c++11 -c src/common.cpp -o build/common.o
g++ -L /usr/local/cuda/lib64 build/r2c_c2r_lto_callback_example.o build/r2c_c2r_reference.o build/common.o -o bin/r2c_c2r_lto_callback_example -lcufft -lcudart
g++ -I /usr/local/cuda/include -DCUDA_ARCH=89 -DCUDA_PATH=/usr/local/cuda -DSOURCE_PATH=/root/bobc/junk/CUDALibrarySamples/cuFFT/lto_callback_window_1d/src -c src/r2c_c2r_lto_nvrtc_callback_example.cpp -o build/r2c_c2r_lto_nvrtc_callback_example.o
g++ -L /usr/local/cuda/lib64 build/r2c_c2r_lto_nvrtc_callback_example.o build/r2c_c2r_reference.o build/common.o -o bin/r2c_c2r_lto_nvrtc_callback_example -lcufft -lnvrtc -lcudart
nvcc -I /usr/local/cuda/include --std=c++11  --generate-code arch=compute_89,code=[compute_89,sm_89] -dc -c src/r2c_c2r_legacy_callback_example.cu -o build/r2c_c2r_legacy_callback_example.o
nvcc -L /usr/local/cuda/lib64  --generate-code arch=compute_89,code=[compute_89,sm_89] -o bin/r2c_c2r_legacy_callback_example build/r2c_c2r_legacy_callback_example.o build/r2c_c2r_reference.o build/common.o -lcufft_static -lcudart -lculibos
# bin/r2c_c2r_lto_callback_example
...
my_val=33
my_val=33
^C
#

(CUDA 13.0, L4 GPU)

I’m sure that’s not exactly what you are doing, but general usage seems to work in that case. Certainly when doing inter-module usage of __constant__ variables, some care is needed to provide for proper linkage, but you already seem to be aware of that. There doesn’t seem to be anything inherently unusable about __constant__ variables in a cufft callback routine.

I guess for completeness I should mention I had to modify the Makefile slightly. It includes specification for compilation targets as early as cc6.0. Since I was using CUDA 13.0 which has dropped support for compilation targets earlier than cc7.5, I had to remove the earlier compilation targets from the Makefile).

Thank you Robert, with your previous recommendations I was able to get it to work as follows:

In my main.cu (global):
_constant_ float c_rho[BATCH_SIZE];
_device_ float *c_rhoPtr = c_rho;

In my main.cu (in main function):

IFFT_CB_InfoT *callerInfo;

// Allocate memory for IFFT callback information
cudaMalloc(&callerInfo, sizeof(IFFT_CB_InfoT));

// Get host pointer for the rho constant memory
float rhoPtrHost*;*
cudaMemcpyFromSymbol(&rhoPtrHost, c_rhoPtr, sizeof(float));

// Copy the pointer to callerInfo
cudaMemcpy(&(callerInfo->rhoPtr), &rhoPtrHost, sizeof(float*), cudaMemcpyHostToDevice);

// Copy rho from host memory to device constant memory
cudaMemcpyToSymbol(c_rho, rho, sizeof(float)*BATCH_SIZE);

In callback.cu (in callback function)
const IFFT_CB_InfoT info = (IFFT_CB_InfoT)callerInfo;
float *rhoPtr = info->rhoPtr;
float rho = rhoPtr[offset];

In common.h:
typedef struct
{
float *rhoPtr;
} IFFT_CB_InfoT;

In terms of your most recent post, my constant memory actually needs to change between calls to the FFT, so I need visibility to the symbol in my main.cu which is really the crux of the problem. The solution above works perfectly for what I am looking to do.

Thanks again!