Skip to content

CUDA default threads per block is unintentionally set to 512 instead of 1024 #3281

@longjon

Description

@longjon

#62 (!) introduced a preprocessor check to allow Caffe to run on CUDA devices of compute capability < 2 by setting CAFFE_CUDA_NUM_THREADS (the number of threads per block used for most of Caffe's CUDA kernels) to 512 if needed.

According to my recent check and my reading of the CUDA programming guide, this code is not correct. __CUDA_ARCH__ is not defined in host code (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#application-compatibility), meaning the check always fails. (Note that the preprocessor doesn't care that the macro isn't defined, it just treats it as zero (see https://gcc.gnu.org/onlinedocs/cpp/If.html.))

I don't know if this has any performance implications; it might be fine to just leave threads/block at 512 and remove the dead code.

I'm noting here since I don't have time to send a patch right now; feel free to do so, or I'll try to get to it in a few days.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions