-
Notifications
You must be signed in to change notification settings - Fork 18.5k
CUDA default threads per block is unintentionally set to 512 instead of 1024 #3281
Description
#62 (!) introduced a preprocessor check to allow Caffe to run on CUDA devices of compute capability < 2 by setting CAFFE_CUDA_NUM_THREADS (the number of threads per block used for most of Caffe's CUDA kernels) to 512 if needed.
According to my recent check and my reading of the CUDA programming guide, this code is not correct. __CUDA_ARCH__ is not defined in host code (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#application-compatibility), meaning the check always fails. (Note that the preprocessor doesn't care that the macro isn't defined, it just treats it as zero (see https://gcc.gnu.org/onlinedocs/cpp/If.html.))
I don't know if this has any performance implications; it might be fine to just leave threads/block at 512 and remove the dead code.
I'm noting here since I don't have time to send a patch right now; feel free to do so, or I'll try to get to it in a few days.