Since this February, there have been a "(convolutional) neural network toolkit" CXXNET based on the "Lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA" mshadow. The toolkit is able to classify 400 images per second, i.e. about 35 million per day, on a GTX 780 GPU. It seems to be faster than Caffe which can process 20 million per day on a K20 and 40 million per day on a K40.
Since CXXNET is using the tensor library, its code is also much more concise than Caffe's.