Skip to content

[August 2015] Rejigging the marks... #56

@soumith

Description

@soumith

The benchmarks this time around are interesting, with some fairly clear trends emerging for the near future.

Looking Back

First, some appreciation for where things are,

  • 9 months ago, we were ~3x slower on alexnet and ~4x slower on overfeat.
    Training that took 3 weeks, takes 1 week (on the same 1-GPU metric). That is a huge fundamental speedup results in several man-hours saved in waiting for experiments to finish.

Pushing these boundaries so fast, in such a short time-frame is quite something. There's two sets of teams who have made this happen:

  • NVIDIA, with their Maxwell cards that are fast as f**k
  • Nervana Systems (Scott Gray and team) who have pushed the CUDA kernels to the limits of the GPUs with efficiencies > 95%.

Now

The result of Nervana pushing the limits of compute means that others who were competing to be faster had to play smarter. Nervana has pushed limits so hard that the GPU cant run at boosted clock speeds for long and has to slow down a little bit.

Nervana had the flexibility to choose the ideal data layout for the task, and they used it to its maximum potential, combined with very low-level optimizations and hand-coded assembly.

The trend of the near-future

The CuDNN and Facebook teams did not have this kind of flexibility because they were working with a constraint of supporting existing frameworks such as Caffe and Torch which froze themselves to use the BDHW data layout, which is not the most ideal data layout for convolutions in the spatial domain.

Switching to FFT-based convolutions and optimizing the hell out of them was an obvious choice.
However, there has been skepticism that FFT-based convolutions take too much extra memory.
This was demonstrated by the Facebook convolutions (FBFFT, FBCuFFT) which were fairly fast, but took an unreasonable amount of extra memory.

However, FFT-based convolutions dont necessarily need a lot of extra memory, especially if one writes the full FFT pipeline from scratch. In fact, Nicolas Vasilache from Facebook demonstrated that FFT based convolutions dont need any extra memory with a single-threaded implementation, but he did not optimize them further to achieve competitive performance. He also showcased a tiling strategy for FFT based convolutions that speeds the convolutions up quite a bit, reducing the extra memory needed as well.

NVIDIA with their R3 release of CuDNN show that their FFT based convolutions can be very competitive in speed with Nervana kernels, and faster in some cases. (See imagenet_winners in README.md on the main page for more details)

One has to remember that FFT based convolutions take the same speed to compute regardless of the convolution size (except in a tiling FFT strategy). So if you have a 3x3 convolution layer or a 15x15, it takes the same speed.

NVIDIA fused lots of the CUDA kernels in their implementation to reduce the amount of memory needed by the FFT convolutions. This reduces the amount of extra memory needed, and it is a matter of time before which they will release completely-fused kernels which barely need any extra memory.

CUDNN (R3) Extra Memory to train Imagenet winners
Network Extra Memory
AlexNet 324 MB
VGG-A 2.6 GB
Overfeat 2.59 GB
GoogleNet 202 MB

The overall trend I see is that:

  • Nervana has already pushed spatial domain convolutions and do not have any more optimizations to do to speed things up even more
  • FFT-based convolutions seem to have so much more room to optimize further
  • NVIDIA switched to focusing on optimizing FFT convolutions and have very competitive performance, which they will only improve over time.

p.s.: sorry for not finishing the Chainer benchmarks, I am having some issues running things. My chainer install seems to have some strange CUDA issues. I will update the README with those results in a week or two when I get time. Overall, my first feel of Chainer is that I am a bit annoyed at the 15/20/30 seconds it takes to compile the compute graph. If I read the documentation hard enough, I'll probably find a debug mode that starts running faster, haven't come that far yet!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions