Skip to content

alexnet example

Tobias Kind edited this page Jul 2, 2016 · 14 revisions

The Alexnet demo is a timing benchmark for AlexNet inference. The algorithm was developed by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton and won a NIPS contest a number of years back. All three worked at University of Toronto and later joined the Google Research team. The paper Imagenet classification with deep convolutional neural networks is highly cited with around 6000 citations for a single paper. For comparison average papers in computer science are usually cited around 10 times.

The code was released and developed as high-performance C++/CUDA implementation of convolutional neural networks cuda-convent and cuda-convnet2 which also supports multi-GPU setups.


The code below shows examples and outputs from a CPU only experiment (no vGPU support in VMs). A detailed discussion with Tesla K40 and Geforce Titan X results can be found at the dedicated TF benchmark page.

We can run a very short CPU only demo with 10x10 to see how it performs. The commands below come from a docker run. Inside the docker installation we can then run.

cd tensorflow
python tensorflow/models/image/alexnet/alexnet_benchmark.py --batch_size 10  --num_batches 10
root@fb729273837c:/tensorflow# python tensorflow/models/image/alexnet/alexnet_benchmark.py --batch_size 10  --num_batches 10
conv1   [10, 55, 55, 64]
pool1   [10, 27, 27, 64]
conv2   [10, 27, 27, 192]
pool2   [10, 13, 13, 192]
conv3   [10, 13, 13, 384]
conv4   [10, 13, 13, 256]
conv5   [10, 13, 13, 256]
pool5   [10, 6, 6, 256]
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
2015-11-26 20:50:38.064129: Forward across 10 steps, 0.170 +/- 0.058 sec / batch
2015-11-26 20:50:54.550445: Forward-backward across 10 steps, 0.746 +/- 0.250 sec / batch

The longer demo which also has reference data for GPUs can be invoked in the docker image by changing the batchsize and batch replications. We call it with commandline options --batch_size 128 --num_batches 100

python tensorflow/models/image/alexnet/alexnet_benchmark.py --batch_size 128  --num_batches 100

or simply

python tensorflow/models/image/alexnet/alexnet_benchmark.py

which will create the output

root@fb729273837c:/tensorflow# python tensorflow/models/image/alexnet/alexnet_benchmark.py --batch_size 128  --num_batches 100
conv1   [128, 55, 55, 64]
pool1   [128, 27, 27, 64]
conv2   [128, 27, 27, 192]
pool2   [128, 13, 13, 192]
conv3   [128, 13, 13, 384]
conv4   [128, 13, 13, 256]
conv5   [128, 13, 13, 256]
pool5   [128, 6, 6, 256]
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
2015-11-26 21:24:30.956213: step 10, duration = 2.271
2015-11-26 21:24:53.889235: step 20, duration = 2.324
2015-11-26 21:25:17.470194: step 30, duration = 2.380
2015-11-26 21:25:40.652514: step 40, duration = 2.396
2015-11-26 21:26:03.827588: step 50, duration = 2.367
2015-11-26 21:26:26.532876: step 60, duration = 2.177
2015-11-26 21:26:49.768012: step 70, duration = 2.293
2015-11-26 21:27:12.705549: step 80, duration = 2.270
2015-11-26 21:27:35.671724: step 90, duration = 2.283
2015-11-26 21:27:56.601975: Forward across 100 steps, 2.285 +/- 0.239 sec / batch
2015-11-26 21:31:22.048707: step 10, duration = 9.731
2015-11-26 21:32:59.365643: step 20, duration = 9.663
2015-11-26 21:34:36.980601: step 30, duration = 9.600
2015-11-26 21:36:15.138785: step 40, duration = 10.290
2015-11-26 21:37:53.282469: step 50, duration = 10.078
2015-11-26 21:39:32.931147: step 60, duration = 9.891
2015-11-26 21:41:12.224409: step 70, duration = 9.843
2015-11-26 21:42:50.520298: step 80, duration = 9.710
2015-11-26 21:44:27.532264: step 90, duration = 9.859
2015-11-26 21:45:56.325980: Forward-backward across 100 steps, 9.721 +/- 0.993 sec / batch
root@fb729273837c:/tensorflow# 

The TensorFlow AlexNet forward run is the most efficient from all three TF benchmarks (MNIST, CIFAR10 and AlexNet). Up to 90% CPU utilization as can be seen below. The current benchmark has a very small memory footprint and is synthetic, hence no real external image data is read and processed.

tensorflow-alexnet-forward-cpu-only


The TensorFlow AlexNet forward-backward run is much slower and CPU efficiency drops to 30% which also results in a 30-fold performance drop.

tensorflow-alexnet-forward-backward-cpu-only


For some memory intensive benchmark we can use a larger batch size which surely will occupy more RAM on the CPU and GPU. If swap-memory is used the efficiency will drop to almost zero, so here PCI based SSDs, SSD RAIDS or large TByte RAMDisks can help.

python tensorflow/models/image/alexnet/alexnet_benchmark.py --batch_size 512  --num_batches 3  

LINKS

  • AlexNet paper - Imagenet classification with deep convolutional neural networks
  • Parallel GPUs - One weird trick for parallelizing convolutional neural networks by Alex Krizhevsky
  • Convnet bench - Convnet benchmarks by Soumith Chintala
  • CCV - another convolutional network library
  • Jetson TK1 - Nvidia Jetson TK1 Reviewed
  • JTK1 power - power draw of Jetson TK1 boards
  • Nvidia - ACCELERATED COMPUTING: THE PATH FORWARD - by Jen-Hsun Huang

Clone this wiki locally