Skip to content

cifar10 example

Tobias Kind edited this page Dec 4, 2017 · 34 revisions

The cifar10.py example is a classic example for image classifications using convolutional neural networks. As explained on the cifar10 website this small dataset consists of 60,000 32x32 colour images in 10 classes (airplanes, automobiles, birds, cats, deers, dogs, frogs, horses, ships and trucks).

catdog
Wikipedia catdog: Peter Hannan Productions.

Historically seen it is interesting that tensorflow also derives from an older Google project cuda-convnet which is a high-performance C++/CUDA implementation of convolutional neural networks. On a side note, there was also a the cifar10 Kaggle competition that used the same data set and achived 95.53% accuracy with a different algorithm. Plus the algorithm may help to invoke dreaming in electric sheep, a reference to this interesting book by Philip K. Dick and the movie Blade Runner.


Before we can start we need to clone tensorflow repository or download locally with

git clone https://github.com/tensorflow/tensorflow.git

Then we can change into the tensorflow directory and start

python tensorflow/models/image/cifar10/cifar10_train.py

After starting the program will create /tmp/cifar10_data with the downloaded files from the cifar-10 repository. On a side note cifar10_multi_gpu_train.py will run on a CPU even without any CUDA installed.

vm@ubuntu:/tmp/cifar10_data/cifar-10-batches-bin$ ls -l
total 180104
-rw-r--r-- 1 vm vm       61 Jun  4  2009 batches.meta.txt
-rw-r--r-- 1 vm vm 30730000 Jun  4  2009 data_batch_1.bin
-rw-r--r-- 1 vm vm 30730000 Jun  4  2009 data_batch_2.bin
-rw-r--r-- 1 vm vm 30730000 Jun  4  2009 data_batch_3.bin
-rw-r--r-- 1 vm vm 30730000 Jun  4  2009 data_batch_4.bin
-rw-r--r-- 1 vm vm 30730000 Jun  4  2009 data_batch_5.bin
-rw-r--r-- 1 vm vm       88 Jun  4  2009 readme.html
-rw-r--r-- 1 vm vm 30730000 Jun  4  2009 test_batch.bin

and the script will create /tmp/cifar10_train which contains run time checkpoints. This seemingly random information will become important down the line (once we wait and age a lot) and will reveal itself with a beautiful picture in this WIKI. Stay with me.

vm@ubuntu:/tmp/cifar10_train$ ls -l
total 107100
-rw-r--r-- 1 vm vm      395 Nov 21 16:01 checkpoint
-rw-r--r-- 1 vm vm 66896835 Nov 21 16:02 events.out.tfevents.1448090579.ubuntu
-rw-r--r-- 1 vm vm  8549875 Nov 21 15:41 model.ckpt-100000
-rw-r--r-- 1 vm vm  8549875 Nov 21 15:51 model.ckpt-101000
-rw-r--r-- 1 vm vm  8549875 Nov 21 16:01 model.ckpt-102000
-rw-r--r-- 1 vm vm  8549875 Nov 21 15:21 model.ckpt-98000
-rw-r--r-- 1 vm vm  8549875 Nov 21 15:31 model.ckpt-99000

The performance curve below for a 8 thread (4 core) core i7-2600K @ 4.2 Ghz shows a different utilization from the MNIST10 example which had spikes. Here the CPU utilization in the VM is somewhat flat, but not at 100%. Outside testing shows that turbo boost > 4 Ghz is consistently on.

tensorflow-cifar10-cpu-profile

The program will then run endlessly on a CPU only. I suspected 4 hours on GPU (see benchmarks) and maybe 8 hours on a CPU only tensorflow. So far my hair is growing and getting gray. According to my task manager I am sitting here since 3 days and 6 hours. Well, instead of twiddling my thumbs I read all the tensorflow books (kidding) and learned about the independent but included application TensorBoard.

2015-11-21 17:11:43.034482: step 108870, loss = 0.59 (218.6 examples/sec; 0.586 sec/batch)
2015-11-21 17:11:48.960857: step 108880, loss = 0.75 (206.5 examples/sec; 0.620 sec/batch)
2015-11-21 17:11:55.013029: step 108890, loss = 0.81 (215.8 examples/sec; 0.593 sec/batch)
2015-11-21 17:12:00.956251: step 108900, loss = 0.71 (221.9 examples/sec; 0.577 sec/batch)
2015-11-21 17:12:07.660601: step 108910, loss = 0.57 (204.3 examples/sec; 0.627 sec/batch)
2015-11-21 17:12:13.678137: step 108920, loss = 0.70 (218.0 examples/sec; 0.587 sec/batch)
2015-11-21 17:12:19.716222: step 108930, loss = 0.70 (210.3 examples/sec; 0.609 sec/batch)
2015-11-21 17:12:25.684618: step 108940, loss = 0.57 (216.6 examples/sec; 0.591 sec/batch)
...
2015-11-22 09:40:54.862983: step 207180, loss = 0.23 (219.1 examples/sec; 0.584 sec/batch)
2015-11-22 09:41:00.791769: step 207190, loss = 0.33 (213.8 examples/sec; 0.599 sec/batch)
2015-11-22 09:41:07.006214: step 207200, loss = 0.23 (222.0 examples/sec; 0.576 sec/batch)
2015-11-22 09:41:13.921191: step 207210, loss = 0.28 (215.5 examples/sec; 0.594 sec/batch)
2015-11-22 09:41:19.909367: step 207220, loss = 0.20 (230.6 examples/sec; 0.555 sec/batch)
2015-11-22 09:41:25.851072: step 207230, loss = 0.24 (216.8 examples/sec; 0.590 sec/batch)
2015-11-22 09:41:31.821360: step 207240, loss = 0.18 (205.4 examples/sec; 0.623 sec/batch)
...
2015-11-23 20:25:51.276225: step 390310, loss = 0.11 (213.2 examples/sec; 0.600 sec/batch)
2015-11-23 20:25:57.075866: step 390320, loss = 0.12 (222.5 examples/sec; 0.575 sec/batch)
2015-11-23 20:26:02.874600: step 390330, loss = 0.10 (226.9 examples/sec; 0.564 sec/batch)
2015-11-23 20:26:08.801202: step 390340, loss = 0.10 (211.3 examples/sec; 0.606 sec/batch)
...
2015-11-24 18:22:30.846715: step 518730, loss = 0.10 (211.4 examples/sec; 0.606 sec/batch)
2015-11-24 18:22:36.841336: step 518740, loss = 0.10 (209.6 examples/sec; 0.611 sec/batch)
2015-11-24 18:22:42.813667: step 518750, loss = 0.11 (220.2 examples/sec; 0.581 sec/batch)
2015-11-24 18:22:48.861087: step 518760, loss = 0.09 (217.1 examples/sec; 0.590 sec/batch)

Using TensorBoard we will get real graphs and real information instead of 684618ms and step 108940 or loss = 0.57, my loss, your loss??? The TensorBoard application can be started for the cifar10 example via the command:

tensorboard --logdir /tmp/cifar10_train/

which will lead to the following terminal output below. Now we recognize why it is so important to debug and understand the most innard guts of each and every program. Because we knew already that the /tmp/cifar10_train/ directory was created by tightly monitoring all tensorflow I/O (could be stuxnet, right) we have now the advantage of passing the directory to the tensorboard application, because we already knew that it contained a number of event and checkpoint files.

vm@ubuntu:~/tensorflow$ tensorboard --logdir /tmp/cifar10_train/
Starting TensorBoard on port 6006
(You can navigate to http://localhost:6006)
127.0.0.1 - - [21/Nov/2015 19:40:22] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [21/Nov/2015 19:40:22] "GET /lib/css/global.css HTTP/1.1" 200 -

With that we can open the address http://localhost:6006 in a local web browser such as Edge (wait a minute) or Chrome or FF. There will be a number of issues unless everything is installed correctly, but we can now investigate what our cifar10.py program is actually doing. Very nice. Really. Much appreciated. Below we can see the cifar-10 tensorboard.

cifar10-tensorboard-200k

From time to time, we can call cifar10_eval.py to get the precision of the calculation or or how often the top prediction matches the true label of the image. Don't be afraid if there is some warning salad, grep/grab the precision value from the output.

python cifar10_eval.py

2015-11-23 22:33:43.662213: precision @ 1 = 0.855

Training result: According to the official source code the training accuracy with cifar10_train.py achieves ~86% accuracy after 100K steps (256 epochs of data) as judged by cifar10_eval.py. See the comparison of Tesla and Pascal (1080Ti) chips at the benchmark page. We can see the tremendous 10-fold speed-up obtained over the years below, plus the prices for comparable consumer CUDA cards have fallen 10-fold.

# Source: cifar10_train.py
Speed: With batch_size 128.
System        | Step Time (sec/batch) | Accuracy                      | Price 
------------------------------------------------------------------------------------
1x Tesla K20m | 0.35-0.60             | ~86% at 60K  steps (5 hours)  | $3,200 (2013)
1x Tesla K40m | 0.25-0.35             | ~86% at 100K steps (4 hours)  | $3,200
4x Tesla K20m | 0.10                  | ~84% at 30K  steps (NA)       | $12,800
2x 1080Ti FE  | 0.011                 | ~86% at 60k  steps (26 min)   | $1,500 (2017)

Links

  • Electric sheep dreams - Google's AI Can Dream, and Here's What it Looks Like by Caroline Reid

  • cifar10@Kaggle - the cifar-10 Kaggle competition winner with 95.53% accuracy

  • cifar10@qiita - cifar10 and TensorFlow (GPU版) を Ubuntu にインストールしてみた

  • cifar@tensorboard - Investigating variable input/output on Tesla GPUs

  • cifar10@TF - The tutorial which explains all additional ideas and commands

Clone this wiki locally