-
Notifications
You must be signed in to change notification settings - Fork 46
tf benchmarks
Tensorflow (TF) benchmarks will tell if the Terminator comes sooner or later. Read the news, artificial intelligence is the biggest threat to humanity since cockroaches invaded the earth (together with those darn squirrels).
Based on the ML principle "don't believe benchmarks that you did not falsify yourself" I will try to benchmark my own systems. Unfortunately the initial WINDOWS snubbing by tensorflow limit(ed) my benchmarks to VMs (update: TF now works under Windows). Didn't Bill Gates recently save Google, they could be a bit more graceful, or was it Apple? And no GPUs in virtual machines. And did I mention the missing AMD and OpenCL support already? Anyway, here are some collections of benchmarks from different sources.
Benchmarks can be geared towards accuracy, standard deviations/errors, memory consumption, performance and parallel scaling. It becomes clear very soon that using tensorflow with CPUs only maybe cheap, but not very fast. See for example below the cifar10 example with dual GPU setups. TensorFlow may not be the fastest and most accurate algorithm, but may scale well with the use of additional GPUs.
This collection of tensorflow cifar10 performance covers GPU and CPU performances from November 2015, sources are added below. Some explanation the cifar10 program tries to distinguish camels from pigeons, cats from dogs etc. Example/batch - the higher the better and sec/batch the lower the better. The list can not be sorted, thanks to the ... markdown language.
| Num | Platform | examples/sec | sec/batch | Price | Perf/$ |
|---|---|---|---|---|---|
| -1 | 2x GTX 1080 TI | 12486.5 | 0.010 | price: $1600 | 7.80 |
| 0 | GTX 1080 | 1780.0 | 0.072 | price: $814 | 2.19 |
| 1 | GTX 1070 | 1733.1 | 0.074 | price: $449 | 3.85 |
| 2 | 2x Geforce TitanX | 796.7 | 0.161 | price: $2060 | 0.38 |
| 3 | 1x Geforce TitanX | 550.1 | 0.233 | price: $1030 | 0.53 |
| 4 | i7-3770K & GTX 970 | 641.4 | 0.200 | price: $630 CPU+GPU | 1.01 |
| 5 | Xeon E5-2670 + 1 GPU | 325.2 | 0.394 | Amazon g2.2xlarge | 0.12 |
| 6 | Xeon E5-2670 + 4 GPUs | 337.9 | 0.379 | Amazon g2.8xlarge | 0.06 |
| 7 | Tesla K40 | 350.0 | 0.250 | price: $3000 | 0.16 |
| 8 | Tesla K20 | NA | 0.350 | price: $2000 | NA |
| 9 | Core i7-2600K 4.2 GHz | 230.8 | 0.555 | price: $330 | 0.69 |
| 10 | 4x NVIDIA Tesla K20M | NA | 0.100 | price: $10996 | NA |
One interesting conclusion from the table above is, that with the Tesla K40 you get a premium tag but the performance does not really hold up to the three fold higher price. The desktop computer with a Core i7-2600K0 is the cheapest but also the slowest with CPU only, around 54-times slower!
The fastest computer contains 2x GTX 1080 TI Founders Edition with each 11 GB RAM, but also uses a fast i7-6900K CPU and intel server SSD. The data needs to be force-fed from the CPU into the GPU, so SSD + fast CPU + a modern GPU with lots of GPU RAM help. However the cifar10 example currently does not scale very well. The same computer with just one GTX 1080 TI Founders Edition (11 GB) runs actually double the performance.
Now the table above is just for academic exercises, system costs are at least double the price, and of course energy costs will eat away personal budgets very quickly. A Titan X under full power consumes around 250 Watts meaning 4 cards under full power will eat 1 kW. Meaning if the computer runs full load for 20h a day it will consume 20kWh (kilo Watt hours). Energy costs will be 20 kWh × $0.05f/kWh = $1.00 in the US per day. For 360 days = 360 bucks. Thats just the GPUs and not the whole system.
Source -1:
Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500
Source
Source 0:
Zotac GTX 1080 AMP Extreme, 2560 CUDA cores, 1771 MHz core clock, 10000 MHz mem clock. i7 930 3.8 GHz boost clock.
step 100000, loss = 0.72 (1780.0 examples/sec; 0.072 sec/batch); time: 2h 5m.
Source
Source 1:
Asus GTX 1070 Strix, 1920 CUDA cores, 1860 MHz core clock, 8000 MHz mem clock.
i7 6700k 4.2 GHz boost clock.
Source
Source 2 and 3:
Geforce TitanX, Core Clock 1127 MHz ; 3072 CUDA Cores; no CPU info.
Source
Source 4:
GeForce GTX 970 (4 Gbyte RAM and 1.253 GHz clock rate, 7Ghz memory clock rate) with 1664 CUDA Cores
CPU: i7-3770K CPU @ 3.50GHz CPU has 8 threads and 4 cores
Source
Source 5 & 6:
On a g2.2xlarge: step 100, loss = 4.50 (325.2 examples/sec; 0.394 sec/batch)
1x NVIDIA GRID K520 GPUs, each with 1,536 CUDA and 8 cores Intel Xeon E5-2670
On a g2.8xlarge: step 100, loss = 4.49 (337.9 examples/sec; 0.379 sec/batch)
4x NVIDIA GRID K520 GPUs, each with 1,536 CUDA cores and 32 threads of Intel Xeon E5-2670
doesn't seem like it is able to use the 4 GPU cards unfortunately :(
Source and
EC2 defs
Source 7 & 8:
On a single Tesla K40, cifar10_train.py processes a single batch of 128 images
in 0.25-0.35 sec (i.e. 350 - 600 images /sec). The model reaches ~86%
accuracy after 100K steps in 8 hours of training time. (source)
With batch_size 128.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
1 Tesla K40m | 0.25-0.35 | ~86% at 100K steps (4 hours)
Source
Source 9:
Core i7-2600K 4.2 GHz (2011) CPU only running Oracle Virtual Box with Ubuntu 13 (EOL)
Source 10:
Four Tesla K20m each with 2496 CUDA Cores
source
This is the alexnet interference network benchmark. The standard TF benchmark conditions are batch size = 128 measured across 100 steps. The benchmark works on CPUs as well as GPUs. It contains a forward and forward-backward pass. The forward pass is highly efficient (80-90% core utilization) on CPUs, the forward-backward pass less efficient (30-50%). The benchmark measures outcomes in ms/batch (milliseconds per batch; lower is better).
The benchmark is invoked by calling:
$python tensorflow/models/image/alexnet/alexnet_benchmark.py
| Num | Platform | fwd ms/bat | fwd-bw ms/bat | Price $ |
|---|---|---|---|---|
| 0 | GeForce GTX 1080 Ti | 25 | 76 | 800 |
| 1 | Titan X | 70 | 244 | 1000 |
| 2 | Tesla K40c | 145 | 480 | 3000 |
| 3 | GeForce TITAN X | 91 | 301 | 1000 |
| 4 | Geforce GTX TITAN X | 100 | 328 | 1000 |
| 5 | i7-2600K 4.2 Ghz | 2456 | 9981 | 330 |
| 6 | i7-6900K 3.2 Ghz | 932 | 2864 | 900 |
Total run time for the GeForce GTX 1080 Ti is 16 seconds(!) and for the Core I7-6900K (8 core) CPU is 7 minutes. That is a 26-fold speed advantage for the GPU.
The Sandy-Bridge Core i7-2600K CPU has only 4 cores and 8 threads running at 4.2 Ghz with a maximum DDR3 memory bandwidth of 21 GB/sec. In comparison the Geforce Titan X has around 3072 CUDA cores running at 1.1 Ghz and a maximum memory bandwidth of 336.5 GB/sec. So while the CPU GHz advantage for the CPU is 4:1 and the price advantage is 3:1, the core count disadvantage is 1:768(!) and the memory bandwidth disadvantage for the CPU is 1:16. The global performance disadvantage for the CPU is 1:40, basically the single Geforce Titan X GPU is 40-times faster than the CPU.
The performance per dollar column above is probably not really useful, because with a 38-fold performance per Dollar advantage you will be also 40-fold slower. Plus this benchmark in this millisecond range and has to be considered a synthetic micro-benchmark. The most expensive CUDA bottleneck is the CPU-GPU bottleneck, basically efficiently moving GBytes of data from the CPU to the GPU for processing. We can see that from the cifar10 benchmark where the CUDA speed advantage shrinks to zero compared to a similar priced Xeon CPU.
For those compute centers that are not next to a nuclear power plant the performance/Watt may play a role, so here more more efficient designs such as the Jetson TK1 or good old FPGAs maybe interesting. But then again, you will be 10-fold slower and pay a 10-fold premium for saving a bunch of electrons (and the earth).
It might be interesting to note that the tensorflow based alexnet benchmark is still 10-fold slower than the torch and neon deep learning kits.
-
Source 0: Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source
-
Source 1 & 2: Titan X and Tesla K40c from tensorflow's alexnet_benchmark.py
-
Source 3: 2 x GeForce GTX TITAN X; major: 5 minor: 2 memoryClockRate (GHz) 1.2155; Total memory: 12.00GiB; 3072 CUDA cores Source
-
Source 4: GeForce GTX TITAN X; major: 5 minor: 2; memoryClockRate (GHz) 1.076; Total memory: 12.00GiB; 3072 CUDA cores Source
-
Source 5: i7-2600K 4.2 Ghz (4 cores, 8 threads) Source: own
-
Source 6: Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500 Source
This is the mnist LeNet-5-like convolutional MNIST model example. It achieves an test error of 0.8% (lower is better) and a validation error of 0.9% (lower is better). The TensorFlow tutorials can be downloaded from here TF models. Loading the data to the GPU takes around 1 minute. Training for 10 epochs with a batch size of 64 on a GeForce GTX 1080 Ti takes additional 54 seconds. Only one GPU is utilized by this example. When running TF mnist on a 8-core CPU only, the runtime increases to 7 minutes, so the GPU example is roughly 4 times faster (even though batch time is 10-fold faster). TF CPU only was not compiled for SSE4.1, SSE4.2, AVX, AVX2 or FMA.
The benchmark is invoked by calling:
$python tutorials/image/mnist/convolutional.py
| Num | Platform | epoch time | run time (min) | Price $ |
|---|---|---|---|---|
| 1 | GeForce GTX 1080 Ti | 5.7 ms | 1:54 min | $800 |
| 2 | Core i7-6900 K | 52.4 ms | 7:36 min | $900 |
Source 1:
Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500
Source
Source 2:
Core i7-6900K (8 core 3.2 Ghz), 2x NVIDIA® GeForce® GTX 1080 Ti Founders Edition 11GB GDDR5X (1xHDMI, 3xDP) each 3584 CUDA Cores; total 7168 CUDA cores; 64 GByte RAM PC4-17000 2133MHz DDR4; 480GB Intel® SSD Pro 5400s; System price: $4500
Source
This is the implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. The tensorflow GPU implementation is missing, other implementations in theano or keras may use GPU acceleration. Hence the current benchmark only runs on CPU. For 15 epochs in the example the accuracy is 36.5%.
| Num | Platform | run time (min) | Price $ |
|---|---|---|---|
| 1 | Core i7-6900K (8 core) | 23:17 min | $900 |
| 2 | Core i7-2600K (4 core) | 36:04 min | $300 |
LINKS:
ConvNet Benchmarks - benchmarking convolutional neural network implementations such as Narvana or caffe
TF performance@Quroa - Why is the (first release) tensorflow performance poor?
Pick a DL algorithm - deep learning frameworks surveyed by VentureBeat
Samsung Veles - Benchmarks for Samsung Veles deep learning
Comparing DeepLearning Kits - Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning
AI gains - Looking back at performance gains in deep learning systems by Soumith Chintala from FaceBook AI
BIDMach - Benchmarks from a CPU/GPU machine learning package
LambdaLabs - Benchmarks from RTX 2080 Ti, RTX 2080, GTX 1080 Ti, Titan V, Tesla V100 (Oct 2018) via LambdaLabs [XLS]
- tensorflow Home
- tensorflow Overview
- tensorflow Setup
- tensorflow MNIST example
- tensorflow Cifar10 example
- tensorflow AlexNet example
- tensorflow Word2vec example
- tensorflow General examples
- tensorflow Benchmarks
- tensorflow TensorBoard
- tensorflow Data-scientists
- tensorflow Links & Blogs