INTRODUCTION TO
DEEP LEARNING WITH GPUS
July 2015
1 What is Deep Learning?
AGENDA
2 Deep Learning software
3 Deep Learning deployment
What is Deep Learning?
DEEP LEARNING & AI
Deep Learning has become the most popular
approach to developing Artificial Intelligence
(AI) machines that perceive and understand
the world
CUDA for
Deep Learning
The focus is currently on specific perceptual
tasks, and there are many successes.
Today, some of the worlds largest internet
companies, as well as the foremost research
institutions, are using GPUs for deep learning
in research and production
PRACTICAL DEEP LEARNING EXAMPLES
Image Classification, Object Detection, Localization,
Action Recognition, Scene Understanding
Speech Recognition, Speech Translation,
Natural Language Processing
Pedestrian Detection, Traffic Sign Recognition
Breast Cancer Cell Mitosis Detection,
Volumetric Brain Image Segmentation
5
TRADITIONAL MACHINE PERCEPTION
HAND TUNED FEATURES
Raw data
Feature extraction
Classifier/
detector
Result
SVM,
shallow neural net,
HMM,
shallow neural net,
Clustering, HMM,
LDA, LSA
Speaker ID,
speech transcription,
Topic classification,
machine translation,
sentiment analysis
6
DEEP LEARNING APPROACH
Train:
Errors
Dog
Dog
Cat
Raccoon
Cat
Honey badger
Deploy:
Dog
SOME DEEP LEARNING USE CASES
Jeff Dean, Google, GTC 2015
ARTIFICIAL NEURAL NETWORK (ANN)
A collection of simple, trainable mathematical units that
collectively learn complex functions
Biological neuron
Artificial neuron
y
w1
x1
w2
w3
x2
x3
From Stanford cs231n lecture notes
y=F(w1x1+w2x2+w3x3)
F(x)=max(0,x)
ARTIFICIAL NEURAL NETWORK (ANN)
A collection of simple, trainable mathematical units that
collectively learn complex functions
Hidden layers
Input layer
Output layer
Given sufficient training data an artificial neural network can approximate very complex
functions mapping raw data to output decisions
10
DEEP NEURAL NETWORK (DNN)
Raw data
Low-level features
Mid-level features
High-level features
Application components:
Input
Result
Task objective
e.g. Identify face
Training data
10-100M images
Network architecture
~10 layers
1B parameters
Learning algorithm
~30 Exaflops
~30 GPU days
11
DEEP LEARNING ADVANTAGES
Robust
No need to design the features ahead of time features are automatically
learned to be optimal for the task at hand
Robustness to natural variations in the data is automatically learned
Generalizable
The same neural net approach can be used for many different applications
and data types
Scalable
Performance improves with more data, method is massively parallelizable
12
CONVOLUTIONAL NEURAL NETWORK (CNN)
Inspired by the human visual
cortex
Learns a hierarchy of visual
features
Local pixel level features are
scale and translation invariant
Learns the essence of visual
objects and generalizes well
13
CONVOLUTIONAL NEURAL NETWORK (CNN)
14
RECURRENT NEURAL NETWORK (RNN)
15
DNNS DOMINATE IN PERCEPTUAL TASKS
Slide credit: Yann Lecun, Facebook & NYU
16
WHY IS DEEP LEARNING HOT NOW?
Three Driving Factors
Big Data Availability
New DL Techniques
GPU acceleration
350 millions
images uploaded
per day
2.5 Petabytes of
customer data
hourly
100 hours of video
uploaded every
minute
17
GPUs and Deep Learning
18
GPUs THE PLATFORM FOR DEEP LEARNING
Image Recognition Challenge
1.2M training images 1000 object categories
GPU Entries
120
100
110
80
60
Hosted by
60
40
20
0
2010
2011
2012
2013
2014
person
car
bird
helmet
frog
motorcycle
person
dog
chair
person
hammer
flower pot
power drill
19
GPU-ACCELERATED DEEP LEARNING
20
GPUS MAKE DEEP LEARNING ACCESSIBLE
GOOGLE DATACENTER
STANFORD AI LAB
Deep learning with COTS HPC
systems
A. Coates, B. Huval, T. Wang, D. Wu,
A. Ng, B. Catanzaro
ICML 2013
$1M Artificial Brain on the Cheap
Now You Can Build Googles
1,000 CPU Servers
2,000 CPUs 16,000 cores
600 kWatts
$5,000,000
3 GPU-Accelerated Servers
12 GPUs 18,432 cores
4 kWatts
$33,000
21
WHY ARE GPUs GOOD FOR DEEP LEARNING?
Neural
Networks
GPUs
Inherently
Parallel
Matrix
Operations
FLOPS
Bandwidth
GPUs deliver -- same or better prediction accuracy
- faster results
- smaller footprint
- lower power
- lower cost
22
GPU ACCELERATION
Training A Deep, Convolutional Neural Network
Training Time
CPU
Training Time
GPU
GPU
Speed Up
64 images
64 s
7.5 s
8.5X
128 images
124 s
14.5 s
8.5X
256 images
257 s
28.5 s
9.0X
Batch Size
ILSVRC12 winning model: Supervision
Dual 10-core Ivy Bridge CPUs
7 layers
1 Tesla K40 GPU
5 convolutional layers + 2 fully-connected
CPU times utilized Intel MKL BLAS library
ReLU, pooling, drop-out, response normalization
GPU acceleration from CUDA matrix libraries
(cuBLAS)
Implemented with Caffe
Training time is for 20 iterations
23
DL software landscape
24
HOW TO WRITE APPLICATIONS USING DL
Speech
Understanding
Image
Language
END USER APPLICATIONS
Analysis
Processing
Deep Learning Frameworks(Industry standard or research frameworks)
Libraries(Key compute intensive commonly used building blocks)
System Software(Drivers)
Hardware Which can accelerate DL building blocks
25
HOW NVIDIA IS HELPING DL STACK
Speech
Understanding
Image
Language
Analysis
Processing
END USER APPLICATIONS
DIGITS
accelerated
DL Frameworks
(Caffe,or
Torch,
Theano)
Deep GPU
Learning
Frameworks(Industry
standard
research
frameworks)
Libraries(Key
used building
blocks)
Performancecompute
librariesintensive
(cuDNN, commonly
cuBLAS)- Highly
optimized
System
Software(Drivers)
CUDA- Best
Parallel
Programming Toolkit
HardwareGPU Which
Worlds
can accelerate
best DL Hardware
DL building blocks
26
GPU-ACCELERATED
DEEP LEARNING FRAMEWORKS
CAFFE
TORCH
THEANO
KALDI
Domain
Deep Learning
Framework
Scientific Computing
Framework
Math Expression
Compiler
Speech Recognition
Toolkit
cuDNN
2.0
2.0
2.0
--
Multi-GPU
via DIGITS 2
In Progress
In Progress
(nnet2)
Multi-CPU
(nnet2)
License
BSD-2
GPL
BSD
Apache 2.0
Interface(s)
Command line,
Python, MATLAB
Lua, Python,
MATLAB
Python
C++, Shell scripts
Embedded (TK1)
http://developer.nvidia.com/deeplearning
All three frameworks covered in the associated Intro to DL hands-on lab
27
CUDNN V2 - PERFORMANCE
v3 coming soon
CPU is 16 core Haswell E5-2698 at 2.3 GHz, with 3.6 GHz Turbo
GPU is NVIDIA Titan X
28
HOW GPU ACCELERATION WORKS
Application Code
Compute-Intensive Functions
GPU
Rest of Sequential
CPU Code
5% of Code
~ 80% of run-time
CPU
29
CUDNN ROUTINES
Convolutions 80-90% of the execution time
Pooling - Spatial smoothing
Activations - Pointwise non-linear function
https://developer.nvidia.com/cudnn
30
DIGITS
Interactive Deep Learning GPU Training System
Data Scientists & Researchers:
Quickly design the best deep neural
network (DNN) for your data
Visually monitor DNN training quality in
real-time
Manage training of many DNNs in
parallel on multi-GPU systems
DIGITS 2 - Accelerate training of a
single DNN using multiple GPUs
https://developer.nvidia.com/digits
31
DL deployment
32
DEEP LEARNING DEPLOYMENT WORKFLOW
33
DEEP LEARNING LAB SERIES SCHEDULE
7/22 Class #1 - Introduction to Deep Learning
7/29 Office Hours for Class #1
8/5 Class #2 - Getting Started with DIGITS interactive training system for image classification
8/12 Office Hours for Class #2
8/19 Class #3 - Getting Started with the Caffe Framework
8/26 Office Hours for Class #3
9/2
9/9
Class #4 - Getting Started with the Theano Framework
Office Hours for Class #4
9/16 Class #5 - Getting Started with the Torch Framework
9/23 Office Hours for Class #5
More information available at developer.nvidia.com/deep-learning-courses
34
HANDS-ON LAB
1. Create an account at nvidia.qwiklab.com
2. Go to Introduction to Deep Learning lab at bit.ly/dlnvlab1
3. Start the lab and enjoy!
Only requires a supported browser, no NVIDIA GPU necessary!
Lab is free until end of Deep Learning Lab series
35