ACCELERATING END TO END DEEP LEARNING
WORKFLOW.
Deepshikha Kumari Data Scientist II- Deep learning
1. AI Use cases for Industry
2. End To End Deep Learning Workflow
Training Pipeline
a. NGC
AGENDA b. Transfer Learning
c. Automatic Mixed Precision
d. Code walkthrough
Inference Pipeline
a. TensorRT (Float 16)
b. TensorRT (INT8)
c. Custom plugin support
d. Deepstream
2
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY
Access Control Public Transit Industrial Inspection Traffic Engineering
Retail Analytics Logistics Critical Infras tructure Public Safety
3
DEEP LEARNING IN PRODUCTION
Speech Recognition
Recommender Systems
Autonomous Driving
Real-time Object Recognition
Robotics
Real-time Language
Translation
Many More…
4
5
NGC
6
WHY CONTAINERS?
Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks and
applications
Share, collaborate,
and test applications across
different environments
5
Virtual Machine vs. Container
Not so similar
App 1 App 1 App 2
Bins / Libs Bins / Libs Bins / Libs
App 1 App 1 App 2
Guest OS Guest OS Guest OS
Bins / Libs Bins / Libs Bins / Libs
Hypervisor Docker Engine
Host Operating System Host Operating System
Server Infrastructure Server Infrastructure
Virtual Machines Containers
8
NVIDIA container runtime
[Link]
• Colloqually called “nvidia-docker”
• Docker containers are hardware-
agnostic and platform-agnostic
• NVIDIA GPUs are specialized
hardware that require the NVIDIA
driver
• Docker does not natively support
NVIDIA GPUs with containers
• NVIDIA Container Runtime makes the
images agnostic of the NVIDIA driver
9
Docker Terms
Definitions
Image
Docker images are the basis of containers. An Image is an ordered collection of root filesystem changes
and the corresponding execution parameters for use within a container runtime. An image typically
contains a union of layered filesystems stacked on top of each other. An image does not have state and it
never changes.
Container
A container is a runtime instance of a docker image.
A Docker container consists of
● A Docker image
● Execution environment
● A standard set of instructions
[Link]
10
11
12
13
PRUNING
1 Reduce model size and increase throughput
2 Incrementally retrain model after pruning to recover
accuracy
Prune Retrain
tlt-prune tlt-train
14
Selecting Unnecessary Neurons
• 1. DATA Driven operation
• 2. Non- Data Driven Operation.
• 3. Handling Element-Wise Operations of Multiple Inputs
pruned_model = [Link](model, t)
15
SCENE ADAPTATION
Camera location vantage point Person with blue shirt
Same network adapting to different
angles and vantage points
Data Adapt
Same network adapting to new data
Train with new data from another vantage point, camera location, or added attribute
16
17
TLT
TENSORFLOW
Automatic Mixed Precision feature is available both in native TensorFlow and inside
the TensorFlow container on
NVIDIA NGC container registry:
export TF_ENABLE_AUTO_MIXED_PRECISION=1
As an alternative, the environment variable can be set inside the TensorFlow Python script:
[Link]['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
19
PYTORCH
Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable,
add these two lines of code into your existing training script:
model, optimizer = [Link](model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
20
MXNET
Automatic Mixed Precision feature is available both in native MXNet (1.5 or later) and inside the MXNet
container (19.04 or later) on NVIDIA NGC container registry. To enable the feature, add the following
lines of code to your existing training script:
[Link]()
amp.init_trainer(trainer)
with amp.scale_loss(loss, trainer) as scaled_loss:
[Link](scaled_loss)
21
AUTOMATIC MIXED PRECISION IN TENSORFL
Upto 3X Speedup
TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs
All models can be found at:
[Link] except for ssd-rn50-fpn-640, which is here: [Link] All
performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy).
Batch sizes:. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; NCF: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; GNMT: 128 for FP32, 192 for AMP. 7
AUTOMATIC MIXED PRECISION IN
PYTORCH
[Link]
● Plot shows ResNet-50 result with/without automatic mixed
precision(AMP)
●
2X
More AMP enabled model scripts coming soon:
Mask-R CNN, GNMT, NCF, etc.
AMP
Enabled
FP32 M ixed
Precision
Source: [Link] 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32
[Link] 9
25
NVIDIA TENSORRT
Programmable Inference Accelerator
FRAMEWORKS GPU PLATFORMS
TESLA P4
TensorRT
JETSON TX2
Optimizer Runtime
DRIVE PX 2
NVIDIA DLA
TESLA V100
26
[Link]/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350
Latency (ms)
400
Images/sec
Latency (ms)
4,000
Images/sec
25
280 ms
300
3,000 20 300 250
14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms
1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.
27
[Link]/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Step 2: Deploy optimized plans with runtime
Plan 1 De-serialize Deploy
Engine Runtime
Plan 2 Data center
Plan 3
Optimized Plans TensorRT Runtime Engine Automotive Embedded 28
MODEL IMPORTING
➢ AI Researchers
➢ Data Scientists
Example: Importing a TensorFlow model
Other Frameworks
Python/C++ API Python/C++ API
Model Importer Network
Definition API
Runtime inference
C++ or Python API
29
[Link]/tensorrt
TENSORRT OPTIMIZATIONS
Layer & Tensor Fusion
➢ Optimizations are completely automatic
➢ Performed with a single function call
Weights & Activation
Precision Calibration
Kernel Auto-Tuning
Dynamic Tensor
Memory
30
LAYER & TENSOR FUSION
Un-Optimized Network TensorRT Optimized Network
• Vertical Fusion
next input
• Horizonal Fusion next input
concat Elimination
• Layer
relu relu relu relu
bias
bias Network Layersbias Layers bias 3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
before after
relu relu
VGG19
bias
43bias 27max pool
1x1 CBR max pool
Inception
1x1 conv. 1x1
309conv. 113
V3
input
ResNet-152 670 159 input
concat
31
32
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY
Kernel Auto-Tuning Dynamic Tensor Memory
• Reduces memory footprint and
100s for specialized kernels
Optimized for every GPU platform
improves memory re-use
• Manages memory allocation for
each tensor only for the duration of
Multiple parameters: its usage
• Batch size
• Input dimensions
Tesla V100 Jetson TX2 • Filter dimensions
Drive PX2 33
...
EXAMPLE: DEPLOYING TENSORFLOW
MODELS WITH TENSORRT
Deployment and Inference
Import, optimize and deploy
TensorFlow models using TensorRT python
API
New Data
Steps:
• Start with a frozen TensorFlow model Trained Neural
Network
• Create a model parser
• Optimize model and create a runtime
TensorRT
engine Optimizer Optimized
• Perform inference using the optimized Runtime Engine
runtime engine
Inference Results
34
[Link]/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
Step 3: Register inputs and outputs
Step 4: Optimize model and create
a runtime engine
Step 5: Serialize optimized engine
Step 6: De-serialize engine
Step 7: Perform inference
[Link]/tensorrt
TensorRT Inference
with TensorFlow
TensorFlow
An end-to-end open source machine learning platform
● Powerful platform for research and experimentation
● Versatile, easy model building
● Robust ML production anywhere
● Most popular ML project on Github
41m Downloads
NVIDIA TensorRT
Platform for High-Performance Deep Learning Inference
● Optimize and Deploy neural networks in production environments
● Maximize throughput for latency-critical apps with optimizer and runtime
● Deploy responsive and memory efficient apps with INT8 & FP16
300k Downloads in 2018
TF-TRT = TF + TRT
TensorRT Inference with
TensorFlow
● Benefits to using TF-TRT
AGENDA ● How to use
● Customer experience: Clarifai
● How TF-TRT works
● Additional Resources
Benefits to using TF-TRT
● Optimize TF inference while still using the TF ecosystem
● Simple API: up to 8x performance gain with little effort
● Fallback to native TensorFlow where TensorRT does not support
Over 10 optimized models with published examples
Models TF FP32 TF-TRT INT8 Speedup
(imgs/s) (imgs/s) ● Performance optimizations soon:
ResNet-50 399 3053 7.7x More NLP and Object Detection
Models
Inception V4 158 1128 7.1x
Mobilenet V1 1203 4975 4.1x
● For non-optimized layers, fallback
support is provided by
NASNet large 43 162 3.8x TensorFlow
VGG16 245 1568 6.4x
SSD Mobilenet V2 102 411 4.0x
SSD Inception V2 82 327 4.0x
TensorFlow FP32 vs TensorFlow-TensorRT INT8 on T4, largest possible batch size, no I/O.
NGC Tensorflow 19.07 with scripts: [Link] 42
FP16 accuracy
Models TF FP32 TF-TRT FP16
Mobilenet V2 74.08 74.07
NASNet Mobile 73.97 73.87
ResNet 50 V1.5 76.51 76.48
ResNet 50 V2 76.43 76.40
VGG 16 70.89 70.91
Inception V3 77.99 77.97
SSD Mobilenet v1 23.06 23.07
FP16 accuracy is within 0.1% of FP32 accuracy
Top-1 metric (%) for classification models. mAP for SSD detection models.
43
Complete data: [Link]
INT8 accuracy
Models TF FP32 TF-TRT INT8
Mobilenet V2 74.08 73.90
NASNet Mobile 73.97 73.55
ResNet 50 V1.5 76.51 76.23
ResNet 50 V2 76.43 76.30
VGG 16 70.89 70.78
Inception V3 77.99 77.85
INT8 accuracy is within 0.2% of FP32 accuracy except for NASNet Mobile within 0.5%.
Top-1 metric (%) for classification models.
44
Complete data: [Link]
TensorRT ONNX PARSER
Parser to import ONNX-models into TensorRT
Optimize and deploy models from ONNX-
supported frameworks in production
Apply TensorRT optimizations to any ONNX framework
(Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet,
PyTorch)
C++ and Python APIs to import ONNX models
New samples demonstrating step-by-step process to get
started
[Link]/tensorrt 45
INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference
Single Model Only Single Framework Only Custom Development
!
Rec-
ASR NLP
ommender
Some systems are overused while Solutions can only support Developers need to reinvent the
others are underutilized models from one framework plumbing for every application
46
NVIDIA TENSORRT INFERENCE SERVER
Production Data Center Inference Server
Maximize real-time inference
NVIDIA performance of GPUs
TensorRT
Inference
Server
T4
NVIDIA
T4 Quickly deploy and manage multiple
models per GPU per node
Tesla
TensorRT
Inference
Server
V100
Easily scale to heterogeneous GPUs
Tesla
V100 and multi GPU nodes
Integrates with orchestration
TensorRT
Inference
Tesla P4
systems and auto scalers via latency
Server Tesla P4 and health metrics
Now open source for thorough
customization and integration
47
FEATURES
Concurrent Model Execution Dynamic Batching
Multiple models (or multiple instances of same Inference requests can be batched up by the
model) may execute on GPU simultaneously inference server to 1) the model-allowed
maximum or 2) the user-defined latency SLA
CPU Model Inference Execution
Framework native models can execute inference Multiple Model Format Support
requests on the CPU PyTorch JIT (.pt)
TensorFlow GraphDef/SavedModel
TensorFlow and TensorRT GraphDef
Metrics ONNX graph (ONNX Runtime)
Utilization, count, memory, and latency TensorRT Plans
Caffe2 NetDef (ONNX import path)
Custom Backend
Custom backend allows the user more flexibility
by providing their own implementation of an
CMake build
execution engine through the use of a shared Build the inference server from source making it
library more portable to multiple OSes and removing
the build dependency on Docker
Model Ensemble
Pipeline of one or more models and the Streaming API
connection of input and output tensors between Built-in support for audio streaming input e.g.
those models (can be used with custom for speech recognition
backend) 48
INFERENCE SERVER ARCHITECTURE
Available with Monthly Updates
Python/C++ Client Library
Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
● ONNX graph
● PyTorch JIT (.pb)
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
49
Additional resources
- GTC Technical presentation: [Link]
- TF-TRT user guide: [Link]
- NVIDIA DLI course on TF-TRT: [Link]
- Monthly release notes: [Link]
- Google Blog on TF-TRT inference: [Link]
learning/running-tensorflow-inference-workloads-at-scale-with-tensorrt-5-and-nvidia-t4-gpus
- Nvidia Developer Blog: [Link]
inference/
50
51
1. Intelligent Video Analytics
2. Deepstream SDK
a. What is Deepstream SDK?
a. Why Deepstream SDK?
AGENDA b. What’s new with DS4.0?
c. Deepstream Building Blocks
3. Getting started with Deepstream SDK
a. Where to start?
b. Directory hierarchy
c. Configurable file and pipeline details
d. Running application
4. Building with Deepstream SDK
a. Real world use cases with demo 2
b. Resources
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY
Access Control Public Transit Industrial Inspection Traffic Engineering
Retail Analytics Logistics Critical Infras tructure Public Safety
5
3
WHAT IS DEEPSTREAM?
Applications and Services
DEEPSTREAM SDK
Ref er en ce
Har dwar e
Docker Co n t ainers Applications & Analytic IOT
Ac c e le r a te d Plugins
Or ch est ratio n Recipes Runtime
CUDA-X
K u b e rne te s ON GPUs NVIDIA C o n t aine rs RT CUDA Multimedia TensorRT
NVIDIACOMPUTING PLATFORM - EDGE TOCLOUD
JETSON | TESLA
5
4
WHY DEEPSTREAM?
The most comprehensive end-to-end development platform for IVA.
Broader Use Cases and
Faster Time to Progress Faster Time to Market
Industries
Provides ready to use building blocks and Provides ready to use building blocks and
Build your own application for smart cities, IP simplify building your innovative IP simplify building your innovative
retail analytics, industrial inspection, product.. product.
logistics, and more
Performance Driven Cloud Integration Faster Time to Progress
Low latency and exceptional performance Pushbutton IoT solution integration to build Iterate and integrate by quick plug and play
optimized for NVIDIA GPUs for real-time applications and services with Cloud of popular plug-ins that are pre-packaged or
edge analytics Service Providers. build your own.
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 55
DEEPSTREAM SDK
Plugins (build w ith open source, 3 rd party, NV) Analytics - multi-camera, multi-sensor framew ork Development Tools
DNN infer ence/Tensor RT plugins DeepStr eam in container s, Multi-GPU or chestr ation End to end r efer ence applications
Communications plugins Tr acking & analytics acr oss lar ge scale/ multi-camer a App building/configur ation tools
V ideo/image captur e and pr ocessing plugins Str eaming and Batch Analytics End-end or chestr ation r ecipes & adaptation guides
3rd par ty libr ar y plugins … … Event fabr ic Plugin templates, custom IP integr ation
DeepStream SDK
Multimedia APIs/ Imaging & Metadata & Multi-camer a
Tensor RT NV container s Message bus clients
V ideo Codec SDK Dewar ping libr ar y messaging tr acking lib
Linux, CUDA
Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure
56
REAL TIME INSIGHTS, HIGHEST STREAM
DENSITY
NGC ANY CLOUD
NVIDIA M etropolis Analytics Visualization
Application Framework
NVIDIA Edge Stack
NVIDIA EGX Server Cloud M onitoring
68 streams of 1080p per T4
Pixels Information Dashboard
57
Smart Parking
PERCEPTION GRAPH
COMM PLUGIN PREPROCESSING PLUGINS DETECTION, CLASSIFICATION & TRACKING PLUGINS COMMUNICATIONS PLUGINS
Camer a
ROI calibr ation
calibr ation
Detectionand
Detection and Global Tr ansmit Analytics
RTSP Decoder Dewar p libr ar y classification Tr acker
classification positioning Metadata ser ver
360d feeds Dewarping ROI: Lines ROI: Polygon
59
Perception Analytics Visualization
VIDEO : INTELLIGENT TRAFFIC SYSTEM
60
WAREHOUSE LOGISTICS: INVENTORY SORTING
USE CASE SOLUTION
IoT edge device
Business Logic
Services
Azure IoT Central
NVIDIA Telemetrydata
IoT edge
DeepSt ream
runtime
Container
Detect and flag packages on a DeepStream container can connect to Azure 61 IoT central
conveyor belt through Azure IoT edge runtime
THANK YOU!
~QUESTIONS?
62