0% found this document useful (0 votes)

60 views62 pages

Deep Learning Optimization

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views62 pages

Deep Learning Optimization

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

ACCELERATING END TO END DEEP LEARNING

WORKFLOW.
Deepshikha Kumari Data Scientist II- Deep learning
1. AI Use cases for Industry
2. End To End Deep Learning Workflow
Training Pipeline
a. NGC
AGENDA b. Transfer Learning
c. Automatic Mixed Precision
d. Code walkthrough
Inference Pipeline
a. TensorRT (Float 16)
b. TensorRT (INT8)
c. Custom plugin support
d. Deepstream
2
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

3
DEEP LEARNING IN PRODUCTION
Speech Recognition

Recommender Systems

Autonomous Driving

Real-time Object Recognition

Robotics

Real-time Language
Translation

Many More…

4
5
NGC

6
WHY CONTAINERS?

Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks and
applications
Share, collaborate,
and test applications across
different environments

5
Virtual Machine vs. Container
Not so similar
App 1 App 1 App 2

Bins / Libs Bins / Libs Bins / Libs

App 1 App 1 App 2
Guest OS Guest OS Guest OS
Bins / Libs Bins / Libs Bins / Libs

Hypervisor Docker Engine

Host Operating System Host Operating System

Server Infrastructure Server Infrastructure

Virtual Machines Containers

8
NVIDIA container runtime
[Link]

• Colloqually called “nvidia-docker”

• Docker containers are hardware-
agnostic and platform-agnostic
• NVIDIA GPUs are specialized
hardware that require the NVIDIA
driver
• Docker does not natively support
NVIDIA GPUs with containers
• NVIDIA Container Runtime makes the
images agnostic of the NVIDIA driver
9
Docker Terms
Definitions
Image
Docker images are the basis of containers. An Image is an ordered collection of root filesystem changes
and the corresponding execution parameters for use within a container runtime. An image typically
contains a union of layered filesystems stacked on top of each other. An image does not have state and it
never changes.

Container
A container is a runtime instance of a docker image.
A Docker container consists of
● A Docker image
● Execution environment
● A standard set of instructions

[Link]
10
11
12
13
PRUNING

1 Reduce model size and increase throughput

2 Incrementally retrain model after pruning to recover

accuracy

Prune Retrain

tlt-prune tlt-train

14
Selecting Unnecessary Neurons

• 1. DATA Driven operation

• 2. Non- Data Driven Operation.
• 3. Handling Element-Wise Operations of Multiple Inputs

pruned_model = [Link](model, t)

15
SCENE ADAPTATION
Camera location vantage point Person with blue shirt

Same network adapting to different

angles and vantage points

Data Adapt
Same network adapting to new data

Train with new data from another vantage point, camera location, or added attribute
16
17
TLT
TENSORFLOW

Automatic Mixed Precision feature is available both in native TensorFlow and inside
the TensorFlow container on

NVIDIA NGC container registry:

export TF_ENABLE_AUTO_MIXED_PRECISION=1

As an alternative, the environment variable can be set inside the TensorFlow Python script:
[Link]['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'

19
PYTORCH

Automatic Mixed Precision feature is available in the Apex repository on GitHub. To enable,
add these two lines of code into your existing training script:
model, optimizer = [Link](model, optimizer, opt_level="O1")

with amp.scale_loss(loss, optimizer) as scaled_loss:

scaled_loss.backward()

20
MXNET

Automatic Mixed Precision feature is available both in native MXNet (1.5 or later) and inside the MXNet
container (19.04 or later) on NVIDIA NGC container registry. To enable the feature, add the following
lines of code to your existing training script:
[Link]()
amp.init_trainer(trainer)
with amp.scale_loss(loss, trainer) as scaled_loss:
[Link](scaled_loss)

21
AUTOMATIC MIXED PRECISION IN TENSORFL
Upto 3X Speedup

TensorFlow Medium Post: Automatic Mixed Precision in TensorFlow for Faster AI Training on NVIDIA GPUs

All models can be found at:

[Link] except for ssd-rn50-fpn-640, which is here: [Link] All
performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB.
Speedup is the ratio of time to train for a fixed number of epochs in single-precision and Automatic Mixed Precision. Number of epochs for each model was matching the literature or common practice (it was also confirmed that both training sessions achieved the same model accuracy).
Batch sizes:. rn50 (v1.5): 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; NCF: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 10 for AMP+XLA; GNMT: 128 for FP32, 192 for AMP. 7
AUTOMATIC MIXED PRECISION IN
PYTORCH
[Link]

● Plot shows ResNet-50 result with/without automatic mixed

precision(AMP)

●
2X
More AMP enabled model scripts coming soon:
Mask-R CNN, GNMT, NCF, etc.

AMP
Enabled

FP32 M ixed
Precision

Source: [Link] 8
AUTOMATIC MIXED PRECISION IN MXNET
AMP speedup ~1.5X to 2X in comparison with FP32

[Link] 9
25
NVIDIA TENSORRT
Programmable Inference Accelerator

FRAMEWORKS GPU PLATFORMS

TESLA P4

TensorRT
JETSON TX2
Optimizer Runtime

DRIVE PX 2

NVIDIA DLA

TESLA V100

26
[Link]/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350

Latency (ms)
400

Images/sec
Latency (ms)
4,000
Images/sec

25
280 ms
300

3,000 20 300 250

14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms

1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.

27
[Link]/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2

Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Engine Runtime
Plan 2 Data center

Plan 3

Optimized Plans TensorRT Runtime Engine Automotive Embedded 28

MODEL IMPORTING
➢ AI Researchers
➢ Data Scientists

Example: Importing a TensorFlow model

Other Frameworks

Python/C++ API Python/C++ API

Model Importer Network

Definition API

Runtime inference
C++ or Python API

29
[Link]/tensorrt
TENSORRT OPTIMIZATIONS

Layer & Tensor Fusion

➢ Optimizations are completely automatic

➢ Performed with a single function call
Weights & Activation
Precision Calibration

Kernel Auto-Tuning

Dynamic Tensor
Memory
30
LAYER & TENSOR FUSION

31
32
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and

100s for specialized kernels
Optimized for every GPU platform
improves memory re-use

• Manages memory allocation for

each tensor only for the duration of
Multiple parameters: its usage
• Batch size
• Input dimensions
Tesla V100 Jetson TX2 • Filter dimensions
Drive PX2 33
...
EXAMPLE: DEPLOYING TENSORFLOW
MODELS WITH TENSORRT
Deployment and Inference
Import, optimize and deploy
TensorFlow models using TensorRT python
API

New Data
Steps:
• Start with a frozen TensorFlow model Trained Neural
Network
• Create a model parser
• Optimize model and create a runtime
TensorRT
engine Optimizer Optimized
• Perform inference using the optimized Runtime Engine
runtime engine

Inference Results

34
[Link]/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create

a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine

Step 7: Perform inference

[Link]/tensorrt
TensorRT Inference
with TensorFlow
TensorFlow
An end-to-end open source machine learning platform

● Powerful platform for research and experimentation

● Versatile, easy model building
● Robust ML production anywhere
● Most popular ML project on Github

41m Downloads
NVIDIA TensorRT
Platform for High-Performance Deep Learning Inference

● Optimize and Deploy neural networks in production environments

● Maximize throughput for latency-critical apps with optimizer and runtime

● Deploy responsive and memory efficient apps with INT8 & FP16

300k Downloads in 2018

TF-TRT = TF + TRT
TensorRT Inference with
TensorFlow

● Benefits to using TF-TRT

AGENDA ● How to use

● Customer experience: Clarifai

● How TF-TRT works

● Additional Resources
Benefits to using TF-TRT
● Optimize TF inference while still using the TF ecosystem
● Simple API: up to 8x performance gain with little effort
● Fallback to native TensorFlow where TensorRT does not support
Over 10 optimized models with published examples
Models TF FP32 TF-TRT INT8 Speedup
(imgs/s) (imgs/s) ● Performance optimizations soon:
ResNet-50 399 3053 7.7x More NLP and Object Detection
Models
Inception V4 158 1128 7.1x

Mobilenet V1 1203 4975 4.1x

● For non-optimized layers, fallback
support is provided by
NASNet large 43 162 3.8x TensorFlow
VGG16 245 1568 6.4x

SSD Mobilenet V2 102 411 4.0x

SSD Inception V2 82 327 4.0x

TensorFlow FP32 vs TensorFlow-TensorRT INT8 on T4, largest possible batch size, no I/O.
NGC Tensorflow 19.07 with scripts: [Link] 42
FP16 accuracy
Models TF FP32 TF-TRT FP16

Mobilenet V2 74.08 74.07

NASNet Mobile 73.97 73.87

ResNet 50 V1.5 76.51 76.48

ResNet 50 V2 76.43 76.40

VGG 16 70.89 70.91

Inception V3 77.99 77.97

SSD Mobilenet v1 23.06 23.07

FP16 accuracy is within 0.1% of FP32 accuracy

Top-1 metric (%) for classification models. mAP for SSD detection models.
43
Complete data: [Link]
INT8 accuracy
Models TF FP32 TF-TRT INT8

Mobilenet V2 74.08 73.90

NASNet Mobile 73.97 73.55

ResNet 50 V1.5 76.51 76.23

ResNet 50 V2 76.43 76.30

VGG 16 70.89 70.78

Inception V3 77.99 77.85

INT8 accuracy is within 0.2% of FP32 accuracy except for NASNet Mobile within 0.5%.

Top-1 metric (%) for classification models.

44
Complete data: [Link]
TensorRT ONNX PARSER
Parser to import ONNX-models into TensorRT

Optimize and deploy models from ONNX-

supported frameworks in production
Apply TensorRT optimizations to any ONNX framework
(Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet,
PyTorch)

C++ and Python APIs to import ONNX models

New samples demonstrating step-by-step process to get

started

[Link]/tensorrt 45
INEFFICIENCY LIMITS INNOVATION
Difficulties with Deploying Data Center Inference

Single Model Only Single Framework Only Custom Development

Rec-
ASR NLP
ommender

Some systems are overused while Solutions can only support Developers need to reinvent the
others are underutilized models from one framework plumbing for every application

46
NVIDIA TENSORRT INFERENCE SERVER
Production Data Center Inference Server
Maximize real-time inference
NVIDIA performance of GPUs

TensorRT
Inference
Server
T4

NVIDIA
T4 Quickly deploy and manage multiple
models per GPU per node
Tesla

TensorRT
Inference
Server
V100
Easily scale to heterogeneous GPUs
Tesla
V100 and multi GPU nodes

Integrates with orchestration

TensorRT
Inference
Tesla P4
systems and auto scalers via latency
Server Tesla P4 and health metrics

Now open source for thorough

customization and integration
47
FEATURES
Concurrent Model Execution Dynamic Batching
Multiple models (or multiple instances of same Inference requests can be batched up by the
model) may execute on GPU simultaneously inference server to 1) the model-allowed
maximum or 2) the user-defined latency SLA
CPU Model Inference Execution
Framework native models can execute inference Multiple Model Format Support
requests on the CPU PyTorch JIT (.pt)
TensorFlow GraphDef/SavedModel
TensorFlow and TensorRT GraphDef
Metrics ONNX graph (ONNX Runtime)
Utilization, count, memory, and latency TensorRT Plans
Caffe2 NetDef (ONNX import path)
Custom Backend
Custom backend allows the user more flexibility
by providing their own implementation of an
CMake build
execution engine through the use of a shared Build the inference server from source making it
library more portable to multiple OSes and removing
the build dependency on Docker
Model Ensemble
Pipeline of one or more models and the Streaming API
connection of input and output tensors between Built-in support for audio streaming input e.g.
those models (can be used with custom for speech recognition
backend) 48
INFERENCE SERVER ARCHITECTURE
Available with Monthly Updates

Python/C++ Client Library

Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
● ONNX graph
● PyTorch JIT (.pb)

Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC

Python/C++ client libraries

49
Additional resources
- GTC Technical presentation: [Link]
- TF-TRT user guide: [Link]
- NVIDIA DLI course on TF-TRT: [Link]
- Monthly release notes: [Link]
- Google Blog on TF-TRT inference: [Link]
learning/running-tensorflow-inference-workloads-at-scale-with-tensorrt-5-and-nvidia-t4-gpus
- Nvidia Developer Blog: [Link]
inference/

50
51
1. Intelligent Video Analytics
2. Deepstream SDK
a. What is Deepstream SDK?
a. Why Deepstream SDK?
AGENDA b. What’s new with DS4.0?
c. Deepstream Building Blocks
3. Getting started with Deepstream SDK
a. Where to start?
b. Directory hierarchy
c. Configurable file and pipeline details
d. Running application
4. Building with Deepstream SDK
a. Real world use cases with demo 2

b. Resources
INTELLIGENT VIDEO ANALYTICS (IVA) FOR EFFICIENCY ANDSAFETY

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

5
3
WHAT IS DEEPSTREAM?

Applications and Services

DEEPSTREAM SDK

Ref er en ce
Har dwar e
Docker Co n t ainers Applications & Analytic IOT
Ac c e le r a te d Plugins
Or ch est ratio n Recipes Runtime

CUDA-X

K u b e rne te s ON GPUs NVIDIA C o n t aine rs RT CUDA Multimedia TensorRT

NVIDIACOMPUTING PLATFORM - EDGE TOCLOUD

JETSON | TESLA

5
4
WHY DEEPSTREAM?
The most comprehensive end-to-end development platform for IVA.

Broader Use Cases and

Faster Time to Progress Faster Time to Market
Industries
Provides ready to use building blocks and Provides ready to use building blocks and
Build your own application for smart cities, IP simplify building your innovative IP simplify building your innovative
retail analytics, industrial inspection, product.. product.
logistics, and more

Performance Driven Cloud Integration Faster Time to Progress

Low latency and exceptional performance Pushbutton IoT solution integration to build Iterate and integrate by quick plug and play
optimized for NVIDIA GPUs for real-time applications and services with Cloud of popular plug-ins that are pre-packaged or
edge analytics Service Providers. build your own.

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 55

DEEPSTREAM SDK
Plugins (build w ith open source, 3 rd party, NV) Analytics - multi-camera, multi-sensor framew ork Development Tools

DNN infer ence/Tensor RT plugins DeepStr eam in container s, Multi-GPU or chestr ation End to end r efer ence applications

Communications plugins Tr acking & analytics acr oss lar ge scale/ multi-camer a App building/configur ation tools

V ideo/image captur e and pr ocessing plugins Str eaming and Batch Analytics End-end or chestr ation r ecipes & adaptation guides

3rd par ty libr ar y plugins … … Event fabr ic Plugin templates, custom IP integr ation

DeepStream SDK

Multimedia APIs/ Imaging & Metadata & Multi-camer a

Tensor RT NV container s Message bus clients
V ideo Codec SDK Dewar ping libr ar y messaging tr acking lib

Linux, CUDA

Perception infra - Jetson, Tesla server (Edge and cloud) Analytics infra - Edge server, NGC, AWS, Azure

56
REAL TIME INSIGHTS, HIGHEST STREAM
DENSITY

NGC ANY CLOUD

NVIDIA M etropolis Analytics Visualization

Application Framework

NVIDIA Edge Stack

NVIDIA EGX Server Cloud M onitoring

68 streams of 1080p per T4

Pixels Information Dashboard

57
Smart Parking
PERCEPTION GRAPH
COMM PLUGIN PREPROCESSING PLUGINS DETECTION, CLASSIFICATION & TRACKING PLUGINS COMMUNICATIONS PLUGINS

Camer a
ROI calibr ation
calibr ation

Detectionand
Detection and Global Tr ansmit Analytics
RTSP Decoder Dewar p libr ar y classification Tr acker
classification positioning Metadata ser ver

360d feeds Dewarping ROI: Lines ROI: Polygon

59
Perception Analytics Visualization

VIDEO : INTELLIGENT TRAFFIC SYSTEM

60
WAREHOUSE LOGISTICS: INVENTORY SORTING

USE CASE SOLUTION

IoT edge device
Business Logic
Services
Azure IoT Central

NVIDIA Telemetrydata
IoT edge
DeepSt ream
runtime
Container

Detect and flag packages on a DeepStream container can connect to Azure 61 IoT central
conveyor belt through Azure IoT edge runtime
THANK YOU!

~QUESTIONS?
62

NVIDIA GPU Innovations for AI Experts
100% (1)
NVIDIA GPU Innovations for AI Experts
96 pages
TENSORRT
No ratings yet
TENSORRT
26 pages
t4 Inference Print Update Inference Tech Overview Final
No ratings yet
t4 Inference Print Update Inference Tech Overview Final
25 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Deep Learning on AWS Guide
No ratings yet
Deep Learning on AWS Guide
29 pages
Ansari H. Mastering TensorFlow. Unleashing The Power of Deep Learning... 2024
100% (1)
Ansari H. Mastering TensorFlow. Unleashing The Power of Deep Learning... 2024
134 pages
Osn Tensorflow2 210327175734
No ratings yet
Osn Tensorflow2 210327175734
23 pages
TensorRT Sample Support Guide
No ratings yet
TensorRT Sample Support Guide
52 pages
TensorFlow: Large-Scale ML System
No ratings yet
TensorFlow: Large-Scale ML System
18 pages
Unit Ii
No ratings yet
Unit Ii
83 pages
Lesson 05 TensorFlow
No ratings yet
Lesson 05 TensorFlow
113 pages
TensorRT Developer Guide
100% (1)
TensorRT Developer Guide
131 pages
TensorFlow TensorRT User Guide
No ratings yet
TensorFlow TensorRT User Guide
30 pages
Dzone Rc251 Gettingstartedwithtensorflow
No ratings yet
Dzone Rc251 Gettingstartedwithtensorflow
5 pages
Deep Learning TensorFlow and Keras
No ratings yet
Deep Learning TensorFlow and Keras
454 pages
Introduction To TensorFlow For Artificial Intelligence
No ratings yet
Introduction To TensorFlow For Artificial Intelligence
41 pages
Azure Machine Learning Overview Guide
No ratings yet
Azure Machine Learning Overview Guide
38 pages
TensorFlow: Large-Scale ML System
No ratings yet
TensorFlow: Large-Scale ML System
21 pages
Optimizing Model Serving with Triton
No ratings yet
Optimizing Model Serving with Triton
28 pages
TensorFlow for ML Developers
No ratings yet
TensorFlow for ML Developers
17 pages
Week 13 GCP Lec Notes
No ratings yet
Week 13 GCP Lec Notes
28 pages
Large-Scale Deep Learning with TensorFlow
No ratings yet
Large-Scale Deep Learning with TensorFlow
119 pages
TensorFlow User Guide
No ratings yet
TensorFlow User Guide
24 pages
10 From Zero To ML
No ratings yet
10 From Zero To ML
53 pages
Tensorlayer Documentation: Release 1.11.1
No ratings yet
Tensorlayer Documentation: Release 1.11.1
258 pages
C2 W4ok
No ratings yet
C2 W4ok
94 pages
TensorFlow Overview and Release History
No ratings yet
TensorFlow Overview and Release History
12 pages
EE292A Lecture 2.ML - Hardware - 2 - April9
No ratings yet
EE292A Lecture 2.ML - Hardware - 2 - April9
13 pages
DNN Algorithms for Image Processing
No ratings yet
DNN Algorithms for Image Processing
5 pages
Tensor Flow
No ratings yet
Tensor Flow
2 pages
106106213
No ratings yet
106106213
637 pages
MVS - Expt8 Object Detection and Reconstruction Using CNN
No ratings yet
MVS - Expt8 Object Detection and Reconstruction Using CNN
5 pages
Mxnet Documentation: Release 0.0.8
No ratings yet
Mxnet Documentation: Release 0.0.8
93 pages
Tensorflow Tutorial PDF
100% (6)
Tensorflow Tutorial PDF
90 pages
GPU-Accelerated Graph Analytics with RAPIDS
No ratings yet
GPU-Accelerated Graph Analytics with RAPIDS
33 pages
Tensor Flow
No ratings yet
Tensor Flow
14 pages
Deep Learning
No ratings yet
Deep Learning
28 pages
Readme PDF
No ratings yet
Readme PDF
5 pages
Deep Learning Examples in MATLAB
No ratings yet
Deep Learning Examples in MATLAB
36 pages
LLM Training Update
100% (1)
LLM Training Update
31 pages
SRS Documentation - TensorFlow
No ratings yet
SRS Documentation - TensorFlow
16 pages
TensorFlow & CNTK for Deep Learning
No ratings yet
TensorFlow & CNTK for Deep Learning
23 pages
Tensor Flow Guide
No ratings yet
Tensor Flow Guide
25 pages
Tensor Flow
No ratings yet
Tensor Flow
19 pages
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
No ratings yet
Tensorflow: Large-Scale Machine Learning On Heterogeneous Distributed Systems
4 pages
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
No ratings yet
Tech Seminar - 1JT19CS076@Jyothyit - Ac.in NIKITHA
32 pages
Assignment 10
No ratings yet
Assignment 10
18 pages
Bigdata Neural Networks
No ratings yet
Bigdata Neural Networks
144 pages
ALTERA Efficient - Neural - Networks
No ratings yet
ALTERA Efficient - Neural - Networks
2 pages
RA2211026010557 - SEAI Scenario 2
No ratings yet
RA2211026010557 - SEAI Scenario 2
3 pages
Deep Learning on GPU Clusters
No ratings yet
Deep Learning on GPU Clusters
50 pages
Deep Learning Lab Manual for JNTU Hyderabad
No ratings yet
Deep Learning Lab Manual for JNTU Hyderabad
20 pages
Artificial Intelligence ME: Manufacturing 6324
No ratings yet
Artificial Intelligence ME: Manufacturing 6324
23 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Practical Projects
100% (32)
Practical Projects
478 pages
The Python Bible
97% (33)
The Python Bible
506 pages
Smith D. - Arduino For Complete Idiots - 2017
91% (22)
Smith D. - Arduino For Complete Idiots - 2017
175 pages
Handbook of Arduino - 100+ Arduino Projects
100% (13)
Handbook of Arduino - 100+ Arduino Projects
608 pages
Full Course of Machine Learning
100% (17)
Full Course of Machine Learning
660 pages
Hackers Guide To Machine Learning With Python PDF
100% (16)
Hackers Guide To Machine Learning With Python PDF
272 pages
Machine Learning Projects in Python
100% (17)
Machine Learning Projects in Python
135 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
94% (18)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Motorcontrol Ebook
100% (13)
Motorcontrol Ebook
262 pages
Python For Science and Engineering
100% (15)
Python For Science and Engineering
304 pages
Create Graphical User Interfaces With Python
100% (12)
Create Graphical User Interfaces With Python
156 pages
Arduino For Beginners EBOOK
100% (12)
Arduino For Beginners EBOOK
134 pages
EXPLORE ESP32 MICROPYTHON - Python Coding, Arduino Coding, Raspberry Pi, ESP8266, IoT Projects, Android Application Projects
100% (13)
EXPLORE ESP32 MICROPYTHON - Python Coding, Arduino Coding, Raspberry Pi, ESP8266, IoT Projects, Android Application Projects
347 pages
Arduino Programming in 24 Hours Richard Blum Softarchive Net PDF
92% (26)
Arduino Programming in 24 Hours Richard Blum Softarchive Net PDF
605 pages
Understanding Machine Learning
100% (73)
Understanding Machine Learning
416 pages
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
100% (15)
Python Programming for Beginners_ From Basics to AI Integrations. 5-Minute Illustrated Tutorials, Coding Hacks, Hands-On Exercises & Case Studies to Master Python in 7 Days and Get Paid More by Prince
244 pages
All-In-One Electronics Guide Your Complete Ultimate Guide To Understanding and Utilizing Electronics!
96% (24)
All-In-One Electronics Guide Your Complete Ultimate Guide To Understanding and Utilizing Electronics!
469 pages
Data Structure and Algorithms With Python
100% (16)
Data Structure and Algorithms With Python
369 pages
Machine Learning With Python
100% (15)
Machine Learning With Python
692 pages
Arduino Measurement Projects For Beginners
100% (7)
Arduino Measurement Projects For Beginners
175 pages
Sensors Protocols Industry 4
100% (5)
Sensors Protocols Industry 4
280 pages
Programming With Node Red Ebook
100% (6)
Programming With Node Red Ebook
325 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (11)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Python in Excel (2024)
100% (14)
Python in Excel (2024)
607 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Arduino For Beginners - A Step by Step Ultimate Guide To Learn Arduino Programming PDF
100% (5)
Arduino For Beginners - A Step by Step Ultimate Guide To Learn Arduino Programming PDF
195 pages
Python Programming For Beginners (Knowles, Chad)
100% (10)
Python Programming For Beginners (Knowles, Chad)
246 pages
100 Electronic Projects With Circuit Diagram PDF
97% (38)
100 Electronic Projects With Circuit Diagram PDF
105 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Numerical Gaussian Processes for Time-Dependentand Nonlinear Partial Differential Equations
No ratings yet
Numerical Gaussian Processes for Time-Dependentand Nonlinear Partial Differential Equations
27 pages
HeavisideCoverup No Laplace
No ratings yet
HeavisideCoverup No Laplace
8 pages
Kinematic Wave Theory
No ratings yet
Kinematic Wave Theory
2 pages
Newton 1
No ratings yet
Newton 1
114 pages
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
No ratings yet
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
27 pages
1D Conservation Laws
No ratings yet
1D Conservation Laws
38 pages
Klausen 1999
No ratings yet
Klausen 1999
20 pages
Biorthogonal System
No ratings yet
Biorthogonal System
54 pages
Nonlinear Differential Equations
No ratings yet
Nonlinear Differential Equations
25 pages
Lie NOTES
No ratings yet
Lie NOTES
114 pages
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
No ratings yet
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
156 pages
Mathematicians & Computational Scientists
No ratings yet
Mathematicians & Computational Scientists
35 pages
Taller - 02 - Taller - Ipynb - DarwinMerchan
No ratings yet
Taller - 02 - Taller - Ipynb - DarwinMerchan
8 pages
b0193wq - P - System Defnition
No ratings yet
b0193wq - P - System Defnition
80 pages
CNC USB Controller Mk3: User Manual
No ratings yet
CNC USB Controller Mk3: User Manual
20 pages
ST 33 TPHF 20 Spi
No ratings yet
ST 33 TPHF 20 Spi
25 pages
Localization
No ratings yet
Localization
3 pages
G20 Infinium
No ratings yet
G20 Infinium
31 pages
SeaChest Combo UserGuides
No ratings yet
SeaChest Combo UserGuides
162 pages
1992 Smith Bruce A1200 Insider Guide Series PDF
No ratings yet
1992 Smith Bruce A1200 Insider Guide Series PDF
255 pages
8K HDMI Cable Pricing and Specs
No ratings yet
8K HDMI Cable Pricing and Specs
6 pages
Iptvcover Xyz
No ratings yet
Iptvcover Xyz
33 pages
Top 21 Computer Architecture Interview Questions & Answers
No ratings yet
Top 21 Computer Architecture Interview Questions & Answers
5 pages
Kinetec Tournes Service Manual Overview
No ratings yet
Kinetec Tournes Service Manual Overview
10 pages
RJ-5G-SX: Features
No ratings yet
RJ-5G-SX: Features
10 pages
Electronics & Appliances Contact List
No ratings yet
Electronics & Appliances Contact List
16 pages
Test TFT
No ratings yet
Test TFT
12 pages
AICT Important QA Discovering Computers 2012 Full
No ratings yet
AICT Important QA Discovering Computers 2012 Full
10 pages
H310M Pro VD Plus
No ratings yet
H310M Pro VD Plus
1 page
Es (Ece Cse It Aiml) Syllabus
No ratings yet
Es (Ece Cse It Aiml) Syllabus
2 pages
Lab Manual
No ratings yet
Lab Manual
35 pages
Xiaomi 11T Pro 5G Review
No ratings yet
Xiaomi 11T Pro 5G Review
3 pages
V.BTTN Developer Quick Start Guide
No ratings yet
V.BTTN Developer Quick Start Guide
12 pages
Modbus Protocol: 1. Realisation
100% (1)
Modbus Protocol: 1. Realisation
20 pages
COMP 225 INTE 223 ASSEMBLY LANGUAGE PROGRAMMING - Kabarak University
No ratings yet
COMP 225 INTE 223 ASSEMBLY LANGUAGE PROGRAMMING - Kabarak University
7 pages
RXi Box IPC Training Guide
No ratings yet
RXi Box IPC Training Guide
13 pages
Elna EC60 Service Manual Guide
No ratings yet
Elna EC60 Service Manual Guide
29 pages
Bluetooth Interface User Manual
No ratings yet
Bluetooth Interface User Manual
13 pages
Memory Organization MCQs for CPU Architecture
No ratings yet
Memory Organization MCQs for CPU Architecture
5 pages
FAQ For Atomic Pi
No ratings yet
FAQ For Atomic Pi
6 pages
Quickspecs: HP Pro Tower 280 G9 Pci Desktop PC
No ratings yet
Quickspecs: HP Pro Tower 280 G9 Pci Desktop PC
41 pages
IBM Storwize v3700 Part2.
No ratings yet
IBM Storwize v3700 Part2.
167 pages

Deep Learning Optimization

Uploaded by

Deep Learning Optimization

Uploaded by

ACCELERATING END TO END DEEP LEARNING

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

Real-time Object Recognition

Bins / Libs Bins / Libs Bins / Libs

Hypervisor Docker Engine

Host Operating System Host Operating System

Server Infrastructure Server Infrastructure

Virtual Machines Containers

• Colloqually called “nvidia-docker”

1 Reduce model size and increase throughput

2 Incrementally retrain model after pruning to recover

• 1. DATA Driven operation

Same network adapting to different

NVIDIA NGC container registry:

with amp.scale_loss(loss, optimizer) as scaled_loss:

All models can be found at:

● Plot shows ResNet-50 result with/without automatic mixed

FRAMEWORKS GPU PLATFORMS

3,000 20 300 250

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Optimized Plans TensorRT Runtime Engine Automotive Embedded 28

Example: Importing a TensorFlow model

Python/C++ API Python/C++ API

Model Importer Network

Layer & Tensor Fusion

➢ Optimizations are completely automatic

Un-Optimized Network TensorRT Optimized Network

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and

• Manages memory allocation for

Step 3: Register inputs and outputs

Step 4: Optimize model and create

Step 5: Serialize optimized engine

Step 6: De-serialize engine

● Powerful platform for research and experimentation

● Optimize and Deploy neural networks in production environments

● Maximize throughput for latency-critical apps with optimizer and runtime

300k Downloads in 2018

● Benefits to using TF-TRT

AGENDA ● How to use

● Customer experience: Clarifai

● How TF-TRT works

Mobilenet V1 1203 4975 4.1x

SSD Mobilenet V2 102 411 4.0x

SSD Inception V2 82 327 4.0x

Mobilenet V2 74.08 74.07

NASNet Mobile 73.97 73.87

ResNet 50 V1.5 76.51 76.48

ResNet 50 V2 76.43 76.40

VGG 16 70.89 70.91

Inception V3 77.99 77.97

SSD Mobilenet v1 23.06 23.07

FP16 accuracy is within 0.1% of FP32 accuracy

Mobilenet V2 74.08 73.90

NASNet Mobile 73.97 73.55

ResNet 50 V1.5 76.51 76.23

ResNet 50 V2 76.43 76.30

VGG 16 70.89 70.78

Inception V3 77.99 77.85

Top-1 metric (%) for classification models.

Optimize and deploy models from ONNX-

C++ and Python APIs to import ONNX models

New samples demonstrating step-by-step process to get

Single Model Only Single Framework Only Custom Development

Integrates with orchestration

Now open source for thorough

Python/C++ Client Library

Python/C++ client libraries

Access Control Public Transit Industrial Inspection Traffic Engineering

Retail Analytics Logistics Critical Infras tructure Public Safety

Applications and Services

K u b e rne te s ON GPUs NVIDIA C o n t aine rs RT CUDA Multimedia TensorRT

NVIDIACOMPUTING PLATFORM - EDGE TOCLOUD

Broader Use Cases and