0% found this document useful (0 votes)

273 views41 pages

Kaldi GPU Acceleration for ASR

Kaldi is a speech recognition framework that uses a combination of deep learning and machine learning algorithms. NVIDIA has accelerated parts of the Kaldi pipeline using GPUs, including moving the acoustic model to the GPU. This provides speedups over having the acoustic model on the CPU. Challenges included batching and parallelizing the dynamic language model decoding on the GPU.

Uploaded by

Shaheen Kader

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

273 views41 pages

Kaldi GPU Acceleration for ASR

Uploaded by

Shaheen Kader

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

KALDI GPU ACCELERATION

GTC - March 2019

1) Brief introduction to speech processing

2) What we have done?

3) How can I use it?

AGENDA

2
INTRODUCTION TO ASR
Translating Speech into Text

Speech Recognition: the process of taking a raw audio signal and transcribing to text

Use of Automatic Speech Recognition has exploded in the last ten years:

Personal assistants, Medical transcription, Call center analytics, Video search, etc

nvidia:nvidia/1.0
2
-:-

0/0.98
-:-
1
ai:ai/1.24
-:-
3 NVIDIA is
speech:speech/1.63
4
cool
-:-

3
SPEECH RECOGNITION
State of the Art

• Kaldi fuses known state-of-the-art techniques from speech recognition with deep learning

• Hybrid DL/ML approach continues to perform better than deep learning alone

• "Classical" ML Components:

• Mel-Frequency Cepstral Coefficients (MFCC) features – represent audio as spectrum of spectrum

• I-vectors – Uses factor analysis, Gaussian Mixture Models to learn speaker embedding – helps
acoustic model adapt to variability in speakers

• Predict phone states – HMM - Unlike "end-to-end" DL models, Kaldi Acoustic Models predict
context-dependent phone substates as Hidden Markov Model (HMM) states

• Result is system that, to date, is more robust than DL-only approaches and typically requires less data
to train
4
KALDI
Speech Processing Framework

Kaldi is a speech processing framework out of Johns Hopkins University

Uses a combination of DL and ML algorithms for speech processing

Started in 2009 with the intent to reduce the time and cost needed to build ASR systems

http://kaldi-asr.org/

Maintained by Dan Povey

Considered state-of-the-art

5
KALDI SPEECH PROCESSING PIPELINE

Feature Acoustic Language

Raw Audio Output
Extraction Model Model

NVIDIA is
cool

Kaldi MFCC & Lattice

NNET3 Decoder
Components: Ivectors

6
FURTHER READING

“Speech Recognition with Kaldi Lectures.” Dan Povey, www.danielpovey.com/kaldi-

lectures.html

Deller, John R., et al. Discrete-Time Processing of Speech Signals. Wiley IEEE Press Imprint,
1999.

7
WHAT HAVE WE
DONE?
8
PREVIOUS WORK

Partnership between Johns Hopkins University and NVIDIA in October 2017

Goal: Accelerate Inference processing using GPUs

Used CPU for entire pipeline

NVIDIA Progress reports:

GTC On Demand: DC8189, S81034

https://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

9
INITIAL WORK

Feature Acoustic Language

Output
Extraction Model Model

First Step: Move Acoustic Model to GPU

Was already implemented but not enabled, batch NNET3 added by Dan Povey

Enabled Tensor-Cores for NNET3 processing

Feature Acoustic Language

Output
Extraction Model Model
10
INITIAL WORK

Feature Acoustic Language

Output
Extraction Model Model

Early on it was clear that we needed to 0.4% 4.9%

target language model decoding Acoustic model
(GPU)
Language model
(CPU)
Feature
94.7% extraction (CPU)
11
LANGUAGE MODEL CHALLENGES

Dynamic Problem:
Amount of parallelism changes significantly throughout decode
Can have few or many candidates moving from frame to frame
Limited Parallelism:
Even when there are lots of candidates the amount of parallelism is orders of
magnitude smaller than required to saturate a large GPU
Solution:
1) Use graph processing techniques and a GPU-friendly data layout to maximize
parallelism while load balancing across threads (See previous talks)
2) Process batches of decodes at a time in a single pipeline
3) Use multiple threads for multiple batched-pipelines
12
CHALLENGES

Kaldi APIs are single threaded, single instance, and synchronous

Makes batching and multi-threading challenging

Solution:

Create a CUDA-enabled Decoder with asynchronous APIs

Master threads submit work and later wait for that work

Batching/Multi-threading occur transparently to the user

13
EXAMPLE DECODER USAGE
More Details: kaldi-src/cudadecoder/README

for ( … ) {
…
//Enqueue decode for unique “key”
CudaDecoder.OpenDecodeHandle(key, wave_data);
…
}

for ( … ) {
…
//Query results for “key”
CudaDecoder.GetLattice(key, &lattice);
…
}
14
GPU ACCELERATED WORKFLOW
BatchedThreadedCudaDecoder
GPU Work
Queue (3) (4)
Master 1 Batch of worked Master queries
... processed by CUDA control threads results. Will block
GPU pipeline for lattice
thread generation
Master N Acoustic Language
Model (NNET3) Model
Master i
(1)
Master threads
opens decode (2)
handles and add Features Placed
waveforms to in GPU Work
work pool Queue

Threaded CPU
Feature Work Pool Compute
Extraction Lattice
15
KALDI SPEECH PROCESSING PIPELINE
GPU Accelerated

Feature Acoustic Language

Raw Audio Output
Extraction Model Model

nvidia:nvidia/1.0
2
-:-

0/0.98
-:-
1
ai:ai/1.24
-:-
3 NVIDIA is
speech:speech/1.63
4
cool
-:-

16
BENCHMARK DETAILS
LibriSpeech

Model:
LibriSpeech - TDNN: https://github.com/kaldi-asr/kaldi/tree/master/egs/librispeech
Data: LibriSpeech - Clean/Other: http://www.openslr.org/12/
Hardware:
CPU: 2x Intel Xeon Platinum 8168
NVIDIA GPUs: V100, T4, or Xavier AGX
Benchmarks:
CPU: online2-wav-nnet3-latgen-faster.cc (modified for multi-threading)
Online decoding disabled
GPU: batched-wav-nnet3-cuda.cc
2 GPU control threads, batch=100 17
TESLA V100
World’s Most Advanced
Data Center GPU

5,120 CUDA cores

640 Tensor cores
7.8 FP64 TFLOPS
15.7 FP32 TFLOPS
125 Tensor TFLOPS
20MB SM RF
16MB Cache
32 GB HBM2 @ 900GB/s
300GB/s NVLink

18
TESLA T4
World’s most advanced
scale-out GPU

2,560 CUDA Cores

320 Turing Tensor Cores
65 FP16 TFLOPS
130 INT8 TOPS
260 INT4 TOPS
16GB | 320GB/s
70 W

19
JETSON AGX XAVIER
World’s first AI computer for
Autonomous Machines
AI Server Performance in
30W  15W  10W
512 Volta CUDA Cores  2x NVDLA
8 core CPU
32 DL TOPS • 750 Gbps SerDes

20
2x Xeon*: 2x Intel Xeon Platinum 8168, 410W, ~$13000
Xavier: AGX Devkit, 30W, $1299
T4*: PCI-E, (70+410)W, ~$(2000+13000)
V100*: SXM, (300W+410), ~$(9000+13000)

KALDI PERFORMANCE Determinized Lattice Output

1 GPU, LibriSpeech beam=10
lattice-beam=7
Uses all available HW threads
Hardware Perf (RTFx) WER Perf Perf/$ Perf/watt
LibriSpeech Model, Libri Clean Data
2x Intel Xeon 381 5.5 1.0x 1.0x 1.0x
AGX Xav ier 500 5.5 1.3x 13.1x 17.9x
Tesla T4 1635 5.5 4.3x 3.7x 3.7x
Tesla V100 3524 5.5 9.2x 5.5x 5.3x
LibriSpeech Model, Libri Other Data
2x Intel Xeon 377 14.0 1.0x 1.0x 1.0x
AGX Xav ier 450 14.0 1.2x 11.9x 16.3x
Tesla T4 1439 14.0 3.8x 3.3x 3.3x
Tesla V100 2854 14.0 7.6x 4.5x 4.4x

*Price/Power, not including, system, memory, storage, etc, price is an estimate 21

INCREASING VALUE
Amortizing System Cost

Adding more GPUs to a single system increases value

Less system cost overhead

Less system power overhead

Dense systems are the new norm:

DGX1V: 8 V100s in a single node

DGX-2: 16 V100s in a single node

SuperMicro 4U SuperServer 6049GP-TRT: 20 T4s in a single node

22
Kaldi Inferencing Speedup Relative to 2x Intel 8168
30x

T4 Performance V100 Performance

25x

20x
Speedup (!)

15x

10x

1635 3371 6368 7906 3524 7082 10011 9399

RTFx RTFx RTFx RTFx RTFx RTFx RTFx RTFx
0x
T4 Perf (!) V100 Perf (!)
1 GPU 2 GPUs 4 GPUs 8 GPUs
Kaldi Inferencing Performance Relative to 2x Intel 8168
12x

Performance Per Dollar Performance Per Watt

10x

8x
Relative Performance

0x
T4 !/$ V100 !/$ T4 !/W V100 !/W
1 GPU 2 GPUs 4 GPUs 8 GPUs
PERFORMANCE LIMITERS

Cannot feed the beast

Feature Extraction and Determinization become bottlenecks

CPU has a hard time keeping up with GPU performance

Small kernel launch overhead

Kernels typically only run for a few microseconds

Launch latency can become dominant

Avoid this by using larger batch sizes (larger memory GPUs are crucial)

25
FUTURE WORK
GPU Accelerated Feature Extraction

Feature Acoustic Language

Raw Audio Output
Extraction Model Model

nvidia:nvidia/1.0
2
-:-

0/0.98
-:-
1
ai:ai/1.24
-:-
3 NVIDIA is
speech:speech/1.63
4
cool
-:-

Feature Extraction on GPU is a natural next step: algorithms map well to GPUs

Allows us to increase density and therefore value

26
FUTURE WORK
Native Multi-GPU Support

Native multi-
Master 1 GPU will
GPU Work naturally load
... Queue
balance work
CUDA control threads pools
Master N

Acoustic Language Master i

Model (NNET3) Model

Threaded CPU
Feature Work Pool Compute
Extraction Lattice
27
FUTURE WORK
Where We Want To Be

GPU Accelerated Multi-GPU Backend

Master 1
...
GPU Work
Queue
Feature Extraction

Master N CUDA control threads

Feature Acoustic Language

Extraction Model (NNET3) Model

Master i

Threaded CPU
Work Pool Compute
Lattice
28
HOW CAN I USE IT?
29
HOW TO GET STARTED
2 Methods

1) Download Kaldi, Pull in PR, Build yourself

https://github.com/kaldi-asr/kaldi/pull/3114

2) Run NVIDIA GPU Cloud Container

Get up and running in less than 10 minutes!

30
THE NGC CONTAINER REGISTRY
Simple Access to GPU-Accelerated Software

Discover over 40 GPU-Accelerated Containers

Spanning deep learning, machine learning, HPC
applications, HPC visualization, and more
Innovate in Minutes, Not Weeks
Pre-configured, ready-to-run
Run Anywhere
The top cloud providers, NVIDIA DGX Systems,
PCs and workstations with select
NVIDIA GPUs, and NGC-Ready systems

31
NGC CONTAINER
Free & Easy

Get an NGC account: https://ngc.nvidia.com/signup

#login in to NGC, pull container, and run it
%> docker login nvcr.io
%> docker pull nvcr.io/nvidia/kaldi:19.03-py3
%> docker run --rm -it nvcr.io/nvidia/kaldi:19.03-py3

#prepare models and data

%> cd /workspace/nvidia-examples/librispeech
%> ./prepare_data.sh

#run benchmarks
%> ./run_benchmark.sh
%> ./run_multigpu_benchmark.sh 4
32
BENCHMARK OUTPUT
NGC Container

BENCHMARK SUMMARY:
test_set: test_clean
Overall: Aggregate Total Time: 55.1701 Total Audio: 194525 RealTimeX: 3525.91
%WER 5.53 [ 2905 / 52576, 386 ins, 230 del, 2289 sub ]
%SER 51.30 [ 1344 / 2620 ]
Scored 2620 sentences, 0 not present in hyp.
test_set: test_other
Overall: Aggregate Total Time: 64.7724 Total Audio: 192296 RealTimeX: 2968.79
%WER 13.97 [ 7314 / 52343, 850 ins, 730 del, 5734 sub ]
%SER 73.94 [ 2173 / 2939 ]
Scored 2939 sentences, 0 not present in hyp.
Running test_clean on 4 GPUs with 24 threads per GPU
GPU: 0 RTF: 2469.55
GPU: 1 RTF: 2472.81
GPU: 2 RTF: 2519.33
GPU: 3 RTF: 2515.81
Total RTF: 9977.50 Average RTF: 2494.3750
NVIDIA GPUS ARE ON EVERY CLOUD
Over 30 Offerings Across USA and China

K520 K80 P40 M60 P4 P100 T4 V100 NGC

Alibaba Cloud    

AWS     

Baidu Cloud   

Google Cloud      

IBM Cloud    

Microsoft Azure      

Oracle Cloud   

Tencent Cloud  

34
CONTAINER FUTURE WORK

Add more models

Add scripts to help users run quickly on their own models

NUMA pinning

Continue to update Kaldi source with latest updates

35
KALDI CHANGES
Source Layout

https://github.com/kaldi-asr/kaldi/pull/3114

Added two new directories to source tree

cudadecoder/*:

Implements framework/library classes for use in applications

cudadecoder/README: Detailed documentation on how to use

cudadecoderbin/*:

Binary example using cuda-accelerated decoder

36
TUNING PERFORMANCE
Functional Parameters

determinize-lattice:
determinize lattice in CPU pool or not
If not determinized in CPU pool master thread will determinize if GetLattice is called
beam:
width of beam during search
Smaller beam = faster but possibly less accuracy
lattice_beam:
width of lattice beam before determinization
Smaller beam = smaller lattice, less I/O, less determinization time

37
TUNING PERFORMANCE
GPU Performance

cuda-control-threads:
number of concurrent CPU threads controlling a single GPU pipeline
Typically 2-4 is ideal (more = more GPU memory and less batch size)
cuda-worker-threads:
number of CPU threads in the CPU workpool, should use all CPU resources available
max-batch-size:
maximum batch size per pipeline (more = more GPU memory and less control threads)
Want as large as memory allows (<200 is currently possible)
batch-drain-size:
how far to drain a batch before refilling (batches NNET3)
typically 20% of max-batch-size works well
cuda-use-tensor-cores:
Turn on Tensor Cores (FP16) 38
TUNING PERFORMANCE
Memory Utilization

max-outstanding-queue-length:
Length of GPU work queue, Consumes CPU memory only
ntokens-preallocated:
Preallocated host memory to store output, CPU memory only
Will grow dynamically if needed
max-tokens-per-frame:
Maximum tokens in GPU memory per frame
Cannot resize, will reduce accuracy if it fills up
max-active:
maximum number of arcs retained in a given frame (keeping only the max-active best ones)
Less = faster & less accurate
39
AUTHORS
Hugo Braun is a Senior AI Developer Technology Engineer at NVIDIA. With a background in mathematics
and physics, he has been working on performance-oriented machine learning algorithms. His work at
NVIDIA focuses on the design and implementation of high-performance GPU algorithms, specializing in
deep learning and graph analytics. He holds a M.S. in Mathematics and Computer Science from Ecole
Polytechnique, France.

Justin Luitjens is a Senior Developer Technology Engineer at NVIDIA. He has spent the last 16 years
working on HPC applications with the last 8 focusing directly on CUDA acceleration at NVIDIA. He
holds a Ph.D. in Scientific Computing from the University of Utah, a Bachelor of Science in Computer
Science from Dakota State University and a Bachelor of Science in Mathematics for Information
Systems from Dakota State University.

Ryan Leary is a Senior Applied Research Scientist specializing in speech recognition and natural
language processing at NVIDIA. He has published research in peer-reviewed venues on machine
learning techniques tailored for scalability and performance as well as natural language processing for
health applications. He holds a M.S. in Electrical & Computer Engineering from Johns Hopkins
University, and a Bachelor of Science in Computer Science from Rensselaer Polytechnic Institute.

Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
No ratings yet
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
81 pages
Realtime Models Final Results
No ratings yet
Realtime Models Final Results
4 pages
Understanding PGPU and CUDA Basics
No ratings yet
Understanding PGPU and CUDA Basics
70 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
Performance Evaluation of Offline Speech Recogniti
No ratings yet
Performance Evaluation of Offline Speech Recogniti
16 pages
w13s1 MultiprocessingGPU
No ratings yet
w13s1 MultiprocessingGPU
21 pages
Duplichecker Plagiarism Report 0.76729900 1744563856
No ratings yet
Duplichecker Plagiarism Report 0.76729900 1744563856
5 pages
Python GPU Acceleration Webinar
No ratings yet
Python GPU Acceleration Webinar
33 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
PyTorch-Kaldi Toolkit for Speech Recognition
No ratings yet
PyTorch-Kaldi Toolkit for Speech Recognition
5 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
PyTorch-Kaldi Speech Recognition Toolkit
No ratings yet
PyTorch-Kaldi Speech Recognition Toolkit
5 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Deep Learning for Portuguese ASR
No ratings yet
Deep Learning for Portuguese ASR
103 pages
CUDA
No ratings yet
CUDA
46 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
NVIDIA Ampere GPU Architecture Overview
No ratings yet
NVIDIA Ampere GPU Architecture Overview
78 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Nividia and The Gpu Revolution
No ratings yet
Nividia and The Gpu Revolution
14 pages
Intro To Deep Learning
100% (1)
Intro To Deep Learning
35 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Pitfalls GPUs Autonomous
No ratings yet
Pitfalls GPUs Autonomous
21 pages
Architecture, Applications, and Accelerating AI
No ratings yet
Architecture, Applications, and Accelerating AI
11 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Hu20b Interspeech
No ratings yet
Hu20b Interspeech
2 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
1 Cuda
100% (1)
1 Cuda
173 pages
DS Tesla M Class Aug11
No ratings yet
DS Tesla M Class Aug11
2 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
CUDA Toolkit Release Notes
No ratings yet
CUDA Toolkit Release Notes
26 pages
NVIDIA GPU Innovations for AI Experts
100% (1)
NVIDIA GPU Innovations for AI Experts
96 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
38 pages
Speech Enhancement Using Deep Learning
No ratings yet
Speech Enhancement Using Deep Learning
33 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Modeling Deep Learning Accelerator Enabled Gpus
No ratings yet
Modeling Deep Learning Accelerator Enabled Gpus
14 pages
Gpu Thesis
100% (2)
Gpu Thesis
7 pages
GPU Energy Use in Edge Speech Recognition
No ratings yet
GPU Energy Use in Edge Speech Recognition
17 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
Electronics 13 04683
No ratings yet
Electronics 13 04683
15 pages
NVIDIA Investor Presentation Oct 2024
No ratings yet
NVIDIA Investor Presentation Oct 2024
30 pages
GPU Verification Iccad18-Gpu
No ratings yet
GPU Verification Iccad18-Gpu
8 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA C Programming Course Overview
No ratings yet
CUDA C Programming Course Overview
30 pages
Robots Create Real-Time 3D Maps
No ratings yet
Robots Create Real-Time 3D Maps
6 pages
Pierre Loup Griffais and John McDonald Vulkan
No ratings yet
Pierre Loup Griffais and John McDonald Vulkan
65 pages
MSC Proposal
No ratings yet
MSC Proposal
2 pages
NVIDIA GPU Evolution: Gaming to AI
100% (1)
NVIDIA GPU Evolution: Gaming to AI
91 pages
Tsip04 Chimi Rinzin
No ratings yet
Tsip04 Chimi Rinzin
17 pages
Speech Recognition ML Only Procedure
No ratings yet
Speech Recognition ML Only Procedure
2 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
PVTP Manual
100% (1)
PVTP Manual
446 pages
SOS Administration
No ratings yet
SOS Administration
143 pages
SYED 5.9yrs SAP Basis KPMG Bangalore
No ratings yet
SYED 5.9yrs SAP Basis KPMG Bangalore
4 pages
2.1 Waterfall Model
No ratings yet
2.1 Waterfall Model
3 pages
Java Exception Handling and JDBC Guide
No ratings yet
Java Exception Handling and JDBC Guide
17 pages
Application Architectures: ©ian Sommerville 2004 Slide 1
No ratings yet
Application Architectures: ©ian Sommerville 2004 Slide 1
37 pages
Basic Concepts and Computer Evolution
No ratings yet
Basic Concepts and Computer Evolution
85 pages
Cit 213
No ratings yet
Cit 213
32 pages
Cyberlab Pathology Measurement Results
No ratings yet
Cyberlab Pathology Measurement Results
10 pages
RulesManager en
No ratings yet
RulesManager en
90 pages
Understanding Const in C++ Methods and Arguments
No ratings yet
Understanding Const in C++ Methods and Arguments
15 pages
Square and Square Roots Worksheet
No ratings yet
Square and Square Roots Worksheet
7 pages
Linux L2 PDF
No ratings yet
Linux L2 PDF
275 pages
Sap Tables
No ratings yet
Sap Tables
3 pages
CS621 FT Highlighted by Vaniza
No ratings yet
CS621 FT Highlighted by Vaniza
111 pages
ASUS e Emanual Ux302la LG Ver8438
No ratings yet
ASUS e Emanual Ux302la LG Ver8438
114 pages
OpenSAP Btpt1 Week 1 Transcript en
No ratings yet
OpenSAP Btpt1 Week 1 Transcript en
23 pages
Full Stack Website Designer & Developer
No ratings yet
Full Stack Website Designer & Developer
2 pages
Entuity 16.0 Getting Started Guide
No ratings yet
Entuity 16.0 Getting Started Guide
129 pages
OpenFOAM Installation Guide
No ratings yet
OpenFOAM Installation Guide
5 pages
Lektion Python-Linkoping University
No ratings yet
Lektion Python-Linkoping University
14 pages
Dynamo PE Cybersecurity White Paper
No ratings yet
Dynamo PE Cybersecurity White Paper
9 pages
11th Computer Science Important Questions em
No ratings yet
11th Computer Science Important Questions em
6 pages
Samsung CLX-6260 Series Eng
No ratings yet
Samsung CLX-6260 Series Eng
159 pages
Valayapathi
No ratings yet
Valayapathi
12 pages
Test Monitoring and Control Overview
No ratings yet
Test Monitoring and Control Overview
20 pages
B HB 0090en Mmi Epi HB v3 1
No ratings yet
B HB 0090en Mmi Epi HB v3 1
88 pages
Big Bang Edge Test Registration Details
No ratings yet
Big Bang Edge Test Registration Details
10 pages
Overview of PIC 16F877 Microcontroller
100% (2)
Overview of PIC 16F877 Microcontroller
8 pages
Linkedin Test Cases
No ratings yet
Linkedin Test Cases
1 page

Kaldi GPU Acceleration for ASR

Uploaded by

Kaldi GPU Acceleration for ASR

Uploaded by

KALDI GPU ACCELERATION

GTC - March 2019

2) What we have done?

3) How can I use it?

• Mel-Frequency Cepstral Coefficients (MFCC) features – represent audio as spectrum of spectrum

Kaldi is a speech processing framework out of Johns Hopkins University

Uses a combination of DL and ML algorithms for speech processing

Maintained by Dan Povey

Feature Acoustic Language

Kaldi MFCC & Lattice

“Speech Recognition with Kaldi Lectures.” Dan Povey, www.danielpovey.com/kaldi-

Partnership between Johns Hopkins University and NVIDIA in October 2017

Goal: Accelerate Inference processing using GPUs

Used CPU for entire pipeline

NVIDIA Progress reports:

GTC On Demand: DC8189, S81034

Feature Acoustic Language

First Step: Move Acoustic Model to GPU

Enabled Tensor-Cores for NNET3 processing

Feature Acoustic Language

Feature Acoustic Language

Early on it was clear that we needed to 0.4% 4.9%

Kaldi APIs are single threaded, single instance, and synchronous

Makes batching and multi-threading challenging

Create a CUDA-enabled Decoder with asynchronous APIs

Batching/Multi-threading occur transparently to the user

Feature Acoustic Language

5,120 CUDA cores

2,560 CUDA Cores

KALDI PERFORMANCE Determinized Lattice Output

*Price/Power, not including, system, memory, storage, etc, price is an estimate 21

Adding more GPUs to a single system increases value

Less system cost overhead

Less system power overhead

Dense systems are the new norm:

DGX1V: 8 V100s in a single node

DGX-2: 16 V100s in a single node

SuperMicro 4U SuperServer 6049GP-TRT: 20 T4s in a single node

T4 Performance V100 Performance

1635 3371 6368 7906 3524 7082 10011 9399

Performance Per Dollar Performance Per Watt

Cannot feed the beast

Feature Extraction and Determinization become bottlenecks

CPU has a hard time keeping up with GPU performance

Small kernel launch overhead

Kernels typically only run for a few microseconds

Launch latency can become dominant

Feature Acoustic Language

Allows us to increase density and therefore value

Acoustic Language Master i

GPU Accelerated Multi-GPU Backend

Master N CUDA control threads

Feature Acoustic Language

1) Download Kaldi, Pull in PR, Build yourself

2) Run NVIDIA GPU Cloud Container

Get up and running in less than 10 minutes!

Discover over 40 GPU-Accelerated Containers

Get an NGC account: https://ngc.nvidia.com/signup

#prepare models and data

K520 K80 P40 M60 P4 P100 T4 V100 NGC

Add more models

Add scripts to help users run quickly on their own models

Continue to update Kaldi source with latest updates

Added two new directories to source tree

Implements framework/library classes for use in applications

cudadecoder/README: Detailed documentation on how to use

Binary example using cuda-accelerated decoder

You might also like