Project Report

The project aims to develop a compact deep learning model for accurate keyword classification on resource-constrained embedded devices, enabling local command execution for smart domestic tasks. The model is trained using TensorFlow on Google's speech commands dataset and optimized for performance and size, achieving comparable accuracy to larger models while being feasible for deployment. Experimental results indicate that the model performs well in streaming scenarios, demonstrating its potential for real-time applications.

Uploaded by

taiman.arham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views4 pages

Project Report

Uploaded by

taiman.arham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Project report

Motivation:

The goal of this project is to implement a compact deep learning model/s to classify a set of 8-10
keywords accurately such that the model can be reliably deployed on a resource constrained embedded
device. The model would then perform keyword inference locally such that one or a combination of
those keywords can be used to perform a set of smart domestic tasks, i.e. - switching of lights, changing
temperature, etc. Thus, it will be unnecessary to perform inference for the set of local commands on the
server side, saving bandwidth and cloud resources for tasks which are non-local (requiring internet
connectivity) and bypassing network reliability issues for the local commands.

Methodology:

The speech recognition network is implemented using Tensorflow’s speech recognition example script
at the following repository: https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/examples/
speech_commands

The speech model is trained on words taken from Google’s speech commands dataset(*), which
consists of 65,000 one-second long audio clips where each clip contains one of 30 different words
spoken by different subjects in realistic environments. Each word has a folder containing all its one-
second long wave file examples. 8-10 words from the dataset are used to train the speech model,
whereas examples from words not selected are used to train an ‘unknown’ word label. A ‘silence’ label
is also trained to detect no spoken word. It should be noted that words falling outside these 30 words
can also be trained using this script as long as it contains sufficient example clips and the clips are
formatted and organized as required.

Each second of raw audio captured from the device’s microphone is converted to a 2D spectrogram of
features using an FFT algorithm and fed as input to the model. By deafult, each spectrogram consists
40 rows and 49 columns, were each row is a 30-ms audio slice split into 43 frequency buckets using
FFT. The 2D array for one second of audio data is built by running FFT on 49 consecutive audio slices
with each slice overlapping the last by 10-ms. Both the audio slice size and slice stride are tunable
parameters, and hence the spectrogram input shape can be changed.

[spectrogram diagram]

The model then performs inference on the one-second spectrogram data and outputs logits for each
keyword it classifies. Since audio data is continously streaming and multiple such inferences are
performed per second on overlapping streaming data, a command recognizer class
(recognize_commands.h) is used to recognize the spoken command by averaging the score for all
words over an averaging window which contains a minimum number of inferences (3). If the average
score for a spoken word is above a threshold value, it is detected as the spoken word. The averaging
window size and detection threshold are both tunable parameters.
Speech recognition model:

The model architectures used to train the keyword classifier are defined in the file ‘models.py’ (*). I
added my convolutional model ‘cnn’ defined by the ‘create_cnn_model’ function in the ‘models.py’
file with the aim of optimizing model tradeoff between size and performance for small footprint
deployment on embedded platforms. The model outline is given below:

(fingerprint_input)
v
[Conv2D layer]
v
[Relu]
v
[Conv2D layer]
v
[Relu]
v
[Maxpool layer]
v
[Output/logits layer]
v

The ‘cnn’ model consists of two successive convolutional layers with 64 and 48 kernels respectively.
The larger number of kernels and single strides in the first convolutional layer are meant to extract
maximum information from the input while the smaller number of filters and larger stride in the 2 nd
convolutional layer is supposed to capture more discriminative features. I use strides of 2 in one
dimension for the second convolutional layer kernels as an alternative to a pooling layer. This is
because normally a pooling layer would downsize the output of the first convolutional layer to a greater
degree and therefore would not retain as much information. The second convolutional layer is followed
by a maxpooling layer which connects to the output layer to obtain the logits for each word. The
maxpooling layer is meant to enable further feature discrimination as well as downsize the second
convolutional layer so as the reduce the number of parameters connecting to the output layer. Batch
normalization layers are introduced after each convolutional layer to optimize/stabilize training

As I require a model with a small footprint, I did not introduce more layers to the model such that it is
compact yet reliable. Other than adjusting model parameters such as configurations of each layer,
number of training steps,learning rate and dropout rate; training input parameters such as the amount
and volume of background noise mixed in, time shifting to add distortion, number of frequency bins in
each spectrogram window slice as well the size and stride of each spectrogram window were also
tweaked to train an optimal model.

I also added and tested a recurrent network model ‘rnn’ defined in ‘create_rnn_model’ to evaluate it’s
performance. It consists of 3 lstm layers with 118 units in each layer. I compared both my models
against the baseline ‘conv’ model provided as the default model for training the keyword spotter in the
‘create_conv_model’ function.
Experimental setup:

All models are trained to classify 8 keywords in consideration of feasible implementation in terms of
compute intensitvity and size. The window size for the audio input is set at 30-ms with a window stride
of 20-ms to create the spectrogram for 1 second of audio. 40 frequency/feature bins are used to
represent each spectrogram window. 10% of the training data is used to train the ‘unknown’ word label
and 10% to train the ‘silence’ label. 10% of the wave files are used as the test set and 10% as the
validation set .

Background noise (volume 0.1 with max. being 1) is introduced to 80% of the training samples. The
training audio is randomly shifted in time within a range of 100-ms. The first 23000 training steps have
a learning rate of 0.001 and the following 7000 steps have a smaller learning rate of 0.0001 for fine-
tuning. All training configuration parameters are kept constant for all models and are specified in the
‘train.py’ file (*) and the corresponding training commands in the README file. The experimental
results are evaluated in the next section.

Results:

The results of training and evaluating my ‘cnn’ model are presented below. I compare these results to
the baseline ‘conv’ model provided for good keyword spotting performance. The ‘conv’ model uses
940,000 weight parameters and requires over 800 million FLOPs to perform each inference. While this
should enable the ‘conv’ model to provide good performance, it is too compute intensive to run
constantly on resource constrained embedded devices and therefore not feasible as a keyword spotting
model. In contrast, my model uses about 363000 weight parameters and performs 14 million FLOPs
per inference. Assuming a system running a maximum of 10 inferences per second, an embedded MCU
with reasonable compute resources can feasibly run a model performing upto about 20Mops/inference
(reference in Table 3 in [2]). Thus, my model will be able to run in interactive speeds on limited-
resource devices with reasonable compute performance.

The model can be quantized for eight-bit deployment by running the training script with the ‘quantize’
flag set to ‘True’. The size of my model without quantization is about 1.4MB whereas it is about
370KB with quantization. While quantization led to little degradation in test accuracy, the quantized
model is small enough to fit within the constraints of embedded platforms. For instance, a number of
Cortex M4/M7 boards have 1-2 MB flash (Table 1 in [2]). However SRAM memory may have to
separately/specifically allocated or added to run this model.

Despite the small footprint of my model, its test accuracy is similar to that of the baseline model. Thus,
my model is highly optimized for efficient keyword spotting performance on embedded systems. It also
shows there is scope for improvement in terms of the performance-size tradeoff.

I did not include the results of my ‘rnn’ model in this evaluation as it’s performance is inferior to that
of my ‘cnn’ model (about 81% test accuracy). However, the performance results for the ‘rnn’ model
can be obtained by following the corresponding instruction in the README file. The inferior
performance of the ‘rnn’ model may be because recurrent networks are more suitable for capturing
longer time-series dependencies than is the case for a 1-second long spectrogram time-series input.
Streaming accuracy:

As a keyword spotting application run on a continuous stream of audio rather than individual clips, I
evaluate model perfromance on streaming data. In this case, it is is done by applying the model
repeatedly at different offsets and averaging the results over a short window to produce a smoothed
prediction. I evaluate my model on streaming data by running the ‘test_streaming_accuracy’ file (*).
This uses the ‘Recognize commands’ class mentioned before to run through a long-form input audio,
try to spot words, and compare those predictions against a ground truth list of labels and times.

I need a long audio file to test streaming data along with labels showing where each word was spoken. I
generate synthetic test data by running the ‘generate_streaming_test_wav,py’ file to create a 5
minute .wav file with similar confiurations I used for training and with words roughly every two
seconds, and a text file containing the ground truth of when each word was spoken.

Then I run the test streaming accuracy script with my model on the generated audio file to evaluate
streaming performance. I use an averaging window of 500-ms to smooth prediction results (with
minimum 3 inferences within that window). The detection threshold for a prediction within that
window is set at 0.7 (max. is 1). The ‘time-tolerance-ms’ flag is 1500-ms, which is the maximum time
given to the model to recognize a word since it is spoken. The ‘suppression_ms’ flag is set at 500-ms,
and prevents subsequent word detections for the specified duration after an initial one is found. Each of
these parameters are configurable.

The streaming accuracy results of my model is as follows:

67% matched: 63% correct, 4% wrong, 0% false positive

Among the 67% of words my model recognized (about 30% of words belong to the ‘unkown’ label) in
the generated audio file, 63% of words were correctly classified, 4% were wrong and there were no
false positives. The baseline ‘conv’ model had similar performance. The incorrect classifications may
be caused by discrepancies arising from simulating streaming performance on synthetic data. However,
the model has fairly good streaming performance on the synthetic audio file and this a good
demonstration of the model can be applied on streaming audio data to obtain reasonable performance
by using appropriate/optimized parameters for streaming.

Conclusion:

EHaCON - 2019 Paper 8
No ratings yet
EHaCON - 2019 Paper 8
20 pages
Voice Command Based Wheelchair: Subtitle As Needed (Paper Subtitle)
No ratings yet
Voice Command Based Wheelchair: Subtitle As Needed (Paper Subtitle)
4 pages
Audio Recognition with CNN
No ratings yet
Audio Recognition with CNN
14 pages
Deep Learning for Portuguese ASR
No ratings yet
Deep Learning for Portuguese ASR
103 pages
Kristian Perriu Audio Classification
No ratings yet
Kristian Perriu Audio Classification
5 pages
Deep Learning in Language ID Systems
No ratings yet
Deep Learning in Language ID Systems
4 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
Google Speech Commands Dataset
No ratings yet
Google Speech Commands Dataset
11 pages
Saheaw 2020
No ratings yet
Saheaw 2020
4 pages
Tsip04 Chimi Rinzin
No ratings yet
Tsip04 Chimi Rinzin
17 pages
On-Device Customization of Tiny Deep Learning Models For Keyword Spotting With Few Examples
No ratings yet
On-Device Customization of Tiny Deep Learning Models For Keyword Spotting With Few Examples
8 pages
Speech Recognition Using Convolutional Neural Netw PDF
No ratings yet
Speech Recognition Using Convolutional Neural Netw PDF
5 pages
Speech Recognition Using Convolutional Neural Netw
No ratings yet
Speech Recognition Using Convolutional Neural Netw
5 pages
Attention-Based Speech Command Recognition
No ratings yet
Attention-Based Speech Command Recognition
18 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Speech Command Recognition Using Deep Learning
No ratings yet
Speech Command Recognition Using Deep Learning
25 pages
Honk A PyTorch Reimplementation of Convolutional N
No ratings yet
Honk A PyTorch Reimplementation of Convolutional N
4 pages
Neural Netwroks NLP Project Detailed Guidelines.
No ratings yet
Neural Netwroks NLP Project Detailed Guidelines.
16 pages
Voice Controlled Robot: En407 Robotics
No ratings yet
Voice Controlled Robot: En407 Robotics
9 pages
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
No ratings yet
Evaluation of State of Art Open-Source ASR Engines With Local Inferencing
81 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
Voice Interactive Games
No ratings yet
Voice Interactive Games
5 pages
Latihan UAP
No ratings yet
Latihan UAP
3 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
Sharifuddin 2019
No ratings yet
Sharifuddin 2019
4 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
Realization of Embedded Speech Recognmition Module Based On STM32
No ratings yet
Realization of Embedded Speech Recognmition Module Based On STM32
5 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
NLP - Comparing BERT and DistilBert - Guidelines.
No ratings yet
NLP - Comparing BERT and DistilBert - Guidelines.
16 pages
DL Report
No ratings yet
DL Report
16 pages
Deep Speech: Advanced Speech Recognition
No ratings yet
Deep Speech: Advanced Speech Recognition
12 pages
Hardware - Software Co-Design For TinyML Voice-Recognition Application On Resource Frugal Edge Devices
No ratings yet
Hardware - Software Co-Design For TinyML Voice-Recognition Application On Resource Frugal Edge Devices
15 pages
2020 - Igor - TinyLSTMs Efficient Neural Speech Enhancement For Hearing Aids
No ratings yet
2020 - Igor - TinyLSTMs Efficient Neural Speech Enhancement For Hearing Aids
5 pages
Speech Recognition Final
100% (1)
Speech Recognition Final
107 pages
Problem
No ratings yet
Problem
15 pages
Audio GAN Models and Training Code
No ratings yet
Audio GAN Models and Training Code
2 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Character-Level ConvNets for Text Classification
No ratings yet
Character-Level ConvNets for Text Classification
9 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Google Wakeword Detection 1 PDF
No ratings yet
Google Wakeword Detection 1 PDF
5 pages
Performance Evaluation of Offline Speech Recogniti
No ratings yet
Performance Evaluation of Offline Speech Recogniti
16 pages
Sign
No ratings yet
Sign
23 pages
Distinguishing Between Two Human Voices Using AI
No ratings yet
Distinguishing Between Two Human Voices Using AI
11 pages
MVS - Expt8 Object Detection and Reconstruction Using CNN
No ratings yet
MVS - Expt8 Object Detection and Reconstruction Using CNN
5 pages
End-To-End Speech Emotion Recognition Using Deep Neural Networks
No ratings yet
End-To-End Speech Emotion Recognition Using Deep Neural Networks
5 pages
SOPC-Based Word Recognition System: Abstract
No ratings yet
SOPC-Based Word Recognition System: Abstract
3 pages
Automated Neural Image Caption Generator For Visually Impaired People
No ratings yet
Automated Neural Image Caption Generator For Visually Impaired People
6 pages
Jaderberg15b PDF
No ratings yet
Jaderberg15b PDF
188 pages
Speech Recognition Project Overview
No ratings yet
Speech Recognition Project Overview
13 pages
CLSTM Methodology for Sign Language Recognition
No ratings yet
CLSTM Methodology for Sign Language Recognition
8 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
DLT Experiment 2
No ratings yet
DLT Experiment 2
7 pages
SIH FALKON PARADOX Final
No ratings yet
SIH FALKON PARADOX Final
6 pages
Supervised Machine Learning Report
No ratings yet
Supervised Machine Learning Report
3 pages
End-to-End Speech Emotion Recognition Using Deep Neural Networks
No ratings yet
End-to-End Speech Emotion Recognition Using Deep Neural Networks
5 pages
JournalPaper ASC Updated
No ratings yet
JournalPaper ASC Updated
16 pages
Artificial Intelligence and Machine Learning For EDGE Computing 1st Edition Rajiv Pandey - Ebook PDF Download
100% (4)
Artificial Intelligence and Machine Learning For EDGE Computing 1st Edition Rajiv Pandey - Ebook PDF Download
66 pages
Capstone Final Report
No ratings yet
Capstone Final Report
54 pages
Final Project Report
No ratings yet
Final Project Report
76 pages
Logical It Networking N
No ratings yet
Logical It Networking N
21 pages
Alzheimer Disease Detection Empowered With Transfer Learning
No ratings yet
Alzheimer Disease Detection Empowered With Transfer Learning
16 pages
Résumé
No ratings yet
Résumé
2 pages
Face Recognition Systempdf
No ratings yet
Face Recognition Systempdf
8 pages
Abstract For Facial Emotion Detection Using Neural Networks
No ratings yet
Abstract For Facial Emotion Detection Using Neural Networks
48 pages
Sania Technical Seminar
No ratings yet
Sania Technical Seminar
14 pages
Dynamic Generative Residual Graph Convolutional Neural Networks For Electricity Theft Detection
No ratings yet
Dynamic Generative Residual Graph Convolutional Neural Networks For Electricity Theft Detection
2 pages
Dillam Thesis Submitted
No ratings yet
Dillam Thesis Submitted
219 pages
Amharic Text Entity Relation Extraction
No ratings yet
Amharic Text Entity Relation Extraction
12 pages
8 Inspirational Applications of Deep Learning
No ratings yet
8 Inspirational Applications of Deep Learning
17 pages
Sign Language and Common Gesture Using CNN
0% (1)
Sign Language and Common Gesture Using CNN
7 pages
Major Project (Gurman, Harneet, Kuldeep, Sahibjot)
No ratings yet
Major Project (Gurman, Harneet, Kuldeep, Sahibjot)
72 pages
Structural MRI Classification in Alzheimer's Disease Using CNN-XGboost An Early Diagnosis
No ratings yet
Structural MRI Classification in Alzheimer's Disease Using CNN-XGboost An Early Diagnosis
13 pages
FEMFER: Feature Enhancement For Multi Faces Expression Recognition in Classroom Images
No ratings yet
FEMFER: Feature Enhancement For Multi Faces Expression Recognition in Classroom Images
21 pages
CNN Building Blocks
No ratings yet
CNN Building Blocks
14 pages
AI-Powered Grape Detection
No ratings yet
AI-Powered Grape Detection
9 pages
Neuromorphic Computing Insights
No ratings yet
Neuromorphic Computing Insights
57 pages
Final Project - Template
No ratings yet
Final Project - Template
97 pages
Ai-Ml Roadmap - 3
No ratings yet
Ai-Ml Roadmap - 3
6 pages
Geometric Deep SLAM for Experts
No ratings yet
Geometric Deep SLAM for Experts
14 pages
Mad MPK
No ratings yet
Mad MPK
27 pages
AI ROADMAP and Syllabus
No ratings yet
AI ROADMAP and Syllabus
24 pages
Time-Series Transfer Learning with LSTM
No ratings yet
Time-Series Transfer Learning with LSTM
4 pages
Graph Neural Networks Overview
No ratings yet
Graph Neural Networks Overview
43 pages
Haar Wavelet Downsampling
No ratings yet
Haar Wavelet Downsampling
14 pages
Hybrid Image Captioning Model
No ratings yet
Hybrid Image Captioning Model
6 pages

Project Report

Uploaded by

Project Report

Uploaded by

Project report

The streaming accuracy results of my model is as follows:

67% matched: 63% correct, 4% wrong, 0% false positive

You might also like