Automatic Audio Defect Detection
Automatic Audio Defect Detection
B ACHELORARBEIT
(P ROJEKTPRAKTIKUM )
Bachelor of Science
im Bachelorstudium
I NFORMATIK
Eingereicht von:
Rudolf Mühlbauer, 0655329
Angefertigt am:
Department of Computational Perception
Betreuung:
Univ.-Prof. Dr. Gerhard Widmer
Dr. Tim Pohle, Dipl.-Ing. Klaus Seyerlehner
Abstract
1 Introduction 4
1.1 Notations used in this document . . . . . . . . . . . . . . . . . . 6
2 Related Work 7
4 Implementation 18
4.1 The detection framework . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 File traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Evaluation 21
5.1 Synthesis of defects . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1.1 Synthetic pauses . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.2 Synthetic silences . . . . . . . . . . . . . . . . . . . . . . 22
5.1.3 Synthetic jumps . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.4 Synthetic noise . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Evaluation requirements . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Evaluation specification . . . . . . . . . . . . . . . . . . . . . . . 24
5.4 Evaluation plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5 Experiment 1: classification . . . . . . . . . . . . . . . . . . . . . 26
2
Contents 3
7 Bibliography 33
1 Introduction
The transition from traditional, analog audio storage media (such as gramo-
phone records or audio tapes) to modern, digitally stored audio files introduces
errors. The storage of analog media causes degradation of the information con-
tained and introduces defects, as a Vinyl record would degrade over time [16]
even without being played. On analog media, the playing also might greatly
degrade the information as observable on cassette tapes [10].
When reading Digital Audio Compact Disk media for hard disk storage, me-
chanical problems on the CD’s surface might introduce read errors if the defects
break error correction as well as the classical problems we very well know from
the portable CD players where jumps and gaps are introduced by vibrations on
the player.
Digitally stored or transferred files are prone to bit level errors. While with raw
PCM encoded audio data this would only result in incorrect samples, formats
with error detection (MP3 frame check sums for example, comp. [14] and [11])
would result in invalid frames.
There already exist tools to check MP3 collections for problems, but they concen-
trate on MP3 format errors such as missing tag information, broken file headers,
or Variable Bit Rate (VBR) problems. Consult table 1.1 on the following page for
a selection of available software solutions.
Also many noise removal and restoration applications are available, mainly
based on noise gating, multiband noise gating or notch filtering. For studio ap-
plications these often require the user to train the algorithms with specific noise
4
1 Introduction 5
Application Reference
MP3 Diag [22]
MP3Test [23]
MP3val [24]
samples. Various commercial audio applications provide click and pop detec-
tion and restoration geared towards digitization and post-production of vinyl as
well as live recordings. Table 1.2 lists some available applications serving this
purpose.
Application Reference
Audacity [19]
GoldWave [20]
Izotope RX [21]
Pristine Sounds [26]
Ressurect [27]
Soundsoap [28]
Wavearts MR Noise [29]
Waves Audio Restoration Plugins [30]
In this thesis a set of methods is presented to identify four common types of au-
dio degradations typically introduced by digitization and encoding: gaps, jumps,
noise bursts, and low sample rate. The algorithms provide a decision whether a
song contains defects of the respective type or not. Special attention has been
paid to the performance of the algorithm to enable the scanning of large col-
1 Introduction 6
lections in a feasible amount of time. The algorithms presented in this work try
to detect defects without prior knowledge about the musical structure of the au-
dio files under test. Only low-level signal processing methods without the usual
prerequisites such as training as in traditional machine learning approaches
are used. The specific defect types and the detection thereof are described in
Chapter 3. Chapter 4 gives an overview of the implementation. The general or-
ganization of the source code is outlined and the auxiliary algorithms (such as
filesystem traversal) are explained. The implementation provides different out-
puts: it is possible to create waveform plots of apparent defects and export WAVE
files of these defects for later inspection. The performance of the implemented
algorithms is evaluated with three experiments and the results are discussed
in Chapter 5. Chapter 6 summarizes the results and explains strengths and
weaknesses of the algorithms and possible future improvements to them.
The goal was to create a push-button application to crawl large collections for
common defects, something that did not exist to that point.
Different typefaces are used depending on the context. Table 1.3 shows these
typefaces.
Example Meaning
text Normal text
mean Matlab function
i Matlab variable
X~ Vector
threshold Configuration variable
MP3 File format
Godsill and Rayner [5] present a method for click removal in degraded gramo-
phone recordings. In their work, an auto regressive model is assumed for the
signal and the model parameters are estimated from the corrupted audio data
and used in a prediction error filter to calculate a detection signal. This detec-
tion signal is then thresholded to identify possible defects. Clicks are modeled
using a transient noise model. Furthermore they describe methods to repair
erroneous audio by replacing the audio samples in question based on different
methods, including auto regressive estimation and median filtering. A Markov
chain Monte Carlo (MCMC) method for similar applications is described in [3]
and [4].
Vaseghi [13] covers many aspects of noise detection / reduction and includes
detailed information about impulsive and transient noise. Statistical and model-
based approaches to noise estimation and reduction are presented, as well as
applications such as the restoration of gramophone records.
Much work has been done in the field of perceived audio quality to assess a
subjective audio quality. An overview is given in Herrero [8]. PEAQ [15] for
example provides methods to calculate a mean opinion score (MOS) that rates
the quality of audio on a scale from 1 to 5. To get objective measurements also
neural networks have been used to emulate a human assessment (Mohamed
[12]).
7
2 Related Work 8
This chapter describes four common audio defects and the respective detection
methods that have been developed and implemented within the defect detection
framework. The covered defects are: gaps, jumps, noise burst, and low sample
rate.
9
3 Defect types and detection methods 10
Envelope estimation
Thresholding
Averaging
Thresholding
The gap detection algorithm finds silent parts in the audio, trying to distinguish
between musically intentional pauses and real defects. Sources of such a defect
can be for example:
The proposed algorithm is depicted in Figure 3.1 and works as follows: the en-
velope of the signal is estimated by filtering the signal with an envelope follower
similar to the one suggested by Herter [9]. The filtered audio data is thresholded
to obtain a vector of logicals according to Equation 3.1 to find silent parts in the
signal.
~
thresholded(i) =1 ~
iff filtered(i) ≤ threshold (3.1)
To exclude single outliers that might appear, the vector of logicals then is fil-
tered with an averaging filter and again thresholded. The averaging is done by
3 Defect types and detection methods 11
0.4
waveform
detected defect
0.2
−0.2
−0.4
1 1.5 2 2.5 3 3.5
position in file [sec]
Gaps detected at the very beginning or the end (the so-called border) of the
audio data are excluded. The remaining signal is scanned for regions below
a certain threshold to identify potential gaps. For such a region the so-called
prepower is calculated. The term prepower designates the cumulative power
of prepower_len samples (the prepower frame) just before the beginning of the
region (see Equation 3.2).
~
prepower = mean(abs(prepowerframe)) (3.2)
Gap detection accuracy would benefit from the distinction between pauses and
silence. Figure 3.3 shows the difference: while in the left plots the source also
stops (called pause), in the right one the source continues to progress, only the
channel is muted (called silence). This distinction would add additional plausi-
bility information and could be accomplished by using a beat-tracking algorithm.
No further research was done in this direction but might prove beneficial to the
performance of the algorithm.
source source
1 1
0 0
−1 −1
0 10 20 30 0 10 20 30
measured signal measured signal
1 1
0 0
−1 −1
0 10 20 30 0 10 20 30
Figure 3.3: Gap Types: the left two plots show a pause situation: the signal
starts where it stopped. The right plots show a silence situation:
time progresses in the source.
Figure 3.5 on the next page sketches what we will understand as jumps: a
discontinuity in the signal progression. This might be a skip-forward triggered
by read errors from a Digital Audio Compact Disk, as well as any other sudden
change in playback position.
As the plot 3.5 on the following page suggests, we are looking for sudden
changes in signal characteristics. The basic procedure to find such disconti-
nuities is depicted in Figure 3.4 on the next page.
This algorithm works as follows: for the input signal ~s an autoregressive Model
is used. In a sliding-window manner the model parameters of the signal are cal-
3 Defect types and detection methods 13
Thresholding
1
0
−1
0 10 20 30
culated by solving the Yule-Walker equations (comp. [17] and [13, p. 209-238]).
With this model and its parameters a prediction of the signal is estimated by
filtering the signal with the prediction filter. This prediction signal p~ is compared
to the original signal by building the sample-wise difference d~n = |~sn − p~n |. The
difference between the two, the residuals, is the prediction error. When the pre-
diction error is high compared to a given threshold, a discontinuity is indicated.
Additionally it is necessary to consider the signal’s standard deviation at the
point of discontinuity. This is a heuristic to distinguish between noisy parts in
the music (for example the sound of a cymbal) and jump defects. This heuristic
performs poor on some music genres, for example heavy rock music due to the
noise parts created by distorted instruments.
3 Defect types and detection methods 14
1.2 waveform
prediction error
1
synthesized jump
0.8 detected jump
0.6
0.4
0.2
−0.2
1.5
waveform
prediction error
detected jump
1 synthesized gap
0.5
−0.5
59.99 59.995 60 60.005 60.01 60.015
position in file [sec]
Noise Estimation
Some audio files tested showed random noise bursts. The reason for those
defects might be as diverse as MP3 frame errors [14] or other transmission /
coding errors. However, observed defects showed similar characteristics: high
energy and almost random distribution of sample values across the full band-
width. Figure 3.9 shows such a defect. The plot shows a power plot at the top,
a spectrogram in the middle and the waveform at the bottom. The shown defect
is clearly audible as loud noise.
dB
60
40
20
kHz
18
16
14
12
10
8
6
4
2
32767
30
25
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-32768
time
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0
Figure 3.9: Plot of noise burst: power plot, spectrogram and waveform
The general approach to detect this type of error is depicted in Figure 3.8. In a
frame-wise manner the audio data is scanned for regions with high energy and
3 Defect types and detection methods 16
PN
~
~ = Pi=1 iXi
mean(X) (3.3)
N ~
i=1 Xi
Bandwidth Estimation
Even if the sample rate of the audio file would offer it, the signal in that file might
not utilize the full bandwidth. This could happen when vintage recordings are
digitized or the file format uses insufficient bandwidth. A common source for
this defect is an encoding in the MP3 format with low sample rate. To detect such
shortcomings the bandwidth usage of the file is analyzed and compared to a
configurable, normalized bandwidth.
To reduce the computational effort, only some frames are taken from the audio
data and analyzed. These frames consist of windowsize samples and are taken
every probe_dist samples in a sliding window manner.
~s = abs(fft(f~)) (3.4)
~
The cumulative spectral power C(i) is the function:
i
X
~
C(i) = ~s(j) (3.5)
j=1
3 Defect types and detection methods 17
windowsize/2
X
~ 80% ) ≥ 0.8
C(i ~s(j) (3.6)
j=1
The bandwidth usage of the whole audio file is then estimated by taking the
overall maximum of all probed frames. This value is compared to the minimum
bandwidth usage th and the reference sample rate of 22 050 samples per sec-
ond in the as follows:
i80% · fs
< th (3.7)
22050 · windowsize
Experiments have shown that a value of 0.6 is reasonable for th. It properly
detects audio data that was sampled with 22 050 Samples per second, even if it
is contained in a file with 44 100 Samples per second. It also identifies vintage
recordings as defective, which might not be what the user wants. However there
is no simple heuristic to distinguish these cases.
4 Implementation
FINDFILES_NEXT DETECT_ALL
18
4 Implementation 19
The file traversal functions (see Section 4.3) provide the means to search the
given path for audio files. For every file found the detection algorithms are in-
voked. Additionally to the detection the statistics are updated and log messages
are produced.
For every defect detected, depending on the configuration, the program can
export a plot showing the defect to PDF and FIG format. Also, a segment of
the signal around a detected defect can be exported to a WAVE file for later
inspection.
4.2 Statistics
All statistics are kept in a global structure. This allows simple performance
assessment and is tightly integrated in the test environment. Calculated fields
in this global structure allow easy interpretation as well as high-level analysis of
the gathered data.
For every type of defect separate timing information is stored. This includes the
detection time and as a calculated field the speedup (See also 5.3 on page 24
for more details).
The File Traversal module consists of two functions for traversing a folder struc-
ture and is invoked by the detection framework. This is implemented as a state-
4 Implementation 20
ful generator, saving memory by generating the file list in-place while detecting
rather than creating a file list first and processing this list later. The current state
of the generator is kept in a global structure.
No ground truth was available for evaluation. To assess the performance of the
detection algorithms a simple solution was implemented: defects are synthe-
sized and added to the audio data. This way a ground truth is available and the
quality of the assessment depends only on the quality of the synthesis and the
files under test. This chapter describes the synthesis of defects, an explanation
of the evaluation procedure, the evaluation metrics, and three experiments. The
results of the experiments are presented and discussed.
If configured accordingly, synthetic defects are added to the audio data of the file
under test. Defects can be synthesized for the defect types gap, jump, and noise
burst. Of course a synthetic defect is only an approximation of defects observed
in real audio data. The synthesis algorithm emulates some characteristics of
real audio to a limited extent.
21
5 Evaluation 22
For the distinction of pause and silence refer to Section 3.1 on page 10.
Pauses are created by inserting samples at the position of the defect. Sample
values inserted are drawn from a N (0, 1) - normal distribution with an adjustable
gain to emulate noise. This gain could be set to zero to create absolute silence.
As a result the length of the whole audio signal is increased.
Noise synthesis is based on silence synthesis. The only difference is that the
noise signal has a higher power. A block of noise is generated by inserting
samples drawn from a uniform distribution in the range −0.9 and 0.9. The noise
inserted is not additive: the original samples are replaced. The waveform de-
picted in Figure 5.1 shows a synthetic noise burst in the left part and a natural
defect in the right part.
dB
60
40
20
kHz
18
16
14
12
10
8
6
4
2
32767
30
25
20
15
10
5
0
-5
-10
-15
-20
-25
-30
-32768
time
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Figure 5.1: Plot of synthetic noise: power plot, spectrogram and waveform
The purpose of the evaluation is clear: it must be possible to get metrics of the
performance of the algorithms to be able to:
Speedup S The ratio of audio playing time and defect detection time.
playing time
S=
detection time
as defined in [18].
This metric provides information about the efficiency of the algorithm. It is de-
fined as (R+ × R+ ) 7→ R+ .
The evaluation procedure can be easily executed for a directory containing au-
dio files. The general algorithm is depicted in Figure 5.2.
Load configuration
no
Files available Calculate overall metrics
yes End
Synthesize defects
Detect defects
Calculate metrics
Three different experiments have been conducted for evaluation purposes. The
first experiment (“good_bad”) uses a small number of files with known defec-
tive files (the “bad” set) and files without defects (the “good” set). The second
experiment (“collection”) was made with a rather large collection of music files
with defect synthesis enabled. The third experiment is a parameter optimization
for the jump detection algorithm, aiming at simultaneous optimization of several
parameters. The results are documented in the following sections.
5 Evaluation 26
For these files all detection algorithms were enabled and defect synthesis was
disabled. The evaluation only considers files marked as defective and therefore
does not consider individual defects found in the file. The classification in “good”
and “bad” is considered as the ground truth.
Table 5.1 on the next page shows the defect classification for every file in the
“good_bad” set while Table 5.2 on the following page shows efficiency informa-
tion of the detection algorithms.
Counting the positive and negative outcomes of Table 5.1 on the next page on
a per-file basis, following numbers can be counted:
P = 0.92 (5.7)
R = 0.84 (5.8)
F = 0.88 (5.9)
5 Evaluation 27
The set “collection” is a huge set of music files (1715 files in total) of diverse
genres and quality. Some of these files contain defects, introduced by bad en-
coding, ripping of defective Digital Audio Compact Disks, or file errors while
most of the files are in good shape.
In this experiment defect synthesis was enabled, where not only synthetic errors
are detected but also the real errors of the audio files are counted. Therefore
the calculation of Recall and Precision suffer on correctness but the major part
of the detected defects should originate in synthetic defects. With synthesis
enabled the exact position of the (synthetic) defects are known. That information
are used as ground truth to calculate recall and precision.
For low sample rate detection no effectivity metrics are available since no de-
fect synthesis is available for this type. Recall and Precision for the detection
of jumps is very poor, this can be attributed to both the implementation and
the parametrization of the detection algorithm. The actual results are shown in
Table 5.3. The experiment used audio data of a total playing length of approxi-
mately 7 045 minutes and finished in 2 206 minutes, yielding a total speedup of
3.19.
Exemplary for the detection of jumps a parameter study was conducted. A set
of files was scanned repeatedly for defects with different detection parameters.
For each of these experiments the F-Score (See also equation 5.3 on page 24)
was calculated to assess the quality of the parametrization.
As input the five “good” files out of the “good_bad” set were used. For these
files synthetic jumps were generated.
The parameters threshold and sigma were swept through a reasonable range.
Figure 5.3 shows the results of this optimization: the F-Score for every parametriza-
tion is inscribed in the plot and show a clear maximum around threshold = 45
and sigma = 20.
20
50
Generally the implemented algorithms give good hints which files in a collec-
tion might be defective but are results are not good enough for fully automatic
classification or to justify automatic deletion of apparently defective files.
Generally the gap detection performs well in both run-time and effectivity. Fur-
ther heuristics would be needed to increase recall rate on speech and electronic
music. On speech data the distinction between erroneous gaps and intended
musical pauses fails, the implemented heuristics would either need more careful
parametrization or additional plausibility checking. Electronic music fabricated
by using prerecorded audio segments tends to introduce gaps between these
segments unlike the signals produced by real instruments.
30
6 Conclusions and further work 31
Jump detection needs improvement in both efficiency and effectivity: the com-
putational effort should be reduced to gain larger speed-ups. To accomplish that
several speed-limiting factors have to be considered. First, the sliding-window-
scheme must be over-thought and optimum values for frame sizes and frame
offsets chosen. The detection has poor recall rates when used on electronic or
heavy rock music. Further heuristics should be implemented to support plau-
sibility checking to exclude false-positive defects due to genre-inherent sound
properties.
Noise bursts are detected quite well with feasible computational efforts. The
recall rates are high when applied “noisy” music such as heavy rock. The dis-
tortion of the instruments, as well as the loud drums are interpreted as noise
bursts. It is either possible to improve the parametrization or include further
heuristics to exclude such false positives. The precision of the algorithms is
reasonable.
Low Sample rate detection is implemented quite efficient and delivers good pre-
cision and recall rates and well distinguishes between high-fidelity and low qual-
ity recordings. One unresolved problem is the classification of vintage record-
ings. Since the distinction between low-quality and vintage bandwidth usage
is purely subjective it is hard to check for plausibility. Machine Learning ap-
proaches could resolve this issue, but are contradictory to the initial assump-
tions for the implementation.
List of Figures
32
7 Bibliography
33
7 Bibliography 34
[25] mpg123 - Fast console MPEG Audio Player and decoder library. http:
//www.mpg123.de. As seen on 2010/05/10.
[26] Pristine Sounds 2000. http://www.sonicspot.com/pristinesounds/
pristinesounds.html. As seen on 2010/03/12.
[27] Resurrect your old recordings. http://wwwmaths.anu.edu.au/~briand/
sound/. As seen on 2010/03/12.
[28] SoundSoap. http://xserve1.bias-inc.com:16080/products/
soundsoap2/. As seen on 2010/03/12.
[29] Wavearts MR Noise. http://wavearts.com/products/plugins/
mr-noise/. As seen on 2010/04/01.
[30] Waves Audio Restoration Plugins. http://www.waves.com/content.aspx?
id=91#Restoration. As seen on 2010/04/01.
RUDOLF M ÜHLBAUER
Staatsangehörigkeit: Österreich
Geburtsort: Burgkirchen
AUSBILDUNG
• 1989 - 1993 Volksschule Neukirchen a.d.E
• 1993 - 1997 Hauptschule Neukirchen a.d.E
• 1997 - 2002 HTL Braunau, Nachrichtentechnik
• Studium der Informatik seit September 2006