0% found this document useful (0 votes)
7 views15 pages

Pothole Detection Using Computer Vision and Learning

This paper discusses various techniques for pothole detection using computer vision and learning, focusing on real-time and offline identification methods. It presents two stereo-vision and two deep-learning-based approaches for automatic pothole detection, providing experimental evaluations that show deep learning methods outperform traditional techniques. The study also highlights the challenges of pothole detection due to varying road conditions and the need for diverse datasets for further research.

Uploaded by

prasadrao_k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Pothole Detection Using Computer Vision and Learning

This paper discusses various techniques for pothole detection using computer vision and learning, focusing on real-time and offline identification methods. It presents two stereo-vision and two deep-learning-based approaches for automatic pothole detection, providing experimental evaluations that show deep learning methods outperform traditional techniques. The study also highlights the challenges of pothole detection due to varying road conditions and the need for diverse datasets for further research.

Uploaded by

prasadrao_k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

3536 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO.

8, AUGUST 2020

Pothole Detection Using Computer


Vision and Learning
Amita Dhiman and Reinhard Klette

Abstract— Techniques for identifying potholes on road surfaces


aim at developing strategies for real-time or offline identification
of potholes, to support real-time control of a vehicle (for driver
assistance or autonomous driving) or offline data collection for
road maintenance. For these reasons, research around the world
has comprehensively explored strategies for the identification of
potholes on roads. This paper starts with a brief review of the
field; it classifies developed strategies into several categories. We,
then, present our contributions to this field by implementing
strategies for automatic identification of potholes. We developed
and studied two techniques based on stereo-vision analysis of
road environments ahead of the vehicle; we also designed two
models for deep-learning-based pothole detection. An experimen-
tal evaluation of those four designed methods is provided, and Fig. 1. Complex geometric shape of a pothole not supporting a precise
conclusions are drawn about particular benefits of these methods. definition of a “border”.

Index Terms— Pothole detection, stereo vision, deep learning.


to a Chicago Sun-Times analysis of city data, drivers filed
I. I NTRODUCTION
11,706 complaints about potholes with the city in the first two

D ISTRACTED driving, speeding or other driver errors are


main causes of accidents worldwide; however, bad road
conditions are also a significant cause. The condition of a
months of 2018 [1]. In the UK, about 50 cyclists are seriously
injured every year because of Britain’s poor roads [2]. In India,
3, 597 people died due to potholes [3].
road turns out to be dangerous due to number of reasons Potholes may cause significant costs. For example, in 2017,
such as flooding, rain, damages caused, e.g., by overloaded big different city councils in New Zealand have spend the follow-
vehicles, or poor physical maintenance of the road. Road con- ing in order to fix potholes: Christchurch 525, 000, Wellington
dition assessment involves identifying and analyzing distinct 12, 782, Invercargill 60, 000, and Dunedin 27, 000; see [4].
types of road surface distress, like potholes, cracks or texture Extensive research has been carried out for macro-scale
changes as being maintenance-relevant features. Macro-scale road issues, such as for estimating the road surface [5] (also
road features are defined by being of traffic relevance. For known as road manifold estimation), detection of obstacles
example, speed bumps are also traffic-relevant features; they that are protruding from the road [6], recognition of traffic
also require detection for driver assistance. isles [7], or pothole detection [8]. Automotive companies such
A pothole is a special case of road distress. It can be an as Tesla, Toyota, Ford, or BMW announced to be able to
arbitrarily shaped structural defect of a road, and a precise deliver autonomous cars by about 2020 [9]. However, road
identification of its “border” is typically impossible; see Fig. 1. pothole detection as a particular research subject still demands
They can be vaguely outlined, but their maximum depth can more research (as, certainly, many more related topics in this
be identified more precisely. Objects such as cars, persons, area).
cyclists, dogs or cats are of specifically defined shapes (and Mobile crowdsourcing based applications have been devel-
now detected by deep learning due to appearance properties); oped to report about road hazards, such as Santani et al. [12].
compared to this, we can certainly claim that the detection of In 2017, a study conducted in Taoyuan, Taiwan, used a data-
a pothole, being of arbitrary shape and of complex geometric analytic approach applying correlation and regression analy-
structure, is a challenging object-detection task. sis; [11] shows that areas, identified (by crowdsourcing) for
Potholes present a grave danger to human life. We just having a high frequency of road potholes, resulted in a higher
state a few facts from related studies worldwide. According number of traffic accidents. In 2018, one of the largest pizza
Manuscript received March 6, 2019; revised May 25, 2019; accepted chains in the U.S., dispensed a special grant to fix potholes at
July 12, 2019. Date of publication August 6, 2019; date of current version selected locations, as potholes caused irreversible damage to
July 29, 2020. The work of A. Dhiman was supported by a Ph.D. Scholarship pizzas during their delivery [13].
from Aukland University of Technology. The Associate Editor for this paper
was D. F. Wolf. (Corresponding author: Reinhard Klette.) Authors published already about three developed methods
The authors are with the School of Engineering, Computer and for pothole detection in conference papers [14]–[16]. This
Mathematical Sciences, EEE Department, Auckland University of Tech- paper builds on those reported materials. This paper provides
nology, Auckland 1010, New Zealand (e-mail: [email protected];
[email protected]). at first a review on techniques for pothole identification,
Digital Object Identifier 10.1109/TITS.2019.2931297 extending brief notes on related literature in those previous
1524-9050 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3537

conference papers. This paper also presents (with additional TABLE I


material) the three previously published methods, adds one P UBLIC R EPORTING ; L ISTED N AMES ON THE L EFT I DENTIFY THE W EB -
SITES OF T HOSE A PPLICATIONS ( E . G . www.fixmystreet.com).
more method, and provides a comparative evaluation of all C ITIZEN H OTLINE 1999 IS THE N AME OF AN I NNOVATIVE O PEN
four methods. For this evaluation, we use here (first time) a D ATA P LATFORM U SED IN TAIWAN ; THE R ELATED R ESEARCH
more diverse set of data. Potholes presents different challenges P UBLICATION WAS IN 2017
under different weather, lighting, road geometry or traffic con-
ditions. As there is no online benchmark dataset available for
pothole detection, we accumulated data from multiple sources,
and suggest to use those five different datasets, recorded under
different weather conditions, for future discussions of progress
in this field of pothole detection.
Contributions of this study are two different approaches
of pothole detection based either on 3D scene reconstruction
TABLE II
or on state-of-the-art deep learning techniques. The proposed
V IBRATION -BASED M ETHODS
strategies allow us identifications of potholes from a distance
in an accurate manner as supported by experiments. Evaluated
experiments demonstrate that state-of-the-art deep learning-
based methods significantly outperform the conventional 3D
scene reconstruction-based methods. We believe that this study
sheds new lights into the field of pothole detection by bridging
a gap caused by datasets recorded under varying illumination
conditions. This study is also an overview of previously
conducted attempts to detect defects on road surface.
This paper is structured as follows. Section II provides a
review of reported work on pothole detection. Section II-
A describes manual techniques, i.e. techniques which use costly hardware or software. Citizens can report a pothole by
a human as sensor for the detection of road anomalies. capturing its picture with their mobile devices and later by
Section II-B reviews techniques that use accelerometers or uploading or sending to a website or application, or by merely
gyroscopes as vibration detection systems to measure vibra- sending information about a pothole’s location. Some reported
tions that turn out in a vehicle whenever it strikes any systems are listed in Table I.
distress on a road. Section II-C presents basic strategies
of implying image or video processing techniques for road B. Vibration-Based Methods
distress detection. Section II-D reviews techniques based on Vibration-based methods include approaches of collecting
3-dimensional (3D) scene reconstruction. Section II-E lists abnormal vibrations [17] caused in the vehicles while driving
current work using learning strategies. Section III informs over road anomalies. Vibrations of the vehicle are collected
about methods proposed by the authors. This starts in using an accelerometer; see Table II. The main drawback of
Section III-A with a technique using single stereo-frame data, the vibration-based methods is that the vehicle has to drive
further improved in Section III-B by the use of multi-frame over the pothole in order to measure the vibrations caused by
stereo data based on visual odometry. Sections III-C and III-D the pothole on the road.
address then learning-based strategies, first by using trans- M. Ghadge et al. [18] used an accelerometer and GPS to
fer learning method based on Mask R-CNN, then by using analyze the conditions of roads to detect locations of potholes
transfer learning using YOLOv2. Section IV gives information and bumps using a machine learning approach, defined by
about the datasets used for comparative performance evalua- K-means clustering on training data and a random-forest
tion. Section V details the studied experiments. Section VI classifier for testing data. The data is divided first into two
concludes. clusters of “pothole” or “non-pothole”, and then a random-
forest classifier is used to validate the proposed result provided
II. L ITERATURE R EVIEW by the clustering algorithm. It is reported that clustering does
This review might contribute to the motivation of developing not perform well when clusters of different size and severity
automated road-surface anomaly detection systems for various are involved; size and severity of a pothole are the major
real-world environments. There have been much advancements properties considered in the system.
in this technical era recently. F. Seraj et al. [19] used a support vector machine (SVM)
for a machine-learning approach to classify road anomalies.
The proposed system uses accelerometer, gyroscope and a
A. Public Reporting Samsung galaxy as sensors for data collection; data labeling
This type of systems enhances civic engagements by gov- is performed manually (by a human) and then a high-pass
ernment, and facilitates the participation by citizens of the filter is used to remove the low-frequency components caused
country. These systems use the public as sensors [10]. The due to turns and accelerations. Ren et al. [20] used
main advantage of this method is that there is no need for K-means clustering to detect potholes based on data collected

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3538 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

by using an accelerometer and GPS. The proposed system The proposed system considers shadow effects on the road
lacks accuracy regarding the isolation of potholes from other and aims to remove those effects of shadows using a shadow-
road anomalies. removal algorithm. The system is unable to perform in rainy
weather. The authors concluded that the system should be
further extended to perform also on video data as the system
C. 2D-Vision-Based Methods
was only tested on 2D images collected using an iPhone
Vision-based methods use 2-dimensional (2D) image or camera with 5 megapixel image resolution.
video data, captured using a digital camera, and process Bashkar and Manohar [44] propose a methodology of
this data using 2D images or video processing techniques detecting pothole’s mean depth by using SURF features on
[21], [22]. The choice of the applied image processing tech- uncalibrated stereo pairs of images (without employing dis-
niques is highly dependent on the application for which 2D parity images). A particular methodology has been developed
images are being processed. for this purpose, but appears to suffers from uncalibrated stereo
Koch and Brilakis [8] proposed a method aiming at a rectification; it is far from providing good results.
separation of defect and non-defect regions in an image using Ying et al. [46] proposed a system which can detect road
histogram shape based threshold. The authors consider the surface based on a feature detector which is shadow-occurence
shape of a pothole as being approximately elliptical based on optimized. This system uses a connected-component-analysis
a perspective view. The authors emphasize on using machine algorithm and other morphological algorithms and is demon-
learning in future work, and claim that the proposed work strated on images of datasets provided by KITTI [47] and
already results in 86% Accuracy along with 86% Recall ROMA [48].
and 82% Precision, with the common definitions of Thekkethala et al. [25] used two (stereoscopic) cameras
TP and applied stereo matching to estimate depth of a pothole
Precision = (1)
TP + FP on asphalt pavement surface. After performing binarization
TP and morphological operations, a skeleton of a pothole is
Recall = (2)
TP + FN estimated. The system is tested on 24 images and no estimates
TP + TN of depth have been provided. The system can detect skeletons
Accuracy = (3)
TP + TN + FP + FN of potholes of great depression. Authors did not estimate the
road manifold.
Precision · Recall
F1 = 2 · (4)
Precision + Recall
where TP is the number of true positives, FP of false positives, D. 3D Scene Reconstruction-Based Methods
TN of true negatives, and FN of false negatives. 3D scene reconstruction is the method of capturing the
Tedeschi and Benedetto [10] recently suggested a system shape, depth, and appearance of objects in the real world; it
for automatic pavement distress recognition (ADPR) which relies on 3D surface reconstruction which typically demands
is able to perform in real time by identifying road distress more computations than 2D vision. Rendering of surface
including fatigue cracks, longitudinal and traversal cracks, and elevations helps to understand accuracy during the design
potholes. The authors used a combination of technologies of of 3D vision systems. 3D scene reconstruction can be based
the OpenCV library and for the classification of the three on using various types of sensors, such as Kinect [28], stereo-
different types of road distresses, three classifiers have been vision cameras, or a 3D laser. Kinect sensors are mainly used
used based on local binary pattern (LBP) features; they in fields of (indoor) robotics or gaming.
achieved more than 70% for Precision, Recall, and the 3D lasers define an advanced road-survey technology; com-
F1-measure. pared to camera-based systems it still comes with higher
Authors discussed difficulties of defining the severity of costs; [30], [31] report survey cycles of (usually) once in four
considered kinds of road distresses. For texture classification years. A 3D laser uses a laser source to illuminate the surface
the authors used Haralick’s features [23] based on gray level and a scan camera for capturing the created light patterns. [32]
co-occurrence matrices (GLCMs) and then classified image applied the common laser-line projection; the recorded laser
regions using a tool from [24]. line deforms when it strikes an obstacle (and supports thus the
Ryu et al. [26] proposed a method to detect potholes both 3D reconstruction), but does not work well, e.g., on wet roads
for asphalt or concrete road surfaces using 2D images collected or potholes filled with water.
by a mounted optical device on a survey vehicle. The system Stereo-vision cameras are considered to be cost-effective
mainly works in three steps of image segmentation, candidate as compared to other sensors. Stereo vision aims at effective
region extraction and decision. The system fails to detect and accurate disparities, calculated from left-and-right image
potholes in darker images (image regions) due to shadows pairs, to be used for estimating depth or distance; see, for
(e.g. of trees or cars) present in real-world road recordings. example, [33]. Commonly, the canonical left-right calibrated
Powell and Satheeshkumar [27] present a method for the stereo camera setup is used while aiming at a reconstruction
detection of potholes by segmenting images into defected or dense 3D surfaces. A disparity map represents per-pixel cor-
non-defected regions. After extracting the texture information respondences for a rectified stereo pair. Figure 2 illustrates a
from defected regions, this texture information is compared recorded 3D scene with a calculated (color-encoded) disparity
with texture information obtained from non-defected regions. map.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3539

TABLE III
E XAMPLES OF 3D R ECONSTRUCTION -BASED
M ETHODS AND U SED S ENSORS

Fig. 2. Recorded road scene with transparent color-encoded disparity map


calculated using stereo global matching algorithm [29]. The used color key
is shown on the right; disparity 250 encodes a distance very close to the host
vehicle and 0 encodes “very far away”.

Table III summarizes a few 3D reconstruction-based meth-


ods for detecting road distress.
Garbowski and Gajewski [41] presented a semi-automatic
pavement failure detection system (PFDS) which is a part of
the FEMat [42] road package (UDPhoto toolbox). It allows
a user to inspect the condition of road pavement based on
calculated clouds of 3D points. The presented system considers
a small region of interest (ROI) in reference to a larger region
of a road surface and is able to detect certain types of cracks
including “alligator cracks”, but not potholes.
Shen et al. [43] propose the use of Takata’s stereo-vision
Fig. 3. Reconstructed (and texture-mapped) 3D point cloud using calculated
disparities. Green arrows indicate locations of two potholes.
system for performing a road surface preview along the host
vehicle. Video data recorded with the used compact stereo-
vision sensor (with a baseline of 16.5 cm) is analyzed in
For briefly introducing the stereo-vision notation (later
an embedded system, already tested in various vehicles, also
needed in this paper), consider coordinates (x, y) ∈  of a
in combination with various driver assistance systems such
pixel in an image, where  denotes the image domain of
as forward collision warning (FCW), automatic emergency
the left image. A disparity map D :  → R+ 0 defines the breaking (AEB), or lane departure warning (LDW). Authors
translation of image coordinates in the left image into those
state that the proposed system achieves satisfactory accuracy;
of a detected corresponding pixels (x − D (x, y) , y) in the
they also state that it does not perform well when there is glare
right image.
on the road surface.
A 3D point (X, Y, Z ) in the 3D scene is mapped into an
Calculated disparities within detected road-surface image
image pixel at (x, y) following a perspective projection
segments support the estimation of a manifold, approximating
X Y the road-surface. Commonly the road surface is assumed to be
x = fx · + x c , y = f y · + yc (5)
Z Z planar (i.e., the manifold is thus a plane). But this planarity
where f x and f y are the focal lengths in x and y coordinate assumption is often not corresponding to actual uneven road
direction, and (x c , yc ) is the principal point in the image plane. surfaces. To simplify, the road manifold is often modeled in
Given a calibrated and rectified stereo camera pair, let driving direction by a profile, i.e. a curve whose parallel trans-
d = D(x, y) be the disparity value assigned to a left-image lation left-to-right creates the road manifold. A line creates a
pixel location (x, y), the Z -coordinate can be triangulated as plane, and a quadratic polynomial profile creates a quadratic
follows: road manifold.
Quadratic road manifolds are discussed by Ai et al. [6].
b
Z = fx · (6) For a consideration of twisting and bending surfaces of roads,
d see Mikhailiuk and Dahnoun [55]; their algorithm has been
Here, b is the length of the baseline which is the distance implemented on a Texas Instrument C6678 multi-core SoC
between left and right camera optical centers. This allows us digital signal processor.
that X and Y can also be recovered; the whole process is Zhang et al. [34] proposed an efficient algorithm to estimate
called triangulation. Figure 3 shows a cloud of 3D points, the size, depth, position and severity of potholes by modeling
reconstructed from a disparity map, following the triangulation the road surface as a quadratic manifold by using a random
process. sample census (RANSAC) approach. Pothole detection and

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3540 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

TABLE IV
E XAMPLES OF CNN S FOR I MAGE S EGMENTATION ,
AND U SED D ATA S ETS

Fig. 4. Obtained elevation-difference map using a best-line fit in y-disparity


space. In the applied color key, blue corresponds to points being 5 cm or
segmentation are achieved by using a connected-component more below the road surface, and red to points being 5 cm or more above the
road surface).
labeling (CCL) algorithm.
Proposed methods may also follow a multi-sensor
However, the system is not able detect potholes under varying
approach [36]–[40]. Tseng et al. [45] developed an automated
illumination conditions.
survey robot which performed in simulated test-field environ-
The authors [52] have proposed a CNN based model mainly
ments to detect five types of distress, namely alligator cracks,
to classify a region on a road as pothole or non-pothole.
small patches, potholes, rectangular, and circular manhole
The author has collected the dataset using smartphone camera
covers.
mounted on the front windshield of the vehicle and the authors
have used preprocessed cropped frames with ROI to train the
E. Learning-Based Methods proposed model. The authors [53]have developed CNN based
model using thermal images to classify whether an image
For identifying objects in image data, various convolutional
has pothole or not. The thermal images are recorded using
neural networks (CNNs) have been proposed, see Table IV,
a thermal camera.
such as Chen et al. [56], RefineNet [57], PSPNet [58], or
the “large-kernel-matters” proposal in [59]. For identifying III. P ROPOSED AND T ESTED M ETHODS
an object at pixel level, fully convolutional neural networks
(FCNs) have been proposed; for example, see Long et al. [60], This section presents four different methods for pothole
or SegNet by Badrinaraynan et al. [61]. detection which are proposed by the authors. Comparative
Regarding road damage detection, Zhang et al. [67] propose performance evaluations will be given later.
a CrackNet to predict class scores for all the pixels in a
considered damage. Song et al. [62] use a CNN approach to A. Single-Frame Stereo-Vision-Based Method - SV1
detect potholes. The authors used a smartphone as a sensor Using [34], we identify potholes by detecting areas within
to acquire movement information and the Inception V3 [63] the road which are evaluated as being “below the road
classifier; they adapted the final fully-connected layer in the surface”. Thus, for identifying a pothole, first we need to
CNN to the given task. estimate the road manifold. There is a variety of techniques
Maeda et al. [64] used a CNN, trained by using a vast for modeling a road manifold; we applied the following two
dataset of road images collected in Japan, to detect road- methods for obtaining a road manifold.
surface damage. Some authors used SSD Inception V2 [65] 1) y-Disparity Line Fitting Model: A calculated dense
or SSD MobileNet [66] to identify different sorts of road disparity map D is used to compute the y-disparity map
damages. Detected road damages are identified by generating [68], [69] as follows:
bounding boxes (i.e., not at pixel level).
V (y, d) = card{x : 1 ≤ x ≤ Ncols ∧ D(x, y) = d} (7)
Staniek [49] uses stereo vision cameras for the acquisition
of road surface in the form of clouds of 3D points. The where Ncols is the width of the input images in pixels, and
author emphasizes on solving stereo matching by using a d ∈ [0, dmax ) is a disparity bounded by the maximum value
Hopfield neural network, which is a special case of a recurrent dmax . V (y, d) gives the number of pixels sharing the same
artificial neural network. The author achieved 66% accuracy disparity d in the y-th row of the disparity map.
when evaluating matching pixels (for 50 image pairs) using a Assuming that recorded images show “large” parts of the
CoVar method [54] for evaluation.. road surface, the lower envelope of data in the y-disparity map
The authors [50] have proposed a model to detect potholes is approximated by a straight line d = f (y) using a random
based on YOLOv2 architecture. However, their reported archi- sample census (RANSAC)-based approach. The estimated line
tecture differs from our proposed model in LM2. Also the d = f (y) is applied to the disparity map D for identifying a
tested frames basically show not much more than potholes, road manifold. If |D(x, y) − f (y)| is smaller than a threshold
while the real road scene is much complex. The CNN model (say, 1) then the pixel at (x, y) is considered to show a 3D
proposed by the authors [51] to detect potholes has been point on the road.
trained on a CPU and experiments shows that CNN based An elevation-difference map, obtained his way by the signed
model perform better than Conventional SVM based approach. elevation difference to the road manifold, is shown in Fig. 4.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3541

Fig. 6. Scene rendered in image-disparity space after dominant plane


detection; the planarity of road pixels is assumed.

Fig. 5. Integral disparities define non-linearly distributed depth layers in 3D


space. Courtesy of Waqar Khan [35].

The road surface is fairly planar in this scene close to the


middle of the road, but bends downward to the right, towards
the curb. Two potholes are visible in the lower right, not yet
part of the general situation on the right of the road.
Note that disparities are not linearly related to depth; see,
e.g., Fig. 5. Thus, this y-disparity-based approach can only
be approximately correct in regions relatively close to the
ego-vehicle (i.e. the vehicle the camera was operating in). Fig. 7. Elevation-difference map after 3D plane fitting in disparity space;
2) 3D Plane Fitting in Disparity Space: An (assumed) the used color key is again shown on the right.
planar road manifold was approximated in [14] by directly
considering disparities, not via a time-consuming 3D scene Given a certain noise level, to consider a point to be an inlier
reconstruction way. A plane in 3D Euclidean space is repre- with respect to a plane hypothesis (n̂, δ̂), we use a threshold
sented by a0 X + a1 Y + a2 Z + a3 = 0, where a0 , . . . , a3 ∈ R εmax ∈ R+ if
are the plane’s coefficients. The planarity in 3D space is
reserved in image-disparity space (x, y, d) due to perspective |ε(p; n̂, δ̂)| ≤ εmax (12)
(i.e. pinhole) projection. A RANSAC process, based on Eq. (12), is carried out to locate
To define a plane uniquely, we use the normal-offset para- the dominating plane in the scene.
metrization. Thus we use a unit 3-vector n and an offset The process starts with a minimum set of points drawn
parameter δ ∈ R+ , where from the population, and a hypothesis is obtained by finding
|a3 | the best fit plane of the randomly selected samples. To verify
δ= (8) and support our hypothesis, first we find all the inliers from
a0 2 + a1 2 + a2 2 the population. If the hypothesis is supported by significantly
many inliers, say 50% of the population, then it is considered
and a model candidate. Such process is repeated for a predefined
δ number of iterations. In the end, the candidate model which is
n= · (a0 , a1 , a2 ) (9)
a3 supported by highest number of inliers is considered as winner
of the selection process. Following the described RANSAC
A point p = (x, y, d) is up to δ on the plane (n, δ) if and process, a dominating plane is found in a robust manner. See
only if it satisfies the following equation: Fig. 6 for an example.
This dominating plane can then again be used for find-
p n − δ = 0 (10) ing pixels that are below the assumed planar road sur-
face, analogously to the case of a y-disparity-based surface
The signed distance of an off-plane point p is defined by
model. Following (11), we obtain an elevation-difference map.
ε(p; n, δ) = p n − δ (11) See Fig. 7.
After carefully comparing results, using either the
where the point is considered to be above the plane if ε > 1, y-disparity based road-surface approximation or the
with an up-vector defined by n, and a point is below the plane plane-approximation in disparity space for road manifold
if ε < 1. estimation, we decided for the latter one. Also compare

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3542 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

Figs. 4 and 7. Thus, method SV1 is defined by plane-


approximation in disparity space. To carry out further
investigations, pixels more than one unit below, or ε < 1, are
considered to be pothole candidates. We also use 8-adjacency
connectedness analysis [70], [71] to remove regions larger
than a reasonable size.
Method SV1 considers individual frames (i.e. stereo pairs of
images). The detected (planar) road manifold does not consider
any bending of road surface (e.g., towards the curbs).

B. Multi-Frame Fusion-Based Method - SV2


The digital elevation model (DEM) approach in [6] provided Fig. 8. Visualization of a segment of a DEM providing a close look at a
a motivation for a multi-frame fusion in [15] for improved pothole, using cell resolution of 1 × 1 cm2 .
digital elevation models. For multi-frame integration, we first
solve camera poses with respect to a reference frame using 2) Visual Odometry: To accumulate 3D data measured in
a visual odometry (VO) technique. After point clouds from different frames, their poses has to be recovered with respect to
different frames are aligned to the reference frame, we further the chosen reference coordinate system. For example, consider
transform them into a road-centered space, using a rigid trans- that a reference coordinate system is taken for Frame t, and 3D
formation solved by means of principal component analysis data measured for Frame t up to Frame t + m (e.g., m = 5)
(PCA). This section describes thus an improved strategy over need to be mapped uniformly into the reference coordinate
the single-frame pothole detection method, described in the system of Frame t. To recover the self-motion of the camera
previous subsection. from a video sequence, we implemented a 3-stage VO hybrid
1) Digital Elevation Model: It is common to represent model as described in this subsection.
a DEM by a regular grid of squares each labeled by a First stage. An efficient perspective-from-n-point (PnP)
height difference to a zero-height plane. Figures. 4 and 7 algorithm [72] is used, along with acquired 3D-to-2D map-
showed elevation differences with respect to the perspective ping, to compute an initial pose (R, t) consisting of R ∈
(i.e. pinhole) transform. Now we basically aim at a uniform S O(3), a rotation matrix, and t ∈ R3 , a translation vector.
height-difference representation across the recorded scene. Second stage. During this stage, we adopt the SURF feature
The construction of a DEM M is as follows. For each detector and descriptor extractor [73] and camera intrinsics to
point (X, Y, Z ) ∈ R within a range of interest, defined by derive an essential matrix, to calculate Sampson distance for
the rectangle X ∈ [X min , X max ] and Z ∈ [Z min , Z max ] ahead each pair of matched key points (see [74] for details). Subject
of the ego-vehicle and an assumed height threshold Ymax , to the filtered correspondences, the reprojection error is min-
an accumulating cell (i, j ) ∈ Z2 is given by imized using the Levenberg-Marquardt (LM) method [75] in
    the sum-of-squares form:
Z − Z min X − X min  ⎛ ⎡ ⎤ ⎞ 
i= , j= (13) P  Xi   2
W W   x 
ϕRPE (R, t) = π ⎝R ⎣ Yi ⎦ + t⎠ − i  (14)
 y 
where W is a chosen size of the used grid such that every i  Zi i 
cell spans an area of W × W in the X Z -plane, and [a] is the
where P denotes a set of sparse 3D-to-2D correspondences,
nearest integer to real number a. The value M(i, j ) of cell
and π : R3 → R2 symbolizes the pinhole projection function.
(i, j ) is decided by all the points assigned to cell (i, j ). Here
Third stage. During this stage, we warp the left image I L
we build M(i, j ) by the averaged signed distance to the road
of the last stereo frame (i.e. Frame t + m in our example
plane, of all points in cell (i, j ).
above) to the current frame using a pose hypothesis being
Now we use a PCA technique to find a rigid transformation
tuned, and iteratively reducing the sum-of-squares difference
that normalizes the point cloud in the given range of interest,
in intensities. We use an LM algorithm to approach a local
such that the Z -axis aligns to the primal axis of the road, and
minima of (15), starting with the pose previously minimized
the Y -axis is parallel to the normal vector n of the plane. After
subject to (14). The objective function here is defined as
the transformation is applied to the point cloud, we define the
follows:
transformed space as the road-centered space (RCS), on which ⎡ ⎛ ⎡ ⎡ ⎤ ⎤⎞ ⎤2
the DEM is constructed. Q Xi
So far this is all done for a single stereo frame only. Figure 8 ϕINT (R, t) = ⎣ I L ⎝π ⎣R ⎣ Yi ⎦ + t⎦⎠ − I L (x i , yi )⎦
shows a 3D visualization of a segment of a calculated DEM. i Zi
There is a regular grid, bended by following the calculated (15)
values M(i, j ). The used color key for elevation differences
is also shown at the bottom of the figure. Note that we could where Q is a set of pixels with valid depth data.
also show the DEM as a regular colored grid if deciding for 3) Weight Assignment and Multi-Frame Fusion: We adopt
a straight top-down view. weighted averaging and a weight is derived, for each data

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3543

high-level features at later layers. The network accepts an


image of 1, 024 × 1, 024. The image is padded with zeroes
if it is not given as per the assumed aspect ratio. FPN is
another improved feature extractor, a second pyramid that
allows features at every level to have access to both lower-
and higher-level features.
Mask R-CNN is composed of a two-step framework. The
first step scans the whole image for generating proposals.
The second step classifies the proposals, generates bounding
boxes, and also masks of an object.
The main modules of Mask R-CNN are as follows:
Fig. 9. From left to right: Road surfaces from 20 accumulated frames before
and after hole filling, identified local minima (as highlighted by red marks). 1) Region Proposal Network: RPN is a lightweight neural
network that scans over the backbone’s feature map, using a
point, by evaluating its disparity. Given the observed left image sliding window to generate anchors that are typically boxes
I L , the block-wise correlation is defined as distributed over the image area. The sliding window operation
   is handled by its convolutional nature, and this is very fast
 I L ( p) − μx y I L ( p) − μx y on a GPU. The output of the RPN is an anchor class and a
C(x, y) = (16) bounding-box refinement.
σx y σx y
p∈A(x,y) 2) Refined Bounding Boxes: To precisely map bounding
where A(x, y) denotes neighbors centering at pixel (x, y), μx y boxes to the regions of an image, Mask R-CNN also improves
and σx y are local mean and standard deviation, respectively, the RoIAllign layer of the network for pixel-level segmenta-
calculated from A(x, y) in image I L , and μx y and σx y are tion. It removes the harsh quantization of the RoIPool layer
those calculated from the reconstructed image I L . to properly encapsulate the extracted features with the input.
For a good disparity estimate in (x, y), the correlation 3) Instance Masks: The mask branch is a CNN that accepts
C(x, y) will be close to 1, while an inaccurate estimate will positive regions as input generated by the classifier, and
lead to a low coefficient, as low as −1. We use a normalized predicts a low resolution 28 × 28 soft mask for it. A soft
indicator W (x, y) = (C(x, y) + 1)/2 to weight each point mask differs from a binary mask as these are represented by
during the accumulation process. float numbers and hold more details.
In Fig. 9, accumulations over 5, 10, and 20 frames of a The main working steps of Mask R-CNN are as follows:
tested sequence are rendered. If depth data from more frames • Retraining an RPN end-to-end for a region proposal net-
are integrated, the resulting DEM becomes denser and presents work, initialized by a pre-trained CNN image classifier.
less missing cells. IoU > 0.7 and < 0.3 define positive or negative samples,
Method SV2 is defined by the described accumulation respectively. To formally apply Intersection over union
approach. Potholes are detected in accumulated DEMs as in IoU, a ground truth box G and a predicted bounding
method SV1 (e.g. with connected-component analysis). box P are used, and define
Comparing SV2 to the single frame approach SV1 described IoU = |G ∩ P|
in Subsection III-A.2, the multi-frame DEM approach not only (17)
|G ∪ P|
models the road manifold in a more reliable way, but also
where G ∩ P is the intersection, G ∪ P the union, and |.|
provides more accurate geometric measures such as depth and
the of the resulting sets.
size of each pothole. Morphological opening operations in SV2
• A small n × n window slides over the convolved feature
follow, as for SV1, ideas published by Z. Zhang [34] with some
map of the entire image.
slight modifications. We modeled the road manifold for SV2
• Anchor is produced to predict the multiple regions,
(within a local context) by using a second order polynomial
at each sliding position.
curve as profile, to consider also twisting or bending of the
• Train an object detection model by using the proposals
surface of the road.
obtained by RPN.
• Fine tune the layer of Mask R-CNN, according to the
C. Using Transfer Learning With Mask R-CNN - LM1 object class name.
Inspired by the results of CrackNet [67], we applied transfer For transfer learning, we used weights trained on the
learning [77] using Mask R-CNN [78]. It predicts a soft mask COCO [85] dataset and adapted the weights to identify pot-
to delineate the boundary of each instance at pixel level. holes. We used 247 images for training dataset, where 50 of the
The origin of Mask R-CNN is region-based convolutional CCSAD’s Urban Sequence 1, 100 of DLR, 48 Japan
neural network (R-CNN) [79], published in 2014. The R-CNN and 49 ofSunny dataset collectively and validation dataset
incrementally improved into Fast R-CNN [81] and then Faster comprised of 50 frames are also selected using same datasets.
R-CNN [82]. For test dataset, we used 50 images from CCSAD’s
The backbone of Mask R-CNN implementation uses sequences 2 and very challenging PNW dataset. We also
ResNet101 [83] and FPN [84]. A ResNet is a standard feature used data augmentation techniques in order to compensate for
extractor which detects low-level features at early layers, and less data. For LM1 we use left-right flips for augmentation.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3544 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

TABLE V D. Using Transfer Learning With YOLOv2 - LM2


E VALUATION M EASURES ( IN %) FOR VALIDATION D ATASET U SING T ECH -
NIQUE LM1 AS S HOWN IN F IG . 12
For real time detection of potholes, we used transfer learning
using another object detector- You only look once (YOLOv2).
YOLOv2 [86] based on regression algorithm, predicts classes
and bounding boxes for the whole image. A single CNN is
used for both classification and localization of an object in
YOLO.
Our YOLOv2 experiments is based on 22 convolutional
layers and 5 maxpool layers.
The input image gets divided into a c × c grid cell. The
purpose of that grid cell is to find that object, whose center
falls into particular grid cell. In YOLOv2 the input image
size is 416 × 416 with five maxpool layers. So, grid cell size
We excluded 6 frames from CCSAD’s Urban Streets is (416 × 416)/ pow(2, 5) = 13. These grid cells produces
sequences 1 for comparison purpose with SV2. We started N bounding boxes along with their confidence scores. Next
with learning rate of 0.01 and it resulted in very large update step is to perform non-max suppression, which is a process
to weights. As we used keras with TensorFlow, the optimizers of removing bounding boxes with low object probability and
are implemented differently. Therefore, we set learning rate as highest shared area.
0.001 and did not lower it further. We train the network using The bounding box consists of 5 numeric predictions: confi-
stochastic gradient descent and a learning momentum of 0.9. dence, x, y, width, height. where confidence score represents
We used a batch size of 2 and for 30 epochs it took 14 hours Intersection over Union IoU between predicted and ground
on a GeForce GTX 1080 GPU. truth box. x, y are coordinates center of box relative to grid
As we used images from four different datasets, the image cell, width and height are relative to input. During testing
dimensions were different. To keep an aspect ratio of uniform the confidence score represents how likely and accurate the
size 1, 024 × 1, 024, zero padding is added to the top and bounding box has the object and the bounding box with
bottom of an image. higher IoU is selected. The loss function in YOLO is mainly
We have two classes in our dataset, one for “background” comprises of: classification, localization and confidence loss.
and one for “pothole”. Transfer learning with Mask R-CNN We started with setting the input image subdivision value as
is a two-stage framework as follows: 8, but due to a resulting high memory requirement we changed
Stage 1: Classification and bounding box refinement. During it to 32. We used 64 images per batch. Initially, the learning
the first stage, the whole training image is scanned to generate rate was set to 0.01. However, after 1, 000 iterations the
anchor proposals by fine tuning RPN from end-to-end. RPN average loss kept on increasing.
is a lightweight neural network that scans over the backbone’s Therefore, we used 0.0001 for learning rate. We trained the
feature map, using a sliding window to generate anchors. LM2 network using TESLA K80 GPU. For around 8, 000 iter-
Anchors are typically bounding boxes in the image to predict ations it took approximately 6−7 hours and checked the mean
multiple regions while a small n × n window slides over the average precision mAP [87] value for different iterations at
convolved feature map of the entire training image. As the 5, 400, 6, 400, 7, 400. The mAP value of 5, 400 iterations was
sliding-window operation is convolutional in nature, so it is higher than other iteration weights. Hence, using Early Stop-
handled fast on a GPU. This stage generates a maximum of ping Point, the model is selected after 5, 400 iterations. After
256 anchors per image and bounding-box refinements. This 5, 400 iterations the average loss was same, hence we stopped
stage outputs a grid of anchors at different scales. training at 5, 400 iterations. The dataset for training and testing
Here IoU > 0.7 define positive samples, and IoU < 0.3 was same for LM1 and LM1. In LM2 we use random rotation
define negative samples, respectively. techniques for augmentation. However, the annotation format
The bounding box refinement step accepts a refined grid is different in case of LM2, which is bounding box not mask.
of anchors from the RPN and classifies the anchors precisely. Table VIII shows IoU measure values for a threshold of 0.5
It maps anchor bounding boxes into final boxes. Mask-RCNN for some of the selected frames of PNW dataset.1 Some of
refines the ROIAllign layer by removing a harsh quantization the frames are shown in Fig. 10.
of the RoIPool layer, to properly encapsulate the extracted We also tested 50 randomly selected PNW frames at different
features with the input. IoU values 0.5, 0.6 and 0.7 and the obtained average precision
Stage 2: Mask generation. values are 60%, 52% and 45%.
The mask branch is a CNN that accepts positive regions as
input, generated by the classifier during Stage 2, and generates IV. DATASETS
a low resolution 28 × 28 soft mask for it. We fine-tune the Authors of [64] state that “there is no uniform road damage
layer of Mask-RCNN, according to our object class “pothole”. dataset available openly, leading to the absence of a benchmark
Some frames from validation dataset are shown in Fig. 12 and for road damage detection”. Existing datasets on websites of
evaluation measures are listed in Table V. All the images from
validation datasets show potholes being identified with high 1 A short video sequence of results (using LM2) on the extensive PNW test

level of accuracy. dataset can be seen here: https://vimeo.com/337886918. This video is 29 fps.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3545

Fig. 10. Detected “potholes” using LM2 method, shown in two columns with original image on the left and predicted results on the right.

The CCSAD Urban Streets dataset is an extensive


collection of road potholes. The data has been acquired
at 20 fps using two Basler Scout scA1300-32fm firewire
greyscale cameras mounted on the roof of the car. The image
resolution of CCSAD is 1096 × 822.
2. DLR. This dataset has been recorded while using
the integrated positioning system (IPS) [92], developed by
the German Aerospace Centre (DLR), installed on a car. The
collected dataset accounts 288 GB for with image dimension
as 1360 ∗ 1024.
3. Japan. This dataset comprises of 163, 664 road images
of dimension 600 × 600 collected in seven different cities
of Japan [64]. The dataset contains 9, 053 damaged-road
images and 15, 435 instances of damaged road surfaces such
as (mainly) cracks and (rarely) potholes. Images are captured
at an interval of one second under different weather and
lighting conditions.2
Fig. 11. Top left - CCSAD, Top right - DLR, middle left - Japan, middle
right- Sunny, bottom - PNW. 4. Sunny. Authors of [93] provided a dataset of 48, 913
images of size 3, 680×2, 760 recorded using a GoPro camera,
mounted inside a car on its windscreen. The camera was set
the KITTI [47] or .eisats.. [88] projects, or of Middlebury to a 0.5 second time lapse mode and car was moving at an
College [89] have been recorded in countries where pothole average speed of 40 km/h while scanning the road surface.
on roads typically occurs rarely, and they are not benchmarks The total data available is 2.70 GB.
for road pothole. Under these circumstances, we decided for 5. PNW. PNW is an extensive video recorded on the Pacific
the use of the following data (see Fig. 11 for examples): Northwest highway [94]. It shows the highway with patches
1. CCSAD. Hayet et al. [90] has introduced a dataset of of snow and water. We used 19, 784 extracted frames of
challenging sequences for autonomous driving (CCSAD) that dimension 1280 × 720 from this video.
is exceptionally utilitarian to execute strategies for detection
of road pothole in Mexico. The CCSAD dataset has been V. E XPERIMENTAL E VALUATIONS
split into four parts Colonial Town Streets, Urban
Using SV1, SV2, and LM1 methods, we detected potholes
Streets, Avenues and Small Roads, and Tunnel
at pixel level segmentation and obtained results are very
Network which accounts for 500 GB of data that incorpo-
rates calibrated and rectified pairs of stereo images, videos and 2 In late 2018, these data have been used in a competition, see
meta-data. bdc2018.mycityreport.net/overview/.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3546 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

Fig. 12. Detected “potholes” from validation dataset using LM1 method, shown in two columns with original image on the left and predicted results on the
right. Top to bottom, left to right: Ten frames in order as listed in Table V.

Fig. 13. Examples of false detections from test PNW dataset using LM1. The detection of pothole including false positives varies from 3070 to 3076 frames
shown in top two rows.

promising. Figure 12 shows that in validation dataset pothole is of arbitrary shape, under bright sunshine a tree is miss-
instances are correctly identified while a false positive has been classified as a pothole in this case (this could be excluded by
detected in the third image (from the bottom) - as a pothole identifying a ground manifold first).

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3547

Fig. 14. Pothole marked in red color presents a very complex situation as
it is filled with water and shadow of a tree.
TABLE VI
C OMPARATIVE E VALUATION OF THE P ROPOSED LM1, SV2 AND SV1 IN %

More examples of false detections using LM1 are shown


in Fig. 13 where frames in top two rows are from continuous
frame number as 3070 to 3076. Frames in this range have only
one pothole yet its been detected accurately in frame number
3073 (second row first frame). It is interesting to see that false
detection in this frame range is because pothole is filled with
water and tree shadow (see Fig. 14).
In order to measure accuracy of our models, we mainly
calculate common classification measures Precision and
Recall on a per-pixel basis. Precision measures the cor-
rectness among all positive pothole instances, recall measures Fig. 15. Left column shows SV1 and SV2 results, marked in red and green,
how many positive pothole instances are successfully printed respectively, with blue for ground truth. The right column shows results
among all positive pothole instances. – For method LM2, for LM1.
we use IoU as an evaluation measure. TABLE VII
We comparatively tested the SV1, SV2, and the proposed E VALUATION M EASURES FOR T EST D ATASET U SING LM1; IN %
LM1 technique. Results of LM1, are great improvement com-
pared to the same example-frames as shown in Fig. 15 when
using the SV1 or SV2 approach.
Table VI lists results for a few frames of the tested CCSAD
Urbban Sequence 1, comparing pixel-wise detected pot-
holes with the manually labeled ground truth. The table
shows that there is a case where the SV2 method provides
a slightly better result, but it also demonstrates the general
observation that the LM1 method outperforms the SV1 method
in a majority of cases, and typically by providing much better with water or snow. The overall precision and recall
results. for randomly chosen 50 frames from our testing dataset dataset
Table VII shows precision and recall values of testing is 88% and 84% respectively.
dataset PNW for some of the frames, see Fig. 16. This method Table VIII lists IoU values for some of the chosen frames,
miss classifies a pothole in PNW frame 720. The reason is see Fig. 10 for PNW testing dataset using LM2 method. The
that network identified a region as pothole which is a darker mean is 69% because to annotate pothole which is always
region on the side of pothole. However, LM1 method has great of irregular shape, using bounding box is very difficult. We
potential to identify potholes whether they are dry, or filled also considered about pothole instance count matching as an

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3548 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

Fig. 16. Detected “potholes” using LM1 method, shown in two columns with original image on the left and predicted results on the right.

TABLE VIII Using the LM2 method, the developed model is applicable
E VALUATION M EASURES ( IN %) FOR LM2 R ESULTS S HOWN IN F IG . 12; for real-time scenarios. In Table VIII, IoU is greater than
N OTE T HAT BB 1 R EPRESENTS B OUNDING B OX 1, AND BB 2 B OUNDING
B OX 2 FOR D IFFERENT I NSTANCES OF P OTHOLES
50%, thus we may say that results are promising. The PNW test
frames get divided into the same number of grids as selected
during the training period, i.e. 13 × 13. The model can predict
multiple bounding boxes in each grid, so we keep the one with
the highest IoU value. This leads to an enforcement of spatial
diversity in making predictions.

VI. C ONCLUSION
The gravity of pothole related accidents can be understood
by increased numbers of accidents around the world due
to potholes. In this research, four different techniques are
proposed and tested against each other. Each technique has its
own benefits and can provide different pathways to a number
of applications. The LM1 model can identify a pothole under
challenging weather conditions with good precision and
recall whereas the LM2 model is capable of real-time
pothole identification. The SV2 approach can identify potholes
and road manifolds with very high accuracy when used with
stereo-vision cameras. The SV2 approach can also be used to
track a pothole from one frame to another, and is relatively
Fig. 17. The potholes marked in purple colors can be perceived as one easy to implement.
big pothole or can be counted separately(two potholes in left image, six The findings that we have presented here suggest that it
potholes in right image)[other potholes are not marked here, but considered
in experiments].
is very difficult to define the irregular shape of a pothole
which further makes it difficult to annotate ground truth. This,
in turn, causes a complex process of matching results with
evaluation measure. However, as shown in Fig. 17, more than ground truth. To date, there is no platform or benchmark
one pothole are adjacent to each other can be identified as one available for pothole identification. As a result of conducting
big pothole or multiple small potholes. In conclusion, we did this research, we also put forward six datasets specifically
not include pothole counts as an evaluation measure because for pothole identification, and discussed applications of two
it depends on individual counting. different areas of research such as computer vision and deep

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
DHIMAN AND KLETTE: POTHOLE DETECTION USING COMPUTER VISION AND LEARNING 3549

learning. It would be fruitful to pursue further research in [24] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical
order to combining the output of LM1 for annotating pothole Machine Learning Tools and Techniques, 3rd ed. San Mateo, CA, USA:
Morgan Kaufmann, 2016.
data and to use it to train more LM2-type models in order to [25] M. V. Thekkethala and S. Reshma, “Pothole detection and volume
increase detection accuracy for real-time purposes. estimation, using stereoscopic cameras,” in Proc. Int. Conf. Mixed
Design Integr. Circuits Syst., 2016, pp. 47–51.
ACKNOWLEDGMENT [26] S.-K. Ryu, T. Kim, and Y.-R. Kim, “Image-based pothole detection
system for ITS service and road management system,” Math. Problems
The authors acknowledge fruitful discussions with Hsiang- Eng., vol. 2015, 2015, Art. no. 968361.
Jen Chien, Auckland, New Zealand, which have been very [27] L. Powell and K. G. Satheeshkumar, “Automated road distress detec-
tion,” in Proc. Int. Conf. Emerg. Technol. Trends, 2016, pp. 1–6.
helpful at various steps of the reported work.
[28] A. Rasheed, K. Kamal, T. Zafar, S. Mathavan, and M. Rahman,
“Stabilization of 3D pavement images for pothole metrology using
R EFERENCES the Kalman filter,” in Proc. Int. Conf. Intell. Transp. Syst., 2015,
pp. 2671–2676.
[1] A. Heaton, “Potholes and more potholes: Is it just us,” Tech. Rep.,
Mar. 2018. [Online]. Available: https://medium.com [29] H. Hirschmüller, “Accurate and efficient stereo processing by semi-
[2] (2018). Pothole Facts. [Online]. Available: www.pothole.info/the-facts global matching and mutual information,” in Proc. Int. Conf. Comput.
[3] N. Dwivedi, “The pothole proposition,” Tech. Rep., Aug. 2018. [Online]. Vis. Pattern Recognit., pp. 807–814.
Available: https://medium.com [30] Q. Li, M. Yao, X. Yao, and B. Xu, “A real-time 3D scanning system
[4] (2018). Christchurch Report. [Online]. Available: www.stuff.co.nz/the- for pavement distortion inspection,” Meas. Sci. Technol., vol. 21, no. 8,
press/news/100847641/christchurch-the-pothole-capital-of-new-zealand/ pp. 015702-1–015702-8, 2010.
[5] H. Kong, J.-Y. Audibert, and J. Ponce, “General road detection from a [31] X. Yu and E. Salari, “Pavement pothole detection and severity mea-
single image,” Image Process., vol. 19, no. 8, pp. 2220–2221, Aug. 2010. surement using laser imaging,” in Proc. Int. Conf. Electro/Inf. Technol.,
[6] X. Ai, Y. Gao, J. G. Rarity, and N. Dahnoun, “Obstacle detection using 2011, pp. 1–5.
U-disparity on quadratic road surfaces,” in Proc. Int. Conf. Intell. Transp. [32] K. K. Vupparaboina, R. R. Tamboli, P. M. Shenu, and S. Jana, “Laser-
Syst., Oct. 2013, pp. 1352–1357. based detection and depth estimation of dry and water-filled potholes:
[7] F. Oniga, S. Nedevschi, M. M. Meinecke, and T. B. To, “Road surface A geometric approach,” in Proc. Nat. Conf. Commun., 2015, pp. 1–6.
and obstacle detection based on elevation maps from dense stereo,” in [33] R. Klette, Concise Computer Vision: An Introduction Into Theory and
Proc. Int. Conf. Intell. Transp. Syst., Oct. 2007, pp. 859–865. Algorithms. London, U.K.: Springer, 2014.
[8] C. Koch and I. Brilakis, “Pothole detection in asphalt pavement images,” [34] Z. Zhang, X. Ai, C. K. Chan, and N. Dahnoun, “An efficient algorithm
Adv. Eng. Inform., vol. 25, no. 3, pp. 507–515, 2011. for pothole detection using stereo vision,” in Proc. Int. Conf. Acoust.,
[9] (2018). Driverless Car Market Watch. [Online]. Available: Speech Signal Process., 2014, pp. 564–568.
http://www.driverless-future.com/?page_id=384 [35] W. Khan, “Accuracy of stereo-based object tracking in a driver assis-
[10] A. Tedeschi and F. Benedetto, “A real-time automatic pavement crack tance context,” Ph.D. dissertation, Dept. Comput. Sci., Auckland Univ.,
and pothole recognition system for mobile Android-based devices,” Adv. Auckland, New Zealand, 2013.
Eng. Inform., vol. 32, pp. 11–25, Apr. 2017.
[36] H. Youquan, W. Jian, Q. Hanxing, Z. Wei, and X. Jianfang, “A research
[11] B.-H. Lin and S.-F. Tseng, “A predictive analysis of citizen hotlines
of pavement potholes detection based on three-dimensional projec-
1999 and traffic accidents: A case study of taoyuan city,” in Proc. Int.
tion transformation,” in Proc. Int. Conf. Image Signal Process., 2011,
Conf. Big Data Smart Comput., Feb. 2017, pp. 374–376.
pp. 1805–1808.
[12] D. Santani et al., “Communisense: Crowdsourcing road hazards in
nairobi” in Proc. Int. Conf. Hum.-Comput. Interact. Mobile Devices [37] C. Zhang and A. Elaksher, “An unmanned aerial vehicle-based imag-
Services, Aug. 2015, pp. 445–456. ing system for 3D measurement of unpaved road surface distresses,”
[13] D. O’Carroll, “For the love of pizza, Domino’s is now fixing potholes Comput.-Aided Civil Infrastruct. Eng., vol. 27, no. 2, pp. 118–129, 2012.
in roads,” Wellington, New Zealand, Tech. Rep., Jun. 2018. [Online]. [38] Y.-W. Hsu, J. W. Perng, and Z.-H. Wu, “Design and implementation
Available: https://stuff.co.nz of an intelligent road detection system with multisensor integration,” in
[14] A. Dhiman, H.-J. Chien, and R. Klette, “Road surface distress detection Proc. Int. Conf. Mach. Learn. Cybern., 2016, pp. 219–225.
in disparity space,” in Proc. Int. Conf. Image Vis. Comput. New Zealand, [39] T. Naidoo, D. Joubert, T. Chiwewe, A. Tyatyantsi, B. Rancati, and
Dec. 2017, pp. 1–6. A. Mbizeni, “Visual surveying platform for the automated detection of
[15] A. Dhiman, H.-J. Chien, and R. Klette, “A multi-frame stereo vision- road surface distresses,” in Proc. Int. Conf. Sensors MEMS Electro-Optic
based road profiling technique for distress analysis,” in Proc. ISPAN, Syst., 2014, Art. no. 92570D.
Oct. 2018, pp. 7–14. [40] F. Orhan and P. E. Eren, “Road hazard detection and sharing with
[16] A. Dhiman, S. Sharma, and R. Klette, Identification of Road Potholes. multimodal sensor analysis on smartphones,” in Proc. Int. Conf. Next
Stratford, U.K.: MIND, 2019. Gener. Mobile Apps Services Technol., 2013, pp. 56–61.
[17] A. Mednis, G. Stardins, R. Zviedris, G. Kanonirs, and L. Selavo, “Real [41] T. Garbowski and T. Gajewski, “Semi-automatic inspection tool of
time pothole detection using Android smartphones with accelerometers,” pavement condition from three-dimensional profile scans,” Intell. Transp.
in Proc. Int. Conf. Distrib. Comput. Sensor Syst. Workshops, Jun. 2011, Syst., vol. 172, pp. 310–318, Jan. 2017.
pp. 1–6. [42] FEMat Project. Accessed: May 20, 2019. [Online]. Available:
[18] M. Ghadge, D. Pandey, and D. Kalbande, “Machine learning approach www.fematproject.pl/index.html
for predicting bumps on road,” in Proc. Int. Conf. Appl. Theor. Comput. [43] T. Shen, G. Schamp, and M. Haddad, “Stereo vision based road surface
Commun. Technol., Oct. 2015, pp. 481–485. preview,” in Proc. Int. Conf. Intell. Transp. Syst., 2014, pp. 1843–1849.
[19] F. Seraj, B. J. van der Zwaag, A. Dilo, T. Luarasi, and P. Havinga,
[44] V. A. Bashkar and G. T. Manohar, “Surface pothole depth estimation
“RoADS: A road pavement monitoring system for anomaly detection
using stereo mode of image processing,” Advance Res. Eng. Technol.,
using smart phones,” in Proc. Int. Workshop Modeling Social Media,
vol. 4, pp. 1169–1177, Jan. 2016.
Jan. 2014, pp. 128–146.
[20] J. Ren and D. Liu, “PADS: A reliable pothole detection system [45] Y.-H. Tseng, S.-C. Kanga, J.-R. Changb, and C.-H. Leea, “Strategies for
using machine learning,” in Proc. Int. Conf. Smart Comput. Commun., autonomous robots to inspect pavement distresses,” Autom. Construct.,
Jan. 2016, pp. 327–338. vol. 20, no. 8, pp. 1156–1172, 2011.
[21] K. Georgieva, C. Koch, and M. König, “Wavelet transform on multi- [46] Z. Ying, G. Li, X. Zang, R. Wang, and W. Wang, “A novel shadow-free
GPU for real-time pavement distress detection,” in Proc. Comput. Civil feature extractor for real-time road detection,” in Proc. Int. Conf. Pervas.
Eng., May 2015, pp. 99–106. Ubiquitous Comput., 2016, pp. 611–615.
[22] K. Doycheva, C. Koch, and M. König, “Implementing textural features [47] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
on GPUs for improved real-time pavement distress detection,” Real-Time The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237,
Image Process., vol. 33, pp. 1–12, Sep. 2016. 2013.
[23] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features [48] T. Veit, J.-P. Tarel, P. Nicolle, and P. Charbonnier, “Evaluation of road
for image classification,” Syst. Man, vol. SMC-3, no. 6, pp. 610–621, marking feature extraction,” in Proc. Int. Conf. ITSC, Beijing, China,
Nov. 1973. 2008, pp. 174–181.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.
3550 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 21, NO. 8, AUGUST 2020

[49] M. Staniek, “Neural networks in stereo vision evaluation of road [75] K. A. Levenberg, “A method for the solution of certain non-linear
pavement condition,” in Proc. Int. Symp. Non-Destructive Test. Civil problems in least squares,” Quart. Appl. Math., vol. 2, no. 2,
Eng., 2015, pp. 15–17. pp. 164–168, 1944.
[50] L. K. Suong and K. Jangwoo, “Detection of potholes using a deep [76] OpenMP. Accessed: Nov. 25, 2018. [Online]. Available:
convolutional neural network,” Universal Comput. Sci., vol. 24, no. 9, www.openmp.org/mp-documents/openmp-4.5.pdf
pp. 1244–1257, 2018. [77] S. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. Knowl.
[51] V. Pereira, S. Tamura, S. Hayamizu, and H. Fukai, “A deep learning- Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
based approach for road pothole detection in timor leste,” in Proc. Int. [78] H. Kaiming, G. Georgia, D. Piotr, and G. Ross, “Mask R-CNN,” CoRR,
Conf. Service Oper. Logistics, Informat., 2018, pp. 279–284. vol. abs/1703.06870, 2017.
[52] K. E. An, S. W. Lee, S.-K. Ryu, and D. Seo, “Detecting a pothole [79] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
using deep convolutional neural network models for an adaptive shock hierarchies for accurate object detection and semantic segmentation,”
observing in a vehicle driving,” in Proc. Int. Conf. Consumer Electron., in Proc. CVPR, 2014, pp. 580–587.
2018, pp. 1–2. [80] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
[53] Y. Bhatia, R. Rai, V. Gupta, N. Aggarwal, and A. Akula, “Con- A. W. M. Smeulders, “Selective search for object recognition,” Int. J.
volutional neural networks based potholes detection using thermal Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013.
imaging,” King Saud Univ., Comput. Inf. Sci., to be published. doi: [81] R. Girshick, “Fast R-CNN,” in Proc. ICCV, 2015, pp. 1440–1448.
10.1016/j.jksuci.2019.02.004. [82] R. Shaoqing, H. Kaiming, G. Ross, and S. Jian, “Faster R-CNN:
[54] B. Cyganek and J. P. Siebert, An Introduction to 3D Computer Vision Towards real-time object detection with region proposal networks,”
Techniques and Algorithms. Hoboken, NJ, USA: Wiley, 2011. CoRR, vol. abs/1506.01497, 2015.
[83] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
[55] A. Mikhailiuk and N. Dahnoun, “Real-time pothole detection on
recognition,” in Proc. CVPR, 2018, pp. 770–778.
TMS320C6678 DSP,” in Proc. Int. Conf. Imag. Syst. Techn., 2016,
[84] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
pp. 123–128.
“Feature pyramid networks for object detection,” in Proc. CVPR, vol. 1,
[56] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, no. 2, 2017, p. 4.
“DeepLab: Semantic image segmentation with deep convolutional nets, [85] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Proc. ECCV, 2014, pp. 740–755.
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2017. [86] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” 2016,
[57] G. Lin, A. Milan, C. Shen, and I. D. Reid, “RefineNet: Multi-path arXiv:1612.08242. [Online]. Available: https://arxiv.org/abs/1612.08242
refinement networks for high-resolution semantic segmentation,” CoRR, [87] J. Hui, “mAP (mean average precision) for object detection,” Tech. Rep.,
vol. abs/1611.06612, 2016. Mar. 2018. [Online]. Available: https://medium.com/
[58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing [88] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nesic,
network,” CoRR, vol. abs/1612.01105, 2016. X. Wang, and P. Westling, “High-resolution stereo datasets with
[59] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters— subpixel-accurate ground truth,” in Proc. Int. Conf. GCPR, in Lecture
Improve semantic segmentation by global convolutional network,” Notes in Computer Science, vol. 8753, 2014, pp. 31–42.
CoRR, vol. abs/1703.02719, 2017. [89] T. Vaudrey, C. Rabe, R. Klette, and J. Milburn, “Differences between
[60] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks stereo and motion behavior on synthetic and real-world stereo
for semantic segmentation,” in Proc. CVPR, 2015, 3431–3440. sequences,” in Proc. Int. Conf. Image Vis. Comput. New Zealand, 2008,
[61] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep pp. 1–6.
convolutional encoder-decoder architecture for image segmentation,” [90] R. Guzmán, J.-B. Hayet, and R. Klette, “Towards ubiquitous autonomous
IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, driving: The CCSAD dataset,” in Proc. Int. Conf. Comput. Anal. Images
2017. Patterns, 2015, pp. 582–593.
[62] H. Song, K. Baek, and Y. Byun, “Pothole detection using machine [91] A. Börner et al., “IPS—A vision-aided navigation system,” in Proc. Adv.
learning,” Adv. Sci. Technol. Lett., vol. 150, pp. 151–155, Feb. 2018. Opt. Technol., vol. 6, no. 2, pp. 121–129, 2017.
[63] C. Szegedy, V. Vanhoucke, S. Ioffe, J. SHlens, and Z. Wojna, “Rethink- [92] D. Grießbach, D. Baumbach, and S. Zuev, “Stereo-vision-aided inertial
ing the inception architecture for computer vision,” in Proc. CVPR, 2016, navigation for unknown indoor and outdoor environments,” in Proc.
pp. 2818–2826. Indoor Positioning Indoor Navigat., 2014, pp. 709–716.
[64] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata, “Road [93] S. Nienaber, M. J. Booysen, and R. S. Kroon, “Detecting potholes using
damage detection using deep neural networks with images captured simple image processing techniques and real-world footage,” in Proc.
through a smartphone,” 2018, arXiv:1801.09454. [Online]. Available: Southern Afr. Transp. Conf., Jul. 2015.
https://arxiv.org/abs/1801.09454 [94] PNW Dataset. Accessed: May 25, 2019. [Online]. Available:
[65] J. Huang et al., “Speed/accuracy trade-offs for modern convolutional www.youtube.com/watch?v=BQo87tGRM74
object detectors,” in Proc. CVPR, 2017, pp. 7310–7311.
[66] A. G. Howard et al., “MobileNets: Efficient convolutional neural
networks for mobile vision applications,” 2017, arXiv:1704.04861. Amita Dhiman received the master’s degree in com-
[Online]. Available: https://arxiv.org/abs/1704.04861 puter science. She is currently pursuing the Ph.D.
[67] A. Zhang et al., “Automated pixel-level pavement crack detection on degree with the Auckland University of Technology.
3D asphalt surfaces using a deep-learning network,” J. Comput.-Aided Teaching Assistant with the Auckland University of
Civil Infrastruct. Eng., vol. 32, no. 10, pp. 805–819, 2017. Technology. She has coauthored papers in image
[68] R. Labayrade, D. Aubert, and J.-P. Tarel, “Real time obstacle detection processing, stereo vision, and deep learning.
in stereovision on non flat road geometry through ‘v-disparity’ represen-
tation,” in Proc. IEEE Intell. Vehicles Symp., Jun. 2002, pp. 646–651.
[69] N. H. Saleem, H.-J. Chien, M. Rezaei, and R. Klette, “Improved stixel
estimation based on transitivity analysis in disparity space,” in Proc. Int.
Conf. Comput. Anal. Images Patterns, 2017, pp. 28–40.
[70] J. Serra, Image Analysis and Mathematical Morphology. Orlando, FL,
USA: Academic, 1983. Reinhard Klette is a Professor with Auckland
[71] R. Klette and A. Rosenfeld, Digital Geometry. San Francisco, CA, USA: University of Technology. He has coauthored more
Morgan Kaufmann, 2003. than 300 publications in peer-reviewed journals or
[72] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n) conferences and books on computer vision, image
solution to the PnP problem,” Int. J. Comput. Vis., vol. 81, no. 2, processing, geometric algorithms, and panoramic
pp. 155–166, 2009. imaging. He is a fellow of the Royal Society of
[73] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust New Zealand. He is on the Honorary Board of the
features,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 404–417. International Journal of Computer Vision. He was
[74] H.-J. Chien and R. Klette, “Regularised energy model for robust an Associate Editor of the IEEE PAMI from 2001 to
monocular egomotion estimation,” in Proc. Int. Joint Conf. Comput. Vis. 2008.
Imag. Comput. Graph., Theory Appl., vol. 6, 2011, pp. 361–368.

Authorized licensed use limited to: GITAM University. Downloaded on April 25,2025 at 12:41:45 UTC from IEEE Xplore. Restrictions apply.

You might also like