Object Recognition Using The Opencv Haar Cascade-Classifier On The Ios Platform
Object Recognition Using The Opencv Haar Cascade-Classifier On The Ios Platform
Examensarbete 15 hp
Januari 2013
Institutionen fr informationsteknologi
Department of Information Technology
Abstract
Object recognition using the OpenCV Haar
cascade-classifier on the iOS platform
Staffan Reinius
Contents
1 Abbreviations
2 Introduction
2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
9
9
3 Background
3.1 Augmented Reality . . . . . . . . . . . . . . .
3.1.1 Augmented Reality applications in cars
3.2 Object recognition . . . . . . . . . . . . . . .
3.2.1 Local invariant feature detectors . . . .
3.2.2 Speeded Up Robust Features . . . . . .
3.2.3 Haar classification . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Methods
4.1 Choosing recognition method based on performance and invariance
properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Data collection and sample creation for Haar training . . . . . . .
4.3 Haar training . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Work diary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
11
11
11
12
12
13
14
17
.
.
.
.
17
18
19
20
5 Results
22
5.1 Performance and accuracy . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Implementation and system design . . . . . . . . . . . . . . . . . . 23
6 Discussion
26
7 Conclusions
28
7.1 Assessment of the work process . . . . . . . . . . . . . . . . . . . . 28
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8 Acknowledgments
30
1 Abbreviations
AR
Augmented Reality
FLANN
FREAK
GENIVI
IVI
In-Vehicle Infotainment
OpenCV
ORB
QR Code
SURF
2 Introduction
In-Vehicle Infotainment (IVI) systems is an expanding field in the automobile
industry. Cars from the BMW Group released in ECE/US have the feature of
letting the user connect a mobile device to the head unit of the car, interweaving
the mobile device with the vehicle. The BMW IVI system is soon to be released to
the Chinese market (summer 2012), supporting widely used Chinese mobile device
applications. Such mobile-to-car interweaving allows on the one hand the user
to interact with their mobile phone through the larger display of the IVI system,
accessing telephony, internet services, news, music, navigation, etc. and on the
other hand allows the mobile to access virtually any data from the car head unit
and system busses. This allows for infotainment applications in the other direction, providing information such as driving statistics, indicators, mileage and so
on. The development of the BMW IVI system is conducted under a Linux-based
open-source development platform delivered by the Geneva In-Vehicle Infotainment (GENIVI) Alliance. [1]
A number of automobile manufacturers, including the BMW Group, MG,
Hyundai, Jaguar, Land Rover, Nissan, PSA Peugeot Citroen, Renault, SAIC Motor, use this Linux-based core service and middleware platform as an underlying
framework. [2]
The aim of this project was to develop an object recognition module to an
iPhone AR prototype application for the BMW Connected Drive Lab, located in
Shanghai China. The ambition was to implement this prototype so that it could be
used as a basis for an interactive diagnostics tool, handbook or similar, allowing
further information about the identified objects and graphics to be layered on
screen. The application, on prototype level, is a stand-alone tool not dependent
on connection or communication with the existing IVI-system. Future versions of
this application could be integrated with the IVI system and present diagnostic
data on the mobile device. Such an AR module could be useful in a number of
tools accompanying a car, e g. using the mobile device as a diagnostic tool (to
check oil, washer fluid, tire pressure etc.), as an interactive car handbook or as a
remote control towards the IVI system.
The project was divided into two parts, the first focused on object recognition
and the second focused on the user interface interaction and graphical overlay onto
the camera-provided images. The first part is presented in the current bachelor
thesis, and the second part is covered by the bachelor thesis of Gabriel Tholsg
ard
[3], with whom collaboration have been extensive. The current report describes
the work of implementing image processing and applying object recognition to
parts on the car dashboard provided in a video stream, and choosing an efficient
approach that takes the relevant invariance properties in to account.
(a)
(b)
(c)
(d)
Figure 1: These objects were chosen for object recognition and represents four
buttons for climate control on the car dashboard.
2.1 Objectives
The goal of this project was to build an AR application for an iOS mobile device,
using its built-in camera and project an OpenGL animation interface as an overlay
on the detected objects, more specifically:
I To construct a prototype module able to recognize four objects (fig. 1) on the
car dashboard using OpenCV on the iOS platform.
II To present an augmented reality OpenGL animation overlay on the camera
image representing the detected object (approached in [3]). Those OpenGL
animations should also be allowed to interact with, they should work as buttons.
III To combine the implementation of the two previous goals, achieving a generalized AR prototype.
2.2 Limitations
Within the scope of this thesis project it was not intended to construct a finished
application (ready for the market). The finished project is a prototype for object
recognition and displays animations as an overlay on the camera image. The
object recognition task was limited to identify four buttons on the cars climate
panel (fig. 1). This prototype is not integrated with the Linux-based core service
or middleware, but where intended to be a stand-alone iOS application, partly
8
because there are no wireless sensors between the car head unit and the phone
today, but more importantly to limit the work to fit the time frame of the project.
3 Background
3.1 Augmented Reality
AR is the idea of adding computer-generated data (e g. graphics, sound, GPS) to
real-time data; in contrast to Virtual reality where the real world is substituted
by a simulated. With object recognition such AR application, called vision based,
can become interactive by adding on a user interface to some detected objects
on the camera image. Another common way of knowing what to display is the
location-based approach, often using GPS [4].
In a early stage of this project the location-based approach was also considered, more precisely the possibility to get the local coordinates of the telephone
in respect to the cars head unit, and combine this info with the tilt of the phone
(accelerometer) and construct a 3D map of coordinates, tilt and corresponding
objects. Such a system would perform really well in terms of computational complexity (constant), but would be exhausting to construct, and each car model
would need its own coordinate map. As earlier mentioned there is no wireless
communication between the phone and the head unit, and even if there would be,
the issue of getting the exact local 3D coordinates might be hard to solve, it would
probably require multiple positioning sensors.
The vision based approach was chosen, and the various aspects of AR is more
thoroughly described in [3]. It was specified that this iOS AR application was to
be made without using markers or AR-tags, but instead use object recognition
methods more commonly used in other areas in augmented reality terms called
natural feature tracking [4].
3.1.1 Augmented Reality applications in cars
AR seems to be more and more explored in many areas: interactive commercial
apps (QR-tags), game set in the real world, GPS-based AR applications highlighting landmarks and such, to mention a few. Within the auto industry ideas has been
proposed on combining data from outside sensors with projections of AR on the
windshield, highlighting traffic signs, pedestrians and other traffic hazards, or to
use the back seats windows as computer screens highlighting landmarks and enable
interaction. From the driving seat it would obviously not be safe to interact with
a touch screen windshield, it could instead be a user scenario for passengers, and
the same applies to interaction with a handheld device to control the infotainment
system. The issue is mainly how to project large images on a windshield.
BMW makes use of AR today by providing a smaller Head-Up-Display in the
front window (projected for the driver) with some essential information such as
speed and navigation information, which is an interesting feature mainly in terms
10
11
in the ideal case be identifiable after the image has transformed in different ways.
[5]
Lets look at the Harris corner detector as an example. Intuitively a perpendicular corner would be found when the horizontal and vertical gradients (within
a sub-window of pixels) sums up to a large value, since in that case there is contrast in both horizontal and vertical directions but a rotation of the image would
hide this feature if the measurement was done perpendicular. The Harris Corner
detector (in this example) compensate for such a rotation by applying eigenvalue
decomposition to the second moment matrix:
"
#
Ix2 (x) Ix Iy (x)
M=
Ix Iy (x) Iy2 (x)
Ix and Iy are the respective sum of pixel intensity change in the x and y
direction at point x. If M has two large eigenvalues it is centered a round a corner.
[5]
In contrast to many typical Template-based methods, local feature-based can
exhibit many types invariance, even to partial occlusion [5].
Object recognition techniques are often applied on gradient images for several
reasons: gray scale images are generally more robust to variation in illumination,
and matrixes of singular brightness values are more efficient in terms of memory
consumption and less demanding in terms of processing speed [5].
After extracting local features from two imaged, those set of feature are compared to find out what feature in one image correspond to feature in the other,
and if a threshold (number of features) is exceeded a match is found. [5]
3.2.2 Speeded Up Robust Features
The first choice of recognition method was SURF [6] a scale-invariant featurebased technique. It is an approach designed for efficiency that at the same time
have a high level of invariance to the transformations and conditions expected in
the setting given [5].
SURF uses box-type filters using integral images to approximate the secondorder Gaussian derivatives of the Hessian-matrix [5]. That is, SURF exploits the
same principle of looking for two eigenvalues with the same sign (as described
above for the Harris Corner detector), only it is based on approximations for each
entry of the Hessian matrix using one of three filters, fig. 2(proposed in [6]) :
"
#
Lxx (x, ) Lxy (x, )
H(x, ) =
(1)
Lxy (x, ) Lyy (x, )
12
If we let Dxx , Dyy and Dxy be approximations for Lxx , Lyy and Lxy respectively,
the determinant of the Hessian matrix can be approximated as
det(Happrox ) = Dxx Dyy (0.6Dxy )2
(2)
for a Gaussian with = 1.2 (finest scale) and a filter (fig. 2) of 9 9 pixels. Here,
0.6 represent a balancing of the relative weights with scale, computed as:
|Lxy (1.2)|F |Dxx/yy (9)|F
= 0.6
|Lxx/yy (1.2)|F |Dxy (9)|F
(3)
Figure 2: These filters (kernels) approximate the Laplacian of Gaussians (Lxx Lyy
and Lxy ) by applying the weights of -1 to the white areas, 2 to the black and 0 to
the gray areas.
gets the classification right in at least above fifty percent of the cases. This buildup
to a better classifier from many weak is done by increasing the weight (penalty) on
misclassified samples so that in the next iteration of training a hypothesis that gets
those falsely classified samples right is selected. Finally the convex combination
of all hypotheses is computed. (fig. 3)).
Figure 3: Example illustrating boosting. (a) A hypothesis (line) is selected, misclassifying object 1, the weight of object 1 is increased which will affect the choice
for picking the next hypothesis (a cheap hypothesis will be selected). (b) The next
hypothesis misclassifies object 2, and the weight is then divided between 1 and
2. (c) The next hypothesis misclassifies object 3, and the weight is now divided
between 1, 2 and 3. (d) After picking a last hypothesis the convex combination of
all hypotheses is computed. This example is based on [9].
There are four boosting methods available for Haar training in OpenCV: Real
Adaboost, Discrete Adaboost, Logitboost and Gentle Adaboost.
That Haar classification uses a rejection cascade means that the final classifier
consists of a cascade of (many) simple classifiers, and a region of interest must pass
all stages of this cascade to pass. The order of the nodes is often arranged after
14
Figure 4: These Haar-wavelet-like features are computed by adding the light regions and subtracting the dark regions [8]. The image is originally from [10].
complexity, so that many feature candidates are ruled out early, saving substantial
computation time [8].
As input to these basic classifiers, that builds up the cascade, comes Haar-like
features that are calculated according to the figures in fig. 4.
When the application is running a pane of different sizes is swept over the image,
computing sums of pixel values based on these features, using integral images, and
applies the trained rejection cascade (see next chapter).
15
4 Methods
OpenCV is a computer vision library written in C and C++, and contains over
500 functions associated with vision, and also contains a general-purpose machinelearning library. OpenCV was chosen since it is open-source and free for academic
and commercial use, and it is widely used and well documented. [8] Most importantly OpenCV compiles on the iOS platform, along with the additional frameworks AVFoundation, ImageIO, libz, CoreVideo and CoreMedia.
Initially many different approaches to object recognition where considered since
the OpenCV interface makes it easy to switch between methods. Developing for
mobile devices increases the demand for efficiency and the key criterion were of
course to choose a well performing technique in terms of efficiency and level of
invariance, with more emphasis on efficiency as live video would be processed and
furthermore CPU and RAM is comparatively limited on the intended devices.
SURF and Haar-feature classification became the main candidates for recognition; SURF because it is designed for efficiency and meets all intended invariance
requirements [5], and the Haar classifier since it is very efficient, but then rotational invariance is slightly compromised [8] depending on what features are used
under training. As described in more detail in chapter four the Haar classifier was
finally chosen. Both make use of integral images, e g. sum of pixel values in a
rectangular region of an image.
16
performed well on a laptop but not on the iPhone 3GS, hence the OpenCV Haar
classification technique became the choice of approach.
17
Figure 5: This image is a screenshot from a running application on the PC implementing SURF, detecting a writing block.
When porting to the iOS platform a preexisting open source iOS project was used
developed by the developer group Aptogo [7] integrating OpenCV with iOS, including a precompiled OpenCV framework. On the iOS platform there are several
aspects regarding OpenCV that differ from a PC environment, especially in using the highgui module, which normally handles interaction with the operating
system: accessing cameras, handle user events, displaying images and graphics in
images etc. On the iOS no video preview is supported, and frames have to be
pulled and display manually from the hihgui VideoCapture class. This preexisting
project was designed to be re-useable by sub-classing its integration feature and
19
allowed to directly use the only slightly modified OpenCV code from developing
on the laptop.
Through April and May, weekly meetings were held, where the working progress
was reported to the thesis project supervisor and deliverables for the coming week
where set.
In the middle of April a SURF implementation was running on the device. At
this point the object in the car to be recognized had not been specified, and in
the meantime a tablet/writing pad was used for testing, which differed a lot from
the final objects that where chosen. This implementation was performing well on
the laptop but did not work well on the handheld device: it only measured to be
able to process 0.5 frames per second. This was a major concern since the aim
was to recognize four objects. Then an effort could have been put to optimize
the implementation, but instead other approaches was looked into, using a better
performer (in terms of processing) based on a learned boosted rejection cascade,
and crucial for this decision being made was knowledge of the automatic sample
generator for the Haar Classifier. (See 4.2)
The training (see 4.3) of the final classifiers took about three days to complete
on two computers, running two parallel training jobs on each (one for each classifier). Before this the sample-creating- and training -utility was tested on the
(just mentioned) writing pad (fig. 5) with fairly good results. But the objects
that finally was picked for recognition (fig. 1), turned out to be harder to produce
good classifiers from, probably because they were to small and lacked blocky
features [8]. So finally the set of classifiers were too tolerant. In retrospect a larger
and easier targets than the small buttons, e g. the entire climate control panel or
similar, should have been chosen for the recognition task.
The last week was dedicated to merging the UI code with the object recognition
code.
20
5 Results
This chapter focuses on implementation and the final outcome of the project, which
is discussed further in the next chapter.
fps
0.6
Object
4 Buttons (fig. 1)
Haar Classification
1.8
Tablet (fig. 5)
SURF
0.5
Tablet (fig. 5)
Acurracy
Poor, finding lots of false
positives outside the intended environment.
Ok from a straight forward angle, to sensitive
to changes in rotation and
change in angle.
Good,
insensitive
to
changes in rotation and
change in angle but unresponsive.
This table is not intended as a measurement of how the methods perform in general, but rather only presenting the result, which is implementation and platform
specific. It would probably not be meaningful to compare such distinctive methods
on a general basis (since the environment differ from task to task).
21
22
Figure 7: A screenshot of the running application (with some texture error, see
discussion in next chapter).
23
cessViewController instantiate four Cascade Classifier objects, one for each xmlcascade-classifier file that was trained. The method detectMultiScale is run on
the cascade object with the following parameters: image object (a cv::Mat), scale
factor (how much the search pane size is reduced for each iteration), vector of
rectangles who are updated as a side effect if there is a match, minimum neighbors
(collections with fewer features are treated as noise) [11].
The vital part of the Haar classification is the following:
// Create a p a t h t o t h e c l a s s i f i e r
NSString cascadePath = [ [ NSBundle mainBundle ] pathForResource :
c a s c a d e F i l e n a m e ofType :@xml ] ;
...
// I n s t a n t i a t e a v e c t o r t o h o l d t h e c o o r d i n a t e s o f t h e matches
s t d : : v e c t o r <cv : : Rect> o b j e c t s ;
// Apply d e t e c t i o n .
c a s c a d e . d e t e c t M u l t i S c a l e ( mat , o b j e c t s , 1 . 1 , 3 ,
CV HAAR FIND BIGGEST OBJECT, cv : : S i z e ( 1 0 , 1 0 ) ) ;
// When t h e processFrame : v i d e o R e c t : v i d e o O r i e n t a t i o n have
// p r o c e s s e d t h e image i t d i s p a t c h e s back t o t h e main t h r e a d
// a v e c t o r o f r e c t a n g l e s f o r where t o draw t h e a n i m a t i o n s :
When the processFrame:videoRect:videoOrientation have processed the image it
dispatches back to the main thread a vector of rectangles for where to draw the
animations:
// Draw on main queue
d i s p a t c h s y n c ( d i s p a t c h g e t m a i n q u e u e ( ) , {
[ s e l f displayObjects : objects
forVideoRect : r e c t
videoOrientation : videOrientation ] ;
});
24
6 Discussion
The result in table 5.1 shows that the Haar Classification implementation in the
top row performed well in terms of speed, four images with a frame rate of 0.6
s which accounts for 2.4 cascades each second. This can be compared with the
SURF implementation with a frame rate of 0.5 per second, which implies it would
take about two seconds to process one image and around eight seconds to process
four. One reason for the good performance in terms of processing speed for the
Haar Classifier (trained on the buttons) might be that the cascades were simple,
but this is also the reason for the poor performance in regards to accuracy.
The first Haar Classification implementation, trained on a tablet (second row),
showed good performance in speed (1.8 fps) and acceptable performance in accuracy, and the cascade was more complex, probably due to that the object (the
tablet) had more distinct features. This was what was wanted when the last classifiers (on the buttons, top row) were trained, but they underperformed, which was
expressed in that they where oversensitive: the different objects where sometimes
confused and false positives (objects not trained on) could easily be found outside
the control panel.
The Haar Classifier is often used for detecting pedestrians, body parts or faces (the
Haar Classifier is some times called the Face detector [11]) and such classifiers are
included within OpenCV. But it is also said to work well on logos and such with
a typical viewing point [11]. And it was partly this that was noted for choosing
the method. Furthermore cascade classification for Haar-like features is said to
work well for blocky features with characteristic views [8]. This could easily be
misinterpreted it could mean blocky as in sharp edges, or larger blocks with
similar gray scale. The number of training samples could have been increased if
the object would have been more distinct, since then the samples could have been
reduce in size. Furthermore, the classifiers would probably have been stronger if
they would have been trained with more advanced features.
When the decision to use Haar Classification for the finale version of the prototype
was made, it was based on the result of the test made on the test object (the
writing pad, entry two in the table), and the decision to use the final weeks of the
project to train Haar classifiers was made before it was specified what object to
be processed in the final version of the prototype. The buttons on the dashboard
lacked sufficiently distinct features and it turned out not to be a good (but hasty)
choice of set of objects (this is discussed more in the next chapter).
But considering the overall result, the outcome of the use of a Haar classifier
was in the end closest to the project specification.
25
As for the screenshot in the previous chapter depicting the running application,
there are two possible reasons why there are texture errors in the animation of the
dashboard buttons, one might be that the data of the texture file is incorrectly
read in; the other is that the settings of the view rendering are incorrect. This is
more thoroughly discussed in [3], and these issues should be relatively easy to fix.
26
7 Conclusions
The finale object recognition module with xml cascades trained on the buttons
(fig. 1) is performing below expectations. The over all experience is that even if
the module finds the objects and often distinguish the different keys, it often finds
false positives outside the climate control panel, e g. objects with white marks on
black background, like on a computer Keyboard.
It was wanted for the final classifiers (top row in 5.1) to performed equally well
as the test classifier (second row in 5.1), but this was not achieved since the target
object turned out to be difficult to train. For the purpose of this prototype it would
have been better to choose some larger objects in the car for the recognition, for
instance the whole climate panel or the head unit.
should have been outdrawn from start, and allow for smaller changes as the work
proceeded. When merging the two projects a lot where done in an ad-hoc fashion,
which resulted in that the final project was not very modular and generic.
8 Acknowledgments
I am grateful to the people at BMW Connected DriveLab in Shanghai, to my
supervisor Amen Hamdan, Senior Manager SW Development, and Philipp Holzer,
Specialist in Human Computer Interaction, and to Alexis Trolin, Head of BMW
Group ConnectedDrive Lab, for giving me the opportunity to do this internship
at your very inspiring office.
I am also thankful to Associate Professor Anders Hast, topic reviewer, and Dr
Bjorn Reinius for feedback on this thesis.
This project was supported or was made possible by a stipend from Broderna
Molanders stiftelse, and I would like say a big thanks to the carers of this foundation.
29
References
R June 2012,
[1] GENIVI
,
http://www.genivi.org/faq
[2] Wuelfing B. CeBIT 2009: BMW and Partners Found GENIVI Open Source
Platform, Linux Pro Magazine. March 2009
http://www.linuxpromagazine.com/Online/News/
CeBIT-2009-BMW-and-Partners-Found-GENIVI-Open-Source-Platform
[3] Tholsg
ard G. 3D rendering and interaction in an augmented reality, Bachelor thesis under preparation (December 2012), Uppsala University
[4] Ronald T. Azuma, A Survey of Augmented Reality, Presence: Teleoperators
and Virtual Environments 6, 4 (August 1997), 355-385. Hughes Research
Laboratories
http://www.cs.unc.edu/~azuma/ARpresence.pdf
[5] I Tuytelaars T. Mikolajczyk K. Local Invariant Feature Detectors: A Survey,
Foundations and Trends in Computer Graphics and Vision Vol. 3, No. 3 (2007)
177280
[6] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool SURF: Speeded Up Robust
Features Computer Vision and Image Understanding (CVIU), Vol. 110, No.
3, pp. 346359, (2008)
http://www.vision.ee.ethz.ch/~surf/eccv06.pdf
[7] Christopher Evans, Notes on the OpenSURF Library, CSTR-09-001, University of Bristol, January 2009
http://www.cs.bris.ac.uk/Publications/Papers/2000970.pdf
[8] Bradski G. Kaehler A., September 2008, Learning OpenCV Computer Vision with the OpenCV Library, OReilly Media
[9] Lecture by M K. Warmuth, November 2011
http://www.youtube.com/watch?v=R3od76PZ08k&list=
EC2A65507F7D725EFB&index=29&feature=plpp_video
[10] OpenCV Wiki, June 2012,
http://code.opencv.org
[11] OpenCV Documentation, June 2012,
http://opencv.willowgarage.com/documentation/cpp/objdetect_
cascade_classification.html
30
31