YoshidaSotaR. 2011 ComputerVision
YoshidaSotaR. 2011 ComputerVision
All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
Account:ehost.
EBSCO Publishing: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE 541358; Yoshida, Sota R.; Computer Vision
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
COMPUTER SCIENCE, TECHNOLOGY AND APPLICATIONS
COMPUTER VISION
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or
by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no
expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of information
contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in
rendering legal, medical or any other professional services.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
COMPUTER SCIENCE, TECHNOLOGY
AND APPLICATIONS
Additional books in this series can be found on Nova‘s website
under the Series tab.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
COMPUTER SCIENCE, TECHNOLOGY AND APPLICATIONS
COMPUTER VISION
SOTA R. YOSHIDA
EDITOR
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Copyright © 2011 by Nova Science Publishers, Inc.
All rights reserved. No part of this book may be reproduced, stored in a retrieval system or
transmitted in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical
photocopying, recording or otherwise without the written permission of the Publisher.
For permission to use material from this book please contact us:
Telephone 631-231-7269; Fax 631-231-8175
Web Site: http://www.novapublishers.com
Independent verification should be sought for any data, advice or recommendations contained in
this book. In addition, no responsibility is assumed by the publisher for any injury and/or damage
to persons or property arising from any methods, products, instructions, ideas or otherwise
contained in this publication.
This publication is designed to provide accurate and authoritative information with regard to the
subject matter covered herein. It is sold with the clear understanding that the Publisher is not
engaged in rendering legal or any other professional services. If legal or any other expert
assistance is required, the services of a competent person should be sought. FROM A
DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE
AMERICAN BAR ASSOCIATION AND A COMMITTEE OF PUBLISHERS.
Additional color graphics may be available in the e-book version of this book.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
CONTENTS
Preface vii
Chapter 1 Some Applications of Computer Vision Systems
in Micromechanics 1
E. Kussul, 0. Makeyev, T. Baidyk, A. Martín-Gonzalez
and G. Toledo-Ramirez
Chapter 2 A Survey of Face Recognition by the Genetic Algorithm 41
Fengzhi Dai , Yutaka Fujihara and Naoki Kushida
Chapter 3 The Attentive Co-Pilot: Robust Driver Assistance Relying
on Human-Like Signal Processing Principles 57
Thomas Paul Michalke1 and Robert Kastner
Chapter 4 Traffic Monitoring: A Practical Implementation
of a Real-Time Open Air Computer Vision System 187
Tomas Rodrıguez
Chapter 5 Algebraic Topology for Computer Vision 239
Daniel Freedman and Chao Chen
Chapter 6 Spectral Imaging Versus Non-Spectral Imaging 269
Ismail Bogrekci
Chapter 7 Least Squares Fitting of Quadratic Curves and Surfaces 285
N. Chernov and H. Ma
Chapter 8 Ontology Based Image and Video Analysis 303
Christopher Town
Chapter 9 Computer Vision by Laser Metrology and Algorithms
of Artificial Intelligence 329
J. A. Muñoz-Rodríguez
Index 363
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
PREFACE
Computer vision is the science and technology of machines that see, where seeing in this
case means that the machine is able to extract information from an image that is necessary to
solve some task. This new book presents topical research in the study of computer vision,
including computer vision systems in micromechanics; genetic algorithm-based face
recognition; algebraic topology for computer vision and computer vision by laser metrology
and algorithms of artificial intelligence
Chapter 1- For general purpose image recognition systems the authors developed three
neural classifiers: LIRA, PCNC, and Pairwise Coding Classifier. The authors tested the neural
classifiers in various tasks of image recognition in the area of micro mechanics. The first task
was the texture recognition problem. Different methods of mechanical metal treatment give
different surface textures. It was made the computer vision system for recognition of 4
different types of workpiece treatment: milled surface, polished with sandpaper, turned with
lathe, and polished with file. The LIRA classifier was used to solve this problem. 20 samples
of each texture were prepared for the experiments. Some randomly selected samples were
used for classifier training, and other samples were used for testing. In the case of 10 training
samples 99.8% recognition rate was obtained. The second task was connected with
recognition of the shape of small screws. When small screws are manufactured on computer
numerical control (CNC) machine tool the cylinder part of the screw is turned with one cutter
and the thread is made with another cutter. The mutual position of the cutters sometimes is
not aligned precisely. In this case the shape of the thread will be incorrect. The recognition of
the thread shape gives the information about mutual position of the cutters. This information
allows us to correct the cutter positions. The authors tested the LIRA classifier in the task of
thread shape recognition. The recognition rate of 98.9% was obtained. The cost of
micromechanical devices will depend not only on the cost of manufacture of components but
also on the cost of the device assembly. So, the assembly automation is very important. Many
assembly procedures can be made with the aid of computer vision systems. Earlier the authors
proved the LIRA classifier to obtain adaptive feedback in the pin-hole alignment task. Now
they consider the problem of component selection from the depository, detection of their
orientation and transportation to the assembly device. This is the third task that the authors
describe in this chapter. This system will include the micromanipulator and camera, and will
permit us to work with different micromechanical components without the development of
new mechanical feeders. The manipulator using computer vision system must recognize the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
viii Sota R. Yoshida
types of components, their position (coordinates) and their orientation. To solve this problem
special program was written and tested. The authors will describe the image data base and the
recognition results obtained in this task.
Chapter 2- Computer vision and recognition is playing an increasing important role in the
modern intelligent control. Object detection is the first and the most important step in object
recognition. Traditionally, special objects can be detected and recognized by the template
matching method, but the recognition speed has always been a problem. Also for recognition
by the neural network, training the data is always time consumption. In this chapter, the
genetic algorithm-based face recognition system is proposed. The genetic algorithm (GA) has
been considered as a robust and global searching method. Here, the chromosomes generated
by GA contain information (parameters) of the image, and the genetic operators are used to
obtain the best match between the original image and the face of interest. The parameters are
the coordinate (x, y) of the center of the face in the original image, the rate of scale and the
angle Ѳ of rotation. The purpose of this chapter is to summarize the GA-based method for
face detection and recognition. Finally, the experimental results, and a comparison with the
traditional template matching method, and some other considerations, are also given.
Chapter 3- Increasingly complex driver assistance functionalities are combined in today‘s
vehicles.Typically, these functionalities run as independent modules bringing own sensors,
processing devices, and actuators. In general, no information fusion, i.e., cross talk
between modules, takes place as a consequence of unsolved questions in system design that
come along with high system complexity. However, information fusion of available sensors
and processing modules could lead to a new quality of driver assistance functionalities in
terms of performance and robustness. Additionally, the growing number of driver assistance
functionalities in today‘s vehicles will cause problems regarding the limited number of
interaction channels to inform the driver of potential dangers - the so-called human machine
interface (HMI). Already today, in specific traffic scenarios a contradicting HMI access of
different driver assistance functionalities cannot be avoided. Furthermore, typical driver
assistance functionalities on the market are based on highly specialized and optimized
algorithms that show sound performance for a restricted number of clearly defined scenarios
only. Also, the combinationof several of these rigid systems as a means to reach the long-term
goal of autonomous driving will not lead to robust system performance, taken the immense
variety of traffic situations into account. As opposed to that, the following contribution
presents a flexible, biologically inspired driver assistance system that adapts its modules
and data exchange between modules online depending on the task. More specifically, the
morphology of the brain as well as brain-like signal processing principles are mimicked in
order to increase the robustness and flexibility of the system. Thereby, authors aimed at
reaching a generic system structure, that supports several system tasks by modulating and
parameterizing its system modules (e.g., detect fast objects, redetect once tracked and later
lost objects, predict object trajectories, locate cars on the road, detect traffic signs). Put
differently, authors main research focus is on the design of general mechanisms that lead to a
certain observable behavior without being explicitly designed for this behavior. Following
that paradigm, a tunable attention sub-system was developed that allows the pre-filtering of
the visual environment for task-relevant information in order to provide the driver with
assistance in dangerous situations. In order to include information of the scene context into
the system, a robust unmarked road detection module as well as an approach for assessing the
general type of environment (highway, country road, inner-city) is included into the system.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Preface ix
The realized driver assistance system is tested online and in real-time on a prototype car. The
gathered results prove the applicability of the developed biologically inspired driver
assistance system in real-world scenarios. Extensive system evaluation shows that different
system properties are in close compliance with measurements gathered in psychophysical
studies of humans. Based on these results, it can be stated that the realized advanced driver
assistance system closely models important human information processing principles
allowing the usage of the system as Attentional Co-Pilot for human drivers.
Chapter 4- Open air is a challenging environment where few computer vision
applications succeed. The reasons behind these difficulties must be found in the uncontrolled
nature of open air applications. Computer vision is a passive technology. Typically a camera
mounted on a mast observes the scene while using ‖clever‖ algorithms to interpret the
images. However, doing this interpretation is no easy task since almost any type of moving or
static object may appear in scene. Also, there are many circumstances which are not under
direct control of the system: illumination changes caused by passing clouds, movements of
the sun or day / night transitions; presence of shadows and moving artificial lights; camera
artifacts resulting from blooming or smearing; overreaction of camera compensation devices;
occlusions; vibrations of the camera; weather elements, such as: rain, snow or fog; etc.
All these circumstances combined turns computer vision in open scenarios into a
formidable task. Only those applications working under very constrained conditions and with
very concrete objectives will have any opportunity to succeed. In this chapter the authors
present a computer vision based traffic monitoring system able to detect individual vehicles in
real-time. Our fully integrated system first obtains the main traffic variables of the vehicles:
counting, speed and category; and then computes a complete set of statistical variables. The
objective is to investigate alternative solutions to some of the difficulties impeding existing
traffic systems achieve balanced accuracy in every condition; most notably: day and night
transitions, moving shadows, heavy vehicles, occlusions, slow traffic and congestions.
Chapter 5- Algebraic topology is generally considered one of the purest subfields of
mathematics.However, over the last decade two interesting new lines of research have
emerged, one focusing on algorithms for algebraic topology, and the other on applications of
algebraic topology in engineering and science. Amongst the new areas in which the
techniques have been applied are computer vision and image processing. In this paper, the
authors survey the results of these endeavours. Because algebraic topology is an area of
mathematics with which most computer vision practitioners have no experience, the authors
review the machinery behind the theories of homology and persistent homology; our review
emphasizes intuitive explanations. In terms of applications to computer vision, authors focus
on four illustrative problems: shape signatures, natural image statistics, image denoising, and
segmentation. Our hope is that this review will stimulate interest on the part of computer
vision researchers to both use and extend the tools of this new field.
Chapter 6- Conventional approaches to Computer Vision harvested many opportunities
for different disciplines and resulted in many applications and practices into our life.
However, spectral imaging is promising many more looking into the future. Before author
discuss what the difference between spectral and non-spectral imaging is, the authors shall
define what the spectral and non-spectral imaging are at the beginning of this chapter. Later
on, image acquisition tools and techniques will be discussed. Following on, the focus is
intended to cover image processing techniques for spectral and non-spectral. Finally, the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
x Sota R. Yoshida
chapter is focused to discuss the advantages and disadvantages of spectral and non-spectral
imaging.
Chapter 7- In computer vision one often fits ellipses and other conics to observed points
on a plane or ellipsoids/quadrics to spacial point clouds. The most accurate and robust fit
is obtained by minimizing geometric (orthogonal) distances, but this problem has no
closed form solution and most known algorithms are prohibitively slow. Authors revisit this
issue based on recent advances by S. J. Ahn, D. Eberly, and our own. Ahn has sorted out
various approaches and identified the most efficient one. Eberly has developed a fast method
of projecting points onto ellipses/ellipsoids (and gave a proof of its convergence). Authors
extend Eberly‘s projection algorithm to other conics, as well as quadrics in space. They also
demonstrate that Eberly‘s projection method combined with Ahn‘s most efficient approach
(and using Taubin‘s algebraic fit for initialization) makes a highly efficient fitting scheme
working well for all quadratic curves and surfaces.
Chapter 8- This chapter shows how ontologies can be used as the basis for an effective
computational and representational framework for computer vision. A particular focus is on
the role of ontologies as a means of representing structured prior information and integrating
different kinds of information in an inference framework. Ontologies are presented as an
effective means of relating semantic descriptors to their parametric representations in terms of
the underlying data primitives. The efficacy of the proposed approach is demonstrated
through the development and analysis of solutions to a range of challenging visual analysis
tasks. Firstly, author consider the problem of content based image retrieval (CBIR) Author
present a CBIR system that allows users to search image databases using an ontological query
language. Queries are parsed using a probabilistic grammar and Bayesian networks to map
high level concepts onto low level image descriptors, thereby bridging the semantic gap
between users and the retrieval system. Author then extend the notion of ontological
languages to video event detection. It is shown how effective high-level state and event
recognition mechanisms can be learned from a set of annotated training sequences by
incorporating syntactic and semantic constraints represented by an ontology.
Chapter 9- The authors present a review of our algorithms for computer vision based on
laser metrology and artificial intelligence. This chapter includes laser metrology, image
processing, contouring by neural networks, modelling of mobile setup and vision parameters.
To achieve the computer vision, the object shape is reconstructed by a laser metrology
method. To carry it out, the object is scanned by a laser line. Based on the deformation of the
laser line, a network provides object dimension. The behavior of the laser line is obtained by
image processing. The modelling of the relationship between the laser line and the object
shape is performed b approximation networks. The architecture of the networks is built using
data of images of a laser line projected on objects, whose dimensions are known. The
approach of the neural networks is to perform the contouring without measurements on the
optical setup. Also, the network provides the camera parameters and the setup parameters.
Thus, the performance and accuracy are improved. It is because the errors of the measurement
are not passed to the contouring system. To describe the accuracy a root mean square of error
is calculated using data produced by the networks and data provided by a contact method.
This technique is tested with real objects and its experimental results are presented.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida © 2011 Nova Science Publishers, Inc.
Chapter 1
ABSTRACT
For general purpose image recognition systems we developed three neural
classifiers: LIRA, PCNC, and Pairwise Coding Classifier. We tested the neural classifiers
in various tasks of image recognition in the area of micro mechanics. The first task was
the texture recognition problem. Different methods of mechanical metal treatment give
different surface textures. It was made the computer vision system for recognition of 4
different types of workpiece treatment: milled surface, polished with sandpaper, turned
with lathe, and polished with file. The LIRA classifier was used to solve this problem. 20
samples of each texture were prepared for the experiments. Some randomly selected
samples were used for classifier training, and other samples were used for testing. In the
case of 10 training samples 99.8% recognition rate was obtained. The second task was
connected with recognition of the shape of small screws. When small screws are
manufactured on computer numerical control (CNC) machine tool the cylinder part of the
screw is turned with one cutter and the thread is made with another cutter. The mutual
position of the cutters sometimes is not aligned precisely. In this case the shape of the
thread will be incorrect. The recognition of the thread shape gives the information about
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
2 E. Kussul, O. Makeyev, T. Baidyk et al.
mutual position of the cutters. This information allows us to correct the cutter positions.
We tested the LIRA classifier in the task of thread shape recognition. The recognition
rate of 98.9% was obtained. The cost of micromechanical devices will depend not only
on the cost of manufacture of components but also on the cost of the device assembly. So,
the assembly automation is very important. Many assembly procedures can be made with
the aid of computer vision systems. Earlier we proved the LIRA classifier to obtain
adaptive feedback in the pin-hole alignment task. Now we consider the problem of
component selection from the depository, detection of their orientation and transportation
to the assembly device. This is the third task that we describe in this chapter. This system
will include the micromanipulator and camera, and will permit us to work with different
micromechanical components without the development of new mechanical feeders. The
manipulator using computer vision system must recognize the types of components, their
position (coordinates) and their orientation. To solve this problem special program was
written and tested. We will describe the image data base and the recognition results
obtained in this task.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 3
there are some problems specific for industrial environment, for example, metal surface can
have dust on it.
The reason to choose a system based on neural network architecture for the current task
was that such systems have already proved their efficacy in texture recognition due to
significant properties of adaptability and robustness to texture variety [8].
We have chosen the LIRA neural classifier because we have already applied it in the flat
image recognition problem in microdevice assembly and the results were very promising [1].
We have also tested it in handwritten digit recognition task and its recognition rate on the
MNIST database was 99.45% [1] that is among the best results obtained on this database.
The task of metal surface texture recognition is important to automate the assembly
processes in micromechanics [9]. To assembly a device it is necessary to recognize the
position and orientation of the work pieces to be assembled [1]. It is useful to identify the
surface of a work piece to recognize its position and orientation. For example, let the shaft
have two polished cylinder surfaces for bearings, one of them milled with grooves for dowel
joint, and the other one turned with the lathe. It will be easier to obtain the orientation of the
shaft if we can recognize both types of the surface textures.
There are works on fast detection and classification of defects of mechanically treated
metal surfaces [7, 14]. In [14] a back propagation neural network based approach to surface
inspection of rolling bearing metal rings is proposed. In [7] a Genetic Algorithm (GA) based
approach to finding and quantitative evaluation of minor flaws of metal surfaces of the multi-
plate clutch is proposed. In both cases preprocessing of the initial images of the metal
surfaces is needed. In [14] the brightness histogram serves as an input to the neural network.
In [7] model-based matching technique is used and the correlation between the initial image
and the surface-strip model is used as a fitness function of the GA.
The only work on texture classification of mechanically treated metal surfaces that we
know is [3]. The authors propose to use a vibration-induced tactile sensor that they called
Dynamic Touch Sensor (DTS) combined with one-layer Rosenblatt perceptron [15]. The DTS
produces signals based on the vibration induced by a sensor needle sliding across the metal
surface with fixed velocity and pressure. The motion path of the sensor is a part of a circle of
approximately 100 degrees. Such motion path permits to capture information about surface in
both directions in one sweep; however, system is very sensitive to the changes in texture
position and orientation. Spectral energy of the sensor was used as an input to the neural
classifier. Metal surfaces were characterized by two characteristics: surface type and surface
roughness. Surface roughness is a measure of the average height of the surface irregularities
given in microinches. In texture recognition system testing six types of surfaces and six
values of surface roughness were used. The obtained recognition rate varied from 74.16% in
recognition of two types of metal surfaces with roughness of 8 microinches to 100% in
recognition of three types of metal surfaces with roughness of 250 microinches. In our
experiments we achieved the recognition rate of 99.7% in recognition of four types of metal
surfaces with roughness of the order of 1 microinch. Not to mention that our approach doesn‘t
require the use of a complex mechanical sensor and is robust to the changes in texture
orientation and position.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
4 E. Kussul, O. Makeyev, T. Baidyk et al.
To test our texture recognition system we created our own image database of metal
surface images. Four texture classes correspond to metal surfaces after: milling, polishing
with sandpaper, turning with lathe and polishing with file (Figure 1). 20 grayscale images of
220x220 pixels were taken for each class. We randomly divide these 20 images into the
training and test sets. It can be seen that different lighting conditions affect greatly the
grayscale properties of the images. The textures may also be arbitrarily oriented and not
centered perfectly. Metal surfaces may have minor defects and dust on it. All this image
properties correspond to the conditions of the real industrial environment and make the
texture recognition task more complicated. Two out of four texture classes that correspond to
polishing with sandpaper and to polishing with file sometimes can be hardly distinguished
with the naked eye (Figure 1, columns b and d).
Figure 1. Examples of metal surfaces after (columns): a) milling, b) polishing with sandpaper, c) turning
with lathe, d) polishing with file.
On the first stage of our experiments with this image database we worked with texture
recognition system based on the Random Subspace (RSC) neural classifier [2]. The best
recognition rate of 80% was obtained in recognition of three classes of metal surfaces. The
class that corresponds to metal surface after polishing with file was excluded because of its
similarity with the class that corresponds to metal surface after polishing with sandpaper. The
numbers of images in training and test sets were 3 and 17 correspondingly. The use of the
LIRA neural classifier in our texture recognition system resulted in recognition rate of
83.39% in recognition of four classes of metal surfaces with only 2 images in training set and
18 images in test set.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 5
The LIRA (Limited Receptive Area) [1] neural classifier was developed on the basis of
the Rosenblatt perceptron [15]. The three-layer Rosenblatt perceptron consists of the sensor
S-layer, associative A-layer and the reaction R-layer. The first S-layer corresponds to the
retina. In technical terms it corresponds to the input image. The second A-layer corresponds to
the feature extraction subsystem. The third R-layer represents the system‘s output. Each
neuron of this layer corresponds to one of the output classes.
The associative layer A is connected to the sensor layer S with the randomly selected,
non-trainable connections. The weights of these connections can be equal either to 1 (positive
connection) or to -1 (negative connection). The set of these connections can be considered as
a feature extractor.
A-layer consists of 2-state neurons; their outputs can be equal either to 1 (active state) or
to 0 (non-active state). Each neuron of the A-layer is connected to all the neurons of the R-
layer. The weights of these connections are modified during the perceptron training.
We have made four major modifications in the original perceptron structure. These
modifications concern random procedure of arrangement of the S-layer connections, the
adaptation of the classifier to grayscale image recognition, the training procedure and the rule
of winner selection.
We propose two variants of the LIRA neural classifier: LIRA_binary and
LIRA_grayscale. The first one is meant for the recognition of binary (black and white)
images and the second one for the recognition of grayscale images. The structure of the
LIRA_grayscale neural classifier is presented in Figure 2.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
6 E. Kussul, O. Makeyev, T. Baidyk et al.
The one-layer perceptron has very good convergence, but it demands the linear
separability of the classes in the parametric space. To obtain linear separability it is necessary
to transform the initial parametric space represented by pixel brightness to the parametric
space of larger dimension. In our case the connections between the S-layer and the A-layer
transform initial (WS · HS) - D space (WS and HS stand for width and height of the S-layer)
into N-dimension space represented by binary code vector. In our experiments WS = HS = 220
and N varied from 64,000 to 512,000. Such transformation improves the linear separability.
The coding procedure used in the LIRA classifier is the following.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 7
In case of the LIRA_binary neural classifier A-layer neurons are connected to S-layer
neurons directly (Figure 3). In this case S-layer consists of 2-state neurons; their outputs can
be equal either to 1 (active state) or to 0 (non-active state). The weights of connections
between the S-layer and the A-layer are randomly selected and are equal either to 1 (positive
connection) or to -1 (negative connection). For example, in Figure 3 four connections, two
positive (marked with arrows) and two negative (marked with circles), are chosen within the
window h w . The i-th neuron of the A-layer is active (ai = 1) only if all the positive
connections with retina correspond to active neurons of the S-layer and all negative
connections correspond to non-active neurons and is non-active (ai = 0) in the opposite case.
To adapt the LIRA neural classifier for grayscale image recognition we have added the
additional 2-state neuron layer between the S-layer and the A-layer. We term it the I-layer
(intermediate layer, see Figure 2).
The input of each I-layer neuron is connected to one neuron of the S-layer and the output
is connected to the input of one neuron of the A-layer. All the I-layer neurons connected to
one A-layer neuron form the group of this A-layer neuron. The number of neurons in one
group corresponds to the number of positive and negative connections between one neuron of
the A-layer and the retina in the LIRA_binary structure. There are two types of I-layer
neurons: ON-neurons and OFF-neurons. The output of the ON-neuron i is equal to 1 if its
input value is larger than the threshold i and it is equal to 0 in the opposite case. The output
of the OFF-neuron j is equal to 1 if its input value is smaller than the threshold j and it is
equal to 0 in the opposite case. The number of ON-neurons in each group corresponds to the
number of the positive connections of one A-layer neuron in the LIRA_binary structure. The
number of OFF-neurons in each group corresponds to the number of the negative
connections. For example, in Figure 2, the group of eight I-layer neurons, four ON-neurons
and four OFF-neurons, corresponds to one A-layer neuron. The rule of connection
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
8 E. Kussul, O. Makeyev, T. Baidyk et al.
arrangement between the retina and one group of the I-layer is the same as the rule of mask
design for one A-layer neuron in the LIRA_binary structure. The thresholds i and j are
selected randomly from the range [0, · bmax], where bmax is maximal brightness of the image
pixels, is the parameter selected experimentally from the range [0, 1]. The i-th neuron of the
A-layer is active (ai = 1) only if outputs of all the neurons of its I-layer group are equal to 1
and is non-active (ai = 0) in the opposite case.
Taking into account the small number of active neurons it is convenient to represent the
binary code vector not explicitly but as a list of numbers of active neurons. Let, for example,
the vector Abe:
A= 00010000100000010000.
The corresponding list of the numbers of active neurons will be 4, 9, and 16. Such
compact representation of code vector permits faster calculations in training procedure. Thus,
after execution of the coding procedure every image has a corresponding list of numbers of
active neurons.
The LIRA_binary neural classifier can be used for the recognition of grayscale images
after transforming them to binary form, for example, by the following procedure:
WS H S
2 ( bij )
i 1 j 1
, (4)
WS H S
1, if bij ,
s ij (5)
0, if bij ,
where is the threshold value, bij is the brightness of the grayscale image pixel with
coordinates (i, j), sij is the brightness of the resulting binary image pixel with coordinates (i,
j). However, experiments performed on the MNIST grayscale image database [1] have shown
that the recognition rate obtained with the LIRA_grayscale neural classifier on original
grayscale images was better than the one obtained with the LIRA_binary neural classifier on
the transformed binary images. That is why taking into account that our current image
database is grayscale we worked with the LIRA_grayscale neural classifier.
The training and recognition procedures are the same for both LIRA_binary and
LIRA_grayscale neural classifiers. The training procedure is the following.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 9
The first stage.The training procedure starts with the presentation of the first image to the
LIRA neural classifier. The image is coded and the R-layer neuron excitations Ei are
computed. Ei is defined as:
N
Ei a j w ji , (6)
j 1
where Ei is the excitation of the i-th neuron of the R-layer, aj is the output signal (0 or 1)
of the j-th neuron of the A-layer, wji is the weight of the connection between the j-th neuron of
the A-layer and the i-th neuron of the R-layer.
The second stage. Robustness of the recognition is one of the important requirements the
classifier must satisfy. After calculation of the neuron excitations of the R-layer, the correct
class c of the image under recognition is read. The excitation Ec of the corresponding neuron
of the R-layer is recalculated according to the formula:
Ec * Ec (1 TE ), (7)
where 0 TE 1 determines the reserve of excitation the neuron that corresponds to the
correct class must have. In our experiments the value TE varied from 0.1 to 0.5.
After that we select the neuron with the largest excitation. This winner neuron represents
the recognized class.
The third stage. Let us denote the winner neuron number as j keeping the number of the
neuron that corresponds to the correct class denoted as c. If j = c then nothing is to be done. If
j c then following modification of weights is to be done:
where wij(t) and wij(t + 1) are the weights of the connection between the i-th neuron of the
A-layer and the j-th neuron of the R-layer before and after modification, ai is the output signal
(0 or 1) of the i-th neuron of the A-layer.
The training process is carried out iteratively. After all the images from the training set
have been presented the total number of training errors is calculated. If this number is larger
than one percent of the total number of images then the next training cycle is performed,
otherwise training process is stopped. The training process is also stopped if the number of
performed training cycles is more than a predetermined value.
It is obvious that in every new training cycle the image coding procedure is repeated and
gives the same results as in previous cycles. Therefore in our experiments we performed the
coding procedure only once and saved the lists of active neuron numbers for each image on
the hard drive. Later, during the training procedure, we used not the images, but the
corresponding lists of active neurons. Due to this approach, the training process was
accelerated approximately by an order of magnitude.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
10 E. Kussul, O. Makeyev, T. Baidyk et al.
It is known [10] that the performance of the recognition systems can be improved with
implementation of distortions of the input image during the training process. In our
experiments we used different combinations of horizontal, vertical and bias image shifts,
skewing and rotation.
d N
Ei a kj w ji , (10)
k 0 j 1
where Ei is the excitation of the i-th neuron of the R-layer, akj is the output signal (0 or 1)
of the j-th neuron of the A-layer for the k-th distortion of the initial image, wji is the weight of
the connection between the j-th neuron of the A-layer and the i-th neuron of the R-layer, d is
the number of applied distortions (case k = 0 corresponds to the initial image).
After that we select the neuron with the largest excitation. This winner neuron represents
the recognized class.
1.4 Results
To test our texture recognition system we created our own image database of
mechanically treated metal surfaces (see Section 2 for details). We work with four texture
classes that correspond to metal surfaces after: milling, polishing with sandpaper, turning with
lathe and polishing with file. 20 grayscale images of (220 x 220) pixels were taken for each
class. We randomly divide these 20 images into the training and test sets for the
LIRA_grayscale neural classifier. The number of images in training set varied from 2 to 10.
All experiments were performed on a Pentium 4, 3.06 GHz computer with 1.00 GB
RAM.
We carried out a large amount of preliminary experiments first to estimate the
performance of our classifier and to tune the parameter values. On the basis of these
preliminary experiments we selected the best set of parameter values and carried out final
experiments to obtain the maximal recognition rate. In preliminary experiments the following
parameter values were set: window h · w width w = 10, height h = 10, parameter that
determines the reserve of excitation the neuron that corresponds to the correct class must have
TE = 0.3. The following distortions were chosen for the final experiments: 8 distortions for
training including 1 pixel horizontal, vertical and bias image shifts and 4 distortions for
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 11
recognition including 1 pixel horizontal and vertical image shifts. The number of training
cycles was equal to 30.
The numbers of ON-neurons and OFF-neurons in the I-layer neuron group that
corresponded to one A-layer neuron were chosen in order to keep the ratio between the
number of active neurons K and the total number of associative neurons N within the limits of
K c N , where c is the constant selected experimentally from the range [1, 5]. This ratio
corresponds to neurophysiological data. The number of active neurons in the cerebral cortex
is hundreds times less than the total number of neurons. For example, for the total number of
associative neurons N = 512,000 we selected three ON-neurons and five OFF-neurons.
In each experiment we performed 50 runs to obtain statistically reliable results. That is,
the total number of recognized images was calculated as number of images in test set for one
run multiplied by 50. New mask of connections between the S-layer and the A-layer and new
division into the training and test sets were created for the each run.
In the first stage of final experiments we changed the total number of associative neurons
N from 64,000 to 512,000. The results are presented in Table 1. Taking into account that the
amount of time needed for 50 runs of coding and classifier‘s training and recognition with N
= 512,000 is approximately 3 h and 20 min (4 min for one run) we can conclude that such
computational time is justified by the increase in the recognition rate. That is why we used N
= 512,000 in all the posterior experiments.
Table 1. Dependency of the recognition rate on the total number of associative neurons
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
12 E. Kussul, O. Makeyev, T. Baidyk et al.
In the third stage of final experiments we performed experiments with different numbers
of images in the training and test sets. The results are presented in Table 3. The note tr./t.
reflects how many images were used for training (tr.) and how many for testing (t.). It can be
seen that even in case of using only 2 images for training and 18 for recognition the
LIRA_grayscale neural classifier gives a good recognition rate of 83.39%.
Table 3. Dependency of the recognition rate on the number of images in training set
1.5 Discussion
The LIRA neural classifier was tested in the task of texture recognition of mechanically
treated metal surfaces. This classifier does not use floating point or multiplication operations.
This property combined with the classifier‘s parallel structure allows its implementation in
low cost, high speed electronic devices. Sufficiently fast convergence of the training process
and very promising recognition rate of 99.7% were obtained on the specially created image
database (see Section 2 for details). There are quite a few methods that perform well when the
features used for the recognition are obtained from a training set image that has the same
orientation, position and lighting conditions as the test image; but as soon as orientation or
position or lighting conditions of the test image is changed with respect to the one in the
training set the same methods will perform poorly. The usefulness of methods that are not
robust to such changes is very limited and that is the reason for developing of our texture
classification system that works well independently of the particular orientation, position and
lighting conditions. In this regard the results obtained in experiments are very promising.
1.6. Conclusion
This chapter continues the series of works on automation of micro assembly processes [1,
9].
The LIRA neural classifier is proposed for texture recognition of mechanically treated
metal surfaces. It can be used in systems that have to recognize position and orientation of
complex work pieces in the task of assembly of micromechanical devices as well as in surface
quality inspection systems. The performance of the proposed classifier was tested on specially
created image database in recognition of four texture types that correspond to metal surfaces
after: milling, polishing with sandpaper, turning with lathe and polishing with file. The
promising recognition rate of 99.7% was obtained.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 13
2.1 Introduction
Shape recognition is a major problem in image understanding and computer vision. In the
area of micromechanics, shape recognition is an important task that is related to many micro
factory processes such as micro device assembly task [1], micro piece manufacturing process
[18], among others [2, 19].
The micro factory concept was proposed in Japan in 1990 at the Mechanical Engineering
Laboratory (MEL) with the developing of a desktop machining micro factory [13]. The micro
factory consists of machine tools such as a lathe, a milling machine and a press machine; and
assembly machines such as a transfer arm and a two-fingered hand. This portable micro
factory has external dimensions of 625 × 490 × 380 mm. The purpose of the micro factory
concept is to build very small production units saving space and energy and characterized by
a high level of flexibility.
In 1996, the Micro Equipment Technology (MET) was proposed [20]. In Mexico,
scientists from the Center of Applied Sciences and Technological Development (CCADET),
UNAM developed and applied this technology [9]. The main idea of this method is to create
successive generations of micro equipment, in which the micro equipment sizes of each
generation have smaller sizes than the ones of the previous generations. Each generation
produces the micro machine tools of its next generation. This approach would allow to use
low cost components for each micro equipment generation and to create micro factories
capable to produce low cost micro devices. The disadvantage of this micro equipment is the
poor precision of their processes. So we suggest the construction of micro factories composed
with micro machine tools with automated control of their process in order to have
autonomous and precision systems. Therefore, it is necessary to develop such control systems
that improve its precision.
One of the main problems in such micro factories is the automation on the base of vision
systems. There are different approaches to construct a computer vision system for these
purposes [21-23].
One of the mechanical micro components produced by micro machine tools (i.e., micro
lathe) is the screw. These are necessary components to construct micro equipment. The
problem related on producing these micro pieces is that some of these could have shape
defects. These errors are related to the microscrew cutter tool position (Figure 4). Because of
the micro machine small size it is impossible to use ordinary instruments (on a large scale) for
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
14 E. Kussul, O. Makeyev, T. Baidyk et al.
control processes. It is necessary to create a vision system that controls automatically the
manufacturing process of these work pieces by means of a shape recognition process. High-
quality micro pieces will be achieved in this case.
Our purpose is to develop a vision system based on an artificial neuronal network
(LIRA_grayscale) to attend micro screws manufacturing process. The system‘s target will be
the recognition of the shape of the micro work pieces. Thus, the information obtained can be
used to make the appropriate manufacturing corrections (i.e., relocate the cutter position) in
case of error detection. The vision system proposed in this chapter can be used with micro
machines tools of MET technology for improving their performance and precision.
a) b)
c) d)
Figure 4. The position of the cutting tool relatively to the work piece: (a) cutting tool located 0.1
mm below the correct position, (b) cutting tool located at the correct position, (c) cutting tool
located 0.1 mm above the correct position, (d) cutting tool located 0.2 mm above the correct
position.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 15
approaches is a vector presentation of associative memory [26]. We will describe the special
type of neural classifiers [1], [27].
The neural classifier LIRA (Limited Receptive Area), based on Rosenblatt‘s perceptron
principles, was tested on micro device assembly tasks and handwritten digit recognition
showing good results [1, 27]. We adapted LIRA for classifying microscrew grayscale images
after the manufacturing process. The resulting neural network was termed LIRA_grayscale
[27], [28].
2.2.1 Architecture
LIRA_grayscale consists of four layers: input layer (S), intermediate layer (I), associative
layer (A) and output layer (R). The neural classifier architecture is presented in Figure 5. The
input layer neurons correspond to every image pixel having outputs in range [0, 255], where 0
and 255 indicate the minimum (black) and maximum (white) brightness pixel value. This
layer has W × H neurons, where W and H correspond to the image width and height. The S-
layer is connected to A-layer through I-layer by means of a random procedure we will
describe later, and the resulting connections are non-trainable ones.
The intermediate layer adapts LIRA_grayscale for handling grayscale images. This layer
is located between S-layer and A-layer. The I-layer contains N groups of neurons, where N
corresponds to the total number of neurons in the associative layer A. There are two types of
neurons in I-layer: ON-neurons and OFF-neurons. According to the brain mechanisms for
vision presented in [29], ON-neurons react with the presence of a stimulus and OFF-neurons
with the absence of it. These neurons have two-state output values {0, 1}, where 1 indicates a
neuron active state.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
16 E. Kussul, O. Makeyev, T. Baidyk et al.
The associative layer corresponds to an image feature extractor containing neurons with
two-state outputs {0, 1}. The A-layer is fully connected to R-layer, where the weights of these
connections will be modified during the training process. The output layer consists of linear
neurons and the number of neurons in this layer corresponds to the number of classes to
recognize. The R-layer is the system‘s output.
1, xij T pk
on ( xij ) (11)
0, xij T pk
The output of an OFF-neuron will be equal to 1 (active state) if its input value is smaller
than a threshold Tnk and will be equal to 0 otherwise, i.e.,
An associative neuron will have output equal to 1 (active state) if all ON-neurons and
OFF-neurons connected to it are active, otherwise the output will be 0. Each associative
neuron acts like one image feature, which output shows if this feature exists or is absent in the
image.
y g yc
Te (13)
yg
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 17
the competitor is now considered the winner, where Te is a constant termed the winner
neuron superfluous excitation parameter. This modification was proposed in order to improve
the recognition process, eliminating the possibilities of selecting a winner neuron by chance.
The learning process consists of three stages described next.
N
yi wki a k (14)
k 1
where yi is the output (excitation) of the i-th neuron of the R-layer, ak is the output signal
(0 or 1) of k-th neuron of A-layer, wki is the weight of the connection between k-th neuron of
A-layer and i-th neuron of R-layer.
yr yr (1 Te ) (15)
After this, the neuron having maximum excitation is selected as the winner neuron g.
k , wkr (t 1) wkr (t ) ak
k , wkg (t 1) wkg (t ) ak
if
( wkg (t 1) 0) wkg (t 1) 0
where wki(t) is the connection weight between k-th neuron of A-layer and i-th neuron of
R-layer before reinforcement, wki(t + 1) is the same weight connection after reinforcement, ak
is the output signal of k-th neuron of A-layer.
This process will be repeated until a convergence criterion is reached, in our case, until
the maximum number of training cycles (epochs) is attained or the recognition error equals
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
18 E. Kussul, O. Makeyev, T. Baidyk et al.
zero. A test phase is then carried out, in which unknown test images are presented to the
network to establish the extent to which the network has learned the task in hand.
2.3 Results
where c is a constant experimentally selected from the range [1, 13]. This relation
corresponds to neurophysiologic facts: the active neurons in the cerebral cortex is hundreds
times smaller than the total number of neurons.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 19
a) b)
c) d)
Figure 6. Micro screw images: (a) class 0, (b) class 1, (c) class 2, (d) class 3.
Three main experiments were done for evaluating the system. In order to obtain
statistically reliable results we execute 30 runs for each experiment. Thus, the total number of
images to recognize was calculated as the number of test images for one run multiplied by 30
and the recognition errors number was calculated as the sum of the misrecognized images per
run.
In the first experiment we chose images from six micro screws per class for training and
the last four for testing, from a total of 10 micro screws per class (Table 4). Since we varied
the image transformations number we had three different numbers of training images related
with it (i.e., 0, 4, and 8). The second experiment consists of images from seven micro screws
per class for training and three for testing (Table 5). In the last experiment we selected images
from eight micro screws per class for training and the last two for testing (Table 6).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
20 E. Kussul, O. Makeyev, T. Baidyk et al.
It can be seen that the training sample size is an important parameter because as the
number of training images increases also the recognition rate increases. For example, in Table
4 with 264 training images it obtains 98.33% of recognition rate and in Table 6 with 352
training images it obtains 98.71%, this is 0.38% more exactitude.
On the other hand, image transformations help on increasing the recognition rate too. For
example, in Table 5 as the number of image transformations increments, the recognition rate
improves.
In addition, we compared our results with the classifier based on ensembles of extremely
randomized decision trees for image classification [30] which was evaluated on different
publicly available object recognition databases (MNIST, ORL, COIL-100, etc.) and had
promising results. We tested this classifier the same way as ours and with 3168 training
images and 88 test images it gave 44 recognition errors from a total of 2640 images to
recognize, and therefore the recognition rate of 98.33%. The process time for this classifier
was 10 minutes while LIRA_grayscale performed the same task in two minutes.
2.4 Conclusion
The neural classifier LIRA_grayscale for shape recognition of microscrew images for
manufacturing process was developed. It was trained and tested with a database containing
images of 40 micro screws of four classes, where every class is related to the definite cutter
position. The recognition rate for the training phase was 100% and for the test phase, in the
best case, the recognition rate was of 98.90%. These results are acceptable, nevertheless is
necessary to improve the classification precision.
Finally, we want to notice that the major advantage of this neural classifier
LIRA_grayscale is its universality. It was not constructed for a particular shape recognition
problem but that can be adapted and be implemented in different image recognition problems.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 21
3.1 Introduction
The miniaturization of products coupled with development of new technologies [31], [32]
has enabled us to develop equipment and instruments of small dimensions [20], [33]. Our aim
is to create fully automated desktop microfactory. With help of this microfactory it is possible
to create smaller and smaller microfactories. We consider different generations of micro-
factories, each of them corresponds to a microfactory of certain dimensions. The proposed
technology is MicroEquipment Technology (MET) [20]. In development of each generation
new problems will appear, but a lot of useful products can be produced. We are developing
this technology at the Centre of Applied Science and Technological Development (CCADET)
of National Autonomous University of Mexico (UNAM) [9], [33].
The most important areas of microtechnology applications are: automotive industry,
medical engineering, electronics, security systems, vision systems, environmental control,
traffic control, communication networks and aerospace technology.
The potential market for microproducts was estimated in more than 30 billions euros in
2002 [34]. Micromanufacturing and microassembly processes automation is fundamental goal
to achieve good performance of corresponding equipment [5], [13]. It is important to apply
computer vision methods in automation. One essential part of an automated microfactory is a
system for work pieces handling. We are working out a fully automated work pieces handling
system. The goal of this system is to recognize one random-located work piece in work area,
to define its coordinates, to take it with manipulator and to transport it to another place. This
system has to be flexible. This means, the system must be able to handle work pieces of
different dimensions and shapes.
The scheme of the proposed system is presented in Figure 7.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
22 E. Kussul, O. Makeyev, T. Baidyk et al.
This system is composed of two cameras, one manipulator and three subsystems:
manipulator control subsystem (MCS), technical vision subsystem (TVS) and intelligent
manipulation subsystem (IMS). The manipulator has special characteristics for mini–micro
handling abilities. It will have six degrees of freedom. We use general purpose, low cost,
commercial CCD cameras. They serve as an input to the TVS. The main purpose of this
system is to process the data from the cameras for recognize work pieces and their positions.
This is the main topic of this section. The IMS controls the whole system. This means it gives
the instructions to manipulator, processes the data from the TVS, and communicates between
both and with the user. The user can be a person or another system.
We begin to work with the system that contains one camera and the TVS. TVS is based
on the method of micro work piece recognition and their position detection. For this purpose
we adapted the neural classifier LIRA_grayscale [35]. In work piece recognition task we
obtained the recognition rate of more than 90%. In position detection task we obtained
coordinates of recognized work pieces.
Object recognition problem and different methods to solve this task are presented in [36],
[37]. There are many approaches and methods to resolve the object recognition problem.
Very often the first step of these methods is to extract contours from the image. We want to
analyze four methods based on contour images applied to objects of 2D, these methods are:
pattern matching, principal component analysis, graph matching and generalized Hough
transform [38].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 23
r ( x, y ) ( I (x i , y j ) T (i , j )) 2 (16)
iT j T
If the resulting sum is larger than experimentally obtained threshold, the object template
T is found in I. This method is time and memory consuming and it does not handle brightness
changes or scaling on image.
a cos
X ( ) 1 (17)
a sin
2
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
24 E. Kussul, O. Makeyev, T. Baidyk et al.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 25
The aim of the TVS is to recognize microobjects and their positions. It has three internal
subsystems: interface, object finder that searches for certain work pieces and defines their
position and orientation, and the camera controller that controls the camera and its data flow
(Figure 8).
The task of TVS is to find the object on the image. In Figure 9 the coordinates x, y and
the orientation angle of the required work piece are presented. The task is to find a work
piece, its coordinates, and orientation with an acceptable performance.
Figure 9. Example of the object with coordinates x, y and the orientation angle .
Figure 10. Seven different types of work pieces: (a) tube base, (b) cone head screw, (c) round head
screw, (d) cone, (e) stator axis, (f) Allen screw, (g) wire terminal.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
26 E. Kussul, O. Makeyev, T. Baidyk et al.
We work with seven classes of objects: tube base, cone head screw, round head screw,
cone, stator axis, Allen screw and wire terminal. These work pieces sometimes have dirty
surfaces, insufficient illumination, and heterogeneous brightness; sometimes they are
surrounded by shadows that complicate the recognition task. Every work piece has to be
recognized by the TVS from a set of random distributed work pieces. Further we will describe
the preprocessing of images. In the next section we will describe the proposed method of
image recognition.
The LIRA_grayscale neural classifier is an artificial neural network with four layers:
Input layer (S), Layer of groups (I), Associative layer (A), Output layer (R). S layer
corresponds to the input image to be classified. In this layer, called retina, each neuron
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 27
corresponds to the brightness of each pixel of the image to be processed; therefore, the output
range of these neurons is [0, B], where 0 corresponds to null brightness (black) and B equals
to the highest brightness. This layer has W x H neurons, where W and H are, respectively, the
width and the height of the image to be classified. The group layer or I layer contains N
neuron groups, each group i contains p ON-neurons and q OFF-neurons. Each ON-neuron is
active if xij> TONij , each OFF-neuron is active if xij< TOFFij , where xij is the input of the
corresponding neuron (the brightness of corresponding pixel in the S layer), TONij is the ON-
neuron threshold and TOFFij is the OFF-neuron threshold. Each neuron threshold is randomly
selected within [0, B x h], where h is a constant selected experimentally from (0,1]. The
neurons of the group are randomly connected with the neurons of S layer located in a
rectangular window (w x h) defined on S layer. Values dx and dy are selected randomly from
[0, W – w] to [0, H – h], respectively. Parameters w and h are very important for performance
of the neural classifier, and should be chosen experimentally. Connections between S layer
and I layer cannot be modified in the training process.
The A layer contains N neurons; each neuron has (p + q) inputs connected to the outputs
of corresponding group. The A layer neuron is active if, and only if all its inputs are active,
the neuron output equals 1 if it is active and equals 0 if it is not. The connections between
layers I and A can not be modified in the training process. All A layer neurons are connected
to each R layer neuron. The weights of these connections are modified in the training process.
The output from R layer neuron i is calculated with the following equation
N
y i w ji a j , (18)
j 0
where wji is the connection weight between A layer neuron j and R layer neuron i and aj is
the output of the A layer neuron. When an image is assigned to the classifier input we
calculate the activity of A layer neurons and present the activity as a binary vector A. This
calculation we term ‗image coding‘ and the vector A we term ‗image code‘. We store the
image codes of training set on the hard drive for not to repeat the coding process in each
training cycle.
yi yi (1 TE ) , (19)
where TE is a constant termed additional excitation of the winner neuron. After that, the
neuron from layer R with the highest output value, called the winner, is detected. This neuron
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
28 E. Kussul, O. Makeyev, T. Baidyk et al.
represents the recognized class for the given input yw be the winner neuron output and yc the
output of the neuron that really represents the correct class. If yw = yc nothing to be done.
If yw yc ,then
(j ) w jc (t 1) w jc (t ) a j , (20)
(j ) w jw (t 1) w jw (t ) a j , (21)
(j ) w jw (t 1) 0 w jw (t 1) 0 , (22)
where wjc(t) is the corresponding weight of the connection between the A layer neuron j
and the correct R layer neuron c before modification and wjc(t+1) is the weight after
modification; wjw(t) is the corresponding weight of A layer neuron j with the winner neuron of
R layer before modification and wjw(t+1) is the weight after modification; aj is the output
value of A layer neuron j. The training process of the neural classifier is an iterative process.
First all images from the training set are coded. After this, each training cycle deals with the
image codes but not with the initial images. It allows us to save computer time. The training
process stops after fixed number of cycles or when the amount of errors in the training cycle
is lower than a predefined value.
3.4.3. Distortions
The training image set can be enlarged by adding new images obtained by application of
distortions to the original images. In our investigation, we applied rotational distortions. Each
image was rotated in clockwise and counter clockwise directions forming the pairs of new
images.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 29
The user defines the properties like height or width for each type of work piece in Optik
module. This information is stored in the system. Source image contains several random
located work pieces. The user marks each work piece approximately in the centre of its major
axis. Then Optik module normalizes image of each work piece. These normalized images are
stored creating the database. RNA module implements the LIRA neural classifier. This
module communicates with other modules and with the user only by RNA interface module.
RNA interface module is developed to manage the neural classifiers and the available
databases, to train and test the classifier, and to look for certain work piece in a source
OptikRNA software was tested and results are presented in this section.
3.5.2. Databases
We use two databases in our experiments. The first and smaller database that contains
150 images was used for initial experiments with LIRA classifier and parameter tuning. The
second and larger database that contains 320 images was used for verify the results and for
practical recognition and position finder experiments.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
30 E. Kussul, O. Makeyev, T. Baidyk et al.
are not isolated (they are heaped up), so the images can contain some work pieces in addition
to the normalized one.
The first database contains 30 images for each of five different work piece classes, 15
images for the training set and other 15 images for the test set. The first four classes
correspond to different types of work pieces and the fifth one to the work piece absence in the
centre of the image (Figure 13).
Figure 13. Examples of five image classes from database I used in parameter tuning of the classifier:
(a) screw I, (b) screw II, (c) tube base, (d) cone, and (e) no work piece in central point of image.
The second database contains 40 images for each of the eight different classes of work
pieces, 20 images for the training set and 20 images for the test set. The eight classes consist
of seven different types of work pieces (Figure 14) and the eighth class corresponding to the
absence of work piece in the centre of the image. The images for the training and test set were
randomly selected.
Figure 14. Examples of eight different classes from database II used for LIRA neural classifier training
and test: (a) tube base, (b) cone head screw, c) round head screw, (d) cone, (e) stator axis, (f) Allen
screw, (g) wire terminal, and (h) no work piece in the central point of image.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 31
The training set of images can be enlarged with distortions applied to the original set. The
images have low resolution, heterogeneous brightness and some of them have the shadows. In
addition, sometimes a background is not homogenious. Such characteristics present real
conditions and make the recognition problem more complicated. We trained and tested the
classifier on normalized images from our databases. Due to random selection of work pieces
from database to training and test sets the obtained results are statistically reliable.
We tuned the classifier parameters with several experiments on database I. For the LIRA
neural classifier we find the parameters (for example, w and h for LIRA window, number of
training cycles and number of neurons of A layer) to achieve the best recognition rate. Several
experiments were made for this purpose. We did the similar experiments with contour images
of work pieces too. Final experiments were made on database II to recognize the work piece
positions.
Number of experiments: 1 2 3 4 5 6 7 8 9
10x1
Window size (w x h) 12x12 15x15 15x15 10x10 13x13 17x17 15x15 15x15
0
Experimental classifier 0.8 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
constant ( )
20
Number of neurons in 175 175 175 175 175 175 175 200
0
layer A (N) (thousands)
Number of ON-neurons 3 3 4 4 4 5 4 4 4
by group (p)
Number of OFF-neurons 4 4 3 3 3 3 3 3 3
by group (q)
Active neurons (%) 0.06 0.09 0.16 0.12 0.13 0.08 0.17 0.17 0.15
Recognition rate (%) 68 89 94 93 85 88 93 89 89
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
32 E. Kussul, O. Makeyev, T. Baidyk et al.
orientations. The results of these experiments are presented in Table 8. The best result was
obtained in experiment with six distortions (+ 15°, +10°, +5v, -5°, -10°, -15°).
Experiment 1 2 3 4
number
Distortions 6 10 10 16
number
Step(°) 5 10 3 3
Distortions 5, 10, 15 10, 20, 30, 3, 6, 9, 12, 3, 6, 9, 12,
angles 40, 50 15 15, 18, 21,
24
Training 80 10 60 80
cycles
Recognition 92 68 88 90
rate (%)
The training time in case of 16 distortions of the training images was approximately 16
min. The recognition time of any normalized image from any database was less than 0.4 s.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 33
Figure 15. Examples of the images from database II, which were not recognized by the system.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
34 E. Kussul, O. Makeyev, T. Baidyk et al.
The little square is the centre of the window. The vertical and horizontal steps of this
movement are x and y, respectively. At each position the centre of the window is rotated
by an angular step (Figure 16). For each angular position the system searches for the
requested work piece. The window continues to rotate until something is recognized or a
complete revolution is made. The window centre moves until something is recognized or
image limits are achieved. If a work piece is recognized the system store the coordinates c,
y, and finish the recognition task. The user is able to indicate the window displacement
parameters ( y, y, ) for investigation goals.
In Figure 17 we show two examples of recognized images.
The image contains two recognized cones (cones 1, 2 in the Figure 17(a)). In Figure
17(b) image contains three recognized wire terminals (1, 2, 3). The little white square in this
figure represents the position where the request work piece was recognized and the big square
represents the search window attached to this position. At the bottom of the images you can
read the position coordinates and orientation found by the system for each recognized work
piece. The LIRA classifier is a flexible method, this means, the system recognizes a work
piece out of its centre, but sufficiently near of the work piece body.
Sometimes the system can recognize one work piece several times. In Figure 18 is
shown multiple recognition of one work piece (case 1 and 3). In some cases this happens
because the system is able to find a work piece far away of its centre point, for example marks
2, 3 in Figure 18(a) and 1, 7 in Figure 18(b). In other cases, it happens because the window
movement step (qx, qy, qq) is too large, for example mark 1 in Figure 18(a) and marks 2, –5,
8, 9 in Figure 18(b). Marks 4 (Figure 18(a)) and 6 (Figure 18(b)) have sufficient accuracy
for a manipulator work.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 35
Two examples of images with work pieces that contact each other are presented in Figure
19. In Figure 19(a) the system found three work pieces, though these objects are far from the
image centre. In Figure 19(b) the system has redundancy. In this case, it can filter the best
recognized work piece by taking the higest value of neural network output, this means, the
best recognized work piece in this case corresponds to mark. We faced with two specific
recognition problems. The first problem is to avoid work piece recognition redundancy and
the second problem is to improve the accuracy of the work piece finder. The obtained results
are sufficiently good, but the searching method should be improved. All these experiments
were realized with an Intel Pentium 4, 2.66 GHz personal computer with 512 MB RAM
3.7 Discussion
LIRA classifier was adapted to recognize the micro work pieces of different classes. It
demonstrates sufficiently good results in object recognition even if the images are of low
quality and have the different illumination. The best results were obtained with normalized
images. Two classes (tube base and cone head screw, Figure 14(a) and (b)) were recognized
with errors. In the future it is necessary to improve the recognition of the work piece centre,
and to add more images to databases.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
36 E. Kussul, O. Makeyev, T. Baidyk et al.
3.8 Conclusion
LIRA_grayscale neural classifier was adapted for work piece recognition. It was trained
and tested in recognition of five and eight classes. Every class corresponds to a work piece
type, except one that corresponds to the absence of work piece, so the system can recognize
the case when there is no work piece. In the experiments with five classes we used 20 samples
for each class and for eight classes we used 40 samples for each class. Images of a tube base,
a cone head screw, a round head screw, a cone, a stator axis, an allen screw and a wire
terminal were used to test our system. For five classes, the best recognition rate of 94% was
obtained (without distortions) and of 92% with distortions of initial images of training set. For
eight classes the recognition rate was 90%. The results are acceptable. Nevertheless it is
necessary to improve the accuracy of work piece position detection. The LIRA_grayscale
classifier has good performance in work piece recognition with the images of low quality. It
has also good perspectives for piece position finding. The training and recognition computer
time is rather small. The time for recognition of any sample of any database was less than
0.4 s.
ACKNOWLEDGMENT
This invetidation was supported partially by projects CONACYT50231, PAPIIT
IN110510-3, PAPIIT IN119610 and project of ICyTDF 330/2009.
E.Kussul and T.Baidyk thank DGAPA, UNAM for sabbatical grant.
The information presented in this chapter was initially published with
―Neurocomputing‖, ―Image and Vision Computing‖, and ―Optical Memory & Neural
Networks (Information Optics)‖.
The part of information is reprinted from Neurocomputing, Makeyev et al: Limited
receptive area neural classifier for texture recognition of mechanically treated metal surfaces,
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 37
Issue 7-9, Vol. 71, March 2008, pp. 1413-1421 with the authors‘ right from Elsevier to
prepare other derivative works, to extend the journal article into book-length form.
The part of information is reprinted from Optical Memory & Neural Networks
(Information Optics), Martin-Gonzalez et al: Improved Neural Classifier for Microscrew
Shape Recognition, Allerton Press, Vol. 19, No. 3, 2010, pp. 220-226 with the authors‘right
from Allerton Press to use the materials from the published article in a book written by the
author (co-authors).
The part of information is reprinted from Image and Vision Computing, Toledo-Ramírez
et al: Neural classifier for micro work pieces recognition Vol.24, Issue 8, 2006, pp.827-836
with permission from Elsevier.
We have invited researchers interested in classifier investigations to use our databases:
http://gengis.dialetheia.net/workpiecesDBs. This data base was prepared by Dr. Gengis
Khang Toledo-Ramirez.
REFERENCES
Baidyk, T.; Kussul, E.; Makeyev, O.; Caballero, A.; Ruiz, L.; Carrera, G.; Velasco, G., Flat
image recognition in the process of microdevice assembly, Pattern RecognLett 2004,
Vol. 25, No. 1, 107-118.
Baidyk, T.; Kussul, E.; Makeyev, O., Texture recognition with random subspace neural
classifier, WSEAS Transactions on Circuits and Systems 2005, Vol. 4, No. 4, 319-325.
Brenner, D.; Principe, J.C.; Doty, K.L., Neural network classification of metal surface
properties using a dynamic touch sensor, in: Proc. IJCNN'91 International Joint
Conference on Neural Networks; Publisher: IEEE, Seattle, 1991 Vol.1, pp. 189-194.
Chi-ho Chan; Grantham K.H. Pang, Fabric defect detection by Fourier analysis, IEEE T
IndAppl 2000, Vol. 36, No. 5, 1267-1276.
Friedrich, C.R.; Vasile, M.J. Development of the micromilling process for high- aspect-ratio
micro structures, J Microelectromech S. 1996, Vol.5, 33-38.
Hepplewhite, L.; Stonham, T.J., Surface inspection using texture recognition, in: Proc.
IAPR'94 12th International Conference on Pattern Recognition, 1994, Vol. 1; Publisher:
IEEE, Jerusalem, pp. 589-591.
Jianing Zhu, M. Minami, Finding and quantitative evaluation of minute flaws on metal
surface using hairline, in: Proc. IECON'04 30th Annual Conference of IEEE Industrial
Electronics Society, 2004, Vol. 2;Publisher: IEEE, Busan,pp. 1240-1245.
Kraaijveld, M.A., An experimental comparison of nonparametric classifiers for time-
constrained classification tasks, in: Proc. ICPR'98 14th International Conference on
Pattern Recognition, 1998, Vol. 1; Publisher: IEEE, Brisbane, pp. 428-435.
Kussul, E.; Baidyk,T.; Ruiz-Huerta, L.; Caballero, A.; Velasco, G.; Kasatkina, L.
Development of micromachine tool prototypes for microfactories, J MicromechMicroeng
2002, Vol. 12, 795-813.
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document
recognition, P IEEE 1998, Vol. 86, No11, 2278-2344.
Leung, T.; Malik, J., Representing and recognizing the visual appearance of materials using
three-dimensional textons, Int J Comput Vision 2001, Vol. 43, No. 1, 29-44.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
38 E. Kussul, O. Makeyev, T. Baidyk et al.
Madni, A.M.; Wan, L.A., Micro electro mechanical systems (MEMS): an overview of current
state-of-the art, in: Proc IEEE Aerospace Conference, 1998, Vol. 1; Publisher: IEEE,
Snowmass, pp. 421-427.
Naotake Ooyama; Shigeru Kokaji; Makoto Tanaka; et al, Desktop machining microfactory,
in: Proc. IWMF'00 2nd International Workshop on Microfactories, 2000; Publisher:
FSRM, Fribourg, pp. 14-17.
Neubauer C., Fast detection and classification of defects on treated metal surfaces using a
backpropagation neural network, Neural Networks 1991, Vol. 2, pp. 1148-1153.
Rosenblatt, F. Principles of neurodynamics,Publisher: New York, Spartan books, 1962.
Sanchez-Yanez, R.; Kurmyshev, E.; Fernandez, A., One-class texture classifier in the CCR
feature space, Pattern RecognLett 2003, Vol. 24, pp. 1503-1511.
Trimmer, W.S. (ed.), Micromechanics and MEMS. Classical and seminal papers to 1990;
Publisher: IEEE Press, New York, 1997.
Baidyk, T.; Kussul, E., Neural network based vision system for micro workpieces
manufacturing, WSEAS Transactions on Systems, April 2004, Vol. 3, No. 2, pp. 483-488.
Toledo-Ramírez, G.K.; Kussul, E.; Baidyk, T., Neural classifier for micro work pieces
recognition, Image and Vision Computing, 2006, Vol. 24, No. 8, pp. 827-836.
Kussul, E.; Rachkovskij, D.; Baidyk, T.; Talayev, S., Micromechanical engineering: A basis
for the low cost manufacturing of mechanical microdevices using microequipment,
Journal of Micromechanics and Microengineering, December 1996, Vol. 6, No. 4, pp.
410-425.
Wu, Q. M. J.; Lee, M.F.R.; de Silva, C.W., Intelligent 3–D sensing in automated
manufacturing processes, Proceedings of the IEEE/ASME International Conference on
Advanced Intelligent Mechatronics, 2001, pp. 366-370.
Lee, S.J.; Kim, K.; Kim, D.H.; Park J.O.; Park, G.T., Recognizing and tracking of 3–D
shaped micro parts using multiple visions for micromanipulation, IEEE International
Symposium on Micromechatronics and Human Science, 2001, pp. 203-210.
Kim, J.Y.; Cho, H. S., A vision based error-corrective algorithm for flexible parts assembly,
Proceedings of the IEEE International Symposium on Assembly and Task Planning,
1999, pp. 205-210.
Dunin-Barkovskii, V.L., On neuroscience, Optical Memory & Neural Networks, 2007, Vol.
16, No. 1, pp. 47-50 .
Mikaelyan, A. L. Neuro-holographic processing methods and availability of neural
nanostructure development, Information Technology and Computer Systems, No.1, pp. 9-
19, 2004.
Kryzhanovsky, B.V.; Litinskii, L.B., The vector models of associative memory,
Neuroinformatics-2003, MIFI Session, 2003, Moscow, Russia, pp.72-85.
Kussul, E.; Baidyk, T., Improved method of handwritten digit recognition tested on MNIST
database, Image and Vision Computing, October 2004, Vol. 22, No. 12, pp. 971-981.
Martín, A.; Baidyk, T., Neural classifier for microscrew shape recognition in
micromechanics, CLEI - IFIP WCC AI, August 2006, pp. 10.
Hubel, D.H.; Wiesel, T.N., Brain mechanisms of vision. Scientific American, 241:150-162,
1979.
Marée, R.; Geurts, P.; Piater, J.; Wehenkel, L., Random subwindows for robust image
classification, Proc. IEEE International Conference on Computer Vision and Pattern
Recognition (CVPR), 2005, Vol.1, pp. 34-40.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Some Applications of Computer Vision Systems in Micromechanics 39
Ofeifer, T.; Dussler, G. Process observation for the assembly of hybrid microsystems, IEEE
International Sysposium on Micromechatronics and Human Science 2002,pp. 117–123.
Bleuler, H.; Clavel, R.; Brequet, J-M.; Langen, H.; Pernette, E., Issues in precision motion
control and microhandling, in: Proceedings of the 2000 IEEE International Conference
on Robotics & Automation, USA, 2000, pp. 959–964.
Kussul, E.; Baidyk, T.; Ruiz, L.; Caballero, A.; Velasco, G., Development of Low Cost
Microequipment, in: International Symposium on Micromechatronics and Human
Science, Nagoya Japan, October 2002, Vol. 20–23, pp. 125–134.
Detter, H.; Popovic, G., Industrial demands on micromechanical products, IEEE International
Conference on Microelectronics NIS, Serbia, 2000, pp. 61–67.
Baidyk, T.; Kussul, E., Application of neural classifier for flat image recognition in the
process of microdevice assembly, in: Proceedings of the IEEE International Join
Conference on Neural Networks, Hawaii, USA, 2002, Vol. 1 pp. 160–164.
Jain, R.; Schunck, B.G.; Kasturi, R. Machine Vision, Publisher: McGraw Hill, New York,
U.S.A., 1995.
Faugeras, O. Three-Dimensional Computer Vision, Publisher: MIT Press, Cambridge, New
York, 1993.
Ehrenmann, M.; Ambela, D.; Steinhaus, P.; Dillman, R., A comparison of four fast vision
based object recognition methods for programming by demonstration applications, IEEE
International Conference on Robotics & Automation, San Francisco, CA, USA, 2000, pp.
1862–1867.
Steinhaus, P. Prinzipiellekomponenten-analyse versus multimediale filterhistogramme,
Master Thesis, Universitat Karlsruhe (TH), Germany, July, 1997
Bennamoun,M.; Boashash, B. A structural description based vision system for automatic
object recognition, IEEE Transactions on Systems Man and Cybernetics Part B
(Cybernetics) 1997, Vol. 27, No.6,pp. 893–906.
Sun-Ho,L.; Hyun-Ki, H.; Jong-Soo, C., Assembly part recognition using part-based
superquadrics model, IEEE TENCON1999, pp. 479–482.
Detter, H.; Radjenovic, M.J., Recognition of thin, flat microelectromechanical structures for
automation of the assembly process, Journal of Intelligent Manufacturing 1997, Vol. 8,
No. 3, pp. 191–201.
Mitzias, D.A.;Mertzios, B.G., A neural multiclassifier system for object recognition in robotic
vision applications, Measurement 2004, Vol. 36, pp. 315–330.
Toledo, G.K.; Kussul, E.; Baidyk, T., Neural classifier LIRA for recognition of micro work
pieces and their positions in the processes of microassembly and micromanufacturing,
The Seventh All-Ukrainian International Conference on Signal/Image Processing and
Pattern Recognition, UkrObraz2004, Kiev, Ukraine, October 2004, pp. 17–20.
Mardanov,A.; Seyfried, J.; Fatikow, S., An automated assembly system for a microassembly
station, Computer in Industry 1999, Vol. 38, pp. 93–102.
Simard, P.Y.; Steinkraus, D.; Platt, J.C., Best practices for convolution neural networks
applied to visual document analysis, International Conference on Document Analysis and
Recognition (ICDAR), IEEE Computer Society, Los Alamitos, USA, 2003, pp. 958–962.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida © 2011 Nova Science Publishers, Inc.
Chapter 2
ABSTRACT
Computer vision and recognition is playing an increasing important role in the
modern intelligent control. Object detection is the first and the most important step in
object recognition. Traditionally, special objects can be detected and recognized by the
template matching method, but the recognition speed has always been a problem. Also
for recognition by the neural network, training the data is always time consumption. In
this chapter, the genetic algorithm-based face recognition system is proposed. The
genetic algorithm (GA) has been considered as a robust and global searching method.
Here, the chromosomes generated by GA contain information (parameters) of the image,
and the genetic operators are used to obtain the best match between the original image
and the face of interest. The parameters are the coordinate (x, y) of the center of the face
in the original image, the rate of scale and the angle Ѳ of rotation. The purpose of this
chapter is to summarize the GA-based method for face detection and recognition. Finally,
the experimental results, and a comparison with the traditional template matching
method, and some other considerations, are also given.
Keywords: Face detection and recognition, Genetic algorithm, Image processing, Template
matching
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
42 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
INTRODUCTION
There are lots of papers and applications are presented on the web, in the conference
proceedings or journals about the intelligent control. Among them, image processing and
recognition occupy a very large percentage. The higher the degree of intelligence is, the more
important the image detection and recognition technology is.
For controlling an intelligent system (the autonomous mobile vehicle, robot, etc), the
most important element is the control strategy. But before automatically making the system
move, image recognition is needed. For an intelligent control system, it is necessary to
acquire information about the external world automatically by sensors, in order to recognize
its position and the surrounding situation. A camera is one of the most important sensors for
computer vision. That is to say, the intelligent system endeavors to find out what is in an
image (the environment of the robot) taken by the camera: traffic signs, obstacles or
guidelines, etc.
The reliability and time-response of the object detection and recognition have a major
influence on the performance and usability of the whole object recognition system [1]. The
template matching method is a practicable and reasonable method for object detection [2], but
the recognition speed has always been a problem.
In addition, in order to search for the object (face) of interest in an image, lots of data
need to be processed. The genetic algorithm (GA) has been considered to be a robust and
global searching method (although it is sometimes said that GA can not be used for finding
the global optimization [3]). Here, the chromosomes generated by GA contain information
about the image data, and the genetic and evolution operations are used to obtain the best
match to the template: searching for the best match is the goal of this method.
This thought emerged from the features of GA, and the need to recognize the faces of
special people easily and quickly by an intelligent system. The single concept and feature of
image processing and the GA will not be introduced here, because there is already extensive
literature on these subjects.
This chapter summarizes the research on GA-based face recognition by the authors of the
chapter and other researchers. In the chapter below, the encoding and decoding method of GA
and the experimental setting that is used here, the experiment and the analysis, and conclusion
are addressed sequentially.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 43
similarity between the original image f ( x, y ) and a template h( x, y ) . Define the mean-squared
distance
Image Result
Standard
Feature Pattern
Preprocessing
extraction matching pattern
template
Figure 1. Object
Figure recognition
1. Object system.
recognition system
2 2
d fh { f ( x , y ) h ( x , y )} dxdy (1)
For the right-hand side of Eq.4, the result has two terms. The first term is constant, thus
Rfh can measure as the least-squared similarity between the original image and the template
[5]. If Rfh has a large value (which means that d fh is small enough), then the image is judged
2
to match the template. If Rfh is less than a pre-selected threshold, recognition will either reject
the match or create a new pattern, which means that the similarity between the object in the
original image and the template is not satisfied.
A. Genetic encoding
As introduced above, Eq.4 is used to match the object between the original image and the
template. Since the chromosomes generated by the GA contain information about the image
data, the first step is to encode the image data into a binary string [6].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
44 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
ѲѲ
Figure
Figure2. 2.
One chromosome
One contains
chromosome 4 bytes.
contains 4 bytes
Some important parameters of GA are given in Table 1, and the search field and region
are given in Table 2. Table 2 shows that there are 4 image parameters: the center of a face (x,
y) in the original image, the rate of scale to satisfy Eq.3, and the rotating angle Ѳ , are encoded
into the elements of gene (the meanings of which will be introduced below). Since one
parameter uses 8 bits (1 byte), and there are 4 image parameters, thus one chromosome
contains 4 bytes shown in Figure 2.
temp_y temp_x
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 45
In Eq.5, temp(i, j) is the gray level of the coordinate (i, j) in the template image, the width
and height of which are temp_x and temp_y. f(x, y, rate, Ѳ) gives the gray level in the original
image, the coordinate of which are calculated by translation from (x, y), and by changing the
scale and the rotation angle Ѳ from the template. Since the images have 256 gray levels, in
Eq.5, division by 255 ensures that the resulting fitness is between 0 and 1.
Based on Eq.5, in the program, the fitness is calculated by the following four steps, and
Figure 4 gives the figural example.
(1) The coordinate (i, j) in the template is scaled by the value of parameter rate, then
the result f(x, y, rate, Ѳ) = f(0, 0, rate, 0) = (i × rate, j × rate) can be gotten;
(2) The result of step (1) is rotated by the value of parameter Ѳ, thus f(x, y, rate, Ѳ) =
f(0, 0, rate, Ѳ);
(3) The result of step (2) is translated by the coordinate (x,y) in the original image,
then the result f(x, y, rate, Ѳ) is calculated;
(4) All the differences between the gray level of the coordinate (i, j) in the template
and that of the calculated f(x, y, rate, Ѳ) in the original image are summed by
Eq.5 to calculate the fitness.
Based on Figure 3, if the value of the fitness is lager than the preset threshold, the search
process is over and the result is given, otherwise the loop will be continued.
In order to obtain the true value of the image parameters from the chromosome, decoding
is needed. In Eq.6, after decoding from datum of GA, the value of each parameter is
standardized. A figural example is given in Figure 5.
Initialization
YES
fitness >= END
threshold
NO
Selection
Crossover,
mutation
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
46 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
MAX MIN
Value MIN (datum from GA ) (6)
255
Scale by Rotate by
A rate
Ѳ
Template image
Translate
by (x, y)
Original image
FigureFigure 4. Figural
4. Figural example
example ofofcalculation
calculation of
ofEq. 5 by
Eq.5 byprogram.
program
C. Experimental setting
The experiment is done by first loading the original and the template images. GA is used
to find whether or not there is the object (face) of a template in the original image. If the
answer is YES, then in the original image the result gives the coordinate of the center of the
object, the rate of scale, and the rotation angle from the template. For comparison, the general
template matching method is also presented [7]. The execution time shows the effectiveness
of the GA-based recognition method.
Figure 6 and Figure 7 are the original images and the templates for the experiment [8].
The values are the width × height in pixels of the image. In Figure 6, three original images are
presented, the content and size of which are different. Figure 6a has two faces (the faces of a
person and a toy), Figure 6b shows a face tipped to one side, and the person in Figure 6c
wears a hat and the background is more complicated than Figure 6a and b.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 47
Figure 5. Figural
Figure example
5. Figural for decoding.
example for decoding
Figure 6. Three
Figure original
6. Three imagesimages
original (max_X max_Y).
(max_x × max_y)
62 × 62 60 × 64
Figure 7.Figure
Templates for matching
7. Templates for (temp_X
matchingx(temp_x
temp_Y).× temp_y)
The two templates in Figure 7 are not extracted only from one image. For normal use, the
template should be extracted as the average of several feature images. In Figure 8, the
template (a)-0 is generated from (a)-1, (a)-2 and (a)-3, which takes the average value of the
gray levels from those three models. The same is also done for the template (b)-0.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
48 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
Figure 9.Figure
Result
9. of searching
Result by GA.
of searching by GA
The results of the GA-based face recognition are given in Figure 9 and Table 3. Figure
9a-c are searched to match to the template Figure 7a, while Figure 9d is matched to Figure 7b.
The result of Table 3(a) means that after the template is scaled by 1.31 times and rotated
anticlockwise rotation 7o, it matches the original image (centered by (64,97)) as 91.6591%.
Figure 9a and Figure 9d reach the matching rate 0.9 within 300 generations, while Figure
9b and Figure 9c cannot reach the matching rate 0.9 within 300 generations (the best match is
given in Table 3).
In the images of Figure 9a, b, d, we see that the matches to the template are well. The
coordinate (x, y), the rate of scale, and the angle of rotation Ѳ have been found correctly. But
for Figure 9c, the result is not very satisfactory.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 49
Image in Figure 9
Result
Original image is Figure 6a, template is Figure 7a.
(a)
The 18th generation gives the result, fitness = 0.916591, time = 2 seconds.
(x, y) = (64, 97), rate = 1.31, Ѳ= -7o
Original image is Figure 6b, template is Figure 7a.
(b)
The 300th generation gives the result, fitness = 0.889581, time = 39 seconds.
(x, y) = (111, 103), rate = 1.72, Ѳ= -11o
Original image is Figure 6c, template is Figure 7a.
(c)
The 300th generation gives the result, fitness = 0.832722, time = 39 seconds.
(x, y) = (165, 120), rate = 1.00, Ѳ= 30o
Original image is Figure 6a, template is Figure 7b.
(d)
The 2nd generation gives the result, fitness = 0.901470, time = 0 second.
(x, y) = (179, 112), rate = 1.45, Ѳ= 18o
i. The template Figure 7 cannot represent the face of interest at all times. That is to
say, although the person to be recognized in different images is the same, the
template cannot give the features for this person at all times (different
appearance, etc.), and in all conditions. (The creation of template is shown in
Figure 8.)
ii. The algorithm itself has some problems. For example, by using a GA-based
recognition method, the settings of the search field (in this chapter, (x, y, rate, Ѳ )
is selected), the determination of the genetic operations, and the selection and
optimization of the fitness function all have a strong effect on the level of
recognition.
For the purpose of comparing the effect of the GA-based algorithm, the result of the
general matching method [7] is also presented. From Figure 10, we see that although both the
original image (the top-left image) and the template (the top-right image) are simplified by
binarization, the search time is 1 min 22s. The recognized result is the bottom-left image in
Figure 10. For the general matching method, the search time depends mainly on the location
of the object of interest in the original image.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
50 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
Figure 11 and Figure 12 are the template and original images for another experiment [9].
The genetic operations and GA parameters are presented in Table 1 and Table 4. The
fitness is defined in Eq.5. The result of the GA-based face recognition is given in Figure 13
and Table 5. The result shows that it reaches the matching rate 0.904 at the 116th generation.
The result of the general matching method is shown in Figure 14.
Pr 0.6
Pc 0.2
Pm 0.02
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 51
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
52 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
Fitness 0.904
Generation 116
Time [sec] 10
(x, y) (145, 77)
Rate 1.118
Angle Ѳ [deg] 3.431
Based on the genetic operations and GA parameters presented in Table 1 and Table 4,
another experiment has been done. Figure 15 and Figure 16 are the template and original
images [10]. The result of the GA-based face recognition is given in Figure 17 and Table 6.
The result of the general matching method is shown in Figure 18.
320×240
Figure 16.16.
Figure TheThe
original image
original (size:
image max_x,
(size: max_y).
max_x, max_y)
Fitness 0.944
Generation 112
Time [sec] 5
(x, y) (112, 107)
Rate 1.02
Angle Ѳ [deg] 2.24
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 53
CONCLUSION
In this chapter, the GA-based image recognition method is tested, and a comparison with
the general matching method is presented.
As we know, GA starts with an initial set of random solutions called population. Each
individual in the population is called a chromosome, and represents a solution to the problem.
By stochastic search techniques based on the mechanism of nature selection and natural
genetics, genetic operations (crossover and mutation) and evolution operation (selecting or
rejecting) are used to search for the best solution [11].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
54 Fengzhi Dai, Yutaka Fujihara and Naoki Kushida
In this chapter, the chromosomes generated by GA contain information about the image,
and we use the genetic operators to obtain the best match between the original image and the
template. The parameters are the coordinate (x, y) of the center of the face in the original
image, the rate of scale, and the angle of rotation Ѳ .
In fact, translation, scale and rotation are the three main invariant moments in the field of
pattern recognition [12]. However, for face recognition, the facial features are difficult to be
extracted and calculated by the general pattern recognition theory and method [13]. Even
these three main invariant moments will not be invariant because the facial expression is
changed in different images.
Thus, recognition only gives the best matching result by an upper predetermined
threshold [14]. Both the GA-based method and the general template matching method are
presented here, and the comparison with the traditional pattern matching method shows that
the recognition is satisfactory, although under some conditions the result is not very good
(Figure 9c).
Based on the results of experiments, further work will emphasize [15]
This work is important and necessary in order to improve the GA-based face recognition
system.
ACKNOWLEDGMENT
This work was sponsored in part by the Scientific Research Foundation of Tianjin
University of Science and Technology (China) for the Introduced Talented Scholars, No.
20100408.
REFERENCES
[1] M. Sugisaka; X. Fan. Development of a face recognition system for the life robot. Proc.
of the Seventh Int. Symp. on Artificial Life and Robotics. 2002, Vol. 2, pp 538-541.
[2] K. Castleman. Digital image processing; Original edition published by Prentice Hall,
Inc., a Simon & Schuster Company, Press of Tsinghua University, China, 1998.
[3] H. Iba. Foundation of genetic algorithm – solution of mystic GA (in Japanese); Omu
Press, 1994.
[4] K. Deguchi; I. Takahashi. Image-based simultaneous control of robot and target object
motion by direct-image-interpretation. Proc. the 1999 IEEE/RSJ International Conf. of
Intelligent Robot and System. 1999, Vol. 1, pp 375-380.
[5] B. Jaehne. Digital image processing - concepts, algorithms, and scientific applications.
The Third Edition, Springer-Verlag, 1995.
[6] T. Agui; T. Nagao. Introduction to image processing using programming language C (in
Japanese). Shoko-do press, 2000.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
A Survey of Face Recognition by the Genetic Algorithm 55
[7] Ishidate (2010). Ishidate‘s Studying Room of Visual C++ (in Japanese),
http://homepage3.nifty.com/ishidate/vcpp.htm
[8] F. Dai; T. Kodani; Y. Fujihara. Research on a face recognition system by the genetic
algorithm. J. of Artificial Life and Robotics. 2007, Vol. 11, Number 1, pp 67-70.
[9] F. Dai; Y. Fujihara. Research on object recognition by genetic algorithm. Proc. of 8th
Int. Conf. on Industrial Management. 2006, pp 907-912.
[10] F. Dai; T. Adachi; Y. Fujihara; H. Zhao. Fundamental research on face recognition by
genetic algorithm. Proc. of 12th Int. Symp. on Artificial Life and Robotics. 2007, pp
639- 642.
[11] M. Gen; R. Cheng. Genetic algorithms and engineering design. A wiley-Interscience
Publication, New York, 1997.
[12] T. Agui; T. Nagao. Image processing and recognition (in Japanese). Syokoudou Press,
1992.
[13] H. Takimoto; T. Mitsukura ; M. Fukumi; et al. A design of face detection system based
on the feature extraction method. Proc. of 12th Symp. on Fuzzy, Artificial Intelligence,
Neural Networks and Computational Intelligence, 2002, pp 409-410.
[14] Kresimir Delac; Mislav Grgic. Face recognition. I-Tech Education and Publishing,
Vienna, Austria, 2007.
[15] D. Goldberg. Genetic algorithm in search, optimization, and machine learning.
Addison-Wesley Publishing Company, Inc., 1989.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida c 2011 Nova Science Publishers, Inc.
Chapter 3
Abstract
Increasingly complex driver assistance functionalities are combined in today’s ve-
hicles. Typically, these functionalities run as independent modules bringing own sen-
sors, processing devices, and actuators. In general, no information fusion, i.e., cross
talk between modules, takes place as a consequence of unsolved questions in system
design that come along with high system complexity. However, information fusion of
available sensors and processing modules could lead to a new quality of driver assis-
tance functionalities in terms of performance and robustness. Additionally, the grow-
ing number of driver assistance functionalities in today’s vehicles will cause problems
regarding the limited number of interaction channels to inform the driver of potential
dangers - the so-called human machine interface (HMI).
Already today, in specific traffic scenarios a contradicting HMI access of differ-
ent driver assistance functionalities cannot be avoided. Furthermore, typical driver
assistance functionalities on the market are based on highly specialized and optimized
algorithms that show sound performance for a restricted number of clearly defined sce-
narios only. Also, the combination of several of these rigid systems as a means to reach
the long-term goal of autonomous driving will not lead to robust system performance,
taken the immense variety of traffic situations into account. As opposed to that, the
following contribution presents a flexible, biologically inspired driver assistance sys-
tem that adapts its modules and data exchange between modules online depending on
the task. More specifically, the morphology of the brain as well as brain-like signal
processing principles are mimicked in order to increase the robustness and flexibility
∗ E-mailaddress: thomas [email protected]
1 Thepresented results have been obtained while carrying out research as part of a Ph.D. program at Darm-
stadt University with support from Honda Research Institute Europe GmbH. The author Thomas Paul Michalke
now works for Daimler AG (email: thomas [email protected]).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
58 Thomas Paul Michalke and Robert Kastner
of the system. Thereby, we aimed at reaching a generic system structure, that supports
several system tasks by modulating and parameterizing its system modules (e.g., de-
tect fast objects, redetect once tracked and later lost objects, predict object trajectories,
locate cars on the road, detect traffic signs). Put differently, our main research focus is
on the design of general mechanisms that lead to a certain observable behavior without
being explicitly designed for this behavior. Following that paradigm, a tunable atten-
tion sub-system was developed that allows the pre-filtering of the visual environment
for task-relevant information in order to provide the driver with assistance in dangerous
situations. In order to include information of the scene context into the system, a ro-
bust unmarked road detection module as well as an approach for assessing the general
type of environment (highway, country road, inner-city) is included into the system.
The realized driver assistance system is tested online and in real-time on a prototype
car. The gathered results prove the applicability of the developed biologically inspired
driver assistance system in real-world scenarios. Extensive system evaluation shows
that different system properties are in close compliance with measurements gathered
in psychophysical studies of humans. Based on these results, it can be stated that the
realized advanced driver assistance system closely models important human informa-
tion processing principles allowing the usage of the system as Attentional Co-Pilot for
human drivers.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 59
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
60 Thomas Paul Michalke and Robert Kastner
cues in Sect. 6., and the road detection system in Sect. 7. some major sub-modules are de-
scribed. Evaluation results for the most crucial system modules as well as the overall system
performance measured in an experimental setup are given in Sect. 8.. The contribution ends
with a summary in Sect. 9. and an outlook on future work in Sect. 10..
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 61
approach allows for a simple attention-based decomposition of road scenes only, without
incorporating object knowledge or context information. Additionally, the overall system
organization is not biologically inspired and hence shows limitations in its flexibility.
In contrast to the here presented ADAS, a tendency of most large-scale research projects
like, e.g., the European PreVENT project [15] is the decomposition of the overall function-
ality into many building blocks and combining these blocks into subsets for solving isolated
tasks. While this “divide and conquer” approach does lead to impressive results in specific
settings, the challenge of integrating all these functionalities into a coherently working flex-
ible system is not yet solved.
For supporting the driver over the full range of driving tasks in all kinds of challenging
situations and going beyond simple reactive behaviors, a more sophisticated task-dependent
processing strategy is required. As the major challenge and in order to achieve this target,
an adequate organization of perception using a generic vision system has to be found.
When assessing biological vision systems, it can be experienced that these are highly
flexible and capable of adapting to severe changes in the task and/or the environment.
Hence, one of the basic design goals on the way to achieve an “all-situation” ADAS could
be to implement a biologically motivated, cognitive vision system as perceptual front-end
of an ADAS, which can handle the wide variety of situations typically encountered when
driving a car. Note that only if an ADAS vision system attends to the relevant surround-
ing traffic and obstacles, it will be fast enough to assist the driver in real-time during all
dangerous situations.
As motivated before, for solving this challenge a computational model of the human at-
tention system is suitable for the “how” and “when” of scene decomposition and interpreta-
tion. Recently, some authors stress the role of incorporating context into the attention-based
scene analysis. For example [16], proposes a combination of a bottom-up saliency map and
a top-down context-driven approach. The top-down path uses spatial statistics, which are
learned during an offline learning phase, to modulate the bottom-up saliency map. This
is different to the here described system, where no offline spatial prior learning phase is
required. In the realized online system, context is incorporated in the form of top-down
weights that are modified at run time as well as by fusing road information.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
62 Thomas Paul Michalke and Robert Kastner
System control
output control
input feature space parameter space control actuators
input−output
functional mapping
car−like road
openings
trigger LTM search
attention
suppress road surface
tracker update
on−off contrasts
TD saliency map
modulation transform ego update
NCC 3D to 2D motion
tracker Environmental interaction
S single track
model
λ 1−λ actuator
depth cue control
generate generate combination
visual ego motion
visual BU & TD apply STM (weak fusion)
TD attention attention TD weights
using weights scene
classification using weights
wi
wi wi apply LTM
TD weights radar, laser
acoustic warning
input image, stereo data CAN data
(disparity) belt strengthener
color 300x400 radar, laser brake
Figure 1. Biologically motivated system structure for active, attention-based scene analysis.
objects. In contrast, the “what” pathway considers the detailed analysis of a single spot in
the image. Nevertheless, an ADAS also requires context information in form of the road
and its shape as well as scene context from a scene classifier, both generated by the static
domain-specific part. Furthermore, for assisting the driver, the system requires interfaces
for allowing environmental interaction (i.e., triggering actuators). The system control mod-
ule relies on numerous internal system percepts as input and numerous system parameters
for controlling the system states and behavior. In order to allow a closer understanding of
the proposed system a detailed description of its sub-modules is given.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 63
φ = Kcon j Max(Fi ) with Kcon j = (0, 1] (see Fig. 2a for a visualization of the object training
region and the background). The weights wTD i dynamically boost feature maps that are
important for the current task / object class in focus and suppress the rest. The bottom-up
(BU) weights wBU i are set object-unspecifically in order to detect unexpected potentially
dangerous scene elements. The central parameter λ ∈ [0, 1] (see Equation (1)) determines
the relative importance of TD and BU search in the current system state. For more details
on the attention system see Sect. 4..
It is important to note that the TD weights (calculated using Equation (2)) are dependent
on the features present in the background (rest) of the current image, since the background
information is used to differentiate the searched object from the rest of the image [18]. Be-
cause of this, it is not sufficient to store the TD weight sets wTD
i of different object classes
directly and switch between them during online processing. Instead, all feature maps of
objects Fi,RoI are stored. To compensate the dependency from background the stored object
feature maps are fused with the feature maps of the current image before calculating the
TD weights. In plain words, the system takes the current scene characteristics (i.e., its fea-
tures) into account in order to determine the optimal TD weight set that shows a maximum
performance on the current frame. Put differently, the described separation approach in
Equation (2) includes the current scene context on a sensory level.
N N
Stotal = λ ∑ wTD BU
i Fi + (1 − λ ) ∑ wi Fi (1)
i=1 i=1
mRoI,i m
RoI,i
mrest,i ∀ mrest,i ≥1
wTD
i = (2)
− mrest,i
m
RoI,i
∀ mrest,i <1
mRoI,i
∑ Fi (u, v)
∀u,v∈{RoI,rest}
with m{RoI,rest},i =
size region {RoI,rest}
Fi (u, v) ∀(u, v), Fi(u, v) ≥ φ
and Fi (u, v) =
0 else
Now, the maximum on the current attention map Stotal is detected and the focus of at-
tention (FoA) is derived by generic region-growing-based segmentation on Stotal . In the
following, only the FoA is classified in full image resolution using a state-of-the-art ob-
ject classifier that is based on neural nets [19]. This procedure (attention generation, FoA
segmentation) models the saccadic eye movements of mammals, where a complex scene is
scanned and decomposed by sequential focusing of objects in the central 2-3◦ foveal retina
area of the visual field. The system uses a temporal integration approach to decide on the
object class, in order to improve the reliability of the classifier decision. More specifically,
all detected objects are tracked and reclassified in the following frames. On each frame
a majority decision (voting) on the current and all stored classifier results decides on the
object class.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
64 Thomas Paul Michalke and Robert Kastner
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 65
all intrinsic and extrinsic camera parameters (in detail these are the 3 camera angles θX , θY ,
and θZ , the 3 translational camera offsets t1 , t2 , t3 , the horizontal and vertical principal point
cu and cv , as well as the horizontal and vertical focal length f u and f v ), refer to Equation (3)
and (4) asa well as Fig. 33b for the applied coordinate system.
In case the NCC tracker is able to re-detect the object in 2D pixel coordinates, the
3D position in the representation is updated using 4 different depth cues for the 2D pixel
(u,v) to 3D world (Xobj , Yobj, Zobj ) transformation. More specifically, the system uses stereo
data, radar, depth from object knowledge and depth from bird’s eye view (see Fig. 37 and
Sect. 6. for more details on these cues). The available depth cues are combined using
the biologically motivated principle of weak fusion (see [22]). Weak fusion combines the
depth sources based on their reliability (i.e., sensor variances). The fusion is realized using
an Extended Kalman Filter (EKF) that combines the cues based on dynamically adapted
weights depending on the static predefined sensor variances and the available depth sources,
as not every cue is available in each time step. The EKF uses a second order process
model for its prediction step that models the relevant kinematics of the car (velocity and
acceleration).
with
r11 = cos(θZ )cos(θY )
r12 = − sin(θZ )cos(θX ) + cos(θZ )sin(θY )sin(θX )
r13 = sin(θZ )sin(θX ) + cos(θZ )sin(θY )cos(θX )
r21 = sin(θZ )cos(θY )
r22 = cos(θZ )cos(θX ) + sin(θZ )sin(θY )sin(θX )
r23 = − cos(θZ )sin(θX ) + sin(θZ )sin(θY )cos(θX )
r31 = − sin(θY )
r32 = cos(θY )sin(θX )
r33 = cos(θY )cos(θX )
Objects that leave the represented surrounding scene or whose Kalman variances are too
high (i.e., they received no new measurements for several frames) are deleted from the
STM. The concept of appearance-based 2D tracking (analysis of motion in 2D) supported
by a 3D representation (interpretation of motion in 3D) was found in humans as well [23].
From a technical point of view, the advantage of this approach is the simple correction of
the ego-motion determined from the internal 3D representation. The vehicle ego-motion
(translations ∆Xe and ∆Ze , as well as the change of the yaw angle ∆θX ) is determined based
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
66 Thomas Paul Michalke and Robert Kastner
on a standard single track model and compensated in the Kalman prediction step (see Equa-
tion (5) and (6) for the state vector E and process model A). Therefore, a computationally
intensive optical-flow-based motion prediction is no longer needed, as it was done in a pre-
vious system implementation [24]. The main reason for the strong object motion in the
2D image is compensated by correcting the ego-motion-based position change of objects,
which eases the tracking task considerably.
E = Zobj Xobj vZ,obj vX,obj 1 (5)
cos(∆θX ) sin(∆θX )
T 0 -∆Ze
-sin(∆θX ) cos(∆θX ) T 0 -∆Xe
A=
0 0 cos(∆θX ) sin(∆θX ) 0 (6)
0 0 -sin(∆θX ) cos(∆θX ) 0
0 0 0 0 1
Pt−4
Pt−3
0
Pt−3 Dobj,ego
t−3
0 Pt−2
Pt−2 Dobj,ego
Region of interest t−2
Pt σPt
Background (rest) σP 0
t Xt0 Dobj,ego
t
(a) (b)
Figure 2. (a) Visualization of the object training region (RoI) for TD weight calculation
against the background (rest), (b) Prediction of object ego-motion (dots: Kalman tracked
object position, squares: ego-motion predicted object position, dashed line: accumulated
object ego-motion).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 67
P − Pt0
β (Pt , Pt0) = q t (7)
σP2t + σP20
t
For all dynamic (i.e., moving) and therefore potentially dangerous objects in the scene an
additional attention-based tracker support is realized, in order to solve a typical problem
appearance-based trackers suffer from (i.e., a tracker type that relies on the comparison of
image patches). Said trackers depend strongly on the quality of the object template (i.e.,
the image patch the tracker has to relocalize in the current image). In Fig. 3a the func-
tional description of a NCC tracker is visualized. Here, the template is fix, which leads
to a decreasing tracking performance. The object gets lost quickly. This is caused by the
fact that in the vehicle domain the appearance of tracked objects quickly changes (due to
changes in illumination, view angle, or varying scale), which makes an adaptation of the
template necessary. A typical approach for template adaptation is shown in Fig. 3b, where
the template is reset based on the previous tracking result. Using this procedure incremen-
tal errors of the tracker are accumulated. The detected region drifts away from the object
and gets finally lost. More advanced approaches for template adaptation exist that adapt
the initial template by model-based image transformations in order to compensate the scale
variance and change of view angle (see, e.g., [25]). Said methods suffer from noise, still
perform rather robustly, but require specific and complex algorithms. In the here presented
system, a novel approach for template adaptation is proposed that relies on already system-
immanent approaches. As visualized in Fig. 3c based on the previous template and Equ. (2)
a TD weight set and a TD saliency map Stotal is computed for the previous image frame.
The maximum of this map marks the new position of the object template used for track-
ing in the current frame. In other words, the initial template position is derived from the
typical feature characteristics present in the previous template. As opposed to Fig. 3b, the
template adaptation and object redetection is organized separately, thereby preventing the
accumulation of incremental errors.
From a representational point of view, the “where” pathway of the here presented ADAS
consists on the one hand of the STM, that stores all properties of sensed objects in a 3D
representation and on the other hand of a long term memory (LTM) that stores the generic
properties of object classes. The LTM is filled offline with typical patches and correspond-
ing feature maps Fi of specific object classes. Typically, 3-5 object patches are sufficient
to allow a robust attention-based object detection. For evaluation purposes cars, reflection
posts, and signal boards are used as LTM content, but the ADAS can detect any other ob-
ject types as well, if the attention and the object recognizer are trained accordingly (see [10]
for the attention-based detection and classification of traffic signs). In the default state the
system searches for the generic LTM object class “car”. This is done by calculating the
geometric mean of all TD weight sets of the stored LTM objects that were calculated based
on Equation (2). These weights tune the TD attention in the “what” pathway.
As described above, in case the tracker has re-detected the object in the current frame
the 3D representation is updated. In case the tracker looses the object the system searches
for the lost STM object in the following frames. This is realized by calculating a TD
weight set that is specific to the lost STM object using Equation (2). The object O f found
by the STM search is then compared to the searched object Os by means of the distance
measure δ (O f , Os ) that is based on the Bhattacharya coefficient(a measure for determining
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
68 Thomas Paul Michalke and Robert Kastner
object T
detection ...
result
NCC NCC
anchor new anchor new
(a) object object
I pos I pos
object T T
detection ...
result
NCC NCC
anchor new anchor new
(b) object object
I pos I pos
object T T T
attention attention
detection template template ...
result adaptation adaptation
NCC NCC
anchor anchor
(c) new new
I object I object
pos pos
the similarity between two histograms) calculated on the histograms of all N object feature
O
maps Hi f and HiOs (see Equation (8)).
N q
O
δ (O f , Os ) = ∑ 1 − γ (Hi f , HiOs ) (8)
i=1
q
O O
γ (Hi f , HiOs ) = ∑ Hi f (u, v)HiOs (u, v)
∀u,v
The LTM and STM object search run in parallel as indicated visually in Fig. 1. Again, it
is important to note that the ADAS is not restricted to the detection and tracking of cars,
reflection posts, signal boards, and traffic signs. By using different LTM object patches and
by offline training of the ADAS object classifier in combination with the generic concept of
online tunable TD attention the ADAS is highly dynamic and flexible.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 69
transform parameterizing a clothoid model, whose input signal is generated by the here
presented generic attention system. The TD attention weights used here boost white and
yellow structures on a darker background (so called on-off contrast), to which the biolog-
ical motivated DoG filter (see Sect. 4.2.1.) is selective. The yellow on-off structures are
weighted stronger than the white to allow the preferred handling of lane markings in con-
struction sites.
The state-of-the-art unmarked lane detection evaluates a street training region in front
of the car and two non-street training regions at the side of the road. The features in the
street training region (stereo, edge density, color hue, color saturation) are used to detect
the drivable road based on dynamic probability distributions for all cues (see [26] for more
details). A temporal integration procedure between the current and past detected road seg-
ments based on the bird’s eye view is applied (refer to [20] for a comprehensive description
of the temporal integration procedure). The procedure is used to increase the completeness
of the detected road by decreasing the number of false negative road pixels. Based on a bi-
ologically motivated curbstone/elevated road border detection approach the gathered road
detection results are post-processed. This allows the correction of some minor false positive
errors that are induced by the temporal integration step (see Sect. 7.3. for details). In the
final step, a fusion between the detected marked and unmarked road segments is used to
derive the currently drivable lanes.
The final part of the static domain specific tasks is the state-of-the-art scene classifica-
tion. For being able to efficiently tune the generic system modules to the current environ-
ment and their challenges the current scene context (e.g., inner-city, country road, highway)
has to be known. For classifying the scene image data is required as the only input. The
processing is roughly organized as follows: After the preprocessing the resulting image is
divided in 16 parts and each part is independently transformed to the frequency domain.
In the following, each transformed part is sampled with an array of shifted and oriented
Gaussian filters, resulting in an average power spectrum for each of the parts. The final
classification step is realized by Hierarchical Principal Component Classification that is
trained during an offline learning phase. For more information please refer to Sect. 5..
Environmental Interaction
The system can interact with the world via an actuator control module. Currently the
ADAS implementation uses a 3 phase danger handling scheme depending on the distance
and relative speed of a recognized obstacle. When an obstacle is detected in front at a
rapidly decreasing distance, a visual and acoustic warning is issued and the brakes are
prepared. In the second phase, the brakes are engaged with a deceleration of 0.25 g followed
by hard braking of 0.6 g in the third phase.
System Control
The system control module realizes a functional mapping of an input feature space of
measured internal system-state variables and an output-parameter space for the modulation
of the system behavior. In the following, the specific features and parameters are named,
the current system uses for control.
Measurement of internal system-state variables (input feature space):
• (i0 ) No condition,
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
70 Thomas Paul Michalke and Robert Kastner
In the following, instances of the functional mapping are listed that control the multiple
parallel “what” pathways, the actuators, as well as the STM data. The functional mapping
between input and output is visualized by the symbol ⇒. The task represented by each
instance is set in parentheses at the end.
• 1. (i0 ) ⇒ (o2 ) LTM search for cars, (o5 ) λ = 0.5 (search for cars),
• 2. (i2 ) NCC tracker confidence for a car below threshold ⇒ (o2 ) Interrupt LTM
search, (o3 ) redetect lost object using TD attention, (o6 ) Set spatial prior, (o5 ) λ = 1
(redetect lost cars),
• 3. (i0 ) ⇒ (o2 ) LTM search for signal boards, (o5 ) λ = 0.5 (search for signal boards),
• 4. (i2 ) NCC tracker confidence for a signal board below threshold ⇒ (o2 ) Interrupt
LTM search, (o3 ) redetect lost object using TD attention, (o6 ) Set spatial prior, (o5 )
λ = 1 (redetect lost signal boards),
• 6. (i6 ) Dynamic object leaving the field of view ⇒ (o5 ) λ = 0, (scene exploration,
search dynamic objects),
• 7. (i6 ) Dynamic object reentering the field of view ⇒ (o5 ) λ = 1, (o4 ) Number of
STM objects = 1, (o6 ) Set spatial prior, (o3 ) Saliency tracker support (track dynamic
object),
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 71
• 8. (i1 ) Car-like road opening detected ⇒ (o6 ) Set spatial prior in (o2 ) for cars (analyze
conspicuous image region).
In the following, with the attention sub-system the generic visual preprocessing module
is described.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
72 Thomas Paul Michalke and Robert Kastner
The design goals of the introduced TD attention sub-system comprised the development
of an object and task-specific tunable saliency map suitable for the real-world scenarios in
the car domain.
However, the robustness of biological attention systems is difficult to achieve, given
e.g., the high variability of scene content, changes in illumination, and scene dynamics.
Most computational attention models do not show real-time capability and are mainly tested
in a controlled indoor environment on artificial scenes. Important aspects discriminating
real-world scenes from indoor and artificial scenes are the dynamics in the environment
(e.g., changing lighting and weather conditions, dynamic scene content) as well as the high
scene complexity (e.g., cluttered scenes). Dealing with such scenarios requires a strong sys-
tem adaptation capability with respect to changes in the environment. Here, the focus was
set on five conceptual issues crucial for closing the gap between artificial and natural atten-
tion systems operating on real-world scenes. The feasibility of this approach will be shown
on vision data from the car domain (see Sect. 8.1.1.). The described TD tunable attention
system is used as front-end of the vision system of the previously described advanced driver
assistance system, whose architecture is inspired by the human brain.
After elaborating on related approaches in Sect. 4.1. and the attention features in
Sect. 4.2., Sect. 4.3. will describe specific challenges for an attention system under real-
world conditions. Section 4.4. will describe the ADAS attention sub-system in detail point-
ing out the solutions to the denoted challenges. Taken these challenges, Sect. 4.5. compares
the proposed attention system on a functional level to two other, influential attention ap-
proaches from literature. Finally, Sect. 8.1.1. evaluates the attention system as a sub-model
of the ADAS.
Figure 4. Psychophysical study conducted by [33] marking the human visual attention as
strong mediator between the world and our perception of the world.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 73
Related to the vehicle domain the task-dependent nature of gazing has also been proven
while steering a car. Recently, it was shown by [34] that the performance for dangerous sit-
uation detection (a colored motorcycle veering into the vehicle’s path) strongly depends on
the feature-match between the current distracting visual task and the unexpected obstacle.
In another example, the gaze of drivers in a virtual environment was examined [35]. The
results show that the performance in detecting stop signs is heavily modulated by context
(i.e., top-down) factors and not only by bottom-up visual saliency. Endowing a vision ar-
chitecture for an intelligent car with similar, task-based attention can result in a gain of per-
formance with minimal additional resource requirements (see gathered results Sect. 8.2.1.).
In most research on human visual attention, the focus is on the bottom-up detection of
salient features/objects in a scene (for a review of biologically evident attention features
see [29]). A well-known computational model for saliency calculation is the approach by
[36] that is used in a number of implemented systems. Recently, this approach has been
extended by various researchers to account for task-dependent aspects of visual attention
(see, e.g., [37, 38, 39]) by applying dynamic weights to different processing stages. The
task is often to find a specific object within a predominantly static indoor scene. A more
complete view on a possible architecture for a visual system incorporating task-dependent
visual attention is given by [32, 40]. The proposed architecture combines top-down (TD)
and bottom-up (BU) influences by using TD weights on the calculated BU features. How-
ever, there is no separation between the untuned BU saliency map and the calculated TD
saliency maps allowing a weighted combination, which would ensure the preservation of
BU influence in all system states. The system is evaluated mainly on static indoor scenes
and a few static outdoor scenes. Furthermore, there are only few attention-based vision sys-
tems that use a motion feature (see [41, 42]). Given the importance of motion in the human
visual perception, modeling the influence of scene dynamics on attention can be seen as a
key issue to realize robust human-like vision systems.
In Sect. 4.5., the two related top-down attention systems of [32] and [1] were chosen
for a detailed structural and functional comparison, since these impacted the here presented
work most.
However, numerous other psychophysical and computational attention models exist
(please refer to [43, 44, 1, 45] for a comprehensive overview of the latest developments
in attention research and [46] for an overview of related psychophysical studies).
To the author’s knowledge, in the car domain so far no task-dependent tunable vision
system exists that mimics human attention processes.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
74 Thomas Paul Michalke and Robert Kastner
growing number of distractors the mean search time is constant or only slightly increasing
(see Fig. 5). A basic feature allows a clear differentiation against distractors that do not dis-
pose of this feature (i.e., one differentiating feature exists in these so-called feature search
tasks).
Additional to the named three features, recent biologically motivated systems also in-
corporate depth and motion (see, e.g., [48]). Both features are marked as basic features by
most researchers (see [29] for an overview). An important property of said basic features
and reason for their efficiency in visual search tasks is that they draw or guide attention.
The attention principle plays an important role for the here developed ADAS since it will
allow solving specific vision tasks in a generic fashion. It will be described in detail in
Sect. 4.4.. In a nutshell, the attention features introduced in the following are combined
to form a saliency map that is the key aspect and major output of a human-like attention
system. This saliency map shows high activation at image regions that contain a high level
of information in terms of a specific vision task (top-down driven activation) or because the
image regions differ strongly from the rest of the image, meaning that a high local entropy
is present (bottom-up driven activation).
(a) (b)
Figure 5. Efficient and inefficient visual search tasks: (a) Efficient search: Orientation is a
basic feature, (b) Inefficient search: Numeric character differentiation is not a basic feature.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 75
neuronal signal processing for a DoG-like receptive field that equals the convolution:
U−1 V −1
Ifilt (u, v)=g(u, v) ∗ I(u, v)= ∑ ∑ g(u, v)I(u − u0, v − v0 ). (9)
u0=0 v0 =0
g(u, v)
receptive synapse
field size weights −
− −
3x3 with DoG +
characteristic / − −
−
DoG image filter
kernel
Figure 6. Simple static neuron model with receptive field and synaptic weights g(u, v) that
are equivalent to a symmetric filter kernel used in image convolution.
In the following two subsections, the technical equivalents of two basic measured re-
ceptive field types are described (DoG and Gabor). After that, with RGBY, a color space is
described that mirrors the processing on the retina.
Biological Motivation
In vitro receptive field measurements of ganglion cells in the retina of macaque monkeys
have shown a characteristic center-surround behavior [50]. Also measurements of receptive
fields of neurons located in early regions of the visual pathway of macaque monkeys have
shown a similar characteristic [52]. In other words, the receptive fields are selective to
monotonous regions (blobs), which differ from the background in terms of their intensity.
An example for such contrast is shown in the lower left image in Fig. 8b. Furthermore,
theories and supporting measurements exist, which allow to interpret specific brain regions
in the human visual pathway as filter banks that decompose an input image in terms of the
existing frequencies [53]. The realized attention system extends this principle to all static
and dynamic features.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
76 Thomas Paul Michalke and Robert Kastner
In the following, the measured center-surround behavior is modeled using a filter kernel
of two 2D Gaussian functions that are subtracted (Difference of Gaussians). A parameteri-
zation of these Gaussian functions will be provided that will allow the implementation of a
low-loss filter bank for said filter kernel.
The Difference of Gaussians (DoG) filter is selective to monotonous regions (blobs)
of different sizes (see Fig. 8). The filter kernel is not orientation selective (i.e., isotropic).
In its basic form the centre of the DoG filter kernel is excitatory and the lateral region is
inhibitory. The discrete DoG filter kernel results from the sampling of a Gaussian curve
with a small variance σe2 that is subtracted from a Gaussian curve with a bigger variance σi2
(see also Fig. 7):
2 2
1 − u 2σ+v2 1 − u22σ+v22
DoG(u, v) = e i − e e . (10)
2πσi2 2πσe2
σσ
i i
Signal amplitude
σeσe
Pixel position
Figure 7. One-dimensional on-off DoG (black) and the two Gaussian functions it is com-
posed of (positive Gaussian function with small standard deviation in blue, negative Gaus-
sian function in red).
On a more qualitative level, the DoG subtracts the mean weighted intensity of a smaller
center region from the mean weighted intensity of a bigger surround for each image pixel.
As will become apparent in Sect. 4.4., in order to yield high hit rates in top-down re-
lated search (i.e., when searching for a specific object using the saliency map), the features
of an attention system need high selectivity to provide as much supporting and inhibiting
(i.e., suppressing) maps as possible. At the same time, high efficiency is needed due to con-
straints in computational resources. An approach fulfilling these demands, is the separation
of the DoG filter in on-center (called on-off in the following) and off-center selectivity (off-
on) as it is emphasized in [1] (see Fig. 9a and Fig. 9b). Thereby, the attention system can
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 77
scale 2
Figure 8. (a) Test image containing all frequencies in all orientations, (b)-(f) Different levels
of the DoG filter bank with the filter response for the test image on top, the filter kernel in
the image domain at bottom left, and the filter kernel in the frequency domain at bottom
right.
differentiate between bright blobs on a dark background and dark blobs on a bright back-
ground. To realize such an on-off/off-on separation the DoG filter response is separated into
its positive and negative part, which is equivalent to the computationally more demanding
independent filtering with the two different filter kernels depicted in the lower left corner of
Fig. 9a and Fig. 9b respectively.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
78 Thomas Paul Michalke and Robert Kastner
(a) DoG on−off (b) DoG off−on (c) 0° Even Gabor on−off
(d) 0° Even Gabor off−on (e) 0° Odd Gabor on−off (f) 0° Odd Gabor off−on
Figure 9. Application of filter kernels on simple test images (negative filter response is cut
off). Both the 2 DoG features (a), (b) and the 4 Gabor features (c)-(f) are realized with one
filter operation each. Every picture shows on the left the used input test image and on the
right the respective filter response for the filter kernel in the bottom left corner.
As described at the beginning of this section, the attention features are computed on
different scales allowing a decomposition of the signal into overlapping frequency chunks.
In the following, the appropriate parameterization of the DoG kernel is derived, which
makes such a decomposition possible.
s
1 ln(γDoG)
σi = 2
(11)
π fcenter γDoG −1
As described before, f center in Equ. (11) is set to 0.25. Furthermore, e.g., [53] recom-
mends to set the ratio between the inner and outer Gaussian γDoG = 1.6, which results in
a low-loss signal decomposition with only few redundancies between scales. Thereby all
parameters of Equation (10) are determined and hence the DoG kernel is defined.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 79
Discussion
As described in the previous subsection, the separation of the DoG filter response into
on-off and off-on contrasts increases the feature space without additional computational
costs. In the following, it is qualitatively shown that this approach increases the perfor-
mance of a marked lane detection system. The approach extends known algorithms for lane
marking detection by adding a biologically motivated filter step for preprocessing. More
specifically, the DoG filter is used as input feature. Figure 10a shows a typical inner-city
scenario with strong shadows on the road. For detecting the lane markings, the view from
above (the so-called bird’s eye view) is computed (see Fig. 10b), as will be described in
Sect. 6.4.. On the bird’s eye view, lane marking-like contrasts (bright image regions on a
darker background) are detected by the DoG filter after which a clothoid-model-based ap-
proach for detecting the markings is used (see, e.g., [54, 55, 56] for related clothoid-based
approaches). Figure 10c depicts the DoG filter results without the described on-off/off-on
separation. Since lane markings have a typical on-off contrast (white/yellow markings on
a darker street), the on-off DoG filter results should be used, since these contain less false-
positive activations (Fig. 10d). For example, in [57] the pre-filtered road image still contains
the lane-marking-unspecific off-on contrasts (e.g., shadows on the road). Such off-on con-
trasts are filtered out in the marked street detection approach to improve the road detection
performance.
For a quantitative evaluation of the influence of the described on-off DoG separation
an implemented lane marking detection system is used. The system gets a DoG-filtered
edge image without on-off (please refer to Fig. 10c) and with on-off separation (as shown
in Fig. 10d). The gathered results are summarized in Tab. 1. The evaluation reveals the
improvement in accuracy of the detected offset (i.e., horizontal position of lane markings)
and radius of the road based on manually labeled ground truth (GT) data of 330 highway
frames (see Fig. 11 for a visualization of the scenario and the gathered results).
Table 1. Mean relative error of detection results (offset and radius of lane marking
model)
Type of input data Mean relative error in offset Mean relative error in radius
preprocessing MREO = MRER =
−offset −radius
1/N ∑ GToffset
GToffset 1/N ∑ GTradius
GTradius
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
80 Thomas Paul Michalke and Robert Kastner
30
25
Distance (depth) in m
20
15
10
0
−5 0 5
Width in m
(a) (b)
30 30
25 25
Distance (depth) in m
Distance (depth) in m
20 20
15 15
10 10
5 5
0 0
−5 0 5 −5 0 5
Width in m Width in m
(c) (d)
Figure 10. Exemplary performance gain of on-off DoG separation as pre-processing step
of a lane marking detection system: (a) Input image, (b) Bird’s eye view, (c) DoG result
without on-off separation, (d) DoG result with on-off contrasts only (off-on contrasts are
filtered out)
Biological Motivation
According to [58], the lower layers of the cortex in cats contain orientation selective
neuron populations. Please note that for lines and edges no 360 degree direction, but only
180 degree orientation is defined. The activation of these neuron populations lessens by
50% when the stimulus is rotated by 15-20 degree versus the preferred orientation. Accord-
ing to the spatial frequency theory (refer to [17] for details), these results give biological
motivation for a filter bank with an angle selectivity of 30-40 degree. Also a frequency
(respectively scale) selectivity of these neuron populations was proven to exist based on
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 81
Road radius: >3000m left turn Road radius: >3000m left turn Road radius: 2378m left turn
Marker offsets: −1.80m 1.75m 5.10m Marker offsets: −1.90m 1.50m 5.09m Marker offsets: −1.90m 1.50m 4.90m
Figure 11. Sample images of the evaluation scene (lane marking detection results visual-
ized).
experiments (see [59]). The receptive fields of these neurons can be described by even and
odd Gabor functionals (pairs of quadrature filters), which represent a Gaussian kernel that
is modulated by a sine functional with a phase shift of 0 and 90 degree. For a visualization
Fig. 12 depicts a one dimensional odd and even Gabor kernel.
Signal amplitude
Signal amplitude
The even Gabor filter kernel, which is selective to lines, equals the real part of g(x) in
Equ. (12). The odd Gabor filter kernel,
which is selective to edges, equals the imaginary
u
part of g(x) in Equ. (12) with x = .
v
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
82 Thomas Paul Michalke and Robert Kastner
1 − 1 xT Ax jk0T x
g(x) = e 2 e (12)
2π ab
With:
|k0 |cosφ
k0 =
|k0 |sinφ
a−2 0
T cosφ −sinφ cosφ sinφ
A = RPR =
sinφ cosφ 0 b−2 −sinφ cosφ
The variances a2 and b2 influence the size of the underlying Gaussian function in the
two image dimensions. According to the biological measurements, the filter kernel will
have a modulation orthogonal to the longer principal axis of the Gaussian curve, which
requires a2 being smaller than b2 . The angle φ in the rotation matrix R determines the
orientation of the filter kernel:
2π
φ = m∆φ and ∆φ = with m ∈ [0..M − 1] . (13)
M
The wave number vector k0 determines the 2D period length of the modulated complex
oscillation and thereby the selectivity of the kernels in the frequency domain. The value |k0 |
determines the line width the filter is selective to. It is important to note that |k0 | is constant
for all filter orientations.
The factor γGabor is introduced, which is the ratio between the principal axes a and b
(i.e., width and height) of the underlying Gaussian curve:
a
γGabor = with a < b. (14)
b
In [60] γGabor was measured to be between 0.25 and 1.0 with a mean of 0.6 in neuronal
receptive fields in the cat cortex. Taken these measurements and Equ. (13) the number of
orientation channels M is typically set to a value between 4 and 18. As shown in [52],
the application of the same parameter setting for the size of the Gaussian curve on all
orientations of a specific frequency channel is possible. Based on this, for γGabor in [52] the
following generic Equation is derived:
∆φ
a 3sin 2
γGabor = = r 2 . (15)
b
∆φ
1 − 9 cos 2 − 1
Since the filter bank uses a Gaussian resolution pyramid the filter kernel is also independent
of the frequency channel (i.e., scale).
It is now sufficient, to determine the parameter a in Equ. (15), which encodes the overlap
in frequency domain between the filter kernels of a specific orientation of two adjacent
frequency channels (scales). In [52], a generic formulation for the parameter a is proposed
that contains the parameter r, which determines the value at the overlap of two adjacent
transfer functions (adjacent in orientation or scale):
p
3 −2ln(r)
a= . (16)
|k0 |
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 83
The parameter r is a value between 0 and 1 and is the ratio between the overlap value and
the maximal value of the transfer function. The bigger r the more do adjacent filter kernels
overlap in the frequency domain. Setting r=0.5 yields a reversible (non-dissipative and
disjoint) signal decomposition compatible to the wavelet theory. This means that during
Gabor decomposition of an image and subsequent recombination nothing is lost and no
redundancy exists.
Summarizing, the filter bank formulation uses the parameter ∆φ defining the number
of orientation channels. Furthermore, the overlap value r and the value |k0 | define the
frequency selectivity of the filter. A typical value for |k0 | is 0.25, which makes the filter
selective to lines with a period length of 4 and hence a line thickness of 2 pixels. The
following Fig. 13 depicts a Gabor filter bank with ∆φ = 45 degree, r = 0.5 and |k0 | = 0.25.
As stated before, Gabor functions model the receptive field characteristics in early lay-
ers of the vision system of mammals. Gabor filters are suited to the local detection of image
frequencies since they optimally fulfill the trade-off between a good resolution in frequency
and image domain, which represents the “uncertainty principle of quantum mechanics” for
image processing. In other words, the time bandwidth product reaches its lower border for
Gabor functions, as was shown in [61], which is equivalent to the fact that Gabor filter pairs
are optimally localized both in the image and frequency domain. For a very descriptive
mathematical formulation of the mentioned facts see [52]. A good localization in the im-
age domain allows small filter kernels and hence minimizes the calculation time for filter
operations. Additionally, a good localization in frequency domain allows an efficient use
of Gaussian pyramids as well as sparseness in the orientation selective feature maps. An
optimized localization in the image domain alone leads to sparse lines in the Gabor filter
response (no repeating patterns), but lines of adjacent orientations are amplified as well.
An optimized localization in the frequency domain leads to good selectivity regarding the
orientation of lines, but the filter response in the image domain shows no isolated line at a
certain location, but an unlocalized pattern of lines of the specific orientation.
For summarizing qualitatively, Gabor filters are selective to oriented lines (i.e., con-
tours), when filtering an image with the even part of the Gabor kernel. Furthermore, Gabor
filters are selective to oriented edges (i.e., steps), when using the odd part of the Gabor
kernel. Additionally, Gabor filters are scale selective (i.e., selective to a certain thickness of
lines respectively sharpness of edges).
Conceptional Extensions
Besides the well-known concept of decomposing the Gabor filter response in the odd
and even part (i.e., computing the real and imaginary part of the filter response), here, an
additional decomposition is done, which is motivated from the DoG filter decomposition.
More specifically, the DoG on-off center concept is transferred to the Gabor filter and sep-
arate the odd and even Gabor responses into their positive and negative parts (please refer
to Fig. 9c-f for a visualization). The proposed decomposition increases the performance of
the ADAS attention system. For example, an on-off versus off-on even Gabor separation
allows for the efficient separation of specifically oriented white lane markings from shad-
ows on the road. Also, as shown in the following subsection an on-off/off-on separation for
odd Gabor allows for the crisp suppression of the sky edge present in most scenes in the car
domain.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
84 Thomas Paul Michalke and Robert Kastner
(a)
(g)
scale 5
(h)
scale 4
(i)
scale 3
(k) (l)
scale 1
orientation change
(m)
scale 5
(n)
scale 4
(o)
scale 3
(p) (q)
scale 1
(r)
scale 5
(s)
scale 4
(t)
scale 3
(u) (v)
scale 1
Figure 13. (a) Test image containing all frequencies in all orientations, (b)-(f) Even Gabor
(without on-off/off-on separation), orientation 0 degree on 5 scales, on top: filter response
to test image, bottom left: filter kernel in image domain, bottom right: filter kernel in fre-
quency domain, (g)-(l) Even Gabor orientation 45 degree, (m)-(q) Even Gabor orientation
90 degree, (r)-(v) Even Gabor orientation 135 degree.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 85
In sum, 4 different Gabor-based feature types are derived from one filtering step (even
on-off, odd on-off, even off-on, odd off-on). Each of these 4 Gabor feature types consists of
20 independently weighable sub-feature maps (4 orientations on 5 scales each). Hence, the
Gabor filter bank in the used parameterization allows for 80 independent filter responses
and hence features.
Discussion
As discussed in the last subsection, a decomposition of the even and odd Gabor feature
in their on-off and off-on component is proposed, in order to increase the feature selectivity.
In the car domain the search performance is strongly influenced by the horizon edge, which
is present in most images of highways and country roads. In the following, this serves as
exemplary problem for showing the importance of a high feature selectivity. Typically, the
horizon edge is removed by mapping out the sky in the input image, which might not be
biologically plausible and is error prone. Instead, the horizon edge is suppressed directly
in the attention by weighting the sub-feature maps (the required weighting procedure is de-
scribed in Equation (2) in Sect. 3.), based on the high selectivity of the attention features.
The gain of this approach is depicted in Fig. 14c that shows the diminished influence of
the horizon edge on the saliency map of the real-world example in Fig. 14a. For a quan-
titative evaluation of the performance gain based on attentional sky suppression, refer to
Sect. 8.1.1..
For a further qualitative assessment of the gain of on-off/off-on separation see Fig. 15a-
d. Here a successful top-down search of the black vehicle is only possible using full sepa-
rability of the input feature space.
When comparing the properties of Gabor and DoG filters, it is perceivable that a DoG
filter response of a specific scale is equivalent to a combination of 4 Gabor filter responses
for the 4 orientations of the same scale. Still using both features in an attention system
(instead of Gabor alone) is reasonable, because the discrete nature of image filtering will
lead to a certain loss in selectivity for combined Gabor feature maps. Supporting the DoG
feature in the attention system makes up for these losses.
However, it is important to note that DoG and Gabor filters are not independent, which
will play an important role in the normalization process described in Sect. 4.4. (see page
98). Normalization will be done to assure a comparability of all attention features, which is
commonly neglected in comparable systems but is important for a robust vision system.
Biological Motivation
The so-called human search asymmetries, which are measured in psychophysical stud-
ies (e.g., an inclined line among vertical lines is detected more easily than a vertical line
among inclined lines) were conceptionalized by [62]. The authors propose a theory, which
supports that it is easier to detect feature deviations of a non-canonical feature among canon-
ical features than the other way around. The term canonical feature is related to the term
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
86 Thomas Paul Michalke and Robert Kastner
50
100
150
200
(a) (b)
50 50
100 100
150 150
0°off−on 45° off−on 135° on−off
200 200
(c) (d)
Figure 14. Evaluation of selectivity, (a) Input image, (b) Original bottom-up attention with-
out sky suppression, (c) Modified bottom-up attention with attentional sky suppression (top-
down influence), using suppressive odd Gabor filter kernels in low scales, (d) Bottom-up
attention with traditional sky suppression.
basic feature (refer to the first paragraph of Sect. 4.2. on page 73 for details on basic fea-
tures). In short, basic features define specific feature types that guide the attention (e.g.,
lines, color), whereas a canonical feature defines characteristics of a certain feature param-
eterization within a basic feature type (e.g., a subset of certain orientations leads to a line
feature that is canonical). Hence, canonical features can be understood as feature param-
eterizations that mimic the way neuron populations are tuned in the human vision system
(e.g., lines of 0 degree and 45 degree orientation are canonical). Feature deviations are rep-
resented by a combination of canonical features (e.g., a 10 degree line is represented by a
combination of 0 degree and 45 degree neurons). Based on measured search asymmetries,
such canonical features can be located. The psychophysical tests described in [47] revealed
typical search asymmetries for colors, which suggest that a canonical feature parameteriza-
tion for colors should differentiate between red, green, blue, and yellow. Furthermore, the
human morphology in the early visual processing on the retina supports this notion. More
specifically, cells exist on the retina that are tuned to red-green and blue-yellow contrasts
(so-called color opponents). Both facts give a biological motivation, for preferring RGBY
colors to any of the earlier mentioned color spaces.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 87
(a) (b)
(c) (d)
Figure 15. Evaluation of the gain of on-off/off-on decomposition and usage of DoG as
feature, (a) Shady input image with top-down search target black car, (b) Bottom-up atten-
tion, (c) Top-down attention, without on-off feature separation (overall attention is negative,
maximum is not on the car, search is not successful), (d) Top-down attention, with on-off
feature separation, only positive values are displayed, search is successful.
G+B
R = R− (17)
2
R+B
G = G− (18)
2
R+G
B = B− (19)
2
R + G |R − G|
Y = − − B. (20)
2 2
Drawbacks of this approach are the missing white balance and missing uniformity in the
resulting color maps. A color space is uniform, in case the distance between adjacent colors
is equal over the whole color space (as related to the human color perception, i.e., the human
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
88 Thomas Paul Michalke and Robert Kastner
ability to distinguish between colors). Uniform color spaces hence numerically represent
colors very similar to the human color perception.
A more complex approach for computing RGBY colors is described in the following.
The basic idea is based on the work of [1]. This more complex approach has a number
of important advantages, namely its incorporated white balance and uniformity. Since the
computational demands are moderate enough, the said RGBY approach is included into the
feature space of the introduced attention system.
RGBY colors are based on the Lab color space. The Lab color space like the Luv
color space was defined in 1976 by the CIE (Commision Internationale d’Eclairage) as a
more accurate model of the human color perception. Both color spaces are uniform. The
well-known HSV (Hue, Saturation, Value) and XYZ (X and Z contain color information,
Y luminance information) are examples for non-uniform color spaces. In the uniform Lab
color space “L” holds luminance information, “a” represents the red-green contrast and “b”
the blue-yellow contrast. Since the L-channel is independent from the color information
the Lab shows a certain extend of invariance against changes in illumination (similar to
the HSV color space). A basic illumination invariance is very important for the proposed
attention system. At the end of this subsection, the illumination variance properties of RGB
and RGBY color space will be compared for a color-based detection of signal boards in
twilight. The Lab color space is computed based on the following Equations:
1/3
Y
L = 116 − 16 (21)
Yn
" 1/3#
X 1/3 Y
a = 500 − (22)
Xn Yn
" 1/3 #
Y 1/3 Z
b = 200 − . (23)
Yn Zn
Xn = 242.4
Yn = 255.0
Zn = 277.7
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 89
The CIE XYZ was initially developed in order to better match the characteristics of
monitors based on RGB color information. The elements of the XYZ-transformation ma-
trix have to be selected dependent on the monitor type. In the here proposed system, the
transformation matrix proposed in [64] is used.
To compute RGBY, [1] proposes the Euclidian distance between the Lab color pixels of
the image to the four Lab reference colors (aref,R =0 and bref,R =127 for red, aref,G =127 and
bref,G =127 for green, aref,Y =127 and bref,Y =0 for yellow and aref,B =127 and bref,B =127 for
blue). Exemplarily, Equ. (25) to (27) show the computation of the R color map of RGBY
space that has to be applied pixel-wise over the whole image:
Based on this, 4 color maps will result that contain only non-negative values (see
[1]). For a numerical representation of the interdependencies between RGBY and RGB
see Tab. 2.
However, there is a drawback in the approach of [1], which makes the resulting color
maps inappropriate for their usage in the here proposed attention system. The problem is
that the so computed color maps are not independent. More specifically, the R map is equal
to the inverted G map and the B map is equal the inverted Y map, which means that only
2 independent color maps exist. Thereby selectivity is lost. Furthermore, for example the
R map holds zeros at image positions of pure green, whereas for an attention system zero
value in the red and green color map should define the intermediate value between red and
green.
Therefore, in the following, a rescaling is proposed, which will lead to four independent
RGBY color maps. Furthermore, it will be shown that the on-off/off-on separation for
RGBY colors is system-immanent and therefore already included. Additionally, so-called
double color opponent maps are proposed that are selective to color contrasts. Furthermore,
exemplarily their importance for the attention is shown.
Conceptional Extensions
In order to allow a more suitable decomposition of color maps, the approach of
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
90 Thomas Paul Michalke and Robert Kastner
2Rtmp ∀ Rtmp > 0
Rfinal = (32)
0 ∀ Rtmp ≤ 0
2Gtmp ∀ Gtmp > 0
Gfinal = (33)
0 ∀ Gtmp ≤ 0
2Btmp ∀ Btmp > 0
Bfinal = (34)
0 ∀ Btmp ≤ 0
2Ytmp ∀ Ytmp > 0
Yfinal = . (35)
0 ∀ Ytmp ≤ 0
As the equations show, a normalization is done prior to the decomposition in four inde-
pendent color channels. The received RGBY color maps do not contain redundancies after
the proposed decomposition procedure.
As a result, four normalized independent RGBY color maps are received. An additional
decomposition of all these maps into on-off and off-on components, which would double
the number of color feature maps is not possible. This is the case, since the decomposi-
tion of RG and BY map into four independent color maps is equivalent to an on-off/off-on
decomposition. Furthermore, image pyramids for all four color maps are build in order
to allow a separation between colored blobs of different sizes. The computed four inde-
pendent color pyramids can be used for an attention-based task-driven search (e.g., search
for red objects of a certain size). The concept of task-driven attention (so-called top-down
attention) is described in detail in Sect. 4.4..
However, colors themselves are not a cue for the contrast-driven attention (so-called
bottom-up attention). For example, a monotonous red image should not lead to contrast-
driven attention. How such a contrast-driven (bottom-up) attention is computed will be
described in Sect. 4.. A color-based contrast-driven attention should be guided to image
positions showing color contrasts (e.g., a green blob on a red background). The color fea-
ture type required for such operations is termed double color opponency in literature. Said
double color opponent maps are received by filtering all four RGBY color maps with a Dif-
ference of Gaussians kernel based on image pyramids. As a result, for each color map five
double color opponent maps are received, also allowing a differentiation between red blobs
on a green background and vice versa. The received color contrast maps can now be used
to detect targets based on their color contrast (e.g., traffic signs of typically bright colors in
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 91
a typically less colorful traffic environment). Additional to bottom-up searches, such color
contrasts can be used for the top-down search for objects with high color contrasts.
In sum, the color feature space consist of four RGBY color pyramids (i.e., 20 color
feature maps) and 20 double color opponent maps. Top-down search hence disposes of 40
color feature maps and bottom-up search of 20 color maps.
Discussion
In the following, the performance of the RGBY color space is assessed and compared
qualitatively to RGB. For that a complex traffic scene in twilight is used (see Fig. 16a), in
which the present signal board should be found based on color information. As it can be
expected, the absolute R-channel of RGB shown in Fig. 16b is not sufficient for a successful
color-based detection of the signal board. However, even when normalizing the RGB R-
channel to the overall sum of all three color channels, the signal board can not be detected
(see Fig. 16c). Reason is that the low light situation changes the color of the signal board
despite the automatic white balance of the camera. When using the proposed non-linear
RGBY colors a separation on the RGBY R-channel is possible, despite the challenging
lighting conditions (see Fig. 16d). An attention-based search for such signal boards can
hence rely on the highly discriminating R-channel, which boosts the performance of the
attention system as compared to the usage of RGB colors. Extensive testing with HSV
colors has also shown inferior performance compared to the proposed RGBY color space.
High Feature Selectivity: In order to yield high hit rates in TD search an attention sys-
tem needs high feature selectivity to have as much supporting and inhibiting feature maps
as possible. For this the used features must be selected and parameterized appropriately.
Even more important for high selectivity is the use of modulatory TD weights on all sub-
feature maps and scales. Many TD attention approaches allow TD weighting only on a high
integration level (e.g., no weighting on scale level [37]) or without using the full potential of
features (e.g., no on-off/off-on feature separation [32]) which leads to a performance loss.
The here presented system fulfills both aspects. Based on the extended selectivity of the
attention sub-system, the specific challenges of the car domain can be handled, as dealing
with the horizon edge present in most images.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
92 Thomas Paul Michalke and Robert Kastner
(a) (b)
(c) (d)
Figure 16. (a) Input image of road scene in twilight (signal board marked in red), (b)
Absolute R-channel of RGB-space, (c) Relative R-channel of RGB-space (i.e., R-channel
normalized to the sum of RGB-channels, with thresholding), (d) R-channel of RGBY space
(with thresholding).
achieved. The presented approach realizes this by the biological principle of homeostasis
that is defined as the reversible adaptation of essential processes of a (biological) system to
the environment (see e.g., [65]).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 93
training images (see Sect. 4.4.), the TD weights are illumination-dependent as well. Put
differently, the TD weights are optimal for the specific illumination and thereby to the
contrast that is present in the training images. The usage of TD weights on test images with
a differing illumination will lead to an inferior TD search performance. Instead of adapting
the TD weights dependent on the illumination, a local exposure control is proposed in order
to adjust the contrast of the training images as well as the test images before applying TD
weight calculation and TD search. The following attention system description will elaborate
on the means taken to solve the named five challenges.
Nonlin.Op
TD BU I off−on
(cut neg.)
image pyramid 4 RGBY pyramids ... Gabor
TD on−off
on 5 scales
starting from 256x256 TD Gabor wC j
... off−on
TD
RGBY normalization
TD TD
wnorm
Fi TD TD path
TD λ
on−off &off−on FiBU
overall
TD BU consp. maps CjBU
Saliency
input DoG TD BU
RGB gray value gray value
DoColOp 1−λ
image im pyramid
CCD color image Igray on−off & TD BU Intensity BU
camera off−on I on−off normal.
BU
TD BU wnorm
Intensity
BU sparseness weights wisparse
Gabor TD BU I off−on
BU feature weights wiBU
image
BU conspicuity weights
BU Gabor
motion off−on
... ... wiBU
wC j
recurrent cycle to Motion
wC j
The attention system consists of a BU and a TD pathway. The TD pathway (red en-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
94 Thomas Paul Michalke and Robert Kastner
closed region in Fig. 17) allows an object- and task-dependent filtering of the input data.
All image regions containing features that match the current system task well are supported
(excitation), while the others are suppressed (inhibition) resulting in a sparse task-dependent
scene representation. Opposed to that, the BU pathway (blue enclosed region in Fig. 17)
supports an object- and task-unspecific filtering of input data supporting scene elements that
differ from their surroundings. The BU pathway is important for a task-unspecific analysis
of the scene supporting task-unrelated but salient scene elements.
The BU and TD saliency maps are linearly combined to an overall saliency map. This
map is used to generate FoAs that represent the scene elements higher system layers work
on. The combination is realized using parameter λ (on the right hand side in Fig. 17)
that is set dependent on the system state emphasizing the BU and/or TD influence (see
Equ. (43)). Due to this combination the system also detects scene elements that do not
match the current TD system task. By ensuring a certain BU influence such scene elements
are not suppressed, which would otherwise lead to the so-called inattentional blindness
phenomenon (i.e., complete perceptual suppression of scene elements as described in [33]).
Turning to the processing details, the following modalities are calculated on the cap-
tured color images: RGBY colors (inspired by [1]), intensity by a Difference of Gaussian
(DoG) kernel, oriented lines and edges by a Gabor kernel, motion by differential images
(see Sect. 4.2. for details on the applied static feature maps) and entropy using the structure
tensor.
In the following, a rough repeated description of the modalities is given, after which
the entropy feature is specified that is used to set the camera exposure. The features motion
and color are used differently for the BU and TD path. The BU path uses double color
opponency from RGBY colors by applying a DoG on 5 scales on the RG and BY color
opponent maps. The filter results are separated into their positive and negative parts (on-
off/off-on separation, whose importance is emphasized in [1]) leading to 4 pyramids of
double color opponent RG,GR,BY and YB-maps. The TD path uses the same color feature
but additionally 4 pyramids of the absolute RGBY maps. Absolute RGBY colors do not
support the BU pop-out character and are hence not used in the BU path. A DoG filter bank
is applied on 5 scales separating on-off and off-on effects. Furthermore, a Gabor filter bank
on 4 orientations (0, π /4, π /2, 3/4π ) and 5 scales is calculated separately for lines and
edges (even and odd Gabor). The realized Gabor filter bank ensures disjoint decomposition
of the input image. The detailed mathematical formulation of the used Gabor filter bank
can be found in Sect. 4.2.2.. Motivated from DoG the concept of on-off/off-on separation
is transferred to Gabor allowing e.g., the crisp separation of the sky edge or street markings
from shadows on the street. Motion from differential images on 5 scales is used in the
BU path alone. Since this simple motion concept cannot separate static objects from self-
moving objects, it is not helpful in TD search. The entropy T is based on the absolute
gradient strength of the structure tensor A on the image Igray :
det(A)
T = . (36)
trace(A)
The matrix A is calculated using derivatives of Gaussian filters Gu and Gv and a rectan-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 95
ΣW (Gu ∗ Igray )2
ΣW (Gu ∗ Igray )(Gv ∗ Igray )
A = (37)
ΣW (Gv ∗ Igray )(Gu ∗ Igray ) ΣW (Gv ∗ Igray )2
u u2 + v2
Gu (u, v) = − exp(− ),
2πσ 4 2σ 2
v u2 + v2
Gv(u, v) = − exp(− ).
2πσ 4 2σ 2
The entropy is used as a means to adapt the camera exposure and not as a feature.
The local exposure control works on the accumulated activation Tsum = ΣRoI T on an
image region of interest (RoI) (e.g., coming from the appearance-based object tracker that
is part of the ADAS, for details see Sect. 3.). Here inspiration came from the human local
contrast normalization. The exposure time is recursively modified in search of a maxi-
mum on Tsum, which maximizes the contrast on the defined image regions. As described in
Sect. 4.2., the system disposes of 136 independently weighable sub-feature maps.
Following the calculation of the raw features, a postprocessing step on all sub-feature
maps is performed (see Fig. 18). The feature postprocessing consists of 5 steps. First,
TD postprocessing BU postprocessing
all sub-features are normalized to the maximal value that can be expected for the specific
sub-feature map (not the current maximum on the map). For example, for DoG and Gabor
this is done by determining the filter response for the ideal input pattern, i.e., the maximum
possible filter response. The ideal input pattern is generated by setting all pixels to 1 whose
matching pixel positions in the filter kernel are bigger than 0. Figure 19 shows the resulting
ideal DoG and 0◦ even Gabor input patterns that are derived from the given filter kernels.
This procedure ensures comparability between sub-features of one modality (e.g., all sub-
feature maps of the motion modality).
Next, the signal power is calculated by squaring and a dynamic neuronal suppression
using a sigmoid function, which is applied for noise suppression. A parameter Ksupp shifts
the sigmoid function horizontally, which influences the degree of noise suppression respec-
tively the sparseness of the resulting sub-feature maps. After a bilinear resize to the resolu-
tion 256x256, which allows a the later feature combination, for the BU feature postprocess-
sparse
ing a sparseness weight wi is multiplied that ensures pop-out by boosting sub-feature
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
96 Thomas Paul Michalke and Robert Kastner
(a) (b)
Figure 19. Input patterns that maximize the filter response. The maximum of this filter
response is used for sub-feature normalization: (a) Ideal DoG input pattern, (b) Ideal 0◦
even Gabor input pattern.
The sparseness operator is not used in the TD path (see red enclosed region in Fig. 18) in
order to prevent the suppression of weak object features.
Later in the TD path a weighting on all 136 sub-feature maps takes place to realize inhi-
bition and excitation. The TD-related tuning on the feature level is motivated from the fact
that neurobiological studies have shown that attentional influences are present very early
in the human visual pathway (see [3]). Furthermore, neurophysical studies on monkeys
have shown that attention-based modulation of neuronal activities lead to an increase in
activity in case the bias matches the preferences of cell populations. But also a suppression
of neuron populations can be encountered in case the attended features do not match the
preferences of the cells (see [3]). These measurements motivate the usage of supporting
(excitation) and suppressing (inhibition) feature TD weights as realized in the presented
attention system.
As referenced earlier, the TD weights wTD i are calculated based on Equ. (2). The weight
calculation scheme is inspired by [37] but was extended to match the scene complexity
of the car domain. The average activation in the object region is related to the average
activation in the surround on each feature map FiT D taken only the Ni pixels above the
threshold φ = Kcon j Max(FiT D ) with Kcon j = (0, 1] into account. As opposed to the weight-
ing scheme proposed by [37], in the here presented approach the threshold φ assures that
numerous small values on a feature map do not even out rarely present big ones. The pro-
posed threshold φ improves the TD search performance, since big values influence the TD
search performance overproportionally. In the BU path only excitation (wBU i ≥ 0) takes
place, since without object or task knowledge in BU nothing can be inhibited. For a more
detailed discussion of feature map weighting see [66, 1].
As visualized in Fig. 17, the j = 1..M conspicuity maps CBU j and CTD
j result from a
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 97
1
w̃C j (t) = 1
(41)
Nj ∑ CBU
j (u, v)
∀u,v with CBU
j (u,v)>ξ
Exponential smoothing is used to fuse old conspicuity weights wC j (t − 1) with the new
optimized ones w̃C j (t):
The parameter α sets the velocity of the adaptation and could be adapted online depen-
dent on the gist (i.e., basic environmental situation) via a TD link. In case of fast changes in
the environment, α could be set high for a brief interval, e.g., while passing a tunnel or low
in case the car stops. Additionally, thresholds for all M conspicuity maps are used based
on a sigma interval of recorded scene statistics to avoid complete adaptation to extreme
environmental situations.
Before combining the BU and TD saliency maps using the parameter λ (see Equ. (43)
and Fig. 17) a final normalization step takes place. Like the sub-feature and conspicuity
maps, the saliency maps are normalized to the maximum expected value. This maximum
value is determined by stepping backwards through the attention sub-system taking into
sparse
account all weights (wi , wBU TD
i , wi ) and the internal disjointness/conjointness of the
features to determine the highest value (vBU TD
max, j and vmax, j ) a single pixel can achieve in each
BU and TD conspicuity map j. A feature is defined as internally disjoint (conjoint), when
the input image is decomposed without (with) redundancy in the sub-feature space. In
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
98 Thomas Paul Michalke and Robert Kastner
other words the recombination of disjoint (conjoint) sub-feature maps of adjacent scales or
orientations is equal to (bigger than) the decomposed input image. Since DoG and Gabor
are designed to be internally disjoint between scales and orientations (see Sect. 4.2.) the
maximum pixel value on a conspicuity map j is equal to the maximum of the product of all
sparse
sub-feature and/or sparseness weights of the sub-features it is composed of (wi and wBU
i
TD
for BU as well as wi for TD). Motion is conjoint between scales, therefore the product of
sparse
all sub-feature motion weights wBU i and their corresponding wi are summed up to get
the maximally expected value on the motion conspicuity map. The contribution of the color
feature to the saliency normalization weight is similar but more complex.
Since appart from DoG and Gabor there is disjointness between conspicuity maps the
maximum possible pixel values for all BU and TD conspicuity maps, calculated as de-
scribed above, are multiplied with the corresponding wC j and added to achieve the normal-
ization weights wTD BU
norm and wnorm for the TD and BU attention (please also refer to Fig. 17
for the position the normalization weights are applied):
1
wBU
norm = (44)
∑Mj=1 k j wC j vBU
max, j
1
wTD
norm = . (45)
∑Mj=1 k j wC j vTD
max, j
With:
0.5 for j ∈ {DoG, Gabor}
kj =
1 for j∈/ {DoG, Gabor}.
Using this approach, wTDnorm will adapt when the TD weight set changes.
It is important to note that DoG and Gabor features are conjoint, meaning that they
represent the same signal characteristics. Put differently the conspicuity maps for DoG and
Gabor are not independent. As discussed in Sect. 4.2. using both DoG and Gabor is still
helpful, since the signal decomposition is different for both filter types. The conjointness
is taken into account in the attention normalization procedure in Equ. (44) and (45) in the
form of the factor k j that decreases the integral influence of DoG and Gabor on the overall
attention.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 99
Most important differences comparing the systems: In the presented system high
selectivity is obtained by decomposing the DoG (on-off/off-on separation) and Gabor (on-
off/off-on separation, lines and edges) features without increasing the calculation time. Fur-
thermore, the usage of TD weights on all sub-feature maps and scales results in 136 inde-
pendent tunable feature weights that increase the selectivity. The resulting scale variance
of the TD weights is not a crucial issue in the car domain. The RGBY is used as color and
double color opponency. In contrast to [32, 1], in the here presented system motion is used
to support scene dynamics. All sub-feature maps and the BU respectively TD saliency maps
are normalized without loosing information or boosting noise and by that preventing false-
positive detections. Comparability of modalities is assured via homeostasis. The attention
sub-system works on 5 scales starting at a resolution of 256x256. Experiments have shown
that in the car domain bigger image sizes do not improve the attention system performance.
The presented system supports conjunction of weak features since the sparseness oper-
ator is not used in the TD path. Illumination invariance is reached by image-region-specific
exposure control that is coupled tightly to the system.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
100 Thomas Paul Michalke and Robert Kastner
(a) (b)
(c) (d)
Figure 20. Examples for traffic sign conditions and scene complexities: (a) blurred due to
camera movement, (b) various textures in the surrounding, (c) similar colors in the back-
round, (d) partly covered by other objects.
The first approaches for traffic sign detection date back to the 1980s (see [68] for an
overview of the early approaches). The large magnitude of publications covering that topic
can be explained by the high degree of variations of traffic signs between different countries.
This work focuses on German traffic signs, however the generic nature of the attention
system would allow an international application of the system. Recently, also the first
commercial systems for the recognition of speed signs are available. Nevertheless, they are
restricted to circular speed signs and also targeted at highways/country roads as application
area. Many of the publications focus on the recognition of a single or a small number of
classes of traffic signs. Still, some authors aim at recognizing all or numerous sign classes,
like [69, 70, 71]. The work of Fang et. al. [69] uses two neural nets to recognize all speed
sign categories, one for color and one for edges. However, the neural nets are only trained
with signs of a single size and therefore, can only recognize one size for each traffic sign.
Therefore, the approach [69] is not optimal in the sense that the recognition of signs must
be possible for all sizes and hence distances to a traffic sign.
The general procedure for the majority of algorithms is similar and can be divided in
two parts. First, the algorithms try to extract all regions with possible traffic sign candidates
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 101
(Region of Interest (RoI)), termed as detection phase. Second, the previous detected regions
are classified or recognized, filtering out the false positive detections, referred to as classi-
fication phase. Additionally, a recognized sign can be temporally integrated to increase the
recognition performance even further. But this can be done with every recognition approach
and therefore, is not treated here. Because of limitations in space only the central aspects
of the different approaches are raised.
The detection phase can be divided in two types of approaches: on the one hand color
based algorithms and on the other hand shape based approaches. Starting with the former,
the image is segmented by typical traffic sign colors, which can be done by color relations
in the RGB color space [72, 73, 74, 75] and also sub-spaces of RGB as shown by [76] can
be used. Even more prominent is a thresholding in the HSV color space as used by [77, 78,
79, 80, 81] in order to get more independent to lighting conditions. Also some exotic color
spaces as the CIECAM97 are used [82]. Besides the simple thresholding a number of more
complex region growing approaches can be found, as e.g., the Color Structure Code (CSC)
based on a hierarchical region growing with a hexagonal topology as used by [83]. Also
possible is a segmentation by fuzzy sets as introduced by [84].
The second type, the shape-based approaches use, e.g., a Sobel operator [70], a Canny
edge detector [85], or similar approches running on a grayscale image. Afterwards, the
resulting edge image is analyzed and decomposed to find regions of interest, with candidates
of road signs. To this end, [70] propose a new method for the detection phase by searching
for vertical symmetry axes, which should enable the detection of all types of traffic signs.
Nevertheless, their example images show only highway scenes with a rather low scene
complexity, leaving in doubt if the approach will also show good performance on inner-city
scenarios with a lot of structure.
However, a large part of publications proposes a combination of the former two, starting
with a color segmentation and afterwards some kind of shape extraction. The work of
Paclik et al. [79] proposes a shape-based template matching after the color segmentation
in the HSV-colorspace. Garcia et al. [86] use a Prewitt-operator to extract edges of the
red color plane. Afterwards a maximum search for each of the image axes is done to find
RoIs. Another approach by Oh et al. [87] starts by deriving color segments followed by the
computation of a number of symmetry properties to get candidate regions for traffic signs.
Zhu et. al. [71] introduce the concept of the Color-Shape Pair (CSP), were a certain color
segmentation is followed by a specific shape extraction depending on the type of traffic sign.
To this end, the approach should be able to detect all types of traffic signs. Nevertheless,
it is not stated how the color segmentation or shape extraction is done. Additionally, the
evaluation is only done with emulated images and also no meaningful evaluation data (e.g.,
false positive and false negative rates) is provided.The work of Tsai et. al. [88] apply
a RBF-network for the color segmentation and afterwards some kind of edge filtering is
applied.
Most approaches are marked by the fact that their detection stage is targeted at mainly
one type of traffic sign class and can not easily be extended to additional traffic sign classes.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
102 Thomas Paul Michalke and Robert Kastner
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 103
(a) (b)
(c) (d)
Figure 21. Detection stage: (a) Input image of attention system, (b) Attention map for the
TD search, (c) RoIs for all sign classes, (d) Fused RoIs for classification stage.
The bottom-up (BU) weights wBU i are set object-unspecifically in order to detect in the
general case unexpected potentially dangerous scene elements. The parameter λ ∈ [0, 1]
(see Equation (43)) determines the relative importance of TD and BU search and is set to
one for the current task of traffic sign detection. Therefore, only the mechanism of top-down
attention is used for the detection of regions with traffic signs. For more details regarding
the used attention system, see Section 4..
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
104 Thomas Paul Michalke and Robert Kastner
Input:
400x300 RGB input image
1. Attention System
conversion conversion
to grayscale to RGBY
wi for
sign class
calculate N features Fi
calculate attention map S
segmentation on S
reduce noise
color segmentation with extract objects
by cs(x, y) m2 (x, y)
CS ◦ Xj
Pixel relation
Excentricity
for each
object in
compute probability the RoI
value for each
object, RoI, sign class
Output:
Probability value for classified traffic sign
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 105
Each of the traffic sign classes (Stop-, Give Way-, Warning-, Prohibitive-signs, etc.) can
be actively searched by using the TD attention. After the computation of the TD attention
map, we detect the maximum on it and get the focus of attention (FoA). In the following,
the segmented FoA is treated as region of interest (RoI) for the classification stage. Never-
theless, the different sign classes have different numbers of occurrences per image and also
a visually diverse clearness. To this end, a different number of RoIs have to be extracted for
a class on each image, varying from three for Stop-signs up to 5 for Give Way-signs (three
for circular-, five for triangular warning signs). The size of the RoIs can vary from 15x15
pixels to 80x80 pixels, which corresponds to the minimum and maximum size of a traffic
sign. The fusion of the resulting RoIs for different sign classes is realized by an overlap cri-
terion of 30% on the smaller RoI, providing a reduced computation time by computing each
image region only once. For further details on the online implementation of the attention
system please refer to [24].
0.035 ≤ roir (x, y) ≤ 0.481
Gmin ≤ roig(x, y) ≤ Gmax
cs(x, y) = 1
B ≤ roib(x, y) ≤ Bmax
min
Ymin ≤ roiy(x, y) ≤ Ymax
cs(x, y) = 0 otherwise (46)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
106 Thomas Paul Michalke and Robert Kastner
1
0 0 1 1 1 0 0
1 1 1 1 1 0 1 0 1 0 1 0
1 0 0 1 0 0 1
1
4
NR = ∑ CS ◦ X j (47)
j=1
Finally, a mask is generated by evaluation of the local neighbourhood (see [64]) for each
pixel. To this end, ki operators (given by Equation (48) to (53)) are applied in two stages on
the result of the noise reduction stage NR. The computation of the first stage (with result
m1 (x, y)) is started at the upper left and continued to the right in a row-wise manner. While,
the computation of the second part (with overall result m2 (x, y)) is started at the upper right
part and continued to the left, also in a row-wise manner. The resulting mask is depicted in
Fig. 24d.
1 : k1 (x, y) = 1 ∨ k2 (x, y) = 1 ∨
m1 (x, y) = nr(x, y) = 1 (48)
0 : otherwise
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 107
(a) (b)
(c) (d)
Figure 24. Classification stage: (a) Input RoI of classification system, (b) Result of the
color segmentation, (c) After opening operation, (d) Masks for each object within the RoI.
the corner matching feature. Hence, starting with the object having the highest result of the
corner matching, followed by the second highest and so on. Therefore, the mask of each
object is multiplied with the initial RoI providing only the relevant pixels.
The pixel relation is defined by the ratio of red to white pixels. Therefore, the amount
of pixels for each of the classes has to be determined. This has already been done for the
class of red pixels with the initial color segmentation. An additional color segmentation
(similar to the initial one) for white pixels is applied to extract all white pixels of the RoI.
Afterwards, the mask m2 (x, y) separates the independent objects, facilitating the counting
of red and white pixels to allow the use for the computation of the pixel relation. As ground
truth data the red and white area sizes of the signs have been estimated from the specifica-
tions of German road traffic regulations (see Equation (54) for Give Way and Equation (55)
for Stop).
pred Ared
= = 0, 98. (54)
pwhite GW,ideal Awhite
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
108 Thomas Paul Michalke and Robert Kastner
pred
= 5, 04. (55)
pwhite STOP,ideal
Due to a number of noise sources (e.g., slant, blurred, light conditions) the pixel relation
is marked by a certain variation. To this end, the pixel relation is transformed to a kind of
fuzzy set (see Figure 25 for Give Way signs) providing the membership function of the
pixel relation to a certain traffic sign class. The membership function was acquired from
the image test set and provides in our case the probability value Ppix for a certain sign class.
GW 11 0,9 0,80.8 0,10.1 0 0.2 0.40.4 0.6 0.80.8 1 1.21.2 1.4 1.61.6 1.8 22 2.2
x pred /pwhite yPpix
2.42.4 2.6
1
0,9
0,8
y
0,1
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
x
Figure 25. Fuzzy membership function of the pixel relation for the Give Way sign class.
The next feature is the corner matching, which is similar for all sign classes, but differs
in the number and type of corner templates that are used (see Fig. 26 for used corners of
different sign classes).
Figure 26. Used corners for matching: (a) Give Way, (b) Stop, (c) Warning.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 109
2 1 1 1 1 1 2
-1 4 2 2 2 4 -1
-2 -1 4 4 4 -1 -2
-4 -2 -1 6 -1 -2 -4
-6 -4 -2 -1 -2 -4 -6
The next feature is the eccentricity, which defines the relation between length to width of
an object. In order to compute the eccentricity, the mask of each object is used to determine
the smallest and biggest row and column belonging to the object. Given this information the
eccentricity of an object can be easily computed. Afterwards, the eccentricity will also be
transformed by a fuzzy set to a probability Pexc , describing the presence of a certain traffic
sign corresponding to the chosen fuzzy set. The ground truth for most of the traffic signs
concerning the eccentricity is about one, nevertheless we use independent fuzzy sets for the
different sign classes since their noise sensitivity differs.
The final weak classifier is the height in the world. Due to the stereo camera system in
our test vehicle we are able to estimate the world position of all pixels. For further details
about the camera and matrix transformations please refer to Sect. 6.. Hence, for each object
the median of height values of all object pixels is computed and again transformed by a
fuzzy set to a probability value Pheight .
Finally, the overall probability for each of the traffic sign classes has to be evaluated.
Based on that, the Equation (57), (58), and (59) provide the overall probability for the
supported traffic sign classes Stop, Give Way and Warning.
GW GW
PGW = Pw1 · (1 − Pw4 ) · Ppix · Pexc · Pheight (57)
STOP ST OP
PST OP = Pw2 · Pw3 · Ppix · Pexc · Pheight · PrsST OP (58)
WARN WARN
PWARN = Pw4 · (1 − Pw1 ) · Ppix · Pexc · Pheight (59)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
110 Thomas Paul Michalke and Robert Kastner
The attention system as a whole as well as the described traffic sign recognition is
evaluated in Sect. 8.1.1..
5. Scene Classification
Currently available Driver Assistance Systems (DAS) focus on restricted application areas,
like, e.g., highways. Information about the current scene context is not included. Said appli-
cations make a number of assumptions based on the designated context, which results in a
predefined set of rules and parameters. Normally, these simplifications restrict the usability
of these systems to one application area only. Different from that, the here presented ADAS
is able to tune its modules dependent on the current context and environment, thereby allow-
ing its robust application on highways, country road, and inner-city. Therefore, knowledge
about the scene type is valuable as a tuning channel of the Attentive Co-Pilot, allowing
more dedicated and also diverse reactions in different surroundings.
In the following section, the fast and robust approach for the classification of the scene
is presented. The algorithm is independent of potentially outdated map data as well as
errors and inaccuracies of Global Positioning System data. Furthermore, it is independent
of precoded context information in the map data, because it requires only an image of the
scene. Based on the novel Hierarchical Principal Component Classification (HPCC), the
algorithm reliably classifies the scene (over 97%) in various, complex scenarios (evaluated
on 9000 images). As the evaluation in Sect. 8.1.2. will show, the presented approach is an
important step towards a holistic understanding of the driving situation, as well as the basis
for a wider operational range of the ADAS.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 111
(a) (b)
(c) (d)
Figure 28. Examples for the different scene categories and illumination conditions: (a)
Highway, (b) Country road, (c) Inner city, (d) Country road.
which are generated from a large number of sample images. Therefore, the spectrum of
an image is sampled by a number of frequency-selective filters and assigned to each of
the classes (like, e.g., artificial/natural, open/closed, expanded/enclosed, see [100]) with a
value between 0 and 1 corresponding to their respective membership. Hence, each class
has a continuous scale describing the membership of the image to the class. The sampled
frequency results for each of the features allows the generation of templates, which describe
all relevant frequencies with their corresponding intensity for a certain class. However, the
proposed method by Oliva and Torralba is on the one hand continuous regarding the ob-
tained classification results, which makes a crisp decision for a certain class difficult. On
the other hand the approach is not discriminative enough in case of similar categories as
present in the car domain.
Only few publications handle scene classification applicable for driver assistance systems.
One of these is the work of [101], which uses features of the HSV-color space to identify
the number of image pixels having the color of, e.g., bricks or grass. These features are
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
112 Thomas Paul Michalke and Robert Kastner
In the following, a rough overview of our approach for scene classification is given (see
Fig. 29). Thereafter, all processing steps and their theoretical background are described in
more detail.
The overall system can be divided in three main parts. The first one is the prepro-
cessing, where adaptations of the image take place to reduce the amount of data, as well
as to reduce the influence of changes in lighting conditions (see [98]). In the last step of
the preprocessing, the image is divided in 16 equally sized square sub-parts, which will
be processed independently in the following feature extraction part. In the second part a
feature extraction is done for obtaining the relevant information of the image. As already
mentioned, our system uses the frequency domain to get a compact representation of the
data. To this end, for each of the 16 sub-parts the Discret Fourier Transform (DFT) is
used to compute a spectrum. Afterwards, each spectrum is sampled (i.e., weighted) with a
number of Gaussian filters, which is inspired by [102], that used oriented Gabor filters for
the sampling of the spectrum. The sampled data is normalized with the computed mean
and variance values from the training phase, to provide comparability between the different
scales of the sub-parts. To further reduce the amount of data, while keeping the relevant
information, the principal component analysis (PCA) is applied. Finally, in the third part
the classification takes place, which is independent of the previous steps, due to the fact that
different methods could be chosen. Therefore, two versions of the newly developed HPCC
classifier have been evaluated.
5.2.1. Preprocessing
The input of the system is a 400x300 RGB image, which in the first step will be converted
to a grayscale image. Afterwards, the influence of changes in lighting conditions is reduced.
To this end, the image is rescaled using a logarithmic curve to change the distribution of the
intensity values. This is motivated by early processes in the human vision pathway. The
intensity values are rescaled to cover the complete dynamic range (see Fig. 30b). The next
step is a high-pass filtering to reduce low spatial frequencies, attenuating large differences
in contrast as well as side effects of the DFT. For that reason, the image is filtered with a
filter whose frequency response is given in Equation (60) (see [103]).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 113
Input:
400x300 RGB input image
1. Preprocessing
conversion
to grayscale brightness filtering by
adaptation H(k, l)
fgray (x, y)
sub part
1
sub part
2
... sub part
16
2. Feature
extraction transform each
sub part i to sampling
frequency with
domain Fi (k, l) gθ,f0 (x, y)
projection normalization
PCA and
4. Classification projection
compute compute
node of node of
decision tree decision tree
3. Training
compute
LDA with average values
WDST ŷc
classification optimizing
between two Eq. 15 , Eq. 16
sub groups by to build two
subgroups
take features
from sub group
categories
yes only one no
category in leaf of the
current group decision tree
reached
Output:
Probability value for classified category
Figure 29. System structure for scene classification (gray system modules are only used in
the training phase)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
114 Thomas Paul Michalke and Robert Kastner
1 for k = 0 ∩ l = 0
H(k, l) = k2 +l 2
− 156.25
(60)
1 − 1 − 0.9e otherwise
The filtering leads to a uniformly distributed intensity over the complete image (see
Fig. 30c). Afterwards, the 400x300 image is resized to 256x256 pixel, which leads to a
compression of the vertical axis and an elongation of the horizontal axis in the frequency
domain. Since the resize is carried out during the training as well as execution phase the
classification result is not effected. The final step is a division of the image in 16 equally
sized square sub-parts of the size 64x64 pixel (see Fig. 30d). In the following, each of the
sub-parts is independently transformed to the frequency domain. The division in sub-parts
is done in order to draw conclusions from characteristic amplitude values at certain image
regions, which would not be possible if the overall image would be directly transformed to
the frequency domain. Hence, each of the sub-parts can be described as a complex spatial
frequency function separated in magnitude and phase (see Equation (61)).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 115
(a) (b)
(c) (d)
Figure 30. Different steps of the preprocessing stage: (a) Grayscale Image, (b) Image after
adaptation of intensity, (c) Image after suppression of low frequencies, (d) Rescaled image
with sub-parts
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
116 Thomas Paul Michalke and Robert Kastner
cos2 θ sin2 θ
a= + (63)
2σx2 2σy2
sin2θ sin2θ
b=− + (64)
4σx2 4σy2
sin2 θ cos2 θ
c= + (65)
2σx2 2σy2
Finally, the width and height of a filter kernel can be adapted by σx and σy and where chosen
in a way that the -3dB border frequencies of the filters touch each other. The shifting of
the filter kernel’s center is carried out by the adaptation of x0 and y0 , defined by the spatial
frequency f 0 , as well as the angle θ (see Equation (66) and (67)).
N
cos θ
x0 = f 0 (66)
2
M
y0 = f 0 sin θ (67)
2
For each of the sub-parts i the result of the sampling is a vector ỹi (i = 1 . . .16) with
100x1 dimensions. The overall result are 16 vectors having 100 values each, making in
total a number of 1600 attributes per image.
Before a further reduction of the data can be done, each sub-part i should be normalized
to generate a similar range of values for each vector element h of ỹi (h). For the normaliza-
tion, the mean and standard deviation vectors have to be estimated for each of the sub-parts
from the training data. The database for the training holds N images, which results in 16
matrices Ỹi for each of the sub-parts i with an overall dimension of 100 × N. A row j of
matrix Ỹi for sub-part i contains the data ỹi , which represent the samples from image j.
Hence, the mean estimator (see [105] for details on this concept) for each vector element h
from the sampling vector ỹi can be described by Equation (68).
1 N n
ȳi (h) = E(µi (h)) = ∑ yi (h)
N n=1
(68)
The standard deviation (see [105]) is derived by Equation (69) were n denotes the image.
s
N
1
q
si (h) = E(σi2 (h)) = ∑ yn (h) − ȳi (h)
N − 1 n=1 i
(69)
The normalization of a sub-part i and element h of the feature vector ỹi (h) is carried out
based on Equation (70).
ỹi (h) − ȳi (h)
yi (h) = (70)
si (h)
The normalization of the overall training data Ỹi (all elements h of sub-part i) is defined by
Equation (71).
Ỹi − IN×1 · ȳTi
Yi = (71)
IN×1 · sTi
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 117
Where IN×P is the unit matrix of dimension N × P, ȳi the mean vector and si the variance
vector of sub-part i.
As already mentioned, a further reduction of the extracted features is carried out based
on the results of a PCA (during the training phase). In the following, the PCA (see [106])
will be computed for each matrix Yi (containing data of N training images of sub-part i),
which can be seen as a coordinate transformation, where the first new coordinate axis aligns
along the maximum variance of the data, the second axis along the second largest variance
and so on. To this end, the result of the PCA is the transformation matrix Ti , where each
column contains the coefficients for a single principal component and the columns are in
descending order of their importance. To reduce the size of each yi (h) only the first v
principal components will be used for the projection. Therefore, a shortened transformation
matrix T̂i , containing only the first v principal components, is constructed.
The final step of the feature extraction is the projection of the matrices Yi to their new
coordinate system by Equation (72).
Similarly, the transformation is carried out for a single feature vector yi by Equation (73),
which is the input for the following classification step.
Whereas |x − y|2 describes the Euclidean Distance between vector x and y (see [106]) . At
the same time, the Euclidean Distance between the average values of group A and B should
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
118 Thomas Paul Michalke and Robert Kastner
(a) (b)
(c)
Figure 31. (a) Two dimensional Gauss function, (b) Spectrum and filter array with -3dB
level lines , (c) Image after transformation to frequency domain
1 1
max
A,B⊆C
∑ ŷi − |B|
|A| i∈A ∑ ŷ j (75)
j∈B 2
The separation in groups A and B is the initial step at the first node of the decision tree.
If for example group A contains more than one category the procedure above is repeated.
The normalization is now done with the mean and standard deviation of the categories,
which are part of group A and combined in matrix YA . Again, the principal components
are computed and the matrices projected to the principal component sub space of group A
with ŶcA = T̂A · Yc . Also the average values ŷcA of the categories are calculated and the
previous criteria (Equation (74), Equation (75)) optimized to sub-divide the group A into
AA and AB. This method is applied for group B as well as for all sub-groups containing
more than one category. The total number of categories is initially defined by the training
data. The advantage of the method is the specific calculation of the principal components
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 119
at each node of the decision tree. Only at the first node the overall training data is used to
form two groups, which are maximally heterogeneous to each other, but at the same time
maximally homogeneous for the members of the group (concerning the average values yˆc
of the categories). The classification at each node can either be carried out with LDA and
WDST (see [100]) or a neural net (see [107]).
Hence, at each node a specific classification task is trained. The overall system with the
classification step is depicted in Figure 29. In a nutshell, the training and subsequent clas-
sification is carried out in the following way: As first step, a preprocessing on the complete
training data is performed, afterwards each image with its sub-parts is transformed indepen-
dently and their spectra sampled. In the second step, the decision tree is learned. Thereby
the matrices for the transformation to principal component space as well as mean and stan-
dard deviation vectors for each node are stored. The last step is the classification of an
image, therefore, the features of the image are extracted and stored in the sampling vec-
tor ỹi . At each node, the sampling vector is normalized with the corresponding mean and
standard deviation vectors and afterwards projected to the principal component space. The
result is classified by one of the named methods and assigned to one of the sub-groups. This
procedure is repeated until a leaf of the tree is reached, which results in the classification
of the image to the category of the leaf. In another variant of the HPCC a decision tree for
each of the sub-parts of an image is generated. Thereby, 16 decision trees are build for each
of the sub-parts and finally a majority voting is carried out for the 16 results. The approach
of a single classification tree for the overall image will be called HPCC 1 and the approach
with 16 single classifications will be called HPCC 2.
For an abundant evaluation of the scene classification system refer to Sect. 8.1.2..
6. Depth Features
Accurate depth information is of vital importance for a driver assistance system. Typical
commercial applications for assisting the driver use Radar or Lidar data. Such sensors de-
liver accurate but sparse depth information of the scene. So far, only few commercial driver
assistance systems use vision, despite the fact that the information density is comparatively
high. During the projection of the 3D world to the 2D image chip, one dimension - the
depth information - is lost. Recovering the depth cannot be done with 100% certainty, i.e.
2D images are ambiguous in terms of depth. For solving this challenge several depth cues
are fused. After a biological motivation, the implemented depth cues are described.
• Depth from ground plane assumption (assuming a flat world, the vertical image posi-
tion is proportional to the object depth)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
120 Thomas Paul Michalke and Robert Kastner
• Depth from blur (optimizing the edge sharpness by changing the focal length of the
camera)
• Depth from Time to Contact (infer the time that remains until collision from the
growth of perceived object size)
• Depth from relative size (several objects of the same type in different distances)
• Depth from aerial/atmospheric perspective (blue bias on objects that are far away)
The following subsections describe four monocular and binocular (stereoscopic) depth
cues the presented ADAS is based on in order to perform its various vision tasks.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 121
capture
image
undistortion rectification
undistortion rectification
right camera
Figure 32. Processing steps for computing dense 3D world coordinates from stereo.
r11 r12 r13
R = RX RY RZ = r21 r22 r23 (78)
r31 r32 r33
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
122 Thomas Paul Michalke and Robert Kastner
optical
f axis
u0
Z
tu camera θZ
T = [t1 , t2 , t3 ]
tv
θX
v0
optical axis X
θY
image chip Y
aperture (simplified as pin hole)
(a) (b)
Figure 33. (a) Visualization of internal camera parameters, (b) Coordinate system and ex-
ternal camera parameters.
For determining these 12 variables the calibration scene shown in Fig. 34 is used, for
which the 3D world position of the marked points was measured manually with a laser
device and stored.
4 3 2 1
7 6 5
(a) (b)
Figure 34. Calibration scene with measured 3D world calibration points: (a) Left image
(calibration points marked), (b) Right image.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 123
can be determined.
2 2 2
r11 + r12 + r13 −1 =0
2 2 2
r21 + r22 + r23 −1 =0
2 2 2
r31 + r32 + r33 − 1 =0 (79)
r11 r21 + r12 r22 + r13 r23 =0
r11 r31 + r12 r32 + r13 r33 =0
r21 r31 + r22 r32 + r23 r33 =0
After repeating the described procedure for the second camera, the image rectification
can be done. After the rectification, an efficient search for correspondences between the
left and right image can be realized. More specifically, for each pixel and its neighborhood
in one of the camera images the best match in the other camera image is determined using
a correspondence search with a probabilistic matching algorithm (refer to [108]). Since
both images are undistorted and rectified the correspondence search between the images
can be restricted to horizontal shifts, which makes the procedure very efficient. The result
of the correspondence search is a dense disparity map D(u, v), which contains a measured
horizontal shift for all image positions.
Based on the disparity image the 3D world position for all image pixels can be computed
using:
fuB
Zstereo (u, v) = + t3 (80)
D(u, v)
Z(v − v0 )
Ystereo (u, v) = + t2 (81)
fv
Z(u − u0 )
Xstereo (u, v) = + t1 . (82)
fu
With: B ... basic distance between the left and right cameras
principal point
fu, fv ... normalized focal length [in pixels]
D(u, v) ... disparity
u0 , v0 ... principal point
t1 ,t2,t3 ... translational camera offset.
The equations are derived by transforming Equ. (3) and (4), setting all camera angles to
zero, since the disparity computation was done on rectified images.
In the last step, the stereo maps are unrectified (i.e., the prior rectification is neutral-
ized) to make them comparable to the input image on which all other processing steps are
running. This is realized by remapping the pixel values of the rectified stereo maps based
on Equ. (3) and (4), which results in unrectified stereo maps.
Figure 35 depicts a typical example for the resulting unrectified stereo maps in an inner-
city scenario.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
124 Thomas Paul Michalke and Robert Kastner
Y position in m Z position in m
Figure 35. Dense 3D world positions for all image pixels based on stereo from a probabilis-
tic matching approach [108].
Conceptional Extensions
Analyzing the Equ. (80) to (82) and Fig. 35, it can be seen that the stereo maps are dense
(i.e., for all image pixels a 3D world position is computed). However, at image positions
near to the car the computed values are not sufficiently accurate. When using a threshold
on the stereo confidence map that was calculated during the disparity computation, these
pixels can be identified. Furthermore, the thereby identified pixels can be corrected using
an inter-modality depth cue fusion with the depth cues described in the following (see [26]
for details on that approach).
Wworld f u Hworld f v
Zob j,W ≈ and Zob j,H ≈ . (83)
Wim Him
Equation (83) can be derived by applying and transforming the 3D to 2D pinhole camera
Equations (3) and (4) (see [49] for the detailed derivation).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 125
The bird’s eye view is calculated on the undistorted pixels v and u based on Equ. (3)
and (4) by inverse perspective mapping of the 3D world points X, Y , and Z (see Fig. 33b for
a visualization of the used coordinate system) to the 2D (u,v) image plane. The equations
describe how to map a 3D position of the world to the 2D image plane (refer to [112]).
More specifically, only the image pixels (u,v) that are required to get the metric bird’s eye
view (i.e., the XZ-plane) dense, are mapped. The approach also leads to low computational
demands. The usage of inverse perspective mapping makes the inversion of Equ. (3) and
(4) obsolete, when computing the bird’s eye view.
As can be seen in Equ. (3) and (4) the 3D world position coordinates X, Y , and Z of all
image pixels (u,v) are required. By using a monocular system, one dimension (the depth Z)
is lost. A solution to this dilemma is the so-called flat plane assumption. Here, for all pixels
in the image, the height Y is set to 0. Based on this, only objects in the image with Y = 0
(especially, the street the system aims to detect) are mapped correctly to the bird’s eye view,
while all the other regions are stretched to infinity in the bird’s eye view (for example the
car in Fig. 37d).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
126 Thomas Paul Michalke and Robert Kastner
Now, a vertical grow algorithm with dynamic thresholds searches for discontinuities in
the bird’s eye view and assigns a distance value to them (see Fig. 37d).
In the rectified image (i.e., the image is virtually remapped to be equivalent to an im-
age with all 3 camera angles zero, see Sect. 6.2. for details on the image rectification) the
following direct relation between the vertical pixel value v and the depth Zbirds exists:
f v t2
Zbirds = (84)
(v − v0 )
With:
t2 ... camera height above the ground
v0 ... the vertical principal point
v ... vertical pixel position with significant contrast change
f v ... Normalized focal length.
In case the flat plane assumption is not fulfilled (i.e., the street surface is not flat) the
bird’s eye view is inaccurate, which decreases the quality of all algorithms that run on the
bird’s eye view (e.g., depth estimation or temporal road integration, see [20]). An approach
that allows a stable bird’s eye view even in case of non-flat street surfaces and pitching of
the vehicle is described in [49].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 127
Using Radar sensors, the object distance and velocity can hence be measured with inde-
pendent approaches. Different from visual sensors, Radar is very robust against changing
weather conditions, which makes it an important cue that increases the system robustness.
39.4
9.6
25.8
71.5
28.7
25.2
18.3
17.2
(a) (b)
Bird’s eye view
50
40
Distance (depth) in m
25.5
30
14.4
20
28.0
17.8
12.1
16.7
10
0
−10 0 10
Width in m
(c) (d)
Figure 37. Used depth cues: Depth from (a) Stereo disparity, (b) Radar, (c) Object knowl-
edge, (d) Bird’s eye view.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
128 Thomas Paul Michalke and Robert Kastner
area” we define the space in front, which the car can move on safely in a physical sense, but
without taking symbolic information into account (e.g., one-way-street, traffic signs).
First vision-based approaches for detecting the drivable road area on unmarked streets
were introduced in recent years. Although most of these visual-feature-based approaches
show sound results in scenarios of limited complexity, they seem to lack the necessary
system-inherent flexibility to run in complex environments under changing lighting condi-
tions. To cope with such environments, in Sect. 7.1. we introduce an architecture for robust
unmarked road detection. The system relies on four novel approaches that permit the au-
tonomous adaptation of important system parameters to the environment. As the presented
results show, the approach allows for robust road detection on unmarked inner-city streets
without manual tuning of internal parameters. This is different from most approaches in
literature that rely on strong rigid road models and offline set parameters. In order to fur-
ther stabilize the gathered results, in Sect. 7.2. a novel, generic approach for improving
unmarked road detection systems by temporal integration is proposed. Since the dedicated
analysis of the remaining road detection errors after the temporal integration has shown that
in challenging situations the curbstone can be surpassed by the detected road segment, we
extended the system and added the here presented system as final postprocessing step (see
Sect. 7.3.). More specifically, all road segments surpassing the detected curbstones were
corrected (clipped).
Initial approaches for lane detection on marked roads date back to the 1990s (see [112] for
an overview of the early approaches). These to date commercially available systems are
restricted to marked roads with a predictable course, based on a clothoid lane model that
is also used for road construction of motor-ways. In recent years, the field of research for
road detection has shifted to unmarked country roads and inner-city streets. To this end,
current prototype systems evaluate and fuse different visual features. In the following, the
structure of such visual-feature-based systems is analyzed. It is shown that despite the large
number of existing road detection systems some important techniques for increasing the
road detection robustness are not considered so far.
Image Training Regions: Current approaches for road detection often use street train-
ing regions in front of the car in order to parameterize the probability distributions that
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 129
describe the road feature characteristics (e.g., [116, 117], see also Fig. 41). Only very few
approaches partially incorporate information of non-road image regions to improve road
detection (e.g., [54, 118]). However, to our knowledge no approach uses the full potential
of non-road information, e.g., for the autonomous adaptation of internal system parameters
and the dynamic online assessment of the cue quality, as it is done in our system.
Features: Typical visual features for road detection in state-of-the-art systems are:
texture (edge density) on the intensity [119, 54, 120], stereo disparity [121, 117], HSI color
[54, 116, 117, 122], or depth from Lidar / Radar [123, 124]. Many system approaches use
the feature edge density (structure) on the intensity map. However, edge density on further
feature maps is so far not considered. To our knowledge no approach uses the edge density
on color maps for road detection. During the evaluation of our system, we experienced the
edge density computed on color maps as a robust cue for detecting the road.
Feature Granularity: Numerous system approaches rely on probabilistic methods for
classifying street and non-street pixels (e.g., [54, 55, 125, 118, 126]). Such iconic (i.e.,
pixel-based) approaches do not include information of the neighborhood of a pixel, but
handle all pixels independently. Nevertheless, discontinuities in the feature maps often
contain important information that allow improved scene decomposition (e.g., curbstones
that separate the road from the sidewalk). Other approaches stress the importance of region-
based information and use region growing or vertical filling (e.g., [116, 127, 128]). Such
approaches are often sensitive to changing lighting conditions causing large gradients in
the feature maps (e.g., shadows on the road). Both, the iconic and the region-based system
approaches have important advantages that partially compensate their drawbacks. However,
to our knowledge no system approach for road detection uses both approaches to the same
extend.
Road Modeling: Many of the recent feature-based systems use road models of varying
complexity that support the feature-based road detection (e.g., [54, 55, 56] use clothoids,
[121] distinguishes between left, right, and straight street course, [129] uses second order
polynomials). For country roads and highways such approaches seem to yield sound results.
Nevertheless, as further discussed in Sect. 7.1.2., we claim that said rigid street models are
not flexible enough to robustly run on inner-city streets that often show abrupt changes in
their course as well as occlusions of significant parts of the drivable road area. However,
some kind of road model seems to be necessary in order to improve robustness of the road
detection. This dilemma can be resolved by relying on a generic and flexible road model
that makes only simple assumptions about the course of the road. One of the few system
approaches that follows this idea is presented in [122]. The authors point out that the road
area typically covers between 30 to 85% of the image. The feature thresholds are adapted in
order to reach this ratio. Unfortunately, the proposed approach is restricted in its flexibility,
since the ratio is set offline without constantly adapting it to match the current characteristics
of the scene.
To sum up, existing state-of-the-art road detection systems are marked by a limited
flexibility, which restricts their application to country roads and highways. In order to
allow reliable road detection in more complex inner-city scenarios, we propose four novel
techniques to enhance robustness and system-inherent flexibility by enabling adaptation to
the environment. To our knowledge a combination of these techniques have not been used
for road detection before.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
130 Thomas Paul Michalke and Robert Kastner
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 131
Input: Input:
Optional input: Input:
stereo 400x300
object knowledge Radar disparity
(cars, houses) data RGB input
image
1. Compute features
stereo
select street / calculation
non−street on−off DoG
training regions filtering
(assure obstacle
free street feature
training region) calculation
(HSI color,
structure, stereo)
lane markings
3. Fuse
detected
features for each feature:
fuse BRM and RPM
with detected
lane markings
3 RPMs
of most
reliable fuse RPMs & BRMs
features to extended RPMs (eRPMs)
feature fusion
road model
6 eRPMs
to Final RPM
4. Compute final
adapt road segment
threshold final
final threshold
threshold on Final RPM
Output:
final binary
road map
Figure 38. System overview: Adaptive road detection system (red modules contain novel
techniques).
Furthermore, vision-based stereo data is used as feature. For computing stereo vision,
the camera images are rectified in order to facilitate the correspondence search between
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
132 Thomas Paul Michalke and Robert Kastner
the two camera images (i.e., the images are remapped, virtually aligning the two camera
coordinate systems with the world coordinate system). The thereby necessary intrinsic
(i.e., internal camera properties, like the focal length and the principal point) and extrinsic
(i.e., external camera properties, like camera angles and offsets) camera parameters were
determined using the freely available calibration toolbox [110]. The toolbox was applied
on a calibration scene similar to the one described in [130] (see also Sect. 6.2. for details on
the computation of stereo disparity). There is no dynamic change of the camera pitch angle,
since on the one hand the input images are pitch-corrected using a correlation-based method
similar to [131]. On the other hand, we assume a flat road, which is present in most inner-
city environments. When using the system in an urban environment, the course of the road
and hence the camera angles could be estimated using a surface model (e.g., a hyperplane,
please refer to [20] for details). The image rectification assures that the camera angles
(including the static pitch angle) will not influence the stereo results. The correspondence
search yields a disparity map. Based on the disparity map three dense maps containing the
3D-world positions for all image pixels can be obtained (see Fig. 35). The stereo data is
remapped using the measured camera angles in order to have the stereo maps and the image
comparable in terms of the pixel position of objects.
The stereo maps are postprocessed for solving the problem of missing disparity values
near to the car (see Fig. 39b). More specifically, during the computation of the stereo
disparity no correspondence search is possible at image regions near to the car, since this
would come at the cost of high computation time. We solve this problem by searching
line-wise for high horizontal gradients in the bird’s eye view of the camera image (for
information on this representation see [112]) taking only the area directly in front of the car
(e.g., first 10 meters) into account (see example in Fig. 39a). It is assured that no objects
are present in the said area based on Radar and low vertical gradients in the bird’s eye view.
The area between the found gradients, which mark the road borders, is assumed to be road.
The image regions in bird’s eye view representation are mapped to the perspective image
with a pin hole camera model, which includes the determined intrinsic and extrinsic camera
parameters (e.g., static camera angles). Based on the perspectively mapped road regions the
three stereo maps are corrected assuming a perfectly flat plane (see the resulting corrected
depth map in Fig. 39c). Since only the region directly in front of the car is corrected, the
error induced by a non-flat road plane can be considered as small. However, to eliminate
this error the estimated camera angles coming from the optional surface model could also
be included into the pin hole camera model.
Tests have shown that huge shadows on the road result in poor stereo quality, since
the correspondence search gets difficult on dark, noisy image regions. This supports using
more cues that are to some extend more invariant to shadows as done in the presented
system (e.g., HSI color space). Altogether, our system relies on six different cues for road
detection (see Tab. 3 for an overview).
In the second step, binary road maps (BRM) and road probability maps (RPM) for
the six feature maps are computed. The BRMs are binary maps that hold “1” for pixels
belonging to the detected street and zero for the rest. The six BRMs are calculated with
a region-growing algorithm, by which region-related feature properties are incorporated.
Opposed to that, the six RPMs contain continuous probability values that assess the “road-
likeness” of the feature values for all pixels independently. Both map types rely on the same
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 133
30
30
25
25
Distance (depth) in m
Distance (depth) in m
20
20
15
15 (b)
10
10
5
5
00
−5
−5 00 55
Widthin
Width in m
m
(a) (c)
Figure 39. (a) Gradient-based road search on the bird’s eye view on the image depicted in
Fig. 41, (b) Missing disparity values near to the camera vehicle induce false and missing
depth values, (c) Corrected depth map
(a) (b)
Figure 40. Structuring element for region growing for (a) left image half, (b) right image
half.
normal distribution, see Equ. (99) and (100). The parameters of the normal distribution are
calculated using a street and at least 2 non-street training regions (see Fig. 41). Please note
that the training region needs to be set beyond the regions of corrected height values (see
Fig. 39c). The training regions are adapted dynamically depending on the scene. For exam-
ple, it is assured that no obstacle is within the training region by incorporating Radar data.
Furthermore, the size of the street training region is set proportionally to the velocity of
the ego vehicle, to exploit the fact that typically no near obstacles exist during fast driving,
e.g., on highways. The street and non-street training regions are chosen by considering the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
134 Thomas Paul Michalke and Robert Kastner
height map of the scene derived from the stereo disparity map (see Fig. 35) and existing
knowledge about objects in the scene. In the following, the computation of the BRMs and
RPMs is described in more detail. For computing the BRMs a region-growing algorithm,
which connects continuous regions in the feature maps is applied (i.e., the neighborhood
of a pixel is evaluated). The latter approach is done, in order to get crisp borders between
the road and the sidewalks that often have road-like features. The region growing uses two
different structuring elements for the left and right half of the image (see Fig. 40), which is
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 135
motivated from the typical course of roads in a perspective image. The region-growing al-
gorithm recursively sets all pixels that are adjacent to the currently known street segment in
BRMi to “1”, when the corresponding pixels in the feature map i are within the confidence
interval (see Equ. (89) and Equ. (90)).
The region-growing algorithm starts from the road-training region. The normal-
distribution-based confidence interval in Equ. (89) uses the feature thresholds xi +/- βi σi ,
which are independently calculated for all five visual features. Here, the parameter xi is the
mean and σi the standard deviation of the normal distribution calculated on the street train-
ing region. The parameter βi is introduced in order to adapt the confidence interval to the
current scene properties. Different from xi and σi , which are calculated on the street training
region alone, the threshold parameter βi changes dynamically depending on the characteris-
tics of the street and non-street training regions. More specifically, the parameter βi , which
influences the feature thresholds, is calculated from di (see Equ. (92)). The parameter di
is the distance between the two histograms Hsi and Hni of the street and non-street train-
ing regions for the i=1..6 features. The measure di is based on the Bhattacharya coefficient
γi (Hsi , Hni ) (see Equ. (93)), which assesses the similarity of two histograms. Based on βi the
confidence interval is adapted (see Equ. (89)). The larger the difference between the street
versus the non-street areas on a feature map is, the bigger the confidence interval becomes.
Different from the five visual cues (hue, saturation, and the three edge density maps),
the normal distribution of the stereo height Y also depends on the measured distance to the
car. This is empirically plausible, since Y is a function of the stereo disparity D(u, v) and the
relative influence of the quantization error of D(u, v) (measured in pixels) grows the smaller
D(u, v) and hence the bigger the distance of a road segment is to the car. Hence, the part σq
of the standard deviation of the stereo height cue that is induced by the quantization error
of D(u, v) increases with growing distance to the car. In order to mathematically assess the
error propagation of the quantization error of disparity D(u, v) to the stereo height Y their
functional relation is required. The stereo height x6 = Y can be computed using Equ. (94)
(with B as the horizontal distance between the stereo cameras, h the camera height, v the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
136 Thomas Paul Michalke and Robert Kastner
vertical pixel position and v0 the vertical principal point of the camera).
B · (v − v0 )
x6 =Y = −h (94)
D(u, v)
B · (v − v0 )
Dsurf (v) = (95)
h
∆g 1
σD =√ =√ (96)
12 12
dY
σq (v) ≈ σD (97)
dD D(u,v)=Dsurf (v)
B · (v − v0 )
≈ σD −
[Dsurf(v)]2
1 h2
σq (v) ≈√ (98)
12 B · (v − v0 )
Equation (96) defines the standard deviation σD of the disparity (measured in pixels), which
is induced by the quantization error (the step size ∆g is set to 1 pixel). For computing the
propagated standard deviation σq (required in Equ. (91)), we use Equ. (97) (refer to [64]),
which describes how the standard deviation of a random variable (here the disparity D(u, v))
is propagated through a function (here Y (D)). We are interested in the disparity on road
surface Dsurf alone (see Equ. (95), gathered after reforming Equ. (94) with Y=0). Hence,
Dsurf defines the position at which Equ. (97) is linearized. Here, the vertical pixel position
v is a parameter of the distribution. For the quantization-error-induced standard deviation
of the height cue Y , we finally find Equ. (98). The hyperbolic form of Equ. (98) confirms
the made empirical assumptions. Based on that, the confidence interval εY for the stereo
height Y (see Equ. (90)) includes the standard deviation σq (v) that is adapted depending on
the current vertical image position v of the current pixel in focus (see Equ. (91)). Besides
adding σq (v) in Equ. (91), the standard deviation σ6 , computed on the training region on
the Y map, needs to be corrected by σq (vtrain ) present at the vertical image position vtrain of
the training region. As result, we now have six BRMs for six features.
Additionally to the region-based processing for calculating the BRMs, a pixel-based
(iconic) processing for computing the RPMs is done (i.e., each pixel is handled indepen-
dently from its surround). All pixel values xi receive a probability value p(xi ), which results
in six independent Road Probability Maps (RPMs) for the six features:
( xi -xi ) 2
−
2σ 2
p(xi ) =e i ∀i = 1..5 (99)
x2
− 6
2[ σ 6 −σ q (vtrain ) +σ q (v) ]2
p(x6 ) =e . (100)
The probability distribution for the stereo-based height cue Y (see Equ. (100)) assumes
the mean height zero x6 = 0 and adapts σq (v) during the computation of RPM6 and BRM6
dependent on the vertical pixel position v. The approach assumes a normal distribution of
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 137
the six features in the street training region and beyond. As described for x6 a position-
dependent variance was introduced. The assumed normal distribution was verified with
statistical tests of goodness of fit for all features independently.
In the third step, the computed BRMs and RPMs are fused with the detected lane mark-
ings. More specifically, the RPMs for all features are set to a high probability for the de-
tected lane markings. The lane-marking detection is done with the biologically motivated
Difference of Gaussian (DoG) kernel (see Fig. 42a), which takes the receptive fields of neu-
rons in the retina as a role model. The DoG filter kernel is adapted to be selective to bright
structures on a dark background, the so-called on-off contrasts without reacting to dark
structures on a brighter background. Figure 42c shows the filter response on the inner-city
frame shown in Fig. 42b. All image regions with on-off contrasts, that have a height within
the confidence interval Equ. (90), and that are below the horizon are detected as being lane
markings (see Fig. 42d). The separation between on-off and off-on contrasts reduces the
number of false positive road marking detections. For example, in [57] the prefiltered road
image still contains the lane marking unspecific off-on contrasts (e.g., traffic signs in front
of a bright sky). Such off-on contrasts are filtered out in our approach to improve the road
detection performance.
(a) (b)
50
100
150
200
250
300
(c) (d)
Figure 42. (a) On-off Difference of Gaussian (DoG) filtering on two test images with on-off
and off-on contrast (left) as well as the respective filter responses (right), (b) Inner-city test
frame, (c) On-off DoG filter response for bright contrasts on a dark background (with lane
markings popping out), (d) Detected lane markings (after fusion of DoG and object height
from stereo).
The six iconic RPMs and their respective BRMs are combined by multiplication, which
leads to six extended RPMs (eRPM):
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
138 Thomas Paul Michalke and Robert Kastner
In the forth and final step, the Final Road Map is determined by applying a threshold
εfinal to the fRPM:
1 ∀ fRPM(u, v) > εfinal
Final Road Map(u, v) = . (103)
0 else
The threshold εfinal is set dynamically based on the correlation results of the three cur-
rently most reliable features maps, in order to get a prediction of the current relative size of
the road versus the rest of the image. For these three features the currently best HSI color
feature (hue or saturation), the best structure feature (structure on hue, saturation, or inten-
sity), as well as stereo are selected. For the selection process the Bhattacharya coefficient
γi (Hsi , Hni ) is evaluated (see Equ. (93)), by which the separability of street versus non-street
histograms Hsi ,Hni can be assessed.
Hence, the computation of the Final Road Map relies on a simple road model (expected
fraction of the road area in the current image, termed road-to-image-ratio). No assumptions
are made regarding the current position of the road in the image. As extensive evaluation
has shown, it is of crucial importance to adapt the said expected fraction dynamically to
the current scene. This dynamic adaptation enables the system to run robustly in complex
scenes, as in inner-city scenarios.
6 features
three
most reliable road adapt
binary road feature reference map threshold final Final
maps (BRM) fusion (expected compute Final Road Map
of features road area) Road Map
A, B, C
Figure 43. Control loop to adapt the final road detection threshold εfinal .
For adapting εfinal the control loop depicted in Fig. 43 is used. The threshold εfinal is
adapted by a gradient method based on Equ. (108). In the following, the applied procedure
is described in detail. It uses the BRMs of the three most reliable feature maps A, B, and
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 139
C that are combined to the road reference map (i.e., feature product R that represents the
expected road area), depicted in Fig. 43. The four binary maps are summed up, which
results in four scalar values S{A,B,C,R} :
The values S{A,B,C,R} represent the integral number of pixels detected as road for the
three feature maps and the road reference map.
Then, the parameter κ is calculated:
SR
SA + SSRB + SSCR
κ = . (105)
3
It represents the mean percentage with which the three most reliable feature maps cor-
respond to the road reference map R. The larger κ is, the more the features match to each
other, i.e., the more similar the three features maps are. The degree of similarity of these
features gives a hint about what to expect from the remaining cues and can hence be used to
adapt εfinal . The Final Road Map is computed (see Equ. (103), where εfinal is set to a typical
initial value for bootstrapping) and summed up yielding the scalar value SFRM :
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
140 Thomas Paul Michalke and Robert Kastner
that strongly varies in time depending on environmental conditions (see also previous Sec-
tion 7.1.). The varying detection performance is due to the changing content of the train-
ing region in front of the car. Thereby, the system possibly adapts to local characteristics
present in the current training region that might differ from the global road characteristics.
Furthermore, local illumination changes that depend on the current view angle and light-
ing conditions influence the detection performance. See Fig. 44 for a visualization of both
effects.
reflections
depending on
the view angle
(a) (b)
street training street training
region in the region in the
sun shade
detected detected
street street
segments segments
(in white) (in white)
(c) (d)
Figure 44. Causes for varying road detection performance: (a) Illumination change with
dependence on the view angle, (b) Sample image showing typical illumination gradient, (c)
Schematic example: Training region in the sun and resulting detected street segment (in
white), (d) Schematic example: Training region in the shade and resulting detected street
segment (in white).
In the following section a real-time capable approach for improving the road detection
results for this type of state-of-the-art system is presented that adds a generic postprocessing
step. Our proposed architecture removes the drawbacks of these systems using a temporal
integration approach based on the bird’s eye view. In order to test the proposed approach,
the visual-feature-based road detection system described in Sect. 7.1. is used. Still, this road
detection system can be exchanged with any other state-of-the-art system. Evaluation re-
sults computed on inner-city data show that this approach is an important enhancement for
all visual-feature-based road detection systems. One of the used sequences and correspond-
ing ground truth data is accessible on the internet for benchmark testing. The proposed
approach is a crucial step toward robust road detection in complex scenarios that allows
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 141
building high-level applications, as, e.g., active collision avoidance or trajectory planning,
based on vision as the major cue.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
142 Thomas Paul Michalke and Robert Kastner
Bird’s
Birdseye view
view
50
50
40
40
(depth) ininm m
30
(depth)
30
20
Distance
20
Distance
10
10
0
−10
0
−10
00 in m10
−10 Width 10
Width in m
(a) (b) (c)
Figure 45. Exemplary inner-city frame: (a) Hand-labeled ground truth street segment, (b)
Optical flow (colors code the direction of the motion), (c) Bird’s eye view.
Fig. 45c). In the following step, the bird’s eye view is used for detecting the motion of the
static vehicle environment based on Normalized Cross Correlation (NCC). Based on these
correlation results the current and past street segments are fused by temporal integration on
the bird’s eye view. The fused street segments are then mapped back to the perspective view
corresponding to the input image.
The system takes optional input data that improves the quality and makes the temporal
integration more robust. As such optional input data, stereo images as well as the longitu-
dinal ego velocity and yaw rate of the CAN bus of our prototype vehicle are processed. The
depth map that is calculated from stereo images (using the commercial “Small Vision Sys-
tem” [111], see Sect. 6.2.) is the basis for correcting the changes in the pitch and roll angle.
An uncompensated change in the pitch and roll angles make the bird’s eye view unstable in
case the car brakes or the street profile is not flat. The CAN data is used for predicting the
motion of the car based on a single track model. The predicted motion is used for determin-
ing the anchor for the correlation on the bird’s eye view. The usage of CAN data makes the
system faster. Still, without CAN data the detection quality is not reduced.
In the following, the processing steps (as depicted in Fig. 46) are described in more
detail. First, the camera lens distortion is corrected. The undistorted vertical and horizontal
pixels v and u are computed on the initial (distorted) vertical and horizontal pixels vd and
ud based on Equation (76) and (77).
The undistortion is based on a lens distortion model (described in [109]) that uses ra-
dial (k1 and k2 ) and tangential distortion coefficients (k3 and k4 ). The undistortion step is
essential in order to allow a correct mapping of the image pixels to the bird’s eye view. It
is important to note that for the bird’s eye view as a metric representation, the undistortion
step makes sure that the proportions in the bird’s eye view match the world.
Then the bird’s eye view is calculated on the undistorted pixels v and u based on Equ. (3)
and (3) by inverse perspective mapping of the 3D world points X, Y , and Z to the 2D
(u,v) image plane (see Fig. 47b for the notation in our coordinate system). The equations
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 143
Output:
detected road
(dense)
perspective
mapping
temporal
past N frames integrated
bird’s eye view road
of road
data segments temporal
storage
integration
image shift
bird’s eye (∆X,∆Z)
view t−1
correlation on
bird’s eye view
image bird’s
eye view t Optional
extended
Y−plane Y−plane
bird’s eye view
estimation
calculation
binary road
map
(incomplete)
undis−
tortion
Figure 46. System structure: Temporal road segment integration (the dashed module can
be exchanged with the road detection algorithm preferred by the user, optional module
highlighted in red).
describe how to map a 3D position of the world to the 2D image plane (refer to [112]).
More specifically, only the image pixels (u,v) that are needed to get a dense metric bird’s
eye view plane are mapped into the XZ-plane. The usage of inverse perspective mapping
makes the inversion of Equ. (3) and (4) for calculating the bird’s eye view obsolete.
As can be seen in Equ. (3) and (4) the 3D world position coordinates X, Y , and Z of
all image pixels (u,v) are needed for computing the bird’s eye view. However, by using a
monocular system, one dimension (the depth Z) is lost. A solution to this dilemma is the
so-called flat plane assumption. Here, for all pixels in the image, the height Y is set to 0.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
144 Thomas Paul Michalke and Robert Kastner
Based on this, only objects in the image with Y = 0 (especially, the street we are interested
in) are mapped correctly to the bird’s eye view, while all the other regions are stretched to
infinity in the bird’s eye view (for example the cars in Fig. 45c).
In case this assumption is not fulfilled (i.e., the street surface is not flat) the bird’s eye
view is inaccurate, which leads to decreasing quality of the temporal integration. To allow
a stable bird’s eye view even in case of non-flat street surfaces and pitching of the vehicle,
stereo data from our stereo camera setup is used. In order to enhance the robustness of
the correction, only pixels that belong to the currently detected street segment are used for
surface estimation. More specifically, the differences between the coordinate axes and the
street surface in terms of the pitch ∆θX and roll angle ∆θZ , as well as the height of the
camera over the ground ∆t2 are computed:
Y = Y0 + aZ + bX (109)
∆θ Z = atan(b) (110)
∆θ X = atan(a) (111)
∆t2 = Y0 . (112)
This is done based on the 3D position for all image pixels derived from the stereo dis-
parity (see Fig. 35 for 3D data of a sample image). The flat plane assumption Y = 0 is then
replaced by Y = f (X, Z) leading to an extended bird’s eye view. In our implementation a
first order model for the street surface (linear hyperplane) is used as shown in Equ. (109)
(see [138] for more details). Results have shown that higher order models lead to inferior
performance. The reason for this is the restraint number of 3D measurement points at the
borders of the image, since only reliable pixels belonging to the detected street are used
for the surface estimation. Since the estimated surface is noisy (stereo data is calculated
θ˙Y
optical
axis
Z
camera θZ
∆z
T = [t1 , t2 , t3 ]
θX X ∆x
perspective bird’s eye l
image view
θY
Y X
Figure 47. (a) Visualization of the bird’s eye view, (b) Coordinate system and position of
the camera (car is heading in Z-direction), (c) Single track vehicle model.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 145
based on error prone correlation between the left and right image), a linear Kalman filter
is used on the parameters Y0 , a, and b that raise the performance considerably. A possi-
ble improvement would be to use a model of the vehicle kinetics (containing damper and
spring characteristics, realistic distribution of the vehicle mass) for the Kalman prediction
(as proposed in [139]) instead of the linear prediction model used here.
By NCC-based correlation between the current and the stored previous bird’s eye view
the vehicle motion (∆X, ∆Z) since the previous time step is detected. A single track vehicle
model, as depicted in Fig. 47c, predicts the starting point xt = xt−1 + ∆x and zt = zt−1 + ∆z
of the NCC correlation patch of time step t-1 in the current bird’s eye view map. The values
∆x and ∆z are calculated based on the sample time T , the distance of the camera from the
rear wheel l, as well as the yaw rate θ̇Y , and lateral velocity Ż from the CAN bus (see single
track model Equ. (113) and (114)):
Ż
∆x = (1 − cos(θ̇Y T )) + sin(θ̇Y T )l (113)
θ̇Y
Ż
∆z = sin(θ̇Y T ) + cos(θ̇Y T )l − l. (114)
θ̇Y
The derived longitudinal and lateral motions as well as rotational change (i.e., yaw
angle) between the current and the previous bird’s eye view are stored along with the incre-
mental motion between the previous N = 40 frames (equivalent to 4 seconds of processing
by our prototype vehicles’ vision system).
The NCC correlation patch on the bird’s eye view is selected to contain enough structure
(using the entropy-based measure described in Sect. 4.4.), which improves the accuracy of
the NCC. Furthermore, it is assured that the patch belongs to the detected street and that it is
not too far away from the ego vehicle, since the resolution of the bird’s eye view decreases
with growing distance to the vehicle.
The bird’s eye view maps of the detected street segments of the previous N = 40 frames
are calculated and stored. The stored incremental motion during the past 4 seconds is inte-
grated and used to shift all stored bird’s eye view street segments correspondingly. Then the
shifted previous 40 bird’s eye view street segments are weighted (weights αt ) and summed
up by:
N N
Sinteg = ∑ αt St with ∑ αt = N. (115)
t=1 t=1
Thereafter, the sum of the street segments Sinteg (X, Z) is related to the maximum pos-
sible number of overlaid street segments Smax (X, Z), which results in an Integrated Road
Probability Map (IRPM):
Sinteg (X, Z)
IRPM = . (116)
Smax (X, Z)
Please note that Smax (X, Z) changes depending on the position in the bird’s eye view
map. The following final threshold operation determines the final temporal integrated street
segment Sfinal in the bird’s eye view representation:
1 ∀ IRPM(X, Z) ≥ β
Sfinal = . (117)
0 ∀ IRPM(X, Z) < β
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
146 Thomas Paul Michalke and Robert Kastner
The weight α1 in Equation (115) is set high to ensure that the pixels in the current detected
street segment are with a high probability also present in the final temporal integrated street
segment. The other weights αt could be set dynamically dependent on a quality measure of
the bird’s-eye-view-based NCC or the road detection system as well as the capturing time
t. The threshold β in Equ. (117) is currently set to 0.7. This means that a pixel is classified
as street if at least 70% of the overlaid past street segments have voted for street.
Next, the final temporal integrated street segment Sfinal is mapped back to the image
using Equ. (3) and (4). For this operation the resolution of the street segment in the bird’s
eye view representation needs to be high (which is done by upsampling the size by factor of
4) in order to allow a lossless perspective mapping of the street segment. The perspective
mapping step produces equidistant, periodic spaces in the street segment directly in front of
the car (see Fig. 48a). These spaces are filled using a morphological close operation with
a small morphological structuring element to prevent adding too many false positive street
pixels (see Fig. 48b). In other words, openings in the bird’s eye street segment (that, e.g.,
correspond to objects on the street) are retained in the final perspectively mapped street
segment. Such openings are explicitly checked for objects in the implemented ADAS (see
Sect. 3.).
(a) (b)
Figure 48. Final morphological fill operation for closing spaces in the street segment that
are due to perspective mapping (justified openings are preserved): (a) Raw perspectively
mapped street segment, (b) After morphological closing.
The following section shows that the proposed temporal integration procedure results in
an enhanced street segmentation. The final detected street segment has fewer holes and is
dynamically more stable than that of other approaches, which allows complex path-related
applications.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 147
guided to suspicious objects near the detected roadside, a precise localization in a digital
map would be possible). Furthermore various system tasks in inner-city become possible
(improvement of autonomous parking maneuvers, improved collision avoidance in inner-
city or more precise analysis of intersections).
As the following Sect. 7.3.1. will show, only few dedicated approaches for curbstone
and elevated roadside detection exist that mainly suffer from different limitations. Most
of them rely on a single sensor approach, disregarding the full potential of sensor fusion.
As opposed to that, the approach described in Sect. 7.3.2. relies on the fusion of different
sensor modalities using biologically inspired methods.
Only few dedicated approaches for the detection of curbstones and elevated roadsides exist.
Typically these approaches rely on a single sensor for the generation of 3D data. Based on
this 3D data a height map of the environment can be generated and suspicious edges are
detected and refined.
More specifically, in [140] a laser scanner is used to derive 3D data of the environment.
The derived differences in height of neighboring scan points are thresholded, filtered and
a line model is fitted. The laser scanner used in this application is marked by a relatively
low frame rate, which restricts the maximum supported velocity. Furthermore, the range of
robust detection of curbs is restricted due to the growing influence of noise with increasing
distance to the ego vehicle.
Another more sophisticated approach is presented in [141]. For generating dense 3D
data, a stereo camera setup is used. The authors propose the computation of a so-called
Digital Elevation Map (DEM), which is a height map of the scene as viewed from above. On
the DEM an edge detection, filtering and spline fitting is realized. Although the presented
DEM is very noisy, the gathered results appear to be robust. As the presented results show,
only curbs that are near to the vehicle can be detected. This is due to the growing influence
of inaccuracies (e.g., the quantization error) with increasing distance to the ego vehicle (see
[26] for a comprehensive treatment of this issue).
Other approaches fuse different sensor outputs in order to improve the detection ro-
bustness. For example, in [142] a laser sensor is combined with a monocular camera. The
single-layer laser sensor is used to detect the position where the curbstone cuts the sensed
layer. This point is used as starting point for the image processing. More specifically, a
static edge detector is applied starting from the laser-sensed curbstone. Since, in some cases
the appearance of curbstones only slightly differs from the road (i.e., virtually no edge is
present in the captured image) this approach is marked by a restricted robustness. Another
multi-sensor approach is described in [143]. Here stereo data and vision data is fused using
probabilistic methods. The described approach realizes a late fusion of the road detection
results of all present sensors. More specifically, curbstones are detected by specific edge
filters and elevated roadsides by the stereo sensor. No early information fusion between the
detection results takes places, limiting the achievable performance gain from sensor fusion.
Newly emerging multilayer laser scanners offer novel possibilities for the detection of
curbstones and road boundaries, since such sensors combine high frame rates with a high
accuracy 3D scan (see e.g., [144] for typical high precision applications of such sensors).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
148 Thomas Paul Michalke and Robert Kastner
However, it is important to note the relatively high costs of this sensor type.
As opposed to that, stereo cameras are getting affordable and technologically sound
(e.g., robust solutions for calibrating the cameras exist) and require a comparatively low
amount of space in the vehicle. Still, as described above sufficient accuracy for curb detec-
tion can only be gained in the first few meters from the ego-vehicle. To solve this dilemma,
our contribution proposes a specific fusion of 3D and vision data. Vision data is marked by
its high information density. The vision-based detection of curbstones and lane borders is
possible even at large distances. Still, curbstones and lane borders can have various appear-
ances, which makes the design of a generic appearance-based model difficult. As described
in the following, in the here proposed system stereo data is used to detect curbstones near
the vehicle. At greater distances a complete switch to vision takes place. The detection
result of the stereo cue is used to adapt a vision template of the present curbstone/elevated
roadside, while relying on biologically inspired approaches for signal processing.
The proposed overall system concept for the adaptive detection of curbstones and elevated
roadsides is depicted in Fig. 49. After giving a rough overview of the major processing
steps, all system modules are described in detail.
The proposed system consists of two major parts. First a stereo-based detection step,
second a vision-based detection step that is initialized and modulated by the results of the
first. The stereo-based curbstone detection relies on dense 3D data coming from a stereo
camera setup. Based on that, a specific height map (the so-called Digital Elevation Map)
is computed and temporally integrated in order to reduce noise. On this thereby extended
Digital Elevation Map, a specific biologically motivated edge filtering is applied resulting
in a robust curbstone detection up to a distance of 9 meters from the vehicle. Based on the
stereo detection result a vision template of the roadside is generated and adapted. Using this
template, the course of the roadside is detected via means of computer vision. The second
detection step stops, when the statistical properties of the road and nonroad area (on both
sides of the road) change abruptly. This change typically marks an erroneous vision-based
curbstone detection far in the distance or an object that occludes the roadside.
In the following, all system modules are described in detail. The system uses pairs of
color images of 300x400 pixels captured by a stereo camera setup mounted in the car as
input information. After an image undistorsion and rectification step (see [109]) as well as a
transformation into grayscale, a dense disparity D(u, v) is computed using correspondence
search with a probabilistic matching algorithm (see [108] for details). Based on the disparity
image the 3D world position for all image pixels can be computed using the previously
introduced Equation (80), (81), and (82).
Please refer to Fig. 35 for a visualization of the used dense stereo data and Fig. 47b for
the here applied coordinate system.
In the following step, the stereo maps Zstereo (u, v), Ystereo (u, v), Xstereo (u, v) are unrecti-
fied (i.e., the prior rectification before disparity computation is neutralized) to make them
comparable to the input image in terms of pixel position. Now, the so-called Digital El-
evation Map (DEM) is computed, which is a metric height map of the scene as viewed
from above (see [145] for details). However, as dedicated testing has revealed, the plain
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 149
Output:
curbstone/
Vision−based detection road border
Prediction &
NCC−based
verification template
adaptation
in t = n+1
Template
generation
Stereo−based detection
detect elevation
edge
odd longitudinal
iDEM & lateral
Gabor image
on−off vehicle
(off−on) pyramide motion,
yaw angle
Temporal inte−
gration of DEM BEV t=n−1
BEV
t=n
DEM store
BEV
compute Digital
Elevation Map
(DEM)
compute bird’s
eye view (BEV)
compute 3D
world position
DEM is too noisy for a direct elevation edge detection (see Fig. 50e). Therefore different
from [145] and hence as a novel approach a temporal integration procedure is applied on
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
150 Thomas Paul Michalke and Robert Kastner
the DEM. More specifically, based on a single track model that uses the vehicle velocity
and yaw-rate from the CAN bus, the longitudinal and lateral vehicle motion as well as yaw
angle is estimated. Furthermore, from the right camera image a bird’s eye view represen-
tation (see [112] for background information) is computed for all input images. The birds
eye view (BEV) is a metric representation of the scene as viewed from above that results
from remapping the gray scale input image (see Fig. 47a and 50b). Using the BEV of the
current and previous image the motion of the vehicle is computed using Normalized Cross
Correlation (NCC). In order to reduce the computational costs of the NCC, the estimated
motion of the vehicle from the single track model is used to define an anchor point for the
correlation template in the current BEV. Based on this procedure, the vehicle motion of
the previous 10 time frames is determined and stored. The DEMs of the previous 10 time
frames are shifted accordingly and superimposed resulting in the integrated DEM (iDEM).
As Fig. 50d shows the iDEM is marked by much less noise as the DEM in Fig. 50c (see
[20] for technical details on the temporal integration procedure).
Finally, on the iDEM an elevation edge detection is realized. More specifically, odd Ga-
bor filters of the orientations 45, 90 and 135 degree are applied on three scales for reducing
the computational costs, while allowing the detection of different edge widths. For that, on
the iDEM a three level Gaussian image pyramid is computed. The applied Gabor filters are
biologically motivated image filters that were shown to exist in the vision pathway of the
mamal brain (see [17] for the biological and [52] for a theoretical background on this filter
type). For the right/left roadside odd Gabor filters with off-on/on-off contrast type selec-
tivity are used (refer to Fig. 51 and 50e for an example and [146] for technical details on
this specific decomposition technique for Gabor filters). By decomposing the Gabor filters,
a simple and effective way for side-specific filtering of elevated edges on the right and left
roadside becomes possible.
Dedicated evaluation has shown that (for our camera setup) from a distance of about 9
meters on a curb detection on stereo starts to get noisy and cannot run robustly. Therefore,
stereo is used as cue for roadside detection for the first 9 meters distance from the vehicle
only.
As stated by [17], also psychophysical evidence exists that marks stereo as the primary
depth cue of the human for the first few meters only. After that other (monocular) vision
features take over. In accordance to that, the processing is switched to vision-based edge
detection. More specifically, the last stereo-based elevation edge measurement is used to
adapt a vision template using the data of the left color image. After a prediction step (a
segment-wise linear line model is applied) the template is used to verify and correct the real
position of the roadside using NCC. In case of a low NCC value a probabilistic crosscheck
is done. For that 5 vision cues are computed on the left color image: the hue and saturation
channel of the HSI color space, as well as the edge density on the hue, saturation and
intensity. All 5 features have been shown to be robust cues for unmarked road detection (see
[26] for technical details on these features). In order to determine, if a vision-based detected
roadside segment is valid, image patches left respectively right of the roadside segment
(subsequently called “road patch” Proad and “nonroad patch” Pnonroad) are compared to their
template counterparts Troad and Tnonroad. The named templates are derived from the valid
roadside region that was previously determined using the DEM. The comparison between
the road/nonroad patches and their templates is realized by applying the distance measure
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 151
30
Distance (depth) in m
15
(a) (b)
0
−8 0 8
width in m
Distance (depth) in m
10
10
10
5
5
5
(c) 0
0
0
−5 0 5
width in m
(d) (e)
−5
−5
−5 0 5 −5 0 5
width in m width in m
Figure 50. Exemplary inner-city scenario (visualization for right curbstone only): (a) Cap-
tured image with detected right curbstone (blue line: stereo-based, green line: vision-
based), (b) Bird’s eye view, (c) DEM, (d) Temporal integrated DEM, (e) Odd Gabor off-on
filtered.
δ (Pk , Tk ) with k ∈ {road, nonroad} that is based on the Bhattacharya coefficient (a measure
for determining the similarity between two histograms) calculated on the histograms HiPk
and HiTk of the image patches of all N = 5 vision cues:
N q
δ (Pk , Tk ) =∑ 1 − γ (HiPk , HiTk ) (118)
i=1
q
γ (HiPk , HiTk ) = ∑ HiPk (u, v)HiTk (u, v).
∀u,v
In case δ (Pk , Tk) is bigger than the maximum distance between all previously found
(valid) DEM road/nonroad patches, the newly found roadside segment is invalid and the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
152 Thomas Paul Michalke and Robert Kastner
(a) 90° Odd Gabor on−off (b) 90° Odd Gabor off−on
Figure 51. Application of 0 degree odd Gabor filter kernels on simplified road border im-
ages: (a) Kernel for on-off contrasts (left road border), (b) Kernel for off-on contrasts (right
road border).
algorithm stops. A big distance δ (Pk , Tk ) can result from a detection error (typically hap-
pening far in the distance) or from an object that occludes and interrupts the roadside. The
thereby gathered information can be used in our biologically motivated ADAS to guide the
attention. In case the roadside is interrupted in the vicinity of the camera-carrying ego-
vehicle, the attention system could be set to analyze the specific image region more closely
in order to rule out a potential danger coming from a so far unknown object on the road.
In the following, an abundant evaluation of the described algorithm will allow the as-
sessment of its capabilities and robustness.
8. Evaluation Results
In Sect. 8.1. individual system modules are evaluated that play an important role for the
presented cognitive ADAS architecture. In Sect. 8.2. the overall system properties will be
assessed. In Sect. 8.2.1. the system ability of purely vision-driven autonomous braking on a
stationary vehicle is tested in an exemplary highway construction site. Then in Sect. 8.2.2.,
based on a complex inner-city scenario it is shown how the system proactively plans and
verifies expectations in order to allow a safe interaction with the environment (correspond-
ing to the control instances 6. and 7. in Sect. 3. on page 69).
First, the system properties related to the challenges defined in Sect. 4.3. are evaluated. All
results are calculated on five real world data sets (cars, reflection post, construction site,
inner-city sequence, toys in an indoor scene) accessible in the internet (see [147]).
High Feature Selectivity
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 153
The gain of this approach was exemplarily shown in Sect. 4.2.2. based on the suppres-
sion of the horizon edge (see Fig. 14c for a visualization of the diminished influence of the
horizon edge on the (TD modified) BU saliency on a real world example). For evaluation,
the measures average FoA hit number (Hit) and average detection rate (DRate) were calcu-
lated. The named evaluation measures were introduced by [18]. While DRate is the ratio
of the number of found task-relevant objects to the overall number of task-relevant objects,
Hit states that the object was found on average with the Hit’th generated FoA. Hence, the
smaller Hit is, the earlier an object is detected (see [1] for more details on these measures).
Table 4 qualitatively shows the significant performance gain of attentional sky suppression
versus no horizon edge handling based on real world benchmark data.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
154 Thomas Paul Michalke and Robert Kastner
100 100
150 150
200 200
a) b) c) d)
0.8
50
Frame 1, TD car TD car
TD signal board
100 0.7 BU 10
e) 0.3
2
0.2
0.1 0
0 10 20 30 40 50 60 0 20 40 60 80 100
Frame number Frame number
f) g)
Figure 52. Evaluation of normalization: a),b) Input images c)TD saliency tuned to cars,
d)TD saliency tuned to signal boards, e)TD saliency tuned to cars (noise, since no car is
present), f)Maximal saliency activation level on BU, TD car and TD signal board map, g)
Dynamically adapted conspicuity weights wC j (homeostasis) for the M=7 modalities.
Comparability of Modalities
The used dynamic adaptation of wC j (for a theoretical background of the biological
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 155
Table 5. Search performance for BU- and TD-based LTM object search for cars and
reflection posts for 2 different training sets
concept of homeostasis, see Equation (42)) causes a twofold performance gain. First, the
a priori incomparable modalities can be combined yielding a well balanced BU and TD
saliency map. Secondly, the system adapts to the dynamics of the environment preventing
varying modalities from influencing the system performance (e.g., in the red evening sun
the R color channel will not be overrepresented in the saliency). Figure 52g) depicts the
dynamically adapted wC j . Table 6 shows a noticeable gain in the signal to noise ration
(SNR) on the overall saliency for 26 traffic relevant objects (e.g., traffic light, road signs,
cars), comparing the dynamically adapted wC j vector with a locally optimized static wC j
vector.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
156 Thomas Paul Michalke and Robert Kastner
of the maximum within the object. For the used 54 traffic scene images on average 11%
of all feature maps are weak. In case weak feature maps are used to optimally support the
TD saliency in an excitatory way SNRob j on the TD saliency map increases by 25%. The
results are aggregated in Tab. 7. Figure 54a shows that the number of excitatory TD weights
wTD
i decreases the bigger Kcon j (see Equation (2)) is. An object-dependent trade-off exists
since the TD saliency map gets sparser the bigger Kcon j is.
The feature activation of an image region depends on the illumination. Hence the TD
weight set is only optimal for the lighting conditions of the training images and the TD
search performance decreases when illumination changes without an adaptation of the cam-
era exposure. It is important to note that in a real world scene the optimal exposure in
varying illumination is different for all objects (see Fig. 54b and c), making the exposure
control dependent on the current task of the system.
Target # Test Average hit number (and detection rate [%]), TD search λ = 1
Toys in a im (obj) Traning illu- without expos. control with expos. control
complex in- mination 75 lx 150 lx 15 lx 150 lx 15 lx
door setup 20 (20) 1.95 (100%) 2.74 (95%) 2.83 (30%) 1.80 (100%) 2.0 (100%)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 157
Whole image
Whole image
Lower half
Lower half
Car
Car
a) b) c)
TP
Completeness = (119)
TP + FN
TP
Correctness = (120)
TP + FP
TP
Quality = (121)
TP + FP + FN
On a descriptive level Completeness states, based on given ground truth data, how much
of the real object region is covered by the tracked and hence relocalized region. Correctness
states how much of the relocalized region actually belongs to the object to allow a better
assessment of large regions that show a high Completeness. Quality combines both mea-
sures, since between Completeness and Correctness a trade-off is possible. Based on this,
the Quality measure should be used for a comparison, since it weights the FP and FN pixels
equally. For a more detailed analysis, the Completeness and Correctness state what exactly
caused a difference in Quality. The center accuracy describes the mean absolute position
error of the middle axis of the object region in pixels. The necessary ground truth data was
produced by accurate manual annotation of the bicycle region. As Tab. 9 shows, the applied
saliency-enhanced tracking is superior to a classical NCC-based approach. Furthermore, a
spatial prior that depends on the Kalman object position uncertainty improves the tracking
result.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
158 Thomas Paul Michalke and Robert Kastner
relevant traffic signs on 93 images. Therefore, a number of images contain two relevant
traffic signs. Nevertheless, the approach and also the evaluation measure treats each traffic
sign independently. Based on that, each of the traffic signs on an image has to be classified.
No temporal integration (usage of sign information from previous images) of a traffic sign
is done here to show the single image performance.
The images were manually labeled, providing the exact position and type of each traffic
sign. The images show various inner-city scenes, with different scene complexities (see
Fig. 20).
In order to evaluate our algorithm, we adopt the Equations (119), (120), and (121).
The measures have to be interpreted slightly different from the previous application. On
a descriptive level the Completeness states, based on given ground truth data, how many of
the traffic signs were actually detected. The Correctness states how many of the detected
regions were actually relevant traffic signs. The Quality combines both measures.A traffic
sign is counted as true positive detection if the corresponding traffic sign class as well as
position (with a range of five pixel to the ground truth center) on the image is detected.
The three measures were calculated on the detected traffic signs over all images of the
two inner-city sequences. The gathered results are depicted in Tab. 10.
An in-depth comparison with other approaches is difficult to realize. This is due to the
fact that, there is no well-known benchmark database with image sequences, providing the
same input images for all approaches. Some publications use images that always contain
traffic signs, others use images of traffic signs with a certain size only, others use images
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 159
of highway scenes only and a few generate synthetic images. In general, the class of rec-
ognized traffic signs as well as their appearance varies in different countries. Additionally,
numerous publications do not state the reached classification rates. Due to the mentioned
difficulties, a direct comparison with other approaches was not carried out. Nevertheless, an
overall Quality of nearly 90% shows the reliability of our approach, especially when taking
into account that no temporal integration was used.
In the following, we evaluate the performance of our scene classification by training and test
with a total of 10800 images, taken from several image sequences. The images were man-
ually assigned to one of the categories highway, country road and inner city (see Fig. 28).
The images show various scenes, some containing cars, trucks, and pedestrians, others do
not contain traffic relevant objects. The scenes also show strong differences in the present
lighting conditions. The training was conducted with 600 images per category. Afterwards,
the evaluation was done by five independent runs with 1800 images each (also 600 images
per category). The results of the HPCC on the different test sets showed similar perfor-
mance on the classification rate (variation of 2%). Table 11 shows the average results over
the five evaluation runs. In the following, the scene classification approach with a single
classification tree for the overall image is called HPCC 1 and the one with 16 single classi-
fications is called HPCC 2.
Table 11. Average results of the scene classification on all test sets
To draw a comparison, a system only comprising of LDA with WDST was set-up.
Hence, the LDA/WDST system generates a template for each category without sub-parts. It
is comparable to the work of Oliva and Torralba [98] and showed a total result of nearly 68%
on a single evaluation run. Compared to our results, the HPCC 1 with a neural net reliably
classifies the scene with over 97% accuracy and also requires the fewest computation time.
Again, no temporal integration was conducted. This would improve the detection results
considerably, since the scene context typically is rather stable over time. The HPCC is able
to reliably provide the current scene context as an input for higher level ADAS applications.
The images being incorrectly classified (see Fig. 55 for two examples), show on the one
hand, largely covered areas, due to cars and trucks in front. On the other hand, they show
ambiguous scenes, which would also be difficult for a human to classify correctly, without
temporal integration. Figure 55a shows the driveway to a highway, which is not a typical
scene for highways but should be interpreted as belonging to this category. The Figure 55b
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
160 Thomas Paul Michalke and Robert Kastner
was captured during a stop at a traffic light on a country road, which underlines the smooth
transition between the different categories.
Figure 55. Two example images resulting in a wrong scene classification: (a) Driveway to
a highway as country road, (b) Country road as inner city
Figure 56 shows the EKF-based fusion of depth measurements for a car that drives in front
of the prototype vehicle through an inner-city scenario (see Fig. 37b for a visualization of
the scenario). For the EKF the sensor variances σradar = 0.3, σbirds = 2.8, and σob j = 2.7 as
well as the process variance σ process = 0.023 for the prediction step are used. Note that the
usage of two additional monocular depth cues of high variance fused with the low variance
Radar cue ensures the availability of depth values even if the interesting objects are outside
of the Radar beam.
For a proof of concept, the classifier (see Sect. 3.) was trained to distinguish cars from sig-
nal boards and clutter. A set of image segments generated by the presented vision system
during online operation was used for training. It contains 11000 square image patches of
size 64x64 pixels, and was divided into the classes “cars” (2952 patches), “signal boards”
(2408 patches) and “clutter” (5803 patches) by visual inspection. Car segments contain
complete back- und front-views of cars (at any position) which must be at least half as large
as the patch in both dimensions. At equal false positive and true negative rates, for cars an
error of 4.7% and for signal boards an error of 9.7% was obtained on equally large test sets.
The performance of the trained classifier is shown in Fig. 57 in form of a receiver operator
characteristic (ROC) curve that visualizes the trade-off between false positive (clutter rec-
ognized as object) and false negative (object recognized as clutter) detections when varying
the classification thresholds. The ROC curve was generated using 5-fold cross validation.
Furthermore, the quality of the classification is enhanced by the voting process described in
Sect. 3..
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 161
28
Depth from bird’s eye view
Depth from object knowledge
26
Radar measurements
EKF Fusion
24
Distance to preceding car [m]
22
20
18
16
14
12
10
Figure 56. Depth from bird’s eye view, object knowledge, Radar and fusion with EKF.
1
whole car color (back/front)
signal boards color
0.8
false negative rate
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
false positive rate
Figure 57. Receiver operator characteristic curve for cars (back and front views) and signal
boards.
that the car can move on safely in a physical sense, but without taking symbolic information
into account (e.g., one-way-street, traffic signs).
In order to evaluate how much of the image pixels representing road area are classified
correctly, ground truth was generated by manual hand labeling of 440 test images of an
inner-city sequence. For evaluation, the previously introduced ground-truth-based measures
are applied (see Equation (119), (120), (121).
The gather evaluation results for the road detection algorithm with and without tempo-
ral integration (see Sect. 3.) are shown in Tab. 12. A Quality of 89.9% is reached using
temporal integration as opposed to a Quality of only 60.5% without temporal integration.
While the highest Correctness of 98.1% is reached without temporal integration, this comes
at the cost of a low Completeness and, consequently, a low Quality.
Table 12. Comparison of unmarked lane detection with and without temporal
integration (TI)
For further evaluation Fig. 58 shows the street detection results on a number of example
frames of the inner-city sequence used for evaluation. Please refer to [20] and [26] for more
evaluation results concerning the road detection and temporal integration system described.
The input images and stereo data used for the evaluation as well as the ground truth data
and results are accessible on the internet [149] for open benchmark testing.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 163
Frame 24 Frame 24
Frame 97 Frame 97
Figure 58. Road detection on example images of an inner-city sequence (left column: With-
out temporal integration, right column: With temporal integrated street segment).
a comparison, since it weights the FP and FN pixels equally. For a more detailed analysis,
the Completeness and Correctness can be evaluated and thereby determining what exactly
caused a difference in Quality.
As Tab. 13 shows using the curbstone detection results in a postprocessing step im-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
164 Thomas Paul Michalke and Robert Kastner
Table 13. Evaluating the infuence of detected curbstones on unmarked road detection
results with and without temporal integration
proved the detection Quality from 66.4% to 73.7% (without applying temporal integration)
respectively from 72.3% to 83.8% (with applying temporal integration). When analyz-
ing Tab. 13 more closely, it can be perceived that postprocessing with curbstone detection
results lowers the Completeness by a small extend. For both cases the Completeness is
diminished: without temporal integration (from 77.1% to 76.7%) and with temporal inte-
gration (from 88.7% to 88.1%). More specifically, that means that some true-positively
detected road segments were falsely corrected and hence clipped by applying the curbstone
detection results. Still, the postprocessing step corrects more false-positive pixels than it
falsely corrects true-positive ones. For both cases the Correctness grows: without temporal
integration (from 83.9% to 95.8%) and with temporal integration (from 80.2% to 94.9%).
Summing up, the Correctness is overproportionally improved leading to the measured sig-
nificant increase in Quality. In order to facilitate the empirical assessment of the gathered
results, in Fig. 59 some typical frames of the used inner-city sequence are depicted.
For our experiments we use a Honda Legend prototype car equipped with a mvBlueFox
CCD color camera from Matrix Vision delivering images of an initial resolution of 800x600
pixels at 10Hz, which is hence the processing rate our road detection module has at least to
reach. The image data as well as the vehicle state data from the CAN bus is transmitted via
LAN to several Toshiba Tecra A7 (2 GHz Core Duo) running our RTBOS integration mid-
dleware [150] on top of Linux. As most of the system’s submodules are already part of our
ADAS running in real-time (e.g., stereo computation, bird’s eye view, temporal integration
architecture, single track model), we expect the remaining currently Matlab-implemented
submodules to fulfill real-time requirements also.
In the following, the overall system performance will be assessed based on a construction
site (see Sect. 8.2.1.) and an inner-city scenario (see Sect. 8.2.2.).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 165
Frame 87 Frame 96 Frame 123
(a)
(b)
(c)
(d)
(e)
(f)
Figure 59. Typical frames of inner-city sequence: (a) Captured images with detected right
curbstone (blue line: stereo-based, green line: vision-based), (b) Hand-labeled groundtruth
road segments, (c) Detected road without temporal integration, (d) Incorporating curbstone
detection result (without temporal integration), (e) Detected road with temporal integration,
(f) Final detection result: Incorporating curbstone detection (with temporal integration)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
166 Thomas Paul Michalke and Robert Kastner
Scenario
In order to evaluate the proposed system in a challenging situation, the focus is on
typical construction sites on highways. This situation is quite frequent and a traffic jam
ending exactly within a construction site is a highly dangerous situation: due to the S-curve
in many construction sites, the driver will notice a braking or stopping car quite late as the
signal boards limit the field of view (see Fig. 60a). The ADAS implementation uses a 3-
phase danger handling scheme depending on the distance and relative speed of a recognized
obstacle. For example, when the vehicle drives around 40 km/h and a static obstacle is
detected in front at less than 33 meters, in the first warning phase a visual and acoustic
warning is issued and the brakes are prepared. If the dangerous situation is not resolved by
the human driver, the second phase triggers the belt pretensioner and the brakes are engaged
with a deceleration of 0.25 g followed by hard braking of 0.6 g in the third phase.
Stationary
car
Stationary car
emerges behind
signal boards at a
distance of 48 m
(a) (b)
Figure 60. Scenario: (a) Schematic sketch of the construction site scenario. Stationary car
is visible from 48 meters on. (b) Real scenario.
Technical Setup
For the experiments, a Honda Legend prototype car is used that is equipped with a
mvBlueFox CCD color camera from Matrix Vision delivering images of 800×600 pixels at
10 Hz. The image data as well as the Radar and vehicle state data from the CAN bus can be
recorded. The recorded data is used during offline evaluation. For online processing all data
is transmitted via Ethernet to two laptops (2 GHz Core Duo) running the RTBOS (Real-
Time Brain-like Operation System) integration middleware [150] on top of Linux. The
individual RTBOS components are implemented in C using an optimized image processing
library based on the Intel performance primitives (IPP) [151].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 167
50
45
(mean of 10 recorded streams)
Car detection distance in m
40
35
N=1
N=2
30 N=3
N=5
N=7
25 Max. distance
20
0 0.25 0.5 0.75 1
TD combination weight λ (using a car tuned TD weight set)
Figure 61. Stationary car detection distance depending on the TD attention parameter λ =0,
0.25, 0.5, 0.75, and 1 as well as the STM size N=1,2,3,5, and 7 when using ground truth for
detecting a hit.
FoA hits the car, which is defined by hand-labeled ground truth on the recorded sequences.
It can be seen that the larger the TD influence (search task: find cars) expressed by λ , the
earlier the car is detected. Similarly, the more objects are stored in the STM (object number
N), the earlier the car is detected as a large part of the visual scene is already contained
as (unknown) objects in the STM and therefore inhibited in the saliency map. It can also
be deduced that with growing N the influence of TD is reduced since the scene coverage
increases.
Including the task of object recognition in the evaluation, Fig. 62 shows the distance
to the stationary car when the first FoA hits the target and this RoI is recognized as car by
the object classifier. Since the used classification threshold was set high to obtain a low
false-positive error rate at the cost of a high false-negative error rate, the distance when the
car is detected is smaller than in the evaluation with ground truth. Differing from Fig. 61,
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
168 Thomas Paul Michalke and Robert Kastner
at large values of N (see Fig. 62 for N=7) the detection distance worsens again. The reason
for this effect is that the system is not using object segmentation algorithms but performs
segmentation directly on the saliency image which can lead to enlarged patches suppressing
the surround of the found objects as well. In this way, the borders of the car might be
suppressed by adjacent signal board patches leading to incomplete car FoAs that are not
sufficient for correct classification by the used object classifier. The likelihood that this
happens is growing with the growing size N of the STM. However with a growing N also
the scene coverage improves. This trade-off leads to the measured results.
50
45
(mean of 10 recorded streams)
Car detection distance in m
40
35
N=1
N=2
30 N=3
N=5
N=7
25 Max. distance
20
0 0.2 0.4 0.6 0.8 1
TD combination weight λ (using a car tuned TD weight set)
Figure 62. Stationary car detection distance depending on the TD attention parameter λ =0,
0.25, 0.5, 0.75, and 1 as well as the STM size N=1,2,3,5, and 7 when using the classifier for
detecting a hit.
Based on Fig. 62 the best choice of λ for detecting cars would be 1, which equals pure
TD search mode. However, such a parameterization is not appropriate because this leads to
a reduced capability of detecting other objects that are only prominent in the BU saliency
(see Fig. 63). Here, it can be seen that with growing λ the average detection distance
of signal boards (the only other object class besides cars in the evaluation) drops. Stated
differently, the system ignores all other objects while searching for cars in pure TD mode
(λ = 1), which might lead to dangerous situations. The default value for λ was hence set to
0.5 for the online tests.
In the previous evaluations it was assumed that the scene contains more than N objects
and used a fixed STM size which is equivalent to storing any object for N frames indepen-
dent of, e.g., whether it is was correctly recognized. Now an object-specific Time To Live
(TTL) is introduced, defining for how many frames an object is stored in the STM before it
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 169
50
45
35
(mean of 10 recorded streams)
0
0 0.25 0.5 0.75 1
TD combination weight λ (using a car tuned TD weight set)
Figure 63. Detection distance depending on the TD attention parameter λ =0, 0.25, 0.5,
0.75, and 1. Average detection distance of signal boards and the stationary car using the
object classifier for an STM size of N=1,2, and 5.
is removed. In this way, unknown objects can be tracked for only a short time before a new
recognition attempt is carried out if the image region is still salient. Figure 64 shows how
the choice of the TTL influences the system performance.
For an object-unspecific TTL of 5 frames the curve is identical to Fig. 63 for N=5. For
the object-specific case, the following parameters were chosen TTLsigboard = 6 frames for
signal boards, TTLcars = 20 frames, and TTLunknown = 3 frames, leading for the construction
site sequences on average to N=5 objects in the STM. Note that the low value of TTLunknown
and the high value of TTLcars both support to set the object recognition threshold high, i.e.,
it is very likely to get an unknown object, which is a false negative car but rather unlikely
to get a car that is a false positive.
A clear gain in detection performance can be seen when using object-dependent TTL
values, which is due to the fact that FoAs, which hit the car very early are often too small
for a reliable classification. These unknown scene parts are suppressed only for 3 frames
before the classifier gets a second chance to detect the car. This object-specific TTL param-
eterization was used during the online tests described below.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
170 Thomas Paul Michalke and Robert Kastner
48
Object−unspecific TTL=5
46 Object−specific TTL (see text)
(mean of 10 recorded streams)
Car detection distance in m
44 Max. distance
42
40
38
36
34
32
30
0 0.25 0.5 0.75 1
TD combination weight λ (using a car tuned TD weight set)
Figure 64. Stationary car detection distance depending on the TD attention parameter λ =0,
0.25, 0.5, 0.75, and 1 while using object-unspecific and object-specific TTL values.
During documented online system tests in the setting depicted in Fig. 60 with the pro-
totype vehicle driving 40 km/h the system detected in 57 of 60 cases the stationary car in
time and issued the 3 warning phases as expected including autonomous braking. In the
remaining cases, either the object recognition detected a signal board as car and the braking
was performed too early or the FoA generation did not deliver a good car RoI position so
that the fusion of the car RoI with Radar data failed and no warning/braking was performed
at all. Note that the presented vision-based proof-of-concept system completely relies on
vision and does not make use of an additional Radar-based emergency braking that would
be needed in real traffic as backup for situations in which the presented vision system fails.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 171
detected dynamic object including its predicted trajectory (the red region codes a negative
relative velocity: the longer the red region, the faster the dynamic object moves), and the
area covered by the radar sensor (in magenta) is visualized. The bottom right image shows
an environmental representation that visualizes the task-relevant dynamic object. An aura
around the object codes the position uncertainty of the detected object.
In the scenario, while exploring the scene (with λ =0, no spatial prior set, and the STM
holding up to 5 objects) the camera ego-vehicle detects a dynamic, i.e., moving, object
based on the procedure described in Sect. 3. (see Fig. 65a). The object is tracked (in one
“what” pathway, the parameter λ is set to 1 allowing an attention-based tracking support, a
spatial prior is set that depends on the object position uncertainty, as described in Sect. 3.).
The ego-vehicle overtakes the bicycle,giving a blind spot warning. Based on the presented
internal 3D representation the bicycle position is predicted linearly even while it is outside
the field of view of the camera (see Fig. 65b, the growing position uncertainty is visualized
by the growing object aura in the bird’s eye view representation). Without any dynamic
object in the field of view, λ is set to 0 again, the spatial prior is reset, and the object
decay rate is reset to allow the tracking of up to 5 objects. All these adaptations support
an object-independent scene exploration. Now, the ego-camera vehicle stops to turn right.
The ADAS “remembers” the bicycle and gives a blind spot warning. The ego-vehicle waits
for the bicycle to reappear. In order to allow a fast redetection, the top-down attention is
tuned to the bicycle, setting a spatial prior with low sharpness. This allows its instantaneous
redetection (see Fig. 65c). The object position is updated lowering the position uncertainty.
The system tracks the object relying on the attention-based-tracker-support (see Fig. 65d).
Summing up, the system at runtime builds up and verifies expectations of the environment,
thereby autonomously tuning internal parameters and processes that improve and accelerate
the system reaction.
The processing described above relates to one “what” pathway (see Fig. 1), which con-
centrates on the detection, tracking, and prediction of one dynamic and hence potentially
dangerous object. As visualized in Fig. 1, multiple “what” pathways run in parallel. For
example, further “what” pathways handle the detection of cars, signal boards, and other
traffic-relevant objects using the previously described LTM search (see Sect. 3.) that relies
on object-specifically tuned saliency maps. The control of these pathways is realized by the
functional mapping described in Sect. 3.. For each of these “what” pathways the object-
specific LTM search can be interrupted by a STM search in case an object was lost during
tracking (see Sect. 3.). The complete result video is accessible in the internet at [152].
For further qualitative system evaluation, Fig. 66 depicts internal system variables for
three sequential frames of another inner-city sequence with cars as LTM search object. As
described in Sect. 3., for each new image the attention is calculated and a new FoA is
generated via maximum search and segmentation on the saliency map. The detected road
area (and thereby also the present lane markings) are mapped out of the saliency map, which
decreases the false positive rate of generated FoAs, i.e., less non-car FoAs are generated.
In the first frame, the car in front is detected and stored in the representation based on a
car-like hole in the detected street segment that modulates the attention (see highlighted
region in the saliency map of frame 194 in Fig. 66). Please note that the cars 2 and 3 are
not stored in the internal representation, since their position is beyond the represented road
environment.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
172 Thomas Paul Michalke and Robert Kastner
(a) (b)
(c) (d)
Figure 65. Visualization of system states for bicycle sequence: (a) After its detection the
bicycle is tracked, ego-vehicle is closing in, (b) Blind prediction of bicycle, (c) Ego-car
searching actively for the bicycle, waiting to turn right, (d) Bicycle redetected successfully,
ego-vehicle turns right.
9. Summary
In modern vehicles, numerous driver assistance functionalities exist that support the driver
in typical traffic situations. In general, all these functionalities bring own sensors, process-
ing devices, and actuators. No information fusion takes places between the functionalities,
due to among other things unsolved questions in system design that come along with high
system complexity. Thereby, the potential to improve the robustness of so far independent
system modules as well as the potential to develop higher level functionalities is ignored.
Furthermore, the number of driver assistance functionalities is growing constantly in to-
day’s vehicles. In the near future, this will lead to problems regarding the limited number
of interaction channels to inform the driver of dangerous events (so-called human machine
interface (HMI)). Already today, in specific traffic scenarios a contradicting HMI access of
different driver assistance functionalities can occur. For solving these challenges, a large-
scale driver assistance system is required that integrates and fuses various functionalities
(leading to so-called advanced driver assistance systems (ADAS)). Despite the apparent ne-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 173
25
20
Distance (depth) in m
15
10
Car 1 5
Frame 194
0
−5 0 5
Width in m
25
20
Distance (depth) in m
15
10
Car 3 Car 2
5
Frame 196
0
−5 0 5
Width in m
25
20
Distance (depth) in m 15
10
Car 4
5
Frame 197
0
−5 0 5
Width in m
Figure 66. System evaluation on example images of an inner-city sequence. Left col-
umn: visualization of found FoAs, middle: calculated saliency map Stotal (previously found
objects are suppressed by inhibition of return), Right column: Visualization of internal rep-
resentation (dashed line marks the border of the vision field).
cessity of such systems, only a few of these approaches exist in literature. Said systems
typically rely on rigid system structures and safely run in clearly defined scenarios only.
While it may be argued that the quality of such engineered systems in terms of isolated
aspects, e.g., object detection or tracking, is often sound, the solutions lack the necessary
flexibility.
Instead of following classical engineering-based approaches, in the contribution at
hand, an ADAS is developed that solves the challenges of complex system design by mim-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
174 Thomas Paul Michalke and Robert Kastner
icking known information processing principles in the human brain. On the micro-level
the realized system copies the signal processing characteristics of neurons in order to reach
robust image filtering. On the macro-level the system gets inspiration from the way the
human brain organizes higher level signal flows in order to reach a generic system structure
(e.g., a task-dependent tunable attention system).
More specifically, on the micro-level various static and dynamic visual features are
introduced that are biologically inspired. Inspiration is drawn from known processing prin-
ciples of the vision pathway in the human brain (i.e., the signal processing characteristics of
neurons in the brain). Features for the detection of specific intensity changes, oriented lines
and edges as well as a retina-like color space are introduced and tested. Since during the
projection of 3D world objects to the 2D image plane the object depth is lost, a distance fea-
ture becomes necessary. Four biologically motivated depth cues and their dynamic fusion
are described. The realized depth sources are stereo disparity, depth from object knowledge,
depth from the bird’s eye view and Radar-based depth. Thereby, in sum 130 static feature
maps are accessible, by which the proposed ADAS can sense the world. Together with six
dynamic feature maps, the ADAS disposes of an overall number of 136 feature maps.
The 136 biologically inspired feature maps are combined in the attention system that is
used as common front-end of all vision processes of the presented ADAS. The key aspect
and output of the attention system is the saliency map, whose amplitude (i.e., its activation
in neurobiological terms) encodes the level of information contained in an image region.
A high activation can be caused by 1) an object that visually differs strongly from its sur-
rounding environment (sensory-driven or bottom-up attention) or 2) by an image region that
matches the current searched object properties (goal-driven or top-down attention). Both
the bottom-up (BU) and top-down (TD) attention use a weighted combination of the fea-
tures. Five novelties increase the adaptivity of the attention system to the scene and thereby
assure a high robustness, which allows for building applications for outdoor scenarios of
the dynamic vehicle domain.
The importance of context information for improving the performance of driver assis-
tance functionalities is a widely acknowledged fact in the community. Therefore the ADAS
contains a real-time unmarked road detection system that relies on vision as the major cue.
As evaluation showed, the so detected road segments match ground truth data well in most
situations. However, in case of shadows on the road the detected road segments contain
holes and get unstable in time. As a further novelty, for solving these challenges a generic
tracking approach for unmarked road detection systems is applied that is based on temporal
integration (i.e., integration of road detection results over time). The temporal integration
approach was successfully tested on the implemented system described before, but would
be suitable to improve any comparable state-of-the-art unmarked road detection system.
In the central part of the contribution, the realized visual features, the attention sub-
system, and the unmarked road detection sub-system including the temporal integration
approach are combined to a generic, biologically motivated Advanced Driver Assistance
System. The attention sub-system, which weights and combines all visual features, allows
a task-dependent scene decomposition that enables the system to react in real-time to chal-
lenging situations. In the developed ADAS, the attention sub-system is used as front-end
for the detection of a stationary vehicle in a highway construction site. In case a quickly
approaching vehicle is detected by the system, a danger handling scheme is initiated that
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 175
in the final phase allows the system to brake autonomously. Different from commercially
available systems and approaches in literature, the object detection is based on vision as the
main cue.
Based on an abundant system evaluation, system properties could be demonstrated that
were measured in psychophysical studies on humans as well. Similar to humans, it could be
shown that a number of 5 to 7 stored and tracked objects in the short term memory results
in the best overall system performance when searching for a dangerous object under time
constraints. Furthermore, it could be shown that the reaction time to unexpected traffic-
relevant objects grows with an increasing focus on a specific object class (a phenomenon
that is called inattentional blindness in psychophysical studies with humans). Based on
these results, it can be stated that the realized ADAS closely models important human in-
formation processing principles allowing the usage of the system as attentional co-pilot for
human drivers.
A vehicle of width a = 2.5m with an acuteness factor S = 1.0 can hence be classified from
approximately 2500 meters on.
As stated in Sect. 8.2.1. the here proposed ADAS is able to classify cars from about 42
meters on. Among other things (like, e.g., differences in image resolution), the large differ-
ence in performance between the human and a technical vision system might be due to the
fact that the human integrates information about the scene context and gathered experience.
In the discussed example, context information about the road (e.g., the object is positioned
on the road), which restricts the potential type of object and object class-related experience
(e.g., a car is a fast moving object on the road) is crucial. First approaches to include con-
text into the ADAS are realized in the contribution at hand, in order to improve the object
detection. More specifically, the detected road is searched for car-like openings and used
to modulate the attention and object segmentation (see Sect. 3.). However, more context
information (e.g., scene type: inner-city, urban road, highway) needs to be included, in or-
der to further improve the system performance. In [154] a robust, real-time capable system
extension for said basic scene classification is described, which will be incorporated in the
“Attentive Co-Pilot” in the near future.
The contribution at hand concentrates mainly on saliency-based attention and build-
ing of a generic system that allows the dynamic modulation of modules and links between
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
176 Thomas Paul Michalke and Robert Kastner
modules at run-time. The further development of the ADAS focuses on ways to control the
designed cognitive system based on reinforcement learning on high system level. After the
successful test of the low complexity control approach, in the next step, learning of the func-
tional mapping between the measured input feature space and the output control parameter
space will be in the focus. More specifically, it is planned to replay stored sequences of crit-
ical traffic situations from a data base. As learn-signal dangerous objects will be manually
labeled in these sequences. The system task is to detect the objects early enough. In case
the system react too late, the scenario is replayed by the learning algorithm while changing
the functional mapping between input and output data of the behavior control module. Also
measuring and mimicking the reactions of an experienced driver is envisioned in the future.
The introduced system contains first approaches towards such an efficient cognitive control
concept. The central assumption, as proposed in this contribution, is that a robust (and as
envisioned also learning) system requires a generic system structure with a high number
of degrees of freedom for controlling the system reaction and measuring the system state.
Therefore, if required, it is planned to further increase the input feature space as well as out-
put control parameter space of the behavior control module in order to increase the number
of possible system behaviors and prove the scalability of the approach.
References
[1] S. Frintrop, Vocus: A visual attention system for object detection and goal-directed
search, Ph.D. thesis, University of Bonn (2006).
[2] H. I. Christensen, H.-H. Nagel (Eds.), Cognitive Vision Systems: Sampling the Spec-
trum of Approaches, LNCS, Springer-Verlag, 2006.
[3] S. Treue, Visual attention: the where, what, how and why of saliency., in: Current
Opinion in Neurobiology, Vol. 13, 2003.
[5] K. Kodaka, M. Otabe, Y. Urai, H. Koike, Rear-end collision velocity reduction sys-
tem, in: Proc. 2003 SAE World Congress, Detroit, 2003.
[6] K. Kodaka, J. Gayko, Intelligent systems for active and passive safety - Collision
Mitigation Brake System, in: Proc. of the ATA EL conference 2004, Parma, 2004.
[8] WWW, European Commission Information Society Intelligent Car Initiative, http:
//ec.europa.eu/informationsociety/activities/
intelligentcar/ (2007).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 177
[12] G. Färber, Biological aspects in technical sensor systems, in: Proc. Advanced Mi-
crosystems for Automotive Applications, Berlin, 2005, pp. 3–22.
[13] C. Stiller, G. Färber, S. Kammel, Cooperative Cognitive Automobiles, in: IEEE In-
telligent Vehicles Symposium, 2007, pp. 215–220.
[16] A. Torralba, Contextual priming for object detection, in: International Journal of
Computer Vision, Vol. 53, 2003.
[18] S. Frintrop, VOCUS: A visual attention system for object detection and goal-directed
search, Ph.D. thesis, University of Bonn Germany (2006).
[19] H. Wersing, E. Körner, Learning optimized features for hierarchical models of in-
variant object recognition, Neural Computation 15 (2) (2003) 1559–1588.
[21] R. Klein, Inhibition of return, Trends in Cognitive Science 4 (4) (2000) 138–145.
[25] J. Eggert, C. Zhang, E. Körner, Template matching for large transformations, in:
ICANN (2), 2007, pp. 169–179.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
178 Thomas Paul Michalke and Robert Kastner
[28] H. Egeth, S. Yantis, Visual attention: control, representation, and time course, An-
nual Review of Psychology 48 (1997) 269–297.
[29] J. Wolfe, T. Horowitz, What attributes guide the deployment of visual attention and
how do they do it?, Nat. Reviews Neuroscience 5 (6) (2004) 495–501.
[30] C. Koch, S. Ullman, Shifts in selective visual attention: towards the underlying neu-
ral circuitry, Human Neurobiology 4 (4) (1985) 219–227.
[31] J. Tsotsos, S. Culhane, W. Wai, Y. Lai, N. Davis, F. Nuflo, Modeling visual attention
via selective tuning, Artificial Intelligence 78 (1-2) (1995) 507–545.
[32] V. Navalpakkam, L. Itti, Modeling the influence of task on attention, Vision Research
45 (2) (2005) 205–231.
[33] D. Simons, C. Chabris, Gorillas in our midst: Sustained inattentional blindness for
dynamic events, British Journal of Developmental Psychology 13 (1995) 113–142.
[34] S. Most, R. Astur, Feature-based attentional set as a cause of traffic accidents, Visual
Cognition 15 (2) (2007) 125–132.
[36] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene
analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259.
[42] G. Backer, B. Mertsching, Integrating depth and motion into the attentional con-
trol of an active vision system, in: G. Baratoff, H. Neumann, (Eds.), Dynamische
Perzeption, St. Augustin (Infix), 2000, pp. 69–74.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 179
[46] J. Findlay, I. Gilchrist, Active Vision: The psychology of looking and seeing, Oxford
University Press, 2003.
[47] A. Treisman, The perception of features and objects, in: A. Baddeley, L. Weiskrantz
(Eds.), Attention: Selection, awareness, and control, Clarendon Press, Oxford, 1993,
pp. 5–35.
[48] Z. Aziz, B. Mertsching, Visual search in static and dynamic scenes using fine-grain
top-down visual attention, in: Lecture Notes in Computer Science, Vol. 5008, 2008,
pp. 3–12.
[50] N. Flores-Herr, Das hemmende Umfeld von Ganglienzellen in der Netzhaut des
Auges, Ph.D. thesis, Frankfurt am Main, Johann Wolfgang Goethe-Universität
(2001).
[55] O. Ramstroem, H. Christensen, A method for following unmarked roads, in: IEEE
Intelligent Vehicles Symposium, 2005, pp. 650–655.
[56] E. Dickmanns, B. Mysliwetz, Recursive 3-d road and relative ego-state recognition,
IEEE Trans. Pattern Anal. Mach. Intell. 14 (2) (1992) 199–213.
[57] T. Luo-Wai, Lane detection using directional random walks, in: IEEE Intelligent
Vehicles Symposium, Eindhoven, 2008.
[58] D. Hubel, T. Wiesel, Receptive fields, binocular interaction and functional architec-
ture in the cat’s visual cortex, Journal of Physiology 160 (1962) 106–154.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
180 Thomas Paul Michalke and Robert Kastner
[62] A. Treisman, S. Gormican, Feature analysis in early vision: Evidence from search
asymmetries, Psychological Review 95 (1988) 15–48.
[63] D. Forsyth, J. Ponce, Computer Vision: A Modern Approach, Prentice Hall, Berkeley,
2003.
[67] S. Frintrop, M. Klodt, E. Rome, A real-time visual attention system using integral
images, in: Int. Conf. on Computer Vision Systems, Bielefeld, 2007.
[69] C. Fang, S. Chen, C. Fuh, Road-Sign Detection and Tracking, IEEE Transactions on
Vehicular Technology 52 (5).
[71] S. Zhu, L. Liu, X. Lu, Color-Geometric Model for Traffic Sign Recognition, in:
Proc. IMACS Multiconference on Computational Engineering in Systems Applica-
tions, 2006, pp. 2028–2032.
[72] V. Andrey, K. H. Jo, Automatic detection and recognition of traffic signs using ge-
ometric structure analysis, in: Proc. Internat. Joint Conference SICE-ICASE, 2006,
pp. 1451–1456.
[74] H.-M. Yang, C.-L. Liu, K.-H. Liu, S.-M. Huang, Traffic Sign Recognition in Disturb-
ing Environments, in: ISMIS 2003, LNAI 2871, Springer Verlag, 2003, pp. 252–261.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 181
[75] S. Kantawong, Road Traffic Signs Detection and Classification for Blind Man Nav-
igation System, in: Proc. Internat. Conf. on Control, Automation and Systems (IC-
CAS), 2007, pp. 847–852.
[76] M. Zadeh, T. Kasvand, C. Suen, Localization and Recognition of Traffic Signs for
Automated Vehicle Control Systems, in: Conf. on Intelligent Transportation Sys-
tems, part of SPIE’s Intelligent System and Automated Manufacturing, 1997, pp.
272–282.
[77] B. Cyganek, Road Signs Recognition by the Scale-Space Template Matching in the
Log-Polar Domain, in: IbPRIA 2007, Part I, LNCS 4477, Springer Verlag, 2007, pp.
330–337.
[79] P. Paclk, J. Novovicov, P. Pudil, P. Somol, Road Sign Classification Using Laplace
Kernel Classifier, Pattern Recognition Letters 21 (2000) 1165–1173.
[80] S.-D. Zhu, Y. Zhang, X.-F. Lu, Detection for Triangle Traffic Sign Based on Neural
Network, in: ISNN 2006, LNCS 3973, Springer Verlag, 2006, pp. 40–45.
[81] Y. Wang, J. Dang, Z. Zhu, Traffic Signs Detection and Recognition by Improved
RBFNN, in: Proc. Internat. Conf. on Computational Intelligence and Security, 2007,
pp. 433–437.
[84] H. Fleyeh, Traffic Sign Recognition by Fuzzy Sets, in: Proc. IEEE Intelligent Vehi-
cles Symposium, 2008, pp. 422–427.
[87] J.-T. Oh, H.-W. Kwak, Y.-H. Sohn, W.-H. Kim, Segmentation and Recognition of
Traffic Signs Using Shape Information, in: ISVC 2005, LNCS 3804, Springer Verlag,
2005, pp. 519–526.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
182 Thomas Paul Michalke and Robert Kastner
[88] L.-W. Tsai, Y.-J. Tseng, J.-W. Hsieh, K.-C. Fan, J.-J. Li, Road Sign Detection Using
Eigen Color, in: ACCV 2007, Part I, LNCS 4843, Springer Verlag, 2007, Ch. 169-
179.
[89] A. d. l. Escalera, J. M. Armingol, J. M. Pastor, F. J. Rodriguez, Visual Sign Infor-
mation Extraction and Identification by Deformable Models for Intelligent Vehicles,
IEEE Transactions on Intelligent Transportation Systems 5 (2) (2004) 57–68.
[90] J. Turan, M. Fifik, L. Ovsenik, Transform Based System for TrafficSign Recognition,
in: Proc. 15th Internat. Conf. on Systems, Signals and Image Processing (IWSSIP),
2008, pp. 441–444.
[91] N. Serrano, A. Savakis, J. Luo, A computationally efficient approach to in-
door/outdoor scene classification, Proc. of the Int. Conference on Pattern Recog-
nition 04 (2002) 146–149.
[92] M. Szummer, R. W. Picard, Indoor-outdoor image classification, in: IEEE Interna-
tional Workshop on Content-based Access of Image and Video Databases, in con-
junction with ICCV’98, 1998, pp. 42–51.
[93] A. Vailaya, A. Jain, H. J. Zhang, On image classification: City vs. landscape, in:
IEEE Workshop on Content - Based Access of Image and Video Libraries, 1998, pp.
3–9.
[94] M. ”Gorkani, R. W. Picard, ”texture orientation for sorting photos “at a glance””,
”ICPR-A” ”94” (”1994”) ”459–464”.
[95] K. Hotta, Computer Vision Systems, Vol. 5008/2008 of Lecture Notes in Computer
Science, Springer, Berlin, Heidelberg, 2008, Ch. Scene Classification Based on
Multi-resolution Orientation Histogram of Gabor Features, pp. 291–301.
[96] A. Bosch, A. Zisserman, X. Munoz, Scene classification via plsa, Proceedings of the
European Conference on Computer Vision.
[97] D. Gökalp, S. Aksoy, Scene classification using bag-of-regions representations, in:
Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
CVPR ’07, Vol. 1, 2007, pp. 1–8.
[98] A. Oliva, A. Torralba, A. Guerin-Dugu, J. Herault, Global semantic classification of
scenes using power spectrum templates (1999).
[99] A. Oliva, Gist of the scene, in: L. Itti, G. Rees, J. K. Tsotsos (Eds.), The Encyclopedia
of Neurobiology of Attention, Elsevier, San Diego, CA, 2005, pp. 251–256.
[100] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of
the spatial envelope, International Journal of Computer Vision 42 (3) (2001) 145–
175.
[101] M. Wilson, Interclass fuzzy rule generation for road scene recognition from colour
images, Proc. of the 9th Int. Conference on Computer Analysis of Images and Pat-
terns 2124 (2001) 692–699.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 183
[102] A. Oliva, P. Schyns, Coarse blobs or fine edges? evidence that information diag-
nosticity changes the perception of complex visual stimuli, Cognitive Psychology 34
(1997) 72–107.
[104] M. Lighthill, An introduction to fourier analysis and generalized functions, 1st Edi-
tion, Cambridge University Press, Cambridge, MA, USA, 1958.
[105] L. Fahrmeir, R. Künstler, I. Pigeot, G. Tutz, Statistik - Der Weg zur Datenanalyse,
6th Edition, Springer, Heidelberg, Berlin, 2007.
[107] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 2nd Edition, Pren-
tice Hall, Upper Saddle River, 2002.
[109] J. Heikkila, O. Silven, A four-step camera calibration procedure with implicit image
correction (1997).
[111] K. Konolige, Small Vision System: Hardware and implementation, in: Eighth Inter-
national Symposium on Robotics Research, 1997.
[112] A. Broggi, Robust real-time lane and road detection in critical shadow conditions,
in: Proc. Int. Symp. on Computer Vision, IEEE, Parma, 1995.
[114] T. Dang, S. Kammel, C. Duchow, B. Hummel, C. Stiller, Path planning for au-
tonomous driving based on stereoscopic and monoscopic vision cues, in: IEEE
Proceedings of the 2006 American Control Conference, 2006, pp. 191–196.
[116] C. Rotaru, T. Graf, J. Zhang, Extracting road features from color images using a
cognitive approach, in: IEEE Intelligent Vehicles Symposium, 2004.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
184 Thomas Paul Michalke and Robert Kastner
[118] N. Apostoloff, A. Zelinsky, Robust vision based lane tracking using multiple cues
and particle filtering, in: IEEE Intelligent Vehicles Symposium, 2003.
[119] T. Hong, T. Chang, C. Rasmussen, M. Shneier, Road detection and tracking for au-
tonomous mobile robots, in: Proceedings of SPIE Aerosense Conference, 2002.
[120] Y. Sha, G. Zhang, Y. Yang, A road detection algorithm by boosting using feature
combination, in: IEEE Intelligent Vehicles Symposium, Istanbul, 2007.
[121] P. Lombardi, M. Zanin, S. Messelodi, Unified stereovision for ground, road and ob-
stacle detection, in: IEEE Intelligent Vehicles Symposium, 2005.
[122] X. Lin, S. Chen, Color image segmentation using modified HSI system for road
following, in: IEEE International Conference on Robotics and Automation, 1991.
[123] C. Rasmussen, Combining laser range, color and texture cues for autonomous road
following, in: IEEE International Conference on Robotics and Automation, Wash-
ington DC, 2002.
[126] P. Smuda, R. Schweiger, H. Neumann, W. Ritter, Multiple cue data fusion with par-
ticle filters for road course detection in vision systems, in: IEEE Intelligent Vehicles
Symposium, Tokyo, 2006.
[127] M. Chern, S. Cheng, Finding road boundaries from the unstructured rural road scene,
in: 16th IPPR Conference on Computer Vision, Graphics and Image Processing,
2003.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
The Attentive Co-Pilot 185
[131] A. Broggi, P. Grisleri, A software video stabilization system for automotive oriented
applications, in: Procs. IEEE Vehicular Technology Conference, Stockholm, Swe-
den, 2005.
[132] J. Adamy, Fuzzy-Logik, Neuronale Netze und Evolutionäre Algorithmen, Shaker
Verlag Achen, 2007.
[133] A. Gepperth, B. Mersch, C. Goerick, J. Fritsch, Color object recognition in real-
world scenes, in: J. de Sa (Ed.), J. Marques de Sa et al. (Eds.): Artificial Neural
Networks, 17th International Conference ICANN, Part II, Lecture Notes in computer
science, Springer Verlag Berlin Heidelberg New York, 2007, pp. 583–592.
[134] E. Bauer, R. Kohavi, An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants, Machine Learning 36 (1-2) (1999) 105–139.
[135] R. Kalman, A new approach to linear filtering and prediction problems, Transactions
of the ASME–Journal of Basic Engineering 82 (Series D) (1960) 35–45.
[136] M. Nieto, L. Salgado, F. Jaureguizar, J. Cabrera, Stabilization of inverse perspective
mapping images based on robust vanishing point estimation, in: IEEE Intelligent
Vehicles Symposium, 2007.
[137] V. Willert, M. Toussaint, J. Eggert, E. Körner, Uncertainty optimization for robust
dynamic optical flow estimation, in: Proceedings of the 2007 International Confer-
ence on Machine Learning and Applications (ICMLA), IEEE, 2007.
[138] X. Li, X. Yao, Y. Murphey, R. Karlsen, G. Gerhart, A real-time vehicle detection and
tracking system in outdoor traffic scenes, in: Proceedings of the 17th International
Conference on Pattern Recognition, 2004.
[139] M. Cech, W. Niem, S. Abraham, C. Stiller, Dynamic ego-pose estimation for driver
assistance in urban environments, in: IEEE Intelligent Vehicles Symposium, 2004,
pp. 43–48.
[140] R. Wang, B. Gu, T. Yu, L. Gou, Study on curb detection method based on 3d range
image by laserradar, in: IEEE Intelligent Vehicles Symposium, Las Vegas, 2005.
[141] F. Oniga, S. Nedevschi, M. Meinecke, Curb detection based on a multi-frame per-
sistence map for urban driving, in: IEEE Conference on Intelligent Transportation
Systems, Beijing, 2008.
[142] R. Aufrere, C. Mertz, C. Thorpe, Multiple sensor fusion for detecting location of
curbs, walls, and barriers, in: IEEE Intelligent Vehicles Symposium, 2003.
[143] R. Danescu, S. Nedevschi, M. Meinecke, T. To, A stereovision-based probabilistic
lane tracker for difficult road scenarios, in: IEEE Intelligent Vehicles Symposium,
Eindhoven, 2008.
[144] B. Roessler, K. Fuerstenberg, U. Lages, Laserscanner for multiple applications in
passenger cars and trucks, in: Proceedings of AMAA 2006, 10th International Con-
ference on Advanced Microsystems for Automotive Applications, Berlin, 2006.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
186 Thomas Paul Michalke and Robert Kastner
[145] F. Oniga, S. Nedevschi, M. Meinecke, T. To, Road surface and obstacle dectec-
tion based on elevation maps from dense stereo, in: IEEE Conference on Intelligent
Transportation Systems, Seattle, 2007.
[148] M. von Trzebiatowski, A. Gern, U. Franke, U.-P. Kaeppeler, P. Levi, Detecting re-
flection posts - lane recognition on country roads, in: IEEE Intelligent Vehicles Sym-
posium, 2004.
[153] K. Manz, D. Koos̈, K. Klinger, S. Schellinger, Entwicklung von Kriterien zur Bewer-
tung der Fahrzeugbeleuchtung im Hinblick auf ein NCAP für aktive Fahrzeugsicher-
heit, Universität Karlsruhe, Lichttechnisches Institut, 2007.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN 978-1-61209-399-4
Editor: Sota R. Yoshida c 2011 Nova Science Publishers, Inc.
Chapter 4
Abstract
Open air is a challenging environment where few computer vision applications
succeed. The reasons behind these difficulties must be found in the uncontrolled nature
of open air applications. Computer vision is a passive technology. Typically a camera
mounted on a mast observes the scene while using ”clever” algorithms to interpret
the images. However, doing this interpretation is no easy task since almost any type
of moving or static object may appear in scene. Also, there are many circumstances
which are not under direct control of the system: illumination changes caused by
passing clouds, movements of the sun or day / night transitions; presence of shadows
and moving artificial lights; camera artifacts resulting from blooming or smearing;
overreaction of camera compensation devices; occlusions; vibrations of the camera;
weather elements, such as: rain, snow or fog; etc.
All these circumstances combined turns computer vision in open scenarios into a
formidable task. Only those applications working under very constrained conditions
and with very concrete objectives will have any opportunity to succeed. In this chapter
we present a computer vision based traffic monitoring system able to detect individual
vehicles in real-time. Our fully integrated system first obtains the main traffic variables
of the vehicles: counting, speed and category; and then computes a complete set of
statistical variables. The objective is to investigate alternative solutions to some of
the difficulties impeding existing traffic systems achieve balanced accuracy in every
condition; most notably: day and night transitions, moving shadows, heavy vehicles,
occlusions, slow traffic and congestions.
1. Introduction
The use of Intelligent Transport Systems (ITS) is progressively becoming more important
for the efficient management of road traffic infrastructures. These systems comprise a great
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
188 Tomás Rodrı́guez
number of technical aids and management strategies from different domains. In this chapter
we focus our attention on a set of advanced sensing technologies categorized under the term
Intelligent Vehicle Highway Systems (IVHS). The aim is to describe an adaptive, real-time,
computer vision based, traffic monitoring system designed to detect and track vehicles in
road traffic scenarios.
Our system is able to robustly detect individual vehicles, measure their speed and iden-
tify their category. Based on thee microscopic variables, the system computes a complete
set of statistical parameters; such as: average speed, traffic intensity or flux, vehicle head-
way, occupancy, traffic density, etc. Traffic authorities typically use these data in public
information services, traffic management, alert detection and infrastructure planning.
Computer vision presents significant advantages over other more traditional vehicle
measurement technologies (i.e. current loops). Computer vision systems are more flexible,
less invasive, more precise, more robust, are easier to maintain, produce richer information,
do not affect the integrity of the road and offer, as an added bonus, the possibility to transmit
images for human supervision.
Computer vision applied to traffic has been investigated since the late 80’s, but there
is still intense research work going on. The bibliography dealing with this subject is huge.
There are lots of good scientific publications [1, 2, 3] and a few very systematic reports
on massive testing experiences [4, 5, 6]. It is also possible to find a certain number of
commercial systems [7, 8].
Unfortunately, despite its undeniable interest, computer vision is not massively used in
traffic monitoring applications, since existing systems still suffer from poor reliability, high
cost and unbalanced accuracy. The reason is computer vision systems are much affected
by weather and illumination conditions. Their accuracy is seriously limited under changing
weather and it is not uncommon to find blocked systems when conditions are adverse. In
addition, not every system is able to work unattended 24x7; i.e. many systems cannot op-
erate at night or in the periods between day and night. On the other hand, most systems are
unable to cope with slow traffic or congestions and show important difficulties to correctly
detect heavy vehicles. Occlusions and shadows are also important problems, causing lots
of errors.
The system we present [9, 10] is autonomous, works unattended for long periods of
time and adapts automatically to the changing environmental conditions. Several innova-
tions, designed to deal with the above circumstances, are proposed: an integrated calibration
and image rectification step, differentiated methods for day and night operation, an adap-
tive segmentation algorithm with automatic illumination compensation [11], a multistage
shadow detection method [12] and special considerations for heavy vehicle identification
and treatment of slow traffic. A specific methodology, developed to benchmark the accu-
racy of the different methods proposed, is presented at the end of the chapter.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 189
which are of a bright color. The road is assumed to be static and free of obstacles. The area
under surveillance is typically between 10 and 20 meters long and may cover 1 to 3 lanes
per camera.
Certainly our main interest is in the road and in the vehicles running along the road.
However, in the images there appear other elements standing outside the road area, which
might affect the detection of the vehicles: trees, traffic signals, lamp posts, bridges, etc.
Those elements may be static or moving and might also project static or moving shadows
on to the surveillance area.
The traffic scenario is influenced by the weather elements: rain, snow, fog; but also by
changing illumination conditions resulting from the movement of the sun and by passing
clouds. Weather elements may have different effects on the image. Rain causes speckle
noise and reduces the perceived intensity in the image, but otherwise is not a big problem;
if a step viewing angle is used for the camera. Snow increases the general intensity levels,
while causing lots of reflections. Finally, fog reduces the contrast and visibility in the image
to the point that might blind the system.
Variations in the illumination sources are also an important effect to be considered.
The movement of the sun is the cause of a slow, systematic, variation in the intensity and
the direction of the received illumination. This variation affects not only the intensity but
also the chromaticity of the light reflected by the objects and determines the shape, size
and direction of the shadows. Passing clouds also contribute to change the illumination
levels in accordance with the movement of the cloud. Obviously, at night, there is no sun
light. At night, traffic scenes may be dark or illuminated by artificial sources. In this last
case illumination is diffuse, reasonably stable and comes from multiple sources located at
fixed positions. While sun light shows a wide spectrum, artificial lighting is concentrated
in a very narrow band. In the periods in between day and night a mixture of the above
conditions is found.
Shadows are another important phenomena we need to take into account. Shadows may
be cast by the vehicles or by the objects standing outside the surveillance area. Vehicles’
shadows are always moving and joined to the vehicle. On the other hand, the objects located
outside the surveillance area are usually static, but they can be moved by the wind, thus
causing moving shadows. Under sun lighting each object cast a single shadow, but when
artificial light is used multiple shadows per object are possible.
Next we need to consider the vehicles themselves. Vehicles are moving objects that
can have any length between 3 and 15 meters and any color. Statistically, white cars dom-
inate. Vehicles are composed of different parts, having different chromatic appearance:
body, windows, tires, etc. Most vehicles present parts darker than the background and in
many cases also brighter, due to illumination effects. Vehicles show very characteristic
movements; they may change lanes, but enter and leave the image from the same direction
and their speed can be considered uniform since, statistically, acceleration is little relevant
in the surveillance area. Vehicles are solid objects; they never overlap, nor do they change
positions in the lane. Most vehicles show cast shadows darker than the background, whose
shape, size and orientation will depend on the illumination source and the atmospheric con-
ditions. Vehicles’ shadows can, in general, be differentiated from shadows cast by elements
external to the surveillance area by their movement.
The final element we will consider in the physical scenario is the camera system. Most
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
190 Tomás Rodrı́guez
frequently, the camera is mounted on top of a pole at the side of the road, but can be also
placed on a bridge or similar supporting structure crossing the road (see Fig. 1).
Figure 1. General layout of a traffic monitoring scenario showing typical camera configura-
tions: Left, Right and Central camera positions. The diagram also presents the layout of the
Control Area (filled) and Surveillance Area (dashed); as well as the four reference points
R̃i used for calibration. W and L are respectively the width and length of the Surveillance
Area, in meters.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 191
Figure 2. Diagram showing the viewing angle observed by the camera and the ability of the
system to discern the separation between two vehicles.
As depicted in Figure 2, the ability of the system to tell the difference between two
vehicles separated by a distance d is a function of the distance D from the camera to the
vehicles, the height of the camera H and the height of the vehicle h; according to the
following formula:
h∗D
d= (1)
H
• The shape of vehicles will be distorted. The distortion will acquire different aspects
depending on the intrinsic and extrinsic parameters of the camera. If we assume
vehicles can be approximated by regular parallelepipeds, the distortion will produce
the following effects: the base and top of the vehicle will be deformed (i.e. they will
no longer be rectangular); its height will appear projected to the left side (camera
placed on the right side of the road), to the right side (camera placed on the left side
of the road), or to the front / rear (camera in a front or back central position).
• The size of the vehicles will no longer remain constant in the image, but will depend
on their position in the scenario (the more distant from the camera the smaller they
will be represented in the image). In the same way, shape deformations will also
depend on the distance to the camera.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
192 Tomás Rodrı́guez
Figure 3. A representation of the Pinhole camera model [16], showing the layout of the
different reference systems.
• The top / bottom borders of the vehicles will show different relative movements in
the image sequence. Their deformation will depend on their height (i.e. If we take the
plane of the road as our reference, top borders will be affected by the deformation in
height, while bottom borders will not). As a consequence top borders will appear to
move faster than their bottom counterpart. This is the reason why tracking methods
tend to use the back-bottom border (the front-bottom border is always occluded in
front views) to track the vehicle.
• Occlusions can be longitudinal or lateral and they are almost impossible to avoid in
certain circumstances. In free traffic conditions occlusions are not a big problem,
since the distance between vehicles is sufficient to avoid most occlusions. However,
under congested traffic, occlusions caused by large vehicles become more problem-
atic. Occlusions are one of the most difficult problems to be found in computer vision
traffic monitoring systems and minimizing occlusions must be one of the leading ob-
jectives of any method proposed.
• The shape of the Surveillance Area, as seen in the image, will be distorted according
to perspective projection rules (i.e. a rectangular Surveillance Area will no longer
remain rectangular in the image).
• In most traffic systems the length of the Surveillance Area is short enough so that the
speed of vehicles can be assumed constant. However, due to perspective projection
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 193
effects, the speed of the vehicle measured in the image is no longer constant; vehicles
move faster in the front plane than they do in the horizon.
• Rectilinear paths in the scenario remain rectilinear in the image. However, vehicles’
trajectories are no longer parallel and they cross at the vanishing point.
• When the camera is moved, i.e. when it is moved by the wind or when the floor
vibrates, the scene also moves.
• The image sensor suffers smearing or blooming problems when working close to its
limits. Those problems typically happen when one of the elements of the sensor gets
saturated due to reflections or direct sun light. The effect in the image is a saturated
line of pixels or a white patch. The use of polarized filters greatly contributes to
mitigate this effect.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
194 Tomás Rodrı́guez
Figure 4. The computer vision chain of the TRAFFIC system. Notice how the system
parameters are updated for every incoming image.
sm
˜ = P M̃ ; (2)
where M̃ = {X, Y, Z, 1} and m̃ = {x, y, 1} are represented in homogeneous coordinates,
and s is a scale factor.
One of the simplest methods to calculate the (World to Image transformation) is to start
from the projection equation (2). In the general case, the projection matrix P has 3×4 = 12
elements, but only 11 degrees of freedom, since it is defined up to a scale factor. In our case
we are interested only in points located in the plane of the road. For that reason, we define,
for convenience, the Z axis perpendicular to the road plane and assume the coordinates of
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 195
the scene points to be always Z = 0. Under such simplification, we may remove s from
(2) and set, without loss of generality, P34 = 1. As a result, we obtain a system of two
equations with 8 unkowns for every pair of corresponding points [M̃, m̃]:
c b a c
f e d f a = P11 − P31 x; d = P21 − P31 y;
X= ; Y = ; with b = P12 − P32 x; e = P22 − P32 y;
a b a b c = x − P14 ; f = y − P24 ;
d e d e
(4)
Using (4), and provided the determinant in the denominator is not null, we are able to
calculate the world coordinates of every image pixel. We must emphasize, once more, the
above transformation is only valid for points placed on the plane of the road.
• Vehicle identification becomes more complex since the shape of the vehicle is de-
formed and its deformation changes with time. Also the size of the vehicle will
depend on its position. These facts affect not only the initial detection of the vehi-
cle, but also the matching of the successive instances of the vehicle in the respective
images of the sequence.
• Tracking vehicles is more difficult since they no longer observe parallel trajectories
and the vehicles’ apparent speed will depend on its position in the image.
• Since the Surveillance Area is no longer regular, it becomes more difficult to isolate
vehicles from their surrounding scenario; i.e. from objects located outside the road.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
196 Tomás Rodrı́guez
• Most traffic systems sub sample the input images in order to reduce redundant infor-
mation and speed-up computation. However the reduction in the size of the vehicles
due to perspective greatly affects the required resolution of the Work Image. One
usual problem is resolution is far too high for vehicles in the front part of the image,
while vehicles close to the horizon are only represented by a few pixels.
Not all these effects are treated adequately in existing computer vision systems [23,
24]. Some approaches cope with perspective deformations by designing more complex
Segmentation and Tracking algorithms, others simply ignore them, or provide empirical
solutions suitable only for very specific circumstances. In any case, perspective effects,
if not taken into account, result in more computing overhead, decreased robustness and
less accurate measurements. In our case, we compensate some of the above difficulties by
means of the innovative Image Rectification method proposed below.
Perspective deformations cannot be avoided completely. However, they can be miti-
gated, to a great extent, if a bird’s eye view over the road (camera pointing vertically to the
plane of the road) can be selected. Placing the camera in this position is, in general, not
possible. However, it is feasible to construct a virtual bird’s eye view from a lateral view
by applying Image Rectification techniques. On the other hand, the Image Rectification
process has its limitations since it is only possible to transform correctly those points in the
scene laying on a single plane (in our case the road plane). Conversely, points outside the
road plane will remain distorted.
Figure 5. Example showing how a rectangle located inside the Surveillance Area of the
Input Image is transformed to the reference frame of the Work Image, using the rectification
method implicit in (6). Notice how the boundaries of the Surveillance Area are set in
correspondence with the limits of the Work Image. M and N are respectively the horizontal
and vertical resolution of the Work Image.
According to basic Epipolar Geometry theory [17], a certain relation can be found be-
tween two perspective views. In particular both views can be real camera views, or as it
will be our case, one of them will be virtual. Given the coordinates of a World point in
each of the two views, say {m ∈ I | m = (x, y, 1)} and {m0 ∈ I 0 | m0 = (x0 , y 0 , 1)} in
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 197
homogeneous coordinates, it can be verified [25] that in the general case the next relation
becomes true:
mT F m0 = 0 , (5)
where F is the Fundamental Matrix. The 3 × 3 matrix F is of rank 2 and is defined up to a
scale factor. In our case we are only interested in points lying on the road plane. Hence an
additional constraint applies, which represents what is usually named a collineation [25];
thus determining the following relation between points in correspondence in the two views:
m = Am0 (6)
where m is a point in the real image and m0 is the point in correspondence in the Virtual
Image we desire to create. The 3×3 matrix A represents the collineation and is also defined
up to a scale factor. In that way, it will be sufficient to take four pairs of corresponding
points {mi , m0i} in general position (i.e. no three of them aligned) to obtain a system of 8
equations with 8 unknowns that will allow us to calculate the 8 independent coefficients in
Ai,j .
Now, if we take the four points mi to be the vertexes of the Surveillance Area in the
Input Image and m0i the vertexes defining the limits of the Virtual Image, we can use Least
Squares to calculate the coefficients Ai,j that will allow us to perform the desired image
transformation (see fig. 5). Once the Ai,j coefficients are known, we may construct 1 the
Virtual Image by assigning to every point m0i ∈ I 0 the pixel value of its corresponding point
mi ∈ I using (6). We denote this Virtual Image as the Work Image. Once the Work Image
becomes available, the Input Image can be discarded and all subsequent processing steps
will be performed over the Work Image.
The method proposed will have the effect of obtaining the Work Image as a rectified
and sub-sampled version of the Input Image. The trick will be to select the appropriate
shape and dimensions of the Surveillance Area and Work Image so that the transformation
produces exactly the effect we are expecting (i.e. a top view, with z = 0). For that purpose,
the Surveillance Area must be rectangular and will enclose strictly that part of the road we
desire to monitor (thus avoiding the inclusion of parts of the scenario outside the road). On
the other hand, the Work Image must match exactly the image resolution we desire to obtain
in the final result.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
198 Tomás Rodrı́guez
Figure 6. Screenshot of the Calibration Tool. The Control and Surveillance Areas are de-
fined over the full resolution Input Image (720x576 pixels); while the metrical references
are taken on the rectified Work Image (256x192 pixels). Notice how the Work Image, com-
puted from the Input Image after sub-sampling and geometry correction, encodes only that
part of the Input Image enclosed inside the Control Area.
The method is started by defining two rectangular areas in the scenario: the Control
Area and the Surveillance Area. Both areas must be rectangular in World coordinates. We
first select the vertexes of the Control Area using a graphical pointer over the Input Image.
The Control Area must be selected so that it covers strictly that part of the road that must
be supervised; i.e. we avoid including in the Control Area parts of the scene outside the
road. The vertexes will be selected using existing references in the scenario (i.e. the traffic
lanes), or more appropriate by placing four traffic cones on the road as physical points of
reference.
Next, we define the vertexes of one or more Surveillance Areas using the same pro-
cedure. The Surveillance Area has been defined as a subset of the Control Area, which
comprises strictly that part of the road (typically one or more traffic lanes) that will be rec-
tified and processed by the computer vision algorithms; i.e. while the complete Control
Area is calibrated, only the objects inside the Surveillance Area are processed. Following
this approach we can change the area under supervision remotely without re-calibrating the
camera; we just need to select another Surveillance Area.
Finally, the system requires to determine the relative distance (in meter) between two
widely spaced points in the longitudinal and transversal directions of the road. The most
common approach would be to measure the distance between the vertical and horizontal
cones used for reference. This method is highly accurate and little invasive for the traffic.
On the other hand we could use existing references (i.e. known distances between road
markings, known size of passing vehicles, etc) to achieve similar results. This last method
is very convenient since we are able to calibrate the camera remotely; however, it will be
less precise. Distance measurements are taken using a pair of graphical sliders over the
Work Image, after the rectification process has been correctly achieved.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 199
Once defined the Control and Surveillance areas, as well as the metrical references, the
Calibration Tool computes the transformation matrices. In our case we have three reference
frames corresponding to the: World, Input Image and Work Image respectively. Where the
Work Image has been introduced as a rectified and scaled projective transformation from the
Input Image with the aim to arrange the content of the images in a way that is more appro-
priate for the needs of computer vision algorithms. In consequence, we will need to provide
matrices for the most common transformations between any two reference frames. As de-
scribed in section 4., our method will only require four Reference Points and two distance
measurements in the World coordinate system to obtain all required relations. The result
will be a set of structures that will allow the computation of the required transformations in
real-time.
The four Reference Points R̃i are set in correspondence with the four vertexes delimiting
the Surveillance Area (see Fig. 1). The system will use R̃i to compute the Projection matrix
P using the procedure defined in section 4.1.. We will use the Z = 0 assumption; hence
the coordinates of the M̃i points in (2) will be set to: {(0, 0), (W, 0), (0, L), (W, L)}; where
W and L are respectively the width and length of the Control Area, measured in meters.
Obviously the m̃i points will be the pixel coordinates of the corresponding reference points
on the Input Image.
Next, the system will use the image pixels m̃i corresponding to the reference points
R̃i to compute the Fundamental matrix F using the procedure defined in section 4.2.. The
points m̃0i in (6) will be selected in correspondence with the vertexes of the Work Image:
{(0, 0), (M, 0), (0, N), (M, N )}; where M and N are the desired horizontal and vertical
resolutions of the Work Image respectively.
Figure 7. In the figure we show transformations between different reference systems. Only
the transformations Twk,wd , Twk,in and the respective inverses are, in practice, codified into
the array of pointers structure.
Once P and F are known, we are able to transform between the respective coordinate
systems. In practice (Figure 7), only the following transformations will be required: Work
to Input Image Transformation (Twk/in ), Work to World Transformation (Twk/wd ) and its
inverses.
One of our basic constraints was to be able to perform the transformations in real time.
Even if traffic images are typically not very large, computing the transformations on-the-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
200 Tomás Rodrı́guez
run would introduce unacceptable levels of overhead. For that reason, each transformation
above is computed off-line and its results encoded into a two-dimensional array of pointers.
This computation is done only once at start-up and, in general, need not be repeated, unless
the camera is moved.
If we examine the transformation Twk/in , we notice the pointers acts as a look-up table
so that when one pixel {x0 , y 0 } is indexed into the Twk/in structure, or Work Image for short,
the pointer will return the intensity value of the corresponding pixel in the Input Image. This
is an elegant way to perform the image transformation, while adding very little overhead.
Similar structures are created for other transformations; i.e. Indexing the point {x0 , y 0}
into the Twk/wd structure will provide the three-dimensional coordinates of the point in the
world: {X, Y, 0}.
At run-time the above transformation structures are already pre-computed. Then for
each Input Image, the corresponding Work Image will be obtained. The Work Image may
remain as a two-dimensional array of pointers or physically copied into an image structure.
At that point the Input Image can be discarded altogether since the Work Image already
contains all the information that is required to perform the computer vision tasks.
Figure 8. Structure of the Segmentation step. Notice how Segmentation operates on a single
gray level image to produce the Segmentation Masks.
5. Segmentation
Our interpretation of the Segmentation step is a bit different from the approaches found
in some of the references [26, 13, 27]. In our case, the objective of the Segmentation
step is limited to extracting the moving objects of the scene and performing a preliminary
classification. I.e. Segmentation does not attempt to identify shadows or vehicles since we
consider at this stage we cannot identify the objects unambiguously and we understand the
final decision must be delayed to later stages when the complete set of historical information
become available.
Once the Work image has been created we segment the image by extracting the moving
objects using an adaptation of well-known Background Suppression [28, 29, 30] techniques.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 201
Dt(x, y) = It (x, y) − Bt−1 (x, y); −255 < Dt(x, y) < 255 (7)
Where It and Bt are the input and background images at time t. Our segmentation approach
defines three categories: BCKGND, BRIGHT (potentially a pixel that belongs to a vehicle)
and DARK (potentially a pixel that belongs to a moving shadow or a vehicle); according to
the following classification:
BRIGHT if Dt(x, y) > Tb
Ut (x, y) = DARK if Dt(x, y) < Td (8)
BCKGN D Otherwise
Tb and Td are the thresholds for BRIGHT and DARK labels respectively. Both thresholds
are calculated automatically using the procedure described in section 5.4.. Once the indi-
vidual pixels have been labeled, contiguous pixels with the same label are grouped together
using a 4-connected blob-coloring algorithm. The result from the one pass grouping algo-
rithm is a list of objects masks complete with their most relevant descriptors: geometric
center, mass center, enclosing rectangle, size in pixels, etc.
In contrast with [13], where early classification decisions may result in imperfect
shadow removal and poor identification of dark vehicles, in (8) the objective is to preserve
the compactness of the object masks and prevent the undesired removal of the dark part
of the vehicles. This three label approach makes possible delaying the decision to remove
the shadows until the tracking stage; where the relative temporal movement of vehicles and
shadows can be better analyzed. This approach, combined with the automatic adaptation of
the thresholds, allows for a simpler, faster and more robust segmentation (see Fig. 9).
The description above is meant for monochrome images. When dealing with color
images, we use an HSV color space. In this case the image of differences (DH and DV) is
calculated for the H and V channels independently and the segmentation criteria is modified
as follows:
BRIGHT if DVt(x, y) > T Vb or
if DVt(x, y) < T Vb ∧
|DHt(x, y)| > TH
Ut (x, y) = (9)
DARK if DVt(x, y) < T Vd ∧
|DHt(x, y)| ≤ TH
BKGN D Otherwise
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
202 Tomás Rodrı́guez
Figure 9. Segmentation results from input images taken at various resolutions in different
ambient conditions: (first row) long shadows at the M30 highway-ring in Madrid, (second
row) dense traffic in a US highway, (third row) heavy vehicle in a German highway.
Then, only the gradients corresponding to moving features will be retained, using the
classification criteria below:
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 203
DARK if GVt(x, y) < Ta
BRIGHT if GVt(x, y) > |GHt(x, y)| ∧
GVt(x, y) > Tb
Ut (x, y) = BRIGHT if GVt(x, y) ≥ 0 ∧ (12)
GVt(x, y) ≤ |GHt(x, y)| ∧
|GHt(x, y)| > Tc
BKGN D Otherwise
Where T a, T b and T c are the vertical negative, vertical positive and horizontal gradients
respectively. It must be noticed that the vertical negative gradient, denoting the transition
between the front of the vehicle and the headlights, will be given preference in the classifi-
cation.
In this case the DARK label represents a vertical negative transition rather than a dark
image pixel. A vehicle will be identified in the tracking stage when it is detected a DARK
patch followed by a BRIGHT patch within a predefined distance. In that way the vehicles
can be identified robustly while the multiple shadows caused by artificial illumination are
ignored since shadows cannot, in general, produce the required vertical negative transition
(see Fig. 10). The best results have been obtained empirically by setting δ and γ to 2 pixels.
Only monochrome information is used during night operation.
Figure 10. An example of night operation. (left) original image, (center) segmented image
(right) detection masks. A white patch represents a BRIGHT label, while a gray patch
is assigned a DARK label. It must be noticed how under night conditions headlights and
shadows do not survive the Segmentation step.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
204 Tomás Rodrı́guez
Figure 11. Construction of the background model; obtained after processing 256 gray level
images.
x
b(ti ) = x
e(ti ) + K(ti ) · [z(ti ) − H(ti ) · x
e(ti )] (13)
x
e(ti ) = A(ti ) · x
b(ti−1 ) (14)
In the above equations, A(ti ) is the system matrix, H(ti ) is named the measurement
matrix, z(ti ) is the input vector, while K(ti ) is the gain matrix of the Kalman filter. The
operation of the method can be summarized in the following steps: first, we take the predic-
tion of the state of the system computed in a previous step; next we compare the prediction
with the measured value of the parameter; finally we obtain a new estimation by computing
the K(ti ) weighted difference between the prediction and the measured value. The Kalman
filter is recursive, so that every new estimation of the value of the parameter automatically
accumulates the information acquired in previous steps. Hence, we avoid the need to store
historic values individually.
Next, we will use basic Kalman filtering theory to implement our method for the adap-
tation of the Background Image. We name s(x, y, ti) to the intensity value of a pixel at a
given time instant ti . The estimated value of the pixel in the Background Image at this same
time instant is represented by the vector sb(x, y, ti) and the variation of the estimation is de-
fined by ḃs(x, y, ti). According to equation 13, the new state of the system can be computed
as follows:
sb(x, y, ti) se(x, y, ti)
ḃ = ė +
s(x, y, ti) s(x, y, ti)
se(x, y, ti)
K(x, y, ti) · s(x, y, ti) − H(x, y, ti) · ė (15)
s(x, y, ti)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 205
1 a1,2
A= H= 1 0 (17)
0 a2,2
where a1,2 = a2,2 = constant and matrix H takes this form because we are dealing with
black and white images only. Finally, the gain matrix transforms into:
k1 (x, y, ti)
K(x, y, ti) = (18)
k2 (x, y, ti)
where k1 (x, y, ti) = k2 (x, y, ti) = α. Now, if we expand the first element in equation (15),
we notice:
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
206 Tomás Rodrı́guez
static objects must be adapted much faster. In that way, in the case of traffic congestion, the
background gets adapted very fast as soon as the congestion dissolves.
The coefficient α will be re-calculated periodically as a function of the period and the
frame rate. The criteria used is to select the α coefficient that will allow us to achieve the
desired rate of adaptation of the Background Image in a given time interval (assuming the
input image remains constant). Since equation (21) is recursive, we need an estimation
for the Background Image at the starting point. In our case we set the initial values of the
Background Image to B0 (x, y) = I(x, y)/2. Next, if we apply equation (21) backwards
until we reach the time instant m, we obtain the following expression::
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 207
Figure 12. The figure shows how the ideal probability distribution representing the static
background is displaced in the current image by an amount P , due to a global illumination
change, caused for example by the actuation of the camera auto-iris. This effect cannot
be detected on the time-averaged histogram since the averaging blurs the history of the
histogram. However, parameters Tb and Td , which represent the adaptation to slow illumi-
nation changes, are calculated from the time-averaged histogram with the aim to avoid the
noisy influences that can be found in the instant histogram.
In that sense, the width of the time-averaged histogram will provide us an estimation
of the required long term noise tolerance, while the position of the peak of the instant
histogram represents the level of continua or displacement of the current image difference
levels with respect to the optimal background estimation. The criteria will be to select first
the relative thresholds T b0 and T d0 as 2,5 standard deviations (σ) in both directions of the
time-averaged histogram and then calculate the absolute thresholds T b and T d at time t as
follows:
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
208 Tomás Rodrı́guez
Figure 13. Examples of real Histograms. Top Accumulated Histogram. Middle and down
two examples of Instantaneous Histograms for the same image sequences. Notice how
the instantaneous histogram is more noisy and distorts the real width of the histogram but
allows us to detect the displacement of the peak (level of continua).
The above method is fast, since it requires only two thresholds for the whole image, it
uses both temporal and spatial information and, since we are using the histogram of differ-
ences and relative thresholds, it does not depend on the illumination levels. The algorithm
correctly handles fast illumination changes and raises automatically the threshold levels in
the presence of noisy images or poor background estimation.
Figure 14. (Left) Typical output of a camera sensor with gamma 0.45. (Right) Correction
function used to compensate the nonlinearity of the sensor, approximated by two linear
sections.
The immediate consequence is similar increases in the incident intensity are trans-
formed differently in the image, depending on the underlying gray level. Some important
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 209
errors, such as segmentation faults in the border of traffic lines in presence of sudden inten-
sity changes, can be traced back to inappropriate gamma correction.
The usual way to compensate the nonlinearity of the camera output is to apply an in-
verse function to the gray levels of every pixel in the image. However, this approach is very
expensive computationally. For that reason, we opted to keep the gray levels but modify
instead the segmentation thresholds. This solution is much faster, but it is still computation-
ally intensive. In consequence we decided to approximate the inverse gamma function by
two linear trams; as shown in Figure 14.
Where Dt is the image of differences we desire to compute and i signals the position of
the image in the temporal sequence (i.e. image comparison need not necessarily be done
between two consecutive images).
The main advantage of this method is we do not need to compute a Background Image.
This is an important factor to consider since acquiring and maintaining the Background
Image is one of the main sources of difficulties for Background Subtraction methods. Fur-
thermore, since the time between two frames is very short, this method is also free from the
undesirable influence of most (but not all) ambient light variations. However, in spite of its
undeniable advantages, the method also has important drawbacks; as described below.
In Figure 15 we show the result of applying the Temporal Differences algorithm to a
simplified traffic scenario. Similarly to the Background Subtraction method, the system
can only detect the vehicle if it is substantially different from the background in previous
images. In this case when we differentiate two consecutive images we get front and rear
borders separated by a valley region for each segmented vehicle. If the vehicle is clearer
than the background the front border is made of negative differences, while the rear border
is positive. In the case of a vehicle darker than the background results are just the opposite.
As observed in the picture, even in this “idealized” case, the resulting object masks
are very difficult to discern. Every vehicle is associated at least two segmented regions.
Furthermore, the size and separation of these regions is not constant, since they will depend
on the speed of the vehicle; which normally is unknown. Also, the masks are not coincident
with the real position of the vehicle. For that reason identifying the real size and position
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
210 Tomás Rodrı́guez
Figure 15. In the figure we show the intensity profile of the n column in image (I 0 ) at two
different time instants (t0 and t1 ). We also show, at the top of the drawing, the resulting
difference Image (D). In the input picture there are two “ideal“ vehicles of uniform color:
one white vehicle, closely followed by a vehicle of a dark color. Nb is the reference gray
level of the background at this point.
of the vehicle becomes more complex. Finally, occlusions become a critical problem since
the method favors overlapping of masks from different vehicles.
On the other hand, it is obvious this “idealized” case is very rare in practice. A real
vehicle is heterogeneous and frequently it is composed of many different parts of different
colors, which turns vehicle identification into a very challenging process. Some authors
[33, 34] have tried to address this problem by computing, for each input image, two differ-
ence images: one between the current and previous image and a second one between the
current and following images. Next, they use a border descriptor to extract the contour of
the regions comprising the vehicles and computes the intersection of both resulting images.
This method offers more accurate identification of the contour of the vehicle in the image.
However, the large number or imperfections appearing in these contours pose new prob-
lems, which are, again, very difficult to solve. In practice the Temporal Differences method
presents so many drawbacks it is hardly used alone. In most references [35, 34] this method
is used as a complement or improvement of other more robust alternative solutions.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 211
surfaces. For that reason, a single Gaussian does not represent accurately the information
acquired from the scene.
With the aim to address this problem, in [37] it is proposed to use a mixture of three
Gaussians, modeling respectively: the background, the vehicles and the shadows cast by
the vehicles. Results can be improved further if we use multiple Gaussians to model the
process; as described in [38]. Each Gaussian tries to handle the information coming from
a single surface observed by the pixel in the course of time. Later, a heuristic criteria is
used to determine which of these Gaussians or populations correspond to the background.
A similar approach can be found in [39], while in [40] the author claims using a mixture
of Cauchy distributions, instead of Gaussians, is a better representation of the different
populations composing the scene.
According to the original “Background Mixture Model” method, described in [38], the
recent history of each pixel {X1 , . . ., Xt} can be modeled with a mixture of K Gaussians.
To that end, we represent the probability of observing the current value of the pixel by
means of the following formula:
K
X
P (Xt) = wk,t · η(Xt, µk,t , Σk,t) (27)
k=1
where wi,t is the weight assigned to each of the Gaussians of the mixture at time t and
η is the probability density function. The distribution η is defined for each Gaussian as
a function of the value of the pixel Xt, the average value µi,t of the distribution and its
covariance matrix Σi,t , according to the following relation:
1 1 T ·Σ−1 ·(X −µ
η(Xt, µ, Σ) = n 1 e− 2 (Xt−µk,t ) t k,t ) (28)
(2π) |Σ|
2 2
With the aim to reduce the computational needs, it is usual to limit K to a maximum
of between 3 and 5 Gaussians and we admit the simplification to consider the covariance
matrix takes the form: Σk,t = σk2 I.
In that way, the distribution of the recent history of the observed values of each pixel in
the image at a given time instant will be characterized by the mixture of Gaussians described
above. When a new image is analyzed, the new value of the pixel can probably be found
already in one of the components of the mixture. If the match is found, the new value is
used to update the model of the corresponding component of the mixture. On the contrary,
if it is not possible to find a match, the component distribution with lowest confidence level
is removed from the mixture and a new component is created containing a single element.
The new distribution is created with the value of the pixel as average and it is started with
high variance and low weight.
Once the appropriate distribution has been found / created, we need to update the dif-
ferent parameters controlling the mixture model. Firstly, we proceed to update the weights
using the following formula:
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
212 Tomás Rodrı́guez
adaptation of the model. Once the weights have been updated, we re-normalize them again.
On the other hand, we only update the parameters µ and σ corresponding to the current
value of the pixel. µ and σ are not modified in the rest of distributions. We use the following
formulas to compute the new parameters µ and σ:
2 2
σk,t = (1 − ρ) · σk,t−1 + ρ · (Xt − µk,t )T · (Xt − µk,t ) (31)
At this point, all the parameters of the mixture have been updated with the information
contributed by the new image. Now, it is time to decide which is the distribution that
better represents the background. The criteria used is to select the candidates having lower
variance and highest weight. For that purpose we first order the Gaussians as a function
of w/σ, so that the most promising distributions appear high in the list. Next, we select
as representative of the background the first B distributions complying with the following
condition:
B = argminb Σbk=1 wk > T (33)
where T signals the size of the space we want to model. In that sense, if T is low, we
basically obtain an unimodal representation, while if T is increased, the background will be
represented by multiple populations. Finally, we segment the image by identifying a pixel
as belonging to a vehicle when its value is far from 2, 5 · σ in all B distributions.
Thanks to the automatic adaptation of its parameters, the method is robust against slow
illumination changes and offers the possibility to represent the background using informa-
tion from various populations. One of the merits of the method is the possibility to incor-
porate new information without destroying the previous model. In that way, when a new
value, not previously considered in the model, appears in the image, the new value is incor-
porated into a new distribution. However, the old distributions are still preserved, with the
same µ and σ 2 , but with lower weight w, until a model becomes the least probable and then
it is removed from the mixture. In consequence, if the background is affected by temporal
alterations, the method offers the possibility to recover its previous state. This is useful, for
example, to cope with camera vibrations, where the background oscillates periodically. The
same effect can be observed with moving trees or rain. Another possibility is to model the
behavior of shadows and stopped vehicles.
Unfortunately, the ideal situation described above is not always realized in practice.
The method only works if information is preserved in the model long enough to be recov-
ered at a later time. However, there are several reasons why this may fail. For example,
illumination conditions may change significantly during the time of the background alter-
ation so that when it is finished and the information is recovered, it no longer represents
the real conditions of the scenario. Another example is when the phenomena altering the
background lasts longer than the time required to preserve the model.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 213
These circumstances are not rare in traffic scenarios and for that reason it is question-
able if the mentioned advantages justify the significant increase of the computational needs
implicit in the method. If we use a low K, say 3 to 5, as proposed, the probability that
a large temporal alteration may ruin the model is very high. In the methods using adap-
tive thresholds this circumstance would be handled by increasing the tolerance level of the
threshold. On the contrary, in the methods using a mixture of Gaussians one of the distribu-
tions is removed from the list, resulting in an incorrect segmentation of the pixel. An effect
that might last long after the disturbance has ended.
On the other hand, if K is high, the computational needs become unmanageable for an
embedded system. Also a high K introduce the problem of how to decide which distribu-
tions belong to the foreground and which to the background. The real problem is, a priori,
there is no clear criteria saying how many distributions must belong to the background.
Another possible source of errors has been reported in [41]. If the first value considered
for a pixel belongs to the foreground, the algorithm will take log(1−α)(T ) times until the real
value of the background is incorporated and log(1−α)(0, 5) times until it becomes dominant.
For example, if we assume the background can be observed at least 60% of the time and
α = 0, 002, it will be necessary 255 and 346 images respectively. The problem is even
more critical in situations of dense traffic, when the road is less visible.
To solve this problem it has been proposed in [41] to use two models to update the
different parameters, depending on the situation of the system. First the author defines a
temporal observation window with the aim to give more relevance to recent information and
adapt better to environmental changes. Next, he uses a different model for the initialization
and runtime phases. The approach implemented during the initialization step is as follows:
1
wk,t = wk,t−1 + (Mk,t − wk,t−1 ) (34)
t
Mk,t
µk,t = µk,t−1 + t (Xt − µk,t−1 ) (35)
Σi=1 (Mk,i )
Mk,t
Σk,t = Σk,t−1 + t ((Xt − µk,t−1 ) · (Xt − µk,t−1 )T − Σk,t−1 ) (36)
Σi=1 (Mk,i )
Later, when the system has acquired at least L samples he uses the alternative approach
described below:
1
wk,t = wk,t−1 + (Mk,t − wk,t−1 ) (37)
L
1 Mk,t · Xt
µk,t = µk,t−1 + ( − µk,t−1 ) (38)
L wk,t
Another drawback of the K-Gaussianas method is the fact that the algorithm only con-
siders information at the pixel level; i.e. it does not take advantage of global information
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
214 Tomás Rodrı́guez
affecting the whole image. This is unfortunate since the analysis of global changes is critical
in traffic scenarios. Some examples are the global changes induced by passing shadows or
the frequent over reaction of camera compensation devices. These effects, if not correctly
handled, may corrupt the background model and may result in the erroneous classification
of foreground and background pixels.
In summary, the Background Mixture Model method offers the possibility to handle
multiple populations with the aim to build a model of the scene that would allow us cope
with important events affecting traffic systems; such as: shadows, stopped vehicles, camera
oscillations, etc. However, in practice, the increased computational needs are not compen-
sated with an equivalent improvement in the accuracy of the system, since, as we mentioned
above, other less sophisticated methods are able to achieve similar performance under free
traffic. Unfortunately, the use of various populations offer no advantage under dense traffic
because congestions take too long for the background information to be preserved.
Finally, it is interesting to note that if we take K = 1 and assume all values have the
same probability (ρ = α), then equation (30) is transformed into the limiting case so that
we again recover a relation equivalent to that found in (20):
Figure 16. Structure of the Detection and Tracking process. Notice how the detection
process uses feedback from previous (t-1) frame.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 215
robustly identified. Ideally, the Detection stage works on information from a single image,
while the tracking stage combines information from multiple previous images. In practice,
the limits between the Detection and Tracking stages are blurred and information from the
tracker is frequently used to lead searches in the Detection stage.
There is a wide variety of Detection and Tracking methods [15, 42, 43, 44, 45] in the lit-
erature. However, not many of them can be implemented in real time. Historically, research
in traffic monitoring applications have been focused mainly on Segmentation but avoided
specific developments for Detection and Tracking. This is unfortunate since general pur-
pose trackers are overshooting in traffic environments, while on the other hand fail to take
advantage of application specific information. In fact the vehicles’ trajectories are rather
simple; i.e. they are rectilinear and move at nearly constant speed, all in the same direction.
However, these apparently simple trajectories are masqueraded by improper segmentation,
the presence of shadows and other moving objects, occlusions, perspective effects, etc. In
consequence standard trackers are poor performers in this application scenario.
Figure 17. Pseudo-code summarizing the Detection and Tracking processes. Main parame-
ters: objects masks = list of segmentation masks; candidates list = list of candidate objects
resulting from initial filtering of the objects masks; objects list = the list of objects being
computed (refined through subsequent image processing steps for the whole life of the de-
tected objects); object = an element in the objects list (i.e. a possible vehicle or shadow);
traffic variables = the computed traffic variables: counting, speed and category. See main
text for details.
Our Detection and Tracking method performs in real time and is based on the vehicle
model presented in section 2.1.. This model describes not only the intrinsic properties of the
vehicles but more interestingly their behavior and their relations with other objects present
in the scene; i.e. the shadows. The structure of the Detection and Tracking stages and a
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
216 Tomás Rodrı́guez
Figure 18. Object Detection Process. Compatible candidates are assigned to existing ob-
jects. Unpaired candidates are added as new objects.
While in day operation, the merging process (Fig. 18) is started with a list of candidate
objects identified in the present image and the list of objects consolidated from previous
images. The task is to take every object from the objects list and try to match it with any
compatible object from the candidates list. With the aim to reduce the risk of incorrectly
merging objects from different vehicles, we first select the oldest objects in the objects list
and, from these, we choose the object showing the widest area. Then, the algorithm pro-
gressively tries to merge the selected object with objects of the same type in the candidates
list, until no more compatible objects are found. The merged objects are automatically re-
moved from the candidates list and the complete process is repeated with the next object
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 217
Figure 19. An example of the Merging Process. (a) Observe the candidate objects 1 − 5
and the predicted center and ROI of the two objects currently being processed. (b) Notice
how after the merging process, a new object 3 appears since the composed object exceeded
the limits of the category.
in the objects list until all objects have been considered. Finally, all free objects in the
candidates list are incorporated into the objects list as new objects.
During the merging process we construct a ROI (region of interest) for every object in
the objects list. The ROI is placed at the position predicted for the object in the current im-
age; as described in figure 19. The predicted position is computed from past observations
of the same object. For BRIGHT objects the size of the ROI will be the maximum allowed
size of the vehicle category or the maximum size observed for the selected object in pre-
vious images; whichever is bigger. The maximum allowed size of each vehicle category is
obtained automatically at the calibration step by computing the size of the area projected
on the work image by a parallelogram representative of the category. We select for merging
that object in the candidates list overlapping more with the ROI. The merging will only
take place if the size of the object resulting from the merging does not exceed the maximum
allowed size of the category or the maximum observed size of the object. For DARK or
SHADOW objects the process is similar; only the historical observed size of the object is
not considered when computing the ROI in order or avoid undesirable growing of shadow
areas.
At night the Detection process will be different since the objects masks are also different
(see Fig. 20). If we remember from section 5.2., at night there are no shadows and the
vehicles are represented by one or more DARK regions followed by one or more BRIGHT
regions. The task will be to correctly group those regions. We first group every dark region
with its corresponding BRIGHT region by placing a ROI aligned with the upper border of
the DARK region. The width of the ROI will be coincident with the width of the DARK
region and its length will be the maximum allowed length for the vehicle category. Any
BRIGHT region overlapping with the ROI will be matched with the DARK region. When
all DARK regions have been processed, the unmatched regions will be removed from the
list; thus removing potentially dangerous features such as: shadows remains, reflections,
etc.
Next all BRIGHT objects are converted into DARK objects and a new grouping process
is started. First, we group all objects which are too close together by using a ROI with the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
218 Tomás Rodrı́guez
Figure 20. Example of a night detection process. (a) Ideal representation of a vehicle. (b)
Segmentation masks. Notice a vehicle is composed of DARK and BRIGHT regions. (c)
Detection ROI, where L is the maximum allowed size of the vehicle category. (d) Rectangle
inscribing the resulting object. Notice how all BRIGHT objects have been transformed into
DARK objects.
maximum allowed size of the category, extended a few pixels. Next we try to match the left
and right lights of the vehicles. Typically this process is only required for large size vehicles,
but we apply it to all objects for security. We use a ROI with the maximum allowed length
of the category and a width computed from the detected width of the object, extended in
the left and right directions up to the limits of the category. Any object overlapping with
the ROI will be merged provided the composed object does not exceed the limits of the
category. This approach allows us to capture any car light within the limits of the vehicle.
Finally, the resulting candidate objects are subject to a matching process with the historical
objects list similar to that described for day operation.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 219
(only for BRIGHT objects). To avoid incorrect merging due to temporal occlusions, we
only merge the two objects when several successive merging attempts have taken place. A
counter attached to the object is incremented with every merging attempt, but it is restored
the first time the merging fails.
At this stage we should have a rather compact description of the objects in the list. How-
ever, to account for possible deficiencies in the Segmentation step (i.e. a vehicle appears
broken into several parts), we implement one additional merging process for BRIGHT ob-
jects. In this case we allow the object to grow up to the limits of the category by capturing
any BRIGHT object in its proximity; provided the usual size limits of the merged object are
not exceeded. In this case we use a ROI extended in all four directions up to the maximum
allowed width and length of the category. This is a risky process and can only be issued
once the objects in the list have been compacted in the previous steps.
The above completes the merging approach, whereby we intend to identify one object
with one vehicle or shadow. Now we will focus on the tracking process. Tracking vehicles
is not a big deal. We just compute the average speed of the vehicle by line fitting the position
detected in previous images, next we predict the position of the object in the current image
assuming constant speed and finally match existing objects with new instances of the objects
in the current image; following the procedure described in section 6.1.. Unfortunately, this
apparent simple process gets masqueraded by the presence of shadows, occlusions and
errors resulting from heavy vehicles.
Tracking BRIGHT objects is rather straightforward since a BRIGHT label provides an
almost certain guaranty that the object belongs to a vehicle. However the case of DARK
and SHADOW labels is more involved since an object may begin its history as DARK and
then transform into SHADOW. Our approach is to count the number of times the object has
been detected either as DARK or SHADOW and then decide which is the most appropriate
label. For that purpose we use two counters: the first one (detection counter) counts the
number of object detections regardless of its type; while the second one (shadow counter)
counts the number of times this object has been identified as SHADOW.
Difficulties are increased due to occlusions. One of the sideways effect we were not
able to solve in our perspective correction method is the height of vehicles is projected to
the front and the side of vehicles. Thus, they appear larger than they are and we face the
risk they might overlap. This problem is specially challenging in the case of heavy vehicles
since, even if they usually observe larger security margins with the preceding vehicles,
they are significantly higher and less aerodynamic. In fact it is not uncommon to find two
overlapped vehicles or even vehicles which are mostly occluded, with the result any of them
is miscounted. On the other hand occlusions are more severe close to the horizon.
Since occlusions imply some critical information is missing, our best opportunity to
solve this problem with simple techniques suitable for real-time, is to compute their traf-
fic variables while they are separate and allow for some tolerance for the case they again
become apart after being occluded; i.e. a kind of ghost object tracking. This is reasonable
since vehicles frequently show different relative speeds in the different lanes. Treatment of
occlusions is done in our case by first, optimizing the aiming of the camera and second,
implementing a tracking method tolerant to temporal faults.
The tracking algorithm is based on the state machine presented in figure 21. The first
time a vehicle enters the scene, it is assigned the ENTERING state. Later, while it is being
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
220 Tomás Rodrı́guez
tracked, it gets the PASSING state and its detection counter is incremented for every new
detection. On the contrary, the first time the detection fails, the object turns into a FAILED
state but remains in the list in the form of a ghost object (this is only applicable to BRIGHT
objects). A ghost object is subject to the usual matching and merging rules, but with every
new fail its failed object counter is incremented. In the event a ghost object has succeeded
the matching, its failed object counter is restored and returns to the PASSING state. Finally,
an object gets the ENDED state when the failed object counter exceeds a threshold. Only at
this point we consider the object is finished and we proceed to perform its final classification
and evaluate its traffic variables.
BRIGHT and DARK objects who survive the merging at the end of the tracking pro-
cess and offer a number of object detections above the threshold Nd will be identified as
candidates for a physical vehicle. This candidate vehicle will only be validated if the ob-
ject entered and leaved the scene from the appropriate points in the control area (i.e. top
/ bottom) and if the object is persistent. Nd is computed automatically as a function of
the length of the control area (L) and the measured speed of the vehicle (VV ); so that the
minimum allowed number of object detections guaranties the vehicle has covered at least
half the length of the control area: Nd = V2·V P ·L
V
. Where VP is the processing speed of the
system (typically 25 frames per second).
Figure 21. State Machine used in the tracking algorithm. Notice it provides certain protec-
tion against temporal occlusions.
On the other hand, surviving SHADOW objects must undertake one additional test. If
the number of shadow detections is above the threshold Nd when the object leaves the scene
(i.e. when it is not detected any more), the object will be classified as SHADOW. Otherwise,
the total number of object detections will be checked against Nd and if the test is successful
the object will be identified as an ordinary DARK object. Finished SHADOW objects will
be simply discarded since it is certain its associated BRIGHT object already achieved the
threshold Nd . This approach guaranties that given a sufficient number of detections, either
the BRIGHT or SHADOW object is selected, but not both. The method also ensures DARK
vehicles survive even if they might have been temporally classified as shadows and ensures
certain protection against temporal occlusions.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 221
Figure 22. Heavy Vehicle Identification step. (a) See the ROI set for objects 3 and 4 and
the category boundaries. See also how object 4 exceeds the width boundary, while object 3
exceeds the length boundary. (b) Results after the merging, assuming objects 3 and 4 had
passed the confidence test.
Heavy vehicles are subject to the usual merging and tracking processes; only the cat-
egory limits used are larger. Unfortunately, this approach does not work when the heavy
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
222 Tomás Rodrı́guez
7. Shadow Removal
Vehicles typically exhibit one or more associated shadows in the image. Unfortunately, the
process of discriminating vehicles’ shadows from other vehicles present in their neighbor-
hood is far from trivial. On the other hand, it is necessary to mention the added difficulty
arising from shadows cast by objects other than the vehicles or multiple shadows resulting
from artificial light sources at night.
A shadow is produced when an object intercepts the light path from the illumination
source to the surface observed by the camera. The non-illuminated part of the obstructing
object is called self-shadow, while shadows projected on to the background elements of
the scene are referred as cast-shadows. Shadows are composed of: umbra and penumbra.
In general we describe the umbra as that part of the shadow where the light source has
been occluded almost completely, while in the penumbra is only partially occluded. The
transition between both parts and between the penumbra and the illuminated part of the
scene is typically smooth.
Most shadow removal methods [13, 27, 46]. present in the literature identify shadows
based on its intrinsic properties (color, shape, or luminance levels). Several theoretical
models have been proposed describing the differential chromatic characteristics of shadows.
However, in practice, those models fail to represent the intricacy of shadows and tend to
ignore its dynamical properties; thus wasting a significant amount of potentially useful
information. As a consequence these methods are little robust since they frequently remove
parts of the vehicles together with the shadows.
Our shadow removal method [12] is more involved, in the sense that shadows are treated
as one more object within the computer vision chain. A type of object which shows certain
differential intrinsic properties but also owns a rich history. The proposed approach is based
on a complex shadow model. The model only considers the effects resulting from moving
shadows. Conversely, static shadows are treated as part of the background. Compilation
of information for shadow identification begins at the Segmentation step; as described in
section 5.. As computation advances, more information is added to the shadow model from
the Detection and Tracking stages. In that sense, in section 6., we learn how shadows are
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 223
detected and tracked independently from the time they enter the scene until they finally dis-
appear. We also noticed it is not possible to robustly remove the shadows until all available
historical shadow information has been considered.
Figure 23. Decomposition of the shadow detection process. Notice how the two shadow
detection methods are run in parallel and the results are compared for each input image.
Now, the critical missing part is the criteria used to decide when a DARK object should
become a SHADOW object. For that purpose, we use three different processes: Detecting
Shadows with the Proximity Criteria, Detecting Shadows with the Directional Criteria and
Detecting Shadows from Outside the Control Area; as described in Fig. 23.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
224 Tomás Rodrı́guez
Figure 24. The figure presents the process of detecting the dark parts of a vehicle (before
and after the shadows have been identified). It must be noticed how only BRIGHT objects
can have associated shadows.
incoming image and every time a BRIGHT object and a DARK / SHADOW object are
related, the shadow counter of the corresponding object is increased.
Based on the detected shadows, the Proximity Criteria allows us to compute statistically
the average shadow vector linking the geometrical center of the BRIGHT object to the
predicted mass center of the SHADOW object; as described in section 7.2..
The average shadow vector computed by the Proximity Criteria takes precedence at
start-up, when directional information is not available, or as a second option when shadows
are too small or when the average direction of the shadow calculated with the Proximity and
Directional criteria are incongruent
Figure 25. The figure shows several possible configurations involving shadows. (a) Long
slanted shadow. (b) Short left shadow. (c) Long shadow in front. (d) Broken lateral shadow.
For each object it is presented: its geometrical center Co , the ROI, the mass center of the
shadow Cg (white cross), the average mass center calculated for the period Cge (inverted
white cross) and the vector Vcge joining Cge and Co .
In order to solve this problem we propose to calculate the average magnitude and ori-
entation of the cast shadows for each vehicle category using statistical methods. Once this
parameter is known, we will identify as shadows those DARK objects whose relative posi-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 225
tion with respect to a BRIGHT object are in the correct direction and whose mass center is
within the permitted range.
In this case we assume the simplification to consider all vehicles from a given category
have associated shadows with similar orientation and magnitude at a given time instant.
Again, this concept will be represented by the vector joining the center of the BRIGHT
object and the mass center of the SHADOW object. We further assume this vector (aver-
age shadow vector) moves slowly and can be considered constant for extended periods of
time (learning time). The average shadow vector can be computed by simply averaging
the relative position of the mass center of the SHADOW with respect to the center of the
BRIGHT object for each incoming vehicle, during the learning time. A reasonable learn-
ing time could be in the order of 30 seconds. One average shadow vector is calculated for
each vehicle category in order to account for the different dimensions of the vehicles in the
different categories.
The main purpose of the average shadow vector is to construct a ROI that will allow
us to capture the shadows associated to each BRIGHT object. The ROI is started from the
appropriate diagonal of the BRIGHT object; as can be observed in figure 25. Its width is
set in accordance with the BRIGHT object’s dimensions, and it has the same orientation
and twice the length of the average shadow vector. Any DARK object whose mass center
is within the ROI will be identified as a SHADOW object.
To ensure a smooth transition between periods, every new average shadow vector is
only accepted if it fits inside the ROI of the vector calculated in the previous period. This
approach implicitly performs a smooth tracking of the average shadow vector; as required
by the shadow model. The algorithm is initialized using the average shadow vector com-
puted with the Proximity Criteria.
The Proximity and Directional Criteria are complementary. The Directional Criteria is
preferred during normal operation since it allows us to capture shadows detached from the
casting object and broken shadows. It is also more robust in situations with long shadows
overlapping the vehicles. However there are circumstances when the computation of the
average shadow vector is not reliable. One such case is when shadows are too short to
produce vectors of statistical significance; i.e. at midday, when the sun is high up in the sky.
It is also possible that the algorithm loose track of the real average shadow vector.
To avoid those problems, both methods are run in parallel. Then, at every learning time
the average shadow vector calculated with the Proximity Criteria is checked against the
ROI of the Directional Criteria. If it does not fit within the ROI, shadows are temporally
captured using the Proximity Criteria and the Directional Criteria is restarted.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
226 Tomás Rodrı́guez
Figure 26. The figure presents several possible scenarios involving external shadows cast
by vehicles running outside the control area. Scenarios (a) and (b) are easily handled by
the proposed external shadow removal method. Scenario (c) is less frequent but presents a
more difficult solution.
ows cast by vehicles running outside the control area requires specific measures since our
present methods can not discriminate them from the DARK vehicles; i.e. the Proximity and
Directional Criteria become useless because the casting object cannot be observed inside
the control area.
To solve this problem we propose to label a DARK object as SHADOW whenever
it touches the appropriate lateral border of the control area; as seen in figure 26. The
appropriate border is determined by the orientation of the shadow. In general, we must
check the border opposite to the direction of the average shadow vector. For example, if
the average shadow vector points to the left we must check the right side limit. We also
added two parameters that will allow us to deactivate the external shadow removal process
when according to the configuration of the scene no external shadows are possible.
In figure 26 we present several possible scenarios involving external shadows. Scenarios
(a) and (b) are easily solved with the method proposed. However in scenario (c) it is not
possible to discriminate the vehicle from the overlapping shadow. We still have one more
possibility to detect the vehicle if the shadow shows different relative speed with respect
to the vehicle and both objects are overlapped only for a short period of time. On the
other hand, if the tracker is not able to make them apart, this situation cannot be solved
and a counting error will be produced. Fortunately, the induced error is not very relevant
since dark-only vehicles are rare and the vehicle will be miscounted only when the above
circumstances occur simultaneously. Some real examples involving shadows are presented
in Fig. 27.
8. System Implementation
We chose to implement our Traffic Monitoring system in a simple computer architecture
with very limited resources, so that it can be ported to embedded platforms and allow space
for further improvement. Currently, the system is running under Windows and is able to
process over 40 frames per second in an old fashioned 800 Mhz microprocessor, with 256
Mb of RAM and low cost digital camera. The user interface of the application is presented
in figures 28 and 29.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 227
Figure 27. Three examples of the Shadow Removal process showing the original image
(left), the Directional method (center) and the final classification of the shadow objects
(right). In the first row we can see two vehicles with long shadows as well as an external
shadow coming from a vehicle running on a parallel motorway. Notice also the abundant
static shadows cast by trees from outside the control area. The Directional method correctly
handles the long shadows since their mass center is easily captured by the ROI (second
column). On the other hand, the dark patches inside the vehicle are successfully captured
by the Proximity method. Finally the external shadow is also captured since it touches the
right border of the control area. In the second row we can observe three vehicles running
close together. It is apparent how the shadow removal algorithm correctly handles this
difficult situation. Finally the third row presents an example of a broken shadow.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
228 Tomás Rodrı́guez
Figure 28. User Interface of the Traffic Monitoring system. Top left, navigation controls.
Center, the Work Image and the Segmented Image. Top right, debugging controls. Left,
object data viewer. Right, Traffic Variables. Bottom right, histograms and Segmentation
controls.
ground truth file is available, we may run the comparison process with different algorithms,
as many times as we desire, while the tool automatically computes the obtained error rates.
The proposed testing method is objective, facilitates testing of individual components, saves
a lot time and helps the researcher focus on the real thing; i.e. we only seek solution to
problems which might affect the error rate.
Figure 29. The User Interface of the Traffic Monitoring system running under Linux with
extended functions and dockable synchronized windows.
Next, we describe the main traffic variables provided by the system; as well as the
methods used to compute them.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 229
• Average Speed. It is computed as the average of the speed of the vehicles of a given
category x, detected in a given traffic lane α, over the integration period t. It is
presented in units of Km/h, using the formula below:
P
Vi(t, x, α)
i
V (t, x, α) = , (41)
N (t, x, α)
where V (t, x, α) is the Average Speed in the interval; Vi(t, x, α)i is the Instantaneous
Speed for the i vehicle; and N (t, x, α) is the Number of Vehicles detected in the
Integration Period t.
N (t, x, α)
ϕ(t, x, α) = , (42)
∆T
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
230 Tomás Rodrı́guez
where ϕ(t, x, α) is the Flux computed in the Integration Period t; and ∆T is the time
interval, measured in hours, used by the Integration Period.
• Traffic Density. The Traffic Density represents the number of vehicles of category
x, detected in the given traffic lane α, per kilometer. It is certainly measured in
vehicles/Km
ϕ(t, x, α)
ρ(t, x, α) = , (43)
V (t, x, α)
where ρ(t, x, α) is the Traffic Density computed in the Integration Period t.
• Vehicle Headway. Computes the average time between two vehicles of category x
running in lane α. It is computed in seconds.
1
L(t, x, α) = , (44)
ϕ(t, x, α)
where L(t, x, α) is the Vehicle Headway computed in the Integration Period t and
ϕ(t, x, α) is the Flux of vehicles.
• Occupancy. The Occupancy provides information about the average time a given
traffic line α remains occupied by the vehicles.
P
i ∆ti
O(t, α) = , (45)
∆T
where O(t, α) is the lane Occupancy in the Integration Period t; and ∆ti is the time
required by vehicle i to cross the control area. It is computed from its speed Vi and
its length li : ∆ti = Vlii .
The above variables are computed for each traffic lane and vehicle category indepen-
dently. However, we also provide aggregated information comprising the same variables
computed for all traffic lanes (i.e. the complete road) and vehicles’ categories (i.e. all
vehicles).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 231
1 1 1 2 2
Our System EVA Traficon Autoscope Video Trak Iteris
Count 3.6 % 5% 5-15 % 7-8 % 25 % 15 %
Speed 6% 10 % 8-13 % 8% N/A N/A
a 10% error in the measurement of the vehicles’ speed. We compare favorably with these
results since our global error rates are 3.59% and 6% respectively. These global results look
reasonable, however, looking at the details we discover the dispersion is huge, with error
values peaking over 15% in vehicle counting and 18% in speed. Again we obtain better
marks since our peak rate is below 9%. Finally, we must highlight some of the systems ana-
lyzed do not operate at night or in slow traffic and present serious difficulties under adverse
weather.
In the second testing phase we compared the results from our traffic system with the
ground truth images using the Testing Tool described above. Tables 2 to 5 present the
results achieved under day, night and transition periods, for various values of traffic flux
and classified by vehicle category. The error rate was obtained by computing the average
rate of the difference between the counting done by our system and the ground truth; as
compared with the total number of vehicles in the ground truth.
If we observe the global results, we discover the average error rate is 3.59%. This
error rate is much lower for light vehicles 2.94% than it is for heavy vehicles 6.49%. On
the other hand we find the error rates are very similar during the day 3.29% and in the
transition periods 3.61%; while it degrades during the night 5.45% due to the difficulties
found to correctly identify the heavy vehicles at night.
Next, we analyze the results for different values of the flux: 800, 1800 and 2200 v/h;
where 1800 v/h signals the accepted maximum capacity of a road. We obtain respectively:
3.69%, 3.71% and 3.67%. However, we must notice these results are biased by the low
composition of heavy vehicles with high flux. We observe the error rate of light vehicles in-
crease only slightly with increasing flux, while the same results for heavy vehicles degrades
significantly due to the difficulties found to identify heavy vehicles under high density;
when the vehicles are close together.
We also performed tests under different weather conditions. The system performed
slightly better under cloudy skies 3.25%, followed by clear skies 3.49% and finally with
moving clouds 3.63%. On the other hand, accuracy was little affected by rainy weather
since the error rate was 3.89%. Unfortunately, we could not obtain sufficient testing data
for snowy or foggy conditions. Finally, in table 6, we present the results with different
shadows configurations. We discover shadows do do not affect significantly the accuracy
of the system and hence it is demonstrated the correct operation of the shadow removal
method.
Overall, the best results are systematically obtained during the day, for light vehicles
and with flux values of 800 v/h: 2.80%. On the other hand, heavy vehicles at night achieve
a worst rate of 9%. With respect to classification of heavy vehicles we obtained a success
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
232 Tomás Rodrı́guez
Table 3. Results with values of flux between 800 and 1800 Vehicles / Hour
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 233
rate of 93.51% and our measurements of speed demonstrated the system is able to achieve
an error rate below 6%. These results are well within the limits placed by traffic authorities
who usually require an error rate of between 3% and 5% in vehicle counting, 5% and 10%
in speed measurement and 5% and 10% in classification.
10. Conclusion
An innovative computer vision based traffic monitoring system has been presented. The
proposed system computes in real-time the microscopic traffic variables: counting, speed
and classification; and based upon them produces the set of macroscopic variables: flux,
density, composition and mean headway. The methods proposed are founded on a thor-
oughly analysis of the application scenario and careful attention was paid to implementation
issues, such as: simple, non-invasive installation and real-time performance.
The system is self-adaptive and is able to operate autonomously for long periods of time;
i.e. no hidden parameters to be adjusted. It performs in all weather and automatically selects
the appropriate algorithm for day, night and transition periods. The system is robust against
fast and slow illumination changes and is able to cope with long broken shadows, and
shadows from parallel roadways. Ordinary camera movements (i.e. wind vibrations) hardly
affect its performance because the system is tolerant against temporal tracking errors and
strict constraints are used to identify the vehicles. We also provide an adequate treatment of
occlusions and heavy vehicles, and obtain reasonable results in dense traffic; i.e. the system
reports maximum occupancy and zero flush in the presence of traffic jams.
Several innovative contributions can be highlighted in the paper: an exhaustive analysis
of the operational environment; an efficient calibration and image rectification method; an
original Segmentation approach, complemented with an innovative method for the auto-
matic selection of the Segmentation parameters; a Detection and Tracking approach specif-
ically designed for traffic environments; a robust shadow removal method; specific provi-
sions for heavy vehicle detection and the treatment of occlusions; and finally, a systematic
testing and benchmarking methodology.
According to the results presented in section 9., our traffic system outperforms existing
systems in terms of accuracy and already comply with most requirements placed by traffic
authorities. Furthermore, our error rates are more balanced in adverse conditions and we
are able to operate in situations where other systems fail; i.e. with long broken shadows,
under high density of heavy traffic, in night conditions, with slow traffic, etc. However,
even if our results are very good in usual traffic conditions, improvements would be desir-
able in some specific areas. In particular, more attention must be given to the detection of
heavy vehicles in night conditions and also the treatment of occlusions under dense traf-
fic must be improved. The system would benefit from an increase in the accuracy of the
measured speed and more testing will be required in snowy and foggy weather. We may
also revisit the S channel in color Segmentation, with the aim to better discriminate the
vehicles from their shadows. On the other hand, some of the basic methods proposed could
be adapted easily to other application scenarios; such as: incident detection, detection of
suicide vehicles, intersection control, tunnel supervision, security surveillance, automatic
signal management, presence detection for automatic barriers, etc.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
234 Tomás Rodrı́guez
References
[1] M.-C. Huang and S.-H. Yen, “A real-time and color-based computer vision for traf-
fic monitoring system,” in IEEE International Conference on Multimedia and Expo,
vol. 3, June 2004, pp. 2119–2122.
[2] D. Koller, J. Weber, and J. Malik, “Robust multiple car tracking with occlusion rea-
soning,” in European Conf. Computer Vision, 1994, pp. 189–196.
[3] T.-H. Chen, Y.-F. Lin, and T.-Y. Chen, “Intelligent vehicle counting method based
on blob analysis in traffic surveillance,” in Proceedings of the Second International
Conference on Innovative Computing, Information and Control, vol. 0, 2007, pp. 238–
242.
[4] MNDOT and S. C. Group, “Evaluation of non-intrusive technologies for traffic de-
tection, final report, september 2002,” Federal Highway Administration, U.S. Depart-
ment of Transportation, Tech. Rep., September 2002.
[5] D. Middleton and R. Parker, “Initial evaluation of selected detectors to replace induc-
tive loops on freeways,” Texas Transportation Institute, Tech. Rep. FHWA/TX1439-7,
April 2003.
[6] “Sensor development final report, traffic surveillance and detection technology devel-
opment,” Jet Propulsion Laboratory, Pasadena, California, Tech. Rep. FHWA-RD-77-
86, March 1997.
[8] D. Panda, “An integrated video sensor design for traffic management and control,” in
IMACS IEEE CSCC 99 International Multi-conference, Athens, Greece, July 1999,
pp. 176–185.
[10] T. Rodrı́guez and N. Garcı́a, “An adaptive, real-time, traffic monitoring system,” Ma-
chine Vision and Applications Journal. Springer Verlag, vol. 21, no. 4, pp. 555–576,
2010.
[12] ——, “Shadow removal for robust vehicle detection,” Machine Vision and Applica-
tions Journal, Springer Verlag, Accepted, to be published in 2007.
[13] R. Cucchiara, C. Grana, and A. Prati, “Detecting moving objects and their shadows:
an evaluation with the pets2002 dataset.” in Proceedings of Third IEEE International
Workshop on Performance Evaluation of Tracking and Surveillance. ECCV 2002.,
Copenhagen, Denmark, May 2002, pp. 18–25.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 235
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
236 Tomás Rodrı́guez
[29] L. Cheng and M. Gong, “Realtime background subtraction from dynamic scenes,”
2009, pp. 2066–2073.
[30] W. Guerra and E. Garcia Reyes, “A novel approach to robust background subtraction,”
2009, pp. 69–76.
[31] C. Ridder, O. Munkelt, and H. Kirchner, “Adaptive background estimation and fore-
ground detection using kalman-filtering,” in Proc. ICRAM 95193-195, UNESCO,
1995.
[32] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Statistical and knowledge based
moving object detection in traffic scenes,” in Proc. of IEEE Int’l Conference on Intel-
ligent Transportation Systems, Oct 2000, pp. 27–32.
[33] D. Dailey and L. Li, “Video image processing to create a speed sensor,” Univesity of
Washington, Seattle, Final WSDOT Research Report TNW-99-01, March 1999.
[34] K. Toyama, J. Krumm, B. Brummitt, and B. Meyers, “Wallflower: Principles and
practice of background maintenance,” in International Conference on Computer Vi-
sion, Corfu, September 1999, pp. 255–261.
[35] A. Lipton, H. Fujiyoshi, and R. Patil, “Moving target classification and tracking
from real-time video,” in Proc. of the 1998 DARPA Image Understanding Workshop
(IUW’98), November 1998.
[36] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder, real-time tracking of
the human body,” vol. 19-7, pp. 780–785, July 1997.
[38] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time
tracking,” in Proc. IEEE Computer, no. II, 1999, pp. 246–252.
[39] T. Chiang and W. Lau, “Segmentation of vehicles in traffic video,” Standford Univer-
sity, Tech. Rep. EE3292J, 2002.
[40] Y. Ming, J. Jiang, and J. Ming, “Background modeling and subtraction using a local
linear-dependence-based cauchy statistical model,” in Proc. of the VIIth Digital Image
Computing: Techniques and Applications., Sydney, 2003, pp. 469–478.
[41] P. K. T. Pong and R. Bowden, “An improved adaptive background mixture model for
real-time tracking with shadow detection,” in 2nd European Workshop on Advanced
Video Based Surveillance Systems. AVBS01, September 2001.
[42] Carlo. and T. Kanade, “Detection and tracking of point features,” no. CMU-CS-91-
132, April 1991.
[43] J. Tai, S. Tseng, C. Lin, and K. Song, “Real-time image tracking for automatic traffic
monitoring and enforcement applications,” vol. 22, no. 6, pp. 485–501, June 2004.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Traffic Monitoring 237
[44] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Journal of
Computing Surveys, vol. 38, no. 4, pp. 13.1–13.45, 2006.
[45] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes for object detection,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 11, pp.
1778–1792, November 2006.
[46] A. Yoneyama, C. Yeh, and C. Kuo, “Moving cast shadows elimination for robust
vehicle extraction based on 2d joint / vehicle shadow models,” in IEEE Conference on
Advanced Video and Signal Based Surveillance. AVSS03, 2003, pp. 229–236.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN 978-1-61209-399-4
Editor: Sota R. Yoshida c 2011 Nova Science Publishers, Inc.
Chapter 5
Abstract
1. Introduction
Algebraic topology, with roots dating to Poincaré at the beginning of the twentieth century,
has traditionally been considered one of the purest subfield of mathematics, with very
few connections to applications. The last decade, however, has witnessed an explosion of
interest in computational aspects of algebraic topology, and with the development of this
computational machinery, a concomitant interest in applications. In this paper, we review
some of these developments, and show how methods of computational algebraic topology
may be fruitfully applied to problems of computer vision and image processing. We hope
∗
E-mail address: [email protected]
†
E-mail address: [email protected]
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
240 Daniel Freedman and Chao Chen
that this review will stimulate interest on the part of computer vision researchers to both use
and extend these tools.
As we have noted, the new interest in algebraic topological tools has been fueled by two
parallel developments: the design of new algebraic topological algorithms, and the appli-
cation of these algorithms to various scientific and engineering f elds. The new algorithms
have focused almost exclusively on computations involving homology groups; without go-
ing into detail at this point (we defer a formal description of homology theory until Section
2.), we may note that homology groups are topological invariants which, roughly speaking,
count the number of holes of various dimensions that a topological space has. The great ad-
vantage conferred by homology groups is that these groups are relatively straightforward to
compute with, as opposed to other topological concepts such as homotopy groups, or worse,
homeomorphism equivalence classes. In particular, the concept of persistent homology as
introduced by Edelsbrunner et al. [1], has proven to be very useful. The new framework has
been applied to a number of f elds, including molecular biology [2, 3], sensor networks [4],
robotics [5], graphics [6], geometric modeling [7], as well as computer vision and image
processing.
Why use topological tools in computer vision? One answer is that, generally speaking,
topological invariants tend to be very robust. If the topological space is stretched and de-
formed continuously, without any tearing or gluing, then the topological invariants of the
space will be preserved. This is, of course, by design: topology is essentially the study of
invariants of spaces under this group of transformations – continuous bijections with contin-
uous inverses, also known as homeomorphisms. This property might be quite useful in the
design of shape signatures, where the goal is to find a compact description of a shape which
does not change much (if it all) when the shape undergoes some types of deformation.
Unfortunately, traditional topological invariants suffer from some problems from the
point of view of the applied scientist. On the one hand, such invariants are perhaps “too
robust:” very different shapes, such as the hand and the circle in Figure 1, may be topo-
logically equivalent, thus leading to poor shape signatures. On the other hand, topological
invariants can be quite sensitive to noise; this phenomenon is illustrated in Figure 1, in
which two spaces – a torus and a two-handled torus – appear very similar, but are in fact
topologically different due to the presence of the tiny handle.
Some of the methods developed in the last decade in the field of applied and compu-
tational algebraic topology attempt to deal with these two situations. In the case of shape
signatures, one remedy involves augmenting the base topological space with extra geomet-
ric information, thereby creating a new topological space. Standard topological invariants
can then be computed for this augmented space, leading to more discriminating shape sig-
natures. We will see an example of this type of augmentation in Section 3.1. For the case of
the sensitivity of topological invariants to noise, the theory of persistent homology has been
developed. This theory deals with “topological noise,” by examining not only the topo-
logical features of a space, but also their “lifetime,” or significance. Aspects of persistent
homology will be used throughout the paper.
The f elds of applied and computational algebraic topology are quite new, with most
of the key developments having taken place in the last decade. The applications of these
ideas to computer vision and image processing are even newer, with most of the relevant
work have appeared in the last f ve years. As a result, this paper is perhaps slightly different
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 241
Figure 1. The problems with topological invariants. The hand-like shape in (a) and the
circle in (b) are quite different shapes, but they possess the same topology. On the flip
side, the ordinary torus in (c) is similar to the two-handled torus in (d), due to the fact that
the second handle is very small (i.e. is “topological noise”). However, their topological
descriptors will be different.
from a traditional survey or review of the literature. We focus on two goals: presenting the
mathematical material – which can be quite daunting at f rst blush – in an accessible fashion,
and reviewing some of the most interesting applications of this material to computer vision
and image processing. The main goal of the paper is to stimulate interest in these new
algebraic topological tools on the part of the computer vision community, so that the tools
may be applied to new problems, and perhaps computationally and theoretically extended
as well.
The remainder of the paper is organized as follows. In Section 2., we review the relevant
material from algebraic topology, focusing in particular on homology theory and persistent
homology. Although the material is presented in a way to make it as intuitive as possible,
references to various standard texts and papers are given, to aid the reader interested in
pursuing the topic further. Section 3. presents four interesting and illustrative applications
of the topological techniques to computer vision and image processing. In Section 3.1., the
problem of designing a shape signature is considered. Section 3.2. examines the problems
of natural image statistics, and shows that the new topological techniques can contribute
to a more accurate characterization of these statistics. Section 3.3. discusses the traditional
problem of noise reduction, and simultaneously examines the problem of image segmenta-
tion. Persistent homology leads to a Mean Shift like algorithm, but one which has a rigorous
way of merging segments. Finally, Section 4. concludes.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
242 Daniel Freedman and Chao Chen
A simplicial complex K is a finite set of simplices that satisf es the following two con-
ditions.
1. Any face of a simplex in K is also in K.
2. The intersection of any two simplices in K is either empty or is a face for both of
them.
The dimension of a simplicial complex is the highest dimension of its simplices. If a subset
K0 ⊆ K is a simplicial complex, it is a subcomplex of K.
Please see Figure 3 for an example simplicial complex. The triangulation of the solid
cube provides 3-dimensional simplices. Therefore the simplicial complex is 3-dimensional.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 243
Note that since the field is Z2 , the set of d-chains is in one-to-one correspondence with the
set of subsets of d-simplices. A d-chain corresponds to a nd -dimensional vector, whose
entries are 0 or 1; the nonzero entries correspond to the included d-simplices. Here nd is
the number of d-simplices in K. For example, in Figure 4, the (formal sum of) red edges
form a 1-chain and the (formal sum of) dark grey triangles form a 2-chain.
If we def ne the addition of chains as the addition of these vectors, all the d-chains form
the group of d-chains, Cd (K). Note that addition is using Z2 (i.e. mod 2) arithmetic.
The boundary of a d-simplex is the (d − 1)-chain which is the formal sum of the (d − 1)-
simplices which are faces of the d-simplex. For example, the boundary of 1-simplex is the
chain which is the formal sum of the two vertices which are its endpoints. The boundary of
a d-chain c is then def ned as the sum of the (d − 1)-chains which are boundaries of each
of the individual d-simplices appearing in the formal sum c. It is important to note that the
sum over chains uses Z2 (i.e. mod 2) arithmetic, as described in Section 2.2.
This concept is best illustrated by way of example. In Figure 4, the green edges form
the boundary of the 2-chain formed by the three dark grey triangles. Two out of seven 1-
dimensional faces (edges) of the triangles do not appear in the boundary due to the mod 2
addition. Similarly, tetrahedra of the cube form a 3-chain whose boundary is the triangles
of the box bounding the cube.
Figure 4. The green cycle is a boundary. The two blue cycles belong to a nontrivial class.
The red cycle represents another nontrivial class.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
244 Daniel Freedman and Chao Chen
acd ace
ab ac ad ae bc cd ce
ab 0 0
a 1 1 1 1 0 0 0 ac 1 1
b 1 0 0 0 1 0 0 ad 1 0
∂1 =
and ∂2 = ae 0
.
c 0 1 0 0 1 1 1 1
d 0 0 1 0 0 1 0 bc 0 0
e 0 0 0 1 0 0 1 cd 1 0
ce 0 1
By convention, we def ne ∂0 ≡ 0.
A d-cycle is a d-chain with zero boundary. The set of d-cycles forms a subgroup of
the chain group, which is the kernel of the boundary operator, Zd (K) = ker(∂d ). The set
of d-boundaries are def ned as the image of the boundary operator, Bd (K) = img(∂d+1 );
this set is in fact a group. A d-cycle which is not a d-boundary, z ∈ Zd (K) − Bd (K), is
a nonbounding cycle. In Figure 4, both the green and red chains are 1-cycles. (The red
chain goes around the interior of the tube, but some parts are necessarily occluded in the
rendering.) But only the red chain is nonbounding. It is not hard to see that a d-boundary is
also a d-cycle. Therefore, Bd (K) is a subgroup of Zd (K).
In our case, the coefficients belong to a field, namely Z2 ; when this is the case, the
groups of chains, boundaries and cycles are all vector spaces.1 Computing the boundary of
a d-chain corresponds to multiplying the chain vector with a boundary matrix [b1 , ..., bnd ],
whose column vectors are boundaries of d-simplices in K. By slightly abusing notation,
we call the boundary matrix ∂d .
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 245
for a fixed z 0 . This set, denoted as [z0 ] = z0 + Bd (K), is called a coset. Any cycle
belonging to the class can be the representative cycle, z0 . When the representative cycle z0
is a boundary, [z0 ] = 0 + Bd (K) is the boundary group itself.
The set of equivalence classes (one of which is the boundary group), under addition
def ned by the addition of their representative cycles, forms a nice group structure. This
group of equivalent classes is the quotient group Hd (K) = Zd (K)/Bd (K), and is called
the d-dimensional homology group. The boundary group 0 + Bd (K) is the identity element
of Hd (K). Otherwise, when z0 is a nonbounding cycle, [z0 ] is a nontrivial homology class
represented by z0 . Cycles of the same homology class are said to be homologous to each
other, formally, z1 ∼ z2 .
In Figure 4, the two blue cycles are homologous to each other, but not to the red and
green cycles. The 1-dimensional homology group has four different members, represented
by the green cycle (corresponding to the boundary group), the red cycle, one of the blue
cycles, and the sum of a red cycle and a blue cycle, respectively.
In Figure 5, the simplicial complex has 1 nontrivial homology class, represented by four
different nonbounding cycles, (ab + ac + bc), (ab + ad + bc + cd), (ab + ae + bc + ce), and
(ab + ac + ad + ae + bc + cd + ce), whose corresponding vectors are (1, 1, 0, 0, 1, 0, 0)T ,
(1, 0, 1, 0, 1, 1, 0)T , (1, 0, 0, 1, 1, 0, 1)T , and (1, 1, 1, 1, 1, 1, 1)T respectively.
All of the groups we have def ned thus far have the structure of vector spaces over the
field Z2 . We can therefore speak of the dimension of any of these groups, by which we
mean the dimension of the corresponding vector space.2 The dimension of the homology
group is referred to as the Betti number,
βd = dim(Hd (K))
= dim(Zd (K)) − dim(Bd (K)).
By linear algebra, the Betti number can be computed by computing the ranks of all boundary
matrices.
As the dimension of the chain group is upper bounded by the cardinality of K, n, so are the
dimensions of Bd (K), Zd (K) and Hd (K).
We note that the 0-dimensional homology group provides information about the number
of connected components of a topological space. In particular, the 0th Betti number, β0 , is
equal to the number of connected components. For dimensions d higher than zero, the Betti
number yields information about the number of “independent d-dimensional holes” in that
space. This last sentence is of course imprecise, and is meant only to convey intuition.
2
The definitions which follow can be made without requiring the vector space property, but further mathe-
matical apparatus is required.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
246 Daniel Freedman and Chao Chen
What is the purpose of such row and column operations? These operations effectively
change the bases of the chain groups: 0- and 1-dimensional chain groups in the case of ∂1 ,
and 1- and 2-dimensional chain groups in the case of ∂2 . For the d-dimensional boundary
matrix, the set of zero columns corresponds to a basis of the d-dimensional cycle group, i.e.
{ab + ac + bc, ac + ae + ce, ac + ad + cd}. The set of nonzero rows corresponds to a basis
of the (d − 1)-dimensional boundary group. There is a one-to-one correspondence between
this boundary basis and the set of d-chains corresponding to nonzero columns, specif ed by
these nonzero diagonal entries in the new boundary matrix.3 Each element of this (d − 1)-
3
The relationship may be more complicated if the homology is not over Z2 field.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 247
dimensional boundary basis is the boundary of its corresponding d-chain. In our example,
the chains ac + ad + cd and ac + ae + ce are the boundaries of the 2-chains acd and ace,
respectively.
In general, combinations of such row and column operations are referred to as reduc-
tions. As can be seen from the example above, reductions enable the computation of Betti
numbers by f nding the dimensions of the cycle and boundary groups. When the homology
is over a field, as it is in our case, the reduction may be computed by Gaussian elimination.4
If the homology is not over a field, a more complicated reduction using the so-called Smith
Normal Form [15] must be used.
The idea of reducing the boundary matrices into canonical forms [8] has been extended
to various reduction algorithms for different purposes [16, 10, 17]. Next, we will introduce
one specif c reduction, namely, the persistent homology reduction.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
248 Daniel Freedman and Chao Chen
Figure 6. Illustrative example for persistent homology: the topological space is the 2D
plane and the filter function is the peaks function in MATLAB.
(c) t = 2.25, a new hole (on (d) t = 3.59, the new hole
the right) is born. dies.
Figure 7. Sublevel sets and their homologies. We draw the continuous sublevel sets whereas
the persistence is computed through the simplicial complex.
namely, the simplex-ordering of K with regard to f . This is the order in which simplices
enter the sublevel set f −1 (−∞, t] as t increases. Any sublevel set is a subcomplex, denoted
as Ki , which exactly σ1 , · · · , σi as its simplices. The nested sequence of sublevel sets
∅ = K0 ⊂ K1 ⊂ · · · ⊂ Km = K
is called a filtration5 of K. Let fi = f (σi ) and f0 = −∞, Ki = f −1 (−∞, fi ].
5
It is worth noting that in fact, the theory of persistent homology applies to any filtration, not only those
which are derived from the sublevel set structure of a function.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 249
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
250 Daniel Freedman and Chao Chen
integer weights, according to their multiplicities. The persistence diagram has been proven
to be stable to changes in the filter function [19]. This stability of persistence implies
that the persistence diagram remains almost the same if we introduce noise into the filter
function (Figure 8(c)).
Alternatively, we could plot the life intervals of persistent classes in the real line. For
a persistent class, the birth and death time are the start and end points of the interval. We
call this representation a persistence barcode. See Figure 8(b) for the persistence barcode
representation of the persistent homology in the preceding example. 0-dimensional classes
(resp. 1-dimensional) are drawn in solid (resp. dashed) lines with round (resp. square) marks
on the start and end points. The essential 0-dimensional class has no death time.
Figure 8. Persistence diagram, barcode and the filter function with noise.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 251
was developed in [17], which also explains the relationship of this algorithm to Gaussian
elimination. A final version of this algorithm is contained in [21].
To explain the computation of persistence, we follow the exposition of [3, 10] which
unif es boundary matrices of different dimensions into one overall incidence matrix D.
Rows and columns of D correspond to simplices of K, indexed in the simplex-ordering.
An entry of D is 1 if and only if its corresponding entry is 1 in the corresponding boundary
matrix. The algorithm performs column reductions on D from left to right. Each new
column is reduced by addition with the already reduced columns, until its lowest nonzero
entry is as high as possible.
More specifically, during the reduction, record low (i) as the lowest nonzero entry of
each column i. To reduce column i, we repeatedly find column j satisfying j < i and
low(j) = low(i); we then add column j to column i, until column i becomes a zero column
or we cannot f nd a qualified j anymore. If column i is reduced to a zero column, low(i)
does not exist. This is equivalent to reducing each boundary matrix into a canonical form,
whose nonzero columns all have different lowest nonzero entries, and thus are linearly
independent. It can be shown that while the order of reduction steps and the reduced matrix
are not unique, the pairs – formed by grouping each column with its lowest nonzero entry
– are unique.
The reduction of D can be written as a matrix multiplication,
R = DV, (1)
where R is the reduced matrix and V is an upper triangular matrix. The reduced matrix R
provides rank(D) many pairs of simplices, (σi , σj ) : low(j) = i. In such a pair, we say
σj is paired on the right, σi is paired on the left. Each simplex appear in at most one pair,
either on the left or on the right (cannot be both). For simplices that are not paired with any
other simplex, we say they are paired with inf nity: (σk , ∞). Simplices paired on the right
are negative simplices. Simplices paired on the left, with other simplices or with inf nity,
are positive simplices.
A pair (σi , σj ) corresponds to a persistent class, whose birth and death time are fi =
f (σi ) and fj = f (σj ), respectively. That is, positive simplices give birth, while negative
simplices cause death. A pair (σk , ∞) corresponds to an essential class, whose birth time
is fk = f (σk ).
The reduction is completely recorded in the matrix V . Columns of V corresponding to
positive simplices form bases of cycle groups. Columns corresponding to positive simplices
paired with +∞ are cycles representing essential classes and form homology cycle bases.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
252 Daniel Freedman and Chao Chen
single paper; in the case of noise reduction and image segmentation, we describe the results
of a number of papers, as these latter problems have received somewhat more treatment in
the still nascent literature.
Then the tangent complex of X, T (X), is the closure of T 0 (X). In the case of a manifold,
the tangent complex is similar to the tangent bundle of the manifold6 – that is, each point
is augmented with the set of unit tangent vectors to the manifold at that point. Thus, for
6
Though not exactly the same, due to the use of unit vectors Sn−1 in the definition of the tangent complex.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 253
Figure 9. Two visualizations of the tangent complex. Left: the space is the blue cone, and
the unit tangent vectors, which augment each point, are visualized as red circles. Middle and
Right: the space is circle, here represented as a point cloud (middle); in this 1-dimensional
case, the tangent complex can be represented explicitly (right). (Left figure taken from [24].
Copyright c 2005 World Scientific. Reprinted with permission. All rights reserved. Middle
and right f gures taken from [25]. Copyright c 2004 Elsevier. Reprinted with permission.
All rights reserved.)
example, in the case of a smooth surface living in R3 , each point is augmented by a circle
of tangent vectors, see Figure 9 (left). In the case of a smooth closed curve, each point is
augmented by exactly two vectors (i.e., the two elements of S0 ), and the tangent complex
can be visualized more explicitly, see Figure 9 (middle and right). The case of non-manifold
behaviour is somewhat different. If a surface has a crease – e.g. imagine two planes meeting
in a line – then at the crease, each point is augmented with not one, but two circles of tangent
vectors.7 See Figure 10 for an illustration.
Figure 10. The tangent complex for non-manifolds. Left: where the space is a 2-manifold,
each point is augmented by a single circle of unit tangent vectors. Middle and right: where
there is non-manifold behaviour, the space may be augmented by more than one circle
of unit tangent vectors. (Figures taken from [24]. Copyright c 2005 World Scientific.
Reprinted with permission. All rights reserved.)
In order to apply the tools of persistent homology, we will need a filter function for the
space T (X); ideally, the filter function should be geometric. Let us begin by focusing on
the case when X is a curve, and consider the curvature κ(x) at each point x ∈ X. For any
point in the tangent complex, t = (x, ξ) ∈ T (X), we extend the curvature from the curve
itself to the tangent complex in the natural way, κ(t) = κ(x, ξ) ≡ κ(x). Then we may use
the curvature as our f ltration function. In the case of curves, this has the effect of focusing
7
In fact, the space T 0 (X) does not contain two circles of tangent vectors at a crease point; however, T (X),
which is the closure of T 0 (X), does.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
254 Daniel Freedman and Chao Chen
on the flat parts of the curve first, while adding in increasingly more curvy segments as
we increase the value of κ. This idea is illustrated in Figure 11 (middle column), for a
family of ellipses. Note the way in which the ellipses with different eccentricities have
different looking filtered tangent complexes; while the homology of the original ellipses
are equivalent, the persistent homologies of these augmented spaces will be quite different,
as desired. To extend the definition of this filtration to arbitrary manifolds – and indeed,
arbitrary spaces – one may, for each t = (x, ξ), define a circle of second order contact (akin
to a classical osculating circle). The reciprocal of the radius of this circle gives an analogue
to the curvature, which we may then use as a filter function. This quantity is essentially the
sectional curvature at the point x in the direction ξ (though it is defined in [24, 25] to apply
to non-manifold spaces as well). The interested reader is referred to [24, 25] for further
details.
Finally, the shape signature for the space X is found by computing the persistent homol-
ogy of the filtered tangent complex. This leads to a set of persistent barcodes, one set for
each dimension. Recall, from Section 2.7., that the barcodes consist of intervals of the real
line: the beginning of the interval is the birthtime and the end of the interval is the deathtime
of the feature in question. These barcodes can be visualized by stacking the intervals, see
Figure 11 (right column).
In order to compare two shapes, the barcode of a given dimension of the first shape is
compared with the barcode of the same dimension of the second shape. The metric between
barcodes is given by a matching algorithm: the cost of matching two intervals is given by
the length of their symmetric difference, while the cost of unmatched intervals is simply
their length. The distance between two barcodes is then given by the cost of the minimal
matching; this distance, which can be computed by a bipartite graph matching algorithm,
can be shown to be a metric, as desired.
We briefly mention some of the implementation details that are necessary for computing
this signature when the shape of interest is given by a point cloud, which is a common
case in practice. The tangent complex can be computed by using PCA on points in the
neighbourhood of a given point (taken by the k nearest neighbours for a parameter k); if the
space is not smooth, this information can be backed out of the eigenanalysis of the PCA,
though this is more complicated. To compute the curvature at a given point, a circle of
second order contact is fit to the data. To compute the persistence, a simplicial complex
is first computed as the Cech complex, which is the nerve of the set of balls where each
ball (of a fixed radius ε ) is centered around a point in the tangent complex. That is, the
complex is constructed as follows: where two balls intersect, an edge is placed; where three
balls intersect, a triangle is placed; and so on. This complex can be shown to be homotopy
equivalent to, and hence have the same homology as, the the union of balls [26]. Finally, the
standard persistent homology algorithm (see Section 2.9.) can be applied to this simplicial
complex. For further details, the reader is referred to [24, 25].
Results of applying the algorithm to the problem of computing shape signatures of a set
of handwritten letters are shown in Figure 12. Quite obviously, there are mature technolo-
gies available for the optical character recognition (OCR) problem, and this technique does
not outperform them; nonetheless, the results illustrate the power of the approach. In this
example, there are eight letters, of which there are ten examples of each. In the experiment,
each shape is assigned the barcodes described above, as well as extra barcodes derived from
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 255
κ=0 50
Figure 11. The persistence barcodes of shapes. Left: various ellipses, represented as point
clouds. Middle: the filtered tangent complexes of these point shapes. The filter function
is represented using colour, where the colour code is given at the bottom of the figure.
Right: the corresponding barcodes (dimension 0), computing using persistent homology.
(Figures taken from [25]. Copyright c 2004 Elsevier. Reprinted with permission. All
rights reserved.)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
256 Daniel Freedman and Chao Chen
other filtrations; the purpose of these other filtrations is to increase the power of discrimi-
nation between shapes. Two shapes are then compared by computing the distance between
each pair their corresponding barcodes, and taking a weighted sum of these distances as the
overall distance (see [25] for details). The resulting distance matrix for these 80 elements
is shown in Figure 12: clearly, the shape signatures have the ability to capture the shape
characteristics of the handwritten letters.
Figure 12. The distance matrix for the 80 letters (10 examples of each of 8 letters); dark
represents a small value, and light represents a large value. (Figures taken from [25]. Copy-
right c 2004 Elsevier. Reprinted with permission. All rights reserved.)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 257
homological approach to characterizing natural image statistics. The data used in this paper
is the same as that of Lee et al. [32], which we brief y review.8 A collection of more than
4, 000 images of natural scenes is used. From each image, 5, 000 3 × 3 image patches are
selected randomly, of which the top 20% with highest contrast in log-intensity are retained.
These leads to a total collection of a bit more than 4 million patches.
More specifically, the process of choosing the top 20% of the image patches, per base
image, works as follows. First, each 3 × 3 patch is represented as a 9-dimensional vector.
Second, the elementwise-logarithm of each vector is taken. Third, for each vector, the mean
of the vector is subtracted off of each element of the vector. Fourth, if √the result of the prior
computation is the vector v, then the so-called D-norm of v, i.e. v T Dv, is computed;
D is a particular 9 × 9 symmetric positive def nite matrix. Only the 20% of vectors with
the highest
√ D-norm are retained. Finally, each vector v is normalized by the D-norm, i.e.
v ← v/ v T Dv.
Examining this process, we see that the in the first and second steps, the vectors live
in R9 . In the third step, after the mean has been subtracted off, the vectors live in an 8-
dimensional subspace of R9 . In the final step, after normalization, the transformed vectors
live in S7 , the 7-sphere.
In this setting, the goal is thus to characterize the statistics of the dense point cloud lying
on S7 . In order to do so, a filter function related to the density of the points is introduced.
For each point x in the point cloud, let δk (x) be the distance from x to its k th nearest
neighbour. Thus, δk (x) is inversely related to the density: the smaller is δk (x), the more
densely represented is the area around a point. Setting the parameter k to have a small value
results in a focus on the fine-scale structure of the data; whereas for k large, the coarse-scale
structure of the point cloud becomes more apparent. For a fixed k , the function δk (x) is
used as the filter function, with one caveat: the filtration ends when we have accumulated
a fraction T of the points, where T is usually taken as 0.25. The reason for this latter
restriction is that the high density points (i.e., the fraction T of points with the smallest
values of δk (x)) are said to form a “stable core,” which best represent the image statistics.
Finally, to speed up the computation, 5, 000 points at random are sampled from the data,
and the persistence computation is performed on this subset. Many random samplings are
taken to ensure consistency.
Given this filter function, what statistics are discovered? The first important discovery
is that with a k value of 300 – that is, a large k value corresponding a relatively coarse scale
– there is a single long-lived 1-dimensional homology class, see Figure 13 (top). To what
does this circle on S7 correspond? An examination of the patches making up this circle
indicates that they are patches with a light region on one side of the patch, and a dark region
on the other: that is, they are edges. The circular structure turns out to be derived from the
angle of the line separating the dark and the light regions; the angle can effectively take on
all value from 0 to π. See Figure 13 (bottom) for an illustration.
The second important discovery uses a k value of 25, for a more fine scale analysis.
At this scale, it is observed that there are 5 long-lived generators of the 1-dimensional
homology group, see Figure 14 (top). There are several structures which can give rise to
β1 = 5 on the 7-sphere; on examination of the data, it turns out that the relevant structure is
8
In fact, [32] uses both range and natural images; here, we focus only on the natural images, as these are
what is used in [35].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
258 Daniel Freedman and Chao Chen
H1
Figure 13. Coarse scale structure of the image statistics, corresponding to k = 300. Top: at
this scale, the persistence barcode indicates that the data has one long-lived 1-dimensional
homology class. Bottom: this single circle in S7 corresponds to edges, with the angle in
the circle giving the angle of the edge. (Figures taken from [11]. Copyright c 2008 Robert
Ghrist. Reprinted with permission. All rights reserved.)
a series of three interlocking circles, see Figure 14 (bottom). The first circle is the same as
that already discovered at the coarse scale; the other two each intersect the original circle
twice, but are disjoint from each other. It turns out that these two new circles represent
patches which include three stripes, in which the stripes are not necessarily monotonic
by grayscale, see Figure 15. Because they intersect the original circle, they include the
edge patches described above (which are three stripe patterns, but where the stripes are
monotonic by grayscale); but they also include non-edge patches, such as a patch consisting
of two dark stripes sandwiching a light stripe. What is the difference between the two new
circles? One circle represents horizontal stripes, and the other vertical stripes. Note that
these horizontal and vertical stripes are not due to pixellization effects – if the images are
all rotated by an angle of π/4, the true horizontal and vertical directions, represented as
diagonals in this coordinate system, are still discovered [35].
Two more points deserve mention. First, more complex information can also be gleaned
from the data, by looking at higher-order homology groups. For example, in looking at the
2-dimensional persistent homology, one finds a Klein Bottle structure. This structure has
an explanation in terms of the underlying patches, but one which is somewhat involved to
explain. The interested reader is referred to [35, 11] for a more in depth exposition. Second,
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 258
H1
Figure 14. Fine scale structure of the image statistics, corresponding to k = 25. Top: at
this scale, the persistence barcode indicates that the data has f ve long-lived 1-dimensional
homology classes. Bottom: β1 = 5 results from a series of three interlocking circles, in
which the f rst circle is the coarse scale circle, while the other two each intersect the original
circle twice, but are disjoint from each other. (Figures taken from [11]. Copyright c 2008
Robert Ghrist. Reprinted with permission. All rights reserved.)
Figure 15. Analyzing the f ne scale structure. The two new circles represent patches which
include three stripes, in which the stripes are not necessarily monotonic by grayscale. (Fig-
ures taken from [11]. Copyright c 2008 Robert Ghrist. Reprinted with permission. All
rights reserved.)
this type of analysis has not been limited to natural image statistics; in a recent paper [36],
Singh et al. have applied the same set of techniques to visual cortex data from experiments
on primates. The latter should be of interest to the biological vision community.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
260 Daniel Freedman and Chao Chen
Figure 16. Simplification of a terrain image. Left: the original terrain image, in which
heights range from blue (low) through cyan, green, yellow, and to red (high). Middle and
right: simplifications of the terrain image which preserve large critical points. (Figures
taken from [37]. Copyright c 2009 Dominique Attali. Reprinted with permission. All
rights reserved.)
What is the connection between critical points and the homological tools that we have
thus far discussed? The answer is that the critical points of a function and the persistent
homology of that function are intimately related. In particular, the topology of the sublevel
set of the function changes whenever there is a critical point; this can be seen in the exam-
ple of Figure 7 and corresponding discussion in Section 2.7. As we know, the persistent
homology itself is def ned based on the changing homology of the sublevel sets. In fact,
then, it turns out that the critical points of a function are in two-to-one correspondence with
the points of the persistence diagram. That is, every birth time and every death time of a
homological feature corresponds to a critical point of the relevant function. (Recall that
each point in the persistence diagram consists of both a birth and death coordinate, which
is why the correspondence is two-to-one.)
Note that this relationship between the homology of a space and the critical points of a
function on that space dates to Morse and his eponymous Morse Theory [38, 39]. Classical
Morse Theory assumes a smooth function, which in addition satisfies a mild genericity con-
dition known as the Morse condition. The advantage of the persistent homology approach
is that no smoothness is assumed for the function, so that a sensible def nition of critical
points exists even when the underlying function is not smooth. That is, a homological crit-
ical value [19] of a function is a value at which the homology of the sublevel set of the
function changes. This definition corresponds with the traditional critical point (at which
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 261
the derivative vanishes) for smooth functions, but is more general. For example, it can be
applied to a piecewise linear function, which is a standard case seen in applications.
Having established the relationship between critical points and points in the persistence
diagram, we may now formulate the problem of feature-sensitive noise reduction as that of
removing points with small persistence from the persistence diagram, while leaving other
points in the persistence diagram alone. This has the effect of retaining important critical
points, while discarding other, less significant ones. This problem has been addressed in
the persistent homology literature, and goes under the title persistence simplification. The
formal problem of persistence simplif cation, as described in [40, 37] is as follows:
4 4
3.5 3.5
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
Figure 17. Persistence simplif cation. Left: the original persistence diagram for a given
function f . Right: the persistence diagram for the ε-simplif cation g of the function f .
Here ε = 0.3.
It should be clear that if such a g can be found, it will have solved the problem of
feature-sensitive noise reduction, see Figure 17. Before examining algorithms for this gen-
eral problem, however, we observe an interesting connection with the seemingly unrelated
problem of segmentation. It will turn out that an important problem in segmentation admits
a simpler version of persistence simplif cation, for which an algorithm has been developed.
We will then return to a high-level discussion of results for the more general persistence
simplif cation problem.
3.3.1. Segmentation
The connection between persistence simplif cation and segmentation comes about through
the Mean Shift algorithm. In Mean Shift segmentation [41, 42, 43], each pixel in the image
is assigned a feature vector – generally speaking, colour, texture, or some combination of
the two, which is then often augmented by position. A non-parametric estimate of the
probability density in this feature space – the Kernel Density Estimate (KDE) – is then
constructed. Given this KDE, the image is segmented according to the modes, or local
maxima, of the KDE. In particular, each local maximum of the KDE represents a segment
of the image, and each pixel is assigned to the local maximum in whose basin of attraction
the pixel lies.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
262 Daniel Freedman and Chao Chen
Despite its success in many applications, the Mean Shift algorithm is known to generally
yield an oversegmentation; that is, it produces too many segments. In some applications,
this is tolerable; Mean Shift is sometimes used as preprocessing for a more sophisticated
clustering algorithm, and is used simply to reduce the complexity of this second algorithm.
On the other hand, it would be desirable if Mean Shift were able to yield a more precise
segmentation on its own. Since Mean Shift tends to oversegment, the results might be
corrected by forcing the algorithm to produce fewer clusters. Since each cluster corresponds
to a local maximum of the KDE, the problem of mitigating oversegmentation is equivalent
to the problem of filtering the KDE so as to preserve the most important local maxima, while
eliminating smaller ones. In this sense, this problem is formally similar to a simpler version
of the feature-sensitive noise reduction and persistence simplif cation outlined above. In
particular, we are interested not in preserving all large critical points, but rather, only large
local maxima.
Chazal et al. [44] present an elegant and practical method for attacking this problem.
Before discussing their method, however, it is worth pointing out the contribution of Paris
and Durand [45], who also attempt to tackle this problem. The basic idea of the paper is
roughly in consonance with the approach of persistence simplification, and is the first work,
to our knowledge, to try to tackle this problem in relation to segmentation. However, there
are problems: the method is very “digital,” in that it attempts to find basins of attraction
and then to use a topological persistence oriented criterion for eliminating modes, by just
using the digital grid, without a true simplicial complex (or other appropriate cell complex)
underlying the analysis. The digitality proves to be a problem, both in theory and in practice.
In theory, there is not much one can prove here; and in practice, some grid points are never
classif ed as belonging to any basin of attraction, and a heuristic must be used.
The method of Chazal et al. [44], by contrast, is well-founded theoretically, and pro-
vides some nice practical properties as well. The set up is as follows. It is assumed that the
topological space X is unknown and the function f : X → R is only specified on a finite
subset L ⊂ X. The specification of L itself is entirely coordinate-free; all that is needed
is the set of pairwise distances between all points in L.9 The goal is to compute the per-
sistence of this sampled representation, and to simplify it, in the sense above: to eliminate
local maxima whose persistence is smaller than a given threshold.
The first contribution is to show how the persistence diagram itself may be computed
for the sampled representation. The Rips Complex of a f nite point set L and a positive
real number δ is denoted Rδ (L) and is def ned as follows. A k-simplex σ with vertex
set v0 , . . . , vk ∈ L belongs to Rδ (L) if the distance between all pairs of vertices is less
than or equal to δ: d(vi , vj ) ≤ δ for all i, j. It turns out that there is no value of δ for
which the homology of the Rips Complex Rδ (L) matches that of X, even for well-sampled
spaces. Instead, the relationship between a pair of Rips Complexes, Rδ (L) and R2δ (L),
is suff cient to yield the persistent homology of the function f : X → R. The details of
this procedure are highly technical, relying on many algebraic concepts, such as persistence
modules, which we do not def ne here. Instead we give a rough sketch, and the interested
reader is referred to [44] for the full treatment.
Well-sampling means that the subset L is an ε-geodesic sample of X, i.e. that any point
9
Thus, X must also be a metric space; in practice, this is never a restriction, and most relevant spaces have
even more structure, that is they are Riemannian manifolds.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 263
in X has a geodesic distance of less than ε from some point in L, for ε less than 1/4 of
the strong convexity radius of X (which we do not def ne here). If X is well-sampled by
L in the above sense, then Chazal et al. show that an approximation to the persistence
diagram of f : X → R can be computed from the function values on L. Technically, let
Lα = L ∩ f 1 (−∞, α] be the discrete analogue of the sublevel set; then the persistence
diagram of f and the persistence diagram of the persistent homology module {Rδ (Lα ) ֒→
R2δ (Lα )}α∈R have a bottleneck distance of at most a constant times δ, when δ is chosen to
be greater than 2ε and less than 1/2 the strong convexity radius of X. This result relies on an
earlier result [46] which showed that the relationship between the pair of Rips Complexes,
Rδ (L) and R2δ (L), is sufficient to provably approximate the homology of the domain X.
Note that this result on its own is quite important, as it allows for the computation of
persistence in high dimensional spaces, when an approach based on a simplicial complex
would be too expensive (note that the size of the simplicial complex generally grows ex-
ponentially with the embedding dimension). In addition, though, the authors show how to
use this scheme to simply deal with the oversegmentation problem from Mean Shift. First,
a sampled version of Mean Shift is proposed: at any point, a steepest ascent vector is de-
fined by looking at the highest (by function value) point in the neighbourhood of the point,
where the neighbourhood is defined using 1 -skeleton of the the Rips Complex R2δ (L). The
point is then moved according to this steepest ascent vector, unless the point itself is the
highest point in its own neighbourhood, in which case it is deemed a local maximum. The
persistences of each such local maximum have already been computed using the algorithm
sketched in the previous paragraph. In fact, the result of that algorithm is a diagram in which
neighbouring maxima are linked. More formally, the diagram consists of pairs (v, e), where
v is a local maximum and e is an edge of the 1-skeleton of the Rips Complex R2δ (L) that
links the connected component created by v in R2δ (L) to the one created by some higher
maximum u. If the lifespan of the connected component of v is shorter than some threshold,
then the cluster of v is merged into that of u.
This algorithm, in addition to its provable properties (under appropriate sampling), can
be shown to have a reasonable complexity. In particular, the complexity of the first part of
the algorithm, the computation of persistence, is O(n3 ), where n is the number of points
in L, whereas the second part is close to linear in n. Results of applying the algorithm are
illustrated in Figure 18.
As opposed to our treatment of the problems in the previous sections, in which we focused
on one particular algorithm, we will, in this case, proceed to summarize the state of existing
results in this field. This is mainly due to the fact that a somewhat larger literature has de-
veloped to tackle the problem of persistence simplification, though open problems certainly
remain.
The first paper to introduce persistent homology [1] had already considered the problem
of persistence simplification. In this paper, the setting is purely simplicial, and the filtration
itself is given by an ordering of the simplices. An algorithm is given there for reordering
of the simplices which simpliie es the persistence. The main problem with this algorithm is
that it reduces the persistence of all features; that is, all points in the persistence diagram
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
264 Daniel Freedman and Chao Chen
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−2.5 −2.5
Figure 18. Segmentation. Top left: the function. Top right: the persistence barcode, illus-
trating the six large local maxima. Bottom left: the results of (discrete) Mean Shift. Bottom
right: the results after merging clusters using persistence. (Figures taken from [44]. Copy-
right c 2009 Society for Industrial and Applied Mathematics. Reprinted with permission.
All rights reserved.)
are moved closer to the diagonal, not just the smaller ones. This is not really desirable, as
we wish to preserve the large features as precisely as possible, while removing the smaller
ones.
Another series of papers tackle the problem of simplif cation of Morse-Smale com-
plexes [47, 48, 49]. Brief y, the Morse-Smale complex bears a close relationship to the
problem of Mean-Shift segmentation. The stable manifolds which are key ingredients in
the construction of the Morse-Smale complex are essentially the same as the basins of at-
traction of the modes in Mean Shift; however, the Morse-Smale complex also uses the
concept of unstable manifolds, which do not have a direct analogy in Mean Shift. (Unstable
manifolds essentially allow points to run “down the hill” instead of up; this can be seen as
performing mean shift on the negative KDE. The Morse-Smale complex then takes intersec-
tions of stable and unstable manifolds to build up a cell complex.) The main issue related
to these results for simplif cation is that they are highly dependent on the low-dimension
of the underlying topological space; the cases of dimension 2 and 3 are considered in the
previously cited papers. In many cases, clustering occurs in spaces of somewhat higher
dimension; for example, a common choice in image segmentation is d = 5, where each
vector comprises 3 colour and 2 spatial dimensions.
The original paper to pose the problem of persistence simplif cation as in Def nition 1
was that of Edelsbrunner et al. [40]. In this paper, the problem was solved for piecewise
linear 2-manifolds (surface meshes); the main problem with the approach is simply that it
is very complicated, with many subcases considered. A more recent paper of Attali et al.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 265
[37] also tackles the problem of simplif cation on surfaces and provides a simple algorithm,
which is relatively simple to implement, and has a low complexity – linear in the number
of simplices. Both of these papers are restricted to the low-dimensional setting.
Finally, we note that multiple simplification algorithms are introduced and discussed in
[18].
References
[1] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and sim-
plif cation. Discrete and Computational Geometry, 28(4):511–533, 2002.
[4] R. Ghrist and A. Muhammad. Coverage and hole-detection in sensor networks via
homology. In Proceedings of the 4th International Symposium on Information Pro-
cessing in Sensor Networks, 2005.
[5] V. de Silva, R. Ghrist, and A. Muhammad. Blind swarms for coverage in 2-D. In
Robotics: Science and Systems, 2005.
[6] T.K. Dey, K. Li, J. Sun, and D. Cohen-Steiner. Computing geometry-aware handle and
tunnel loops in 3D models. ACM Transactions on Graphics (TOG), 27(3):45:1–45:9,
2008.
[7] H. Edelsbrunner. Surface tiling with differential topology. In Proceedings of the Third
Eurographics Symposium on Geometry Processing, pages 9–11, 2005.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
266 Daniel Freedman and Chao Chen
[11] R. Ghrist. Barcodes: The persistent topology of data. Bulletin of the American Math-
ematical Society, 45(1):61–75, 2008.
[14] G. Carlsson. Topology and data. Bulletin of the American Mathematical Society,
46(2):255–308, 2009.
[15] R. Kannan and A. Bachem. Polynomial algorithms for computing the Smith and
Hermite normal forms of an integer matrix. SIAM Journal on Computing, 8:499–507,
1979.
[17] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete
and Computational Geometry, 33(2):249–274, 2005.
[18] Afra Zomorodian. Topology for Computing. Cambridge Monographs on Applied and
Computational Mathematics. Cambridge University Press, 2005.
[20] F. Chazal, D. Cohen-Steiner, M. Glisse, L.J. Guibas, and S.Y. Oudot. Proximity of per-
sistence modules and their diagrams. In Proceedings of the 25th Annual Symposium
on Computational Geometry, pages 237–246, 2009.
[22] D. Zhang and G. Lu. Review of shape representation and description techniques.
Pattern Recognition, 37(1):1–19, 2004.
[23] J.W.H. Tangelder and R.C. Veltkamp. A survey of content based 3D shape retrieval
methods. Multimedia Tools and Applications, 39(3):441–471, 2008.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Algebraic Topology for Computer Vision 267
[25] A. Collins, A. Zomorodian, G. Carlsson, and L.J. Guibas. A barcode shape descriptor
for curve point cloud data. Computers & Graphics, 28(6):881–894, 2004.
[26] J. Leray. Sur la forme des espaces topologiques et sur les points f xes des
représentations. J. Math. Pures Appl., IX. Sér., 24:95–167, 1945.
[27] F. Chazal, D. Cohen-Steiner, L.J. Guibas, F. Mémoli, and S.Y. Oudot. Gromov-
Hausdorff stable signatures for shapes using persistence. In Computer Graphics Fo-
rum, volume 28, pages 1393–1403, 2009.
[28] D.J. Field. Relations between the statistics of natural images and the response proper-
ties of cortical cells. Journal of the Optical Society of America A, 4(12):2379–2394,
1987.
[29] B. Olshausen and D. Field. Natural image statistics and eff cient coding. Network:
Computation in Neural Systems, 7(2):333–339, 1996.
[30] A. Van der Schaaf and JH Van Hateren. Modelling the power spectra of natural im-
ages: statistics and information. Vision Research, 36(17):2759–2770, 1996.
[31] J. Huang and D. Mumford. Statistics of natural images and models. In Proceedings of
the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
volume 1, pages 541–547, 1999.
[32] A.B. Lee, K.S. Pedersen, and D. Mumford. The nonlinear statistics of high-contrast
patches in natural images. International Journal of Computer Vision, 54(1):83–103,
2003.
[33] Y. Weiss and WT Freeman. What makes a good model of natural images? In IEEE
Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07, pages 1–8,
2007.
[34] T.S. Cho, M. Butman, S. Avidan, and W.T. Freeman. The patch transform and its
applications to image editing. In IEEE Conference on Computer Vision and Pattern
Recognition, 2008. CVPR 2008, pages 1–8, 2008.
[36] G. Singh, F. Memoli, T. Ishkhanov, G. Sapiro, G. Carlsson, and D.L. Ringach. Topo-
logical analysis of population activity in visual cortex. Journal of Vision, 8(8):1–18,
2008.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
268 Daniel Freedman and Chao Chen
[41] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function,
with applications in pattern recognition. IEEE Transactions on Information Theory,
21(1):32–40, 1975.
[42] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17(8):790–799, 1995.
[43] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analy-
sis. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 603–619,
2002.
[44] F. Chazal, L.J. Guibas, S.Y. Oudot, and P. Skraba. Analysis of scalar f elds over point
cloud data. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Dis-
crete Algorithms, pages 1021–1030. Society for Industrial and Applied Mathematics
Philadelphia, PA, USA, 2009.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida © 2011 Nova Science Publishers, Inc.
Chapter 6
Ismail Bogrekci
Mechanical Engineering Department, Faculty of Engineering,
Adnan Menderes University, Turkey
SUMMARY
Conventional approaches to Computer Vision harvested many opportunities for
different disciplines and resulted in many applications and practices into our life.
However, spectral imaging is promising many more looking into the future. Before we
discuss what the difference between spectral and non-spectral imaging is, we shall define
what the spectral and non-spectral imaging are at the beginning of this chapter. Later on,
image acquisition tools and techniques will be discussed. Following on, the focus is
intended to cover image processing techniques for spectral and non-spectral. Finally, the
chapter is focused to discuss the advantages and disadvantages of spectral and non-
spectral imaging.
INTRODUCTION
Spectral and non-spectral imaging should not be thought as competitors of each other.
However, they both have pros and cons in terms of technology, application, and
understanding. Since imaging technology is continuously developing, the future of imaging
will show and surprise us with different approaches. From the beginning to end the light will
interact with the material, this phenomenon will produce many different views, shows and
images into the practical life.
Spectral and non-spectral imaging systems have wide varieties of applications using
many varieties of platforms from space to human cell. These applications areas are Medicine,
Space, Engineering, Agriculture, Security, Communication, Education, and etc. In this
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
270 Ismail Bogrekci
respect; the importance of imaging technologies is well understood. Without the imaging
technology, the life would be blurry.
To give an example in Medicine, simply a surgeon or futures robot surgeons can not
operate work without the imaging technology which is inevitable to use and operate a patient
with minimal cut/disturbance.
THE CONCEPT
The concept of non-spectral and spectral imaging is essentially the same. Before the
concept of spectral and non-spectral imaging is explained, the definitions for imaging are
given for better understanding. Imaging is the creation of images from a physical scene using
a camera or similar device. Imaging can be studied by dividing the area into two as analog
and digital imaging. Currently, digital imaging is dominating the imaging world. A digital
image may be created directly from a physical scene by a camera or similar devices.
Alternatively, it may be obtained from another image in an analog medium, such as
photographs, photographic film, or printed paper, by an image scanner or similar device.
There have been so called imaging practices in science without using a camera or similar
device. Ultrasound images can be given as an example. Many technical images; such as those
acquired with tomographic equipment, side-scan sonar, or radio telescopes are actually
obtained by complex processing of non-image data. This digitalization of analog real-world
data is known as digitizing, and involves sampling (discretization) and quantization.
Figure 1 illustrates the binary imaging representation of L shaped-item. Digital numbers
(DNs) are given in Figure 1b and array voltage representation is shown in Figure 1c. An
example of 4 by 4 pixel array detector is given here. It is assumed that minimum voltage level
is 0 V while the maximum voltage level is 5 V. When L shaped black colored item is fixed to
the white board, the image appears as seen in Figure 1a. The voltage levels of each detector
are observed as in Figure 1c. Finally DN values of the image matrix would be as Figure 1b
for 2 bit representation.
Figure 1. Binary imaging representation of 4 by 4 pixel array a) image, b) DN values and c) voltages
(volt).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 271
Figure 2. Gray level imaging representation of 4 x 4 pixel array a) image, b) DN values and c) voltages
(volt).
Figure 3. Color imaging representation of 4 x 4 pixel array a) image, b) DN values and c) voltages
(volt).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
272 Ismail Bogrekci
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 273
Figure 4. Airborne images of a farm in the Lake Okeechobee Drainage Basin in Florida, USA a)
Binary, b) Gray level, and c) Color.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
274 Ismail Bogrekci
Hyperspectral imaging is collected with a resolution of less than 10 nm and the number
of images is mostly more than that of multispectral imaging. Therefore, hyperspectral
imaging contains more information than multispectral imaging.
Figure 6 shows the spatial and spectral representation of hyperspectral imaging. As seen
many images are captured in hyperspectral imaging system than multispectral imaging.
Therefore, hyperspectral images contain a lot more information than multispectral and non-
spectral images.
Figure 7. Airborne images of a farm in the Lake Okeechobee Drainage Basin in Florida, USA a)
multispectral images and b) hyperspectral images.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 275
IMAGING SYSTEMS
There are varieties of imaging systems for non-spectral and spectral imaging. They differ
from each other on design approaches. However, the main difference between non-spectral
and spectral sensing is that spectral sensing requires the acquisition of separated light with
different wavelengths into the detector unit.
NON-SPECTRAL IMAGING
Non-spectral acquisition is the simplest way of imaging. A camera with black and white
film used for photography is one of examples for the non-spectral imaging of analogue
imaging system. For digital imaging systems, a black and white digital camera can be given
as good example. In this sense, it can be explained that non-spectral imaging requires a light
source, a scene/view/object, optics and detector (Figure 8). For a whole imaging system, off
course, the electronic circuits, storage media and software are also needed. Direct or indirect
light comes to the object. The light then is reflected, absorbed, and transmitted in order to
explain simple. However, many other interactions happen between the material and the light
which is out of aim of this chapter. Reflected light goes through the optics and shines on the
detector. Depending upon the light intensity, potential differences occur in the detector cells.
The voltages were digitized by an ADC circuit unit. The DNs are recorded by using different
standards. Finally, the image is captured and stored in storage media.
For non-spectral imaging, there are few significant factors that affect the quality of the
image. These can be explained as stated below:
The parameters stated above are the main factors which defines the quality of images.
Spatial resolution is one of the most important key factors. Figure 9 shows the effect of image
resolution. The image in Figure 9a is a lot clearer than others in Figure 9. Also it has more
information than others.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
276 Ismail Bogrekci
SPECTRAL IMAGING
Spectral imaging can be described as obtaining images with spectral information.
Multispectral and hyperspectral imaging are used. In spectral imaging, apart from light
source, a scene/view/object, optics and detector, a wavelength selector is needed to use.
Figure 10 illustrates the spectral imaging system.
Depending upon the application area, the specifications of light source, optics,
wavelength selectors, and detectors vary. Important part of the spectral imaging system is to
have a wavelength selection unit. There are many ways to separate light photons to different
wavelengths. Many of us observed that a piece of glass can separate the Sun light into
different colors. From static to dynamic; from simple to advance; from cheap to expensive,
different wavelength selectors have been invented and are on sale. Here, different wavelength
selectors are mentioned below:
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 277
1. Band Filters,
2. Linear Variable Filters (LVFs),
3. Filter wheels,
4. Prisms,
5. Gratings,
6. Monochromators,
7. Liquid Crystal Tunable Filters,
8. Acousto-Optic Tunable Filters.
The light shines upon the object, depending upon the area of sensing, transmitted or
reflected light goes through the optics and the light is separated into different wavelengths.
Separated light waves are then shined on the detector. Finally, images are captured on a
storage media.
Figure 11. Airborne hyperspectral images of a farm in the Lake Okeechobee Drainage Basin in Florida,
USA a) plant spectrum b) soil spectrum, and c) water Spectrum.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
278 Ismail Bogrekci
IMAGE PROCESSING
The main aim of image processing is to enhance and analyze the images for obtaining
real-world results. First attention is paid to image acquisition. The best image acquisition
system should be able to mimic the real-world scene into digital medium. Real-time or static
image processing systems can use hardware or software in order to process the signals. Image
processing involves the enhancement, filtering, feature extraction before image analysis is
conducted. The captured image is resized, filtered, enhanced and reformatted. These all cause
some changes on DNs of image matrix. The regression or comparison of DNs with real world
values such as length, temperature, and chemical concentration etc. are called as image
analysis.
1. Resampling,
2. Resizing,
3. Filtering (low pass, high pass, band pass, edge filtering etc.)
4. Pattern recognition,
5. Blob analysis,
6. Contrast enhancement,
7. Geometric corrections/calibrations/enhancements.
Above mentioned techniques are some of the commonly used techniques. However, there
are many more image processing techniques used for non-spectral image processing.
Images on the left of Figure 12 show the image acquisition errors. Image acquisition
plane and angle introduces errors to the image. Therefore, results obtained from these
erroneous images are far from the real values. Since the results are incorrect on the left
images, geometric corrections are needed in order to obtain correct results. Figure 12 shows
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 279
the geometric corrections done on two original images with/without grid on the left and
geometrically corrected two images with/without grid on the right.
Figure 12. An original image without grid top left, a geometrically corrected image top right without
grid, an original image with grid bottom left and a geometrically corrected image with grid bottom
right.
Figure 13. Images (a) after geo-correction (b) after median filter (c) after frame removal.
Image of soil aggregates with grid and frame is illustrated in Figure 13a. A median filter
is applied to the image in Figure 13a, and then image in Figure 13b is obtained. It is observed
that the application of median filter to the image (Figure 13a) results in the removal of grid
(Figure 13b). Later on, the image in Figure 13b is cropped to remove the frame.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
280 Ismail Bogrekci
When the value of c increases, the output image gets darker and when the value of s
increases output image becomes lighter. Figure 15 shows the effects of contrast enhancement
on an image (left is input image and right is image with contrast enhancement). This
technique has some advantages over binarising. Binarising is an image processing technique
that can reduce the colour level of the image into two colours. To binarise an image, a
threshold value is selected. When the value of the pixels in an image is greater than threshold,
image becomes white and when the value of the pixels is smaller than threshold, image
becomes black.
An edge may be regarded as a boundary between two dissimilar regions in an image.
Many edge extraction techniques can be broken up into two distinct phases: Finding pixels in
the image where edges are likely to occur by looking for discontinuities in gradients
(Candidate points for edges in the image are usually referred to as edge points, edge pixels, or
edgels). Linking these edge points in some way to produce descriptions of edges in terms of
lines, curves etc. Figure 16 illustrates the edge detection of images.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 281
Up to now, a few image processing techniques are mentioned and illustrated. However,
there are many more image processing techniques for non-spectral imaging for different
areas. It is not possible to mention and explain within the context and size of this chapter. It
needs a book size to cover all.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
282 Ismail Bogrekci
1. Spectral Math,
2. Band ratios,
3. Classification (supervised, unsupervised, decision tree etc.),
4. Image sharpening (HSV, Brovey, Gram-Schmidt etc.),
5. Principle components,
6. MNF rotations,
7. Color transforms,
8. Stretches (decorrelation, photographic, saturation etc.),
9. NDVI or indexing,
10. Filterings (convolution and morphology, texture, adaptive and FFT etc.)
Above mentioned techniques are some of the commonly used techniques. However, there
are many more image processing techniques used for spectral image processing. Some of the
above mentioned images processing techniques are illustrated below.
Figure 17. Band ratios application of 420 nm to 990 nm: (a) image at 420 nm (b) image at 990 nm and
(c) the result image 420/990 nm.
Figure 18. Spectral math application (a) image at 500 nm (b) image after Gram-Schmidt sharpening and
(c) image after spectral math calculation (410+467)/550 nm.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Spectral Imaging Versus Non-Spectral Imaging 283
CONCLUSION
Spectral imaging and non-spectral imaging are part of our lives. They have different
application areas from photography to Space. Humans will always use the imaging
technologies in their life. Spectral sensing systems are becoming more available, cheap and
user friendly. This virgin scientific area will give benefit the human beings; from brain
studies to cells in Medicine; from maturity to quality in Agricultural Sciences; from fractures
to defects in Engineering; etc.
The important thing is to know what to use when/where/how.
REFERENCES
[1] Russ John C. 2002. The image processing handbook. CRC Press LLC, USA, ISBN 0-
8493-1142-X.
[2] Bogrekci Ismail. 2010. Soil Tilth Sensing. Lambert Academia Publishing, USA, ISBN
978-3-8383-8085-8.
[3] Gonzalez Rafael C., Richard E. Woods and Steven L. Eddins. 2009. Digital Image
processing using Matlab. Gatesmark Publishing, USA, ISBN 978-0-9820-8540-0.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
284 Ismail Bogrekci
[4] Burke Michael. 2010. Handbook of Machine Vision Engineering: Image Processing.
Chapman & Hall, USA, ISBN 978-0-4124-7930-4.
[5] McAndrew Alasdair. 2004. Introduction to Digital Image Processing with Matlab.
Course Technology, USA, ISBN 978-0-5344-0011-8.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida c 2011 Nova Science Publishers, Inc.
Chapter 7
Abstract
In computer vision one often fits ellipses and other conics to observed points on a
plane or ellipsoids/quadrics to spacial point clouds. The most accurate and robust fit
is obtained by minimizing geometric (orthogonal) distances, but this problem has no
closed form solution and most known algorithms are prohibitively slow. We revisit this
issue based on recent advances by S. J. Ahn, D. Eberly, and our own. Ahn has sorted
out various approaches and identified the most efficient one. Eberly has developed a
fast method of projecting points onto ellipses/ellipsoids (and gave a proof of its conver-
gence). We extend Eberly’s projection algorithm to other conics, as well as quadrics
in space. We also demonstrate that Eberly’s projection method combined with Ahn’s
most efficient approach (and using Taubin’s algebraic fit for initialization) makes a
highly efficient fitting scheme working well for all quadratic curves and surfaces.
1. Introduction
Fitting simple contours (primitives) to observed image data is one of the basic tasks in
pattern recognition and computer vision. The most popular contours are lines, circles, and
ellipses (recall that round objects appear as elliptic ovals on photos). In 3D space, one
often fits planes, spheres, or more complex surfaces (such as ellipsoids) to point clouds.
We review the most advanced fitting methods and extend them to all quadratic curves and
surfaces.
∗
E-mail address: [email protected]
†
E-mail address: [email protected]
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
286 N. Chernov and H. Ma
We begin with the 2D fitting problem. Let (x1 , y1 ), . . ., (xn , yn ) denote the observed
points. Let P (x, y; Θ) = 0 be the equation of the fitting contour, where Θ represents the
vector of unknown parameters. For example, lines can be defined by equation
x̆2 y̆ 2
+ 2 −1 =0 (3)
a2 b
in the canonical coordinates x̆, y̆, which can be related to x, y by translation and rotation,
i.e., x̆ = (x − c1 ) cos θ − (y − c2 ) sin θ and y̆ = (x − c1 ) sin θ + (y − c2 ) cos θ; now
Θ = (a, b, c1, c2 , θ), where (c1 , c2) is the center, a, b are the semiaxes, and θ is the angle of
tilt.
The classical least squares fit minimizes geometric distances from the observed points
to the fitting curve:
n
X n
X
F (Θ) = d2i = (xi − x0i )2 + (yi − yi0 )2 → min . (4)
i=1 i=1
Here di denotes the geometric distance from the observed point (xi, yi ) to the fitting con-
tour, and (x0i , yi0 ) the (orthogonal) projection of (xi , yi) onto the contour.
The least squares fit (4) has many nice features. It is invariant under translations, ro-
tations, and scaling, i.e., the fitting contour does not depend on the choice of the coordi-
nate system. It provides the maximum likelihood estimate of Θ under standard statisti-
cal assumptions (where points are observed with an independent isotropic gaussian noise
[9, 12, 15]). The minimization (4) is always regarded as the most desirable solution1 of the
fitting problem, albeit hard to compute in most cases.
Fig. 1 shows a sample of eight points (their coordinates are given in Table 1; they
are borrowed from [19]) and the best fitting ellipse obtained by (4). We will explore this
example at the end of Section 2.
When one fits lines (1), the problem (4) has a closed form solution, and its properties
have been studied deeply [10, 15]. When one fits circles (2), the problem (4) has no closed
form solution, but one has an explicit formula for the distances,
p
di = (xi − a)2 + (yi − b)2 − R, (5)
hence one can easily compute the objective function (4), as well as its derivatives with
respect to a, b, R. This makes the minimization of F rather straightforward – the standard
1
In particular, (4) has been prescribed by a recently ratified standard for testing the data processing software
for coordinate metrology [20].
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 287
−1
−4 −2 0 2 4 6 8 10
Levenberg-Marquardt algorithm (the most popular and reputable scheme for solving least-
squares problems [15, 19]) works well. Algebraic circle fits (say, the one by Taubin; see
Appendix) provide good initial guesses.
When one fits ellipses, no simple explicit formulas are available for the distances di ’s or
the projections (x0i , yi0 ). Theoretically, the projection (x0i , yi0 ) can be found by solving a poly-
nomial equation of degree four [28], but such a solution is complicated and inconvenient
(it involves complex numbers), and even worse – it is numerically unstable [4]. Besides it
is not clear how to differentiate di with respect to Θ. Since one cannot easily compute the
objective function (4) or its derivatives, it appears that the Levenberg-Marquardt scheme is
impractical (see [25] that mentioned such a point of view).
The first thorough investigation of the ellipse fitting problem was done in the middle
1990’s by Gander, Golub, and Strebel [19]. They developed a roundabout way of mini-
mizing F by using auxiliary parameters ωi , i = 1, . . . , n, describing the location of the
projected points (x0i , yi0 ) on the ellipse; the latter, in the canonical coordinates (3) were
expressed by x̆0i = a cos ωi and y̆i0 = b sin ωi . Thus the objective function becomes
n
X
F= (xi − c1 − a cos ωi cos θ − b sin ωi sin θ)2
i=1
+ (yi − c2 + a cos ωi sin θ − b sin ωi cos θ)2 .
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
288 N. Chernov and H. Ma
Appendix. They produce ellipses that fit less accurately than those minimizing geometric
distances (4). Some authors say that “the performance gap between algebraic fitting and
geometric fitting is wide...” (see [7, p. 12]).
In the early 2000’s another approach to the minimization of geometric distances (4)
emerged, due to Ahn et. al. [4, 6, 7], that turned out to be very efficient. We describe it in a
general form.
In the former case we will need di’s and their derivatives (di)Θ . In the latter we need fi ’s
and gi ’s, as well as their derivatives (gi )Θ and (fi )Θ . The resulting algorithm is said to be
distance-based if one uses di ’s, or coordinate-based if one uses gi ’s and hi ’s; see Ahn et al.
[6, 7].
Obviously, it is enough to know the projections (x0i , yi0 ) in order to compute di ’s, gi’s,
and hi ’s. But their derivatives (di )Θ , (gi)Θ , (fi)Θ present a more challenging problem.
Sometimes finite differences are used to approximate these derivatives, which reduces the
accuracy of the fit. But there are surprisingly simple formulas for these derivatives:
Proposition. Let (x, y) be a given a point and (x0 , y 0 ) denote its projection onto the curve
P (x, y; Θ) = 0 (then x0 , y 0 depend on Θ). Denote g = x−x0 , h = y −y 0 , and d2 = g 2 +h2 .
Then we have
PΘ Px2 − gPy (Px PyΘ − Py PxΘ )
gΘ = , (6)
Px (Px2 + Py2 )
where PΘ , Px, Py denote the first order partial derivatives of P with respect to Θ, x, y,
respectively, and PxΘ and PyΘ the corresponding second order partial derivatives; all the
derivatives are taken at the projection point (x0 , y 0 ).
Proof. Since the vector (x − x0 , y − y 0 ) is orthogonal to the curve,
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 289
Now (8) follows from (12) and (10). Differentiation of (9) gives
gΘ = tΘ Px + tPxΘ , hΘ = tΘ Py + tPyΘ .
Solving (12) and (13) for gΘ and hΘ we obtain (6) and (7).
The formulas (6)–(8) were essentially obtained by Ahn et al. [6, 7]. Independently the
formula (8) was derived in [3] (see eq. (24) there).
Practically, the calculation of the derivatives dΘ , gΘ, hΘ by (6)–(8) is easy once the
projection point (x0 , y 0 ) is located. The differentiation of P (x, y; Θ) with respect to Θ is
usually straightforward; for example, one can easily find the derivatives of (3) with respect
to a, b, c1, c2 , θ.
Alternatively, one can try to change the parametrization scheme in order to simplify
differentiation. For example, instead of (3) one can define an ellipse by equation
p p
(x − p1 )2 + (y − p2 )2 + (x − q1 )2 + (y − q2 )2 − 2a = 0 (14)
where (p1 , p2 ) and (q1 , q2 ) denote its foci and 2a, as before, the major axis. These are
sometimes called Kepler’s parameters of an ellipse; they have certain advantages [11]. In
particular, differentiation of (14) with respect to p1 , p2, q1 , q2 , and a is quite straightforward.
We note that the coordinate-based scheme using gi ’s and hi ’s operates with 2n terms and
involves the second order derivatives PxΘ and PyΘ . The distance-based scheme operates
only with n terms and does not involve the second order derivatives; cf. (8). As a result, the
coordinate-based scheme seems to be less efficient, and we will only use the distance-based
fit in what follows.
Remark. Since the distance d given must be differentiable, we have to treat it as a signed
distance – it must be positive on one side of the curve and negative on the other, just as we
had it in (5). For ellipses, one can make d > 0 for points outside the ellipse and d < 0 for
points inside.
The above minimization scheme also works for surfaces in the 3D space, when they are
defined by implicit equations P (x, y, z; Θ) = 0. In that case the above formulas acquire an
extra term corresponding to the z variable, otherwise they remain pretty much the same.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
290 N. Chernov and H. Ma
x 1 2 5 7 9 3 6 8
y 7 6 8 7 5 7 2 4
Now the minimization problem (4) can be solved by the Gauss-Newton or Levenberg-
Marquardt algorithm provided one can project any given point (x, y) onto a given
curve/surface P = 0. We will discuss the projection subproblem later. The rate of con-
verges of the GN and LM algorithms is nearly quadratic provided one has a good initial
guess (which can be found, for example, by a non-iterative algebraic fit, such as the Taubin
fit, see Appendix).
Ahn et al. have applied the above minimization scheme to quadratic curves (ellipses,
hyperbolas, and parabolas) and some quadratic surfaces (spheres, ellipsoids, cones); see [4,
6, 7]. They compared it with several other minimization algorithms [5, 6, 7] and concluded
that this one is the fastest and most robust. We have also tested it on quadratic curves and
surfaces of various types and found that it has the following two advantages over other
schemes: (i) it converges in fewer iterations, and (ii) it finds a smaller value of the objective
function F more frequently than other methods do (i.e., the other methods tend to end up
in a local minimum or diverge more often that this method does).
A benchmark example introduced in [19] and used later in [4] and other papers is a
simple eight point set whose coordinates are shown in Table 1. The best fitting ellipse is
known to have center (2.6996, 3.8160), axes a = 6.5187 and b = 3.0319, and angle of
tilt θ = 0.3596; see Fig. 1. We have run two fitting algorithms – the implicit fit described
here and the Gander-Golub-Strebel (GGS) method, which simultaneously optimizes n + 5
parameters, as mentioned earlier. Both methods were initialized by randomly generated
ellipses (to get an initial ellipse, we just picked 5 points randomly in the square 0 ≤ x, y ≤
10 and used the ellipse interpolating those 5 points). After running these fitting methods
from 105 random initial guesses, we found that the implicit fit failed to converged to the
best ellipse in 11% of the cases, while the GGS method failed in 26% of the cases. In
those cases where both algorithms converged, the implicit method took 20 iterations, on the
average, while the GGS method took 60 iterations, on the average. The cost of one iteration
was also higher for the GGS method. Table 2 summarizes the results.
We note that a high number of iterations here is not unusual. The authors of [19] used
a modification of the best fitting circle to initialize their GGS procedure, and it took 71
iterations (!) to converge. A coordinate-based variant of the implicit fit used in [4] took 19
iterations to converge (it was also initialized with the best fitting circle). Our distance-based
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 291
Initial ellipse
Best Circle “Direct fit” Taubin fit
Implicit (G) 16 17 17
Implicit (K) 14 16 16
GGS 50 54 54
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
292 N. Chernov and H. Ma
points can be reflected to the first quadrant about the axes, and then the projection point can
be reflected back.) Also, we exclude the degenerate cases where u = 0 or v = 0; they are
fairly simple and can be handled separately (see details in [1]).
Now the projection point (x, y) on the ellipse satisfies the orthogonality conditions (9),
hence
u − x = tx/a2 and v − y = ty/b2 (16)
for some real t (note that t < 0 for points inside the ellipse and t > 0 for points outside the
ellipse). From (16) we find
a2 u b2 v
x= and y= (17)
t + a2 t + b2
Since x, y > 0, we have constraints t > −a2 and t > −b2 . Assuming, as usual, that a ≥ b
we get a single constraint t > −b2 . Substituting (17) into (15) we obtain a function
a2 u2 b2 v 2
F (t) = + − 1, (18)
(t + a2 )2 (t + b2 )2
whose root we need to find (because (x, y) must lie on the ellipse). Once we solve equation
(18) for t, we can compute the projection point (x, y) by (17). Note that
2a2 u2 2b2 v 2
F 0 (t) = − − (19)
(t + a2 )3 (t + b2 )3
and
6a2 u2 6b2 v 2
F 00 (t) = + . (20)
(t + a2 )4 (t + b2 )4
Thus on the interval (−b2 , ∞) we have F 0 < 0 and F 00 > 0, i.e., the function F is mono-
tonically decreasing and concave; see Fig. 2. Thus standard Newton’s method starting at
any point t0 where F (t0 ) > 0 will converge to the unique root of F . Eberly suggests to
start with t0 = bv − b2 , because F (t0 ) > 0 is guaranteed by (18). We found that it is more
beneficial to start with
t0 = max{au − a2 , bv − b2 }. (21)
Then Newton’s method converges in 3-5 iterations in all practical cases and finds the root to
within 7-8 significant digits. This is, on average, 2-3 times faster than solving equation of
degree four or using general heuristics [4, 7]. The MATLAB code for this method is posted
on our web page [29].
Hyperbolas. Now let us project a point (u, v) onto a hyperbola. Again, the latter can be
defined in its canonical coordinates:
x2 y2
− − 1 = 0. (22)
a2 b2
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 293
F(t)
2
-b t
-1
Figure 2. A typical graph of F (t) for t > −b2 and the progress of Newton’s iterations
toward the root.
Due to symmetry, we restrict the method to u > 0, v > 0, then we also have x > 0, y > 0.
The orthogonality conditions (9) now read
from which
a2 u b2 v
x= and y= (24)
t + a2 −t + b2
Since x, y > 0, we have constraints −a2 < t < b2 . Substituting (24) into (22) we obtain a
function
a2 u2 b2 v 2
F (t) = 2 2
− − 1, (25)
(t + a ) (−t + b2 )2
whose root we need to find. Note that
2a2 u2 2b2 v 2
F 0 (t) = − − (26)
(t + a2 )3 (−t + b2 )3
6a2 u2 6b2 v 2
F 00 (t) = − . (27)
(t + a2 )4 (−t + b2 )4
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
294 N. Chernov and H. Ma
lies above the x axis, i.e., F (t∗ ) > 0 and (b) the inflection point lies below the x axis. The
inflection point is found by solving F 00 = 0, hence
√ √
b2 au − a2 bv
t∗ = √ √ .
au + bv
Now by computing F (t∗ ) we can determine which case, (a) or (b), we have at hand. Stan-
dard Newton’s method will converge to the root of F (t) = 0, but the starting point t0 must
be selected wisely. In the case (a) we need to choose t0 such that F (t0 ) < 0, and in the
case (b) we need F (t0 ) > 0. In the case (a) we can try points tk = b2 − (b2 − t∗ )/2k
for k = 1, 2, . . . until we find one where F is negative. In the case (b) we can try points
tk = −a2 + (t∗ + a2 )/2k for k = 1, 2, . . . until we find one where F is positive. This
process of choosing t0 is relatively inexpensive and it does not involve the derivatives of F .
This completes our projection algorithm for hyperbolas.
In practice, it works about 1.5 longer than the projection algorithm for ellipses, due to
a more elaborate choice of t0 , so its speed is close to that of heuristic schemes [4, 7]. But it
is guaranteed to converge, according to our analysis.
F(t) F(t)
(a) (b)
t
*
t t t
*
Figure 3. Two possible appearances of F (t) on the interval −a2 < t < −b2 . Arrows show
the progress of Newton’s iterations toward the root.
The MATLAB code for this method is posted on our web page [29].
Parabolas. Next let us project a point (u, v) onto a parabola. Again, the latter can be
defined in its canonical coordinates:
y 2 − 2px = 0, (28)
where p > 0 is the distance from the focus to the directrix. Due to symmetry, we restrict
the method to v > 0, then we also have y > 0. The orthogonality conditions (9) now give
from which
v
x = u + pt and y= (30)
t+1
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 295
Since y > 0, we have constraint t > −1. Substituting (30) into (28) we obtain a function
v2
F (t) = − 2pu − 2p2 t, (31)
(t + 1)2
2v 26v 2
Taking the derivatives of F we see that F 0 (t) = − (t+1) 2 00
3 − 2p and F (t) = (t+1)4 . Thus
on the interval (−1, ∞) we have F 0 < 0 and F 00 > 0, i.e., the function F is monotonically
decreasing and concave. Now standard Newton’s method starting at any point t0 where
F (t0 ) > 0 will converge to the unique root of F . We can try points tk = −1 + 2−k for
k = 1, 2, . . . until we find one where F is positive. The MATLAB code for this method is
posted on our web page [29].
x2 y 2 z2
+ + − 1 = 0, (32)
a2 b2 c2
where a ≥ b ≥ c > 0 are its semiaxes. (Other ellipsoids case be translated and rotated to
the canonical form (32), and then the projection point can be translated and rotated back to
the original ellipsoid.) Due to symmetry, we restrict the method to u > 0, v > 0, w > 0,
then we also have x > 0, y > 0, z > 0. The orthogonality conditions now give
a2 u b2 v c2 w
x= , y= z= (34)
t + a2 t + b2 t + c2
Since x, y, z > 0, we have constraint t > max{−a2 , −b2 − c2 } = −c2 . Substituting (34)
into (32) we obtain a function
a2 u2 b2 v 2 c2 w 2
F (t) = + + − 1, (35)
(t + a2 )2 (t + b2 )2 (t + c2 )2
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
296 N. Chernov and H. Ma
Hyperbolic paraboloids. Now let us project a point (u, v, w) onto a hyperbolic paraboloid
(“saddle”) defined in its canonical coordinates as
x2 y 2
− 2 − z = 0. (38)
a2 b
Due to symmetry, we restrict the method to u > 0, v > 0, then we also have x > 0, y > 0.
The orthogonality conditions now give
2a2 u2 2b2 v 2 1
F 0 (t) = − 2 3
− 2 3
− (42)
(t + a ) (−t + b ) 2
6a2 u2 6b2 v 2
F 00 (t) = − . (43)
(t + a2 )4 (−t + b2 )4
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 297
interval (−a2 , b2 ). Our further analysis repeats that done for hyperbolas in the previous
section.
Hyperboloids. Now let us project a point (u, v, w) onto a hyperboloid (one sheet) defined
in its canonical coordinates as
x2 y 2 z2
+ − − 1 = 0, (44)
a2 b2 c2
where we can assume a ≥ b. Due to symmetry, we restrict the method to u > 0, v >
0, w > 0, then we also have x > 0, y > 0, z > 0. The orthogonality conditions now give
a2 u b2 v c2 w
x= , y= , z= . (46)
t + a2 t + b2 −t + c2
Since x, y, z > 0, we have constraints −b2 < t < c2 . Substituting (46) into (44) we obtain
a function
a2 u2 b2 v 2 c2 w 2
F (t) = + − − 1, (47)
(t + a2 )2 (t + b2 )2 (−t + c2 )2
whose root we need to find. Note that
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
298 N. Chernov and H. Ma
As there is a 50% chance of choosing one of the two sides correctly at random, the
current projection method is perhaps about 1.5 times slower, on average, than the previous
one (a moderate price to pay for extra complications). We emphasize that our analysis still
guarantees that the method converges to the correct projection point in all cases.
Other quadrics. We have covered three major types of quadratic surfaces in 3D. There
are two others – elliptic paraboloid and hyperboloid of two sheets, which are treated very
similarly, with small variations in each case that we leave out.
5. Conclusion
The main goal of our chapter is to put a strong argument against the customary presumption
that fitting ellipses and other quadratic curves/surfaces by minimizing geometric distances
is a prohibitively difficult task. It is no longer so! Three major breakthroughs have occurred
in the last decade:
• Ahn et al. have designed a general fitting scheme for implicit curves and surfaces that
is surprisingly simple and fast. Its most time-consuming part is the projection of data
points onto the curve/surface.
• Eberly discovered a remarkably fast and totally reliable projection algorithm for el-
lipses (which we generalize here to other conics and quadrics).
• Taubin’s algebraic fit gained the reputation of being able to provide a balanced (nearly
unbiased) initialization for the subsequent iterative procedure.
Combining these three advances together gives a complete fitting scheme for fitting
quadratic curves and surfaces of all kinds. This scheme is reliable and efficient.
Acknowledgment. N.C. was partially supported by National Science Foundation, grant
DMS-0969187.
Appendix
Here we review non-geometric (algebraic) fits that are used to provide an initial guess, i.e.,
a curve that initializes an iterative procedure solving the geometric fitting problem (4).
Suppose again that one fits an implicit curve P (x, y; Θ) = 0 to observed points (x1 , y1 ),
. . ., (xn , yn ). Perhaps the simplest non-geometric fit is the one minimizing
n
X
F1 (Θ) = [P (xi , yi ; Θ)]2. (50)
i=1
To justify this method one usually notes that P (xi , yi ; Θ) = 0 if and only if the point (xi , yi)
lies on the curve, and [P (xi , yi ; Θ)]2 is small when the point lies near the curve. The
minimization of (50) is called algebraic fit and |P (xi , yi ; Θ)| is called the corresponding
algebraic distance.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 299
This method is called gradient weighted algebraic fit. It was applied to quadratic curves by
Sampson [24] and popularized by Taubin [26].
If the curve is defined by an algebraic equation, such as (51), then both numerator and
denominator of each fraction in (52) are homogeneous quadratic polynomials of the param-
eters. As a result, F2 is invariant under scalings of the parameter vector (A, B, C, D, E, F ),
hence no additional constraints are needed anymore. The minimization of (52) produces a
more accurate fit than the simple algebraic fits (50) do; see statistical analysis in [12].
But the problem of minimizing (52) has no closed form solution and must be solved
by iterations. Various iterative schemes for the minimization of (52) have been developed
(see [16, 17, 21]); some of them have become standard in computer vision industry. They
are all too complex to be used for an initialization of the geometric fit (4) (besides, each of
them needs its own initialization to get started...). In this sense, the minimization of (52)
and that of (4) can be regarded as two independent approaches to the task of fitting curves
to data. While (4) is called Maximum Likelihood Estimation (MLE), that of (52) is called
Approximate Maximum Likelihood Estimation (AMLE); see [16, 17].
One may wonder if the AMLE (52) can be used instead of the MLE (4), as the mini-
mization of (52) is technically simpler than that of (4). Most researchers, however, agree
that the answer is NO, i.e., the minimization of (4) would produce a better fit than that of
(52); see comments in [7, p. 12]. (Though a complete statistical analysis has yet to be done.)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
300 N. Chernov and H. Ma
Taubin [26] simplified (52) and converted it into a non-iterative fit that minimizes
[P (xi, yi ; Θ)]2
P
F3 (Θ) = P . (53)
k∇P (xi , yi ; Θ)k2
Note that Taubin simply summed up all the numerators and all the denominators in (52)
separately.
If one fits conics defined by algebraic equation (51), then both numerator and denomi-
nator of (53) are homogeneous quadratic polynomials of the components of the parameter
vector A = (A, B, C, D, E, F )T . Thus one can rewrite (53) as
AT MA
F4 (A) = → min, (54)
AT NA
where M and N are some 6 × 6 symmetric positive semi-definite matrices. Since F4 (A)
is invariant under scalings of the vector A, one can solve (54) by minimizing F5 (A) =
AT MA under the constraint AT NA = 1. Introducing a Lagrange multiplier η we can
minimize the function
MA = ηNA, (55)
thus A must be a generalized eigenvector of the matrix pair (M, N). Moreover, premul-
tiplying (55) by A we see that AT MA = η, and because we are minimizing AT MA,
the desired vector A must correspond to the smallest (non-negative) eigenvalue η. (We
note that N here is singular; one usually eliminates F to reduce A to a 5-vector A0 =
(A, B, C, D, E)T and the 6 × 6 problem (55) to a 5 × 5 problem M0 A0 = η 0 N0 A0 , where
the 5 × 5 matrix N0 is positive definite; see details in [22].)
Solving a generalized eigenvalue problem takes just one call of a standard matrix func-
tion (such functions are included in most modern software packages, e.g., in MATLAB).
Thus Taubin’s fit is regarded as a fast non-iterative procedure. In practice the Taubin’s fit is
only marginally slower than the simple algebraic fit minimizing (50).
Its advantage is that Taubin’s fit is more balanced than any algebraic fit. It has a much
smaller bias; see statistical analysis in [22]. Its disadvantage is that it produces a conic
that may be of any kind – an ellipse, a hyperbola, or a parabola. If one fits conics of a
certain type (e.g., ellipses), then Taubin’s fit must be supplemented with another simple fit
whenever it gives the wrong curve. Experimental tests show that Taubin’s fit provides a
better initialization of iterative procedures than simple algebraic fits do [14].
References
[1] 3D Game Engine Design, 2nd ed., Morgan Kaufmann Publishers, San Francisco, CA,
2007. See also Internet article Distance from a point to an ellipse in 2D, Geometric
Tools, LLC, www.geometrictools.com
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Least Squares Fitting 301
[2] Aigner, M. & Jüttler, B. (2005). Robust computation of foot points on implicitly de-
fined curves, In: Editors Daehlen, M. et al., Mathematical Methods for Curves and
Surfaces, Tromso 2004, Nashboro Press, pp. 1–10.
[3] Aigner, M. & Jüttler, B. (2008). Gauss-Newton type techniques for robustly fitting
implicitly defined curves and surfaces to unorganized data points, In: Shape Modeling
International, pp. 121–130.
[4] Ahn, S. J., Rauh, W. & Warnecke, H. J. (2001). Least-squares orthogonal distances
fitting of circle, sphere, ellipse, hyperbola, and parabola, Pattern Recog., 34, 2283–
2303.
[5] Ahn, S. J., Rauh, W. & Recknagel, M. (2001). Least squares orthogonal distances
fitting of implicit curves and surfaces, In: LNCS 2191, 398–405.
[6] Ahn, S. J., Rauh, W. & Cho, H. S. (2002). Orthogonal distances fitting of implicit
curves and surfaces, IEEE trans. PAMI, 24, 620–638.
[7] Ahn, S. J. (2004). Least Squares Orthogonal Distance Fitting of Curves and Surfaces
in Space, In: LNCS 3151, Springer, Berlin.
[8] Bookstein, F. L. (1979). Fitting conic sections to scattered data, Comp. Graph. Image
Proc. 9, 56–71.
[9] Chan, N. N. (1965). On circular functional relationships, J. R. Statist. Soc. B, 27, 45–
56.
[10] Cheng, C.-L. & Van Ness, J. W. (1999). Statistical Regression with Measurement Er-
ror, Arnold, London.
[11] Chernov, N., Ososkov, G. & Silin, I. (2000). Robust fitting of ellipses to non-complete
and contaminated data, Czech. J. Phys. 50, 347–354.
[12] Chernov, N. & Lesort, C. (2004). Statistical efficiency of curve fitting algorithms,
Comp. Stat. Data Anal., 47, pp. 713–728.
[13] Chernov, N. & Lesort, C. (2005). Least squares fitting of circles, J. Math. Imag. Vision
23, 239–251.
[15] Chernov, N. (2010). Circular and linear regression: Fitting circles and lines by least
squares, Chapman & Hall/CRC Monographs on Statistics and Applied Probability
117.
[16] Chojnacki, W., Brooks, M. J. & van den Hengel, A. (2001). Rationalising the renor-
malisation method of Kanatani, J. Math. Imag. Vision, 14, 21–38.
[17] Chojnacki, W., Brooks, M. J., van den Hengel, A. & Gawley, D. (2005). FNS, CFNS
and HEIV: A unifying approach, J. Math. Imag. Vision, 23, 175–183.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
302 N. Chernov and H. Ma
[18] Fitzgibbon, A. W., Pilu, M. & Fisher, R. B. (1999). Direct Least Squares Fitting of
Ellipses, IEEE Trans. PAMI 21, 476–480.
[19] Gander, W., Golub, G. H. & Strebel, R. (1994). Least squares fitting of circles and
ellipses, BIT, 34, 558–578.
[20] Geometric Product Specification (GPS) – Acceptance and representation test for co-
ordinate measuring machines (CMM) – Part 6: Estimation of errors in computing
Gaussian associated features. Int’l Standard ISO 10360-6. IS, Geneva, Switzerland
(2001).
[21] Kanatani, K. (1994). Statistical bias of conic fitting and renormalization, IEEE Trans.
PAMI, 16, 320–326.
[22] Kanatani, K. (2008). Statistical optimization for geometric fitting: Theoretical accu-
racy bound and high order error analysis, Int. J. Computer Vision 80, 167–188.
[23] Leedan, Y. & Meer, P. (2000). Heteroscedastic regression in computer vision: Prob-
lems with bilinear constraint, Intern. J. Comp. Vision, 37, 127–150.
[24] Sampson, P. D. (1982). Fitting conic sections to very scattered data: an iterative re-
finement of the Bookstein algorithm, Comp. Graphics Image Proc. 18, 97–108.
[25] Sturm, P., & Gargallo, P. (2007). Conic fitting using the geometric distance, Proc.
Asian Conf. Comp. Vision, Tokyo, Japan, 2, pp. 784–795.
[26] Taubin, G. (1991). Estimation of planar curves, surfaces and nonplanar space curves
defined by implicit equations, with applications to edge and range image segmenta-
tion, IEEE Trans. PAMI, 13, 1115–1138.
[27] Wijewickrema, S., Papliński, A. & Esson, Ch. (2006). Orthogonal distance fitting re-
visited, Tech. report, Clayton School Inf. Technol., Monash U., Melbourne, 2006/205.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida c 2011 Nova Science Publishers, Inc.
Chapter 8
Abstract
This chapter shows how ontologies can be used as the basis for an effective com-
putational and representational framework for computer vision. A particular focus is
on the role of ontologies as a means of representing structured prior information and
integrating different kinds of information in an inference framework. Ontologies are
presented as an effective means of relating semantic descriptors to their parametric
representations in terms of the underlying data primitives.
The efficacy of the proposed approach is demonstrated through the development
and analysis of solutions to a range of challenging visual analysis tasks. Firstly, we
consider the problem of content based image retrieval (CBIR). We present a CBIR sys-
tem that allows users to search image databases using an ontological query language.
Queries are parsed using a probabilistic grammar and Bayesian networks to map high
level concepts onto low level image descriptors, thereby bridging the semantic gap
between users and the retrieval system.
We then extend the notion of ontological languages to video event detection. It is
shown how effective high-level state and event recognition mechanisms can be learned
from a set of annotated training sequences by incorporating syntactic and semantic
constraints represented by an ontology.
1. Introduction
Visual information is inherently ambiguous and semantically impoverished. There conse-
quently exists a wide semantic gap between human interpretations of image and video data
∗ E-mail address: [email protected]
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
304 Christopher Town
and that currently derivable by means of a computer. This chapter demonstrates how this
gap can be narrowed by means of ontologies. Ontology is the theory of objects in terms
of the criteria which allow one to distinguish between different types of objects and their
relationships, dependencies, and properties. Ontologies encode the relational structure of
concepts which one can use to describe and reason about aspects of the world. This makes
them eminently suitable for many problems in computer vision which require prior knowl-
edge to be modelled and utilised in both a descriptive and prescriptive capacity.
In the work presented here, terms in the ontology are grounded in the data and therefore
carry meaning directly related to the appearance of real world objects. Tasks such as image
retrieval and automated visual surveillance can then be carried out by processing sentences
in a visual language defined over the ontology. Such sentences are not purely symbolic
since they retain a linkage between the symbol and signal levels. They can therefore serve
as a computational vehicle for active knowledge representation which permits incremental
refinement of alternate hypotheses through the fusion of multiple sources of information
and goal-directed feedback.
This approach is broadly motivated by two notions of how visual information process-
ing may be achieved in biological and artificial systems. Firstly, vision can be posed as
knowledge-driven probabilistic inference. Deductive and inductive reasoning can be ap-
plied to deal with two key problems that make vision difficult, namely complexity and
uncertainty. Recognition is thus posed as a joint inference problem relying on the integra-
tion of multiple (weak) clues to disambiguate and combine evidence in the most suitable
context as defined by the top level model structure.
Secondly, vision may be regarded as closely related to (and perhaps an evolutionary
precursor of) language processing. In both cases one ultimately seeks to find symbolic
interpretations of underlying signal data. Such an analysis needs to incorporate a notion of
the syntax and semantics that is seen as governing the domain of interest so that the most
likely explanation of the observed data can be found. A visual language can also serve as an
important mechanism for attentional control by constraining the range of plausible feature
configurations that need to be considered when performing a visual task such as recognition.
Processing may then be performed selectively in response to queries formulated in terms
of the structure of the domain, i.e. relating high-level symbolic representations to extracted
visual and temporal features in the signal.
This chapter presents two concrete implementations of the approach discussed above
which demonstrate its utility for solving relevant research problems for both image and
video analysis.
2. Related Work
An increasing number of research efforts in medium and high level video analysis can
be viewed as following the emerging trend that object recognition and the recognition of
temporal events are best approached in terms of generalised language processing which at-
tempts a machine translation [2, 3] from information in the visual domain to symbols and
strings composed of predicates, objects, and relations. While this approach has shown great
promise for applications ranging from image retrieval to face detection to visual surveil-
lance, a number of problems remain to be solved. The nature of visual information poses
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 305
hard challenges which hinder the extent to which mechanisms such as Hidden Markov
models and stochastic parsing techniques popular in the speech and language processing
community can be applied to information extraction from images and video. Consequently
there remains some lack of understanding as to which mechanisms are most suitable for
representing and utilising the syntactic and semantic structure of visual information and
how such frameworks can best be instantiated.
The role of machine learning in computer vision continues to grow and recently there
has been a very strong trend towards using Bayesian techniques for learning and inference,
especially factorised graphical models such as Dynamic Belief networks (DBN). While
finding the right structural assumptions and prior probability distributions needed to in-
stantiate such models requires some domain specific insights, Bayesian graphs generally
offer greater conceptual transparency than e.g. neural network models since the underlying
causal links and prior beliefs are made more explicit.
Many methods for representing and matching ontological knowledge in artificial intelli-
gence (description logics, frame-based representations, semantic nets) are coming back into
vogue [1], not least because of the “semantic web” initiative. However, many problems re-
main when such approaches are applied to highly uncertain and ambiguous data of the sort
that one is confronted with in computer vision and language processing. Much research re-
mains to be done in fusing classical syntactic approaches to knowledge representation with
modern factorised probabilistic modelling and inference frameworks.
[12] introduces an evidence driven probabilistic image analysis framework using on-
tologies and Bayesian networks. It is shown how this framework can yield superior perfor-
mance through the integration of contextual constraints and prior domain knowledge.
This linkage between visual and linguistic information is made explicit in [18], which
discusses an image parsing to text description (I2T) framework to generate text descriptions
of image and video content. The mapping from extracted visual primitives to textual de-
scriptions is facilitated by an and-or-graph (AoG) to represent a stochastic image grammar
to define goal-directed top-down hypotheses during image parsing.
Similar approaches were applied to traffic surveillance problems in [8]. The authors
present a spatio-temporal query language that can be used for analysing traffic surveillance
scenarios. The language features unary and binary relations over attributes such as dis-
tances, orientations, velocities, and temporal intervals. Queries consisting of trees of such
relations are matched to the output of a tracking framework by considering all possible
ways of binding tracked objects to leaf nodes in the tree and evaluating relations to assess
whether all constraints are matched.
[11] describes an event recognition language for video. Events can be hierarchical
composites of simpler primitive events defined by various temporal relationships over object
movements, and the authors present a standardised taxonomy for video event recognition
consisting of a video event representation language (VERL) and a video event markup
language (VEML) for annotation. A similar problem is given a more explicitly ontology-
based solution in [13], where both domain knowledge (entities and events of interest) and
system knowledge (capabilities of the visual recognition system) are modelled to facilitate
video event classification.
In the domain of image analysis, [10] introduces an ontology framework to contain
both visual and contextual knowledge. Probabilistic inference is used to compute high-level
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
306 Christopher Town
retrieval queries over a natural scene database from the “LabelMe” data set. The problem
of semantic image interpretation is also considered in [6]. An ontology of spatial relations
and fuzzy representations of visual concepts is constructed and applied to knowledge-based
recognition of brain structures in 3D magnetic resonance images. To overcome the inherent
user interface challenge of query formulation, [7] present a visual query language to reduce
the gap between image features and high-level semantics.
The concept of an ontological query language as presented in section 4. in this chapter
has recently been scaled up to a collection of some 25 million high resolution commercial
images. [16] discusses how modern advances in grid computing allow efficient distributed
processing and content-based analysis of such large image data sets without compromising
the semantic richness of the image descriptors defined by means of an ontology.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 307
field of view or repeatedly returning to the scene over a short time scale. Internally the
system could perform these functions by generating and processing queries of the kind
“does the observed region movement correspond to a person entering the scene?”, “has a
person of similar appearance been observed recently?”, or “is the person emerging from
behind the occluding background object the same person who could no longer be tracked
a short while ago?”. These queries would be phrased in a language which relates them to
the corresponding feature extraction modules (e.g. a Bayesian network for fusing various
cues to track people-shaped objects) and internal descriptions (e.g. a log of events relating to
people entering or leaving the scene at certain locations and times, along with parameterised
models of their visual appearance). Formulating and refining interpretations then amounts
to selectively parsing such queries.
This chapter demonstrates that an ontological content representation and query lan-
guage is an effective vehicle for hierarchical representation and goal-directed inference in
high-level visual analysis tasks. As sketched in figure 1, such a language can serve as a
means of guiding the fusion of multiple sources of visual evidence and refining symbolic
interpretations of dynamic scenes in the context of a particular problem. By maintaining
representations of both the current internal state and derivational goals expressed in terms of
the same language framework, such a system could be seen as performing self-referential
feedback based control of the way in which information is processed over time. Visual
recognition then amounts to selecting a parsing strategy that determines how elements of
the current string set are to be processed further given a stream of lower level tokens gen-
erated by feature detectors. The overall structure of the interpretative module is not limited
to a particular probabilistic framework and allows context-sensitive parsing strategies to be
employed where appropriate.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
308 Christopher Town
Figure 2. Model of the retrieval process using an ontological query language to bridge the
semantic gap between user and system notions of content and similarity.
The retrieval process takes place entirely within the ontological domain defined by the
syntax and semantics of the user query. It utilises automatically extracted image segmenta-
tion and classification information, as well as Bayesian networks to infer higher level and
composite terms. The OQUEL language provides an effective mechanism of addressing key
problems of content based image retrieval, namely the ambiguity of image content and user
intention and the semantic gap which exists between user and system notions of relevance
(see figure 2). By basing such a language on an extensible ontology, one can explicitly state
ontological commitments about categories, objects, attributes, and relations without having
to pre-define any particular method of query evaluation or image interpretation.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 309
a specification syntax for image content. The latter includes adjectives for image region
properties (i.e. shape, colour, and texture) and both relative and absolute object location.
Desired image content can be denoted by nouns such as labels for automatically recognised
visual categories of stuff (“grass”, “cloth”, “sky”, etc.) and through the use of derived higher
level terms for composite objects and scene description. The latter includes a distinction
between singular and plural, hence “people” will be evaluated differently from “person”.
The following gives a somewhat simplified high level context free EBNF-style grammar G
of a reference implementation of the OQUEL language [17]:
G:{
S → R
R → modi f ier? (metacategory | SB | BR)
| not? R (CB R)?
BR → SB binaryrelation SB
SB → (CS | PS) + LS ∗
CS → visualcategory | semanticcategory |
not? CS (CB CS)?
LS → location | not? LS (CB LS)?
PS → shapedescriptor | colourdescriptor |
sizedescriptor | not? PS (CB PS)?
CB → and | or | xor;
}
The major syntactic categories are: S: Start symbol of the sentence (text query).
R: Requirement (a query consists of one or more requirements which are evaluated separately, the
probabilities of relevance then being combined according to the logical operators).
BR: Binary relation on SBs.
SB: Specification block consisting of at least one CS or PS and 0 or more LS.
CS: Image content specifier.
LS: Location specifier for regions meeting the CS/PS.
PS: Region property specifier (visual properties of regions such as colour, shape, texture, and size).
CB: Binary (fuzzy) logical connective (conjunction, disjunction, and exclusive-OR).
Tokens (terminals) belong to the following sets:
modifier: Quantifiers such as “a lot of”, “none”, “as much as possible”.
scene descriptor: Categories of image content characterising an entire image, e.g. “countryside”,
“city”, “indoors”.
binaryrelation: Relationships which are to hold between clusters of target content denoted by
specification blocks. The current implementation includes spatial relationships such as “larger
than”, “close to”, “above”, and some more abstract relations such as “similar content”.
visualcategory: Categories of stuff, e.g. “water”, “skin”, “cloud”.
semanticcategory: Higher semantic categories such as “people”, “vehicles”, “animals”. location:
Desired location of image content matching the content or shape specification, e.g. “background”,
“lower half”, “top right corner”.
shapedescriptor: Region shape properties, for example “straight line”, “blob shaped”.
colourdescriptor: Region colour specified either numerically or through the use of adjectives and
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
310 Christopher Town
The precise semantics of these constructs are dependent upon the way in which the query
language is implemented, the parsing algorithm, and the user query itself, as will be de-
scribed in the following sections. Figure 3 shows some additional query sentences and their
resulting abstract syntax trees.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 311
Colour descriptors
Nearest-neighbour colour classifiers were built from the region colour representation. These
use the Earth-mover distance measure applied to Euclidean distances in RGB space to com-
pare region colour profiles with cluster templates learned from a training set. Colour classi-
fiers were constructed for each of twelve “basic” colours (“black”, “blue”, “cyan”, “grey”,
“green”, “magenta”, “orange”, “pink”, “red”, “white”, “yellow”, “brown”). Each region is
associated with the colour labels which best describe it.
Face detection
Face detection relies on identifying elliptical regions (or clusters of regions) classified as
human skin. A nearest-neighbour shape classifier was trained to recognise eyes. Adjacent
image regions classified as human skin in which eye candidates have been identified are
then labelled as containing (or being part of) one or more human faces subject to the scale
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
312 Christopher Town
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 313
visualcategory: The 12 categories of stuff which have been assigned to segmented image regions by
the neural net classifiers.
location: Location specifiers which are simply mapped onto the grid pyramid representation. For
example, when searching for “grass” in the “bottom left” part of an image, only content in the lower
left image quadrant considered.
shapedescriptor: The current terms include “straight line”, “vertical”, “stripe”, “top edge”, “left
edge”, “right edge”, “polygonal”, and “blobs”. They are defined as predicates over region properties
and aspects of the region graph representation derived from the image segmentation.
colourdescriptor: Region colour specified either numerically in the RGB or HSV colour space or
through the colour labels assigned by the nearest-neighbour classifiers. By assessing the overall
brightness and contrast properties of a region using fixed thresholds, colours identified by each
classifier can be further described by a set of three “colour modifiers” (“bright”, “dark”, “faded”).
sizedescriptor: The size of image content matching other aspects of a query is assessed by adding
the areas of the corresponding regions. Size may be defined as a percentage value of image area
(“at least x%”, “at most x%”, “between x% and y%”) or relative to other image parts (e.g. “largest”,
“bigger than”).
Figure 7. Plots of relative percentages for precision versus recall for the retrieval experi-
ments. As can be seen, results obtained using OQUEL generally outperform those achieved
using query-by-example or a combination of sketch and feature based retrieval.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
314 Christopher Town
OQUEL queries are parsed to yield a canonical abstract syntax tree (AST) representation
of their syntactic structure. Figures 3 and 6 show sample queries and their ASTs. The
structure of the syntax trees follows that of the grammar, i.e. the root node is the start symbol
whose children represent particular requirements over image features and content. The leaf
nodes of the tree correspond to the terminal symbols representing particular requirements
such as shape descriptors and visual categories. Intermediate nodes are syntactic categories
instantiated with the relevant token (i.e. “and”, “which is larger than”) which represent the
relationships that are to be applied when evaluating the query.
In the first stage, the syntax tree derived from the query is parsed top-down and the leaf
nodes are evaluated in light of their predecessors and siblings. Information then propagates
back up the tree until one arrives at a single probability of relevance for the entire image. At
the lowest level, tokens map directly or very simply onto the content descriptors via SQL
queries. Higher level terms are either expanded into sentence representations or evaluated
using Bayesian graphs. Matching image content is retrieved and the initial list of results is
sorted in descending order of a probability of relevance score. Next, nodes denoting visual
properties (e.g. size or colour) are assessed in order to filter the initial results and modify
relevance scores according to the location, content, and property specifications which occur
in the syntax tree. Finally, relationships (logical, geometric, or semantic, e.g. similarity) are
assessed and probability scores are propagated up the AST until each potentially relevant
image has one associated relevance score. Relations are evaluated by considering matching
candidate image content (evidence) . A closure consisting of a pointer to the identified
content (e.g. a region identifier or grid coordinate) together with the probability of relevance
is passed as a message to higher levels in the tree for evaluation and fusion. Query sentences
consist of requirements which yield matching probabilities that are further modified and
combined according to the top level syntax.
At the leaf nodes of the AST, derived terms such as object labels and scene descriptions
are either expanded into equivalent OQUEL sentence structures or evaluated by Bayesian
networks integrating image content descriptors with additional sources of evidence (e.g. a
face detector). Bayesian networks tend to be context dependent in their applicability and
may therefore give rise to brittle performance when applied to very general content labelling
tasks. In the absence of additional information in the query sentence itself, it was therefore
found useful to evaluate mutually exclusive scene descriptors for additional disambiguation.
For example, the concepts “winter” and “summer” are not merely negations of one another
but correspond to Bayesian nets evaluating different sources of evidence. If both were to
assign high probabilities to a particular image then the labelling is considered ambiguous
and consequently assigned a lower relevance weight. Figure 5 shows a simplified Bayesian
network for the scene descriptor “winter”. Arrows denote conditional dependencies and
terminal nodes correspond to sources of evidence or, in the case of the term “outdoors”,
other Bayesian nets.
Due to the inherent uncertainty and complexity of the task, query evaluation is per-
formed in a way that limits the requirement for runtime inference by quickly ruling out
irrelevant images given the query. The overall approach relies on passing messages (image
structures labelled with probabilities of relevance), assigning weights to these messages ac-
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 315
cording to higher level structural nodes (modifiers and relations), and integrating these at the
topmost levels (specification blocks) in order to compute a belief state for the relevance of
the evidence extracted from the given image for the given query. The approach adopted here
is related to that of [9] in that it applies notions of weighting akin to the Dempster-Shafer
theory of evidence to construct an information retrieval model which captures structure, sig-
nificance, uncertainty, and partiality in the evaluation process. Image regions which match
the target content requirements can then be used to assess any other specifications (shape,
size, colour) which appear in the same requirement subtree within the query. Groups of re-
gions which are deemed salient with respect to the query can be compared for the purpose
of evaluating relations as mentioned above. As long as one maintains a notion of uncer-
tainty, borderline false detections will simply result in lowly ranked retrieved images. Top
query results correspond to those image where the confidence of having found evidence for
the presence of people is high relative to the other images.
Figure 8. Plots of total number of images retrieved versus number of relevant images re-
trieved for left: OQUEL queries, right: query-by-example (QBE). In each case, results are
shown for an initial query and two iterations of query refinement (top line: OQUEL1 query;
middle line: OQUEL2 query; bottom line: OQUEL3 query).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
316 Christopher Town
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 317
Figure 9. Tracking results (from left to right): Original frame; Model of the background
variances; Results of background subtraction; Detected blobs after morphological opera-
tions; Resulting tracked objects (with bounding box).
Video sequences and ground truth from the CAVIAR project1 were used to define an
ontology of visual content descriptors arranged in a hierarchy of scenarios, situations, roles,
states, and visual properties. The latter properties were defined by choosing object attributes
such as translational speed and appearance change which could easily be computed by
means of a visual tracking and appearance modelling framework. The CAVIAR training
data was then automatically re-labelled with this extended set of descriptors by instantiating
the tracking framework with the individual objects in the ground truth and computing the
selected visual attributes for all frames in the sequences. The resulting data was then used
to learn both the structure and parameters of Bayesian networks for high-level analysis.
Evaluations were performed to assess how easily the categories of the ontology could be
inferred on the basis of the chosen visual features and on the basis of preceding layers in the
hierarchy. The former allows one to assess the (in)adequacies of a set of given visual content
extraction and representation methods, which is an important tool in designing the computer
1 EC Funded CAVIAR project/IST 2001 37540, http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
318 Christopher Town
vision components of a surveillance system in order to maximise their utility for high-level
inference in light of the domain ontology. Conversely, one can use the probabilistic scoring
methods applicable to Bayesian networks to evaluate how well-defined e.g. the pre-defined
set of situation descriptors are in terms of the labels for object roles and states which appear
in the ground truth. Further implementation details can be found in [15].
Figure 10. Bayesian network for occlusion reasoning and prediction of object interactions.
Pixels are deemed to be part of the dynamic foreground if they exceed a difference threshold
which is a multiple of the background variance σBi and if they are not deemed to be part of
a shadow as determined by a simple test over luminance vs hue and saturation changes.
After performing some morphological clean-up operations, foreground pixels are clus-
tered using connected components analysis to identify moving regions (“blobs”). Blob
positions are tracked using a Kalman or particle filter with a second order motion model.
Tracked objects are matched to detected blobs using a weighted dissimilarity metric which
takes into account differences in predicted object location vs blob location and changes
in object appearance. Colour appearance is modelled by histograms in RGB space and
Gaussian Mixture models in HSV space, with distance metric computed through the Earth
Mover’s distance and likelihood computation respectively. Object arrivals, departures and
occlusions are inferred using a Bayesian network (see figure 10). Figure 9 illustrates the
approach.
In order to model and track the shape of objects in terms of their closed boundaries,
approaches based on active contours have been very prominent in the vision community.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 319
We have adapted Gradient Vector flow (GVF) in order to make the transition from tracked
edges to closed boundary contours. The snake’s external force is computed as a diffusion of
the gradient vectors of an edge map derived using our edge detector. After fitting a motion
model to the edges of each object by means of the edge tracker and also fitting the GVF ac-
tive contour to the object’s outer boundary, the shape of that boundary is parameterised. The
approach chosen here uses the first four of the seven affine invariant moments φi proposed
by Hu [5]. In the present case, the invariant moments are calculated over object boundary
pixels only, i.e. f (x, y) is an indicator function which is 1 for a given object’s boundary
pixels and 0 elsewhere. All x, y values are transformed such that they lie in the range [0; 1]
by normalising with respect to the object’s bounding box. In order to reduce the range of
the φi , the logarithms log(φi) of the actual values are used.
Figure 11. Left: CAVIAR ontology; Right: Extended set of concepts derived through visual
analysis.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
320 Christopher Town
current role to depend on the situation they are facing, and their situation to depend on the
scenario in which they are participating. These relationships can be used as a structural
prior for the training of Bayesian networks as described below. In order to ground the terms
of the ontology, we extend it with appropriate descriptors computed from the tracking and
appearance modelling framework described in section 5.1.. These visual descriptors (see
table 11) are not claimed to constitute the best choice for the analysis task at hand. They
are merely properties of tracked objects which can be simply and robustly defined using
the techniques described in section 5.1. and offer a reasonable basis for studying the re-
quirements for low-level analysis mechanisms which result from the pre-defined ontology
of higher-level terms.
Figure 12. Bayesian network structure trained using the K2 algorithm from the extended
ontology.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 321
Figure 13. Bayesian network structure trained using the K2 algorithm with a structural prior
from the extended ontology.
accounts for about 95.5% of the variation in the data while reducing the effect of outliers
that may occur due to discontinuities caused by imperfect visual analysis.
Learning the network structure requires a means of searching the space of all possible
DAGs over the set of nodes X and a scoring function to evaluate a given structure over
the training data D. The K2 algorithm is a greedy search technique which starts from an
empty network but with an initial ordering of the nodes. A Bayesian network is then created
iteratively by adding a directed arc to a given node from that parent node whose addition
most increases the score of the resulting graph structure. This process terminates as soon as
none of the possible additions result in an increased score.
Markov Chain Monte Carlo (MCMC) can be applied to Bayesian network structure
learning without the need for a prior node ordering (although such orderings can be em-
ployed to speed up convergence). The Metropolis-Hastings sampling technique is applied
to search the space of all graphs G by defining a Markov Chain over it whose station-
ary distribution is the posterior probability distribution P(G|D). Following Bayes’ rule,
P(G|D) = P(D|G)P(G). The marginal likelihood of the data P(D|G) can be computed
by means of an appropriate scoring function (see below) and the prior P(G) may be left
uninformative (i.e. a uniform distribution over the set of possible DAGs G). Candidate
structures are then sampled by performing a random walk over the Markov chain. The
highest scoring network structure can then be inferred by averaging over a sufficiently large
number of samples.
In order to compute the score of a candidate network over the training data while avoid-
ing overfitting, two scoring functions were considered:
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
322 Christopher Town
Figure 14. Plot of recognition rates achieved by the Bayesian network shown in figure 13
for the variables in the CAVIAR ontology given different evidence. Top: given only the
computer vision derived nodes as evidence; 2nd from top: given only the states (tAP, tDI,
tO, tIN, tAC, tWK, tR); 3rd from top: given both the states and computer vision informa-
tion; Bottom: recognition rates for the states given the computer vision derived nodes as
evidence. It can be seen that some nodes in the ontology are insufficiently grounded us-
ing the computer vision components alone but that embedding them in an ontology derived
structure improves recognition.
where D is the training data, G is the graph structure, and θ are the network parame-
ters.
• The Bayesian Information Criterion (BIC), which approximates the marginal likeli-
hood using a Minimum Description Length (MDL) approach. The Laplace approx-
imation to the parameter posterior can be written in terms of the likelihood and a
penalty term d2 logM to explicitly penalise model complexity:
d
logP(D|G) ≈ logP(D|G, θ̂G) − log M = BIC(D, G)
2
where M is the number of training cases in D, d is the number of free parameters, and
θ̂G is their maximum likelihood estimate.
MCMC was largely found to provide inferior results and required many thousands of it-
erations to converge to a solution. Furthermore, the K2 method directly benefits from the
prior structural information contained in the ontology. Although the BIC score is a more
crude approximation than that inherent in the computation of the marginal likelihood shown
above, there was very little difference in resulting network performance using the two scor-
ing methods. Once the network structure has been trained, parameters can be estimated
easily using maximum likelihood estimation using Dirichlet priors (pseudo-counts).
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 323
Figure 15. Plot of Bayesian network recognition rates for the variables in the CAVIAR
ontology given only the computer vision derived nodes as evidence. The rates achieved by
the network in figure 13 are shown in red (top line), those for the network in figure 12 are
shown in blue (bottom line). It can be seen that the use of an ontological prior improves
accuracy.
Figure 16. Bayesian network structure trained using the K2 algorithm with random initial
ordering of the nodes.
5.4. Results
Figure 12 shows a Bayesian network trained in the manner described in section 5.3. using
the K2 algorithm without a structural prior. The network structure looks somewhat erratic
but captures some of the hierarchical relationships between variables that one would expect
from their semantics. Some nodes in the network (siM, siIS, rLV, tIN) remain unconnected.
That is because their values are almost constant in the data set and hence can in most cases
be inferred trivially through a purely deterministic prior.
The network shown in figure 13 was trained using K2 and a structural prior specifying
that nodes which are part of the same semantic level in the ontology (e.g. all situation
labels) should be treated as equivalent in terms of the ordering of nodes. The resulting
network structure encompasses many of the causal relationships one would expect from
the semantics and shows that there are strong dependencies between the computer vision
derived terms and the states and roles in particular. Figure 14 shows classification rates
achieved by this network given different sets of evidence. As shown in figure 15, this
network achieves better recognition rates than the one trained without a structural prior.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
324 Christopher Town
The average recognition rates (computed over an independent testing set) over all elements
of the CAVIAR ontology are now 0.911 and 0.882 respectively for the Bayesian network
with and without a structural prior.
Even without the structural prior (which groups variables in the same level of the ontol-
ogy into an equivalence class), the ontology has thus far still been used to define an ordering
of the nodes in the Bayesian network. Figure 16 shows an example of a Bayesian network
which was trained using K2 without any such ontological information, i.e. with a random
initial ordering of the nodes. The performance of this network is worse than the two dis-
cussed before, with the average recognition rate now being 0.865. This implies that the use
of an ontological prior has reduced the expected error rate from about 14% to about 9% (i.e.
a 36% reduction) even though the same data set, image processing, and learning method are
being used.
It is also possible to provide a structural prior for MCMC learning by initialising the
search with a particular network structure. The resulting average recognition rates for the
CAVIAR ontology are 0.642 (MCMC without ontological prior) and 0.659 (search ini-
tialised with a prior).
6. Conclusion
This chapter presents research in the area of high-level computer vision which shows how
ontologies can be used as effective computational and representational mechanisms that
allow one to relate semantic descriptors to their parametric representations in terms of the
underlying data primitives. A particular focus of this work is on the role of ontologies
as a means of representing structured prior information and of fusing different kinds of
information in an inference framework.
Section 4. presents a novel approach to content-based image retrieval founded on an on-
tological query language, OQUEL. The problems of expressing, representing, and matching
user queries are thus solved through a prescriptive ontology of image content descriptors
which is hierarchically decomposed using a language which embodies a general syntax and
semantics for query composition and representation of target image content. It is shown
how the ontological query language provides a way of narrowing the semantic gap be-
tween users and the retrieval system by providing a shared language and hierarchy of con-
cepts for both. Rather than attempting to describe image content in terms of the language,
this approach recognises that the meaning attributed to a given image by a user relative
to some current retrieval need (and therefore its relevance to a given query) is only dis-
cernable through the composition of the query itself which defines the ontological domain
over which relevance assessment is carried out. Inference of image content thus occurs
only directly in response to the user query and terminates as soon as the relevance or ir-
relevance of each image has been established. Comparisons with other query composition
and retrieval paradigms (sketch, sample images, property thresholds) show that the OQUEL
query language constitutes a more efficient and flexible retrieval tool. Few prior interpreta-
tive constraints are imposed and relevance assessments are carried out solely on the basis of
the syntax and semantics of the query itself. Such queries have also generally proven to be
more efficient to evaluate since one only needs to analyse those aspects of the image con-
tent representation that are relevant to nodes in the corresponding syntax tree and because
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 325
of various possible optimisations in the order of evaluation to quickly rule out non-relevant
images.
In section 5., the problem of building reliable high-level recognition systems for dy-
namic scene analysis (in particular that of surveillance video) is addressed by a combination
of pre-annotated training data, a set of automatically derived visual descriptors, and an ex-
tended ontology incorporating both of these. The section describes how Bayesian networks
can be trained from this data to perform inference over the terms of the ontology. Moreover,
an analysis of the composition and performance of different Bayesian recognition networks
can lead to insights into the coherence, utility, and groundedness of the ontology itself in
terms of the basis vocabulary derived by the visual analysis.
A central problem in the development and application of ontologies is that of grounding
their terms and relations in the underlying data. One way in which this may be achieved
is to hierarchically decompose and re-express the terms of the ontology until they are all
defined in terms of primitives which the system can readily recognise. Another way is to
provide sufficient training data such that the system can be made to internalise an appropri-
ate definition of the concept by means of machine learning. Both of these approaches are
investigated in this work.
The general idea is that recognising an object or event requires one to relate loosely de-
fined symbolic representations of concepts to concrete instances of the referenced object or
behaviour pattern. This is best approached in a hierarchical manner by associating individ-
ual parts at each level of the hierarchy according to rules governing which configurations
of the underlying primitives give rise to meaningful patterns at the higher semantic level.
Thus syntactic rules can be used to drive the recognition of compound objects or events
based on the detection of individual components corresponding to detected features in time
and space. Visual analysis then amounts to parsing a stream of basic symbols according
to prior probabilities to find the most likely interpretation of the observed data in light of
the top-level starting symbols in order to establish correspondence between numerical and
symbolic descriptions of information.
References
[1] Dasiopoulou, S., Kompatsiaris, I.: Trends and issues in description logics frameworks
for image interpretation. In: S. Konstantopoulos, S. Perantonis, V. Karkaletsis, C. Spy-
ropoulos, G. Vouros (eds.) Artificial Intelligence: Theories, Models and Applications,
Lecture Notes in Computer Science, vol. 6040, pp. 61–70. Springer Berlin / Heidel-
berg (2010)
[2] Duygulu, P., Barnard, K., De Freitas, J., Forsyth, D.: Object recognition as machine
translation: Learning a lexicon for a fixed image vocabulary. In: Proc. European
Conference on Computer Vision (2002)
[3] Duygulu, P., Bastan, M.: Multimedia translation for linking visual data to semantics
in videos. Machine Vision and Applications pp. 1–17 (2009)
[4] Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
326 Christopher Town
[5] Hu, M.: Visual pattern recognition by moment invariants. IRA Transactions on Infor-
mation Theory 17(2), 179–187 (1962)
[6] Hudelot, C., Atif, J., Bloch, I.: Fuzzy spatial relation ontology for image interpreta-
tion. Fuzzy Sets and Systems 159(15), 1929 – 1951 (2008). From Knowledge Rep-
resentation to Information Processing and Management - Selected papers from the
French Fuzzy Days (LFA 2006)
[7] Iskandar, D.: Visual ontology query language. In: Networked Digital Technologies,
2009. NDT ’09. First International Conference on, pp. 65 –70 (2009)
[8] Kohler, C.: Selecting ghosts and queues from a car trackers output using a spatio-
temporal query language. In: Proc. Conference on Computer Vision and Pattern
Recognition (2004)
[9] Lalmas, M.: Applications of Uncertainty Formalisms, chap. Information retrieval and
Dempster-Shafer’s theory of evidence, pp. 157–177. Springer (1998)
[10] Liu, Y., Zhang, J., Li, Z., Tjondronegoro, D.: High-level concept annotation using
ontology and probabilistic inference. In: Proceedings of the First International Con-
ference on Internet Multimedia Computing and Service, ICIMCS ’09, pp. 97–101.
ACM, New York, NY, USA (2009)
[11] Nevatia, R., Hobbs, J., Bolles, B.: An ontology for video event representation. In:
Proc. Int. Workshop on Detection and Recognition of Events in Video (at CVPR04)
(2004)
[12] Nikolopoulos, S., Papadopoulos, G., Kompatsiaris, I., Patras, I.: An evidence-driven
probabilistic inference framework for semantic image understanding. In: P. Perner
(ed.) Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in
Computer Science, vol. 5632, pp. 525–539. Springer Berlin / Heidelberg (2009)
[13] SanMiguel, J., Martinez, J., Garcia, A.: An ontology for event detection and its ap-
plication in surveillance video. In: Advanced Video and Signal Based Surveillance,
2009. AVSS ’09. Sixth IEEE International Conference on, pp. 220 –225 (2009)
[14] Sinclair, D.: Smooth region structure: folds, domes, bowls, ridges, valleys and slopes.
In: Proc. Conference on Computer Vision and Pattern Recognition, pp. 389–394
(2000)
[15] Town, C.: Ontology-driven Bayesian networks for dynamic scene understanding. In:
Proc. Int. Workshop on Detection and Recognition of Events in Video (at CVPR04)
(2004)
[16] Town, C., Harrison, K.: Large-scale grid computing for content-based image retrieval.
Aslib Proceedings 62(4/5), 438–446 (2010)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Ontology Based Image and Video Analysis 327
[17] Town, C., Sinclair, D.: Language-based querying of image collections on the basis of
an extensible ontology. International Journal of Image and Vision Computing 22(3),
251–267 (2004)
[18] Yao, B., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text descrip-
tion. Proceedings of the IEEE 98(8), 1485 –1508 (2010)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
In: Computer Vision ISBN: 978-1-61209-399-4
Editor: Sota R. Yoshida © 2011 Nova Science Publishers, Inc.
Chapter 9
J. A. Muñoz-Rodríguez*
Centro de Investigaciones en Optica, A. C. Leon, Gto, Mexico
ABSTRACT
We present a review of our algorithms for computer vision based on laser metrology
and artificial intelligence. This chapter includes laser metrology, image processing,
contouring by neural networks, modelling of mobile setup and vision parameters. To
achieve the computer vision, the object shape is reconstructed by a laser metrology
method. To carry it out, the object is scanned by a laser line. Based on the deformation of
the laser line, a network provides object dimension. The behavior of the laser line is
obtained by image processing. The modelling of the relationship between the laser line
and the object shape is performed b approximation networks. The architecture of the
networks is built using data of images of a laser line projected on objects, whose
dimensions are known. The approach of the neural networks is to perform the contouring
without measurements on the optical setup. Also, the network provides the camera
parameters and the setup parameters. Thus, the performance and accuracy are improved.
It is because the errors of the measurement are not passed to the contouring system. To
describe the accuracy a root mean square of error is calculated using data produced by the
networks and data provided by a contact method. This technique is tested with real
objects and its experimental results are presented.
1. INTRODUCTION
The computer vision largely deals with the analysis of the images in order to achieve
results similar to those obtained by the man. This matter is directed towards the recovery the
object contour in a scene. The optical metrology is a valuable tool in computer vision,
*
E-mail address: [email protected]; Tel: (477) 441 42 00
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
330 J. A. Muñoz-Rodríguez
especially for object contouring. Application areas such as: object modelling, distance
measurements, position of obstacles, shape inspection and navigation are performed by
computer vision [1]. In optical metrology, the object contouring is treated by image
processing to achieve the vision. Based on a light pattern image, the contour data are
computed to reproduce object shape. The optical sensors provide the contouring without
contact to achieve the computer vision [2]. The use of structured light generally makes the
system more reliable and the acquired data are easier to interpret. Lighting methods have been
implemented to perform the object contouring. Optical methods such as: fringe patter
projection, laser line projection and light spot projection have been developed to extract three-
dimensional data [3-5]. In these methods, the contouring is performed by triangulation using
the geometry of the optical setup [6-8]. In the triangulation procedure, the distances of the
geometry are obtained by an external procedure to the contouring system. This kind of
performance involves a high level of position location of the components of the optical setup.
In this chapter, the computer vision includes automatic contouring by networks to avoid
measurements on the optical setup. In this procedure, the object is scanned by a laser line.
From this step, images of the laser line are captured and digitized. By means of a neural
network, the object contour is computed based on the laser line deformation. This network
generates a mathematical function, which contains the relationship between the object
dimension and the laser line position. Thus, contour dimension is deduced computationally
and the measurements on the optical setup are avoided. In this manner, the performance and
the accuracy are improved. It is because the errors of measurement on the setup are not
passed to the computer vision system. Also, a modelling of the mobile setup is performed by
the network, which computes the contour dimension. In this case, the devices of the setup can
be moved to detect the occluded areas and little details, which are not detected by a static
setup. The network computes the contour dimension based on the laser line deformation at
any camera position. In this manner, the object dimension is deduced for the mobile setup.
Also, the parameters of the imaging system are deduced from the network and by image
processing. The viability of computer vision is examined based on the root mean squared
error. This analysis is carried out based on a contact method used as reference. The setup used
in this technique consists of an electromechanical device, a laser projector, a CCD camera and
a computer. In this arrangement, object is moved on an axis and scanned by a laser line. In the
scanning procedure, images of laser line are captured by a CCD camera. These images are
processed to detect the laser line position by measuring the maximum intensity. This position
is processed by the neural network to determine the object dimension.
The content of this chapter is organized as follow. In section 2, the optical setup for
object contouring is described. In section 3, neural networks are implemented to perform the
object contouring. In section 4, modelling of mobile setup is presented. In section 5, the
computer vision parameters are determined. In section 6, experimental results of computer
vision by optical metrology are presented. Finally, a conclusion is presented.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 331
detect the laser line deformation. When a laser line is projected on an object, the line is
deformed at the image plane. This is due to the surface depth variation and the viewer
direction. From the laser line position and the geometry of the optical setup, the contour data
are recovered [10]. The proposed technique for object contouring is the laser line projection.
The setup includes an electromechanical device, a CCD camera, a laser line projector and a
computer. The experimental setup is shown in Figure 1. In this setup, the object is fixed on a
platform of the electromechanical device. A laser line is projected perpendicularly to the
object surface and a camera is aligned at an oblique angle. To perform the scanning, the
object is moved by the electromechanical device based on computer procedure. In each step
of the scanning, the laser line is deformed in x-axis according to the object surface. Every
deformed line is captured and digitized by a CCD camera and a frame grabber, respectively.
By detecting the position of the line deformation, the contouring is performed. A neural
network computes the object dimensions by mean of the laser line position. The contour data
extracted from a laser line image corresponds to a transverse section of the object. The data of
all transverse sections produce the complete object shape.
y
Reference plane
x
z
Object
CCD Camera
Laser projector
PC
s (x ,y ) Laser
xB xA
Image plane
F
Lens d
h ( x ,y )
Refernce plane
A B
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
332 J. A. Muñoz-Rodríguez
The relationship between the laser line position and the object depth is described by the
geometry shown in Figure 2. In this figure, on the reference plane are located the x-axis and
y-axis, and the object depth is indicated by h(x, y). The points A and B correspond to laser line
projected on the reference plane and the object surface, respectively. When a laser line is
projected on the target, the line position in the image is moved from xA to xB. This
displacement can be represented as:
This line displacement s(x, y) is directly proportional object depth h(x, y). A network
performs the model that describes this relation-ship. Therefore, the network computes the
depth h(x, y) for each displacement s(x,y) detected in the image.
To detect the line displacement, the position of the intensity maximum is measured in the
image. The intensity projected by a laser diode is a Gaussian distribution in lateral direction
(x-axis) [11]. The pixel intensity of every row of the laser line image is represented by (x0, z0),
(x1, z1), ......, (xn, zn), where xi is the pixel position and zi is the pixel intensity and n is the pixel
number. A simple way for representing the pixel intensity is a Gaussian function. The
equation of a Gaussian function is
2
1 x
Ni
f ( x) e 2
, (2)
2
where Ni is the area under the curve, is the mean of the function, and is the standard
deviation [12]. To determine the position of the laser line, Eq.(2) is computed for each row of
the image. It is because the represents the position of the intensity maximum of the
Gaussian function. The mean for a Gaussian function is described by
z x . i i
(3)
i 0
n
z i
i 0
In this equation xi represents the pixel position and zi the pixel intensity. To detect the laser
line position, the maximum position is calculated for every row of the image via Eq.(3). To
describe this procedure, the maximum position is calculated for the pixels shown in Figure 3.
These pixels correspond to a row of a laser line image. To compute the maximum position,
the pixel position xi and intensity pixel zi are substituted into Eq.(3). In this case, the result is
= 34.184, which corresponds to the maximum position. Therefore, the laser line position is
xB =34.184 pixels. Also, Ni and the standard deviation are calculated to fit the Gaussian
function f(x) shown in Figure 3.
Also, Bezier Curves are used to detect the position of the laser line. The nth-degree
Bezier function is determined by n+1pixels [13]. The nth-degree Bezier function is
determined by two parametric equations, which are described by
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 333
Eq.(4) represents the pixel position and Eq.(5) represents the pixel intensity. To fit the Bezier
curve shown in Figure 3, x0, x1, x2,......,xn, are substituted into Eq. (4) and z0, z1, z2,....,zn, are
substituted into Eq. (5). Then, these equations are evaluated in the interval 0 u1 to fit the
curve shown in Figure 3. Then, the first and the second derivative are applied to determine the
local maximum or minimum. In this case, the second derivative z”(u) 0 in the interval
0 u1. Therefore, the maximum is detected by the first derivative equal to zero z´(u)=0 via
bisection method [14]. Beginning with a pair of values ui = 0 and us =1, because z(u) is
defined for the interval 0 u 1, u* is halfway between ui and us. If z´(u) evaluated at u =
u* is positive, then ui = u*. If z´(u) evaluated at u = u* is negative, then us=u*. Next, u* is
taken as the mid point of the last pair values that converges to the root. The value u* where
z´(u) = 0 is substituted into Eq.(5) to obtain maximum position x*. The result is x* = 34.274
and the stripe position is xB = 34.274 pixels, which is shown in Figure 3. The examination of
these methods is based on the best accuracy provided by the laser line detection. According to
the root mean square of error (rms), the laser line detection by Bezier curves provides the best
accuracy for object contouring. A laser line image is shown in Figure 4(a), which is
processed to detect the line displacement. To carry it out, the maximum position is computed
for every row of the image via Bezier curves. Then, the maximum position is substituted in
Eq.(1) to calculate the displacement s(x,y), which represents the depth profile shown in Figure
4(b). These depth data are used to compute the object dimension by means of a network,
which is described in section 3.
200
f(x)
160
(x(u), z(u))
Pixel intensity
120
80
40
Maximum position
0
28 30 32 34 36 38 40
Pixel position
Figure 3. Maximum position from a Gaussian function and Bezier curves.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
334 J. A. Muñoz-Rodríguez
60
s(x,y) (pixels)
45
30
D ispacem ent
15
Radial basis function neural networks (RBFNN) is an artificial network applied to solve
problems of supervised learning, regression, classification, shape identification, object
detection and time series prediction [15-19]. The structure of an RBFNN includes an input
vector, a hidden layer and output layer [20]. This structure is shown in Figure 5. The input
vector is only a fan-out to connect the input data with the hidden layer. The hidden layer is
constructed by radial basis function. The output layer is obtained by the summation of the
neurons of the hidden layer, which are multiplied by a weight. Each layer of the network is
defined as follow. The input vector s1, s2, s3,….,sn are obtained by measuring the laser line
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 335
position by image processing described in section 2. The radial basis function of the hidden
layer is a Gaussian neuron, which is described by the next equation
s si 2
i exp
2 2
, (6)
where si is the neuron centre and is the neuron width, which is calculated as
1 q
q i 1
( s si ) 2
, (7)
where q = 2, s - si is de distance between each couple of nearest neurons. The output layer is
computed by the summation of the weighted neurons, which is represented by
m
h ( s ) wi s s i
i 1 , (8)
where m is the number of neurons. To construct the complete network architecture, the
parameters of the Eq.(8) should be determined. These parameters include: the neuron shape,
neuron centre, neuron width, number of neurons and the weights.
1 w1
s1
2
w2
s2
w3 h
... 3
.. ..
.. wm
s .
m
The first parameter to be determined is the neuron shape. Some functions can be used to
construct the neuron shape, such as: cubic, inverse quadratic, inverse multi-quadratic and
Gaussian [18]. Using these functions, four RBFNN are constructed. Then, a mean squared
error (MSE) [18] is calculated to determine the best network according to the neuron shape.
The MSE is defined by
1 n
MSE ( hc i hi ) 2
n i 1 ,
(9)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
336 J. A. Muñoz-Rodríguez
where hci is the output calculated by the RBFNN, hi is the desired output and n is the number
of analyzed data. Based on the MSE, the best output response is given by the Gaussian
function. Therefore, the Gaussian function Eq.(6) is used to built the hidden layer. The second
parameter is the neuron centre si, which is given by each data of the input vector. The third
parameter to be determined is the number of neurons of the hidden layer. Again, the MSE
criterion is used. This criterion, also, provides the error according to the number of neurons.
To define neuron number, the network is constructed with different neuron number. Then, the
neuron number that provides the best MSE is selected. The parameter sigma, which
corresponds to the neuron width is calculated by Eq.(7). Finally, the training process is
applied to determine the weights. This training is supervised, because the outputs are known
and the network is being forced to produce the correct outputs. To carry it out, the input data
s1, s2, s3,...,sn are connected to the hidden layer. Therefore, for each input si an output hi is
produced and the next equation system is generated
s s s sm
h1 = w exp s1 s1 + w2 exp 1 2
2 2 2
+..........+ wm exp 1
1 2 2 2 2 2 2
s s2 s sm
h2 = w exp s 2 s1
2 2 2
+ w2 exp 2 +..........+ wm exp 2 (10)
1 2 2 2 2 2 2
s s s sm
hn = w1 exp s n s1
2 2 2
+ w2 exp n 22 +..........+ wm exp n
2 2 2 2 2
h1 = w11,1 + w21,2+..........+wm1,m
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 337
By solving the linear equation system Eq.(12), the weights wi are determined. This system can
be solved by the Chelosky method [21]. This method uses a Lower and Upper (LU)
decomposition to transform the matrix system. By using the lower and upper triangular
matrix, linear system Eq.(17) H=W is transformed into LUW=H. In this LU decomposition
= LU is determined as
1,1 1, 2 1,3 .... 1,m L1,1 0 0 .... 0 U 1,1 U 1, 2 U 1,3 .... U 1,m
.... 2 ,m L L 0 .... 0 0 U 2 , 2 U 2 ,3 .... U 2 ,m
2 ,1 2 , 2 2 ,3 2 ,1 2 , 2
n ,1 n , 2 n ,3 .... n ,m Ln ,1 Ln , 2 Ln ,3 .... Ln ,m 0 0 0 .... U n ,m
, (13)
Then, triangular system is solved as LY=H and UW=Y. In this manner, Y is calculated
using L and H. Then, W is solved with Y and U. With this step the weight values w1, w2,
w3,…,wn are determined. With these calculated weights, the RBFNN has been completed. In
this manner, for a particular given si a hi is calculated by the RBFNN.
The GRNN was developed by D. Specht in 1991 for systems modeling and identification
[22]. The GRNN is an architecture that can solve any approximation problem. The basic idea
of a GRNN is that a data set can be approximated by a continuous function. Assuming that a
function has x inputs from x1 to xn and one output y, the expected mean value of the output for
a particular input x can be found using the next equation:
y f ( x , y ) dy
y
. (14)
f ( x, y )dy
In this equation, f(x,y) is a joint probability density function (PDF), the probability of the
output being y and input being x. The PDF can be approximated by a sum of Gaussians [22].
This process is done by placing the center of Gaussian over the input data and multiplying the
Gausian by the corresponding output. The Eq.(14) can be approximated by the next equation
w i exp( d i2 / 2 2 )
. (15)
y i 1
n
exp( d i
2
/ 2 2 )
i 1
From this equation di is the distance between each input, n is the number of input/output,
is the width of the Gaussian function, and wi is a weight that multiplies at Gaussian
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
338 J. A. Muñoz-Rodríguez
function. The architecture to construct a GRNN to obtain the Eq.(15) is shown in Figure 6. In
this structure, the output h is the object dimension and si is its corresponding laser line
displacement. This GRNN is formed by an input vector, a hidden layer, two nodes of
summation and output layer.
Hidden layer
The data of input vector s1, s2, s3,….,sn are the laser line displacement measured by image
processing. The hidden layer is constructed by neurons, whose shape is a Gaussian function
centered at the input data. The effective range of the Gaussian function is determined by the
values allocated to the center and width of it. The function used to obtain each neuron is
determined by the next equation
s si 2
, (16)
f i exp
2 2
where ||s-si|| is the distance di used in Eq.(15), si and are center and the width of the
Gaussian function respectively. The two nodes contain the summation of the neurons of the
hidden layer, which are multiplied by a weight. The output layer is determined with the
division of the two summation nodes. In this manner, the network output is defined by the
next equation
w i fi
h i 1 . (17)
n
f i
i 1
To construct the GRNN via Eq.(17), neuron width, and weights should be determined. The
width of the Gausian neuron is the smother parameter , which is calculated via Eq.(7). To
determine the weights, also, the supervised learning is applied. It is because the outputs are
known and the network should be forced to produce the correct outputs. To carry it out, the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 339
input vector s1, s2, s3, .....,sn are connected to hidden layer to produce the outputs h1, h2,
h3,.....,hn. The output of this GRNN is defined by the division of two summations nodes.
Therefore, for each input si an output is generated as a linear equation system:
w1 s1 s1 2
w s s2 2
s s
+,..........,+ wm exp 1 m
2
h1 = exp + 2 exp 1
k1 2 2 k1 2 2 k1 2 2
w1 s 2 s1 2
w s s2 2
s sn
+,..........,+ wn exp 2
2
h2 = exp + 2 exp 2 (18)
k2 2 2 k2 2 2 k2 2 2
w1 s s 2
w2 sm s2 2
s sm
+,..........,+ wm exp n
2
hn = exp m 2 1 + exp
kn 2 k 2 2 kn 2 2
n
where
s s s s2 2
+,..........,+ exp s1 s m
2 2
k1 = exp 1 1 + exp 1
2 2 2 2 2 2
s 2 s1 s s 2
+,..........,+ exp s 2 s m
2 2
k2 = exp + exp 2 22
2 2 2 2 2
s s 2
s s 2
s s 2
kn = exp n 1 + exp n 22 +,..........,+ exp n 2m
2 2 2 2
h1 = w1f1,1 + w2f1,2+..........+wnf1,m
h2 = w1f2,1 + w2f2,2+..........+wnf2,m (19)
hn = w1fn,1 + w2fn,2+..........+wmfn,m
This equation system can be rewritten in matrix form as the product H=FW by
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
340 J. A. Muñoz-Rodríguez
By solving linear system Eq.(20), the weights wi are obtained. This system is solved by
Chelosky method. With this step the weights w1, w2, w3,….,wm are calculated. In this manner,
the GRNN has been constructed. The result of this GRNN is a continuous function, which
describes the object dimension according to the laser line displacement. In this manner, for a
particular given si an hi is calculated by the GRNN. Based on MSE, the neuron that provides
the best approximation respect to desired output is the Gaussian function. Therefore, the
Gaussian function Eq.(16) is used to construct the hidden layer of this GRNN. The number of
neurons used in the hidden layer for this GRNN is determined again by the MSE criterion.
The Bezier basis function has been used to represent curves and surfaces [23-26]. This is
the main aim to propose a network of Bezier basis function. The structure proposed for this
Bezier neural network (BNN) consists of an input vector, a parametric input, a hidden layer
and an output layer. The structure of this BNN is shown in Figure 7. This network includes
an input vector, a parametric input, a hidden layer and an output layer. The input includes:
object dimensions hi and its corresponding laser line displacement si. The parametric input u
is a value proportional to the displacement si. This value u is computed by a weighted linear
input combination (WLIC) [27]. The hidden layer is constructed by Bezier basis function.
The output layer is formed by the summation of the neurons of the hidden layer, which are
multiplied by a weight. Each layer of the network is performed as follow. The input data s0,
s1, s2,….,sn are line displacements computed by the image processing described in section 2.
By means of these displacements, the parametric u value is determined via WLIC.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 341
w2 h
h2 B2
.. ..
.. .. wn
. .
hn Bn
u
si
WLIC
Figure 7. Structure of a neural network.
The relationship between the displacement and the parametric u value is described by
u= a0 +a1s, (21)
where ai are the unknown constants, s is position of the stripe displacement s(x,y). Using the
position s and its respective u, the Eq.(21) can be determined. The Bezier curves are defined
in the interval 0 u 1 [28]. Therefore, u = 0 for the first displacement s0, and u=1 for the
last displacement sn. Substituting the values (s0, u=0) and (sn, u=1) in Eq.(21), two equation
with two unknown constants are obtained. Solving these equations, a0 and a1 are determined.
Thus, the WLIC Eq.(21) is completed. Thus, for a particular displacement si a value u is
computed. The input data h0, h1, h2,...,hn are obtained measuring pattern objects by a contact
method. The neurons of the hidden layer are defined by the Bezier basis function, which are
described by
n
Bi (u ) u i (1 u ) n i , 0u 1, (22)
i
n n!
and ,
i i!(n i )!
denotes the binomial distribution from probability and statistics [29]. The output layer is
obtained by the summation of the weighted neurons. The output response is the object
dimension h(u) given by
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
342 J. A. Muñoz-Rodríguez
n
h (u ) wi Bi (u ) hi , 0 u 1, (23)
i 0
where wi are the weights, hi is the known dimension of the object and Bi is the Bezier
basis function represented by Eq.(22). To compute the weights w0, w1, w2,…..,wn, the network
is being forced to produce the correct outputs h0, h1, h2,.....,hn. To carry it out, an adjustment
mechanism is applied. This is performed substituting the values ui and hi in Eq.(23). For each
input ui, an output hi is obtained. In this manner, the next equation system is obtained
h0 = w00,0 + w10,1+..........+wn0,n
h1 = w01,0 + w11,1+..........+wn1,n (25)
hn = w0n,0 + w1n,1+..........+wnn,n
This equation can be rewritten in matrix form as the product between the matrix of the input
data and the matrix of the corresponding output values: W = H. The linear system
represented by the next matrix
By solving the linear system Eq.(26), the weights wi are obtained. This system is solved
by the Chelosky method. With this step the weights w0, w1, w2,….,wn are calculated. In this
manner, the network has been completed. The result of this network is a continuous function,
which provides the object dimension according to the laser line displacement. Thus, for a si
given an hi is computed by the network Eq.(23). Figure 8(a) shows the surface of a dummy
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 343
bell. Figure 8(b) shows the surface extrated by means of the laser line and the network. In this
procedure the measurement of the setup parameter are not passed to vision system. Thus,
computational process provides a better performance and the accuracy is improved.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
344 J. A. Muñoz-Rodríguez
The structure of this network is shown in Figure 9. The input includes: the object dimension
hi and the parametric values u and v. The value v is directly proportional to laser line position
xA at each camera position in the x-axis, which is shown in Figure 10. The value u
corresponds to laser line displacement in the x-axis. The hidden layer is constructed by
neurons, whose shape is represented by the Bezier basis function. The output layer is formed
by the summation of the neurons of the hidden layer, which are multiplied by a weight. Each
layer of the network is performed as follow. The input data h0, h1, h2,...,hn are the known
dimensions of the pattern objects. The input data s00, s01, s02,..., s0n, s10, s11, s12,..., s1n,…, sm0,
sm1, sm2,….,smn are the laser line displacements, which are computed by image processing
described in section 2. The n-index corresponds to the displacement for each pattern
dimension. The m-index corresponds to the displacement at each camera position xA in the x-
axis. The displacements at different camera position are shown in Figure 10.
h0 B00 w00
h1 B01 w01
w0n
hn B0n
s LC u
h0 B10 w10
h1 B11 w11
h
w1n
hn B1n
xA v
LC
h0 Bm0 wm0
h1 wm 1
Bm1
wm n
hn Bmn
Using the position xA, the parametric v is determined by a linear combination (LC). The
relationship between xA of the camera position and the parametric value v is represented by
v= a0 + a1xA (27)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 345
where ai are the unknown constants and xA is laser line position at each camera position.
Using xA of the camera position and its respective v, the Eq.(27) can be determined. The
Bezier curves are defined in the interval 0 v 1. Therefore, the value v = 0 for the first xA
and the value v = 1 for the last camera position xA. Substituting these two xA positions and its
corresponding values v in Eq.(27), two equation with two unknown constants are obtained.
Solving these equations a0 and a1 are determined and Eq.(27) is completed. In this manner,
for each xA of the camera position, a parametric v valued is computed via Eq.(27). In the same
manner, the displacement s(x,y) is represented by a parametric value u. The relationship
between the displacement and the parametric u value is represented by
u= b0 +b1s,
(28)
where bi are the unknown constants, s the laser line displacement. Based on the
displacement s and its respective u, the Eq.(28) can be determined. The Bezier curves are
defined in the interval 0 u 1. Thus, u=0 for the first displacement s00 and u=1 for the last
displacement smn. Substituting these values and its corresponding values u in Eq.(28), two
equation with two unknown constants are obtained. Solving these equations b0 and b1 are
determined and Eq.(28) is completed. In this manner, a parametric value u is computed via
Eq.(28) for a each displacement sij.
sm0 s10 s00
sm2 sm 1 s s
sm n s1 211 s 01
xA s1n xA s0n02 xA
xc xc xc
Laser line
Optical axis hn
h2 z- axis
h1
h0
Object B
A
Reference plane x- axis
The hidden layer is built by means of the Bezier basis function. The expression for a two-
dimensional Bezier basis function is described by
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
346 J. A. Muñoz-Rodríguez
n m
, B j ( v ) u j (1 v ) m j ,
Bi (u ) u i (1 u ) n i
i j
and
n n! , m m! .
i i!(n i )! j
j! ( m j )!
The output layer is generated by the summation of the neurons, which are multiplied by a
weight. The output response is the object dimension H (u, v), which is given by
n m
H (u, v) = w h ij i Bi (u ) B j ( v ) , 0 u 1, 0 v 1, (30)
i 0 j 0
where wij are the weights, hi is the object dimension, Bi(u) and Bj (v) are the Bezier basis
function represented by Eq.(29). To construct the complete network Eq.(30), the suitable
weights wij should be determined. To carry it out, the network is being forced to produce the
correct object dimension hi. To do this, for each hi, the xA of the camera position and
displacement sij are converted to values (u, v) via Eq.(27) and Eq.(28) respectively. Then, the
object dimension hi and its corresponding (u, v) are substituted in Eq.(30). Thus, for each pair
(u, v) an output Hij(u, v) is generated and the next equation system is obtained
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 347
Hmn = w00mn,0 + w01mn,1+,.....,+wmnmn,mn
This equation can be rewritten in matrix form as W = H. The linear system is represented by
the next matrix
By solving the linear system Eq.(33), the weights wij are obtained. This system is solved
by the Chelosky method. In this manner, the network H (u, v) Eq.(30) has been completed.
The result of this network is a two-dimensional model, which computes the object dimension.
Thus, for a particular diplacement sij and its respective xA of the camera position an hij is
computed by the network Eq.(30). The displacement sij and the xA of the camera position are
detected by image processing described in section 2.
F hi
ki
sij x c x A
(34)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
348 J. A. Muñoz-Rodríguez
From Eq.(12) F, xc, xA are constants and hi is computed by the network according to sij. In this
case, ki is a linear function and the derivative ki respect to sij dk/ds is a constant.
x-axis
Laser lin e
z-axis
Ob jec
y-axi s
p( x,y)
Optical axis
F
yc Reference plane
p
xc
Image plane
sij
xB xA
xc
a
Laser line
zi
hn
Optical axis h3
h2
B h1
Object
hi
A k1 k2 k3 kn
Reference plane x- axis
Figure 11(b). Geometry of an optical axis perpendicular to the reference plane x-axis.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 349
Other camera orientation is an optical axis not perpendicular to the reference plane,
which is shown in Figure 11(c). For this geometry, the distance between the image center and
the laser line position is defined by xc - xB=F tan . Thus the laser line displacement is defined
by sij=F tan- (xc - xA). The angles of this geometry are defined by + + =180, + = 90
and ki = hi tan . Therefore, the projection ki is given by
sij x c x A
k i hi tan tan 1 90 (35)
F
From Eq.(34) F, xc, xA and are constants and hi is computed by hi = D - a tan . Where
is an angle minor than 90 and D is constant. In this case, sij of Eq.(35) does not produce a
linear function ki and the derivative dk/ds is not a constant.
sij
xB
xA
xc
F
a
Laser line
zi
hn
Optical axis h3
h2
B h1
Object
z- axis
hi
A k1 k2 k3 k n
Reference plane x- axis
Figure 11(c). Geometry of an optical axis not perpendicular to the reference plane x-axis.
The orientation of the camera in y-axis is performed based on the geometry of Figure
11(d). In this case, the object is moved in y-axis over the laser line. When the object is moved,
the object pattern changes from yp to yi in the laser line. In this case, t = (yc-yp)-(yc-yi). For an
optical axis perpendicular to the reference plane y-axis, a linear qi produces a linear ti at the
image plane. Therefore, the derivative dt/dq is a constant. Thus, the orientation camera is
performed by means of dk/ds = constant for the x-axis and dt/dq = constant for the y-axis.
Based on these criterions, the optical axis of the camera is aligned perpendicular to x-axis and
y-axis. The orientation in x-axis, ki is computed from hi provided by the network. Due to the
distortion, the derivative dk/ds is slightly different to a constant. But, this derivative is the
more similar to a constant, which is shown in Figure 11(e). In this figure, the dash line is
dk/ds for minor than 90 and the dot line is dk/ds for major than 90. Thus, the generated
network corresponds to an optical axis aligned perpendicularly along the x-axis. For the
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
350 J. A. Muñoz-Rodríguez
orientation in y-axis, qi is provided by the electromechanical device, the object position yi and
ti are obtained by image processing in each movement. Due to the distortion, the derivative
dt/dq is not exactly a constant. But, this derivative is the more similar to a constant. Thus, the
optical axis is aligned perpendicular to the y-axis. In this manner, the network and image
processing provide an optical axis aligned perpendicularly to the reference plane.
ti
yp yi
yc
F
b
zi
Laser line
z- axis
Optical axis
qn q3 q2 q1 p
hi
Object
28
24
20
dk/ds
16
12
8
0 4 8 12 16 20
Displacement si (pixels)
Figure 11(e). Derivative dk/ds for an optical axis perpendicular and not perpendicular to x-axis.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 351
Based on the optical axis perpendicular to reference plane Figure 11(b), the camera
parameters are deduced. To carry it out, the network produces the object depth hi according to
the displacement sij and xA of the camera position. The geometry of the setup Figure 11(b) is
described by
zi zi F
, (36)
a ( xc x B ) a
From this equation is the scale factor to convert the pixels to millimeters. Using D = zi + hi
and sij = (xc - xB) - (xc – xA), Eq.(36) is rewritten as
D hi D hi F (37)
a ( sij x c x A ) a
Where D is the distance from the lens to the reference plane. From Eq.(37) the constants D,
a, F, , xc should be determined. To carry it out, Eq.(37) is rewritten as equation system
F a
h0 D
( s00 xc x A )
F a
h1 D
( s01 xc x A )
F a
h2 D (38)
( s02 xc x A )
F a
h3 D
( s03 xc x A )
F a
h4 D
( s04 xc x A )
F a
h5 D
( s05 xc x A )
The values h1, h2,….,h5, are computed by the network according to s01, s02,….,s06 at any
camera position. These values are substituted in Eq.(38) and the equation system is solved.
Thus, the constants D, a, F, , and xc are determined for the initial camera position. The
coordinate yc is computed from the geometry Figure 11(c) described by
F ( D hi )
ti ( yc y p ) ,
( b qi 1 )
(39)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
352 J. A. Muñoz-Rodríguez
From Eq.(39) the constants D, F, , qi, ti, hi are known and yc, b, yp should be
determined. To carry it out, Eq.(39) is rewritten as equation system for an hi constant by
F ( D h1 )
t1 ( y c y p )
( b q0 )
F ( D h1 )
t2 ( yc y p ) (40)
( b q1 )
F ( D h1 )
t 3 ( ay c ay p )
( b q2 )
The values t1, t2, t3, are taken from the orientation in y-axis, q0= 0 and the values q1, q2,
are provided by the electromechanical device. These values are substituted in Eq.(40) to solve
the equation system. Thus, the constants yc, yp and b are determined. In this manner, the
camera parameters and the setup geometry are determined based on the network and image
processing. The distortion is observed by means of the laser line position xB in the image
plane, which is described by
F a
xB xc
D hi
(41)
Based on Eq.(41), the behavior of xB respect to hi is a linear function. However, due to the
distortion, the real data xB are not linear. The network is constructed by means of the real data
using the displacement sij =(xc- xB) - (xc- xA) at each camera position. Therefore, the network
produces a non linear data hi, which is shown in Figure 11(f). Thus, the distortion is included
in the network, which computes the object dimension in the imaging system. This network
computes a transverse section of the object from each laser line. By means of the information
of all laser line images, the complete object is constructed.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 353
300
Object dimension computed
150
100
Real object
50
dimension
0
0 30 60 90 120 150 180 210
Figure 11(f). Object dimension produced by the Bezier network at the firs camera position.
1 n
rms
n i 1
( hoi hci ) 2 , (42)
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
354 J. A. Muñoz-Rodríguez
where hoi is the data measured by the CMM, hci is the computed by the network and n is
the data number. For the glass the n was 480 data, which was measured by the CMM and
computed by the two-dimensional network. The error calculated for the glass shape is a
rms = 0.162 mm. Based on the average dimension the rms value can be represented in
percentage terms. The average of the glass dimensions is 22.00 mm. Therefore, the error is
0.0073 based on the rms, which represents a 0.73 % of error.
x-axis mm
Object long (y-axis mm)
To determine if the value n is according to the desired precision of the error, the
confidence level is calculated [31]. The value n based on the confidence desired is described
by the next relation
2
n z x , (43)
e
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 355
e
z n. (44)
x
To know if the value n is right according to the desired confidence level, Eq.(42) is
applied. The confidence level desired is 95 %, which corresponds to z=1.96 based on the
confidence table [32]. The error is e = 0.73% and standard deviation is 7.62. Substituting
these values in Eq.(42), the result is a n = 418.57. This value indicates that value n chosen is
right to produce a confidence level of 95 %. Therefore, the values z > 1.96, provide a
confidence level greater than 95%. The confidence level for the n=480 is z =2.09. It indicates
a confidence level over 95%.
The second object to be profiled is a dummy face shown in Figure 13(a) and Figure
13(b). In the initial sequence, the laser line occlusion appears due to the surface variation
Figure 13(b). The camera is moved toward the laser projector to obtain Figure 13(c), which
provides the occluded region of the laser line. Thus, the contouring can be performed
completely. To carry it out, the dummy face is scanned in x-axis in steps of 1.27 mm. From
each image produced in the scanning, the xA of the camera position and the laser line
displacement sij are computed via image processing. Then, these values are converted to
values v and u are deduced via Eq.(27) and Eq.(28), respectively. Based on u and v, the
object dimension is computed by means of the network Eq.(30). All data produced by the
network are stored in array memory to construct the complete object shape. In this case, sixty
eight images of the mobile setup are processed to determine the complete object shape shown
in Figure 13(d). The scale of this figure is mm. To determine the accuracy of the data
provided by the modelling of the mobile setup, the dummy face is measured by the CMM.
The rms is computed using n=1122 data, which were provided by the network and by the
CCM as reference. The rms calculated for this object is a rms = 0.155 mm. The precision of
this error produces a confidence level over 95%. The result of the contouring using a static
setup is a not completed object shape. This is because some regions are not profiled due to the
occlusions.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
356 J. A. Muñoz-Rodríguez
Figure 13(a). Image of a dummy face. Figure 13(b). Dummy face with laser line occlusion.
Figure 13(c). Laser line with out occlusions. Figure 13(d). Three-dimensional shape of the dummy face.
The third object to be profiled is a plastic fruit shown in Figure 14(a). The contouring is
performed by scanning the fruit in steps of 1.27 mm. In this case, forty laser lines were
processed to determine the complete object shape shown in Figure 14(b). To determine
accuracy of the model provided by the network, the plastic fruit was measured by the CMM.
The rms was computed using n=420 data, which were provided by the two-dimensional
network and by the CMM as reference. The rms is calculated for this object is a rms = 0.142
mm. The precision of this error produces a confidence level over 95%.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 357
Figure 14(a). Image of the plastic fruit with out stripe occlusion.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
358 J. A. Muñoz-Rodríguez
Figure 15 (a). Object with laser line occlusion. Figure 15(c). Laser line with out occlusions.
The employed computer in this process is a PC to 1 GHz. Each light line image is
processed in 0.022 sec. The capture velocity of the camera used in this camera is 34 fps. Also,
the electromechanical device is programmed to moving the object at 34 steps per second. In
this computer vision procedure, the parameters such as projection angle, distances of the
geometry are not used to determine the shape representation. Therefore, the automatic
procedure provides an easier performance than those techniques that use distances of the
components of optical setup. In this manner, the computer vision is performed by
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 359
computational process and measurements on optical step are avoided. Therefore, a good
repeatability is achieved in each computer vision process.
CONCLUSION
A review of the shape detection technique based on a laser line and approximation
networks has been presented. The described technique here provides a valuable tool for
inspection industrial and reverse engineering. The automatic technique avoids the physical
measurements of the setup, as is common in the methods of laser line projection. In this
technique, the parameters of the setup are obtained automatically by computational process
via image processing and Bezier networks. This procedure improves the accuracy of the
results, because measurement errors are not passed to the contouring system. In this
technique, the ability to detect the laser line with a sub-pixel resolution has been achieved by
Bezier curves. This step is achieved with few operations. By using this computational-optical
setup a good repeatability is achieved in every measurement. Therefore, this technique is
performed in good manner.
ACKNOWLEDGMENTS
J. Apolinar Muñoz Rodríguez would like to thank to CONCYTEG of Guanajuato and
CONACYT of Mexico, for the partial support of this research.
REFERENCES
[1] W. R. Uttal, R. Kakarala, S. Dayanand, T. Shepherd, J. Kalki, C. F. Lunskis and N.
Liu, (1999), Computational modeling of vision: The role of combination, Marcel
Dekker, U.S.A..
[2] G. Sansoni, R. Rodella, M. Carocci and V. Carbone, (2000), “Machine vision: Optical
digitization of free-form complex surface using the projection of structured light,‖
Optics and Photonics News, 23-29.
[3] C. Lulie, Q. Chenggen, T. Cho-Jui and F. Yu, , (2005), ―Shape measurement using one
frame projected sawtooth fringe pattern‖, Opt. Comm., Vol. 246, Issue 4-6, p. 275-284.
[4] J. A. Muñoz Rodríguez and R. Rodríguez-Vera, (2003). ―Evaluation of the light line
displacement location for object shape detection‖, Journal of Modern Optics, Vol. 50
No. 1 p. 137-154.
[5] R. Klette, K. Schluns and A. Koschan, (1998), Computer vision: Three-dimensional
data from images, Springer-Verlag, Singapore.
[6] W. Guanghui, H. Zhanyi, W. Fuchao and T. Hung-Tat, (2004), ―Implementation and
experimental study on fast object modeling based on multiple structured stripes‖,
Optics and Laser in Eng. Vol. 42 No. 6, p. 627-638.
[7] F. Remondino, A. Guarnieri and A. Vettore, (2004), ―3D modeling of close-range
objects: photogrammetry or laser scanning?‖, SPIE Proc. Vol. 5665, p. 216-225.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
360 J. A. Muñoz-Rodríguez
[8] Z. Wei, G. Zhang and Y. Xu, (2003), ―Calibration approach for structured–light-stripe
vision sensor based on invariance of double cross-ratio,‖ Opt. Eng. Vol. 42 No. 10, p.
2956-2966.
[9] M. Mclvor, (2002), ―Nonlinear calibration of a laser profiler,‖ Opt. Eng. Vol. 42 No.1,
p. 205-212.
[10] L. Zagorchev and A. Goshtasby, (2006), ―A paintbrush laser range scanner‖, Computer
vision and image understating, Vol. 10, p. 65-86.
[11] F. Causa and J. Sarma, (2003), ―Realistic model for the output beam profile of stripe
and tapered superluminescent light–emitting diodes,‖ Appl. Optics, Vol. 42 No.21, p.
4341-4348.
[12] J. S. Milton and J. C. Arnold, (2003), Introduction to probability and statistics:
principles and applications for engineering and the computing science, McGraw-Hill,
U.S.A.
[13] M. E. Mortenson, Geometric Modeling, (1997), Willey, Second edition, U.S.A. .
[14] C. F. Gerald and P. O. Wheatley, (1992), Applied numerical analysis, Addison Wesley,
Fifth edition, U.S.A.
[15] M. Neifeld and D. Psaltis, (1993), ―Optical implementations of radial basis classifiers,‖
Appl. Optics Vol. 32 No.8, 1370-1379.
[16] E. W. Foor and A. M. Neifeld, (1995), ―Adaptive optical radial basis function neural
network for handwritten digit recognition,‖ Appl. Optics Vol. 34 No.32, 7545-7555.
[17] T. Sabisch, A. Ferguson, H. Bolouri, (2000), "Identification of complex shapes using a
self organizing neural system," IEEE Transactions on Neural Networks, Vol. 11 Issue:
4, 921 -934.
[18] Rojas, H. Pomares, J. Gonzalez, J. L.Bernier, E. Ros, F. J. Pelayo, A.Prieto, (2000),
"Analysis of the functional block involved in the design of Radial basic Functions,"
Neural Processing Letters , Vol.12 No.1, 1-17.
[19] J.A. Muñoz Rodríguez, R. Rodríguez-Vera and A. Asundi, (2003), ―Shape detection of
moving objects based on a neural network of a light line‖. Optics communication Vol.
221, 73-86.
[20] P. Picton, (1998), Neural networks, Polgrave, U.S.A.
[21] W.H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, (1993), Numerical
Recipes in C, U.S.A. Cambridge Press.
[22] Specht, D. F. (1991), ―A general regression neural network,‖ IEEE Trans. Neural
Networks, vol. 2, 568-576.
[23] L. Shao and H. Zhou, (1996), ―Curve Fitting with Bezier cubics‖, Graphical models and
image processing Vol. 58 No.3, 223-232.
[24] Zhang, (1999), ―C- Bezier Curves and Surface”, Graphical models and Image
processing Vol. 61, 2-15.
[25] Mohamed, N. Horoshi and A. Takeshi, (1993) “Bezier and polynomial methods of
making curves,‖ SPIE Vol. 2644, 290-295.
[26] A. Muñoz Rodríguez, Miguel Rosales Ciseña and R. Rodriguez-Vera, (2005), ―3D
shape detection based on a Bezier neural network of a light line,‖ SPIE Proc. Vol. 5776,
630-639.
[27] Robert J. Schalkoff, (1997), Artificial Neural Networks, Mc Graw Hill. U.S.A.
[28] E. Mortenson, (1997), Geometric Modeling, Willey, Second edition, U.S.A.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Computer Vision by Laser Metrology and Algorithms of Artificial Intelligence 361
[29] J. Zhang, (1999), ―C- Bezier Curves and Surface‖, Graphical models and Image
processing Vol. 61, p. 2-15.
[30] T. Masters, (1993), Practical Neural Networks Recipes in C++, Academic Press,
U.S.A.
[31] J. E. Freund, (1979), Modern Elementary Statistics, Prentice Hall, U.S.A.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
362 J. A. Muñoz-Rodríguez
INDEX
classes, 4, 5, 6, 10, 16, 18, 20, 24, 26, 29, 30, 32, 33,
A 35, 36
classification, 3, 12, 20, 37, 38, 334
access, viii
coding, 6, 8, 9, 11, 18, 27
actuators, viii
color, 272, 278
adaptability, 3
commercial, 22
adaptation, 5, 17
communication, 21, 359
ADC, 275
compensation, ix
adjustment, 341
competitors, 269
aerospace, 21
complexity, viii
algebraic topology, vii, ix, 239, 240, 241, 265
compliance, ix
algorithm, vii, viii, x, 29, 32, 33, 38, 41, 42, 44, 49,
computation, 23, 42
54, 55
computer, vii, ix, x, 1, 10, 13, 14, 21, 24, 28, 35, 36,
artificial intelligence, vii, x, 329
42, 329, 330, 331, 352, 353, 358
Artificial Neural Networks, 360
computer numerical control (CNC), vii, 1
Austria, 55, 239
computing, 24, 353, 359
automate, 3
conference, 42
automation, vii, 2, 12, 13, 21, 24, 39
construction, 2, 13
consumption, viii, 41
B content based image retrieval (CBIR), x, 303
contour, 22, 23, 24, 31, 32, 329, 330, 331, 353, 356
base, viii, 2, 13, 24, 25, 26, 30, 32, 35, 36, 37 convergence, x, 6, 12, 17
bias, 10 correlation, 2, 3, 23
brain, viii, 15, 283 cost, vii, 2, 12, 13, 22, 38
cycles, 9, 11, 17, 28, 31, 32
C
D
C++, 55, 360
calibration, 359 data set, 337
CCR, 2, 38 database, 3, 4, 8, 10, 12, 18, 20, 23, 24, 28, 29, 30,
ceramic, 2 31, 32, 33, 36, 38
cerebral cortex, 11, 18 decision trees, 20
chemical, 278 decoding, 42, 44, 45
China, 41, 54 decomposition, 337
chlorophyll, 278 defects, 2, 3, 4, 13, 38, 283
chromosome, 44, 45, 53 deficiencies, 278
deformation, x, 329, 330, 331, 343
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
364 Index
denoising, ix genetics, 53
depth, 330, 332, 333, 350 geometrical parameters, 24
depth recovery, 330 geometry, 29, 330, 331, 332, 347, 348, 349, 350,
detection, vii, viii, x, 2, 3, 21, 22, 23, 24, 36, 37, 38, 351, 352, 358
41, 42, 55, 280, 281, 329, 333, 334, 356, 358, Germany, 1, 39, 57
359, 360 graph, 22
detection system, 55 guidelines, 42
diodes, 359
discretization, 270
dispersion, 283 H
displacement, 34, 332, 333, 338, 340, 341, 342, 343,
hair, 37
344, 346, 347, 348, 350, 352, 353, 354, 356, 358
Hawaii, 39
distortions, 10, 11, 18, 28, 31, 32, 36
height, 3, 6, 10, 15, 27, 29, 45, 46
distribution, 2, 332, 341
histogram, 3
distribution function, 2
human, viii, ix, 14, 269, 283
human information processing, ix
E human machine interface (HMI), viii, 57
hybrid, 39
electronic circuits, 275
encoding, 42, 43
energy, 3, 13 I
engineering, ix, 21, 38, 55, 358, 359
identification, 334, 337
environment, viii, ix, 3, 4, 42
illumination, ix, 26, 29, 35
environmental control, 21
image analysis, 278
equipment, 13, 21, 270
image recognition systems, vii, 1, 2
error detection, 14
imaging systems, 269, 275
evolution, 42, 53
indexing, 282
excitation, 9, 10, 17, 18, 27
individuality, 14
execution, 8, 46
individuals, 44
extraction, 2, 5, 24, 55, 278, 280
industry, 2, 21
extracts, 29
input, 334, 336, 337, 340
intelligence, 42
F interface, viii, 25, 28, 29
Israel, 239
facial expression, 54
factories, 13
FFT, 282 J
fires, 272
Japan, 13, 39, 41
fitness, 3, 44, 45, 49, 50, 54
flaws, 3, 37
flexibility, viii, 13 L
forest fire, 272
formula, 9 landscape, 272
Fourier analysis, 37 languages, x
fractures, 283 laser metrology, vii, x, 329
freedom, 22 lead, viii
fusion, viii learning, 13, 16, 17, 18, 23, 24, 37, 334, 338
learning process, 17, 18
lens, 351
G
light, 24, 269, 272, 275, 276, 277, 283, 330, 358,
359, 360
genes, 44
linear function, 347, 348, 352
genetic algorithm (GA), viii, 41, 42
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Index 365
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
366 Index
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.
Index 367
vehicles, viii, ix
U velocity, 3, 358
vibration, 3
UK, 303
vision, vii, viii, ix, x, 1, 13, 14, 15, 21, 22, 38, 39, 41,
Ukraine, 39
42, 329, 330, 342, 347, 352, 353, 358, 359
universality, 20
visions, 38
USA, 1, 39, 272, 273, 274, 277, 283, 284, 285
visual environment, viii
vocabulary, 2
V
W
values, 337
variables, ix
water, 272, 277, 278
varieties, 269, 275
water structure, 272
vector, 6, 8, 15, 23, 24, 27, 38, 334, 336, 338, 340,
wavelengths, 272, 275, 276, 277, 278, 283
343
web, 24, 42
vegetation, 272
workpiece treatment, vii, 1
EBSCOhost: eBook Collection (EBSCOhost) printed on 1/12/2025 5:32:11 PM UTC via DICKINSON COLLEGE. All use subject to https://www.ebsco.com/terms-of-use.