Auro 2017
Auro 2017
Abstract We propose a novel edge-based visual-inertial fu- applications. In particularly, reliable tracking of fast and ag-
sion approach to address the problem of tracking aggressive gressive motions is essential for popular applications that
motions with real-time state estimates. At the front-end, our involves highly dynamic mobile platforms/devices, such as
system performs edge alignment, which estimates the rel- aerial robotics and augmented reality 1.
ative poses in the distance transform domain with a larger
convergence basin and stronger resistance to changing light- As demonstrated in the literature (Baker and Matthews
ing conditions or camera exposures compared to the popu- (2004)), cameras are the ideal sensors for tracking slow to
lar direct dense tracking. At the back-end, a sliding-window moderate motions using feature-based methods under con-
optimization-based framework is applied to fuse visual and stant lighting conditions and camera exposures. However,
inertial measurements. We utilize efficient inertial measure- large image displacement caused by fast motions can seri-
ment unit (IMU) preintegration and two-way marginaliza- ously downgrade the feature tracking performance. Recent
tion to generate accurate and smooth estimates with limited advances in direct dense tracking have shown good adapt-
computational resources. To increase the robustness of our ability to fast motions (Newcombe et al (2011); Engel et al
proposed system, we propose to perform an edge alignment (2014); Ling and Shen (2015)). These methods operate on
self check and IMU-aided external check. Extensive statisti- image intensities, rather than on sparse features, to minimize
cal analysis and comparison are presented to verify the per- the photometric cost function and make full use of all the
formance of our proposed approach and its usability with available information within an image. Thus they essentially
resource-constrained platforms. Comparing to state-of-the- bypass the feature processing pipeline and eliminate some
art point feature-based visual-inertial fusion methods, our of the issues found with feature-based methods. To ensure
approach achieves better robustness under extreme motions high image quality under different lighting conditions, cam-
or low frame rates, at the expense of slightly lower accu- era auto-exposure is usually employed due to the same phys-
racy in general scenarios. We release our implementation as ical location in space being imaged as having different inten-
open-source ROS packages. sities across frames. The photo-consistency assumption be-
hind direct dense tracking is easily proven wrong when the
Keywords Visual-inertial fusion · Edge alignment ·
lighting conditions of the environment change. In contrast
Tracking of aggressive motions · Visual-inertial odometry
to direct methods, our earlier work on edge alignment (Kuse
and Shen (2016)) uses the distance transform in the en-
ergy formulations and elegantly addresses the lack of photo-
1 Introduction consistency issue. Nevertheless, it fails if the captured im-
ages undergo severe blurring. In contrast to cameras, IMUs
Real-time, robust, and accurate state estimation is the fore- generate noisy but outlier-free measurements, making them
most important component for many autonomous robotics very effective for short-term tracking even under fast mo-
Yonggen Ling · Manohar Kuse · Shaojie Shen
tions. On the flip side, low-cost IMUs suffer significant drift
Department of Electronic and Computer Engineering, The Hong Kong in the long run. We believe that combining the complemen-
University of Science and Technology, Hong Kong, SAR China. tary nature of edge alignment and IMU measurements opens
E-mail: [email protected], [email protected], [email protected] up the possibility of reliable tracking of aggressive motions.
2 Yonggen Ling et al.
tions that lead to large image movement and severe motion Intensity errors of image patches as well as inverse depth
blur. parametrization are considered in Bloesch et al (2015). The
With recent advances in high-performance mobile com- high-level effect of such fusion is the smoothing of vision-
puting, direct dense methods have become popular (Kerl based tracking. Even when visual tracking fails, the esti-
et al (2013); Newcombe et al (2011); Engel et al (2013, mation can be done by the use of an IMU for short-term
2014)). Kerl et al (2013) propose a probabilistic formula- motion prediction. In loosely-coupled methods, visual mea-
tion of direct dense tracking that is based on student distri- surements are usually presented in the form of relative pose
bution, which alleviates the influence of outliers and leads transformations, while leaving the visual pose tracking as a
to robust estimation. Newcombe et al (2011) present a sys- black box. This leads to lower computational complexity at
tem for real-time camera tracking and reconstruction using the cost of suboptimal results.
current commodity GPU hardware. Recent work on direct Recent developments in visual-inertial fusion indicates
dense tracking (Engel et al (2013, 2014)) models the uncer- that tightly-coupled methods outperform their loosely-
tainty on the inverse depth of pixels and exhibits amazing coupled counterparts in terms of estimation accuracy (Hesch
performance towards a large scale environment. However, et al (2014); Li and Mourikis (2013); Leutenegger et al
these methods rely on the photo-consistency assumption, by (2015); Shen et al (2015); G. Huang, M. Kaess and Leonard
which motion estimation can be done by following the local J.J. (2014); Christian et al (2015); Usenko et al (2016)).
gradient directions to minimize the total intensity error. Di- Shen et al (2015), Christian et al (2015) and Hesch et al
rect dense methods have a small basin of attraction (as noted (2014) propose a scheme of IMU preintegration on the Lie
in Kerl et al (2013)) and are sensitive to changing lighting manifold and then fuse it with monocular camera track-
conditions. ing information in a tightly-coupled graph-based optimiza-
Another way to estimate the states using visual mea- tion framework. The inertial measurement integration ap-
surements is an iterative closest point (ICP) based method, proaches in Leutenegger et al (2015), Shen et al (2015) and
which directly aligns three-dimensional point clouds. Christian et al (2015) are slightly different, with respec-
Stückler and Behnke (2012) employ a method based on ICP tive pros and cons. Hesch et al (2014), Li and Mourikis
for the alignment of point clouds obtained from an RGB- (2013) and G. Huang, M. Kaess and Leonard J.J. (2014)
D camera, while Rusinkiewicz and Levoy (2001) present put emphasis on the system’s observability and build up the
a survey of other attempts to use efficient ICP-like meth- mathematical foundation of visual-inertial systems. Upon
ods for pose estimation. A generalized formulation of ICP their analysis, they develop monocular visual-inertial sys-
is proposed in Segal et al (2005), in which ICP and point- tems that are high-precision as well as consistent. Besides
to-plane ICP are combined into a single probabilistic frame- this, Hesch et al (2014), Li and Mourikis (2013), Dong-Si
work. Fitzgibbon (2003) proposes an algorithm to align two and Mourikis (2012), L. Heng, G. H. Lee, and M. Polle-
two-dimensional point sets, and the algorithm is also ex- feys (2014), and Yang and Shen (2015) relax the assump-
tensible to three-dimensional point sets. Fitzgibbon (2003) tion that the transformation between the camera and IMU
uses the distance transform to model the point correspon- (cameras extrinsics) is known. IMU-camera extrinsics are
dence function to align two-dimensional curves. Since ICP also optimized in their algorithms, which takes a step for-
relies on three-dimensional points clouds, its applications to ward towards practical applications. Instead of geometric er-
monocular or stereo cameras, which generally give neither rors of sparse features, Usenko et al (2016) propose a direct
dense nor accurate three-dimensional points, are limited. visual-inertial odometry that minimizes the intensity errors.
To overcome the disadvantages of approaches with vi- Tightly-coupled methods consider the coupling between two
sion only, visual-inertial fusion has recently gained lots of types of measurements and allows the adoption of a graph
attentions. It is straightforward to apply some variations of optimization-based framework with the ability to iteratively
Kalman filtering (Huang et al (2011); Shen et al (2013); re-linearize nonlinear functions. In this way, tightly-coupled
Scaramuzza et al (2014); Omari et al (2015); Bloesch et al approaches gain the potential to achieve better performance.
(2015)) to loosely fuse visual and inertial measurements. On the other hand, these kinds of methods usually come with
Huang et al (2011) utilizes the information richness of RGB- a higher computational cost as the number of variables in-
D cameras and fuses the visual tracking and inertial mea- volved in the optimization is large.
surements in an extended Kalman filter (EKF) framework.
Shen et al (2013) combines the information from a KLT
tracker and IMU measurements with an unscented Kalman 3 System Overview
filter (UKF), and Omari et al (2015) leverages the recent de-
velopment of direct dense tracking and fuses it with inertial The pipeline of our proposed system is illustrated in Fig. 2.
information in the EKF fashion. Scaramuzza et al (2014) (Also see Table 1 in Sect. 7.1.) Three threads run simultane-
also applies EKF in a similar way to Shen et al (2013). ously, utilizing the multi-core architecture.
4 Yonggen Ling et al.
Fig. 2: The pipeline of our proposed system. It consists of three modules running in separate threads to ensure real-time
availability of state estimates (denoted as three dashed boxes). The driver thread runs at 200 Hz, the edge alignment thread
runs at 25 Hz, and the optimization-based sensor fusion thread runs at 25 Hz.
The first thread is the driver thread, which performs ba- and camera frame while taking the k th image. We assume
sic operations, such as data acquisition and image rectifica- that the IMU-camera sensor suite is rigidly mounted, and
tion. the translation and rotation between the left camera and
The edge alignment (Sect. 5) thread performs key- the IMU are tcb , qcb . The intrinsics of stereo cameras are
frame-to-frame edge alignment periodically. Canny edge de- calibrated beforehand. pX X X
Y , vY and RY are the 3D posi-
tection and distance transform are performed for each in- tion, velocity and rotation of camera frame Y with respect
coming image. The angular prior from the integration of gy- to frame X, respectively. We also have the corresponding
roscope measurements initializes the incremental rotation. quaternion (qX Y = [qx , qy , qz , qw ]) representation for rota-
This thread also identifies instantaneous tracking perfor- tion. Hamilton notation is used for quaternions. The states
mance, detects tracking failure and determines whether to are defined as the combinations of positions, velocities, ro-
add a new keyframe. A disparity map is computed using a tations, linear acceleration biases and angular velocity bi-
standard block matching algorithm for every new keyframe. ases xk = [pw w w bk bk
bk , vbk , qbk , ba , bω ]. For camera frame cr
The visual measurements and their corresponding frames (which denotes the reference frame) and camera frame cn
are stored in a frame list buffer for further processing by (which denotes the current frame), the rigid-body transfor-
the optimization thread. mation between them is Tnr = {pccnr , Rccnr } ∈ SE(3), where
The optimization-based sensor fusion thread maintains a pccnr and Rccnr are translation and rotation, respectively. Next
sliding window of states and measurements, and checks the we denote a 3D scene point i in the co-ordinate system of
frame list buffer periodically. If it is not empty, all the frames the camera optical center at time instance k by k fi ∈ R3 .
within the buffer will be added into the sliding window. If a The camera projection function Π : R3 7→ R2 projects the
keyframe is added, loop closure detection is performed to visible 3D scene point onto the image domain. The inverse
find possible visual connections between keyframes. Graph- projection function Π̃ : (R2 , R) 7→ R3 back-projects a
optimization is then applied to find the maximum a posteri- pixel coordinate given the depth at this pixel co-ordinate:
ori estimate of all the states within the sliding window using
measurements from IMU pre-integration (Sect. 6.2), multi-
k
constrained relative pose measurements (Sect. 6.3) and the ui = Π(k fi ) (1)
prior. A two-way marginalization scheme that selectively re- k
fi = Π̃(k ui , Zk (k ui )), (2)
moves states is used to bound the computational complexity
and the time interval of the IMU integration and to maximize
the information stored within the sliding window (Sect. 6.5).
where k ui ∈ R2 denotes the image coordinates of the 3D
point k fi , and Zk (k u) denotes the depth of point k fi . We use
4 Notations a graph structure to represent the variables (states, combina-
tion of poses and velocities) we aim to solve and constraints
We begin by giving notations. We consider (·)w as the (links) between variables. See Fig. 3 for an illustration and
earth’s inertial frame, (·)bk and (·)ck as the IMU body frame Sect. 6. for details about variables and constraints.
Edge Alignment-Based Visual-Inertial Fusion for Tracking of Aggressive Motions 5
r
5 Edge Alignment Ei ) are pre-selected, then the function minj D(ui , uj ) is
exactly the definition of the distance transform (Felzen-
In this section, we introduce our formulation for relative szwalb and Huttenlocher (2012)). We denote the distance
camera motion estimation, which we refer to as the edge transform of the edge-map of the current image as V (n) :
alignment formulation. It is based on the minimization of R2 7→ R. Thus, the energy terms for an edge-pixel of the
the geometric error term at each edge pixel to obtain an reference frame are given by
estimate of the rigid body transform between two frames,
υei (R, p) = V (n) Π[R Π̃(r ei , Zr (r ei )) + p)] .
ie., to find a pose (rotation and translation matrix) such (5)
that the edges of the two images align. This is in contrast
To summarize, the relaxed energy function is
to previous direct methods, notably the one proposed by
Kerl et al (2013), which minimizes the photometric error X
f (R, p) = (υei (R, p))2 (6)
at every pixel. The energy formulation we propose in ∀ei
this work is the sum of the squared distances between and our goal is to solve for R∗ and t∗ :
transformed-projected (on current frame) co-ordinates of
the edge-pixels from the reference frame and the nearest argmin f (R, p) (7)
R,p
edge-pixels in the current frame.
subject to R ∈ SO(3).
For convenience of notation we derive our energy formula- Since (5) is highly nonlinear with respect to R and p, we
tion using R and p as the alias to Rccnr and pccnr for a partic- linearize it on the Lie group manifolds SE(3) with respect
ular instance of the reference and current frames. Our pro- to ξ = (δp, δθ) ∈ se(3), which is the minimum dimension
posed geometric energy function is the sum of the distances error representation,
of the re-projections (of edge points from the reference im-
υei (R, p, ξ) = υei (R, p) + Ji|ξ=0 · ξ, (8)
age) and nearest edge points in the current image:
X where Ji is the Jacobian matrix of υei (R, p) with respect
min D2 Π[R r fi + p], n uj ,
f (R, p) = (3)
i
j to ξ at ξ = 0. Lie group SE(3) and Lie algebra se(3) can
be linked by an exponential map and logarithm map. More
where D : (R2 , R2 ) 7→ R denotes the Euclidean distance details can be found in Ma et al (2012).
between those points. The best estimates for the rigid trans- Unlike in our early work Kuse and Shen (2016), which
form can be obtained by solving the following optimization provides a strong theoretical guarantee on convergence, we
problem: adopt the Gaussian-Newton method to solve (7). We empir-
ically find that the Gaussian-Newton method works in most
minimize f (R, p) (4)
R,p cases and it converges to the local minimum quickly so the
subject to R ∈ SO(3). real-time requirement of our proposed system is satisfied. To
get rid of the disturbance caused by convergence, we have
We relax the geometric energy function by restricting it mechanisms to detect and reject failure of edge alignment
only to edge points. In this approach, we observe that, if the (Sect. 5.3 and Sect. 6.4).
image points corresponding to edge points in the reference Following the scheme of the Gaussian-Newton ap-
image (denoted by r ei ∈ R2 with corresponding 3D point proach, we iteratively solve (7) using the linearization (8)
6 Yonggen Ling et al.
around the current estimate T̂ = {R̂, p̂} and then perform 6 Sliding Window-Based Sensor Fusion
incremental updates until convergence:
Given two time instants that correspond to two images, we
T̂ ← T̂ ⊗ exp(ξ). (9) can write the IMU propagation model for position, velocity
and rotation with respect to the earth’s inertial frame:
(c) Iteration 0 (d) Iteration 2 (e) Iteration 4 (f) Iteration 6 (g) Iteration 8
(h) Iteration 0 (i) Iteration 2 (j) Iteration 4 (k) Iteration 6 (l) Iteration 8
Fig. 4: Reprojections of edge-pixels in the reference frame onto the current frame as the Gaussian-Newton optimization
progresses. The middle row shows the reprojections on the current gray image. They are false colored to represent υei (ξ).
The last row shows reprojections on the distance transform image of the edge-map of the current frame. Note that the current
frame and the reference frame are about 160 ms apart (5 frames) and the Gaussian-Newton method progress is shown without
pyramids, with the initial guess as the identity. Viewing in color is recommended.
6.2 IMU Preintegration The residual function between the states and the IMU inte-
gration is defined as
We adopt the IMU preintegration approach proposed in
Yang and Shen (2016). The linear acceleration abt and an-
gular velocity ω bt at time t are modeled as k
δαk+1
∗
δβ kk+1
abt = abt + bbat + nbat (14) rSi (ẑkk+1 , X ) =
k
δθ k+1
bt bt ∗ δbbk
ω =ω + bbωt + nbωt (15) a
δbbωk
∗ ∗
where abt and ω bt are true values, bbat and bbωt are slowly w ∆t2
k
Rbwk (pw w w
bk+1 − pbk − vbk ∆t + g 2 ) − α̂k+1
varying biases which are modeled as Gaussian random k
walks, and nbat and nbωt are additive Gaussian white noises.
Rbwk (vbwk+1 − vbwk + gw ∆t) − β̂ k+1
= k −1 w −1 w .
The integration from IMU measurements between time
2[(q̂k+1 ) (qbk ) qbk+1 ]xyz
b
bak+1 − bbak
instants k and k + 1 is
bk+1 bk
bω − bω
Rkt abt dt2
RR
k t∈[k,k+1] (17)
α̂k+1
R
Rk abt dt
k k
ẑk+1 = β̂ k+1 = " t
t∈[k,k+1] # .
b b
R 1 − ω t
× ω t
k
q̂kk+1
t∈[k,k+1] 2 bt T q t dt
−ω 0 The covariance Pkk+1 can be calculated by iteratively
(16) linearizing the continuous-time dynamics of the error term
8 Yonggen Ling et al.
and then updating it with discrete-time approximation: not an easy job as different external conditions may result
in different parameter settings. We propose to use the IMU
Pkt+δt =(I + Ft δt) · Pkt · (I + Ft δt)T
preintegration to threshold the performance of instantaneous
+ (I + Gt δt) · Qt · (I + Gt δt)T , (18) edge alignment. The characteristics of an IMU can be eas-
with the initial condition Pkk = 0. Ft and Gt are the ily calculated offline and are supposed to be known prior
state transition Jacobians with respect to the states and the to the starting of the system. We can detect possible false
IMU measurement noise, respectively. Detailed derivations edge alignment according to the difference between the IMU
can be found in Yang and Shen (2016). IMU preintegration preintegration and edge alignment estimate. We ignore the
forms constraints between consecutive state variables (iner- visual measurements from edge alignment if they are not
tial links in the graph model, see Fig. 3). consistent with the IMU preintegation (both for the rotation
and translation estimation). An IMU-aided external check is
a vitally important step towards a practical and robust sys-
6.3 Multi-constrained Edge Alignments tem.
We declare edge alignment estimates as failures if either
Edge alignment (Sect. 5) is performed between the latest
of the criteria (Sect. 5.3 and Sect. 6.4) fail.
keyframe within the sliding window and the latest incom-
ing frame (also referred to as the current frame). The re-
sultant visual measurements are named key-frame-to-frame
links in the graph model (Fig. 3). In addition, since signif-
icant drifts may occur after aggressive motions, we intro- 6.5 Two-way Marginalization
duce a local loop closure module for recovery. Once a new
keyframe is added, loop closure detection is performed to
Due to the limited memory and computational resources of
seek possible visual measurements between existing key-
the system, we can only maintain a certain number of states
frames within the sliding window and the new keyframe us-
and measurements within the sliding window. We convert
ing edge alignment. Note that the cross check is adopted to
states that carry less information into priors {Λp , bp } by
avoid wrong loop closure. If and only if the two correspond-
marginalization, where Λp = HTp Hp . Note that the ef-
ing estimated rigid-body transformations are consistent, the
fectiveness of loop closure (Sect. 6.3) and drift elimination
cross check is passed. The outputs from the loop closure de-
depends on whether an older state is kept within the slid-
tection module are denoted as loop closure links in the graph
ing window. For this reason, unlike traditional approaches,
model (Fig. 3).
which only marginalize old states, we use the two-way
Suppose the visual measurement between reference
marginalization scheme that was first introduced in our ear-
frame i and aligned frame j obtained from edge alignment
∗ lier work, Shen et al (2014), to selectively remove old or
is ẑji = Tji . The residual function is defined as
more recent states in order to enlarge the covered regions of
j
j δpi the sliding window.
rSc (ẑi , X ) =
δθ ji Fig. 5 illustrates the process of our two-way marginal-
"
b cj c cj
# ization. Front marginalization removes the second newest
Rwj (pw w b
bi − pbj ) − Rc (Rci tb + p̂ci ) − tc
b
= c , state, while back marginalization removes the oldest state.
2[(qcb q̂cji qcb )−1 (qw
bj )
−1 w
qbi ]xyz
Blue circles represent key states, green circles represent the
(19) states to be marginalized and brown circles represent the
incoming states. The relation between states and frames is
where q̂ji is the quaternion representation of R̂ji , and vice
that states include poses and velocities, while frames include
versa. It can be derived mathematically that the correspond-
poses and images. Each frame has its corresponding state
ing covariance Pji is the inverse of the Hessian matrix
and vice versa. States are linked by IMU preintegration (in-
JT WJ at the final Gaussian-Newton iteration.
ertial link), incremental edge alignment (tracking link), loop
closure (loop closure link) and the prior (prior link). To per-
6.4 IMU-aided External Check form front marginalization, the second newest state is first
linked with the incoming state (step 1) and then marginal-
Since the IMUs provide noisy but outlier-free measure- ized out (step 2). For back marginalization, the oldest state
ments, the estimation using IMU preintegration is short- is simply marginalized out (steps 1-2). After marginaliza-
term reliable. Moreover, though we can tune the related pa- tion, the third step decides which state is to be marginalized
rameters so that the edge alignment exhibits good perfor- in the next round (front marginalization or back marginal-
mance, it fails as the surroundings are complicated and un- ization). Mathematically, to marginalize a specific state, we
known in advance. Additionally, tuning the parameters is remove all links related to it and then add the removed links
Edge Alignment-Based Visual-Inertial Fusion for Tracking of Aggressive Motions 9
Fig. 5: The process of our two-way marginalization, which marginalizes all the available information (motion estimates from
edge alignment, inertial measurement, loop closure relation and prior) into a new prior and maintains bounded computational
complexity. Front marginalization marginalizes the second newest state, while back marginalization marginalizes the oldest
state within the sliding window.
outliers in edge alignment that pass the DE-A self check depth map from the stereo camera, we use a block match-
(Sect. 5.3) and IMU-aided external check (Sect. 6.4). Wij ing algorithm implemented in OpenCV (StereoBM). Since,
is computed according to the Huber norm thresholding on the proposed edge alignment requires depth values at edge
the current estimate pixels only, a simple stereo block matching suffices for
our needs. We adopt image pyramids (with three levels) in
I3×3 , if ||Rj (p0 − p0 ) − t̂j || ≤ ct the edge alignment to handle the large image displacement
j 0 i j i
(Wi )ul = ct
||Rj (p0 −p I
0 )−t̂j || 3×3 , otherwise caused by fast motion and increase the speed of convergence
0 j i i
for the underlying iterative optimization procedure. We set
(24)
the size of the sliding window to be 30. The threshold of
I3×3 , if ||2[(q̂ji )−1 (q0j )−1 q0i ]xyz || ≤ ca the average reprojection distance of the edge alignment self
(Wij )lr = ca check is set to 5. For the local loop closure module, we firstly
I ,
||2[(q̂ji )−1 (q0j )−1 q0i ]xyz || 3×3
otherwise,
do the cross check of the edge alignment at the coarsest
(25) level, and ignore the candidates that fail this test. We then
do the cross check of the edge alignment with full image
where (Wij )ul is the upper left 3 × 3 matrix of Wij , (Wij )lr pyramids for the remaining candidates. Meanwhile, we re-
is the lower right 3 × 3 matrix of Wij , I3×3 is an identity strict the number of cross checks with full image pyramids
matrix, and ct and ca are the given translation and angular so as to limit the maximum time spent on the loop closure
threshold, respectively. module. The computing times of each component are sum-
marized in Table I.
We do not impose a global prior (like fixing the old-
7 Experiments est pose) when solving equations (13) and (22). Instead, we
solve equations (13) and (22) without any global prior (the
For sensing, we use a VI-sensor1 which consists of a MEMS resultant equations may not be well constrained, we thus
IMU and two global shutter cameras with a fronto-parallel use perturbed Cholesky decomposition that ensures positive
stereo configuration. A power efficient small-form factor definiteness to solve them). The obtained positions and yaw
computer, the Intel NUC2 with a dual-core CPU i5-4250U angles of states in the sliding window after the iteration are
running at 1.3 GHz and 16 GB RAM is used for the com- subtracted by the position and the yaw angle difference of
puting needs. All the algorithms are developed in C++ with the oldest pose before and after the iteration. We do NOT
ROS as the interfacing robotics middleware. The IMU gen- enforce a global prior as the pitch and roll angle of the old-
erates data at 200 Hz and the stereo camera produces time est state in the sliding window are observable. The initial
synchronized data at 25 Hz. position and yaw angle of the oldest pose at time instant b0
before the iteration are zero. Prior matrix Λp and prior vec-
tor bp obtained in the marginalization step are relative priors
7.1 Real-time Implementation between states.
(a) Transition from indoor corridor (b) Transition from outdoor corridor (c) Featureless and structured sur- (d) Fast changing lighting condi-
to outdoor corridor. to indoor corridor. roundings. tions and strong reflection on win-
dows.
Fig. 6: Part of captured images during a walk around a circular path with various lighting conditions.
estimated trajectory The segments within blue dashed boxes indicate the lo-
keyframe insertions via DEA self−check
15 failture detection via IMU−aided external check
cations at which the captured surroundings transform from
an indoor corridor to an outdoor corridor or from an outdoor
10
corridor to an indoor corridor (Fig. 6 (a), (b)). Since lighting
conditions change rapidly and greatly, edge detection is not
5
consistent between frames and results in alignment failure.
Our proposed system is able to detect this alignment fail-
Y (meters)
0
ure by the IMU-aided external check, and this increases the
robustness of the system.
−5 The segments within the purple dashed boxes indicate
the locations at which the captured surroundings are fea-
−10 tureless (Fig. 6 (c)). Since our estimator tracks incremen-
tal motions based on edges instead of sparse features, the
−15 alignment succeeds in most cases. Though occasional fail-
−20 −15 −10 −5 0 5 10
X (meters) ures exist (see the blue circles), our system overcomes them
by the IMU-aided external check.
Fig. 7: The estimated trajectory of a walk around a circular The segments within the red dashed boxes indicate the
path with different lighting conditions. locations at which the lighting conditions of the captured
surroundings change rapidly and alternately. (Fig. 6 (d)).
The changing and alternating lighting conditions are caused
by the transitions between the glass and borders of the win-
sure in a dimly lit room and using this exposure setting in a
dows. We also notice that there are strong reflections on the
brighter room causes severe degradation of image quality.
windows on both sides. Our edge alignment module inserts
We note that the previous visual odometry algorithms keyframes frequently to handle these cases (see the green
which rely on the photo-consistency assumption fail in this circles). Again, the IMU-aided external check is required
challenging environment. The dense approach proposed by (see the blue circles).
Kerl et al (2013) fails to produce any meaningful results due
to the violation of the photo-consistency assumption. Fur-
thermore, the feature based algorithms also fail in this situ-
7.3 Throw it!
ation because of the rather featureless corridors.
Fig. 7 presents the estimated output of our estimator. The In this experiment, we test the proposed system with ex-
red curve represents the estimated trajectory, green circles treme experimental conditions. This experiment is firstly
represent the locations at which new keyframes are inserted designed for demonstration of the superior tracking perfor-
by the edge alignment self check, and blue circles represent mance of our previous work Ling and Shen (2015). We play
the locations at which edge alignment is detected as failed back the recorded data and redo this experiment using the
by the IMU-aided external check. Some of the images cap- proposed system in this work. In this challenging experi-
tured during this experiment are shown in Fig. 6 (a)-(d). The ment, we throw the VI-sensor while walking (Fig. 1). The
total distance travelled is about 120 meters and the final drift total walking distance is about 50 meters and the final po-
is about 1.4 meters. More details can be found in the accom- sition drift is about 2.24 meters. VI-sensor is thrown eight
panying video. times in total. The estimation results of our proposed method
12 Yonggen Ling et al.
1
failture detection via IMU prior proposed in Geiger et al (2012). The summaries are shown
in Table 2. No data means the concerned method fails to
2
−4 −2 0 2 4 6 8 10 12
converge at some point in the sequence.
OKVIS, a tightly coupled feature-based approach, is the
best in terms of ARE-rot and ARE-trans. Nevertheless, it
Fig. 8: The estimated trajectory of a walk around a circular
fails to track in the V2 03 difficult sequence. The other ap-
path. We throw the VI-sensor while walking.
proaches are able to track all the sequences successfully. The
ARE-rot and ARE-trans of “Edge+IMU” are smaller than
are shown in Fig. 8. From the figure, we see that our es- “Edge+IMU+Loop” in most of the sequences. This is be-
timator can successfully track the motions of these eight cause local loop closure usually causes a noticeable pose
throws, resulting in a smooth estimated trajectory. Though correction to the latest estimate, which is unfriendly for the
edge alignment in our proposed system is able to handle relative metrics of ARE-rot and ARE-trans. The same for
large image displacement caused by challenging motions, the comparison of “Edge-Only” and “Edge+Loop”. In terms
it fails when the motions become more and more aggres- of the ARE-rot, our proposed method (“Edge+IMU+Loop”)
sive (captured images become more and more blurry, see is better than ROVIO. However, for the ARE-trans, ROVIO
the green and blue circles for indications). Inertial measure- obtains smaller errors than our approach. The reason is that
ments are the last resort that provide crucial links between the estimation of rotation is not related to the scene depth,
consecutive states to ensure continuous operation of the es- its error only depends on the number of pixels that well-
timator. Moreover, failure detection via edge alignment self- constraints the rotation. ARE-trans greatly depends on scene
check (highlighted with green circles, detailed in Sect. 5.3) depth, thus ROVIO, which is a tight-coupled approach that
and failure detection via IMU prior (highlighted with blue jointly optimizes the poses and the scene depth, performs
circles, detailed in Sect. 6.4) are of vital importance to the better than our method.
smoothness of the estimated trajectory. In all cases, our local
loop closure is able to largely eliminate drifts after throwing 7.5 Discussions on Convergence Basin
(Sect. 6.3).
Notice that, to the best of our knowledge, this experi- One advantage of our proposed method compared to dense
ment is the toughest testing for a visual-inertial estimator tracking based on image intensities is that the convergence
that has ever been reported. basin is larger. Put differently, what it means is, with the pro-
posed formulation, the iterations converge even for a rather
poor initial guess. We evaluate this property via skipping
7.4 Performance on the EuRoC MAV Dataset frames (downsampling the image temporal frequency). Sup-
pose the origin image temporal frequency is fn and the num-
We compare our proposed method and other state-of-the- ber of skipped frames is sm . The downsampled temporal
art approaches on the public EuRoC MAV dataset (Burri frequency is
et al (2016)). The complexity of the sequences in this
fn
dataset varies in terms of trajectory length, flight dynam- fn0 = . (26)
ics, and illumination conditions. The reference methods are 1 + sm
OKVIS (Leutenegger et al (2015)) and ROVIO (Bloesch We use the most difficult sequence (V2 03 difficult) of
et al (2015)). Both OKVIS and ROVIO contain the default the EuRoc MAV dataset to give a detailed assessment of our
parameters for the EuRoC MAV dataset in their open-source system. Since OKVIS fails to track this sequence, we ex-
implementations. Since we use stereo cameras in our pro- clude it in this comparison. We compare our system with
posed system, for fair comparison, we set the “doStereoIni- ROVIO. The same as the previous experiment, we set the
tialization” flag to be true in ROVIO, and also use stereo “doStereoInitialization” flag to be true in ROVIO for fair
Edge Alignment-Based Visual-Inertial Fusion for Tracking of Aggressive Motions 13
Table 2: Average relative angle error (ARE-rot, deg/m) of different approaches on the EuRoC MAV dataset. The best result
is bold. No data means the concerned method fails to converge at some point in the sequence.
Table 3: Average relative translation error (ARE-trans, m/m) of different approaches on the EuRoC MAV dataset. The best
result is bold. No data means the concerned method fails to converge at some point in the sequence.
Table 4: Comparison between different methods with different numbers of skipped frames in the V2 03 difficult sequence
of the EuRoc MAV dataset. Error metrics are average relative angle error (ARE-rot, deg/m). The best result is bold. No data
means the concerned method fails to converge at some point in the sequence.
Table 5: Comparison between different methods with different numbers of skipped frames in the V2 03 difficult sequence
of the EuRoc MAV dataset. Error metrics are average relative translation error (ARE-trans, m/m). The best result is bold.
No data means the concerned method fails to converge at some point in the sequence.
14 Yonggen Ling et al.
z (m)
0
We further test our system performance in an outdoor en- Fig. 9: Our system is able to track in an outdoor environ-
vironment with more complex textures and less prominent ment with more complex textures and less prominent edge
edge data. There are trees, grass and shadows in the test se- data. (a) The estimated trajectory. (b) One of the captured
quence. Our system is able to handle this tracking sequence. images. (c) The edges detected in the current frame. (d)
The estimated trajectory is shown in Fig. 9 (a). One of the The distance transform of the current frame. (e) The ref-
captured images is shown in Fig. 9 (b). Corresponding edges erence keyframe edges. More tracking details can be found
and distance transform are shown in Fig. 9 (c) and (d). Ref- in the supplementary video: https://1drv.ms/u/s!
erence keyframe edges are shown in Fig. 9 (e). The total ApzRxvwAxXqQmgX66v7srdWZNvAs
travel distance is about 120 meters and the final position
drift is about 1.5 meters. More tracking details can be found
in the supplementary video: https://1drv.ms/u/s!
pose a semi-tightly coupled probabilistic framework for fu-
ApzRxvwAxXqQmgX66v7srdWZNvAs.
sion of sensor states over a sliding window. The multi-thread
framework enables a fast and stable estimate with only the
CPU of an off-the-shelf computing platform. Experiments
8 Conclusions and Future Work
have verified the performance of our system and its poten-
We propose a novel and robust real-time system for state tial for use in embedded system applications.
estimation of aggressive motions. Our system is designed We note that the tightly-coupled methods, that jointly
specifically for aggressive quadrotor flights or other applica- optimize poses and point depth, outperform our semi-
tions in which aggressive motions are encountered (such as tightly coupled approach if their front-end trackers work
augmented reality). We employ a novel edge-tracking for- well. As the future work, we will integrate the front-end
mulation for visual relative pose estimation. We also pro- edge tracker and back-end tightly-coupled optimization in a
Edge Alignment-Based Visual-Inertial Fusion for Tracking of Aggressive Motions 15
whole framework to achieve better performance. The main Herbert Bay TT Andreas Ess, Gool LV (2008) Speeded up
challenge is to handle the greatly increased system complex- robust features. In: Computer Vision and Image Under-
ity. A probabilistic formulation of the edge alignment will standing
also be considered. Hesch JA, Kottas DG, Bowman SL, Roumeliotis SI (2014)
Consistency analysis and improvement of vision-aided in-
ertial navigation. IEEE Trans Robot 30(1):158–176
Huang AS, Bachrach A, Henry P, Krainin M, Maturana D,
9 Acknowledgments Fox D, Roy N (2011) Visual odometry and mapping for
autonomous flight using an RGB-D camera. In: Proc. of
The authors acknowledge the funding support from HKUST the Intl. Sym. of Robot. Research, Flagstaff, AZ
internal grant R9341 and HKUST institutional scholarship. Kerl C, Sturm J, Cremers D (2013) Robust odometry estima-
We would like to thank all AURO reviewers for their excep- tion for rgb-d cameras. In: Proc. of the IEEE Intl. Conf.
tionally useful reviews. on Robot. and Autom.
Kuse M, Shen S (2016) Robust camera motion estimation
using direct edge alignment and sub-gradient method. In:
Proc. of the IEEE Intl. Conf. on Robot. and Autom.
References
L Heng, G H Lee, and M Pollefeys (2014) Self-calibration
and visual slam with a multi-camera system on a micro
Baker S, Matthews I (2004) Lucas-kanade 20 years on: A
aerial vehicle. In: Proc. of Robot.: Sci. and Syst.
unifying framework. Intl J Comput Vis 56(3):221–255
Leutenegger S, Furgale P, Rabaud V, Chli M, Konolige K,
Bloesch M, Omari S, Hutter M, Roland S (2015) Robust vi-
Siegwart R (2015) Keyframe-based visual-inertial using
sual inertial odometry using a direct ekf-based approach.
nonlinear optimization. Intl J Robot Research 34(3):314–
In: Proc. of the IEEE/RSJ Intl. Conf. on Intell. Robots and
334
Syst.
Li M, Mourikis A (2013) High-precision, consistent ekf-
Burri M, Nikolic J, Gohl P, Schneider T, Rehder J, Omari S,
based visual-inertial odometry. Intl J Robot Research
Achtelik MW, Siegwart R (2016) The euroc micro aerial
32(6):690–711
vehicle datasets. Intl J Robot Research 35(10):1–11
Ling Y, Shen S (2015) Dense visual-inertial odometry for
Christian F, Luca C, Frank D, Davide S (2015) Imu
tracking of aggressive motions. In: Proc. of the IEEE Intl.
preintegration on manifold for efficient visual-inertial
Conf. on Robot. and Bio.
maximum-a-posteriori estimation. In: Proc. of Robot.:
Ling Y, Liu T, Shen S (2016) Aggresive quadrotor flight us-
Sci. and Syst.
ing dense visual-inertial fusion. In: Proc. of the IEEE Intl.
Dong-Si T, Mourikis AI (2012) Estimator initialization in
Conf. on Robot. and Autom.
vision-aided inertial navigation with unknown camera-
Lowe DG (2004) Distinctive image features from scale-
IMU calibration. In: Proc. of the IEEE/RSJ Intl. Conf. on
invariant keypoints. International Journal of Computer
Intell. Robots and Syst.
Vision 60(2):91–110
Engel J, Sturm J, Cremers D (2013) Semi-dense visual
Ma Y, Soatto S, Kosecka J, Sastry SS (2012) An invitation
odometry for a monocular camera. In: Proc. of the IEEE
to 3-d vision: from images to geometric models, vol 26.
Intl. Conf. Comput. Vis., Sydney, Australia
Springer Science & Business Media
Engel J, Schöps T, Cremers D (2014) LSD-SLAM: Large-
Newcombe RA, Lovegrove S, Davison AJ (2011) DTAM:
scale direct monocular SLAM. In: Euro. Conf. on Com-
Dense tracking and mapping in real-time. In: Proc. of the
put. Vis.
IEEE Intl. Conf. Comput. Vis.
Felzenszwalb PF, Huttenlocher DP (2012) Distance Trans-
Omari S, Bloesch M, Gohl P, Siegwart R (2015) Dense
forms of Sampled Functions. Theory of computing
visual-inertial navigation system for mobile robots. In:
8(1):415–428
Proc. of the IEEE Intl. Conf. on Robot. and Autom.
Fitzgibbon A (2003) Robust registration of 2D and 3D point
Rosten E, Drummond T (2006) Machine learning for high-
sets. Image and Vision Computing 21(14):1145–1153
speed corner detection. In: IEEE Conference on European
G Huang, M Kaess and Leonard JJ (2014) Towards Consis-
Conference on Computer Vision
tent Visual-Inertial Navigation. In: Proc. of the IEEE Intl.
Rusinkiewicz S, Levoy M (2001) Efficient variants of the
Conf. on Robot. and Autom., Hong Kong
ICP algorithm. In: International Conference on 3-D Imag-
Geiger A, Lenz P, Urtasun R (2012) Are we ready for au-
ing and Modeling, pp 145–152
tonomous driving? the KITTI vision benchmark suite. In:
Scaramuzza D, Achtelik M, Doitsidis L, Fraundorfer F,
Proc. of the IEEE Intl. Conf. on Pattern Recognition
Kosmatopoulos E, Martinelli A, Achtelik M, Chli M,
Harris CG, Pike JM (1987) 3D positional integration from
Chatzichristofis S, Kneip L, Gurdan D, Heng L, Lee G,
image sequences. In: In Proc. Alvey Vision Conference
16 Yonggen Ling et al.