Lecture Notes CV
Lecture Notes CV
Carl Olsson
[email protected]
[email protected]
2
Contents
3
Lecture Notes in CV Carl Olsson
10 Model Fitting 75
10.1 Noise Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.2 Line Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.2.1 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4
Lecture Notes in CV Carl Olsson
12 Local Optimization 95
12.1 The Maximal Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
12.3 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
12.4 Non-Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.4.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.4.2 Gauss-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
12.4.3 Levenberg-Marquardt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
12.4.4 Bundle Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5
Lecture Notes in CV Carl Olsson
6
Lecture 1:
The Pinhole Camera Model
The main goal of Computer Vision is to understand the world from images. One of the classical problems is the so
called Structure from Motion problem. Given a collection of images of a scene the goal is to estimate the positions
of the cameras that captured the images (the motion) and create a 3D model of the depicted scene (the structure).
Figure 1.1 shows an example of such a model. The main goal of this course is to develop methods for solving these
types of reconstruction problems.
Figure 1.1: An illustration of the Structure from Motion problem. The top row shows 5 images from a collection
of 1208 used to reconstruct the scene. The bottom row shows the reconstruction from two different view points.
The scene model in Figure 1.1 is a 3D point cloud consisting of points whose projections are visible in a subset of
the images. Figure 1.2 illustrates how these projections are obtained. First point detectors are used to generate a
set of 2D points that can potentially be matched to other images (Fig. 1.2(b)). This is followed by a matching step
where the local texture in patches around the points is compared to determine which points are projections of the
same 3D point. For details on point descriptors and the matching process we refer to the course in image analysis.
Here we will assume that the matching process as been completed and we will focus on determining the structure
and motion from the obtained projections.
7
Lecture Notes in CV Carl Olsson
Figure 1.2: Point detection and matching using the SIFT descriptor.
The most commonly used camera model, which we will also use in the course, is the so called pinhole camera. The
model is inspired by the simplest cameras. It has the shape of a box, light from an object enters though a small hole
(the pinhole) in the front and produces an image on the back camera wall (see Figure 1.3).
C
z=1 e′z
e′x
e′y
Figure 1.3: The Pinhole camera (left), and a mathematical model (right).
To create a mathematical model we first select a coordinate system. The origin C = (0, 0, 0) will represent the so
called camera center (pinhole). The directions of the the coordinate axes {e′x , e′y , e′z } are selected so that e′x and
e′y correspond to x- and y-directions of the image respectively. The third axis e′z is perpendicular to these two and
points in the direction that the camera is facing. Note that {e′x , e′y , e′z } are selected to have positive orientation,
see Figure 1.3. We will refer to this system as the camera coordinate system.
To generate a projection x = (x1 , x2 , 1) of a scene point X = (X1′ , X2′ , X3′ ) we form the line between X and C
and intersect it with the plane z = 1. We will refer to this plane as the image plane and the line as the viewing ray
associated with x or X. The plane z = 1 has the normal ez and lies at the distance 1 from the camera center. We
will refer to ez as the viewing direction or the principal axis. Note that in contrast to a real pinhole camera we
have placed the image plane in front of the camera center. This has the effect that the image will not appear upside
down, see Figure 1.3.
Since X − C is a direction vector of the viewing ray (see Figure 1.4) we can parametrize it by the expression
To find the intersection between this line and the image plane z = 1 we need to find and s such that the third
coordinate sX3′ of sX fulfills sX3′ = 1. Therefore, assuming X3′ ̸= 0, we get s = 1/X3′ and the projection
X1′ /X3′
8
Lecture Notes in CV Carl Olsson
X −C
e′z
z=1
e′y
Figure 1.4: The model viewed from the side. (The vector e′x points out of the figure.)
Exercise 1. Compute the image of the cube with corners in (±1, ±1, 2) and (±1, ±1, 4), see Figure 1.5.
(−1, 1, 2) (−1, 1, 4)
(1, 1, 2) (1, 1, 4)
Solution: To find the projections we simply divide by the third coodinate. We get ± 12 , ± 12 and ± 14 , ± 14 .
−1 −1 1 −1
2
, 2 2
, 2
−1 −1 1 −1
4
, 4 4
, 4
−1 1 1 1
4
,4 ,
4 4
−1 1 1 1
2
,2 ,
2 2
9
Lecture Notes in CV Carl Olsson
In our applications we will frequently have cameras that have been capturing images from different viewpoints.
Therefore we need to have a way of modeling camera movements. A camera can undergo translation and rotation.
We will represent the translation with a vector t ∈ R3 and the rotation with a 3 × 3 matrix R. Since R is a rotation
matrix it has to be orthogonal, that is fulfill RT R = I, and have det(R) = 1.
To encode camera movements we introduce a new coordinate system, see Figure 1.6. We will refer to this coordinate
ez
ey
e′z
ex
e′x
e′y
system as the global coordinate system or world coordinate system, since all the camera movements will be related
to this system. Typically all the scene point coordinates are also specified in this coordinate system. Let’s assume that
a scene point X has coordinates (X1′ , X2′ , X3′ ) in the camera coordinate system and (X1 , X2 , X3 ) in the global
coordinate system. Since the camera can be rotated and translated there is a rotation matrix R and translation
vector t that relates the two coordinate systems via
X1′
X1
X2′ = R X2 + t. (1.3)
X3′ X3
Exercise 2. What is the position of the camera (the coordinates of the camera center) in the global coordinate
system? What is the viewing direction in the global coordinate system?
Solution: The camera center is the origin in the camera coordinate system. If C is a vector containing the coordinates
of the camera center in the world coordinate system we will have
since R is orthogonal. The viewing direction can be represented by the vector that goes from the camera center to
the point (0, 0, 1) in the camera coordinate system. If X contains the world coordinates of this point we have
0 0
0 = RX + t ⇔ X = RT 0 − RT t = R3T + C, (1.5)
1 1
where R3 is the third row of R. The viewing direction in the world coordinate system is therefore R3T + C − C =
R3T .
10
Lecture Notes in CV Carl Olsson
If we add an extra 1 to the scene point X we can write (1.3) in matrix form
′ X1
X1 /X3′
′
X1 X2
X3′ X2′ /X3′ = X2′ = [R t] X3 .
(1.6)
1 X3′
1
Here [R t] is the 3 × 4 matrix where the first 3 × 3 block is R and the last column is t. As we saw previously,
the projection of X is given by (1.2). Therefore we conclude from (1.6) that the projection x of a scene point
X
X
(with coordinates given in the global coordinate system) is obtained by computing the vector v = [R t]
1
and dividing the elements of v by the third coordinate of v. From here on we will always assume that scene point
coordinates are given in the global coordinate system, if nothing else is stated.
Solution: When adding an extra one to the 3D point and multiplying with P we get
0
1 2 2 0 −2
1 0 1
v = 2 1 −2 0 −1 = 3 2 . (1.8)
3
−2 2 −1 1 2
1
Division with the third coordinate now gives the projection (−1, 1).
To get the projection (−1, 1) after dividing the vector v with its third coordinate v must be of the form v =
λ(−1, 1, 1), where λ is any non-zero number. The 3D points that give this v can be found by solving
X
λv = P = RX + t. (1.9)
1
Since λ is an arbitrary number the expression above can be interpreted as line, and this line contains all points that
project to (−1, 1), see Figure 1.3. Note that the value λ = 23 gives back the original 3D point X = (0, 0, 1).
We say that a scene point is in front of the camera (or has positive depth) if its third coordinate is positive in the
camera coordinate system. According to (1.6) the third coordinate is given by
′ X
X3 = [R3 t3 ] , (1.11)
1
where R3 is the third row R and t3 is thethird coordinate of t. To determine whether a point is in front of the
X
camera we therefore compute v = [R t] and check if the third coordinate of v is positive.
1
11
Lecture Notes in CV Carl Olsson
Exercise 4. Which of the points that project to (−1, 1) in the camera P in Exercise 3 are in front of the camera?
Solution: Since v = (−1, 1, 1) we have by (1.9) that the third coordinate of X ′ = RX + t is λ. Therefore the
points in (1.10) are in front of the camera if λ > 0.
12
Lecture 2:
Intrinsic Parameters and Projective Geometry
In the previous lecture we presented the pinhole camera model. Here the image plane was embedded in R3 . That
is, image projections are given in the length unit of R3 (e.g. meters). Furthermore, the center of the image was be
located in (0, 0, 1), and therefore has image coordinate (0, 0). For real cameras we typically obtain images where the
coordinates are measured in pixels with (0, 0) in the upper left corner. To be able to do geometrically meaningful
computations we need to transform pixel coordinates into the length unit of R3 . We do this by adding a mapping
from the image plane embedded in R3 to the real image, see Figure 2.1.
K 0 640
0
480
Figure 2.1: The mapping K from the image plane to the real image.
γf sf x0
K=0 f y0 . (2.1)
0 0 1
This matrix contains what is usually referred to as the inner parameters of the camera, that is, focal length f ,
principal point (x0 , y0 ), skew s and aspect ratio γ. We will discuss this further in Lecture 4, for now we
only mention that the diagonal elements of K are all positive implying that its determinant is also positive. The
13
Lecture Notes in CV Carl Olsson
or in matrix form
λx = P X. (2.3)
Equation (2.3) is usually called the camera equations and the 3 × 4 matrix P is called the camera matrix.
√1 0 − √12 0
2 −1 0
√ 1 0
P1 = 0 1 0 0 and P2 = 0 − 2 0 0 .
√ (2.4)
√1 0 √1 1 −1 0 −1 − 2
2 2
In both cases division with the third coordinate gives the projection ( 1+−1
√ , 0).
2
The camera center C1 of P1 is given by the nullspace of P1 which we find by solving the system
√1
0 − √1 0
X
2 2 Y 0
0 1 0 0 Z
= 0 . (2.6)
√1 0 √1 1 0
2 2 W
Since we have 3 equations and 4 unknowns Gaussian elimination gives the solution parameterization
√
√
X = −s/ 2
−1 2
Y =0 √
and therefore C1 = 0√ . (2.7)
Z = −s/ 2
−1 2
W =s
√
Finally since P2 = − 2P1 it is clear that P2 X = 0 ⇔ P1 X = 0 and therefore P1 and P2 have the same camera
centers.
Remark 1. Since we allways divide with the third coordinate when computing projections rescaling of a camera
matrix P makes no difference. In the example above the two camera matrices P1 and P2 represent the same camera
and give the same projections. The same is true if we rescale the vector X. This observation motivates the use of
homogeneous coordinates that are often used in projective geometry.
As we have seen the interpretation of equation (2.3) is that the projection (x1 , x2 ) of the scene point with co-
ordinates (X1 , X2 , X3 ) can be found by first computing v = P X and then dividing v by its third coordinate.
14
Lecture Notes in CV Carl Olsson
There are several vectors v that give the same projection. For example, v = (3, 2, 1) gives the projection (3, 2) and
v = (6, 4, 2) gives 26 , 42 = (3, 2).
Formally, we will say that two vectors x, y ∈ R3 are equivalent if there is a number λ ̸= 0 such that x = λy, and
write
x ∼ y. (2.8)
The two vectors are said to be representing the same element of the so called two dimensional projective space
P2 . The space consists of all the elements that can be represented by vectors in R3 (with the exception of the zero
vector). For example, the two vectors (6, 9, 3) and (4, 6, 2) are equivalent and therefore both represent the same
element of P2 . Furthermore, by dividing with the third coordinate we can interpret this element as a point in R2 ,
namely (2, 3). The vectors (6, 9, 3) and (4, 6, 2) are called homogeneous coordinates of (2, 3).
Note that in Lecture 1 we interpreted the result of the multiplication
X
v= R t = RX + t, (2.9)
1
as a 3D point in the camera coordinate system. We can now interpret the results as homogeneous coordinates of
the projection. To interpret the result as a regular point we need to divide by the third coordinate.
There are however elements in P2 that cannot be interpreted as points in R2 . Specifically, if the third coordinate is
zero we cannot divide by it. We will soon see that these points also have a simple geometric interpretation.
X
Similarly the 4 × 1 vector X = can be seen as a homogeneous representative of a point in R3 , which has
1
γX
regular coordinates X. Any re-scaled version of this vector , γ ̸= 0 represents the same 3D point. In addition
γ
the projection is unaffected by the scale since
γX
R t = γv ∼ v. (2.10)
γ
The projective space of n dimensions Pn is defined similarly as P2 . In general the homogeneous coordinates for
representing Pn are the vectors of Rn+1 , and if coordinate n + 1 is not zero then we can interpret them as points
of Rn by dividing with this coordinate.
From linear algebra we know that a line in R2 can be represented by the equation
ax + by + c = 0, (2.11)
where (a, b, c) ̸= (0, 0, 0). For a point x ∼ (x, y, z) in P2 we instead use the modified formula
ax + by + cz = 0. (2.12)
If x is a regular point with Cartesian coordinates (x, y), that is, x ∼ (x, y, 1) then it is clear that the two equations
(2.11) and (2.12) give the same result. Note that (2.11) can be seen as the scalar product of the vectors (x, y, 1)
and (a, b, c). If we use (λx, λy, λ) to represent x instead of (x, y, 1) we get the scalar product
and therefore (2.12) works with and homogeneous representative for x. Similarly if c ̸= 0 we can represent l with
a b
c , c , 1 instead of (a, b, c).
Note that lines and points behave in the same way here. They are both represented with 3-vectors. As a consequence
of this we say that points are dual to lines in P2 . That is, whenever we have a theorem involving lines and points
15
Lecture Notes in CV Carl Olsson
in P2 we can always exchange points for lines and get a dual statement. For example, the statement "Two lines
intersect each other in one point" is dual to "For any two points there is one line going through them both". (Note
that the first statement is not true in R2 if the lines are parallel.)
In three dimensions the equation
ax + by + cz + d = 0 (2.14)
3
represents a plane. Therefore, for the space P (the space consisting of all elements that can be represented by
vectors in R4 ) we have duality between points and planes instead.
Exercise 6. Compute the point of intersection x ∈ P2 (x ∼ (x, y, z)) of the two lines l1 ∼ (−1, 0, 1) and
l2 ∼ (0, −1, 1).
Exercise 7. Compute the line l ∼ (a, b, c) passing through the points x1 ∼ (−1, 0, 1) and x2 ∼ (0, −1, 1).
(Hint: look at the previous exercise.)
In the space P2 every pair of lines have a common intersection point, even parallel ones. Consider for example the
two lines l1 = (−1, 0, 1) and l2 = (1, 0, 1), see Figure 2.2.
If x lies on both the lines it is a solution the system of equations
T
l1 x = 0 x+z =0
⇔ . (2.17)
lT2 x = 0 −x + z =0
Hence, for example (0, 1, 0) is a representative of our intersection point. In this case we cannot interpret the
result by dividing with the third coordinate since this one is zero, which makes sense since the lines are parallel
and therefore do not intersect in R2 . To interpret (0, 1, 0) geometrically we look at (0, 1, ϵ), where ϵ is a small
positive number. This point has non-zero third coordinate and is equivalent to 0, 1ϵ , 1 , that is, it is a point with
x-coordinate zero and a very large y-coordinate. Making ϵ smaller we see that (0, 1, 0) can be interpreted as a point
infinitely far away in the direction (0, 1). We call this type of point an ideal point or a point at infinity.
16
Lecture Notes in CV Carl Olsson
−1 1 x
Note that if we instead assume that ϵ is a small negative number we get a point far away in the direction (0, −1).
We therefore do not differ between these points, and in addition (0, 1, 0) ∼ (0, −1, 0).
The line z = 0 is called the ideal line or the line at infinity since it only contains points that have third coordinate
0.
Similarly, we can interpret (x, y, z, 0) as an ideal point in P3 , infinitely far away in the direction (x, y, z). In P3
all points at infinity form a plane. We will sometimes project ideal points in P3 to regular points in P2 . The result
of such a projection is called a vanishing point.
17
Lecture Notes in CV Carl Olsson
18
Lecture 3:
Projective Transformations and Camera Models
In Lecture 2 we introduced homogeneous coordinates and the projective spaces Pn . Each point in Pn is represented
with a vector in Rn+1 \{0}. In addition two vectors x and y represent the same point if x = λy, λ ̸= 0, in which
case we say that x ∼ y.
Next we will consider mappings between these spaces that can be represented with matrices. A projective trans-
formation is an invertible mapping Pn → Pn defined by
x ∼ Hy (3.1)
where x ∈ Rn+1 and y ∈ Rn+1 are homogeneous coordinates representing elements of Pn and H is an invertible
(n + 1) × (n + 1) matrix. Projective transformations are also often called homographies.
Exercise 8. Show that it does not matter what representative we choose, the result will be the same. (Hint: y and
λy are two representatives of the same point.)
Projective mappings occur often when working with images. In Lecture 2 we saw one example of such a mapping,
namely the K-matrix. In the camera equation
x ∼ K [R t] X (3.3)
the matrix K transforms the point [R t] X in the image plane to the real image coordinate system (with the unit
pixels).
Another example is point transfer via a scene plane. Figure 3.1 shows two images of a desk with a roughly planar
surface. With this setup the there is a homography that transforms the points of one image to the other given by
(3.1), where H is a 3 × 3 matrix. To find the transformation we need to determine the elements of H. There
are 9 elements but since thee scale does not matter (H and λH represents the same transformation) there are only
8 degrees of freedom. Now suppose that we have n points yi , i = 1, ..., n that we know are transformed to n
corresponding points xi , i = 1, ..., n in the second image. Each point pair gives us three equations
λi xi = Hyi (3.4)
19
Lecture Notes in CV Carl Olsson
Figure 3.1: Two images of a roughly planar surface with 4 detected point correspondences.
(recall that yi and Hxi are vectors of size 3) but one new unknown λi is introduced. We now have 3n equations
and 8 + n degrees of freedom. To be able to find H we therefore need
3n ≥ 8 + n ⇔ 2n ≥ 8 ⇔ n ≥ 4 (3.5)
point correspondences. Figure 3.1 shows four point correspondences that can be used to compute H. (An approach
for doing this, the so called DLT method, will be presented in Lecture 4.)
Figure 3.2: Left: the result of appying H to the left image in Figure 3.1. Right: The transformed image overlaid on
the right image of Figure 3.1.
When H has been computed we can transform the other points in the image. Figure 3.2 shows the transformation
of the left image and the transformed image overlaid on the right image. The images agree well where the scene is
roughly planar.
When looking carefully at the checkerboard pattern it is evident that this transformation preserves lines, that is,
points on a line in the original image is mapped to another line in the new figure. This is always the case when we
have projective transformations.
Exercise 9. Assume that y lies on the line l, that is, lT y = 0, and that x ∼ Hy. Show that x lies on the line
l̂ = (H −1 )T l.
Solution: We have
0 = lT y = lT H −1 Hy = ((H −1 )T l)Hy ∼ l̂T x, (3.6)
which means that lT y = 0 if and only if l̂T x = 0.
20
Lecture Notes in CV Carl Olsson
A special case of projective transformation Pn → Pn is the affine transformation. For this type of mapping the
matrix H has the shape
A t
H= , (3.7)
0 1
where A is an invertible n×n matrix, t is an n×1 vector and 0 is a 1×n vector of all zeros. Besides being projective,
the affine transformation has the special property that parallel lines are mapped to parallel lines. Furthermore, it
preserves the line at infinity, that is, ideal points are mapped to ideal points and regular points to regular points. If we
only consider points in R2 (with regular Cartesian coordinates) then the transformation can be written x = Ay +t.
An example of an affine transformation is shown in Figure 3.3.
where R is an n × n rotation and s is a positive number. This mapping also preserves angles between lines. An
example of a similarity transformation is shown in Figure 3.4.
If s = 1 in (3.8) then the transformation is called Euclidean. This mapping also preserves distances between
points. An example of an affine transformation is shown in Figure 3.5.
21
Lecture Notes in CV Carl Olsson
Figure 3.6: Left - A regular pinhole camera where all viewing rays go through the camera center. Right - The affine
camera model where the viewing rays are parallel to the normal of the image ray.
Previously we have derived the pinhole camera model. Here we present a class of cameras that are on one hand less
accurate in practical applications but on the other hand yield simpler solutions to multi-view problems.
The affine camera model is in a sense a simpler model than the regular pinhole camera since no division by the third
coordinate is needed to compute the projection. The geometric interpretation is that the incoming viewing rays
are parallel and do not pass through a common camera center as illustrated in Figure 3.6. While regular cameras
typically do not work this way it is often a good approximation of the pinhole model if the observed scene points
are roughly on the same depth.
′
X1
′ x1
X2
Suppose that X′ ∼ X3′ in the camera coordinate system. The projection x ∼ x2 is obtained by simply
1
1
discarding the X3′ coordinate. In matrix form this can be written
X1′
x1 1 0 0 0 ′ ′
x2 ∼ 0 1 0 0 X2′ = I2×3 0 X , (3.9)
X3 0 1 1
1 0 0 0 1
1
′
X1
1 0 0 ′ ′ X
where I2×3 = and X = X2 . The projection of X ∼
in world coordinates is therefore
0 1 0 1
X3′
I 0 R t X R2×3 t2×1 X
x ∼ 2×3 = , (3.10)
0 1 0 1 1 0 1 1
where R2×3 contains the first two rows of R and t2×1 the first two elements of t. Finally to obtain the real image
in pixels we multiply with a matrix K, which gives
R2×3 t2×1 X
x∼K (3.11)
0 1 1
| {z }
=P
K2×2 0
If K = we get
0 1
K2×2 R2×3 t2×1 A2×3 t2×1
P = = . (3.12)
0 1 0 1
Any matrix A2×3 of size 2 × 3 can be factorized into K2×2 R2×3 where K2×2 is upper triangular and R2×3
contains two rows from a rotation matrix (using the Gram-Schmidt process, which we will discuss in Lecture 5).
22
Lecture Notes in CV Carl Olsson
Hence this is the most general type of affine camera that only requires that the last row of P is 0 0 0 1 .
Similar to an affine transformation the affine camera preserves parallelism in the sense that two lines that are parallel
in 3D are projected two lines that are parallel in the image.
sI2×2 0
If K = we get
0 1
sR2×3 st2×1
P = . (3.13)
0 1
This camera type is called scaled orthographic or sometimes weak perspective. The name weak perspective comes
from the fact that by changing the scale s one can approximate the perspective effect observed in a pinhole model
when a single object moves away or towards the camera.
If K = I we get
R2×3 t2×1
P = , (3.14)
0 1
we get the so called orthographic camera. In addition to parallelism this camera also preserves distances that are
perpendicular to the viewing direction. Note that when performing projection in any of these models we do not
need to divide by the third coordinate. The projection can therefore also be expressed in regular coordinates as
X1
x1
= A2x3 X2 + t2×1 . (3.15)
x2
X3
This gives the camera center C ∼ (0, 0, 1, 0) which is an ideal point that we can interpret as a point infinitly far
away in the direction (0, 0, 1).
3.3 1D Cameras
Navigation for autonomous vehicles is often performed using 1D laser scanners such as the one shown in Figure 3.7.
These devices basically sends out a signal that bounces of a reflector and returns to the sender. The result is a
measurement of a direction in which the reflection was observed. This direction can be thought of as either an
angle or a bearing (unit vector parallel to the ground plane). The projection into a 1D camera can be seen as a
2 1
P→ P . Similar to the regular camera model we can use homogeneous coordinates for the 2D point
mapping
X1
x1
X ∼ X2 and its 1D projection x ∼
and derive the camera equations
x2
X3
λx = K2×2 R2×2 t2×1 X, (3.18)
| {z }
P2×3
where K2×2 is triangular and R2×2 is a 2D-rotation matrix. In contrast to the regular camera matrix we usually
don’t divide by the last coordinate. Instead we typically compute a unit vector, which corresponds to the intersection
23
Lecture Notes in CV Carl Olsson
Figure 3.7: Autonomous vehicles (left) can perform navigation using laser scanners (right) that provide angular
measurements or bearings (unit vectors in the plane).
with a circle as illustrated in Figure 3.8, or the angle ϕ with respect to the e′y axis. The camera equations can then
be written
sin ϕ
λ = P2×3 X. (3.19)
cos ϕ
Note that from a projective geometry point of view it makes no difference whether we chose to work with unit
vectors or with vectors with last element 1. They are both homogeneous coordinates representing the same point
in P1 .
Figure 3.8: The projection into a 1D camera works similarly to a regular camera. However instead of having an
image line at y = 1 we usually intersect with a circle of radius 1.
1 0 0 1 0 −1
Exercise 11. The cameras P1 = and P2 = are observing the point X at angles
0 1 0 0 1 0
ϕ1 = π4 and ϕ2 = 0 respectively. What is the position of the point X?
Solution: We have (sin(ϕ1 ), cos(ϕ1 )) = √12 , √12 ∼ (1, 1) and (sin(ϕ2 ), cos(ϕ2 )) = (0, 1). If X = (X, Y, Z)
we therefore need to solve the system
!
1 X
1 0 0 −1 0 0
λ = P X
0 1 0 −1 0 Y 0
1 1
1
! ⇔1 0 −1 0
Z = . (3.20)
0 0 λ1
0
λ2 = P2 X
0 1 0 0 −1 0
1 λ2
It is easily seen that X = λ1 , Y = λ1 , Z = X solves this system for any choise of λ1 , which means that
(X, Y, Z) = λ1 (1, 1, 1) ∼ (1, 1, 1). (Which we can also geometrically interpret as the point (1, 1) in R2 .)
24
Lecture 4:
Camera Calibration, DLT and Depth
In this section we will introduce the inner parameters of the cameras. Recall from the camera equations
λx = P X, (4.1)
where P = K [R t], K is a 3×3 matrix R is a 3×3 rotation matrix and t is a 3×1 vector. The 3×4 matrix [R t]
encodes the orientation and position of the camera with respect to a reference coordinate system. Given a 3D point
in homogeneous coordinates X the product [R t] X can be interpreted as the 3D coordinates of the scene point
in the camera coordinate system. Note that alternatively we can interpret the result as the homogeneous coordinates
of the projection of X into the image plane embedded in R3 , since the projection in the camera coordinate system
is computed by division with the third coordinate.
The 3 × 3 matrix K transforms the image plane in R3 to the real image coordinate system (with unit pixels), see
Figure 4.1.
[R t]
X
ez
0 640
X
0
x x
e′z
ex 480
ey e′x
e′y
K
25
Lecture Notes in CV Carl Olsson
The parameter f is called the focal length. This parameter re-scales the image coordinates into pixels. The point
(x0 , y0 ) is called the principal point. For many cameras it is enough to use the focal length and principal point.
In this case the K matrix transforms the image points according to
f x + x0 f 0 x0 x
f y + y0 = 0 f y0 y , (4.3)
1 0 0 1 1
that is, the coordinates are scaled by the focal length and translated by the principal point. Note that the center
point (0, 0, 1) of the image in R3 is transformed to the principal point (x0 , y0 ).
The parameter γ is called the aspect ratio. For cameras where the pixels are not square the re-scaling needs to be
done differently in the x-direction and the y-direction. In such cases the aspect ratio γ will take a value different
from 1.
The final parameter s is called the skew. This parameter corrects for tilted pixels, see Figure 4.2, and is typically
zero.
A camera P = K [R t] is called calibrated if the inner parameters K are known. For such cameras we can
eliminate the K matrix from the camera equations by multiplying both sides of (4.1) with K −1 from the left. If
we let x̃ = K −1 x we get
λx̃ = K −1 K [R t] X = [R t] X. (4.4)
The new camera matrix [R t] is called the normalized (calibrated) camera and the new image points x̃ are called
the normalized image points. (Note that later in this lecture there is a different concept of normalization that is
used for improving stability of computations. However in the context of calibrated cameras, normalization always
means multiplication with K −1 .)
The calibration model presented in this section is limited in the sense that the normalization is carried out by
applying the homography K −1 to the image coordinates. For some cameras this is not sufficient. For example, in
the image displayed in Figure 4.3, lines that are straight in 3D do not appear as straight lines in the image. Such so
called radial distortion is common in cameras with wide field of view and can not be removed with a homograpy.
Figure 4.3: Radial distortion can not be handled with the K matrix. This requires a more complicated model.
26
Lecture Notes in CV Carl Olsson
The main problem of interest in this course is the Structure from Motion problem, that is, given image projections
xij (of scene point j in image i) determine both 3D point coordinates Xj and camera matrices Pi such that
Note that the scalars λij are also unknown and need to be determined. However, primarily we are interested in the
scene points and cameras, the scalars are bi-products of this formulation.
If the calibration is unknown, that is Pi can be any non-zero 3 × 4 matrix then the solution to this problem is called
a projective reconstruction. Such a solution can only be uniquely determined up to a projective transformation.
To see this suppose that we have found cameras Pi and 3D-points Xj such that
To construct a different solution we can take an unknown projective transformation H (P3 → P3 ) and let P̃i =
Pi H and X̃j = H −1 Xj . The new cameras and scene points also solve the problem since
This means that given a solution we can apply any projective transformation to the 3D points and obtain a new
solution. Since projective transformations do not necessarily preserve angles or parallel lines projective reconstruc-
tions can look distorted even though the projections they give match the measured image points. To the left in
Figure 4.4 a projective reconstruction of the Arch of Triumph in Paris is displayed.
Figure 4.4: Reconstructions of the Arch of Triumph in Paris. Left: Projective reconstruction. Right: Euclidean
reconstruction (known camera calibration). Both reconstructions provide the same projections.
One way to remove the projective ambiguity is to use calibrated cameras. If we normalize the image coordinates
using x̃ = K −1 x then the structure form motion problem becomes that of finding normalized (calibrated) cameras
[Ri ti ] and scene points Xj such that
λij x̃ij = [Ri ti ] Xj , (4.8)
where the first 3 × 3 block Ri is a rotation matrix. The solution of this problem is called a Euclidean Recon-
struction. Given a solution [Ri ti ] and Xj we can try to do the same trick as in the projective case. However
when multiplying [Ri ti ] with H the result does not necessarily have a rotation matrix in the first 3 × 3 block.
To achieve a valid solution we need H to be a similarity transformation,
sQ v
H= , (4.9)
0 1
where Q is a rotation. We then get
1
λij Q sv −1 1
xij = [Ri ti ] 1 H Xj = Ri Q (Ri v + ti ) X̃j , (4.10)
s 0 s s
27
Lecture Notes in CV Carl Olsson
which is a valid solution since Ri Q is a rotation. Hence, in the case of Euclidean reconstruction we do not have
the same distortion since similarity transformations preserve angles and parallel lines. Note that there is still an
ambiguity here. The entire reconstruction can be re-scaled, rotated and translated without changing the image
projections.
(Strictly speaking, Equation (4.10) only shows that we can apply a similarity transformation without violating the
rotational constraints. It does not show that if H is not a similarity the constraints are violated. This can however
be seen by considering the effects of H on all camera equations.)
In this section we will present a simple method for finding the camera parameters. We will do it in two steps:
1. First, we will compute a camera matrix P . To make sure that there is no projective ambiguity present (as in
Section 4.2) we will assume that the scene point coordinates are known. This can for example be achieved
by using an image of a known object where we have measured all the points by hand.
2. Secondly, once the camera matrix P is known we can factorize it into K[R t], where K is upper triangular
and R is a rotation. This can be done using the so called RQ-factorization.
In this section we will outline a method for finding the camera matrix P . We are assuming that the scene points
Xi and their projections xi are known. The goal is to solve the equations
λi xi = P Xi , i = 1, ..., n, (4.11)
where the λi and P are the unknowns. This problem, determining the camera matrix from know scene points and
projections is called the resection problem. The 3 × 4 matrix P has 12 elements, but the scale is arbitrary and
therefore it only has 11 degrees of freedom. There are 3n equations (3 for each point projection), but each new
projection introduces one additional unknown λi . Therefore we need
3n ≥ 11 + n ⇒ n ≥ 6 (4.12)
points in order for the problem to be well defined. To solve the problem we will use a simple approach called Direct
Linear Transformation (DLT). This method formulates a homogeneous linear system of equations and solves this
by finding an approximate null space of the system matrix. If we let pi , i = 1, 2, 3 be 4 × 1 vectors containing the
rows of P , that is, T
p1
P = pT2 (4.13)
pT3
then we can write (4.11) as
XTi p1 − λi xi = 0 (4.14)
XTi p2 − λi yi = 0 (4.15)
XTi p3 − λi = 0, (4.16)
28
Lecture Notes in CV Carl Olsson
Note that since Xi is a 4 × 1 vector each 0 on the left hand side actually represents a 1 × 4 block of zeros. Thus
the left hand side is a 3 × 13 matrix multiplied with a 13 × 1 vector. If we include all the projection equations in
one matrix we get a system of the form
XT1
0 0 −x1 0 0 ... 0
0 XT1 0 −y1 0 0 ... 0
p1
XT1
0 0 −1 0 0 ... 0
p2
XT2
0 0 0 −x2 0 ... 0
p3
XT2
0 0 0 −y2 0 ... 0
λ1 = . (4.18)
XT2
0 0 0 −1 0 ... 0
λ2
XT3
0 0 0 0 −x3 ... 0
λ3
XT3
0 0 0 0 −y3 ...
..
0
0 0 XT3 0 0 −1 ... . 0
.. .. .. .. .. .. ..
| }
.. {z
. . . . . . . =v .
| {z }
=M
Here we are interested in finding a non-zero vector in the nullspace of M . Since the scale is arbitrary we can add
the constraint ∥v∥2 = 1. In most real cases the system M v = 0 will not have any exact solution due to noise in
the measurements. Therefore we will search for a solution to
We refer to this type of problem as a homogeneous least squares problem. Note that there are always at least
two solutions to (4.19) since ∥M v∥ = ∥M (−v)∥ and ∥v∥ = ∥ − v∥. These solutions give the same projections,
however for one of them the camera faces away from the scene points thereby giving negative depths. If the
homogeneous representative for the scene points have positive fourth coordinate then we should select the solution
where the λi are all positive.
An alternative formulation with only the p-variables can be found by noting that (4.11) means that the vectors xi
and P Xi are be parallel. This can be expressed using the vector product
xi × P Xi = 0, i = 1, ..., n. (4.20)
These equations are also linear in p1 , p2 , p3 and we can therefore set up a similar homogeneous least squares system
but without the λi .
The solution to (4.19) can be found by eigenvalue computations. If we let f (v) = v T M T M v and g(v) = v T v
we can write the problem as
min f (v). (4.21)
g(v)=1
From basic courses in Calculus (e.g. Person-Böiers Fler-dim.) we know that the solution must fulfill
Therefore the solution of (4.19) has to be an eigenvector of the matrix M T M . Suppose v∗ is an eigenvector with
eigenvalue γ∗ . If we insert into the objective function we get
Since ∥v∗ ∥ = 1 we see that in order to minimize f we should select the eigenvector with the smallest eigenvalue.
Because of the special shape of M T M we compute the eigenvectors efficiently using the so called Singular Value
Decomposition (SVD).
29
Lecture Notes in CV Carl Olsson
M = U SV T , (4.24)
M T M = (U SV T )T U SV T = V S T U T U SV T = V S T SV T . (4.26)
Since S T S is a diagonal matrix this means that V diagonalizes M T M and therefore S T S contains the eigenvalues
and V the eigenvectors of M T M . The diagonal elements of S T S are ordered decreasingly σ12 , σ22 , ..., σr2 , 0, ..., 0.
Thus, to find an eigenvector corresponding to the smallest eigenvalue we should select the last column of V . Note
that if r < n, that is, the matrix M does not have full rank, then the eigenvalue we select will be zero which means
that there is an exact nonzero solution to M v = 0. In most cases however r = n due to noise.
We summarize the steps of the algorithm here:
M = U SV T . (4.28)
Normalization
The matrix M will contain entries xi , yj and ones. Since the xi and yi are measured in pixels the values can be in
the thousands. In contrast the third homogeneous coordinate is 1 and therefore the matrix M contains coefficient
of highly varying magnitude. This can make the matrix M T M poorly conditioned resulting buildup of numerical
errors.
The numerics can often be greatly improved by translating the coordinates such that their “center of mass” is zero
and then rescaling the coordinates to be roughly 1.
30
Lecture Notes in CV Carl Olsson
This mapping will first translate the coordinates by (−x̄, −ȳ) and then re-scale the result with the factor s. If for
example (x̄, ȳ) is the mean point then the transformation
x̃ = N x (4.33)
γi N x = P̃ Xi . (4.35)
When the camera matrix has been computed we want to find the inner parameters K by factorizing P into
P = K [R t] , (4.36)
where K is a right triangular and R is a rotation matrix. This can be done using the RQ-factorization, which is
equivalent to running the Gram-Schmidt process on the rows of P .
Theorem 2. If A is an n × n matrix then there is an orthogonal matrix Q and a right triangular matrix R such that
A = RQ. (4.37)
(If A is invertible and the diagonal elements of R are chosen positive then the factorization is unique.)
In order to be consistent with the notation in the rest of the lecture we will use K for the right triangular matrix
and R for the orthogonal matrix. Given a camera matrix P = [A a] we want to use RQ-factorization to find K
and R such that A = KR. If
T T
a b c A1 R1
K = 0 d e , A = AT2 and R = R2T , (4.38)
0 0 f AT3 R3T
that is, R1 , R2 , R3 and A1 , A2 , A3 are 3 × 1 vectors containing the rows of R and A respectively, then we get
T T
aR1T + bR2T + cR3T
A1 a b c R1
AT2 = 0 d e R2T = dR2T + eR3T (4.39)
AT3 0 0 f R3T f R3T
From the third row of (4.39) we see that A3 = f R3 . Since the matrix R is orthogonal R3 has to have the length
1. We therefore see that need to select
1
f = ∥A3 ∥ and R3 = A3 . (4.40)
∥A3 ∥
31
Lecture Notes in CV Carl Olsson
to get a positive coefficient f . When R3 is known we can proceed to the second row of (4.39). The equation
A2 = dR2 + eR3 tells us that A2 is a linear combination of two orthogonal vectors (both of length one). Hence,
the coefficient e can be computed from the scalar product
e = AT2 R3 . (4.41)
similar to what we did for f and R3 in (4.40). When R2 and R3 is known we use the first row of (4.39)
The resulting matrix K is not necessarily of the form (4.2) since element (3, 3) might not be one. To determine
the individual parameters, focal length, principal point etc. we therefore need to divide the matrix with element
(3, 3). Note however, that this does not modify the camera in any way since the scale is arbitrary. In addition there
is no guarantee that the determinant of R is 1 and therefore R may not be a rotation. However if det(R) = −1
we switch can simply switch to −P = −A −a which gives the factorization −A = K(−R).
Exercise 13. If
3000 0√ −1000 1
P = 1000 2000 2 1000 2
2 0 2 3
find the inner parameters of the camera.
Solution: To solve the problem we first need to compute the rq factorization KR = A, where A is the first 3 × 3
matrix of P , and then extract the inner parameters from K. Performing the calculations outlined in (4.40)-(4.44)
gives
√ √ √1 −1
0 √
2000 2 0 √ 1000√2 2 2
K= 0 2000 2 1000 2 and R = 0 1 0 . (4.45)
√ 1 1
0 0 2 2 √
2
0 √
2
√
2000
√ 2
Since K11 = K22 and K12 = 0 the aspect ratio is 1 and the skew is 0. The focal length is 2 2
= 1000 and the
√ √
principal point is 1000
√ 2 , 1000
2 2
√ 2 = (500, 500).
2 2
4.4 Depth
To determine if a point is in front of a camera we will compute its depth with respect to that camera. In Section 1.4
we defined the depth of a point X as the z-coordinate of the of the point in the local camera coordinate system.
Given a camera matrix P we therefore need to compute the QR-factorization to extract the extrinsic parameters
R
and t to find the depth. In this section we derive a direct formula that uses the elements of P = A a .
First we determine the principal axis. Recall that this if P = K R t where R is a rotation and K is upper
triangular with positive diagonal elements then the third row R3 of R is the principal axis. Note that if det(A) < 0
and A = KR then det(K) det(R) = det(A) < 0. Since an upper triangular K with positive diagonal elements
has positive determinant we must have that det(R) = −1 and therefore R is orthogonal but not a rotation. Since
the camera matrix P is only defined up to an unknown scale factor we instead use P = sign(det(A)) A a to
ensure that the RQ factorization gives a rotation. We then get the principal axis
sign(det(A))
R3 = A3 . (4.46)
∥A3 ∥
32
Lecture Notes in CV Carl Olsson
X
The depth of a point X = , where X are the regular Carthesian coordinates is given by
1
where C = −A−1 a are the Carthesian coordinates of the camera center. Since AT3 A−1 = 0
0 1 and
0 0 1 a = a3 , where a3 is the third element or a we get
sign(det(A)) T
depth(P, X) = A3 a3 X. (4.48)
∥A3 ∥
Note that if P is an affine camera then A3 = 0. Therefore we cannot talk about points being in front or behind
an affine camera. Similarly for a point at infinity we have ρ = 0 and therefore only finite points can be said to be
behind or in front of the camera.
Exercise 14. Which of the 3D points (1, 0, 1), (0, 1, −2) and (1, 1, 1) are in-front of the camera
1 2 0 1
P = 0 −2 1 0?
0 0 1 1
Solution: Since A is triangular its determinant is simply the product of the elements on the diagonal which gives
det(A) = −2 and therefore sign(det(A)) = −1. Furthermore ∥A3 ∥ = 1. We can represent the first point with
the homogeneous coordinats X1 ∼ (1, 0, 1, 1) which means that we have selected a representation where ρ = 1.
We then get
1
0
λ= 0 0 1 1 1 = 2,
(4.50)
1
which gives depth(P, X1 ) = −4. Similarly we get depth(P, X2 ) = 2 and depth(P, X3 ) = −4. Therefore only
the point (0, −1, 2) is infront of the camera.
33
Lecture Notes in CV Carl Olsson
34
Lecture 5:
Triangulation and Homography Estimation
5.1 Triangulation
The problem of finding an unknown 3D-point position X from measured projections xi , i = 1, ...n into known
cameras Pi is called triangulation. We have previously seen that if X is to project to the measurements then the
camera equations
λi xi = Pi Xi , i = 1, ..., n. (5.1)
must hold. Therefore we need to find X and λi that solve these equations. For simplicity we willassume
that
x
X
X is a regular point, not at infinity, with homogeneous coordinates X = , where X = y are the
1
z
regular Cartesian 3D-coordinates. Each projection gives us 3 equations and for n projections we therefore have
3n equations. Each projection also introduces one new unknown, the depth λi , and therefore we have 3 + n
unknowns. To solve the problem we need
3
3n ≥ 3 + n ⇔ n ≥ . (5.2)
2
λi xi = Ai X + ti ⇔ X = −A−1 −1
i ti + λi Ai xi . (5.3)
Note that Ci = −A−1 i ti are the 3D-coordinates of the camera center of Pi (see Lecture 1). The geometric
interpretation of the above expression is that X should be on a line going though Ci with directional vector
A−1
i xi . When we vary λi we get different 3D points on the line and all of them project to xi . To find the correct
X we need to determine the intersection of the lines coming from each camera as illustrated in Figure 5.1.
35
Lecture Notes in CV Carl Olsson
Figure 5.1: Illustration of triangulation. Left: The sought point X is in the intersection of the two viewing lines.
Right: If the camera moves so that the two viewing lines intersect for all points the problem becomes degenerate
and we cannot determine X.
1 1
0 and x2 ∼ 0, 12 , 1 in
3D-point X that projects to x1 ∼
Exercise 15. Findthe 2, 2, 1 in P1 = I
−1
P2 = I 0 .
0
Solution: Since the first 3 × 3 parts of both cameras are rotations the camera centers are given (in regular Carthesian
coordinates) by
−1 1
C1 = −I T 0 = 0 and C2 = −I T 0 = 0 . (5.4)
0 0
The 3D point must be in the intersection
of the
lines that project to x1 and x2 . According to (5.3) these lines can
1
be written X = λ1 x1 and X = λ2 x2 + 0. Setting these equal to each other we obtain the system
0
λ1
2 =1
(
λ1 λ2 λ1 = 2
2 = 2
⇔ . (5.5)
λ2 = 2
λ1 = λ2
Remark 2. Note that the system in (5.5) has three equations but only two unknowns and may therefore not have
a solution for other choices of the points x1 and x2 . Geometrically this means that the viewing lines might not
intersect. In real problems with noisy measurements we therefore need to solve the problem approximately in some
sense.
In some special cases the problem will have multiple intersections between the lines of (5.3). If the camera centers
Ci , i = 1, ..., n and the 3D point X are all located on one line in 3D, the expressions (5.3) become identical and
all points on the 3D-line fulfill them. This is called a degenerate configuration and is illustrated to the right in
Figure 5.1. While it is very unlikely that all camera centers and the 3D point should be exactly on the same line
in any real case, degenerate configurations are important since problem instances that are close to degeneracy often
exhibit numerics and become very sensitive to noise. Hence in practical cases when the camera centers and the 3D
point are close to being on a line we can expect that we will get poor accuracy when we try to recover it.
36
Lecture Notes in CV Carl Olsson
In practical cases our measurements will not be exact but will be affected by noise, and therefore we cannot expect
the viewing rays to have an exact intersection point, as illustrated in Figure 5.2. Therefore we need to solve the
Figure 5.2: In case of noisy measurements we cannot expect the viewing rays to intersect in a point. In such cases
we need to solve the camera equations approximately.
camera equations approximately in some sense. One way of doing this is to use the DLT method from Lecture
3. The problem is linear in the unknowns λi and X, so we can find the homogeneous least squares solution by
re-formulating the problem as
M v = 0, (5.6)
with
P1 −x1 0 ··· 0
P2 0 −x2 ··· 0
M = , (5.7)
.. .. .. ..
. . . .
Pn 0 0 ··· −xn
and
vT = XT
λ1 λ2 ··· λn . (5.8)
As in Lecture 3 we solve
min ∥M v∥2 (5.9)
∥v∥2 =1
by computing the singular value decomposition M = U SV T of M and let v be the last column of V .
Figure 5.3 shows three examples of the DLT objective for triangulation. We use 1D-cameras for visualization
purposes. Here a camera is a 2 × 3 matrix which takes a point in the plane P2 and
projects
it onto the camera line,
x
essentially providing a direction towards the point. For a given 2D point X = y we compute
1
2
γX
f (X) = 2 2 min M λ1 , (5.10)
λ1 +λ2 +∥X∥2 γ 2 =1
λ2
and plot the result as a function of X. We also plot the cameras and the projection of the optimal 2D point in the
same figure. When the cameras are close to each other the level curves of the objective function become thin and
elongated. As a result the location of the triangulated point will be more difficult to determine when noise is added
to the problem. In the three images in Figure 5.4 perform the same triangulation experiment 100 times with noise
added to the image projections. The noise is the same for all the three images only the locations of the cameras
differ. The estimated 2D-points are the blue dots. When the cameras are sufficiently far apart all estimations end
up close to the true 2D-point. In contrast when the cameras are close to the degenerate case, the result becomes
very sensitive to noise and the exact distance from the cameras are difficult to determine accurately.
37
Lecture Notes in CV Carl Olsson
Figure 5.3: Visualization of the DLT objective for 1-dimensional cameras. When the cameras approach the degen-
erate case (by moving towards each other) the objective is roughly constant along the viewing direction.
Figure 5.4: Estimation of a 2D point from 1D images using the DLT objective. Close to degenerate cases the depth
of the point becomes uncertain.
Projective transformations (or homographies) were introduced in Lecture 2. In practical computer vision problems
they occur in a number of common settings. Figure 5.5 shows a plane induced homography. If a set of points
from a 3D-plane is projected into two images then there is a projective transformation P2 → P2 between the two
images (see Assignment 1 for a proof ). Another example is when two images are captured by cameras that have
the same position, only differing by orientation and possibly calibration. This is useful for building panoramas (see
Section 5.2.3). When we are solving uncalibrated reconstruction problems all possible reconstructions are related
by a projective transformation P3 → P3 (see Lecture 3).
In homography estimation we want to find a projective transformation from Pk to Pk . We will describe the problem
for k = 2 but the procedure is exactly the same for any dimension. Given two sets of points yi and xi that are
related by a homography H we want to solve
λi yi = Hxi , i = 1 . . . n. (5.11)
38
Lecture Notes in CV Carl Olsson
The matrix H has 9 entries, but since its scale is arbitrary we can assume that one of them is fixed. For n points we
therefore have 3n equations and 8 + n unknowns and the problem can therefore be solved if
3n ≥ 8 + n ⇔ n ≥ 4. (5.12)
Note that in contrast to the triangulation problem, where we obtained an overdetermined system when using two
point projections, the problem can be solved exactly if n = 4. Before we compute the solution we make one
simplification. We assume that
1 0 0 a
x1 = 0 , x2 = 1 , x3 = 0 and x4 = b , (5.13)
0 0 1 c
where a, b, c are some known numbers. Note that x1 ,x2 are points at infinity. Although we cannot (directly)
observe such points in a real image, this assumption simplifies the following derivations. We treat the general case
at the end of this section through a change of variables.
Under the assumption (5.13) we have
(
hi i = 1, 2, 3
Hxi = , (5.14)
ah1 + bh2 + ch3 i=4
To determine the unknowns λi , i = 1, 2, 3 we consider the second case of (5.14) which inserted in (5.11) gives
λ1
λ4 y4 = Hx4 = λ1 ay1 + λ2 by2 + λ3 cy3 = ay1 by2 cy3 λ2 . (5.16)
λ3
Note that
a 0 0
ay1 by2 cy3 = y1 y2 y3 0 b 0 . (5.17)
| {z } 0 0 c
:=Y1:3 | {z }
:=Dx4
Since the scale is arbitrary we can assume that λ4 = 1. The homography can therefore be uniquely determined
when the matrix Y1:3 Dx4 is invertible, and is given by the formula
where
λ1 λ1 0 0
λ1:3 = λ2 and Dλ1:3 = 0 λ2 0 . (5.20)
λ3 0 0 λ3
In the next section we will study under what conditions the inverses above exist and what happens when they don’t.
We conclude this section by showing how to estimate H with four generally placed points. This can be achieved
through a change of coordinates. Let X1:3 be the 3 × 3 matrix with columns x1 , x2 and x3 . If X1:3 is invertible
then (5.11) can then be written
−1
λi yi = Hxi = HX1:3 X1:3 x . (5.21)
| {z } | {z }i
:=H̃ :=x̃i
−1
Since X1:3 X1:3 = I and xi , i = 1, 2, 3 are the columns of X1:3 it is now easy to see that in the new coordinates x̃i
are of the desired form (5.13). We can therefore instead solve for H̃, as described above, and compute the solution
−1
to (5.11) using H = H̃X1:3 . We will return to the issue of X1:3 being invertible in Section 5.2.1.
39
Lecture Notes in CV Carl Olsson
It is clear from this expression that λ1 cannot be determined in this case. However the equations can still be solved
since a = 0 corresponds to x2 ,x3 ,x4 (and therefore y2 ,y3 ,y4 ) being linearly dependent. From Linear Algebra we
then know that there are numbers γ2 , γ3 and γ4 not all of them zero such that
γ2 y2 + γ3 y3 + γ4 y4 = 0. (5.23)
Furthermore, we must have γ4 ̸= 0 since otherwise γ2 y2 = −γ3 y3 which means that y2 and y3 are homogeneous
representatives of the same point in P2 . Therefore selecting λ2 = − γb2 , λ3 = − γc3 , λ4 = γ4 and λ1 arbitrarily
gives a family of solutions, all mapping the 4 points correctly. Figure 5.6 shows an example of a degenerate case
where three of the four black points are on a line. Both the estimated transformations map the black points to
themselves but differ elsewhere. The first homography maps the blue grid to the green one while the second maps
the blue grid to the red one.
As in the case of triangulation we often have overdetermined systems with noisy measurements. In such cases we
can again apply the DLT approach. Suppose that we want to solve (5.11) with n ≥ 4. Let hi be row i of H, that
is, 1
h
H = h2 , (5.24)
h3
and
ui
yi = vi . (5.25)
wi
40
Lecture Notes in CV Carl Olsson
Figure 5.6: Two examples of homographies that map the four black points to themselves but differ for other points.
The blue grid (left) is mapped to the green and red grids (right) by the two transformations respectively.
Here H is a 3 × 3 matrix so h1 , h2 and h3 are 1 × 3 matrices. By stacking all our unknowns in a vector
v T = h1 h2 h3 λ1 λ2 · · · λn .
(5.26)
As an example of homography estimation we will show how we can stitch together a number of images taken from
the same position (camera center). For ease of notation we will assume that we have calibrated cameras and that the
image coordinates are normalized using the inverse of the calibration matrices. Since we have captured the images
in the same point and we are free to choose a global coordinate system, we will assume that the camera center is
at the
origin.
This means that for two corresponding image points xi and yi being projections of the 3D point
Xi
Xi = we have the following system of equations
1
Xi
γi xi = [R1 0] , (5.30)
1
Xi
ηi yi = [R2 0] , (5.31)
1
41
Lecture Notes in CV Carl Olsson
Xi = γi R1T xi Xi = γi R1T xi
γi xi = R1 Xi
⇔ ⇔ . (5.32)
ηi yi = R2 Xi ηi yi = R2 Xi ηi yi = γi R2 R1T xi
λi yi = Hxi , (5.33)
with λi = γηii and H = R2 R1T . This shows that we can transfer points from the first image plane to the second
by the use of a homography. We know from the previous sections that we can estimate this transformation from
at least four point correspondences. Once we have estimated the homography, all other points in an image can be
transferred using it.
The unnormalized points x̃i and ỹi are related to the normalized ones by xi = K1−1 x̃i and yi = K2−1 ỹi
respectively. Therefore we have
λi K2−1 ỹi = R2 R1T K1−1 x̃i ⇔ λi ỹi = K2 R2 R1T K1−1 x̃i , (5.34)
which means that the unnormalized points are mapped by a homography of the following form
Figures 5.7-5.9 shows how we can use homographies to build panoramas from multiple images. In this case H21 ,
which transforms points in image 2 to Image 1, is estimated from green matches and H32 , which transforms points
in image 3 to points in Image 2, is estimated from red matches. Figure 5.8 shows the effects of transforming Image
2 and 3 using the homographies H21 and H31 = H21 H32 . These transformations show where the pixels would
have been placed had images 2 and 3 been captured with camera 1. When these transformations have been applied
we can extend Image 1 with pixels from the other images.
42
Lecture Notes in CV Carl Olsson
Figure 5.7: Original images with matches between image 1 and 2 (red) and between 2 and 3 (green). The matches
can be used to compute two homographies, H21 which maps pixels in image 2 to image 1 and H32 which maps
image 3 to image 2.
Figure 5.8: The homographies H21 and H21 H32 can be used to transform the pixel of images 2 and 3 respectively,
to the coordinate system of the first camera.
Figure 5.9: When images 2 and 3 has been transformed we can add the images together.
43
Lecture Notes in CV Carl Olsson
44
Lecture 6:
Radial Distortion and 1D-cameras
Figure 6.1: Projection of a line in 3D gives a line in the image under the pinhole camera model. Under radial
distortion this is not true and the projection of a line often appears as slightly bent as shown in the left image. To
the right the image has been undistorted and straight lines appear straight again.
The camera model that we have derived in Lectures 1 and 2 has the property that straight lines are projected onto
straight lines. While this is typically enough for many regular cameras, wide field of view lenses and fisheye cameras
typically have what is called radial distortion, which is illustrated in Figure 6.1. Directly applying the pinhole
camera model to images such as this one usually gives large errors. Therefore we first need an extended model that
can handle this kind of images.
Figure 6.2: Radial distortion is typically modeled by moving points along lines going throught the principal point.
45
Lecture Notes in CV Carl Olsson
Radial distortion is typically modeled by moving points towards or away from the principal point (0, 0, 1) in the
image plane (z = 1). The distance the point moves depends on how close it is to the principal point. We let
xu ∼ (xu , yu , 1) denote the undistorted point and xd ∼ (xd , yd , 1) the distorted one. Since xd is on the line
going through xu and the principal point we can write it
d(rd )xd d(rd ) 0 0 xd
xu ∼ d(rd )yd = 0 d(rd ) 0 yd . (6.1)
1 0 0 1 1
p
Here rd = x2d + yd2 is the distance to the principal point and d(rd ) is a function that controls how far the
point should be moved. There p are several choices for d but usually a low degree polynomial is sufficient. (One
may alternatively
p use ru = x2u + yu2 . However the problem of undistorting an image becomes easier if one uses
2
rd = xd + yd .) 2
x ∼ Kxd . (6.3)
If the inner parameters K and the distortion function d(r) is known we can remove radial distortion using the
following process:
If the camera intrinsic and distortion parameters are known it is possible to undistort the images (as described
above) and solve multiple view geometry problems such as triangulation and resectioning as we have discussed in
previous sections. Alternatively one can, if the principal point is known, use the so called 1D Radial camera model
that is invariant to radial distortion.
Here we assume for simplicity that the principal
0 , y0 ) = (0, 0). Note that if (x0 , y0 ) ̸= (0, 0) we can
point is (x
1 0 −x0
multiply the image points with the matrix 0 1 −y0 to move the principal point to (0, 0). See Assignment
0 0 1
2, Exercise 3.
The 1D-radial camera is represented by a 2 × 4 matrix which is basically the two first rows of a regular 3 × 4 camera
matrix. To derive it we start from the regular camera equations (with radial distortion) which according to the
previous section can be written
−1
γf sf 0 d(rd ) 0 0
x∼0 f 0 0 d(rd ) 0 R t X (6.6)
0 0 1 0 0 1
46
Lecture Notes in CV Carl Olsson
Since
d(rd ) 0 0 1 0 0
0 d(rd ) 0 0 1 0 = d(rd )I ∼ I, (6.7)
0 0 1 0 0 d(rd )
we get
x γf sf 0
y ∼ 0 f 0 R t X. (6.8)
1 0 0 d(rd )
Only the third coordinate depends on the radial distortion. If we discard it we get
x γf sf 0
∼ R t X. (6.9)
y 0 f 0
Here
we
note that last row of
R t vanishes from this expression. If R2×3 t2x1 are the first two rows of
R t we get
x γf sf γ s
∼ R2×3 t2x1 X ∼ R2×3 t2x1 X. (6.10)
y 0 f 0 1
| {z }
P2×4
Note that the focal length also vanishes from this expression due to the scale ambiguity. Hence, this camera model
andfocal length. The above equation can be seen as a projection P3 → P1 .
is invariant to radial distortion
x
Geometrically we can think of as a directional vector of a line going through the principal point (0, 0)
y
containing the projection of X, regardless of what the radial distortion and focal length are.
The matrix P2×4 represents the 1D-radial camera. While a projection in such a camera gives less information on
the 3D point than what a regular camera does, they are still usefull for multiple view geometry.
x
Exercise 16. If ∼ P2×4 X and P2×4 contains the first two rows of the camera matrix P show that the
y
projection of X must be on the line l ∼ (y, −x, 0).
Solution: If the first two rows of P are P2×4 then the projection in P is
x
x ∼ y ∼ P X, (6.11)
ρ
T
Exercise 17. If v ∼ P2×4 X show that X must be on the 3D plane π ∼ P2×4 v⊥ , where v⊥ is a vector that is
perpendicular to v.
π T X = v⊥
T T
P2×4 X = v⊥ v = 0. (6.12)
For practical problems the measurements that we use are still points in images. However to obtain invariance to
distortion (and focal length) we only view them as directional vectors of lines known to contain the undistorted
projection. In this way one viewed point constrains the 3D point to lie on a plane that depends on the camera
parameters. With sufficiently many such planes we can solve multiple view problems.
47
Lecture Notes in CV Carl Olsson
6.2.1 Triangulation
In this section we consider the problem corresponding to triangulation, that is, determining a 3D point from line
observations in several known 1D radial cameras. Since 3 planes in general have a unique intersection point we can
solve the problem exactly when a point is visible in 3 cameras.
Solution: If we assume that X is a regular point with coordinates (X, Y, Z, 1) we need to solve the system
λ1 = X
λ1 1 0 0 0
λ1 = Y
λ 1 0 1 0 0 X
0 = X + 1
0 1 0 0 1 Y
λ 2 = 0 1 0 0 Z ⇔ λ = Y . (6.14)
2
λ 3 0 1 0 0 1
λ3 = Y
−λ3 0 0 1 0
−λ3 = Z
Eliminating λ1 , λ2 and λ3 one sees that the solution is (X, Y, Z) = (−1, −1, 1).
In practical cases we can solve the problem numerically with 3 or more projections using DLT. Given projections
i
vi ∼ (xi , yi ) in cameras P2×4 , i = 1, ..., n we want to find the unknown 3D point X. To formulate a suitable
matrix formulation we note that
i
i X
λi vi = P2×4 X ⇔ P2×4 −vi = 0. (6.15)
λi
In Lecture 4 we mentioned that it is possible to eliminate λ using the vector product. Here it is possible to do the
T T
same using a vector v⊥ that is perpendicular to v. If λv = P2×4 X then v⊥ P2×4 X = λv⊥ v = 0. Figure 6.3
shows an example of triangulation with radial distortion.
6.2.2 Resection
Next we consider the resection problems for 1D-cameras. Given a number of projections vi ∼ (xi , yi ) and
corresponding 3D points Xi we want to find P2×4 such that
vi ∼ P2×4 X. (6.17)
Exercise 19. How many degrees of freedom does a 1D-radial camera have?
How may projections are needed to be able to determine a 1D-radial camera?
Solution: The matrix P2×4 has 8 elements but the scale is arbitrary. Therefore the 1D-radial camera has 7 dof.
Each projectiopn gives two equations and one additional unknown λ. Therefore the problem is solvable if 2n ≥
7 + n ⇔ n ≥ 7, where n is the number of projections.
48
Lecture Notes in CV Carl Olsson
Figure 6.3: Triangulation with radially distorted images. Top row: examples of the images used. Bottom row: Result
obtained with the radial camera model (left), result obtained with a regular pinhole camera (middle), histogram over
the errors to the ground truth for both approaches (right).
which we can solve with the method presented in Lecture 4. The left image of Figure 6.4 shows the result of running
the resection problem on the same data set as in Figure 5.4. Since it is not possible to determine a unique camera
center for the 1D radial camera (the nullspace is two dimensional) we can only say that the image was captured
somewhere on a line that is parallel to the camera viewing direction. Figure 6.4 shows these lines in purple for every
camera.
When the inner camera and distortion parameters are known it is relatively easy to undistort the image as shown
in Figure 6.1. It relies on the fact that using the resection formulation above we can estimate the first 2 rows of the
unknown camera matrix independently of radial distortion using DLT. Once this is done the remaining parameters
can be found through a linear system of equations. The only assumption we have to make is that the principal
point is known, since otherwise we can’t solve the resection problem as described above. This is however usually
located in the middle of the image and therefore typically known.
From the solution to the resection problem we can extract γ, s, R2×3 and t2×1 using RQ-factorization. In addition,
when we have the first two rows of a rotation matrix we can compute the third one using the vector product. What
49
Lecture Notes in CV Carl Olsson
remains is now to find f, t3 and d(rd ). These cannot be determined using the line projections but have to be
computed form the regular point Throughout this section we will assume that the extracted 2 × 4
projections.
camera is of the form P2×4 = R2×3 t2×1 , that is γ = 1, s = 0 in (6.10). If this is not the case we can multiply
the image points with the matrix
−1 1 −s
γ s 0 γ γ 0
0 1 0 = 0 1 0 . (6.20)
0 0 1 0 0 1
Let
X
t2×1
Y = R X. (6.21)
0
Z
If t3 = 0 these are homogeneous coordinates of xd . For other values of t3 the z-coordinate will be Z + t3 . We
therefore need to find parameters f ,d(rd ) and t3 so that
d(rd ) 0 0 x X
0 d(rd ) 0 K −1 y ∼ Y . (6.22)
0 0 1 1 Z + t3
1
0 f 0
Since γ = 1 and s = 0 we have K −1 = 0 f1 0 which gives the equations
0 0 1
d(rd )x/f X
( d(r )
d X
x = Z+t
d(rd )y/f ∼ Y ⇔ d(rf ) 3
. (6.23)
d Y
1 Z +t f y = Z+t
3 3
q
y x2 y2
x r
p
Since x ∼ Kxd we know that xd = f , yd = f and that rd = f2 + f2 = f, where r = x2 + y 2 , these are
equivalent to (
d( fr ) fx = X
Z+t3
. (6.24)
d( fr ) fy = Y
Z+t3
Since Xx
= Yy it is not difficult to see that the two equations above are dependent therefore it is enough solving any
re-scaled version of one of these. Here we use
r r R̃
d( ) = , (6.25)
f f Z + t3
√
where R̃ = X 2 + Y 2 . To proceed further we now need to specify what type of function d is. To obtain a linear
system a common choice is d(rd ) = p(r1d ) where
r2 r4 r2n
(Z + t3 )r = R̃(f + k1 + k2 3 + ... + kn 2n−1 ). (6.27)
f f f
The unknowns are f , t3 and the coefficients k1 , ...kn . We can collect these in one vector of unknowns by rewriting
(6.27) as
f
k1 /f
3
2 4 2n
k2 /f
R̃ R̃r R̃r . . . R̃r −r .. = rZ. (6.28)
.
kn /f 2n−1
t3
50
Lecture Notes in CV Carl Olsson
Each point projection gives one such equation. We have n + 2 unknowns and we therefore need n + 2 point
projections to solve the problem.
Once we have solved the problem we can form a complete 3 × 4 camera for which we can compute a unique
camera center. To the right in Figure 6.4 we show the result obtain when applying this approach to the same data as
previously. When the function d(rd ) has been determined we can also remove the radial distortion from the image
and image points. An example of this is shown in Figure 6.5.
Figure 6.4: Left: Result of running the resection formulation of Section 6.2.2. The radial camera does not have any
unique camera center since t3 is unknown. Varying t3 gives a line (parallel to the viewing direction) of potential
positions where the image could have been captured. Right: The result from the method in Section 6.3.
Figure 6.5: After the function d(rd ) has been determined we can remove the radial distortion from the image
and image points. The left figure shows the original distorted image with detected points. The right shows the
undistributed image.
51
Lecture Notes in CV Carl Olsson
52
Lecture 7:
Epipolar Geometry and the Fundamental Matrix
In this lecture we will consider the two-view structure from motion problem. That is, given two images we want to
compute both the camera matrices and the scene points such that the camera equations
λi xi = P1 Xi (7.1)
λ̄i x̄i = P2 Xi , (7.2)
i = 1, ..., n, are fulfilled. In previous lectures we have considered sub-problems where either the camera matrices
are known (the triangulation problem) or the scene points are known (the resection problem). Since the camera
equations become linear in these two cases we could solve these directly by applying DLT . The situation becomes
more complicated when both the scene points and camera matrices are unknown. The approach we will take in
this lecture formulates a set of algebraic constraints that involve only the image points and the cameras, thereby
eliminating the scene points. The resulting equations are linear and can be solved using SV D. Once the cameras
are known the 3D points can be computed using triangulation.
Given two sets of corresponding points xi and x̄i , i = 1, .., n, our goal is to find camera matrices P1 and P2
such that (7.1)-(7.2) are solved. As we observed in lecture 3, if the cameras are uncalibrated the reconstruction can
only be determined uniquely up to an unknown projective transformation. If the cameras are P1 = [A1 t1 ] and
P2 = [A2 t2 ] then we can apply the transformation
−1
−A−1
A1 1 t1
H= (7.3)
0 1
to the camera equations (7.1)-(7.2). The camera matrix P1 is then transformed to
A−1 −A−1
1 1 t1
P1 H = A 1 t 1 = I 0 . (7.4)
0 1
Therefore, we can always assume that there is a solution where P1 = [I 0] and P2 = [A t].
In this section we will derive the so called epipolar constraints. In the following sections we will use these constraints
to find camera matrices that solve (7.1)-(7.2).
53
Lecture Notes in CV Carl Olsson
We first consider a point x in the first image. The scene points that can project to this image
point all lie on a line
X
(the viewing ray of x) in 3D space, see Figure 7.1. If we assume that the X = , where X ∈ R3 , are the
1
C1 C2
Figure 7.1: All scene points on the line project to the same point in the left camera.
This is a line in the second image, see Figure 7.2. We refer to this line as the epipolar line of x. Since all scene
points that can project to x are on the viewing ray, all points in the second image that can correspond x have to be
on the epipolar line. This condition is called the epipolar constraint. For different points x in the first image we
get different viewing rays that project to different epipolar lines. Since the viewing rays all pass through the camera
center C1 of the first camera the epipolar lines will all intersect each other in the projection e2 of the camera center
C1 , see Figure 7.2. The projections of the camera centers e1 and e2 are called the epipoles.
C1 e1 e2 C2
Figure 7.2: The projection of the viewing line into the second camera gives an epipolar line.
54
Lecture Notes in CV Carl Olsson
Exercise 20. Compute the epipoles for the camera pair P1 = I 0 and
1 1 0 0
P2 = 1 0 1 0 , (7.8)
0 1 0 1
Solution: To compute the camera centers we first find the nullspaces of the camera matrices. For P1 it is easy to see
that C1 ∼ (0, 0, 0, 1). The nullspace of P2 is given by the system
X = t
X + Y =0
Y = −t
X +Z =0 ⇔ , (7.9)
Z = −t
Y +W =0
W = t
For a general camera pair P1 = I 0 , P1 = A t , where A invertible we have that C1 = (0, 0, 0, 1) and
−A−1 t
C2 = in homogeneous coordinates. Projecting into the two cameras gives e2 ∼ P2 C1 = t. and
1
−A−1 t
= −A−1 t.
e1 ∼ I 0
1
The expression λAx + t is a parametrization of the epipolar line of x. Note that since e2 = t we see that e2 is on
the epipolar line by letting λ = 0. We know that a line in P2 can also be represented by a line equation lT x = 0.
To find the vector l we pick two points on the line, e.g. t and Ax + t and insert into the line equation
lT t = 0
T . (7.13)
l (Ax + t) = 0
These equations tells us that l has to be perpendicular to both t and Ax + t. We can find such an l using the vector
product
55
Lecture Notes in CV Carl Olsson
Solution: We have
1 1 0 0 1
Ax = 1 0 1 1 = 1 , (7.16)
0 1 0 1 1
and
−1
l = e2 × (Ax) = 1 . (7.17)
0
This gives
lT x1 = −1 · 1 + 1 · 2 + 0 · 1 = 1 (7.18)
lT x2 = −1 · 1 + 1 · 1 + 0 · 1 = 0. (7.19)
A cross pruduct y × x is a linear operation on x and can therefore be represented by a matrix which we denote
[y]× . If x = (x1 , x2 , x3 ) and y = (y1 , y2 , y3 ) then their cross product is
y × x = (y2 x3 − y3 x2 , y3 x1 − y1 x3 , y1 x2 − y2 x1 ). (7.20)
0 −y3 y2 x1 y2 x3 − y3 x2
y3 0 −y1 x2 = y3 x1 − y1 x3 (7.21)
−y2 y1 0 x3 y1 x2 − y2 x1
| {z }
=[y]×
The matrix [y]× is skew symmetric, that is, [y]× = −[y]T× . It is easy to see that for any 3 × 3 skew symmetric
matrix S there is a vector y such that S = [y]× .
With this notation the epipolar line can be written
The matrix F = [e2 ]× A is called the fundamental matrix. It maps points in image 1 to lines in image 2. If x̄
corresponds to x then the epipolar constraint can be written
Note that F only depends on the cameras and therefore the epipolar constraints only involves the image points and
the camera P2 . We will use these constraints to compute F from a number of image correspondences. Once F has
been determined the camera P2 can be extracted.
56
Lecture Notes in CV Carl Olsson
and
−1 0 −1
eT2 F =
0 0 1 1 1 0 = 0 0 0 . (7.27)
0 0 0
Recall that the objective of the two-view structure from motion problem is to find the scene points and the camera
P2 . We will see in the next lecture that if the Fundamental matrix is known then P2 can be extracted from it. We
now present a simple algorithm for estimating F .
As we saw in the previous section, for each point correspondence xi ,x̄i we get one epipolar constraint.
x̄Ti F xi = 0. (7.28)
Therefore each correspondence gives one linear constraint on the entries of F . In matrix form we can write the
resulting system as
F11 0
x̄1 x1 x̄1 y1 x̄1 z1 . . . z̄1 z1 F12 0
x̄2 x2 x̄2 y2 x̄2 z2 . . . z̄2 z2
F13 0
= . (7.30)
.. .. .. .. ..
. . . . . .. ..
. .
x̄n xn x̄n yn x̄n zn . . . z̄n zn
| {z } F33 0
M
This is a linear homogeneous system which we can solve using SVD as in Lecture 3. The matrix F has 9 entries
but the scale is arbitrary and the system therefore has 8 degrees of freedom. Each correspondence gives one new
constraint on F and we therefore need 8 correspondences to solve this problem.
Note that it is in fact possible to solve the problem with only 7 point correspondences since we also have the
constraint det(F ) = 0. However, this constraint is a polynomial of third order and we cannot use SVD to solve
the resulting system. Therefore we use at least 8 correspondences.
57
Lecture Notes in CV Carl Olsson
Because of noise the resulting estimation F̃ of the fundamental matrix is not likely to have zero determinant.
Therefore given this estimation we chose the matrix F that solves
(where the norm is the Frobenious/sum-of-squares norm). The solution to this problem is given by the SVD of F̃ .
If
U SV T = F̃ , (7.32)
where S = diag(σ1 , σ2 , σ3 ). Then F can be found by setting the smallest singular value σ3 = 0, that is,
As was the case with the resection problem, normalization is important for numerical stability. If for example x1
and x̄1 are both in the order of a 1000 pixels then x1 x̄1 ≈ 106 while z1 z̄1 = 1. This makes the matrix M T M
very poorly conditioned. To improve the numerics we can use the same normalization as in Lecture 3 (for both the
cameras).
We summarize the different steps of the algorithm here:
using svd.
• Form the matrix F (and ensure that det(F ) = 0).
58
Lecture 8:
Extracting Cameras from F
In Lecture 7 we considered the two-view structure from motion problem, that is, given a number of measured
points in two images we want to compute both camera matrices and 3D points such that they project to the
measurements. We showed that the 3D points can be eliminated from the problem by considering the fundamental
matrix F . If x is an image point belonging to the fist image and x̄ belongs to the second then there is a 3D point
that projects to to these if and only if the epipolar constraint
x̄T F x = 0 (8.1)
is fulfilled. Using the projections of 8-scene points we can compute the fundamental matrix by solving a homo-
geneous least squares problem (the 8-point algorithm). What remains in order to find a solution to the two-view
structure from motion camera is to compute cameras from F and finally compute the 3D-points.
In general we may assume (see Lecture 5) that the cameras are of the form P1 = [I 0] and P2 = [A e2 ] where
e2 is the epipole in the second image. Since we know that F T e2 = 0 we can find e2 by computing the null space
of F T . In what follows we will show that A = [e2 ]× F gives the correct epipolar geometry and therefore solution
for the second camera is given by
P2 = [[e2 ]× F e2 ] . (8.2)
The Fundamental matrix of the camera pair P1 and P2 is according to Lecture 5 given by [t]× A = [e2 ]× [e2 ]× F
we need to show that this expression reduces to F . The epipolar line of an arbitrary point x in image 1 is
Since e2 is in the nullspace of F T it is perpendicular to the columns of F and therefore also the vector F x. This
means v, e2 , v × e2 , where v = ∥F1x∥ F x forms a positive oriented orthonormal basis or R3 . It is now easy to see
that
Therefore e2 × (e2 × (F x))) = −F x for all x ∈ R3 , which shows that [e2 ]× [e2 ]× F ∼ F .
So in conclusion, if we have corresponding points that
fulfill the epipolar
constraints (8.1)
then we can always find
3D points that project to these in the cameras P1 = I 0 and P2 = [e2 ]× F e2
59
Lecture Notes in CV Carl Olsson
To find the 3D point X ∼ (X, Y, Z, W ) that projects to x1 and x2 (note that we know that there is such a point
from Exercise 21) we solve the system of camera equations
0=X
λ1 = Y
X=0
(
λ1 x1 = P1 X λ1 = Z Y = −t
⇔ ⇒ . (8.7)
λ2 x2 = P2 X
λ2 = −X − Y
Z = −t
λ2 = −X − Z W =t
λ2 = W
The choice P2 = [[e2 ]× F e2 ] may seem a little strange since the matrix [e2 ]× F has the nullspace e1 . Therefore
we have
e
0 = [[e2 ]× F e2 ] 1 , (8.8)
0
which means that P2 ’s camera center is a point at infinity. Since there is a projective ambiguity there are however
many choices for P2 . Given that P1 = [I 0] the general formula for P2 is
P2 = [e2 ]× F + e2 v T
λe2 , (8.9)
where v is some vector in R3 and λ is a non-zero scalar. It is easy to verify that this camera pair gives the correct
fundamental matrix similar to what we did previously.
60
Lecture Notes in CV Carl Olsson
Figure 8.1 shows an example of a reconstruction obtained using the above method. Blue ∗ are the image mea-
surements and red o are the reprojections. While the reprojections appear to be correct the 3D point could still
does not look the way we expect it to do. This is because the solution we obtain is only unique up to a unknown
projective transformation. Such transformations can severely distort the results. In this case P2 that has a camera
center at infinity and points that are both in front an behind P1 which makes the results look strange even though
the camera equations are solved. To address these issues we will consider how to find a projective transformation
that makes all cameras finite and places all the points in front of them.
Figure 8.1: Solution to the two view problem. Blue ∗ are the image measurements and red o are the reprojections.
The 3D points (to the right) look strange because of the projective ambiguity (note the difference in scale on the
axes).
In this section we will introduce the notion of quasi affine homographies. These are homographies that do not
transform any part of the reconstruction to infinity. Figure 8.2 shows an example of a 2D point set {xi } that is
being transformed by 3 different homographies all of the form
1 0 0
H = 0 1 0 . (8.12)
l1 l2 1
The transformation Hxi is a point at infinity if its third element is 0, that is, lT xi = 0 where lT = (l1 , l2 , 1).
Therefore l is the line of all points that will be mapped to infinity by the transformation. This line is shown in blue
61
Lecture Notes in CV Carl Olsson
in Figure 8.2. Furthermore, the convex hull of the point set (that is, the smallest convex set that contains the point
set {xi }) is shown in read. As long as l does not intersect the convex hull this is preserved by the mapping (that
is, the points on the boundary of convex hull before transformation are also on the convex hull in the transformed
space). However, when the line intersects the convex hull significant distortion is introduced and this is no longer
the case. To remove this kind of distortion from reconstructions obtained from the fundamental matrix we will use
quasi affine transformations.
Figure 8.2: Projective transformation of 2D points. First row: original points (blue) and their convex envelope (red).
Second row: transformed points (blue) and the transformed convex envelope (red). The two first transformations
are quasi affine with respect to the blue points while the third is not.
The examples we showed in the previous section concern 2D projective transformations. However the same princi-
ple applies to 3D. When the plane that is mapped to infinity intersects the convex hull of the set of 3D points the
resulting geometry will look strange. This is also what has happened in Figure 8.1. In the remainder of this section
we will consider sets of 3D. The main purpose is to upgrade the solution from the two view problem to a more
reasonably looking one.
Xi Y
A homography H mapping a set of finite points Xi = to corresponding points Yi = i is called quasi
1 1
affine with respect to (Q.A w.r.t) the set {Xi } if
HXi = γi Yi , (8.13)
where γi have the same sign for all i.
If a homography is not Q.A w.r.t {Xi } the transformation of the point set generally results in strange geometries.
For such a transformation there is always i and j such that
HXi = γi Yi (8.14)
HXj = γj Yj (8.15)
where γi < 0 and γj > 0. Consider the line segment (1 − λ)Xi + (1 − λ)Xj , 0 ≤ λ ≤ 1 between the two points
Xi and Xj . This is transformed by H to
Y Y
H((1 − λ)Xi + (1 − λ)Xj ) = λγi i + (1 − λ)γj j . (8.16)
1 1
γj
The last coordinate in this expression is λγi + (1 − λ)γj . If we let λ = γj −γi we get
γj γi γj
λγi + (1 − λ)γj = + (1 − )γj = 0. (8.17)
γj − γi γj − γi
62
Lecture Notes in CV Carl Olsson
Therefore the set of points H((1 − λ)Xi + (1 − λ)Xj ), 0 ≤ λ ≤ 1 contains a point at infinity. Since all the
points Yi ∼ HXi are finite the convex hull of {HX} also consists of finite points, this means that the convex hull
of {Xi } is not mapped to the convex hull of {HXi } by H.
Now suppose that we have a camera P and a set of 3D points {Xi } that fulfills
λi xi = P Xi . (8.18)
If P̄ ∼ P H −1 and X̄i ∼ HXi when is H Q.A w.r.t the set {Xi }? There are non-zero constants c and γi such
that P̄ = cP H −1 and γi X̄i = HXi therefore we have
γi λi c
λi xi = P Xi = P H −1 HXi = P̄ X̄i ⇔ xi = P̄ X̄i . (8.19)
c γi
|{z}
:=λ̄i
Since c does not depend on i we se that λ̄i has the same sign or the opposite sing as λi forall i when γi > 0 for all
i. Therefore we have
sign(λ1 λ̄1 , λ2 λ̄2 , ..., λn λ̄n ) = ±(1, 1, ..., 1), (8.20)
when H is Q.A. w.r.t {Xi }. Conversely, when H is a general homography γi will not have the same sign for all i,
and therefore the signs of λ̄i will vary with i.
Now suppose that we have a "true" metric reconstruction of a scene (with finite scene points). In particular the
Xi
scene points {Xi } are in front of the cameras. If Xi = and one of cameras is P = A a then we have
1
det(A)λi
0 < depth(P, Xi ) = (8.21)
∥A3 ∥
Since det(A)
∥A3 ∥ is independent of i it is clear that λi , i = 1, ..., n all have the same sign. Then any other projective
reconstruction with P̄ ∼ P H −1 , X̄i ∼ HXi and λ̄i xi = P̄ X̄i , that has same sign for all λi is related to the true
solution via a quasi-affine transformation (w.r.t Xi ). We call such a solution a quasi-affine reconstruction.
Note that if all points are visible in all cameras it is enough to check the signs of one camera to determine if H is
quasi-affine. Since the depths of the points are in front of the cameras it is easy to see that the sign condition also
will hold in the other cameras. Furthermore, a quasi affine reconstruction does not necessarily have points in front
of the cameras since we do not know how sign(det(A)) is changed by H. The only thing that is guaranteed is that
no point in the convex envelope of {X} is transformed to infinity.
In thecase of
the two view
problem we have seen in Section 8.1 how to obtain a projective solutions where
P1 = I 0 and P2 = [e2 ]× F e2 . In this case it is easy to obtain a quasi affine reconstruction (w.r.t a "true"
metric point cloud) using the following three steps:
Note that the projections are not changed by the first two step since these correspond to transforming the recon-
struction with the homograpy
1 0 0 0
0 1 0 0
T = 0 0 0 1 .
(8.22)
0 0 1 0
63
Lecture Notes in CV Carl Olsson
Neither does the third step since this only rescales the homogeneous coordinates of the 3D points. After the
transformation the first camera is
1 0 0 0
P1 T −1 = 0 1 0 0 . (8.23)
0 0 0 1
Since the scene points all have last coordinate 1 it is clear that multiplication with this camera matrix gives all
λi = 1. Therefore the resulting projective reconstruction is related to the true metric solution via some quasi affine
transformation H. Figure 8.3 shows the resulting structure after applying the above algorithm to the reconstruction
in Figure 8.1. Since P 1 is has its camera center at an ideal point we only plot the resulting structure.
Figure 8.3: New structure after quasi affine upgrade. Note that axes scales are varied. This should be expected since
the reconstruction is not metrically accurate. The only thing that is guaranteed by a quasi affine reconstruction is
that the convex hull of the points is preserved.
While the structure of the above reconstruction typically looks more reasonable than a general projective solution
it has some unreasonable features. Firstly, the camera (8.23) is affine and therefore has a camera center at infinity.
For such cameras we can not talk about depth of a point since the length ∥(p31 , p32 , p33 )∥ = 0. Secondly, there
is no guarantee that the second camera P2 T −1 has points in front of it. In this section we address these two issues
under the assumption that the determinant of the first 3 columns of P2 T −1 is non-zero.
To address the second issue one may simply change the signs of the third column of the camera and the third
coordinate of every point or equivalently applying the transformation
1 0 0 0
0 1 0 0
T̄ =
0 0 −1 0
(8.24)
0 0 0 1
to the reconstruction. This changes sign of the determinant of the first 3 × 3 part of the second camera. At the same
time the multiplication of P2 T −1 T̄ −1 T̄ T Xi = P2 Xi which means that the third coordinate of the multiplication
is unchanged resulting in new depths with opposite signs.
Note that P1 T −1 T̄ −1 = P1 T −1 is still at infinity. To fix this we apply a final transformation
1 0 0 0
0 1 0 0
T̃ = 0 0 1 0 ,
(8.25)
0 0 −ϵ 1
64
Lecture Notes in CV Carl Olsson
Figure 8.4: Quasi affine reconstruction with finite cameras and points in front of cameras. Read arrows show the
local coordinate systems of the cameras. Blue lines show the image borders transformed into the image plane using
K −1 . Note that for the camera to the right the principal point is not in the image.
8.3 Autocalibration
In general the inner parameters of a camera cannot be extracted from an uncalibrated reconstruction alone. Without
any additional knowledge than the image projections the best we can do is a quasi affine reconstruction with points
in front of the camera. However by adding some extra information it may be possible to extract the true camera
parameters. This is called Autocalibration.
We suppose that we have a projective reconstruction P, P̄ , {Xi } of a two view problem and want to find an
upgrading homography H that gives new cameras
K R t ∼ PH (8.28)
K̄ R̄ t̄ ∼ P̄ H, (8.29)
where K and K̄ has some desirable properties. The approach relies on the fact that KK T can be directly extracted
(without the need to compute an RQ-factorization) using the observation that
K R 0 ∼ P HD, (8.30)
65
Lecture Notes in CV Carl Olsson
where
1 0 0 0
0 1 0 0
D=
0
. (8.31)
0 1 0
0 0 0 0
Therefore
RT
T T
K T = KRRT K T = KK T .
P |HDH
{z } P ∼K R 0 (8.32)
tT
:=Ω
Hence KK T (and similarly K̄ K̄ T ) is linear in the elements of the matrix Ω up to an unknown scaling and can be
determined from P and Ω without the need for RQ factorization. The matrix Ω is symmetric and positive semi
definite with rank 3. The main idea is to add a few reasonable constraints on KK T and K̄ K̄ T and solve for Ω.
Once Ω been determined we can compute H to upgrade P ,P̄ and {Xi }. We have
2
(γ + s2 )f 2 s2 f 2 + x0 y0 x0
γf sf x0 γf 0 0
KK T = 0 f y0 sf f 0 = s2 f 2 + x0 y0 f 2 + y02 y0 . (8.33)
0 0 1 x0 y0 1 x0 y0 1
Here we will assume that x0 = y0 = s = 0, γ = 1. Note that s = 0 and γ = 1 are typically reasonable
assumptions whereas (x0 , y0 ) is typically the middle of the image. Therefore a coordinate change of the image
coordinates using
1 0 −x0
N = 0 1 −y0 (8.34)
0 0 1
is required to make this assumption valid. When all the assumptions above are enforced we get
2
f 0 0
KK T = 0 f 2 0 ∼ P ΩP T , (8.35)
0 0 1
and similarly for the other camera. This gives us 4 constraints on the matrix w := P ΩP T , namely, w11 = w22 ,
w12 = w13 = w23 = 0. (Note that w is symmetric by construction and therefore the constraints w12 = w13 =
w23 = 0 do not give any additional information.) Similarly we get 4 constraints for w̄ = P̄ ΩP̄ T . The matrix Ω is
4 × 4 symmetric and therefore has 10 unknown elements, but since the scale is arbitrary it only has 9 DOF. With
the constraint det(Ω) = 0 we can therefore find Ω. To avoid the trivial solution Ω = 0 we can fix the scale, for
example by setting w33 = 1.
8.3.1 Finding Ω
To solve the problem we stack the unknowns of Ω in a vector ω and obtain a system of the form
Aω = b (8.36)
det(Ω) = 0, (8.37)
where A has size 9 × 10. The solutions to (8.36) are of the form
ωp + λωh , (8.38)
where ωp is one solution to (8.36) and ωh is in the nullspace of A. (In MATLAB ωp and ωh can be found by taking
A\b and null(A).) The matrix expression Ωp + λΩh corresponding to (8.38) should now fulfil det(Ωp + λΩh ) =
0. This corresponds to a generalized eigenvalue problem
Ωp v = −λΩh v (8.39)
which can be solved using eig(Omega_p,-Omega_h). There are four solutions to the eigenvalue problem however
typically only two of these result in solutions with rank 3. The expression Ωp + λΩh can be seen as a line that
66
Lecture Notes in CV Carl Olsson
intersects the convex set of positive semi definite matrices. It is easy to see that there has to be either 0, 1, 2 or
infinitely many intersection points at the boundary of the convex set, which consists of matrices with determinant
0. In general there will be two distinct intersection points that correspond to two possible solutions for Ω.
The above approach is sensitive to noise and therefore normalization is important to achieve reasonable results. In
addition
to moving the
principal point it is therefore a good idea to rescale the image coordinates using a matrix of
s 0 0
the type 0 s 0 after applying (8.34). Note however that the particular normalization we choose must not
0 0 1
invalidate the constraints used to form A.
8.3.2 Finding H
After Ω has been determined we also need to find H such that Ω = HDH T . This can be done by taking the
eigenvalue decomposition of Ω. If Ω = V ΛV T , with
λ1 0 0 0
0 λ2 0 0
Λ= 0
(8.40)
0 λ3 0
0 0 0 0
Figure 8.5: The four solutions obtained from the autocalibration process. Note that all cameras seem to have
principle points in the middle of the image, but only one have points in front of the camera.
67
Lecture Notes in CV Carl Olsson
68
Lecture 9:
The Essential Matrix
When solving the relative orientation problem without camera calibration there is, as we has seen in previous
lectures, an ambiguity. Basically any projective transformation can be applied to the 3D-points to give a new
solution. Therefore the resulting constructions can often look strange even though the reprojections are correct.
To remove this ambiguity one has to add additional knowledge about the solution to the problem. In Lecture
8 we added some constraints on the inner parameters and used auto calibration to find an upgrading projective
transformation. Alternatively, if we have some knowledge about the 3D scene, such as the distance between a few
of the points, then we can apply a transform to the solution that make these distances correct.
If the inner parameters K1 and K2 are completely known we consider the calibrated two-view structure from
motion problem. Given two sets of corresponding points xi and x̄i , i = 1, ..., n and inner parameters K1 and K2
our goal is to find [R1 t1 ], [R2 t2 ] and Xi such that
xi ∼ K1 [R1 t1 ]Xi (9.1)
x̄i ∼ K2 [R2 t2 ]Xi , (9.2)
and R1 ,R2 are rotation matrices.
We can make two simplifications to the problem. First we normalize the cameras by multiplying equations (9.1)
and (9.2) with K1−1 and K2−1 respectively. Furthermore, we apply the euclidean transformation
T
R1 −R1T t1
H= (9.3)
0 1
to the cameras (and H −1 to the 3D points). This gives us the new cameras
R1T −R1T t1
P1 H = R1 t1 = I 0 (9.4)
0 1
R1T −R1T t1 R2 RT −R2 R1T t1 + t2
= | {z 1} |
P2 H = R 2 t 2 } . (9.5)
0 1
{z
=R =t
69
Lecture Notes in CV Carl Olsson
The fundamental matrix for a pair of cameras of the form [I 0] and [R t] is given by
E = [t]× R, (9.8)
and is called the Essential matrix. A rotation has 3 degrees of freedom and a translation 3. Since the scale of the
essential matrix does not matter it has 5 degrees of freedom. The reduction in freedom compared to F , results in
extra constraints on the singular values of E. In addition to having det(E) = 0 the two non-zero singular values
have to be equal. Furthermore, since the scale is arbitrary we can assume that these singular values are both 1.
Therefore E has the SVD
E = U diag([1 1 0])V T . (9.9)
The decomposition is not unique. We will assume that we have a singular value decomposition where det(U V T ) =
1. It is easy to ensure this; If we have an SVD as in (9.9) with det(U V T ) = −1 then we can simply switch the
sign of the last column of V . Alternatively we can switch to −E which then has the SVD
with det(U (−V )T ) = (−1)3 det(U V T ) = 1. Note however that if we recompute the SVD for −E we might
get another decomposition since it is not unique.
To find the essential matrix we can use a slightly modified 8-point algorithm. From 8 points correspondences we
form the M matrix (see Lecture 6) and solve the homogeneous least squares system
using SVD. The resulting vector v can be used to form a matrix Ẽ that does not necessarily have the right singular
values 1, 1, 0. We therefore compute the decomposition Ẽ = U SV T and construct an essential matrix using
E = U diag([1 1 0])V T .1
Since the essential matrix has only 5 degrees of freedom it is possible to find it using only 5 correspondences.
However as in the case of the fundamental matrix the extra constraints are non-linear which makes estimation more
difficult. (We will consider this problem in Lecture 7.)
We summarize the steps of the modified 8-point algorithm here:
• Normalize the coordinates using K1−1 and K2−1 where K1 and K2 are the inner parameters of the cameras.
using SVD.
• Form the matrix E (and ensure that E has the singular values 1, 1, 0).
Once we have determined the essential matrix E we need extract cameras from it. Basically we want to decompose
it into E = SR where S is a skew symmetric matrix and R is a rotation. For this purpose we will begin to
1 Alternatively E = U diag([1 1 0])(−V )T if det(U V T ) = −1.
70
Lecture Notes in CV Carl Olsson
find a decomposition of diag(
1 1 0 ) = ZW , where
Z is skew
symmetric and W is a rotation. Since W is
orthogonal we have diag( 1 1 0 ) = ZW ⇔ diag( 1 1 0 )W T = Z ⇔
1 0 0 w11 w21 w31 w11 w21 w31 0 −z3 z2
0 1 0 w12 w22 w32 = w12 w22 w32 = z3 0 −z1 . (9.12)
0 0 0 w13 w23 w33 0 0 0 −z2 z1 0
By inspecting the individual elements we see that w11 = w22 = 0, w31 = z2 = 0, w32 = −z1 = 0, and
w12 = z3 = −w21 . Since W is a rotation (with columns of length 1) it is clear that w12 = ±1. Choose
w12 = Z3 = −1 give w21 = 1 and because the third column of W is the vector product of the first two we get the
solution
0 −1 0 0 1 0
W = 1 0 0 and Z = −1 0 0 . (9.13)
0 0 1 0 0 0
Similarly if we chose w12 = 1, we see that second solution is given by W T and Z T .
To decompose E we now note that
E = U diag([1 1 0])V T = U ZW V T = U T
{z } U
| ZU |W
T
{zV } (9.14)
:=S1 :=R1
and similarly
E = U diag([1 1 0])V T = U Z T W T V T = U T T
| Z{zU } U
T T
| W{z V } (9.15)
:=S2 :=R2
To see that these are valid solutions we first verify that R1 and R2 are rotations. Since
R1T R1 = (U W T V T )T U W T V T = V W U T U W T V T = I (9.16)
R1 is orthogonal. Furthermore,
and therefore R1 is a rotation. (Note that if det(U V T ) = −1 then the R1 that we obtain is not a rotation but
a rotation composed with a reflexion and therefore not a valid solution.) That S1 is skew symmetric is easy to see
since
−S1T = (U ZU T )T = U Z T U T = −U ZU T = S1 , (9.18)
and therefore S1 R1 is a valid decomposition. E = S2 R2 can be verified similarly.
To create the camera matrices corresponding to the solutions S1 R1 and S2 R2 we need to extract a translation
vectors from the skew symmetric matrices S1 and S2 . We note that since
S1 = U ZU T = −U Z T U T = −S2 , (9.19)
these are the same up to a scaling and it is therefore enough to determine t from S1 . Since [t]× t = 0 the vector
t
must be in the nullspace of S1 which is the third column u3 of U . We therefore obtain the solutions P1 = I 0
P2 = U W V T u3 or P2′ = U W T V T
u3 . (9.20)
The two cameras P2 and P2′ are called the twisted pair. The relative rotation between P2 and P2′ is the rotation
R1T R2 = V W T U T U W T V T = V W T W T V T . It is not hard to verify that if v3 is the third column of V then
V W T W T V T v3 = v3 and therefore v3 is the rotation axis of R1T R2 . The rotation angle is given by
tr(R2T R1 ) − 1 tr(W T W T ) − 1
cos(ϕ) = = = −1, (9.21)
2 2
71
Lecture Notes in CV Carl Olsson
which yields ϕ = π. The camera centers of the two cameras P2 and P2′ are
0 0
−R1T u3 = −V W U T u3 = −V W 0 = −V 0 = −v3 (9.22)
1 1
and
0 0
−R2T u3 = −V W T U T u3 = −V W 0 = −V 0 = −v3 (9.23)
1 1
respectively. Hence in both solutions we are moving from the center of P1 which is the origin in the direction −v3 .
While P2′ is rotated 180◦ around this direction with respect to P2 .
Figure 9.1: Two examples of twisted pair solution. Note that one of P2 (green) and P2′ (red) has the reconstructed
3D behind itself.
Note that if E = [t]× R then λE = [λt]× R is also a valid solution. Different λ corresponds to rescaling the
solution and since there is a scale ambiguity we cannot determine a "true" value of λ. However the sign of λ is
important since it determines whether points are in front of the cameras or not in the final reconstruction, see
Figure 9.2. To make sure that we can find a solution where the points are in front of both the cameras we therefore
test λ = ±1 and the twisted solution.
If u3 is the third column of U we get the four solutions
P2 = [U W V T u3 ] or [U W T V T u3 ] (from λ = 1) (9.24)
or [U W V T − u3 ] or [U W T V T − u3 ] (from λ = −1) (9.25)
When we have computed these four solutions we compute the 3D points using triangulation for all the choices
of P2 and select the one with where points are in front of both P1 and P2 . Figure 9.3 shows the four calibrated
reconstructions of a scene. Only one of them has all the points in front of both the cameras.
72
Lecture Notes in CV Carl Olsson
Figure 9.2: Reconstruction with different values of λ. Note that changing sign of λ moves the reconstructed points
that are front of the camera to the rear of it and vice versa.
Figure 9.3: The 4 solutions when solving calibrated structure from motion for the chair image images in the first
row. Only the second one has positive depths.
73
Lecture Notes in CV Carl Olsson
74
Lecture 10:
Model Fitting
In previous lectures we have addressed various problems using linear formulations that approximately solve the
governing algebraic equations. While this is an easy approach we can in general not expect that these equations
can be solved exactly. Since the appearance of a patch changes when the viewpoint changes, exact positioning of
corresponding points is not possible, see Figure 10.1. Additionally, since matching is done automatically we have to
expect that some matches are incorrect causing large deviations from the model. Therefore, our point measurements
will in practice always be corrupted by noise of various forms and levels and in general approximate solutions based
on DLT will not give the "best" possible fit to the observed data.
In this lecture we will derive formulations that gives the "best" fit under the assumption of Gaussian noise. The
resulting problems are in general more difficult to solve than the formulations that we have used previously. In many
cases they can only be locally optimized. Therefore the linear approaches are still very useful since they provide an
easy way of creating a starting solution.
Figure 10.1: Two patches extracted from images with slightly different viewpoint. Exact localization of correspond-
ing points is made difficult because of slight appearance differences and limited image resolution.
What is meant by the "best" fit depends on the particular noise model. In this section we will consider two different
noise models and show that they lead to different optimization criteria. For simplicity we will consider the problem
of line fitting since this leads to closed form solutions.
75
Lecture Notes in CV Carl Olsson
Suppose that (xi , yi ) are measurements of 2D-points belonging to a line y = ax + b. Furthermore, we assume
that yi is corrupted by Gaussian noise, that is,
yi = ỹi + ϵi (10.1)
where ϵi ∈ N (0, 1) (Gaussian noise with mean 0 and standard deviation 1) and ỹi is the true y-coordinate. Our
goal is to estimate the line parameters a and b for which the measurements yi are most likely. Since ϵi ∈ N (0, 1),
its probability density function is
1 2
p(ϵi ) = √ e−ϵi /2 . (10.2)
2π
Furthermore, if we assume that the ϵi , i = 1, ..., n are independent of each other then their joint distribution is
n
Y
p(ϵ) = p(ϵi ), (10.3)
i=1
where ϵ = (ϵ1 , ϵ2 , ..., ϵn ). Since ϵi = yi − ỹi we can compute the likelihood of the measurements by
n n n n
Y Y Y Y 1 2
p(ϵ) = p(ϵi ) = p(yi − ỹi ) = p(yi − (axi + b)) = √ e−(yi −(axi +b)) /2 . (10.4)
i=1 i=1 i=1 i=1
2π
We now want to find the the line parameters a and b that make these measurements most likely. To simplify the
maximization we maximize the logarithm of the likelihood
n
! n n
(yi − (axi + b))2 X
Y X 1
log p(ϵi ) = − + log √ . (10.5)
i=1 i=1
2 i=1
2π
Since the second term does not depend on a of b this is the same as minimizing
n
X
(yi − (axi + b))2 . (10.6)
i=1
The minimum of this expression can be computed using the normal equations
a
= (AT A)−1 AT B, (10.8)
b
which we will derive in Lecture 9. The geometric interpretation of (10.6) is that under this noise model the vertical
distance between the line and the measurement should be minimized, see Figure 10.2.
Next we will assume that we have noise in both coordinates, that is,
xi x̃i
= + δi , (10.9)
yi ỹi
76
Lecture Notes in CV Carl Olsson
Figure 10.2: Left: The vertical distances between the line and the measured points are minimized in (10.6). In
contrast, the minimal distances between the line and the measured points are minimized in (10.12).
where δi ∈ N (0, I) and ax̃i + bỹi = c. The δi now belong to a two dimensional normal distribution with
probability density function
1 −∥δi ∥2 /2
p(δi ) = e . (10.10)
2π
The log likelihood function is
n n n
X X (xi − x̃i )2 + (yi − ỹi )2 X 1
log(p(δi )) = − + log( ). (10.11)
i=1 i=1
2 i=1
2π
where ax̃i + bỹi = c. The point (x̃i , ỹi ) can be any point on the line, however since we are minimizing (10.12)
we can restrict it to be the closest point on the line. The expression (10.12) then becomes the distance between
(xi , yi ) and the line. This distance can be expressed using the distance formula as
|axi + byi + c|
√ . (10.13)
a2 + b2
This problem is often referred to as the total linear least squares problem.
To solve (10.14),(10.15) we first take derivatives with respect to c. This shows that the the optimal solution must
fulfill
c = −(ax̄ + bȳ), (10.16)
77
Lecture Notes in CV Carl Olsson
min tT M t (10.21)
T
such that 1 − t t = 0, (10.22)
where t is a 2 × 1 vector containing a and b. This is a constrained optimization problem of the type
According to Persson-Böiers, "Analys i flera variabler" and the method of Lagrange multipliers the solution of such
a system has to fulfill
∇f (t) + λ∇g(t) = 0. (10.25)
Therefore the solution of (10.21)-(10.22) must fulfill
That is, the solution t has to be an eigenvector of the matrix M . Furthermore, inserting into (10.21), and using
that tT t = 1 we see that it has to be the eigenvector corresponding to the smallest eigenvalue.
In structure from motion problems we typically have two types of noise. Appearance changes that makes it hard
to exactly localize corresponding points giving rise to displacements that are typically in the order of at most a few
pixels. This type of noise is usually modeled as Gaussian. In addition to this we often have mismatches that give
rise to very large errors. When minimizing the least squares objective such measurements can severely distort the
results. Figure 10.3 shows a line fitting example with one outlier point. Because of the quadratic growth of the least
squares criterion (10.6) points that are far away from the estimated line will have a proportionally larger effect on
the estimate than points that are close to it. This gives a poor estimate shown in the middle of Figure 10.3. By
using a robust loss-function that essentially removes the effect of the outlier it is possible obtain a better estimate of
the line as shown to the right in Figure 10.3.
To find a suitable robust loss-function we can replace the assumption of normalized noise with a more general
distribution with density function
2
p(ϵi ) = ce−ρ(ϵi ) . (10.27)
Here ρ is a function that should be less sensitive to outliers and c is some constant which ensures that the density
function integrates to 1. As in Section 10.2.1 we assume that we have 2D measurements (xi , yi ) of points belonging
to a line y = ax + b with noise in the y-coordinate yi = ỹi + ϵi . To obtain the maximum likelihood estimate we
should then minimize
Xn
ρ (axi + b − yi )2 .
(10.28)
i=1
78
Lecture Notes in CV Carl Olsson
Figure 10.3: Line fitting with one outlier. Left: Measurements. Middle: Obtained model fit using the least squares
objective (10.6). Right: Obtained model fit using a robust loss-function.
respectively. In general these equations lack closed form solutions and have to be solved iteratively. None the less,
we can still draw
some simple conclusions from these expressions. In matrix form we can also write these two
a
equations M = m, where
b
n n
x2i
X
′ xi X
′ xi yi
M= ρ (ϵ2i ) and m= ρ (ϵ2i ) . (10.31)
xi 1 xi
i=1 i=1
a
Assuming that we somehow know what ϵi is at the optimal values of (a, b) then we have = M −1 m. We
b
2
xi xi xi yi
note that M and m can be seen as weighted sums of the matrices and respectively. If we use
xi 1 xi
ρ(t) = t then (10.28) reduces to the least squares solution and ρ (ϵi ) = 1. By modifying the weight ρ′ (ϵ2i ) we
′ 2
can increase or reduce the impact of measurement i on the solution. For example selecting ρ so that its derivative is
decreasing will reduce the effects of measurements with large errors. A common approach is to use a threshold τ . If
we let ρ2 (t) = min(t, τ 2 ) for some value τ then ρ′2 (ϵ2i ) will be zero if |ϵi | larger than τ and therefore this residuals
completely disappears from (10.31). In contrast if |ϵi | is smaller than τ then ρ′2 (ϵ2i ) = 1. (Note that this is the
derivative of ρ2 (t) with respect to t where we have inserted t = ϵ2i ). Therefore this option will essentially remove
large residuals and compute the least squares solution of remaining ones, which is what is illustrated to the right in
Figure 10.3. Figure 10.4 shows the truncated loss function ρ2 (ϵ2i ).
Strictly speaking the function min(ϵ2i , τ 2 ) will not be differentiable at ϵi = τ . However, we can always find
approximations that are close to min(ϵ2i , τ 2 ) while still being differentiable. In Figure 10.4 we show ρ3 (ϵ2i ) which
is one such commonly used approximation.
As we noted above using decreasing derivatives generally makes the loss-function more robust to outliers. On the
other hand optimization then becomes Pn more difficult since this
often leads to non-convex problems with local
minimizers. If we for example use i=1 ρ2 (axi + b − yi )2 the derivatives will all be zero for any choice of
a and b that makes all error residuals larger than τ . Thus an iterative algorithm, making use of the derivatives,
will immediately get stuck if the starting point is not close enough to the global minimizer. Selecting ρ so that
the derivatives are decreasing but never reach zero can help to achieve better convergence. The function ρ4 (ϵ2i )
in Figure 10.4 shows one such example. However, without convexity there is in general no guarantee that we will
79
Lecture Notes in CV Carl Olsson
reach the global minimizer. Initialization is therefore an important issue (which we will address in Lecture 8) for
these methods.
The last example h(ϵi ) := ρ5 (ϵ2i ) of Figure 10.4 is the Huber function. If we consider h as a function of ϵi it has
the derivative
(
2ϵi |ϵi | < b
h′ (ϵi ) = . (10.32)
2bsign(ϵi ) |ϵi | ≥ b
Since h′ is non-decreasing h is convex. Furthermore, since a sum of a set of convex functions is also convex the
line fitting problem defined by (10.28) is convex. Therefore local optimization staring from any initialization is
guaranteed to converge to the globally optimal solution. On the other hand the weights ρ′ (ϵ2i ) are decreasing (and
tend to 0 as ϵi → ∞). In that sense this is a good trade off between robustness and convexity. We will study global
optimization further in Lecture 10.
(
ρ2 (ϵ2i ) = min(ϵ2i , τ 2 ) 0 |ϵi | > τ
ρ′2 (ϵ2i ) =
1 |ϵi | < τ
2 2
ln(1+τ 2 )−ln(e−ϵi +τ 2 ) e−ϵi
ρ3 (ϵ2i ) = ln(1+τ 2 )
ρ′3 (ϵ2i ) = −ϵ2
(e i +τ 2 )ln(1+τ 2 )
ϵ2 b2
ρ4 (ϵ2i ) = b2 ln(1 + i
b2
) ρ′4 (ϵ2i ) = b2 +ϵ2
i
( (
b
2b|ϵi | − b2 |ϵi | ≥ b |ϵi | ≥ b
ρ5 (ϵ2i ) = ρ′5 (ϵ2i ) = |ϵi |
ϵ2i |ϵi | ≤ b 1 |ϵi | ≤ b
Figure 10.4: Some examples of loss-functions and the resulting weights. (Note that the derivative ρ′j (ϵ2i ) is obtained
by differentiating ρj (t) with respect to t and inserting t = ϵ2i ).
In this section we derive the maximum likelihood estimator for our class projection problems. Suppose the 2D-
point xij = (x1ij , x2ij ) is a projection in regular Cartesian coordinates of the 3D-point Xj in camera Pi . The
80
Lecture Notes in CV Carl Olsson
where Pi1 , Pi2 , Pi3 are the rows of the camera matrix Pi . Also we assume that the observations are corrupted by
Gaussian noise, that is, 1
Pi Xj Pi2 Xj
1 2
(xij , xij ) = , + δij , (10.34)
Pi3 Xj Pi3 Xj
and δij is normally distributed with covariance I. The probability density function is then
1 − 1 ||δij ||2
p(δij ) = e 2 . (10.35)
2π
Similarly to Section 10.2.2 we now see that the model configuration that maximizes the likelihood of the obtaining
the observations xij = (x1ij , x2ij ) is obtained by solving
n X
m 2
P 1 Xj P 2 Xj
X
min x1ij − i3 , x2ij − i3 . (10.36)
i=1 j=1
Pi X j P i Xj
where n is the number of cameras and m is the number of scene points. The geometric interpretation of the above
expression is that the distance between the projection and the measured point in the image should be minimized,
see Figure 10.5. Note that it does not matter which of the variables Pi and Xi we consider as unknowns, it is always
the reprojection error that should be minimized.
PX
Figure 10.5: Geometric interpretation of the maximum likelihood estimate for projection problems. The dashed
distance should be minimized.
The objective function (10.36) is in very hard to minimize and in general we are limited to local refinement from
a suitably selected starting point. In this section we will study the special case of 2-view triangulation where it can
be shown that the problem can be solved by finding the roots of a 6th degree polynomial. We assume that the two
cameras P and P̄ are known and we are looking for the 3D point X that projects as close to the measured points
x and x̄ as possible by minimizing the objective
2 2
P 1X P 2 Xj P̄ 1 X P̄ 2 X
min x1 − 3 , x2 − 3 + x̄1 − 3 , x̄2 − 3 . (10.37)
X P X P X P̄ X P̄ X
To simplify notation we will let y and ȳ be the exact projections of X, that is λy = P X and λ̄ȳ = P̄ X. From
Lecture 5 we know that there is a 3D point that projects to y and ȳ if and only if the epipolar constraint ȳT F y = 0
holds. Furthermore, if we let d(·, ·) denote the distance between two points then (10.37) can be written
81
Lecture Notes in CV Carl Olsson
The epipolar constraint says that y has to lie on the epipolar line l = F T ȳ in image 1. Similarly ȳ has to lie on the
epipolar line l̄ = F y in image 2. It turns out that all points p lying on the epipolar line l in image 1 gives the same
epipolar line F p in the second image. To see this we note that since the epipole in image 1 e is on l the points of l
can be written
p = e + sv, (10.39)
where v is a direction and t ∈ R. Therefore we have F p = F e + tF v = tF v ∼ F v, which shows that all these
points give the same epipolar line. Similarly one sees that all points on l̄ give rise to the same epipolar line in image
1. There is therefore a one-to-one correspondence of epipolar lines in the two images and as long as y and ȳ lies
on corresponding epipolar lines there is a 3D point X that projects to them.
Figure 10.6 gives a geometric interpretation of corresponding epipolar lines. The camera centers C,C̄ and the 3D
point X are coplanar and forms an epipolar plane. The epipolar lines are the intersections between this plane and
the image planes. A 3D plane has 3 degrees of freedom. Any epipolar plane contains the two camera centers (and
the two epipoles) which fixes two of these degrees. The family of epipolar planes, and consequently the family of
corresponding epipolar lines, can therefore be parametrized by a single parameter s.
Figure 10.6: Visualization of corresponding epipolar lines. The camera centers C and C̄ and the 3D-point X
forms a plane. The intersections with the image planes form a pair of corresponding epipolar lines.
Now suppose that we search for the best points over a given pair of corresponding epipolar lines l and l̄. It is then
clear that the points that minimize the distance to x and x̄ respectively are given by the ortogonal projections of
x and x̄. To solve (10.37) we can therefore equivalently search for the corresponding epipolar lines l and l̄ that
minimizes
where d(x, l) denotes the (orthogonal) distance between the measured point x and the line l.
To parametrize
the family of corresponding lines we note that l goes through e and specify its direction v. If we let
s
v = 1 the line can take any 2D-direction (except (1, 0) which corresponds to s → ∞ which has to be checked
0
separately). The expression for the corresponding lines are now given by
l=e×v = e ×v (10.41)
l̄ = F (e + tv) ∼ F v (10.42)
Note that the elements of these two vectors are affine functions (1st degree polynomials).
82
Lecture Notes in CV Carl Olsson
Exercise 25. Compute all epipolar line correspondences for the cameras
1 0 1 0
P = I 0 and P̄ = 0 1 0 0 . (10.43)
1 0 0 1
The nullspace of F is easily seen to be t(1, 0, −1) and therfore e ∼ (1, 0, −1). When parameterizing with the
direcional vector v = (s, 1, 0), s ∈ R we get
0 −1 0 s −1
l ∼ [e]× v = 1 0 1 1 = s (10.45)
0 −1 0 0 −1
and
0 −1 0 s −1
l̄ ∼ F v = 1 0 1 1 = s . (10.46)
0 0 0 0 0
(l1 x1 + l2 x2 + l3 )2
. (10.47)
(l12 + l22 )
as2 + bs + c
. (10.48)
ds2 + es + f
Since the same holds for the squared distance between x̄ and l̄ the objective function (10.40) is a sum of two such
expressions. To find its minimizer we compute its stationary points by differentiating. The derivative of (10.48) is
which is a quotient of a degree two and a degree four polynomial. Setting the derivative of (10.40) to zero and
multiplying with the two denominators thus gives a polynomial of at most degree 6 with roots contain the optimal
value of s.
83
Lecture Notes in CV Carl Olsson
Exercise 26. If x ∼ (0, 2, 1) and x̄ ∼ (1, 2, 1) compute the corresponding epipolar lines that minimize
d(x, l)2 + d(x̄, l̄)2 for the cameras in Exercise 25.
Solution: We have
lT x (2s − 1)2 l̂T x̂ (2s − 1)2
d(x, l)2 = = and d(x, l) 2
= = , (10.50)
∥l∥2 1 + s2 ∥l̂∥2 1 + s2
which gives the objective function
(2s − 1)2
f (s) := d(x, l)2 + d(x, l)2 = 2 . (10.51)
1 + s2
Differentiation gives
2(2s − 1) 2s − 1)2 4(2s − 1)(2 − s)
f ′ (s) = 2 2
− 2 2s = . (10.52)
1+s ( 2
1+s )2 (s2 + 1)2
Hence f ′ (s) = 0 when s = 2 and s = 1/2. The function therefore has the following shape:
1
s 2 2
′
f (s) − 0 + 0 −
18
f (s) ↘ 0 ↗ 5 ↘
1
and therefore f (s) = 0 at s = 2 is a local minima. To conlude that this is also the global minima we compute the
limit
lim f (s) = 8 > 0.
s→+∞
Figure 10.7 compares the optimal triangulation objective to the DLT objective from Lecture 4.
Figure 10.7: Comparison between the DLT- and the ML objectives for triangulation. First row DLT, second row
ML objective (with the same colormap).
In general the maximum likelihood estimator (10.36) can only be solved using local iterative methods. However in
the special case of affine cameras there is a closed form solution. An affine camera is a camera where the third row
84
Lecture Notes in CV Carl Olsson
of P , P 3 = [0 0 0 t3 ]. Since the scale of the camera matrix is arbitrary we may assume that t3 = 1, and therefore
the camera matrix has the form
A t
P = , (10.53)
0 1
where A is a 2 × 3 matrix and t is a 2 × 1 vector. Geometrically this means that viewing rays are assumed to be
perpendicular to the image plane, see Figure 10.8. If the observed scene points are roughly at the same distance
from the camera this model often works well as an approximation of the pinhole camera model.
Figure 10.8: Left - A regular pinhole camera where all viewing rays go through the camera center. Right - The affine
camera model where the viewing rays are parallel to the normal of the image ray.
If we use regular Cartesian coordinates for both image points and scene points the camera equations can be simpli-
fied. If xij is the projection of the scene point Xj in the affine cameras Pi then the projection can be written
xij = Ai Xj + ti . (10.54)
ti = x̄i − Ai X̄,
1 1
P P
where X̄ = m j Xj and x̄i = m j xij . To simplify the problem we therefore change coordinates so that all
these mean values are zero by translating all image points and scene points. Using x̃ij = xij − x̄i and X̃i = Xi − X̄,
gives the simplified problem X
min ||x̃ij − Ai X̃j ||2 . (10.56)
ij
Since the Ai has only 3 columns the second term will be rank 3 matrix. Thus our problem is to find the matrix of
rank 3 that best approximates M . The best approximating matrix can be found by computing the SVD of M and
setting all but the first 3 singular values to zero.
We summarize the algorithm for affine cameras here:
85
Lecture Notes in CV Carl Olsson
1. Re-center all images so that the center of mass of the image points is zero in each image.
4. A solution can be found by extracting the cameras from U (:, 1 : 3) and the structure from S(1 : 3, 1 :
3) ∗ V (:, 1 : 3)′ .
5. Transform back the solution to the original image coordinates.
Note that the approach only works when all points are visible in all images. Furthermore, the camera model is
affine, which is a simplification. This is often a good approximation when the scene point have roughly the same
depth.
86
Lecture 11:
RANSAC and Minimal Solvers
In previous lectures we have studied the algebraic equations that govern projective camera systems. Under the
assumption that the data given to us in the form of point correspondences is correct, we have derived algorithms
for approximately solving these in the presence of moderate noise. However, since correspondences are determined
automatically this will not be true in practice. A typical situation is shown in Figure 11.1 where correspondences
between two images have been determined using SIFT descriptors. In practice we have to expect that the data
contains (at least) a small portion of incorrect matches. We refer to these as outliers and the rest as inliers.
Figure 11.1: The Outlier Problem. When automatically detecting correspondences using descriptors such as SIFT
there will always be a portion of incorrect matches. Green lines in the figure correspond to correct matches and red
lines correspond to outliers.
As we saw in Lecture 7 the outliers typically do not fulfill the Gaussian noise assumption for the particular problem
that we are trying to solve, and they can severely degrade the quality of the estimation. To address this issue we can
use robust loss-functions. However, these can be sensitive to initialization and can often only handle a relatively
small portion of outliers. Therefore we need a method for removing outliers as well as providing reliable starting
solutions that can be locally refined.
87
Lecture Notes in CV Carl Olsson
11.2 RANSAC
Random sample consensus (RANSAC) is a method for removing outliers. The idea is simple; If the number of
outliers is small, then if we pick a small subset of the measurements at random, we are likely to pick an outlier free
set.
The outline of the algorithm is as follows:
1. Randomly select a small subset of measurements and solve the problem using only these.
2. Evaluate the error residuals for the rest of the measurements under the solution from 1. The Consensus Set
for this solution is the set of measurements with error residuals less than some predefined threshold.
3. Repeat a number of times and select the solution that gives the largest consensus set.
The probability of randomly selecting a set of inliers depends on the size of the set and the proportion of inliers.
Exercise 27. Assume that we want to fit a line to a set of points. We randomly select 2 points and fit a line to these
in each RANSAC iteration. Suppose that 10% of the points are outliers. How many iterations are required to find
at least one set of only inliers with probability p = 95%? You may assume that the set of points is large such that
the portion of outliers do not change when removing a point. (Hint: First compute the probability of failure.)
Solution: The probability of selecting an inlier set is 0.92 . Therefore the probability of failing to do so is 1 − 0.92 .
Now suppse we sample consensus sets n times. Finding an inlier set at least ones is the complement event of failing
to find an inlier set all n times. The probablility of failing n times is (1 − 0.92 )n . We should therefore have
log(1 − 0.95)
(1 − 0.92 )n < 1 − 0.95 ⇒ n log(1 − 0.92 ) ≤ log(1 − 0.95) ⇒ n ≥ ≈ 1.8.
log(1 − 0.92 )
Therefore n = 2 works.
Exercise 28. Same question as before but now we select 8 point correspondences to estimate a Fundamental matrix.
log(1 − 0.95)
(1 − 0.98 )n < 1 − 0.95 ⇒ n log(1 − 0.98 ) ≤ log(1 − 0.95) ⇒ n ≥ ≈ 5.4.
log(1 − 0.98 )
Therefore n = 6 works.
In practice it is a good idea to run more iterations than what is needed since, because of noise, not all inlier sets
work equally well for estimating the solution. For example in the case of line estimation; if the two inlier points
used to estimate the line are very close to each other then the line estimate can be very poor due to noise. Therefore
the estimation may still generate a small consensus set.
The more measurements we use in each RANSAC iteration the more iterations we need to run for finding good
inlier sets. Therefore it is essential to use as few measurements as possible. Minimal solvers are a class of algebraic
solvers that compute solutions from a minimal amount of data. In Lecture 6 we computed the Essential matrices
from 8 or more point correspondences. However the essential matrix has only 5 degrees of freedom and a minimal
solver for this problem therefore only uses 5 points. The 8-point algorithm is more general in that it works for any
number of correspondences above 8 whereas the minimal solver only works for precisely 5. Still, in the context of
a RANSAC algorithm the minimal solver is preferable.
88
Lecture Notes in CV Carl Olsson
Minimal solvers often need to find solutions to systems of non-linear equations. Next we will present a method for
solving polynomial systems of equations. The idea is to transform the problem into an eigenvalue problem.
For simplicity we will first consider a system of two equations in two variables.
2
x −y−3 =0
(11.1)
xy − x = 0.
By factoring out x from the second equation it can be seen that the system has three roots, namely (0, −3), (2, 1)
and (−2, 1).
In the following sections we will present a method for automatically finding these roots, that work under fairly
general conditions. A polynomial p of degree n can represented using a monomial vector m(x, y) containing all
monomials of degree at most n and a coefficient vector cp . For example, the polynomial p(x, y) = 1 + 2x + 3y +
4x2 + 5xy + 6y 2 can be represented by cTp m(x, y), where
1 1
2 x
3
and m(x, y) = y2 .
cp =
4 x (11.2)
5 xy
6 y2
Using the monomials in m(x, y) and a coefficient vector we can represent any second degree polynomial. The
collection of monomials in m(x, y) is called a monomial basis. The approach we will present is based on the
observation that if we insert a root (x0 , y0 ) in the monomial vector m(x, y) then the resulting vector can be found
by computing eigenvectors of a particular matrix.
We define Tx to be an operator that takes a polynomial p(x, y) and multiplies it with x. If we apply Tx to the three
monomials 1, x, y we get
1 7→ x (11.3)
2
x 7→ x (11.4)
y 7→ xy. (11.5)
Now let us assume that (x0 , y0 ) is a solution to (11.1). The result of applying Tx to the above monomials can then
be simplified if we insert (x0 , y0 ),
1 7→ x0 (11.6)
x0 7→ x20 = y0 + 3 (11.7)
y0 7→ x0 y0 = x0 . (11.8)
Now suppose that a first order polynomial p is given by the coefficient vector cp = (c1p , c2p , c3p ) and monomial
vector m(x, y) = (1, x, y). By q(x, y) we denote the result of applying Tx to p(x, y). Because of the reductions
(11.6)-(11.8) we get
q(x0 , y0 ) = c1p x0 + c2p (y0 + 3) + c3p x0 = 3c2p 1 + (c1p + c3p )x0 + c2p y0 . (11.9)
We see that if (x0 , y0 ) solves (11.1) then because of the reductions we can represent q(x0 , y0 ) using the vector
m(x0 , y0 ) and a coefficient vector cq . The coefficient vector can be found by identifying the monomials in (11.9),
1
3c2p
1
cq 0 3 0 cp
c2q = c1p + c3p = 1 0 1 c2p . (11.10)
c3q c3q 0 1 0 c3p
| {z }
=Mx
The matrix Mx is called the action matrix for the mapping Tx . Given the coefficients of p(x, y) it computes the
coefficients of x0 p(x0 , y0 ) provided that (x0 , y0 ) solves the system (11.1). Under certain conditions it is possible
to compute the roots of the system from this matrix.
89
Lecture Notes in CV Carl Olsson
Next we will show that the roots of the system (11.1) are eigenvalues to the action matrix Mx . For any polynomial
p we have
x0 p(x0 , y0 ) = x0 cTp m(x0 , y0 ). (11.11)
Furthermore,
x0 p(x0 , y0 ) = q(x0 , y0 ) = cTq m(x0 , y0 ) = (Mx cp )T m(x0 , y0 ) = cTp MxT m(x0 , y0 ). (11.12)
Since this is true for any degree one polynomial (and therefore any coefficient matrix cq of size 3 × 1) we must have
that
x0 m(x0 , y0 ) = MxT m(x0 , y0 ). (11.13)
Therefore we can conclude that if (x0 , y0 ) is a root of (11.1) then m(x0 , y0 ) is an eigenvector of MxT with
eigenvalue x0 .
for all the three roots (0, −3), (2, 1) and (−2, 1) of the system (11.1).
11.3.3 Algorithm
Based on the above derivations we now give an algorithm for finding the roots of a system of polynomials.
2. Apply the mapping Tx to the monomial basis and reduce the result until the resulting expressions consists
only of monomials from the basis.
90
Lecture Notes in CV Carl Olsson
The theory from Section 11.3.2 says that the solutions will be among the eigenvectors. It does however not
guarantee that there are no other eigenvectors. Therefore we might have to check the extracted solutions by inserting
into the system of equations.
Furthermore, if the eigenvalues are not distinct there might be infinitely many eigenvectors to search. For example,
if the x-coordinate of two of the roots are the same then MxT will have a double eigenvalue. If we cannot find the
correct eigenvector then the eigenvalue will still give us some information, since it is not possible to have a root
with x-coordinate that is not an eigenvalue MxT .
In the above example we only considered monomials of degree 1 in (11.6)-(11.8). This worked since all the
equations resulting from multiplication with x could be reduced to monomials of degree 1. Therefore we could
represent both p(x0 , y0 ) and x0 p(x0 , y0 ) with the monomial basis 1, x0 , y0 . If this is possible or not depends on
the system of equations. In general the basis has to be selected large enough so that all the reductions result in terms
that are present in the basis.
The theory still holds even if we should not select the smallest possible monomial basis. For the system (11.1) we
can consider all second degree polynomials. For example we can use the reductions:
1 7→ x0 (11.15)
x0 7→ x20 (11.16)
y0 7→ x0 y0 (11.17)
x20 7→ x30 = x0 (y0 + 3) = x0 y0 + 3x0 (11.18)
x0 y0 7→ x20 y0 = x20 (11.19)
y02 7→ x0 y02 = x0 y0 . (11.20)
Here we made reductions so that all the terms on the right hand side have degree 2 or less. Since all the monomials
on the right hand side are also present on the left hand side we can construct an action matrix from these reductions.
Note that some of these terms can be reduced further. However, the theory from Section 11.3.2 holds regardless if
we do this or not. Further reduction would result in a different action matrix.
The resulting action matrix is in this case the 6 × 6 matrix
0 0 0 0 0 0
1 0 0 3 0 0
0 0 0 0 0 0
Mx = . (11.21)
0 1 0 0 1 0
0 0 1 1 0 1
0 0 0 0 0 0
The transpose of this matrix has eigenvalues λ = −2, 0, 2 which agrees with our roots. The eigenvalue 0 does
however have multiplicity four and therefore there is no unique eigenvector to this value. Hence we might not be
able to find the solution (0, −3) using the eigenvectors of Mx .
In section 11.3.1 we chose to construct the action matrix for multiplication with x. However, in principle any
mapping Tq(x,y) could be used. The choice of q does however affect the reductions (11.6)-(11.8). For example
91
Lecture Notes in CV Carl Olsson
The transpose of this matrix has eigenvalues −3, 1, 0. Since there are two roots with y coordinate 1 the eigenvalue
1 will have at least multiplicity two. Therefore we can only extract the solution to (0, −3) from this eigenspace.
A simple heuristic for generating action matrices with more distinct eigenvalues is to use a mapping Tq(x,y) where
q(x, y) is a random combination of x and y. For example 0.5MxT + 0.5MyT (with Mx and My from (11.21) and
(11.28) respectively) has five distinct eigenvalues.
Another trick that modifies the eigenspace is to drop some of the monomials. In (11.28) rows 1 and 2 are all zeros.
This means that 1 and x0 do not occur in any of the expressions after the reductions. Therefore we can remove
these from the system. The new action matrix is
0 0 0 0
0 1 0 1
My = 0 0 1 0 .
(11.29)
1 0 0 −3
Note however, that the monomial vector is now m(x, y) = (y, x2 , xy, y 2 ) and therefore the eigenvector will not
contain the value of x0 .
In this section we will construct a minimal solver for the problem of finding an Essential matrix. Given 5 point
correspondences we will use the following equations:
x̄Ti Exi = 0, i = 1, ..., 5. (11.30)
det(E) = 0, (11.31)
T T
2EE E − trace(EE )E = 0. (11.32)
The third constraint (11.32) actually consists of 9 polynomial equations since it is a matrix expression. Any matrix
E that has a singular value decomposition of the form
σ 0 0
E = U 0 σ 0 V T (11.33)
0 0 0
| {z }
=S
92
Lecture Notes in CV Carl Olsson
will fulfill this constraint. This can be seen by inserting (11.33) into (11.32). Furthermore, it can be shown that
any matrix that fulfills (11.32) must have a singular value decomposition as in (11.33).
To find the solutions of (11.30)-(11.32) we will first use (11.30) to reduce the number of variables. We construct
an M matrix of size 5 × 9 from the 5 epipolar constraints, similar to what we did in Lecture 6. In contrast to
the eight point algorithm, the M matrix is in itself not enough to determine all the 8 parameters of the essential
matrix since it only represents 5 equations. Since the dimension of the nullspace of M is 4 we can find 4 linearly
independent vectors vi , i = 1, ..., 4 such that
M (a1 v1 + a2 v2 + a3 v3 + a4 v4 ) = 0, (11.34)
for any choice of coefficients a1 , ..., a4 . Reshaping these vectors into matrices we get
What remains is to find the coefficients a1 , ..., a4 such that (11.31) and (11.32) are fulfilled. Note that since the
scale of the essential matrix is arbitrary, we can assume that (for example) a1 = 1.
To determine the coefficients a2 , ..., a4 we will use the method presented in the previous section. Equation (11.32)
consists of 9 third order polynomials in a2 , a3 , a4 . In addition we have (11.31) which consists of 1 third degree
polynomial. To construct the action matrix we first need to compute the coefficients of these polynomials. These
can easily be determined by rewriting (11.32)
4 X
X 4 X
4
2EE T E − trace(EE T )E = ai aj ak 2Ei EjT Ek − trace(Ei EjT )Ek .
(11.36)
i=1 j=1 k=1
For each of the 9 constraints in (11.32) we can extract coefficients for the monomial ai aj ak using this expression.
Note that the monomials occur several times in this sum. For example, (i, j, k) = (1, 1, 2) and (i, j, k) = (1, 2, 1)
both yield ai aj ak = a21 a2 = a2 . There are 64 terms in the sum but only 20 distinct monomials. The determinant
constraint can be handled in the same way using the expression
4 X
X 4 X
4
det(E) = ai aj ak ei11 ej22 ek33 + ei12 ej23 ek31 + ei13 ej21 ek32
i=1 j=1 k=1
where eiab is element (a, b) in matrix Ei . In summary we construct a 10 × 20 matrix that contains the coefficients
of all the 10 polynomials. The columns of this matrix correspond to the coefficients of the monomials:
Figure 11.2: Shape of the 10 × 20 coefficient matrix before (left) and after elimination (right). A ’*’ means the
element is non-zero.
{a34 , a3 a24 , a23 a4 , a33 , a2 a24 , a2 a3 a4 , a2 a23 , a22 a4 , a22 a3 , a32 , a24 , a3 a4 , a23 , a2 a4 , a2 a3 , a22 , a4 , a3 , a2 , 1}. (11.38)
The 10 first monomials are of degree 3 and the rest are of lower degree. If we modify the coefficient matrix
by performing Gaussian elimination we can determine reductions for each of these terms, see Figure 11.2. For
example, after elimination, the first row of the matrix only contains a34 from the third order monomials and can
93
Lecture Notes in CV Carl Olsson
therefore be used to replace this term with lower order terms. To create an action matrix we use the 10 equations
represented by the new coefficient matrix to compute reductions of all the third order monomials. In this case it
does not matter if we use Ta2 , Ta3 or Ta4 , since reductions to second order or less for all third order monomials are
availible in the modified coefficient matrix.
In summary the 5-point solver consists of the following steps:
1. Construct the 5 × 9 matrix M from the 5 point correspondences. (Note that the image points should be
normalized as in Lecture 6.)
2. Compute the 4 vectors that span the nullspace of M and reshape them to the matrices E1 , E2 , E3 , E4 .
3. Using the expressions (11.36) and (11.37) compute coefficients for all monomials and construct the 10 × 20
coefficient matrix. (The monomial order does not have to be the same as (11.38), however the first 10 terms
needs to be the 3rd order monomials.)
4. Perform Gaussian elimination on the coefficient matrix.
5. Construct the action matrix for either Ta2 , Ta3 or Ta4 using the reductions available in the modified coeffi-
cient matrix.
Figure 11.3 shows a comparison between the 5-point solver and the 8-point solver. For the pair of images depicted
in the first row of the figure we apply a 1000 iterations of RANSAC. In the histograms on the second row we plot
the size of the consensus set in each iteration. It can be seen that the 5-point solver generally finds consensus sets
with a larger number of inliers.
Figure 11.3: First row: The best solution obtained from the 5-point solver. Yellow ’*’ is an image point, green ’o’ is
the reprojection of an inlier and red ’o’ is a reprojection of an outlier. Second row: Histogram over the size of the
consensus set in each iteration of a 1000-iteration-RANSAC, using (a) - 5 points, (b) - 8 points and (c) - 10 points.
94
Lecture 12:
Local Optimization
In the previous lecture we derived the maximum likelihood estimator for our class of projection problems. If
xij = (x1ij , x2ij ) is the projection in regular Cartesian coordinates of the 3D-point Xj in camera Pi then the
model parameters that make the measurements xij most likely is found by minimizing
n X m 2
Pi1 Xj 2 Pi2 Xj
X
1
xij − 3 , xij − 3 . (12.1)
i=1 j=1
Pi X j Pi X j
This problem is however difficult to solve since there is in general no closed form solution. In practice we are
limited to locally improving the objective function from some suitable selected starting point. The starting point is
typically generated using algebraic solvers like the ones we have presented previously in the course. In this lecture
we will present some simple methods for local optimization.
12.2 Background
∇f (v) = 0. (12.2)
for all s in a neighborhood around v. (More technically, for all s such that ∥v − s∥ ≤ δ for some δ > 0).
Definition 3. A matrix M is positive definite (M ≻ 0) if v T M v > 0 for all v ̸= 0. It is called positive semi definite
(M ⪰ 0) if v T M v ≥ 0 for all v.
Theorem 4. If v is a stationary point of f (∇f (v) = 0) and the hessian of f is positive definite at t (∇2 f (v) ≻ 0)
then v is also a local minima of f .
95
Lecture Notes in CV Carl Olsson
Minimizing (12.1) is an example of a non-linear least squares problem. In this section we will show how linear
least squares problems can be minimized. In in the following sections we will use this approach to tackle the more
difficult non-linear versions.
In the linear least squares problem we want to fit a set of linear functions Ai v to a set of measurements bi , i =
1, .., n. Here ATi is a vector of the same size as v representing a linear function of v. The least squares problem is
now
Xn
min ∥Ai v − bi ∥2 . (12.4)
v
i=1
If we concatenate all the Ai into one matrix A and all the bi into one vector b we can write the problem in matrix
form as
min ∥Av − b∥2 . (12.5)
v
These equations are called the normal equations, and assuming that AT A is invertible they give the stationary point
∇2 f (v) = AT A. (12.9)
v T AT Av = ∥Av∥2 ≥ 0. (12.10)
Furthermore, it has to be positive definite since if there is v ̸= 0 such that ∥Av∥ = 0 then, by multiplying with AT ,
we see that AT Av = 0 which would imply that AT A is not invertible (which we have assumed above). Therefore,
(12.8) is a local minimum.
It is in fact also a global minimum which can be seen by changing coordinates. Let us call the stationary point s
and let v = δ + s. We then get
Since s is the stationary point it fulfills (12.7) and therefore the right hand side can be re-written as
The result of this simplification is a term which depends on δ and one that is constant. Since AT A is positive
definite any choice of δ that is not zero will result in the first term being positive. Therefore the optimal choice is
δ = 0 and thereby v = s.
96
Lecture Notes in CV Carl Olsson
Next we turn to the problem of solving the non-linear least squares formulation. In the general formulation we
have a set of residuals ri (v) which we want to minimize in a least squares sense
X
min ri (v)2 . (12.13)
v
i
If we stack all the residuals ri (v) in a vector r(v) then we can write the problem
In this section we will present three simple procedures that from a starting point improves the objective value by
locally searching for better values.
The perhaps simplest possible strategy is the steepest descent search. Given a starting point v0 this method finds
the direction in which the function decreases most rapidly, and takes a step in this direction. For the function f (v)
the directional derivatives at a point v0 are given by
Here d is a direction (vector of unit length). To find the d for which fd′ (v0 ) is maximally negative we should select
d = −∇f (v0 )/|∇f (v0 )| since this choice would give
∇f (v0 )
fd′ = −∇f (v0 )T = −|∇f (v0 )|. (12.16)
|∇f (v0 )|
For each term ri (v)2 of the function f (v) = ∥r(v)∥2 = r1 (v)2 + r2 (v)2 + ... + rn (v)2 we have
and therefore X
∇f (v) = 2 ri (v)∇ri (v). (12.18)
i
Once the descent direction d has been computed the algorithm picks a new point v1 by searching along this
direction, that is v1 = v0 + λd, where lambda is selected such that f (v1 ) < f (v0 ). It is always possible to find
such a λ (buy choosing λ small) unless ∇f (v0 ) = 0, that is v0 is already a stationary point.
Then the whole process is repeated for the point v1 and so on.
12.4.2 Gauss-Newton
While very cheap to compute the gradient does not contain any information about step-length. Therefore in the
steepest descent approach we are forced to select a suitable λ by other means.
An alternative approach is the Gauss-Newton method. In this method we first approximate the residuals ri (v)
with linear functions and then solve the resulting approximation using the approach from Section 12.3. The Taylor
approximation of r(v) at v0 is
∇r1 (v0 )T
r1 (v) r1 (v0 )
r2 (v) r2 (v0 ) ∇r2 (v0 )T
r(v) = ≈ + (v − v0 ) (12.19)
.. .. ..
. . . | {z }
d
rn (v) rn (v0 ) ∇rn (v0 )T
| {z } | {z }
r(v0 ) J(v0 )
97
Lecture Notes in CV Carl Olsson
The matrix J(v0 ) is the Jacobian of the vector r. Its rows contain the gradients of each entry ri in r. The
approximating linear least squares problem is therefore
12.4.3 Levenberg-Marquardt
While convergence of the Gauss-Newton method is very rapid close to the minimum it can be unstable at points
not sufficiently close to the minimum. The reason is that the approach may generate large steps even though the
approximation is only valid in a local neighborhood around v0 . Therefore we add a penalty for large steps and solve
To find the minimum of this problem we can again compute the stationary point. Expanding the function gives
∥r(v0 ) + J(v0 )d∥2 + λ∥d∥2 = dT J(v0 )T J(v0 ) + λI d + 2r(v0 )T J(v0 )d + r(v0 )T r(v0 ).
(12.24)
We now return to the structure from motion problem. We will consider the calibrated version, that is, we are
assuming that the image points xij have already been normalized
and that we are searching for camera matrices of
Xj
the form Pi = [Ri ti ] and scene points or the form Xj = . In Computer Vision literature this problem
1
is often called bundle adjustment since the bundle of viewing rays are locally adjusted to reduce the reprojection
errors. The objective function (12.1) can then be written
n X
m 2
R1 Xj + t1i R2 Xj + t2i
X
x1ij − i3 3 , x2ij − i3 , (12.27)
i=1 j=1
Ri Xj + ti Ri Xj + t3i
where Ri1 , Ri2 , Ri3 are the rows of the rotation Ri and t1i , t2i , t3i are the entries of the translation ti . Note that we
have assumed in (12.27) that all points are visible in all cameras. However, this is no restriction. If for example,
point 2 is not visible in camera 3 then we simply remove the term (i, j) = (3, 2) from the sum.
98
Lecture Notes in CV Carl Olsson
To be able to apply the Gauss-Newton method we need to linearize the expressions of the type
R1 X + t1 R2 X + t2
and . (12.28)
R3 X + t3 R3 X + t3
| {z } | {z }
=f1 =f2
(For ease of notation we drop the indexes i, j for now.) There are two difficulties with this that we need to handle:
First, the functions are nonlinear since they contain both fractions and quadratic terms. Second, the rotation R
depends non-linearely on a set of parameters.
One way to parametrize a rotation is through the exponential map
∞
X 1 k 1 1
exp(A) = A = I + A + A2 + A3 + ... (12.29)
k! 2 6
k=0
If R0 is our current rotation estimate, then any other rotation R can be written as R0 multiplied with a the
exponential map of a skew symmetric matrix
0 −a1 −a2
R = exp a1 0 −a3 R0 . (12.30)
a2 a3 0
Since we are interested in linearizing (12.28) we need only consider the first order terms of (12.29). If we let
0 −1 0 0 0 −1 0 0 0
S1 = 1 0 0 , S2 = 0 0 0 and S3 = 0 0 −1 , (12.31)
0 0 0 1 0 0 0 1 0
where S1i is the i’th row of S1 , R0i,j is element (i, j) in R0 . Using these (and the rest of the) derivatives we can
now form the Jacobian J(v0 ) in (12.19). The entries of r(v0 ) is obtained by computing the residual values at the
current parameter estimate. Note that v0 contain the parameter values of a1 , a2 , a3 , t1 , t2 , t3 , X1 , X2 and X3 for
all the cameras and all the points of the problem. Thus if there are n cameras, m scene points and all the points are
visible in all the cameras then v0 is of size (6n + 3m) × 1, r(v0 ) is mn × 1 and J(v0 ) is (6n + 3m) × mn.
Since the reconstruction is only uniquely determined up to an unknown similarity transformation we can reduce
the number of variables slightly. For example we can assume that the first camera is [I 0]. This fixes six of the
seven degrees of freedom of the similarity transformation. To also fix the scale we can for example select the first
coordinate of the first point.
99
Lecture Notes in CV Carl Olsson
100
Lecture 13:
Global Optimization
In Lecture 9 we studied local optimization methods for multiple view geometry problems. Under the assumption
of Gaussian image noise we optimized the maximum likelihood function
2
X Ri1 Xj + t1i 2 Ri2 Xj + t2i
X
2 1
(rij (v)) = xij − 3 ,x − . (13.1)
i,j ij
Ri Xj + t3i ij Ri3 Xj + t3i
In general this problem is difficult to solve since the involved terms are quotients of quadratic functions. In this
lecture we will study some special cases that we can optimize globally by making a slight modification to the
problem. Specifically, we will assume that the residuals are of the form
!
aTi v + ãi bTi v + b̃i
ri (v) = , , (13.2)
cTi v + c̃i cTi v + c̃i
where cTi v + c̃i > 0, that is, each coordinate is a quotient of affine functions. Instead of solving the least squares
formulation we will consider the optimization problem
Triangulation In this problem we know the camera matrices Pi = [Ai ti ] and image points xi and want to find
the scene point X. The residuals are of the form
To see that this expression is of the correct type (13.2) we can re write it as
1 3
(xi Ai − A1i )X + x1i t3i − t1i (x2i A3i − A2i )X + x2i t3i − t2i
, . (13.6)
A3i X + t3i A3i X + t3i
The constraint A3i X + t3i > 0 means that the scene point should be in front of the camera.
101
Lecture Notes in CV Carl Olsson
Resection In the resection problem we want to estimate the camera parameters A and t from scene points Xi and
their projections (x1i , x2i ). The residuals of this problem are
A1 Xi + t1 2 A2 Xi + t2
ri (A, t) = x1i − 3 , x − . (13.7)
A Xi + t3 i A3 Xi + t3
Structure from Motion with Known Camera Orientations If the camera orientations of all the cameras are known
(in practice computed with some other method such as the one described in Section 13.4) then we can solve
for both the 3D points and the camera positions simultaneously. In this case the unknowns are X and t
which gives residuals of the form
A1 Xi + t1 2 A2 Xi + t2
1
ri (X, t) = xi − 3 ,x − 3 . (13.8)
A Xi + t3 i A Xi + t3
In this section we review some properties of convex functions and sets that will be useful for solving (13.3)-(13.4).
A set C ∈ Rn is called convex if the line segment joining any two points in C is contained in C. That is, if
x, y ∈ C then λx + (1 − λ)y ∈ C for all λ with 0 ≤ λ ≤ 1.
We call a point x of the form
n
X
x= λ i xi , (13.9)
i=1
Pn
where i=1 λi = 1, 0 ≤ λi ≤ 1 and xi ∈ C, a convex combination of the points x1 , ..., xn . A convex set
always contains every convex combination of its points. Furthermore, it can be shown that a set is convex only if it
contains all its convex combinations. Figure 13.1 shows simple examples of the notions introduced.
Next we will state two special cases of convex sets that will be useful to us.
{x ∈ Rn ; aT x ≤ b}, (13.10)
where a ̸= 0, i.e., it is the solution set of a nontrivial affine inequality. The boundary of the half space is the
hyperplane {x ∈ Rn ; aT x = b}. It is straight forward to verify that these sets are convex.
102
Lecture Notes in CV Carl Olsson
The second order cone. The second order cone in Rn+1 is the set
From the general properties of norms it follows that the second order cone is a convex set in Rn+1 .
Convexity is also preserved under intersection. Thus a set C that is given by several of the constraints above (half
spaces and cone-constraints) is a convex set.
A function f : C 7→ R is called convex if C is a convex set, and for all x, y ∈ C and 0 ≤ λ ≤ 1, we have
The geometric interpretation of this definition is that the line segment between the points (x, f (x)) and (y, f (y))
should lie above the graph of f . Figure 13.2 shows the geometric interpretation of the definition.
Figure 13.2: Left: Graph of a convex function. The line segment joining two points (x, f (x)) and (y, f (y)) lies
above the graph. Right: Graph of a non-convex function.
Theorem 5. If f is convex on a convex set C then any local minimum is also global.
Proof. Asume that x is a local minimizer. Then there is a local neighborhood around x such that
for all y such that ∥x − y∥ ≤ δ. Suppose that x is not the global minimizer. Then there is an x∗ such that
Since C is convex we can form the line segment between x∗ and x and look at the values of f .
f (λx∗ + (1 − λ)x) ≤ λf (x) + (1 − λ)f (x∗ ) < λf (x) + (1 − λ)f (x) = f (x), (13.16)
if 0 < λ ≤ 1. Now if we choose λ small enough such that y = λx∗ + (1 − λ)x fulfills ∥x − y∥ ≤ δ we see that x
cannot be a local minimizer since
f (x) > f (y). (13.17)
103
Lecture Notes in CV Carl Olsson
We now return to the min-max problem (13.3)-(13.4). By adding an extra variable γ we can rewrite the problem
as
minγ,v γ (13.18)
s.t. ri (v) ≤ γ, ∀i. (13.19)
Since we minimize γ and γ ≥ ri (v) for all i, γ has to take the same value as the largest residual maxi ri (v).
Therefore the two formulations are equivalent. Since cTi v + c̃i > 0 and we can write the problem as
minγ,v γ (13.20)
s.t. aTi v + ãi , bTi v + b̃i ≤ γ(cTi v + c̃i ), ∀i. (13.21)
For a fixed γ the constraint (13.21) of the type (13.12) and therefore convex. Furthermore the intersection of all
these constraints is also convex. This makes it possible to determine if there is a set of variables v that fulfill all the
constraints for a given γ. Specifically, we can solve the convex program
mins,v s (13.22)
s.t. aTi v + ãi , bTi v + b̃i ≤ γ(cTi v + c̃i ) + s, ∀i. (13.23)
If the optimal s > 0 then it is not possible to find a v that fulfills all the constraints at the same time. In this case
we know that the current γ is smaller than minv maxi ri (v). Otherwise if the optimal s ≤ 0 we know that the
current γ is larger than minv maxi ri (v). This makes it possible to search for the minv maxi ri (v) by solving a
sequence of convex problems.
Below we outline a simple procedure, called bisection, that finds the optimal solution.
The result is an interval [γl , γu ] that is guaranteed to contain the optimal value.
The methods that we described in the previous section can be used to solve the structure from motion problem if
we have estimates for the camera rotations by optimizing over camera positions and 3D points. For this to be useful
we need a method that can accurately estimate camera rotations which we will describe in this section.
For each pair of cameras that have a sufficient number
ofmatches we can solve the
(calibrated) relative pose problem
(see Lecture 6). This gives us two cameras P1 = I 0 and P2 = R12 t12 . The rotation R12 essentially tells
us how we should rotate camera
1 to get camera 2. Similarly,
if we solve the relative pose between cameras 2 and
3 we get the solution P2 = I 0 and P3 = R23 t23 . Note that the solution to the relative pose problem is
only defined up to a similarity transformation. That is, if we have one solution we may rotate translate and scale
104
Lecture Notes in CV Carl Olsson
the solution to obtain a new one. Now let’s say that the cameraorientations
are R1 ,R2 and R3 in some common
R1T 0
coordinate system. Then, by applying the rigid transformation to the first solution we get
0 1
R1T
0
I 0 = R1 0 (13.24)
0 1
and
R1T
0
R12 t12 = R12 R1 t12 (13.25)
0 1
T
R2 0
That is R2 = R12 R1 . Similarly by applying to the second camera pair we get R3 = R23 R2 =
0 1
R23 R12 R1 , which gives us estimates for the three camera orientations in the same coordinate system.
Now suppose that we also solve for the relative orientation between cameras 3 and 1 giving us the additional
equation R1 = R31 R3 . The system is now over-determined and due to noise it will in general not be possible to
find an exact solution to all these equations at once. Least squares rotation averaging attempts to find the rotations
that minimizes the problem
This is a quadratic objective function with quadratic equallity constraints which in general results in a non-convex
problem. In the following section we will describe how such problems can be addressed using Lagrangian duality.
13.5 Duality
Suppose that we have an a quadratic objective function q(x) = xT Ãx + 2bT x + c, where the variable x ∈ Rn . By
extending x to Rn+1 and letting the extra coordinate be one we can write the objective as q(x) = xT Ax where
à b
A= T . (13.28)
b c
Similarly suppose that we have quadratic functions qi (x) = xT Ai x − 1 and we want to solve
We willP
call this problem the primal problem. To eliminate the constraint we form the Lagrangian L(x, λ) =
q(x) + i λi (qi (x) − 1) and consider
min L(x, λ) (13.31)
x
for a given value of λ. Let x be the minimizer of of (13.31). If x∗ fulfills the constraints qi (x∗ ) = 0 then we
∗
have found the minimizer of the original problem since L(x, λ) = q(x) for all x that fulfill qi (x) = 0. If the
minimizer does not fulfill the constraints then we found a solution to that was better than all other points fulfilling
the constraints. In both cases we have that L(x∗ , λ) < minx q(x) such that qi (x) = 0. That is for each value of λ
we get a lower bound on the optimal value of the primal problem. Lagrangian duality consists in trying to find the
largest lower bound by solving
max min L(x, λ). (13.32)
λ x
We will refer to this as the dual problem. When both the objective and the
Pnconstraints areP
quadratic it is possible to
n
solve the inner minimization in closed from since L(x, λ) = xT (A + i=1 λi Ai )x − i=1 λi . The minimum
of this function is ( P
n Pn
∗ − i=1 if A + i=1 λi Ai ⪰ 0
L(x , λ) = (13.33)
−∞ otherwise
105
Lecture Notes in CV Carl Olsson
Since the dual problem maximizes with respect to λ we do not need to consider the second case. The dual problem
is then
n
X
maxλ − λi (13.34)
i=1
n
X
such that A + λi Ai ⪰ 0. (13.35)
i=1
This problem has a linear objective function and a convex constraint and is therefore convex and can be reliably
solved with standard solvers.
For rotation averaging the dual problem has the from
where
0 R12 . . . R1n Λ1 0 0 ...
R21 0 . . . R2n 0 Λ2 0 . . .
M = . and Λ=0 . (13.38)
.. .. 0 Λ3 . . .
.. . . .. .. ..
..
Rn1 Rn2 ... 0 . . . .
While the dual problem is in general only a lower bound on the primal problem it can be shown that for the
rotation averaging problem it is tight if the solution has small enough error residuals. Specifically, if the rotation
angle of the rotation RjT Rij Ri is smaller than 42.9◦ then the lower bound is tight.
106
Lecture 14:
Factorization and Dimensionality Reduction
Previously we have considered image data generated by viewing a rigid scene in a moving camera. In this section
we will consider images depicting more general deforming objects. Solving for both camera and object motions is
an ill posed problem if points are allowed to move arbitrarily. However, in real scenes point motions are typically
highly correlated. Figure 14.1 shows for images from a dataset consisting of points tracked through a sequence off
hand. While all the points are allowed to move only a small subset of simple motions give rise to reasonable hand
shapes. Restricting the set of shapes and motions regularizes the problem and makes it solvable.
Figure 14.1: Four (out of 40) images of a deformable model with 56 tracked point. By concatenating x- and y-
coordinates from all images each track can be seen as a data point in an 80 dimensional space.
From a mathematical point of view we can see the coordinates from each track as a data point in a high dimensional
space. Let (xij , yij ) be the coordinates of point j in image i. We form the measurement matrix M by
x11 x12 x13 . . . x1n
y11 y12 y13 . . . y1n
x21 x22 x23 . . . x2n
M = y21 y22 y23 . . . y2n . (14.1)
.. .. .. .. .
..
.
. . .
xm1 xm2 xm3 . . . xmn
ym1 ym2 ym3 . . . ymn
Here each column corresponds to the coordinates of a point track and every two consecutive rows correspond to an
image. The columns of M can be seen as data points in a 2m-dimensional space. To model dependence between
points a common approach is to assume that the data points can be written as a linear combination of a few basis
columns, or equivalently, to assume that the columns of M belong to a low dimensional subspace.
In linear algebra we have the following useful concepts:
107
Lecture Notes in CV Carl Olsson
1 2
Exercise 30. Show that the columns B1 = 2 and B2 = 3 form a basis for the column space of
1 1
1 2 2 0
M = 2 3 2 1 , (14.2)
1 1 0 1
Figure 14.2: Left - The columns of M seen as points in R3 . Middle - The factorization 2D basis B1 and B2
spanning a plane that all points belong to. Right - When a coordinate of a new data point is missing it can be
recovered by enforcing that the points lie on the detected plane.
Since the columns of M have 3 entries they can be interpreted as points in R3 , see Figure 14.2. The fact that we
can write all the columns of M as a linear product of the basis vectors B1 and B2 shows that they all belong to a
2-dimensional subspace spanned by B1 and B2 .
Exercise 31. Find a 3 × 2 matrix B and a 2 × 4 matrix C such that M = BC T and determine a basis for the row
space of M .
As we have seen previously we can interpret the columns of M as data points in R3 and B as a basis for the 2-
dimensional column space. Alternatively we can consider the rows of M as data points in R4 and C as a basis
for the row space. In this case the matrix B contains the coefficients used to form the data points in M from
the basis in C. The dimension of these two spaces is the same as the rank of the matrix. For the data shown in
Figure 14.1 the row space will consist of a set of possible hand-shapes whereas the column space consists of a set of
point trajectories.
The expression M = BC T is called a factorization of M and B and C are the factors. Without any additional
T
constraints there are many possible factorizations. For example if we let B̃ = BH and C̃ = CH −1 = CH −T
we get
B̃ C̃ T = BHH −1 C T = BC T = M, (14.3)
for any invertible matrix H.
108
Lecture Notes in CV Carl Olsson
When using real data our measurements are usually corrupted by noise. Therefore the measurement matrix is
normally of full rank and it is not possible to find any subspace containing the data points unless the whole space
is used. In order to reduce the dimensionality of the data we therefore have to remove noise. Assuming the we add
Gaussian noise to a matrix with rank r to get the measurement matrix M the maximum likelihood estimator is
that is, we want to find the matrix X that is closest to M in a least squares sense and has rank(X) = r.
The solution to this problem can be obtained from the SVD of M .
Theorem 6 (Ekhart-Young 1936). If rank(M ) = k > r and M has the singular value decomposition M = U SV T
with
diag(σ1 , σ2 , ..., σk ) 0
S= , (14.5)
0 0
then the solution to (14.4) is given by X = U Sr V T where
diag(σ1 , σ2 , ..., σr , 0, 0, ...) 0
Sr = . (14.6)
0 0
To find the best rank r approximation of M we thus take its SVD and set all but the first r singular values to zero.
We can directly obtain a factorization X = BC T from the SVD X = U Sr V T . Let U ′ and V ′ be the first r
columns of U and V and Sr′ be the top r × r block of Sr . Since all but the first r columns of U vanish in the
multiplication U Sr , and similarly all but the first r rows of V T vanish in Sr V T , it is clear that
T
X = U Sr V T = U ′ Sr′ V ′ . (14.7)
Figure 14.3: Left - The measurement matrix for the hand data set in Figure 14.1. Middle - A rank 5 approximation
of the measurement matrix. Right - The difference between the measurement matrix and the rank 5 approximation.
Figure 14.1 shows the measurement matrix for the hand data set from Figure 14.1. The left image shows the
original noisy measurement matrix and the middle image shows the low rank approximation obtained using SVD
as described above. The difference is shown in the right image, note the different scales. While the two matrices
are very similar the first one has 80 · 56 = 4480 elements while the second one can be represented with (80 + 56) ·
5 − 52 = 665 numbers.
If the assumption that all hand shapes can be written as a linear combination of five basis shapes is accurate, we can
use the obtained factorization to compute new hand shapes. To generate a new image we need to know the x- and
y- coordinates of all the points. According to our assumption these should be linear combinations of the rows of
C T . We therefore should have
x1 x2 . . . x n
= Bnew C T , (14.8)
y1 y2 . . . yn
109
Lecture Notes in CV Carl Olsson
where (x1 , y1 ), (x2 , y2 ), ... are the unknown point coordinates and Bnew is a 2 × 5 matrix of parameters that
specify the point coordinates. Thus we have a linear function f (Bnew ) := Bnew C T that can be used to generate
new images. It can be difficult to find reasonable values for the parameters in Bnew since there are many possible
choices of C T that can be obtained from the factorization. One way to generate reasonable values is to instead
specify the coordinates of a few (in this case 5) points and determine the values of Bnew from these. If we for
example specify the first 5 point positions we get
x1 x2 ... x5 T
= Bnew C5×5 , (14.9)
y1 y2 ... y5
where C5×5 are the first five rows of C. Assuming that C5×5 is invertible we get
x1 x2 ... x5 −T
Bnew = C5×5 . (14.10)
y1 y2 ... y5
Inserting (14.10) in (14.8) gives a new linear function that takes 5 point positions and returns the coordinates for
all the points 56.
The choice of using the first five points to construct the function above is arbitrary and we may use any collection
of 5 points. For numerical reasons it is often beneficial to use points that are evenly distributed in the scene.
In Figure 14.4 we selected the five marked with a red dot and varied the coordinates of one of them. The blue
curves show the positions obtained for all points by using (14.10) and (14.8). As long as we stay in the vicinity of
the previously observed shapes the silhouettes look reasonable, however if we move too far away from previously
observed shapes the result may start to look unreasonable, as in the right example of Figure 14.4).
Figure 14.4: Modifying the point coordinates of 5 of the observed points to generate new shapes using the computed
row space of C T .
Now lets consider the degrees of freedom (DOF) of the set of matrices that can be written BC T if B is of size
m × r and C is n × r. The two factors have (m + n)r elements in total, however since the product (14.3) is
independent of the r × r matrix H we get (m + n)r − r2 . In contrast a general m × n matrix has mn elements.
Since mn = (m + n)n − n2 = (m + n)m − m2 it is clear that BC T has fewer degrees of freedom if r < m
or r < n. In the example in Exercise 30 we have mn = 12 and (m + n)r − r2 = 10. Thus BC T can be seen
as a compressed representation where we can specify the content of M using fewer values than its total number of
elements.
The fact that BC T has fewer DOF than a general matrix of size m × n makes it possible to recover M even if only
a subset of the elements of M is known.
110
Lecture Notes in CV Carl Olsson
To the right Figure 14.2 we give a geometric interpretation of the above exercise. Then varying m15 and m26 we
obtain the points of two lines. If the unknown points are assumed to belong to the same subspace as the rest of the
columns of M they have to lie in the intersections between the lines and the plane.
When some elements of M are unknown we cannot use the SVD to recover a low rank approximation of M . If W
is a matrix that has wij = 1 if mij is known and wij = 0 otherwise we seek a solution to
where ⊙ denotes element-wise multiplication. This problem has to be solved for relatively high levels of missing
data using iterative local methods. Figure 14.5 shows a solution recovered from a measurement matrix containing
roughly 50% of the data from the hand data set.
Figure 14.5: Left - The measurement matrix with roughly 50% missing entries for the hand data set in Figure 14.1.
Middle - A rank 5 approximation obtained using local optimization. Right - The difference between the true
measurement matrix (without missing data) and the obtained rank 5 approximation.
111
Lecture Notes in CV Carl Olsson
112
Lecture 15:
Stereo and Surface Estimation
When camera positions have been determined, using structure from motion, we would like to compute a dense
surface model of the scene. In this lecture we will study the so called Stereo Problem, where the goal is to estimate
the depth of each pixel in an image. This requires computing a match for each pixel in the image even if the texture
is ambiguous.
Figure 15.1: Left and Middle: Two images used for computing depth estimates for every pixel in the left image.
Right: The depth estimate color coded.
Dense depth estimation requires matching of each pixel to a corresponding pixel in neighboring images. Because
of ambiguous texture this problem is difficult to solve. Since cameras are known we can however use the epipolar
lines to limit the search.
We will start by assuming that we have a pair of so called rectified cameras, P1 = K[I 0] and
b
P2 = K I 0 . (15.1)
0
Geometrically this means that both cameras have the same orientation and that the second camera position is a
translation in the x-direction of the first camera, see Figure 15.2. In general image pairs taken with regular cameras
do not fulfill these assumptions, however they can be modified to do so. The the line segment joining the two
camera centers is called the baseline.
There are several ways of rectifying two cameras. A simple approach is to first select the orientation of axes of
the new camera coordinate system in one of the cameras and then rotate both the cameras to this new coordinate
113
Lecture Notes in CV Carl Olsson
xl xr xl xr xl
C1 C2 C1 C2
Figure 15.2: Stereo with rectified cameras. The cameras have the same orientation and the image plane normals are
perpendicular to the baseline. Left: The epipolar lines are parallel to the x-axis. Right: The disparity.
system. The new x-axis should be parallel to the baseline and the new y and z-axes should be perpendicular to it.
To keep the changes that occur when transforming the image small, we should select the new coordinate system to
be as similar to the old one as possible. We can select the new x-axis by projecting the old x-axis onto the baseline
and normalizing it. When the x-axis has been determined we can chose the new-z axis as the projection of the old
z-axis onto the plane going through the camera center with the new x-axis as normal. When both the x and z axes
are determined we can find the y axis by using the cross product. Once we known the new camera orientations
we rotate the cameras. Since the rectified cameras and old cameras are related through pure rotations the rectified
images are obtained by transforming the images using a homography.
Next we will show that for this setup the epipolar lines are parallel to the x-axis of the image. Suppose that
γf sf x0
K=0 f y0 . (15.2)
0 0 1
Then the projection of a scene point X = (X, Y, Z, 1) in the cameras P1 and P2 is given by
X
Y γf X + sf Y + x0 Z
xl = K I 0 Z =
f Y + y0 Z (15.3)
Z
1
X
b Y γf (X + b) + sf Y + x0 Z
xR = K I 0 Z =
f Y + y0 Z (15.4)
0 Z
1
Therefore yl = yR for any two corresponding points, which means that the epipolar lines are horizontal. The
difference between the x-coordinates
γf b
d = xR − xL = (15.7)
Z
is called the disparity. The matching problem can be seen as the problem of assigning a disparity to each point in
the image. The disparity is inversely proportional to the depth Z, see Figure 15.3. If we assume that disparities can
only be determined up to integer values (no-subpixel accuracy), then it can be seen from Figure 15.3 that accurate
(high resolution) depth estimation can only be achieved when Z is relatively small (with respect to the baseline b).
114
Lecture Notes in CV Carl Olsson
Figure 15.3: Depth as a function of disparity (γf b = 1). Large disparities gives higher depth resolution.
Given a pixel and a candidate depth/disparity we need to have a criteria for evaluating if this is a good match or
not. Previously in the course we have used SIFT features. However, this type of features have various invariance
properties that we do not want here, since the camera positions are already known. A simple commonly used criteria
is the normalized cross correlation. The cross correlation compares a patch around the pixel to a patch around the
potential match. If I1 and I2 are the two patches then their correlation is
n
1 X (I1 (xi ) − I¯1 )(I2 (xi ) − I¯2 )
N CC(I1 , I2 ) = , (15.8)
n − 1 i=1 σ(I1 )σ(I2 )
where the sequence xi are the pixels being compared, I¯1 , I¯2 , σ(I1 ) and σ(I2 ) are the mean values and standard
deviations of of the two patches. The result is a number between −1 and 1 where 1 means that the patches similar.
The normalized cross correlation is invariant to translation and rescaling of the image intensities, which is very
useful if the two images are captured under different lighting conditions. Figure 15.4 shows the evaluation of
normalized cross correlation along an epipolar line.
Figure 15.4: Evaluation of the normalized cross correlation along an epipolar line. Left: Left image and the image
point of interest. Middle: Corresponding epipolar line and a few local maxima of the NCC. Right: NCC for a
range of disparities. Red rings are the same local maxima as in the middle image.
115
Lecture Notes in CV Carl Olsson
When we use more than two cameras it might not be possible to rectify all the images simultaneously. In this case
an alternative is to employ a plane sweep algorithm. The basic idea is that if all the pixels have the same depth
then all the scene points are located on a plane in 3D. Since projections in two images from points on a plane
are related by a homography, we can hypothesize a depth, transform the pixels of one camera into the second and
compute cross correlation. Sweeping the plane through a series of hypotheses gives a cost for assigning pixels to the
hypothesized depths.
If the cameras are P1 = K[I 0] and P2 = K[R t] then the plane Π with scene points at the depth Z is given
by Π = (0, 0, −1/Z, 1). The homography HZ for the depth Z is then given by the formula
HZ = K R + t 0 0 Z1 K −1 .
(15.9)
Figure 15.5: Plane sweep algorithms. Given two cameras (a), hypothesize a depth for all the pixels (b), project
into the second camera and compare pixel values (c). Sweeping the scene plane over different depths gives costs of
assigning pixels to these hypothesized depths.
Compared to rectified cameras where we can simply compare patches along the epipolar line, transforming the
image using the homography is more computationally expensive. On the other hand this approach is very conve-
nient in that it works with general camera configurations. Furthermore, since the image is re-sampled during the
transformation we do not need to limit ourselves to depths that are are inverse values of integer valued disparities.
Therefore sup-pixel accuracy is naturally achieved with this approach.
The result of the plane sweep algorithm is a function for each pixel that specifies a cost of assigning that pixel a
certain depth. By selecting the smallest cost for each pixel we obtain a dense depth estimate. The resulting surface
will often be noisy. This is because individual pixel estimates can be unreliable due to ambiguous texture and self
occlusion.
To improve the estimated surface and reduce the noise a common approach is to seek an assignment where neigh-
boring pixels have similar depths. Typically one minimizes an energy functional of the form
X X X
Eij (Zi , Zj ) + Ei (Zi ). (15.10)
i j∈N (i) i
The second term Ei (Zi ) is the cost of assigning the depth Zi to pixel i. This term is typically referred to as the
data term since it is based on image data. The first term Eij (Zi , Zj ) penalizes differences in depth between pixel i
and a pixel j that is in the neighborhood N (i) of pixel i. This term is usually called the smoothness term since it
favors solutions with smooth surfaces.
116
Lecture Notes in CV Carl Olsson
s s s s
t t t t
Figure 15.6: The four cuts corresponding to the values of the energy term in (15.12)-(15.15).
Optimizing this type of energy is challenging due to the numerous local minima of the individual data terms
Ei (Zi ), see Figure 15.4. The most successful methods for doing this are based on discrete graph methods. If Zi
is binary, then the energy (15.10) can be seen as the minimum cut on a graph where the nodes correspond to the
pixels and the edges correspond to the neighborhood structure. Finding the minimum cut in a graph can be solved
efficiently (in polynomial time complexity) by computing the maximal flow on the same graph.
Let us first consider a two pixel case with the energy functional
The corresponding graph has two nodes representing the pixels and two special nodes; the source s and the sink t.
The graph has directed edges from s to i and j, from i and j to t and between i and j. In addition each edge has
a (non-negative) capacity c(i, j). A cut in this graph is a partitioning of the nodes into two disjoint sets S and T
where S are the nodes connected to the source and T the nodes connected to the sink. The value of the cut is the
sum of all the edge capacities of the edges going form S to T .
The energy (15.11) can be solved as a minimal cut problem on this graph under certain conditions. First of all we
see that E(Zi , Zj ) ≥ 0 since graph edges have non-negative capacities. Note that if the energy is not positive we
can always add a constant to E(Zi , Zj ) without changing the minimizer and thereby obtain an equivalent positive
energy. If we let i ∈ S mean that Zi = 1 then from Figure 15.6 we see that
Since the capacities are all positive we see from these equations that in order to be able to represent this energy with
a graph, then
E(0, 0) + E(1, 1) ≤ E(1, 0) + E(0, 1) (15.16)
must hold. Energies with pairwise smoothing terms that fulfill (15.16) are called submodular. It can be shown that
the condition (15.16) actually ensures that the energy can be represented by a graph. Furthermore, it is easy to see
that the sum of submodular energies is also submodular. This means that if the terms Eij of the energy (15.10) are
all submodular then we can minimize this energy by solving a max-flow/min-cut problem.
When Zi is not binary (as in stereo) we typically employ move making algorithms. The goal of these algorithms is
to modify as many pixels as possible simultaneously in order to avoid getting stuck in poor local minima. One of
the most popular approaches is that of α-expansion. Suppose that we have a current assignment of depths. Then
each pixel is given the option to switch from the current depth to a new depth α. Since each pixel has two choices
117
Lecture Notes in CV Carl Olsson
(move to α or retain the old value) this can be formulated as a binary problem. If this new binary problem is
submodular it can be solved efficiently using max-flow/min-cut algorithms.
Suppose that in the current assignment Zi = β and Zj = γ and consider the term Eij (Zi , Zj ) when we let the
pixels switch to α. We define a new binary energy term EB (xi , xj ) by letting xi = 0 when Zi retains its current
assignment, xi = 1 when Zi switches to α (and xj similarly). Then
This term favors assignments where neighboring depths have small disparity differences. In real images we should
however also allow for discontinuous depth assignments due to object boundaries (transitions between smooth
surfaces). Therefore the threshold t is added to the energy to prevent over-smoothing at discontinuities.
118