6.801/6.
866: Machine Vision, Lecture 5
Professor Berthold Horn, Ryan Sander, Tadayuki Yoshitake
MIT Department of Electrical Engineering and Computer Science
Fall 2020
These lecture summaries are designed to be a review of the lecture. Though I do my best to include all main topics from the
lecture, the lectures will have more elaborated explanations than these notes.
1 Lecture 5: TTC and FOR Montivision Demos, Vanishing Point, Use of VPs
in Camera Calibration
In this lecture, we’ll continue going over vanishing points in machine vision, as well as introduce how we can use brightness
estimates to obtain estimates of a surface’s orientation. We’ll introduce this idea with Lambertian surfaces, but we can discuss
how this can generalize to many other types of surfaces as well.
1.1 Robust Line Estimation and Sampling
We’ll start by covering some of the topics discussed during lecture.
1.1.1 Line Intersection Least-Squares
Helpful when considering multiple lines at a time - this optimization problem is given by:
X
min (x cos θi + y sin θi − ρ)2 (1)
x,y
i
1.1.2 Dealing with Outliers
Many of the approaches we’ve covered thus far are centered around the idea of making estimates from data. But recall that
the data gathered by sensors can be noisy, and can be corrupted from phenomena such as crosstalk. Because of this noise,
it is advantageous to distinguish and filter outliers from the inliers in our dataset, especially when we make estimates using
Least-Squares.
A good choice of algorithm for dealing with outliers is RANSAC, or Random Sample Consensus [1]. This algorithm
is essentially an iterative and stochastic variation of Least-Squares. By randomly selecting points from an existing dataset to fit
lines and evaluate the fit, we can iteratively find line fits that minimize the least-squares error while distinguishing inliers from
outliers.
1
Figure 1: Pseudocode for the RANSAC Algorithm.
1.1.3 Reprojection and Rectification
These two problems are highly important in mapmaking, and help for solving the equations:
Rr = r0 (2)
T
r=R r (3)
Where r is the vector from the Center of Projection (COP) to the image plane.
1.1.4 Resampling
Resampling is also a valuable application in many facets of computer vision and robotics, especially if we seek to run any kind
of interpolation or subsampling algorithms. Some approaches for this:
• Nearest Neighbor: This is a class of methods in which we interpolate based off of the values of neighboring points. This
can be done spatially (e.g. by looking at adjacent pixels) as well as other image properties such as brightness and color. A
common algorithm used here is K-Nearest Neighbors (KNN), in which interpolation is done based off of the K-nearest
points in the desired space.
• Bilinear Interpolation: An extension of linear interpolation used for functions in two-dimensional grids/of two variables
(e.g. (x, y) or (i, j))) [2], such as the brightness or motion in images.
• Bicubic Interpolation: Similar to bilinear interpolation, bicubic interpolation is an extension of cubic interpolation
of functions in two-dimensional grids/of two variables (e.g. (x, y) or (i, j))) [3], such as the brightness or motion in
images. Bicubic interpolation tends to perform much better than bilinear interpolation, though at the cost of additional
computational resources.
1.2 Magnification with TTC
• For the Montivision demo, note the following for the bars:
– A → x-dimension for motion estimation problem
– B → y-dimension for motion estimation problem
1
– C→ Time to Contact
• Recall from the previous lectures that the percent change of size in the world is the percent change of size in the image.
We can derive this through perspective projection. The equation for this is:
s S
= (4)
f Z
Where s is the size in the image plane and S is the size in the world. Differentiating with respect to time gives us (using
the chain rule):
d ds dZ
(sZ = f S) → Z +s =0 (5)
dt dt dt
2
Rearranging terms:
dz ds
dt
= − dt (6)
Z S
Recall that the intuition here is that the rate of change of size is the same in the image and the world.
1.2.1 Perspective Projection and Vanishing Points
Let’s begin by defining a few terms. Some of these will be a review, but these review terms connect to new terms that we’ll
introduce shortly:
• vanishing points: These are the points in the image plane (or extended out from the image plane) that parallel lines in
the world converge to.
• Center of Projection (COP): This is where in the camera/sensor (not in the image plane) where incident projected
light rays converge. An analog to the COP for human vision systems is your eyes. NOTE: COP and vanishing points are
oftentimes different.
• Principle Point: The orthogonal projection of the Center of Projection (COP) onto the image plane.
• f : Similar to the focal length we’ve seen in other perspective projection examples, this f is the perpendicular distance from
the COP to the image plane.
Recall that a common problem we can solve with the use of vanishing points is finding the Center of Projection (COP).
Solving this problem in 3D has 3 degrees of freedom, so consequently we’ll try to solve it using three equations.
Intuitively, this problem of finding the Center of Projection can be thought of as finding the intersection of three spheres,
each of which have two vanishing points along their diameters. Note that three spheres can intersect in up to two places - in this
case we have defined the physically-feasible solution, and therefore the solution of interest, to be the solution above the image
plane (the other solution will be a mirror image of this solution and can be found below the image plane).
Application of this problem: This problem comes up frequently in photogrammetry, in that simply having two locations as
your vanishing points isn’t enough to uniquely identify your location on a sphere.
1.2.2 Lines in 2D and 3D
Next, let’s briefly review how we can parameterize lines in 2D and 3D:
• 2D: (2 Degrees of Freedom)
– y = mx + c
– ax + by + c = 0
– sin θx − cos θy + ρ = 0
Δ
– If n̂ = (− sin θ, cos θ)T , then n̂ · r = ρ.
• 3D: (3 Degrees of Freedom)
– n̂ = ρ
– aX + bY + cZ + d = 0
1.2.3 Application: Multilateration (MLAT)
This problem comes up when we estimate either our position or the position of other objects based off of Time of Arrival
[4]. One specific application of this problem is localizing ourselves using distances/Time of Arrival of wifi access points inside
buildings/other locations without access to GPS.
Like the other problems we’ve looked at, this problem can be solved by finding the intersection of 3 spheres. Let’s begin
with:
||r − ri ||2 = pi ∀ i ∈ {1, · · · , N } (7)
3
Next, let’s square both sides of this equation and rewrite the left-hand side with dot products:
||r − ri ||22 = (r − ri )T (r − ri ) = p2i (8)
= r · r − 2r · ri + ri · ri = pi2 (9)
Recall from Bezout’s Theorem that this means that are 8 possible solutions here, since we have three equations of second-order
polynomials. To get rid of the 2nd order terms, we simply subtract the equations:
r · r − 2r · ri + ri · ri = ρ2i (10)
− r · r − 2r · rj + rj · rj = ρ2j ∀ i, j ∈ {1, 2, 3}, i 6= j (11)
Subtracting these pairs of equations yields a linear equation in R:
2r(rj − ri ) = (ρ2j − ρ2i ) − (Rj2 − Ri2 ) (12)
Δ
(Where the scalar Rj2 = rj · rj .)
Putting these equations together, this is equivalent to finding the intersection of three different spheres:
⎡ ⎤
(r2 − r1 )T
⎡ 2
(ρ2 − ρ21 ) − (R22 − R12 )
⎤
1
⎣ r3 − r2 )T ⎦ r = ⎣(ρ23 − ρ22 ) − (R32 − R22 )⎦ (13)
⎢ ⎥
2
r 1 − r3 )T
(ρ21 − ρ23 ) − (R12 − R32 )
However, even though we’ve eliminated the second-order terms from these three equations, we still have two solutions. Recall
from linear algebra equations don’t have a unique solution when there is redundancy or linear dependence between the equations.
If we add up the rows on the right-hand side of the previous equation, we get 0, which indicates that the matrix on the left-hand
side is singular:
⎡ ⎤
(r2 − r1 )T
Δ ⎢
A = ⎣ r3 − r2 )T ⎦ ∈ R3×3 (14)
⎥
r1 − r3 )T
To solve this linear dependence problem, we again use Bezout’s Theorem and keep one of the second-order equations:
(r − r1 ) · (r − r2 ) = 0 (15)
(r − r2 ) · (r − r3 ) = 0 (16)
(r − r2 ) · (r − r2 ) = 0 → (r − r2 ) ⊥ (r3 − r1 ) (17)
I.e. the plane passes through r2 - this intersecting point is the solution and is known as the orthocenter or the principal
point. Now, all we need is to find the quantity f to find the Center of Projection.
Next, note the following relations between the vanishing points in the inverse plane and ẑ, which lies perpendicular to the
image plane:
r1 · ẑ = 0 (18)
r2 · ẑ = 0 (19)
r3 · ẑ = 0 (20)
What else is this useful for? Here are some other applications:
• Camera calibration (this was the example above)
• Detecting image cropping - e.g. if you have been cropped out of an image
• Photogrammetry - e.g. by verifying if the explorer who claimed he was the first to make it “all the way” to the North Pole
actually did (fun fact: he didn’t).
Next, now that we’ve determined the principal point, what can we say about f (the “focal length”)?
For this, let us consider the 3D simplex, which is triangular surface in R3 given by the unit vectors:
Δ T
e1 = 1 0 0 (21)
Δ T
e2 = 0 1 0 (22)
Δ T
e3 = 0 0 1 (23)
4
√
Using this 3D simplex, let us suppose that each side of the triangles formed
√ by this√ simplex take length v = 2, which is
consistent with the l2 norm of the triangles spanned by the unit simplex ( 12 + 12 = 2).
Next, we solve for f by finding the value of a such that the dot product between the unit vector perpendicular to the sim-
plex and a vector of all a is equal to 1:
T h 1 iT
√1 √1
a a a √ =1 (24)
3 3 3
3a
√ =1 (25)
3
√
3 1
a= =√ (26)
3 3
This value of a = √13 = f . Then we can relate the lengths of the sides v (which correspond to the magnitudes of the vectors
between the principal point and the vanishing points (||r − ri ||2 )) and f :
√ 1 v
v= 2, f = √ =⇒ f = √ (27)
3 6
With this, we’ve now computed both the principal point and the “focal length” f for camera calibration.
1.2.4 Application: Understand Orientation of Camera w.r.t. World
For this problem, the goal is to transform from the world frame of reference to the camera frame of reference, i.e. find a frame
of reference such that the unit vectors x̂c , ŷc , ẑc have the following orthogonality properties:
x̂c · ŷc ≈ 0 (28)
c c
ŷ · ẑ ≈ 0 (29)
c c
x̂ · ẑ ≈ 0 (30)
(Note that the c superscript refers to the camera coordinate system.) If the location of the Center of Projection (COP) is given
above the image plane as the vector pc and the vanishing points are given as vectors rc1 , rc2 , and rc3 (note that these vanishing
points must be in the same frame of reference in order for this computation to carry out), then we can derive expressions for the
unit vectors through the following relations:
pc − r1c
(pc − r1 ) //x̂c =⇒ x̂c = (31)
||pc − r1c ||2
pc − rc2
(pc − r2 ) //ŷc =⇒ ŷc = (32)
||pc − r2c ||2
pc − rc3
(pc − r3 ) //ẑc =⇒ ẑc = (33)
||pc − r3c ||2
Then, after deriving the relative transformation between the world frame (typically denoted w in robotics) and the camera frame
(typically denoted c in robotics), we can express the principal point/Center of Projection in the camera coordinate frame:
r = αx̂c + βŷc + γẑc (34)
Where (α, β, γ) are the coordinates in the object coordinate system (since x̂c , ŷc , and ẑc comprise the orthogonal basis of this
coordinate system). Then we can express the relative coordinates of objects in this coordinate system:
T
r0 = α β γ
(35)
To transform between frames, we can do so via the following equation:
⎡ ⎤
−(x̂c )T
r = ⎣−(ŷc )T ⎦ r0 (36)
⎢ ⎥
−(ẑc )T
⎡ ⎤
−(x̂c )T
Δ ⎢
Note that the matrix defined by: R = ⎣−(ŷc )T ⎦ is orthonormal, since these vector entries are orthogonal to one another.
⎥
c T
−(ẑ )
This matrix R is a rotation matrix, which means it is skew-symmetric, invertible, and its transpose is equal to its inverse:
5
R−1 = RT . Therefore, if we wanted to solve the reverse problem that we did above (finding object coordinates from camera
coordinates), we can do so by simply taking the transpose of the matrix:
⎡ ⎤−1
−(x̂c )T
r0 = ⎣−(ŷc )T ⎦ r (37)
⎢ ⎥
c T
−(ẑ )
⎡ ⎤T
−(x̂c )T
= ⎣−(ŷc )T ⎦ r (38)
⎢ ⎥
c T
−(ẑ )
1.3 Brightness
We’ll now switch to the topic of brightness, and how we can use it for surface estimation. Recall that for a Lambertian surface
(which we’ll assume we use here for now), the power received by photoreceptors (such as a human eye or an array of photodiodes)
depends both on the power emitted by the source, but also the angle between the light source and the object. However, the
brightness of the object does not depend on the viewing angle between the object and the viewer.
This is relevant for foreshortening (the visual effect or optical illusion that causes an object or distance to appear shorter
than it actually is because it is angled toward the viewer [6]): the perceived area of an object is the true area times the cosine
of the angle between the light source and the object/viewer. Definition: Lambertian Object: An object that appears equally
bright from all viewing directions and reflects all incident light, absorbing none [5].
Let’s look at a simple case: a Lambertian surface. If we have the brightness observed and we have this modeled as:
E = n̂ · s, (39)
Where s is the vector between the light source and the object, then can we use this information to recover the surface orientation
given by n̂. This unit vector surface orientation has degrees of freedom, since it is a vector in the plane.
It is hard to estimate this when just getting a single measurement for brightness. But what if we test different lighting conditions?:
E1 = n̂ · s1 (40)
E2 = n̂ · s2 (41)
n̂ · n̂ = ||n̂||2 = 1 (42)
This is equivalent, intuitively, to finding the intersection between two cones for which we have different angles, which forms a
planar curve. We then intersect this planar curve with the unit sphere corresponding to the constraint ||n̂||2 = 1. By Bezout’s
Theorem, this will produce two solutions.
One problem with this approach, however, is that these equations are not linear. We can use the presence of reflectance
to help us solve this problem. Let us define the following:
Definition: Albedo: This is the ratio of power out of an object divided by power into an object:
Δ Δ power in
“albedo” = ρ = ∈ [0, 1] (43)
power out
Though the definition varies across different domains, in this class, we define albedo to be for a specific orientation, i.e. a specific
si .
Fun fact: Although an albedo greater than 1 technically violates the 2nd Law of Thermodynamics, superluminous sur-
faces such as those sprayed with flourescent paint can exhibit ρ > 1.
Using this albedo term, we can now solve our problem of having nonlinearity in our equations. Note that below we use
three different measurements this time, rather than just two:
E1 = ρn̂ · s1 (44)
E2 = ρn̂ · s2 (45)
E3 = ρn̂ · s3 (46)
6
This creates a system of 3 unknowns and 3 Degrees of Freedom. We also add the following constraints:
n = ρn̂ (47)
n
n̂ = (48)
||n||2
Combining these equations and constraints, we can rewrite the above in matrix/vector form:
⎡ ⎤ ⎡ ⎤
−ŝ1T −ŝT1
⎡ ⎤ ⎡ ⎤
E1 E1
⎢ T⎥ −1 Δ ⎢
⎣−ŝ2 ⎦ n = ⎣E2 ⎦ =⇒ n = S ⎣E2 ⎦ (Where S = ⎣−ŝT2 ⎦ .) (49)
⎥
T T
−ŝ3 E3 E3 −ŝ3
A quick note on the solution above: like other linear algebra-based approaches we’ve investigated so far, the matrix S above
isn’t necessarily invertible. This matrix is not invertible when light sources are in a coplanar orientation relative to the object.
If this is the case, then the matrix S becomes singular/linearly dependent, and therefore non-invertible.
1.3.1 What if We Can’t use Multiple Orientations?
In the case where the orientation of the camera, object, and light source cannot be changed (i.e. everything is fixed), or if it is
expensive to generate a surface estimation from a new orientation, then another option is to use different color sensors. Most
cameras have RGB (“Red-Green-Blue”) sensors, and thus we can use the same approach as above, where each color corresponds
to a different orientation.
This RGB approach has a few issues:
• RGB has “crosstalk” between color channels (it can be hard to excite/activate a single channel for color without exciting
the others as well).
• The object may not be uniformly-colored (which, practically, is quite often the case).
However, despite the drawbacks, this approach enables us to recover the surface orientation of an object from a single RGB
monocular camera.
A final note: we originally assumed this object was Lambertian, and because of this used the relation that an object’s per-
ceived area is its true area scaled by the cosine of the angle between the light source and the object, Does this apply for real
surfaces? No, because many are not ideal Lambertian surfaces. However, in practice, we can just use a lookup table that can
be calibrated using real images.
1.4 References
1. RANSAC Algorithm: http://www.cse.yorku.ca/~kosta/CompVis Notes/ransac.pdf
2. Bilinear Interpolation: https://en.wikipedia.org/wiki/Bilinear interpolation
3. Bicubic Interpolation: https://en.wikipedia.org/wiki/Bicubic interpolation
4. Multilateration: https://www.icao.int/Meetings/AMC/MA/2007/surv semi/Day02 SENSIS Turner.pdf
5. Lambertian Surface: Robot Vision, Page 212
6. Foreshortening: https://en.wikipedia.org/wiki/Perspective
7
MIT OpenCourseWare
https://ocw.mit.edu
6.801 / 6.866 Machine Vision
Fall 2020
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms