Computer Vision User Guide
Computer Vision User Guide
User's Guide
R2023b
How to Contact MathWorks
Phone: 508-647-7000
Build and Deploy Visual SLAM Algorithm with ROS in MATLAB . . . . . . 1-25
v
Code Generation and Third-Party Examples
2
Code Generation for Monocular Visual Simultaneous Localization and
Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
vi Contents
Detect Image Anomalies Using Pretrained ResNet-18 Feature
Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-141
Activity Recognition from Video and Optical Flow Data Using Deep
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-212
vii
Export YOLO v2 Object Detector to ONNX . . . . . . . . . . . . . . . . . . . . . . . 3-443
Find Image Rotation and Scale Using Automated Feature Matching . . . 4-25
viii Contents
Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-41
ix
Traffic Warning Sign Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24
x Contents
Visual Tracking of Occluded and Unresolved Objects . . . . . . . . . . . . . . . 8-15
xi
Create Faster R-CNN Object Detection Network . . . . . . . . . . . . . . . . . . 10-10
Labelers
11
Elements of Ground Truth Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Exported Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
xii Contents
Custom Algorithm Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46
xiii
Frame Navigation and Time Interval Settings . . . . . . . . . . . . . . . . . . . 11-102
Labeling Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-102
Polyline Drawing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-103
Polygon Drawing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-104
Zooming and Panning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-104
App Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-104
xiv Contents
Featured Examples
12
Localize and Read Multiple Barcodes in Image . . . . . . . . . . . . . . . . . . . . 12-2
xv
Getting Started with Point Clouds Using Deep Learning . . . . . . . . . . . . . 14-3
Import Point Cloud Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Augment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3
Encode Point Cloud Data to Image-like Format . . . . . . . . . . . . . . . . . . . . 14-4
Train a Deep Learning Classification Network with Encoded Point Cloud
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-4
xvi Contents
Subtract Image Background by Using OpenCV in MATLAB . . . . . . . . . 15-19
xvii
Draw Shapes and Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-8
Rectangle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-8
Line and Polyline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-8
Polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10
Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10
xviii Contents
Object Detection
19
Train Custom OCR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
Prepare Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
Train an OCR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4
Evaluate OCR training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4
xix
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-43
Getting Started with R-CNN, Fast R-CNN, and Faster R-CNN . . . . . . . . 19-89
Object Detection Using R-CNN Algorithms . . . . . . . . . . . . . . . . . . . . . . 19-89
Comparison of R-CNN Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . 19-91
Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-91
Design an R-CNN, Fast R-CNN, and a Faster R-CNN Model . . . . . . . . . . 19-92
xx Contents
Label Training Data for Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 19-93
xxi
Image Classification with Bag of Visual Words . . . . . . . . . . . . . . . . . . . 19-171
Step 1: Set Up Image Category Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 19-171
Step 2: Create Bag of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-171
Step 3: Train an Image Classifier With Bag of Visual Words . . . . . . . . 19-172
Step 4: Classify an Image or Image Set . . . . . . . . . . . . . . . . . . . . . . . . 19-173
Fixed-Point Design
21
Fixed-Point Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
Fixed-Point Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
Benefits of Fixed-Point Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
Benefits of Fixed-Point Design with System Toolboxes Software . . . . . . . 21-2
xxii Contents
Portable C Code Generation for Functions That Use OpenCV Library . . 22-4
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
xxiii
Enhance Contrast of Grayscale Image Using Histogram Equalization
........................................................ 23-63
xxiv Contents
Perform Opening of Binary Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-141
Convert Data Type and Color Space of Image from RGB to HSV . . . . . 23-153
xxv
1
Simultaneous Localization and Mapping (SLAM) is a complex process that involves determining the
position and orientation of a sensor, such as a camera, in relation to its surroundings. It also entails
creating a map of the environment by identifying the 3D locations of various points. This example
demonstrates how to effectively perform SLAM by combining images captured by a monocular
camera with measurements obtained from an IMU sensor.
The method demonstrated in this example is inspired by ORB-SLAM3 which is a feature-based visual-
inertial SLAM algorithm. The IMU and camera fusion is achieved using a factorGraph (Navigation
Toolbox), which is a bipartite graph that consists of two types of nodes:
1 Variable Nodes: They represent the unknown random variables in an estimation problem such as
the position of a mobile robot or a 3D point.
2 Factor Nodes: They represent a function which quantifies the relationship between the random
variables. For example, a camera correspondence factor between 2D image points and 3D map
points or an IMU measurement factor between two positions.
Factor graphs are a popular way of designing algorithms for sensor fusion-based SLAM because they
provide an efficient and modular solution to SLAM problems and allow the graph to be easily updated
if new measurements or constraints become available. Additionally, this type of model provides a
flexible approach incorporating different types of sensors and data, including visual, lidar and inertial
sensors, which makes it useful for variety of of SLAM applications.
This example illustrates how to construct a monocular visual-inertial SLAM pipeline using a factor
graph step by step.
Overview
• Initial IMU Bias Estimation (Optional): If your data sequence contains a sufficient number of
static frames, you can use them to estimate the IMU bias. This helps you initialize the factor graph
later.
1-2
Monocular Visual-Inertial SLAM
• Map Initialization: Start by initializing the map of 3-D points from two video frames. The 3-D
points and relative camera pose are computed using triangulation based on 2-D ORB feature
correspondences.
• Camera and IMU alignment: Once the map is initialized, generate a set of camera poses using
Structure From Motion (SFM). Then use corresponding IMU measurements to align the camera
and IMU frames and to find the scale conversion between the IMU and camera odometries.
• Tracking: Once the camera and IMU data are aligned and scaled correctly, initialize the factor
graph. Then, for each new key frame, estimate the camera pose by matching features in the
current frame to features in the last key frame. Refine the estimated camera pose by tracking the
local map. Then, add the camera poses to the factor graph with their corresponding IMU
measurements.
• Local Mapping: Use the current frame to create new 3-D map points if it is identified as a key
frame. At this stage, the factor graph is optimized, and the camera poses and 3-D points are
adjusted using a set of local key frames.
• Loop Closure: Detect loops for each key frame by comparing the key frame against all previous
key frames using the bag-of-features approach. Once a loop closure is detected, add a new link to
the factor graph to reflect the connection found by the loop closure.
The data used in this demo has been extracted from the Blackbird data set (NYC Subway Winter). It
represents a sequence from a UAV flying around in a simulated subway environment. Download the
MAT file which contains the images, camera intrinsics and IMU measurements.
uavData = helperDownloadSLAMData();
images = uavData.images;
intrinsics = uavData.intrinsics;
timeStamps = uavData.timeStamps;
Set up the noise parameters using the factorIMUParameters object. These values are typically
provided by the IMU manufacturer.
A section of the input data sequence has only static frames. These frames are useful for estimating
the initial bias of the IMU sensor, which you can then use as a prior when we construct the factor
graph. Accounting for the bias helps to reduce pose estimation errors because it can serve as an
initial guess during the optimization. This step is optional, as it depends on the availibity of static
frames.
Map Initialization
We start by initializing the map which holds the 3-D world points. This step is crucial and has a
significant impact on the accuracy of final SLAM result. Find the initial ORB feature point
correspondences using the matchFeatures function between a pair of images. After the
1-3
1 Camera Calibration and SfM Examples
correspondences are found, you can use these two geometric transformation models to establish map
initialization:
The homography and the fundamental matrix can be computed using estgeotform2d and
estimateFundamentalMatrix, respectively. The model that results in a smaller reprojection error
is selected to estimate the relative rotation and translation between the two frames using
estrelpose. Because the RGB images are taken by a monocular camera which does not provide the
depth information, the relative translation can only be recovered up to a specific scale factor.
Given the relative camera pose and the matched feature points in the two images, the 3-D locations of
the matched points are determined using triangulateMultiview function. A triangulated map
point is valid when it is in the front of both cameras, when its reprojection error is low, and when the
parallax of the two views of the point is sufficiently large.
% Set random seed for reproducibility
rng(0);
currFrameIdx = startFrameIdx + 1;
firstI = currI; % Preserve the first frame
isMapInitialized = false;
currFrameIdx = currFrameIdx + 1;
preMatchedPoints = prePoints(indexPairs(:,1),:);
currMatchedPoints = currPoints(indexPairs(:,2),:);
1-4
Monocular Visual-Inertial SLAM
if ~isValid
continue
end
isMapInitialized = true;
disp(['Map initialized with frame ', num2str(startFrameIdx),' and frame ', num2str(currFrameI
end % End of map initialization loop
if isMapInitialized
% Show matched features
hfeature = showMatchedFeatures(firstI, currI, prePoints(indexPairs(:,1)), ...
currPoints(indexPairs(:, 2)), "Montage");
else
error('Unable to initialize the map.')
end
1-5
1 Camera Calibration and SfM Examples
After the map is initialized using two frames, you can use imageviewset and worldpointset to
store the two key frames and the corresponding map points:
• imageviewset stores the key frames and their attributes, such as ORB descriptors, feature
points, camera poses, and graph connections between the key frames, such as feature points
matches and relative camera poses. It also builds and updates a pose graph. The absolute camera
poses and relative camera poses of odometry edges are stored as rigidtform3d objects. The
relative camera poses of loop-closure edges are stored as affinetform3d objects.
• worldpointset stores 3-D positions of the map points and the 3-D to 2-D projection
correspondences, indicating which map points are observed in a key frame and which key frames
observe a map point. It also stores other attributes of map points, such as the mean view
direction, the representative ORB descriptors, and the range of distance at which the map point
can be observed.
% Add the first key frame. Place the camera associated with the first
% key frame at the origin, oriented along the Z-axis
preViewId = 1;
vSetKeyFrames = addView(vSetKeyFrames, preViewId, rigidtform3d, Points=prePoints,...
Features=preFeatures.Features);
% Add connection between the first and the second key frame
vSetKeyFrames = addConnection(vSetKeyFrames, preViewId, currViewId, relPose, Matches=indexPairs);
% Add image points corresponding to the map points in the first key frame
mapPointSet = addCorrespondences(mapPointSet, preViewId, newPointIdx, indexPairs(:,1));
% Add image points corresponding to the map points in the second key frame
mapPointSet = addCorrespondences(mapPointSet, currViewId, newPointIdx, indexPairs(:,2));
Loop detection in this example is performed using the bags-of-words approach. You can create a
visual vocabulary represented as a bagOfFeatures object offline using the ORB descriptors
extracted from a large set of images in the dataset by calling:
1-6
Monocular Visual-Inertial SLAM
bag =
bagOfFeatures(imds,CustomExtractor=@helperORBFeatureExtractorFunction,TreePro
perties=[3, 10],StrongestFeatures=1);
The loop closure process builds a database, by encode the incoming images into feature vectors
compactly representing each view.
Initialize the Factor Graph and Configure Trajectory and Map Visualization
Refine the initial reconstruction using a factorGraph (Navigation Toolbox), which optimizes both
camera poses and world points to minimize the overall reprojection errors.
In MATLAB, working with a factor graph involves managing a set of unique IDs for different parts of
the graph, including: poses, 3D points or IMU measurements. By using these IDs, you can add
additional constraints can be added between the variable nodes in the factor graph, such as the
corresponding 2D image matches for a set of 3D points, or gyroscope measurements between two
sets of poses. This is accomplished using factor nodes. You can generate unique IDs using the
generateNodeID (Navigation Toolbox) function.
To initialize the factor graph, two pose variable nodes representing the first and second poses are
combined with a factorCameraSE3AndPointXYZ (Navigation Toolbox) that represents the links
between 2D feature points and 3D map points. At this stage, IMU measurements cannot be added to
the graph yet, as the correct alignment and scaling of IMU data require several camera poses to be
collected first.
Once the initialization is successful, update the attributes of the map points including 3-D locations,
view direction, and depth range. You can use helperVisualizeMotionAndStructure to visualize the map
points and the camera locations.
1-7
1 Camera Calibration and SfM Examples
showPlotLegend(mapPlot);
1-8
Monocular Visual-Inertial SLAM
While using a monocular camera, pose estimates are obtained at an unknown scale, and differ from
the metric measurements collected by an IMU. To address this issue, accelerometer measurements
are used to estimate a scale factor which can scale the input camera poses to match the metric scale
of the IMU measurements. In addition to that, you must align the reference frames of the input poses
with the local navigation reference frame of the IMU to remove the constant gravity acceleration that
is captured by the accelerometer can be removed. The estimateGravityRotationAndPoseScale
function estimates both the scale and gravity rotation which you can use to improve the accuracy of
the sensor fusion.
1-9
1 Camera Calibration and SfM Examples
The frames used to obtain this alignment and scaling should be collected while keeping in mind a set
of criteria:
• Map initialization should be as accurate as possible, because it affects the camera poses greatly.
Extracting a large number of features may be necessary to obtain great accuracy.
• Camera poses should contain both static and moving frames, with the moving frames containing a
strong acceleration in the gravity axis direction.
• For efficiency, a sliding window should be used to store the poses. The window size should be kept
relatively small to reduce the amount of scale drift in the poses.
Once the estimation has converged, update the poses on the factor graph by rotating and scaling
them correctly. Also update the factor graph by setting the bias prior using the previously estimated
value. Use the factorIMU (Navigation Toolbox) to link the corresponding IMU measurements to the
previously stored poses. Finally, optimize the factor graph to adjust the poses and 3D points
according to the newly added constrains. Use factorGraphSolverOptions (Navigation Toolbox) to
set up the optimization parameters for the factor graph.
% Main loop
isLastFrameKeyFrame = true;
isIMUInitialized = false;
isLoopClosed = false;
% Estimate the gravity rotation and pose scale that helps in transforming input camera po
% local navigation reference frame of IMU
[gDir,scale] = helperAlignIMUCamera(camPoses,uavData,imuParams,...
keyFrameToFrame,currKeyFrameId);
if scale>0.3
% Scale estimation was successful
isIMUInitialized = true;
xyzPoints1 = mapPointSet.WorldPoints;
xyzPoints2 = nodeState(fGraph,nodeIDs(fGraph,NodeType="POINT_XYZ"));
1-10
Monocular Visual-Inertial SLAM
% Transform and scale the input camera poses and XYZ points using the estimated gravi
[updatedCamPoses,updatedXYZPoints1,updatedXYZPoints2] = helperTransformToIMU(...
camPoses,xyzPoints1,xyzPoints2,gDir,scale); %two sets of points because we can't
velToNode = [velToNode;veloId'];
biasToNode = [biasToNode;biasId'];
initVelPrior = factorVelocity3Prior(velToNode(1));
initBiasPrior = factorIMUBiasPrior(biasToNode(1));
addFactor(fGraph,initVelPrior);
addFactor(fGraph,initBiasPrior);
for i=2:length(viewToNode)
imuMeasurements = helperExtractIMUMeasurements(uavData, keyFrameToFrame(i-1), key
imuId = [viewToNode(i-1),velToNode(i-1),biasToNode(i-1), ...
viewToNode(i) ,velToNode(i) ,biasToNode(i)];
fIMU = factorIMU(imuId,imuMeasurements.gyro,imuMeasurements.accel,imuParams,Senso
addFactor(fGraph,fIMU);
end
optimize(fGraph,fgso);
Tracking
To determine when to insert a new key frame, perform the tracking process for each frame using
these steps:
1 Extract ORB features for each new frame and then matche (using matchFeatures), with
features in the last key frame that have known corresponding 3-D map points.
2 Estimate the camera pose with the Perspective-n-Point algorithm using estworldpose.
3 Using the camera pose, project the map points observed by the last key frame into the current
frame and search for feature correspondences using matchFeaturesInRadius.
1-11
1 Camera Calibration and SfM Examples
4 With 3-D to 2-D correspondence in the current frame, refine the camera pose by performing a
motion-only bundle adjustment using factorGraph. Construct a factor graph using the steps in
the previous section and then fix the 3D point nodes before proceeding to the optimization step.
5 Project the local map points into the current frame to search for more feature correspondences
using matchFeaturesInRadius and refine the camera pose again using motion-only bundle
adjustment.
6 Determine if the current frame is a new key frame. If the current frame is a key frame, continue
to the Local Mapping process. Otherwise, start Tracking for the next frame.
If tracking is lost because not enough number of feature points could be matched, try inserting new
key frames more frequently.
% Track the local map and check if the current frame is a key frame.
numSkipFrames = 20;
numPointsKeyFrame = 200;
[localKeyFrameIds, currPose, mapPointsIdx, featureIdx, isKeyFrame] = ...
helperTrackLocalMapVI(mapPointSet, vSetKeyFrames, mapPointsIdx, ...
featureIdx, currPose, currFeatures, currPoints, intrinsics, scaleFactor, numLevels, ...
isLastFrameKeyFrame, lastKeyFrameIdx, currFrameIdx, numSkipFrames, numPointsKeyFrame);
if ~isKeyFrame
isLastFrameKeyFrame = false;
currFrameIdx = currFrameIdx + 1;
continue
else
isLastFrameKeyFrame = true;
keyFrameToFrame = [keyFrameToFrame; currFrameIdx];
encodedImages = [encodedImages; encode(bofData.bag,currI,Verbose=false)];
keyTimeStamps = [keyTimeStamps; timeStamps.imageTimeStamps(currFrameIdx)];
loopCtr = loopCtr+1;
end
1-12
Monocular Visual-Inertial SLAM
% Remove outlier map points that are observed in fewer than 3 key frames
[mapPointSet,pointToNode] = helperCullRecentMapPoints(mapPointSet, pointToNode, mapPointsIdx,
Loop Closure
Loop candidates are recognized by comparing feature descriptors, which represent encoded images
using a pre-trained bag of words. The detected loop candidate is validated when it is located near a
previous pose, and the distance between the feature descriptors associated to both poses is smaller
than a set threshold.
When a valid loop candidate is found, use estgeotform3d to compute the relative pose between the
loop candidate frame and the current key frame. Then add the loop connection with the relative pose
to the factor graph.
loopClosureCheck = 350; % Before this frame, loop closure does not occur.
% Check loop closure after some key frames have been created
if currKeyFrameId > loopClosureCheck && loopCtr>10
if ~isempty(potentialLoopCandidates)
[isDetected, validLoopCandidates] = helperCheckLoopClosureVI(encodedImages, potential
if isDetected
% Add loop closure connections
[fGraph, isLoopClosed, mapPointSet, vSetKeyFrames] = helperAddLoopConnectionsVI(f
mapPointSet, vSetKeyFrames, validLoopCandidates, currKeyFrameId, ...
currFeatures, currPoints, loopEdgeNumMatches, intrinsics);
if isLoopClosed
loopCtr=0;
end
end
end
end
Local Mapping
Local mapping is performed for every key frame. When a new key frame is determined, add it and its
corresponding IMU measurement to the factor graph and update to the mapPointSet and
vSetKeyFrames objects. To ensure that mapPointSet contains as few outliers as possible, a valid
map point must be observed in at least 3 key frames.
Create new map points by triangulating ORB feature points in the current key frame and its
connected key frames. For each unmatched feature point in the current key frame, search for a match
with other unmatched points in the connected key frames using matchFeatures. The local
1-13
1 Camera Calibration and SfM Examples
optimization of the factor graph refines the pose of the current key frame, the poses of connected key
frames, and all the map points observed in these key frames.
allOptNodes=solInfo.OptimizedNodeIDs;
optPointNodes=setdiff(allOptNodes,viewToNode(refinedKeyFrameIds));
[~,mapPointIdx]=intersect(pointToNode,optPointNodes,'stable');
1-14
Monocular Visual-Inertial SLAM
1-15
1 Camera Calibration and SfM Examples
Global Optimization
Once all the frames have been processed, we proceed to optimize the factor graph one last time by
including all the keyframes, 3D points and IMU measurements.
% Optimize the poses
vSetKeyFramesOptim = vSetKeyFrames;
fGraphOptim = fGraph;
optimize(fGraphOptim,fgso);
mapPointSet = helperUpdateGlobalMap(mapPointSet,vSetKeyFrames,vSetKeyFramesOptim);
1-16
Monocular Visual-Inertial SLAM
% Update legend
showPlotLegend(mapPlot);
% Show legend
showPlotLegend(mapPlot);
The downloaded data contains the gTruth array, which you can use to compare the obtained
trajectory with the ground truth to qualitatively evaluate the accuracy of the SLAM pipeline. You can
also calculate the root-mean-square-error (RMSE) and absolute-mean-error (AME) of trajectory
estimates.
1-17
1 Camera Calibration and SfM Examples
In this example you learned how to utilize the factor graph to combine pose estimates from a camera
with IMU measurements to obtain more accurate results.
Supporting Functions
Other functions are included in separate files. For more information on these functions, please refer
to the “Monocular Visual Simultaneous Localization and Mapping” on page 1-122 example.
helperAddLoopConnectionsVI adds connections between the current keyframe and the valid loop
candidate.
helperCheckLoopClosureVI detects key frames that are loop candidates by retrieving visually
similar images from the database.
helperTrackLastKeyFrameVI estimates the current camera pose by tracking the last key frame.
helperTrackLocalMapVI refines the current camera pose by tracking the local map.
helperDetectAndExtractFeatures detects and extracts and ORB features from the image.
helperUpdateGlobalMap updates 3-D locations of map points after pose graph optimization
helperAlignIMUCamera estimates the gravity rotation and pose scale to transform the input poses
to the local navigation reference frame of IMU using IMU measurements and factor graph
optimization.
helperTransformToIMU transforms and scales input poses and XYZ 3-D points according to
specified gravity direction and pose scale.
helperLocalFactorGraphOptimization optimizes the factor graph using a set of local key frames.
helperInitFactorGraph initializes the factor graph using the results obtained from SfM.
K = intrinsics.K;
camInfo = ((K(1,1)/1.5)^2)*eye(2);
refinedKeyFrameIds = vSetKeyFrames.Views.ViewId';
graphIds = [];
1-18
Monocular Visual-Inertial SLAM
graphMes = [];
graphPts = [];
fGraph = factorGraph;
mapPointIdx = [];
pointIds = pointToNode(pointIndices);
poseIds = ones(length(pointIds),1)*viewToNode(viewId);
graphIds = [graphIds; [poseIds pointIds]];
ptsInView = vSetKeyFrames.Views.Points{viewId,1}.Location;
graphMes = [graphMes; ptsInView(featureIndices,:)];
optimize(fGraph,factorGraphSolverOptions);
fixNode(fGraph,ptsIDS(idx),false);
% Extract the results of the factor graph optimization and re-package them
1-19
1 Camera Calibration and SfM Examples
poseIDs = nodeIDs(fGraph,NodeType="POSE_SE3");
fgposopt = nodeState(fGraph,poseIDs);
initPose = rigidtform3d(quat2rotm(fgposopt(1,4:7)),fgposopt(1,1:3));
refinedPose = rigidtform3d(quat2rotm(fgposopt(2,4:7)),fgposopt(2,1:3));
ViewId=[1;2];
AbsolutePose=[initPose;refinedPose];
refinedAbsPoses=table(ViewId,AbsolutePose);
ptsIDS = nodeIDs(fGraph,NodeType="POINT_XYZ");
refinedPoints = nodeState(fGraph,ptsIDS);
end
N = length(mapPointsIndices);
K = intrinsics.K;
camInfo = ((K(1,1)/1.5)^2)*eye(2);
viewId = vSetKeyFrames.Views.ViewId(end)+1;
viewsAbsPoses = vSetKeyFrames.Views.AbsolutePose;
for i = 1:numel(keyFramesIndices)
localKeyFrameId = keyFramesIndices(i);
[index3d, index2d] = findWorldPointsInView(mapPoints, localKeyFrameId);
[~, index1, index2] = intersect(index3d, mapPointsIndices, 'stable');
prePose = viewsAbsPoses(localKeyFrameId);
relPose = rigidtform3d(prePose.R' * cameraPose.R, ...
(cameraPose.Translation-prePose.Translation)*prePose.R);
if numel(index1) > 5
vSetKeyFrames = addConnection(vSetKeyFrames, localKeyFrameId, viewId, relPose, ...
Matches=[index2d(index1),featureIndices(index2)]);
1-20
Monocular Visual-Inertial SLAM
ptsInView = vSetKeyFrames.Views.Points{viewId,1}.Location;
mesPts = ptsInView(featureIndices,:);
% If the camera poses have been scaled properly, then add the IMU
% factor to the factor graph
if initIMU
velId = generateNodeID(fGraph,1);
velToNode=[velToNode;velId];
biasId = generateNodeID(fGraph,1);
biasToNode=[biasToNode;biasId];
References
[1] Dellaert, Frank, and Michael Kaess. Factor Graphs for Robot Perception. Now Publishers, 2017.
[2] Campos, Carlos, et al. “ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial,
and Multimap SLAM.” IEEE Transactions on Robotics, vol. 37, no. 6, Dec. 2021, pp. 1874–90.
DOI.org (Crossref), https://doi.org/10.1109/TRO.2021.3075644.
1-21
1 Camera Calibration and SfM Examples
This example shows how to use the MATLAB® Coder™ to generate C/C++ code for the visual
simultaneous localization and mapping algorithm from the “Monocular Visual Simultaneous
Localization and Mapping” on page 1-122 example.
Visual simultaneous localization and mapping (vSLAM) is the process of calculating the position and
the orientation of a camera, with respect to its surroundings, while simultaneously mapping the
environment.
This example shows how to process image data from a monocular camera to build a map of an indoor
environment and estimate the trajectory of the camera. The steps in this process are:
1 Package the visual SLAM algorithm from the “Monocular Visual Simultaneous Localization and
Mapping” on page 1-122 example into an entry-point function, helperMonoVisualSLAM.
2 Modify the helperMonoVisualSLAM function to support code generation.
3 Generate C/C++ code, and verify the results.
You can also integrate the generated code into external software for further testing.
Download Data
This example uses data from the TUM RGB-D benchmark [1] on page 1-24. The size of the data set is
1.38 GB. You can download the data set to a temporary folder using this code.
baseDownloadURL = "https://vision.in.tum.de/rgbd/dataset/freiburg3/rgbd_dataset_freiburg3_long_of
dataFolder = fullfile(tempdir,"tum_rgbd_dataset",filesep);
options = weboptions(Timeout=Inf);
tgzFileName = dataFolder + "fr3_office.tgz";
folderExists = exist(dataFolder,"dir");
Entry-Point Function
To meet the requirements of MATLAB Coder, restructure the code from the “Monocular Visual
Simultaneous Localization and Mapping” on page 1-122 example into the entry-point function
helperMonoVisualSLAM. This function takes a cell array of images as an input and outputs 3-D
worldpointset, estimated camera poses, and frame indices.
1-22
Code Generation for Monocular Visual Simultaneous Localization and Mapping
1 Initializes a map of 3-D points from the first two video frames, and then computes the 3-D points
and relative camera pose using triangulation based on 2-D ORB ORBPoints feature
correspondences.
2 For each new frame, estimates the camera pose by matching features in the current frame to
features in the previous key frame. The function refines the estimated camera pose by tracking
the local map.
3 If the function identifies the new frame as a key frame, the function uses the new frame to create
new 3-D points. In this step, the function uses bundle adjustment to minimize reprojection errors
in the estimated camera poses and 3-D points.
4 Detects loops in each key frame by comparing it with all previous key frames using the bag-of-
features approach. Once the function detects a loop closure, it optimizes the pose graph by
refining the camera poses of all the key frames.
As code generation does not support the imageDatastore object, read the images, convert them to
grayscale, and store them in a cell array.
Use the “Compilation Directive %#codegen” (MATLAB Coder) function to compile the
helperVisualSLAMCodegen function into a MEX file. You can specify the -report option to
generate a compilation report that shows the original MATLAB code and the associated files created
during code generation. You can also create a temporary directory where MATLAB Coder can store
the generated files. Note that, by default, the generated MEX file has the same name as the original
MATLAB function with "_mex" appended as a suffix: helperVisualSLAMCodegen_mex.
Alternatively, you can use the -o option to specify the name of the MEX file.
For code generation, you must pass Images as an input to the helperVisualSLAMCodegen
function.
cpuConfig = coder.config("mex");
cpuConfig.TargetLang = "C++";
codegen -config cpuConfig helperVisualSLAMCodegen -args {Images}
Use the helperVisualSLAMCodegen_mex function to find the estimated and optimized camera
poses based on Images cell array.
monoSlamOut = helperVisualSLAMCodegen_mex(Images);
Plot the estimated trajectory and actual trajectory of the camera by specifying monoSlamOut as an
input argument to the helperVisualizeMonoSlam helper function.
1-23
1 Camera Calibration and SfM Examples
% Clear up
clear helperMonoVisualSLAMCodegen_mex
Reference
[1] Sturm Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. “A
Benchmark for the Evaluation of RGB-D SLAM Systems.” In 2012 IEEE/RSJ International Conference
on Intelligent Robots and Systems, 573–80, 2012. https://doi.org/10.1109/IROS.2012.6385773.
1-24
Build and Deploy Visual SLAM Algorithm with ROS in MATLAB
In this example, you implement a visual simultaneous localization and mapping (SLAM) algorithm to
estimate the camera poses for the TUM RGB-D Benchmark [1] on page 1-29 dataset. You then
generate C++ code for the visual SLAM algorithm and deploy it as a ROS node to a remote device
using MATLAB®.
The remote device to which you want to deploy the code must have the following dependencies
installed:
• OpenCV 4.5.0 — For more information about downloading the OpenCV source and building it on
your remote device, see OpenCV linux installation.
• g2o library — Download the g2o source and build it on your remote device.
• Eigen3 library — Install eigen3 library using the command $ sudo apt install libeigen3-
dev.
For this example, download a virtual machine (VM) using the instructions in , and then follow these
steps.
masterIP = '192.168.192.135';
rosinit(masterIP,11311)
This example uses TUM RGB-D Benchmark [1] on page 1-29 dataset. Download the dataset as a ROS
bag file on the remote device.
$ wget https://cvg.cit.tum.de/rgbd/dataset/freiburg3/rgbd_dataset_freiburg3_long_office_household
This example uses the monovslam object to implement visual SLAM. For each new frame added using
its addFrame object function, the monovslam object extracts and tracks features to estimate camera
1-25
1 Camera Calibration and SfM Examples
poses, identify key frames and compute the 3-D map points in the world frame. The monovslam
object also searches for loop closures using the bag-of-features algorithm and optimizes camera poses
using pose graph optimization, once a loop closure is detected. For more information on visual SLAM
pipeline, see “Monocular Visual Simultaneous Localization and Mapping” on page 1-122.
The helperROSVisualSLAM function implements the visual SLAM algorithm using these steps:
type("helperROSVisualSLAM.m")
function helperROSVisualSLAM()
% The helperROSVisualSLAM function implements the visual SLAM pipeline for
% deployment as a ROS node.
% Copyright 2023 The MathWorks, Inc.
while 1
if hasNewKeyFrame(vslam)
msg = rosmessage('std_msgs/Float64MultiArray', 'DataFormat', 'struct');
% Get map points and camera poses
worldPoints = mapPoints(vslam);
[camPoses] = poses(vslam);
1-26
Build and Deploy Visual SLAM Algorithm with ROS in MATLAB
msg.Layout.Dim(end).Size = uint32(poseSize);
Use MATLAB Coder™ to generate a ROS node for the visual SLAM algorithm defined by the
helperROSVisualSLAM function . You can then deploy this node on the remote virtual machine.
Create a MATLAB Coder configuration object that uses "Robot Operating System (ROS)"
hardware. Before remote deployment, set these configuration parameters for the Linux virtual
machine. Note that, if you are deploying to a different remote machine, you must change these to the
appropriate parameters for your device.
Note: By default, the "Build and Load" build action deploys the node to the device, but does not
automatically run it. If you want the node to run immediately after code generation, use the "Build
and Run" build action, instead.
cfg = coder.config('exe');
cfg.Hardware = coder.hardware('Robot Operating System (ROS)');
cfg.Hardware.BuildAction = 'Build and load';
cfg.Hardware.CatkinWorkspace = '~/catkin_ws';
cfg.Hardware.RemoteDeviceAddress = '192.168.192.135';
cfg.Hardware.RemoteDeviceUsername = 'user';
cfg.Hardware.RemoteDevicePassword = 'password';
cfg.Hardware.DeployTo = 'Remote Device';
codegen helperROSVisualSLAM -args {} -config cfg -std:c++11
Configure Visualization
Use the helperVisualSLAMViewer object to create a viewer that visualizes map points along with
the camera trajectory and the current camera pose.
viewer = helperVisualSLAMViewer(zeros(0,3),rigidtform3d(eye(4)));
1-27
1 Camera Calibration and SfM Examples
Create a ROS subscriber to visualize map points and camera poses published by the deployed visual
SLAM node. Assign helperVisualizePointsAndPoses function as a callback to be triggered
whenever the subscriber receives a message from the deployed node.
visualizeSub = rossubscriber('/visualizePoints', 'std_msgs/Float64MultiArray', @(varargin)helperV
On the Ubuntu desktop of the virtual machine, click the ROS Noetic Terminal icon. Source the
catkin workspace.
$ source ~/catkin_ws/devel/setup.bash
To help the deployed node access library dependencies, append /usr/local/lib path to the
environment variable, LD_LIBRARY_PATH.
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
Navigate to the source directory of the deployed helperrosvisualslam node on the remote device.
You must run the node from this directory because the bag of features file used by the deployed node
is present in this directory.
1-28
Build and Deploy Visual SLAM Algorithm with ROS in MATLAB
$ cd ~/catkin_ws/src/helperrosvisualslam/src/
Start playing the ROS bag file in a separate ROS Noetic Terminal.
Disconnect
Disconnect from ROS Network after the nodes have finished execution.
rosshutdown
References
[1] Sturm, Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. "A
benchmark for the evaluation of RGB-D SLAM systems". In Proceedings of IEEE/RSJ International
Conference on Intelligent Robots and Systems, pp. 573-580, 2012
Helper Functions
1-29
1 Camera Calibration and SfM Examples
Monocular visual-inertial odometry estimates the position and orientation of the robot using camera
and inertial measurement unit (IMU) sensor data. Camera-based state estimation is accurate during
low-speed navigation. However, camera-based estimation faces challenges such as motion blur and
track loss at higher speeds. Also monocular camera-based estimation can estimate poses at an
arbitrary scale. On the other hand, inertial navigation can handle high-speed navigation easily and
estimate poses at world scale. You can combine the advantages of both types of sensor data to
achieve better accuracy using tightly coupled factor graph optimization. For good execution time
performance only a small portion of the entire factor graph containing only most recent
measurements is optimized at every optimization step. This variant of factor graph optimization is
popularly referred to as Sliding Window or Partial Graph Optimization.
Overview
The visual-inertial system implemented in this example consists of a simplified version of the
monocular visual odometry front-end of the VINS [1 on page 1-53] algorithm and a factor graph
back-end.
The visual odometry front-end has responsibilities similar to standard structure from motion (SfM)
algorithms, such as oriented FAST and rotated BRIEF (ORB) and simultaneous localization and
mapping (SLAM). The visual odometry front-end detects and tracks key points across multiple
frames, estimates camera poses, and triangulates 3-D points using multi-view geometry. The factor
graph back-end jointly optimizes the estimated camera poses, 3-D points, IMU velocity, and bias.
Before fusing the camera and IMU measurements, you must align the camera and IMU to compute
the camera pose scale, gravity rotation, and initial IMU velocity and bias.
1-30
Monocular Visual-Inertial Odometry Using Factor Graph
Set Up
This example uses the Blackbird data set (NYC Subway Winter) to demonstrate the visual-inertial
odometry workflow. Download the data set.
data = helperDownloadData();
rng(0)
1-31
1 Camera Calibration and SfM Examples
• Sliding window size (windowSize) - Maximum number of recent frames or pose nodes to optimize
in the factor graph. Usually the state estimation happens incrementally. Meaning that the system
sequentially processes the frame data, like camera images and IMU measurements, at each time
step to estimate the robot pose or state at that particular time step. The factor graph optimization
can use all the sensor measurements and estimates until the current frame to refine the state
estimates. Using all the frame data in the factor graph at every optimization step produces more
accurate solutions but can very computationally extensive. To improve execution-time
performance, you can consider only a few recent frame measurements for optimization. This type
of optimization is referred to as sliding window optimization or partial graph optimization.
• Frame rate at which to run the factor graph optimization (optimizationFrequency). After
processing a fixed number of frames specified by optimizationFrequency the factor graph
optimization is used to refine the pose estimates. Calling graph optimization after every frame
produces more accurate estimates but increases execution time.
params = helperVIOParameters();
% Set to true if IMU data is available.
useIMU = true;
Initialize variables.
status = struct("firstFrame",true,"isMapInitialized",false,"isIMUAligned",false,"Mediandepth",fal
% Set to true to attempt camera-IMU alignment.
readyToAlignCameraAndIMU = false;
% Set initial scene median depth.
initialSceneMedianDepth = 4;
viewId = 0;
removedFrameIds = [];
allCameraTracks = cell(1,5000);
% Enable visualization.
vis = true;
showMatches = false;
if vis
% Figure variables.
axMatches = [];
axTraj = [];
axMap = [];
end
Set up factor graph for back-end tightly coupled factor graph optimization.
Create the point tracker to track key points across multiple frames.
1-32
Monocular Visual-Inertial Odometry Using Factor Graph
fManager = helperFeaturePointManager(data.intrinsics,params,maxFrames,maxLandmarks);
% Set up the key point detector.
fManager.DetectorFunc = @(I)helperDetectKeyPoints(I);
vSet = imageviewset;
Specify the first and last frames to process from the data set. Then, process the first frame.
% In the first frame, detect new key points and initialize the tracker for
% future tracking.
status.firstFrame = false;
I = data.images{startFrameIdx};
if params.Equalize
% Enhance contrast if images are dark.
I = adapthisteq(I,NumTiles=params.NumTiles,ClipLimit=params.ClipLimit);
end
if params.Undistort
% Undistort if images contain perspective distortion.
I = undistortImage(I,data.intrinsics);
end
% Assign a unique view ID for each processed camera frame or image.
viewId = viewId + 1;
currPoints = createNewFeaturePoints(fManager,I);
updateSlidingWindow(fManager,I,currPoints,true(size(currPoints,1),1),viewId);
initialize(tracker, currPoints, I);
prevI = I;
firstI = I;
vSet = addView(vSet,viewId,rigidtform3d);
Image Preprocessing
• Equalize — Enhance the contrast of an image to correct for dim lighting, which can affect feature
extraction and tracking.
• Undistort — Correct for radial and tangential distortions that can impact state estimation.
1-33
1 Camera Calibration and SfM Examples
Feature Tracking
To compute a camera frame pose, you must calculate 2D-2D correspondences (2-D image point tracks
across multiple frames). There are several ways to estimate 2-D feature points that see the same
landmark (key point tracks), but this example uses a Kalman tracker for tracking feature points in
multiple images.
Tracks are not all accurate and can contain outliers. Tracking performance also depends on the
Kalman tracker parameters, such as bidirectional error. Even in an ideal case, you can expect some
invalid tracks, such as those due to repetitive structures. As such, outlier rejection is a critical task in
feature tracking. To reject outliers from tracks, use fundamental matrix decomposition in the feature
point manager while updating the sliding window with the latest feature point tracks.
Visualize the feature point tracks between the last key frame and current frame.
if status.isMapInitialized
svIds = getSlidingWindowIds(fManager);
if length(svIds) > 2
[matches1,matches2] = get2DCorrespondensesBetweenViews(fManager,svIds(end-2),viewId);
1-34
Monocular Visual-Inertial Odometry Using Factor Graph
end
end
The accelerometer and gyroscope readings of the IMU data contain some bias and noise. To estimate
bias values, you must obtain accurate pose estimates between the first few frames. You can achieve
this by using SfM. SfM involves these major steps:
• When there is enough parallax between the first key frame and the current frame, estimate the
relative pose between the two, using 2D-2D correspondences (key point tracks across multiple
frames).
• Triangulate 3-D points using the world poses of key frames and 2D-2D correspondences.
• Track the 3-D points in the current frame, and compute the current frame pose using 3D-2D
correspondences.
if ~status.isMapInitialized
if windowState.FirstFewViews
% Accept the first few camera views.
vSet = addView(vSet,viewId,rigidtform3d);
elseif windowState.EnoughParallax
% Estimate relative pose between the first key frame in the
% window and the current frame.
svIds = getSlidingWindowIds(fManager);
[matches1,matches2] = get2DCorrespondensesBetweenViews(fManager,svIds(end-1),svId
valRel = false(size(matches1,1),1);
for k = 1:10
[F1,valRel1] = estimateFundamentalMatrix( ...
matches1,matches2,Method="RANSAC", ...
NumTrials=params.F_Iterations,DistanceThreshold=params.F_Threshold, ...
Confidence=params.F_Confidence);
if length(find(valRel)) < length(find(valRel1))
valRel = valRel1;
F = F1;
end
end
inlierPrePoints = matches1(valRel,:);
inlierCurrPoints = matches2(valRel,:);
relPose = estrelpose(F,data.intrinsics, ...
inlierPrePoints,inlierCurrPoints);
%vSet = addView(vSet,svIds(end-1),currPose);
vSet = addView(vSet,viewId,currPose);
status.isMapInitialized = true;
if vis
axisSFM = axes(figure); %#ok
showMatchedFeatures(firstI,I,matches1,matches2, ...
1-35
1 Camera Calibration and SfM Examples
Parent=axisSFM);
title(axisSFM,"Enough Parallax Between Key Frames");
end
end
else
Camera-IMU Alignment
To optimize camera and IMU measurements, you must align them by bringing them to the same base
coordinate frame and scale. Alignment primarily consists of these major tasks:
• Compute the camera pose scale to make it similar to the IMU or world scale.
• Calculate the gravity rotation required to rotate gravity vector from local navigation reference
frame of IMU to initial camera reference frame. The inverse of this rotation aligns the z-axis of the
camera with the local navigation reference frame.
• Estimate the initial IMU bias.
if ~status.isIMUAligned && readyToAlignCameraAndIMU
svIds = getSlidingWindowIds(fManager);
% Because you have not yet computed the latest frame pose,
% So use only the past few frames for alignment.
svIds = svIds(1:end-1);
1-36
Monocular Visual-Inertial Odometry Using Factor Graph
If the alignment is successful, update the camera poses, 3-D points, and add IMU factors between the
initial frames in the current sliding window.
if status.isIMUAligned
% After alignment, add IMU factors to factor graph.
for k = 1:length(gyro)
nId = [ids.pose(svIds(k)),ids.vel(svIds(k)),ids.bias(svIds(k)), ...
ids.pose(svIds(k+1)),ids.vel(svIds(k+1)),ids.bias(svIds(k+1))];
fIMU = factorIMU(nId,gyro{k},accel{k},imuParams,SensorTransform=data.
slidingWindowFactorGraph.addFactor(fIMU);
end
end
Estimate an initial guess for IMU bias by using factor graph optimization with the camera projection
and IMU factors.
1-37
1 Camera Calibration and SfM Examples
fixNode(slidingWindowFactorGraph,ids.pose(svIds(1)));
soll = optimize(slidingWindowFactorGraph, ...
params.SolverOpts);
fixNode(slidingWindowFactorGraph,ids.pose(svIds(1)),false);
1-38
Monocular Visual-Inertial Odometry Using Factor Graph
When IMU data is available, you can predict the world pose of the camera by integrating
accelerometer and gyroscope readings. Use factor graph optimization to further refine this
prediction.
imuGuess = false;
if status.isIMUAligned
% Extract gyro and accel reading between current image frame
% and last acquired image frame to create IMU factor.
svIds = getSlidingWindowIds(fManager);
svs = svIds((end-1):end);
[gyro,accel] = helperExtractIMUDataBetweenViews(data.gyroReadings, ...
data.accelReadings,data.timeStamps,allFrameIds(svs));
nodeID = [ids.pose(svs(1)) ...
ids.vel(svs(1)) ...
ids.bias(svs(1)) ...
ids.pose(svs(2)) ...
ids.vel(svs(2)) ...
ids.bias(svs(2))];
% Create the transformation required to trasform a camera pose
% to IMU base frame for the IMU residual computation.
fIMU = factorIMU(nodeID,gyro{1},accel{1},imuParams, ...
SensorTransform=data.camToIMUTransform);
1-39
1 Camera Calibration and SfM Examples
[currPoints,pointIds,isTriangulated] = getKeyPointsInView(fManager,viewId);
cVal = true(size(currPoints,1),1);
cTrf = find(isTriangulated);
If no IMU prediction is available, then use 3D-2D correspondences to estimate the current view pose.
if ~imuGuess
x3D = getXYZPoints(fManager,pointIds(isTriangulated));
c2D = currPoints(isTriangulated,:);
ii = false(size(x3D,1),1);
currPose = rigidtform3d;
for k = 1:params.F_loop
[currPosel,iil] = estworldpose( ...
currPoints(isTriangulated,:),x3D, ...
data.intrinsics,MaxReprojectionError=params.F_Threshold,Confidence=params.F_Confi
MaxNumTrials=params.F_Iterations);
if length(find(ii)) < length(find(iil))
ii = iil;
currPose = currPosel;
end
end
cVal(cTrf(~ii)) = false;
else
Use the IMU predicted pose as an initial guess for motion-only bundle adjustment.
x3D = getXYZPoints(fManager,pointIds(isTriangulated));
c2D = currPoints(isTriangulated,:);
[currPose,velRefined,biasRefined,ii] = helperBundleAdjustmentMotion( ...
x3D,c2D,data.intrinsics,size(I),pp,pv,prevP,prevVel,prevBias,fIMU);
slidingWindowFactorGraph.nodeState( ...
ids.vel(viewId),velRefined);
slidingWindowFactorGraph.nodeState( ...
ids.bias(viewId),biasRefined);
cVal(cTrf(~ii)) = false;
end
setKeyPointValidityInView(fManager,viewId,cVal);
vSet = addView(vSet,viewId,currPose);
Add camera projection factors related to the 3-D point tracks of the current view.
obs2 = pointIds(isTriangulated);
obs2 = obs2(ii);
fCam = factorCameraSE3AndPointXYZ( ...
[ids.pose(viewId*ones(size(obs2))) ids.point3(obs2)], ...
data.intrinsics.K,Measurement=c2D(ii,:), ...
Information=cameraInformation);
allCameraTracks{viewId} = [viewId*ones(size(obs2)) obs2 fCam.Measurement];
slidingWindowFactorGraph.addFactor(fCam);
end
1-40
Monocular Visual-Inertial Odometry Using Factor Graph
When using the latest 2D-2D correspondences for camera-world pose estimation, you must frequently
create new 3-D points.
if status.isMapInitialized
[newXYZ,newXYZID,newPointViews,newPointObs] = triangulateNew3DPoints(fManager,vSet);
Factor graph optimization reduces the error in trajectory or camera pose estimation. Various factors,
like inaccurate tracking and outliers, can contribute to estimation errors.
Graph optimization adjusts camera pose node estimates to satisfy various sensor measurement
constraints, like camera observations (3-D point projection onto an image-frame-generating 2-D
image point observation), IMU relative poses, and relative velocity change. You can categorize
optimization based on the type of factors used. The two important categories are the following.
1-41
1 Camera Calibration and SfM Examples
• Visual-inertial optimization — Along with camera measurements, add IMU measurements, like
gyroscope and accelerometer readings, to the graph by using factorIMU (Navigation Toolbox).
The visual-inertial factor graph system consists of the following node types connected using different
factors:
• Camera pose nodes at different timestamps - These are connected to other nodes using
camera projection and IMU factors. The camera pose node estimates are computed using SfM.
• 3-D point landmark nodes - These are connected to camera pose nodes using camera projection
factors. The landmark node estimates are computed using SfM.
• IMU velocity nodes at different timestamps - These are connected to other nodes using IMU
factors. The velocity node estimates are computed using IMU preintegration.
• IMU bias nodes at different timestamps - These are connected to other nodes using IMU
factors. The bias node estimates are usually result of factor graph optimization. These are
unknown before the optimization.
Each timestep contains a pose node, a velocity node, and a bias node. Each of these nodes in one time
step is connected to the pose, velocity and bias nodes of another time step using a factorIMU object.
When the UAV observes feature points using the camera, a factorCameraSE3AndPointXYZ object
relates the observed feature points to the pose node of the observed time step. This process repeats
for each new time step of the UAV flight.
Update the sliding window with the latest 3-D points and camera view pose.
newPointsTriangulated = false;
if ~isempty(newXYZ)
newPointsTriangulated = true;
% Store all new 3D-2D correspondenses.
for pIte = 1:length(newPointViews)
allCameraTracks{newPointViews(pIte)} = [allCameraTracks{newPointViews(pIte)}; new
end
1-42
Monocular Visual-Inertial Odometry Using Factor Graph
obs = vertcat(newPointObs{:});
% Create camera projection factors with the latest
% 3-D point landmark observations in the current image.
fCam = factorCameraSE3AndPointXYZ([ids.pose(obs(:,1)) ...
ids.point3(obs(:,2))],data.intrinsics.K, ...
Measurement=obs(:,3:4), ...
Information=cameraInformation);
addFactor(slidingWindowFactorGraph,fCam);
end
Refine the estimated camera frame poses and 3-D points using factor graph optimization. The
optimization is time consuming. So, the optimization is not run after estimating the pose of each
frame. The frame frequency at which the optimization is run can be controlled using a parameter.
if helperDecideToRunGraphOptimization(curIdx,newPointsTriangulated,params)
% Use partial factor graph optimization with only the latest key
% frames, for performance.
nodeIdsToOptimize = ids.pose(slidingWindowViewIds);
xyzIds = getPointIdsInViews(fManager,slidingWindowViewIds);
1-43
1 Camera Calibration and SfM Examples
Update the feature manager and view set with your optimization results.
if ~status.Mediandepth
status.Mediandepth = true;
xyz = slidingWindowFactorGraph.nodeState( ...
ids.point3(xyzIds));
medianDepth = median(vecnorm(xyz.'));
[posesUpdated,xyz] = helperTransformToNavigationFrame(helperUpdateCameraPoseTable
slidingWindowFactorGraph.nodeState(ids.pose(slidingWindowViewIds))), ...
xyz,rigidtform3d,initialSceneMedianDepth/medianDepth);
% Set current camera pose node state guess.
slidingWindowFactorGraph.nodeState(ids.pose(slidingWindowViewIds), ...
helperCameraPoseTableToSE3Vector(posesUpdated));
% Add guess for newly triangulated 3-D points node states.
slidingWindowFactorGraph.nodeState( ...
ids.point3(xyzIds),xyz);
else
posesUpdated = helperUpdateCameraPoseTable(poses(vSet,slidingWindowViewIds), ...
slidingWindowFactorGraph.nodeState( ...
ids.pose(slidingWindowViewIds)));
end
Add a new feature point to the Kalman tracker, in case the number of points goes below the feature
tracking threshold.
createNewFeaturePoints(fManager,I);
currPoints = getKeyPointsInView(fManager,viewId);
setPoints(tracker,currPoints);
prevPrevI = prevI;
prevI = I;
1-44
Monocular Visual-Inertial Odometry Using Factor Graph
1-45
1 Camera Calibration and SfM Examples
Plot all key frame camera poses and 3-D points. Observe the landmarks on features such as the
ceiling, floor, and pillars.
helperPlotCameraPosesAndLandmarks(axMap,fManager,vSet,removedFrameIds);
1-46
Monocular Visual-Inertial Odometry Using Factor Graph
1-47
1 Camera Calibration and SfM Examples
• Absolute trajectory error (ATE) - Root Mean Squared Error (RMSE) between computed camera
locations and ground truth camera locations.
• Scale error - Percentage of how far the computed median scale is to original scale.
addedFrameIds = allFrameIds(vSet.Views.ViewId);
axf = axes(figure);
helperPlotAgainstGroundTruth(vSet,data.gTruth,data.camToIMUTransform, ...
addedFrameIds,axf,removedFrameIds);
1-48
Monocular Visual-Inertial Odometry Using Factor Graph
Evaluate the tracking accuracy, based on root mean square error (RMSE) and median scale error.
helperComputeErrorAgainstGroundTruth(data.gTruth,vSet,allFrameIds,removedFrameIds,data.camToIMUTr
Supporting Functions
This section details the short helper functions included in this example. Larger helper functions have
been included in separate files.
helperVisualizeTrajectory updates trajectory plot with latest data stored in view set and
feature manager.
1-49
1 Camera Calibration and SfM Examples
helperPlotAgainstGroundTruth plots estimated trajectory and ground truth trajectory for visual
comparison.
helperGenerateNodeID generates unique factor graph node IDs for fixed number of camera view
poses, IMU velocities, IMU biases, and 3-D point nodes.
helperUpdateCameraPoseTable updates pose table with latest estimated N-by-7 SE(3) poses.
cameraPoseTableUpdated = cameraPoseTable;
R = quat2rotm(cameraPoses(:,4:7));
for k = 1:size(cameraPoses,1)
cameraPoseTableUpdated.AbsolutePose(k).Translation = cameraPoses(k,1:3);
cameraPoseTableUpdated.AbsolutePose(k).R = R(:,:,k);
end
end
% If the current frame belongs to the initial set of frames, then run graph
% optimization every frame, because the initial SfM is still running.
1-50
Monocular Visual-Inertial Odometry Using Factor Graph
helperTransformToNavigationFrame transforms and scales input poses and XYZ 3-D points to
local navigation reference frame of IMU using gravity rotation and pose scale.
posesUpdated = poses;
% Input gravity rotation transforms the gravity vector from local
% navigation reference frame to initial camera pose reference frame.
% The inverse of this transforms the poses from camera reference frame
% to local navigation reference frame.
Ai = gRot.A';
for k = 1:length(poses.AbsolutePose)
T = Ai*poses.AbsolutePose(k).A;
T(1:3,4) = poseScale*T(1:3,4);
posesUpdated.AbsolutePose(k) = rigidtform3d(T);
end
% Transform points from initial camera pose reference frame to
% local navigation reference frame of IMU.
xyzUpdated = poseScale*gRot.transformPointsInverse(xyz);
end
len = length(frameIds);
gyro = cell(1,len-1);
accel = cell(1,len-1);
for k = 2:len
% Assumes the IMU data is time-synchorized with the camera data. Compute
% indices of accelerometer readings between consecutive view IDs.
[~,ind1] = min(abs(timeStamps.imuTimeStamps - timeStamps.imageTimeStamps(frameIds(k-1))));
[~,ind2] = min(abs(timeStamps.imuTimeStamps - timeStamps.imageTimeStamps(frameIds(k))));
imuIndBetweenFrames = ind1:(ind2-1);
% Extract the data at the computed indices and store in a cell.
gyro{k-1} = gyroReadings(imuIndBetweenFrames,:);
accel{k-1} = accelReadings(imuIndBetweenFrames,:);
end
end
function helperPlotCameraPosesAndLandmarks(axisHandle,fManager,vSet,removedFrameIds,plotCams)
% helperPlotCameraPosesAndLandmarks plots the key frame camera poses and
% triangulated 3-D point landmarks.
if nargin < 5
1-51
1 Camera Calibration and SfM Examples
pcshow(xyzPoints(indToPlot,:),Parent=axisHandle,Projection="orthographic");
hold(axisHandle,"on")
if plotCams
c = table(camPoses.AbsolutePose,VariableNames={'AbsolutePose'});
plotCamera(c,Parent=axisHandle,Size=0.25);
title(axisHandle,"Initial Structure from Motion")
else
traj = vertcat(camPoses.AbsolutePose.Translation);
plot3(traj(:,1),traj(:,2),traj(:,3),"r-",Parent=axisHandle);
view(axisHandle,27.28,-2.81)
title(axisHandle,"Estimated Trajectory and Landmarks")
end
hold off
drawnow
end
1-52
Monocular Visual-Inertial Odometry Using Factor Graph
helperDownloadData downloads data set from specified URL to specified output folder.
vioDataTarFile = matlab.internal.examples.downloadSupportFile(...
'shared_nav_vision/data','BlackbirdVIOData.tar');
vioData = load(fullfile(outputFolder,"BlackbirdVIOData","data.mat"));
end
References
[1] Qin, Tong, Peiliang Li, and Shaojie Shen. “VINS-Mono: A Robust and Versatile Monocular Visual-
Inertial State Estimator.” IEEE Transactions on Robotics 34, no. 4 (August 2018): 1004–20. https://
doi.org/10.1109/TRO.2018.2853729
[2] Antonini, Amado, Winter Guerra, Varun Murali, Thomas Sayre-McCord, and Sertac Karaman. “The
Blackbird Dataset: A Large-Scale Dataset for UAV Perception in Aggressive Flight.” In Proceedings of
the 2018 International Symposium on Experimental Robotics, edited by Jing Xiao, Torsten Kröger, and
Oussama Khatib, 11:130–39. Cham: Springer International Publishing, 2020. https://doi.org/
10.1007/978-3-030-33950-0_12
1-53
1 Camera Calibration and SfM Examples
Visual simultaneous localization and mapping (vSLAM), refers to the process of calculating the
position and orientation of a camera with respect to its surroundings, while simultaneously mapping
the environment.
You can perform vSLAM using a monocular camera. However, the depth cannot be accurately
calculated, and the estimated trajectory is unknown and drifts over time. To produce an initial map,
which cannot be triangulated from the first frame, you must use multiple views of a monocular
camera. A better, more reliable solution is to use an RGB-D camera, which is composed of one RGB
color image and one depth image.
This example shows how to process RGB-D image data to build a map of an indoor environment and
estimate the trajectory of the camera. The example uses a version of the ORB-SLAM2 [1] algorithm,
which is feature-based and supports RGB-D cameras.
The pipeline for RGB-D vSLAM is very similar to the monocular vSLAM pipeline in the “Monocular
Visual Simultaneous Localization and Mapping” on page 1-122 example. The major difference is that
in the Map Initialization stage, the 3-D map points are created from a pair of images consisting of
one color image and one depth image instead of two frames of color images.
• Map Initialization: The initial 3-D world points can be constructed by extracting ORB feature
points from the color image and then computing their 3-D world locations from the depth image.
The color image is stored as the first key frame.
• Tracking: Once a map is initialized, the pose of the camera is estimated for each new RGB-D
image by matching features in the color image to features in the last key frame.
• Local Mapping: If the current color image is identified as a key frame, new 3-D map points are
computed from the depth image. At this stage, bundle adjustment is used to minimize reprojection
errors by adjusting the camera pose and 3-D points.
• Loop Closure: Loops are detected for each key frame by comparing it against all previous key
frames using the bag-of-features approach. Once a loop closure is detected, the pose graph is
optimized to refine the camera poses of all the key frames.
1-54
Visual SLAM with an RGB-D Camera
The data used in this example is from the TUM RGB-D benchmark [2]. You can download the data to a
temporary folder using a web browser or by running the following code:
baseDownloadURL = "https://cvg.cit.tum.de/rgbd/dataset/freiburg3/rgbd_dataset_freiburg3_long_offi
dataFolder = fullfile(tempdir, 'tum_rgbd_dataset', filesep);
options = weboptions('Timeout', Inf);
tgzFileName = [dataFolder, 'fr3_office.tgz'];
folderExists = exist(dataFolder, 'dir');
Downloading fr3_office.tgz (1.38 GB). This download can take a few minutes.
Create two imageDatastore objects to store the color and depth images, respectively.
imgFolderColor = [imageFolder,'rgb/'];
imgFolderDepth = [imageFolder,'depth/'];
imdsColor = imageDatastore(imgFolderColor);
imdsDepth = imageDatastore(imgFolderDepth);
Note that the color and depth images are generated in an un-synchronized way in the dataset.
Therefore, we need to associate color images to depth images based on the time stamp.
1-55
1 Camera Calibration and SfM Examples
Map Initialization
The pipeline starts by initializing the map that holds 3-D world points. This step is crucial and has a
significant impact on the accuracy of the final SLAM result. Initial ORB feature points are extracted
from the first color image using helperDetectAndExtractFeatures on page 1-65. Their
corresponding 3-D world locations can be computed from the pixel coordinates of the feature points
and the depth value using helperReconstructFromRGBD on page 1-65.
initialPose = rigidtform3d();
[xyzPoints, validIndex] = helperReconstructFromRGBD(currPoints, currIdepth, intrinsics, initialPo
Loop detection is performed using the bags-of-words approach. A visual vocabulary represented as a
bagOfFeatures object is created offline with the ORB descriptors extracted from a large set of
images in the dataset by calling:
bag = bagOfFeatures(imds,CustomExtractor=@helperORBFeatureExtractorFunction,
TreeProperties=[5, 10], StrongestFeatures=1);
1-56
Visual SLAM with an RGB-D Camera
After the map is initialized using the first pair of color and depth image, you can use imageviewset
and worldpointset to store the first key frames and the corresponding map points:
% Show legend
showLegend(mapPlot);
Tracking
The tracking process is performed using every RGB-D image and determines when to insert a new
key frame.
1-57
1 Camera Calibration and SfM Examples
currFrameIdx = 2;
isLoopClosed = false;
1 ORB features are extracted for each new color image and then matched (using
matchFeatures), with features in the last key frame that have known corresponding 3-D map
points.
2 Estimate the camera pose using Perspective-n-Point algorithm, which estimates the pose of a
calibrated camera given a set of 3-D points and their corresponding 2-D projections using
estworldpose.
3 Given the camera pose, project the map points observed by the last key frame into the current
frame and search for feature correspondences using matchFeaturesInRadius.
4 With 3-D to 2-D correspondences in the current frame, refine the camera pose by performing a
motion-only bundle adjustment using bundleAdjustmentMotion.
5 Project the local map points into the current frame to search for more feature correspondences
using matchFeaturesInRadius and refine the camera pose again using
bundleAdjustmentMotion.
6 The last step of tracking is to decide if the current frame should be a new key frame. A frame is a
key frame if both of the following conditions are satisfied:
• At least 20 frames have passed since the last key frame or the current frame tracks fewer than
100 map points or 25% of points tracked by the reference key frame.
• The map points tracked by the current frame are fewer than 90% of points tracked by the
reference key frame.
If the current frame is to become a key frame, continue to the Local Mapping process. Otherwise,
start Tracking for the next frame.
% Main loop
isLastFrameKeyFrame = true;
while ~isLoopClosed && currFrameIdx < numel(imdsColor.Files)
% Track the local map and check if the current frame is a key frame.
% A frame is a key frame if both of the following conditions are satisfied:
1-58
Visual SLAM with an RGB-D Camera
%
% 1. At least 20 frames have passed since the last key frame or the
% current frame tracks fewer than 100 map points.
% 2. The map points tracked by the current frame are fewer than 90% of
% points tracked by the reference key frame.
%
% localKeyFrameIds: ViewId of the connected key frames of the current frame
numSkipFrames = 20;
numPointsKeyFrame = 100;
[localKeyFrameIds, currPose, trackedMapPointsIdx, trackedFeatureIdx, isKeyFrame] = ...
helperTrackLocalMap(mapPointSet, vSetKeyFrames, trackedMapPointsIdx, ...
trackedFeatureIdx, currPose, currFeatures, currPoints, intrinsics, scaleFactor, numLevels
isLastFrameKeyFrame, lastKeyFrameIdx, currFrameIdx, numSkipFrames, numPointsKeyFrame);
if ~isKeyFrame
currFrameIdx = currFrameIdx + 1;
isLastFrameKeyFrame = false;
continue
else
% Match feature points between the stereo images and get the 3-D world positions
[xyzPoints, validIndex] = helperReconstructFromRGBD(currPoints, currIdepth, ...
intrinsics, currPose, depthFactor);
Local Mapping
Local mapping is performed for every key frame. When a new key frame is determined, add it to the
key frames and update the attributes of the map points observed by the new key frame. To ensure
that mapPointSet contains as few outliers as possible, a valid map point must be observed in at least
3 key frames.
New map points are created by triangulating ORB feature points in the current key frame and its
connected key frames. For each unmatched feature point in the current key frame, search for a match
with other unmatched points in the connected key frames using matchFeatures. The local bundle
adjustment refines the pose of the current key frame, the poses of connected key frames, and all the
map points observed in these key frames.
% Remove outlier map points that are observed in fewer than 3 key frames
if currKeyFrameId == 2
triangulatedMapPointsIdx = [];
end
1-59
1 Camera Calibration and SfM Examples
Loop Closure
The loop closure detection step takes the current key frame processed by the local mapping process
and tries to detect and close the loop. Loop candidates are identified by querying images in the
database that are visually similar to the current key frame using evaluateImageRetrieval. A
candidate key frame is valid if it is not connected to the last key frame and three of its neighbor key
frames are loop candidates.
When a valid loop candidate is found, use estgeotform3d to compute the relative pose between the
loop candidate frame and the current key frame. The relative pose represents a 3-D rigid
transformation stored in a rigidtform3d object. Then add the loop connection with the relative
pose and update mapPointSet and vSetKeyFrames.
% Check loop closure after some key frames have been created
if currKeyFrameId > 20
if isDetected
% Add loop closure connections
1-60
Visual SLAM with an RGB-D Camera
maxDistance = 0.1;
[isLoopClosed, mapPointSet, vSetKeyFrames] = helperAddLoopConnectionsStereo(...
mapPointSet, vSetKeyFrames, validLoopCandidates, currKeyFrameId, ...
currFeatures, currPoints, loopEdgeNumMatches, maxDistance);
end
end
Finally, apply pose graph optimization over the essential graph in vSetKeyFrames to correct the
drift. The essential graph is created internally by removing connections with fewer than
minNumMatches matches in the covisibility graph. After pose graph optimization, update the 3-D
locations of the map points using the optimized poses.
1-61
1 Camera Calibration and SfM Examples
plotOptimizedTrajectory(mapPlot, optimizedPoses)
% Update legend
showLegend(mapPlot);
You can compare the optimized camera trajectory with the ground truth to evaluate the accuracy. The
downloaded data contains a groundtruth.txt file that stores the ground truth of camera pose of
each frame. The data has been saved in the form of a MAT-file. You can also calculate the root-mean-
square-error (RMSE) of trajectory estimates.
% Show legend
showLegend(mapPlot);
1-62
Visual SLAM with an RGB-D Camera
Given the refined camera poses, you can reproject all the valid image points in the associated depth
images back to the 3-D space to perform dense reconstruction.
for i = 1: numel(addedFramesIdx)
Icolor = readimage(imdsColor, addedFramesIdx(i));
Idepth = readimage(imdsDepth, addedFramesIdx(i));
figure
pcshow(pointCloudsAll,VerticalAxis="y", VerticalAxisDir="down");
xlabel('X')
ylabel('Y')
zlabel('Z')
1-63
1 Camera Calibration and SfM Examples
Supporting Functions
Short helper functions are listed below. Larger function are included in separate files.
% Input handling
dataLines = [4, Inf];
1-64
Visual SLAM with an RGB-D Camera
helperDetectAndExtractFeatures detect and extract and ORB features from the image.
function [features, validPoints] = helperDetectAndExtractFeatures(Irgb, scaleFactor, numLevels)
numPoints = 1000;
% Extract features
[features, validPoints] = extractFeatures(Igray, points);
end
helperUpdateGlobalMap update 3-D locations of map points after pose graph optimization
1-65
1 Camera Calibration and SfM Examples
posesOld = vSetKeyFrames.Views.AbsolutePose;
posesNew = vSetKeyFramesOptim.Views.AbsolutePose;
positionsOld = mapPointSet.WorldPoints;
positionsNew = positionsOld;
indices = 1:mapPointSet.Count;
% Update world location of each map point based on the new absolute pose of
% the corresponding major view
for i = 1: mapPointSet.Count
majorViewIds = mapPointSet.RepresentativeViewId(i);
tform = rigidtform3d(posesNew(majorViewIds).A/posesOld(majorViewIds).A);
positionsNew(i, :) = transformPointsForward(tform, positionsOld(i, :));
end
mapPointSet = updateWorldPoints(mapPointSet, indices, positionsNew);
end
References
[1] Mur-Artal, Raul, and Juan D. Tardos. “ORB-SLAM2: An Open-Source SLAM System for Monocular,
Stereo, and RGB-D Cameras.” IEEE Transactions on Robotics 33, no. 5 (October 2017): 1255–62.
[2] Sturm, Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. “A
Benchmark for the Evaluation of RGB-D SLAM Systems.” In 2012 IEEE/RSJ International Conference
on Intelligent Robots and Systems, 573–80, 2012.
1-66
Import Stereo Camera Parameters from ROS
The ROS camera calibration package estimates stereo camera parameters using the OpenCV camera
calibration tools [1]. After calibrating a stereo camera in ROS, you can export its camera parameters
to an INI file using the camera calibration parser. To use the calibrated stereo camera with Computer
Vision Toolbox™ functions, such as rectifyStereoImages, you must read the camera parameters
from the INI file and convert them into a stereoParameters object using
stereoParametersFromOpenCV.
Read the stereo camera parameters stored in stereoParams.ini using the helper function
helperReadINI.
stereoParamsINI = helperReadINI("stereoParams.ini");
The baseline parameters of a stereo camera describe the relative translation and rotation of the two
cameras in the stereo camera pair. The relative rotation and translation of camera 2 with respect to
camera 1 is required to create the stereoParameters object using
stereoParametersFromOpenCV. You can compute these from the rectification and projection
matrices read from the ROS INI file [2].
Extract the translation of camera 2 relative to camera 1 from the last column of the projection matrix.
translationOfCamera2 = cameraParams2.projection(:,end);
The rotation of camera 2 relative to camera 1, R21, is derived from the rectification matrices of the
stereo pair R1 and R2. The rectification matrices are the rotation matrices that align the camera
coordinate system to the ideal stereo image plane such that epipolar lines in both stereo images are
parallel. Compute the rotation of camera 2 relative to camera 1 as R21= R2*R1T .
rotationOfCamera2 = cameraParams2.rectification*cameraParams1.rectification';
Extract the intrinsic matrices and distortion coefficients of the two cameras from the stereoParams
structure.
intrinsicMatrix1 = cameraParams1.camera_matrix;
intrinsicMatrix2 = cameraParams2.camera_matrix;
distortionCoefficients1 = cameraParams1.distortion;
distortionCoefficients2 = cameraParams2.distortion;
Obtain the image size from the image field of the stereoParams structure.
1-67
1 Camera Calibration and SfM Examples
Use the imported stereo parameters with rectifyStereoImages to rectify an image pair captured
using the calibrated stereo camera.
1-68
Import Stereo Camera Parameters from ROS
Supporting Functions
helperReadINI
The helperReadINI function reads the camera parameters from its input INI file that has been
exported from ROS.
function cameraParams = helperReadINI(filename)
% helperReadINI reads a ROS INI file, filename, and returns a structure with
% these fields: image, <camera_name1>, <camera_name2>. image is a
% structure describing the height and width of the image captured by the
% cameras of the stereo pair. The fields <camera_name1> and <camera_name2>
% are structures named after the camera names present in the INI file, and they contain
% these fields: camera_matrix, distortion, rectification_matrix,
% and projection_matrix. These fields are stored in the INI file with their
% values placed in a new line followed by their name.
f = fopen(filename,"r");
sectionName = '';
while ~feof(f)
% Read line from file.
line = fgetl(f);
1-69
1 Camera Calibration and SfM Examples
if isempty(line) || line(1)=='#'
% Skip empty line and comments.
continue
elseif line(1) == '[' && line(end) == ']'
% Identify section names and continue reading.
sectionName = line(2:end-1);
sectionName = strrep(sectionName,'/','_');
continue
end
if isempty(line)
% A empty line indicates end of value data.
break
elseif line(1)=='#'
% Skip comment lines.
continue
end
line = str2num(line); %#ok
value = [value; line]; %#ok
end
fclose(f);
end
References
[1] http://wiki.ros.org/camera_calibration
[2] http://docs.ros.org/en/melodic/api/sensor_msgs/html/msg/CameraInfo.html
1-70
Import Camera Intrinsic Parameters from ROS
The ROS camera calibration package estimates camera intrinsic parameters using the OpenCV
camera calibration tools [1]. After calibrating a camera in ROS, you can import its intrinsic
parameters to a YAML file using the camera calibration parser in ROS. To use the calibrated camera
with Computer Vision Toolbox™ functions, such as undistortImage, you must read the camera
parameters from the YAML file and then convert them into a cameraIntrinsics object using
cameraIntrinsicsFromOpenCV.
Read the camera parameters stored in cameraParams.yaml using the helper function
helperReadYAML.
intrinsicsParams = helperReadYAML('cameraParams.yaml');
intrinsicsObj = cameraIntrinsicsFromOpenCV(intrinsicMatrix,distortionCoefficients,imageSize);
Undistort Image
Use the imported camera intrinsics with undistortImage to undistort an image captured using the
calibrated camera.
1-71
1 Camera Calibration and SfM Examples
Supporting Functions
helperReadYAML
The helperReadYAML function reads the monocular camera parameters from the input YAML file
that was exported from ROS.
function cameraParams = helperReadYAML(filename)
% helperReadYAML reads a ROS YAML file, filename, and returns a structure
% with these fields: image_width, image_height, camera_name,
% camera_matrix, distortion_model, distortion_coefficients,
% rectification_matrix, and projection_matrix. These fields are stored
% in the YAML file colon separated from their values in different lines.
f = fopen(filename,'r');
stringFields = {'camera_name','distortion_model'};
while ~feof(f)
[name,value,isEmptyLine] = helperReadYAMLLine(f);
if isEmptyLine
continue
end
if ~isempty(value)
% Convert all values to numbers except for known string
% fields.
if ~any(contains(name, stringFields))
value = str2num(value); %#ok
end
else
% An empty value in ROS YAML files indicates a matrix in
% upcoming lines. Read the matrix from the upcoming lines.
value = helperReadYAMLMatrix(f);
end
1-72
Import Camera Intrinsic Parameters from ROS
end
fclose(f);
end
helperReadYAMLMatrix
The helperReadYAMLMatrix function reads the rows, columns and data fields of a matrix in the
ROS YAML file.
function matrix = helperReadYAMLMatrix(f)
% helperReadYAMLMatrix reads a matrix from the ROS YAML file. A matrix in
% a ROS YAML file has three fields: rows, columns and data. rows and col
% describe the matrix size. data is a continguous array of the matrix
% elements in row-major order. This helper function assumes the presence
% of all three fields of a matrix to return the correct matrix.
numRows = 0;
numCols = 0;
data = [];
if isEmptyLine
continue
end
switch name
case 'rows'
numRows = str2num(value); %#ok
case 'cols'
numCols = str2num(value); %#ok
case 'data'
data = str2num(value); %#ok
if numel(data) == numRows*numCols
% Reshape the matrix using row-major order.
matrix = reshape(data,[numCols numRows])';
end
end
helperReadYAMLLine
1-73
1 Camera Calibration and SfM Examples
if isempty(line) || line(1)=='#'
% Empty line or comment.
name = '';
value = '';
isEmptyLine = true;
else
% Split the line to get name and value.
c = strsplit(line,':');
assert(length(c)==2,'Unexpected file format')
name = c{1};
value = strtrim(c{2}); % Trim leading whitespace.
isEmptyLine = false;
end
end
References
[1] http://wiki.ros.org/camera_calibration
1-74
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
This example shows how to develop a visual Simultaneous Localization and Mapping (SLAM)
algorithm using image data obtained from the Unreal Engine® simulation environment.
Visual SLAM is the process of calculating the position and orientation of a camera with respect to its
surroundings while simultaneously mapping the environment. Developing a visual SLAM algorithm
and evaluating its performance in varying conditions is a challenging task. One of the biggest
challenges is generating the ground truth of the camera sensor, especially in outdoor environments.
The use of simulation enables testing under a variety of scenarios and camera configurations while
providing precise ground truth.
This example demonstrates the use of Unreal Engine simulation to develop a visual SLAM algorithm
for either a monocular or a stereo camera in a parking scenario. For more information about the
implementation of the visual SLAM pipelines, see the “Monocular Visual Simultaneous Localization
and Mapping” on page 1-122 example and the “Stereo Visual Simultaneous Localization and
Mapping” on page 1-153 example.
Use the Simulation 3D Scene Configuration block to set up the simulation environment. Select the
built-in Large Parking Lot scene, which contains several parked vehicles. The visual SLAM algorithm
matches features across consecutive images. To increase the number of potential feature matches,
you can use the Parked Vehicles subsystem to add more parked vehicles to the scene. To specify the
parking poses of the vehicles, use the helperAddParkedVehicle function. If you select a more
natural scene, the presence of additional vehicles is not necessary. Natural scenes usually have
enough texture and feature variety suitable for feature matching.
You can follow the “Select Waypoints for Unreal Engine Simulation” (Automated Driving Toolbox)
example to interactively select a sequence of parking locations. You can use the same approach to
select a sequence of waypoints and generate a reference trajectory for the ego vehicle. This example
uses a recorded reference trajectory and parked vehicle locations.
1-75
1 Camera Calibration and SfM Examples
ylim([10 60])
hScene.Position = [100, 100, 1000, 500]; % Resize figure
legend
hold off
modelName = 'GenerateImageDataOfParkingLot';
open_system(modelName);
1-76
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
helperAddParkedVehicles(modelName, parkedPoses);
1-77
1 Camera Calibration and SfM Examples
Set up the ego vehicle moving along the specified reference path by using the Simulation 3D Vehicle
with Ground Following block. The Camera Variant Subsystem contains two configurations of camera
sensors: monocular and stereo. In both configurations, the camera is mounted on the vehicle roof
center. You can use the Camera Calibrator or Stereo Camera Calibrator app to estimate intrinsics of
the actual camera that you want to simulate. This example shows the monocular camera workflow
first followed by the stereo camera workflow.
% Camera intrinsics
focalLength = [700, 700]; % specified in units of pixels
principalPoint = [600, 180]; % in pixels [x, y]
1-78
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
Run the simulation to visualize and record sensor data. Use the Video Viewer block to visualize the
image output from the camera sensor. Use the To Workspace block to record the ground truth
location and orientation of the camera sensor.
close(hScene)
if ~ispc
error("Unreal Engine Simulation is supported only on Microsoft" + char(174) + " Windows" + ch
end
% Run simulation
simOut = sim(modelName);
1-79
1 Camera Calibration and SfM Examples
Use the images to evaluate the monocular visual SLAM algorithm. The function helperVisualSLAM
implements the monocular ORB-SLAM pipeline:
• Map Initialization: ORB-SLAM starts by initializing the map of 3-D points from two images. Use
estrelpose to compute the relative pose based on 2-D ORB feature correspondences and
triangulate to compute the 3-D map points. The two frames are stored in an imageviewset
object as key frames. The 3-D map points and their correspondences to the key frames are stored
in a worldpointset object.
• Tracking: Once a map is initialized, for each new image, the function
helperTrackLastKeyFrame estimates the camera pose by matching features in the current
frame to features in the last key frame. The function helperTrackLocalMap refines the
estimated camera pose by tracking the local map.
• Local Mapping: The current frame is used to create new 3-D map points if it is identified as a key
frame. At this stage, bundleAdjustment is used to minimize reprojection errors by adjusting the
camera pose and 3-D points.
• Loop Closure: Loops are detected for each key frame by comparing it against all previous key
frames using the bag-of-features approach. Once a loop closure is detected, the pose graph is
optimized to refine the camera poses of all the key frames using the optimizePoseGraph
(Navigation Toolbox) function.
For the implementation details of the algorithm, see the “Monocular Visual Simultaneous Localization
and Mapping” on page 1-122 example.
[mapPlot, optimizedPoses, addedFramesIdx] = helperVisualSLAM(imds, intrinsics);
1-80
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
You can evaluate the optimized camera trajectory against the ground truth obtained from the
simulation. Since the images are generated from a monocular camera, the trajectory of the camera
can only be recovered up to an unknown scale factor. You can approximately compute the scale factor
from the ground truth, thus simulating what you would normally obtain from an external sensor.
% Show legend
showLegend(mapPlot);
1-81
1 Camera Calibration and SfM Examples
You can also calculate the root mean square error (RMSE) of trajectory estimates.
helperEstimateTrajectoryError(gTruth(addedFramesIdx), scaledTrajectory);
In a monocular visual SLAM algorithm, depth cannot be accurately determined using a single
camera. The scale of the map and of the estimated trajectory is unknown and drifts over time.
Additionally, because map points often cannot be triangulated from the first frame, bootstrapping the
system requires multiple views to produce an initial map. Using a stereo camera solves these
problems and provides a more reliable visual SLAM solution. The function
helperVisualSLAMStereo implements the stereo visual SLAM pipeline. The key difference from
the monocular pipeline is that at the map initialization stage, the stereo pipeline creates 3-D map
points from a pair of stereo images of the same frame, instead of creating them from two images of
different frames. For the implementation details of the algorithm, see the “Stereo Visual
Simultaneous Localization and Mapping” on page 1-153 example.
1-82
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
% Run simulation
simOut = sim(modelName);
snapnow;
1-83
1 Camera Calibration and SfM Examples
% Show legend
showLegend(mapPlot);
1-84
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
Compared with the monocular visual SLAM algorithm, the stereo visual SLAM algorithm produces a
more accurate estimation of the camera trajectory.
Given the refined camera poses, you can perform dense reconstruction from the stereo images
corresponding to the key frames.
close_system(modelName, 0);
close all
1-85
1 Camera Calibration and SfM Examples
Supporting Functions
files = dir(dataFolder);
if numel(files) < 3
numFrames = numel(simOut.images.Time);
for i = 3:numFrames % Ignore the first two frames
img = squeeze(simOut.images.Data(:,:,:,i));
imwrite(img, [dataFolder, sprintf('%04d', i-2), '.png'])
end
end
files = dir(dataFolderLeft);
if numel(files) < 3
numFrames = numel(simOut.imagesLeft.Time);
for i = 3:numFrames % Ignore the first two frames
imgLeft = squeeze(simOut.imagesLeft.Data(:,:,:,i));
imwrite(imgLeft, [dataFolderLeft, sprintf('%04d', i-2), '.png'])
imgRight = squeeze(simOut.imagesRight.Data(:,:,:,i));
imwrite(imgRight, [dataFolderRight, sprintf('%04d', i-2), '.png'])
end
end
1-86
Develop Visual SLAM Algorithm Using Unreal Engine Simulation
for i = 1: numel(addedFramesIdx)
I1 = readimage(imdsLeft, addedFramesIdx(i));
I2 = readimage(imdsRight, addedFramesIdx(i));
disparityMap = disparitySGM(im2gray(I1), im2gray(I2), DisparityRange=[0, maxDisparity],Unique
xyzPoints = reconstructScene(disparityMap, reprojectionMatrix);
% Ignore the upper half of the images which mainly show the sky
xyzPoints(1:100, :, :) = NaN;
currPose = optimizedPoses.AbsolutePose(i);
xyzPoints = transformPointsForward(currPose, xyzPoints);
ptCloud = pointCloud(xyzPoints, Color=colors);
ptClouds(i) = pcdownsample(ptCloud, random=0.2);
end
1-87
1 Camera Calibration and SfM Examples
This example shows how to develop a visual localization system using synthetic image data from the
Unreal Engine® simulation environment.
It is a challenging task to obtain ground truth for evaluating the performance of a localization
algorithm in different conditions. Virtual simulation in different scenarios is a cost-effective method to
obtain the ground truth in comparison with more expensive approaches such as using high-precision
inertial navigation systems or differential GPS. The use of simulation enables testing under a variety
of scenarios and sensor configurations. It also enables a rapid algorithm development, and provides
precise ground truth.
This example uses the Unreal Engine simulation environment from Epic Games® to develop and
evaluate a visual localization algorithm in a parking lot scenario.
Overview
Visual localization is the process of estimating the camera pose for a captured image relative to a
visual representation of a known scene. It is a key technology for applications such as augmented
reality, robotics, and automated driving. Compared with a “Implement Visual SLAM in MATLAB” on
page 13-8, visual localization assumes that a map of the environment is known and does not
require 3-D reconstruction or loop closure detection. The pipeline of visual localization includes the
following:
• Map Loading: Load the pre-built map 3-D map containing world point positions and the 3-D to 2-
D correspondences between the map points and the key frames. Additionally. for each key frame,
load the feature descriptors corresponding to the 3-D map points.
• Global Initialization: Extract features from the first image frame and match them with the
features corresponding to all the 3-D map points. After getting the 3-D to 2-D correspondences,
estimate the camera pose of the first frame in the world coordinate by solving a Perspective-n-
Point (PnP) problem. Refine the pose using motion-only bundle adjustment. The key frame that
shares the most covisible 3-D map points with the first frame is identified as the reference key
frame.
• Tracking: Once the first frame is localized, for each new frame, match features in the new frame
with features in the reference key frame that have known 3-D world points. Estimate and refine
the camera pose using the same approach as in Global Initialization step. The camera pose can be
further refined by tracking the features associated with nearby key frames.
Create Scene
Guiding a vehicle into a parking spot is a challenging maneuver that relies on accurate localization.
The VisualLocalizationInAParkingLot model simulates a visual localization system in the
parking lot scenario used in the “Develop Visual SLAM Algorithm Using Unreal Engine Simulation”
(Automated Driving Toolbox) example.
• The Simulation 3D Scene Configuration (Automated Driving Toolbox) block sets up the Large
Parking Lot scene. The Parked Vehicles subsystem adds parked cars into the parking lot.
• The Simulation 3D Vehicle with Ground Following (Automated Driving Toolbox) block controls the
motion of the ego vehicle.
• The Simulation 3D Camera (Automated Driving Toolbox)block models a monocular camera fixed at
the center of the vehicle's roof. You can use the Camera Calibrator app to estimate intrinsics of
the actual camera that you want to simulate.
1-88
Visual Localization in a Parking Lot
• The Helper Visual Localization MATLAB System block implements the visual localization
algorithm. The initial camera pose with respect to the map is estimated using the
helperGlobalInitialization function. The subsequent camera poses are estimated using the
helperTrackingRefKeyFrame function and refined using the helperTrackLocalKeyFrames
function. This block also provides a visualization of the estimated camera trajectory in the pre-
built map. You can specify the pre-built map data and the camera intrinsic parameters in the block
dialog.
1-89
1 Camera Calibration and SfM Examples
The pre-built map data is generated using the stereo camera in the “Develop Visual SLAM Algorithm
Using Unreal Engine Simulation” (Automated Driving Toolbox) example. The data consists of three
objects that are commonly used to manage image and map data for visual SLAM:
• vSetKeyFrame: an imageviewset object storing the camera poses of key frames and the
associated feature points for each 3-D map point in mapPointSet.
• mapPointSet: a worldpointset object storing the 3-D map point locations and the
correspondences between the 3-D points and 2-D feature points across key frames. The 3-D map
points provide a sparse representation of the environment.
1-90
Visual Localization in a Parking Lot
You can follow the “Select Waypoints for Unreal Engine Simulation” (Automated Driving Toolbox)
example to select a sequence of waypoints and generate a reference trajectory for the ego vehicle.
This example uses a recorded reference trajectory.
Run Simulation
Run the simulation and visualize the estimated camera trajectory in the pre-built map. The white
points represent the tracked 3-D map points in the current frame. You can compare the estimated
trajectory with the ground truth provided by the Simulation 3D Camera block to evaluate the
localization accuracy.
if ~ispc
error("Unreal Engine Simulation is supported only on Microsoft" + char(174) + " Windows" + ch
end
% Run simulation
sim(modelName);
1-91
1 Camera Calibration and SfM Examples
1-92
Visual Localization in a Parking Lot
Conclusion
With this setup, you can rapidly iterate over different scenarios, sensor configurations, or reference
trajectories and refine the visual localization algorithm before moving to real-world testing.
• To select a different scenario, use the Simulation 3D Scene Configuration (Automated Driving
Toolbox) block. Choose from the existing prebuilt scenes or create a custom scene in the Unreal®
Editor.
• To create a different reference trajectory, use the helperSelectSceneWaypoints tool, as shown
in the “Select Waypoints for Unreal Engine Simulation” (Automated Driving Toolbox) example.
• To alter the sensor configuration use the Simulation 3D Camera (Automated Driving
Toolbox)block. The Mounting tab provides options for specifying different sensor mounting
placements. The Parameters tab provides options for modifying sensor parameters such as
detection range, field of view, and resolution. You can also use the Simulation 3D Fisheye Camera
(Automated Driving Toolbox) block which provides a larger field of view.
1-93
1 Camera Calibration and SfM Examples
Visual SLAM is the process of calculating the position and orientation of a camera with respect to its
surroundings while simultaneously mapping the environment. Developing a visual SLAM algorithm
and evaluating its performance in varying conditions is a challenging task. One of the biggest
challenges is generating the ground truth of the camera sensor, especially in outdoor environments.
The use of simulation enables testing under a variety of scenarios and camera configurations while
providing precise ground truth.
This example demonstrates the use of Unreal Engine® simulation to develop a visual SLAM
algorithm for a UAV equipped with a stereo camera in a city block scenario. For more information
about the implementation of the visual SLAM pipeline for a stereo camera [1] on page 1-99, see the
“Stereo Visual Simultaneous Localization and Mapping” on page 1-153 example.
First, set up a scenario in the simulation environment that can be used to test the visual SLAM
algorithm. Use a scene depicting a typical city block with a UAV as the vehicle under test.
Next, select a trajectory for the UAV to follow in the scene. You can follow the “Select Waypoints for
Unreal Engine Simulation” (Automated Driving Toolbox) example to interactively select a sequence of
waypoints and then use the helperSelectSceneWaypoints function to generate a reference
trajectory for the UAV. This example uses a recorded reference trajectory as shown below:
1-94
Stereo Visual SLAM for UAV Navigation in 3D Simulation
The UAVVisualSLAMIn3DSimulation Simulink® model is configured with the US City Block scene
using the Simulation 3D Scene Configuration (UAV Toolbox) block. The model places a UAV on the
scene using the Simulation 3D UAV Vehicle (UAV Toolbox) block. A stereo camera consisting of two
Simulation 3D Camera (UAV Toolbox) blocks is attached to the UAV. In the dialog box of the
Simulation 3D Camera (UAV Toolbox) block, use the Mounting tab to adjust the placement of the
camera. Use the Parameters tab to configure properties of the camera to simulate different cameras.
To estimate the intrinsics of the stereo camera that you want to simulate, use the “Using the Stereo
Camera Calibrator App” on page 18-35 app.
1-95
1 Camera Calibration and SfM Examples
modelName = 'UAVVisualSLAMIn3DSimulation';
open_system(modelName);
The Helper Stereo Visual SLAM System block implements the stereo visual SLAM pipeline, consisting
of the following steps:
• Map Initialization: The pipeline starts by initializing the map of 3-D points from a pair of images
generated from the stereo camera using the disparity map. The left image is stored as the first key
frame.
• Tracking: Once a map is initialized, for each new stereo pair, the pose of the camera is estimated
by matching features in the left image to features in the last key frame. The estimated camera
pose is refined by tracking the local map.
• Local Mapping: If the current left image is identified as a key frame, new 3-D map points are
computed from the disparity of the stereo pair. At this stage, bundle adjustment is used to
minimize reprojection errors by adjusting the camera pose and 3-D points.
1-96
Stereo Visual SLAM for UAV Navigation in 3D Simulation
• Loop Closure: Loops are detected for each key frame by comparing it against all previous key
frames using the bag-of-features approach. Once a loop closure is detected, the pose graph is
optimized to refine the camera poses of all the key frames.
For the implementation details of the algorithm, see the “Stereo Visual Simultaneous Localization and
Mapping” on page 1-153 example.
Simulate the model and visualize the results. The Video Viewer block displays the stereo image
output. The Point Cloud Player displays the reconstructed 3-D map with the estimated camera
trajectory.
if ~ispc
error("Unreal Engine Simulation is supported only on Microsoft" + char(174) + " Windows" + ch
end
% Run simulation
sim(modelName);
1-97
1 Camera Calibration and SfM Examples
close_system(modelName);
1-98
Stereo Visual SLAM for UAV Navigation in 3D Simulation
References
[1] Mur-Artal, Raul, and Juan D. Tardós. "ORB-SLAM2: An open-source SLAM system for monocular,
stereo, and RGB-D cameras." IEEE Transactions on Robotics 33, no. 5 (2017): 1255-1262.
1-99
1 Camera Calibration and SfM Examples
AprilTags are widely used as visual markers for applications in object detection, localization, and as a
target for camera calibration [1]. AprilTags are like QR codes, but are designed to encode less data,
and can therefore be decoded faster which is useful, for example, for real-time robotics applications.
The advantages of using AprilTags as a calibration pattern include greater feature point detection,
and consistent, repeatable detections. This example uses the readAprilTag function to detect and
localize AprilTags in a calibration pattern. The readAprilTag function supports all official tag
families. The example also uses additional Computer Vision Toolbox™ functions to perform end-to-end
camera calibration. The default checkerboard pattern is replaced by a grid of evenly spaced
AprilTags. For an example of using a checkerboard pattern for calibration, refer to “Using the Single
Camera Calibrator App” on page 18-22.
This example shows how to calibrate a camera using AprilTags programmatically, and by using the
Camera Calibrator app:
Pre-generated tags for all the supported families can be downloaded from here using a web browser
or by running the following code:
downloadURL = "https://github.com/AprilRobotics/apriltag-imgs/archive/master.zip";
dataFolder = fullfile(tempdir,"apriltag-imgs",filesep);
options = weboptions('Timeout', Inf);
zipFileName = fullfile(dataFolder,"apriltag-imgs-master.zip");
folderExists = exist(dataFolder,"dir");
Extracting apriltag-imgs-master.zip...
The helperGenerateAprilTagPattern on page 1-112 function at the end of the example can be used to
generate a calibration target using the tag images for a specific arrangement of tags. The pattern
image is contained in calibPattern, which can be used to print the pattern (from MATLAB). The
example uses the tag36h11 family, which provides a reasonable trade-off between detection
performance and robustness to false-positive detections.
1-100
Camera Calibration Using AprilTag Markers
Using the readAprilTag function on this pattern results in detections with the corner locations of
the individual tags grouped together. The helperAprilTagToCheckerLocations on page 1-114 function
can be used to convert this arrangement to a column-major arrangement, such as a checkerboard.
% Read and localize the tags in the calibration pattern.
[tagIds, tagLocs] = readAprilTag(calibPattern,tagFamily);
1-101
1 Camera Calibration and SfM Examples
• While the pattern is printed on a paper in this example, consider printing it on a surface that
remains flat, and is not subject to deformations due to moisture, etc.
• Since the calibration procedure assumes that the pattern is planar, any imperfections in the
pattern (e.g. an uneven surface) can reduce the accuracy of the calibration.
• The calibration procedure requires at least 2 images of the pattern but using between 10 and 20
images produces more accurate results.
• Capture a variety of images of the pattern such that the pattern fills most of the image, thus
covering the entire field of view. For example, to best capture the lens distortion, have images of
the pattern at all edges of the image frame.
• Make sure the pattern is completely visible in the captured images since images with partially
visible patterns will be rejected.
• For more information on preparing images of the calibration pattern, see “Prepare Camera and
Capture Images” on page 18-4.
1-102
Camera Calibration Using AprilTag Markers
The helperDetectAprilTagCorners on page 1-113 function, included at the end of the example, is used
to detect, and localize the tags from the captured images and arrange them in a checkerboard fashion
to be used as key points in the calibration procedure.
The generated AprilTag pattern is such that the tags are in a checkerboard fashion, and so the world
coordinates for the corresponding image coordinates determined above (in imagePoints) can be
obtained using the generateCheckerboardPoints function.
Here, the size of the square is replaced by the size of the tag, and the size of the board is obtained
from the previous step. Measure the tag size between the outer black edges of one side of the tag.
1-103
1 Camera Calibration and SfM Examples
With the image and world point correspondences, estimate the camera parameters using the
estimateCameraParameters function.
Visualize the accuracy of the calibration and the extrinsic camera parameters. Show the planes of the
calibration pattern in the captured images.
1-104
Camera Calibration Using AprilTag Markers
1-105
1 Camera Calibration and SfM Examples
Inspect the locations of the detected image points and the reprojected points, which were obtained
using the estimated camera parameters.
1-106
Camera Calibration Using AprilTag Markers
While this example uses AprilTags markers in the calibration pattern, the same workflow can be
extended to other planar patterns as well. The estimateCameraParameters used to obtain the
camera parameters requires:
• imagePoints: Key points in the calibration pattern in image coordinates obtained from the
captured images.
• worldPoints: Corresponding world point coordinates of the key points in the calibration pattern.
Provided there is a way to obtain these key points, the rest of the calibration workflow remains the
same.
For convenience of use, the above workflow can also be integrated into the Camera Calibrator app.
The overall workflow remains the same and the steps are:
2. Import a custom pattern detector class for AprilTags. The detector must do the following:
1-107
1 Camera Calibration and SfM Examples
• MATLAB Toolstrip: On the Apps tab, in the Image Processing and Computer Vision section,
click the Camera Calibrator icon.
• MATLAB command prompt: Enter “Using the Single Camera Calibrator App” on page 18-22.
On the Calibration tab, in the File section, click Add images, and then select From file. You can
add images from multiple folders by clicking Add images for each folder. We will reuse the same
images as above on page 1-102. You will need at least 2 images for camera calibration. Once you add
images, the following UI will appear:
1-108
Camera Calibration Using AprilTag Markers
The above UI shows a drop-down list for pattern selection. By default, the app does not include a
pattern detector for AprilTags. You can create a custom pattern detector class and then add it to the
list to use in the app. For more information on how to create a custom pattern, click on the
information icon ( ). A custom pattern detector class for AprilTags has been provided in
MyCustomAprilTagPatternDetector.m file. This class contains UI code for parameters needed by the
detector and functions for detecting and processing the custom AprilTags calibration pattern.
The example uses the configureUIComponents() function to configure the UI component and the
initializePropertyValues() to initialize it. The helperDrawImageAxesLabels on page 1-114
function, included at the end of the example, is used to render the origin, X-axis and Y-axis labels in
the calibration images displayed in the Camera Calibrator app dialog.
• detectPatternPoints() - Detects and localizes the AprilTags from the captured images and sorts
them for use as key points in the calibration procedure. This function is implemented using
helperDetectAprilTagCorners on page 1-113 function, given at the end of the example.
• generateWorldPoints() - Computes world coordinates for the corresponding image coordinates
in the AprilTag pattern. This function is implemented using helperGenerateAprilTagPattern on
page 1-112 function, given at the end of the example.
1-109
1 Camera Calibration and SfM Examples
Import the custom pattern detector class by clicking on the Import Pattern Detector button under
Custom Pattern panel. Choose the class file MyCustomAprilTagPatternDetector.m. If there are
no errors in the class, then you will see the following view:
For this example, all the fields in the Properties panel have correct values. But you can customize
these values to fit your needs. Note that Square Size represents the width of the tag in world units
and is also assumed to be equal to the spacing between each tag in the image.
Click OK and the Data Browser pane displays a list of images with IDs, as shown below:
These images will contain the detected pattern. To view an image, select it from the Data Browser
pane.
1-110
Camera Calibration Using AprilTag Markers
At this point, camera calibration process is the same as given in “Using the Single Camera Calibrator
App” on page 18-22.
With the default calibration settings, click the Calibrate button on the Calibration tab. Visualize the
accuracy of the calibration by inspecting the Reprojection Errors pane and then visualize estimates
of the extrinsic camera parameters in the Camera-centric pane which shows the patterns positioned
with respect to the camera.
1-111
1 Camera Calibration and SfM Examples
numTags = tagArragement(1)*tagArragement(2);
tagIds = zeros(1,numTags);
% Pad image with white boundaries (ensures the tags replace the black
% portions of the checkerboard).
tagSize = round(max(tagLoc(:,2)) - min(tagLoc(:,2)));
padSize = round(tagSize/2 - (size(Ires,2) - tagSize)/2);
Ires = padarray(Ires,[padSize,padSize],255);
1-112
Camera Calibration Using AprilTag Markers
I = readimage(imdsTags,idx + 2);
Igray = im2gray(I);
Ires = imresize(Igray,15,"nearest");
Ires = padarray(Ires,[padSize,padSize],255);
tagIds(idx) = readAprilTag(Ires,tagFamily);
end
% Reshape the tag images to ensure that they appear in column-major order
% (montage function places image in row-major order).
columnMajIdx = reshape(1:numTags,tagArragement)';
tagImages = tagImages(:,:,columnMajIdx(:));
end
1-113
1 Camera Calibration and SfM Examples
[~,sortIdx] = sort(tagIds);
tagLocs = tagLocs(:,:,sortIdx);
end
end
numTagRows = tagArrangement(1);
numTagCols = tagArrangement(2);
numTags = numTagRows * numTagCols;
end
helperDrawImageAxesLabels renders the origin, X-axis and Y-axis labels in the calibration images
displayed in the calibrator app.
numBoardRows = boardSize(1)-1;
numBoardCols = boardSize(2)-1;
1-114
Camera Calibration Using AprilTag Markers
refPointIdx = find(~isnan(boardCoordsX(:,1)),2);
p2 = boardCoords(refPointIdx(2),1,:);
refPointIdx = find(~isnan(boardCoordsX(1,:)),2);
p3 = boardCoords(1,refPointIdx(2),:);
originLabel.Location = loc;
originLabel.Orientation = theta;
else
originLabel = struct;
end
% X-axis label
firstRowIdx = numBoardCols:-1:1;
refPointIdx13 = find(~isnan(boardCoordsX(1,firstRowIdx)),2);
refPointIdx13 = firstRowIdx(refPointIdx13);
p1 = boardCoords(1,refPointIdx13(1),:);
p3 = boardCoords(1,refPointIdx13(2),:);
refPointIdx2 = find(~isnan(boardCoordsX(:,refPointIdx13(1))),2);
p2 = boardCoords(refPointIdx2(2),refPointIdx13(1),:);
xLabel.Location = loc;
xLabel.Orientation = theta;
% Y-axis label
firstColIdx = numBoardRows:-1:1;
refPointIdx12 = find(~isnan(boardCoordsX(firstColIdx,1)),2);
refPointIdx12 = firstColIdx(refPointIdx12);
p1 = boardCoords(refPointIdx12(1),1,:);
p2 = boardCoords(refPointIdx12(2),1,:);
[loc,theta] = getAxesLabelPosition(p1,p2,p3);
yLabel.Location = loc;
yLabel.Orientation = theta;
%--------------------------------------------------------------
% p1+v
% \
% \ v1
% p1 ------ p2
% |
1-115
1 Camera Calibration and SfM Examples
% v2 |
% |
% p3
function [loc,theta] = getAxesLabelPosition(p1,p2,p3)
v1 = p3 - p1;
theta = -atan2d(v1(2),v1(1));
v2 = p2 - p1;
v = -v1 - v2;
d = hypot(v(1),v(2));
minDist = 40;
if d < minDist
v = (v / d) * minDist;
end
loc = p1 + v;
end
%--------------------------------------------------------------
end
Reference
[1] E. Olson, "AprilTag: A robust and flexible visual fiducial system," 2011 IEEE International
Conference on Robotics and Automation, Shanghai, 2011, pp. 3400-3407, doi: 10.1109/
ICRA.2011.5979561.
1-116
Configure Monocular Fisheye Camera
This example shows how to convert a fisheye camera model to a pinhole model and construct a
corresponding monocular camera sensor simulation. In this example, you learn how to calibrate a
fisheye camera and configure a monoCamera (Automated Driving Toolbox) object.
Overview
To estimate the intrinsic parameters, use a checkerboard for camera calibration. Alternatively, to
better visualize the results, use the Camera Calibrator app. For fisheye camera, it is useful to place
the checkerboard close to the camera, in order to capture large noticeable distortion in the image.
% Gather a set of calibration images.
images = imageDatastore(fullfile(toolboxdir('vision'), 'visiondata', ...
'calibration', 'gopro'));
imageFileNames = images.Files;
To estimate the extrinsic parameters, use the same checkerboard to estimate the mounting position
of the camera in the vehicle coordinate system. The following step estimates the parameters from one
image. You can also take multiple checkerboard images to obtain multiple estimations, and average
the results.
% Load a different image of the same checkerboard, where the checkerboard
% is placed on the flat ground. Its X-axis is pointing to the right of the
1-117
1 Camera Calibration and SfM Examples
% vehicle, and its Y-axis is pointing to the camera. The image includes
% noticeable distortion, such as along the wall next to the checkerboard.
1-118
Configure Monocular Fisheye Camera
% Undistort the image and extract the synthetic pinhole camera intrinsics.
[J1, camIntrinsics] = undistortFisheyeImage(I, params.Intrinsics, 'Output', 'full');
imshow(J1)
title('Undistorted Image');
Now you can validate the monoCamera (Automated Driving Toolbox) by plotting a bird's-eye view.
% Define bird's-eye-view transformation parameters
distAheadOfSensor = 6; % in meters
spaceToOneSide = 2.5; % look 2.5 meters to the right and 2.5 meters to the left
bottomOffset = 0.2; % look 0.2 meters ahead of the sensor
outView = [bottomOffset, distAheadOfSensor, -spaceToOneSide, spaceToOneSide];
outImageSize = [NaN,1000]; % output image width in pixels
1-119
1 Camera Calibration and SfM Examples
figure
imshow(annotatedB)
title('Bird''s-Eye View')
1-120
Configure Monocular Fisheye Camera
The plot above shows that the camera measures distances accurately. Now you can use the
monocular camera for object and lane boundary detection. See the “Visual Perception Using
Monocular Camera” (Automated Driving Toolbox) example.
1-121
1 Camera Calibration and SfM Examples
Visual simultaneous localization and mapping (vSLAM), refers to the process of calculating the
position and orientation of a camera with respect to its surroundings, while simultaneously mapping
the environment. The process uses only visual inputs from the camera. Applications for vSLAM
include augmented reality, robotics, and autonomous driving.
This example shows two implementations of processing image data from a monocular camera to build
a map of an indoor environment and estimate the trajectory of the camera:
1 Modular and Modifiable: The first implementation builds a visual SLAM pipeline step-by-step
using functions and classes mentioned in “Implement Visual SLAM in MATLAB” on page 13-8.
It is modular and designed to teach about the details of the vSLAM implementation loosely based
on the popular and reliable ORB-SLAM [1] on page 1-143 algorithm. The code is easily navigable,
allowing you to understand the algorithm and test how its parameters can impact the
performance of the system. This modular implementation is most suitable for experimentation
and modification to test your own ideas.
2 Performant and Deployable: The second implementation uses the monovslam class which
packages all implementation details together and creates a more practical solution in terms of
performance (execution speed) and deployability. You can also generate C/C++ code from
monovslam using MATLAB Coder, and the generated code can also be utilized in non-PC
hardware.
To speed up computations, you can enable parallel computing from the “Computer Vision Toolbox
Preferences” dialog box. To open Computer Vision Toolbox™ preferences, on the Home tab, in the
Environment section, click Preferences. Then select Computer Vision Toolbox.
Glossary
• Key Frames: A subset of video frames that contain cues for localization and tracking. Two
consecutive key frames usually involve sufficient visual change.
• Map Points: A list of 3-D points that represent the map of the environment reconstructed from
the key frames.
• Covisibility Graph: A graph consisting of key frame as nodes. Two key frames are connected by
an edge if they share common map points. The weight of an edge is the number of shared map
points.
• Essential Graph: A subgraph of covisibility graph containing only edges with high weight, i.e.
more shared map points.
• Place Recognition Database: A database used to recognize whether a place has been visited in
the past. The database stores the visual word-to-image mapping based on the input bag of
features. It is used to search for an image that is visually similar to a query image.
1-122
Monocular Visual Simultaneous Localization and Mapping
Overview of ORB-SLAM
• Map Initialization: ORB-SLAM starts by initializing the map of 3-D points from two video frames.
The 3-D points and relative camera pose are computed using triangulation based on 2-D ORB
feature correspondences.
• Tracking: Once a map is initialized, for each new frame, the camera pose is estimated by
matching features in the current frame to features in the last key frame. The estimated camera
pose is refined by tracking the local map.
• Local Mapping: The current frame is used to create new 3-D map points if it is identified as a key
frame. At this stage, bundle adjustment is used to minimize reprojection errors by adjusting the
camera pose and 3-D points.
• Loop Closure: Loops are detected for each key frame by comparing it against all previous key
frames using the bag-of-features approach. Once a loop closure is detected, the pose graph is
optimized to refine the camera poses of all the key frames.
The data used in this example are from the TUM RGB-D benchmark [2] on page 1-143. You can
download the data to a temporary directory using a web browser or by running the following code:
baseDownloadURL = "https://cvg.cit.tum.de/rgbd/dataset/freiburg3/rgbd_dataset_freiburg3_long_offi
dataFolder = fullfile(tempdir, 'tum_rgbd_dataset', filesep);
options = weboptions(Timeout=Inf);
tgzFileName = [dataFolder, 'fr3_office.tgz'];
folderExists = exist(dataFolder, "dir");
1-123
1 Camera Calibration and SfM Examples
imageFolder = [dataFolder,'rgbd_dataset_freiburg3_long_office_household/rgb/'];
imds = imageDatastore(imageFolder);
Map Initialization
The ORB-SLAM pipeline starts by initializing the map that holds 3-D world points. This step is crucial
and has a significant impact on the accuracy of final SLAM result. Initial ORB feature point
correspondences are found using matchFeatures between a pair of images. After the
correspondences are found, two geometric transformation models are used to establish map
initialization:
The homography and the fundamental matrix can be computed using estgeotform2d and
estimateFundamentalMatrix, respectively. The model that results in a smaller reprojection error
is selected to estimate the relative rotation and translation between the two frames using
estrelpose. Since the RGB images are taken by a monocular camera which does not provide the
depth information, the relative translation can only be recovered up to a specific scale factor.
Given the relative camera pose and the matched feature points in the two images, the 3-D locations of
the matched points are determined using triangulate function. A triangulated map point is valid
when it is located in the front of both cameras, when its reprojection error is low, and when the
parallax of the two views of the point is sufficiently large.
currFrameIdx = currFrameIdx + 1;
firstI = currI; % Preserve the first frame
isMapInitialized = false;
1-124
Monocular Visual Simultaneous Localization and Mapping
currFrameIdx = currFrameIdx + 1;
preMatchedPoints = prePoints(indexPairs(:,1),:);
currMatchedPoints = currPoints(indexPairs(:,2),:);
preMatchedPoints = prePoints(indexPairs(:,1),:);
currMatchedPoints = currPoints(indexPairs(:,2),:);
if ~isValid
1-125
1 Camera Calibration and SfM Examples
continue
end
isMapInitialized = true;
if isMapInitialized
close(himage.Parent.Parent); % Close the previous figure
% Show matched features
hfeature = showMatchedFeatures(firstI, currI, prePoints(indexPairs(:,1)), ...
currPoints(indexPairs(:, 2)), "Montage");
else
error('Unable to initialize the map.')
end
After the map is initialized using two frames, you can use imageviewset and worldpointset to
store the two key frames and the corresponding map points:
• imageviewset stores the key frames and their attributes, such as ORB descriptors, feature
points and camera poses, and connections between the key frames, such as feature points
matches and relative camera poses. It also builds and updates a pose graph. The absolute camera
poses and relative camera poses of odometry edges are stored as rigidtform3d objects. The
relative camera poses of loop-closure edges are stored as affinetform3d objects.
• worldpointset stores 3-D positions of the map points and the 3-D into 2-D projection
correspondences: which map points are observed in a key frame and which key frames observe a
map point. It also stores other attributes of map points, such as the mean view direction, the
representative ORB descriptors, and the range of distance at which the map point can be
observed.
% Create an empty imageviewset object to store key frames
vSetKeyFrames = imageviewset;
% Add the first key frame. Place the camera associated with the first
% key frame at the origin, oriented along the Z-axis
1-126
Monocular Visual Simultaneous Localization and Mapping
preViewId = 1;
vSetKeyFrames = addView(vSetKeyFrames, preViewId, rigidtform3d, Points=prePoints,...
Features=preFeatures.Features);
% Add connection between the first and the second key frame
vSetKeyFrames = addConnection(vSetKeyFrames, preViewId, currViewId, relPose, Matches=indexPairs);
% Add image points corresponding to the map points in the first key frame
mapPointSet = addCorrespondences(mapPointSet, preViewId, newPointIdx, indexPairs(:,1));
% Add image points corresponding to the map points in the second key frame
mapPointSet = addCorrespondences(mapPointSet, currViewId, newPointIdx, indexPairs(:,2));
Loop detection is performed using the bags-of-words approach. A visual vocabulary represented as a
bagOfFeatures object is created offline with the ORB descriptors extracted from a large set of
images in the dataset by calling:
bag =
bagOfFeatures(imds,CustomExtractor=@helperORBFeatureExtractorFunction,TreePro
perties=[3, 10],StrongestFeatures=1);
Refine the initial reconstruction using bundleAdjustment, that optimizes both camera poses and
world points to minimize the overall reprojection errors. After the refinement, the attributes of the
1-127
1 Camera Calibration and SfM Examples
map points including 3-D locations, view direction, and depth range are updated. You can use
helperVisualizeMotionAndStructure to visualize the map points and the camera locations.
% Scale the map and the camera pose using the median depth of map points
medianDepth = median(vecnorm(refinedPoints.'));
refinedPoints = refinedPoints / medianDepth;
refinedAbsPoses.AbsolutePose(currViewId).Translation = ...
refinedAbsPoses.AbsolutePose(currViewId).Translation / medianDepth;
relPose.Translation = relPose.Translation/medianDepth;
1-128
Monocular Visual Simultaneous Localization and Mapping
% Show legend
showLegend(mapPlot);
1-129
1 Camera Calibration and SfM Examples
Tracking
The tracking process is performed using every frame and determines when to insert a new key frame.
To simplify this example, we will terminate the tracking process once a loop closure is found.
isLoopClosed = false;
1 ORB features are extracted for each new frame and then matched (using matchFeatures), with
features in the last key frame that have known corresponding 3-D map points.
2 Estimate the camera pose with the Perspective-n-Point algorithm using estworldpose.
1-130
Monocular Visual Simultaneous Localization and Mapping
3 Given the camera pose, project the map points observed by the last key frame into the current
frame and search for feature correspondences using matchFeaturesInRadius.
4 With 3-D to 2-D correspondence in the current frame, refine the camera pose by performing a
motion-only bundle adjustment using bundleAdjustmentMotion.
5 Project the local map points into the current frame to search for more feature correspondences
using matchFeaturesInRadius and refine the camera pose again using
bundleAdjustmentMotion.
6 The last step of tracking is to decide if the current frame is a new key frame. If the current frame
is a key frame, continue to the Local Mapping process. Otherwise, start Tracking for the next
frame.
If tracking is lost because not enough number of feature points could be matched, try inserting new
key frames more frequently.
% Main loop
isLastFrameKeyFrame = true;
while ~isLoopClosed && currFrameIdx < numel(imds.Files)
currI = readimage(imds, currFrameIdx);
% Track the local map and check if the current frame is a key frame.
% A frame is a key frame if both of the following conditions are satisfied:
%
% 1. At least 20 frames have passed since the last key frame or the
% current frame tracks fewer than 100 map points.
% 2. The map points tracked by the current frame are fewer than 90% of
% points tracked by the reference key frame.
%
% Tracking performance is sensitive to the value of numPointsKeyFrame.
% If tracking is lost, try a larger value.
%
% localKeyFrameIds: ViewId of the connected key frames of the current frame
numSkipFrames = 20;
numPointsKeyFrame = 80;
[localKeyFrameIds, currPose, mapPointsIdx, featureIdx, isKeyFrame] = ...
helperTrackLocalMap(mapPointSet, vSetKeyFrames, mapPointsIdx, ...
featureIdx, currPose, currFeatures, currPoints, intrinsics, scaleFactor, numLevels, ...
isLastFrameKeyFrame, lastKeyFrameIdx, currFrameIdx, numSkipFrames, numPointsKeyFrame);
if ~isKeyFrame
currFrameIdx = currFrameIdx + 1;
isLastFrameKeyFrame = false;
continue
else
isLastFrameKeyFrame = true;
1-131
1 Camera Calibration and SfM Examples
end
Local Mapping
Local mapping is performed for every key frame. When a new key frame is determined, add it to the
key frames and update the attributes of the map points observed by the new key frame. To ensure
that mapPointSet contains as few outliers as possible, a valid map point must be observed in at least
3 key frames.
New map points are created by triangulating ORB feature points in the current key frame and its
connected key frames. For each unmatched feature point in the current key frame, search for a match
with other unmatched points in the connected key frames using matchFeatures. The local bundle
adjustment refines the pose of the current key frame, the poses of connected key frames, and all the
map points observed in these key frames.
% Remove outlier map points that are observed in fewer than 3 key frames
outlierIdx = setdiff(newPointIdx, mapPointsIdx);
if ~isempty(outlierIdx)
mapPointSet = removeWorldPoints(mapPointSet, outlierIdx);
end
1-132
Monocular Visual Simultaneous Localization and Mapping
Loop Closure
The loop closure detection step takes the current key frame processed by the local mapping process
and tries to detect and close the loop. Loop candidates are identified by querying images in the
database that are visually similar to the current key frame using evaluateImageRetrieval. A
candidate key frame is valid if it is not connected to the last key frame and three of its neighbor key
frames are loop candidates.
When a valid loop candidate is found, use estgeotform3d to compute the relative pose between the
loop candidate frame and the current key frame. The relative pose represents a 3-D similarity
transformation stored in an affinetform3d object. Then add the loop connection with the relative
pose and update mapPointSet and vSetKeyFrames.
% Check loop closure after some key frames have been created
if currKeyFrameId > 20
if isDetected
% Add loop closure connections
[isLoopClosed, mapPointSet, vSetKeyFrames] = helperAddLoopConnections(...
mapPointSet, vSetKeyFrames, validLoopCandidates, currKeyFrameId, ...
currFeatures, loopEdgeNumMatches);
end
end
1-133
1 Camera Calibration and SfM Examples
1-134
Monocular Visual Simultaneous Localization and Mapping
Finally, a similarity pose graph optimization is performed over the essential graph in vSetKeyFrames
to correct the drift of camera poses. The essential graph is created internally by removing
connections with fewer than minNumMatches matches in the covisibility graph. After similarity pose
graph optimization, update the 3-D locations of the map points using the optimized poses and the
associated scales.
if isLoopClosed
% Optimize the poses
minNumMatches = 20;
vSetKeyFramesOptim = optimizePoses(vSetKeyFrames, minNumMatches, Tolerance=1e-16);
% Update legend
1-135
1 Camera Calibration and SfM Examples
showLegend(mapPlot);
end
You can compare the optimized camera trajectory with the ground truth to evaluate the accuracy of
ORB-SLAM. The downloaded data contains a groundtruth.txt file that stores the ground truth of
camera pose of each frame. You can import the data using the helperImportGroundTruth
function:
gTruthFileName = [dataFolder,'rgbd_dataset_freiburg3_long_office_household/
groundtruth.txt'];
In this example, the data has been saved in the form of a MAT-file. Once the ground truth is imported,
you can calculate the root-mean-square-error (RMSE) of trajectory estimates.
% Show legend
showLegend(mapPlot);
1-136
Monocular Visual Simultaneous Localization and Mapping
You can test the visual SLAM pipeline with a different dataset by tuning the following parameters:
• numPoints: For image resolution of 480x640 pixels set numPoints to be 1000. For higher
resolutions, such as 720 × 1280, set it to 2000. Larger values require more time in feature
extraction.
• numSkipFrames: For frame rate of 30fps, set numSkipFrames to be 20. For a slower frame rate,
set it to be a smaller value. Increasing numSkipFrames improves the tracking speed, but may
result in tracking lost when the camera motion is fast.
The above sections demonstrated the core principles involved in building a visual SLAM pipeline step-
by-step. In this section, you will learn how to use the monovslam class to easily invoke the entire
pipeline in just a few lines of code. This implementation is significantly faster and is suitable for
deployment via C/C++ code generation on a host computer or an embedded platform.
1-137
1 Camera Calibration and SfM Examples
if hasNewKeyFrame(vslam)
% Display 3-D map points and camera trajectory
plot(vslam);
end
end
% PLot intermediate results and Wait until all images are processed
while ~isDone(vslam)
if hasNewKeyFrame(vslam)
plot(vslam);
end
end
1-138
Monocular Visual Simultaneous Localization and Mapping
xyzPoints = mapPoints(vslam);
camPoses = poses(vslam);
Code Generation
You can generate C++ code from monovslam class suitable for deployment on a host computer or an
embedded platform that has all the third-party dependencies including OpenCV and g2o [3] on page
1-143. For illustrative purposes, in this section, we will generate MEX code. To meet the
requirements of MATLAB Coder, the code has to be restructured to isolate the algorithm from the
visualization code. The algorithmic content was encapsulated in the helperMonoVisualSLAM
function, which takes an image as the input and outputs 3-D world points and camera poses as
matrices. Inside the function, a monovslam object is created and saved into a persistent variable
called vslam. Note that helperMonoVisualSLAM function does not display the reconstructed 3-D
point cloud or the camera poses. The plot method of the monovslam class was not designed to
generate code because visualization is typically not deployed on embedded systems.
%#codegen
if isempty(vslam)
% Create a monovslam class to process the image data
focalLength = [535.4, 539.2]; % in units of pixels
principalPoint = [320.1, 247.6]; % in units of pixels
imageSize = [480, 640]; % in units of pixels
intrinsics = cameraIntrinsics(focalLength, principalPoint, imageSize);
vslam = monovslam(intrinsics);
end
xyzPoint = xyzPointsInternal;
You can compile the helperMonoVisualSLAM function into a MEX-file using the codegen command.
Note that the generated MEX-file has the same name as the original MATLAB file with _mex
appended, unless you use the -o option to specify the name of the executable.
compileTimeInputs = {coder.typeof(currI)};
1-139
1 Camera Calibration and SfM Examples
% Clear up
clear helperMonoVisualSLAM_mex
Supporting Functions
Short helper functions are included below. Larger function are included in separate files.
helperAddLoopConnections add connections between the current keyframe and the valid loop
candidate.
helperCheckLoopClosure detect loop candidates key frames by retrieving visually similar images
from the database.
helperTrackLastKeyFrame estimate the current camera pose by tracking the last key frame.
helperTrackLocalMap refine the current camera pose by tracking the local map.
helperImportGroundTruth import camera pose ground truth from the downloaded data.
helperDetectAndExtractFeatures detect and extract and ORB features from the image.
function [features, validPoints] = helperDetectAndExtractFeatures(Irgb, ...
scaleFactor, numLevels, numPoints, varargin)
1-140
Monocular Visual Simultaneous Localization and Mapping
% Extract features
[features, validPoints] = extractFeatures(Igray, points);
end
inlierPoints1 = matchedPoints1(inliersLogicalIndex);
inlierPoints2 = matchedPoints2(inliersLogicalIndex);
inliersIndex = find(inliersLogicalIndex);
locations1 = inlierPoints1.Location;
locations2 = inlierPoints2.Location;
xy1In2 = transformPointsForward(H, locations1);
xy2In1 = transformPointsInverse(H, locations2);
error1in2 = sum((locations2 - xy1In2).^2, 2);
error2in1 = sum((locations1 - xy2In1).^2, 2);
outlierThreshold = 6;
inlierPoints1 = matchedPoints1(inliersLogicalIndex);
inlierPoints2 = matchedPoints2(inliersLogicalIndex);
inliersIndex = find(inliersLogicalIndex);
locations1 = inlierPoints1.Location;
locations2 = inlierPoints2.Location;
outlierThreshold = 4;
1-141
1 Camera Calibration and SfM Examples
end
% Check parallax
isValid = all(cosAngle < cosd(minParallax) & cosAngle>0);
end
helperUpdateGlobalMap update 3-D locations of map points after pose graph optimization
% Update world location of each map point based on the new absolute pose of
% the corresponding major view
for i = indices
majorViewIds = mapPointSet.RepresentativeViewId(i);
poseNew = posesNew(majorViewIds).A;
1-142
Monocular Visual Simultaneous Localization and Mapping
tform = affinetform3d(poseNew/posesOld(majorViewIds).A);
positionsNew(i, :) = transformPointsForward(tform, positionsOld(i, :));
end
mapPointSet = updateWorldPoints(mapPointSet, indices, positionsNew);
end
Reference
[1] Mur-Artal, Raul, Jose Maria Martinez Montiel, and Juan D. Tardos. "ORB-SLAM: a versatile and
accurate monocular SLAM system." IEEE Transactions on Robotics 31, no. 5, pp 1147-116, 2015.
[2] Sturm, Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. "A
benchmark for the evaluation of RGB-D SLAM systems". In Proceedings of IEEE/RSJ International
Conference on Intelligent Robots and Systems, pp. 573-580, 2012.
[3] Kümmerle, Rainer, Giorgio Grisetti, Hauke Strasdat, Kurt Konolige, and Wolfram Burgard. "g 2 o:
A general framework for graph optimization." In Proceedings of IEEE International Conference on
Robotics and Automation, pp. 3607-3613, 2011.
See Also
Related Examples
• “Stereo Visual Simultaneous Localization and Mapping” on page 1-153
1-143
1 Camera Calibration and SfM Examples
Structure from motion (SfM) is the process of estimating the 3-D structure of a scene from a set of 2-
D images. This example shows you how to estimate the poses of a calibrated camera from two
images, reconstruct the 3-D structure of the scene up to an unknown scale factor, and then recover
the actual scale factor by detecting an object of a known size.
Overview
This example shows how to reconstruct a 3-D scene from a pair of 2-D images taken with a camera
calibrated using the Camera Calibrator app. The algorithm consists of the following steps:
1 Match a sparse set of points between the two images. There are multiple ways of finding point
correspondences between two images. This example detects corners in the first image using the
detectMinEigenFeatures function, and tracks them into the second image using
vision.PointTracker. Alternatively you can use extractFeatures followed by
matchFeatures.
2 Estimate the fundamental matrix using estimateEssentialMatrix.
3 Compute the motion of the camera using the estrelpose function.
4 Match a dense set of points between the two images. Re-detect the point using
detectMinEigenFeatures with a reduced 'MinQuality' to get more points. Then track the
dense points into the second image using vision.PointTracker.
5 Determine the 3-D locations of the matched points using triangulate.
6 Detect an object of a known size. In this scene there is a globe, whose radius is known to be
10cm. Use pcfitsphere to find the globe in the point cloud.
7 Recover the actual scale, resulting in a metric reconstruction.
imageDir = fullfile(toolboxdir('vision'),'visiondata','upToScaleReconstructionImages');
images = imageDatastore(imageDir);
I1 = readimage(images, 1);
I2 = readimage(images, 2);
figure
imshowpair(I1, I2, 'montage');
title('Original Images');
1-144
Structure from Motion from Two Views
This example uses the camera parameters calculated by the Camera Calibrator app. The parameters
are stored in the cameraIntrinsics object, and include the camera intrinsics and lens distortion
coefficients.
Lens distortion can affect the accuracy of the final reconstruction. You can remove the distortion from
each of the images using the undistortImage function. This process straightens the lines that are
bent by the radial distortion of the lens.
I1 = undistortImage(I1, intrinsics);
I2 = undistortImage(I2, intrinsics);
figure
imshowpair(I1, I2, 'montage');
title('Undistorted Images');
1-145
1 Camera Calibration and SfM Examples
Detect good features to track. Reduce 'MinQuality' to detect fewer points, which would be more
uniformly distributed throughout the image. If the motion of the camera is not very large, then
tracking using the KLT algorithm is a good way to establish point correspondences.
1-146
Structure from Motion from Two Views
% Visualize correspondences
figure
showMatchedFeatures(I1, I2, matchedPoints1, matchedPoints2);
title('Tracked Features');
Use the estimateEssentialMatrix function to compute the essential matrix and find the inlier
points that meet the epipolar constraint.
1-147
1 Camera Calibration and SfM Examples
Compute the location and orientation of the second camera relative to the first one. Note that loc is
a translation unit vector, because translation can only be computed up to scale.
relPose = estrelpose(E, intrinsics, inlierPoints1, inlierPoints2);
Re-detect points in the first image using lower 'MinQuality' to get more points. Track the new
points into the second image. Estimate the 3-D locations corresponding to the matched points using
the triangulate function, which implements the Direct Linear Transformation (DLT) algorithm [1].
Place the origin at the optical center of the camera corresponding to the first image.
% Detect dense feature points. Use an ROI to exclude points close to the
% image edges.
border = 30;
roi = [border, border, size(I1, 2)- 2*border, size(I1, 1)- 2*border];
imagePoints1 = detectMinEigenFeatures(im2gray(I1), ROI = roi, ...
MinQuality = 0.001);
1-148
Structure from Motion from Two Views
Use the plotCamera function to visualize the locations and orientations of the camera, and the
pcshow function to visualize the point cloud.
1-149
1 Camera Calibration and SfM Examples
Find the globe in the point cloud by fitting a sphere to the 3-D points using the pcfitsphere
function.
1-150
Structure from Motion from Two Views
The actual radius of the globe is 10cm. You can now determine the coordinates of the 3-D points in
centimeters.
1-151
1 Camera Calibration and SfM Examples
xlabel('x-axis (cm)');
ylabel('y-axis (cm)');
zlabel('z-axis (cm)')
title('Metric Reconstruction of the Scene');
Summary
This example showed you how to recover camera motion and reconstruct the 3-D structure of a scene
from two images taken with a calibrated camera.
References
[1] Hartley, Richard, and Andrew Zisserman. Multiple View Geometry in Computer Vision. Second
Edition. Cambridge, 2000.
1-152
Stereo Visual Simultaneous Localization and Mapping
Visual simultaneous localization and mapping (vSLAM), refers to the process of calculating the
position and orientation of a camera with respect to its surroundings, while simultaneously mapping
the environment. The process uses only visual inputs from the camera. Applications for vSLAM
include augmented reality, robotics, and autonomous driving.
vSLAM can be performed by using just a monocular camera. However, since depth cannot be
accurately calculated using a single camera, the scale of the map and the estimated trajectory is
unknown and drifts over time. In addition, to bootstrap the system, multiple views are required to
produce an initial map as it cannot be triangulated from the first frame. Using a stereo camera solves
these problems and provides a more reliable vSLAM solution.
This example shows how to process image data from a stereo camera to build a map of an outdoor
environment and estimate the trajectory of the camera. The example uses a version of ORB-SLAM2
[1] on page 1-166 algorithm, which is feature-based and supports stereo cameras.
The pipeline for stereo vSLAM is very similar to the monocular vSLAM pipeline in the “Monocular
Visual Simultaneous Localization and Mapping” on page 1-122 example. The major difference is that
in the Map Initialization stage 3-D map points are created from a pair of stereo images of the same
stereo pair instead of two images of different frames.
• Map Initialization: The pipeline starts by initializing the map of 3-D points from a pair of stereo
images using the disparity map. The left image is stored as the first key frame.
• Tracking: Once a map is initialized, for each new stereo pair, the pose of the camera is estimated
by matching features in the left image to features in the last key frame. The estimated camera
pose is refined by tracking the local map.
• Local Mapping: If the current left image is identified as a key frame, new 3-D map points are
computed from the disparity of the stereo pair. At this stage, bundle adjustment is used to
minimize reprojection errors by adjusting the camera pose and 3-D points.
• Loop Closure: Loops are detected for each key frame by comparing it against all previous key
frames using the bag-of-features approach. Once a loop closure is detected, the pose graph is
optimized to refine the camera poses of all the key frames.
1-153
1 Camera Calibration and SfM Examples
The data used in this example are from the UTIAS Long-Term Localization and Mapping Dataset
provided by University of Toronto Institute for Aerospace Studies. You can download the data to a
directory using a web browser or by running the following code:
ftpObj = ftp('asrl3.utias.utoronto.ca');
tempFolder = fullfile(tempdir);
dataFolder = [tempFolder, '2020-vtr-dataset/UTIAS-In-The-Dark/'];
zipFileName = [dataFolder, 'run_000005.zip'];
folderExists = exist(dataFolder, 'dir');
imgFolderLeft = [dataFolder,'/images/left/'];
imgFolderRight = [dataFolder,'/images/right/'];
imdsLeft = imageDatastore(imgFolderLeft);
imdsRight = imageDatastore(imgFolderRight);
Map Initialization
The ORB-SLAM pipeline starts by initializing the map that holds 3-D world points. This step is crucial
and has a significant impact on the accuracy of final SLAM result. Initial ORB feature point
correspondences are found using matchFeatures between two images of a stereo pair. The matched
pairs should satisfy the following constraints:
1-154
Stereo Visual Simultaneous Localization and Mapping
• The horizontal shift between the two corresponding feature points in the rectified stereo pair
image is less than the maximum disparity. You can determine the approximate maximum disparity
value from the stereo anaglyph of the stereo pair image. For more information, see “Choosing
Range of Disparity”
• The vertical shift between the two corresponding feature points in the rectified stereo pair image
is less than a threshold.
• The scales of the matched features are nearly identical.
The 3-D world locations corresponding to the matched feature points are determined as follows:
• Use “Choosing Range of Disparity” to compute the disparity map for each pair of stereo images by
using semi-global matching (SGM) method.
• Use reconstructScene to compute the 3-D world point coordinates from the disparity map.
• Find the locations in the disparity map that correspond to the feature points and their 3-D world
locations.
% Set random seed for reproducibility
rng(0);
% Load the initial camera pose. The initial camera pose is derived based
% on the transformation between the camera and the vehicle:
% http://asrl.utias.utoronto.ca/datasets/2020-vtr-dataset/text_files/transform_camera_vehicle.tx
initialPoseData = load("initialPose.mat");
initialPose = initialPoseData.initialPose;
% In this example, the images are already undistorted and rectified. In a general workflow,
% uncomment the following code to undistort and rectify the images.
% currILeft = undistortImage(currILeft, intrinsics);
% currIRight = undistortImage(currIRight, intrinsics);
% stereoParams = stereoParameters(intrinsics, intrinsics, eye(3), [-baseline, 0 0]);
% [currILeft, currIRight] = rectifyStereoImages(currILeft, currIRight, stereoParams, OutputView="
% Detect and extract ORB features from the rectified stereo images
scaleFactor = 1.2;
numLevels = 8;
[currFeaturesLeft, currPointsLeft] = helperDetectAndExtractFeatures(im2gray(currILeft), scaleF
[currFeaturesRight, currPointsRight] = helperDetectAndExtractFeatures(im2gray(currIRight), scale
% Match feature points between the stereo images and get the 3-D world positions
disparityRange = [0 48]; % specified in pixels
[xyzPoints, matchedPairs] = helperReconstructFromStereo(im2gray(currILeft), im2gray(currIRight),
currFeaturesLeft, currFeaturesRight, currPointsLeft, currPointsRight, reprojectionMatrix, ini
1-155
1 Camera Calibration and SfM Examples
After the map is initialized using the first stereo pair, you can use imageviewset and
worldpointset to store the first key frames and the corresponding map points:
% Show legend
showLegend(mapPlot);
Loop detection is performed using the bags-of-words approach. A visual vocabulary represented as a
bagOfFeatures object is created offline with the ORB descriptors extracted from a large set of
images in the dataset by calling:
bag =
bagOfFeatures(imds,CustomExtractor=@helperORBFeatureExtractorFunction,TreePro
perties=[3, 10], StrongestFeatures=1);
1-156
Stereo Visual Simultaneous Localization and Mapping
Tracking
The tracking process is performed using every pair and determines when to insert a new key frame.
currFrameIdx = 2;
isLoopClosed = false;
1 ORB features are extracted for each new stereo pair of images and then matched (using
matchFeatures), with features in the last key frame that have known corresponding 3-D map
points.
2 Estimate the camera pose with the Perspective-n-Point algorithm using estworldpose.
Given the camera pose, project the map points observed by the last key frame into the current frame
and search for feature correspondences using matchFeaturesInRadius.
1 With 3-D to 2-D correspondences in the current frame, refine the camera pose by performing a
motion-only bundle adjustment using bundleAdjustmentMotion.
2 Project the local map points into the current frame to search for more feature correspondences
using matchFeaturesInRadius and refine the camera pose again using
bundleAdjustmentMotion.
3 The last step of tracking is to decide if the current frame should be a new key frame. A frame is a
key frame if both of the following conditions are satisfied:
• At least 5 frames have passed since the last key frame or the current frame tracks fewer than 80
map points.
• The map points tracked by the current frame are fewer than 90% of points tracked by the
reference key frame.
If the current frame is to become a key frame, continue to the Local Mapping process. Otherwise,
start Tracking for the next frame.
% Main loop
isLastFrameKeyFrame = true;
while ~isLoopClosed && currFrameIdx <= numel(imdsLeft.Files)
1-157
1 Camera Calibration and SfM Examples
currILeftGray = im2gray(currILeft);
currIRightGray = im2gray(currIRight);
% Track the local map and check if the current frame is a key frame.
% localKeyFrameIds: ViewId of the connected key frames of the current frame
numSkipFrames = 5;
numPointsKeyFrame = 80;
[localKeyFrameIds, currPose, trackedMapPointsIdx, trackedFeatureIdx, isKeyFrame] = ...
helperTrackLocalMap(mapPointSet, vSetKeyFrames, trackedMapPointsIdx, ...
trackedFeatureIdx, currPose, currFeaturesLeft, currPointsLeft, intrinsics, scaleFactor, n
isLastFrameKeyFrame, lastKeyFrameIdx, currFrameIdx, numSkipFrames, numPointsKeyFrame);
% Match feature points between the stereo images and get the 3-D world positions
[xyzPoints, matchedPairs] = helperReconstructFromStereo(currILeftGray, currIRightGray, currFe
currFeaturesRight, currPointsLeft, currPointsRight, reprojectionMatrix, currPose, dispari
Local Mapping
Local mapping is performed for every key frame. When a new key frame is determined, add it to the
key frames and update the attributes of the map points observed by the new key frame. To ensure
that mapPointSet contains as few outliers as possible, a valid map point must be observed in at least
3 key frames.
New map points are created by triangulating ORB feature points in the current key frame and its
connected key frames. For each unmatched feature point in the current key frame, search for a match
with other unmatched points in the connected key frames using matchFeatures. The local bundle
1-158
Stereo Visual Simultaneous Localization and Mapping
adjustment refines the pose of the current key frame, the poses of connected key frames, and all the
map points observed in these key frames.
% Remove outlier map points that are observed in fewer than 3 key frames
if currKeyFrameId == 2
triangulatedMapPointsIdx = [];
end
Loop Closure
The loop closure step takes the current key frame processed by the local mapping process and tries
to detect and close the loop. Loop candidates are identified by querying images in the database that
are visually similar to the current key frame using evaluateImageRetrieval. A candidate key
1-159
1 Camera Calibration and SfM Examples
frame is valid if it is not connected to the last key frame and three of its neighbor key frames are loop
candidates.
When a valid loop closure candidate is found, compute the relative pose between the loop closure
candidate frame and the current key frame using estgeotform3d. Then add the loop connection
with the relative pose and update mapPointSet and vSetKeyFrames.
% Check loop closure after some key frames have been created
if currKeyFrameId > 50
1-160
Stereo Visual Simultaneous Localization and Mapping
Finally, apply pose graph optimization over the essential graph in vSetKeyFrames to correct the
drift. The essential graph is created internally by removing connections with fewer than
minNumMatches matches in the covisibility graph. After pose graph optimization, update the 3-D
locations of the map points using the optimized poses.
% Update legend
showLegend(mapPlot);
1-161
1 Camera Calibration and SfM Examples
You can compare the optimized camera trajectory with the ground truth to evaluate the accuracy of
the solution. The downloaded data contains a gps.txt file that stores the GPS location for each
frame. You can convert the GPS location from geographic to local Cartesian coordinates using
latlon2local (Automated Driving Toolbox) from Automated Driving Toolbox or geodetic2enu
(Mapping Toolbox) from Mapping Toolbox. In this example, you can simply load the converted GPS
data from an M-file.
% Show legend
showLegend(mapPlot);
1-162
Stereo Visual Simultaneous Localization and Mapping
Given the refined camera poses, you can perform dense reconstruction from the stereo images
corresponding to the key frames.
for i = 1: numel(addedFramesIdx)
ILeft = readimage(imdsLeft, addedFramesIdx(i));
IRight = readimage(imdsRight, addedFramesIdx(i));
% Ignore the upper half of the images which mainly show the sky
xyzPoints(1:floor(imageSize(1)/2), :, :) = NaN;
% Ignore the lower part of the images which show the vehicle
xyzPoints(imageSize(1)-50:end, :, :) = NaN;
% Remove world points that are too far away from the camera
validIndex = xyzPoints(:, 3) > 0 & xyzPoints(:, 3) < 100/reprojectionMatrix(4, 3);
xyzPoints = xyzPoints(validIndex, :);
colors = colors(validIndex, :);
1-163
1 Camera Calibration and SfM Examples
Supporting Functions
Short helper functions are listed below. Larger function are included in separate files.
helperDetectAndExtractFeatures detect and extract and ORB features from the image.
function [features, validPoints] = helperDetectAndExtractFeatures(Igray, scaleFactor, numLevels)
numPoints = 600;
% Extract features
[features, validPoints] = extractFeatures(Igray, points);
end
helperReconstructFromStereo reconstruct scene from stereo image using the disparity map
function [xyzPoints, indexPairs] = helperReconstructFromStereo(I1, I2, ...
features1, features2, points1, points2, reprojectionMatrix, currPose, disparityRange)
% Compute disparity for all pixels in the left image. In practice, it is more
% common to compute disparity just for the matched feature points.
disparityMap = disparitySGM(I1, I2, DisparityRange=disparityRange);
xyzPointsAll = reconstructScene(disparityMap, reprojectionMatrix);
1-164
Stereo Visual Simultaneous Localization and Mapping
for i = 1:size(locations, 1)
point3d = squeeze(xyzPointsAll(locations(i,1), locations(i, 2), :))';
isPointValid = point3d(3) > 0 & point3d(3) < 200/reprojectionMatrix(4, 3);
if isPointValid
xyzPoints = [xyzPoints; point3d]; %#ok<*AGROW>
isPointFound(i) = true;
end
end
indexPairs = indexPairs(isPointFound, :);
xyzPoints = xyzPoints * currPose.Rotation + currPose.Translation;
end
if ~isempty(outlierIdx)
mapPointSet = removeWorldPoints(mapPointSet, outlierIdx);
end
end
helperUpdateGlobalMap update 3-D locations of map points after pose graph optimization
posesOld = vSetKeyFrames.Views.AbsolutePose;
posesNew = vSetKeyFramesOptim.Views.AbsolutePose;
positionsOld = mapPointSet.WorldPoints;
positionsNew = positionsOld;
indices = 1:mapPointSet.Count;
% Update world location of each map point based on the new absolute pose of
1-165
1 Camera Calibration and SfM Examples
References
[1] Mur-Artal, Raul, and Juan D. Tardós. "ORB-SLAM2: An open-source SLAM system for monocular,
stereo, and RGB-D cameras." IEEE Transactions on Robotics 33, no. 5 (2017): 1255-1262.
1-166
Evaluating the Accuracy of Single Camera Calibration
This example shows how to evaluate the accuracy of camera parameters estimated using the “Using
the Single Camera Calibrator App” on page 18-22 app or the estimateCameraParameters
function.
Overview
Camera calibration is the process of estimating parameters of the camera using images of a special
calibration pattern. The parameters include camera intrinsics, distortion coefficients, and camera
extrinsics. Once you calibrate a camera, there are several ways to evaluate the accuracy of the
estimated parameters:
• Plot the relative locations of the camera and the calibration pattern
• Calculate the reprojection errors
• Calculate the parameter estimation errors
Extrinsics
You can quickly discover obvious errors in your calibration by plotting relative locations of the
camera and the calibration pattern. Use the showExtrinsics function to either plot the locations of
the calibration pattern in the camera's coordinate system, or the locations of the camera in the
pattern's coordinate system. Look for obvious problems, such as the pattern being behind the
camera, or the camera being behind the pattern. Also check if a pattern is too far or too close to the
camera.
figure;
showExtrinsics(params, "CameraCentric");
1-167
1 Camera Calibration and SfM Examples
figure;
showExtrinsics(params, "PatternCentric");
1-168
Evaluating the Accuracy of Single Camera Calibration
Reprojection Errors
Reprojection errors provide a qualitative measure of accuracy. A reprojection error is the distance
between a pattern keypoint detected in a calibration image, and a corresponding world point
projected into the same image. The showReprojectionErrors function provides a useful
visualization of the average reprojection error in each calibration image. If the overall mean
reprojection error is too high, consider excluding the images with the highest error and recalibrating.
figure;
showReprojectionErrors(params);
1-169
1 Camera Calibration and SfM Examples
Estimation Errors
displayErrors(estimationErrors, params);
Intrinsics
----------
Focal length (pixels): [ 714.1886 +/- 3.3219 710.3785 +/- 4.0579 ]
Principal point (pixels):[ 563.6481 +/- 5.3967 355.7252 +/- 3.3036 ]
Radial distortion: [ -0.3536 +/- 0.0091 0.1730 +/- 0.0488 ]
Extrinsics
----------
Rotation vectors:
[ -0.6096 +/- 0.0054 -0.1789 +/- 0.0073 -0.3835 +/- 0.0024
[ -0.7283 +/- 0.0050 -0.0996 +/- 0.0072 0.1964 +/- 0.0027
[ -0.6722 +/- 0.0051 -0.1444 +/- 0.0074 -0.1329 +/- 0.0026
[ -0.5836 +/- 0.0056 -0.2901 +/- 0.0074 -0.5622 +/- 0.0025
1-170
Evaluating the Accuracy of Single Camera Calibration
Whether or not a particular reprojection or estimation error is acceptable depends on the precision
requirements of your particular application. However, if you have determined that your calibration
accuracy is unacceptable, there are several ways to improve it:
• Modify calibration settings. Try using 3 radial distortion coefficients, estimating tangential
distortion, or the skew.
• Take more calibration images. The pattern in the images must be in different 3D orientations, and
it should be positioned such that you have keypoints in all parts of the field of view. In particular, it
is very important to have keypoints close to the edges and the corners of the image in order to get
a better estimate of the distortion coefficients.
• Exclude images that have high reprojection errors and re-calibrate.
Summary
This example showed how to obtain and interpret camera calibration errors.
References
[1] Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(11):1330-1334, 2000.
1-171
1 Camera Calibration and SfM Examples
This example shows how to measure the diameter of coins in world units using a single calibrated
camera.
Overview
This example shows how to calibrate a camera, and then use it to measure the size of planar objects,
such as coins. An example application of this approach is measuring parts on a conveyor belt for
quality control.
Camera calibration is the process of estimating the parameters of the lens and the image sensor.
These parameters are needed to measure objects captured by the camera. This example shows how
to calibrate a camera programmatically. Alternatively, you can calibrate a camera using the “Using
the Single Camera Calibrator App” on page 18-22 app.
To calibrate the camera, we first need to take multiple images of a calibration pattern from different
angles. A typical calibration pattern is an asymmetric checkerboard, where one side contains an even
number of squares, both black and white, and the other contains an odd number of squares.
The pattern must be affixed to a flat surface, and it should be at approximately the same distance
from the camera as the objects you want to measure. The size of a square must be measured in world
units, for example millimeters, as precisely as possible. In this example we use 9 images of the
pattern, but in practice it is recommended to use 10 to 20 images for accurate calibration.
numImages = 9;
files = cell(1, numImages);
for i = 1:numImages
files{i} = fullfile(matlabroot, 'toolbox', 'vision', 'visiondata', ...
'calibration', 'slr', sprintf('image%d.jpg', i));
end
1-172
Measuring Planar Objects with a Calibrated Camera
1-173
1 Camera Calibration and SfM Examples
The bar graph indicates the accuracy of the calibration. Each bar shows the mean reprojection error
for the corresponding calibration image. The reprojection errors are the distances between the
corner points detected in the image, and the corresponding ideal world points projected into the
image.
Load the image containing objects to be measured. This image includes the calibration pattern, and
the pattern is in the same plane as the objects you want to measure. In this example, both the pattern
and the coins are on the same table top.
Alternatively, you could use two separate images: one containing the pattern, and the other
containing the objects to be measured. Again, the objects and the pattern must be in the same plane.
Furthermore, images must be captured from exactly the same view point, meaning that the camera
must be fixed in place.
1-174
Measuring Planar Objects with a Calibrated Camera
Use the cameraParameters object to remove lens distortion from the image. This is necessary for
accurate measurement.
% Since the lens introduced little distortion, use 'full' output view to illustrate that
% the image was undistored. If we used the default 'same' option, it would be difficult
% to notice any difference when compared to the original image. Notice the small black borders.
[im, newOrigin] = undistortImage(imOrig, cameraParams, OutputView = "full");
figure; imshow(im, InitialMagnification = magnification);
title("Undistorted Image");
1-175
1 Camera Calibration and SfM Examples
Note that this image exhibits very little lens distortion. The undistortion step is far more important if
you use a wide-angle lens, or a low-end webcam.
Segment Coins
In this case, the coins are colorful on white background. Use the saturation component of the HSV
representation of the image to segment them out.
1-176
Measuring Planar Objects with a Calibrated Camera
Detect Coins
We can assume that the two largest connected components in the segmented image correspond to the
coins.
1-177
1 Camera Calibration and SfM Examples
figure; imshow(imDetectedCoins);
title("Detected Coins");
Compute Extrinsics
To map points in the image coordinates to points in the world coordinates we need to compute the
rotation and the translation of the camera relative to the calibration pattern. Note that the
estimateExtrinsics function assumes that there is no lens distortion. In this case imagePoints
have been detected in an image that has already been undistorted using undistortImage.
% Adjust the imagePoints so that they are expressed in the coordinate system
% used in the original image, before it was undistorted. This adjustment
% makes it compatible with the cameraParameters object computed for the original image.
imagePoints = imagePoints + newOrigin; % adds newOrigin to every row of imagePoints
1-178
Measuring Planar Objects with a Calibrated Camera
To measure the first coin we convert the top-left and the top-right corners of the bounding box into
world coordinates. Then we compute the Euclidean distance between them in millimeters. Note that
the actual diameter of a US penny is 19.05 mm.
% Adjust upper left corners of bounding boxes for coordinate system shift
% caused by undistortImage with output view of 'full'. This would not be
% needed if the output was 'same'. The adjustment makes the points compatible
% with the cameraParameters of the original image.
boxes = boxes + [newOrigin, 0, 0]; % zero padding is added for width and height
Measure the second coin the same way as the first coin.
In addition to measuring the size of the coin, we can also measure how far away it is from the
camera.
1-179
1 Camera Calibration and SfM Examples
Summary
This example showed how to use a calibrated camera to measure planar objects. Note that the
measurements were accurate to within 0.2 mm.
References
[1] Z. Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(11):1330-1334, 2000.
1-180
Depth Estimation from Stereo Video
This example shows how to detect people and their distance to the camera from a video taken with a
calibrated stereo camera.
Load the stereoParameters object, which is the result of calibrating the camera using either the
stereoCameraCalibrator app or the estimateCameraParameters function.
1-181
1 Camera Calibration and SfM Examples
videoFileLeft = 'handshake_left.avi';
videoFileRight = 'handshake_right.avi';
readerLeft = VideoReader(videoFileLeft);
readerRight = VideoReader(videoFileRight);
player = vision.VideoPlayer('Position', [20,200,740 560]);
The frames from the left and the right cameras must be rectified in order to compute disparity and
reconstruct the 3-D scene. Rectified images have horizontal epipolar lines, and are row-aligned. This
simplifies the computation of disparity by reducing the search space for matching points to one
dimension. Rectified images can also be combined into an anaglyph, which can be viewed using the
stereo red-cyan glasses to see the 3-D effect.
frameLeft = readFrame(readerLeft);
frameRight = readFrame(readerRight);
figure;
imshow(stereoAnaglyph(frameLeftRect, frameRightRect));
title('Rectified Video Frames');
1-182
Depth Estimation from Stereo Video
Compute Disparity
In rectified stereo images any pair of corresponding points are located on the same pixel row. For
each pixel in the left image compute the distance to the corresponding pixel in the right image. This
distance is called the disparity, and it is proportional to the distance of the corresponding world point
from the camera.
frameLeftGray = im2gray(frameLeftRect);
frameRightGray = im2gray(frameRightRect);
Reconstruct the 3-D world coordinates of points corresponding to each pixel from the disparity map.
points3D = reconstructScene(disparityMap, reprojectionMatrix);
1-183
1 Camera Calibration and SfM Examples
1-184
Depth Estimation from Stereo Video
% Create the people detector object. Limit the minimum object size for
% speed.
peopleDetector = vision.PeopleDetector('MinSize', [166 83]);
% Detect people.
bboxes = peopleDetector.step(frameLeftGray);
Find the 3-D world coordinates of the centroid of each detected person and compute the distance
from the centroid to the camera in meters.
1-185
1 Camera Calibration and SfM Examples
Apply the steps described above to detect people and measure their distances to the camera in every
frame of the video.
while hasFrame(readerLeft) && hasFrame(readerRight)
% Read the frames.
frameLeft = readFrame(readerLeft);
frameRight = readFrame(readerRight);
% Convert to grayscale.
frameLeftGray = im2gray(frameLeftRect);
frameRightGray = im2gray(frameRightRect);
% Compute disparity.
disparityMap = disparitySGM(frameLeftGray, frameRightGray);
1-186
Depth Estimation from Stereo Video
% Detect people.
bboxes = peopleDetector.step(frameLeftGray);
if ~isempty(bboxes)
% Find the centroids of detected people.
centroids = [round(bboxes(:, 1) + bboxes(:, 3) / 2), ...
round(bboxes(:, 2) + bboxes(:, 4) / 2)];
1-187
1 Camera Calibration and SfM Examples
1-188
Depth Estimation from Stereo Video
1-189
1 Camera Calibration and SfM Examples
% Clean up
release(player);
1-190
Depth Estimation from Stereo Video
Summary
This example showed how to localize pedestrians in 3-D using a calibrated stereo camera.
References
[1] G. Bradski and A. Kaehler, "Learning OpenCV : Computer Vision with the OpenCV Library,"
O'Reilly, Sebastopol, CA, 2008.
[2] Dalal, N. and Triggs, B., Histograms of Oriented Gradients for Human Detection. CVPR 2005.
1-191
1 Camera Calibration and SfM Examples
Structure from motion (SfM) is the process of estimating the 3-D structure of a scene from a set of 2-
D views. It is used in many applications, such as robot navigation, autonomous driving, and
augmented reality. This example shows you how to estimate the poses of a calibrated camera from a
sequence of views, and reconstruct the 3-D structure of the scene up to an unknown scale factor.
Overview
This example shows how to reconstruct a 3-D scene from a sequence of 2-D views taken with a
camera calibrated using the Camera Calibrator. The example uses an imageviewset object to store
and manage the data associated with each view, such as the camera pose and the image points, as
well as matches between points from pairs of views.
The example uses the pairwise point matches to estimate the camera pose of the current view
relative to the previous view. It then links the pairwise matches into longer point tracks spanning
multiple views using the findTracks method of the imageviewset object. These tracks then serve
as inputs to multiview triangulation using the triangulateMultiview function and the refinement
of camera poses and the 3-D scene points using the bundleAdjustment function.
The example consists of two main parts: camera motion estimation and dense scene reconstruction.
In the first part, the example estimates the camera pose for each view using a sparse set of points
matched across the views. In the second part, the example iterates over the sequence of views again,
using vision.PointTrackerto track a dense set of points across the views, to compute a dense 3-D
reconstruction of the scene.
1-192
Structure from Motion from Multiple Views
'structureFromMotion');
imds = imageDatastore(imageDir);
1-193
1 Camera Calibration and SfM Examples
Use an imageviewset object to store and manage the image points and the camera pose associated
with each view, as well as point matches between pairs of views. Once you populate an
imageviewset object, you can use it to find point tracks across multiple views and retrieve the
camera poses to be used by triangulateMultiview and bundleAdjustment functions.
% Create an empty imageviewset object to manage the data associated with each
% view.
vSet = imageviewset;
% Add the first view. Place the camera associated with the first view
% and the origin, oriented along the Z-axis.
viewId = 1;
vSet = addView(vSet, viewId, rigidtform3d, Points=prevPoints);
for i = 2:numel(images)
% Undistort the current image.
I = undistortImage(images{i}, intrinsics);
1-194
Structure from Motion from Multiple Views
% Estimate the camera pose of current view relative to the previous view.
% The pose is computed up to scale, meaning that the distance between
% the cameras in the previous view and the current view is set to 1.
% This will be corrected by the bundle adjustment.
[relPose, inlierIdx] = helperEstimateRelativePose(...
matchedPoints1, matchedPoints2, intrinsics);
% Store the point matches between the previous and the current views.
vSet = addConnection(vSet, i-1, i, relPose, Matches=indexPairs(inlierIdx,:));
prevFeatures = currFeatures;
prevPoints = currPoints;
end
1-195
1 Camera Calibration and SfM Examples
hold on
Go through the images again. This time detect a dense set of corners, and track them across all views
usingvision.PointTracker.
% Read and undistort the first image
I = undistortImage(images{1}, intrinsics);
1-196
Structure from Motion from Multiple Views
% Create the point tracker object to track the points across views.
tracker = vision.PointTracker(MaxBidirectionalError=1, NumPyramidLevels=6);
1-197
1 Camera Calibration and SfM Examples
title('Dense Reconstruction');
References
[1] M.I.A. Lourakis and A.A. Argyros (2009). "SBA: A Software Package for Generic Sparse Bundle
Adjustment". ACM Transactions on Mathematical Software (ACM) 36 (1): 1-30.
[2] R. Hartley, A. Zisserman, "Multiple View Geometry in Computer Vision," Cambridge University
Press, 2003.
1-198
Uncalibrated Stereo Image Rectification
Stereo image rectification projects images onto a common image plane in such a way that the
corresponding points have the same row coordinates. This process is useful for stereo vision, because
the 2-D stereo correspondence problem is reduced to a 1-D problem. As an example, stereo image
rectification is often used as a preprocessing step for computing or creating anaglyph images. For
more details, see the “Depth Estimation from Stereo Video” on page 1-181 example.
Read in two color images of the same scene, which were taken from different positions. Then, convert
them to grayscale. Colors are not required for the matching process.
I1 = imread("yellowstone_left.png");
I2 = imread("yellowstone_right.png");
% Convert to grayscale.
I1gray = im2gray(I1);
I2gray = im2gray(I2);
Display both images side by side. Then, display a color composite demonstrating the pixel-wise
differences between the images.
figure
imshowpair(I1,I2,"montage")
title("I1 (left); I2 (right)")
figure
imshow(stereoAnaglyph(I1,I2))
title("Composite Image (Red - Left Image, Cyan - Right Image)")
1-199
1 Camera Calibration and SfM Examples
There is an obvious offset between the images in orientation and position. The goal of rectification is
to transform the images, aligning them such that corresponding points will appear on the same rows
in both images.
The rectification process requires a set of point correspondences between the two images. To
generate these correspondences, you will collect points of interest from both images, and then choose
potential matches between them. Use detectSURFFeatures to find blob-like features in both
images.
blobs1 = detectSURFFeatures(I1gray,MetricThreshold=2000);
blobs2 = detectSURFFeatures(I2gray,MetricThreshold=2000);
Visualize the location and scale of the thirty strongest SURF features in I1 and I2. Notice that not all
of the detected features can be matched because they were either not detected in both images or
because some of them were not present in one of the images due to camera motion.
figure
imshow(I1)
hold on
plot(selectStrongest(blobs1,30))
title("Thirty Strongest SURF Features In I1")
1-200
Uncalibrated Stereo Image Rectification
figure
imshow(I2)
hold on
plot(selectStrongest(blobs2,30))
title("Thirty Strongest SURF Features In I2")
1-201
1 Camera Calibration and SfM Examples
Use the extractFeatures and matchFeatures functions to find putative point correspondences.
For each blob, compute the SURF feature vectors (descriptors).
[features1,validBlobs1] = extractFeatures(I1gray,blobs1);
[features2,validBlobs2] = extractFeatures(I2gray,blobs2);
Use the sum of absolute differences (SAD) metric to determine indices of matching features.
indexPairs = matchFeatures(features1,features2,Metric="SAD", ...
MatchThreshold=5);
Show matching points on top of the composite image, which combines stereo images. Notice that
most of the matches are correct, but there are still some outliers.
figure
showMatchedFeatures(I1, I2, matchedPoints1, matchedPoints2)
legend("Putatively Matched Points In I1","Putatively Matched Points In I2")
1-202
Uncalibrated Stereo Image Rectification
The correctly matched points must satisfy epipolar constraints. This means that a point must lie on
the epipolar line determined by its corresponding point. You will use the
estimateFundamentalMatrix function to compute the fundamental matrix and find the inliers that
meet the epipolar constraint.
figure
1-203
1 Camera Calibration and SfM Examples
Rectify the stereo images, and display them as a stereo anaglyph. You can use red-cyan stereo glasses
to see the 3D effect.
1-204
Uncalibrated Stereo Image Rectification
The parameters used in the above steps have been set to fit the two particular stereo images. To
process other images, you can use the cvexRectifyStereoImages function, which contains
additional logic to automatically adjust the rectification parameters. The image below shows the
result of processing a pair of images using this function.
cvexRectifyImages("parkinglot_left.png","parkinglot_right.png");
1-205
1 Camera Calibration and SfM Examples
References
[1] Trucco, E; Verri, A. "Introductory Techniques for 3-D Computer Vision." Prentice Hall, 1998.
[2] Hartley, R; Zisserman, A. "Multiple View Geometry in Computer Vision." Cambridge University
Press, 2003.
[3] Hartley, R. "In Defense of the Eight-Point Algorithm." IEEE® Transactions on Pattern Analysis and
Machine Intelligence, v.19 n.6, June 1997.
[4] Fischler, MA; Bolles, RC. "Random Sample Consensus: A Paradigm for Model Fitting with
Applications to Image Analysis and Automated Cartography." Comm. Of the ACM 24, June 1981.
1-206
2
• “Code Generation for Monocular Visual Simultaneous Localization and Mapping ” on page 2-2
• “Code Generation for Object Detection by Using Single Shot Multibox Detector” on page 2-5
• “Code Generation for Object Detection by Using YOLO v2” on page 2-8
• “Introduction to Code Generation with Feature Matching and Registration” on page 2-12
• “Code Generation for Face Tracking with PackNGo” on page 2-19
• “Code Generation for Depth Estimation From Stereo Video” on page 2-27
• “Detect Face (Raspberry Pi2)” on page 2-32
• “Track Face (Raspberry Pi2)” on page 2-38
• “Video Display in a Custom User Interface” on page 2-44
• “Generate Code for Detecting Objects in Images by Using ACF Object Detector” on page 2-49
2 Code Generation and Third-Party Examples
This example shows how to use the MATLAB® Coder™ to generate C/C++ code for the visual
simultaneous localization and mapping algorithm from the “Monocular Visual Simultaneous
Localization and Mapping” on page 1-122 example.
Visual simultaneous localization and mapping (vSLAM) is the process of calculating the position and
the orientation of a camera, with respect to its surroundings, while simultaneously mapping the
environment.
This example shows how to process image data from a monocular camera to build a map of an indoor
environment and estimate the trajectory of the camera. The steps in this process are:
1 Package the visual SLAM algorithm from the “Monocular Visual Simultaneous Localization and
Mapping” on page 1-122 example into an entry-point function, helperMonoVisualSLAM.
2 Modify the helperMonoVisualSLAM function to support code generation.
3 Generate C/C++ code, and verify the results.
You can also integrate the generated code into external software for further testing.
Download Data
This example uses data from the TUM RGB-D benchmark [1] on page 2-4. The size of the data set is
1.38 GB. You can download the data set to a temporary folder using this code.
baseDownloadURL = "https://vision.in.tum.de/rgbd/dataset/freiburg3/rgbd_dataset_freiburg3_long_of
dataFolder = fullfile(tempdir,"tum_rgbd_dataset",filesep);
options = weboptions(Timeout=Inf);
tgzFileName = dataFolder + "fr3_office.tgz";
folderExists = exist(dataFolder,"dir");
Entry-Point Function
To meet the requirements of MATLAB Coder, restructure the code from the “Monocular Visual
Simultaneous Localization and Mapping” on page 1-122 example into the entry-point function
helperMonoVisualSLAM. This function takes a cell array of images as an input and outputs 3-D
worldpointset, estimated camera poses, and frame indices.
2-2
Code Generation for Monocular Visual Simultaneous Localization and Mapping
1 Initializes a map of 3-D points from the first two video frames, and then computes the 3-D points
and relative camera pose using triangulation based on 2-D ORB ORBPoints feature
correspondences.
2 For each new frame, estimates the camera pose by matching features in the current frame to
features in the previous key frame. The function refines the estimated camera pose by tracking
the local map.
3 If the function identifies the new frame as a key frame, the function uses the new frame to create
new 3-D points. In this step, the function uses bundle adjustment to minimize reprojection errors
in the estimated camera poses and 3-D points.
4 Detects loops in each key frame by comparing it with all previous key frames using the bag-of-
features approach. Once the function detects a loop closure, it optimizes the pose graph by
refining the camera poses of all the key frames.
As code generation does not support the imageDatastore object, read the images, convert them to
grayscale, and store them in a cell array.
Use the “Compilation Directive %#codegen” (MATLAB Coder) function to compile the
helperVisualSLAMCodegen function into a MEX file. You can specify the -report option to
generate a compilation report that shows the original MATLAB code and the associated files created
during code generation. You can also create a temporary directory where MATLAB Coder can store
the generated files. Note that, by default, the generated MEX file has the same name as the original
MATLAB function with "_mex" appended as a suffix: helperVisualSLAMCodegen_mex.
Alternatively, you can use the -o option to specify the name of the MEX file.
For code generation, you must pass Images as an input to the helperVisualSLAMCodegen
function.
cpuConfig = coder.config("mex");
cpuConfig.TargetLang = "C++";
codegen -config cpuConfig helperVisualSLAMCodegen -args {Images}
Use the helperVisualSLAMCodegen_mex function to find the estimated and optimized camera
poses based on Images cell array.
monoSlamOut = helperVisualSLAMCodegen_mex(Images);
Plot the estimated trajectory and actual trajectory of the camera by specifying monoSlamOut as an
input argument to the helperVisualizeMonoSlam helper function.
2-3
2 Code Generation and Third-Party Examples
% Clear up
clear helperMonoVisualSLAMCodegen_mex
Reference
[1] Sturm Jürgen, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. “A
Benchmark for the Evaluation of RGB-D SLAM Systems.” In 2012 IEEE/RSJ International Conference
on Intelligent Robots and Systems, 573–80, 2012. https://doi.org/10.1109/IROS.2012.6385773.
2-4
Code Generation for Object Detection by Using Single Shot Multibox Detector
This example shows how to generate CUDA® code for an SSD network (ssdObjectDetector object)
and take advantage of the NVIDIA® cuDNN and TensorRT libraries. An SSD network is based on a
feed-forward convolutional neural network that detect multiple objects within the image in a single
shot. SSD network can be thought of as having two sub-networks. A feature extraction network,
followed by a detection network.
This example generates code for the network trained in the Object Detection Using SSD Deep
Learning example from Computer Vision Toolbox™. For more information, see “Object Detection
Using SSD Deep Learning” on page 3-302. The Object Detection Using SSD Deep Learning example
uses ResNet-50 for feature extraction. The detection sub-network is a small CNN compared to the
feature extraction network and is composed of a few convolutional layers and layers specific to SSD.
Third-Party Prerequisites
Required
This example generates CUDA MEX and has the following third-party requirements.
Optional
For non-MEX builds such as static, dynamic libraries or executables, this example has the following
additional requirements.
• NVIDIA toolkit.
• NVIDIA cuDNN library.
• Environment variables for the compilers and libraries. For more information, see “Third-Party
Hardware” (GPU Coder) and “Setting Up the Prerequisite Products” (GPU Coder).
Use the coder.checkGpuInstall (GPU Coder) function to verify that the compilers and libraries
necessary for running this example are set up correctly.
envCfg = coder.gpuEnvConfig('host');
envCfg.DeepLibTarget = 'cudnn';
envCfg.DeepCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);
This example uses the ssdResNet50VehicleExample_20a MAT-file containing the pretrained SSD
network. This file is approximately 44 MB size. Download the file from the MathWorks® website.
ssdNetFile = matlab.internal.examples.downloadSupportFile('vision/data','ssdResNet50VehicleExampl
The DAG network contains 180 layers including convolution, ReLU, and batch normalization layers,
anchor box, SSD merge, focal loss, and other layers. To display an interactive visualization of the
deep learning network architecture, use the analyzeNetwork (Deep Learning Toolbox) function.
2-5
2 Code Generation and Third-Party Examples
load(ssdNetFile);
analyzeNetwork(detector.Network);
The ssdObj_detect.m entry-point function takes an image input and runs the detector on the image
using the deep learning network saved in the ssdResNet50VehicleExample_20a.mat file. The
function loads the network object from the ssdResNet50VehicleExample_20a.mat file into a
persistent variable ssdObj and reuses the persistent object on subsequent detection calls.
type('ssdObj_detect.m')
persistent ssdObj;
if isempty(ssdObj)
ssdObj = coder.loadDeepLearningNetwork(matFile);
end
% Pass in input
[bboxes,~,labels] = detect(ssdObj,in,'Threshold',0.5);
To generate CUDA code for the ssdObj_detect.m entry-point function, create a GPU code
configuration object for a MEX target and set the target language to C++. Use the
coder.DeepLearningConfig (GPU Coder) function to create a CuDNN deep learning configuration
object and assign it to the DeepLearningConfig property of the GPU code configuration object.
Run the codegen command specifying an input size of 300-by-300-by-3. This value corresponds to the
input layer size of SSD Network.
cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');
inputArgs = {ones(300,300,3,'uint8'),coder.Constant(ssdNetFile)};
codegen -config cfg ssdObj_detect -args inputArgs -report
To test the generated MEX, the example uses a small vehicle data set that contains 295 images. Many
of these images come from the Caltech Cars 1999 and 2001 data sets, available at the Caltech
Research Data Respository website, created by Pietro Perona and used with permission.
2-6
Code Generation for Object Detection by Using Single Shot Multibox Detector
Load the vehicle data set and randomly select 10 images to test the generated code.
unzip vehicleDatasetImages.zip
imageNames = dir(fullfile(pwd,'vehicleImages','*.jpg'));
imageNames = {imageNames.name}';
rng(0);
imageIndices = randi(length(imageNames),1,10);
Read the video input frame-by-frame and detect the vehicles in the video using the detector.
References
[1] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng Yang Fu, and
Alexander C. Berg. "SSD: Single shot multibox detector." In 14th European Conference on Computer
Vision, ECCV 2016. Springer Verlag, 2016.
2-7
2 Code Generation and Third-Party Examples
This example shows how to generate CUDA® MEX for a you only look once (YOLO) v2 object
detector. A YOLO v2 object detection network is composed of two subnetworks. A feature extraction
network followed by a detection network. This example generates code for the network trained in the
Object Detection Using YOLO v2 Deep Learning example from Computer Vision Toolbox™. For more
information, see “Object Detection Using YOLO v2 Deep Learning” on page 3-468. You can modify
this example to generate CUDA® MEX for the network imported in the Import Pretrained ONNX
YOLO v2 Object Detector example from Computer Vision Toolbox™. For more information, see
“Import Pretrained ONNX YOLO v2 Object Detector” on page 3-436.
Third-Party Prerequisites
Required
This example generates CUDA MEX and has the following third-party requirements.
Optional
For non-MEX builds such as static, dynamic libraries or executables, this example has the following
additional requirements.
• NVIDIA toolkit.
• NVIDIA cuDNN library.
• Environment variables for the compilers and libraries. For more information, see “Third-Party
Hardware” (GPU Coder) and “Setting Up the Prerequisite Products” (GPU Coder).
Use the coder.checkGpuInstall (GPU Coder) function to verify that the compilers and libraries
necessary for running this example are set up correctly.
envCfg = coder.gpuEnvConfig('host');
envCfg.DeepLibTarget = 'cudnn';
envCfg.DeepCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);
matFile = matlab.internal.examples.downloadSupportFile('vision/data','yolov2ResNet50VehicleExampl
vehicleDetector = load(matFile);
net = vehicleDetector.detector.Network
net =
DAGNetwork with properties:
2-8
Code Generation for Object Detection by Using YOLO v2
InputNames: {'input_1'}
OutputNames: {'yolov2OutputLayer'}
The DAG network contains 150 layers including convolution, ReLU, and batch normalization layers
and the YOLO v2 transform and YOLO v2 output layers. To display an interactive visualization of the
deep learning network architecture, use the analyzeNetwork (Deep Learning Toolbox) function.
analyzeNetwork(net);
The yolov2_detect.m entry-point function takes an image input and runs the detector on the image
using the deep learning network saved in the yolov2ResNet50VehicleExample.mat file. The
function loads the network object from the yolov2ResNet50VehicleExample.mat file into a
persistent variable yolov2Obj and reuses the persistent object on subsequent detection calls.
type('yolov2_detect.m')
persistent yolov2Obj;
if isempty(yolov2Obj)
yolov2Obj = coder.loadDeepLearningNetwork(matFile);
end
To generate CUDA code for the entry-point function, create a GPU code configuration object for a
MEX target and set the target language to C++. Use the coder.DeepLearningConfig (GPU
Coder) function to create a CuDNN deep learning configuration object and assign it to the
DeepLearningConfig property of the GPU code configuration object. Run the codegen command
specifying an input size of 224-by-224-by-3. This value corresponds to the input layer size of YOLOv2.
cfg = coder.gpuConfig('mex');
cfg.TargetLang = 'C++';
cfg.DeepLearningConfig = coder.DeepLearningConfig('cudnn');
cfg.GenerateReport = true;
inputArgs = {ones(224,224,3,'uint8'),coder.Constant(matFile)};
2-9
2 Code Generation and Third-Party Examples
Set up the video file reader and read the input video. Create a video player to display the video and
the output detections.
videoFile = 'highway_lanechange.mp4';
videoFreader = vision.VideoFileReader(videoFile,'VideoOutputDataType','uint8');
depVideoPlayer = vision.DeployableVideoPlayer('Size','Custom','CustomSize',[640 480]);
Read the video input frame-by-frame and detect the vehicles in the video using the detector.
cont = ~isDone(videoFreader);
while cont
I = step(videoFreader);
in = imresize(I,[224,224]);
out = yolov2_detect_mex(in,matFile);
step(depVideoPlayer, out);
% Exit the loop if the video player figure window is closed
cont = ~isDone(videoFreader) && isOpen(depVideoPlayer);
end
2-10
Code Generation for Object Detection by Using YOLO v2
References
[1] Redmon, Joseph, and Ali Farhadi. "YOLO9000: Better, Faster, Stronger." 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
2-11
2 Code Generation and Third-Party Examples
This example shows how to use the MATLAB® Coder™ to generate C code for a MATLAB file. The
example explains how to modify the MATLAB code used by the “Find Image Rotation and Scale Using
Automated Feature Matching” on page 4-25 example so that it is supported for code generation. The
example highlights some of the general requirements for code generation, as well as some of the
specific actions you must take to prepare MATLAB code. Once the MATLAB code is ready for code
generation, you use the codegen (MATLAB Coder) command to generate a C-MEX function. Finally,
to verify results, the example shows you how to run the generated C-MEX function in MATLAB and
compare its output with the output of the MATLAB code.
To run this example, you must have access to a C compiler and you must configure it using 'mex -
setup' command. For more information, see “Get Started with MATLAB Coder” (MATLAB Coder).
Generated code can run inside the MATLAB environment as a C-MEX file, or outside the MATLAB
environment as a standalone executable or shared utility to be linked with another standalone
executable. For more details about setting code generation options, see the -config option of the
codegen (MATLAB Coder) command.
MEX Executables
This example generates a MEX executable to be run inside the MATLAB environment.
Generating a C-MEX executable to run inside of MATLAB can also be a great first step in a workflow
that ultimately leads to standalone code. The inputs and the outputs of the MEX-file are available for
inspection in the MATLAB environment, where visualization and other kinds of tools for verification
and analysis are readily available. You also have the choice of running individual commands either as
generated C code, or via the MATLAB engine. To run via MATLAB, declare relevant commands as
coder.extrinsic (MATLAB Coder), which means that the generated code will re-enter the
MATLAB environment when it needs to run that particular command. This is useful in cases where
either an isolated command does not yet have code generation support, or if you wish to embed
certain commands that do not generate code (such as plot command).
Standalone Executables
If deployment of code to another application is the goal, then a standalone executable will be
required. The first step is to configure MATLAB Coder appropriately. For example, one way to tell it
you want a standalone executable is to create a MATLAB Coder project using the MATLAB Coder IDE
and configure that project to generate a module or an executable. You can do so using the C/C++
static library or C/C++ executable options from the Build type widget on the Generate page. This IDE
is available by navigating as follows:
- Click APPS tab - Scroll down to MATLAB Coder - In MATLAB Coder Project dialog box, click OK
2-12
Introduction to Code Generation with Feature Matching and Registration
a=coder.config('exe')
and pass that object to the coder command on the MATLAB command line. When you create a
standalone executable, you have to write your own main.c (or main.cpp). Note that when you create a
standalone executable, there are no ready-made utilities for importing or exporting data between the
executable and the MATLAB environment. One of the options is to use printf/fprintf to a file (in your
handwritten main.c) and then import data into MATLAB using 'load -ascii' with your file.
Break Out the Computational Part of the Algorithm into a Separate MATLAB Function
MATLAB Coder requires MATLAB code to be in the form of a function in order to generate C code.
Note that it is generally not necessary to generate C code for all of the MATLAB code in question. It is
often desirable to separate the code into the primary computational portion, from which C code
generation is desired, and a harness or driver, which does not need to generate C code - that code
will run in MATLAB. The harness may contain visualization and other verification aids that are not
actually part of the system under test. The code for the main algorithm of this example resides in a
function called visionRecovertformCodeGeneration_kernel
Once the code has been re-architected as described above, you must check that the rest of the code
uses capabilities that are supported by MATLAB coder. For a list of supported commands, see
MATLAB Coder “Functions and Objects Supported for C/C++ Code Generation” (MATLAB Coder).
For a list of supported language constructs, see “MATLAB Language Features Supported for C/C++
Code Generation” (MATLAB Coder).
It may be convenient to have limited visualization or some other capability that is not supported by
the MATLAB Coder present in the function containing the main algorithm, which we hope to compile.
In these cases, you can declare these items 'extrinsic' (using coder.extrinsic). Such capability is only
possible when you generate the C code into a MATLAB MEX-file, and those functions will actually run
in interpreted MATLAB mode. If generating code for standalone use, extrinsic functions are either
ignored or they generate an error, depending on whether the code generation engine determines that
they affect the results. Thus the code must be properly architected so that the extrinsic functions do
not materially affect the code in question if a standalone executable is ultimately desired.
The original example uses showMatchedFeatures and imshowpair routines for visualization of the
results. These routines are extracted to a new function
featureMatchingVisualization_extrinsic. This function is declared extrinsic.
2-13
2 Code Generation and Third-Party Examples
scaleRecovered = 0.702550
thetaRecovered = 29.761566
2-14
Introduction to Code Generation with Feature Matching and Registration
2-15
2 Code Generation and Third-Party Examples
MATLAB Coder can create new files. Note that the generated MEX-file has the same name as the
original MATLAB file with _mex appended, unless you use the -o option to specify the name of the
executable.
MATLAB Coder requires that you specify the properties of all the input parameters. One easy way to
do this is to define the input properties by example at the command-line using the -args option. For
more information see “Define Input Properties by Example at the Command Line” (MATLAB Coder).
Since the inputs to % visionRecovertformCodeGeneration_kernel are a pair of images, we
define both the inputs with the following properties:
scaleRecovered = 0.702448
thetaRecovered = 29.908159
2-16
Introduction to Code Generation with Feature Matching and Registration
Clean Up
clear visionRecovertformCodeGeneration_kernel_mex;
2-17
2 Code Generation and Third-Party Examples
Recovered scale and theta for both MATLAB and CODEGEN, as shown above, are within reasonable
tolerance. Furthermore, the matched points are identical, as shown below:
isequal(matchedOriginalLocCG,matchedOriginalLoc)
isequal(matchedDistortedLocCG,matchedDistortedLoc)
ans =
logical
ans =
logical
Appendix
• featureMatchingVisualization_extrinsic
2-18
Code Generation for Face Tracking with PackNGo
This example shows how to generate code from “Face Detection and Tracking Using the KLT
Algorithm” on page 8-89 example with packNGo function. The packNGo (MATLAB Coder) function
packages all relevant files in a compressed zip file so you can relocate, unpack, and rebuild your
project in another development environment without MATLAB present. This example also shows how
to create a makefile for the packNGo content, rebuild the source files and finally run the standalone
executable outside MATLAB environment.
This example is a function with the main body at the top and helper routines in the form of “Nested
Functions” below.
function FaceTrackingKLTpackNGoExample()
To run this example, you must have access to a C++ compiler and you must configure it using 'mex -
setup c++' command. For more information, see “Choose a C++ Compiler”. If you deploy the
application on MATLAB host, use a C++ compiler that is compatible with the compiler used to build
OpenCV libraries. For more information, see “Portable C Code Generation for Functions That Use
OpenCV Library” on page 22-4.
Break Out the Computational Part of the Algorithm into a Separate MATLAB Function
MATLAB Coder requires MATLAB code to be in the form of a function in order to generate C code.
The code for the main algorithm of this example resides in a function called
FaceTrackingKLTpackNGo_kernel.m. This file is derived from “Face Detection and Tracking Using the
KLT Algorithm” on page 8-89. To learn how to modify the MATLAB code to make it compatible for
code generation, you can look at example “Introduction to Code Generation with Feature Matching
and Registration” on page 2-12.
fileName = 'FaceTrackingKLTpackNGo_kernel.m';
visiondemo_dir = pwd;
currentDir = pwd; % Store the current directory
fileName = fullfile(visiondemo_dir, fileName);
Create a code generation configuration object for EXE output with packNGo function call in post code
generation stage.
codegenArgs = createCodegenArgs(visiondemo_dir);
2-19
2 Code Generation and Third-Party Examples
currentPath = addpath(visiondemo_dir);
pathCleanup = onCleanup(@()path(currentPath));
cd(codegenOutDir);
dirChange = onCleanup(@()cd(currentDir));
Note that, instead of using codegen command, you can open a dialog and launch a code generation
project using codegen (MATLAB Coder). Use the post code generation command with packNGo
function to create a zip file.
Unzip the zip file into a new folder. Note that the zip file contains source files, header files, libraries,
MAT-file containing the build information object, data files. unzipPackageContents and other
helper functions are included in the appendix.
zipFileLocation = codegenOutDir;
fprintf('-> Unzipping files ....\n');
unzipFolderLocation = unzipPackageContents(zipFileLocation);
Create the commands required to build the project and to run it.
fprintf('-> Creating ''Build Command'' and ''Run command'' ....\n');
[buildCommand, runCommand] = createBuildAndRunCommands(zipFileLocation,...
unzipFolderLocation,makefileName,fname);
2-20
Code Generation for Face Tracking with PackNGo
cd(unzipFolderLocation);
system(runCommand);
The application can be deployed in another machine by copying the executable and the library files.
isPublishing = ~isempty(snapnow('get'));
if ~isPublishing % skip printing out directory to html page
fprintf('Executable and library files are located in the following folder:\n%s\n', unzipFolderL
fprintf('To re-execute run the following commands:\n');
fprintf('1. cd(''%s'')\n', unzipFolderLocation);
fprintf('2. system(''%s'')\n', runCommand);
end
cfg = coder.config('exe');
cfg.PostCodeGenCommand = 'packNGo(buildInfo,''packType'',''hierarchical'');';
cfg.CustomSource = mainCFile;
cfg.CustomInclude = folderForMainC;
cfg.EnableOpenMP = false;
end
unzipFolderLocationName = 'unzipPackNGo';
mkdir(unzipFolderLocationName);
assert(numel(zipFile)==1);
unzip(zipFile.name,unzipFolderLocationName);
2-21
2 Code Generation and Third-Party Examples
for i=1:numel(zipFileInternal)
unzip(fullfile(unzipFolderLocationName,zipFileInternal(i).name), ...
unzipFolderLocationName);
end
unzipFolderLocation = fullfile(zipFileLocation,unzipFolderLocationName);
end
lastDir = cd(unzipFolderLocation);
dirCleanup = onCleanup(@()cd(lastDir));
% Get defines
horzcat_with_space = @(cellval)sprintf('%s ',cellval{:});
defs = horzcat_with_space(getDefines(binfo.buildInfo));
else
[~, cFiles] = system(['find ./ ' '-name ' '''*.c''']);
[~, cppFiles] = system(['find ./ ' '-name ' '''*.cpp''']);
end
for i = 1:length(cIndx)
if i == 1
startIdx = 1;
endIdx = cIndx(i);
else
startIdx = cIndx(i-1)+1;
endIdx = cIndx(i);
end
[~, b, ~] = fileparts(cFiles(startIdx:endIdx));
srcFilesC = [srcFilesC ' ' b '.c']; %#ok<AGROW>
end
for i = 1:length(cppIndx)
if i == 1
startIdx = 1;
endIdx = cppIndx(i);
2-22
Code Generation for Face Tracking with PackNGo
else
startIdx = cppIndx(i-1)+1;
endIdx = cppIndx(i);
end
[~, b, ~] = fileparts(cppFiles(startIdx:endIdx));
srcFilesCPP = [srcFilesCPP ' ' b '.cpp']; %#ok<AGROW>
end
filecontent = char(fread(fid)');
fclose(fid);
newfilecontent = regexprep(filecontent,...
{'PASTE_ARCH','PASTE_EXT','PASTE_DEFINES','PASTE_SRCFILES', 'PASTE_MATLAB'},...
{ archDir, dllExt, defs, srcFiles, matlabDirName});
makefileName = 'Makefile';
mk_name = fullfile(unzipFolderLocation,makefileName);
if isunix
if( ismac )
[status,sysHeaderPath] = system( 'xcode-select -print-path' );
assert(status==0, ['Could not obtain a path to the system ' ...
'header files using ''xcode-select -print-path''' '']);
2-23
2 Code Generation and Third-Party Examples
end
end
assert(~isempty(sdkPath), ...
sprintf('There is no sdk available in %s. Please check system environment.\n',s
newfilecontent = regexprep(newfilecontent,'PASTE_CC',ccCMD);
newfilecontent = regexprep(newfilecontent,'PASTE_CPP',cppCMD);
end
fid = fopen(mk_name,'w+');
fprintf(fid,'%s',newfilecontent);
fclose(fid);
end
if ismac
buildCommand = [' xcrun make -f ' makefileName];
runCommand = ['./' fileName ' "' fileName '"'];
elseif isunix
buildCommand = [' make -f ' makefileName];
runCommand = ['./' fileName ' "' fileName '"'];
else
% On PC we use the generated BAT files (there should be 2) to help
% build the generated code. These files are copied to the
% unzipFolderLocation where we can use them to build.
batFilename = [fileName '_rtw.bat'];
batFilelocation = fullfile(packageLocation,'codegen', ...
filesep,'exe',filesep,fileName);
batFileDestination = unzipFolderLocation;
2-24
Code Generation for Face Tracking with PackNGo
oldMakefilename = makefileName;
copyfile(fullfile(batFileDestination,oldMakefilename),...
fullfile(batFileDestination,newMakefileName));
buildCommand = batFilename;
runCommand = [fileName '.exe' ' "' fileName '"'];
end
end
lastDir = cd(unzipFolderLocation);
dirCleanup = onCleanup(@()cd(lastDir));
if hadError
error (sysResults);
end
end
matlabDirName='';
for ij=1:length(dirLists)
thisDirName = dirLists(ij).name;
if (isfolder(thisDirName))
% subdirectory will have toolbox/vision
[subDir1, hasSubDir1] = hasSubdirectory(thisDirName, 'toolbox');
if hasSubDir1
[~, hasSubDir2] = hasSubdirectory(subDir1, 'vision');
if hasSubDir2
matlabDirName = thisDirName;
break;
end
end
end
end
end
subDir = '';
hasSubDir = false;
for ij=1:length(dirLists)
thisDirName = dirLists(ij).name;
thisDir = fullfile(dirName,thisDirName);
2-25
2 Code Generation and Third-Party Examples
end
2-26
Code Generation for Depth Estimation From Stereo Video
This example shows how to use the MATLAB® Coder™ to generate C code for a MATLAB function,
which uses the stereoParameters object produced by Stereo Camera Calibrator app or the
estimateCameraParameters function. The example explains how to modify the MATLAB code in
the “Depth Estimation from Stereo Video” on page 1-181 example to support code generation.
Code Generation
You can learn about the basics of code generation using the MATLAB® Coder™ from the
“Introduction to Code Generation with Feature Matching and Registration” on page 2-12 example.
MATLAB Coder requires MATLAB code to be in the form of a function in order to generate C code.
Furthermore, the arguments of the function cannot be MATLAB objects.
This presents a problem for generating code from MATLAB code, which uses cameraParameters or
stereoParameters objects, which are typically created in advance during camera calibration. To
solve this problem, use the toStruct() method to convert the cameraParameters or the
stereoParameters object into a struct. The struct can then be passed into the generated code.
The restructured code for the main algorithm of “Depth Estimation from Stereo Video” on page 1-181
example resides in a function called depthEstimationFromStereoVideo_kernel.m. Note that
depthEstimationFromStereoVideo_kernel is a function that takes a struct created from a
stereoParameters object. Note also that it does not display the reconstructed 3-D point cloud,
because the showPointCloudFunction does not support code generation.
Load the stereoParameters object, which is the result of calibrating the camera using either the
stereoCameraCalibrator app or the estimateCameraParameters function.
% Convert the object into a struct, which can be passed into generated
% code.
stereoParamsStruct = toStruct(stereoParams);
2-27
2 Code Generation and Third-Party Examples
On Macintosh, VideoReader does not support code generation for reading compressed video.
Uncompress the video files, and store them in the temporary directory.
2-28
Code Generation for Depth Estimation From Stereo Video
MATLAB Coder requires that you specify the properties of all the input parameters. One easy way to
do this is to define the input properties by example at the command-line using the -args option. For
more information see “Define Input Properties by Example at the Command Line” (MATLAB Coder).
compileTimeInputs = {coder.typeof(stereoParamsStruct)};
% Generate code.
codegen depthEstimationFromStereoVideo_kernel -args compileTimeInputs;
2-29
2 Code Generation and Third-Party Examples
Clean Up
clear depthEstimationFromStereoVideo_kernel_mex;
release(player);
2-30
Code Generation for Depth Estimation From Stereo Video
Summary
This example showed how to generate C code from MATLAB code that takes a cameraParameters
or a stereoParameters object as input.
2-31
2 Code Generation and Third-Party Examples
This example shows how to use the MATLAB® Coder™ to generate C code from a MATLAB file and
deploy the application on an ARM target.
The example reads video frames from a webcam and detects faces in each of the frames using the
Viola-Jones face detection algorithm. The detected faces are displayed with bounding boxes. The
webcam function, from 'MATLAB Support Package for USB Webcams', and the VideoPlayer object,
from the Computer Vision System toolbox™, are used for the simulation on the MATLAB host. The
two functions do not support the ARM target, so OpenCV-based webcam reader and video viewer
functions are used for deployment.
The target must have OpenCV version 3.4.0 libraries (built with GTK) and a standard C++ compiler.
A Raspberry Pi 2 with Raspbian Stretch operating system was used for deployment. The example
should work on any ARM target.
This example is a function with the main body at the top and helper routines in the form of “Nested
Functions” below.
function FaceDetectionARMCodeGenerationExample()
To run this example, you must have access to a C++ compiler and you must configure it using 'mex -
setup c++' command. For more information, see “Choose a C++ Compiler”.
Break Out the Computational Part of the Algorithm into a Separate MATLAB Function
MATLAB Coder requires MATLAB code to be in the form of a function in order to generate C code.
The code for the main algorithm of this example resides in a function called
faceDetectionARMKernel.m. The function takes an image from a webcam, as the input. The function
outputs the image with a bounding box around the detected faces. The output image will be displayed
on video viewer window. To learn how to modify the MATLAB code to make it compatible for code
generation, you can look at example “Introduction to Code Generation with Feature Matching and
Registration” on page 2-12.
fileName = 'faceDetectionARMKernel.m';
For a standalone executable target, MATLAB Coder requires that you create a C file containing a
function named "main". This example uses faceDetectionARMMain.c file. This main function in this
file performs the following tasks:
For simulation on MATLAB host, the tasks performed in faceDetectionARMMain.c file is implemented
in faceDetectionARMMain.m
2-32
Detect Face (Raspberry Pi2)
For deployment on ARM, this example implements webcam reader functionality using OpenCV
functions. It also implements a video viewer using OpenCV functions. These OpenCV based utility
functions are implemented in the following files:
• helperOpenCVWebcam.hpp
• helperOpenCVWebcam.cpp
• helperOpenCVVideoViewer.cpp
• helperOpenCVVideoViewer.hpp
For simulation on MATLAB host, the example uses the webcam function from the 'MATLAB Support
Package for USB Webcams' and the VideoPlayer object from the Computer Vision System toolbox.
Run the simulation on the MATLAB host by typing faceDetectionARMMain at the MATLAB®
command line.
This example requires that you install OpenCV 3.4.0 libraries on your ARM target. The video viewer
requires that you build the highqui library in OpenCV with GTK for the ARM target.
Follow the steps to download and build OpenCV 3.4.0 on Raspberry Pi 2 with preinstalled Raspbian
Stretch. You must update your system firmware or install other developer tools and packages as
needed for your system configuration before you start building OpenCV.
• $ make
• $ sudo make install
For official deployment of the example, OpenCV libraries were installed in the following directory on
Raspberry Pi 2:
/usr/local/lib
/usr/local/include
2-33
2 Code Generation and Third-Party Examples
Generate Code
Use build information stored in buildInfo.mat to create a zip folder using packNGo.
fprintf('-> Creating zip folder (it may take a few minutes) ....\n');
bInfo = load(fullfile('codegen','exe','faceDetectionARMKernel','buildInfo.mat'));
packNGo(bInfo.buildInfo, {'packType', 'hierarchical', ...
'fileName', 'faceDetectionARMKernel'});
% The generated zip folder is faceDetectionARMKernel.zip
-> Creating zip folder (it may take a few minutes) ....
Unzip faceDetectionARMKernel.zip into a folder named FaceDetectionARM. Unzip all files and
remove the .zip files.
packngoDir = hUnzipPackageContents();
Deployment on ARM
2-34
Detect Face (Raspberry Pi2)
Transfer your project folder named FaceDetectionARM to your ARM target using your preferred file
transfer tool. Since the Raspberry Pi 2 (with Raspbian Stretch) already has an SSH server, you can
use SFTP to transfer files from host to target.
For official deployment of this example, the FileZilla SFTP Client was installed on the host machine
and the project folder was transferred from the host to the /home/pi/FaceDetectionARM folder on
Raspberry Pi.
Run the makefile to build the executable on ARM. For Raspberry Pi 2, (with Raspbian Stretch), open a
linux shell and cd to /home/pi/FaceDetectionARM. Build the executable using the following command:
make -f faceDetectionARMMakefile
disp('Step-2: Build the executable on ARM using the shell command: make -f faceDetectionARMMakefi
Step-2: Build the executable on ARM using the shell command: make -f faceDetectionARMMakefile.mk
Run the executable generated in the above step. For Raspberry Pi 2, (with Raspbian Stretch), use the
following command in the shell window:
./faceDetectionARMKernel
Make sure that you are connected to the Raspberry Pi with a window manager, and not just through a
command line terminal to avoid errors related to GTK. This is necessary for the tracking window to
show up.
To close the video viewer while the executable is running on Raspberry Pi2, click on the video viewer
and press the escape key.
disp('Step-3: Run the executable on ARM using the shell command: ./faceDetectionARMKernel');
Step-3: Run the executable on ARM using the shell command: ./faceDetectionARMKernel
2-35
2 Code Generation and Third-Party Examples
end
packngoDirName = 'FaceDetectionARM';
unzip(zipFile.name,packngoDirName);
for i=1:numel(zipFileInternal)
unzip(fullfile(packngoDirName,zipFileInternal(i).name), ...
packngoDirName);
end
% delete internal zip files
delete(fullfile(packngoDirName,'*.zip'));
packngoDir = packngoDirName;
end
2-36
Detect Face (Raspberry Pi2)
for i=1:length(dirList)
thisDir = fullfile(packngoDir,dirList(i).name, 'toolbox', 'vision');
if isfolder(thisDir)
% rename the dir
movefile(fullfile(packngoDir,dirList(i).name), ...
fullfile(packngoDir,'matlab'));
break;
end
end
end
end
end
2-37
2 Code Generation and Third-Party Examples
This example shows how to use the MATLAB® Coder™ to generate C code from a MATLAB file and
deploy the application on ARM target.
The example reads video frames from a webcam. It detects a face using Viola-Jones face detection
algorithm and tracks the face in a live video stream using the KLT algorithm. It finally displays the
frame with a bounding box and a set of markers around the face being tracked. The webcam function,
from 'MATLAB Support Package for USB Webcams', and the VideoPlayer object, from the Computer
Vision System toolbox™, are used for the simulation on the MATLAB host. The two functions do not
support the ARM target, so OpenCV-based webcam reader and video viewer functions are used for
deployment.
The target must have OpenCV version 3.4.0 libraries (built with GTK) and a standard C++ compiler.
A Raspberry Pi 2 with Raspbian Stretch operating system was used for deployment. The example
should work on any ARM target.
This example is a function with the main body at the top and helper routines in the form of “Nested
Functions” below.
function FaceTrackingARMCodeGenerationExample()
To run this example, you must have access to a C++ compiler and you must configure it using 'mex -
setup c++' command. For more information, see “Choose a C++ Compiler”.
Break Out the Computational Part of the Algorithm into a Separate MATLAB Function
MATLAB Coder requires MATLAB code to be in the form of a function in order to generate C code.
The code for the main algorithm of this example resides in a function called
faceTrackingARMKernel.m. The function takes an image from a webcam, as the input. The function
outputs the image with a bounding box and a set of markers around the face. The output image will
be displayed on video viewer window. To learn how to modify the MATLAB code to make it compatible
for code generation, you can look at example “Introduction to Code Generation with Feature
Matching and Registration” on page 2-12.
fileName = 'faceTrackingARMKernel.m';
For a standalone executable target, MATLAB Coder requires that you create a C file containing a
function named "main". This example uses faceTrackingARMMain.c file. This main function in this file
performs the following tasks:
For simulation on MATLAB host, the tasks performed in faceTrackingARMMain.c file is implemented
in faceTrackingARMMain.m
2-38
Track Face (Raspberry Pi2)
For deployment on ARM, this example implements webcam reader functionality using OpenCV
functions. It also implements a video viewer using OpenCV functions. These OpenCV based utility
functions are implemented in the following files:
• helperOpenCVWebcam.hpp
• helperOpenCVWebcam.cpp
• helperOpenCVVideoViewer.cpp
• helperOpenCVVideoViewer.hpp
For simulation on MATLAB host, the example uses the webcam function from the 'MATLAB Support
Package for USB Webcams' and the VideoPlayer object from the Computer Vision System toolbox.
Run the simulation on the MATLAB host by typing faceTrackingARMMain at the MATLAB® command
line.
This example requires that you install OpenCV 3.4.0 libraries on your ARM target. The video viewer
requires that you build the highgui library in OpenCV with GTK for the ARM target.
Follow the steps to download and build OpenCV 3.4.0 on Raspberry Pi 2 with preinstalled Raspbian
Stretch. You must update your system firmware or install other developer tools and packages as
needed for your system configuration before you start building OpenCV.
• $ make
• $ sudo make install
For official deployment of the example, OpenCV libraries were installed in the following directory on
Raspberry Pi 2:
/usr/local/lib
/usr/local/include
2-39
2 Code Generation and Third-Party Examples
Generate Code
Use build information stored in buildInfo.mat to create a zip folder using packNGo.
fprintf('-> Creating zip folder (it may take a few minutes) ....\n');
bInfo = load(fullfile('codegen','exe','faceTrackingARMKernel','buildInfo.mat'));
packNGo(bInfo.buildInfo, {'packType', 'hierarchical', ...
'fileName', 'faceTrackingARMKernel'});
% The generated zip folder is faceTrackingARMKernel.zip
-> Creating zip folder (it may take a few minutes) ....
Unzip faceTrackingARMKernel.zip into a folder named FaceTrackingARM. Unzip all files and remove
the .zip files.
packngoDir = hUnzipPackageContents();
Deployment on ARM
2-40
Track Face (Raspberry Pi2)
Transfer your project folder named FaceTrackingARM to your ARM target using your preferred file
transfer tool. Since the Raspberry Pi 2 (with Raspbian Stretch) already has an SSH server, you can
use SFTP to transfer files from host to target.
For official deployment of this example, the FileZilla SFTP Client was installed on the host machine
and the project folder was transferred from the host to the /home/pi/FaceTrackingARM folder on
Raspberry Pi.
Run the makefile to build the executable on ARM. For Raspberry Pi 2, (with Raspbian Stretch), open a
command line terminal and 'cd' to /home/pi/FaceTrackingARM. Build the executable using the
following command:
make -f faceTrackingARMMakefile.mk
disp('Step-2: Build the executable on ARM using the shell command: make -f faceTrackingARMMakefil
Step-2: Build the executable on ARM using the shell command: make -f faceTrackingARMMakefile.mk
Run the executable generated in the above step. For Raspberry Pi 2, (with Raspbian Stretch), use the
following command in the shell window:
./faceTrackingARMKernel
Make sure that you are connected to the Raspberry Pi with a window manager, and not just through a
command line terminal to avoid errors related to GTK. This is necessary for the tracking window to
show up.
To close the video viewer while the executable is running on Raspberry Pi2, click on the video viewer
and press the escape key.
disp('Step-3: Run the executable on ARM using the shell command: ./faceTrackingARMKernel');
Step-3: Run the executable on ARM using the shell command: ./faceTrackingARMKernel
2-41
2 Code Generation and Third-Party Examples
mainCFile = 'faceTrackingARMMain.c';
end
packngoDirName = 'FaceTrackingARM';
unzip(zipFile.name,packngoDirName);
for i=1:numel(zipFileInternal)
unzip(fullfile(packngoDirName,zipFileInternal(i).name), ...
packngoDirName);
end
% delete internal zip files
delete(fullfile(packngoDirName,'*.zip'));
packngoDir = fullfile(packngoDirName);
end
2-42
Track Face (Raspberry Pi2)
end
2-43
2 Code Generation and Third-Party Examples
This example shows how to display multiple video streams in a custom graphical user interface (GUI).
Overview
When working on a project involving video processing, we are often faced with creating a custom
user interface. It may be needed for the purpose of visualizing and/or demonstrating the effects of our
algorithms on the input video stream. This example illustrates how to create a figure window with
two axes to display two video streams. It also shows how to set up buttons and their corresponding
callbacks.
This example is written as a function with the main body at the top. The example also uses nested
functions and a separate helper function listed.
function VideoInCustomGUIExample()
videoSrc = VideoReader('vipmen.avi');
Create a figure window and two axes to display the input video and the processed video.
2-44
Video Display in a Custom User Interface
Now that the GUI is constructed, we can press the play button to trigger the main video processing
loop defined in the getAndProcessFrame function listed below.
Note that each video frame is centered in the axis box. If the axis size is bigger than the frame size,
video frame borders are padded with background color. If axis size is smaller than the frame size
scroll bars are added.
2-45
2 Code Generation and Third-Party Examples
Create a figure window and two axes with titles to display two videos.
end
Axis is created on uipanel container object. This allows more control over the layout of the GUI. Video
title is created using uicontrol.
% Create panel
hPanel = uipanel('parent',hFig,'Position',pos,'Units','Normalized');
% Create axis
hAxis = axes('position',[0 0 1 1],'Parent',hPanel);
hAxis.XTick = [];
hAxis.YTick = [];
hAxis.XColor = [1 1 1];
hAxis.YColor = [1 1 1];
% Set video title using uicontrol. uicontrol is used so that text
% can be positioned in the context of the figure, not the axis.
titlePos = [pos(1)+0.02 pos(2)+pos(3)+0.3 0.3 0.07];
uicontrol('style','text',...
'String', axisTitle,...
'Units','Normalized',...
'Parent',hFig,'Position', titlePos,...
'BackgroundColor',hFig.Color);
end
Insert Buttons
function insertButtons(hFig,hAxes,videoSrc)
2-46
Video Display in a Custom User Interface
This callback function rotates the input video frame and displays the original input and rotated frame
on two different axes. The helper function showFrameOnAxis, is responsible for displaying a frame
of the video on the user-defined axis.
function playCallback(hObject,~,videoSrc,hAxes)
try
% Check the status of play button
isTextStart = strcmp(hObject.String,'Start');
isTextCont = strcmp(hObject.String,'Continue');
if isTextStart
% Two cases: (1) starting first time, or (2) restarting
% Start from first frame
if ~hasFrame(videoSrc)
videoSrc.CurrentTime = 0.0;
end
end
if (isTextStart || isTextCont)
hObject.String = 'Pause';
else
hObject.String = 'Continue';
end
2-47
2 Code Generation and Third-Party Examples
end
end
This function defines the main algorithm that is invoked when play button is activated.
function exitCallback(~,~,hFig)
% Close the figure window
close(hFig);
end
end
2-48
Generate Code for Detecting Objects in Images by Using ACF Object Detector
This example shows how to generate code from a MATLAB® function that detects objects in images
by using an acfObjectDetector object. When you intend to generate code from your MATLAB
function that uses an acfObjectDetector object, you must create the object outside of the
MATLAB function. The example explains how to modify the MATLAB code in “Train Stop Sign
Detector Using ACF Object Detector” to support code generation.
To generate C Code, MATLAB Coder requires MATLAB code to be in the form of a function. The
arguments of the function cannot be MATLAB objects. This requirement presents a problem for
generating code from the MATLAB function that uses acfObjectDetector objects created outside
of the MATLAB function. To solve this problem, use the toStruct function to convert the
acfObjectDetector object into a structure and pass the structure to the MATLAB function.
To support code generation, this example restructures the code of an existing example ( See “Train
Stop Sign Detector Using ACF Object Detector”) in a function called detectObjectsUsingACF,
which is present in the current working folder as a supporting file. The detectObjectsUsingACF
function takes an image as an input and loads the pretrained ACF stop sign detector.
type("detectObjectsUsingACF.m")
Select the ground truth for stop signs. The ground truth data is the set of known locations of stop
signs in the images.
stopSigns = stopSignsAndCars(:,1:2);
Use the trainACFObjectDetector function to train the ACF detector. Turn off the training
progress output by setting Verbose=false.
2-49
2 Code Generation and Third-Party Examples
detector = trainACFObjectDetector(stopSigns,NegativeSamplesFactor=2,Verbose=false);
Because you intend to generate code for the MATLAB function detectObjectsUsingACF, convert
the created detector into a structure.
detectorStruct = toStruct(detector);
save("detectorStruct.mat","detectorStruct");
Generate C-MEX code that you can run in the MATLAB environment. Use the codegen (MATLAB
Coder) command.
img = imread("stopSignTest.jpg");
Call the generated C-MEX function by passing the loaded image img as an input.
[bboxes,scores] = detectObjectsUsingACF_mex(img);
Display the detection results and insert the bounding boxes for objects into the image.
img = insertObjectAnnotation(img,"rectangle",bboxes,scores);
figure
imshow(img)
Clean Up
Release the system memory used to store the generated C-MEX file.
2-50
Generate Code for Detecting Objects in Images by Using ACF Object Detector
clear ObjectDetectionFromImages_mex;
See Also
“Introduction to Code Generation with Feature Matching and Registration” on page 2-12 | “Generate
Code to Detect Edges on Images” (MATLAB Coder)
2-51
3
3-2
Hand Pose Estimation Using HRNet Deep Learning
This example shows how to detect keypoints in a human hand and estimate hand pose using the
HRNet deep learning network.
Overview
Hand pose estimation detects and estimates the 2D pose and configuration of a human hand from an
image or a video. It identifies the position and orientation of the hand joints, such as the locations of
fingertips, knuckles, and the palm. The applications of hand pose estimation include virtual and
augmented reality, human-computer interaction, sign language recognition, gesture-based interfaces,
robotics, and medical diagnosis.
This example uses a High-Resolution Net (HRNet) [1 on page 3-15] deep learning network to detect
keypoints in a human hand. To learn more about the HRNet deep learning network, see “Getting
Started with HRNet” on page 19-86
doTraining = false;
downloadFolder = tempdir;
pretrainedKeypointDetector = helperDownloadHandPoseKeypointDetector(downloadFolder);
I = imread("handPose.jpg");
Specify the bounding box location of the hand region in the form [x y w h]. The x and y values
specify the upper-left corner of the bounding box. w specifies the width of the box, which is its length
along the x-axis. h specifies the height of the box, which is its length along the y-axis.
Alternatively, you can get bounding box locations by training object detectors like
yolov3ObjectDetector and yolov4ObjectDetector to detect object locations.
Use the pretrained keypoint detector to detect the hand keypoints in the image.
[keypoints,scores,visibility] = detect(pretrainedKeypointDetector,I,handBoundingBoxes);
3-3
3 Deep Learning, Semantic Segmentation, and Detection Examples
Visualize the detected keypoints. The image represents the detected hand keypoints as yellow dots,
and the keypoint connections using green lines.
The remainder of this example shows how to configure the pretrained object keypoint detector using
a transfer learning approach, and train an HRNet deep learning network on a hand pose data set.
To illustrate the training procedure, this example uses a labeled data set that contains 2500 images
from the Large-Scale Multiview Hand Pose Dataset [2 on page 3-15]. Each image in the data set
contains a human hand with 21 annotated keypoints.
dataset = helperDownloadHandPoseDataset(downloadFolder);
3-4
Hand Pose Estimation Using HRNet Deep Learning
data = load(dataset);
handPoseDataset = data.handPoseDataset;
The hand pose data table contains three columns. The first, second, and third columns contain the
image filenames, keypoint locations, and hand bounding boxes, respectively. Keypoints consist of N-
by-2 matrices, where N is the number of keypoints present in the hand. Each image contains only one
hand, which is one object. Therefore, each row represent one object in an image. If a custom data set
contains more than one object in an image, create a row of data for each object in that image.
ans=4×3 table
imageFilename keypoints boundingBoxes
__________________________ _____________ ___________________
% Add the full data path to the locally stored hand pose data folder.
handPoseDataset.imageFilename = fullfile(downloadFolder,"2DHandPoseDataAndGroundTruth","2DHandPos
Read the keypoint class names and keypoint connection information using the
helperHandPoseDatasetKeypointNames and helperKeypointConnection helper functions.
keypointClasses contains the categorical class labels for every hand keypoint.
keypointConnections contains the connectivity information between pairs of keypoints.
keypointClasses = helperHandPoseDatasetKeypointNames;
keypointConnections = helperKeypointConnection;
handPoseKeypointDetector = hrnetObjectKeypointDetector("human-full-body-w32",keypointClasses,Keyp
Use imageDatastore to create ImageDatastore objects for loading the image data.
handPoseImds = imageDatastore(handPoseDataset.imageFilename);
Use arrayDatastore to create ArrayDatastore objects for loading the groud truth keypoint
location data.
handPoseArrds = arrayDatastore(handPoseDataset(:,2));
3-5
3 Deep Learning, Semantic Segmentation, and Detection Examples
Use boxLabelDatastore to create boxLabelDatastore objects for loading the bounding box
locations.
handPoseBlds = boxLabelDatastore(handPoseDataset(:,3));
The HRNet deep learning network has been trained on image patches that contain only one object in
each image. Use the transform function and the helperPreprocessCropped helper function to
preprocess the images in the datastore. Use the functions to crop image patches that contain the
object of interest and rescale the keypoints to the new image size. Then, store the preprocessed data
by using the writeall function. The function stores the image patches as JPEG files and the hand
keypoint data as a MAT file.
% Define the input size and number of keypoints to process.
inputSize = handPoseKeypointDetector.InputSize;
numKeypoints = size(handPoseKeypointDetector.KeyPointClasses,1);
Load the data. Create an ImageDatastore object for the image patches and a FileDatastore
object for the keypoints.
handPosePatchImds = imageDatastore(fullfile(imagesPatchDataLocation,"imagePatches"));
handPoseKptfileds = fileDatastore(fullfile(imagesPatchDataLocation,"Keypoints"),"ReadFcn",@load,F
Split the data set into training, validation, and test sets. Select 80% of the data for training, 10% for
validation, and rest for testing the trained detector.
rng(0);
numFiles = numel(handPosePatchImds.Files);
shuffledIndices = randperm(numFiles);
numTrain = round(0.8*numFiles);
trainingIdx = shuffledIndices(1:numTrain);
numVal = round(0.10*numFiles);
valIdx = shuffledIndices(numTrain+1:numTrain+numVal);
testIdx = shuffledIndices(numTrain+numVal+1:end);
Create ImageDatastore objects and for training, validation, and test sets.
trainingImages = handPosePatchImds.Files(trainingIdx);
valImages = handPosePatchImds.Files(valIdx);
testImages = handPosePatchImds.Files(testIdx);
imdsTrain = imageDatastore(trainingImages);
imdsValidation = imageDatastore(valImages);
imdsTest = imageDatastore(testImages);
3-6
Hand Pose Estimation Using HRNet Deep Learning
testKeypoints = handPoseKptfileds.Files(testIdx);
fdsTrain = fileDatastore(trainingKeypoints,"ReadFcn",@load,FileExtensions=".mat");
fdsValidation = fileDatastore(valKeypoints,"ReadFcn",@load,FileExtensions=".mat");
fdsTest = fileDatastore(testKeypoints,"ReadFcn",@load,FileExtensions=".mat");
Create CombinedDatastore objects for training, validation, and test set by combining the
respective image datastore and file data store of each set.
trainingData = combine(imdsTrain,fdsTrain);
validationData = combine(imdsValidation,fdsValidation);
testData = combine(imdsTest,fdsTest);
Visualize the data set. Render the ground truth keypoints in yellow and the keypoint connections in
green color.
data = read(trainingData);
I = data{1};
keypoints = data{2}.keypoint;
Iout = insertObjectKeypoints(I,keypoints, ...
Connections=keypointConnections, ...
ConnectionColor="green", ...
KeypointColor="yellow",KeypointSize=3,LineWidth=3);
figure
imshow(Iout)
3-7
3 Deep Learning, Semantic Segmentation, and Detection Examples
Use the handPoseKeypointDetector object and the minibatchqueue (Deep Learning Toolbox)
function to train the HRNet deep learning network on the hand pose data set with a mini-batch size of
8. Decrease the mini-batch size if you run out of memory during training. Create mini-batch queues
for the training and validation data. The minibatchqueue function automatically detects whether a
GPU is available and uses it by default. If you do not have a compatible GPU, or prefer to train on a
CPU, you can specify the OutputEnvironment name-value argument as "cpu" when calling the
minibatchqueue function.
miniBatchSize = 8;
mbqTrain = minibatchqueue(trainingData,3, ...
MiniBatchSize=miniBatchSize, ...
MiniBatchFcn=@(images,keypoints)helperCreateBatchData(images,keypoints,handPoseKeypointDe
MiniBatchFormat=["SSCB","SSCB","SSCB"]);
• Set the number of epochs to 10. For larger data sets, you must train for a higher number of
epochs.
• Set the learning rate to 0.001.
numEpochs = 10;
initialLearnRate = 0.001;
Initialize the velocity, averageGrad, and averageSqGrad parameters for Adam optimization.
velocity = [];
averageGrad = [];
averageSqGrad = [];
numObservationsTrain = numel(imdsTrain.Files);
numIterationsPerEpoch = floor(numObservationsTrain/miniBatchSize);
numIterations = numEpochs*numIterationsPerEpoch;
Train the HRNet hand pose keypoint detector on hand pose data. Observe the training progress
plotter to monitor the training of the detector object on a custom training loop.
• Read data from the training minibatchqueue. If it does not have any more data, reset and
shuffle the minibatchqueue.
• Evaluate the model gradients using the dlfeval (Deep Learning Toolbox) function. The
modelGradients function, listed as a supporting function, returns the gradients of the loss with
respect to the learnable parameters in the network, the corresponding mini-batch loss, and the
state of the current batch.
3-8
Hand Pose Estimation Using HRNet Deep Learning
• Update the detector parameters using the adamupdate (Deep Learning Toolbox) function.
• Update the state of non-learnable parameters of the detector.
• Update the training progress plot.
if doTraining
monitor = trainingProgressMonitor( ...
Metrics=["TrainingLoss","ValidationLoss"], ...
Info=["Epoch","Iteration","LearningRate"], ...
XLabel="Iteration");
groupSubPlot(monitor,"Loss",["TrainingLoss","ValidationLoss"])
iteration = 0;
monitor.Status = "Running";
reset(mbqTrain)
shuffle(mbqTrain)
if epoch >= 7
currentLR = initialLearnRate/10;
elseif epoch >= 10
currentLR = initialLearnRate/100;
else
currentLR = initialLearnRate;
end
[XTrain,YTrain,WTrain] = next(mbqTrain);
updateInfo(monitor, ...
LearningRate=currentLR, ...
Epoch=string(epoch) + " of " + string(numEpochs), ...
Iteration=string(iteration) + " of " + string(numIterations));
3-9
3 Deep Learning, Semantic Segmentation, and Detection Examples
recordMetrics(monitor,iteration, ...
TrainingLoss=trainingLoss, ...
ValidationLoss=validationLoss)
monitor.Progress=100*floor(iteration/numIterations);
end
end
else
handPoseKeypointDetector = pretrainedKeypointDetector;
end
Evaluate the hand keypoint detection using the percentage of correct keypoints (PCK) metric [3 on
page 3-15]. The PCK metric measures the percentage of estimated keypoints that fall within a
certain radius of the ground truth keypoints. To compute the PCK metric, set a threshold to define
whether or not a predicted keypoint is accurate. The optimal distance threshold for comparing the
predicted and ground truth keypoints ranges from 0.1 to 0.3. If the distance between them is within
the threshold, you can consider the predicted keypoints accurate.
To compute the PCK metric, calculate the Euclidean distance between the predicted keypoints and
the ground truth keypoints, and then normalize the value by a specified distance. In the case of hand
keypoint detection, use the distance between the middle point and the lowest point of the middle
finger as the normalization factor.
testDataPCK = [];
reset(testData)
while testData.hasdata
data = read(testData);
I = data{1};
keypoint = data{2}.keypoint;
[height, width] = size(I,[1 2]);
bbox = [1 1 width height];
% Distance between the middle and the lower point of middle finger.
normalizationFactor = sqrt((keypoint(5,1)-keypoint(6,1))^2 + (keypoint(5,2)-keypoint(6,2))^2)
threshold = 0.3;
predictedKeypoints = detect(handPoseKeypointDetector,I,bbox);
pck = helperCalculatePCK(predictedKeypoints,keypoint,normalizationFactor,threshold);
testDataPCK = [testDataPCK;pck];
end
PCK = mean(testDataPCK);
disp("Average PCK on the hand pose test dataset is: " + PCK);
A PCK score of 0.9443 on the test data implies that 94.43% of the keypoints have been identified
correctly. To improve the results, you can add more data to the data set or use data augmentation. To
customize this example for your own data, you might need to reduce the learning rate if the
validation loss remains constant and the model does not converge.
Supporting Functions
3-10
Hand Pose Estimation Using HRNet Deep Learning
[dlYPredOut,state] = forward(detector,dlX);
loss = helperCalculateLoss(dlYPredOut,dlW,dlY);
gradients = dlgradient(loss,detector.Learnables);
end
for k = 1:miniBatchSize
I = images{k};
keypoint = keypoints{k}.keypoint;
X(:,:,:,k) = single(rescale(I));
[heatmaps,weights] = helperGenerateHeatmap(single(keypoint),inputSize,outputSize);
Y(:,:,:,k) = single(heatmaps);
W(:,:,:,k) = repmat(permute(weights,[2 3 1]),outputSize(1:2));
end
end
helperPreprocessCropped — Crop the input images based on their bounding boxes, and
transform their corresponding keypoints based on the cropped image coordinates.
3-11
3 Deep Learning, Semantic Segmentation, and Detection Examples
end
preprocessedData{2} = keypoint;
preprocessedData{3} = trainingData{3};
preprocessedData{4} = trainingData{4};
end
if size(keypoints,2) == 2
weights = ones(numKeypoints,1);
else
weights = keypoints(:,3);
end
tmpSize = sigma*3;
for k = 1:numKeypoints
muX = round(keypoints(k,1)/featStride(1) + 0.5);
muY = round(keypoints(k,2)/featStride(2) + 0.5);
upperLeft = [floor(muX - tmpSize) floor(muY - tmpSize)];
bottomRight = [floor(muX + tmpSize + 1),floor(muY + tmpSize + 1)];
if (upperLeft(1) >= heatmapSize(1) || upperLeft(2) >= heatmapSize(2) || ...
bottomRight(1) < 0 || bottomRight(2) < 0)
weights(k) = 0;
continue
end
sizeRegion = 2*tmpSize + 1;
[x,y] = meshgrid(1:sizeRegion,1:sizeRegion);
x0 = floor(sizeRegion/2);
y0 = x0;
g = exp(-((x - x0).^2 + (y - y0).^2) ./ (2*(sigma^2)));
gx = [max(0, -upperLeft(1)) min(bottomRight(1),heatmapSize(1))-upperLeft(1)-1] + 1;
3-12
Hand Pose Estimation Using HRNet Deep Learning
helperGetAffineTransform — Calculate the affine transform based on the center and scale of the
image.
function transformMatrix = helperGetAffineTransform(center,scale,outputHeatMapSize,invAffineTrans
% center: Center of the bounding box [x y].
% scale: Scale of the bounding box, normalized by the scale factor, [width height].
% outputHeatMapSize: Size of the destination heatmaps.
% invAffineTransform (boolean): Option to invert the affine transform direction.
% (inv=False: src->dst or inv=True: dst->src).
% shift (0-100%): Shift translation ratio with regard to the width and height.
shift = [0 0];
srcPoint = [1 srcWidth*-0.5];
dstDir = double([1 dstWidth*-0.5]);
src = zeros(3,2);
3-13
3 Deep Learning, Semantic Segmentation, and Detection Examples
dst = zeros(3,2);
src(1,:) = center + scaleTmp.*shift;
src(2,:) = center + srcPoint + scaleTmp.*shift;
dst(1,:) = [dstWidth*0.5 dstHeight*0.5];
dst(2,:) = [dstWidth*0.5 dstHeight*0.5] + dstDir;
src(3,:) = helperGetThirdPoint(src(1,:),src(2,:));
dst(3,:) = helperGetThirdPoint(dst(1,:),dst(2,:));
if invAffineTransform
transformMatrix = fitgeotform2d(dst,src,"affine");
else
transformMatrix = fitgeotform2d(src,dst,"affine");
end
end
helperGetThirdPoint — To calculate the affine matrix, you must have three pairs of points. This
function obtains the third point, given 2D points a and b. The function defines the third point by
rotating the vector a - b by 90 degrees anticlockwise, using point b as the rotation center.
Utility Functions
helperDownloadHandPoseDataset — Download the hand pose data set and ground truth labels.
3-14
Hand Pose Estimation Using HRNet Deep Learning
helperCalculatePCK — Calculate the PCK of each predicted keypoint and corresponding ground
truth keypoint.
References
[1] Sun, Ke, Bin Xiao, Dong Liu, and Jingdong Wang. “Deep High-Resolution Representation Learning
for Human Pose Estimation.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 5686–96. Long Beach, CA, USA: IEEE, 2019. https://doi.org/10.1109/
CVPR.2019.00584.
[2] Gomez-Donoso, Francisco, Sergio Orts-Escolano, and Miguel Cazorla. "Large-Scale Multiview 3D
Hand Pose Dataset." Image and Vision Computing 81 (2019): 25–33. https://doi.org/10.1016/
j.imavis.2018.12.001.
3-15
3 Deep Learning, Semantic Segmentation, and Detection Examples
[3] Yang, Yi, and Deva Ramanan. "Articulated Human Detection with Flexible Mixtures of Parts." IEEE
Transactions on Pattern Analysis and Machine Intelligence 35, no. 12 (December 2013): 2878–90.
https://doi.org/10.1109/TPAMI.2012.261.
3-16
Recognize Seven-Segment Digits Using OCR
This example shows how to recognize seven-segment digits in an image by using optical character
recognition (OCR). In the example, you use the detectTextCRAFT function and region properties to
detect the seven-segment text regions in the image. Then, you use OCR to recognize the seven-
segment digits in the detected text regions.
Read Image
Detect text regions in the input image by using the detectTextCRAFT function. The
CharacterThreshold value is the region threshold to use for localizing each character in the
image. The LinkThreshold value is the affinity threshold that defines the score for grouping two
detected texts into a single instance. You can fine-tune the detection results by modifying the region
and affinity threshold values. Increase the value of the affinity threshold for more word-level and
character-level detections. For information about the effect of the affinity threshold on the detection
results, see the “Detect Characters by Modifying Affinity Threshold” example.
Set the value of the affinity threshold to 0.005. The default value for the region threshold is 0.4. The
output is a set of bounding boxes that localize the text regions in the input image. The bounding box
specifies the spatial coordinates of the detected text regions in the image and is a vector of form [x,
y, width, height]. The vector specifies the upper left corner and size of the detected region in
pixels.
bbox = detectTextCRAFT(img,LinkThreshold=0.005);
Draw the output bounding boxes on the image by using the insertShape function.
Iout = insertShape(img,"rectangle",bbox,LineWidth=4);
3-17
3 Deep Learning, Semantic Segmentation, and Detection Examples
In the input image, the seven-segment text region occupies the maximum area. Use the area of the
detected bounding boxes to extract the seven-segment text region.
Compute the area of the bounding boxes and find the bounding box with maximum area.
bboxArea = bbox(:,3).*bbox(:,4);
[value,indx]= max(bboxArea);
Extract the text region with maximum bounding box area from the input image.
roi = bbox(indx,:);
extractedImg = img(roi(2):roi(2)+roi(4),roi(1):roi(1)+roi(3),:);
figure
imshow(extractedImg)
title('Extracted Seven-Segment Text Region')
Recognize the seven-segment digits in the detected text region by using ocr function. Set the value
of the Model name-value argument to "seven-segment". The output is an ocrText object
containing information about the recognized text, the recognition confidence, and the location of the
text in the original image.
output = ocr(img,roi,Model="seven-segment")
output =
ocrText with properties:
Text: '810000...'
CharacterBoundingBoxes: [17x4 double]
CharacterConfidences: [17x1 single]
Words: {2x1 cell}
WordBoundingBoxes: [2x4 double]
WordConfidences: [2x1 single]
TextLines: {2x1 cell}
TextLineBoundingBoxes: [2x4 double]
3-18
Recognize Seven-Segment Digits Using OCR
Display the recognized seven-segment digits. You can notice that OCR detects two bounding boxes
enclosing the text regions and recognizes the digits in each region.
disp([output.Words])
{'810000' }
{'0110555.'}
Iocr = insertObjectAnnotation(img,"Rectangle",output.WordBoundingBoxes,output.Words,LineWidth=4,F
figure
imshow(Iocr)
The main challenges in accurate recognition of the seven-segment digits are the segmentation of text
regions and the choice of the LayoutAnalysis name-value argument of ocr function.
As a preprocessing step, the ocr function performs binarization to segment the text regions from the
background. Due to the nature of the seven-segment text images, the binarized text regions have
disconnected pixels. If the distance between the pixels disconnected along the vertical direction is
large and the value of the LayoutAnalysis parameter is set to "block", the ocr function considers
the input image to have multiple lines of text. Then, the ocr function groups each line of text into a
region and recognizes the digits within each region. As a result, the recognition results might be
inaccurate. In such cases, you can improve the recognition accuracy by selecting a proper value for
the LayoutAnalysis parameter.
3-19
3 Deep Learning, Semantic Segmentation, and Detection Examples
If the detected image region consists of only one line of seven-segment text, you can set the
LayoutAnalysis name-value argument to "word", "character", or "line" in order to obtain good
recognition results. For more details about how to select the value for LayoutAnalysis name-value
argument, see ocr.
The input image contains a group of seven-segment digits. To recognize all the digits in the group, set
the value of the LayoutAnalysis name-value argument to "word". Compute the OCR results.
output = ocr(img,roi,Model="seven-segment",LayoutAnalysis="word")
output =
ocrText with properties:
Text: '010555....'
CharacterBoundingBoxes: [9x4 double]
CharacterConfidences: [9x1 single]
Words: {'010555.'}
WordBoundingBoxes: [149 213 1057 365]
WordConfidences: 0.6762
TextLines: {'010555.'}
TextLineBoundingBoxes: [149 213 1057 365]
TextLineConfidences: 0.6762
disp([output.Words])
{'010555.'}
Draw the output bounding boxes on the image by using the insertObjectAnnotation function.
Display the recognition results. You can notice that the seven-segment text region in the image is well
localized and the digits are recognized correctly.
Iocr = insertObjectAnnotation(img,"Rectangle",output.WordBoundingBoxes,output.Words,LineWidth=4,F
figure
imshow(Iocr)
3-20
Recognize Seven-Segment Digits Using OCR
Further Exploration
• If the detected text region consist of multiple lines of seven-segment texts, set the
LayoutAnalysis name-value argument to "block" for optimal results.
• You can improve the recognition results by accurately localizing and segmenting the seven-
segment text regions in a given image. Though you can use the detectTextCRAFT function for
detecting the text regions, you will have to manually select the appropriate region threshold and
affinity threshold values for good detection results. Alternatively, you can use the Color
Thresholder or Image Segmenter apps to interactively segment the desired text regions in the
image.
• If the segmented region contain outliers, use morphological operations to preprocess the image
before performing OCR. For an example, see the Image Pre-processing and ROI-based Processing
Techniques demonstrated in the “Recognize Text Using Optical Character Recognition (OCR)” on
page 4-46 example. The Improve Recognition Results section in “Automatically Detect and
Recognize Text Using Pretrained CRAFT Network and OCR” on page 4-14 example also
demonstrates the image preprocessing techniques that you can use for improving recognition
results if the image contain multiple lines of texts.
3-21
3 Deep Learning, Semantic Segmentation, and Detection Examples
This example shows how to train an OCR model to recognize seven-segment digits, use quantization
to improve runtime performance, and evaluate text recognition accuracy. The Computer Vision
Toolbox™ provides several pretrained OCR models, including one for seven-segment digits. Training
an OCR model is necessary when a pretrained model is not effective for your application. This
example demonstrates the general procedure for training an OCR model using the YUVA EB dataset
[1].
Load Data
This example uses 119 images from the YUVA EB dataset. The dataset contains images of energy
meter displays with seven-segment numerals. These images were captured under challenging text
recognition conditions such as tilted positions, lens blur, and non-uniform lighting conditions. A small
dataset is useful for exploring the OCR training procedure, but in practice, more labeled images are
needed to train a robust OCR model.
datasetFiles = helperDownloadDataset;
The images in the dataset were annotated with bounding boxes containing the seven-segment digits
and text labels were added to these bounding boxes as an attribute using the “Get Started with the
Image Labeler” on page 11-71. To learn more about labeling images for OCR training, see “Train
Custom OCR Model” on page 19-2. The labels were exported from the app as groundTruth object
and saved in 7SegmentGtruth.mat file.
ld = load("7SegmentGtruth.mat");
gTruth = ld.gTruth;
Create datastores that contain images, bounding boxes and text labels from the groundTruth object
using the ocrTrainingData function with the label and attribute names used during labeling.
labelName = "Text";
attributeName = "Digits";
[imds,boxds,txtds] = ocrTrainingData(gTruth,labelName,attributeName);
3-22
Train an OCR Model to Recognize Seven-Segment Digits
Analyze the ground truth text to verify that all characters of interest for training have observation
samples in the ground truth data. To verify this, find the character set of the ground truth data.
Read all ground truth text corresponding to each image and combine the text in each image.
allImagesText = txtds.readall;
allText = strjoin(vertcat(allImagesText{:}), "");
The ground truth data contains images of the 10 digits from 0-9 and the period symbol in the seven-
segment font.
3-23
3 Deep Learning, Semantic Segmentation, and Detection Examples
In addition to verifying the ground truth character set, it is important to ensure that all characters
have equal representation in the dataset.
Count the occurences of each of these characters in the ground truth data.
characterSet = cellstr(characterSet');
characterCount = accumarray(idx,1);
Tabulate the character count and sort the count in descending order.
characterCountTbl=11×2 table
Character Count
_________ _____
{'0'} 170
{'.'} 120
{'1'} 98
{'3'} 91
{'2'} 84
{'4'} 78
{'5'} 61
{'9'} 56
{'8'} 55
{'7'} 43
{'6'} 40
numCharacters = numel(characterSet);
figure
bar(1:numCharacters, characterCountTbl.Count)
xticks(1:numCharacters)
xticklabels(characterCountTbl.Character)
xlabel("Digits")
ylabel("Number of samples")
3-24
Train an OCR Model to Recognize Seven-Segment Digits
The characters '0' and '.' have the maximum number of occurences and the characters '7' and '6' have
the least number of occurences. In text recognition applications, it is common to have such imbalance
in the number of character samples as not all characters occur frequently in paragraphs of text.
Dataset imbalance may result in an OCR model that performs poorly on underrepresented characters.
You can balance the dataset by oversampling the least occuring characters if such behavior exists in
the trained OCR model.
cds = combine(imds,boxds,txtds);
Use 60% of the dataset for training and split the rest of the data evenly for validation and testing. The
following code randomly splits the data into training, validation and test.
trainPercent = 60;
[cdsTrain, cdsVal, cdsTest, numTrain, numVal, numTest] = helperPartitionOCRData(cds, trainPercent
The 60/20/20 split results in the following number of training, validation and test images:
3-25
3 Deep Learning, Semantic Segmentation, and Detection Examples
outputDir = "OCRModel";
if ~exist(outputDir, "dir")
mkdir(outputDir);
end
checkpointsDir = "Checkpoints";
if ~exist(checkpointsDir, "dir")
mkdir(checkpointsDir);
end
Use ocrTrainingOptions function to specify the following training options for OCR Training.
Empirical analysis is required to determine the optimal training options values.
• ocrTrainingOptions uses ADAM solver by default. Set the gradient decay factor for ADAM
optimization to 0.9.
• Use an initial learning rate of 20e-4.
• Set the maximum number of epochs for training to 15.
• Set the verbose frequency to 100 iterations.
• Specify the output directory.
• Specify the checkpoint path to enable saving checkpoints.
• Specify validation data to enable validation step during training.
• Set the validation frequency to 10 iterations.
ocrOptions = ocrTrainingOptions(GradientDecayFactor=0.9,...
InitialLearnRate=20e-4,...
MaxEpochs=15,...
VerboseFrequency=100,...
OutputLocation=outputDir,...
CheckpointPath=checkpointsDir,...
ValidationData=cdsVal,...
ValidationFrequency=10);
Train a new OCR model by fine-tuning the pretrained "english" model. The training will take about
8-9 minutes.
trainedModelName = "sevenSegmentModel";
baseModel = "english";
[trainedModel, trainingInfo] = trainOCR(cdsTrain, trainedModelName, baseModel, ocrOptions);
*************************************************************************
Starting OCR training
3-26
Train an OCR Model to Recognize Seven-Segment Digits
|================================================================================================
| Epoch | Iteration | Time Elapsed | Training Statistics | Validatio
| | | (hh:mm:ss) | RMSE | Character Error | Word Error | RMSE | Charact
|================================================================================================
| 1 | 1 | 00:02:02 | 18.73 | 100.00 | 100.00 | 0.00 | 0.
| 1 | 100 | 00:02:40 | 9.05 | 40.80 | 65.00 | 4.42 | 11
| 2 | 200 | 00:03:20 | 6.15 | 24.33 | 46.00 | 3.35 | 13
| 3 | 300 | 00:03:59 | 4.73 | 16.95 | 34.33 | 2.75 | 10
| 4 | 400 | 00:04:38 | 3.90 | 13.71 | 27.00 | 2.67 | 9.
| 5 | 500 | 00:05:16 | 3.36 | 11.33 | 22.40 | 5.59 | 27
| 6 | 600 | 00:05:54 | 3.04 | 10.03 | 19.67 | 2.44 | 11
| 7 | 700 | 00:06:32 | 2.72 | 8.69 | 17.29 | 3.14 | 11
| 8 | 800 | 00:07:11 | 2.45 | 7.63 | 15.25 | 2.39 | 9.
| 9 | 900 | 00:07:49 | 2.24 | 6.79 | 13.56 | 2.92 | 14
| 10 | 1000 | 00:08:27 | 2.07 | 6.11 | 12.20 | 2.31 |