Deep learning – Object Detection and Segmentation
Classification vs. Detection
✓ Dog
Dog
Dog
Object Detection
deer
cat
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
deer?
CNN cat?
background?
Object Detection as Classification
with Sliding Window
deer?
CNN cat?
background?
Object Detection as Classification
with Box Proposals
Histogram of Oriented Gradients (HOG) - 1986
Example:
From this, we observe that most of the gradients are either up or down
HOG + SVM Vehicle detection
Demo code:
HOG\HOG.py
Object Detection
• The RCNN Object Detector (2014)
• The Fast RCNN Object Detector (2015)
• The Faster RCNN Object Detector (2016)
• The YOLO Object Detector (2016)
• The SSD Object Detector (2016)
• Mask-RCNN (2017)
RCNN
https://people.eecs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Rich feature hierarchies for accurate object detection and semantic
segmentation. Girshick et al. CVPR 2014.
Fast-RCNN
Idea: No need to recompute features for every box independently,
Regress refined bounding box coordinates.
https://arxiv.org/abs/1504.08083
https://github.com/sunshineatnoon/Paper-
Fast R-CNN. Girshick. ICCV 2015. Collection/blob/master/Fast-RCNN.md
Faster-RCNN
Idea: Integrate the Bounding
Box Proposals as part of the
CNN predictions
https://arxiv.org/abs/1506.01497
Ren et al. NIPS 2015.
Regional proposal
Step 1: Group similar objects
Step 2: Predict objects
22
YOLO- You Only Look Once
Idea: No bounding
box proposals.
Predict a class and a
box for every location
in a grid.
https://arxiv.org/abs/1506.02640 Redmon et al. CVPR 2016.
YOLO- You Only Look Once
Divide the image into 7x7 cells.
Each cell trains a detector.
Demo Code: The detector needs to predict the object’s class distributions.
YOLO\ytest.py The detector has 2 bounding-box predictors to predict
bounding-boxes and confidence scores.
https://arxiv.org/abs/1506.02640 Redmon et al. CVPR 2016.
The YOLOv8 model is faster and more accurate while providing a
unified framework for training models for performing
•Object Detection,
•Instance Segmentation, and Watch Yolov8 :https://youtu.be/QgF5PHDCwHw
•Image Classification. https://www.stereolabs.com/blog/performance-of-
yolo-v5-v7-and-v8/
26
YOLO is known for its high speed and good accuracy, making it suitable for real-time object detection tasks.
However, it is also sensitive to object scale, and it is not good for small object detection.
YOLOv2 One of the main differences between YOLO and YOLOv2 is the use of a new backbone network called Darknet-
19. The Darknet-19 architecture is similar to the VGG-19 architecture, but with fewer layers and fewer filters per layer.
This allows YOLOv2 to have a faster inference time compared to the original YOLO.
YOLOv3 One of the main differences between YOLOv2 and YOLOv3 is the use of a new and larger backbone network
called Darknet-53. The Darknet-53 architecture is similar to the ResNet-50 architecture, but with fewer layers and fewer
filters per layer. This allows YOLOv3 to have a better accuracy compared to YOLOv2. YOLOv3 also introduced several
techniques such as data augmentation, multi-scale training, batch normalization, and other techniques to improve the
accuracy and robustness of the model.
YOLOv4 One of the main differences between YOLOv3 and YOLOv4 is the use of a new and more powerful backbone
network, which includes CSPDarknet-53, ResNet-SPP and EfficientNet-b0,b1,b2,b3,b4,b5,b6,b7 models. Another
difference is the use of the “bag of freebies” technique which improves the accuracy by using techniques such as data
augmentation, dropout, and label smoothing. Overall, YOLOv4 improved the accuracy of the model and reduced the
inference time compared to YOLOv3.
YOLOv5 One of the main differences between YOLOv4 and YOLOv5 is the use of a new and more powerful backbone
network, which includes EfficientNet-b1,b2,b3,b4,b5,b6,b7 models. The EfficientNet-b* models are EfficientNet models
with different depth and width. This allows YOLOv5 to have a better accuracy and speed compared to YOLOv4.
YOLOv7 One of the main differences between YOLOv5 and YOLOv7 is the use of a new and more powerful backbone
network, which includes EfficientNet-b1,b2,b3,b4,b5,b6,b7 models and also a new architecture called “Single-Shot
Refinement Neural Network” (SIN) which is used to improve the accuracy of the model.
YOLOv8 is the latest version in the YOLO series, building upon the success of previous models. It introduces a new
transformer-based architecture that sets it apart from earlier YOLO models. This innovative design has led to
improvements in accuracy and performance. YOLOv8 is built on the YOLOv5 framework and includes several
architectural and developer experience improvements. It is faster and more accurate than YOLOv5, and it provides a
unified framework for training models for performing object detection, instance segmentation, and image
classification.
Watch comparison:
https://youtu.be/b7Lk7aRa5Ek
Training time of v5 to v8
Accuracy of v5 to v8
Demo Yolo8
See yoloinstruction.txt
SSD: Single Shot Detector
Idea: Similar to YOLO, but denser grid map, multiscale grid maps. +
Data augmentation + Hard negative mining + Other design choices i
n the network. Liu et al. ECCV 2016.
Video frames per second
32
Object detection in medical image
Demo Code: Non-Max Suppression: Non-Max Supression (NMS) is a technique used to select
NMS\Nms.py one bounding box for an object if multiple bounding boxes were detected with
varying probability scores by object detection algorithms(example: Faster R-
CNN,YOLO)
(Intersection over Union)
0.8
0.8
0.85
Why 0.5? What happen is the threshold I higher? -> overlapped regions
Segmentation
What is the difference?
Left image, every pixel belongs to a particular class (either background or person). Also, all the pixels belonging
to a particular class are represented by the same color (background as black and person as pink). This is an
example of semantic segmentation
Right image has also assigned a particular class to each pixel of the image. However, different objects of the
same class have different colors (Person 1 as red, Person 2 as green, background as black, etc.). This is an
example of instance segmentation
Thresholding
Edge Segmentation
Deep Learning-based methods
Convolutional Encoder-Decoder Architecture
SegNet -2015
Mask R-CNN
1. We take an image as input and pass it to the ConvNet, which returns the feature map for that image
2. Region proposal network (RPN) is applied on these feature maps. This returns the object proposals along with
their objectness score
3. A RoI pooling layer is applied to these proposals to bring down all the proposals to the same size
4. Finally, the proposals are passed to a fully connected layer to classify and output the bounding boxes for
objects. It also returns the mask for each proposal
U-Net – medical image segmentation
U-Net: The U-Net solves problems of general CNN networks used for medical image
segmentation, since it adopts a perfect symmetric structure and skip connection.
Different from common image segmentation, medical images usually contain noise and show
blurred boundaries. Therefore, it is very difficult to detect or recognize objects in medical
images only depending on image low-level features.
Meanwhile, it is also impossible to obtain accurate boundaries depending only on image
semantic features due to the lack of image detail information. Whereas, the U-Net effectively
fuses low-level and high-level image features by combining low-resolution and high-
resolution feature maps through skip connections, which is a perfect solution for medical
image segmentation tasks.
Currently, the U-Net has become the benchmark for most medical image segmentation tasks
and has inspired a lot of meaningful improvements
The low-level information helps to improve accuracy. The high-level information helps to extract complex features.
Annotation
https://www.mdpi.com/2071-1050/13/3/1224/pdf
Image segmentation applications
Robotics (Machine Vision)
1. Instance segmentation for robotic grasping
2. Recycling object picking
3. Autonomous navigation and SLAM
https://youtu.be/aZkmeGIWZVw
Medical imaging
1.Medical image segmentation is the process of extracting the desired object
(organ) from a medical image (2D or 3D)
2. X-Ray segmentation
3. CT scan organ segmentation
4. Dental instance segmentation
5. Digital pathology cell segmentation
6. Surgical video annotation
https://youtu.be/wYdI12EN00M
3.Self Driving Cars
Drivable surface semantic segmentation
Car and pedestrian instance segmentation
In-vehicle object detection (stuff left behind by passengers)
Pothole detection and segmentation
and many …