0% found this document useful (0 votes)
54 views6 pages

Yolo & Object Detection - Complete Note + Lab

summary of object detection using YOLO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views6 pages

Yolo & Object Detection - Complete Note + Lab

summary of object detection using YOLO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

YOLO & Object Detection — Complete Note + Lab

Goal: A self-contained note that starts from what object detection is, transitions into YOLO intuition and math,
explains how to implement it with CNNs, and finishes with lab-ready TensorFlow/Keras code you can run and
adapt.

1. What is Object Detection?


Object detection is the computer vision task of finding what objects are present in an image and where they
are located. For each detected object we want:

• a class label (what it is), and


• a bounding box (where it is) — typically represented as (x_min, y_min, x_max, y_max) or
(x_center, y_center, width, height) .

Related tasks

• Image classification — one label for the whole image.


• Object detection — multiple labeled boxes per image.
• Semantic segmentation — per-pixel class labels.
• Instance segmentation — segmentation masks per object.

Important evaluation metrics


Area(Bpred ∩Bgt )
• Intersection over Union (IoU) for a predicted box and ground-truth box: IoU = Area(Bpred ∪Bgt )
• Precision / Recall computed over detections.
• mAP (mean Average Precision) across classes at IoU thresholds (e.g., 0.5, or COCO-style multiple
thresholds).

2. The YOLO Family — Intuition and Evolution


YOLO (You Only Look Once) treats detection as a single regression problem from image pixels to bounding
boxes + class probabilities. Its core ideas:

• Divide the image into a grid of cells.


• Each grid cell is responsible for predicting objects whose center falls in that cell.
• The network predicts, for each cell, one or more bounding boxes and class scores — in one forward
pass.

Why YOLO? It is fast and end-to-end: a single network produces all detections, allowing real-time
performance.

1
Evolution highlights:

• YOLOv1 (2016) — grid-based prediction, simpler, used MSE loss for everything.
• YOLOv2 — introduced anchor boxes (priors) and dimension clustering (k-means) to find good
anchors.
• YOLOv3/v4 — multi-scale detection and improved backbones.
• Newer YOLOs (v5/…/v8) continue to refine architecture, training, and deployment.

3. Mathematical Formulation (YOLOv1-style, then anchors)

3.1 Grid and Output tensor

Let the image be resized to a fixed size (e.g., 448×448). Divide image into an S × S grid.

For each cell i (there are S 2 cells), the network predicts:

• for j = 1 … B bounding boxes: xij , yij , wij , hij , Cij


• x, y are offsets relative to the cell (often normalized between 0 and 1)
• w, h are relative to the image (or predicted as offsets from anchors — see below)
• C is the confidence (probability * IoU)
• class probabilities for the cell: pi (1..C)

So the output per cell has size B × 5 + C . The full tensor shape is (S, S, B × 5 + C).

3.2 Bounding box parameterization (anchors)

If not using anchors (YOLOv1), the network directly predicts x, y, w, h .

When anchors are used (common in YOLOv2+), each grid cell has pre-defined anchor boxes with sizes
(wa , ha ) . The network predicts offsets tx , ty , tw , th and we decode as:

bx = σ(tx ) + cx by = σ(ty ) + cy bw = wa ⋅ etw bh = ha ⋅ eth

• (cx , cy ) is the top-left corner coordinate of the grid cell (integer grid indices).
• σ(⋅) is the sigmoid function producing values in (0,1), ensuring centers remain inside the cell.
• Exponential on tw , th ensures positive widths/heights and models multiplicative offsets.

3.3 Confidence and class probability

• Network also predicts objectness/confidence score C ^ per predicted box.


^(c∣object) .
• Conditional class probabilities are predicted per grid cell: p
• Final per-box class score often computed as: score(c) = C^ × p^(c∣object).

2
4. Loss Function — Why It’s Special
YOLO's loss combines multiple objectives in one scalar. The canonical formulation (YOLOv1) uses Mean
Squared Error (MSE) with weights:

[\mathcal{L} = \lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}{ij}^{obj} \left[(x}-\hat{x{ij})^2 +


(y)^2\right] \}-\hat{y}_{ij

• \lambda_{coord} \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}{ij}^{obj} \left[(\sqrt{w}}-


\sqrt{\hat{w{ij}})^2 + (\sqrt{h)^2\right] \}}-\sqrt{\hat{h}_{ij}
• \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}{ij}^{obj} (C)^2 \}-\hat{C}_{ij
• \lambda_{noobj} \sum_{i=1}^{S^2} \sum_{j=1}^{B} \mathbb{1}{ij}^{noobj} (C)^2 \}-\hat{C}_{ij
• \sum_{i=1}^{S^2} \mathbb{1}{i}^{obj} \sum_i(c))^2 ]}^{C} (p_i(c)-\hat{p

Meaning of terms:

obj
• 1ij = 1 if object is present in cell i and bounding box j is the responsible box (usually the predicted
box with highest IoU to the ground-truth). Otherwise 0.
• The ⋅ on widths/heights reduces the relative loss on large boxes (stabilizes regression).
• λcoord (e.g., 5) increases weight on localization.
• λnoobj (e.g., 0.5) decreases penalty for predicting objects where there are none (helps with class
imbalance: few objects vs many background cells).

Differences from usual ANN losses

• Classification uses cross-entropy typically, but YOLOv1 used MSE on class probabilities. Later
versions moved toward log-loss (cross-entropy) for classes.
• YOLO mixes regression losses (MSE on continuous coordinates) with classification loss. This union
demands careful weighting.
• YOLO explicitly separates losses for object-present and no-object cells with different weights.

5. Anchor Generation (k-means with IoU)


Good anchor sizes help convergence. The standard approach:

1. Extract all ground-truth boxes from dataset, convert to normalized (w, h) (divided by image width/
height).
2. Run k-means clustering but with distance = 1 − IoU (box, centroid) instead of Euclidean.
3. The cluster centers are the normalized anchors.

This produces anchors tailored to dataset shape distributions (vehicles vs people produce different
anchors).

3
6. Inference: Decoding Predictions & Non-Max Suppression (NMS)
Steps:

1. Forward pass → get tensor shape (S, S, B × 5 + C) .


2. For each cell and each anchor (box) decode bx , by , bw , bh into absolute image coordinates.
3. Compute class scores: score(c) = conf × p(c) .
4. Filter boxes by a confidence threshold (e.g., 0.25).
5. For each class separately, run NMS: keep highest-score box, remove boxes with IoU > threshold
(e.g., 0.5).

TensorFlow helper: [Link].non_max_suppression(boxes, scores, max_output_size,


iou_threshold) .

7. Implementation Notes & Practical Tips


• Use sigmoid for tx, ty and objectness to bound outputs.
• Use exponential for width/height offsets when using anchors.
• Normalize bounding boxes to image size for stable training.
• Use multiple scales (detection at different feature map sizes) for better small-object detection (used
by YOLOv3+).
• For loss, start with YOLOv1 weights (λcoord = 5, λnoobj = 0.5) ; tune later.

8. Lab Code — YOLOv1-style skeleton with anchors & loss


(TensorFlow/Keras)
This lab is a working, runnable skeleton. It uses a simplified backbone and a reasonably complete loss +
decode + NMS pipeline. Replace the dataset-loading parts with your own annotated data.

Note: This is pedagogical — real production YOLOs use more sophisticated backbones and
training tricks.

# yolo_lab.py (run with Python 3.8+, TensorFlow 2.x)

import numpy as np
import tensorflow as tf
from [Link] import layers, Model

# --------------------------
# Utilities: IoU, decode, NMS
# --------------------------

4
def box_iou(boxes1, boxes2):
# boxes: [...,4] as (x_center, y_center, w, h) absolute coords
# convert to x1y1x2y2
def to_x1y1x2y2(b):
x, y, w, h = b[...,0], b[...,1], b[...,2], b[...,3]
x1 = x - w/2; y1 = y - h/2; x2 = x + w/2; y2 = y + h/2
return [Link]([x1,y1,x2,y2], axis=-1)

b1 = to_x1y1x2y2(boxes1)
b2 = to_x1y1x2y2(boxes2)

# intersection
ix1 = [Link](b1[...,0], b2[...,0])
iy1 = [Link](b1[...,1], b2[...,1])
ix2 = [Link](b1[...,2], b2[...,2])
iy2 = [Link](b1[...,3], b2[...,3])
iw = [Link](ix2 - ix1, 0)
ih = [Link](iy2 - iy1, 0)
inter = iw * ih

area1 = (b1[...,2]-b1[...,0]) * (b1[...,3]-b1[...,1])


area2 = (b2[...,2]-b2[...,0]) * (b2[...,3]-b2[...,1])
union = area1 + area2 - inter
iou = inter / (union + 1e-9)
return iou

# --------------------------
# Anchor calculation (k-means IoU)
# --------------------------

def kmeans_anchors(boxes_wh, k=3, max_iter=100):


# boxes_wh: N x 2 normalized widths/heights
N = boxes_wh.shape[0]
[Link](0)
centroids = boxes_wh[[Link](N, k, replace=False)]

for _ in range(max_iter):
# compute 1 - IoU as distance
dists = [Link]((N, k))
for i in range(N):
for j in range(k):
w1,h1 = boxes_wh[i]
w2,h2 = centroids[j]
inter_w = min(w1, w2)
inter_h = min(h1, h2)
inter = inter_w * inter_h
union = w1*h1 + w2*h2 - inter
iou = inter / union

5
dists[i,j] = 1 - iou
nearest = [Link](dists, axis=1)
new_centroids = [Link]([boxes_wh[nearest==j].mean(axis=0) if
[Link](nearest==j) else centroids[j] for j in range(k)])
if [Link](new_centroids, centroids):
break
centroids = new_centroids
return centroids

# --------------------------
# Model: simple YOLOv1-style
# --------------------------

S = 7 # grid
B = 2 # boxes per cell
C = 3 # classes (example)
INPUT_SHAPE = (448,448,3)

def build_simple_yolo(input_shape=INPUT_SHAPE, S=7, B=2, C=3):


inputs = [Link](shape=input_shape)
x = layers.Conv2D(64, 7, strides=2, padding='same', activation='relu')
(inputs)
x = layers.MaxPool2D(2)(x)
x = layers.Conv2D(192, 3, padding='same', activation='relu')(x)
x = layers.MaxPool2D(2)(x)
x = layers.Conv2D(128,1,activation='relu')(x)
x = layers.Conv2D(256,3,padding='same',activation='relu')(x)
x = layers.Conv2D(256,1,activation='relu')(x)
x = layers.Conv2D(512,3,padding='same',activation='relu')(x)
x = layers.MaxPool2D(2)(x)
x = layers.Conv2D(1024,3,padding='same',activation='relu')(x)
x = layers.Conv2D

You might also like