DartVision (Proposal)
DartVision (Proposal)
INSTITUTE OF ENGINEERING
THAPATHALI CAMPUS
Submitted By:
Kapur Pant (THA077BEI021)
Khagendra Raj Joshi (THA077BEI022)
Kiran Chand (THA077BEI023)
Mohit Bhusal (THA077BEI025)
Submitted To:
Department of Electronics and Computer Engineering
Thapathali Campus
Kathmandu, Nepal
June, 2024
ABSTRACT
The goal of the ”DartVision: AI-Driven Dart Targeting System Using Facial Recogni-
tion” project is to use computer vision and artificial intelligence to create a novel dart
throwing mechanism that can precisely target human subjects. The system makes use
of facial recognition algorithms for target identification, servo motors for precise dart
control, and a Depth Camera for target detection and depth perception. The device
accomplishes precise dart targeting by combining projectile motion principles with en-
ergy conservation laws. Non-lethal darts and strict safety standards are used to ensure
user safety. The DartVision technology has potential uses in military defense systems
for locating and eliminating certain targets in crowds, in addition to recreational sports.
Extensive testing and validation procedures will be carried out to evaluate the system’s
efficiency and dependability in many situations. This project offers a fresh take on dart
aiming systems and improves accuracy and security in dart-related activities.
i
Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2. LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3. REQUIREMENT ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 ESP32cam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.3 YOLOv8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
ii
4.2 System Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5. Expected Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6. Feasibility Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7. Project Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
9. APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iii
List of Figures
iv
List of Tables
v
List of Abbreviations
AI Artificial Intelligence
KF Kilman Filter
KLT Kanade-Lucas-Tomasi
vi
1. INTRODUCTION
”DartVision: AI-Driven Dart Targeting System Using Facial Recognition” project seeks
to transform dart throwing mechanisms by utilizing computer vision and AI to precisely
target human subjects with non-lethal darts. The DartVision system, whose possible
uses range from leisure sports to military defense systems, promises to improve safety,
accuracy, and efficiency in dart-based activities through a combination of latest tech-
nologies and creative engineering.
1.1 Background
The accuracy and precision of traditional dart throwing devices are limited, especially
when aiming at moving targets, as they frequently rely on manual aiming techniques.
Furthermore, conventional systems require continual human interaction because they
are unable to recognize and track targets on their own. By enabling automated target
detection, recognition, and accurate dart targeting, the development of AI and computer
vision technologies offers a chance to get beyond these restrictions.
Dart sports, such competitive dart throwing and games like ”501,” have a long and illus-
trious history. Millions of fans have been playing dart sports in leagues, tournaments,
and informal games around the world in recent years, as their popularity has skyrock-
eted. Dart throwing is a fun activity, but recreational dart throwing requires players
to be extremely accurate and precise, which emphasizes the value of trustworthy dart
targeting systems.
In order to achieve strategic goals and reduce dangers to friendly forces and civilians,
it is imperative for military operations to effectively target adversaries. Conventional
approaches to target acquisition and engagement frequently depend on guided weapon
systems or manual targeting, which might be inadequate or impracticable in some situ-
ations, particularly those involving asymmetric warfare or urban settings.
Defense think tanks like the RAND Corporation have performed research that points to
precise targeting as a key component in lowering civilian casualties and collateral dam-
age during military operations. The need for more sophisticated and accurate targeting
systems is highlighted by the analysis of historical data, which shows cases in which
poor targeting resulted in unintentional harm.
Additionally, market research studies show that the military and law enforcement in-
dustries are becoming more and more in need of technologically sophisticated targeting
systems. This pattern reflects the growing focus on autonomous targeting capabilities
1
and precision-guided weapons to improve operational effectiveness and reduce hazards.
1.2 Motivation
The ”DartVision: AI-Driven Dart Targeting System Using Facial Recognition” project
is driven by the following main reasons:
Improving Safety and Precision: Manual aiming methods are frequently used in tra-
ditional dart throwing devices, which can lead to inconsistent and inaccurate results,
especially when aiming at moving targets. The DartVision system attempts to improve
targeting precision and accuracy by combining AI-driven facial recognition and com-
puter vision technologies, making sure that darts hit their intended targets with the least
amount of variation. The system can be used for a variety of purposes, such as recre-
ational sports and military training exercises, because it prioritizes the safety of both
operators and targets through the use of non-lethal darts and strict safety regulations.
Several significant issues and restrictions with conventional dart aiming systems are ad-
dressed by the DartVision project, especially those pertaining to accuracy, automation,
and flexibility. The following are the main issues the project attempts to address:
Lack of Target Precision: Manual aiming methods are frequently used in traditional
dart throwing devices, which leads to inconsistent and inaccurate results, particularly
when aiming at moving targets or objects. Whether in casual sports or military drills,
this impreciseness can reduce the overall efficacy of dart-based activities. By creating
an AI-driven targeting system that can autonomously recognize and track human targets
with great precision and accuracy, the DartVision project aims to address this issue.
Inefficiency in Target Acquisition: Most conventional dart targeting systems are un-
able to identify and lock onto targets on their own, necessitating continuous human
2
interaction in order to aim and fire darts correctly. This manual procedure can be la-
borious and prone to mistakes, especially in busy or dynamic settings where targets
might be concealed or move quickly. The goal of the DartVision system is to simplify
the target acquisition procedure by combining computer vision and facial recognition
technologies. This will allow for the quick and accurate identification of human targets
in real-time.
Safety Issues with Target Engagement: When using darts in activities involving
humans, safety must always come first. Conventional dart throwing systems carry some
danger of damage or injury, particularly if the darts are thrown too hard or are not aimed
precisely. The DartVision project places a high priority on safety by using non-lethal
darts and putting strict safety procedures in place to reduce the possibility of mishaps
or unintentional injury to targets and operators.
Demand for Versatile Applications: Although dart targeting systems are often con-
nected to leisure activities, there is an increasing need for these systems to be used in
other fields, like security operations or military defense. Their usefulness and efficacy
may be limited by the inability of current technologies to scale and adapt to the various
needs of different applications. By creating a versatile targeting system that can satisfy
the needs of a range of applications, from recreational activities to tactical engagements,
the DartVision project seeks to close this gap.
3
1.4 Project Objectives
• To create an AI-powered dart aiming system that uses computer vision and facial
recognition to target accurately and independently.
1.5 Applications
• Practice and Training: DartVision is an excellent training tool for darts play-
ers who want to improve. It can provide immediate feedback on accuracy and
technique to the players.
1.6.1 Capabilities
• Continuous Tracking: Using the Kalman Filters (KF), DartVision ensures con-
4
tinuous face tracking and blurs all non-target faces, maintaining focus on the
target.
• Precise Dart Targeting: By integrating servo and stepper motors for dart con-
trol, the system can accurately adjust dart trajectory based on the tracked position
of the target’s face, enhancing targeting precision.
1.6.2 Limitations
5
2. LITERATURE REVIEW
In the research of S.R. Rath[1], moving objects in a video are detected using computer
vision techniques. Frame differencing and computer vision techniques are implemented
to detect whether there are any moving objects in a video. The video is divided into
multiple frames, which are made up of pixels of colors. The current frame is subtracted
from the past frame, and the color is identified. Based on the color, it is determined that
something has moved or changed position.
The authors of[2] suggested four basic methods for solving object segmentation prob-
lems for detecting moving region i.e. Background subtraction, temporal differencing,
statis- tical method and optical flow. Background subtraction is commonly used to
detect mov- ing regions in images but requires a good background model to handle dy-
namic scenes. Temporal differencing uses pixel differences between consecutive frames
but can create holes in moving objects. Optical flow detects independent motion even
with camera movement but is computationally intensive and noise-sensitive. Statistical
methods dy- namically update background models to classify pixels as foreground or
background based on statistical characteristics.
This paper introduces a robust algorithm based on background subtraction and DST
for moving object detection and segmentation. The proposed method reduces com-
putational complexity compared to traditional techniques and effectively identifies and
segments moving objects in static backgrounds.
The research paper by Shivam Singh and Prof. S. Graceline Jasmine [3] focuses on
developing an automated face recognition system using several algorithms to enhance
accuracy and efficiency. The system integrates face detection, feature extraction, and
recognition algorithms to automatically identify individuals from still images or video
frames. Key algorithms employed include the Viola-Jones algorithm for face detection
using Haar cascade classifiers, the Kanade-Lucas-Tomasi (KLT) tracker for continuous
face tracking, and Principal Component Analysis (PCA) for feature extraction. The pro-
posed system involves multiple stages: image capture, face detection, pre-processing,
database development, and post-processing for real-time recognition. The effectiveness
of the system is demonstrated in varying lighting conditions and emphasizes its poten-
tial applications in security systems, surveillance, and identity verification. Despite its
robustness, the system faces challenges such as handling different poses, facial expres-
sions, and poor lighting conditions, suggesting areas for future improvement.
The research paper ”Real-time face detection based on YOLO” presented by Wang
6
Yang and Zheng Jiachun [4], explains the details the application of the YOLO (You
Only Look Once) network for face detection. YOLO stands out due to its high detection
speed and accuracy, which are crucial for real-time applications. The paper [4] com-
pares YOLO with other object detection methods like R-CNN, emphasizing YOLO’s
end- to-end training and detection process that integrates feature extraction, classifica-
tion, and regression into a single network. The YOLOv3 variant, which the authors
focus on, uses a multi-scale feature map and up-sampling techniques to improve de-
tection, especially for small objects. They also discuss the importance of adapting the
anchor boxes for specific tasks via dimension clustering using the k-means algorithm,
which enhances detection accuracy. Experimental results using datasets like WIDER
FACE, Celeb Faces, and FDDB demonstrate YOLOv3’s robustness and faster detection
times, confirming its suitability for real-time face detection even in complex environ-
ments.
The research paper ”Review and Comparison of Face Detection Techniques” by Sudipto
Kumar et. al. [5] compares various face detection methods, focusing on Haar- like cas-
cade classifiers, Local Binary Pattern (LBP) cascade classifiers, and Support Vector
Machine (SVM)-based methods. It evaluates these techniques based on detection time,
accuracy, performance in low light, and effectiveness on diverse skin tones, specifically
dark complexions. Haar cascade classifiers, while accurate, struggle with low light and
dark skin tones, and have high false positive rates. LBP classifiers perform well in chal-
lenging lighting and with dark complexions but have lower accuracy and higher CPU
usage. SVMs, although accurate, are slower and less effective in low light conditions.
The study concludes that while each method has strengths and weaknesses, a combi-
nation of features from Haar and LBP classifiers could potentially yield better overall
performance.
In the paper ”Design and Control of an Articulated Robotic Arm for Archery,” Ah-
madRafiq Mohd Khairudin et al. [6] explore the innovative application of robotics in
sports, specifically focusing on archery. The study highlights the development and inte-
gration of a robotic system comprising a Universal Robot UR5 robotic arm, motion con-
trollers, and a vision-based targeting system. Utilizing OpenCV algorithms, namely the
Circle Hough Transform (CHT) and color and contour detection, the robot identi- fies
the target center, aims, draws, and releases the arrow with a high degree of accu- racy
(87.56). This approach addresses the challenge of hand-eye coordination in archery,
leveraging the robot’s ability to perform repetitive tasks without tremors. The paper
also discusses the methodology for training the arm, including setting up waypoints for
accurate targeting and shooting, and compares the efficiency of different algorithms in
target detection. The results indicate that the OpenCV algorithm is more effective for
7
dynamic targeting than the CHT algorithm. The study concludes with a proof of con-
cept demonstrating the feasibility of using collaborative robots for archery, suggesting
further research on enhancing precision and automating more complex tasks such as
nocking the arrow.
As there are many object detection algorithm present today YOLO(You Only Look
Once) can be considered as one the the finest algorithm .Many researchs have used
YOLO for real time object detection ,one of them includes the research paper[7] by Gu-
dala Lavanya and Sagar Dhanraj Pande that explores the YOLO (You Only Look Once)
algorithm’s transformative impact on real-time object detection. YOLO’s unique ap-
proach processes entire image at once making it highly efficient and fast. YOLO works
by dividing images into some grids and each grid cell predicts a fixed number of bound-
ing boxes and class probabilities. They also mentioned the architectural improvement
of YOLO form version v1 to v5 where v1 focused on basic concept but the v5 model
introduces features like mosaic augmentation that maintained a balance between speed
and accuracy. Despite the challenges such as detecting small objects,YOLO’s continu-
ous architectural refinements ensure it remains a pivotal tool in computer vision.
The article ”Kalman Filter and Its Application” by Qiang Li, et al.[8] provides a com-
prehensive survey of the Kalman filter (KF) and its variations, including the Extended
Kalman filter (EKF) and Unscented Kalman filter (UKF). The Kalman filter, introduced
by R. E. Kalman in 1960, is widely used for optimal estimation in dynamic systems,
particularly for tasks like target tracking and navigation. The paper discusses the ba-
sic theories, strengths, and limitations of each filter type. The standard KF is suited
for linear systems with Gaussian noise, while EKF extends its application to nonlinear
systems through linearization, though it can suffer from divergence if noise estimations
are inaccurate. The UKF further improves performance by approximating the probabil-
ity distribution of the nonlinear function without needing linearization, making it more
accurate and easier to implement.
8
3. REQUIREMENT ANALYSIS
The components required for the proper implementation of our project are provided
below.
3.1.1 ESP32cam
The Stepper Motor serves as a crucial component for the project,primarily responsible
for the movement of mechanical hardware.Stepper motors are usually preferred in those
application where we require precise control over the rotaion angle and speed such as
CNC machines and other automated systems.The model that we have chose offers a
balance between torque,size and power consumption meeting the specific requirements
of mechanical components it will drive.
9
Figure 3-2 : Stepper Motor
The project uses SG90 (SG refers to “Servo Gear” and 90 refers to its approximate rota-
tion capability of +-90 degrees) that provides stall torque up to 1.8kgf.cm Its operating
voltage range is 4.5V to 6V [9]
Motor Drivers, represented by the A4988 model, play a vital role in controlling the
stepper motors. These drivers convert the signals from the microcontroller into the
necessary power and sequence of pulses to drive the stepper motor effectively. The
A4988, in particular, is known for its reliability and compatibility with various stepper
motor types. It provides microstepping capabilities, allowing for smoother and more
precise control over the motor’s movement, contributing to the overall accuracy and
10
efficiency of the system.
The necessary software required for the project are mentioned below:
3.2.2 Python
3.2.3 YOLOv8
The integration of YOLOv8 (You Only Look Once, version 8) introduces a sophisti-
cated object detection model into the software repertoire. YOLOv8 is renowned for its
real-time object detection capabilities, making it an apt choice for identifying and local-
izing the cube within the captured images. The model’s efficiency in processing images
and detecting objects with high accuracy aligns with the project’s need for robust and
11
rapid face detection. YOLOv8 contributes to the software’s capability to analyze the
visual data captured by the camera and extract relevant information for further process-
ing.
12
4. SYSTEM ARCHITECTURE AND METHODOLOGY
The major concept underlying while making Dart-Vision is regarding how the real-life
object or any specific human being can be tracked applying machine learning algo-
rithm,the The YOLOv8 object detection system, like its predecessors is a single-stage
detector that processes and input image to predict bounding boxes and class probabili-
ties for objects all in a single go. Here is a look at how YOLOv8 operates,broken down
into several key stages.
13
Explanation of components in system block diagram is presented below:
This section includes all the tasks that is processed inside computers CPU.
Camera Modules: Two camera modules are required in this project.One of the two
module (camera module 1) is used to take real-time video and both of them combinely
is used for stereo vision that helps in calculating depth of target from the launching site.
Image Frame: Real time video from camera module 1 is converted into sequence of
image frame.
Yolo face detection model: After the image frame is achieved YOLOv8 model is used
for face detection as shown in the 4-14 .
Live GUI feedback system: A live camera video feedback is achieved using OpenCV
and GUI is designed using Tkinter or PyQt.Yolo face detection model is integrated with
the GUI system to visualize the realtime detection and tracking of the face.
Similarly, mouse click callback function is used to let the user click the targeted face
whose coordinate is separately saved which is further used to calculate pitch and yaw
angle.
Trigger confirmation button is made at GUI interface such that when pressed trigger
mechanism is activated.
Shooting Spring Constants: Force gauge can be used to measure the dart draw weight
14
at different positions and use linear regression to get the dart string’s spring constant.
F = −Kx (4.1)
Conservation of Energy:
Now that the spring constant is known, the arrow’s initial velocity could be solved for
by knowing the energy was conserved from the darts string’s potential energy to the
kinetic energy of the arrow upon release.
1 1
kspring x = mv20 (4.2)
2 2
r
k
v0 = x (4.3)
m
15
Calculation of Pitch Angle:
As the arrow is in flight, it would move in a parabolic curve. The objective is to control
the shape of that curve, or more precisely, to control the angle and velocity at which the
dart would hit the target. A diagram of the arrow’s motion is shown in figure 4-3 .
The depth and height of the target are known using the camera, as previously discussed.
There are multiple solutions for the initial angle and initial velocity if the final angle and
final velocity of the arrow were not considered. To reduce the solution space, the initial
velocity is set to the maximum extension of the bow, and the desired angle is solved for
using the formula below.
gd 2
h2 = h1 + d tan(θ ) − (4.4)
2v20 cos2 (θ )
This pitch angle(θ ) might have errors due to the resistance of air and losses in
energy because of multiple factors ,considering these factors (θ +ψ) is calculated and is
passed to the esp-32 cam in order to launch the dart in desired range of target.Here,(ψ)
is the constant that arises due to loss factors.
Calculation of Yaw angle: Using the value of real world coordinate of the face which
16
is stored in a unique variable and the initial position of servo, the yaw angle is calcu-
lated.
From the figure 4-4 ,using the depth value(d), real world x coordinate and y coordinate
value of the target object/face, we can calculate yaw angle.
d 2 = y2 + b2 (4.5)
p
b = d 2 − y2 (4.6)
Now the yaw angle can be calculated by using sine angle as,
p
sin(φ ) = x/( d 2 − y2 ) (4.7)
p
φ = sin−1 (x/ d 2 − y2 ) (4.8)
Here, this yaw angle (φ ) is passed to the esp-32 cam in order to face the dart launching
tube in the direction of target.
17
Depth Calculation: As stereo vision is used to calculate the depth from object to
launching site,intrinsic and extrinsic parameter of the camera is at first estimated,optical
center and focal length comes under the intrinsic parameter whereas position of cam-
era in 3d space comes under extrinsic parameter.Using these data both the cameras are
calibrated.Finally, depth value is assigned to each pixel of the live video feedback.
This process mimics human vision. Our brains merge the slightly different images from
each eye, allowing us to perceive depth and spatial relationships, resulting in our three-
dimensional view of the world.
By employing its prior knowledge of the relative distance between the cameras, the
computing system uses triangulation to determine the depth(d).
Each point depth has to be calculated in order to produce a 3d image from 2d.Using
this, each points relative depth is found out.
18
Figure 4-6 : Computer system achieving sterio vision [9]
An image (or image channel) with data on the distances between scene objects’ surfaces
as seen from a particular perspective is called a depth map. In 3D computer graphics
and computer vision, scene depths are commonly represented in this manner.
The perceived velocity of objects is the difference between the two stereo images. When
we shut one eye and open it fast without opening the other, we will see that objects
close to us move more than those farthest away, which move very little. This behavior
is known as ”discrepancy.”
19
A direction vector in epipolar geometry is a three-dimensional vector that originates
from an image pixel:
The direction vector, as the name suggests, is the direction from where the light ray
arrives at the pixel sensor. This line thus carries all the 3D points that could be candidate
sources for the 2D pixels in the image. In the above figure, the direction vector Ls1 S1
originates from the point Ls1 , which is the “left” 2D pixel corresponding to the 3D point
S1 in the scene.
20
In the above figure 4-8 , the direction vectors from the left and right images (Ls1 S1 and
Rs1 S1 , respectively) intersect at the single source S1 . This 3D source point in the scene
is the point from where light rays cast image pixels Ls1 and Rs1 in the left and right
images.
Depth Calculation: The distance between cameras should be known and should be
very much small compared to the distance between camera and object.Then,the loca-
tion of the 3d point in the space by triangulation can be determined. The depth is a
perpendicular cast on the line joining the two cameras:
The above image 4-9 shows the actual depth ds1 for the point from the line joining the
two cameras.angle between the line ds1 and the line Ls1 Rs1 is not exactly 90 degrees.In
reality, however, the distance Ls1 Rs1 is very small compared to ds1 . This results in the
angle between the line ds1 and the line Ls1 Rs1 being approximately 90 degrees. Since,
the location of S1 is determined by triangulation,with the help of the relative distance
Ls1 Rs1 , the depth ds1 is calculated by using Pythagoras theorem.
21
Since s is very large compared to t, the angle ∠S1 Ms1 Rs1 approaches 90◦ . Lengths
Ls1 Ms1 and Ms1 Rs1 are almost the same (denoted by t). Also, lengths Ls1 S1 andRs1 S1
are almost the same (denoted by s). Applying the Pythagorean theorem, s2 = ds1 2 + t 2 .
Solving for the depth of point S1
p
ds1 = s2 − t 2 (4.9)
Triangulation in computer vision finds a 3D point’s location using its image projections
and camera information. It takes 2D image points and camera matrices to calculate a
3D point’s location. There are various techniques (mid-point, DLT, essential matrix)
that differ in complexity and accuracy.
Disparity-Map: Disparity maps show how far corresponding points have shifted be-
tween two stereo images. This shift, inversely related to depth, helps us create 3D
models.
Building a disparity map involves finding matching pixels between the left and right
images (solving the correspondence problem). Rectifying the images simplifies this by
aligning corresponding points horizontally. Block matching is a common technique to
find these matches. Finally, the disparity map is converted to a depth map using camera
data through a process triangulation.
22
4.3 System Flowchart
The flowchart outlines a detailed process for real-time video processing and object
tracking using the YoloV8 model. The process initiates with capturing live video, which
is subsequently fed into the YoloV8 model to identify objects and obtain bounding box
coordinates along with their confidence levels. This information is then displayed on a
graphical user interface (GUI), enabling user interaction.
The user can click on a specific frame in the video feed, and the system captures the
23
x and y coordinates of the click. It then iterates through the bounding box coordinates
provided by the YoloV8 model to determine if the click corresponds to any detected
object. If a match is found, the system applies a tracking algorithm to the identified
object, enabling continuous monitoring and tracking.
This sequence ensures that the system can accurately track objects in real-time based
on user input, leveraging the object detection capabilities of the YoloV8 model and
the interactive functionality of the GUI. The combination of real-time video capture,
object detection, user interaction, and tracking makes the system robust for applications
requiring precise and dynamic object monitoring.
The YOLOv8 object detection system, like its predecessors is a single-stage detector
that processes and input image to predict bounding boxes and class probabilities for
objects all in a single go. Here is a look at how YOLOv8 operates,broken down into
several key stages.
The architecture consists of a Backbone Neck and Head. The Backbone is a convolu-
tional neural network (CNN) that is primarily responsible for extracting feature maps
24
Figure 4-13 : YOLOv8 architecture
from the input image. It processes the image through multiple layers of convolutions,
capturing various levels of abstraction and important spatial details.The Neck compo-
nent is responsible for aggregating the features extracted by the Backbone. This is typi-
cally achieved using path aggregation blocks like the Feature Pyramid Network (FPN),
which combines feature maps from different scales to create a rich, multi-scale feature
representation.Then it passes them onto head,predicting the final bounding boxes and
class probabilities for detected objects.
The state-of-the-art(SOTA) deep learning model YOLOv8 is intended for computer vi-
sion applications that require real-time object recognition.YOLOv8 has been widely
recognized for its real time performance and its accurcacy.For our project, the YOLO
technique is used to detect faces.The following describes how YOLO works for facial
detection.
25
Figure 4-14 : Face Detection Using YOLO
The YOLOv8 algorithm works based on the following four main appraoches:
1. Grid Division The first step starts by taking a frame of a real time video/image
where it divides the original image into SxS grid cells of equal shape where N in
our case is 4 as shown in the figure above.Each cell in the grid is responsible for
localizing and predicting of class and confidence score.
2. Bounding Box regression The next step is to determine the bounding boxes
which correspond to rectangles highlighting all the face in the image. We can
have as many bounding boxes as there are faces within a given image. YOLO
determines the attributes of these bounding boxes using a single regression mod-
26
ule in the following format, where Y is the final vector representation for each
bounding box. Y = [pc, bx, by, bh, bw, c1, c2] This is especially important during
the training phase of the model.
• bx, by are the x and y coordinates of the center of the bounding box with
respect to the enveloping grid cell.
• bh, bw correspond to the height and the width of the bounding box with
respect to the enveloping grid cell.
• c1 and c2 correspond to the two classes Player and Ball. We can have as
many classes as your use case requires.
3. Intersection Over Union(IoU) During most of the time a single face in a real
time video can habve multiple grid box condidates for prediction,even though
most of them may not be relevant. The main aim the IOU (a value between 0 and
1) is to discard such grid boxes and only keep those that are relevant. Here is the
logic behind it:
• The user defines its IOU selection threshold, which can be, for instance,a.
• Then YOLO computes the IOU of each grid cell which is the Intersection
area divided by the Union Area.
• Finally, it ignores the prediction of the grid cells having an IOU ≤ a thresh-
old and considers those with an IOU > a threshold.
4. Non Maximum Supression Simply setting a threshold for the Intersection over
Union (IoU) is not always sufficient to eliminate redundant detections and noise
in object detection tasks, including face detection with YOLO (You Only Look
Once). This is where Non-Maximum Suppression (NMS) plays a crucial role.
NMS ensures that only the bounding box with the highest probability score is
kept, effectively reducing the noise and improving the accuracy of the detection.
27
4.7 Target tracking using Extended Kalman Filters
The standard Kalman filter works for linear systems. Extended Kalman filter (EKF) is
used for non-linear systems by linearizing the equations around the current estimate.
This allows the EKF to track objects where the motion or the measurements are non-
linear. An example is when the object’s position is measured in spherical coordinates
but its state is represented in Cartesian coordinates.In case of this project, the target
might not have linear motion.Hence, this model can be considered as one of the best
approach for the target tracking for DartVision.
State Update Model: A closed-form expression for the predicted state as a function
of the previous xk , controls uk , noise wk and time r
The Jacobian of the predicted state with respect to the previous state is obtained by
partial derivatives as:
∂f
F (x) = (4.12)
∂x
The Jacobian of the predicted state with respect to the noise is:
∂f
F (w) = (4.13)
∂w
These functions take simpler forms when the noise is additive in the state update equa-
tion:
xk+1 = f (xk , uk ,t) + wk (4.14)
F (w) js (4.15)
Specification of the state of Jacobian matrix can be done using the StateTransition-
JacobianFen property of the tracking EKF object. If not specified this property, the
object computes Jacobians using numeric differencing, which is slightly less accurate
and can increase the computation time.
28
zk = h(xk , vk ,t) (4.16)
∂ zk
Hx = (4.17)
∂ xk
The Jacobian of the measurement with respect to the measurement noise is: These
functions take simpler forms when the noise is additive in the measurement equation:
∂ zk
H (v) = (4.18)
∂ vk
These functions take simpler forms when the noise is additive in the measurement equa-
tion:
zk = h(x1 ) + vk (4.19)
Extended Kalman Filter Loop: The extended Kalman filter loop is almost identical
to the loop of Linear Kalman Filters except that:
• The filter uses the exact nonlinear state update and measurement functions when-
ever possible.
29
Figure 4-15 : Extended Kalman Filter Loop
The toolbox provides predefined state update and measurement functions to use in
trackingEKF.
30
Motion Model Function Function Purpose State Representation
Name
31
Motion Model Function Function Purpose State Representation
Name
Example: Estimate 2-D Target States with Angle and Range Measurements Using
trackingEKF
Assume a target moves in 2D with the following initial position and velocity. The
simulation lasts 20 seconds with a sample time of 0.2 seconds.
Matlab Code:
rng(2022); % For repeatable results
dt = 0.2; % seconds
simTime = 20; % seconds
tspan = 0:dt:simTime;
trueInitialState = [30; 1; 40; 1]; % [x;vx;y;vy]
initialCovariance = diag([100,1e3,100,1e3]);
processNoise = diag([0; .01; 0; .01]); % Process noise matrix
Assume the measurements are the azimuth angle relative to the positive-x direction and
the range to from the origin to the target. The measurement noise covariance matrix is:
Matlab Code:
32
measureNoise = diag([2e-6;1]); % Measurement noise matrix. Units
are m^2 and rad^2.
Propagate the constant velocity model and generate the measurements with noise.
numSteps = length(tspan);
trueStates = NaN(4,numSteps);
trueStates(:,1) = trueInitialState;
estimateStates = NaN(size(trueStates));
measurements = NaN(2,numSteps);
Output:
33
Figure 4-16 : True Trajectory
figure(2)
subplot(2,1,1)
plot(tspan,measurements(1,:)*180/pi)
xlabel("time (s)")
ylabel("angle (deg)")
title("Angle and Range")
subplot(2,1,2)
plot(tspan,measurements(2,:))
xlabel("time (s)")
ylabel("range (m)")
Output:
34
Figure 4-17 : Plot of trajectory angle and range
Initialize Extended Kalman Filter Initialize the filter with an initial state estimate at [35;
0; 45; 0].
filter = trackingEKF(State=[35; 0; 45; 0],StateCovariance=
initialCovariance, ...
StateTransitionFcn=@stateModel,ProcessNoise=processNoise, ...
MeasurementFcn=@measureModel,MeasurementNoise=measureNoise);
estimateStates(:,1) = filter.State;
Run the filter by recursively calling the predict and correct object functions.
for i=2:length(tspan)
predict(filter,dt);
estimateStates(:,i) = correct(filter,measurements(:,i));
end
figure(1)
plot(estimateStates(1,1),estimateStates(3,1),"g*",DisplayName="
35
Initial Estimate")
plot(estimateStates(1,:),estimateStates(3,:),"g",DisplayName="
Estimated Trajectory")
legend(Location="northwest")
title("True Trajectory vs Estimated Trajectory")
36
end
37
5. EXPECTED OUTCOME
After the completion of the project;”DartVision”, it is expected that the dart will be able
to hit the target by itself with the help of face detection and target locking mechanism.
As a reference the system is closely related to automatic archery robot as shown in
figure 5-1
(a) bow targeting apple in head using (b) bow hitting apple in head using image
image detection detection
38
6. FEASIBILITY ANALYSIS
• Data Availability and Suitability: For accurate facial detection and tracking, we
will utilize publicly available datasets such as FDDB and WIDER FACE, which
offer a vast number of labeled images. For depth analysis, stereo vision datasets
will be employed. These comprehensive datasets are sufficient for our project
needs, ensuring the availability and suitability of data are well within feasible
limits.
• Time Feasibility: The project is divided into two distinct phases. In the first
phase, we will develop the necessary mathematical concepts and derivations to
support our use of YOLO for facial detection, KF for tracking, and stereo vision
for depth analysis. In the second phase, we will implement and train the model
using the selected datasets, followed by integrating the model with servo and
stepper motors for the dart targeting system. This structured approach allows
for clear milestones and effective time management, ensuring the project can be
completed within the allocated time frame.
39
7. PROJECT SCHEDULE
40
8. ESTIMATED PROJECT BUDGET
41
9. APPENDICES
Mean Average Precision (mAP) is a metric used to evaluate object detection models.
The mean of average precision(AP) values are calculated over recall values from 0 to 1.
mAP formula is based on the Confusion Matrix, Intersection over Union(IoU), Recall
and Precision sub metrices.
Confusion matrix is created using four attributes. True Positives (TP), True Neg-
atives (TN), False Positives (FP) and False Negatives (FN). True Positives (TP) occur
when the model correctly predicts a label that matches the ground truth, while True
Negatives (TN) occur when the model correctly does not predict a label that is also
absent in the ground truth. Conversely, False Positives (FP) arise when the model in-
correctly predicts a label that is not part of the ground truth, known as a Type I Error.
False Negatives (FN) happen when the model fails to predict a label that is actually
present in the ground truth, referred to as a Type II Error.
Intersection over Union (IoU) indicates the overlap of the predicted bounding box
coordinates to the ground truth box. Higher IoU indicates the predicted bounding box
coordinates closely resembles the ground truth box coordinates.
Precision measures how well true positives are found out of all positive predictions.
The precision value may vary bsaed on the model’s confidence threshold.
TP
Precision =
T P + FP
42
Recall measures how well the true positives are found out of all predictions.
TP
Recall =
T P + FN
1 N
mAP = ∑ APi (9.1)
N i=1
The equation 9.1 gives the mean Average Precision. The mAP incorporates the trade-off
between precision and recall and considers both false positives (FP) and false negatives
(FN). This property makes mAP a suitable metric for most detection applications.
43
Appendix B: Reference Design of Launching Mechanism
Dart throwing mechanism can be designed such that one servo controls the pitch angle
of the mechanism so that the dart gun releasing mouth has minimum offset value along
z-axis compared to the center of the target face.
Similarly, another servo controls the yaw angle so that the dart gun releasing mouth
has minimum offset value along x-axis compared to the center of the target face
The depth value obtained from depth sensor helps to find out the range of dart to
be thrown, thus helping to obtain initial velocity to be thrown by the mechanism as
discussed in 4-3 .
44
References
[1] S. R. Rath. “Moving object detection using frame differencing with opencv.”
[Online; accessed 17-June-2024]. (2020), https://debuggercafe.com/moving-
object-detection-using-frame-differencing-with-opencv/.
[2] G. Thapa, K. Sharma, and M. K. Ghose, “Moving object detection and segmen-
tation using frame,” International Journal of Computer Applications, vol. 102,
2014.
[4] W. Yang and Z. Jiachun, “Real-time face detection based on yolo,” in 1st IEEE
International Conference on Knowledge Innovation and Invention (ICKII), Juju
Island, Korea (South), 2018.
[7] G. Lavanya and S. D. Pande, “Enhancing real-time object detection with yolo
algorithm,” EAI Endorsed Transactions on Internet of Things, vol. 23, no. 5,
123–134, 2023. DOI: 10.4108/eetiot.4541.
[8] Q. Li, R. Li, K. Ji, and W. Dai, “Kalman filter and its application,” in Proceed-
ings of the 2015 8th International Conference on Intelligent Networks and
Intelligent Systems, Computer Technology Application Key Lab of Yunnan
Province, Kunming University of Science and Technology, Kunming, China,
650500, 2015.
[9] DrMax, Computer vision: Stereo 3d, Accessed: 2024-06-17, 2024. https://www.
baeldung.com/cs/stereo-vision-3d.
45
[10] K. Carter, Robot archer, Accessed: 2024-06-17, 2021. https : / / hackaday. io /
project/179680/gallery#5b03f7f8eac091844f78badc2680e3ac.
46