0% found this document useful (0 votes)
83 views11 pages

HandTracking Using MediaPipe

Uploaded by

Rithvik Mandya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views11 pages

HandTracking Using MediaPipe

Uploaded by

Rithvik Mandya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ed

Integrating Computer Vision and Unity Engine for 3D


Hand Gesture Recognition

iew
Rohit Prasad, Anirudha Sawant, Chinmaay Sharma, Siddhesh Navghare,
Beatrice S
*Department of Computer Engineering, Xavier Institute of Engineering, Mahim, Mahim 400016, India

* Corresponding author, [email protected]

v
re
Abstract
The fusion of computer vision and the Unity game engine offers promising opportunities for advancing 3D
hand gesture recognition, enabling immersive interactions in various applications. The approach involves
real-time hand detection and precise 2D landmark localization through a webcam feed, followed by mapping
er
these landmarks to 3D space within the Unity environment. Interactive elements and visual feedback are
incorporated to enhance the system's usability. The system enables accurate hand positioning and
pe
manipulation of a 3D hand model in Unity, incorporating depth perception considerations. Realistic
rendering and immersive display of the 3D hand model are achieved, showcasing its potential applications
in gaming, education, and virtual reality. The integration of computer vision techniques with the Unity game
engine presents a compelling approach to advancing 3D hand gesture recognition. The developed system
demonstrates
ot

Keywords Human computer interaction; Virtual environment; Gesture design


tn

1 Introduction

The integration of 3D hand gesture recognition with computer vision and the Unity game engine marks a
rin

significant leap in intuitive human-computer interaction. However, achieving seamless harmony between
these technologies poses a key challenge. This paper addresses this challenge by presenting a unified
framework that amalgamates computer vision methodologies with Unity's capabilities for precise 3D hand
gesture recognition. The primary goal is to enable real-time hand detection, accurate 2D landmark
ep

localization, and seamless mapping of these landmarks into Unity's 3D space. By intricately blending
computer vision techniques and Unity's environment, the framework allows for the creation and
manipulation of a realistic 3D hand model while ensuring considerations for depth perception and responsive
Pr

gesture-driven interactions. This integration of interactive elements within Unity not only enhances user
engagement but also extends its applicability across gaming, education, and virtual reality domains.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
2 Literature Survey

ed
The literature survey delves into the diverse approaches in gesture recognition, a pivotal domain in computer
science crucial for human-computer interaction, virtual reality, and gaming. Research by Zhang et al.
explores real-time reconstruction methodologies, employing joint learning networks and energy

iew
optimization functions to achieve dynamic gesture recognition with single-depth camera systems. This
signifies a significant shift towards real-time performance using minimal hardware setups.[1] Integration
of deep learning techniques, as highlighted in recent advancements, contributes to improved accuracy and
efficiency in gesture recognition systems. CNN-based pose regression and neural network-based shape
estimation enhance real-time performance, paving the way for more robust recognition models.[2] The

v
literature also underscores advancements in dataset creation, including hybrid real-synthetic datasets, crucial

re
for training gesture recognition models. These datasets alleviate challenges associated with manual labeling
and foster the development of more accurate recognition systems.[3]

Furthermore, hand pose estimation techniques have undergone significant evolution, transitioning from
er
depth-based to RGB-based approaches with a focus on real-time tracking. Methods such as dense geometry
representation and machine learning-based depth estimation address challenges related to depth ambiguity
and interaction handling, bolstering the accuracy of pose estimation systems.[4] Notably, single-camera
pe
setups like RGB2Hands and VNect have demonstrated potential for gesture recognition, enabling real-time
capture of hand gestures using only a single RGB camera, thus broadening the applicability of gesture
recognition technology.[5] Gesture recognition systems find diverse applications in human-computer
interaction, virtual reality, gaming, and sign language recognition, enhancing user experience and
ot

accessibility across various domains.[6]

Despite the advancements, challenges persist, including depth ambiguity, interaction handling, and
tn

occlusion in gesture recognition. Techniques such as dense geometry representation, neural network-based
depth estimation, and 3D convolutional neural networks offer promising solutions, improving the accuracy
and robustness of gesture recognition systems.[7] The practical implications of gesture recognition extend
rin

to healthcare, education, and beyond, offering potential applications in rehabilitation, assistive technologies,
and interactive learning experiences.[8] Looking ahead, future directions in gesture recognition research
include the development of more efficient algorithms, creation of larger datasets, and exploration of novel
applications in emerging fields such as augmented reality and human-robot interaction.[9] These
ep

advancements hold promise for further enhancing the capabilities and accessibility of gesture recognition
technology.
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
4. Algorithm

ed
# Step 1: Hand Landmarks Extraction and Transmission
# Python Script

from HandTrackingModule import ...

iew
import cv2
import socket

# Initialize HandTrackingModule and socket connection


cap = cv2.VideoCapture(0)
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
server_address = ('127.0.0.1', 12345)

while True:

v
success, img = cap.read()
hands, img = detector.findHands(img)
data = []

re
if hands:
hand = hands[0]
lmList = hand["lmList"]
for lm in lmList:
data.extend([lm[0], h - lm[1], lm[2]]) # Extract hand landmarks
sock.sendto(str.encode(str(data)), server_address) # Send landmark data to Unity

er
In the Python script, the process initiates by initializing the HandTrackingModule, an essential component
for accurate hand landmark detection, and establishing a robust socket connection for seamless
communication with the Unity environment. Leveraging the powerful capabilities of OpenCV, the script
pe
continuously captures real-time frames from the webcam feed, enabling the HandTrackingModule to
meticulously analyze and extract hand landmarks with precision. These landmarks, represented as 2D
coordinates encapsulating intricate hand movements and gestures, undergo meticulous formatting to ensure
seamless transmission over the established socket connection to Unity. This pivotal segment of the algorithm
ot

serves as the foundational framework for capturing and transmitting intricate hand landmark data from the
Python environment to Unity, facilitating a seamless integration between the two systems for subsequent
processing and visualization tasks. The collaborative synergy between Python's robust computational
tn

capabilities and Unity's immersive visualization tools enables the creation of dynamic and interactive hand
gesture recognition systems.
rin

// Step 2: Unity Processing and 3D Mapping


// Unity Script: HandTracking.cs

using UnityEngine;
ep

using System.Net;
using System.Net.Sockets;
using System.Text;

public class HandTracking : MonoBehaviour


{
Pr

UdpClient client;
IPEndPoint endPoint;

void Start()
{
client = new UdpClient(12345);

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
endPoint = new IPEndPoint(IPAddress.Parse("127.0.0.1"), 12345);
}

ed
void Update()
{
byte[] data = client.Receive(ref endPoint);
string receivedData = Encoding.UTF8.GetString(data);
string[] coordinates = receivedData.Split(' ');

iew
// Map coordinates to 3D space and update hand model positions
foreach (string coordinate in coordinates)
{
// Map 2D coordinates to 3D space
Vector3 position = MapTo3DSpace(coordinate);
// Update hand model positions
UpdateHandModelPosition(position);
}
}

v
Vector3 MapTo3DSpace(string coordinate)
{

re
// Implement mapping logic from 2D to 3D coordinates
}

void UpdateHandModelPosition(Vector3 position)


{
// Update hand model positions in Unity scene
}
}
er
In the Unity script, the system awaits the arrival of hand landmark data sent from the Python environment
pe
via the established socket connection. Once the data is received, it is decoded and parsed to extract the
individual hand landmark coordinates. Through a mapping function, these 2D coordinates are converted
into 3D world space, considering scaling and translation parameters for accurate positioning. The script
dynamically updates the positions of GameObjects representing hand landmarks, ensuring alignment with
ot

the detected hand movements. This section of the algorithm encapsulates the processing and visualization
aspects within the Unity environment, facilitating the interactive and immersive representation of hand
gestures in 3D space.
tn
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
5 Methodology

ed
Here is the complete working of the project wherein the system takes in a 2D video capture and transforms
the image landmarks and maps them into a 3D environment.

v iew
re
er
pe
ot
tn
rin
ep
Pr

Fig. 1: Working

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
5.1 Hand Landmarks Extraction and Transmission

ed
Here, the goal is to extract hand landmarks using Python with OpenCV and the Mediapipe library. The hand
landmarks, represented as 2D coordinates, will be captured in real-time. Subsequently, a UDP
communication channel will be established to facilitate the transmission of these coordinates to a Unity

iew
application. The data, consisting of the hand landmark coordinates, will be converted into string format and
sent over the network to the IP address 127.0.0.1 (localhost). This communication mechanism enables the
seamless integration of hand tracking data into a Unity environment for further interactive and immersive
applications, such as virtual reality or augmented reality experiences.

v
5.2 Unity Processing and 3D Mapping

re
A Unity script named HandTracking.cs is created to receive and process the 2D coordinates transmitted
from a Python script. The received coordinates, representing hand landmarks, undergo a mapping function
within the Unity script to convert them into 3D world space. This mapping considers scaling and translation
er
parameters to accurately position the landmarks in the Unity environment. The script then dynamically
updates the positions of GameObjects in real time to reflect the movement and positioning of the hand
landmarks. This integration allows for the seamless incorporation of hand tracking data into the Unity
pe
application, providing an interactive and responsive experience where GameObjects align with the detected
hand movements and positions.
ot

5.3 Gesture Visualization and Tracking

The LineCode.cs script in Unity is employed to visually connect specific hand landmarks using a
tn

LineRenderer, enhancing visualization of hand movements. Additionally, rotation tracking logic is


implemented both in Python and Unity to determine the relative orientation of hand landmarks over time.
By tracking changes in distances and orientations between the landmarks, hand gestures are calculated and
visualized within the Unity environment. This comprehensive approach enables the dynamic representation
rin

of hand movements, fostering a more immersive and interactive user experience by visually conveying
gestures through the manipulation of GameObjects and the LineRenderer component in Unity.

5.4 Mapping 2D Landmarks to 3D Coordinates


ep

Let P2D(xi , yi) represent the i-th 2D landmark coordinate extracted from the hand using Mediapipe.
The mapping to 3D coordinates P3D(Xi , Yi , Zi) is achieved through a transformation function F:
Pr

P3D(Xi , Yi , Zi) = F(P2D(xi , yi))

Where F consists of: Scaling Factor (S), Translation Matrix (T), and the Mapping Equation (M).

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
5.5 Magnification and Scale Calculation

ed
The changing distances between specific hand landmarks are analyzed over time in both Python and Unity.
The goal is to determine a magnification factor by comparing the hand size in the 2D image to a reference
size. This factor is then utilized to dynamically calculate and adjust the scale of GameObjects in the Unity

iew
environment based on hand movements. By continuously monitoring and analyzing the evolving hand size
and distances between landmarks, this approach allows for a dynamic and responsive scaling mechanism in
Unity.
● Scaling Factor (S):

v
re
Scale the 2D coordinates to the 3D space, where zmax and zmin define the desired range along the Z-
axis, and xmax−xmin and ymax−ymin represent the range along the X and Y axes.

● Translation Matrix (T): er


pe
Translate the 2D coordinates to ensure the minimum values map to the origin in 3D space.

● Mapping Equation (M):


ot

P3D(Xi , Yi , Zi) = S · P2D(xi , yi) + T


Apply the scaling factor and translation matrix to obtain the corresponding 3D coordinates
tn

5.6 Z-Axis Adjustment and 3D Model Interaction

In this phase, the project focuses on achieving dynamic interaction with 3D models in Unity. A Z-axis
adjustment factor is calculated by analyzing changes in the hand’s depth over time. This factor is applied to
rin

adjust the Z-axis position of GameObjects representing hand landmarks, ensuring that the 3D models
respond dynamically to changes in the hand’s depth. This enhancement significantly enriches the user’s
interactive experience by allowing more nuanced and realistic control over the virtual environment.
ep

Define ∆X and ∆Y as the changes in X and Y coordinates between consecutive frames.

● Magnitude of Change (∆D):


Pr

Calculate the Euclidean distance to represent the overall change in hand position.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
● Threshold Function (Θ):

ed
Evaluate whether the change in hand position exceeds a predefined threshold, indicating a
significant gesture.

iew
● Mapping to 3D Magnification (∆Z):
∆Z = Θ(∆D, threshold) · scaling factor · ∆D
Map the detected changes to the Z-axis of the 3D hand model, considering the scaling factor for
appropriate magnification.

v
5.7 Spatial Constraints and Scene Boundaries

re
To enhance user control within the Unity scene, spatial constraints are defined by determining scene
boundaries based on the X, Y, and Z coordinates of hand landmarks. Both Python and Unity scripts are
integrated to enforce these boundaries, constraining hand movements within a specified area. This ensures
er
a more guided and controlled interaction, preventing unintended movements outside the defined limits and
enhancing the overall usability and safety of the application.
pe
5.8 Precise 3D Model Manipulation

The project advances into providing users with precise control over 3D models in Unity. By implementing
the ability to receive mapped 2D coordinates triggered by user input, the position of 3D models is
ot

dynamically set using the Transform component. Additionally, rotation adjustments based on user
interactions are incorporated, allowing users to finely tune the orientation of 3D models. This level of
granularity in manipulation enhances the interactive experiences, enabling users to achieve a more detailed
tn

and personalized interaction with the virtual elements in the Unity environment.
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
6 Summary and Results

ed
In summary, the project successfully integrated Mediapipe for 2D hand landmark extraction with Unity
for 3D gesture recognition and visualization. The Python script utilized Mediapipe to extract 2D hand
landmarks from a standard 2D camera feed and transmitted them to Unity in real-time. In Unity, the C#

iew
script received the landmark data and mapped it onto 3D game objects that represent the hand landmarks.
The system also implemented a mechanism to detect changes in hand position and adjust the Z-axis of the
3D hand model accordingly, providing users with an immersive and dynamic gesture recognition
experience. Throughout the project, iterative testing and optimization were performed to ensure accuracy,
efficiency, and user-friendliness.

v
re
Original Gesture Generated Gesture

er
pe

Image. 1(a) Image. 1(b)


ot
tn
rin

Image. 2(a) Image. 2(b)


ep

In the results, original gestures captured by the camera (images 1(a) and 2(a)) closely replicate corresponding
3D models generated in Unity (images 1(b) and 2(b)). Image a shows a user’s original gesture, accurately
mirrored in image b by the project’s 3D model. Similarly, image c displays another original gesture,
faithfully reproduced in image d by the Unity-generated 3D model. This demonstrates the system’s precision
Pr

in translating 2D hand landmarks into realistic 3D representations.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
7 Discussion

ed
The paper presents a comprehensive approach to integrating computer vision techniques with the Unity
game engine for accurate 3D hand gesture recognition. The key components of the system include real-time
hand detection and 2D landmark localization using a webcam feed and computer vision algorithms, as well

iew
as the mapping of these 2D landmarks to 3D space within the Unity environment, considering depth
perception.

The authors leverage the strengths of both Python (computer vision) and Unity (3D rendering and
interactivity) to create a robust and responsive system. The use of socket communication to transmit hand

v
landmark data from Python to Unity enables seamless integration between the two environments. The paper

re
highlights the techniques used for 2D-to-3D mapping, scaling, and depth adjustment, which are crucial for
achieving accurate and immersive 3D hand gesture representation in Unity. The iterative testing and
optimization mentioned suggest the system can provide reliable and precise hand gesture recognition. The
system's integration of hand gesture recognition with Unity can enable more intuitive and natural
er
interactions within virtual reality (VR) and augmented reality (AR) environments, expanding the
possibilities for immersive experiences. Additionally, the system can be utilized in educational and training
applications, allowing users to manipulate 3D models and visualizations using hand gestures, making the
pe
learning process more engaging and interactive.

The hand gesture recognition capabilities can improve accessibility by providing alternative input
methods for individuals with disabilities, enabling them to interact with digital systems more effectively.
ot

Furthermore, the system's ability to track and recognize hand gestures can be beneficial in healthcare and
rehabilitation applications, such as monitoring hand movements for physical therapy or assisting in the
development of assistive technologies.
tn

Overall, the paper presents a well-designed framework that successfully combines computer vision and
Unity engine technologies to achieve accurate 3D hand gesture recognition. The proposed system has the
rin

potential to significantly enhance user interactions and experiences across various domains, from gaming
and virtual reality to education and healthcare.
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
References
1. H. Zhang, Y. Zhou, Y. Tian, J.-H. Yong, and F. Xu, “Single depth view based Real-Time reconstruction of Hand-

ed
Object interactions,” ACM Transactions on Graphics, vol. 40, no. 3, pp. 1–12, Jun. 2021, doi: 10.1145/3451341.
2. J. Wang et al., “RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video,” arXiv
(Cornell University), Jun. 2021, [Online]. Available: http://arxiv.org/pdf/2106.11725.pdf.
3. F. Mueller et al., “Real-time pose and shape reconstruction of two interacting hands with a single depth camera,”

iew
ACM Transactions on Graphics, vol. 38, no. 4, pp. 1–13, Jul. 2019, doi: 10.1145/3306346.3322958.
4. J. Segen and S. Kumar, ”Shadow gestures: 3D hand pose estimation using a single camera,” Proceedings. 1999
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins,
CO, USA, 1999, pp. 479-485 Vol. 1, doi: 10.1109/CVPR.1999.786981.
5. N. Shimada, K. Kimura and Y. Shirai, ”Real-time 3D hand posture estimation based on 2D appearance retrieval

v
using monocular camera,” Proceedings IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces
and Gestures in Real-Time Systems, Vancouver, BC, Canada, 2001, pp. 23-30, doi: 10.1109/RATFG.2001.938906.

re
6. Mehta, Dushyant, et al. ”Vnect: Real-time 3d human pose estimation with a single rgb camera.” Acm transactions
on graphics (tog) 36.4 (2017): 1-14.
7. L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Real-Time 3D Hand Pose Estimation with 3D Convolutional Neural
Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 956–970, Apr.

8.
2019, doi: 10.1109/tpami.2018.2827052. er
S. S. Rautaray, “Real time hand gesture recognition system for dynamic applications,” International Journal of
UbiComp, vol. 3, no. 1, pp. 21–31, Jan. 2012, doi: 10.5121/iju.2012.3103.
pe
9. Y. Shi, Y. Li, X. Fu, M. I. A. O. Kaibin, and Q. Miao, “Review of dynamic gesture recognition,” Virtual Reality &
Intelligent Hardware, vol. 3, no. 3, pp. 183–206, Jun. 2021, doi: 10.1016/j.vrih.2021.05.001.
10. V. L. Patil, S. R. Sutar, S. B. Ghadge, and S. Palkar, “Gesture Recognition for Media Interaction: A Streamlit
Implementation with OpenCV and MediaPipe.,” International Journal for Research in Applied Science and
Engineering Technology, vol. 11, no. 9, pp. 1039–1046, Sep. 2023, doi: 10.22214/ijraset.2023.55775
11. Indriani, Moh. Harris, and A. S. Agoes, “Applying hand gesture recognition for user guide application using
ot

MediaPipe,” Advances in Engineering Research, Jan. 2021, doi: 10.2991/aer.k.211106.017.


tn
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561

You might also like