HandTracking Using MediaPipe
HandTracking Using MediaPipe
iew
Rohit Prasad, Anirudha Sawant, Chinmaay Sharma, Siddhesh Navghare,
Beatrice S
*Department of Computer Engineering, Xavier Institute of Engineering, Mahim, Mahim 400016, India
v
re
Abstract
The fusion of computer vision and the Unity game engine offers promising opportunities for advancing 3D
hand gesture recognition, enabling immersive interactions in various applications. The approach involves
real-time hand detection and precise 2D landmark localization through a webcam feed, followed by mapping
er
these landmarks to 3D space within the Unity environment. Interactive elements and visual feedback are
incorporated to enhance the system's usability. The system enables accurate hand positioning and
pe
manipulation of a 3D hand model in Unity, incorporating depth perception considerations. Realistic
rendering and immersive display of the 3D hand model are achieved, showcasing its potential applications
in gaming, education, and virtual reality. The integration of computer vision techniques with the Unity game
engine presents a compelling approach to advancing 3D hand gesture recognition. The developed system
demonstrates
ot
1 Introduction
The integration of 3D hand gesture recognition with computer vision and the Unity game engine marks a
rin
significant leap in intuitive human-computer interaction. However, achieving seamless harmony between
these technologies poses a key challenge. This paper addresses this challenge by presenting a unified
framework that amalgamates computer vision methodologies with Unity's capabilities for precise 3D hand
gesture recognition. The primary goal is to enable real-time hand detection, accurate 2D landmark
ep
localization, and seamless mapping of these landmarks into Unity's 3D space. By intricately blending
computer vision techniques and Unity's environment, the framework allows for the creation and
manipulation of a realistic 3D hand model while ensuring considerations for depth perception and responsive
Pr
gesture-driven interactions. This integration of interactive elements within Unity not only enhances user
engagement but also extends its applicability across gaming, education, and virtual reality domains.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
2 Literature Survey
ed
The literature survey delves into the diverse approaches in gesture recognition, a pivotal domain in computer
science crucial for human-computer interaction, virtual reality, and gaming. Research by Zhang et al.
explores real-time reconstruction methodologies, employing joint learning networks and energy
iew
optimization functions to achieve dynamic gesture recognition with single-depth camera systems. This
signifies a significant shift towards real-time performance using minimal hardware setups.[1] Integration
of deep learning techniques, as highlighted in recent advancements, contributes to improved accuracy and
efficiency in gesture recognition systems. CNN-based pose regression and neural network-based shape
estimation enhance real-time performance, paving the way for more robust recognition models.[2] The
v
literature also underscores advancements in dataset creation, including hybrid real-synthetic datasets, crucial
re
for training gesture recognition models. These datasets alleviate challenges associated with manual labeling
and foster the development of more accurate recognition systems.[3]
Furthermore, hand pose estimation techniques have undergone significant evolution, transitioning from
er
depth-based to RGB-based approaches with a focus on real-time tracking. Methods such as dense geometry
representation and machine learning-based depth estimation address challenges related to depth ambiguity
and interaction handling, bolstering the accuracy of pose estimation systems.[4] Notably, single-camera
pe
setups like RGB2Hands and VNect have demonstrated potential for gesture recognition, enabling real-time
capture of hand gestures using only a single RGB camera, thus broadening the applicability of gesture
recognition technology.[5] Gesture recognition systems find diverse applications in human-computer
interaction, virtual reality, gaming, and sign language recognition, enhancing user experience and
ot
Despite the advancements, challenges persist, including depth ambiguity, interaction handling, and
tn
occlusion in gesture recognition. Techniques such as dense geometry representation, neural network-based
depth estimation, and 3D convolutional neural networks offer promising solutions, improving the accuracy
and robustness of gesture recognition systems.[7] The practical implications of gesture recognition extend
rin
to healthcare, education, and beyond, offering potential applications in rehabilitation, assistive technologies,
and interactive learning experiences.[8] Looking ahead, future directions in gesture recognition research
include the development of more efficient algorithms, creation of larger datasets, and exploration of novel
applications in emerging fields such as augmented reality and human-robot interaction.[9] These
ep
advancements hold promise for further enhancing the capabilities and accessibility of gesture recognition
technology.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
4. Algorithm
ed
# Step 1: Hand Landmarks Extraction and Transmission
# Python Script
iew
import cv2
import socket
while True:
v
success, img = cap.read()
hands, img = detector.findHands(img)
data = []
re
if hands:
hand = hands[0]
lmList = hand["lmList"]
for lm in lmList:
data.extend([lm[0], h - lm[1], lm[2]]) # Extract hand landmarks
sock.sendto(str.encode(str(data)), server_address) # Send landmark data to Unity
er
In the Python script, the process initiates by initializing the HandTrackingModule, an essential component
for accurate hand landmark detection, and establishing a robust socket connection for seamless
communication with the Unity environment. Leveraging the powerful capabilities of OpenCV, the script
pe
continuously captures real-time frames from the webcam feed, enabling the HandTrackingModule to
meticulously analyze and extract hand landmarks with precision. These landmarks, represented as 2D
coordinates encapsulating intricate hand movements and gestures, undergo meticulous formatting to ensure
seamless transmission over the established socket connection to Unity. This pivotal segment of the algorithm
ot
serves as the foundational framework for capturing and transmitting intricate hand landmark data from the
Python environment to Unity, facilitating a seamless integration between the two systems for subsequent
processing and visualization tasks. The collaborative synergy between Python's robust computational
tn
capabilities and Unity's immersive visualization tools enables the creation of dynamic and interactive hand
gesture recognition systems.
rin
using UnityEngine;
ep
using System.Net;
using System.Net.Sockets;
using System.Text;
UdpClient client;
IPEndPoint endPoint;
void Start()
{
client = new UdpClient(12345);
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
endPoint = new IPEndPoint(IPAddress.Parse("127.0.0.1"), 12345);
}
ed
void Update()
{
byte[] data = client.Receive(ref endPoint);
string receivedData = Encoding.UTF8.GetString(data);
string[] coordinates = receivedData.Split(' ');
iew
// Map coordinates to 3D space and update hand model positions
foreach (string coordinate in coordinates)
{
// Map 2D coordinates to 3D space
Vector3 position = MapTo3DSpace(coordinate);
// Update hand model positions
UpdateHandModelPosition(position);
}
}
v
Vector3 MapTo3DSpace(string coordinate)
{
re
// Implement mapping logic from 2D to 3D coordinates
}
the detected hand movements. This section of the algorithm encapsulates the processing and visualization
aspects within the Unity environment, facilitating the interactive and immersive representation of hand
gestures in 3D space.
tn
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
5 Methodology
ed
Here is the complete working of the project wherein the system takes in a 2D video capture and transforms
the image landmarks and maps them into a 3D environment.
v iew
re
er
pe
ot
tn
rin
ep
Pr
Fig. 1: Working
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
5.1 Hand Landmarks Extraction and Transmission
ed
Here, the goal is to extract hand landmarks using Python with OpenCV and the Mediapipe library. The hand
landmarks, represented as 2D coordinates, will be captured in real-time. Subsequently, a UDP
communication channel will be established to facilitate the transmission of these coordinates to a Unity
iew
application. The data, consisting of the hand landmark coordinates, will be converted into string format and
sent over the network to the IP address 127.0.0.1 (localhost). This communication mechanism enables the
seamless integration of hand tracking data into a Unity environment for further interactive and immersive
applications, such as virtual reality or augmented reality experiences.
v
5.2 Unity Processing and 3D Mapping
re
A Unity script named HandTracking.cs is created to receive and process the 2D coordinates transmitted
from a Python script. The received coordinates, representing hand landmarks, undergo a mapping function
within the Unity script to convert them into 3D world space. This mapping considers scaling and translation
er
parameters to accurately position the landmarks in the Unity environment. The script then dynamically
updates the positions of GameObjects in real time to reflect the movement and positioning of the hand
landmarks. This integration allows for the seamless incorporation of hand tracking data into the Unity
pe
application, providing an interactive and responsive experience where GameObjects align with the detected
hand movements and positions.
ot
The LineCode.cs script in Unity is employed to visually connect specific hand landmarks using a
tn
of hand movements, fostering a more immersive and interactive user experience by visually conveying
gestures through the manipulation of GameObjects and the LineRenderer component in Unity.
Let P2D(xi , yi) represent the i-th 2D landmark coordinate extracted from the hand using Mediapipe.
The mapping to 3D coordinates P3D(Xi , Yi , Zi) is achieved through a transformation function F:
Pr
Where F consists of: Scaling Factor (S), Translation Matrix (T), and the Mapping Equation (M).
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
5.5 Magnification and Scale Calculation
ed
The changing distances between specific hand landmarks are analyzed over time in both Python and Unity.
The goal is to determine a magnification factor by comparing the hand size in the 2D image to a reference
size. This factor is then utilized to dynamically calculate and adjust the scale of GameObjects in the Unity
iew
environment based on hand movements. By continuously monitoring and analyzing the evolving hand size
and distances between landmarks, this approach allows for a dynamic and responsive scaling mechanism in
Unity.
● Scaling Factor (S):
v
re
Scale the 2D coordinates to the 3D space, where zmax and zmin define the desired range along the Z-
axis, and xmax−xmin and ymax−ymin represent the range along the X and Y axes.
In this phase, the project focuses on achieving dynamic interaction with 3D models in Unity. A Z-axis
adjustment factor is calculated by analyzing changes in the hand’s depth over time. This factor is applied to
rin
adjust the Z-axis position of GameObjects representing hand landmarks, ensuring that the 3D models
respond dynamically to changes in the hand’s depth. This enhancement significantly enriches the user’s
interactive experience by allowing more nuanced and realistic control over the virtual environment.
ep
Calculate the Euclidean distance to represent the overall change in hand position.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
● Threshold Function (Θ):
ed
Evaluate whether the change in hand position exceeds a predefined threshold, indicating a
significant gesture.
iew
● Mapping to 3D Magnification (∆Z):
∆Z = Θ(∆D, threshold) · scaling factor · ∆D
Map the detected changes to the Z-axis of the 3D hand model, considering the scaling factor for
appropriate magnification.
v
5.7 Spatial Constraints and Scene Boundaries
re
To enhance user control within the Unity scene, spatial constraints are defined by determining scene
boundaries based on the X, Y, and Z coordinates of hand landmarks. Both Python and Unity scripts are
integrated to enforce these boundaries, constraining hand movements within a specified area. This ensures
er
a more guided and controlled interaction, preventing unintended movements outside the defined limits and
enhancing the overall usability and safety of the application.
pe
5.8 Precise 3D Model Manipulation
The project advances into providing users with precise control over 3D models in Unity. By implementing
the ability to receive mapped 2D coordinates triggered by user input, the position of 3D models is
ot
dynamically set using the Transform component. Additionally, rotation adjustments based on user
interactions are incorporated, allowing users to finely tune the orientation of 3D models. This level of
granularity in manipulation enhances the interactive experiences, enabling users to achieve a more detailed
tn
and personalized interaction with the virtual elements in the Unity environment.
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
6 Summary and Results
ed
In summary, the project successfully integrated Mediapipe for 2D hand landmark extraction with Unity
for 3D gesture recognition and visualization. The Python script utilized Mediapipe to extract 2D hand
landmarks from a standard 2D camera feed and transmitted them to Unity in real-time. In Unity, the C#
iew
script received the landmark data and mapped it onto 3D game objects that represent the hand landmarks.
The system also implemented a mechanism to detect changes in hand position and adjust the Z-axis of the
3D hand model accordingly, providing users with an immersive and dynamic gesture recognition
experience. Throughout the project, iterative testing and optimization were performed to ensure accuracy,
efficiency, and user-friendliness.
v
re
Original Gesture Generated Gesture
er
pe
In the results, original gestures captured by the camera (images 1(a) and 2(a)) closely replicate corresponding
3D models generated in Unity (images 1(b) and 2(b)). Image a shows a user’s original gesture, accurately
mirrored in image b by the project’s 3D model. Similarly, image c displays another original gesture,
faithfully reproduced in image d by the Unity-generated 3D model. This demonstrates the system’s precision
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
7 Discussion
ed
The paper presents a comprehensive approach to integrating computer vision techniques with the Unity
game engine for accurate 3D hand gesture recognition. The key components of the system include real-time
hand detection and 2D landmark localization using a webcam feed and computer vision algorithms, as well
iew
as the mapping of these 2D landmarks to 3D space within the Unity environment, considering depth
perception.
The authors leverage the strengths of both Python (computer vision) and Unity (3D rendering and
interactivity) to create a robust and responsive system. The use of socket communication to transmit hand
v
landmark data from Python to Unity enables seamless integration between the two environments. The paper
re
highlights the techniques used for 2D-to-3D mapping, scaling, and depth adjustment, which are crucial for
achieving accurate and immersive 3D hand gesture representation in Unity. The iterative testing and
optimization mentioned suggest the system can provide reliable and precise hand gesture recognition. The
system's integration of hand gesture recognition with Unity can enable more intuitive and natural
er
interactions within virtual reality (VR) and augmented reality (AR) environments, expanding the
possibilities for immersive experiences. Additionally, the system can be utilized in educational and training
applications, allowing users to manipulate 3D models and visualizations using hand gestures, making the
pe
learning process more engaging and interactive.
The hand gesture recognition capabilities can improve accessibility by providing alternative input
methods for individuals with disabilities, enabling them to interact with digital systems more effectively.
ot
Furthermore, the system's ability to track and recognize hand gestures can be beneficial in healthcare and
rehabilitation applications, such as monitoring hand movements for physical therapy or assisting in the
development of assistive technologies.
tn
Overall, the paper presents a well-designed framework that successfully combines computer vision and
Unity engine technologies to achieve accurate 3D hand gesture recognition. The proposed system has the
rin
potential to significantly enhance user interactions and experiences across various domains, from gaming
and virtual reality to education and healthcare.
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561
References
1. H. Zhang, Y. Zhou, Y. Tian, J.-H. Yong, and F. Xu, “Single depth view based Real-Time reconstruction of Hand-
ed
Object interactions,” ACM Transactions on Graphics, vol. 40, no. 3, pp. 1–12, Jun. 2021, doi: 10.1145/3451341.
2. J. Wang et al., “RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video,” arXiv
(Cornell University), Jun. 2021, [Online]. Available: http://arxiv.org/pdf/2106.11725.pdf.
3. F. Mueller et al., “Real-time pose and shape reconstruction of two interacting hands with a single depth camera,”
iew
ACM Transactions on Graphics, vol. 38, no. 4, pp. 1–13, Jul. 2019, doi: 10.1145/3306346.3322958.
4. J. Segen and S. Kumar, ”Shadow gestures: 3D hand pose estimation using a single camera,” Proceedings. 1999
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins,
CO, USA, 1999, pp. 479-485 Vol. 1, doi: 10.1109/CVPR.1999.786981.
5. N. Shimada, K. Kimura and Y. Shirai, ”Real-time 3D hand posture estimation based on 2D appearance retrieval
v
using monocular camera,” Proceedings IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces
and Gestures in Real-Time Systems, Vancouver, BC, Canada, 2001, pp. 23-30, doi: 10.1109/RATFG.2001.938906.
re
6. Mehta, Dushyant, et al. ”Vnect: Real-time 3d human pose estimation with a single rgb camera.” Acm transactions
on graphics (tog) 36.4 (2017): 1-14.
7. L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Real-Time 3D Hand Pose Estimation with 3D Convolutional Neural
Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 956–970, Apr.
8.
2019, doi: 10.1109/tpami.2018.2827052. er
S. S. Rautaray, “Real time hand gesture recognition system for dynamic applications,” International Journal of
UbiComp, vol. 3, no. 1, pp. 21–31, Jan. 2012, doi: 10.5121/iju.2012.3103.
pe
9. Y. Shi, Y. Li, X. Fu, M. I. A. O. Kaibin, and Q. Miao, “Review of dynamic gesture recognition,” Virtual Reality &
Intelligent Hardware, vol. 3, no. 3, pp. 183–206, Jun. 2021, doi: 10.1016/j.vrih.2021.05.001.
10. V. L. Patil, S. R. Sutar, S. B. Ghadge, and S. Palkar, “Gesture Recognition for Media Interaction: A Streamlit
Implementation with OpenCV and MediaPipe.,” International Journal for Research in Applied Science and
Engineering Technology, vol. 11, no. 9, pp. 1039–1046, Sep. 2023, doi: 10.22214/ijraset.2023.55775
11. Indriani, Moh. Harris, and A. S. Agoes, “Applying hand gesture recognition for user guide application using
ot
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4813561