Fall Detection:
On a higher level, fall detection works in three modules; Object Detection, Pose Estimation and
Action Recognition.
The first module focuses on Object Detection. This employs the TinyYolov3 object detection
model. It is a lightweight version of the YOLO (You Look Only Once) object detection model,
designed for real time object detection. It follows the same basic structure as the original
YOLOv3 model, but with lesser number of convolutional layers and parameters to make it more
computationally efficient.
The structure of TinyYolov3 consists of an input layer wherein the input to the model is an
image. The next is the backbone network. This layer is responsible for extracting features from
the input image. The TinyYolov3's backbone network is a simplified version of Darknet-53
network. It consists of convolutional layers and residual blocks. The next layer is the Feature
Pyramid Network (FPN). It combines features from different layers of the backbone network and
allows the model t o detect objects at multiple scales. Consequently, the detection head is
responsible for predicting bounding boxes and class probabilities for the objects in the image.
The detection head consists of two YOLO layers, each responsible for detecting objects at
different scales. The first YOLO layer predicts bounding boxes and class probabilities for smaller
objects whereas the second YOLO layer predicts the bounding boxes and class probabilities for
larger objects.
These bounding boxes are generated using the concept of anchor boxes. Firstly, TinyYOLOv3
uses a set of predefined anchor boxes which are bounding boxes of different aspect ratios and
scales. These anchor boxes are chosen based on the distribution of object sizes and aspect ratios
in the training data. Each input image that is fed into the model is then divided into a grid of cells
- this can be either 13x13 or 26x26 according to the size of the image for TinyYOLOv3.
For each cell in the grid, the model predicts a fixed amount of bounding boxes. The model
predicts four values for each bounding box: x, y, width and height. The values represent offsets
from the corresponding anchor box dimensions and coordinates. The model also predicts a
confidence score for each bounding box, which represents the confidence that an object is
present in that bounding box. For each bounding box prediction, the model predicts a class
probability for each object class(person, table, cat etc). The class probabilities represent the
model's confidence in the object being present in the bounding box.
The last but not least layer is the Non-Max-Suppression. On obtaining the bounding box
predictions and class probabilities, a NMS algorithm is applied to filter out the overlapping
bounding boxes and keep only the most confident predictions. It first sorts the bounding boxes
by their confidence scores and then iterates through the sorted bounding boxes and removes any
unwanted boxes that overlap with a higher confidence bounding box of the same class.
The remaining bounding boxes after the NMS are the final object detections with their
corresponding class labels and confidence scores.
Coming to Pose Estimation, We use the FastPose model. It is a neural network architecture that
takes an input image and outputs heatmaps for each human body key point. These heatmaps
represent the likelihood or confidence of each key point being present at different places in the
image. The FastPose model handles a lot of tasks like model loading. This class initializes the
appropriate FastPose model namely InferenNet_fastRes50 based on the specified backbone
which is ResNet-50. It also additionally handles the cropping and resizing of the input image
based on the detected bounding boxes, preparing the data for the FastPose model. The
preprocessed input is then fed into the FastPose model and heatmap outputs are obtained for each
key point. Then the obtained heatmaps are processed and final key point coordinates are
obtained. Then finally, the a non-maximum suppression is applied to the predicted poses thereby
removing redundant or overlapping predictions. The FastPose model itself is a deep
convolutional network and was designed to be efficient and scalable, allowing for real-time
multi-person pose estimation on various computing platforms.
The action recognition component implements the Two-Stream Spatial Temporal Graph. It is
responsible for recognizing human actions based on the temporal sequence of estimated poses.
The class defines seven actions that it can recognize namely 'Standing', 'Walking', 'Sitting', 'Lying
Down', 'Stand Up', 'Sit Down' and 'Fall Down'. The prediction of the action occurs on the basis of
a sequence of pose key points. The pose key points are first normalized and scaled and then
converted into a PyTorch tensor and permuted to match the expected input format of the model.
Initially, the input is of a NumPy array of shape (t, v, c) where t is is the number of time
steps(frames), v is the number of key points(body parts) and c is the number of
channels(typically 2 for x and y coordinates and sometimes an additional channel for key point
confidence score) The model requires the motion information which is calculated by taking the
difference between consecutive frames of the pose key points. PyTorch models expect input
tensors to be in a specified format. So, the numpy array of shape t, v, c is converted into a shape
of c, t, v where c is the number of channels and t is the number of time steps and v is the number
of key points. The permutation rearranges the dimensions to match the expected input format of
the Two Stream Spatial Temporal Graph model. There is one more parameter required to match
the expected input of the Two Stream Spatial Temporal Graph model which is the batch
dimension. Even if this input is a single sample we add it by creating a batch size of 1 at the
beginning of the tensor while keeping the other dimensions constant. It has the shape now of 1, c,
t, v. By permuting the dimensions and adding the batch dimension, it ensures that the pose key
point data is properly formatted for input to the model.
The next step is for action recognition to take place. The action recognition component leverages
temporal information encoded in the sequence of pose key points to classify the observed human
actions. For each confirmed track (tracked person), the code checks if the length of the key
points list is equal to 30. If the key points list has 30 elements, it means that the model has
enough temporal information to predict the action. The key points list is converted to a numpy
array and passed to the predict method of Two Stream Spatial Temporal Graph. The predicted
action probabilities are obtained from the model's output and then the name of the action which
has the highest probability is retrieved. The action name and it's corresponding probability are
formatted into a string. The action string is then visualized on the output frame along with a color
code indicating the type of action.