0% found this document useful (0 votes)
89 views17 pages

Alphapose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time

Uploaded by

hfsb2h8m2m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views17 pages

Alphapose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time

Uploaded by

hfsb2h8m2m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX.

XXXX 1

AlphaPose: Whole-Body Regional Multi-Person


Pose Estimation and Tracking in Real-Time
Hao-Shu Fang∗ , Jiefeng Li∗ , Hongyang Tang, Chao Xu$ , Haoyi Zhu$ , Yuliang Xiu, Yong-Lu Li, and Cewu
Lu† , Member, IEEE

Abstract—Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To
capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and
foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate
whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric
arXiv:2211.03375v1 [cs.CV] 7 Nov 2022

Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating
redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to
Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able
to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant
detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on
COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and
dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.

Index Terms—human pose estimation, pose tracking, whole-body pose estimation, hand pose estimation, realtime, multi-person

F
1 I NTRODUCTION

F ULL body human pose estimation is a fundamental


challenge for computer vision. It has many applications
in human-computer interaction [1], film industry [2], action
Symmetric Integral
Keypoints Regression

recognition [3], etc.


In this work, we focus on the problem of multi-person
full body pose estimation. In conventional body-only pose
CNN
estimation, recognizing the pose of multiple persons in the
wild is more challenging than recognizing the pose of a
single person in an image [4], [5], [6], [7], [8]. Previous
attempts approached this problem by using either a top- Input Heatmap

down framework [9], [10] or a bottom-up framework [11],


[12], [13]. Fig. 1. The quantization error caused by heatmap (green and blue lines).
Our approach follows the top-down framework, which With our symmetric integral keypoints regression (pink line), we can
first detects human bounding boxes and then estimates resolve the localization error.
the pose within each box independently. For top-down
based methods, although their performances are dominant
on common benchmarks [14], [15], such methodology has pose estimation. The resulted redundant poses from re-
some drawbacks. Since the detection stage and the pose dundant boxes are then eliminated by a parametric pose
estimation stage are separated, i) if the detector fails, there NMS, which introduces a novel pose distance metric to
is no cue for the pose estimator to recover the human pose, compare pose similarity. A data-driven approach is applied
and ii) current researchers adopt strong human detectors to optimize the pose distance parameters. We show that
for accuracy, which makes the two step processing slow with such strategy, a top-down framework with YOLOV3-
in inference. To solve these drawbacks of the top-down SPP detector can achieve on par performance with the state-
framework, we propose a new methodology to make it of-the-art detectors while achieving much higher efficiency.
efficient and reliable in practice. To alleviate the missing Furthermore, to speed up the top-down framework during
detection problem, we lower the detection confidence and inference, we design a multi-stage concurrent pipeline in
NMS threshold to provide more candidates for subsequent AlphaPose, which allows our framework to run in realtime.
Beyond body-only pose estimation, full body pose esti-
• Hao-Shu Fang, Jiefeng Li, Hongyang Tang, Chao Xu, Haoyi Zhu, Yong- mation in the wild is more challenging as it faces several
Lu Li and Cewu Lu are with the Department of Electrical and Computer extra problems. For both top-down framework and bottom-
Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China.
• * denotes the first two authors contribute equally to the manuscript, email: up framework, the currently most used representation for
[email protected], ljf [email protected]. $ denotes the fourth and fifth keypoint is the heatmap [16]. And the heatmap size is usu-
author contribute equally. ally the quarter of the input image due to the limit of com-
• Corresponding author: Cewu Lu, email: [email protected]
putation resources. However, for localizing the keypoints of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 2

body, face and hands simultaneously, such representation is • This work documents the release of AlphaPose,
unsuitable since it is incapable of handling the large scale which achieves both accurate and realtime perfor-
variation across different body parts. A major problem is mance. Our library facilitated many researchers and
referred as the quantization error. As illustrated in Fig. 1, has been starred for over 6,000 times on GitHub.
since the heatmap representation is discrete, both the adja-
cent grids on heatmap may miss the correct position. This
is not a problem for body pose estimation since the correct
2 R ELATED W ORK
area are usually large. However, for fine-level keypoints on In this section, we first briefly review papers in multi-person
hands and face, it is easy to miss the correct position. pose estimation, which provides background knowledge of
To solve this problem, previous methods either adopt human pose estimation. In Sec. 2.2 we review related works
additional sub-networks for hand and face estimation [17], in multi-person whole-body pose estimation and discuss a
or adopt ROI-Align to enlarge the feature map [18]. How- key issue in the current literature. In Sec. 2.3 we review
ever, both methods are computation expensive, especially integral regression based keypoint localization and clarify
in multi-person scenario. In this paper, we propose a novel our improvements toward previous works. In Sec. 2.4 we
symmetric integral keypoints regression method that can review pose tracking and summarize the connection and
localize keypoints in different scales accurately. It is the first differences between previous works and our method.
regression method that can have the accuracy on par with
heatmap representation while eliminate the quantization 2.1 Multi Person Pose Estimation
error.
Bottom-up Approaches Bottom-up approaches are also
Another problem for the full body pose estimation is the called part-based approaches in the early. These approaches
lack of training data. Unlike the frequent studied body pose firstly detect all possible body parts in an image and then
estimation with abundant datasets [14], [19], there is only group them into individual skeletons. Representative works
one dataset [18] for the full body pose estimation. To pro- [10], [11], [12], [13], [21], [22], [23] are reviewed. Chen
mote development in this area, we annotate a new dataset et al. [11] present an approach to parse largely occluded
named Halpe for this task, which includes extra essential people by graphical model which models humans as flexible
joints not available in [18]. To further improve the generality compositions of body parts. Gkiox et al [10] use k-poselets to
of top-down framework for full body pose estimation in jointly detect people and predict locations of human poses
the wild, two key components are introduced. We adopt a by a weighted average of all activated poselets. Pishchulin
Multi-Domain Knowledge Distillation to incorporate train- et al [12] propose DeepCut to first detect all body parts,
ing data from separate body part datasets. To alleviate the and then label, filter and assemble these parts via integral
domain gap between different datasets and the imperfect linear programming. A stronger part detector based on
detection problem, we propose a novel part-guided human ResNet [24] and a better incremental optimization strategy
proposal generator (PGPG) to augment training samples. is proposed by Insafutdinov et al [13], named DeeperCut.
By learning the output distribution of a human detector for Openpose [17], [25] introduces Part Affinity Fields (PAFs)
different poses, we can simulate the generation of human to encode association scores between body parts with indi-
bounding boxes, producing a large sample of training data. viduals and solves the matching problem by decomposing
At last, we introduce a pose-aware identity embedding it into a set of bipartite matching subproblems. Newell et
to enable simultaneous human pose tracking within our top- al [26] learn an identification tag for each part detected to
down framework. A person re-id branch is attached on the indicate which individuals it belongs to, named associative
pose estimator and we perform jointly pose estimation and embedding. Cheng et al [27] use a powerful multi-resolution
human identification. With the aid of pose-guided region network [28] as backbone and high-resolution feature pyra-
attention, our pose estimator is able to identify human mids to learn scale-aware representations. OpenPifpaf [22],
accurately. Such design allows us to achieve realtime pose [23] proposes a Part Intensity Field (PIF) and a Part As-
estimation and tracking in an unified manner. sociation Field (PAF) to localize and associate body parts
This manuscript extends our preliminary work pub- respectively.
lished at the conference ICCV 2017 [20] along the following While bottom-up approaches have demonstrated good
aspects: performance, their body-part detectors can be vulnerable
since only small local regions are considered and face the
• We extend our framework to full body pose estima-
scale variation challenge when there are small persons in
tion scenario and propose a new symmetric integral
the image.
keypoint localization network for fine-level localiza-
tion. Top-down Approaches Our work follows the top-down
• We extend our pose guided proposal generator to in- paradigm like others [9], [20], [28], [29], [30], [31], which
corporate with multi-domain knowledge distillation firstly obtains the bounding box for each human body
on different body part dataset. through object detector and then performs single-person
• We annotate a new whole-body pose estimation pose estimation on cropped image. Fang et al [20] propose
benchmark (136 points for each person) and make symmetric spatial transformer network to solve the problem
comparisons with previous methods. on imperfect bounding boxes with huge noise given by
• We propose the pose-aware identity embedding that human body detector. Mask R-CNN [29] extends Faster
enable pose tracking in our top-down framework in R-CNN [32] by adding a pose estimation branch in par-
a unified manner. allel with existing bounding box recognition branch after
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 3

Symmetric
Integral Human (d)
(b) Poses
Backbone Fullbody Pose
... (a) heatmap NMS
Re-ID (c) Human Re-
Backbone
Backbone feature ID features
PGA MSIM
fullbody pose
(e)
Proposal Generator/
Knowledge Distillation Human
Proposals
fullbody pose +
tracking id

(i). Human Detections (ii). Human Pose Estimation (iii). Human Pose Tracking

Fig. 2. Illustration of our full-body pose estimation and tracking framework. Given an input image, we first obtain (i) human detections using off-
the-shelf object detectors like YoloV3 or EfficientDet. For each detected human, we crop and resize it and forward it through pose estimation and
tracking networks to obtain full body human pose and Re-ID features. The backbone of these two networks can either be separated for adaptation
to different pose configurations, or share the same weights for fast inference (thus misaligned in the figure). The (a) symmetric integral regression
is adopted for fine-level keypoint localization. We adopt (b) pose NMS to eliminate redundant poses. The (c) pose-guided alignment (PGA) module
is applied on the predicted human re-id feature to obtain pose-aligned human re-id features. The (d) multi-stage identity matching (MSIM) utilize the
human poses, re-id features and detected boxes to produce the final tracking identity. During training, proposal generator and knowledge distillation
(e) is adopted to improve the generalization ability of the networks.

ROIAlign, enabling end-to-end training. PandaNet [33] pro- representations and make it difficult for the networks to
poses an anchor based method to predict multi-person 3D learn.
pose estimation in a single shot manner and achieved high
efficiency. Chen et al [30] use a feature pyramid network
to localize simple joints and a refining network which inte- 2.2 Whole-Body Keypoint Localization
grates features of all levels from previous network to handle Unified detection of body, face, hand and foot keypoints
hard joints. A simple-structured network [31] with ResNet for multi-person is a relative new research topic and few
[24] as backbone and a few deconvolutional layers as up- methods have been proposed. OpenPose [17] developed
sampling head shows effective and competitive results. Sun a cascaded method. It first detects body keypoints using
et al [28] present a powerful high-resolution network, where PAFs [25] and then adopts two separate networks to es-
a high-resolution subnetwork is established in the first stage, timate face landmarks and hand keypoints. Such design
and high-to-low resolution subnetworks are added one by makes it time inefficient and consumes extra computation
one in parallel in subsequent stages, conducting repeated resources. Hidalgo et al. [39] propose a single network to
multi-scale feature fusions. Bertasius et al [34] extend from estimate the whole body keypoints. However, due to its
images to videos and propose a method for learning pose one-step mechanism, the output resolution is limited and
warping on sparsely labeled videos. thus decrease its performance on fine-level keypoints such
Although state-of-the-art top-down approaches achieve as faces and hands. Jin et al. [18] propose a ZoomNet that
remarkable precision on popular large-scale benchmark, used ROIAlign to crop the hand and face region on the
The two step paradigm makes them slow in inference feature maps and predict keypoints on the resized feature
compared with the bottom-up approaches. In addition, maps. All these methods adopt the heatmap representation
the lack of library-level framework implementation hin- for keypoint localization due to its dominant performance
ders them from being applied to the industry. Thus we on body keypoints. However, the mentioned quantization
present AlphaPose in this paper, in which we develop a problem of heatmap would decrease the accuracy of face
multi-stage pipeline to simultaneously process the time- and hand keypoints. The requirement of large-size input
consuming steps and enable fast inference. also consumes more computation resources. In this paper,
we argue that soft-argmax presentation is more suitable for
One-stage Approaches Some approaches need neither post whole-body pose estimation and proposed an improved
joints grouping nor human body bounding boxes detected version of soft-argmax that yields higher accuracy. Jin et
in advance. They locate human bodys and detect their own al. [18] also extended the COCO dataset to whole-body
joints simultaneously to improve low efficiency in two-stage scenario. However, some joints like head and neck are not
approaches. Representative works include CenterNet [35], presented in this dataset, which is essential in tasks like
SPM [36], DirectPose [37], and Point-set Anchor [38]. How- mesh reconstruction. Meanwhile, the face annotation is in-
ever, these approaches do not achieve high precision as compatible with that in 300LW. In this paper, we contribute
top-down approaches, partly because body center map and a new in-the-wild multi-person whole-body pose estimation
dense joint displacement maps are high-semantic nonlinear benchmark. We annotate 40K images from HICO-DET [40]
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 4

with Creative Common license1 as the training set and Compared with [57], we design a pose-guided re-ID feature
extend the COCO keypoints validation set (6K instances) extraction to avoid potential background noise. Moreover,
as our test set. Experiments on this benchmark and COCO- we design a multi-stage information merging method to
Wholebody demonstrate the superiority of our method. utilize the boxes, poses, and re-ID features simultaneously.

2.3 Integral Keypoints Localization 3 W HOLE -B ODY M ULTI P ERSON P OSE E STIMA -
Heatmap is a dominant representation for joint localization TION
in the field of human pose estimation. The read-out locations
from heatmaps are discrete numbers since heatmaps only The whole pipeline of our proposed method is illustrated in
describe the likelihood of joints occurring in each spatial Figure 2. In this section, we introduce the details of our pose
grid, which leads to inevitable quantization error. As de- estimation, which is shown in the top row of Fig. 2.
noted in Sec. 2.2, we argue that soft-argmax based integral
regression is more suitable for whole-body keypoints lo- 3.1 Symmetric Integral Keypoints Regression
calization. Several previous works have studied the soft-
As mentioned in Sec. 2.3, there exist two problems in con-
argmax operation to read continuous joint locations from
ventional soft-argmax operation for keypoint regression. We
heatmaps [41], [42], [43], [44], [45], [46]. Specifically, Luvizon
illustrate them in the following sections and propose our
et al. [43], [46] and Sun et al. [45] apply the soft-argmax oper-
novel solution.
ation to single-person 2D/3D pose estimation successfully.
However, two drawbacks exist in these works that decrease
3.1.1 Asymmetric gradient problem
their accuracy in pose estimation. We summarize the draw-
backs as asymmetric gradient problem and size-dependent The soft-argmax operation, also known as integral regression,
keypoint scoring problem. Details of these problems are is differentiable, which turns heatmap based approaches
provided in Sec. 3.1, as well as our proposed new gradient into regression based approaches and allows end-to-end
design and keypoint scoring method. By solving these prob- training. The integral regression operation is defined as:
lems, we provide a new keypoint regression method with X
µ̂ = x · px , (1)
higher accuracy, and it shows good performance in both
whole-body pose estimation and body-only pose estimation. where x is coordinate of each pixel and px denotes the pixel
likelihood on heatmap after normalization. During training,
2.4 Multi Person Pose Tracking the loss function is applied to minimize the `1 norm between
the predicted joint locations µ̂ and ground-truth locations
Multi person pose tracking is extended from multi person
pose estimation in videos, which gives each predicted key-
µ: Lreg = kµ − µ̂k1 . The gradient of each pixel can be
formulated as:
point the corresponding identity over time. Similar to pose
estimation literature, it can be divided into two categories: ∂Lreg
= x · sgn(µ̂ − µ). (2)
top-down [31], [47], [48], [49], [50], [51], [52], [53], [54] and ∂px
bottom-up [23], [55], [56]. Based on the bottom-up pose
Notice that the gradient amplitude is asymmetric. The ab-
estimation methods, [55], [56] use the detected keypoints
solute value of the gradient is determined by the absolute
to build temporal and spatial graphs which aims to link
position (i.e. x) of the pixel instead of the relative position to
corresponding individual body by solving an optimization
the ground truth. It denotes that given the same distance
problem. However, the prerequisite of temporal and spatial
error, the gradient becomes different when the keypoint
graphs prevent graph-cut optimization from running in
locates at a different position. This asymmetry breaks the
online manner, which makes them quite time-consuming
translation invariance of the CNN network, which leads to
and memory-inefficient. [49] utilizes a 3D MaskRCNN to es-
performance degradation.
timate person tubes and poses simultaneously. [50] proposes
Amplitude Symmetric Gradient To improve the learning
forward and backward bounding box propagation strategy
efficiency, we propose an amplitude symmetric gradient
to eliminate the issue of missed detection. The input of these
(ASG) function in backward propagation, which is an ap-
methods is a whole video sequence, which cannot achieve
proximation to the true gradient:
online tracking. Some other top-down methods allow to
input a single frame, and then use designed poseflow [47], δASG = Agrad · sgn(x − µ̂) · sgn(µ̂ − µ), (3)
GCN [48], [52], optical flow [31] or transformer [54] for
identity matching. Yang et al [53] predict current poses given where Agrad denotes the amplitude of gradients. It is a
historical pose sequences and merge them with the pose constant that we manually set as 1/8 of the heatmap size
estimation results from the current frame. A drawback of and we give the derivation in the next paragraph. Using
these methods is that they rely on the spatial continuity our symmetric gradient, the gradient distribution is centred
of the poses only, which may not be satisfied when the at the predicted joint locations µ̂. In the learning process,
online image stream is unstable or humans are moving this symmetric gradient distribution can better utilize the
rapidly. Specifically, [57] proposes to use re-ID feature to advantage of heatmaps and approximate the ground-truth
tackle tracking problem. Our tracking method also explicitly locations in a more direct manner. For example, assume the
adopts the human re-ID features to solve this problem. predicted location µ̂ is higher than the ground truth µ. On
one hand, the network tends to suppress the heatmap values
1. https://creativecommons.org/licenses/ on the right side of µ̂, because they have positive gradients;
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 5

on the other hand, the heatmap values on the left side of µ̂ It shows that the Lipschitz constant of the proposed method
will be activated because of their negative gradients. is 4-times smaller than the original integral regression when
Stable gradients for ASG Agrad = W/8, which indicates that the gradient space is
Here, we conduct a Lipschitz analysis to derive the value more smooth and the model can be optimized more easily.
of Agrad and show that ASG can provide a more stable
gradient for training. Recall that f denotes the objective 3.1.2 Size-dependent Keypoint Scoring Problem
function that we want to minimize. We say that f is L- Before conducting soft-argmax, the element-sum of P the pre-
smooth if: dicted heatmaps should be normalized to one, i.e., px =
1. Prior works [45], [46] adopt soft-max operation, which
k∇θ f (θ + ∆θ) − ∇θ (θ)k ≤ Lk∆θk, (4)
works well in single-person pose estimation but remains a
where θ is the network parameters and ∇ denotes the large performance gap with the state-of-the-arts in multi-
gradient. The objective function can be re-written as: person pose estimation [31], [45], [60]. This is because in
multi-person cases, we need not only the joint locations,
∇θ f = ∇θ L(µ, h(z)) = ∇z L(µ, h(z))∇θ z, (5) but also the joint confidence for pose NMS and calculating
where z denotes the logits that are predicted by the network, the mAP. In previous methods, the maximum value of the
and µ̂ = h(z) denotes the composition of the normalization heatmap is taken as the joint confidence, which is size-
and soft-argmax functions. Here, we assume the gradient dependent and not accurate.
of the network is smooth and only analyze the composition If we adopt the one-step normalization such as soft-max,
function, i.e.: the maximum value of the heatmap is inversely propor-
tional to the scale of the distribution, which highly depends
k∇z L(µ, h(z + ∆z)) − ∇z L(µ, h(z))k. (6) on the projected size of the body joint. Therefore, a large-size
joint (e.g. left-hip) will generate a smaller confidence value
In the conventional integral regression, we have: than a small-size joint (e.g. nose), which harms the reliability
∇z L(µ, h(z)) = (x − µ̂) · px . (7) of the predicted confidence values.
Two-step Heatmap Normalization To decouple confidence
In this case, Eq. 6 is equivalent to: prediction and integral regression, we propose a two-step
heatmap normalization manner. In the first step, we per-
k(x − µ̂ − ∆µ̂)(px + ∆px ) − (x − µ̂) · px k. (8) form element-wise normalization to generate the confidence
Note that x can be a arbitrary position on the heatmap. heatmap C:
Denoting the heatmap size as W , we have kx − µ̂k ≤ W cx = sigmoid(zx ), (13)
over the whole dataset. Therefore, we derive the Lipschitz where zx denotes the un-normalized logits value of location
constant of integral regression as: x, cx denotes the confidence heatmap value of location x.
k∇z L(µ, h(z + ∆z)) − ∇z L(µ, h(z))k Hence, the joint confidence can be indicated by the maxi-
mum value of the heatmap:
≤kW (px + ∆px ) − W px k = W k∆px k (9)
=W · Ls · k∆zk, conf = max (C). (14)
where Ls is the Lipschitz constant of the normalization Since we use an element-wise operation sigmoid for the first
function [58], [59]. It shows that the conventional integral step of normalization and don’t force the sum of C to be
regression multiplies a factor W to the Lipschitz constant of one, the maximum value of C won’t be affected by the size
normalization. of the joint. In this way, the predicted joint confidence is
Similarly, we can derive the Lipschitz constant of the pro- only related to the predicted location. In the second step,
posed amplitude symmetric function. Firstly, the gradient of we perform global normalization to generate the probability
the logits is: heatmap P:
cx
X X px = P . (15)
|∇z L(µ, h(z))| = |Agrad · px · (1 + p xi − pxi )| C
xi <µ̂ xi >µ̂
The element-sum of the probability heatmap P is one, which
≤ 2 · Agrad · px . ensures the predicted joint location µ̂ is within the heatmap
(10) boundary and stabilizes the training process.
We set Agrad = W/8 to make the average norm of the To sum up, we obtain the joint confidence through the
gradient the same as integral regression. Specifically, first step and obtain the joint location on the heatmap
W generated by the second step. An ablation study is carried
Ex [|(x − µ̂)px |] = Ex [|x − µ̂|]px =
· px . (11) out in Sec. 6.6 to show the effectiveness of our normalization
4
method.
The Lipschitz constant of the proposed amplitude symmet-
ric function derived as:
3.2 Multi-Domain Knowledge Distillation
k∇z L(µ, h(z + ∆z)) − ∇z L(µ, h(z))k
W Beyond our novel symmetric integral regression, the per-
≤k2Agrad (px + ∆px ) − 2Agrad px k = k∆px k (12) formance of network can further benefit from extra train-
4
W ing data. Except for annotating a new dataset (detailed in
= · Ls · k∆zk. Sec 6.1), we also adopt multi-domain knowledge distillation
4
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 6

to train our network. Three additional datasets are adopted,


namely 300Wface [61], FreiHand [62] and InterHand [63].
The details of these datasets will be introduced in Sec 6.1.
Combining these datasets, our network are able to predict
face and hand keypoints accurately for in-the-wild images.
During training, we construct each training batch by
sampling different datasets with a fixed ratio. To be specific,
1/3 of the batch are sampled from our annotated dataset,
1/3 from the COCO-fullbody and the remaining are equally
sampled from 300Wface and FreiHand. For each sample, we
apply dataset-specific augmentation, which is introduced in
the next section.
Although these domain specific datasets are able to
provide accurate intermediate supervision, their data dis-
tribution are quite different from in-the-wild images. To Fig. 3. Distributions of bounding box offsets for several different body
parts. The dotted boxes denote the ranges of approximated uniform
solve this problem, we extend our pose-guided proposal distributions. Best viewed in color.
generator in [20] to full body scenario and conduct data
augmentation in a unified manner. separated part, we calculate the offsets between its tightly
surrounded bounding box and the detected bounding box
3.3 Part-Guided Proposal Generator of the whole person. Since the box variances in horizontal and
For the two-stage pose estimation, the human proposals vertical directions are usually independent, we simplify the
generated by the human detector usually produce a differ- the modeling of original distribution into modeling
ent data distribution from the ground-truth human boxes. Px (δ xmin , δ xmax |p),
Meanwhile, the spatial distribution of the face and hand are
also different between the full-body images in the wild and Py (δ ymin , δ ymax |p).
the part-only images in the dataset. Without proper data After processing all the instances in Halpe-FullBody, the
augmentation during training, the pose estimator may not offsets form a frequency distribution, and we fit the data to a
work properly in the testing phase for the detected human. Gaussian mixture distribution. For different body parts, we
To generate training samples with similar distribution have different Gaussian mixture parameters. We visualize
to the output of the human detector, we propose our part- the distributions and their corresponding parts in Figure 3.
guided proposal generator. For different body parts with During the training phase of the pose estimator, for
a tight surrounded bounding box, the proposal generator a training sample belonging to part p, we can gener-
generates a new box that is inline with the distribution of ate additional offsets to its ground-truth bounding box
the output of human detector. by dense sampling according to Px (δ xmin , δ xmax |p) and
Since we already have the ground truth bounding box Py (δ ymin , δ ymax |p) to produce augmented training propos-
for each part, we simplify this problem into modeling als. In practice, we found that sampling in an approximated
the distribution of the relative offset between the detected uniform distribution (the dotted red boxes in the Figure 3)
bounding box and the ground truth bounding box varies can also produce similar performance.
across different parts. To be more specific, there exists a
distribution
3.4 Parametric Pose NMS
P (δ xmin , δ xmax , δ ymin , δ ymax |p) For the top-down approaches, a main drawback is the early
where δ xmin /δ xmax is the normalized offset between the commitment problem: if the human detector fails to detect a
left-est/right-est coordinate of a bounding box generated person, there is no recourse for the pose estimator to recover
by human detector and the coordinates of the ground truth it. Most top-down based methods [28], [30], [31], [65] would
bounding box: meet this problem since they set the detection confidence
to a high value to avoid redundant poses. On the contrary,
gt
xdetect
min -xmin we set the detection confidence to a low value (0.1 in our
δ xmin = gt , experiments) to ensure a high detection recall. In this case,
xgt
max -xmin
human detectors inevitably generate redundant detections
xdetect gt
max -xmax for some people, which results in redundant pose estima-
δ xmax = gt ,
xgt
max -xmin
tions. Therefore, pose non-maximum suppression (NMS) is
required to eliminate the redundancies. Previous methods
and similarly is δ ymin , δ ymax , p is the ground truth part
[11], [66] are either not efficient or not accurate enough.
type. If we can model this distribution, we are able to
In this paper, we propose a parametric pose NMS method.
generate many training samples that are similar to human
Similar to the previous subsection, the pose Pi , with m joints
proposals generated by the human detector. j j
is denoted as {hki1 , c1i i, . . . , hkim , cm
i i}, where ki and ci are
To achieve that, we adopt an off-the-shelf object de- th
the j location and confidence score of joints respectively.
tector [64] and generate human detection for our Halpe-
FullBody dataset. For each instances in the dataset, we NMS scheme We revisit pose NMS as follows: firstly, the
separate the annotations of face, body and hand. For each most confident pose is selected as reference, and some poses
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 7

close to it are subject to elimination by applying elimination enhance the person identity feature. Finally, human pro-
criterion. This process is repeated on the remaining poses set posal information (identity embedding, box and pose) are
until redundant poses are eliminated and only unique poses integrated by our designed Multi-Stage Identity Matching
are reported. (MSIM) algorithm to achieve online realtime pose tracking.
Elimination Criterion We need to define pose similarity in
order to eliminate the poses which are too close and too 4.1 Pose-Guided Attention Mechanism
similar to each others. We define a pose distance metric Person re-ID feature can be used to identify the same in-
d(Pi , Pj |Λ) to measure the pose similarity, and a threshold dividual from a lot of human proposals. In our top-down
η as elimination criterion, where Λ is a parameter set of framework, we extract re-ID feature from each bounding
function d(·). Our elimination criterion can be written as box produced by object detector. However, the quality of
follows: the re-ID feature will be reduced by the background in the
bounding box, especially when there exists other people’s
f ( Pi , Pj |Λ, η) = 1[d(Pi , Pj |Λ, λ) ≤ η] (16)
bodies. In order to solve this problem, we consider using
If d(·) is smaller than η , the output of f (·) should be 1, the predicted human pose to construct a region where
which indicates that pose Pi should be eliminated due to the human body is concentrated. Thus, the Pose-Guided
redundancy with reference pose Pj . Attention (PGA) is proposed to force the extracted features
focusing on the human body of interest, and ignore the
Pose Distance Now, we present the distance function
impact of the background. The insight of PGA is elaborated
dpose (Pi , Pj ). We assume that the box for Pi is Bi . Then
in ablation studies (Sec. 6.8).
we define a soft matching function
The pose estimator generates k heatmaps where k means
KSim (Pi , Pj |σ1 ) = the number of keypoint for each person. Then the PGA
(P
cn cn module will transform these heatmaps to an attention map
n tanh i
σ1 · tanh j
σ1 , if kjn is within B(kin ) (mA ) with a simple conv layer. Note that the mA has same
(17)
0 otherwise size with re-ID feature map (mid ). Therefore, we could
obtain the weighted re-ID feature map (mwid ) :
where B(kin ) is a box center at kin , and each dimension of
B(kin ) is 1/10 of the original box Bi . The tanh operation mwid = mid mA + mid (20)
filters out poses with low-confidence scores. When two
corresponding joints both have high confidence scores, the where means Hadamard product.
output will be close to 1. This distance softly counts the Finally, the identity embedding (embid ) which is a 128
number of joints matching between poses. dimension vector is encoded by a fully-connection layer.
The spatial distance between parts is also considered,
which can be written as 4.2 Multi-Stage Identity Matching
X (kin − kjn )2 For a video sequence, Let Hti denote the i-th human pro-
HSim (Pi , Pj |σ2 ) = exp[− ] (18) posal of t-th frame. As described above, Hti has several
n
σ2
features: pose (Pti ), bbox (Bti ) and identity embedding
By combining Eqn 17 and 18, the final distance function (Eti ). Considering that all these features can determine the
can be written as identity of a person, we design MSIM algorithm to assign
the corresponding id for Hti . Assuming that the detection
d(Pi , Pj |Λ) = KSim (Pi , Pj |σ1 ) + λHSim (Pi , Pj |σ2 ) (19)
and tracking results of the previous t-1 frames have been
where λ is a weight balancing the two distances and Λ = obtained and stored in the tracking pool P l. First, a kalman
{σ1 , σ2 , λ}. Note that the previous pose NMS [11] set pose filter is used to finetune the detection features in the current
distance parameters and thresholds manually. In contrast, frame thus make trajectories more smooth. Then we perform
our parameters can be determined in a data-driven manner. the first stage matching by computing the affinity matrix
i
Memb among identity embedding of the t-th frame and
Optimization Given the detected redundant poses, the four all embeddings existed in P l. The matching rules are as
parameters in the eliminate criterion f ( Pi , Pj |Λ, η) are follows:
optimized to achieve the maximal mAP for the validation 
t t
set. Since exhaustive search in a 4D space is intractable, link(p, q),
 if Memb [p][q] = min(Memb [p])
t
we optimize two parameters at a time by fixing the other and Memb [p][q] ≤ µemb (21)
two parameters in an iterative manner. Once convergence is  p

Ht keep untracked, otherwise
achieved, the parameters are fixed and will be used in the
p
testing phase. where link(p, q) means Ht shares the same trajectory with
the q-th human proposal in P l. µemb is the threshold. Here
we set µemb as 0.7 following [67].
4 M ULTI P ERSON P OSE T RACKING At the second stage, we consider both position and
In this section, we introduce our multi person pose tracking shape constraints for those untracked human proposals
method shown in the middle row of Fig. 2. We attach a in 21. Specifically, We use IOU metric between bboxes as
person re-ID branch on the pose estimator. Thus the network position constraint and normalized pose distance as shape
j
can estimate human pose and re-ID feature simultaneously. constraint. For two human proposals Hti and Ht−δ , we first
A Pose-Guided Attention Mechanism (PGA) is adopted to resize their bbox to same scale and get the center point c
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 8

Pose NMS
Images
Human
Detector Pose
Box NMS
Pose Association
Video Estimator

FIFO queue
FIFO queue
FIFO queue

FIFO queue
Box File Rendering
Crop&Resize
Camera Re-ID Tracker
Saving

Data Loader Detection Data Transform Pose Estimation Post Processing


(a) (b) (c) (d) (e)
Fig. 4. System architecture of AlphaPose. Our system is divided into five modules, namely (a) data loading module that can take images, video or
camera stream as input, (b) detection module that provides human proposals, (c) data transformation module to process the detection results and
crop each single person for later modules, (d) pose estimation module that generates keypoints and /or human identity for each person, (e) post
processing module that processes and saves the pose results. Our framework is flexible and each module contains several components that can
be replaced and updated easily. Dashed box denotes optional components in each module. See text for more details and best viewed in color.

of each bbox. Then we compute the normalized pose vector


by subtracting the center from each keypoint coordinates.
Finally, we can obtain the normalized pose distance (distnp ) ResNet DUC DUC DUC Conv2D
by Eqn. 19. Therefore the fusion distance matrix of shape
and location can be written as:

Mft = (1 − IOU ) + λnp × distnp (22)


2w
Where IOU and distnp denote the matrix formed by IOU-
w
function and normalized pose distance among untracked w
proposals and P l. λnp is weight to balance location and
Conv2D PixelShuffle
shape distance matrix. h
h
Here we also use a threshold µf to filter unmatched c 4c 2h
proposals like Eqn 21 and empirically set it as 0.5.
In order to match the tracklets that are not very similar c
with previous frames, we appropriately lower the threshold
and repeat the above stage. If there is still no matched Fig. 5. Network architecture of FastPose. Firstly, ResNet is adopted as
proposal, we think that this is a newly tracklet, so a new the network backbone. Then, DUC modules are applied for up-sampling.
id will be assigned to it. Finally, a 1 × 1 convolution is utilized to generate heatmaps.

4.3 Joint Training Strategy


5.1 Pipeline
In order to simplify the training process of the whole net-
work, we simultaneously train the pose estimator and the
re-ID branch. Our network is trained on COCO [14] and
PoseTrack [68] dataset. PoseTrack has both pose and identity A drawback of two-step framework is the limitation of the
annotation, while COCO only has pose annotation. There- inference speed. To facilitate the fast processing of large-
fore, when training on COCO, the gradient contributed by scale data, we design a five-stage pipeline with multi-
the re-ID branch does not participate in back propagation. processing implementation to speed up our inference. Fig. 4
We follow the loss balanced strategy in [69] to jointly opti- illustrates our AlphaPose pipelining mechanism. We divide
mize pose and identification sub-task. the whole inference process into five modules, following
the principle that each module consumes similar processing
time. During inference, each module is hosted by an inde-
pendent process or thread. Each process communicates with
5 A LPHA P OSE subsequent processes with a First-In-First-Out queue, that
is, it stores the computed results of current module and the
In this section, we present AlphaPose2 , the first jointly following modules directly fetch the results from the queue.
whole-body pose estimation and tracking system. With such design, these modules are able to run in parallel,
resulting in a significant speed up and enabling real-time
2. Available at https://github.com/MVIG-SJTU/AlphaPose application.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 9

DataSet #Kpt Wild Body Hand Face HOI Total


Kpt Kpt Kpt Instances
For the post processing module, we provide our parametric
pose NMS and the OKS-based NMS [65]. Another tracker
MPII [15] 16 X X 40K
CrowdPose [19] 14 X X 80K PoseFlow [47] is available here and we support rendering
PoseTrack [68] 15 X X 150K
COCO [14] 17 X X 250K for images and video. Our saving format is COCO format
OneHand10K [70] 21 X X 10K by default and can be compatible with OpenPose [17]. One
FreiHand [62] 21 X 130K can easily run AlphaPose with different setting by simply
MHP [71] 21 X 80K
WFLW [72] 98 X X 10K
specifying the input arguments.
AFLW [73] 19 X X 25K
COFW [74] 29 X X 1852
300W [61] 68 X X 3837
COCO-WholeBody [18] 133 X X X X 250K
Halpe-FullBody 136 X X X X X 50K

TABLE 1
Overview of some popular public datasets for 2D keypoint estimation in
RGB images. Kpt stands for keypoints, and #Kpt means the annotated
number. “Wild” denotes whether the dataset is collected in-the-wild.
“HOI” denotes human-object-interaction body-part labels.

5.2 Network (b)


For our two-step framework, various human detector and
pose estimator can be adopted.
In the current implementation, we adopt off-the-shelf
detectors include YOLOV3 [64] and EfficientDet [75] trained
on COCO [14] dataset. We do not retrained these models as
their released model already work well in our case.
For the pose estimator, we design a new backbone
named FastPose, which yields both high accuracy and ef-
ficiency. The network structure is illustrated in Fig. 5. We
use ResNet [24] as our network backbone to extract features
from the input cropped image. Three Dense Upsampling (a) (c)
Convolution (DUC) [76] modules are adopted to upsample Fig. 6. Annotated keypoint format in Halpe-FullBody dataset for (a) body
the extracted features, followed by a 1 × 1 convolution layer and foot, (b) face, (c) hand respectively. Zoom in for details of the face
annotation,
to generate heatmaps. The DUC module first applies 2D
convolution to the feature map with dimension h × w × c
and then reshapes it to 2h × 2w × c0 via a PixelShuffle [77]
operation. 6 DATASETS AND E VALUATIONS
To further boost the performance, we also incorporate 6.1 Datasets
deformable convolution operator into our ResNet backbone Halpe-FullBody To facilitate the development of whole
following [78] to improve the feature extraction. Such net- body human pose estimation, we annotate a full body key-
work is named as FastPose-DCN. points dataset named Halpe-FullBody3 . For each person,
we annotate 136 keypoints, including 20 for body, 6 for
5.3 System feet, 42 for hands and 68 for face. The keypoint format is
illustrated in Fig. 6. Note that since there are two popu-
AlphaPose is developed based on both PyTorch [79] and
lar definition for the face keypoints (see Fig. 7), we only
MXNet [80]. Benefiting from the flexibility of PyTorch, Al-
phaPose supports both Linux and Windows system. Alpha- 3. Available at https://github.com/Fang-Haoshu/Halpe-FullBody
Pose is highly optimized for the purpose of easy usage and
further development, as we decompose the training and
testing pipeline into different modules and one can easily
replace or update different modules for custom purpose.
For the data loading module, we support image input
by specifying image name, directory or a path list. Video
file or stream input from camera are also supported. For
the detection module, we adopt YOLOX [81] YOLOV3-
SPP [64], EfficientDet [75] and JDE [67]. Detecting results
from other detectors are also supported as a file input.
Other trackers like [82] can also be incorporated. For the
data transform module, we implement vanilla box NMS (a) (b)
and soft-NMS [83]. For the pose estimation module, we Fig. 7. Two different definition of face keypoints on the lower jaw. The
green dots represent the same definition and the red dots indicate their
supports SimplePose [31], HRNet [28], and our proposed differences. In (a) and (b), the left definition is commonly used in 2D
FastPose with different variants like FastPose-DCN. Our Re- annotated dataset like [18], [61], [72], while the right definition is used in
ID based tracking algorithm is also available in this module. 3D face alignment task like [84].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 10

Method Input Size full-body foot face hand body


AP AP50 AP75 APL APM AR AP AR AP AR AP AR AP AR
OpenPose-default [17] N/A 0.276 0.528 0.258 0.356 0.310 0.370 0.438 0.652 0.482 0.495 0.140 0.209 0.514 0.575
OpenPose-maxacc [17] N/A 0.281 0.531 0.265 0.363 0.318 0.381 0.456 0.677 0.482 0.496 0.142 0.211 0.526 0.590
SN [39] N/A 0.233 0.606 0.128 0.211 0.354 0.362 0.481 0.680 0.344 0.419 0.030 0.071 0.563 0.624
HRNet [28] 256×192 0.387 0.782 0.346 0.393 0.432 0.522 0.581 0.749 0.429 0.558 0.104 0.204 0.605 0.713
Simple [31] 256×192 0.409 0.782 0.391 0.417 0.435 0.506 0.706 0.782 0.444 0.536 0.141 0.233 0.648 0.691
ZoomNet 384×288 0.427 0.803 0.412 0.446 0.433 0.513 0.702 0.778 0.505 0.569 0.136 0.210 0.648 0.699
FastPose50-hm 256×192 0.417 0.784 0.406 0.426 0.439 0.516 0.730 0.803 0.432 0.536 0.163 0.258 0.658 0.701
FastPose50-si 256×192 0.441 0.772 0.444 0.470 0.446 0.532 0.706 0.781 0.491 0.580 0.207 0.294 0.650 0.699
FastPose152-si 256×192 0.451 0.785 0.457 0.475 0.460 0.537 0.724 0.791 0.508 0.590 0.199 0.294 0.651 0.699
FastPose50-dcn-si 256×192 0.462 0.795 0.477 0.491 0.464 0.548 0.739 0.810 0.508 0.589 0.214 0.301 0.672 0.717
FastPose50-dcn-si* 256×192 0.484 0.826 0.505 0.497 0.508 0.565 0.733 0.810 0.537 0.596 0.226 0.330 0.678 0.721

TABLE 2
Whole-body pose estimation results on Halpe-FullBody dataset. For fair comparisons, results are obtained using single-scale testing.
“OpenPose-default” and “OpenPose-maxacc” denotes its default and maximum accuracy configuration respectively. “hm” denotes the network
uses heatmap based localization, “si” denotes the network uses our symmetric integral regression. “*” denotes model trained with multi-domain
knowledge distillation and PGPG.FastPose50 denotes our FastPose network with ResNet50 as backbone and so is FastPose152. “dcn” denotes
that the deformable convolutional layer [78] is adopted in the ResNet backbone.

Type Method Input Size GFLOPs whole-body body foot face hand
AP AR AP AR AP AR AP AR AP AR
OpenPose [17] N/A N/A 0.338 0.449 0.563 0.612 0.532 0.645 0.482 0.626 0.198 0.342
SN [39] N/A N/A 0.161 0.209 0.280 0.336 0.121 0.277 0.382 0.440 0.138 0.336
Bottom- PAF [25] N/A N/A 0.141 0.185 0.266 0.328 0.100 0.257 0.309 0.362 0.133 0.321
Up PAF-body [25] N/A N/A - - 0.409 0.470 - - - - - -
AE [26] N/A N/A 0.274 0.350 0.405 0.464 0.077 0.160 0.477 0.580 0.341 0.435
AE-body [26] N/A N/A - - 0.582 0.634 - - - - - -
HRNet [28] 384×288 16.0 0.432 0.520 0.659 0.709 0.314 0.424 0.523 0.582 0.300 0.363
HRNet-body [28] 384×288 16.0 - - 0.758 0.809 - - - - - -
Top- ZoomNet 384×288 20.0 0.541 0.658 0.743 0.802 0.798 0.869 0.623 0.701 0.401 0.498
Down FastPose50-si 256×192 5.9 0.554 0.625 0.673 0.717 0.636 0.718 0.757 0.818 0.425 0.515
FastPose152-si 256×192 13.2 0.569 0.641 0.684 0.730 0.672 0.750 0.765 0.824 0.443 0.532
FastPose50-dcn-si 256×192 6.1 0.577 0.650 0.693 0.740 0.690 0.765 0.759 0.820 0.453 0.538

TABLE 3
Whole-body pose estimation results on COCO-WholeBody dataset. For fair comparisons, results are obtained using single-scale testing. We only
report the input size and GFLOPS of the pose model in top-down based approaches and ignore the detection model.“hm” denotes the network
uses heatmap based localization, “si” denotes the network uses our symmetric integral regression. FastPose50 denotes our FastPose network with
ResNet50 as backbone and so is FastPose152. “dcn” denotes that the deformable convolutional layer [78] is adopted in the ResNet backbone.

annotate the visible lower jaw of the face (green dots in COCO 2017 test-dev set.
Fig. 7) so as to be compatible with these two definition.
For the images, our training set uses the training images PoseTrack PoseTrack is a large scale dataset for multi-person
of the HICO-DET [40] dataset and our testing set uses the pose estimation and tracking. It is built on the raw videos
COCO-val set. In total, our dataset contains 50K instances provided by MPII Human dataset [15]. There are more than
for training and 5K images for testing. Tab. 1 compares 1356 video sequences of PoseTrack and they are split into
our dataset with previous popular datasets on human pose train, val, test. Each annotated person has 17 keypoints
estimation. similar with COCO, but there are two different keypoints
COCO-WholeBody As a concurrent work, Jin et. al. anno- compared with COCO, which are ‘top head’ and ‘bottom
tates 133 whole body keypoints based on the COCO dataset. head’. Other annotations share the same format with COCO.
They share a similar keypoints definition with us, except We train our method on PoseTrack-2018 set and compare
that the head, neck and hip points are missing in their it with previous methods on both PoseTrack-2017-val and
annotation. The total training set contains 118K images with PoseTrack-2018-val sets.
250K instances, and the test set contains 5K images. We also
evaluate our algorithm on this dataset. 300Wface, FreiHand and InterHand are used as supple-
COCO COCO dataset is a standard benchmark for human mental datasets to improve the generalization ability of
keypoints prediction. It contains 17 keypoints of human our model. 300Wface [61] contains 300 indoor and 300
body without face, hand and foot annotations. In total, there outdoor in-the-wild images. For each face, 68 keypoints are
are 118K images for training, 5K for validation and 41K for annotated. FreiHand [62] contains 33K unique hand samples
testing. We train our algorithm on the COCO 2017 train for training, each contains 21 keypoints. InterHand [63]
set and compare our FastPose network and symmetrical contains 2.6M images with interacting hands, where each
integral loss with previous state-of-the-arts models on the hand also has 21 keypoints.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 11

Methods Backbone Detector Input Size GFLOPs AP AP50 AP75 APM APL AR
G-MRI [65] ResNet-101 Faster-RCNN 353 × 257 57.0 0.649 0.855 0.713 0.623 0.700 0.697
RMPE [20] PyraNet Faster-RCNN 320 × 256 26.7 0.723 0.892 0.791 0.680 0.786 -
CPN [25] ResNet-Inception FPN 384 × 288 - 0.721 0.914 0.800 0.687 0.772 0.785
Detection PAF-body [25] - - - - 0.618 0.849 0.675 0.571 0.682 0.665
AE [26] - - - - 0.655 0.868 0.723 0.606 0.726 0.702
SimplePose [31] ResNet-50 Faster-RCNN 256 × 192 8.9 0.702 0.909 0.783 0.671 0.759 0.758
HRNet [28] HRNet-32 Faster-RCNN 384 × 288 16.0 0.749 0.925 0.828 0.713 0.809 0.801
HRNet [28] HRNet-48 Faster-RCNN 384 × 288 32.9 0.755 0.925 0.833 0.719 0.815 0.805
FastPose-hm ResNet-50 YOLO-v3 256 × 192 5.9 0.718 0.919 0.803 0.728 0.742 0.773
FastPose-dcn-hm ResNet-50 YOLO-v3 256 × 192 6.1 0.726 0.922 0.812 0.737 0.749 0.781
FastPose-dcn-hm ResNet-101 YOLO-v3 256 × 192 9.8 0.727 0.922 0.813 0.736 0.751 0.781
Integral [45] ResNet-101 Faster-RCNN 256 × 256 17.8 0.678 0.882 0.748 0.639 0.740 -
CenterNet [35] Hourglass-2 stacked - - - 0.630 0.868 0.696 0.589 0.704 -
Regression SPM [36] Hourglass-8 stacked - - - 0.669 0.885 0.729 0.626 0.731 -
Point-set Anchor [38] HRNet-W48 - - - 0.687 0.899 0.763 0.648 0.753 -
FastPose-si ResNet-50 YOLO-v3 256 × 256 7.9 0.649 0.865 0.728 0.669 0.663 0.716
FastPose-si ResNet-101 YOLO-v3 256 × 256 12.8 0.679 0.876 0.751 0.675 0.714 0.723
FastPose-dcn-si ResNet-101 YOLO-v3 256 × 256 13.1 0.690 0.901 0.773 0.729 0.690 0.775

TABLE 4
Body pose estimation results on COCO test-dev set. For fair comparisons, results are obtained using single-scale testing. “hm” denotes the
network uses heatmap based localization, “si” denotes the network uses our symmetric integral regression.

6.2 Evaluation Metrics and Tools and only finetune the re-ID branch on posetrack dataset for
Halpe-FullBody We extend the evaluation metric of the 10 epochs. Learning rate in the finetuning phase is 1e-4. We
COCO keypoints to full body scenario. COCO defines a adopt Adam [86] optimizer during training. All experiments
Object Keypoint Similarity controlled by a per-keypont con- are conducted on 8 Nvidia 2080Ti GPUs.
stant k . For our newly added keypoints, we set the k for
the feet, face and hand as 0.015. Same as COCO, we report 6.4 Evaluation for Full Body Pose Estimation
AP 0.5:0.95:0.05 as the main result and the detailed results for
We first evaluate the performance of our model on Halpe-
body, foot, face and hand are also reported.
FullBody and COCO-WholeBody dataset. Since Halpe-
COCO-WholeBody COCO-WholeBody adopts the same
FullBody is a new dataset, we retrain several state-of-the-
metric as ours except that the constant k is different from
art models and compare the results with us. Tab. 2 gives
us for some keypoints.
the final results. YOLOV3 is adopted as human detector
COCO We adopt the standard AP metric of COCO dataset
for all the top-down based models. We can see that top-
for fair comparison with previous works.
down methods can achieve higher accuracy compared to the
PoseTrack In fact, multi-person pose tracking can be re-
bottom-up methods. However, due to the quantization error
garded as the combination of multi-person pose estima-
introduced by heatmap, conventional SPPEs decrease a lot
tion and multi-object tracking. Thus, the evaluation metric
on the fine-level body parts like face and hand. Equipped by
should follow these two tasks. Mean Average Precision
our novel symmetrical integral loss function, our FastPose
(mAP) [12] is used to measure frame-wise human pose
models achieve the best accuracy. Notably, we can see that
accuracy. To evaluate the tracking performance, the MOT
FastPose50-si yields 2.4 mAP (5.7% relatively) higher than
[85] metric is applied to each of the body joints indepen-
its heatmap-based counterpart. The improvements mainly
dently. Then the final tracking performance is obtained by
comes from the face and hands. It demonstrates that the
averaging each joint mot metric. The PCKh [15] (head-
quantization error of heatmap affects the fine-level localiza-
normalized probability of correct keypoint) is one of the
tion of face and hand keypoints, and our symmetric integral
most commonly used metric to evaluate whether a body
regression works well on such cases.
joint is predicted correctly. Here it can determine which
On the COCO-WholeBody dataset, our FastPose embed-
predicted joint is matched with groundtruth joint.
ded with symmetrical integral loss function also outper-
To evaluate the tracking result on posetrack validation
forms previous state-of-the-art methods by a large margin,
dataset, we use the official tool named poseval4 and report
especially on the face and hands. Notably, our FastPose
Multiple Object Tracker Accuracy (MOTA), Multiple Object
achieves the highest accuracy given a smaller input size.
Tracker Precision (MOTP), Precision and Recall.
The model complexity is also much lower than previous
methods. It demonstrates the superiority of our network
6.3 Implementation Details structure and the novel loss.
We conduct our experiments with PyTorch [79]. We train the Some qualitative results of full body pose estimation is
network with batch 32 for 270 epochs. The initial learning shown in Fig. 8.
rate is 0.01 and we decay it on epoch 100 and 170 by 0.1. The
pose guided proposal generator is applied after epoch 200. 6.5 Evaluation for Conventional Body Pose Estimation
After the entire network is trained, we freeze the backbone
We also conduct experiments on the conventional body-only
4. https://github.com/leonid-pishchulin/poseval pose estimation task to demonstrate the effectiveness of our
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 12

Fig. 8. Qualitative results of AlphaPose on the full-body pose estimation task. Zoom in for more details and best viewed in color.

method, although it is not our main focus.. We train our Module Halpe Fullbody (mAP) COCO (mAP)
models on the COCO dataset and evaluate it on COCO test- two-step hm-norm 44.1 69.5
dev set. The results are reported in Tab. 4. For the heatmap one-step hm-norm 38.1 67.1
based methods, we can see that our FastPose backbone w. SIKR 44.1 69.5
can achieve on par performance with the state-of-the-art w.o SIKR 42.3 64.6
method, given a smaller input size and weaker human detector. w. P-NMS 44.1 69.5
It demonstrates the superiority of our FastPose network. w.o P-NMS 43.7 68.2
Note that since our goal is to present a new baseline model w. PGPG 48.4* N/A
like SimplePose [31], we conduct these experiments to prove w.o PGPG 47.1* N/A
the accuracy and efficiency of our model. Further pursuing TABLE 5
higher accuracy with speed and resources trade-off is not Ablation studies on Halpe Fullbody dataset and COCO dataset.
our goal in this paper and we leave them for future research. “hm-norm” denotes heatmap normalization. “*” denotes results trained
with additional data from Multi-Domain Knowledge Distillation.
For the regression based methods, our method achieves
the state-of-the-art performance with the lowest GFLOPS.
Compared to [45], our network serves as a new baseline for
future research. base network and report the numbers on COCO validation
and Halpe-Fullbody test set respectively. The results are
summarized in Tab. 5.
6.6 Ablation Studies for Pose Estimation Heatmap Normalization We elucidated the essence of our
To evaluate the effectiveness of our proposed module for two-step heatmap normalization for applying the integral-
pose estimation, we also conducted ablation experiments on based method in the multi-person scenario in Sec.3.1. Here
COCO and Halpe-Fullbody dataset. We adopt FastPose50 as we conduct an ablation experiment to show the perfor-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 13

Fig. 9. Qualitative results of AlphaPose on the full-body pose tracking task. Zoom in for more details and best viewed in color. The colors of persons
denote their tracking ID. The image order is denoted by the time arrow. See text for more analysis.

mance gap of different heatmap normalization methods. Data Method mAP MOTA fps Res Src
We can see that when comparing the conventional one-step Det&Track [49] 60.6 55.2 1.2 - X
heatmap normalization (soft-max) to our two-step heatmap PoseFlow [47] 66.5 58.3 10* - X
JointFlow [87] 69.3 59.8 0.2 N/A ×
normalization, the performance in multi-person pose esti-
Fast [57] 70.3 63.2 12.2 N/A ×
mation decrease for 6 mAP and 2.4 mAP on Halpe-fullbody TML++ [88] 71.5 61.3 - - ×
and COCO datasets, respectively. It demonstrates that the STAF [55] 72.6 62.7 3.0 N/A X
two-step normalization can alleviate the size-dependent ef- 2017 FlowTrack [31] 76.7 65.4 3.0 384×288 X
PGPT [52] 77.2 68.4 1.2 384×288 X
fect and improve performance. Yang et.al. [53] 81.1 73.4 - 384×288 ×
SIKR Module We compare our symmetric integral function Ours-UNI 76.1 65.5 11.3 256×192 X
with the original integral regression [45]. For both full Ours-SEP 76.9 65.7 8.9 256×192 X
body pose estimation scenario and conventional body pose MDPN [51] 71.7 50.6 - 384×288 ×
estimation, we can see that our symmetric integral function STAF [55] 70.4 60.9 3.0 N/A X
OpenSVAI [89] 69.7 62.4 - - ×
greatly outperforms the original integral regression.
LightTrack [48] 71.2 64.6 0.7 384×288 X
Pose NMS Module Without Pose-NMS, multiple human 2018 KeyTrack [54] 74.3 66.6 1.0 384×288 ×
poses will be predicted for a single person. The redundant PGPT [52] 76.8 67.1 1.2 384×288 X
poses will decrease the model performance. From Tab. 5, we Yang et.al. [53] 77.9 69.2 - 384×288 ×
Ours-UNI 74.0 64.4 10.9 256×192 X
can see that our model decreases for 0.4 mAP and 1.3 mAP Ours-SEP 74.7 64.7 8.7 256×192 X
for Halpe Fullbody and COCO dataset respectively.
PGPG Module Proper data augmentation is needed during TABLE 6
Evaluation Result On Posetrack Validation dataset. “Res” denotes input
training to ensure the generalization ability at testing phase. resolution of pose network and “Src” denotes whether source code is
For Halpe Fullbody dataset, we compare the results of available. “Ours-UNI” denotes results trained with a shared backbone
FastPose50-dcn trained with and without PGPG module. of pose and re-ID branch and “Ours-SEP” denotes results trained with
separated backbones. The “*” in fps means not including detection
Tab. 5 shows that without our part guided proposal gen-
time. The mAP value is obtained after tracking post-precessing.
eration, the performance would decrease due to the domain
variance in training.
and output, which consumes a lot of memory and is com-
putationally expensive. Our method achieves satisfactory
6.7 Evaluation for Pose Tracking
accuracy while running efficiently.
To verify that our system is sufficient for multi-person pose
tracking task, we apply it to the posetrack validation dataset.
Tab. 6 shows the comparison with other state-of-the-art 6.8 Ablation Studies for Pose Tracking
methods. The backbone we adopted is the FastPose152 and In order to verify the effectiveness of each part of the track-
the detector is YoloX. We can see that our model outper- ing algorithm, we have designed several sets of ablation
forms most methods in both mAP metric and MOTA metric, experiments.
and our speed is quite fast. This near real-time processing PGA Module The function of PGA module is to assist in
speed can be applied to various scenarios in our real life. It extracting more effective re-ID features with the help of the
is worth noting that there are some other methods [49], [50] keypoint information. As a comparison, we remove the PGA
that have achieved good results on the posetrack dataset, module in our framework, which means the human pose
but they mainly consider the overall timing information of and re-ID feature are fed into MSIM directly. Tests on Pose-
the video, which means that they are not strictly an online Track dataset show that tracking performance will decrease
algorithms. Therefore our method is not directly compared after removing PGA module which reported in Table.7. At
with theirs. [52], [53], [54] achieve higher accuracy compared the same time, we visualized the extracted re-ID features
with our results. But they use very high resolutions for input with or without PGA module shown as Fig.10. Since the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 14

exp Setting head shou elb hip knee Total


mAP
w/ 77.7 75.4 75.3 69.0 68.1 74.7
PGA
w/o 78.0 75.6 75.5 69.3 68.6 74.9
MOTA
w/ 74.0 72.6 64.6 66.8 58.9 64.7
w/o 73.4 71.8 64.2 66.1 58.4 63.5
mAP
No-GT 77.7 75.4 75.3 69.0 68.1 74.7

MSIM GT-Box 81.3 81.0 81.5 80.8 81.5 81.3 (a)


70
GT-Pose 100.0 100.0 100.0 100.0 100.0 100.0
MOTA 68
No-GT 74.0 72.6 64.6 66.8 58.9 64.7
66

MOTA
GT-Box 75.8 75.5 75.8 75.4 76.7 75.9
GT-Pose 93.8 93.6 93.7 93.8 94.4 93.9 64
STAF
TABLE 7 LightTrack
62 PGPT
The ablation study results of proposed pose tracking method.
AlphaPose
60
0 200 400 600 800 1000 1200 1400 1600
detection result is usually a larger box than the original size Average inference time (ms) per image
of human, the background in the box has a large proportion. (b)
However, the background information makes the human
Fig. 11. Speed/Accuracy comparison of different pose estimation and
identity embedding carry useless features. This intuitively tracking libraries. (a) Pose estimation results obtained on COCO-
explains that the advantage of PGA is that it can better focus WholeBody validation set and COCO validation set. (b) Pose tracking
attention on the target person’s area. results obtained on PoseTrack18-val set.

Original Image w/o PGA w. PGA MSIM To further verify the performance of our model, we
add different level information into the Network. Specif-
ically, we set up several sets of experiments, respectively
using GT box, GT pose. These results are reported in Table.7.
The results show that if we replace the human detector and
pose estimator with more accurate network, our tracking
performance will be further improved.

(a) 7 F ULL B ODY P OSE T RACKING


In the above sections, we demonstrate the effectiveness of
our methods on both full body pose estimation and pose
tracking. Since our tracking algorithm is general, it is also
applicable to the whole-body scenario. We adopt a weakly
supervised strategy by training on both PoseTrack dataset
and Halpe-FullBody dataset.
Some qualitative results of full body pose tracking is
shown in Fig. 9. We can see that both full body pose
(b)
estimation and pose tracking yield high accuracy given the
heavily crowded scene. And our method is insensitive to
the size variance of humans. Specifically, when a person is
occluded by others and re-appear, our method still gives
the correct identity (e.g., the person with black shorts on the
right).

(c)
8 L IBRARY A NALYSIS
In this section, we compare our AlphaPose library with
Fig. 10. Visualization of the role of PGA module. When there is no PGA other popular open source library in both pose estimation
module, some background area will also have a high response. The and pose tracking. The results are obtained on a single
results of adding PGA module show that the feature response is more
concentrated on the target person. Notably, from figure (b) we can see Nvidia 20080Ti GPU. Fig. 11 shows the speed-accuracy
that when two people are close, the feature response focus on the target curve of different libraries. From Fig. 11(a) we can see that
person with the aid of PGA (zoom in for more details). our method has the highest accuracy and yields the highest
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 15

efficiency on whole-body and body-only pose estimation. [10] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “Using k-
Although a drawback of our top-down based approach is poselets for detecting people and localizing their keypoints,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
that the running time would increase as the persons in 2014, pp. 3582–3589.
the scene increase, our parallel processing pipeline greatly [11] X. Chen and A. L. Yuille, “Parsing occluded people by flexible
redeem this deficiency. According to the statistics by Open- compositions,” in IEEE Conference on Computer Vision and Pattern
Pose [17], our library is more efficient than it when there are Recognition (CVPR), 2015, pp. 3945–3954.
[12] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
less than 20 persons in the scene. From Fig. 11(b) we can see P. Gehler, and B. Schiele, “Deepcut: Joint subset partition and
that our pose tracking achieves on-par performance with the labeling for multi person pose estimation,” in IEEE Conference on
state-of-the-art library while running with high efficiency. Computer Vision and Pattern Recognition (CVPR), June 2016.
[13] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and
B. Schiele, “DeeperCut: A Deeper, Stronger, and Faster Multi-
Person Pose Estimation Model,” in European Conference on Com-
9 C ONCLUSION puter Vision (ECCV), May 2016.
[14] http://mscoco.org/dataset/#keypoints-leaderboard, 2016.
In this paper, we propose a unified and realtime framework
[15] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human
for multi-person fullbody pose estimation and tracking. To pose estimation: New benchmark and state of the art analysis,” in
the best of our knowledge, it is the first framework that IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
serves this purpose. Several novel techniques are presented 2014.
[16] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of
to achieve this goal and we demonstrate superior perfor- a convolutional network and a graphical model for human pose
mance in both efficacy and efficiency. A new dataset that estimation,” in Conference on Neural Information Processing Systems
contains full body keypoints (136 keypoints for each person) (NeurIPS), 2014, pp. 1799–1807.
is annotated to facilitate the research in this area. We also [17] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Open-
pose: realtime multi-person 2d pose estimation using part affinity
present a standard library that is highly optimized for easy fields,” arXiv preprint arXiv:1812.08008, 2018.
usage and hope that it can benefit our community. For our [18] S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and
future research, we will also include 3D keypoints and mesh P. Luo, “Whole-body human pose estimation in the wild,” in
to our library. European Conference on Computer Vision, 2020, pp. 196–214.
[19] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “Crowdpose:
Efficient crowded scenes pose estimation and a new benchmark,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
ACKNOWLEDGMENT Pattern Recognition, 2019, pp. 10 863–10 872.
This work is supported in part by the National Key R&D [20] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-
person pose estimation,” in Proceedings of the IEEE International
Program of China, No. 2017YFA0700800, Shanghai Munici-
Conference on Computer Vision, 2017, pp. 2334–2343.
pal Science and Technology Major Project (2021SHZDZX0102), [21] J. G. Umar Iqbal, “Multi-person pose estimation with local joint-
Shanghai Qi Zhi Institute, SHEITC (2018-RGZN-02046). We ap- to-person associations,” in European Conference on Computer Vision
preciate Chenxi Wang for help developing the MXNet version Workshops 2016 (ECCVW’16), 2016.
and Yang Han for developing the Jittor version of AlphaPose. [22] S. Kreiss, L. Bertoni, and A. Alahi, “Pifpaf: Composite fields for
Hao-Shu Fang would like to thank the support from Baidu, human pose estimation,” in Proceedings of the IEEE/CVF conference
MSRA and ByteDance Fellowship. on computer vision and pattern recognition, 2019, pp. 11 977–11 986.
[23] K. Sven, B. Lorenzo, and A. Alexandre, “Openpifpaf: Composite
fields for semantic keypoint detection and spatio-temporal associ-
ation,” IEEE Transactions on Intelligent Transportation Systems, 2021.
R EFERENCES [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[1] K. Wang, R. Zhao, and Q. Ji, “Human computer interaction with image recognition,” 2016.
head pose, eye gaze and body gestures,” in 2018 13th IEEE [25] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
International Conference on Automatic Face & Gesture Recognition (FG 2d pose estimation using part affinity fields,” in IEEE Conference
2018). IEEE, 2018, pp. 789–789. on Computer Vision and Pattern Recognition (CVPR), 2017.
[2] T. B. Moeslund, A. Hilton, and V. Krüger, “A survey of advances [26] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-
in vision-based human motion capture and analysis,” Computer to-end learning for joint detection and grouping,” in Advances in
vision and image understanding, vol. 104, no. 2-3, pp. 90–126, 2006. Neural Information Processing Systems, 2017, pp. 2274–2284.
[3] B. Pang, K. Zha, and C. Lu, “Human action adverb recognition: [27] B. Cheng, X. Bin, W. Jingdong, S. Honghui, S. H. Thomas,
Adha dataset and a three-stream hybrid model,” in Proceedings and Z. Lei, “Higherhrnet: Scale-aware representation learning
of the IEEE Conference on Computer Vision and Pattern Recognition for bottom-up human pose estimation,” in IEEE Conference on
Workshops, 2018, pp. 2325–2334. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5386–
[4] B. Sapp, A. Toshev, and B. Taskar, “Cascaded models for articu- 5395.
lated pose estimation,” in European Conference on Computer Vision [28] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution
(ECCV). Springer, 2010, pp. 406–420. representation learning for human pose estimation,” in Proceedings
[5] M. Sun, P. Kohli, and J. Shotton, “Conditional regression forests of the IEEE conference on computer vision and pattern recognition, 2019,
for human pose estimation,” in IEEE Conference on Computer Vision pp. 5693–5703.
and Pattern Recognition (CVPR). IEEE, 2012, pp. 3394–3401. [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,”
[6] L. Ladicky, P. H. Torr, and A. Zisserman, “Human pose estimation in Computer Vision (ICCV), 2017 IEEE International Conference on.
using a joint pixel-wise and part-wise formulation,” in IEEE Con- IEEE, 2017, pp. 2980–2988.
ference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. [30] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded
3578–3585. pyramid network for multi-person pose estimation,” in Proceedings
[7] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for of the IEEE conference on computer vision and pattern recognition, 2018,
human pose estimation,” in arXiv preprint arXiv:1603.06937, 2016. pp. 7103–7112.
[8] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolu- [31] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose
tional pose machines,” in IEEE Conference on Computer Vision and estimation and tracking,” in Proceedings of the European conference
Pattern Recognition (CVPR), 2016, pp. 4724–4732. on computer vision (ECCV), 2018, pp. 466–481.
[9] L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and [32] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
B. Schiele, “Articulated people detection and pose estimation: real-time object detection with region proposal networks,” in
Reshaping the future,” in IEEE Conference on Computer Vision and Conference on Neural Information Processing Systems (NeurIPS), 2015,
Pattern Recognition (CVPR), 2012, pp. 3178–3185. pp. 91–99.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 16

[33] A. Benzine, F. Chabot, B. Luvison, Q. C. Pham, and C. Achard, [56] S. Jin, W. Liu, W. Ouyang, and C. Qian, “Multi-person articulated
“Pandanet: Anchor-based single-shot multi-person 3d pose esti- tracking with spatial and temporal embeddings,” in Proceedings
mation,” in Proceedings of the IEEE/CVF Conference on Computer of the IEEE Conference on Computer Vision and Pattern Recognition,
Vision and Pattern Recognition, 2020, pp. 6856–6865. 2019, pp. 5664–5673.
[34] G. Bertasius, C. Feichtenhofer, D. Tran, J. Shi, and L. Torre- [57] J. Zhang, Z. Zhu, W. Zou, P. Li, Y. Li, H. Su, and G. Huang, “Fast-
sani, “Learning temporal pose estimation from sparsely-labeled pose: Towards real-time pose estimation and tracking via scale-
videos,” Advances in neural information processing systems, vol. 32, normalized multi-task networks,” arXiv preprint arXiv:1908.05593,
2019. 2019.
[35] Z. Xingyi, W. Dequan, and K. Philipp, “Objects as points,” in arXiv [58] B. Gao and L. Pavel, “On the properties of the softmax function
preprint arXiv:1904.07850, 2019. with application in game theory and reinforcement learning,”
[36] X. Nie, J. Feng, J. Zhang, and S. Yan, “Single-stage multi-person arXiv preprint arXiv:1704.00805, 2017.
pose machines,” in IEEE International Conference on Computer Vision [59] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree, “Regularisation
(ICCV), 2019, pp. 6951–6960. of neural networks by enforcing lipschitz continuity,” Machine
[37] Z. Tian, H. Chen, and C. Shen, “Directpose: Direct end-to-end Learning, vol. 110, no. 2, pp. 393–416, 2021.
multi-person pose estimation,” arXiv preprint arXiv:1911.07451, [60] J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu,
2019. “Human pose regression with residual log-likelihood estimation,”
[38] F. Wei, X. Sun, H. Li, J. Wang, and S. Lin, “Point-set anchors for in ICCV, 2021.
object detection, instance segmentation and pose estimation,” in [61] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and
ECCV, 2020. M. Pantic, “300 faces in-the-wild challenge: Database and results,”
[39] G. Hidalgo, Y. Raaj, H. Idrees, D. Xiang, H. Joo, T. Simon, and Image and vision computing, vol. 47, pp. 3–18, 2016.
Y. Sheikh, “Single-network whole-body pose estimation,” in Pro- [62] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and
ceedings of the IEEE/CVF International Conference on Computer Vision, T. Brox, “Freihand: A dataset for markerless capture of hand pose
2019, pp. 6982–6991. and shape from single rgb images,” in Proceedings of the IEEE/CVF
[40] Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, “Learning to International Conference on Computer Vision, 2019, pp. 813–822.
detect human-object interactions,” in 2018 ieee winter conference on [63] G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “Interhand2.
applications of computer vision (wacv). IEEE, 2018, pp. 381–389. 6m: A dataset and baseline for 3d interacting hand pose estimation
[41] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel, from a single rgb image,” arXiv preprint arXiv:2008.09309, 2020.
“Learning visual feature spaces for robotic manipulation with [64] J. Redmon and A. Farhadi, “Yolov3: An incremental improve-
deep spatial autoencoders,” arXiv preprint arXiv:1509.06113, ment,” arXiv preprint arXiv:1804.02767, 2018.
vol. 25, 2015.
[65] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson,
[42] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invari-
C. Bregler, and K. Murphy, “Towards accurate multi-person pose
ant feature transform,” in European conference on computer vision.
estimation in the wild,” in Proceedings of the IEEE Conference on
Springer, 2016, pp. 467–483.
Computer Vision and Pattern Recognition, 2017, pp. 4903–4911.
[43] D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation
[66] X. Burgos-Artizzu, D. Hall, P. Perona, and P. Dollar, “Merging
and action recognition using multitask deep learning,” in Proceed-
pose estimates across space and time,” in British Machine Vision
ings of the IEEE conference on computer vision and pattern recognition,
Conference (BMVC), 2013.
2018, pp. 5137–5146.
[44] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “3d human [67] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time
pose estimation with 2d marginal heatmaps,” in 2019 IEEE Winter multi-object tracking,” arXiv preprint arXiv:1909.12605, 2019.
Conference on Applications of Computer Vision (WACV). IEEE, 2019, [68] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan,
pp. 1477–1485. J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose
[45] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose estimation and tracking,” in Proceedings of the IEEE Conference on
regression,” in ECCV, 2018. Computer Vision and Pattern Recognition, 2018, pp. 5167–5176.
[46] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression [69] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using
by combining indirect part detection and contextual information,” uncertainty to weigh losses for scene geometry and semantics,”
Computers & Graphics, vol. 85, pp. 15–22, 2019. in Proceedings of the IEEE conference on computer vision and pattern
[47] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu, “Pose flow: Efficient recognition, 2018, pp. 7482–7491.
online pose tracking,” arXiv preprint arXiv:1802.00977, 2018. [70] Y. Wang, C. Peng, and Y. Liu, “Mask-pose cascaded cnn for 2d
[48] G. Ning and H. Huang, “Lighttrack: A generic framework hand pose estimation from single color image,” IEEE Transactions
for online top-down human pose tracking,” arXiv preprint on Circuits and Systems for Video Technology, vol. 29, no. 11, pp.
arXiv:1905.02822, 2019. 3258–3268, 2018.
[49] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran, [71] F. Gomez-Donoso, S. Orts-Escolano, and M. Cazorla, “Large-scale
“Detect-and-track: Efficient pose estimation in videos,” in Pro- multiview 3d hand pose dataset,” Image and Vision Computing,
ceedings of the IEEE Conference on Computer Vision and Pattern vol. 81, pp. 25–33, 2019.
Recognition, 2018, pp. 350–359. [72] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look
[50] M. Wang, J. Tighe, and D. Modolo, “Combining detection and at boundary: A boundary-aware face alignment algorithm,” in
tracking for human pose estimation in videos,” arXiv preprint Proceedings of the IEEE conference on computer vision and pattern
arXiv:2003.13743, 2020. recognition, 2018, pp. 2129–2138.
[51] H. Guo, T. Tang, G. Luo, R. Chen, Y. Lu, and L. Wen, “Multi- [73] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated
domain pose network for multi-person pose estimation and track- facial landmarks in the wild: A large-scale, real-world database for
ing,” in Proceedings of the European Conference on Computer Vision facial landmark localization,” in 2011 IEEE international conference
(ECCV) Workshops, 2018, pp. 0–0. on computer vision workshops (ICCV workshops). IEEE, 2011, pp.
[52] Q. Bao, W. Liu, Y. Cheng, B. Zhou, and T. Mei, “Pose-guided 2144–2151.
tracking-by-detection: Robust multi-person pose tracking,” IEEE [74] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face
Transactions on Multimedia, vol. 23, pp. 161–175, 2020. landmark estimation under occlusion,” in Proceedings of the IEEE
[53] Y. Yang, Z. Ren, H. Li, C. Zhou, X. Wang, and G. Hua, “Learning international conference on computer vision, 2013, pp. 1513–1520.
dynamics via graph neural networks for human pose estimation [75] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient
and tracking,” in Proceedings of the IEEE/CVF Conference on Com- object detection,” in Proceedings of the IEEE/CVF Conference on
puter Vision and Pattern Recognition, 2021, pp. 8074–8084. Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790.
[54] M. Snower, A. Kadav, F. Lai, and H. P. Graf, “15 keypoints is all [76] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cot-
you need,” in Proceedings of the IEEE/CVF Conference on Computer trell, “Understanding convolution for semantic segmentation,” in
Vision and Pattern Recognition, 2020, pp. 6738–6748. WACV, 2018.
[55] Y. Raaj, H. Idrees, G. Hidalgo, and Y. Sheikh, “Efficient online [77] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
multi-person 2d pose tracking with recurrent spatio-temporal D. Rueckert, and Z. Wang, “Real-time single image and video
affinity fields,” in Proceedings of the IEEE Conference on Computer super-resolution using an efficient sub-pixel convolutional neural
Vision and Pattern Recognition, 2019, pp. 4620–4628. network,” in CVPR, 2016.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 17

[78] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei,


“Deformable convolutional networks,” in Proceedings of the IEEE
international conference on computer vision, 2017, pp. 764–773.
[79] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An im-
perative style, high-performance deep learning library,” Advances
in Neural Information Processing Systems, vol. 32, pp. 8026–8037,
2019.
[80] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine
learning library for heterogeneous distributed systems,” arXiv
preprint arXiv:1512.01274, 2015.
[81] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo
series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
[82] B. Pang, Y. Li, Y. Zhang, M. Li, and C. Lu, “Tubetk: Adopting tubes
to track multi-object in a one-step training model,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 6308–6318.
[83] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-nms–
improving object detection with one line of code,” in Proceedings of
the IEEE international conference on computer vision, 2017, pp. 5561–
5569.
[84] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across
large poses: A 3d solution,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 146–155.
[85] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler,
“Mot16: A benchmark for multi-object tracking,” arXiv preprint
arXiv:1603.00831, 2016.
[86] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[87] A. Doering, U. Iqbal, and J. Gall, “Joint flow: Temporal flow fields
for multi person tracking,” arXiv preprint arXiv:1805.04596, 2018.
[88] J. Hwang, J. Lee, S. Park, and N. Kwak, “Pose estimator and
tracker using temporal flow maps for limbs,” in 2019 International
Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8.
[89] G. Ning, P. Liu, X. Fan, and C. Zhang, “A top-down approach to
articulated human pose estimation and tracking,” in Proceedings of
the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

You might also like