Alphapose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time
Alphapose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time
XXXX 1
Abstract—Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To
capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and
foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate
whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric
arXiv:2211.03375v1 [cs.CV] 7 Nov 2022
Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating
redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to
Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able
to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant
detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on
COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and
dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
Index Terms—human pose estimation, pose tracking, whole-body pose estimation, hand pose estimation, realtime, multi-person
F
1 I NTRODUCTION
body, face and hands simultaneously, such representation is • This work documents the release of AlphaPose,
unsuitable since it is incapable of handling the large scale which achieves both accurate and realtime perfor-
variation across different body parts. A major problem is mance. Our library facilitated many researchers and
referred as the quantization error. As illustrated in Fig. 1, has been starred for over 6,000 times on GitHub.
since the heatmap representation is discrete, both the adja-
cent grids on heatmap may miss the correct position. This
is not a problem for body pose estimation since the correct
2 R ELATED W ORK
area are usually large. However, for fine-level keypoints on In this section, we first briefly review papers in multi-person
hands and face, it is easy to miss the correct position. pose estimation, which provides background knowledge of
To solve this problem, previous methods either adopt human pose estimation. In Sec. 2.2 we review related works
additional sub-networks for hand and face estimation [17], in multi-person whole-body pose estimation and discuss a
or adopt ROI-Align to enlarge the feature map [18]. How- key issue in the current literature. In Sec. 2.3 we review
ever, both methods are computation expensive, especially integral regression based keypoint localization and clarify
in multi-person scenario. In this paper, we propose a novel our improvements toward previous works. In Sec. 2.4 we
symmetric integral keypoints regression method that can review pose tracking and summarize the connection and
localize keypoints in different scales accurately. It is the first differences between previous works and our method.
regression method that can have the accuracy on par with
heatmap representation while eliminate the quantization 2.1 Multi Person Pose Estimation
error.
Bottom-up Approaches Bottom-up approaches are also
Another problem for the full body pose estimation is the called part-based approaches in the early. These approaches
lack of training data. Unlike the frequent studied body pose firstly detect all possible body parts in an image and then
estimation with abundant datasets [14], [19], there is only group them into individual skeletons. Representative works
one dataset [18] for the full body pose estimation. To pro- [10], [11], [12], [13], [21], [22], [23] are reviewed. Chen
mote development in this area, we annotate a new dataset et al. [11] present an approach to parse largely occluded
named Halpe for this task, which includes extra essential people by graphical model which models humans as flexible
joints not available in [18]. To further improve the generality compositions of body parts. Gkiox et al [10] use k-poselets to
of top-down framework for full body pose estimation in jointly detect people and predict locations of human poses
the wild, two key components are introduced. We adopt a by a weighted average of all activated poselets. Pishchulin
Multi-Domain Knowledge Distillation to incorporate train- et al [12] propose DeepCut to first detect all body parts,
ing data from separate body part datasets. To alleviate the and then label, filter and assemble these parts via integral
domain gap between different datasets and the imperfect linear programming. A stronger part detector based on
detection problem, we propose a novel part-guided human ResNet [24] and a better incremental optimization strategy
proposal generator (PGPG) to augment training samples. is proposed by Insafutdinov et al [13], named DeeperCut.
By learning the output distribution of a human detector for Openpose [17], [25] introduces Part Affinity Fields (PAFs)
different poses, we can simulate the generation of human to encode association scores between body parts with indi-
bounding boxes, producing a large sample of training data. viduals and solves the matching problem by decomposing
At last, we introduce a pose-aware identity embedding it into a set of bipartite matching subproblems. Newell et
to enable simultaneous human pose tracking within our top- al [26] learn an identification tag for each part detected to
down framework. A person re-id branch is attached on the indicate which individuals it belongs to, named associative
pose estimator and we perform jointly pose estimation and embedding. Cheng et al [27] use a powerful multi-resolution
human identification. With the aid of pose-guided region network [28] as backbone and high-resolution feature pyra-
attention, our pose estimator is able to identify human mids to learn scale-aware representations. OpenPifpaf [22],
accurately. Such design allows us to achieve realtime pose [23] proposes a Part Intensity Field (PIF) and a Part As-
estimation and tracking in an unified manner. sociation Field (PAF) to localize and associate body parts
This manuscript extends our preliminary work pub- respectively.
lished at the conference ICCV 2017 [20] along the following While bottom-up approaches have demonstrated good
aspects: performance, their body-part detectors can be vulnerable
since only small local regions are considered and face the
• We extend our framework to full body pose estima-
scale variation challenge when there are small persons in
tion scenario and propose a new symmetric integral
the image.
keypoint localization network for fine-level localiza-
tion. Top-down Approaches Our work follows the top-down
• We extend our pose guided proposal generator to in- paradigm like others [9], [20], [28], [29], [30], [31], which
corporate with multi-domain knowledge distillation firstly obtains the bounding box for each human body
on different body part dataset. through object detector and then performs single-person
• We annotate a new whole-body pose estimation pose estimation on cropped image. Fang et al [20] propose
benchmark (136 points for each person) and make symmetric spatial transformer network to solve the problem
comparisons with previous methods. on imperfect bounding boxes with huge noise given by
• We propose the pose-aware identity embedding that human body detector. Mask R-CNN [29] extends Faster
enable pose tracking in our top-down framework in R-CNN [32] by adding a pose estimation branch in par-
a unified manner. allel with existing bounding box recognition branch after
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 3
Symmetric
Integral Human (d)
(b) Poses
Backbone Fullbody Pose
... (a) heatmap NMS
Re-ID (c) Human Re-
Backbone
Backbone feature ID features
PGA MSIM
fullbody pose
(e)
Proposal Generator/
Knowledge Distillation Human
Proposals
fullbody pose +
tracking id
(i). Human Detections (ii). Human Pose Estimation (iii). Human Pose Tracking
Fig. 2. Illustration of our full-body pose estimation and tracking framework. Given an input image, we first obtain (i) human detections using off-
the-shelf object detectors like YoloV3 or EfficientDet. For each detected human, we crop and resize it and forward it through pose estimation and
tracking networks to obtain full body human pose and Re-ID features. The backbone of these two networks can either be separated for adaptation
to different pose configurations, or share the same weights for fast inference (thus misaligned in the figure). The (a) symmetric integral regression
is adopted for fine-level keypoint localization. We adopt (b) pose NMS to eliminate redundant poses. The (c) pose-guided alignment (PGA) module
is applied on the predicted human re-id feature to obtain pose-aligned human re-id features. The (d) multi-stage identity matching (MSIM) utilize the
human poses, re-id features and detected boxes to produce the final tracking identity. During training, proposal generator and knowledge distillation
(e) is adopted to improve the generalization ability of the networks.
ROIAlign, enabling end-to-end training. PandaNet [33] pro- representations and make it difficult for the networks to
poses an anchor based method to predict multi-person 3D learn.
pose estimation in a single shot manner and achieved high
efficiency. Chen et al [30] use a feature pyramid network
to localize simple joints and a refining network which inte- 2.2 Whole-Body Keypoint Localization
grates features of all levels from previous network to handle Unified detection of body, face, hand and foot keypoints
hard joints. A simple-structured network [31] with ResNet for multi-person is a relative new research topic and few
[24] as backbone and a few deconvolutional layers as up- methods have been proposed. OpenPose [17] developed
sampling head shows effective and competitive results. Sun a cascaded method. It first detects body keypoints using
et al [28] present a powerful high-resolution network, where PAFs [25] and then adopts two separate networks to es-
a high-resolution subnetwork is established in the first stage, timate face landmarks and hand keypoints. Such design
and high-to-low resolution subnetworks are added one by makes it time inefficient and consumes extra computation
one in parallel in subsequent stages, conducting repeated resources. Hidalgo et al. [39] propose a single network to
multi-scale feature fusions. Bertasius et al [34] extend from estimate the whole body keypoints. However, due to its
images to videos and propose a method for learning pose one-step mechanism, the output resolution is limited and
warping on sparsely labeled videos. thus decrease its performance on fine-level keypoints such
Although state-of-the-art top-down approaches achieve as faces and hands. Jin et al. [18] propose a ZoomNet that
remarkable precision on popular large-scale benchmark, used ROIAlign to crop the hand and face region on the
The two step paradigm makes them slow in inference feature maps and predict keypoints on the resized feature
compared with the bottom-up approaches. In addition, maps. All these methods adopt the heatmap representation
the lack of library-level framework implementation hin- for keypoint localization due to its dominant performance
ders them from being applied to the industry. Thus we on body keypoints. However, the mentioned quantization
present AlphaPose in this paper, in which we develop a problem of heatmap would decrease the accuracy of face
multi-stage pipeline to simultaneously process the time- and hand keypoints. The requirement of large-size input
consuming steps and enable fast inference. also consumes more computation resources. In this paper,
we argue that soft-argmax presentation is more suitable for
One-stage Approaches Some approaches need neither post whole-body pose estimation and proposed an improved
joints grouping nor human body bounding boxes detected version of soft-argmax that yields higher accuracy. Jin et
in advance. They locate human bodys and detect their own al. [18] also extended the COCO dataset to whole-body
joints simultaneously to improve low efficiency in two-stage scenario. However, some joints like head and neck are not
approaches. Representative works include CenterNet [35], presented in this dataset, which is essential in tasks like
SPM [36], DirectPose [37], and Point-set Anchor [38]. How- mesh reconstruction. Meanwhile, the face annotation is in-
ever, these approaches do not achieve high precision as compatible with that in 300LW. In this paper, we contribute
top-down approaches, partly because body center map and a new in-the-wild multi-person whole-body pose estimation
dense joint displacement maps are high-semantic nonlinear benchmark. We annotate 40K images from HICO-DET [40]
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 4
with Creative Common license1 as the training set and Compared with [57], we design a pose-guided re-ID feature
extend the COCO keypoints validation set (6K instances) extraction to avoid potential background noise. Moreover,
as our test set. Experiments on this benchmark and COCO- we design a multi-stage information merging method to
Wholebody demonstrate the superiority of our method. utilize the boxes, poses, and re-ID features simultaneously.
2.3 Integral Keypoints Localization 3 W HOLE -B ODY M ULTI P ERSON P OSE E STIMA -
Heatmap is a dominant representation for joint localization TION
in the field of human pose estimation. The read-out locations
from heatmaps are discrete numbers since heatmaps only The whole pipeline of our proposed method is illustrated in
describe the likelihood of joints occurring in each spatial Figure 2. In this section, we introduce the details of our pose
grid, which leads to inevitable quantization error. As de- estimation, which is shown in the top row of Fig. 2.
noted in Sec. 2.2, we argue that soft-argmax based integral
regression is more suitable for whole-body keypoints lo- 3.1 Symmetric Integral Keypoints Regression
calization. Several previous works have studied the soft-
As mentioned in Sec. 2.3, there exist two problems in con-
argmax operation to read continuous joint locations from
ventional soft-argmax operation for keypoint regression. We
heatmaps [41], [42], [43], [44], [45], [46]. Specifically, Luvizon
illustrate them in the following sections and propose our
et al. [43], [46] and Sun et al. [45] apply the soft-argmax oper-
novel solution.
ation to single-person 2D/3D pose estimation successfully.
However, two drawbacks exist in these works that decrease
3.1.1 Asymmetric gradient problem
their accuracy in pose estimation. We summarize the draw-
backs as asymmetric gradient problem and size-dependent The soft-argmax operation, also known as integral regression,
keypoint scoring problem. Details of these problems are is differentiable, which turns heatmap based approaches
provided in Sec. 3.1, as well as our proposed new gradient into regression based approaches and allows end-to-end
design and keypoint scoring method. By solving these prob- training. The integral regression operation is defined as:
lems, we provide a new keypoint regression method with X
µ̂ = x · px , (1)
higher accuracy, and it shows good performance in both
whole-body pose estimation and body-only pose estimation. where x is coordinate of each pixel and px denotes the pixel
likelihood on heatmap after normalization. During training,
2.4 Multi Person Pose Tracking the loss function is applied to minimize the `1 norm between
the predicted joint locations µ̂ and ground-truth locations
Multi person pose tracking is extended from multi person
pose estimation in videos, which gives each predicted key-
µ: Lreg = kµ − µ̂k1 . The gradient of each pixel can be
formulated as:
point the corresponding identity over time. Similar to pose
estimation literature, it can be divided into two categories: ∂Lreg
= x · sgn(µ̂ − µ). (2)
top-down [31], [47], [48], [49], [50], [51], [52], [53], [54] and ∂px
bottom-up [23], [55], [56]. Based on the bottom-up pose
Notice that the gradient amplitude is asymmetric. The ab-
estimation methods, [55], [56] use the detected keypoints
solute value of the gradient is determined by the absolute
to build temporal and spatial graphs which aims to link
position (i.e. x) of the pixel instead of the relative position to
corresponding individual body by solving an optimization
the ground truth. It denotes that given the same distance
problem. However, the prerequisite of temporal and spatial
error, the gradient becomes different when the keypoint
graphs prevent graph-cut optimization from running in
locates at a different position. This asymmetry breaks the
online manner, which makes them quite time-consuming
translation invariance of the CNN network, which leads to
and memory-inefficient. [49] utilizes a 3D MaskRCNN to es-
performance degradation.
timate person tubes and poses simultaneously. [50] proposes
Amplitude Symmetric Gradient To improve the learning
forward and backward bounding box propagation strategy
efficiency, we propose an amplitude symmetric gradient
to eliminate the issue of missed detection. The input of these
(ASG) function in backward propagation, which is an ap-
methods is a whole video sequence, which cannot achieve
proximation to the true gradient:
online tracking. Some other top-down methods allow to
input a single frame, and then use designed poseflow [47], δASG = Agrad · sgn(x − µ̂) · sgn(µ̂ − µ), (3)
GCN [48], [52], optical flow [31] or transformer [54] for
identity matching. Yang et al [53] predict current poses given where Agrad denotes the amplitude of gradients. It is a
historical pose sequences and merge them with the pose constant that we manually set as 1/8 of the heatmap size
estimation results from the current frame. A drawback of and we give the derivation in the next paragraph. Using
these methods is that they rely on the spatial continuity our symmetric gradient, the gradient distribution is centred
of the poses only, which may not be satisfied when the at the predicted joint locations µ̂. In the learning process,
online image stream is unstable or humans are moving this symmetric gradient distribution can better utilize the
rapidly. Specifically, [57] proposes to use re-ID feature to advantage of heatmaps and approximate the ground-truth
tackle tracking problem. Our tracking method also explicitly locations in a more direct manner. For example, assume the
adopts the human re-ID features to solve this problem. predicted location µ̂ is higher than the ground truth µ. On
one hand, the network tends to suppress the heatmap values
1. https://creativecommons.org/licenses/ on the right side of µ̂, because they have positive gradients;
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 5
on the other hand, the heatmap values on the left side of µ̂ It shows that the Lipschitz constant of the proposed method
will be activated because of their negative gradients. is 4-times smaller than the original integral regression when
Stable gradients for ASG Agrad = W/8, which indicates that the gradient space is
Here, we conduct a Lipschitz analysis to derive the value more smooth and the model can be optimized more easily.
of Agrad and show that ASG can provide a more stable
gradient for training. Recall that f denotes the objective 3.1.2 Size-dependent Keypoint Scoring Problem
function that we want to minimize. We say that f is L- Before conducting soft-argmax, the element-sum of P the pre-
smooth if: dicted heatmaps should be normalized to one, i.e., px =
1. Prior works [45], [46] adopt soft-max operation, which
k∇θ f (θ + ∆θ) − ∇θ (θ)k ≤ Lk∆θk, (4)
works well in single-person pose estimation but remains a
where θ is the network parameters and ∇ denotes the large performance gap with the state-of-the-arts in multi-
gradient. The objective function can be re-written as: person pose estimation [31], [45], [60]. This is because in
multi-person cases, we need not only the joint locations,
∇θ f = ∇θ L(µ, h(z)) = ∇z L(µ, h(z))∇θ z, (5) but also the joint confidence for pose NMS and calculating
where z denotes the logits that are predicted by the network, the mAP. In previous methods, the maximum value of the
and µ̂ = h(z) denotes the composition of the normalization heatmap is taken as the joint confidence, which is size-
and soft-argmax functions. Here, we assume the gradient dependent and not accurate.
of the network is smooth and only analyze the composition If we adopt the one-step normalization such as soft-max,
function, i.e.: the maximum value of the heatmap is inversely propor-
tional to the scale of the distribution, which highly depends
k∇z L(µ, h(z + ∆z)) − ∇z L(µ, h(z))k. (6) on the projected size of the body joint. Therefore, a large-size
joint (e.g. left-hip) will generate a smaller confidence value
In the conventional integral regression, we have: than a small-size joint (e.g. nose), which harms the reliability
∇z L(µ, h(z)) = (x − µ̂) · px . (7) of the predicted confidence values.
Two-step Heatmap Normalization To decouple confidence
In this case, Eq. 6 is equivalent to: prediction and integral regression, we propose a two-step
heatmap normalization manner. In the first step, we per-
k(x − µ̂ − ∆µ̂)(px + ∆px ) − (x − µ̂) · px k. (8) form element-wise normalization to generate the confidence
Note that x can be a arbitrary position on the heatmap. heatmap C:
Denoting the heatmap size as W , we have kx − µ̂k ≤ W cx = sigmoid(zx ), (13)
over the whole dataset. Therefore, we derive the Lipschitz where zx denotes the un-normalized logits value of location
constant of integral regression as: x, cx denotes the confidence heatmap value of location x.
k∇z L(µ, h(z + ∆z)) − ∇z L(µ, h(z))k Hence, the joint confidence can be indicated by the maxi-
mum value of the heatmap:
≤kW (px + ∆px ) − W px k = W k∆px k (9)
=W · Ls · k∆zk, conf = max (C). (14)
where Ls is the Lipschitz constant of the normalization Since we use an element-wise operation sigmoid for the first
function [58], [59]. It shows that the conventional integral step of normalization and don’t force the sum of C to be
regression multiplies a factor W to the Lipschitz constant of one, the maximum value of C won’t be affected by the size
normalization. of the joint. In this way, the predicted joint confidence is
Similarly, we can derive the Lipschitz constant of the pro- only related to the predicted location. In the second step,
posed amplitude symmetric function. Firstly, the gradient of we perform global normalization to generate the probability
the logits is: heatmap P:
cx
X X px = P . (15)
|∇z L(µ, h(z))| = |Agrad · px · (1 + p xi − pxi )| C
xi <µ̂ xi >µ̂
The element-sum of the probability heatmap P is one, which
≤ 2 · Agrad · px . ensures the predicted joint location µ̂ is within the heatmap
(10) boundary and stabilizes the training process.
We set Agrad = W/8 to make the average norm of the To sum up, we obtain the joint confidence through the
gradient the same as integral regression. Specifically, first step and obtain the joint location on the heatmap
W generated by the second step. An ablation study is carried
Ex [|(x − µ̂)px |] = Ex [|x − µ̂|]px =
· px . (11) out in Sec. 6.6 to show the effectiveness of our normalization
4
method.
The Lipschitz constant of the proposed amplitude symmet-
ric function derived as:
3.2 Multi-Domain Knowledge Distillation
k∇z L(µ, h(z + ∆z)) − ∇z L(µ, h(z))k
W Beyond our novel symmetric integral regression, the per-
≤k2Agrad (px + ∆px ) − 2Agrad px k = k∆px k (12) formance of network can further benefit from extra train-
4
W ing data. Except for annotating a new dataset (detailed in
= · Ls · k∆zk. Sec 6.1), we also adopt multi-domain knowledge distillation
4
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 6
close to it are subject to elimination by applying elimination enhance the person identity feature. Finally, human pro-
criterion. This process is repeated on the remaining poses set posal information (identity embedding, box and pose) are
until redundant poses are eliminated and only unique poses integrated by our designed Multi-Stage Identity Matching
are reported. (MSIM) algorithm to achieve online realtime pose tracking.
Elimination Criterion We need to define pose similarity in
order to eliminate the poses which are too close and too 4.1 Pose-Guided Attention Mechanism
similar to each others. We define a pose distance metric Person re-ID feature can be used to identify the same in-
d(Pi , Pj |Λ) to measure the pose similarity, and a threshold dividual from a lot of human proposals. In our top-down
η as elimination criterion, where Λ is a parameter set of framework, we extract re-ID feature from each bounding
function d(·). Our elimination criterion can be written as box produced by object detector. However, the quality of
follows: the re-ID feature will be reduced by the background in the
bounding box, especially when there exists other people’s
f ( Pi , Pj |Λ, η) = 1[d(Pi , Pj |Λ, λ) ≤ η] (16)
bodies. In order to solve this problem, we consider using
If d(·) is smaller than η , the output of f (·) should be 1, the predicted human pose to construct a region where
which indicates that pose Pi should be eliminated due to the human body is concentrated. Thus, the Pose-Guided
redundancy with reference pose Pj . Attention (PGA) is proposed to force the extracted features
focusing on the human body of interest, and ignore the
Pose Distance Now, we present the distance function
impact of the background. The insight of PGA is elaborated
dpose (Pi , Pj ). We assume that the box for Pi is Bi . Then
in ablation studies (Sec. 6.8).
we define a soft matching function
The pose estimator generates k heatmaps where k means
KSim (Pi , Pj |σ1 ) = the number of keypoint for each person. Then the PGA
(P
cn cn module will transform these heatmaps to an attention map
n tanh i
σ1 · tanh j
σ1 , if kjn is within B(kin ) (mA ) with a simple conv layer. Note that the mA has same
(17)
0 otherwise size with re-ID feature map (mid ). Therefore, we could
obtain the weighted re-ID feature map (mwid ) :
where B(kin ) is a box center at kin , and each dimension of
B(kin ) is 1/10 of the original box Bi . The tanh operation mwid = mid mA + mid (20)
filters out poses with low-confidence scores. When two
corresponding joints both have high confidence scores, the where means Hadamard product.
output will be close to 1. This distance softly counts the Finally, the identity embedding (embid ) which is a 128
number of joints matching between poses. dimension vector is encoded by a fully-connection layer.
The spatial distance between parts is also considered,
which can be written as 4.2 Multi-Stage Identity Matching
X (kin − kjn )2 For a video sequence, Let Hti denote the i-th human pro-
HSim (Pi , Pj |σ2 ) = exp[− ] (18) posal of t-th frame. As described above, Hti has several
n
σ2
features: pose (Pti ), bbox (Bti ) and identity embedding
By combining Eqn 17 and 18, the final distance function (Eti ). Considering that all these features can determine the
can be written as identity of a person, we design MSIM algorithm to assign
the corresponding id for Hti . Assuming that the detection
d(Pi , Pj |Λ) = KSim (Pi , Pj |σ1 ) + λHSim (Pi , Pj |σ2 ) (19)
and tracking results of the previous t-1 frames have been
where λ is a weight balancing the two distances and Λ = obtained and stored in the tracking pool P l. First, a kalman
{σ1 , σ2 , λ}. Note that the previous pose NMS [11] set pose filter is used to finetune the detection features in the current
distance parameters and thresholds manually. In contrast, frame thus make trajectories more smooth. Then we perform
our parameters can be determined in a data-driven manner. the first stage matching by computing the affinity matrix
i
Memb among identity embedding of the t-th frame and
Optimization Given the detected redundant poses, the four all embeddings existed in P l. The matching rules are as
parameters in the eliminate criterion f ( Pi , Pj |Λ, η) are follows:
optimized to achieve the maximal mAP for the validation
t t
set. Since exhaustive search in a 4D space is intractable, link(p, q),
if Memb [p][q] = min(Memb [p])
t
we optimize two parameters at a time by fixing the other and Memb [p][q] ≤ µemb (21)
two parameters in an iterative manner. Once convergence is p
Ht keep untracked, otherwise
achieved, the parameters are fixed and will be used in the
p
testing phase. where link(p, q) means Ht shares the same trajectory with
the q-th human proposal in P l. µemb is the threshold. Here
we set µemb as 0.7 following [67].
4 M ULTI P ERSON P OSE T RACKING At the second stage, we consider both position and
In this section, we introduce our multi person pose tracking shape constraints for those untracked human proposals
method shown in the middle row of Fig. 2. We attach a in 21. Specifically, We use IOU metric between bboxes as
person re-ID branch on the pose estimator. Thus the network position constraint and normalized pose distance as shape
j
can estimate human pose and re-ID feature simultaneously. constraint. For two human proposals Hti and Ht−δ , we first
A Pose-Guided Attention Mechanism (PGA) is adopted to resize their bbox to same scale and get the center point c
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 8
Pose NMS
Images
Human
Detector Pose
Box NMS
Pose Association
Video Estimator
FIFO queue
FIFO queue
FIFO queue
FIFO queue
Box File Rendering
Crop&Resize
Camera Re-ID Tracker
Saving
TABLE 1
Overview of some popular public datasets for 2D keypoint estimation in
RGB images. Kpt stands for keypoints, and #Kpt means the annotated
number. “Wild” denotes whether the dataset is collected in-the-wild.
“HOI” denotes human-object-interaction body-part labels.
TABLE 2
Whole-body pose estimation results on Halpe-FullBody dataset. For fair comparisons, results are obtained using single-scale testing.
“OpenPose-default” and “OpenPose-maxacc” denotes its default and maximum accuracy configuration respectively. “hm” denotes the network
uses heatmap based localization, “si” denotes the network uses our symmetric integral regression. “*” denotes model trained with multi-domain
knowledge distillation and PGPG.FastPose50 denotes our FastPose network with ResNet50 as backbone and so is FastPose152. “dcn” denotes
that the deformable convolutional layer [78] is adopted in the ResNet backbone.
Type Method Input Size GFLOPs whole-body body foot face hand
AP AR AP AR AP AR AP AR AP AR
OpenPose [17] N/A N/A 0.338 0.449 0.563 0.612 0.532 0.645 0.482 0.626 0.198 0.342
SN [39] N/A N/A 0.161 0.209 0.280 0.336 0.121 0.277 0.382 0.440 0.138 0.336
Bottom- PAF [25] N/A N/A 0.141 0.185 0.266 0.328 0.100 0.257 0.309 0.362 0.133 0.321
Up PAF-body [25] N/A N/A - - 0.409 0.470 - - - - - -
AE [26] N/A N/A 0.274 0.350 0.405 0.464 0.077 0.160 0.477 0.580 0.341 0.435
AE-body [26] N/A N/A - - 0.582 0.634 - - - - - -
HRNet [28] 384×288 16.0 0.432 0.520 0.659 0.709 0.314 0.424 0.523 0.582 0.300 0.363
HRNet-body [28] 384×288 16.0 - - 0.758 0.809 - - - - - -
Top- ZoomNet 384×288 20.0 0.541 0.658 0.743 0.802 0.798 0.869 0.623 0.701 0.401 0.498
Down FastPose50-si 256×192 5.9 0.554 0.625 0.673 0.717 0.636 0.718 0.757 0.818 0.425 0.515
FastPose152-si 256×192 13.2 0.569 0.641 0.684 0.730 0.672 0.750 0.765 0.824 0.443 0.532
FastPose50-dcn-si 256×192 6.1 0.577 0.650 0.693 0.740 0.690 0.765 0.759 0.820 0.453 0.538
TABLE 3
Whole-body pose estimation results on COCO-WholeBody dataset. For fair comparisons, results are obtained using single-scale testing. We only
report the input size and GFLOPS of the pose model in top-down based approaches and ignore the detection model.“hm” denotes the network
uses heatmap based localization, “si” denotes the network uses our symmetric integral regression. FastPose50 denotes our FastPose network with
ResNet50 as backbone and so is FastPose152. “dcn” denotes that the deformable convolutional layer [78] is adopted in the ResNet backbone.
annotate the visible lower jaw of the face (green dots in COCO 2017 test-dev set.
Fig. 7) so as to be compatible with these two definition.
For the images, our training set uses the training images PoseTrack PoseTrack is a large scale dataset for multi-person
of the HICO-DET [40] dataset and our testing set uses the pose estimation and tracking. It is built on the raw videos
COCO-val set. In total, our dataset contains 50K instances provided by MPII Human dataset [15]. There are more than
for training and 5K images for testing. Tab. 1 compares 1356 video sequences of PoseTrack and they are split into
our dataset with previous popular datasets on human pose train, val, test. Each annotated person has 17 keypoints
estimation. similar with COCO, but there are two different keypoints
COCO-WholeBody As a concurrent work, Jin et. al. anno- compared with COCO, which are ‘top head’ and ‘bottom
tates 133 whole body keypoints based on the COCO dataset. head’. Other annotations share the same format with COCO.
They share a similar keypoints definition with us, except We train our method on PoseTrack-2018 set and compare
that the head, neck and hip points are missing in their it with previous methods on both PoseTrack-2017-val and
annotation. The total training set contains 118K images with PoseTrack-2018-val sets.
250K instances, and the test set contains 5K images. We also
evaluate our algorithm on this dataset. 300Wface, FreiHand and InterHand are used as supple-
COCO COCO dataset is a standard benchmark for human mental datasets to improve the generalization ability of
keypoints prediction. It contains 17 keypoints of human our model. 300Wface [61] contains 300 indoor and 300
body without face, hand and foot annotations. In total, there outdoor in-the-wild images. For each face, 68 keypoints are
are 118K images for training, 5K for validation and 41K for annotated. FreiHand [62] contains 33K unique hand samples
testing. We train our algorithm on the COCO 2017 train for training, each contains 21 keypoints. InterHand [63]
set and compare our FastPose network and symmetrical contains 2.6M images with interacting hands, where each
integral loss with previous state-of-the-arts models on the hand also has 21 keypoints.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 11
Methods Backbone Detector Input Size GFLOPs AP AP50 AP75 APM APL AR
G-MRI [65] ResNet-101 Faster-RCNN 353 × 257 57.0 0.649 0.855 0.713 0.623 0.700 0.697
RMPE [20] PyraNet Faster-RCNN 320 × 256 26.7 0.723 0.892 0.791 0.680 0.786 -
CPN [25] ResNet-Inception FPN 384 × 288 - 0.721 0.914 0.800 0.687 0.772 0.785
Detection PAF-body [25] - - - - 0.618 0.849 0.675 0.571 0.682 0.665
AE [26] - - - - 0.655 0.868 0.723 0.606 0.726 0.702
SimplePose [31] ResNet-50 Faster-RCNN 256 × 192 8.9 0.702 0.909 0.783 0.671 0.759 0.758
HRNet [28] HRNet-32 Faster-RCNN 384 × 288 16.0 0.749 0.925 0.828 0.713 0.809 0.801
HRNet [28] HRNet-48 Faster-RCNN 384 × 288 32.9 0.755 0.925 0.833 0.719 0.815 0.805
FastPose-hm ResNet-50 YOLO-v3 256 × 192 5.9 0.718 0.919 0.803 0.728 0.742 0.773
FastPose-dcn-hm ResNet-50 YOLO-v3 256 × 192 6.1 0.726 0.922 0.812 0.737 0.749 0.781
FastPose-dcn-hm ResNet-101 YOLO-v3 256 × 192 9.8 0.727 0.922 0.813 0.736 0.751 0.781
Integral [45] ResNet-101 Faster-RCNN 256 × 256 17.8 0.678 0.882 0.748 0.639 0.740 -
CenterNet [35] Hourglass-2 stacked - - - 0.630 0.868 0.696 0.589 0.704 -
Regression SPM [36] Hourglass-8 stacked - - - 0.669 0.885 0.729 0.626 0.731 -
Point-set Anchor [38] HRNet-W48 - - - 0.687 0.899 0.763 0.648 0.753 -
FastPose-si ResNet-50 YOLO-v3 256 × 256 7.9 0.649 0.865 0.728 0.669 0.663 0.716
FastPose-si ResNet-101 YOLO-v3 256 × 256 12.8 0.679 0.876 0.751 0.675 0.714 0.723
FastPose-dcn-si ResNet-101 YOLO-v3 256 × 256 13.1 0.690 0.901 0.773 0.729 0.690 0.775
TABLE 4
Body pose estimation results on COCO test-dev set. For fair comparisons, results are obtained using single-scale testing. “hm” denotes the
network uses heatmap based localization, “si” denotes the network uses our symmetric integral regression.
6.2 Evaluation Metrics and Tools and only finetune the re-ID branch on posetrack dataset for
Halpe-FullBody We extend the evaluation metric of the 10 epochs. Learning rate in the finetuning phase is 1e-4. We
COCO keypoints to full body scenario. COCO defines a adopt Adam [86] optimizer during training. All experiments
Object Keypoint Similarity controlled by a per-keypont con- are conducted on 8 Nvidia 2080Ti GPUs.
stant k . For our newly added keypoints, we set the k for
the feet, face and hand as 0.015. Same as COCO, we report 6.4 Evaluation for Full Body Pose Estimation
AP 0.5:0.95:0.05 as the main result and the detailed results for
We first evaluate the performance of our model on Halpe-
body, foot, face and hand are also reported.
FullBody and COCO-WholeBody dataset. Since Halpe-
COCO-WholeBody COCO-WholeBody adopts the same
FullBody is a new dataset, we retrain several state-of-the-
metric as ours except that the constant k is different from
art models and compare the results with us. Tab. 2 gives
us for some keypoints.
the final results. YOLOV3 is adopted as human detector
COCO We adopt the standard AP metric of COCO dataset
for all the top-down based models. We can see that top-
for fair comparison with previous works.
down methods can achieve higher accuracy compared to the
PoseTrack In fact, multi-person pose tracking can be re-
bottom-up methods. However, due to the quantization error
garded as the combination of multi-person pose estima-
introduced by heatmap, conventional SPPEs decrease a lot
tion and multi-object tracking. Thus, the evaluation metric
on the fine-level body parts like face and hand. Equipped by
should follow these two tasks. Mean Average Precision
our novel symmetrical integral loss function, our FastPose
(mAP) [12] is used to measure frame-wise human pose
models achieve the best accuracy. Notably, we can see that
accuracy. To evaluate the tracking performance, the MOT
FastPose50-si yields 2.4 mAP (5.7% relatively) higher than
[85] metric is applied to each of the body joints indepen-
its heatmap-based counterpart. The improvements mainly
dently. Then the final tracking performance is obtained by
comes from the face and hands. It demonstrates that the
averaging each joint mot metric. The PCKh [15] (head-
quantization error of heatmap affects the fine-level localiza-
normalized probability of correct keypoint) is one of the
tion of face and hand keypoints, and our symmetric integral
most commonly used metric to evaluate whether a body
regression works well on such cases.
joint is predicted correctly. Here it can determine which
On the COCO-WholeBody dataset, our FastPose embed-
predicted joint is matched with groundtruth joint.
ded with symmetrical integral loss function also outper-
To evaluate the tracking result on posetrack validation
forms previous state-of-the-art methods by a large margin,
dataset, we use the official tool named poseval4 and report
especially on the face and hands. Notably, our FastPose
Multiple Object Tracker Accuracy (MOTA), Multiple Object
achieves the highest accuracy given a smaller input size.
Tracker Precision (MOTP), Precision and Recall.
The model complexity is also much lower than previous
methods. It demonstrates the superiority of our network
6.3 Implementation Details structure and the novel loss.
We conduct our experiments with PyTorch [79]. We train the Some qualitative results of full body pose estimation is
network with batch 32 for 270 epochs. The initial learning shown in Fig. 8.
rate is 0.01 and we decay it on epoch 100 and 170 by 0.1. The
pose guided proposal generator is applied after epoch 200. 6.5 Evaluation for Conventional Body Pose Estimation
After the entire network is trained, we freeze the backbone
We also conduct experiments on the conventional body-only
4. https://github.com/leonid-pishchulin/poseval pose estimation task to demonstrate the effectiveness of our
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 12
Fig. 8. Qualitative results of AlphaPose on the full-body pose estimation task. Zoom in for more details and best viewed in color.
method, although it is not our main focus.. We train our Module Halpe Fullbody (mAP) COCO (mAP)
models on the COCO dataset and evaluate it on COCO test- two-step hm-norm 44.1 69.5
dev set. The results are reported in Tab. 4. For the heatmap one-step hm-norm 38.1 67.1
based methods, we can see that our FastPose backbone w. SIKR 44.1 69.5
can achieve on par performance with the state-of-the-art w.o SIKR 42.3 64.6
method, given a smaller input size and weaker human detector. w. P-NMS 44.1 69.5
It demonstrates the superiority of our FastPose network. w.o P-NMS 43.7 68.2
Note that since our goal is to present a new baseline model w. PGPG 48.4* N/A
like SimplePose [31], we conduct these experiments to prove w.o PGPG 47.1* N/A
the accuracy and efficiency of our model. Further pursuing TABLE 5
higher accuracy with speed and resources trade-off is not Ablation studies on Halpe Fullbody dataset and COCO dataset.
our goal in this paper and we leave them for future research. “hm-norm” denotes heatmap normalization. “*” denotes results trained
with additional data from Multi-Domain Knowledge Distillation.
For the regression based methods, our method achieves
the state-of-the-art performance with the lowest GFLOPS.
Compared to [45], our network serves as a new baseline for
future research. base network and report the numbers on COCO validation
and Halpe-Fullbody test set respectively. The results are
summarized in Tab. 5.
6.6 Ablation Studies for Pose Estimation Heatmap Normalization We elucidated the essence of our
To evaluate the effectiveness of our proposed module for two-step heatmap normalization for applying the integral-
pose estimation, we also conducted ablation experiments on based method in the multi-person scenario in Sec.3.1. Here
COCO and Halpe-Fullbody dataset. We adopt FastPose50 as we conduct an ablation experiment to show the perfor-
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 13
Fig. 9. Qualitative results of AlphaPose on the full-body pose tracking task. Zoom in for more details and best viewed in color. The colors of persons
denote their tracking ID. The image order is denoted by the time arrow. See text for more analysis.
mance gap of different heatmap normalization methods. Data Method mAP MOTA fps Res Src
We can see that when comparing the conventional one-step Det&Track [49] 60.6 55.2 1.2 - X
heatmap normalization (soft-max) to our two-step heatmap PoseFlow [47] 66.5 58.3 10* - X
JointFlow [87] 69.3 59.8 0.2 N/A ×
normalization, the performance in multi-person pose esti-
Fast [57] 70.3 63.2 12.2 N/A ×
mation decrease for 6 mAP and 2.4 mAP on Halpe-fullbody TML++ [88] 71.5 61.3 - - ×
and COCO datasets, respectively. It demonstrates that the STAF [55] 72.6 62.7 3.0 N/A X
two-step normalization can alleviate the size-dependent ef- 2017 FlowTrack [31] 76.7 65.4 3.0 384×288 X
PGPT [52] 77.2 68.4 1.2 384×288 X
fect and improve performance. Yang et.al. [53] 81.1 73.4 - 384×288 ×
SIKR Module We compare our symmetric integral function Ours-UNI 76.1 65.5 11.3 256×192 X
with the original integral regression [45]. For both full Ours-SEP 76.9 65.7 8.9 256×192 X
body pose estimation scenario and conventional body pose MDPN [51] 71.7 50.6 - 384×288 ×
estimation, we can see that our symmetric integral function STAF [55] 70.4 60.9 3.0 N/A X
OpenSVAI [89] 69.7 62.4 - - ×
greatly outperforms the original integral regression.
LightTrack [48] 71.2 64.6 0.7 384×288 X
Pose NMS Module Without Pose-NMS, multiple human 2018 KeyTrack [54] 74.3 66.6 1.0 384×288 ×
poses will be predicted for a single person. The redundant PGPT [52] 76.8 67.1 1.2 384×288 X
poses will decrease the model performance. From Tab. 5, we Yang et.al. [53] 77.9 69.2 - 384×288 ×
Ours-UNI 74.0 64.4 10.9 256×192 X
can see that our model decreases for 0.4 mAP and 1.3 mAP Ours-SEP 74.7 64.7 8.7 256×192 X
for Halpe Fullbody and COCO dataset respectively.
PGPG Module Proper data augmentation is needed during TABLE 6
Evaluation Result On Posetrack Validation dataset. “Res” denotes input
training to ensure the generalization ability at testing phase. resolution of pose network and “Src” denotes whether source code is
For Halpe Fullbody dataset, we compare the results of available. “Ours-UNI” denotes results trained with a shared backbone
FastPose50-dcn trained with and without PGPG module. of pose and re-ID branch and “Ours-SEP” denotes results trained with
separated backbones. The “*” in fps means not including detection
Tab. 5 shows that without our part guided proposal gen-
time. The mAP value is obtained after tracking post-precessing.
eration, the performance would decrease due to the domain
variance in training.
and output, which consumes a lot of memory and is com-
putationally expensive. Our method achieves satisfactory
6.7 Evaluation for Pose Tracking
accuracy while running efficiently.
To verify that our system is sufficient for multi-person pose
tracking task, we apply it to the posetrack validation dataset.
Tab. 6 shows the comparison with other state-of-the-art 6.8 Ablation Studies for Pose Tracking
methods. The backbone we adopted is the FastPose152 and In order to verify the effectiveness of each part of the track-
the detector is YoloX. We can see that our model outper- ing algorithm, we have designed several sets of ablation
forms most methods in both mAP metric and MOTA metric, experiments.
and our speed is quite fast. This near real-time processing PGA Module The function of PGA module is to assist in
speed can be applied to various scenarios in our real life. It extracting more effective re-ID features with the help of the
is worth noting that there are some other methods [49], [50] keypoint information. As a comparison, we remove the PGA
that have achieved good results on the posetrack dataset, module in our framework, which means the human pose
but they mainly consider the overall timing information of and re-ID feature are fed into MSIM directly. Tests on Pose-
the video, which means that they are not strictly an online Track dataset show that tracking performance will decrease
algorithms. Therefore our method is not directly compared after removing PGA module which reported in Table.7. At
with theirs. [52], [53], [54] achieve higher accuracy compared the same time, we visualized the extracted re-ID features
with our results. But they use very high resolutions for input with or without PGA module shown as Fig.10. Since the
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 14
MOTA
GT-Box 75.8 75.5 75.8 75.4 76.7 75.9
GT-Pose 93.8 93.6 93.7 93.8 94.4 93.9 64
STAF
TABLE 7 LightTrack
62 PGPT
The ablation study results of proposed pose tracking method.
AlphaPose
60
0 200 400 600 800 1000 1200 1400 1600
detection result is usually a larger box than the original size Average inference time (ms) per image
of human, the background in the box has a large proportion. (b)
However, the background information makes the human
Fig. 11. Speed/Accuracy comparison of different pose estimation and
identity embedding carry useless features. This intuitively tracking libraries. (a) Pose estimation results obtained on COCO-
explains that the advantage of PGA is that it can better focus WholeBody validation set and COCO validation set. (b) Pose tracking
attention on the target person’s area. results obtained on PoseTrack18-val set.
Original Image w/o PGA w. PGA MSIM To further verify the performance of our model, we
add different level information into the Network. Specif-
ically, we set up several sets of experiments, respectively
using GT box, GT pose. These results are reported in Table.7.
The results show that if we replace the human detector and
pose estimator with more accurate network, our tracking
performance will be further improved.
(c)
8 L IBRARY A NALYSIS
In this section, we compare our AlphaPose library with
Fig. 10. Visualization of the role of PGA module. When there is no PGA other popular open source library in both pose estimation
module, some background area will also have a high response. The and pose tracking. The results are obtained on a single
results of adding PGA module show that the feature response is more
concentrated on the target person. Notably, from figure (b) we can see Nvidia 20080Ti GPU. Fig. 11 shows the speed-accuracy
that when two people are close, the feature response focus on the target curve of different libraries. From Fig. 11(a) we can see that
person with the aid of PGA (zoom in for more details). our method has the highest accuracy and yields the highest
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 15
efficiency on whole-body and body-only pose estimation. [10] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “Using k-
Although a drawback of our top-down based approach is poselets for detecting people and localizing their keypoints,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
that the running time would increase as the persons in 2014, pp. 3582–3589.
the scene increase, our parallel processing pipeline greatly [11] X. Chen and A. L. Yuille, “Parsing occluded people by flexible
redeem this deficiency. According to the statistics by Open- compositions,” in IEEE Conference on Computer Vision and Pattern
Pose [17], our library is more efficient than it when there are Recognition (CVPR), 2015, pp. 3945–3954.
[12] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka,
less than 20 persons in the scene. From Fig. 11(b) we can see P. Gehler, and B. Schiele, “Deepcut: Joint subset partition and
that our pose tracking achieves on-par performance with the labeling for multi person pose estimation,” in IEEE Conference on
state-of-the-art library while running with high efficiency. Computer Vision and Pattern Recognition (CVPR), June 2016.
[13] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and
B. Schiele, “DeeperCut: A Deeper, Stronger, and Faster Multi-
Person Pose Estimation Model,” in European Conference on Com-
9 C ONCLUSION puter Vision (ECCV), May 2016.
[14] http://mscoco.org/dataset/#keypoints-leaderboard, 2016.
In this paper, we propose a unified and realtime framework
[15] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human
for multi-person fullbody pose estimation and tracking. To pose estimation: New benchmark and state of the art analysis,” in
the best of our knowledge, it is the first framework that IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
serves this purpose. Several novel techniques are presented 2014.
[16] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of
to achieve this goal and we demonstrate superior perfor- a convolutional network and a graphical model for human pose
mance in both efficacy and efficiency. A new dataset that estimation,” in Conference on Neural Information Processing Systems
contains full body keypoints (136 keypoints for each person) (NeurIPS), 2014, pp. 1799–1807.
is annotated to facilitate the research in this area. We also [17] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Open-
pose: realtime multi-person 2d pose estimation using part affinity
present a standard library that is highly optimized for easy fields,” arXiv preprint arXiv:1812.08008, 2018.
usage and hope that it can benefit our community. For our [18] S. Jin, L. Xu, J. Xu, C. Wang, W. Liu, C. Qian, W. Ouyang, and
future research, we will also include 3D keypoints and mesh P. Luo, “Whole-body human pose estimation in the wild,” in
to our library. European Conference on Computer Vision, 2020, pp. 196–214.
[19] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “Crowdpose:
Efficient crowded scenes pose estimation and a new benchmark,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and
ACKNOWLEDGMENT Pattern Recognition, 2019, pp. 10 863–10 872.
This work is supported in part by the National Key R&D [20] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “Rmpe: Regional multi-
person pose estimation,” in Proceedings of the IEEE International
Program of China, No. 2017YFA0700800, Shanghai Munici-
Conference on Computer Vision, 2017, pp. 2334–2343.
pal Science and Technology Major Project (2021SHZDZX0102), [21] J. G. Umar Iqbal, “Multi-person pose estimation with local joint-
Shanghai Qi Zhi Institute, SHEITC (2018-RGZN-02046). We ap- to-person associations,” in European Conference on Computer Vision
preciate Chenxi Wang for help developing the MXNet version Workshops 2016 (ECCVW’16), 2016.
and Yang Han for developing the Jittor version of AlphaPose. [22] S. Kreiss, L. Bertoni, and A. Alahi, “Pifpaf: Composite fields for
Hao-Shu Fang would like to thank the support from Baidu, human pose estimation,” in Proceedings of the IEEE/CVF conference
MSRA and ByteDance Fellowship. on computer vision and pattern recognition, 2019, pp. 11 977–11 986.
[23] K. Sven, B. Lorenzo, and A. Alexandre, “Openpifpaf: Composite
fields for semantic keypoint detection and spatio-temporal associ-
ation,” IEEE Transactions on Intelligent Transportation Systems, 2021.
R EFERENCES [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[1] K. Wang, R. Zhao, and Q. Ji, “Human computer interaction with image recognition,” 2016.
head pose, eye gaze and body gestures,” in 2018 13th IEEE [25] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person
International Conference on Automatic Face & Gesture Recognition (FG 2d pose estimation using part affinity fields,” in IEEE Conference
2018). IEEE, 2018, pp. 789–789. on Computer Vision and Pattern Recognition (CVPR), 2017.
[2] T. B. Moeslund, A. Hilton, and V. Krüger, “A survey of advances [26] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-
in vision-based human motion capture and analysis,” Computer to-end learning for joint detection and grouping,” in Advances in
vision and image understanding, vol. 104, no. 2-3, pp. 90–126, 2006. Neural Information Processing Systems, 2017, pp. 2274–2284.
[3] B. Pang, K. Zha, and C. Lu, “Human action adverb recognition: [27] B. Cheng, X. Bin, W. Jingdong, S. Honghui, S. H. Thomas,
Adha dataset and a three-stream hybrid model,” in Proceedings and Z. Lei, “Higherhrnet: Scale-aware representation learning
of the IEEE Conference on Computer Vision and Pattern Recognition for bottom-up human pose estimation,” in IEEE Conference on
Workshops, 2018, pp. 2325–2334. Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5386–
[4] B. Sapp, A. Toshev, and B. Taskar, “Cascaded models for articu- 5395.
lated pose estimation,” in European Conference on Computer Vision [28] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution
(ECCV). Springer, 2010, pp. 406–420. representation learning for human pose estimation,” in Proceedings
[5] M. Sun, P. Kohli, and J. Shotton, “Conditional regression forests of the IEEE conference on computer vision and pattern recognition, 2019,
for human pose estimation,” in IEEE Conference on Computer Vision pp. 5693–5703.
and Pattern Recognition (CVPR). IEEE, 2012, pp. 3394–3401. [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,”
[6] L. Ladicky, P. H. Torr, and A. Zisserman, “Human pose estimation in Computer Vision (ICCV), 2017 IEEE International Conference on.
using a joint pixel-wise and part-wise formulation,” in IEEE Con- IEEE, 2017, pp. 2980–2988.
ference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. [30] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded
3578–3585. pyramid network for multi-person pose estimation,” in Proceedings
[7] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for of the IEEE conference on computer vision and pattern recognition, 2018,
human pose estimation,” in arXiv preprint arXiv:1603.06937, 2016. pp. 7103–7112.
[8] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolu- [31] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose
tional pose machines,” in IEEE Conference on Computer Vision and estimation and tracking,” in Proceedings of the European conference
Pattern Recognition (CVPR), 2016, pp. 4724–4732. on computer vision (ECCV), 2018, pp. 466–481.
[9] L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and [32] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
B. Schiele, “Articulated people detection and pose estimation: real-time object detection with region proposal networks,” in
Reshaping the future,” in IEEE Conference on Computer Vision and Conference on Neural Information Processing Systems (NeurIPS), 2015,
Pattern Recognition (CVPR), 2012, pp. 3178–3185. pp. 91–99.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 16
[33] A. Benzine, F. Chabot, B. Luvison, Q. C. Pham, and C. Achard, [56] S. Jin, W. Liu, W. Ouyang, and C. Qian, “Multi-person articulated
“Pandanet: Anchor-based single-shot multi-person 3d pose esti- tracking with spatial and temporal embeddings,” in Proceedings
mation,” in Proceedings of the IEEE/CVF Conference on Computer of the IEEE Conference on Computer Vision and Pattern Recognition,
Vision and Pattern Recognition, 2020, pp. 6856–6865. 2019, pp. 5664–5673.
[34] G. Bertasius, C. Feichtenhofer, D. Tran, J. Shi, and L. Torre- [57] J. Zhang, Z. Zhu, W. Zou, P. Li, Y. Li, H. Su, and G. Huang, “Fast-
sani, “Learning temporal pose estimation from sparsely-labeled pose: Towards real-time pose estimation and tracking via scale-
videos,” Advances in neural information processing systems, vol. 32, normalized multi-task networks,” arXiv preprint arXiv:1908.05593,
2019. 2019.
[35] Z. Xingyi, W. Dequan, and K. Philipp, “Objects as points,” in arXiv [58] B. Gao and L. Pavel, “On the properties of the softmax function
preprint arXiv:1904.07850, 2019. with application in game theory and reinforcement learning,”
[36] X. Nie, J. Feng, J. Zhang, and S. Yan, “Single-stage multi-person arXiv preprint arXiv:1704.00805, 2017.
pose machines,” in IEEE International Conference on Computer Vision [59] H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree, “Regularisation
(ICCV), 2019, pp. 6951–6960. of neural networks by enforcing lipschitz continuity,” Machine
[37] Z. Tian, H. Chen, and C. Shen, “Directpose: Direct end-to-end Learning, vol. 110, no. 2, pp. 393–416, 2021.
multi-person pose estimation,” arXiv preprint arXiv:1911.07451, [60] J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu,
2019. “Human pose regression with residual log-likelihood estimation,”
[38] F. Wei, X. Sun, H. Li, J. Wang, and S. Lin, “Point-set anchors for in ICCV, 2021.
object detection, instance segmentation and pose estimation,” in [61] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and
ECCV, 2020. M. Pantic, “300 faces in-the-wild challenge: Database and results,”
[39] G. Hidalgo, Y. Raaj, H. Idrees, D. Xiang, H. Joo, T. Simon, and Image and vision computing, vol. 47, pp. 3–18, 2016.
Y. Sheikh, “Single-network whole-body pose estimation,” in Pro- [62] C. Zimmermann, D. Ceylan, J. Yang, B. Russell, M. Argus, and
ceedings of the IEEE/CVF International Conference on Computer Vision, T. Brox, “Freihand: A dataset for markerless capture of hand pose
2019, pp. 6982–6991. and shape from single rgb images,” in Proceedings of the IEEE/CVF
[40] Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, “Learning to International Conference on Computer Vision, 2019, pp. 813–822.
detect human-object interactions,” in 2018 ieee winter conference on [63] G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “Interhand2.
applications of computer vision (wacv). IEEE, 2018, pp. 381–389. 6m: A dataset and baseline for 3d interacting hand pose estimation
[41] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel, from a single rgb image,” arXiv preprint arXiv:2008.09309, 2020.
“Learning visual feature spaces for robotic manipulation with [64] J. Redmon and A. Farhadi, “Yolov3: An incremental improve-
deep spatial autoencoders,” arXiv preprint arXiv:1509.06113, ment,” arXiv preprint arXiv:1804.02767, 2018.
vol. 25, 2015.
[65] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson,
[42] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invari-
C. Bregler, and K. Murphy, “Towards accurate multi-person pose
ant feature transform,” in European conference on computer vision.
estimation in the wild,” in Proceedings of the IEEE Conference on
Springer, 2016, pp. 467–483.
Computer Vision and Pattern Recognition, 2017, pp. 4903–4911.
[43] D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation
[66] X. Burgos-Artizzu, D. Hall, P. Perona, and P. Dollar, “Merging
and action recognition using multitask deep learning,” in Proceed-
pose estimates across space and time,” in British Machine Vision
ings of the IEEE conference on computer vision and pattern recognition,
Conference (BMVC), 2013.
2018, pp. 5137–5146.
[44] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “3d human [67] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time
pose estimation with 2d marginal heatmaps,” in 2019 IEEE Winter multi-object tracking,” arXiv preprint arXiv:1909.12605, 2019.
Conference on Applications of Computer Vision (WACV). IEEE, 2019, [68] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan,
pp. 1477–1485. J. Gall, and B. Schiele, “Posetrack: A benchmark for human pose
[45] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose estimation and tracking,” in Proceedings of the IEEE Conference on
regression,” in ECCV, 2018. Computer Vision and Pattern Recognition, 2018, pp. 5167–5176.
[46] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression [69] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using
by combining indirect part detection and contextual information,” uncertainty to weigh losses for scene geometry and semantics,”
Computers & Graphics, vol. 85, pp. 15–22, 2019. in Proceedings of the IEEE conference on computer vision and pattern
[47] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu, “Pose flow: Efficient recognition, 2018, pp. 7482–7491.
online pose tracking,” arXiv preprint arXiv:1802.00977, 2018. [70] Y. Wang, C. Peng, and Y. Liu, “Mask-pose cascaded cnn for 2d
[48] G. Ning and H. Huang, “Lighttrack: A generic framework hand pose estimation from single color image,” IEEE Transactions
for online top-down human pose tracking,” arXiv preprint on Circuits and Systems for Video Technology, vol. 29, no. 11, pp.
arXiv:1905.02822, 2019. 3258–3268, 2018.
[49] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran, [71] F. Gomez-Donoso, S. Orts-Escolano, and M. Cazorla, “Large-scale
“Detect-and-track: Efficient pose estimation in videos,” in Pro- multiview 3d hand pose dataset,” Image and Vision Computing,
ceedings of the IEEE Conference on Computer Vision and Pattern vol. 81, pp. 25–33, 2019.
Recognition, 2018, pp. 350–359. [72] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look
[50] M. Wang, J. Tighe, and D. Modolo, “Combining detection and at boundary: A boundary-aware face alignment algorithm,” in
tracking for human pose estimation in videos,” arXiv preprint Proceedings of the IEEE conference on computer vision and pattern
arXiv:2003.13743, 2020. recognition, 2018, pp. 2129–2138.
[51] H. Guo, T. Tang, G. Luo, R. Chen, Y. Lu, and L. Wen, “Multi- [73] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated
domain pose network for multi-person pose estimation and track- facial landmarks in the wild: A large-scale, real-world database for
ing,” in Proceedings of the European Conference on Computer Vision facial landmark localization,” in 2011 IEEE international conference
(ECCV) Workshops, 2018, pp. 0–0. on computer vision workshops (ICCV workshops). IEEE, 2011, pp.
[52] Q. Bao, W. Liu, Y. Cheng, B. Zhou, and T. Mei, “Pose-guided 2144–2151.
tracking-by-detection: Robust multi-person pose tracking,” IEEE [74] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face
Transactions on Multimedia, vol. 23, pp. 161–175, 2020. landmark estimation under occlusion,” in Proceedings of the IEEE
[53] Y. Yang, Z. Ren, H. Li, C. Zhou, X. Wang, and G. Hua, “Learning international conference on computer vision, 2013, pp. 1513–1520.
dynamics via graph neural networks for human pose estimation [75] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient
and tracking,” in Proceedings of the IEEE/CVF Conference on Com- object detection,” in Proceedings of the IEEE/CVF Conference on
puter Vision and Pattern Recognition, 2021, pp. 8074–8084. Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790.
[54] M. Snower, A. Kadav, F. Lai, and H. P. Graf, “15 keypoints is all [76] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cot-
you need,” in Proceedings of the IEEE/CVF Conference on Computer trell, “Understanding convolution for semantic segmentation,” in
Vision and Pattern Recognition, 2020, pp. 6738–6748. WACV, 2018.
[55] Y. Raaj, H. Idrees, G. Hidalgo, and Y. Sheikh, “Efficient online [77] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop,
multi-person 2d pose tracking with recurrent spatio-temporal D. Rueckert, and Z. Wang, “Real-time single image and video
affinity fields,” in Proceedings of the IEEE Conference on Computer super-resolution using an efficient sub-pixel convolutional neural
Vision and Pattern Recognition, 2019, pp. 4620–4628. network,” in CVPR, 2016.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, XXX. XXXX 17