3D Human Poses Estimation from a single 2D silhouette
Fabrice Atrevi, Damien Vivet, Florent Duculty, Bruno Emile
To cite this version:
Fabrice Atrevi, Damien Vivet, Florent Duculty, Bruno Emile. 3D Human Poses Estimation from
a single 2D silhouette. 11th International Joint Conference on Computer Vision, Imaging and
Computer Graphics Theory and Applications, Feb 2016, Rome, Italy. Proceedings of the 11th
Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications,
<10.5220/0005711503610369>. <hal-01636974>
HAL Id: hal-01636974
https://hal.archives-ouvertes.fr/hal-01636974
Submitted on 17 Nov 2017
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est
archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
3D Human Poses Estimation from a single 2D silhouette
Fabrice Dieudonné Atrevi1 , Damien Vivet1 , Florent Duculty1 and Bruno Emile1
1 Univ. Orléans, PRISME, EA 4229, F45072, Orléans, France
{damien.vivet, bruno.emile,florent.duculty}@univ-orleans.fr, [email protected]
Keywords: Pose estimation, 3D pose, 3D modeling, skeleton extraction, shape descriptor, geometric moment, Krawtchouk
moment.
Abstract: This work focuses on the problem of automatically extracting human 3D poses from a single 2D image. By
pose we mean the configuration of human bones in order to reconstruct a 3D skeleton representing the 3D
posture of the detected human. This problem is highly non-linear in nature and confounds standard regres-
sion techniques. Our approach combines prior learned correspondences between silhouettes and skeletons
extracted from 3D human models. In order to match detected silhouettes with simulated silhouettes, we used
Krawtchouk geometric moment as shape descriptor. We provide quantitative results for image retrieval across
different action and subjects, captured from differing viewpoints. We show that our approach gives promising
result for 3D pose extraction from a single silhouette.
1 INTRODUCTION ficult task, moreover camera parameters are unknown
making the correspondence 2D/3D difficult.
In this work we propose a new technique for the ex-
Recognizing human actions is really challenging for
traction of 3D skeleton pose assumptions from a sin-
computer vision scientists and researchers since the
gle 2D image based on the silhouette shape recogni-
last two decades (Wang et al., 2011). Nevertheless,
tion. This technique is based on the use of a 3D hu-
human action recognition systems have a lot of pos-
man pose and action simulator. A silhouette database
sible applications in surveillance, pedestrian tracking
is constructed from this simulator and is used in order
and Human Machine Interaction (Aggarwal and Cai,
to match nearest silhouette and as a results possible
1999). Human pose estimation is a key step to action
3D human pose.
recognition.
This article presents a silhouette shape description
A human action is often represented as a succession and comparison between different subjects and action
of human poses (Wang et al., 2013). As these poses steps and show that we can obtain 3D skeleton con-
could be 2D or 3D, so estimating them have attracted figuration by using only a single 2D silhouette detec-
a lot of attention. A 2D pose is usually represented tion. Section 2 presents related works in the human
by a set of joint locations (Yang and Ramanan, 2011) skeleton and action recognition. Section 3 presents
whose estimation remains challenging because of the the global framework of the method and the 3D sim-
human body shape variability, viewpoint change, etc. ulation used. Section 4 deals with Krawtchouk shape
Considering 3D pose, we usually represent it by a descriptors applied to human silhouettes. Finally, sec-
skeleton model parameterized by joint locations (Tay- tion 3.1 and 5 present the databases and the obtained
lor, 2000) or by rotation angles (Lee and Nevatia, results.
2009). Such representation has the advantage to
be Viewpoint-invariant however, estimating 3D poses
from a single image still remains a difficult problem.
The reasons are multiple. First, multiple 3D poses 2 RELATED WORKS
may have the same 2D pose reprojection. Second,
3D pose is inferred from detected 2D joint locations There are many methods in the state-of-the-art that
so 2D pose reliability is essential because it greatly deals with the human pose estimation and action
affects skeleton estimation performance. In camera recognition. Nevertheless, these tasks are still chal-
network used in a video-surveillance context, image lenging for computer vision community. Human
quality is often poor making 2D joint detection a dif- activity analyses started with O’Rourke and Badler
(O’Rourke et al., 1980) and Hogg (Hogg, 1983) in the Triggs (Agarwal and Triggs, 2006) used the shape
eighties. Since last decades scientists proposed many context in their research on human pose estimation.
approaches. We can categorize these approaches into Gorce and al. (de La Gorce et al., 2011) estimated
two main categories: on one hand the methods using and tracked the human hand from monocular video
3D information and on the other hand technics using through minimization of an objective function. This
only 2D data. minimization is done using a quasi-Newton method,
Most of the approaches use a 3D model or 3D detec- for which they provide a rigorous derivation of the ob-
tion for estimating the pose of a subject and for action jective function gradient. Yang and Ramanan (Yang
classification. Rehg and Kanade (Rehg and Kanade, and Ramanan, 2011) estimated the pose by capturing
1994) presented a 3D model-based hand tracking sys- the orientation of each part with a mixture of tem-
tem that can cover the state of a 27 DOF skeleton. plates modeled by linear SVMs. All of these methods
Gavrial and al.(Gavrila and Davis, 1996) used a 3D focus on 2D image interpretation in order to detect
model-based tracking of unconstrained human move- human pose or action. For this purpose, learning is
ment. They used some sequence images acquired requiered and such algorithms need complex and ex-
from multiple views for recovering 3D body pose of pensive systems to get the training data set with the
a human. ground truth.
Bourdev and Malik (Bourdev and Malik, 2009) esti- Our method is based on a very simple silhou-
mated the human pose from key points. They used a ette extraction and description. We use the robust
data set of annotations of human with 3D joins infor- Krawtchouk geometric moment to shape analysis in
mations inferred using anthropometric constraints for monocular image. For the database, we proposed to
human action classification (Maji et al., 2011). Hiyadi use software applications from the open source com-
and al. (Hiyadi et al., 2015) used the depht informa- munity. These softwares makes realistic simulation
tion obtained from Kinect sensor and a tracking algo- of various human poses and action possible. We
rithm for 3D human gestures recognition. Jian (Jiang, have shown in this work that using 3D simulations
2010) proposed an exemple-based method, based on for learning, without complex machine learning algo-
the kd-tree achieves real-time performance, to prune rithm and with a simple real time shape descriptor we
the hypotheses. Ramakrishna and al. (Andriluka can achieve 3D pose estimation on real data with good
et al., 2010) proposed a three-stage process for 3D accuracy from a unique 2D image.
poses recovering in uncontrolled environment. Val-
madre and Lucey (Valmadre and Lucey, 2010) used
deterministic structure from multiple view of motion, 3 METHODOLOGY
based on the related work of Wei and Chai (Wei and
Chai, 2009), for 3D pose estimation. The proposed approach for pose estimation is based
These approaches need multiple sensors or specific on shape analysis of human silhouette. The method
devices such as time of flight or active camera for can be decomposed into four parts: (1) simulated
acquiring 3D information. These models also, need silhouette and skeleton database, (2) Human detec-
good parametrization. tion and 2D silhouette extraction, (3) silhouette shape
The second category of approaches, to which our pro- matching, (4) skeleton scaling and validation. The
posed method belongs, used 2D models trained from workflow is presented Fig.1.
various images. Baumberg and Hogg (Baumberg and (1) First, silhouette and skeleton database is built
Hogg, 1994) used active shape model to track pedes- thanks to opensource 3D software (see section 3.1).
trians in real world scenes. They used the B-spline Such database is composed of human silhouettes and
as a shape vector for training the model. Wren and its corresponding 3D skeletons for different kind of
al. (Wren et al., 1997) tracked people and interpreted actions we want to recognize. So, for a requested sil-
their behaviour by using a multiclass statistical model houette, it’ll be possible to find the matching silhou-
of colour and shape to obtain 2D representation of ette in the database and then the corresponding 3D
head and hand. Gorelick and al. (Gorelick et al., skeleton.
2005) used the solution of Poisson’s equation to ex- (2) 2D silhouettte detection is a well-studied field in
tract spatiotemporal features such as the saliancy, the machine learning and computer vision. For this pur-
orientation of the shape for action recognition and pose we used classical real-time approach proposed
then human pose estimation. Guo and al. (Guo et al., by Dollar et al. (P. Dollar and Perona, 2010) based on
2009) used a geometrical normalized vector of dimen- multiscale HOG (Dalal and Triggs, 2005). Once the
sion 13 for describing the shape of a human. Mori human silhouette is detected, we converted it in a 128
and Jitendra (Mori and Malik, 2002), or Agarwal and x 48 pixels image for solving the translation and scale
Figure 1: Human pose estimation methodology.
problem. graphics software called Blender1 associated with a
(3) Silhouette description and similarity measure- free software to create realistic 3d human makehu-
ment is the key point of our methodology. The main man2 (see Fig. 2). These avatars can be animated
objective is to describe accurately the shape of the sil- thanks to motion capture data in order to simulate
houette. For this task, we used the geometric moment very realistic actions.
of Krawtchouk because of its robustness compared to
Hu, Zernike or Shapecontext descriptors. (See sec-
tion 4) Based on this descriptor, a characteristic vector
is computed for each silhouette in the database. The
similarity between characteristic vector is measured
with the Euclidean distance given by :
T 2
d(zr , zt ) = ∑ zri − zti (1)
i=1
where zr et zt is respectively the characteristic vec-
tor of request silhouette and the t th silhouette in the
database. Figure 2: 3D simulated avatar and its associated skeleton
(4) Skeleton scaling and validation. For each sil-
houette we retrieve a 3D skeleton. This skeleton is In these softwares, we simulate different human
scaled to the current silhouette size. At this step we avatars with different morphologies and clothes and
use ground truth simulated database to valide the ap- animate them with different realistic motions taken
proach. The confidence score is process by measuring from the CMU motion capture database3
the reprojection error of predicted joints on the silhou-
3.1.2 Database construction
ette.
3.1 Construction of the 2D/3D matching In the 3D computer graphics software, we positioned
on an emisphere a virtual camera looking at the sub-
database ject. For each movement of the avatar, we record
3.1.1 3D human avatar and action simulation 1 https://www.blender.org/
2 http://www.makehuman.org/
In order to build our simulated humans, we choose to 3 Thedata used in this project was obtained from mo-
use a professional free and open-source 3D computer cap.cs.cmu.edu.
both: 2D image and silhouette (see fig 3), 3D cam- and satisfies the orthogonality condition :
era poses and 3D joints and bones poses. As a result
N
for each subject’s pose we can collect the detected sil-
houette related to its 3D skeleton which contains 19
∑ w(x; p, N)Kn (x; p, N)Km (x; p, N) = ρ(n; p, N)δnm
k=0
bones. We recorded in 4 subjects with different phe- n (6)
notypes and for 4 differents animations: walk cycle, where ρ(n; p, N) = (−1)n 1−p n!
and δnm
p (−N) n
basket action, jumb and climb. As a result, we ob-
tained 2925 couples silhouette / 3D skeleton. is the Kronecher function.
For each silhouette, we calculated the feature vector In order to eliminate the large variability in the
of the shape descriptors presented in section 4 and the dynamic range, a normalization process is applied.
2D poses of reprojected joints for quantitative evalu- Then, the set of normalized (weighted) Krawtchouk
ation of the method. polynomials is defined by (Yap et al., 2003) as:
s
w(x; p, N)
K̄n (x; p, N) = Kn (x; p, N) (7)
ρ(n; p, N)
4.2 Krawtchouk Moment
Krawtchouk moment is firstly used in image analysis
by P.T Yap and al.(Yap et al., 2003). Based on the
weighted Krawtchouk polynomials, the (n + m) order
of Krawtchouk moment for an N x M image with in-
tensity function f (x, y) is defined as:
Figure 3: Human silhouette extracted
N−1 M−1
Qnm = ∑ ∑ K̄n (x; p1, N − 1) K̄m (y; p2, M − 1) f (x, y)
x=0 y=0
(8)
4 KRAWTCHOUK POLYNOMIAL The parameter p1 and p2 can be viewed as a trans-
AND MOMENTS lation factor. Indeed, if p = 0.5 + ∆p, the weighted
Krawtchouk polynomials are shifted by about N∆p.
The direction of shifting relies on the sign of ∆p, with
4.1 Krawtchouk Polynomial the polynomials shifting along + x direction when ∆p
is positive and vice versa. This property allows to ex-
The n-th order of Krawtchouk polynomial is based on tract the local properties of an images. For software
the hypergeometric function and is defined as: like Matlab, there is a matrix form of the Krawtchouk
N
1
moment. In matrix form, it is defined as:
Kn (x; p, N) = ∑ ak,n,p xk = 2 F1 −n, −x; −N;
k=0 p Q = K2 AK1T (9)
(2)
i, j=N−1
where x, n = 0, 1, 2, ..., N et N > 0, p ∈ (0, 1) and where Q = {Q ji }i, j=0 ,
the hypergeometric function defined as: i, j=N−1
Kv = {K̄i ( j; pv, N − 1)}i, j=0 and
i, j=N−1
A = { f ( j, i)}i, j=0
(a)k (b)k zk zk
∞
2 F1 (a, b; c; z) = ∑ (3)
k=0 (c)k k! 4.3 Feature extraction
Γ(a + k) For a given image of human silhouette, we used
(a)k = a(a + 1)...(a + k − 1) = (4) Krawtchouk moment to describe the shape of the
Γ(a)
human belong to the image. That means to calcu-
Equation (4) is the Pochhammer symbol. late the characteristic vector of the image with dif-
The set of (N+1) Krawtchouk polynomial forms the ferent values of the moment. Thanks to the ability of
complete set of discrete basis functions with weight Krawtchouk moment to extract feature of specific re-
function gions of the image, we divided each silhouette in two
N parts (up and bottom) (fig. 4) with the parameter p1 =
w(x; p, N) = px (1 − p)N−x (5) 0.5, p2 = 0.1 (for the up) and p1 = 0.5, p2 = 0.95
x
(for the bottom). Then, we calculated two character- not only give the more suitable silhouette but gives in
istic vectors and combined them to get one vector de- a classified way the N th most probable silhouettes. In
scriptor. Each human silhouette extracted is converted order to evaluate the given result, we used the simula-
to a common space 128 x 48 to get the invariance to tion. By knowing the real skeleton of the test image,
translation and scale. For rotation invariance, we sup- we can process the reprojection error of the estimated
posed that the vertical is preserved. 3D joints. According to experimental result, when the
mean error is less than 5 pixels, the pose of the re-
sult is considered similar to the pose of the request
silhouette. For this empiric threshold, the difference
between two silhouettes is hardly visible for a human.
5.1 Representativity and descriptor
robustness to noise
Silouette extraction is still an active reseach field. It
is well known that extraction is subjected to noise.
Figure 4: Krawtchouk polynomial for up and bottom First point was to check our descriptors robustness to
noise. For this, we conducted experiments with two
According to some related works, we chose to calcu- databases of simulated data for a human avatar with
late Krawtchouk moment with parameter (m = n). In different morphology and different actions. The first
order to find the best value of n, we used a database database contains 2925 training data with Gaussian
with 600 simulated silhouettes and done cross valida- noise around the contour of the shape and the second
tion over all. The fig 5 show that from order (n = m = database contains 608 unlearning data. The aims of
24), we got a stable and best accuracy for pose recog- this experience is to evaluate the capacity of shape de-
nition. So, the final feature vector has 48 dimensions. scriptors to encode various shapes with different value
of the standard deviation of Gaussian noise. Con-
sidering x0 = [0, 0] the center of the silhouette, let
xi = [ρi , θi ] the polar coordinates of a contour point.
The noise ∆σ is applied on ρi . ∆σ ,→ N (0, std) with
std = {0, 1, 2, 3}. Example of noised silhouettes are
presented on figure 6.
Figure 5: Accuracy of cross validation with differents value
of n
Figure 6: Noised silhouettes with ∆σ ,→ N (0, std) and
5 EXPERIMENTS std = {1, 2, 3}
In section 3.1 we have shown that for each 2D image The aim of this experience is to see if the shape de-
of silhouette of the database, we store both the silhou- scriptor can perfectly encode a silhouette and make
ette vector descriptors and the associated 3D skeleton the difference between closed postures. The silhou-
composed of 19 joints. Then, for a test image with ette in the database can be very similar because we
extracted silhouette, similarity is computed between extracted it from a video of the motion, so two near
the processed vector of descriptors and database de- frames provide a very similar silhouette. For std = 0,
scriptors using the Euclidian distance. As a result we we have the original silhouette and for std > 0, the
extract the corresponding silhouette in the database Gaussian white noise is added on the silhouette. Fig-
and its joints 3D poses. Note that the approach does ure 7 shows that the more the std increases, the more
the recognition accuracy decreases. For this test we that the shape of the curve changes as a fonction of
used a training data set composed of 2925 and a the motion. The means error obtained form the jump
testing test of 608 silouhettes. For a single neigh- motion is 1.9892 px. This discrimination factor con-
bour (N = 1), with std = {0, 1, 2, 3}, the recognition firmed that the 3D poses can be used for actions clas-
rate is respectively RR = {98.81, 96.43, 74.6, 44.84}. sification in a video.
But, if we augment the number of N assumption re-
turned by the program, the recognition rate grows up
quickly. For N = 7 and std = {0, 1, 2, 3}, the RR are 5.2 Application to action recognition on
{100, 100, 96.43, 73.41}. Considering that the silhou- real data
ettes are very similar and the noise very strong, the
method gives very good results. For the rest of the ar-
ticle we will consider N = 7 first silhouettes given by We used the same shape descriptor for human ac-
the matcher. tion classification in video, with the public Weizmann
In order to estimate the 3D extracted skeleton, we database (see Figure 12). As we do not use tempo-
use the same request silhouette as for previous exper- ral information, our method consists in matching each
iment. For each extracted silhouette, we process the frame to an action class and took the class with the
reprojection error and evaluate the accuracy for dif- highest associated rate as the class action.
ferent value of N. The Figure 8 shows skeletons esti- The database is a collection of 90 low-resolution (180
mations from a single monocular image. For this re- x 144, deinterlaced 50 fps) video sequences show-
sult, the reprojection error of the first image (human ing 9 different people, each performing 10 natural ac-
walking) is 2.4739 px and that of the second image tions: run, walk, skip, jumping-jack, jump, gallop-
(human in cross position) is 1.2614 px. This means sideways, wave-two-hands, waveone- hand, or bend.
error show that the retrieval pose is near to the origi- On Weizmann data base, we made a cross validation
nal pose. Note that, in the database, there a no avatar with the different movements and with the different
with the similar appaerance, so this error is reason- phenotypes. In each case and for each frame, we ap-
able. ply our shape matching method to each frame. As
The images that we used as request in fig 8 are sim- the resulting silhouette from the database belongs to a
ulate image. So, we got a perfect result with low re- specific movement class we simply count the number
projection error. of occurencies. The more represented class is then
The result of this 3D skeleton extraction of fig 9 is considered as the detected movement.
perfect because this pose is unique and easy to find. Based on this very simple workflow, we got 71.66%
The silhouette extraction is too easy because we have of good action classification. The confusion matrix
a static and uniform background. is shown on the fig.13. Of course, this accuracy
In fig 10, we used a real world image extracted from a rate is lower than the recent accuracy obtained on the
walking action video. The pose that we choose is sim- same database (Blank 99.64% (Blank et al., 2005) and
ilar but not exact with pose in the learning database. Gorelick 97.83% (Gorelick et al., 2007)). But both of
So, we don’t expect to get a very simular 3D pose as these approach used space-times cubes to analyse the
result, but some pose similar. The result show a good motion while we do not consider yet the temporal cor-
result in term of the shape of the pose. But, confusion relation between successives frame.
was made between right and left foot and arm. According to Gorelick et al.: many successive frames
In order to evaluate the stability and therobustness from the first action (about run) may exhibit high spa-
of our approach, we considered the successive detec- tial similarity to the successive frames from the sec-
tions during a complete movie of the movement. Note ond one. Ignoring the dynamics within the frames
that there is no use of the time line and each frame might lead to confusion between the two actions. As
is processed independently. Figure 11 (a) shows the the approach does not take into account time dimen-
tracking results of four human’s joints during the ex- sion, frame to frame comparison leads to misclassifi-
ecution of the climbing motion. The red curve show cation for these very similar frame to frame actions:
the real position over time and the green curve show run, skip and jump.
the estimate position over time. We can note that the In future work, we will use our proposed approach
shape of different curves is the same. That means that combined with the multi-hypothesis tracking tech-
the successives detections are stable in time and that niques (with N neighboors) to improve the accuracy
our shape descriptor is reliable. We can note that there of action classification. By this way, we will take into
is an offset due to shape scaling. The means error over account the temporal information and the dynamic of
the motion execution is 1.9765 px. Figure 11 shows the action.
Figure 7: Histogramm of accuracy: colors represent the noise amplitude resp. {0, 1, 2, 3} pixels. The abscisses represent the
number N of neighboors considered {1, 3, 5, 7}.
Figure 9: Real world data 1
Figure 8: 3D pose estimation result: Left, the resquest sil-
houette and from left to right, the 3D estimated skeleton
from various viewpoints
tion to 3D pose estimation and action classification
have been presented. In our work, we tested different
6 CONCLUSIONS moment order and selected the best suitable for our
approach. We compared our approach with some re-
In this paper, we presented a new approach for lated work in action classification and we concluded
3D human pose estimation and action classification that this approach can be improved by using multi-
in video. The learning database is easily generated hypothesis tracking during action identification and
thanks to open source softwares which allow any hu- classification. In future work, we will use a combina-
man pose simulaion. The proposed posture recogni- tion of local and global shape descriptor for improv-
tion method is based on the geometric Krawtchouck ing the pose estimation, and use the estimated poses
moment and gives promising results. Both applica- to construct an action model for activity classification.
Figure 12: Some images of Weizmann database
Figure 10: Real world data 2
Figure 13: Confusion matrix
REFERENCES
Agarwal, A. and Triggs, B. (2006). Recovering 3d hu-
man pose from monocular images. Pattern Analy-
(a) Climb motion sis and Machine Intelligence, IEEE Transactions on,
28(1):44–58.
Aggarwal, J. and Cai, Q. (1999). Human motion analysis:
A review. Computer Vision and Image Understanding,
73(3):428–440.
Andriluka, M., Roth, S., and Schiele, B. (2010). Monocular
3d pose estimation and tracking by detection. In Com-
puter Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, pages 623–630. IEEE.
Baumberg, A. and Hogg, D. (1994). Learning flexible mod-
els from image sequences. Springer.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., and
(b) Jump motion
Basri, R. (2005). Actions as space-time shapes. In
Figure 11: Tracking result The Tenth IEEE International Conference on Com-
puter Vision (ICCV’05), pages 1395–1402.
7 ACKNOWLEDGE Bourdev, L. and Malik, J. (2009). Poselets: Body part de-
tectors trained using 3d human pose annotations. In
Computer Vision, 2009 IEEE 12th International Con-
This work is part of LUMINEUX project, sup- ference on, pages 1365–1372. IEEE.
ported by the Regional Centre-Val de Loire (France). Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-
The authors would like to acknowledge the Conseil dients for human detection. In In: IEEE Conference
Regional of Centre-Val de Loire for its support. on Computer Vision and Pattern Recognition, pages
886–893.
de La Gorce, M., Fleet, D., and Paragios, N. (2011).
Model-based 3d hand pose estimation from monocu-
lar video. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 33(9):1793–1805.
Gavrila, D. M. and Davis, L. S. (1996). 3-d model-based Wang, C., Wang, Y., and Yuille, A. (2013). An approach
tracking of humans in action: a multi-view approach. to pose-based action recognition. In Computer Vision
In Computer Vision and Pattern Recognition, 1996. and Pattern Recognition (CVPR), 2013 IEEE Confer-
Proceedings CVPR’96, 1996 IEEE Computer Society ence on, pages 915–922.
Conference on, pages 73–80. IEEE. Wang, L., Wang, Y., and Gao, W. (2011). Mining layered
Gorelick, L., Blank, M., Shechtman, E., Irani, M., and grammar rules for action recognition. International
Basri, R. (2005). Actions as space-time shapes. In Journal of Computer Vision, 93(2):162–182.
In ICCV, pages 1395–1402. Wei, X. K. and Chai, J. (2009). Modeling 3d human poses
Gorelick, L., Blank, M., Shechtman, E., Irani, M., and from uncalibrated monocular images. In Computer
Basri, R. (2007). Actions as space-time shapes. Trans- Vision, 2009 IEEE 12th International Conference on,
actions on Pattern Analysis and Machine Intelligence, pages 1873–1880. IEEE.
29(12):2247–2253. Wren, C. R., Azarbayejani, A., Darrell, T., and Pentland,
Guo, K., Ishwar, P., and Konrad, J. (2009). Action recogni- A. P. (1997). Pfinder: Real-time tracking of the hu-
tion in video by covariance matching of silhouette tun- man body. Pattern Analysis and Machine Intelligence,
nels. In In: XXII Brazilian Symposium on Computer IEEE Transactions on, 19(7):780–785.
Graphics and Image Processing, pages 299–306. Yang, Y. and Ramanan, D. (2011). Articulated pose esti-
Hiyadi, H., Ababsa, F., Bouyakhf, E. H., Regragui, F., and mation with flexible mixtures-of-parts. In Computer
Montagne, C. (2015). Reconnaissance 3d des gestes Vision and Pattern Recognition (CVPR), 2011 IEEE
pour l’interaction naturelle homme robot. In Journées Conference on, pages 1385–1392. IEEE.
francophones des jeunes chercheurs en vision par or- Yap, P.-T., Paramesran, R., and Ong, S.-H. (2003). Image
dinateur. analysis by krawtchouk moments. Image Processing,
Hogg, D. (1983). Model-based vision: a program to see a IEEE Transactions on, 12(11):1367–1377.
walking person. Image and Vision computing, 1(1):5–
20.
Jiang, H. (2010). 3d human pose reconstruction using mil-
lions of exemplars. In Pattern Recognition (ICPR),
2010 20th International Conference on, pages 1674–
1677.
Lee, M. W. and Nevatia, R. (2009). Human pose tracking in
monocular sequence using multilevel structured mod-
els. Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 31(1):27–38.
Maji, S., Bourdev, L., and Malik, J. (2011). Action recog-
nition from a distributed representation of pose and
appearance. In Computer Vision and Pattern Recogni-
tion (CVPR), 2011 IEEE Conference on, pages 3177–
3184. IEEE.
Mori, G. and Malik, J. (2002). Estimating human body con-
figurations using shape context matching. In Com-
puter VisionECCV 2002, pages 666–680. Springer.
O’Rourke, J., Badler, N., et al. (1980). Model-based im-
age analysis of human motion using constraint prop-
agation. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, (6):522–536.
P. Dollar, S. B. and Perona, P. (2010). The fastest pedestrian
detector in the west. In In: Proceedings of the British
Machine Vision Conference, pages 1–11.
Rehg, J. M. and Kanade, T. (1994). Visual tracking of high
dof articulated structures: an application to human
hand tracking. In Computer VisionECCV’94, pages
35–46. Springer.
Taylor, C. (2000). Reconstruction of articulated objects
from point correspondences in a single uncalibrated
image. In Computer Vision and Pattern Recognition,
2000. Proceedings. IEEE Conference on, volume 1,
pages 677–684 vol.1.
Valmadre, J. and Lucey, S. (2010). Deterministic 3d hu-
man pose estimation using rigid structure. In Com-
puter Vision–ECCV 2010, pages 467–480. Springer.