0% found this document useful (0 votes)
72 views29 pages

Cours Programmation C

This paper introduces a deep spatio-temporal appearance (DSTA) descriptor for person re-identification (re-ID) that utilizes deep Fisher vector encoding to robustly handle misalignment in pedestrian tracklets. The proposed method integrates saliency and Gaussian templates to enhance encoding, achieving competitive accuracy with state-of-the-art techniques without the need for pre-training or data augmentation. Experimental results on four challenging datasets demonstrate the effectiveness of the DSTA descriptor, especially when combined with deep CNN approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views29 pages

Cours Programmation C

This paper introduces a deep spatio-temporal appearance (DSTA) descriptor for person re-identification (re-ID) that utilizes deep Fisher vector encoding to robustly handle misalignment in pedestrian tracklets. The proposed method integrates saliency and Gaussian templates to enhance encoding, achieving competitive accuracy with state-of-the-art techniques without the need for pre-training or data augmentation. Experimental results on four challenging datasets demonstrate the effectiveness of the DSTA descriptor, especially when combined with deep CNN approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Multimed Tools Appl

[Link]

Deep salient-Gaussian Fisher vector encoding


of the spatio-temporal trajectory structures
for person re-identification

Salma Ksibi1 · Mahmoud Mejdoub1 ·


Chokri Ben Amar1

Received: 27 September 2017 / Revised: 21 March 2018 / Accepted: 23 May 2018


© Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract In this paper, we propose a deep spatio-temporal appearance (DSTA) descriptor


for person re-identification (re-ID). The proposed descriptor is based on the deep Fisher
vector (FV) encoding of the trajectory spatio-temporal structures. These have the advan-
tage of robustly handling the misalignment in the pedestrian tracklets. The deep encoding
exploits the richness of the spatio-temporal structural information around the trajectories.
This is achieved by hierarchically encoding the trajectory structures leveraging a larger
tracklet neighborhood scale when moving from one layer to the next one. In order to elim-
inate the noisy background located around the pedestrian and model the uniqueness of
its identity, the deep FV encoder is further enriched towards the deep Salient-Gaussian
weighted FV (deepSGFV) encoder by integrating the pedestrian Gaussian and saliency
templates in the encoding process, respectively. The proposed descriptor produces compet-
itive accuracy with respect to state-of-the art methods and especially the deep CNN ones
without necessitating either pre-training or data augmentation on four challenging pedes-
trian video datasets: PRID2011, i-LIDS-VID, Mars and LPW. The further combination of
DSTA with deep CNN boosts the current state-of-the-art methods and demonstrates their
complementarity.

Keywords Person re-identification · Deep weighted encoding · Spatio-temporal trajectory


structures · Deep spatio-temporal appearance descriptor · Deep CNN

! Salma Ksibi
[Link].2014@[Link]
Mahmoud Mejdoub
[Link]@[Link]
Chokri Ben Amar
[Link]@[Link]

1 REGIM: Research Groups on Intelligent Machines, University of Sfax, ENIS, Sfax, Tunisia
Multimed Tools Appl

1 Introduction

Due to their simplicity and good performances, the histogram encoding approaches such as
Bag Of visual words (BOW) [45] and Fisher Vector (FV) [17, 19, 25, 26, 30], have become
well-established in the person re-identification (re-ID) task. They consist in mapping the
raw low-level local descriptors into a global, compact and fixed-size representation. Com-
monly, they consist in two steps: encoding and pooling. In the first step, the local features
are quantized into a set of visual words. In the second step, the visual words are aggre-
gated to form a single histogram representative of the image or video. Nonetheless, these
methods ignore the intrinsic spatial and spatio-temporal structural information embedded
in the visual contents. In this context, we presented in a previous work [29] a combined
Salient-Gaussian weighted histogram encoding based on both FV and BossaNova (BN) [4].
This proposed combination improves the re-ID accuracy. Nevertheless, it can only provide a
shallow description of the pedestrian. In order to capture more complex structural informa-
tion, we proposed in [20] a deep version of the earlier shallow salient-Gaussian histogram
encoding methods[17, 29]. The deep representation hierarchically refines the shallow one
by integrating more pedestrian spatial neighborhood information when moving from one
layer to the next. It yields competitive results with the proliferated deep convolutional neu-
ral network (CNN) [46]. To further enrich the deep description of the pedestrian images, we
combine the proposed deep representation with the deep CNN one. The resulting combina-
tion conveys two different views of the spatial structural information. On the one hand, the
deep CNN learns the structural relationship among the raw pixels in the pedestrian image via
a cascaded series of convolution transformations on the pixel values, while on the other hand
the proposed framework provides a deep encoding of the hand-crafted low level features.
Recently, the deep learning methods [1, 28, 32, 46] are proven to be a good alternative of the
histogram encoding methods since they depict in a hierarchical manner the structural rela-
tionships between the image/video pixels. Specifically, the deep learning methods compute
hierarchical features or representations of the observational visual data, where the higher
level features or factors are defined from lower level ones. The convolutional neural net-
work (CNN) [34, 35, 46, 48] is a widely used deep architecture in the person re-ID task. The
latter has a substantially more sophisticated structure than standard shallow representations.
It models complex spatial relationships among image pixel values by exploiting the strong
spatially local correlation. CNN learns the spatial relationships by enforcing the local con-
nectivity between neurons of adjacent layers. The neurons correspond to non-linear feature
extractors usually represented by convolution, normalization, and max pooling filters. Con-
volutional layers are usually placed alternatively between pooling and normalization ones.
When applied to the person re-ID task, deep learning networks have demonstrated excel-
lent performance. Nevertheless, the CNN has the drawback of ignoring temporal ordering
among the pedestrian sequence of frames. Recurrent neural network (RNN) [28, 49] based
on the Long Short-Term Memory (LSTM) architecture effectively addresses this problem
using feedback connections that permits it to remember information over the time steps.
When receiving a new input related to the current frame in the sequence, LSTM yields an
output based on information about both the actual input and the previous outputs. How-
ever RNN does not depict the temporal alignment between the pedestrian parts over the
successive frames.
In this paper, we propose a deep spatio-temporal appearance descriptor, for the pedes-
trian sequence of frames, based on the deep FV encoding of the spatio-temporal trajectory
structures. The proposed descriptor describes the appearance cue around the pedestrian’s
trajectories since it was proven in many previous works [43, 46] that appearance visual
Multimed Tools Appl

characteristics are more relevant than motion in the context of person re-ID. Indeed, motion
features can not adequately discriminate the pedestrians since the same person can produce
different motion patterns in different cameras, and conversely different persons can yield the
same motion throughout different cameras. The proposed representation extends our shal-
low FV encoding version previously presented in [17–19] in the context of the image-based
person re-ID.
The proposed descriptor makes several contributions:

– The trajectory information inside the sequence of pedestrian frames is exploited in order
to robustly handle the temporal misalignment in the pedestrian tracklet.
– The deep FV encoding applied on the trajectory structures permits to take profit of the
richness of the spatio-temporal structural information. It produces competitive results
in comparison with the deep CNN without needing neither pre-training nor data aug-
mentation. Furthermore, the combination of the proposed deep hand-crafted descriptor
with the deep CNN one considerably enhances the re-ID accuracy.
– We propose to incorporate the saliency and the Gaussian maps in the deep FV encoding
in order to emphasize the pedestrian distinctive parts and to eliminate the background
noise around the pedestrian, respectively.

The paper is structured as follows:


After introducing the context of our work in Section 1 and presenting the most rele-
vant related state-of-the-art methods in Section 2, we present an overview of the proposed
method, and then describe in detail each important component in Section 3. The originality
of the proposed method is justified via the experimentation results obtained on four chal-
lenging benchmarking datasets in Section 4. Finally, we end up this paper with a conclusion
and perspectives of our future work in Section 5.

2 Related work
In the person re-ID literature, most methods investigated this problem in an image-based
re-ID way, in which each identity is represented in every camera by a set of temporally
independent images. Recently, video-based re-ID has become popular owing to the rich-
ness of the temporal data that can be extracted from the sequences of frames representing
each person. In [5, 10, 23, 27, 39, 44–46, 48], spatio-temporal appearance features are
derived to describe the pedestrian sequence of frames (tracklet). This is achieved by com-
puting the appearance feature separately in each tracklet frame and aggregating the resulting
features within the tracklet by simply applying either a sum or a max pooling operation.
More specifically, the appearance features can be divided into two main categories: shallow
and deep. We can cite as the most successful shallow appearance features: the Symmetry-
Driven Accumulation Features (SDALF) [5], the Local Maximal Occurrence (LOMO) [23],
BOW [45], HistLBP [39], eSDC [44] and gBiCov [27]. Regarding the deep appearance fea-
tures, we can cite the deep CNN feature called ID-discriminative Embedding (IDE) [46,
48]. The IDE feature is obtained by learning a discriminative embedded feature space from
the tracklet frames within a classification mode. The CaffeNet [16] and ResNet-50 [9]
convolutional neural network (CNN) models are learned in a classification mode by cate-
gorizing the training tracklet frames into pre-defined pedestrian identity classes. The output
of the last convolutional layer is then used as an embedding feature that serves for pedes-
trian matching. It was stated in [46, 48] that the IDE feature considerably boosts the re-ID
Multimed Tools Appl

accuracy with respect to the previously used Siamese verification models [41]. Indeed, these
treat person re-ID as a two-class recognition task, by taking a pair of images as inputs and
determining whether they belong to the same person or not. In this way, the training set
is enlarged and the shortage of the training images is avoided. Yet, the recent large scale
datasets (e.g. Mars [46]) provide richer training samples per class. In this sense, it was
shown in [24] that the classification model performs better than a verification Siamese one
in the large scale datasets since it can exploit more adequately the intra-class similarity and
inter-class dissimilarity contexts. The Triplet deep CNN models in [2, 10] are also used to
derive deep appearance features. Indeed, the Triplet CNN model represents an embedded
end-to end deep neural network that simultaneously learns an embedded appearance fea-
ture space and a dissimilarity distance metric between the pedestrian images in that space.
This embedded architecture extends the classical CNN models by introducing a triplet of
images as input (one anchor, one of a same person and one of a different person). To this
end, the classification loss function was replaced by the triplet loss function that, according
to the embedded feature space, pulls the images of the same identity and pushes those of
the different identities. However, in such models the number of triplets cubically increases
proportionally to the number of the training data. To relatively resolve this problem in the
large scale MARS dataset, authors in [10] operated the training on a set of small selected
batches. In [2], authors concentrated on the small dataset case and proposed an improved
threshold triplet loss function in order to enhance the identities intra-class similarity con-
straint. Besides, to further consider the spatial alignment between pedestrian parts, they
jointly trained the network on the full body and the local body horizontal stripes. The afore-
mentioned extensions of the appearance features to the video case have shown promising
re-ID results, especially in the case of deep features. Nevertheless, their major drawback
is that they ignore the temporal ordering information. Therefore, in order to describe the
temporal ordering information, the RNN models based on the LSTM architecture were pro-
posed in [28, 49]. McLaughlin et al. [28] started with the extraction of a deep CNN feature
on each frame and then applied the RNN on the top of the resulting CNN features while
taking into account the previous time-step. Afterwards, an average or a max pooling was
applied on the outputs of the RNN all over the time-steps. In the same context, RNN+OF
[28] was managed by combining two streams: the frame stream and the optical flow stream,
in the combined CNN/RNN introduced network. As for [49], Zhou et al. have proposed
an end-to-end deep neural network architecture that integrates the temporal motion infor-
mation in the deep CNN model via RNN and simultaneously learns a dissimilarity metric.
The major drawback of the RNN-based aforementioned methods is that they do not take
into account the temporal alignment information in the sequence of frames. To alleviate
this problem, many works [25, 43] have recently focused on the walking profile of the
person in the consecutive frames. This is done by splitting the frames into a set of action
primitives that form the different walking cycles. Indeed, these are detected by computing
the flow energy profile (FEP), i.e, computing a one dimensional optical flow measuring
the leg motion between each two consecutive frames. Afterwards, frames located around
the local minimum/maximum FEP are used to derive the different walking cycles. Ideally,
the local maximum of FEP corresponds to the postures when the person’s two legs over-
lap, while at the local minimum the two legs are the farthest away. In this context, in [46],
the walking cycles are first extracted then HOG3d [14] as well as motion GAIT [8] fea-
tures are computed separately on each walking cycle. More specifically, HOG3D [14] is
a joint appearance-motion feature, borrowed from the action recognition field that reflects
the temporal gradient information. GAIT [8] has been introduced as a biometric feature
describing the pedestrian walking sequence. In [25], each sequence is split into body-action
Multimed Tools Appl

units derived from different walking cycles. FV models are then trained separately on each
body-action unit, upon joint appearance-motion features. These are obtained by computing
the spatial and the temporal derivatives of the pixel intensities. The final description of the
person is obtained by concatenating the FVs related to all body-action units, thus forming
the spatio-temporal FV (STFV3D). In summary, these motion-based features [8, 14, 46]
tend to match pedestrians with similar activities. Meanwhile, the problem of this type of
methods arises from the fact that two different identities can share similar motion patterns,
and, conversely, two pedestrians of the same identities can exhibit different motion patterns.
Liu et al. proposed in [43] a discriminative appearance representation by learning a deep
CNN on selected representative frames rather on all sequence frames. Tracklet is divided
into a set of walking cycles. From each walking cycle the representative frames are sampled.
A deep CNN learning is then applied in the representative frames by aggregating them in
the pooling layer located between the classifier and the final convolutional layers. Indeed,
it was shown in [43] that the walking cycles give good results on the one side view datasets
with relatively low noisy backgrounds. Meanwhile, the authors stated the handicap of the
walking cycle in the case of complex datasets such as Mars dataset [46]. Indeed, walking
cycles are based on the leg motion information that is sensitive to the noisy backgrounds,
occlusions, changes of poses and point of views. This produces inaccurate walking cycles
and thus erroneous alignment information in the sequence.
To measure the similarity between two pedestrians the common way is to employ the
supervised metric learning techniques, such as Keep It Simple and Straightforward Metric
Learning (KISSME) [15], locally adaptive decision functions (LADF) [22], the Null space
(NS) metric learning [42], and the Cross-view Quadratic Discriminant Analysis (XQDA)
[23]. These are often applied upon the generated features in order to learn an optimal dis-
tance allowing to increase the intra-similarity and decrease the inter-similarity among the
pedestrians in the feature space. Among the metric learning techniques, XQDA achieves
good re-ID results in many works [46, 48]. This is mainly due to the fact that XQDA has
the ability to simultaneously learn a discriminative subspace as well as a distance in the low
dimensional subspace W .
We presented in [29] a combined Salient-Gaussian weighted histogram encoding
method. Nevertheless the proposed method was interested only on the image-based re-ID
case where a single shot is used per pedestrian. However, it was shown in many recent works
[23, 27, 28, 49] that the re-ID process can be effectively improved by exploiting the tem-
poral information among the sequence of the pedestrian frames. To this end, we shifted our
interest in this paper to the video-based person re-ID case. Meanwhile, in order to capture
more complex structural information, we proposed in this paper a deep weighted FV encod-
ing of the trajectory spatio-temporal structures that extends the previous shallow histogram
encoding methods [25, 29, 45]. Regarding the temporal information, it was generally inves-
tigated in the state-of-the-art methods [5, 10, 23, 27, 39, 44–46, 48] in three ways. In [23, 27,
39, 44, 45], simple pooling of the frame global features was applied. Regarding the LSTM
based methods [28, 49], they proposed to take into account the temporal ordering of the
frame features. However, the activities of two pedestrians with different/same identities can
be similar/dissimilar. The walking cycle based methods [8, 14, 25, 46] divide the sequence
into cycles to exploit the temporal alignment among the frames of each cycle. Nevertheless,
these methods use a global optical flow information which is very sensitive to some dis-
turbances such as the change of points of views and the occlusions. With respect to these
methods, we exploit in this paper the temporal information by designing local and dense
trajectories that represent a more natural way to handle the temporal alignment among the
tracklet frames.
Multimed Tools Appl

3 The proposed method

In this section, we first present an overview of the proposed method, and then explain it in
details in the following sections.

3.1 Overview of the proposed method

In this paper, we propose a deep spatio-temporal appearance (DSTA) descriptor based


on the deep encoding of the spatio-temporal trajectory structures (see Fig. 1). First, we
construct dense trajectories within the tracklet. The dense trajectories were introduced in
[37] as a natural way to cope with the temporal misalignment in the human action videos
[36]. They are effectively designed by densely tracking the frame points along the video
sequence. Two kinds of spatio-temporal trajectory structures, ranging from small-to-large
granularities, are built around the dense trajectories. The first kind called Small Trajec-
tory Sub-Volumes (SSTV) structures is designed while describing in each frame the spatial
neighborhood of the tracked points by means of patch appearance features. The second kind
is called Large Trajectory Sub-Volumes (LTSV) structures, is manufactured by aggregat-
ing the STSV structures according to their spatial proximity. Then, to generate the DSTA
descriptor, the STSV and LSTV structures are hierarchically encoded via the proposed deep
Salient-Gaussian weighted FV (deepSGFV) according to two nested layers. These forward
the FV codes generated by the first layer on the STSV structures as input to the FV encod-
ing of the second layer LSTV structures. Besides, deepSGFV integrates the Gaussian and
saliency maps in the encoding process in order to eliminate the background noise located
around the pedestrian and to emphasize its most distinctive parts, respectively.
Finally, to compute the dissimilarity between two pedestrian tracklets, we use as [46] the
XQDA metric learning [23], by separately applying it on the proposed DSTA descriptors
and the deep CNN IDE [48] descriptors of the tracklets. Afterwards, a late fusion is oper-
ated on the learned distances. It is worth mentioning that all tracklet frames are pre-treated

Fig. 1 Overview of the proposed method: a The proposed deep spatio-temporal appearance descriptor
(DSTA): (a.1) Construction of the spatio-temporal trajectory structures. (a.2) Extraction of the Gaussian
and saliency maps. (a.3) The proposed deepSGFV enoding. b Deep CNN IDE features. c The dissimilarity
computation: combination of the distances learned from the DSTA descriptors and the IDE features using
XQDA
Multimed Tools Appl

with Retinex transform [23], to reduce the illumination variation before the application of
the encoding methods. Besides, the proposed deepSGFV encoding is applied upon a stripe
representational scheme in order to consider the spatial alignment information between
pedestrian parts.

3.2 Dealing with illumination variations

In this paper, we apply the Multi-scale Retinex transform with Color Restoration (MSRCR)
[13] to handle illumination variations. Single Scale Retinex algorithm (SSR) is the basic
Retinex algorithm which uses a single scale. The original frame is processed in the loga-
rithmic space in order to highlight the relative details. Besides, a 2D convolution operation
with Gaussian surround function is applied to smooth the frame. Afterwards, the smooth
part is subtracted from the frame to obtain the final enhanced frame. SSR can either pro-
vide dynamic range compression (small scale), or tonal rendition (large scale), but not
both simultaneously. The MSRCR algorithm bridges the gap between color images and the
human observation by effectively combining the dynamic range compression of the small-
scale Retinex and the tonal rendition of the large scale with a color restoration function. In
the experiments, we used two scales of the Gaussian surround function ( σ = 5 and σ = 20).

3.3 Deep spatio-temporal appearance descriptor

We describe hereafter the different steps of the construction of the deep spatio-temporal
appearance (DSTA) descriptor: the construction of spatio-temporal trajectory structures, the
computation of the saliency and the Gaussian maps and then the deepSGFV encoding (see
Fig. 1).

3.3.1 Construction of the spatio-temporal trajectory structures

Dense Trajectory Extraction The dense trajectories are extracted as introduced by [37]
in the action recognition research field, in order to capture the local motion information
yield in the tracklets. To this end, points are densely sampled through a regular grid over
the first tracklet frame, using horizontal and vertical strides of 4 pixels. Every point is then
tracked along the tracklet frames based on the Fanerback [6] dense optical flow algorithm to
construct a set of dense trajectories. These can effectively handle the temporal misalignment
in the tracklet since they are robust to the view as well as pose changes and the cluttered
backgrounds. This is due to the local and dense setting used for both the tracked points and
the optical flow fields. To extract dense optical flow fields, the Fanerback method proceeds
by embedding a translation motion model between the neighborhoods of two consecutive
frames via a polynomial expansion that approximates the neighborhood pixel intensities.
We use indeed the OpenCV library implementation of the Fanerback method. As explained
by (1), each point ct = (ht , wt ) at frame t is tracked in the next frame by applying a median
filtering ω of size 3×3 on the dense optical flow field F = (ut , vt ), where ut and vt represent
the vertical and horizontal components of the optical flow field.
ct+1 = (ht+1 , wt+1 ) = (ht , wt ) + F × ω|(ht ,wt ) (1)
We note that pedestrian parts can disappear and re-appear in the sequence. To deal with this
problem, we verify the presence of a tracking within the next frame. If no tracked point is
found, this point is progressively tracked within the following frames until a tracked point
is detected.
Multimed Tools Appl

Construction of the Small Trajectory Sub-Volumes (STSV) In each frame crossed


by the trajectory we construct a patch of size 4 × 4 pixels around the tracked point. The
patch is then described by an appearance feature (see Section 4.3 for the description of
the used appearance features). Since the starting trajectory points are densely sampled in
the first frame, we obtain a set of dense spatio-temporal STSV structures. Each one starts
from a dense patch located in the first frame, and stretches along the spatio-temporal vol-
ume located around the trajectory. The STSVs have the advantage of (1) describing the
spatio-temporal structural information at a fine granularity, (2) dealing with the temporal
misalignment and (3) revealing the appearance cue which is more relevant than the motion
one in the context of person re-ID.

Construction of the Large Trajectory Sub-Volumes (LTSV) The LTSV structures are
formed by the STSV ones, whose starting patches are located in the same spatial neigh-
borhood in the first tracklet frame. Overlapping spatial neighborhoods are built by densely
scanning the first tracklet frame with a sliding sub-volume of size 3 × 3 patches using a
stride of 4 pixels.

3.3.2 Extraction of the saliency and Gaussian maps

The features are extracted from the STSV and LTSV structures built at the trajectory neigh-
borhood along all the tracklet frames. The saliency and the Gaussian weights were firstly
computed from the starting tracklet frame and then extended to the tracklet level by assign-
ing them to the STSV and LTSV structures. Indeed, thanks to the performed patch tracking,
patches are enough similar to reveal a consistent saliency information over the tracklet
frames. Furthermore, the background suppression via the Gaussian template was performed
once at the tracklet first frame and not for each individual frame in the tracklet, because the
starting frame reveals the pedestrian first walking position and posture before his displace-
ment. Indeed, due to the fact that the person re-ID datasets employ hand-cropped boxes or
boxes produced by the DMP Detector, the pedestrian in the tracklet first frame lies nearby
the frame center. Besides, the idea of computing the Gaussian and saliency maps only in the
first tracklet frame permits to reduce the computational time complexity. We describe in the
following the proposed methods for the saliency and Gaussian map extraction.

Extraction of the Saliency map Authors in [44] proposed a salience model in the image-
based person re-ID context by computing the salience of the pedestrian image patches and
weighting them with the salience values i.e. the saliency weights. Indeed, salient patches are
defined as those that possess property of uniqueness among a reference set selected from
the training images. More specifically, a patch is considered as salient if it does not share
similar appearance with half the pedestrians in the reference set taken from the learning set.
In this paper, we compute the saliency of the pedestrian in the context of the video-based
person re-ID. Since the first frame is the most representative in the tracklet as it contains
the initial posture of the pedestrian, we compute the saliency weights for the first frame
patches. Then, the saliency of the STSV and LTSV structures are computed according to
that of the first frame patches. The STSV saliency is equal to that of its starting patch,
while the LTSV saliency is computed by max pooling the weights of the STSV structures
located inside the LSTV. We note that max pooling is used for the saliency weight since it
can more adequately emphasize the pedestrian distinctiveness. Consequently, the saliency
computation problem for the video-based re-ID is transformed from the tracklet-level to
the frame-level. This avoids the redundant saliency computation in each tracklet frame and
Multimed Tools Appl

permits to gain in terms of computational time. Consider R the reference set that corre-
sponds to the first frames
! of the N r training tracklets." To compute the salience for the first
tracklet frame I = ph,w , h = 1 . . . H, w = 1 . . . W of width W and height H , a nearest
neighbor set of size N r is built for every patch ph,w in I . This is carried out by searching for
the most similar patch to ph,w in every v-th reference frame in the training set. When seek-
ing a patch ph,w in the v-th training tracklet frame I R,v , the search space is restricted to the
adjacency set S(ph,w , I R,v ) (see (2)). The latter corresponds to the horizontal region cen-
tered on the h-th row. This is established in order to avoid the spatial misalignment (which
manifestly occurs on the horizontal direction) and relax the search space.
# $
R,v
S(ph,w , I R,v ) = pi,j , i ∈ # (h) , j = 1 . . . W (2)
where #(h) = {max(0, h − l), . . . , h, . . . , min(h + l, H )}. The parameter l defines the
width of the adjacency set. Thus, we compute for each input patch ph,w , the matching set
XN N (ph,w ) defined by (3) :

⎧ ⎫
⎨ ⎬
R,v R,v R,v R,v
XN N (ph,w ) = pi,j |argmin dist (ph,w , pi,j ), pi,j ∈ S(ph,w , I ), v = 1 . . . Nr
⎩ R,v ⎭
pi,j
(3)
where dist (·) is the Euclidean distance between two patch features. Afterwards, we use the
computed matching set to define the patch saliency score as the distance to the k-th nearest
neighbor denoted by distk (k = αN r):

score(ph,w ) = distk (XN N (ph,w )) (4)


and the saliency weight of ph,w :

S(ph,w ) = 1 − exp(−score(ph,w )2 /σ02 ) (5)


where σ0 is a saliency score bandwidth parameter. Actually, the higher the patch saliency
weight is, the more discriminative it is. As achieved in [44], we set k = αN r with α = 1/2
in the saliency learning scheme with an empirical assumption that a patch is considered
to have a special appearance when more than half people in the reference set do not share
similar patches with it. To build the saliency map, [44] relies on high dimensional local
dColorSIFT features (672 dimensions) to describe the patches. In this work, to reduce the
saliency extraction computation time, we use as a patch low-level feature, the stacking of
the three descriptors CN, CHS and 15-d (see Section 4.3) whose total size is 74 dimensions.
Then, the patch-to-patch matching is computed via an approximate nearest neighbor (ANN)
algorithm (we use as ANN the randomized best-bin-first KD-tree forest introduced in [31]).

Extraction of the Gaussian Map To distinct between the pedestrian located in the
foreground and the background objects, image and video segmentation algorithms were
proposed in previous works [1, 3, 5]. However, generating a mask for a pedestrian is gener-
ally unstable and time-consuming due to the cluttered backgrounds and the small pedestrian
displacements over the frames in many tracklets. For that reason, this paper proposes a sim-
ple solution by mapping a 2-D Gaussian template on the first image of the tracklet. With
respect to the pedestrian located in the foreground, this assigns, in the encoding process,
low weights to the background objects.
Multimed Tools Appl

As established in the case of the saliency computation (see Section 3.3.2), the STSV
Gaussian weight is equal to that of its starting patch while the LSTV Gaussian weight is
obtained by operating an average pooling on the Gaussian weights of the STSV structures
located inside the LSTV. Inspired by [5, 45], the Gaussian function is defined by N (µx , σ ),
where µx is the mean value of the horizontal coordinates, and σ is the standard deviation.
We set µx to the frame centre (µx = W/2) and σ = W/4. This method uses a prior knowl-
edge on the person position, which assumes that the pedestrian lies in the frame centre.
Therefore, the Gaussian template works by weighting the locations near the vertical frame
centre with higher probabilities. This permits to discard the noise surrounding the person’s
silhouette, and thus to keep meaningful parts of the images and eliminate needless ones.
Explicitly, we endow each patch ph,w in the first frame with a Gaussian weight G(ph,w ),
given by:

G(ph,w ) = exp(−(w − µx )2 /2σ 2 ) (6)

3.3.3 Shallow Salient-Gaussian weighted Fisher Vector (SGFV) encoding

FV encoding effectively maps each local feature into a high dimensional super vector rep-
resentation. It then groups these encodings into a single histogram by applying a global
sum pooling over the entire image/video. The authors in [26] are the first to introduce the
FV encoding method, in the context of person re-identification. In this paper, we propose a
rich extension of the traditional FV encoding method. It consists in the incorporation of the
Gaussian and the saliency maps in the encoding process of the latter.
As operated in the traditional FV encoding, we start by learning a Gaussian Mixture
model (GMM) [7] p (fi /θ ) on the local features fi extracted from all training pedestrian
tracklets. p (fi /θ ) represents the parametric probability density function on R d given by (7)
and (8) :

K
+
p (fi /θ ) = πk p (fi /µk , Σk ) (7)
k=1

1 (fi − µk )T Σk−1 (fi − µk )


p (fi /µk , Σk ) = , exp (8)
(2()d detΣk 2σk2

where K is the number of the GMM components, θ = (π1 , µ1 , Σ1 , ..., πK , µK , ΣK ) is the


vector of parameters of the GMM model that -includes the mixture weight πk (We denote
that πk satisfies the following constraints (i) K k=1 πk = 1 and (ii) 0 ≤ πk:1≤k≤K ≤ 1), the
mean µk ∈ R d and the covariance matrix of the k th Gaussian component )k ∈ R d×d .
The GMM is learned by the Expectation-Maximization (EM) algorithm from the training
set of local features. Afterwards, the soft feature-to-cluster assignments are defined by:

πk p (xi /µk , Σk )
γi (k) = -K (9)
k=1 πk p (fi /µk , Σk )

Consider M local structures in the tracklet described by the local features fi and their
respective Gaussian Gi and saliency Si weights. The proposed global SGFV encoding of
these features incorporates the Gaussian and the saliency weights in the traditional FV
Multimed Tools Appl

encoding, as given by the mean and the standard deviation of the differences between the
encoded descriptor and the k-th GMM component denoted u(k) and v(k), respectively.
⎧ / 0
1 -M fi −µk
⎨ u(k) =

M i=1 Gi × Si × γi (k)
1 σk
(10)
2
1 -M (fi −µk )2
⎩ v(k) =

i=1 Gi × Si × γi (k) −1
M σk2

where, γ i(k) is the soft assignment weight of the i-th feature fi to the k-th Gaussian. For
each GMM component, the sum-pooling operation aggregates the M local structure features
in the tracklet, into a global 2 × d × K weighted FV denoted as SGFV. The latter is given
by the horizontal concatenation of u(k) and v(k) for all K components as described in the
following Eq:

SGF V = [u(1), v(1), . . . , u(K), v(K)] (11)

Finally, we apply power l2 normalization for each SGFV component before l2 normal-
izing them jointly. Such normalization demonstrates a good performance in previous works
[33].
We present hereinafter the algorithm 1 for the computation of the weighted salient-
Gaussian tiny FVs that encode the local features fi (M=1 in (10)) and the algorithm 2 for
the global SGFV encoding in which we aggregate the tiny FVs over the whole tracklet (see
(10)).
Multimed Tools Appl

Algorithm 2 Global-SGFV-encoding
Require: Set of the trajectory sub-volumes : TSV;
Number of in the tracklet: ;
Set of the local features describing : ;
Learned GMM parameters over the features:
Ensure: Global SGFV: SGFV
1: for i=1 to M do
2: Compute the Gaussian weight of ;
3: Compute the Saliency weight of ;
4: =Compute-Tiny-FV
5: end for
6:
7: Normalization of SGFV;
8: return (SGFV);

3.3.4 Deep Salient-Gaussian weighted Fisher Vector (DeepSGFV) encoding

The deepSGFV encoding of the trajectory structures exploits the richness of the spatio-
temporal structural information. This is performed by encoding the tracklet local features
increasingly generated from small-to-large trajectory neighborhoods over two nested layers.
Each layer ends-up by applying the global SGFV encoder (see Fig. 2) on the tracklet local

Fig. 2 Pipeline illustrating the proposed DeepSGFV Encoding


Multimed Tools Appl

features representative of each layer trajectory structures. The global SGFVs generated from
the two layers form the tracklet DSTA descriptor.

First layer We start the first layer by applying a local FV encoding within each spatio-
temporal STSV structure to encode it into a STSV local feature. To this end, we first learn
GMM on the set of the STSV patch appearance features belonging to all training track-
lets. Afterwards, we encode each STSV patch appearance feature with a patch tiny FV
(M = 1 in (10), without taking into consideration the Gaussian and saliency weights as
all patch appearance features have the same weight in the same STSV structure). Next, an
average-pooling operation is applied to the trajectory patch tiny FVs within each STSV
structure, in order to aggregate them into a single fixed-length vector. Finally, we apply on
the resulting STSV vectors the same FV normalization process described in Section 3.3.3.
This produces a set of STSV local FVs for each tracklet that can be regarded as tracklet
STSV local features. These latter have the advantage of reflecting a fine spatio-temporal
structural information and describing the appearance cue in the tracklet while taking into
account the temporal misalignment problem between the tracklet frames. Afterwards, as
the STSV local features are high-dimensional to be encoded by SGFV, we apply a dimen-
sionality reduction in a discriminative supervised way throughout XQDA. As the only
provided annotation is the identity label, we exploit this information by (1) averaging the
STSV local features over every tracklet, (2) applying XQDA to the resulting intermediate
vectors in order to learn the XQDA subspace matrix, and (3) projecting each STSV local
feature on the learned XQDA subspace matrix. This gives a good compromise between
accuracy and efficiency, and decorrelates the STSV local features in order to make them
suitable to their further FV encoding in the next step. Finally, the global SGFV encod-
ing within each tracklet presented in Section 3.3.3 is applied to the reduced STSV local
features. The encoding is carried out, taking into account both the STSV Gaussian and
saliency weights computed for the STSV structures as described in Section 3.3.2. This
forms the global SGFV of the tracklet (see (11)) that represents the first layer output. The
pseudo code of the first layer is presented hereinafter in algorithm 4. But we first provide
in algorithm 3 the process of the local FV encoding within a given trajectory sub-volume
structure.

Second layer In the second layer, the LTSV structures are built upon the encoded STSV
structures of the first layer in order to reflect a larger spatio-temporal structural information.
After the extraction of the dense LTSV structures within each tracklet, we apply the local
SGFV encoding within each one of them (each LTSV structure). It consists in performing
an average-pooling inside the LSTV structure rather than in the whole tracklet, on the STSV
weighted tiny FVs already computed by the global SGFV encoder of the first layer. More
specifically, each weighted tiny FV encodes a single local STSV reduced feature by setting
M = 1 in (10) of the first layer global SGFV encoding. The obtained LSTV local SGFVs
are subsequently normalized using the same previously described normalization process. As
a result, each tracklet is represented as a set of local LSTV features that depict larger spatio-
temporal structural information in comparison with the STSV ones. The encoding of these
features is weighted by the Gaussian/saliency weights computed for their corresponding
LTSV structures as described in Section 3.3.2. The resulting global SGFV of the tracklet
forms the second layer output. We present hereinafter the algorithm 5 that describes the
steps of the proposed method second layer.
Multimed Tools Appl

Algorithm 5 Second-Layer
Require: Set of the weighted tiny FVs output of the layer 1: ;
Ensure: Deep spatio-temporal descriptor: DSTA;
1: for 1 to do
2: Build the structures at the neighborhood of :

% SP represents the starting patch of the


% 3 3 represents the 3 3 spatial neighborhood of the starting
patch
3:
4: = Local-FV-encoding
5: Reduce the dimensionality of by XQDA;
6: end for
7: = Global-SGFV-encoding
8: DSTA= [ ];
9: return (DSTA);
Multimed Tools Appl

3.3.5 Pedestrian stripe partition

In order to alleviate the spatial misalignment caused by the pose variations problem in the
person’s tracklet frames, appearance modeling typically exploits part-based body models to
take into account the non-rigid shape of the human body and treat the appearance of different
body parts independently [5, 40]. Inspired by these works, we propose to sub-divide the
first frame of the pedestrian tracklet into NS stripes. Since, the spatial information of the
horizontal y-axis exhibits more importantly the spatial alignment between the pedestrian
parts than the vertical x-axis due to viewpoint and pose variations, we choose to divide
the silhouette only according to the y-axis. We set NS equal to 8 as it has shown a good
compromise between accuracy and efficiency in [45]. More specifically, the output of each
deepSGFV layer is computed according to each partition. To generate the outputs of the
First layer/Second layer, we apply SGFV encoding on the STSV/LTSV local features of
each partition rather than a global SGFV on all tracklet features. Indeed, the STSV/LTSV
features of a given partition are those describing the STSV/LTSV structures whose starting
patches in the first frame are located in that partition. Finally, the computed outputs from
each partition are separately concatenated for each layer and subsequently l2 normalized to
form the global SGFVs of each layer.

3.4 Deep CNN feature: IDE

In this paper, we chose to use the ID-discriminative Embedding (IDE) feature introduced in
[46] since, as previously explained in Section 1, it outperforms the Siamese verification [41]
and the Triplet loss [10] models in the re-ID accuracy and the training computational cost,
respectively. Specifically, as it was done in [46, 48], we use CaffeNet [16] and ResNet-50 [9]
to train the CNN in a classification mode. In the training phase, images are resized to 227 ×
227 pixels and they are passed to the CNN model, along with their respective identities.
The CaffeNet network contains five convolutional layers with the same original architec-
ture: two globally connected layers each with 1024 neurons, and a fully connected classifier
layer. The number of neurons in the final fully connected layer is defined by the number
of training identities in each dataset. The deep residual ResNet-50 network is made up of 5
convolutional blocks (conv1-x, conv2-x, conv3-x, conv4-x, conv5-x) and a classifier block.
The conv5-x block ends with 2048 convolutional filters each of which is of size 1 × 1. We
note that the CNN model is pre-trained on ImageNet [16] dataset before fine-tuning on the
target datasets (all the CNN layer weights are fine-tuned while the classifier layer weights
are trained from scratch). In the testing phase, 1024 and 2048 dimensional CNN features are
extracted, for each pedestrian frame, throughout the 7-th layer of CaffeNet and the conv5-x
block of ResNet-50, respectively. The CNN features are then subsequently l2 normalized.
Then, as in [46], the average pooling is used for i-LIDS-VID, and the max pooling is
used for PRID2011 and MARS datasets in order to generate a tracklet spatio-temporal IDE
feature.

3.5 Dissimilarity computation

The XQDA metric learning [23] is adopted in this work since it has shown a good compro-
mise between efficiency and accuracy in many state-of-the-art methods [23, 46]. It learns a
reduced subspace from the original data, and at the same time learns a Mahalanobis distance
in the reduced subspace for the dissimilarity measurement. Since we use in this work three
patch appearance features (see Section 4.3) and a deepSGFV encoder with two layers, our
Multimed Tools Appl

DSTA descriptor is represented by six global tracklet vectors. These six vectors are com-
bined together with the tracklet spatio-temporal IDE vector. To this end, the XQDA distance
is learned separately from the seven vectors. The obtained distances are then summed up
to derive the final dissimilarity function. Given a probe, dissimilarity scores are assigned to
all gallery items. The gallery set is then ranked according to the dissimilarity to the probe.
It is worth mentioning that, as performed in the aforementioned works [23, 46], we select
as subspace components the eigenvectors corresponding to the eigenvalues of Sw−1 Sb that
are larger than 1, where Sw and Sb refer to the within and the between scatter matrices,
respectively.

3.6 Multiple queries

The use of multiple queries (MultiQ) is shown to give superior results in the person re-
ID task [5, 46, 47], since the intra-class variation between the different pedestrian tracklet
queries is taken into account. In this paper, we reformulate each probe by separately apply-
ing an average pooling on the six histograms of the DSTA descriptor, within the same
camera tracklet queries. To group the IDE features generated from the different tracklet
queries, we apply as [5, 46] a max pooling operation. The resulting pooled vectors are then
used to carry out the dissimilarity computation between the probe and the pedestrians of the
gallery as described in Section 3.5.

4 Experiments

4.1 Datasets

In this section, we evaluate the proposed method on three challenging video benchmarks:
PRID2011 [11], i-LIDS-VID [38], Mars [46] and LPW [34] (see Table 1). These datasets
are challenging for the person re-ID task due to many facts such as the important variations
on viewpoints, poses, and illuminations in the images, as well their low resolution, existing
occlusions and background clutters. The Mars dataset is among the largest person re-ID
datasets currently available for the video-based person re-ID task.

PRID2011 [11] is a video dataset that includes 400 images sequences for 200 different
persons. This dataset is captured by two non-overlapping cameras, and each sequence con-
tains about 100 images. This dataset was captured in non-crowded outdoor scenes with a
relatively simple and clean background. The 200 identities are divided equally, i.e. half for
the training set and half for the testing set. All experiments are carried out at a rate of 10
times with different training/testing splits.

Table 1 Comparison of the video datasets used in this work

Datasets Release time #IDS #Bbox #tracklets #Cam

PRID2011 2011 934 40k 400 2


i-LIDS 2014 300 43,800 600 2
Mars 2016 1,261 1,067,516 20,716 6
LPW 2018 2,731 590,547 7.694 7
Multimed Tools Appl

i-LIDS-VID [38] is a video dataset that involves 300 identities observed across two dis-
joint camera views in public open spaces. Indeed, each single identity has two image
sequences, thus resulting in a total of 600 sequences. The length of the image sequences
varies from 23 to 192, with an average number of 73. This dataset is challenging due to
environment variations. The test and training set both contain 150 different identities. As
performed in PRID2011, experiments are carried out at a rate of 10 times with different
training/testing splits.

MARS [46] contains 1,261 identities and about 20,000 video sequences, making it the
largest video re-ID dataset up to date. MARS is made up of “tracklets” that have been man-
ually grouped into person IDs. The 1261 identities have 1191,003 tracklet frames, divided
into train/test sets of 631/630 identities respectively containing 509,914/681,089 tracklet
frames.

LPW [34] is a new large dataset named the Labeled Pedestrian in the Wild (LPW). It con-
tains 2731 pedestrians in three different scenes where each annotated identity is captured
by minimum 2 cameras and maximum 4. The LPW has a remarkable scale of 7694 with
over 590, 000 worth of frames as well as the cleanliness of its tracklets. It distinguishes
from existing datasets in three aspects: large scale with cleanliness, automatically detected
bounding boxes and far more crowded scenes with greater age span. LPW is divided into
train and test sets, containing 1975 and 756 identities, respectively. The persons in the sec-
ond and third scene will be used for training whereas the first scene will be used for testing.
In the testing set, the probe contains the persons in the second view and the gallery contains
people in the other two views. There are 756 probes and 1,072 galleries and for the same
probe there will be multiple galleries. As performed in [34], the train/test partitioning are
fixed instead of repeating random partitioning for 10 times.

4.2 Platform and framework

The computation of the proposed deep SGFV features was performed on a server of 3.47
GHz CPU. The codes were implemented with MATLAB and MEX functions. The deep
CNN features were computed with the Caffe deep learning framework [48] on NVIDIA
K40 GPU.

4.3 Patch appearance feature representation

In this work, three kinds of patch appearance features (CN, CHS and 15-d) are extracted
for each patch located in the STSV and LTSV structures. The extracted low-level features
are chosen on account of their good compromise between efficiency and re-ID accuracy
[26, 45]. Indeed, their small dimensionality as compared to other state-of-the-art descriptors
such as the global LOMO descriptor [23] and the local dColorSift one [44], makes them
well adapted to the efficiency factor required by the person re-ID task.

Color Names (CN) Authors in [21] have demonstrated that the color description based
on color names ensures a good robustness against the photometric variance. In this paper,
as was done in [21], we use the 11 basic color terms of the English language, i.e. black,
blue, brown, gray, green, orange, pink, purple, red, white, and yellow. First, the CN feature
vector of each frame pixel is calculated by making a mapping from HSV pixel values to an
11 dimensional CN vector. Afterwards, we apply a sum pooling on the CN pixel features
Multimed Tools Appl

related to each STSV patch. Finally, the resulting histogram undergoes a square rooting
operation followed by l1 normalization. The size of the generated CN feature is then equal
to 11.

15-d feature: Inspired by [26], we design a simple 15-d feature. First, the pedestrian image
is split into 3 color channels (HSV). For each channel C, each pixel is converted into a 5-d
local feature, which contains the pixel intensity, the first-order and second-order derivative
of this pixel. The description is as follows:
f (x, y, C) = (C(x, y), Cx (x, y), Cy (x, y), Cxx (x, y), Cyy (x, y)) (12)
where C(x, y) is the raw pixel intensity at position (x, y), Cx and Cy are the first-order
derivatives with respect to pixel coordinates × and y, and Cxx , Cyy are the second-order
derivatives. Then, we apply, for each color channel, a sum-pooling operation over the 15-
d features of the pixels located within each patch. Each of the three obtained appearance
patch features undergoes a square root operation followed by l1 normalization. Afterwards,
we horizontally concatenate the three normalized features into one single signature.

Color Histogram (CHS): For each patch, a 16-bin color histogram is computed in each
HSV color space channel. For each color channel, the patch color histogram is square-rooted
and subsequently l1 normalized. The three obtained histograms are then concatenated,
generating a 16 × 3 size color feature.

4.4 Experimental settings

In this paper, we use a codebook of 256 GMM components for the proposed FV encoder
since it yields a good compromise between accuracy and efficiency. Unless otherwise stated,
all results generated by the proposed method are given for the supervision case throughout
XQDA and on the one query setting. The sizes of the patch appearance features are given for
the CN, CHS and 15-d descriptors by 11, 48 and 15, respectively. After the application of
the XQDA dimensionality reduction technique, within the two layers, we set the size of the
reduced STSV and LTSV local FVs input of the global SGFV encoding to s = 200 for the
three kinds of low level appearance features, as this produces a good compromise between
the re-ID accuracy and the speed up. This generates for the two layers the following feature
space dimensionalities:
1. First layer:
– The size of the STSV local FV is given by: (2 × 256 × 15), (2 × 256 × 11) and
(2 × 256 × 48), according to the three patch appearance features, respectively.
– After the application of the global SGFV encoding, the size of the SGF Vl1 is given
by (2 × 256 × 200) for each kind of descriptors.
– During the similarity measure operated by the XQDA metric learning, the global
SGF Vl1 output of the layer 1 is reduced to: 727, 851 and 887 according to the
three kinds of patch appearance features, respectively.
2. Second Layer:
– The LTSV local FVs according to the three kinds of appearance features have the
same size as SGF Vl1 .
– After the application of the global SGFV encoding, the size of the SGF Vl2 is given
by (2 × 256 × 200) for each kind of descriptors.
Multimed Tools Appl

– During the similarity measure operated by the XQDA metric learning, the global
SGF Vl2 output of the layer 2 is reduced to 687, 731 and 824 according to the three
kinds of patch appearance features, respectively.
Note that during the matching between the probe and the gallery tracklets, the low-
dimensional feature space generated by XQDA is automatically computed (see Section 3.5).

4.5 Evaluation metrics

Commonly, the Matching Characteristics (CMC) curve is the most popular evaluation
metric for the person re-ID task. It shows the probability that a query identity appears
in different-sized candidate lists, i.e. the chance of the true match appearing in the top
1, 2, ..., N of the ranked list (the first point on the CMC curve being Rank-1 accuracy).
Rank-1 accuracy refers to the conventional notion of classification accuracy: the percentage
of probe images which are perfectly matched to their corresponding gallery image
The CMC curve at rank r is given by the following equation:

|q|
1 + Cm(r)
CMC(r) = (13)
r |q|
1

with |q| is the number of queries and Cm(r) is the number of correct matches at rank r. We
also use the mean average precision (mAP) to evaluate the performances of the proposed
methods. For each query, we calculate the area under the Precision-Recall curve, called
average precision (AP) that has the advantage of taking into account both precision and
recall. After that, the mean value of the APs (mAP) of all queries is calculated.

4.6 Empirical analysis of the proposed method

4.6.1 Impact of the Gaussian and the saliency maps

As can be shown by Table 2, both the Gaussian and saliency maps have shown important
improvements when applied to the proposed deep FV encoding. Indeed, when weighting
deepFV via the Gaussian template (deepGFV), the matching rates considerably increase in
all datasets. This is due to the effect of the background noise elimination when applying the
Gaussian mask. Furthermore, the saliency template has a significant impact on the re-ID

Table 2 Impact of the weightings on the proposed deep encoding

Methods Mars i-LIDS-VID PRID2011

r=1 r=5 r=20 mAP r=1 r=5 r=20 r=1 r=5 r=20

DSTA (deepFV) 64.1 81.2 91.5 44.8 66.5 83.0 94.1 78.5 94.7 99.4
DSTA (deepGFV) 70.2 85.1 93.8 55.3 73.8 88.8 96.5 85.1 96.5 100
DSTA (deepSGFV) 76.7 89.8 96.4 65.2 80.0 94.7 99.2 92.7 99.0 100

Results (ranks 1, 5 and 20 matching rates and the mAP for Mars and ranks 1, 5 and 20 for i-LIDS-VID and
PRID2011) are reported for different methods, i.e., DSTA (deepFV), DSTA (deepGFV) and DSTA (deepS-
GFV) that refer to the proposed DSTA descriptor based on the deepFV, deepGFV and deepSGFV encodings,
respectively
Multimed Tools Appl

Table 3 Impact of the combination with the IDE features

Methods Mars

r=1 r=5 r=20 mAP

DSTA 76.7 89.8 96.4 65.2


IDE(C) 65.3 82.0 89.0 47.6
DSTA+IDE(C) 78.8 91.5 97.9 67.7
IDE(R) 70.5 83.4 90.2 55.1
DSTA+IDE(R) 81.0 92.7 98.6 69.8

Results (rank 1, 5 and 20 matching rates and the mAP) are reported on Mars dataset for: DSTA, IDE features
trained on CaffeNet: IDE(C) and ResNet-50: IDE(R), DSTA+IDE(C) and DSTA+IDE(R)

accuracy (deepSGFV) on the results for all datasets. This proves the effectiveness of the
saliency map in stressing the identity uniqueness.

4.6.2 Impact of the combination of DSTA and IDE

In this paper, we propose a spatio-temporal deep representation combined with the deep
CNN (IDE) features. Indeed, as shown in Table 3 and Table 4, while comparing the deep
spatio-temporal appearance descriptor (DSTA) with the high-level deep CNN descrip-
tors IDE(C) (trained on CaffeNet), the performance of the proposed method obviously
increases. Furthermore, we notice a high improvement of the results on the Mars dataset (see
Table 3) when combining the proposed DSTA descriptor with the deep CNN one trained on
the ResNet-50 network (IDE(R)). This proves the complementarity of the proposed deep
spatio-temporal appearance descriptor and the deep CNN features. Indeed, the two methods
provide two different views of the re-ID process since they exploit the structural context
among the deep features and the raw pixels, respectively.
Some visual re-ID results on Mars dataset, corresponding to the proposed DSTA+IDE
method, are shown in Fig. 3. We observe that the proposed method can cope well with
the illumination variations, the change of point of view, change of scales and the cluttered
backgrounds. Besides, we notice that the false matching observed in the fourth probe is due
to the fact that the two distinct identities share similar colors.

Table 4 Impact of the combination with the IDE features

Methods i-LIDS-VID PRID2011

r=1 r=5 r=20 r=1 r=5 r=20

DSTA 80.0 94.7 99.2 92.7 99.0 100


IDE(C) 53.0 81.4 95.1 77.3 93.5 99.3
DSTA+IDE(C) 83.6 95.8 99.5 96.1 99.5 100

Results (rank-1, 5 and 20 matching rates ) are reported on i-LIDS-VID and PRID2011 datasets for different
encoding methods, i.e. DSTA, IDE features trained on the CaffeNet dataset (IDE(C)), DSTA+IDE(C) and
DSTA+IDE(C)+re
Multimed Tools Appl

Fig. 3 Example results of four probes on the Mars dataset. For each probe, we present the ranking results
produced by the proposed DSTA+IDE(R). The persons surrounded by green boxes are the same persons as
the probe and those surrounded by red boxes are not

4.6.3 Impact of the combination of the layer outputs

In this section, we study the performance of each layer of the proposed deep spatio-temporal
appearance descriptor: DSTA. Indeed, important and prominent results are noticed espe-
cially for the second layer output features owing to the rich spatio-temporal structural
information present in this layer. When combining the two layers, i.e. to form the proposed
DSTA descriptor, the results improve more remarkably (see Table 5). This proves the com-
plementarity of the small and large structural granularities represented over the two nested
layers.

4.6.4 Impact of the pedestrian stripe partition

We observe good improvements in Table 6 once we perform the stripe partition described in
Section 3.3.5 (for example, from r1=73.6% to r1=76.7% and mAP=62.0% to mAP=65.2%
on Mars dataset). This is due to the spatial alignment ensured by the pedestrian partition
into horizontal stripes.

Table 5 Impact of the combination of the features issued from each layer on the Mars dataset, i.e. the first
layer output: the Global STSV and the second layer output: the Global LSTV

Methods Mars i-LIDS-VID PRID2011

r=1 r=5 r=20 mAP r=1 r=5 r=20 r=1 r=5 r=20

First layer output 63.3 79.1 86.4 45.6 65.5 86.8 97.2 79.4 94.8 98.8
Second layer output 73.1 87.0 94.8 60.1 76.1 92.4 98.8 89.0 97.9 99.8
DSTA 76.7 89.8 96.4 65.2 80.0 94.7 99.2 92.7 99.0 100

Accuracy is presented by rank 1, 5 and 20 matching rates


Multimed Tools Appl

Table 6 Impact of the Stripe Partition

Methods Mars i-LIDS-VID PRID2011

r=1 mAP r=1 r=5 r=1 r=5

DSTA (-SP) 73.6 62.0 76.8 92.7 89.2 98.0


DSTA 76.7 65.2 80.0 94.7 92.7 98.6

Note that (-SP) means without Stripe Partition

4.6.5 Comparison with the state-of-the-art methods on Mars, i-LIDS-VID


and PRID2011 datasets

In this section, we compare the performance of the proposed methods to existing video-
based re-ID approaches on Mars, i-LIDS-VID and PRID2011 datasets. The different results
are shown in Tables 7 and 8.
We first begin by comparing the proposed DSTA descriptor with shallow spatio-temporal
features. These features can be divided into two main categories: spatio-temporal features
obtained by pooling the appearance features over the tracklet frames (eSDC [44], SDALF
[5], LOMO [23], Bag of visual Words (BOW) [45], HistLBP [39] and gBiCov [27]), and
spatio-temporal features based on the walking cycle extraction (KISSME (GEI), [8] XQDA
(HOG3D) [14], KISSME (HOG3D) [14], KISSME (STFV3D [25])). In comparison with
the first category of methods, the shallow version of the proposed descriptor (the first layer)
considerably outperforms them in all datasets. This is due to the fact that our method takes
into account the temporal information by addressing the temporal misalignment in the track-
lets. With respect to the motion features constructed upon the walking cycles [8, 14, 25],
our DSTA descriptor realizes considerably higher accuracy. Indeed, the motion features are
not well adapted to the person re-ID task since two pedestrians of same/different identi-
ties may have dissimilar/similar motion activities. The proposed deep hand-crafted method
(DSTA) is competitive with the deep CNN ones: the IDE [46, 47], the RNN [28, 49] and
the Triplet loss [2, 10] based methods as well as the method of [43] that applies the deep
learning on the representative frames selected from the walking cycle, without necessitating
either pre-training or data augmentation. Indeed, pre-training is operated by all the afore-
mentioned deep learning methods while data augmentation is proceeded in [10, 28, 49]
and [2]. In comparison with [2, 28, 46, 47, 49] and [43], we note that the proposed DSTA
descriptor achieves better re-ID accuracy. This proves 1) the importance of the temporal
alignment information which is handled by the trajectories, 2) the advantage of the deep
spatio-temporal structural information incorporated in our proposed descriptor and 3) the
ability of the weighted encoding to deal with the noisy backgrounds and to concentrate
on the most distinctive parts of the pedestrian. The Triplet method [10] gives slightly bet-
ter re-ID accuracy than the proposed DSTA descriptor on Mars dataset, but in a detriment
of a higher computational complex training step. Indeed, the Triplet loss approach suf-
fers from the cubically growth of the number of the triplet images passed to the network
in large datasets. Besides, we speculate that the higher result obtained by the Triplet loss
method [10] is in part due to the joint used data augmentation in both training and test-
ing phases. We note that DSTA outperforms the Triplet method presented in [2] for the
PRID2011 and i-LIDS-VID datasets. This can be explained by the fact that the Triplet net-
work can be badly influenced by the limited number of training Triplets in these small
datasets. When combining the DSTA descriptor with the IDE features, we outperform all the
Multimed Tools Appl

Table 7 Comparison of the proposed methods with the most relevant state-of-the-art results on the Mars
dataset

Methods r=1 r=5 r=20 mAP

KISSME (GEI) [8] 1.2 2.8 7.4 0.4


KISSME (HOG3D) [14] 2.6 6.4 12.4 0.8
KISSME (BOW) [46] 30.6 46.2 59.2 15.5
Shallow Methods DVR (SDALF) [46] 4.1 12.3 25.1 1.8
XQDA (HistLBP) [46] 18.6 33.0 45.9 8.0
XQDA (gBiCov) [46] 9.2 19.8 33.5 3.7
XQDA (LOMO) [46] 30.7 46.6 60.9 16.4
Ours/Shallow Ours (First Layer) 66.3 81.1 88.4 56.2
WC [43] 55.5 70.2 80.2 -
KISSME (IDE(C)) [46] 65.0 81.1 88.9 45.6
KISSME (IDE(C)) (+MultiQ) [46] 68.3 82.6 89.4 49.3
Deep Methods XQDA (IDE(C)) [46] 65.3 82.0 89.0 47.6
KISSME (IDE(R)) [48] 70.3 - - 53.2
XQDA (IDE(R)) [48] 70.5 - - 55.1
ST-RNN [49] 70.6 90.0 97.6 50.7
TriNet [10] 79.8 91.3 - 67.7
Ours(DSTA) 76.7 89.8 96.4 65.2
Ours(DSTA) (+MultiQ) 79.1 91.0 97.5 67.2
Ours/Deep Ours( DSTA+IDE(C)) 79.8 91.5 97.9 67.9
Ours(DSTA+IDE(C)) (+MultiQ) 81.9 93.1 98.8 70.2
Ours(DSTA+IDE(R)) 82.4 93.7 99.0 70.7
Ours(DSTA+IDE(R)) (+MultiQ) 85.0 94.6 99.3 73.0
TriNet+re [10] 81.2 90.7 - 77.4
Re-ranking Methods XQDA (IDE(C))+re [48] 67.7 - - 57.9
XQDA (IDE(R))+re [48] 73.9 - - 68.4
Ours(DSTA+IDE(C)+re) 81.9 93.2 98.7 78.0
Ours/Re-ranking Ours(DSTA+IDE(C)+re) (+MultiQ) 84.5 95.2 99.7 78.0
Ours(DSTA+IDE(R)+re) 83.7 93.7 99.1 80.9
Ours(DSTA+IDE(R)+re) (+MultiQ) 86.6 93.7 99.1 80.9

(+re) means re-ranking approach. Accuracy is presented by rank 1, 5 and 20 matching rates and mAP

state-of-the-art-methods. To be comparable with TriNet+re [10], XQDA (IDE(C))+re and


XQDA (IDE(R))+re [48], we used the same k-reciprocal re-ranking method [48] as them.
We observe in Tables 8 and 7 that the employed re-ranking method boosts the accuracy and
especially the mAP since re-ranking can exploit more adequately the contextual similarity
information. Besides, we note that the MultiQ setting enhances the re-ID accuracy of our
method since the pedestrian intra-class variation can be effectively considered.

4.6.6 Comparison with the state-of-the-art methods on LPW dataset

To further demonstrate the effectiveness of the proposed DSTA, we compared it with two
deep CNN methods (RQEN [34] and GoogleNet [35]) in the LPW dataset (see Table 9).
Multimed Tools Appl

Table 8 Comparison of the proposed methods: Ours(DSTA), Ours(DSTA+IDE(C)) and Ours(DSTA+


IDE(C)+re) with results of the state-of-the-art methods on i-LIDS-VID and PRID2011 datasets

Methods i-LIDS-VID PRID2011

r=1 r=5 r=20 r=1 r=5 r=20

DVR (HOG3D) [46] 23.3 42.4 68.4 28.9 55.3 82.8


DVR (SDALF) [46] 26.7 49.3 71.6 31.6 58.0 85.3
DVR(eSDC) [46] 30.9 54.4 77.1 - - -
KISSME (BOW) [46] 14.0 32.2 59.5 - - -
Shallow Methods XQDA (LOMO) [46] 53.0 78.5 93.4 - - -
KISSME (GEI) [46] 10.3 30.5 61.5 19.0 36.8 63.9
XQDA(HOG3D) [46] 16.1 41.6 74.5 21.7 51.7 87.0
RankSVM(HOG3D) [14] 12.1 29.3 65.3 19.4 44.9 77.2
STFV3D [25] 37.0 64.3 86.9 42.1 71.9 91.6
KISSME(STFV3D) [25] 44.3 71.7 91.7 64.1 87.3 92.0
Ours Shallow Ours(First Layer) 65.5 86.8 97.2 79.4 94.8 98.8
KISSME(IDE(C)) [46] 48.8 75.6 92.6 69.9 90.6 98.2
XQDA (IDE(C)) [46] 53.0 81.4 95.1 77.3 93.5 99.3
WC [43] 60.2 85.1 94.2 83.3 93.3 96.7
Deep Methods RNN [28] 50.0 76.0 94.0 65.0 90.0 97.0
RNN+OF [28] 58.0 84.0 96.0 70.0 90.0 97.0
ST-RNN [49] 55.2 86.5 97.0 79.4 94.4 99.3
CNN+Triplet [2] 60.4 82.7 96.4 22.0 - 57.0
Ours (DSTA) 80.0 94.7 99.2 92.7 99.0 100
Ours Deep Ours(DSTA+IDE(C)) 83.6 95.8 99.5 96.1 99.5 100
Ours(DSTA+IDE(C)+re) 86.5 96.8 99.8 98.8 99.9 100

(+re) means re-ranking approach. Accuracy is presented by rank 1, 5 and 20 matching rates and mAP for
Mars dataset, and rank 1, 5 and 20 for i-LIDS-VIS and PRID2011

The Region-based Quality Estimation Network (RQEN) [34] is a deep CNN that learns
the complementary between the regions of the frames along the tracklet. GoogleNet with
batch normalization [12] is a 22 layers deep network initialized with the ImageNet model.
It is based on the Hebbian principle and the multi-scale intuition. The performance of the
proposed DSTA has achieved higher accuracy regarding these two deep CNN models, a fact
that justifies the robustness of the proposed method.

4.6.7 Discussion

As shown by the experimental results, the proposed deep descriptor outperforms the shallow
histogram encoding features [17–20, 25, 29] and the deep features [28, 34, 35, 46, 49]. This
is due to three advantageous factors in our approach: the weightings of the encoding that
eliminate the noisy background located around the pedestrian and model the uniqueness
of its identity, the consideration of the deep structural constraints among the hand-crafted
features, and the design of the local and dense temporal alignment throughout the patch
trajectories. While the proposed deep DSTA descriptor achieves competitive results at a low
Multimed Tools Appl

Table 9 Comparison of the proposed DSTA with results of the state-of-the-art methods on the LPW dataset

Methods LPW

r=1 r=5 r=10 r=20

RQEN [34] 57.1 81.3 86.9 91.5


GoogleNet [35] + Batch normalization [12] 41.5 66.7 77.2 86.2
DSTA 59.8 83.7 88.1 92.6

Accuracy is presented by rank 1, 5, 10 and 20 matching rates

training cost, with respect to the state of the art deep CNN features, some limitations of the
proposed deep histogram encoding method can be underlined.
The deep CNN methods learn the structural constraints straightly among the raw pix-
els, while, in the proposed method, we operate a pre-processing step that computes the
hand-crafted features before the application the quantization process. Thus, the proposed
framework depends on the quality of the used hand-crafted features and their ability to
cope with background occlusions, change of field of view, illumination, etc. Moreover, the
quantization can generate a loss of information within the feature space.
Therefore, to further enrich the deep description of the pedestrian tracklets, we com-
bine the proposed deep representation with the deep CNN one. The resulting combination
enhances the re-ID results since it conveys two different views of the structural information
related to both the raw pixel and the hand-crafted feature data.

4.7 Time and space complexity

Consider K GMM clusters and M d-dimensional trajectory sub-volume features. The time
and space complexities of the XQDA dimensionality reduction technique are given by
O(M × d × r) and O(d × r), respectively, with r is the reduced size of the trajectory
sub-volume features after the projection on the XQDA subspace matrix. The computation
complexity of the global SGFV encoding is estimated taking into account the three follow-
ing steps: the first step computes the saliency and Gaussian weights. The Gaussian weight
computation complexity is given by O(W ) and the saliency weight computation complexity
is given by O(M ×N r ×(2l+1)×W ×dim)+O(M ×N r ×dim) (dim is the dimensionality
of the feature vector used in the saliency computation) since it is processed by (1) imple-
menting a patch-to-patch matching between the test patch and the patches in the adjacency
search and this for each image in the reference set and (2) computing the saliency score by
searching the k-th nearest neighbor patch in the matching set. The second step computes the
soft feature-to-cluster assignment and the third step computes the derivation on the GMM.
The computation complexity of the second step is O(K × M × r) given that it computes the
Gaussian-based distances from a feature to every visual word. As for the third step, its com-
putation complexity corresponds to O(2×K ×M ×r), given that we compute the derivation
on GMM with respect to the mean and standard deviation. Thus, the time and space com-
plexities of the SGFV encoding are given by O(M × d × r) + O(d × r) + O(W ) + O(M ×
N r × (2l + 1) × W × dim) + O(M × N r × dim) + O(3 × K × M × r)andO(2 × K × r),
respectively. The computation complexity of the XQDA metric learning which measures
the similarity between the probe and the gallery tracklets, comprises (1) the implementation
of the dimensionality reduction of the tracklet global SGFV and (2) the similarity compu-
tation based on the resulting reduced size features. Thus, the time and space complexities
Multimed Tools Appl

are given by O(r × rg ) + O(G × rg ) and O(r × rg ), respectively, with rg is the size of the
features after the dimensionality reduction and G is the number of tracklets in the gallery.

5 Conclusion

In this paper, we proposed a deep spatio-temporal appearance descriptor that consists in


applying deepSGFV encoding to the spatio-temporal trajectory structures. It has the advan-
tage of handling the temporal misalignment in the tracklet through the trajectories’ tracking.
Moreover, it increasingly exploits small-to-large trajectory spatio-temporal neighborhoods
in order to reflect a rich spatio-temporal structural information at different granularities. The
proposed deep descriptor alone competes well with the deep CNN features. Besides, their
further combination enhances the re-ID accuracy. The proposed saliency/Gaussian tem-
plates permit to highlight the distinctive parts of the pedestrian and to get rid of the noisy
background around the pedestrians. As perspectives, in feature research, we will investigate
more sophisticated deep network architectures.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

References

1. Bedagkar-Gala A, Shah SK (2011) Multiple person re-identification using part based spatio-temporal
color appearance model. In: IEEE international conference on computer vision workshops (ICCV),
pp 1721–1728
2. Cheng D, Gong Y, Zhou S, Wang J, Zheng N (2016) Person re-identification by multi-channel parts-
based cnn with improved triplet loss function. In: The IEEE conference on computer vision and pattern
recognition (CVPR)
3. Chinnasamy GMG (2015) Segmentation of pedestrian video using thresholding algorithm and its
parameter analysis. In: International journal of applied research, vol 1, pp 43–46
4. de Avila SEF, Thome N, Cord M, Valle E, de Albuquerque Araújo A (2011) BOSSA: extended bow
formalism for image classification. In: 18th IEEE international conference on image processing (ICIP),
pp 2909–2912
5. Farenzena M, Bazzani L, Perina A, Murino V, Cristani M (2010) Person re-identification by symmetry-
driven accumulation of local features. In: The twenty-third IEEE conference on computer vision and
pattern recognition, CVPR, pp 2360–2367
6. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: 13th Scandina-
vian conference on image analysis (SCIA), pp 363–370
7. Farquhar J, Szedmak S, Meng H, Taylor JS (2005) Improving bag-of-keypoints image categorisation
generative models and pdf-kernels. Report
8. Han J, Bhanu B (2006) Individual recognition using gait energy image. IEEE Trans Pattern Anal Mach
Intell 28(2):316–322
9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE
conference on computer vision and pattern recognition (CVPR), pp 770–778
10. Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. CoRR
arXiv:1703.07737
11. Hirzer M, Beleznai C, Roth PM, Bischof H (2011) Person re-identification by descriptive and
discriminative classification. In: 17th Scandinavian conference on image analysis (SCIA), pp 91–102
12. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal
covariate shift. In: ICML, JMLR workshop and conference proceedings, vol 37, pp 448–456
13. Jobson DJ, Rahman Z, Woodell GA (1997) A multiscale retinex for bridging the gap between color
images and the human observation of scenes. IEEE Trans Image Process 6(7):965–976
Multimed Tools Appl

14. Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In:
Proceedings of the British machine vision conference (BMVC), pp 1–10
15. Kȯstinger M, Hirzer M, Wohlhart P, Roth PM, Bischof H (2012) Large scale metric learning from equiv-
alence constraints. In: 2012 IEEE conference on computer vision and pattern recognition. Providence,
pp 2288–2295
16. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
networks. In: Proceedings of the 25th international conference on neural information processing systems
(NIPS), pp 1097–1105
17. Ksibi S, Mejdoub M, Ben Amar C (2016) Extended fisher vector encoding for person re-identification.
In: 2016 IEEE international conference on systems, man, and cybernetics (SMC), pp 4344–4349
18. Ksibi S, Mejdoub M, Ben Amar C (2016) Person re-identification based on combined gaussian weighted
fisher vectors. In: 13th IEEE/ACS international conference of computer systems and applications
(AICCSA), pp 1–8
19. Ksibi S, Mejdoub M, Ben Amar C (2016) Topological weighted fisher vectors for person re-
identification. In: 23rd international conference on pattern recognition (ICPR), pp 3097–3102
20. Ksibi S, Mejdoub M, Ben Amar C (2018) Supervised person re-id based on deep hand-
crafted and cnn features. In: International conference on computer vision theory and applications.
[Link]
21. Kuo CH, Khamis S, Shet VD (2013) Person re-identification using semantic color names and rankboost.
In: IEEE workshop on applications of computer vision, pp 281–287
22. Li Z, Chang S, Liang F, Huang TS, Cao L, Smith JR (2013) Learning locally-adaptive decision functions
for person verification. In: 2013 IEEE conference on computer vision and pattern recognition. Portland,
3610–3617
23. Liao S, Hu Y, Zhu X, Li SZ (2015) Person re-identification by local maximal occurrence representation
and metric learning. In: IEEE conference on computer vision and pattern recognition, CVPR 2015.
Boston, pp 2197–2206
24. Lin Y, Zheng L, Zheng Z, Wu Y, Yang Y (2017) Improving person re-identification by attribute and
identity learning. CoRR arXiv:1703.07220
25. Liu K, Ma B, Zhang W, Huang R (2015) A spatio-temporal appearance representation for video-based
pedestrian re-identification. In: IEEE international conference on computer vision (ICCV), pp 3810–3818
26. Ma B, Su Y, Jurie F (2012) Local descriptors encoded by fisher vectors for person re-identification. In:
ECCV workshops, vol 7583, pp 413–422
27. Ma B, Su Y, Jurie F (2014) Covariance descriptor based on bio-inspired features for person re-
identification and face verification. Image Vis Comput 32(6-7):379–390
28. McLaughlin N, Martinez del Rincon J, Miller P (2016) Recurrent convolutional network for video-based
person re-identification. In: The IEEE conference on computer vision and pattern recognition (CVPR)
29. Mejdoub M, Ksibi S, Ben Amar C, Koubaa M (2017) Person re-id while crossing different cam-
eras: Combination of salient-gaussian weighted bossanova and fisher vector encodings. In: International
journal of advanced computer science and applications (ijacsa), vol 8, pp 399–410
30. Messelodi S, Modena CM (2015) Boosting fisher vector based scoring functions for person re-
identification. Image Vis Comput 44:44–58
31. Muja M, Lowe DG (2009) Fast approximate nearest neighbors with automatic algorithm configuration.
In: International Conference on Computer Vision Theory and Applications (VISAPP), pp 331–340
32. Othmani M, Bellil W, Ben Amar C, Alimi AM (2010) A new structure and training procedure for multi-
mother wavelet networks. IJWMIP 8(1):149–175. [Link]
33. Sapienza M, Cuzzolin F, Torr PHS (2014) Learning discriminative space-time action parts from weakly
labelled videos. Int J Comput Vis 110(1):30–47
34. Song G, Leng B, Liu Y, Hetang C, Cai S (2017) Region-based quality estimation network for large-scale
person re-identification. CoRR arXiv:1711.08766
35. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A
(2015) Going deeper with convolutions. In: Computer vision and pattern recognition (CVPR)
36. Wali A, Ben Aoun N, Karray H, Ben Amar C, Alimi AM (2010) A new system for event detection from
video surveillance sequences. In: Advanced concepts for intelligent vision systems - 12th international
conference, ACIVS 2010, Sydney, Australia, December 13-16, 2010, Proceedings, Part II, pp 110–120,
[Link]
37. Wang H, Klȧser A, Schmid C, Liu C (2013) Dense trajectories and motion boundary descriptors for
action recognition. Int J Comput Vis 103(1):60–79
38. Wang T, Gong S, Zhu X, Wang S (2014) Person re-identification by video ranking. In: 13th European
conference on computer vision (ECCV), pp 688–703
Multimed Tools Appl

39. Xiong F, Gou M, Camps OI, Sznaier M (2014) Person re-identification using kernel-based metric
learning methods. In: The 13th European conference on computer vision (ECCV), pp 1–16
40. Xu Y, Ma B, Huang R, Lin L (2014) Person search in a scene by jointly modeling people commonness
and person uniqueness. In: Proceedings of the ACM international conference on multimedia, pp 937–940
41. Yi D, Lei Z, Li SZ (2014) Deep metric learning for practical person re-identification. CoRR arXiv:1407.4979
42. Zhang L, Xiang T, Gong S (2016) Learning a discriminative null space for person re-identification. In:
2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1239–1248
43. Zhang W, Hu S, Liu K (2017) Learning compact appearance representation for video-based person
re-identification. CoRR arXiv:1702.06294
44. Zhao R, Ouyang W, Wang X (2013) Unsupervised salience learning for person re-identification. In:
IEEE conference on computer vision and pattern recognition, pp 3586–3593
45. Zheng L, Shen L, Tian L, Wang S, Bu J, Tian Q (2015) Person re-identification meets image search. In:
CoRR, arXiv:1502.02171, pp 2360–2367
46. Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, Tian Q (2016) Mars: a video benchmark for large-scale
person re-identification. In: European conference on computer vision (ECCV)
47. Zheng L, Zhang H, Sun S, Chandraker M, Tian Q (2016) Person re-identification in the wild. CoRR
arXiv:1604.02531
48. Zhong Z, Zheng L, Cao D, Li S (2017) Re-ranking person re-identification with k-reciprocal encoding.
CoRR arXiv:1701.08398
49. Zhou Z, Huang Y, Wang W, Wang L, Tan T (2017) See the forest for the trees: Joint spatial and temporal
recurrent neural networks for video-based person re-identification. In: The IEEE conference on computer
vision and pattern recognition (CVPR)

Salma Ksibi received the Engineering degree in Computer Science Engineering from the National Engi-
neering School of Sfax (ENIS), Tunisia, in 2013. She is currently a Ph.d student in the National Engineering
School of Sfax (ENIS). Besides, she is a research member in the REsearch Groups on Intelligent Machines
(REGIM) Lab, ENIS, Sfax, Tunisia. She is also an IEEE Student member since 2014. Her research interests
include Computer Vision and Image and Video Analysis. These research activities are centered around the
image and video-based person re-identification.
Multimed Tools Appl

Mahmoud Mejdoub received the Engineering degree in Computer Engineering from the National Engi-
neering School of Sfax (ENIS) in 2004, the Master degree in Novel Technologies in Dedicated Computer
Systems from the National Engineering School of Sfax (ENIS) in 2005 and the joint Ph.D. degree (cotutelle)
in Engineering of Computer Systems from the National Engineering School of Sfax (ENIS), Tunisia and in
Automatic, Signal and Image Processing from the University of Nice Sophia Antipolis (UNSA), France, in
2011. In 2008, he was an assistant professor at the faculty of sciences-Sfax (Tunisia) in the department of
telecommunications and computer science. In 2015, he joined the Majmaah University at Riyadh, Arabia
Saudi, as an assistant professor. He is currently a research member in the Research Groups on Intelligent
Machines (REGIM). His research interest focuses on Computer Vision and particularly image and video classification

Chokri Ben Amar received the B.S. degree in Electrical Engineering from the National Engineering School
of Sfax (ENIS) in 1989, the M.S. and PhD degrees in Computer Engineering from the National Institute of
Applied Sciences in Lyon, France, in 1990 and 1994, respectively. He spent one year at the University of
“Haute Savoie” (France) as a teaching assistant and researcher before joining the higher School of Sciences
and Techniques of Tunis as Assistant Professor in 1995. In 1999, he joined 77the Sfax University (USS),
where he is currently a professor in the Department of Electrical Engineering of the National Engineering
School of Sfax (ENIS), and the Vice director of the REsearch Group on Intelligent Machines (REGIM).
His research interests include Computer Vision and Image and video analysis. These research activities are
centered on Wavelets and Wavelet networks and their applications to data Classification and approximation,
Pattern Recognition and image and video coding, indexing and watermarking. He is a senior member of
IEEE, and the chair of the IEEE SPS Tunisia Chapter since 2009. He was the chair of the IEEE NGNS’2011
(IEEE Third International Conference on Next Generation Networks and Services) and the Workshop on
Intelligent Machines: Theories & Applications (WIMTA 2008) and the chairman of the organizing commit-
tees of the “Traitement et Analyse de l’Information : Methodes et Applications (TAIMA 2009)” conference,
International Conference on Machine Intelligence ACIDCA-ICMI’2005 and International Conference on
Signals, Circuits and Systems SCS’2004.

You might also like