Deep Learning For Sensor-Based Activity Recognition: A Survey
Deep Learning For Sensor-Based Activity Recognition: A Survey
Jindong Wanga,b,c , Yiqiang Chena,b,c,∗∗, Shuji Haod , Xiaohui Penga,b,c , Lisha Hua,b,c
a Beijing Key Laboratory of Mobile Computing and Pervasive Device, Beijing, China
b Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
c University of Chinese Academy of Sciences, Beijing, China
arXiv:1707.03502v1 [cs.CV] 12 Jul 2017
ABSTRACT
Sensor-based activity recognition seeks the profound high-level knowledge about human activity from
multitudes of low-level sensor readings. Conventional pattern recognition approaches have made
tremendous progress in the past years. However, most of those approaches heavily rely on heuristic
hand-crafted feature extraction methods, which dramatically hinder their generalization performance.
Additionally, those methods often produce unsatisfactory results for unsupervised and incremental
learning tasks. Meanwhile, the recent advancement of deep learning makes it possible to perform
automatic high-level feature extraction thus achieves promising performance in many areas. Since
then, deep learning based methods have been widely adopted for the sensor-based activity recognition
tasks. In this paper, we survey and highlight the recent advancement of deep learning approaches for
sensor-based activity recognition. Specifically, we summarize existing literatures from three aspects:
sensor modality, deep model and application. We also present a detailed discussion and propose grand
challenges for future direction.
Keywords: Deep learning; activity recognition; pervasive computing; pattern recognition
c 2017 Elsevier Ltd. All rights reserved.
Activity signal Feature extraction Model training Activity inference We need to build a model F to predict the activity sequence
Variance Max K nearest neighbor Upstairs
based on sensor reading s
Mean Naï
ve Bayes
Min Range
Time domain
Neural
network Running  = { j }nj=1 = F (s),  j ∈ A (3)
Riding bike
Frequency domain Decision tree
while the true activity sequence (ground truth) is denoted as
Having coffee
Skewness Gaussian Mixture
Activity signal Deep feature extraction & model building Activity inference environment (Vepakomma et al., 2015; Yang et al., 2015; Fang
CNN DNN
Upstairs
and Hu, 2014) and medical activities (Li et al., 2016b; Wang
et al., 2016a). The RFID can provide more fine-grained infor-
Running
mation for more complex activity recognition.
Riding bike
It should be noted that object sensors are less used than body-
RNN
Having coffee worn sensors due to the difficulty in its deployment. Besides,
Watching TV
the combination of object sensors with other types is emerging
DBN in order to recognize more high-level activities (Yang, 2009).
network can also extract high-level representation in deep layer, Ambient sensors are used to capture the interaction between
which makes it more suitable for complex activity recognition humans and the environment. They are usually embedded in
tasks. What’s more, the deep learning models trained on a users’ smart environment. There are many kinds of ambient
large-scale labeled dataset can usually be transferred to the new sensors such as radar, sound sensors, pressure sensors, and tem-
tasks where there are few or none labels. perature sensors. Different from object sensors which measure
In the following sections, we mainly summarize the existing the object movements, ambient sensors are used to capture the
work based on the pipeline of HAR task: (a) sensor modality, change of the environment.
(b) deep model, (c) application. Several research used ambient sensors to recognize daily ac-
tivities and hand gesture (Lane et al., 2015; Wang et al., 2016a;
Kim and Toomajian, 2016). Most of the work was tested in
3. Sensor Modality smart home environment. Same as object sensors, the deploy-
Although most of the approaches for HAR could be general- ment of ambient sensors is also difficult. In addition, ambient
ized to all data modalities, most of them are specific to certain sensors are easily affected by the environment, and only certain
types. According to (Chavarriaga et al., 2013), we mainly clas- types of activities can be robustly inferred.
sify them into three aspects: body-worn sensors, object sensors
and ambient sensors. Table 1 briefly summarizes all the sensor 3.4. Hybrid Sensor
modalities.
Some work combined different types of sensors for HAR. As
shown in (Hayashi et al., 2015), combining acceleration with
3.1. Body-worn Sensor
acoustic information could improve the accuracy of HAR. Am-
Body-worn sensors are one of the most common modali- bient sensors are also used together with object sensors; hence
ties in HAR. Those sensors are often worn by the users, such they can record both the object movements and environment
as accelerometer, magnetometer and gyroscope. The accelera- state. (Vepakomma et al., 2015) designed a smart home envi-
tion and angular velocity are changed according to human body ronment called A-Wristocracy, where a large number of find-
movements; thus they can infer human activities. Those sensors grained and complex activities of multiple occupants can be
can often be found on smart phones, watches, bands, glasses, recognized through body-worn, object and ambient sensors. It
and helmets. is obvious that the combination of sensors is capable of captur-
Several deep learning approaches have been proposed based ing rich information of human activities, which is also possible
on body-worn sensors (Chen and Xue, 2015; Plötz et al., 2011; for a real smart home system in the future.
Zeng et al., 2014; Ronao and Cho, 2016; Jiang and Yin, 2015;
Yang et al., 2015). Among those work, accelerometer is mostly
adopted. Gyroscope and magnetometer are also frequently used 4. Deep Model
together with accelerometer. Those sensors are often exploited
to recognize activities of daily living (ADL) and sports. In- In this section, we investigate deep learning approaches for
stead of extracting statistical and frequency features from those HAR. Inspired by (Deng, 2014), we categorize the related work
movement data, the original signal is directly used as inputs for into three categories: generative deep architecture, discrimina-
the network. tive deep architecture and hybrid deep architecture. Table 2
presents a brief description of all those three deep learning mod-
3.2. Object Sensor els for HAR tasks.
Object sensors are usually placed on objects to detect the
movement of a specific object (Chavarriaga et al., 2013). Differ- 4.1. Discriminative Deep Architecture
ent from body-worn sensors which capture human movements,
object sensors are mainly used to detect the movement of cer- Discriminative deep architecture offers discriminative power
tain objects in order to infer human activities. For instance, the to patterns via characterizing the posterior distributions of
accelerometer attached to a cup can be used to detect the drink- classes conditioned on the visible data (Deng, 2014). Existing
ing water activity. Radio frequency identifier (RFID) tags are research mostly falls into two deep learning models: (a) deep
typically used as object sensors and deployed in smart home fully-connected network and (b) convolutional neural network.
4
Table 1. Sensor modalities in surveyed articles
Modality Description Examples
Body-worn Worn by the user to describe the body movements Smartphone, watch, or band’s accelerometer, gyroscope etc.
Object Attached to objects to capture objects movements RFID, accelerometer on cup etc.
Ambient Applied in environment to reflect user interaction Sound, door sensor, WiFi, Bluetooth etc.
Hybrid Crossing sensor boundary Combination of types, often deployed in smart environments
4.1.1. Deep Fully-connected Network poral multi-axis 1D readings. Input adaptation is necessary be-
Deep fully-connected network (DFN) is developed from ar- fore applying CNN to those inputs. The main idea is to adapt
tificial neural network (ANN). Traditional ANN often contains the inputs in order to form a virtual image. There are mainly
very few hidden layers (shallow) while DFN contains more two types of adaptation approaches: model-driven and data-
(deep). With more layers, DFN is more capable of learning driven.
from large data. DFN usually serves as the dense layer of other
deep models. In this part, we mainly focus on DFN as a single • Data-driven approach treats each axis as a channel, then
model, while in other sections we will discuss the dense layer. performs 1D convolution on them. After convolution and
(Vepakomma et al., 2015) first extracted hand-engineered pooling, the outputs of each channel are flattened to uni-
features from the sensors, then those features are fed into a DFN fied DFN layers. A very early work is (Zeng et al., 2014),
model. Similarly, (Walse et al., 2016) performed PCA before where each axis of the accelerometer was treated as one
using DFN. For those work, DFN only served as a classifica- channel like RGB of an image, then the convolution and
tion model without feature extraction, hence they may not gen- pooling were performed separately. (Yang et al., 2015)
eralize well. And the network was rather shallow. (Hammerla further proposed to unify and share weights in multi-
et al., 2016) used a 5-hidden-layer DFN to perform automatic sensor CNN by using 1D convolution in the same tempo-
feature learning and classification with improved performance. ral window. Along this line, (Chen and Xue, 2015) resized
We observed from those work that, when the HAR data is multi- the convolution kernel to obtain the best kernel for HAR
dimensional and activities are more complex, it is better to use data. Other similar work include (Hammerla et al., 2016;
more hidden layers to perform automatic feature learning since Sathyanarayana et al., 2016; Pourbabaee et al., 2017). This
their representation capability is stronger. data-driven approach treats the 1D sensor reading as a 1D
image, which is simple and easy to implement. The disad-
4.1.2. Convolutional Neural Network vantage of this approach is the ignorance of dependencies
Convolutional Neural Network (ConvNets, or CNN) lever- between axis and sensors, which may influence the perfor-
ages three important ideas: sparse interactions, parameter shar- mance.
ing and equivariant representations (LeCun et al., 2015). After • Model-driven approach resizes the inputs to a virtual 2D
convolution, there are usually pooling and fully-connected lay- image so as to adopt a 2D convolution. This approach usu-
ers, which perform classification or regression tasks. ally pertains to non-trivial input tuning techniques. (Ha
CNN is competent to extract features from signals and it et al., 2015) combined all axis to form an image, while
has achieved promising results in image classification, speech (Jiang and Yin, 2015) designed a more complex algorithm
recognition and text analysis. When applied to time series clas- to transform the time series into an image. In (Singh et al.,
sification like HAR, CNN has two advantages over other mod- 2017), pressure sensor data was transformed to image via
els: local dependency and scale invariance. Local dependency modality transformation. Other similar work include (Ravi
means the nearby signals in HAR are likely to be correlated, et al., 2016; Li et al., 2016b). This model-driven approach
while scale invariance refers to the scale-invariant for different can make use of the temporal correlation of sensor. But the
paces or frequencies. Due to the effectiveness of CNN, most of map of time series to image is non-trivial task and needs
the surveyed work are focused on this area. domain knowledge.
When applying CNN to HAR, there are several aspects that
need to be considered: input adaptation, pooling and weight- 2) Pooling. The convolution-pooling combination is com-
sharing. mon in CNN, and most approaches performed max or average
1) Input adaptation. Unlike images, most HAR sensors pro- pooling after convolution (Ha et al., 2015; Kim and Toomajian,
duce time series readings like acceleration signal, which is tem- 2016; Pourbabaee et al., 2017) except work (Mohammed and
5
Tashev, 2017). Pooling can also speed up the training process the network to search and convergence quickly in all directions.
on large data. (Zhang et al., 2015b) further implemented RBM on a mobile
3) Weight-sharing. Weight sharing (Zebin et al., 2016; phone for offline training, indicating RBM can be very light-
Sathyanarayana et al., 2016) is an efficient schema to speed up weight. Similar to autoencoder, RBM/DBN can also perform
the training process on new task. (Zeng et al., 2014) utilized a unsupervised feature learning for HAR.
relaxed partial weight sharing technique as the signal appeared
in different units may behave differently. (Ha and Choi, 2016) 4.2.3. Recurrent Neural Network
adopted a CNN-pf and CNN-pff structure to investigate the per- Recurrent neural network (RNN) is widely used in speech
formance of different weight-sharing. It is shown in those litera- recognition and natural language processing by utilizing the
tures that partial weight-sharing could improve the performance temporal correlations between neurons. LSTM (long-short
of CNN. term memory) cells are often combined with RNN where
LSTM is serving as the memory units through gradient descent.
4.2. Generative Deep Architecture Few works has proposed RNN for the HAR tasks (Hammerla
Generative deep architecture aims to build the model by char- et al., 2016; Inoue et al., 2016; Edel and Köppe, 2016; Guan and
acterizing joint distributions from the visible data and their Ploetz, 2017), where the learning speed and resource consump-
classes. Although there are many types of deep models un- tion are the main concerns for HAR. (Inoue et al., 2016) inves-
der this architecture, we mainly introduce three popular ones tigated several model parameters first and then proposed a rela-
adopted in HAR tasks: (a) autoencoder, (b) restricted Boltz- tively good model which can perform HAR with high through-
mann machine and (c) recurrent neural network. put. (Edel and Köppe, 2016) proposed a binarized-BLSTM-
RNN model, in which the weight parameters, input, and output
4.2.1. Autoencoder of all hidden layers are all binary values. The main line of RNN
Autoencoder learns a latent representation of the input val- based HAR models are dealing with resource-constrained envi-
ues through the hidden layers, which can be considered as an ronments while still achieve good performance.
encoding-decoding procedure. The purpose of autoencoder is
to learn more advanced feature representation via an unsuper- 4.3. Hybrid Deep Architecture
vised learning schema. Stacked autoencoder (SAE) is the stack
Hybrid deep architecture is the combination of discriminative
of some autoencoders. SAE treats every layer as the basic
models and generative models (Deng, 2014).
model of autoencoder. After several rounds of training, learned
One emerging hybrid model is the combination of CNN and
features are stacked with labels to form a classifier.
RNN. (Ordóñez and Roggen, 2016; Yao et al., 2017) provided
(Almaslukh et al., 2017; Wang et al., 2016a) used SAE
good examples for how to combine CNN and RNN. It is shown
for HAR, where they first adopted the greedy layer-wise pre-
in (Ordóñez and Roggen, 2016) that the performance of ‘CNN
training (Hinton et al., 2006), then performed fine-tuning. Com-
+ recurrent dense layers’ is better than ‘CNN + dense layers’.
pared to those works, (Li et al., 2014) investigated the sparse au-
Similar results are also shown in (Singh et al., 2017). The rea-
toencoder by adding KL divergence and noise to the cost func-
son is that, CNN is able to capture the spatial relationship, while
tion, which indicates that adding sparse constraints could im-
RNN can make use of the temporal relationship. Combining
prove the performance of HAR. The advantage of SAE is that
CNN and RNN could enhance the ability to recognize different
it can perform unsupervised feature learning for HAR, which
activities that have varied time span and signal distributions.
could be a powerful tool for feature extraction. But SAE de-
Other work combined CNN with models such as SAE (Zheng
pends too much on its layers and activation functions which
et al., 2016) and RBM (Liu et al., 2016). In those work, CNN
may be hard to search the optimal solutions.
performs feature extraction, and the generative models can help
in speeding up the training process. In the future, we expect
4.2.2. Restricted Boltzmann Machine there will be more research in this area.
Restricted Boltzmann machine (RBM) is a bipartite, fully-
connected, undirected graph consisting of a visible layer and a
hidden layer (Hinton et al., 2006). The stacked RBM is called 5. Applications
deep belief network (DBN), which treats every two consecu-
tive layers as a RBM. DBN/RBM is often followed by fully- HAR is always not the final goal of an application, but it
connected layers. serves as an important step in many applications such as skill
In pre-training, most work applied Gaussian RBM in the first assessment and smart home assistant. In this section, we survey
layer while binary RBM for the rest layers (Plötz et al., 2011; deep learning based HAR from the application perspective.
Hammerla et al., 2015; Lane et al., 2015). For multi-modal sen- Most of the surveyed work focused on recognizing activities
sors, (Radu et al., 2016) designed a multi-modal RBM where of daily living (ADL) and sports (Zeng et al., 2014; Chen and
a RBM is constructed for each sensor modality, then the out- Xue, 2015; Ronao and Cho, 2016; Ravı̀ et al., 2017). Those
put of all the modalities are unified. (Li et al., 2016a) added activities of simple movements are easily captured by body-
pooling after the fully-connected layers to extract the impor- worn sensors. Some research studied people’s lifestyle such as
tant features. (Fang and Hu, 2014) used a contrastive gradient sleep (Sathyanarayana et al., 2016) and respiration (Khan et al.,
(CG) method to update the weight in fine-tuning, which helps 2017; Hannink et al., 2017). The detection of such activities
6
Table 3. Public HAR datasets, where A=accelerometer, G=gyroscope, M=magnetometer, O=object sensor, AM=ambient sensor, ECG=electrocardiograph.
ID Dataset Type #Subject S. Rate #Activity #Sample Sensor Reference
D01 OPPORTUNITY ADL 4 32 Hz 16 701,366 A, G, M, O, AM (Ordóñez and Roggen, 2016)
D02 Skoda Checkpoint Factory 1 96 Hz 10 22,000 A (Plötz et al., 2011)
D03 UCI Smartphone ADL 30 50 Hz 6 10,299 A, G (Almaslukh et al., 2017)
D04 PAMAP2 ADL 9 100 Hz 18 2,844,868 A, G, M (Zheng et al., 2014)
D05 USC-HAD ADL 14 100 Hz 12 2,520,000 A, G (Jiang and Yin, 2015)
D06 WISDM ADL 29 20 Hz 6 1,098,207 A (Alsheikh et al., 2016)
D07 DSADS ADL 8 25 Hz 19 1,140,000 A, G, M (Zhang et al., 2015c)
D08 Ambient kitchen Food preparation 20 40 Hz 2 55,000 O (Plötz et al., 2011)
D09 Darmstadt Daily Routines ADL 1 100 Hz 35 24,000 A (Plötz et al., 2011)
D10 Actitracker ADL 36 20 Hz 6 2,980,765 A (Zeng et al., 2014)
D11 SHO ADL 10 50 Hz 7 630,000 A, G, M (Jiang and Yin, 2015)
D12 BIDMC Heart failure 15 125 Hz 2 >20,000 ECG (Zheng et al., 2014)
D13 MHEALTH ADL 10 50 Hz 12 16,740 A, C, G (Ha and Choi, 2016)
D14 Daphnet Gait Gait 10 64 Hz 2 1,917,887 A (Hammerla et al., 2016)
D15 ActiveMiles ADL 10 50-200 Hz 7 4,390,726 A (Ravı̀ et al., 2017)
D16 HASC ADL 1 200 Hz 13 NA A (Hayashi et al., 2015)
D17 PAF PAF 48 128 Hz 2 230,400 EEG (Pourbabaee et al., 2017)
D18 ActRecTut Gesture 2 32 Hz 12 102,613 A, G (Yang et al., 2015)
D19 Heterogeneous ADL 9 100-200 Hz 6 43,930,257 A, G (Yao et al., 2017)
often requires some object and ambient sensors such as WiFi eratures, we present the widely used public datasets in Table 3.
and sound, which are rather different from ADL.
It is a developing trend to apply HAR to health and dis-
ease issues. Some pioneering work has been done with re- 6. Summary and Discussion
gard to Parkinson’s disease (Hammerla et al., 2015), trauma
resuscitation (Li et al., 2016a,b) and paroxysmal atrial fibrilla- Table 4 presents all the surveyed work in this article. We can
tion (PAF) (Pourbabaee et al., 2017). Disease issues are always make several observations based on the table.
related to the change of certain body movements or functions, 1) Sensor usage. Choosing the suitable sensors is critical for
so they can be detected using corresponding sensors. successful HAR. In our surveyed literatures, body-worn sen-
Under those circumstances, a well thought of the disease and sors serve as the most common modalities and accelerometer is
activity is required. Moreover, it is critical to use the right sen- mostly used. The reasons are two folds: firstly, a lot of wear-
sors. For instance, Parkinson’s disease is related to the frozen of able devices such as smart phones or watches are equipped with
gait, which can be reflected by some sensors attached to human accelerometer, which is easy to access; secondly, accelerom-
shoes (Hammerla et al., 2015). eter is competent to recognize many types of daily activities
Other than health and disease, the recognition of high-level since most of them are simple body movements. Compared to
activities is helpful to learn more resourceful information for body-worn sensors, object and ambient sensors are better at rec-
HAR. The movement, behavior, environment, emotion and ognizing activities related to context and environment such as
thought are critical parts in recognizing high-level activities. having coffee. Therefore, it is suggested to use body-worn sen-
However, most work only focused on body movements in smart sors (mostly accelerometer+gyroscope) for ADL and sports ac-
homes (Vepakomma et al., 2015; Fang and Hu, 2014), which tivities. If the activities are pertaining to some semantic mean-
is not enough to recognize high-level activities. For instance, ing but not just simple body movements, it is better to combine
(Vepakomma et al., 2015) combined activity and environment the object and ambient sensors. In addition, there are few pub-
signal to recognize activities in a smart home, but the activities lic datasets for object and ambient sensors probably because of
are constrained to body movements without more information privacy issues and deployment difficulty of the data collecting
on user emotion and state, which are also important. In the fu- system. So we expect there will be more open datasets from
ture, we expect there will be more research on this area. object and ambient sensors.
Benchmark datasets: We extensively explore the bench- 2) Model selection. There are several deep models surveyed
mark datasets for deep learning based HAR. There are two in this article. Then, a natural question arises: which model is
types of data acquisition schemes: self data collection and pub- the best for HAR? (Hammerla et al., 2016) did an early work by
lic datasets. investigating the performance of DNN, CNN and RNN through
Self data collection: Some work performed their own data 4,000 experiments on some public HAR datasets. We com-
collection (Chen and Xue, 2015; Zhang et al., 2015b; Bhat- bine their work and our explorations to draw some conclusions:
tacharya and Lane, 2016; Zhang et al., 2015a). Very detailed RNN is recommended to recognize short activities that have
efforts are required for self data collection, and it is rather te- natural order while CNN is better at inferring long-term repet-
dious to process the collected data. itive activities. The reason is that RNN could make use of the
Public datasets: There are already many public datasets for time-order relationship between sensor readings, and CNN is
HAR, which are adopted by most researchers (Plötz et al., 2011; more capable of learning deep features contained in recursive
Ravi et al., 2016; Hammerla et al., 2016). By summarizing lit- patterns. For multi-modal signals, it is better to use CNN since
7
Table 4. Summation of existing works based on the three aspects: sensor modality, deep model and application (in reference order)
Reference Sensor Modality Deep Model Application Dataset
(Almaslukh et al., 2017) Body-worn SAE ADL D03
(Alsheikh et al., 2016) Body-worn RBM ADL, factory, Parkinson D02, D06, D14
(Bhattacharya and Lane, 2016) Body-worn, ambiemt RBM Gesture, ADL, transportation Self, D01
(Chen and Xue, 2015) Body-worn CNN ADL Self
(Chen et al., 2016b) Body-worn CNN ADL D06
(Edel and Köppe, 2016) Body-worn RNN ADL D01, D04, Self
(Fang and Hu, 2014) Object, ambient DBN ADL Self
(Gjoreski et al., 2016) Body-worn CNN ADL Self, D01
(Guan and Ploetz, 2017) Body-worn, object, ambient RNN ADL, smart home D01, D02, D04
(Ha et al., 2015) Body-worn CNN Factory, health D02, D13
(Ha and Choi, 2016) Body-worn CNN ADL, health D13
(Hammerla et al., 2015) Body-worn RBM Parkinson Self
(Hammerla et al., 2016) Body-worn, object, ambient DFN, CNN, RNN ADL, smart home, gait D01, D04, D14
(Hannink et al., 2017) Body-worn CNN Gait Self
(Hayashi et al., 2015) Body-worn, ambient RBM ADL, smart home D16
(Inoue et al., 2016) Body-worn RNN ADL D16
(Jiang and Yin, 2015) Body-worn CNN ADL D03, D05, D11
(Khan et al., 2017) Ambient CNN Respiration Self
(Kim and Toomajian, 2016) Ambient CNN Hand gesture Self
(Kim and Li, 2017) Body-worn CNN ADL Self
(Lane and Georgiev, 2015) Body-worn, ambient RBM ADL, emotion Self
(Lane et al., 2015) Ambient RBM ADL Self
(Lee et al., 2017) Body-worn CNN ADL Self
(Li et al., 2016a) Object RBM Patient resuscitation Self
(Li et al., 2016b) Object CNN Patient resuscitation Self
(Li et al., 2014) Body-worn SAE ADL D03
(Liu et al., 2016) Body-worn CNN, RBM ADL Self
(Mohammed and Tashev, 2017) Body-worn CNN ADL, gesture Self
(Morales and Roggen, 2016) Body-worn CNN ADL, smart home D01, D02
(Ordóñez and Roggen, 2016) Body-worn CNN, RNN ADL, gesture, posture, factory D01, D02
(Plötz et al., 2011) Body-worn, object RBM ADL, food preparation, factory D01, D02, D08, D14
(Pourbabaee et al., 2017) Body-worn CNN PAF disease D17
(Radu et al., 2016) Body-worn RBM ADL D19
(Ravi et al., 2016) Body-worn CNN ADL, factory D02, D06, D14, D15
(Ravı̀ et al., 2017) Body-worn CNN ADL, factory, Parkinson D02, D06, D14, D15
(Ronao and Cho, 2015a,b, 2016) Body-worn CNN ADL D03
(Sathyanarayana et al., 2016) Body-worn CNN, RNN, DFN ADL, sleep Self
(Singh et al., 2017) Ambient CNN, RNN Gait NA
(Vepakomma et al., 2015) Body-worn, object, ambient DFN ADL Self
(Walse et al., 2016) Body-worn DFN ADL D03
(Wang et al., 2016b) Body-worn, ambient CNN ADL, location Self
(Wang et al., 2016a) Object, ambient SAE ADL NA
(Yang et al., 2015) Body-worn, object, ambient CNN ADL, smart home, gesture D01, D18
(Yao et al., 2017) Body-worn, object CNN, RNN Cartrack, ADL Self, D19
(Zebin et al., 2016) Body-worn CNN ADL Self
(Zeng et al., 2014) Body-worn, ambient, object CNN ADL, smart home, factory D01, D02, D10
(Zhang et al., 2015a) Body-worn DFN ADL Self
(Zhang et al., 2015b) Body-worn RBM ADL Self
(Zhang et al., 2015c) Body-worn DBN ADL, smart home D01, D05, D07
(Zheng et al., 2016) Body-worn CNN, SAE ADL D04
(Zheng et al., 2014) Body-worn CNN ADL, heart failure D04, D14
the features can be integrated through multi-channel convolu- Technically there is no model which outperforms all the oth-
tions. While adapting CNN, data-driven approaches are bet- ers in all situations, so it is recommended to choose models
ter than model-driven approaches as the inner properties of the based on the scenarios. To better illustrate the performance of
activity signal can be exploited better when the input data are some deep models, Table 5 offers some results comparison of
transformed into the virtual image. Multiple convolutions and existing work on public datasets in Table 3. In Skoda and UCI
poolings also help CNN perform better. RBM and autoencoders Smartphone protocols, CNN achieves the best performance. In
are usually pre-trained before being fine-tuned. Multi-layer two OPPORTUNITY protocols, DBN and RNN outperform the
RBM or SAE is preferred for more accurate recognition. others. This confirms that no models can achieve the best in
8
Despite the progress in previous work, there are still some • Exploit context information. Context is any information
challenges for deep learning based HAR. In this section, we that can be used to characterize the situation of an entity
present those challenges and propose some feasible solutions (Abowd et al., 1999). Context information such as Wi-
for them. Fi, Bluetooth and GPS can be used to infer more environ-
A. Online and mobile deep activity recognition. Two criti- mental knowledge about the activity. The exploitation of
cal issues are related to deep HAR: online deployment and mo- resourceful context information will greatly help to recog-
bile application. Although some existing work adopted deep nize user state as well as more specific activities.
HAR on smartphone (Lane et al., 2015) and watch (Bhat- D. Light-weight deep models. Deep models often require
tacharya and Lane, 2016), they are still far from online and lots of computing resources, which is not available for wearable
mobile deployment. Because the model is often trained offline devices. In addition, the models are often trained off-line which
on some remote server and the mobile device only utilizes a cannot be executed in real-time. However, less complex models
trained model. This approach is neither real-time nor friendly such as shallow NN and conventional PR methods could not
to incremental learning. There are two approaches to tackle this achieve good performance. Therefore, it is necessary to develop
problem: reducing the communication cost between mobile and light-weight deep models to perform HAR.
server and enhancing computing ability of the mobile devices.
B. More accurate unsupervised activity recognition. The • Combination of human-crafted and deep features. Re-
performance of deep learning still relies heavily on labeled sam- cent work indicated that human-crafted and deep features
ples. Acquiring sufficient activity labels is expensive and time- together could achieve better performance (Plötz et al.,
consuming. Thus, unsupervised activity recognition is urgent 2011). Some pre-knowledge about the activity will greatly
in the future. contribute to more robust feature learning in deep mod-
els (Stewart and Ermon, 2016). Researchers should con-
• Take advantage of the crowd. The latest research indicates
sider about the possibility of applying two kinds of fea-
that exploiting the knowledge from the crowd will facili-
tures to HAR with human experience and machine intelli-
tate the task (Prelec et al., 2017). Crowd-sourcing takes
gence.
advantage of the crowd to annotate the unlabeled activi-
ties. In addition to acquire labels passively, researchers • Collaboration of deep and shallow models. Deep mod-
els have powerful learning abilities, while shallow models
are more efficient. The collaboration of those two models
1 OPP 1 follows the protocol in (Hammerla et al., 2016). OPP 2 follows
has the potential to perform both accurate and light-weight
the protocol in (Plötz et al., 2011). Skoda follows the protocol in (Zeng et al.,
2014). UCI smartphone follows the protocol in (Ronao and Cho, 2016). In OPP,
HAR. Several issues such as how to share the parameters
weighted f1-score is used, and in OPP 2, Skoda, UCI smartphone, accuracy is between deep and shallow models are remained to be ad-
used. dressed.
9
E. Non-invasive activity sensing. Traditional activity col- Bengio, Y., 2013. Deep learning of representations: Looking forward, in:
lection strategies need to be updated with more non-invasive International Conference on Statistical Language and Speech Processing,
Springer. pp. 1–37.
approaches. Non-invasive approaches tend to collect informa- Bhattacharya, S., Lane, N.D., 2016. From smart to deep: Robust activity recog-
tion and infer activity without disturbing the subjects, and re- nition on smartwatches using deep learning, in: 2016 IEEE International
quires more flexible computing resources. Conference on Pervasive Computing and Communication Workshops (Per-
Com Workshops), IEEE. pp. 1–6.
• Opportunistic activity sensing with deep learning. Op- Bulling, A., Blanke, U., Schiele, B., 2014. A tutorial on human activity recog-
nition using body-worn inertial sensors. ACM Computing Surveys (CSUR)
portunistic sensing could dynamically harness the non- 46, 33.
continuous activity signal to accomplish activity infer- Chavarriaga, R., Sagha, H., Calatroni, A., Digumarti, S.T., Tröster, G., Millán,
ence (Chen et al., 2016a). In this scenario, back propaga- J.d.R., Roggen, D., 2013. The opportunity challenge: A benchmark database
tion of deep models under this condition should be well- for on-body sensor-based activity recognition. Pattern Recognition Letters
34, 2033–2042.
designed. Chen, Y., Gu, Y., Jiang, X., Wang, J., 2016a. Ocean: a new opportunistic com-
puting model for wearable activity recognition, in: Proceedings of the 2016
F. Beyond activity recognition: assessment and assistant. ACM International Joint Conference on Pervasive and Ubiquitous Comput-
Recognizing activities is often the initial step in many applica- ing: Adjunct, ACM. pp. 33–36.
tions. For instance, some professional skill assessment is re- Chen, Y., Xue, Y., 2015. A deep learning approach to human activity recog-
nition based on single accelerometer, in: Systems, Man, and Cybernetics
quired in fitness exercises and smart home assistant plays an (SMC), 2015 IEEE International Conference on, IEEE. pp. 1488–1492.
important role in healthcare services. There is some early work Chen, Y., Zhong, K., Zhang, J., Sun, Q., Zhao, X., 2016b. Lstm networks for
on climbing assessment (Khan et al., 2015). With the advance- mobile human activity recognition .
ment of deep learning, more applications should be developed Cook, D., Feuz, K.D., Krishnan, N.C., 2013. Transfer learning for activity
recognition: A survey. Knowledge and information systems 36, 537–556.
to be beyond just recognition. Deng, L., 2014. A tutorial survey of architectures, algorithms, and applica-
tions for deep learning. APSIPA Transactions on Signal and Information
Processing 3, e2.
8. Conclusion Edel, M., Köppe, E., 2016. Binarized-blstm-rnn based human activity recog-
nition, in: Indoor Positioning and Indoor Navigation (IPIN), 2016 Interna-
In this paper, we survey the recent advance in deep learn- tional Conference on, IEEE. pp. 1–7.
ing approaches for sensor-based activity recognition. Com- Fang, H., Hu, C., 2014. Recognizing human activity in smart home using
pared to traditional pattern recognition methods, deep learn- deep learning algorithm, in: Control Conference (CCC), 2014 33rd Chinese,
IEEE. pp. 4716–4720.
ing reduces the dependency on human-crafted feature extrac- Gjoreski, H., Bizjak, J., Gjoreski, M., Gams, M., 2016. Comparing deep and
tion and achieves better performance by automatically learning classical machine learning methods for human activity recognition using
high-level representations of the sensor reading. We highlight wrist accelerometer, in: IJCAI-16 workshop on Deep Learning for Artifi-
the recent progress in three important categories: sensor modal- cial Intelligence (DLAI).
Guan, Y., Ploetz, T., 2017. Ensembles of deep lstm learners for activity recog-
ity, deep model and application. Subsequently, we summarize nition using wearables. arXiv preprint arXiv:1703.09370 .
and discuss the surveyed research. Finally, several grand chal- Ha, S., Choi, S., 2016. Convolutional neural networks for human activity recog-
lenges are presented for future research and some feasible solu- nition using multiple accelerometer and gyroscope sensors, in: Neural Net-
tions are proposed. works (IJCNN), 2016 International Joint Conference on, IEEE. pp. 381–388.
Ha, S., Yun, J.M., Choi, S., 2015. Multi-modal convolutional neural networks
for activity recognition, in: Systems, Man, and Cybernetics (SMC), 2015
IEEE International Conference on, IEEE. pp. 3017–3022.
Acknowledgments Hammerla, N.Y., Fisher, J., Andras, P., Rochester, L., Walker, R., Plötz, T.,
2015. Pd disease state assessment in naturalistic environments using deep
This work is supported in part by National Key Research learning, in: Twenty-Ninth AAAI Conference on Artificial Intelligence.
& Development Program of China (No.2016YFB1001200), Hammerla, N.Y., Halloran, S., Ploetz, T., 2016. Deep, convolutional, and re-
Natural Science Foundation of China (No.61572471), Chi- current models for human activity recognition using wearables, in: IJCAI.
nese Academy of Sciences Research Equipment Development Hannink, J., Kautz, T., Pasluosta, C.F., Gaßmann, K.G., Klucken, J., Eskofier,
B.M., 2017. Sensor-based gait parameter extraction with deep convolutional
Project (No.YZ201527), and Science and Technology Planning neural networks. IEEE journal of biomedical and health informatics 21, 85–
Project of Guangdong Province (No.2015B010105001). 93.
Hayashi, T., Nishida, M., Kitaoka, N., Takeda, K., 2015. Daily activity recog-
nition based on dnn using environmental sound and acceleration signals, in:
References Signal Processing Conference (EUSIPCO), 2015 23rd European, IEEE. pp.
2306–2310.
Abowd, G.D., Dey, A.K., Brown, P.J., Davies, N., Smith, M., Steggles, P., 1999. Hinton, G.E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep
Towards a better understanding of context and context-awareness, in: Inter- belief nets. Neural computation 18, 1527–1554.
national Symposium on Handheld and Ubiquitous Computing, Springer. pp. Hu, L., Chen, Y., Wang, S., Wang, J., Shen, J., Jiang, X., Shen, Z., 2016.
304–307. Less annotation on personalized activity recognition using context data, in:
Almaslukh, B., AlMuhtadi, J., Artoli, A., 2017. An effective deep autoencoder Proceedings of the 2016 International IEEE Conference on Ubiquitous In-
approach for online smartphone-based human activity recognition. Inter- telligence Computing (UIC), pp. 327–332.
national Journal of Computer Science and Network Security (IJCSNS) 17, Inoue, M., Inoue, S., Nishida, T., 2016. Deep recurrent neural network for
160. mobile human activity recognition with high throughput. arXiv preprint
Alsheikh, M.A., Selim, A., Niyato, D., Doyle, L., Lin, S., Tan, H.P., 2016. Deep arXiv:1611.03607 .
activity recognition models with triaxial accelerometers. Workshops at the Jiang, W., Yin, Z., 2015. Human activity recognition using wearable sensors
Thirtieth AAAI Conference on Artificial Intelligence . by deep convolutional neural networks, in: Proceedings of the 23rd ACM
Bao, L., Intille, S.S., 2004. Activity recognition from user-annotated acceler- international conference on Multimedia, ACM. pp. 1307–1310.
ation data, in: International Conference on Pervasive Computing, Springer. Khan, A., Mellor, S., Berlin, E., Thompson, R., McNaney, R., Olivier, P., Plötz,
pp. 1–17.
10
T., 2015. Beyond activity recognition: skill assessment from accelerometer on-node sensor data analytics for mobile or wearable devices. IEEE journal
data, in: Proceedings of the 2015 ACM International Joint Conference on of biomedical and health informatics 21, 56–64.
Pervasive and Ubiquitous Computing, ACM. pp. 1155–1166. Ronao, C.A., Cho, S.B., 2015a. Deep convolutional neural networks for human
Khan, U.M., Kabir, Z., Hassan, S.A., Ahmed, S.H., 2017. A deep learning activity recognition with smartphone sensors, in: International Conference
framework using passive wifi sensing for respiration monitoring. arXiv on Neural Information Processing, Springer. pp. 46–53.
preprint arXiv:1704.05708 . Ronao, C.A., Cho, S.B., 2015b. Evaluation of deep convolutional neural net-
Kim, Y., Li, Y., 2017. Human activity classification with transmission and re- work architectures for human activity recognition with smartphone sensors,
flection coefficients of on-body antennas through deep convolutional neural in: Proc. of the KIISE Korea Computer Congress, pp. 858–860.
networks. IEEE Transactions on Antennas and Propagation 65, 2764–2768. Ronao, C.A., Cho, S.B., 2016. Human activity recognition with smartphone
Kim, Y., Toomajian, B., 2016. Hand gesture recognition using micro-doppler sensors using deep learning neural networks. Expert Systems with Applica-
signatures with convolutional neural network. IEEE Access . tions 59, 235–244.
Lane, N.D., Georgiev, P., 2015. Can deep learning revolutionize mobile sens- Sathyanarayana, A., Joty, S., Fernandez-Luque, L., Ofli, F., Srivastava, J., El-
ing?, in: Proceedings of the 16th International Workshop on Mobile Com- magarmid, A., Taheri, S., Arora, T., 2016. Impact of physical activity on
puting Systems and Applications, ACM. pp. 117–122. sleep: A deep learning based exploration. arXiv preprint:1607.07034 .
Lane, N.D., Georgiev, P., Qendro, L., 2015. Deepear: robust smartphone au- Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural
dio sensing in unconstrained acoustic environments using deep learning, in: Networks 61, 85–117.
Proceedings of the 2015 ACM International Joint Conference on Pervasive Singh, M.S., Pondenkandath, V., Zhou, B., Lukowicz, P., Liwicki, M., 2017.
and Ubiquitous Computing, ACM. pp. 283–294. Transforming sensor data to the image domain for deep learning-an applica-
Lara, O.D., Labrador, M.A., 2013. A survey on human activity recognition tion to footstep detection. arXiv preprint arXiv:1701.01077 .
using wearable sensors. IEEE Communications Surveys & Tutorials 15, Stewart, R., Ermon, S., 2016. Label-free supervision of neural networks with
1192–1209. physics and domain knowledge. arXiv preprint arXiv:1609.05566 .
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. Vepakomma, P., De, D., Das, S.K., Bhansali, S., 2015. A-wristocracy: Deep
Lee, S.M., Yoon, S.M., Cho, H., 2017. Human activity recognition from ac- learning on wrist-worn sensing for recognition of user complex activities,
celerometer data using convolutional neural network, in: Big Data and Smart in: 2015 IEEE 12th International Conference on Wearable and Implantable
Computing (BigComp), 2017 IEEE International Conference on, IEEE. pp. Body Sensor Networks (BSN), IEEE. pp. 1–6.
131–134. Walse, K.H., Dharaskar, R.V., Thakare, V.M., 2016. Pca based optimal ann
Li, X., Zhang, Y., Li, M., Marsic, I., Yang, J., Burd, R.S., 2016a. Deep neural classifiers for human activity recognition using mobile sensors data, in: Pro-
network for rfid based activity recognition, in: Wireless of the Students, by ceedings of First International Conference on Information and Communica-
the Students, and for the Students (S3) Workshop with MobiCom. tion Technology for Intelligent Systems: Volume 1, Springer. pp. 429–436.
Li, X., Zhang, Y., Marsic, I., Sarcevic, A., Burd, R.S., 2016b. Deep learning for Wang, A., Chen, G., Shang, C., Zhang, M., Liu, L., 2016a. Human activity
rfid-based activity recognition, in: Proceedings of the 14th ACM Conference recognition in a smart home environment with stacked denoising autoen-
on Embedded Network Sensor Systems CD-ROM, ACM. pp. 164–175. coders, in: International Conference on Web-Age Information Management,
Li, Y., Shi, D., Ding, B., Liu, D., 2014. Unsupervised feature learning for hu- Springer. pp. 29–40.
man activity recognition using smartphone sensors, in: Mining Intelligence Wang, J., Zhang, X., Gao, Q., Yue, H., Wang, H., 2016b. Device-free wire-
and Knowledge Exploration. Springer, pp. 99–107. less localization and activity recognition: A deep learning approach. IEEE
Liu, C., Zhang, L., Liu, Z., Liu, K., Li, X., Liu, Y., 2016. Lasagna: towards Transactions on Vehicular Technology .
deep hierarchical understanding and searching over mobile sensing data, in: Yang, J.B., Nguyen, M.N., San, P.P., Li, X.L., Krishnaswamy, S., 2015. Deep
Proceedings of the 22nd Annual International Conference on Mobile Com- convolutional neural networks on multichannel time series for human activ-
puting and Networking, ACM. pp. 334–347. ity recognition, in: Proceedings of the 24th International Joint Conference
Mohammed, S., Tashev, I., 2017. Unsupervised deep representation learning to on Artificial Intelligence (IJCAI), Buenos Aires, Argentina, pp. 25–31.
remove motion artifacts in free-mode body sensor networks, in: Wearable Yang, Q., 2009. Activity recognition: Linking low-level sensors to high-level
and Implantable Body Sensor Networks (BSN), 2017 IEEE 14th Interna- intelligence., in: IJCAI, pp. 20–25.
tional Conference on, IEEE. pp. 183–188. Yao, S., Hu, S., Zhao, Y., Zhang, A., Abdelzaher, T., 2017. Deepsense: A uni-
Morales, F.J.O., Roggen, D., 2016. Deep convolutional feature transfer across fied deep learning framework for time-series mobile sensing data process-
mobile activity recognition domains, sensor modalities and locations, in: ing, in: Proceedings of the 26th International Conference on World Wide
Proceedings of the 2016 ACM International Symposium on Wearable Com- Web, International WWW Conferences Steering Committee. pp. 351–360.
puters, ACM. pp. 92–99. Zebin, T., Scully, P.J., Ozanyan, K.B., 2016. Human activity recognition with
Ordóñez, F.J., Roggen, D., 2016. Deep convolutional and lstm recurrent neural inertial sensors using a deep learning approach, in: SENSORS, 2016 IEEE,
networks for multimodal wearable activity recognition. Sensors 16, 115. IEEE. pp. 1–3.
Pan, S.J., Yang, Q., 2010. A survey on transfer learning. Knowledge and Data Zeng, M., Nguyen, L.T., Yu, B., Mengshoel, O.J., Zhu, J., Wu, P., Zhang, J.,
Engineering, IEEE Transactions on 22, 1345–1359. 2014. Convolutional neural networks for human activity recognition using
Plötz, T., Hammerla, N.Y., Olivier, P., 2011. Feature learning for activity recog- mobile sensors, in: Mobile Computing, Applications and Services (Mobi-
nition in ubiquitous computing, in: IJCAI Proceedings-International Joint CASE), 2014 6th International Conference on, IEEE. pp. 197–205.
Conference on Artificial Intelligence, p. 1729. Zhang, L., Wu, X., Luo, D., 2015a. Human activity recognition with hmm-dnn
Pourbabaee, B., Roshtkhari, M.J., Khorasani, K., 2017. Deep convolution neu- model, in: Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015
ral networks and learning ecg features for screening paroxysmal atrial fibril- IEEE 14th International Conference on, IEEE. pp. 192–197.
latio patients. IEEE Transactions on Systems, Man, and Cybernetics: Sys- Zhang, L., Wu, X., Luo, D., 2015b. Real-time activity recognition on smart-
tems . phones using deep neural networks, in: Ubiquitous Intelligence and Com-
Prelec, D., Seung, H.S., McCoy, J., 2017. A solution to the single-question puting (UIC), IEEE. pp. 1236–1242.
crowd wisdom problem. Nature 541, 532–535. Zhang, L., Wu, X., Luo, D., 2015c. Recognizing human activities from raw
Qin, J., Liu, L., Zhang, Z., Wang, Y., Shao, L., 2016. Compressive sequential accelerometer data using deep neural networks, in: 2015 IEEE 14th Interna-
learning for action similarity labeling. IEEE Transactions on Image Process- tional Conference on Machine Learning and Applications (ICMLA), IEEE.
ing 25, 756–769. pp. 865–870.
Radu, V., Lane, N.D., Bhattacharya, S., Mascolo, C., Marina, M.K., Kawsar, F., Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L., 2014. Time series classification
2016. Towards multimodal deep learning for activity recognition on mobile using multi-channels deep convolutional neural networks, in: International
devices, in: Proceedings of the 2016 ACM International Joint Conference Conference on Web-Age Information Management, Springer. pp. 298–310.
on Pervasive and Ubiquitous Computing: Adjunct, ACM. pp. 185–188. Zheng, Y., Liu, Q., Chen, E., Ge, Y., Zhao, J.L., 2016. Exploiting multi-
Ravi, D., Wong, C., Lo, B., Yang, G.Z., 2016. Deep learning for human activity channels deep convolutional neural networks for multivariate time series
recognition: A resource efficient implementation on low-power devices, in: classification. Frontiers of Computer Science 10, 96–112.
Wearable and Implantable Body Sensor Networks (BSN), 2016 IEEE 13th
International Conference on, IEEE. pp. 71–76.
Ravı̀, D., Wong, C., Lo, B., Yang, G.Z., 2017. A deep learning approach to