0% found this document useful (0 votes)
27 views18 pages

17 - Dynamic Graph Convolutional Networks

The document discusses the development of dynamic graph convolutional networks that combine Long Short-Term Memory (LSTM) networks and Graph Convolutional Networks (GCN) to handle dynamic graph structures in classification tasks. The proposed methods show significant improvements in accuracy and F1 scores for both vertex-based semi-supervised and graph-based supervised classification on real-world datasets. The paper highlights the need for neural network architectures that can effectively manage the temporal changes in graph-structured data, which has been largely overlooked in existing research.

Uploaded by

gnyanadeepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views18 pages

17 - Dynamic Graph Convolutional Networks

The document discusses the development of dynamic graph convolutional networks that combine Long Short-Term Memory (LSTM) networks and Graph Convolutional Networks (GCN) to handle dynamic graph structures in classification tasks. The proposed methods show significant improvements in accuracy and F1 scores for both vertex-based semi-supervised and graph-based supervised classification on real-world datasets. The paper highlights the need for neural network architectures that can effectively manage the temporal changes in graph-structured data, which has been largely overlooked in existing research.

Uploaded by

gnyanadeepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Pattern Recognition 97 (2020) 1070 0 0

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

Dynamic graph convolutional networks


Franco Manessi a, Alessandro Rozza a,∗, Mario Manzo b
a
Strategic Analytics, lastminute.com, Chiasso, Switzerland
b
Information Technology Services, University of Naples “L’Orientale”, Naples, Italy

a r t i c l e i n f o a b s t r a c t

Article history: In many different classification tasks it is required to manage structured data, which are usually mod-
Received 14 May 2018 eled as graphs. Moreover, these graphs can be dynamic, meaning that the vertices/edges of each graph
Revised 21 May 2019
may change over time. The goal is to exploit existing neural network architectures to model datasets that
Accepted 13 August 2019
are best represented with graph structures that change over time. To the best of the authors’ knowl-
Available online 13 August 2019
edge, this task has not been addressed using these kinds of architectures. Two novel approaches are pro-
posed, which combine Long Short-Term Memory networks and Graph Convolutional Networks to learn
long short-term dependencies together with graph structure. The advantage provided by the proposed
methods is confirmed by the results achieved on four real world datasets: an increase of up to 12 per-
centage points in Accuracy and F1 scores for vertex-based semi-supervised classification and up to 2
percentage points in Accuracy and F1 scores for graph-based supervised classification.
© 2019 Elsevier Ltd. All rights reserved.

1. Introduction mer consists of finding whether an image contains a given ob-


ject and its position, the latter consists of extracting a caption
In machine learning, data are usually described as points in a that describes an image. Another possible example could be web
vector space (x ∈ Rd ). Nowadays, structured data are ubiquitous. page classification, where the web is represented by a graph where
The capability to capture the structural relationships among the nodes are the pages and edges are the hyperlinks between them,
data points can be particularly useful in improving the effective- the aim being to exploit the web connectivity to classify pages in
ness of the models trained on them. a set of topics. Instead, estimating the probability of a chemical
To this aim, graphs are widely employed to represent this kind compound to cause certain diseases can be seen a graph-focused
of information in terms of nodes/vertices and edges, including local application [4]. This is possible due to the fact that a chemical
and spatial information arising from data. For instance, consider a compound can be modeled by a graph where the nodes are the
d-dimensional dataset X = {x1 , . . . , xn } ⊂ Rd representing points in atoms and the edges the chemical bonds.
space, a graph can be extracted from X by considering each point For the sake of simplicity and without loss of generality, just
as a node, where edge connectivity and weights can be computed the classification problem is considered. Under this setting, the
using a metric function. new data representation G = (V , E ) is ob- vertex-focused applications are characterized by a set of labels Y =
tained, where V is a set, which contains vertices and E is a set of {1, . . . , k}, a dataset X = {x1 , . . . , xn } ⊂ Rd , the related graph G . Let
weighted1 pairs of vertices (edges). us assume L ⊂ X is a subset of labeled nodes. The goal is to clas-
Applications in a graph domain can be usually divided into two sify the unlabeled nodes belonging to X but not to L by exploit-
main categories: vertex-focused and graph-focused [1]. The former ing jointly the node features and the graph structure by means
includes tasks where one performs classification/regression on the of a semi-supervised learning approach. Graph-focused applications
vertices of a graph, whereas in the latter one performs these tasks are related to the goal of learning a function f that maps differ-
on the graph itself. For instance, object detection [2] and image an- ent graphs to integer values by taking into account the features of
notation [3] are examples of vertex-focused applications, the for- the nodes of each graph: f (G , X ) ∈ Y . This task can be solved by
supervised classification on the graphs.
A number of research works are devoted to classification, both

Corresponding author.
for vertex-focused and graph-focused applications [5–10]. Neverthe-
E-mail addresses: [email protected] (F. Manessi), less, there is a major limitation in existing studies: most of these
[email protected] (A. Rozza), [email protected] (M. Manzo). research works are focused on static graphs. However, many real
1
Note that an unweighted graph can be considered an instance of a weighted world graph-structured data are dynamic and nodes/edges in the
graph, where all the weights are equal.

https://doi.org/10.1016/j.patcog.2019.1070 0 0
0031-3203/© 2019 Elsevier Ltd. All rights reserved.
2 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

graphs may change over time. In such dynamic scenarios, temporal A GCN shares its underlying intuition with Convolutional Neural
information can also play an important role. For instance, the in- Networks (CNNs, [18]), specialized kinds of neural networks that
teractions between individuals inside a building in one day can be are highly successful in practical applications (such as computer
modeled as a sequence of graphs, each one describing a time win- vision) employing, within some of their layers, a convolutional op-
dow within the day, where the nodes are the people and the edges eration in place of regular matrix multiplication. A convolutional
are the interactions occurring between them within a time frame. layer in a CNN exploits the locality of information embedded in its
As another example, consider the classification of human activities input (e.g. pixel adjacency for image inputs) to extract local fea-
from motion-capture data, where each frame can be modeled as tures. It does this by repeatedly looking at small patches of its in-
graph, where the vertices are the skeleton joints of the person in put and by learning a mixing matrix of the size of the patches that
question. is shared across all the patches, thus reducing the amount of pa-
In the last decade, neural networks have shown their great rameters. Similarly, a GCN exploits the locality notion induced by
power and flexibility by learning to represent the world as a nested the graph connectivity by looking at small patches of the graph,
hierarchy of concepts, achieving outstanding results in many differ- where the patches are all neighbourhood subgraphs of the original
ent fields of application. It is important to underline that just a few graph.
research works have been devoted to encoding graph structures di-
rectly using a neural network model [1,11–15]. Among them, to the 2. Related work
best of the authors knowledge, no one is able to manage dynamic
graphs. Many real world datasets are best represented in graph form:
To exploit both graph-structured data and temporal information e.g. knowledge graphs, social networks, protein-interaction net-
through the use of a neural network model, two novel approaches works and the World Wide Web.
are introduced. They combine Long Short Term-Memory network To achieve good classification results on this kind of data, the
(LSTM, [16]) and Graph Convolutional Network (GCN, [14]), which traditional approaches proposed in the literature mainly follow two
can be considered the two base elements of the proposed archi- different directions: to identify structural properties as features
tectures. Both of them are able to deal with vertex-focused appli- and manage them using traditional learning methods, or to propa-
cations. Respectively, these techniques are able to capture tempo- gate the labels to obtain a direct classification.
ral information and to properly manage graph-structured data. The Zhu et al. [19] propose a semi-supervised learning algorithm
approaches are also extended to deal with graph-focused applica- based on a Gaussian random field model (also known as La-
tions. bel Propagation). The learning problem is formulated exploiting
LSTMs are a special kind of Recurrent Neural Network Gaussian random fields over a continuous function state space,
(RNN, [17]), which are able to improve the learning of long term rather than random fields over the discrete label set. This ensures
dependencies. All RNNs take the form of a chain of repeating mod- that the most probable configuration of the field is unique, it is
ules of neural networks. Precisely, RNNs are artificial neural net- characterized in terms of harmonic functions and it has a closed
works where connections among units form a directed cycle. This form solution that can be computed using matrix methods or
creates an internal state of the network which allows it to exhibit belief propagation. In contrast, for multi-label discrete random
dynamic sequential behavior. In standard RNNs, the repeating mod- fields, computing the lowest energy configuration is typically
ule is based on a simple structure, such as a single (hyperbolic tan- NP-hard and approximation algorithms or other heuristics must
gent) unit. LSTMs extend the repeating module by combining four be used [20]. In [21], the authors have extended the previous
interacting units. Specifically, it is based on: cell state, forget gate, approach by introducing Dynamic Label Propagation, a technique
input gate and output gate. The most important part of LSTMs is to better deal with multi-class/multi-label problems due to the
the cell state, which can be pictured as a conveyor belt. It runs lack in consideration of label correlation.
straight down the entire chain, with some linear interactions. The Xu et al. [8] present a semi-supervised factor graph model that
first interaction decides what information is going to be removed is able to exploit the relationships among nodes. In this approach,
from the cell state. This decision is made by a sigmoid layer, called each vertex is modeled as a variable node and the various relation-
the forget gate. The next interaction decides what new information ships are modeled as factor nodes. Grover and Leskovec, in [22],
will be stored in the cell state. This can be operationalized in two present an efficient and scalable algorithm for feature learning
parts: (i) a sigmoid layer, called the input gate decides which val- in networks that optimizes a novel network-aware, neighborhood
ues should be updated; (ii) another sigmoid layer creates a vector preserving objective function using Stochastic Gradient Descent.
of new candidate values that could be added to the cell state. The Perozzi et al. [23] propose an approach called DeepWalk. This
final output is generated by combining the cell state with a sig- technique uses truncated random walks to efficiently learn rep-
moid layer. resentations for vertices in graphs. These latent representations,
A GCN is a neural network model that directly encodes graph which encode graph relations in a vector space, can be easily ex-
structure, which is trained on a supervised target loss for all the ploited by statistical models thus producing state-of-the-art re-
nodes with labels. This approach is able to distribute the gradient sults.
information from the supervised loss and to learn representations Another technique used to extract vector representations start-
exploiting both labeled and unlabeled nodes, thus achieving state- ing from graphs is the Bag of Graphs method introduced in [24],
of-the-art results. The goal of a GCN is to learn a function of sig- where the authors exploit the combination of graphs with a bag-
nals/features on a graph which takes as input: of-words model to create a discriminant and efficient representa-
tion based on local structures of an object, leading to fast and ac-
• A feature description for each node, summarized in a N × D fea- curate results in classification tasks.
ture matrix (where N is the number of nodes and D is the num- Unfortunately, the described techniques cannot deal with
ber of input features); graphs that dynamically change over time. There are a few
• A representative description of the graph structure in matrix methodologies that have been designed to classify nodes in dy-
form (typically in the form of an adjacency matrix). namic networks [25,26]. Li et al. [25] propose an approach that
is able to learn the latent feature representation and to cap-
This model produces a node-level output (an N × F feature ma- ture dynamic patterns. Yao et al. [26] present a Support Vector
trix, where F is the number of output features per node). Machine-based approach that combines the support vectors of the
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 3

previous temporal instant with the current training data to exploit The new network architectures proposed in this paper will
temporal relationships. Pei et al. [27] define an approach known as work on ordered sequences of graphs and ordered sequences of
the Dynamic Factor Graph Model, which they apply to node clas- vertex features. For sequences of length one, this reduces to the
sification in dynamic social networks. This approach organizes the vertex/graph-focused applications described in Section 1.
dynamic graph data in a sequence of graphs. Three types of fac- The paper contributions are based on the idea of combining an
tors, called node, correlation and dynamic factors, are designed to extension of the Graph Convolution (GC, the fundamental layer of
capture node features, node correlations and temporal correlations, the GCNs) and a modified version of an LSTM, which learn the
respectively. Node and correlation factors are designed to capture downstream recurrent units by exploiting both graph-structured
the global and local properties of the graph structures, while the data and vertex features.
dynamic factor exploits the temporal dependencies. Two GC-like layers are here proposed. They take as input a
It is important to underline that very little attention has been graph sequence and the corresponding ordered sequence of ver-
devoted to the generalization of neural network models to graph- tex features. The output is an ordered sequence of a new vertex
structured datasets. In the last few years, a number of research representation. These layers are:
works have revisited the problem of generalizing neural networks
to work on structured graphs, some of them achieving promis-
• The Waterfall Dynamic-GC layer, which performs at each step of
ing results in domains that have been previously dominated by the sequence a graph convolution on the vertex input sequence.
other techniques. Gori et al. [28] and Scarselli et al. [1] for- An important feature of this layer is that the trainable param-
malize a novel neural network model, the Graph Neural Network eters of each graph convolution are shared among the various
(GNN). This model is based on extending a neural network method steps of the sequence;
with the purpose of processing data in form of graph structures.
• The Concatenated Dynamic-GC layer, which performs at each
The GNN model can process different types of graphs (e.g., acyclic, step of the sequence a graph convolution on the vertex input
cyclic, directed and undirected) and it maps a graph and its nodes features and concatenates it to the input. Again, the trainable
into a D-dimensional Euclidean space to learn the final classifi- parameters are shared among the steps in the sequence.
cation/regression model. Li et al. [15] extend the GNN model, Each of the two layers can jointly be used with a modified ver-
by relaxing the contractivity requirement of the propagation step sion of an LSTM to perform a semi-supervised classification of a
through the use of a Gated Recurrent Unit [29] and by predict- sequence of vertices or a supervised classification of a sequence of
ing sequence of outputs from a single input graph. Bruna et al. graphs. The difference between the two tasks just consists in how
[11] describe two generalizations of CNNs: one based on a hier- the final step of processing of the data is performed (for further
archical clustering of the domain and another based on the spec- details, see Eqs. (8) and (10).
trum of the graph (computed using the Laplacian matrix). Du- In the following section the mathematical definitions of the two
venaud et al. [13] present another variant of CNNs that works modified GC layers are given, together with the modified version
on graph structures. This model allows an end-to-end learning on of the LSTM, as well as some other handy definitions that will be
graphs of arbitrary size and shape. Defferrard et al. [12] introduce useful when the final network architectures will be described.
a formulation of CNNs in the context of spectral graph theory. The
model provides efficient numerical schemes to design fast localized
3.1. Definitions
convolutional filters on graphs, achieving the same computational
complexity of classical CNNs working on any graph structure. Kipf
Let [Y]i,j be the ith row, jth column element of the matrix Y and
and Welling [14] propose an approach for semi-supervised learning
Y its transpose. Id is the identity matrix of Rd ; softmax and ReLU
on graph-structured data (the GCN) based on CNNs. In their work,
are the soft-maximum and the rectified linear unit activation func-
they exploit a localized first-order approximation of the spectral
tions [33]. Note that all the activation functions act element-wise
graph convolutions framework [30]. Their model linearly scales in
when applied to a matrix M ∈ Rn×m , e.g. [ReLU M ]i, j = ReLU[M ]i, j .
the number of graph edges and learns hidden layer representations
that encode local and structural graph features. Seo et al. [31] ad- The matrix P ∈ Rd×d is a projector on Rd if it is a symmetric,
dress the problem of generating sequences on static graphs by ex- positive semi-definite matrix and where P 2 = P . In particular, it is
tending [12] through the usage of a RNN. Monti et al. [32] tackle a diagonal projector if it is a diagonal matrix (with possibly some
the problem of matrix completion (one of the most common for- zero entries on the main diagonal). In other words, a diagonal pro-
mulations of recomender systems) by means of recurrent multi- jector on Rd is the diagonal matrix with some 1s on the main di-
graph neural networks. agonal that when it is right-multiplied by a d-dimensional column
Note that none of the aforementioned neural network archi- vector v it zeroes out all the entries of v corresponding to the zeros
tectures are able to properly deal with dynamic graph-structured on the main diagonal of P:
data.

3. Methods

In this section, two novel network architectures are introduced


to deal with vertex/graph-focused applications. Both of them rely on
the following intuitions:
Let (G i )i∈ZT with ZT : = {1, 2, . . . , T } be a finite sequence of
• GCNs can effectively deal with graph-structured information, undirected graphs G i = (V i , E i ), with V i = V ∀i ∈ ZT , i.e. all the
but they lack the ability to handle data structures that change graphs in the sequence share the same vertices. Considering the
over time. This limitation is (at least) twofold: (i) inability to graph G i , for each vertex vk ∈ V let xki ∈ Rd be the correspond-
manage dynamic vertex features, (ii) inability to manage dy- ing feature vector. Each step i in the sequence ZT can completely
namic edge connections. be defined by its graph G i (modeled by the adjacency matrix2 Ai )
• LSTMs excel in finding long and short range sequence depen-
dencies, but they lack the ability to explicitly exploit graph- 2
Given a finite undirected graph G = (V , E ), its adjacency matrix is the square
structured information. matrix A ∈ R|V |×|V | where [A]i, j = [A] j,i = wi j if and only if there is an edge between
4 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Fig. 1. Representation of a sequence of unweighted graphs G 1 , G 2 , G 3 made of 4 nodes (v1 , v2 , v3 , v4 ), where each vertex-feature vector xki ∈ R2 , with i ∈ {1, 2, 3} and k ∈ {1,
2, 3, 4}. The graphs share the same vertex elements, while the edge connectivity and the vertex-feature values change for different graphs of the sequence. For instance, the
feature vector for vertex v1 changes from v12 = (0.9 0.2 ) to v13 = (2.8 1.5 ) between steps i = 2 and i = 3. Similarly, the edge between vertices v1 and v4 disappears between
the same steps in the graph sequence.

and by the vertex-features matrix X i ∈ R|V |×d (the matrix whose (x, f (x )). Namely, the GF operator transforms f into a function re-
row vectors are the xki ). It is worth mentioning that despite shar- turning the concatenation between x and f(x).
ing the same vertex elements, the graphs within the sequence do
not share the same vertex-feature values. See Fig. 1 for a pictorial Definition 2 (cd-GC layer). Let (Ai )i∈ZT , (X i )i∈ZT be, respectively,
representation of a sequence of graphs. the sequence of adjacency matrices and the sequence of vertex-
The mathematics of the GC layer [14] and the LSTM [16] are feature matrices for the considered graph sequence (G i )i∈ZT , with
here briefly recalled, since they are the basic building blocks of Ai ∈ R|V |×|V | and X i ∈ R|V |×d ∀i ∈ ZT . A Concatenated Dynamic-GC
the contribution of this paper. Given a graph with adjacency ma- layer with M output nodes is the function cd-GCM defined as fol-
trix A ∈ R|V |×|V | and vertex-feature matrix X ∈ R|V |×d , the GC layer low:
with M output nodes and B ∈ Rd×M weight matrix is defined as the
function: cd-GCM : ((X i )i∈ZT , (Ai )i∈ZT ) → ( [GF GCBM,Ai ](X i ) )i∈ZT (4)
GCBM,A : R|V |×d → R|V |×M ,
where [GF GCBM,A ](X i ) ∈ R|V |×(M+d ) .
GCBM,A (X ) : = ReLU(A
ˆ X B ), i
(1)
1 1 Intuitively, cd-GC is a layer made of T copies of GC layers, each
where A ˆ is the re-normalized adjacency matrix, i.e. A ˆ:=D˜ -2 A
˜D˜ -2
 copy acting on one instant of the sequence. Each output of the
with A˜ : = A + I |V | and [D ˜ ]kl . Note that the GC layer can
˜ ]kk : = l [A
T copies is then concatenated with its input, thus resulting in a
be seen as localized first-order approximation of spectral graph sequence of graph-convoluted features together with the vertex-
convolution [34], with the additional renormalization trick in order features matrix. Such skip-connections of the vertex-input features
to improve numerical stability [14]. counters the possible blurring of the graph features due to the
Given the sequence (xi )i∈ZT with xi d-dimensional row vectors GC units alone. Indeed, by looking at Eq. (1) it can be seen that
for each i ∈ ZT , a returning sequence-LSTM with N output nodes, is GC recombines the vertex-features matrix by means of the re-
the function LSTMN : (xi )i∈ZT → (hi )i∈ZT , with hi ∈ RN and normalized adjacency A ˆ , thus averaging each vertex feature with
its neighbourhood. This behavior might be counter-productive in a
hi = oi  tanh(ci ), f i = σ (xiW f + hi−1U f + b f ),
graph with high connectivity, to the extreme case that in a com-
ci = j i  
ci + f i  ci−1 , j i = σ (xiW j + hi−1U j + b j ), ˜ ]i, j = 1 ∀i, j) all the recombined
plete graph (i.e. a graph with [A
oi = σ (xiW o + hi−1U o + bo ), ci = σ (xiW c + hi−1U c + bc ),

vertex features are equal for all the vertices, since all the ele-
(2) ments of A ˆ are the same. The weights B ∈ Rd×M are shared among
where  is the Hadamard product, σ (x ) : = 1/(1 + e-x ), W l ∈ the T copies. The number of learnable parameters of this layer is
Rd×N , U l ∈ RN×N are weight matrices and bl are bias vectors, with d × (d + M ), independently of the number of steps in the sequence
l ∈ {o, f, j, c}. ( G i )i∈ZT .
Note that both the input and the output of wd-GC and cd-GC
Definition 1 (wd-GC layer). Let (Ai )i∈ZT , (X i )i∈ZT be, respectively, are sequences of matrices (loosely speaking, third order tensors).
the sequence of adjacency matrices and the sequence of vertex- Three additional layers will be defined now. These layers
feature matrices for the considered graph sequence (G i )i∈ZT , with will help in reducing the clutter with the notation when, in
Ai ∈ R|V |×|V | and X i ∈ R|V |×d ∀i ∈ ZT . The Waterfall Dynamic-GC Sections 3.2 and 3.3, the network architectures used to solve the
layer with M output nodes is the function wd-GCM with weight semi-supervised classification of sequences of vertices and the su-
matrix B ∈ Rd×M defined as follows: pervised classification of sequences of graphs will be introduced.
They are: (i) the recurrent layer used to process in a parallel fash-
wd-GCM : ((X i )i∈ZT , (Ai )i∈ZT ) → ( GCBM,Ai (X i ) )i∈ZT . (3) ion the convoluted vertex features, (ii) the two final layers (one per
The wd-GC layer can be seen as multiple copies of a stan- task) used to map the previous layers outputs into k-class proba-
dard GC layer, all of them sharing the same training weights. Then, bility vectors.
the resulting training parameters are d × M, independently of the
length of the sequence. Definition 3 (v-LSTM layer). Consider (Z i )i∈ZT with Z i ∈ RL×M , the
In order to introduce the Concatenated Dynamic-GC layer, the Vertex LSTM layer with N output nodes is given by the function
definition of the Graph of a Function is now given: consider- v-LSTMN :
ing a function f from A to B, [GF f ] : A → A × B, x → [GF f ](x ) : = ⎛ ⎞
LSTMN ((V 1 Z i )i∈ZT )
v-LSTMN : (Z i )i∈ZT → ⎝ ..
.
⎠ ∈ RL×N×T , (5)
the ith and jth vertices and the edge has weight wij . In the case of an unweighted

graph, wi j = 1. LSTMN ((V L Z i )i∈ZT )
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 5

where Vp is the isometric embedding3 of R into RL defined as where ◦ denote the function composition. Both the architectures
[V p ]i, j = δip and δ is the Kronecker delta function. The training take ((X i )i∈ZT , (Ai )i∈ZT ) as input and produce a sequence of matri-
weights are shared among the L copies of the LSTMs. ces whose row vectors are the probabilities of each vertex of the
graph: (Z i )i∈ZT with Z i ∈ R|V |×k . For the sake of clarity, in the rest
Definition 4 (vs-FC layer). Consider (Z i )i∈ZT with Z i ∈ RL×N , the of the paper, the networks defined by Eqs. (8a) and (8b) will be
Vertex Sequential Fully Connected layer with k output nodes is given called Waterfall Dynamic-GCN(WD-GCN) and Concatenated Dynamic-
by the function vs-FCk , parameterized by the weight matrix W ∈ GCN (CD-GCN), respectively.

RN×k and the bias matrix RL×k B : = (b , . . . , b ) : As it can be seen in Fig. 2a, the wd-GC layer acts as T copies
of a regular GC layer, each one working on one instant of the
vs-FCk : (Z i )i∈ZT → ( softmax(Z iW + B ) )i∈ZT (6)
sequence. The output of this first layer is then processed by the
with softmax(Z iW + B ) ∈ RL×k . v-LSTM layer, which acts as |V | copies of the returning sequence-
LSTM layer, each one working on one vertex of the sequence of
Definition 5 (gs-FC layer). Consider (Z i )i∈ZT with Z i ∈ RL×N , the graphs. The last layer, which produces the k-class probability vec-
Graph Sequential Fully Connected layer with k output nodes is tor for each vertex and for each instant of the sequence, can be
given by the function gs-FCk , parameterized by the weight ma- seen as |V | × T of a FC layer. Fig. 2b shows the CD-GCN archi-
trices W 1 ∈ RN×k , W 2 ∈ R1×L and the bias matrices RL×k B1 : = tecture; the interpretation is analogous of that of WD-GCN, where

(b , . . . , b ) and B2 ∈ R1×k : now the v-LSTM unit takes as input both the graph convolutional
output and the vertex input features thanks to the cd-GC layer.
gs-FCK : (Z i )i∈ZT → ( softmax(W 2 ReLU(Z iW 1 + B1 ) + B2 ) )i∈ZT Since the output layers return probability vectors, the weights
(7) of the architectures can be learned using gradient descent meth-
ods, employing as a loss function the cross entropy, evaluated only
with softmax(W 2 ReLU(Z iW 1 + B1 ) + B2 ) ∈ R1×k . on the labeled vertices:
 
Informally: (i) the v-LSTM layer acts as L copies of LSTM, − [Y t ]v,c log[PtLab Zt ]v,c , (9)
each one evaluating the sequence of one row of the input tensor t∈ZT c∈Zk v∈Z|V |
(Z i )i∈ZT ; (ii) the vs-FC layer acts as L × T copies of a Fully Con-
nected layer (FC, [33]), all of the copies sharing the parameters with the convention that 0 × log 0 = 0.
and each one of them returning a k-class probability vectors thanks
to the softmax activation; (iii) the gs-FC layer acts as T copies of 3.3. Supervised classification of sequence of graphs
two FC layers with softmax-ReLU activation, all the copies shar-
ing the parameters. This layer outputs one k-class probability vec-
tor for each step in the input sequence, thanks to the softmax ac- Definition 7 (Supervised Classification of Sequence of Graphs).
tivation. Note that both the input and the output of vs-FC and Let (G i )i∈ZT be a sequence of T graphs each one made of
v-LSTM are sequences of matrices, while for gs-FC the input is a |V | vertices and (X i )i∈ZT the related sequence of vertex-features
sequence of matrices and the output is a sequence of vectors. matrices. Moreover, let (yi )i∈ZT be a sequence of T one-hot en-
coded k-class labels, i.e. yi ∈ {0, 1}k . Then, graph-sequence classifi-
3.2. Semi-supervised classification of sequence of vertices cation task consists in learning a predictive function f such that
f ( ( G i )i∈ZT , ( X i )i∈ZT ) = ( y i )i∈ZT .

Definition 6 (Semi-Supervised Classification of Sequence of Ver- The proposed architectures are defined by the following func-
tices). Let (G i )i∈ZT be a sequence of T graphs each one made of tions:
|V | vertices and (X i )i∈ZT the related sequence of vertex-features g_wd-GC_LSTMM,N,k : gs-FCk ◦ v-LSTMN ◦ wd-GCM , (10a)
matrices.
Let (P Lab
i )i∈ZT be a sequence of diagonal projectors on the
vector space R|V | . Define the sequence (P Unlab )i∈ZT by means of g_cd-GC_LSTMM,N,k : gs-FCk ◦ v-LSTMN ◦ cd-GCM , (10b)
i
P Unlab i , ∀i ∈ ZT ; i.e. P i
: = I |V | − P Lab Lab
i and P Unlab
i identify the la- The two architectures take as input ((X i )i∈ZT , (Ai )i∈ZT ). The out-
beled and unlabeled vertices of G i , respectively. Moreover, let put of wd-GC and cd-GC is processed by a v-LSTM, resulting in a
(Y i )i∈ZT be a sequence of T matrices with |V | rows and k columns, |V | × N matrix for each step in the sequence. It is a gs-FC duty to
satisfying the property P Lab i Y i = Y i , where the jth row of the ith transform this vertex-based prediction into a graph based predic-
matrix represents the one-hot encoding of the k-class label of the tion, i.e. to output a sequence of k-class probability vectors (zi )i∈ZT .
jth vertex of the ith graph in the sequence, with the jth ver- Again, WD-GCN and CD-GCN will be used to refer to the networks
tex being a labeled one. Then, a semi-supervised classification of defined by Eqs. (10a) and (10b), respectively.
a sequence of vertices consists in learning a function f such that Fig. 3a and b show the two architectures WD-GCN and CD-GCN
j f ( (G i )i∈ZT , (X i )i∈ZT ) j = Y j and P j
P Lab f ( (G i )i∈ZT , (X i )i∈ZT ) j is
Unlab
respectively. The same interpretation given for Fig. 2a and b holds
the right labeling for the unlabeled vertices for each j ∈ ZT . also here, with the only difference being in the last layer, since
now it has to return a k-class probability vector for each instant of
To address the above task, the networks defined by the follow-
the sequence. It can be seen as the composition of two FC layers:
ing functions are proposed:
the first one working on each vertex for every instant and the next
v_wd-GC _ LSTMM,N,k : vs-FCk ◦ v-LSTMN ◦ wd-GCM , (8a) one working on all the vertices at a given instant.
Also under this setting the training can be performed by means
of gradient descent methods, with the cross entropy as loss func-
v _ cd-GC _ LSTMM,N,k : vs-FCk ◦ v-LSTMN ◦ cd-GCM , (8b) tion:

− [yt ]c log[zt ]c , (11)
3
An isometric embedding f of a vector space V to a vector space W, with t∈ZT c∈Zk
dim(V ) < dim(W ), is a linear map between V and W such that f (v ) = v ,
∀ v ∈ V. with the convention 0 × log 0 = 0.
6 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Fig. 2. The figure shows the network architectures presented in Section 3.2 for the semi-supervised classification of sequences of vertices. In these pictorial representations,
all of them work on sequences of four graphs composed of five vertices, i.e. (G i )i∈Z4 , |V | = 5.
(a) The wd-GC layer acts as four copies of a regular GC layer, each one working on one instant of the sequence. The output of this first layer is processed by the v-LSTM
layer that acts as five copies of the returning sequence-LSTM layer, each one working on a vertex of the graphs. The last layer, which produces the k-class probability vector
for each vertex and for each instant of the sequence, can be seen as 5 × 4 copies of a FC layer.
(b) The cd-GC and the v-LSTM layers work as the wd-GC and the v-LSTM of the Fig. 2a, the only difference being that v-LSTM works both on graph convolutional
features, as well as plain vertex features, due to the fact that cd-GC produces their concatenation.

Fig. 3. The figure shows the network architectures presented in Sections 3.3 for the supervised classification of sequences of graphs. All of them work on sequences of four
graphs composed of five vertices, i.e. (G i )i∈Z4 , |V | = 5. The networks of (a) and (b) differ from their counterparts of Fig. 2 only on the last gs-FC layer, since in this case
the networks shall produce one k-class probability vector prediction for each graph.

3.4. Computational complexity tures, wd-GC nodes, v-LSTM nodes, output classes, time steps,
maximum edges and vertices. Consider first v_wd-GC_LSTMG,L,K ,
Considering a sparse implementation for the wd-GC and i.e. WD-GCN for a semi-supervised classification of a sequence of
cd-GC layers, the time complexities of the proposed networks are vertices. Since such network architecture is composed of wd-GCG ,
linear in the number of graph edges and the length of the se- v-LSTML and vs-FCK , the time complexity of each component is
quence. Indeed, let F, G, L, K, T, E, V be the number of input fea- here reported:
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 7

Table 1
The table below summarizes the datasets considered in the experiments.

Task Dataset Nodes Time steps Vertex features Prediction classes

Vertex-focused DBLP 500 10 70 6


(semi-supervised 5000 10 70 6
classification) 20000 10 70 6
CIAW 92 20 64 5
Synthetic 500 20 1 2
Graph-focused (supervised CAD-120 34 1298 24 10
classification) HDM05 31 1654 6 19

• wd-GCG : each one of the T GC units has a computational com- techniques that are not able to properly exploit graph structure,
plexity of EFG [14], thus leading to a grand total of EFGT; such as fully connected (feed-forward) neural networks or LSTM,
• v-LSTML : each one of the T recurrent iteration of the V LSTM to achieve better results. The node features have been augmented
building units requires L2 + LG operations for each one of the 4 by adding the number of articles published by the authors in each
LSTM internal gates, thus leading to a grand total of 4V LT (G + of the communities, obtaining a 70-dimensional features vector.
L ); The dataset is made of 25215 authors across the ten years un-
• vs-FCK : each one of the fully connected unit for each of the der analysis. Each year 4786 authors appear on average and 114
V vertices and the T steps of the sequences requires LK oper- authors appear all the years, with an average of 1594 authors ap-
ations, giving VLKT total operations. pearing on two consecutive years.
The 50 0/50 0 0/20 0 0 0 authors with the highest number of con-
Similar arguments hold also for the other architectures intro-
nections during the analyzed 10 years have been considered, i.e.
duced in this paper; hence the computational complexity of the
the 50 0/50 0 0/20 0 0 0 vertices among the total 25215 with the high-
networks are given by:  
est t∈Z10 i [At ]i, j , with At the adjacency matrix at the tth year.
• Semi-supervised classification of sequences of vertices If one of the selected authors does not appear in the tth year, its
– WD-GCN → EF GT + 4V LT (G + L ) + V LKT ; feature vector is set to zero.
– CD-GCN → EF GT + 4V LT (G + F + L ) + V LKT ; The final dataset is composed of 10 vertex-features matrices in
• Supervised classification of sequences of graphs Rx×70 and 10 adjacency matrices belonging to Rx×x , with x ∈ {500,
– WD-GCN → EF GT + 4V LT (G + L ) + V LKT + V KT ; 50 0 0, 20 0 0 0}. Moreover, each vertex belongs to one of the 6
– CD-GCN → EF GT + 4V LT (G + F + L ) + V LKT + V KT ; classes.5
Since in many real world scenarios V ≤ E, the above claim fol- CIAW. The Contacts In A Workplace (CIAW)6 is a vertex-focused
lows. dataset of [35] that contains the temporal network of contacts be-
tween individuals measured in an office building in France, from
4. Experimental results June 24, to July 3, 2013. Each individual was wearing a sensor able
to record the interaction with another individual within 1.5 m. Any
4.1. Datasets interaction lasting more than 20 s was considered as a contact be-
tween the two people. For each 20 s interval between June 24, and
The proposed architectures have been tested on the five July 3, 2013, all the contacts occurring between the surveyed indi-
datasets summarized in Table 1, namely one synthetic and two viduals have been recorded. Each of the individuals is further char-
real world datasets (DBLP, CIAW) for the vertex-focused applica- acterized by his or her department name.
tions and two real world datasets for the graph-focused applications Similarly to the dataset DBLP, the task is to predict each indi-
(CAD-120, HDM05). vidual’s department, by leveraging the historical sequence of their
DBLP. It has been considered a subset of the public available interactions in a semi-supervised fashion. Again, this is a vertex-
DBLP4 , a vertex-focused dataset, described in [27]. Conferences focused task.
from six research communities, including artificial intelligence and The interaction data have been downsampled by grouping it in
machine learning, algorithm and theory, database, data mining, 20 temporally equally spaced snapshots.7 A sequence of 20 graphs
computer vision and information retrieval, have been taken into has been built as follows: (i) the set of vertices (shared between all
account. The co-author relationships from 2001 to 2010 have been the graphs) is composed of the collection of all the individuals, (ii)
considered and data belonging to each year has been organized in if two individuals interact in the ith snapshot, the respective ver-
a graph form. Each author represents a node in the network and an tices are connected by an edge in the ith graph. For all the vertices
edge between two nodes exists if two authors have collaborated on of every graph of each snapshot, a 64 dimensional representation
a paper in the considered year. Note that the resulting adjacency by means of DeepWalk has been built.
matrix is unweighted. The task to be tackled with this dataset is Each person’s department has been considered as the target la-
the prediction of each author’s research community, by leveraging bel for the semi-supervised classification, for a grand total of 5
the dynamic evolution of the collaborations between the authors classes. The dataset is made of 92 individuals across the 20 snap-
through the years in a semi-supervised fashion. This is a vertex- shots. For each snapshot, 60 individuals appear on average and
focused task. 4 individuals appear in all the snapshots, with an average of 49
The node features are extracted from each temporal instant us- people appearing on two consecutive snapshots.
ing DeepWalk [23] and correspond to a 64-dimensional vector.
DeepWalk uses truncated random walks to learn latent represen-
tations of vertices in a graph thus embedding the nodes in a vec-
5
tor space. This approach is well-known and largely employed, be- Note that even if the label of a vertex does not change during the sequence, all
the tested architectures have to provide a vertex label for each year in the sequence.
cause it allows to build representations (features) that can help 6
http://www.sociopatterns.org/datasets/contacts- in- a- workplace/.
7
Note that the downsampling is only for the purpose of experimental compari-
4
http://dblp.uni-trier.de/xml/. son and that other aggregations may significantly alter the results [36].
8 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Table 2 attached to each graph in the sequence, i.e. a graph-focused appli-


The two configurations used to generate the synthetic dataset. Configuration
cation.
2 has much more noise in the graph structured data than Configuration 1.
Each one of the 10 high-level activities is characterized by one
psame dep psame inter pother pchange inter pmove person, whose 15 joints are tracked (in position and orientation)
Configuration 1 0.5 0.4 0.05 0.05 0.05 in the 3D space for each frame of the sequence. In each high-
Configuration 2 0.5 0.05 0.4 0.05 0.05 level activity there appears a variable number of objects, for which
are registered their bounding boxes in the video frame together
with the transformation matrix matching extracted SIFT features
The resulting dataset is made of 20 vertex-feature matrices (in [38] from the frame to the ones of the previous frame. 19 are ob-
R92×64 ), 20 adjacency matrices (in R92×92 ) and each vertex belongs jects involved.
to one of the 5 classes.5 A graph for each video frame has been built: the vertices are
Synthetic. Similarly to CIAW, the synthetic vertex-focused the 15 skeleton joints plus the 19 objects, while the weighted ad-
dataset aims to simulate the behavior of employees in a workplace jacency matrix has been derived by exploiting Euclidean distance.
environment. There are 500 employees, each one belonging to one Precisely, for two skeleton joints, the edge weight is given by the
of two possible departments. Moreover, each employee is charac- Euclidean distance between their 3D positions; for two objects, it
terized by one boolean feature describing his or her working inter- is the 2D distance between the centroids of their bounding boxes;
ests: when the feature is zero it means that the employee would for an object and a skeleton joint, it is the 2D distance between the
like to work for the first of the two departments, while, when it centroid of the object bounding box and the skeleton joint projec-
is equal to one, he or she would like to work for the second. Em- tion into the 2D video frame. All the distances have been scaled
ployees can interact with each other over time, as well as change between zero and one. When an object does not appear in a frame
working department and working interests. The task to solve is to its related row and column in the adjacency matrix and its corre-
predict the current department of each of the employees by look- sponding features in the vertex-features matrix are set to zero.
ing at their historical interactions with the other employees and Each vertex is characterized by 24 features:
the time series of their interests.
The dataset has been created in the following way: the 500
• 9 numbers representing the 3 × 3 orthonormal matrix charac-
employees have been initially assigned a department as well as a terizing the rotation of a joint’s local coordinates with respect
working interest uniformly at random. The graph describing their to a reference coordinate system;
interactions has been built as follows: (i) two people belonging to
• 1 boolean number representing the confidence value of a joint
the same department have a probability psame dep to be connected; orientation measurement—i.e. it is 1 when tracking seems to
(ii) two people not belonging to the same department but hav- work, 0 if tracking fails and the measurement apparatus is un-
ing the same working interest have a probability psame inter to be certain about the output;
connected; (iii) two people, neither sharing department nor inter-
• 3 numbers representing the position of a joint in the 3D space;
est, have a probability pother to be connected. The previous three
• 1 boolean number representing the confidence value of a joint
items completely identify the initial time step in the sequence; the position measurement;
following ones are identified by the following sequential update
• 6 numbers representing the transform matrix matching the
rules: (i) each employee, not belonging to the department of his SIFT features of an object to the ones belonging to previous
or her liking, has a probability pchange inter of changing working in- frame;
terest into the one corresponding to its current department; (ii)
• 4 numbers representing the coordinates of the bounding box of
each employee, not belonging to the department of his or her lik- an object within the frame.
ing, has a probability pmove of changing department. Experiments
have been done using the two configurations reported in Table 2, Note that object-related features are set to zero for joint ver-
where each sequence is made of 20 steps. tices and vice versa.
The resulting dataset is made of 20 vertex-feature matrices (in Since the videos have different lengths, all the sequences have
R500×1 ), 20 adjacency matrices (in R500×500 ) and each vertex be- been zero-padded to match the longest one, which has 1298
longs to one of the 2 classes. frames. Moreover, the zero-padded steps are not considered both
CAD-120. This graph-focused dataset8 is made of 122 RGB-D during training and testing, i.e. both the loss function evaluation
videos corresponding to 10 high-level human activities [37]. Each and the metrics disregard the padded steps. Note that no temporal
video is annotated with sub-activity labels, object affordance la- segmentation data have been used: all the experimented architec-
bels, tracked human skeleton joints and tracked object bounding ture worked directly at the frame level with no ground-truth in-
boxes. The 10 sub-activity labels are: reaching, moving, pouring, eat- formation regarding the similarity between consecutive frames, i.e.
ing, drinking, opening, placing, closing, scrubbing and null. All the segment of consecutive frames belonging to the same ground-truth
data related to the detection of sub-activities have been consid- sub-activity label.
ered, i.e. no object affordance data have been considered. Given Finally, the feature columns have been standardized. The result-
one video (with its annotated data) one wants to predict the sub- ing dataset is composed of 122 × 1298 vertex-feature matrices be-
activity for each frame within the video. Note that detecting the longing to R34×24 , 122 × 1298 adjacency matrices (in R34×34 ) and
sub-activities is a challenging problem as it involves complex in- each graph belongs to one of the 10 classes.
teractions, since humans can interact with multiple objects during HDM05. This graph-focused dataset9 of Müller et al. [39] con-
a single activity. One can imagine to tackle this problem by ex- tains more than three hours of systematically recorded motion
tracting graph-structured data from the tracked positions of both capture data. The motion sequences are performed by five non-
human skeleton joints and objects. This implies that each video professional actors. Most of the sequences have been performed
can be seen as a graph sequence as presented at the beginning several times by all five actors according to the guidelines fixed in
of Section 3.1, one graph per frame and the detection of the sub- a script. The script consists of five parts, where each part is sub-
activity for each of the frames as the prediction of a target label divided into several scenes, for a grand total of 20 scenes. Since

8 9
http://pr.cs.cornell.edu/humanactivities/data.php. http://resources.mpi-inf.mpg.de/HDM05/.
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 9

one of these 20 scenes has been performed only once, is has been ployed. Note that two ways of assessing the networks perfor-
dropped. mances have been used for the DBLP and the CIAW datasets. Since
Similarly to the CAD-120 dataset, the task is to classify each in this case not all the vertices appear all the time in the graph
frame according to the scene it represents by exploiting graph- sequences (i.e. in DBLP only 114 authors out of 25215 appear all
structured data extracted from the tracked motion capture data, the years, while in CIAW 4 people out of 60 appear in all the
thus resulting in a graph-focused application. Each person is char- snapshots), the two aforementioned metrics on the test set have
acterized by 31 body parts, which have been tracked both in posi- been evaluated in the following ways: (i) taking into account all
tion and orientation for each frame. The motion capture data have the snapshots in the label sequences for all the vertices, regard-
been downsampled by considering only one frame every 24 and a less of whether or not the vertices appear in the snapshot taken
weighted graph of 31 vertices (the body parts) has been built for into account; (ii) taking into account for each snapshot in the label
each frame. The graph edges have been built employing the same sequences just the vertices appearing in the snapshot.
approach proposed for the CAD-120 joint-joint edges. The consid- Finally, the training time of each architecture with the best
ered vertex features are the 3D position of the corresponding body hyper-parameter configurations has been analyzed.15 The training
parts, together with the 3 Euler angles representing their orienta- time evaluation has been performed on 2 × 10-core Intel Xeon 4114
tion, thus resulting in a 6-dimensional feature vector. @ 2.20 GHz, 64 GB RAM, with a single Nvidia Tesla P100 GPU. All
Since the videos have different lengths, the sequences have the architectures have been implemented in Keras 1.2.2 with Ten-
been zero-padded to match the longest one, which has 1654 sorflow 1.11.0 as a backend.
frames. Note that also in this case the zero-padded steps are not
considered both during training and testing. It is important to un- 4.3. Results
derline that also for this experiment, no temporal segmentation
data have been used. Since the techniques presented in this work are, to the best of
Finally, the features columns have been standardized. The re- the authors knowledge, the first attempts to provide general pur-
sulting dataset is composed of 88 × 1654 vertex-feature matrices pose neural architectures to address classification problems in the
belonging to R31×6 , 88 × 1654 adjacency matrices (in R31×31 ) and context of dynamic graph datasets, other general purpose tech-
each graph belongs to one of the 19 classes. niques based on neural networks have been chosen. The baselines
feature all the base elements of the proposed architectures.
4.2. Experimental settings In particular, for all the datasets one neural network for each
of the following groups has been considered: (i) simple networks
In the experiments, the results achieved by the proposed archi- made of two fully connected layers; (ii) graph convolutional net-
tectures have been compared with those obtained by other base- works; (ii) networks made of a LSTM followed by a fully connected
line networks (see Section 4.3 for a full description of the chosen layer; (iv) networks made of a fully connected layer followed by
baselines). a LSTM and another fully connected layer. In particular, (ii) and
For the baselines that are not able to explicitly exploit sequen- (iii) allow us to assess the performance of our new architectures
tiality in the data, the temporal dimension of all the sequences has against their building blocks, while (iv) allows to ascertain whether
been flattened, thus considering the same point at two different it is the increased depth of WD-GCN and CD-GCN the responsible for
times as two different samples. better results when compared to (iii).
Unless otherwise stated, the hyper-parameters10 of all the net- DBLP. The approaches proposed in Section 3.2 (WD-GCN and
works (in terms of number of nodes of each layer and dropout CD-GCN) have been compared against the following baseline
rate11 ) have been appropriately tuned by means of a grid12 ap- methodologies: (i) a GCN composed of two layers, (ii) a network
proach. The performances have been assessed employing 10 itera- made of two FC layers, (iii) a network composed of LSTM+FC, (iv)
tions of Monte Carlo Cross-Validation13 , preserving the percentage and a deeper architecture made of FC+LSTM+FC. Note that the FC
of samples for each class. It is important to underline that to keep is a Fully Connected layer; when it appears as the first layer of a
the experiments as fair as possible, the 10 train/test sets are gen- network it employs a ReLU activation, a softmax activation is used
erated once and then used to evaluate all the architectures. More- when it is the last layer of a network, in order to get as output a
over, the training phase has been performed using the Adam op- k-class probability vector.
timizer [40] for a maximum of 100 epochs. For each network, the First of all, the 500 vertices version of the DBLP dataset has
epoch where the model achieved the best performance on the val- been considered. The test set contains 30% of the 500 vertices.
idation set has been selected, and the performance of correspond- Moreover, 20% of the remaining vertices have been used for valida-
ing trained model has been assessed on the test set. tion purposes. It is important to underline that an unlabeled vertex
To assess the performances of all the considered architectures, remains unlabeled for all the years in the sequence, i.e. considering
the metrics Accuracy and Unweighted F1 Measure14 have been em- Definition 6, P Labi = P Lab , ∀i ∈ ZT .
Tables 3 and 4 show the best hyper-parameter configurations,
together with the test results of all the evaluated architectures for
10
A hyper-parameter is a parameter whose value is set before the learning pro-
the 500 vertices DBLP. In particular, Table 3 shows the results in
cess begins and with respect to which the loss function is not differentiable.
11
Dropout is a regularization technique to reduce overfitting in a neural network.
which the metrics have been evaluated considering all the years
The term “dropout” refers to dropping out nodes in a neural network. The dropout in the label sequences for all the vertices, while Table 4 takes into
rate is the percentage of nodes that are randomly dropped out. account only the years of the label sequences where each vertex
12
A possible approach to perform hyper-parameter optimization is grid search. appears.
This methodology consists in an exhaustive searching through a manually specified
Figs. 4 and 5 show the confusion matrices for each of the
subset of the hyper-parameter space of the employed learning algorithm.
13
This approach randomly selects (without replacement) some fraction of the data
best configurations of Tables 3 and 4 respectively. By looking at
to build the training set and it assigns the rest of the samples to the test set. This the confusion matrices, it is clear that both WD-GCN and CD-GCN
process is repeated multiple times, generating (at random) new training and test achieved either equivalent or better results that the baseline ar-
partitions each time. Note that in our experiments, the training set is further split chitectures for each of the target classes, thus resulting in better
into training and validation.
14
The Unweighted F1 Measure evaluates the F1 scores for each class and find

their mean: 1k c∈Zk p2cp+c rrcc , where pc and rc are the precision and the recall of the 15
The best configuration is the hyper-parameter configuration that generates the
class c. best results in terms of either Accuracy or Unweighted F1 Measure.
10 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Fig. 4. Confusion matrices for the best configuration of each architecture of Table 3. Rows represent real target classes, while columns are the predicted ones. Each entry
shows the number of vertices falling into the relative true class vs predicted class, averaged across the full 10-fold Monte Carlo cross-validation and rounded to the unit. In
brackets, the corresponding standard deviation is shown.

predictions not only for the minority class (i.e. Class 3) but also for evaluating the performance metrics by: (a) taking into account all
the ones bigger in size. the years in the label sequences for all the vertices, regardless of
On the 500 vertices DBLP dataset, all the networks have also whether or not the vertices appear in the year taken into account;
been tested by changing the ratio of the labeled vertices—i.e. the (b) taking into account for each year in the label sequences just
size of the training set—as follows: 20%, 30%, 40%, 50%, 60%, 70%, the vertices appearing in the year taken into account.
80%. To obtain robust estimations, the performances have been av- Finally, employing the hyper-parameters of the best configura-
eraged by means of 10 iterations of Monte Carlo Cross-Validation. tion for each of the architectures in Tables 3 and 4, all the net-
Fig. 6 reports the results of these experiments, considering the best works performance have been assessed on the much bigger 50 0 0
configuration for each of the architectures in Tables 3 and 4 and and 20 0 0 0 vertices versions of the DBLP dataset. Also in this case,
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 11

Table 3
Results of the evaluated architectures on a semi-supervised classification of a sequence of vertices employing the 500 vertices version of the DBLP dataset. Metrics
have been evaluated considering all the years in the label sequences for all the vertices, regardless of whether or not the vertices appear in the year taken into
account. A Wilcoxon Test shows a p-value < 0.6% when comparing WD-GCN and CD-GCN against all the baselines for both the scores.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

FC+FC 1st FC nodes: {150, 200, 250, 300, 350, 400} 250 49.1% ± 1.1% 250 48.4% ± 1.1%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 40%
GC+GC 1st GC nodes: {150, 200, 250, 300, 350, 400} 350 54.4% ± 1.2% 350 54.0% ± 1.7%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 10%
LSTM+FC LSTM nodes: {100, 150, 200, 300, 400} 100 60.2% ± 2.1% 100 60.2% ± 2.5%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 0% 0%
FC+LSTM+FC FC nodes: {100, 200, 300, 400} 300 62.5% ± 2.3% 300 62.6% ± 2.5%
LSTM nodes: {100, 200, 300, 400} 300 300
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 50%
WD-GCN wd-GC nodes: {100, 200, 300, 400} 300 70.1% ± 3.1% 400 69.8% ± 3.1%
v-LSTM nodes: {100, 200, 300, 400} 300 300
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 0%
CD-GCN cd-GC nodes: {100, 200, 300, 400} 200 69.3% ± 3.4% 200 69.3% ± 3.1%
v-LSTM nodes: {100, 200, 300, 400} 100 100
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 50%

Table 4
Results of the evaluated architectures on a semi-supervised classification of a sequence of vertices employing the 500 vertices version of the DBLP dataset. Metrics
have been evaluated considering for each year in the label sequences just the vertices appearing in the year taken into account. A Wilcoxon Test shows a p-value
≈ 0.5% when comparing WD-GCN and CD-GCN against all the baselines for both the scores.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

FC+FC 1st FC nodes: {150, 200, 250, 300, 350, 400} 250 53.5% ± 1.2% 250 50.9% ± 1.5%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 50%
GC+GC 1st GC nodes: {150, 200, 250, 300, 350, 400} 300 68.4% ± 2.0% 250 66.6% ± 3.4%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 20%
LSTM+FC LSTM nodes: {100, 150, 200, 300, 400} 100 62.5% ± 2.5% 100 62.2% ± 3.5%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 0% 0%
FC+LSTM+FC FC nodes: {100, 200, 300, 400} 200 64.6% ± 2.4% 100 63.9% ± 2.7%
LSTM nodes: {100, 200, 300, 400} 100 200
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 50% 50%
WD-GCN wd-GC nodes: {100, 200, 300, 400} 100 77.4% ± 3.4% 100 76.9% ± 3.4%
v-LSTM nodes: {100, 200, 300, 400} 300 300
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 40%

Table 5
Results of the best configuration for each architecture of Table 3 on semi-supervised classification of sequence of vertices em-
ploying the 50 0, 50 0 0 and 20 0 0 0 vertices versions of the DBLP dataset. Metrics have been evaluated considering all the years
in the label sequences for all the vertices, regardless of whether or not the vertices appear in the year taken into account. A
Wilcoxon Test shows p-value < 0.6% when comparing WD-GCN and CD-GCN against all the baselines for both the scores.

Network Accuracy (mean ± std) Unweighted F1 Measure (mean ± std)

DBLP 500 DBLP 50 0 0 DBLP 20 0 0 0 DBLP 500 DBLP 50 0 0 DBLP 20 0 0 0


FC+FC 49.1% ± 1.1% 38.1% ± 0.3% 31.7% ± 0.1% 48.4% ± 1.1% 34.6% ± 0.4% 24.5% ± 0.2%
GC+GC 54.4% ± 1.2% 41.4% ± 0.3% 27.2% ± 0.1% 54.0% ± 1.7% 39.8% ± 0.5% 24.3% ± 0.2%
LSTM+FC 60.2% ± 2.1% 56.7% ± 0.7% 50.2% ± 0.3% 60.2% ± 2.5% 56.0% ± 0.9% 49.6% ± 0.4%
FC+LSTM+FC 62.5% ± 2.3% 57.8% ± 0.9% 50.8% ± 0.4% 62.6% ± 2.5% 57.4% ± 0.9% 50.4% ± 0.5%
WD-GCN 70.1% ± 3.1% 64.7% ± 1.1% 56.3% ± 0.7% 69.8% ± 3.1% 64.9% ± 1.3% 56.9% ± 0.9%
CD-GCN 69.3% ± 3.4% 64.1% ± 0.7% 55.5% ± 0.3% 69.3% ± 3.1% 64.5% ± 0.7% 56.0% ± 0.4%

the test set is made of 30% of the vertices and the validation one Moreover, the WD-GCN and the CD-GCN performances are
is made of the remaining 20%. The results have been averaged by roughly equivalent in terms of Accuracy and Unweighted F1 Mea-
means of 10 iterations of Monte Carlo Cross-Validation. Tables 5 sure. Architectures such as GCNs and LSTMs are mostly likely
and 6 show the results of this experiment. limited for their inability to jointly exploit graph structure and
Both the proposed architectures have overcome the results long short-term dependencies (see also the part discussing the
achieved by the considered baselines, both when evaluating the synthetic dataset within the Section 4.3). Note that the struc-
performance metrics by (a) taking into account all the years in the ture of the graphs appearing in the sequence is not exclusively
label sequences for all the vertices, regardless of whether or not conveyed by the DeepWalk vertex-features and it is effectively
the vertices appear in the year taken into account; (b) taking into captured by the GC units. Indeed, the two layers-GCN has ob-
account, for each year in the label sequences, just the vertices ap- tained better results with respect to those achieved by the two FC
pearing in the year taken into account. layers.
12 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Fig. 5. Confusion matrices for the best configuration of each architecture of Table 4. Rows represent real target classes, while columns are the predicted ones. Each entry
shows the number of vertices falling into the relative true class vs predicted class, averaged across the full 10-fold Monte Carlo cross-validation and rounded to the unit. In
brackets, the corresponding standard deviation is shown.

It is important to underline that the WD-GCN and the CD-GCN • The number of parameters of our approaches is significantly
have achieved better performances with respect to the baselines lower than the number of parameters of the biggest employed
not because they exploit a greater number of parameters, or since network: i.e. the best WD-GCN and CD-GCN have, respectively,
they are deeper, rather: 872206 and 163406 parameters, while the largest tested net-
work is the FC+LSTM+FC with 1314006 parameters;
• The FC+LSTM+FC network has a comparable depth with re-
• The baseline architectures have achieved their best perfor- spect to our approaches, but it has achieved lower performance.
mances without employing the maximum of the allowed num-
ber of nodes, thus showing that their performance is unlikely Moreover, WD-GCN and CD-GCN have shown little sensitiv-
to become better with an even greater number of nodes; ity to the labeling ratio, consistently outperforming the baselines
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 13

Fig. 6. The figure shows the performances of the tested approaches (on the 500 vertices DBLP dataset) varying the ratio of the labeled vertices—i.e. size of the training
set divided by the number of total samples. The vertical bars represent the standard deviation of the performances achieved in the 10 iterations of the Monte Carlo cross-
validation. The top plots show Accuracy and Unweighted F1 Measure for the best configuration of each architecture of Table 3, taking into account all the years in the label
sequences for all the vertices, regardless of whether or not the vertices appear in the year taken into account. The bottom plots show the same metrics for Table 4, this time
evaluated by taking into account for each year in the label sequences just the vertices appearing in the year taken into account.

Table 6
Results of the best configuration for each architecture of Table 4 on semi-supervised classification of sequence of vertices em-
ploying the 50 0, 50 0 0 and 20 0 0 0 vertices versions of the DBLP dataset. Metrics have been evaluated considering for each year
in the label sequences just the vertices appearing in the year taken into account. A Wilcoxon Test shows a p-value 0.5% when
comparing WD-GCN and CD-GCN against all the baselines for both the scores.

Network Accuracy (mean ± std) Unweighted F1 Measure (mean ± std)

DBLP 500 DBLP 50 0 0 DBLP 20 0 0 0 DBLP500 DBLP 50 0 0 DBLP 20 0 0 0


FC+FC 53.5% ± 1.2% 58.4% ± 0.9% 62.5% ± 0.5% 50.9% ± 1.5% 57.0% ± 1.0% 61.0% ± 0.4%
GC+GC 68.4% ± 2.0% 71.9% ± 1.1% 73.7% ± 0.4% 66.6% ± 3.4% 71.8% ± 0.8% 73.9% ± 0.5%
LSTM+FC 62.5% ± 2.5% 68.4% ± 1.2% 69.7% ± 0.4% 62.2% ± 3.5% 67.6% ± 1.1% 68.9% ± 0.5%
FC+LSTM+FC 64.6% ± 2.4% 70.8% ± 1.2% 71.2% ± 0.4% 63.9% ± 2.7% 70.0% ± 1.0% 70.1% ± 0.5%
WD-GCN 77.4% ± 3.4% 79.1% ± 0.7% 77.5% ± 0.6% 76.9% ± 3.4% 78.7% ± 0.8% 76.9% ± 0.6%
CD-GCN 76.6% ± 3.4% 80.2% ± 0.9% 80.5% ± 0.5% 75.6% ± 3.8% 79.7% ± 1.1% 80.2% ± 0.4%

by more than 5 percentage points in both Accuracy and Un- the CIAW dataset. In particular, Table 7 shows the results in which
weighted F1 Measure, regardless of its value. Furthermore, they the metrics have been evaluated considering all the snapshots in
have achieved better performance than the baselines when tested the label sequences for all the vertices, regardless of whether or
also on bigger graphs, further demonstrating the robustness of our not the vertices appear in the snapshot taken into account. On the
methods. other hand, Table 8 takes into account, for each snapshot in the
Finally, Fig. 7 represents the wall-clock training time cost in label sequences, just the vertices appearing in the snapshot taken
seconds for all the best architectures of Table 3 on the 50 0, 50 0 0 into account. Finally, Fig. 8 represents the wall-clock training time
and 20 0 0 0 vertices versions of the DBLP dataset. It can be seen cost in seconds for all the best architectures of Table 7.
that both the WD-GCN and CD-GCN training times are compara- The considerations are similar to those made for the DBLP
ble to these baselines. Moreover, when increasing the graph size, dataset: our techniques have outperformed all the baselines, with
CD-GCN, which achieves results very similar to the WD-GCN, can CD-GCN achieving statistically significative better results than the
be trained faster than FC+LSTM+FC, which is the best performer other architectures. Moreover, it can be seen that the training
among the baselines. time of both of our techniques is very similar to the one of
CIAW. The performance of WD-GCN and CD-GCN has been mea- FC+LSTM+FC, which is the best performer among the baselines
sured against the same baseline as in the DBLP dataset. Again, the and even faster than the slower baseline networks, i.e. GC+GC.
test set contains 30% of the vertices and 20% of the remaining ones Synthetic dataset. The performance of the WD-GCN and CD-GCN
have been used for validation purposes. has been measured against the same baseline as in the DBLP
Tables 7 and 8 show the best hyper-parameter configurations dataset. Again, the test set contains 30% of the vertices and 20%
together with the test results of all the evaluated architectures for of the remaining ones have been used for validation purposes.
14 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Fig. 7. The box plots represent the training time in seconds taken by the best configuration of each architecture of Table 3 on semi-supervised classification of sequence of
vertices employing the 500, 5000 and 20000 vertices versions of the DBLP dataset.

Table 7
Results of the evaluated architectures on a semi-supervised classification of a sequence of vertices employing the CIAW dataset. Metrics have been evaluated
considering all the snapshots in the label sequences for all the vertices, whether or not the vertices appear in the snapshot taken into account. A Wilcoxon Test
shows p-value < 3% for the Accuracy score and < 5% for the Unweighted F1 Measure when comparing the best of the proposed architecture against all the baselines.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

FC+FC 1st FC nodes: {150, 200, 250, 300, 350, 400} 150 62.5% ± 3.6% 150 52.0% ± 2.4%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 20% 0%
GC+GC 1st GC nodes: {150, 200, 250, 300, 350, 400} 400 64.7% ± 3.5% 150 54.6% ± 3.9%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 0% 40%
LSTM+FC LSTM nodes: {100, 150, 200, 300, 400} 100 79.3% ± 4.1% 100 66.7% ± 5.2%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 20%
FC+LSTM+FC FC nodes: {100, 200, 300, 400} 400 80.7% ± 3.5% 400 66.8% ± 5.4%
LSTM nodes: {100, 200, 300, 400} 300 30%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 300 10%
WD-GCN wd-GC nodes: {100, 200, 300, 400} 400 81.3% ± 2.7% 100 68.5% ± 9.0%
v-LSTM nodes: {100, 200, 300, 400} 200 100
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 10% 10%
CD-GCN cd-GC nodes: {100, 200, 300, 400} 400 81.9% ± 4.0% 400 69.0% ± 5.6%
v-LSTM nodes: {100, 200, 300, 400} 200 200
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 30% 20%

Table 8
Results of the evaluated architectures on a semi-supervised classification of a sequence of vertices employing the CIAW dataset. Metrics have been evaluated
considering for each snapshot in the label sequences just the vertices appearing in the snapshot taken into account. A Wilcoxon Test shows p-value < 5% for both
the metrics when comparing the best of the proposed architecture against all the baselines.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

FC+FC 1st FC nodes: {150, 200, 250, 300, 350, 400} 150 77.1% ± 2.0% 300 64.6% ± 3.9%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 0% 10%
GC+GC 1st GC nodes: {150, 200, 250, 300, 350, 400} 200 80.8% ± 2.1% 400 70.0% ± 5.5%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 20% 10%
LSTM+FC LSTM nodes: {100, 150, 200, 300, 400} 150 84.6% ± 3.4% 150 68.8% ± 2.4%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 40%
FC+LSTM+FC FC nodes: {100, 200, 300, 400} 300 86.2% ± 2.6% 400 71.6% ± 3.1%
LSTM nodes: {100, 200, 300, 400} 300 300
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 0% 20%
WD-GCN wd-GC nodes: {100, 200, 300, 400} 400 86.6% ± 3.7% 300 72.6% ± 3.4%
v-LSTM nodes: {100, 200, 300, 400} 100 100
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 30%
CD-GCN cd-GC nodes: {100, 200, 300, 400} 400 86.7% ± 2.6% 400 72.1% ± 3.0%
v-LSTM nodes: {100, 200, 300, 400} 200 400
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 30% 40%

Table 9 shows the best hyper-parameter configurations together WD-GCN and CD-GCN can exploit both graph structure information
with the test results of all the architectures, where the synthetic and sequential correlations at the same time, thus significantly in-
dataset is the one called Configuration 1 in Table 2. It can be seen creasing the final network performance.
that with respect to FC+FC, GC+GC and the LSTM-based baselines To confirm the previous interpretation of the results, the exper-
can achieve better results, the former exploiting the graph struc- iment has been repeated for the best configurations of Table 9 on a
ture, while the second ones leveraging the sequence of data. Both new synthetic dataset with a noisier graph structure, namely Con-
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 15

figuration 2 of Table 2. The results presented in Table 10 show that


the added noise greatly reduces the capability of the GC units to
exploit the structured information within the graph, which in turns
implies a decrease in the performance for all GC+GC, WD-GCN
and CD-GCN. Nevertheless, CD-GCN is less affected than WD-GCN
thanks to the skip-connection built within it, thus allowing to
achieve better performance with respect to all the baselines.
Finally, Fig. 9 represents the wall-clock training time cost in
seconds for all the best architectures of Table 9. Also in this case,
the training time of our techniques is comparable to the one of
the slower baseline network (LSTM+FC) and still in the same or-
der of magnitude as the best and faster baseline performer, the
FC+LSTM+FC.
CAD-120. The approaches proposed in Section 3.3 have been
compared against a GC+gs-FC network, a vs-FC+gs-FC ar-
chitecture, a v-LSTM+gs-FC network and a deeper architecture
made of vs-FC+v-LSTM+gs-FC. Note that for these architec-
tures, the vs-FCs are used with a ReLU activation, instead of a
Fig. 8. The box plots represent the training time in seconds taken by the best con- softmax, since they appear as the first layer of the network.
figuration of each architecture of Table 7 on semi-supervised classification of se- Of the videos, 10% have been selected for testing the perfor-
quence of vertices employing the CIAW dataset.
mances of the model and 10% of the remaining videos have been
employed for validation.
Table 11 shows the results of this experiment, while Fig. 10a
represents the wall-clock training time cost in seconds for all the
best architectures of the same table. The obtained results have
shown that only CD-GCN has outperformed the baseline, while
WD-GCN has reached performances similar to those obtained by
the baseline architectures. This difference may be due to the low
number of vertices in the sequence of graphs. Under this setting,
the predictive power of the graph convolutional features is less
effective and the CD-GCN approach, which augments the plain
vertex-features with the graph convolutional ones, provides an ad-
vantage. Hence, one can further suppose that while WD-GCN and
CD-GCN are suitable to effectively exploit structure in graphs with
high vertex-cardinality, only the latter can deal with dataset with
a limited amount of nodes. It is worth noting that although all the
experiments have shown a high variance in their performances, the
Wilcoxon Test has shown that CD-GCN is statistically better than
the baselines with a p-value < 5% for the Unweighted F1 Measure
and < 10% for the Accuracy. This reveals that in almost every it-
eration of the Monte Carlo Cross-Validation, the CD-GCN has per-
Fig. 9. The box plots represent the training time in seconds taken by the best con-
figuration for each architecture of Table 9 on semi-supervised classification of se- formed better than the baselines.
quence of vertices employing the synthetic dataset. Finally, the same considerations presented for the DBLP dataset
regarding the depth and the number of parameters are valid also

Fig. 10. The box plots represent the training time in seconds taken by the best configuration of each architecture of Tables 11 and 12 on a supervised classification of a
sequence of graphs on the CAD-120 and HDM05 dataset, respectively.
16 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

Table 9
Results of the evaluated architectures on a semi-supervised classification of a sequence of vertices employing the Configuration 1 of Table 2 for the synthetic dataset.
A Wilcoxon Test shows a p-value < 5% for both the Accuracy score and the Unweighted F1 Measure when comparing the best of the proposed architecture against
all the baselines.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

FC+FC 1st FC nodes: {150, 200, 250, 300, 350, 400} 150 65.6% ± 3.0% 150 65.4% ± 3.2%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 40%
GC+GC 1st GC nodes: {150, 200, 250, 300, 350, 400} 400 73.0% ± 3.6% 300 72.8% ± 3.3%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 20% 0%
LSTM+FC LSTM nodes: {100, 150, 200, 300, 400} 100 73.2% ± 1.4% 100 73.1% ± 1.8%
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 0% 10%
FC+LSTM+FC FC nodes: {100, 200, 300, 400} 100 73.8% ± 1.4% 300 73.8% ± 1.7%
LSTM nodes: {100, 200, 300, 400} 100 100
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 20% 20%
WD-GCN wd-GC nodes: {100, 200, 300, 400} 400 95.1% ± 0.8% 400 95.1% ± 0.8%
v-LSTM nodes: {100, 200, 300, 400} 100 100
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 40%
CD-GCN cd-GC nodes: {100, 200, 300, 400} 400 96.3% ± 0.7% 400 96.3% ± 0.7%
v-LSTM nodes: {100, 200, 300, 400} 100 100
dropout: {0%, 10%, 20%, 30%, 40%, 50%} 40% 40%

Table 10
Results of the best configuration for each architecture of Table 9 on a semi-supervised classification of a sequence
of vertices, both for the datasets generated using Configuration 1 and Configuration 2 of Table 2.

Network Accuracy (mean ± std) Unweighted F1 Measure (mean ± std)

Configuration one Configuration two Configuration one Configuration two

FC+FC 65.6% ± 3.0% 65.6% ± 3.0% 65.4% ± 3.2% 65.4% ± 3.2%


GC+GC 73.0% ± 3.6% 69.4% ± 2.5% 72.8% ± 3.3% 69.3% ± 2.7%
LSTM+FC 73.2% ± 1.4% 73.2% ± 1.4% 73.1% ± 1.8% 73.1% ± 1.8%
FC+LSTM+FC 73.8% ± 1.4% 73.8% ± 1.4% 73.8% ± 1.7% 73.8% ± 1.7%
WD-GCN 95.1% ± 0.8% 73.1% ± 2.9% 95.1% ± 0.8% 72.9% ± 2.9%
CD-GCN 96.3% ± 0.7% 79.6% ± 2.5% 96.3% ± 0.7% 79.5% ± 2.5%

Table 11
Results of the evaluated architectures on a supervised classification of a sequence of graphs employing the CAD-120 dataset. CD-GCN is the only technique comparing
favorably to all the baselines, resulting in a Wilcoxon Test with a p-value lower than 5% for the Unweighted F1 Measure and lower than 10% for the Accuracy.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

vs-FC+gs-FC 1st vs-FC nodes: {100, 200, 250, 300} 100 49.9% ± 5.2% 200 48.1% ± 7.2%
dropout: {0%, 20%, 30%, 50%} 20% 20%
GC+gs-FC 1st GC nodes: {100, 200, 250, 300} 250 46.2% ± 3.0% 250 36.7% ± 7.9%
dropout: {0%, 20%, 30%, 50%} 30% 50%
v-LSTM+gs-FC LSTM nodes: {100, 150, 200, 300} 150 56.8% ± 4.1% 150 53.0% ± 9.9%
dropout: {0%, 20%, 30%, 50%} 0% 0%
vs-FC+v-LSTM+gs-FC vs-FC nodes: {100, 200, 250, 300} 200 58.7% ± 1.5% 200 57.5% ± 2.9%
v-LSTM nodes: {100, 150, 200, 300} 150 150
dropout: {0%, 20%, 30%, 50%} 20% 20%
WD-GCN wd-GC nodes: {100, 200, 250, 300} 250 54.3% ± 2.6% 250 50.6% ± 6.3%
v-LSTM nodes: {100, 150, 200, 300} 150 150
dropout: {0%, 20%, 30%, 50%} 30% 30%
CD-GCN cd-GC nodes: {100, 200, 250, 300} 250 60.7% ± 8.6% 250 61.0% ± 5.3%
v-LSTM nodes: {100, 150, 200, 300} 150 150
dropout: {0%, 20%, 30%, 50%} 30% 30%

for this set of data. Also in this case, the training time of our tech- Table 12 shows the results of this experiment and Fig. 10b rep-
niques is comparable to the one of the baseline networks, always resents the wall-clock training time cost in seconds for all the
in the same order of magnitude of the one of the best baseline best architectures of the same table. The obtained results have
performer, i.e. vs-FC+v-LSTM+gs-FC. Note that in this dataset shown that CD-GCN has outperformed the baseline, while WD-GCN
all the networks have much longer training time with respect to has reached performances similar to those obtained by the base-
the previous vertex-focused datasets due to the length of the sam- line architectures. It is important to note that under these settings
ple sequence, i.e. 1298 compared to 10, 20 and 20 for DBLP, CIAW the performances of GC+gs-FC are worse than vs-FC+gs-FC,
and the synthetic datasets respectively. showing that the graph information is limited in this case. How-
HDM05. Finally, the same architectures employed for the ever, the graph-convolutional features extracted by the GC units,
CAD-120 dataset has been compared. 10% of the data has been when considered together with their sequentiality (as in CD-GCN),
used as a test set and 10% of the remaining data for validation. helps in overcoming the performances of a network that deals only
F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000 17

Table 12
Results of the evaluated architectures on a supervised classification of a sequence of graphs employing the HDM05 dataset. CD-GCN compares favorably to all the baselines,
resulting in a Wilcoxon test with a p-value lower than 2% for Accuracy and Unweighted F1 Measure.

Network Hyper-params Grid Accuracy Unweighted F1 Measure

Best config. Performance mean ± std Best config. Performance mean ± std

vs-FC+gs-FC 1st vs-FC nodes: {100, 200, 300} 200 33.7% ± 5.3% 300 30.0% ± 4.0%
dropout: {20%} 20% 20%
GC+gs-FC 1st GC nodes: {100, 200, 300} 200 30.8% ± 3.9% 200 26.5% ± 4.2%
dropout: {20%} 20% 20%
v-LSTM+gs-FC LSTM nodes: {100, 200, 300} 200 77.6% ± 8.5% 200 73.9% ± 8.1%
dropout: {20%} 20% 20%
vs-FC+v-LSTM+gs-FC (vs-FC, v-LSTM) nodes: {(100, 100), (200, 150), (300, 200)} (100, 100) 84.36% ± 2.0% (100,100) 80.0% ± 4.0%
dropout: {20%} 20% 20%
WD-GCN (wd-GC, v-LSTM) nodes: {(100, 100), (200, 150), (300, 200)} (100, 100) 80.4% ± 3.8% (100,100) 76.7% ± 5.1%
dropout: 20% 20% 20%
CD-GCN (cd-GC, v-LSTM) nodes: {(100, 100), (200, 150), (300, 200)} (200, 150) 85.1% ± 1.9% (200, 150) 81.5% ± 3.8%
dropout: {20%} 20% 20%

with vertex features without exploiting the graph structure (as in always comparable with the considered baselines, thus making
v-LSTM+gs-FC or vs-FC+v-LSTM+gs-FC). WD-GCN and CD-GCN viable techniques.
Moreover, the same considerations presented for the DBLP Interesting extensions of this work may consist in:
dataset regarding the depth and the number of parameters still
• The usage of alternative recurrent units to replace the LSTM,
hold. Additionally, also under this setting, the training time of our
such as a Gated Recurrent Unit [29], which is known to achieve
techniques is comparable to the one of the baseline networks and
performance comparable to an LSTM at a much lower compu-
always in the same order of magnitude of the one of the best base-
tational cost;
line performer, i.e. vs-FC+v-LSTM+gs-FC. Note that in this
• Further extensions of the GC unit, e.g. replacing the GC unit
dataset, all the networks have much longer training time with re-
within WD-GCN and CD-GCN with higher order approximations
spect to the previous vertex-focused datasets due to the length
of the spectral graph convolution [34];
of the sample sequence, i.e. 1654 compared to 10, 20 and 20 for
• Exploring the performance of deeper architectures that com-
DBLP, CIAW and the synthetic datasets, respectively.
bine the layers proposed in this work;
• Identifying which statistical properties the data have to fulfill
5. Conclusions and future works
in order to have some theoretical guarantees for WD-GCN and
CD-GCN generalization error; e.g. a good candidate might be
Two novel neural network architectures have been introduced.
ergodicity of the dynamic process producing the data, since it
They can deal with the semi-supervised classification of sequences
would intuitively justify the weight sharing among vertices and
of vertices and the supervised classification of sequences of graphs.
time steps as the key enabler allowing the proposed architec-
Both the techniques satisfy the following two properties: (i) they
tures to achieve much better results than the considered neural
are neural architectures that can be trained by means of gradient
network baselines;
descent; (ii) they are general purpose, i.e. not tied to a specific ap-
• Reducing the number of parameters of the proposed architec-
plication.
tures by means of compression techniques (such as [41]) thus
Our models are based on modified GC layers connected with
to speed up model inference.
a modified version of an LSTM, the former sharing their weights
across the time steps of the data sequence, the latter sharing them
Acknowledgments
between the graph vertices. This allows the proposed techniques
to have a time cost that is linear in both number of edges and
The authors would like to thank Giorgio Patrini and Adam El-
sequence length, as well as the GC and LSTM building blocks.
wood for their helpful and constructive comments that contributed
WD-GCN performance and CD-GCN performance have been as- to improve the work. Moreover, the authors would like to thank
sessed against neural network baselines based on FC, GC, and
Pei et al. [27] that provided the code and their version of the DBLP
LSTM layers on four real world datasets and a synthetic one. The dataset employed in their experiment.
experiments showed the superiority of the CD-GCN for both the
semi-supervised classification of sequences of vertices and the su- References
pervised classification of sequences of graphs, with the WD-GCN
achieving good results on datasets with more complex graph struc- [1] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neu-
ral network model, IEEE Trans. Neural Netw. 20 (1) (2009) 61–80.
tures. Experiments performed on the synthetic dataset showed [2] M. Bianchini, M. Maggini, L. Sarti, F. Scarselli, Recursive neural networks learn
that the CD-GCN can cope better with noise in the graph structure, to localize faces, Pattern Recognit. Lett. 26 (12) (2005) 1885–1895.
allowing it to achieve significantly better results than the base- [3] J. Liu, M. Li, Q. Liu, H. Lu, S. Ma, Image annotation via graph learning, Pattern
Recognit. 42 (2) (2009) 218–228.
lines even when graph information is limited. It is hypothesized
[4] A. Srinivasan, S. Muggleton, R.D. King, M.J. Sternberg, Mutagenesis: ILP experi-
that this is due to the feature augmentation approach guaranteed ments in a non-determinate biological domain, in: Proceedings of the 4th In-
by the skip connection that is present in the CD-GCN architecture, ternational Workshop on Inductive Logic Programming, vol. 237, Citeseer, 1994,
pp. 217–232.
which allows for the v-LSTM building block to work on vertex fea-
[5] A. Jain, A.R. Zamir, S. Savarese, A. Saxena, Structural-RNN: deep learning on
tures when uncorrupted graph features are not available due to the spatio-temporal graphs, in: CVPR, IEEE, 2016, pp. 5308–5317.
noise in the graph structure. [6] Y. Yuan, X. Liang, X. Wang, D.-Y. Yeung, A. Gupta, Temporal dynamic graph
The experiments showed that WD-GCN and CD-GCN scale well LSTM for action-driven video object detection, in: Proceedings of the IEEE In-
ternational Conference on Computer Vision, 2017, pp. 1801–1810.
with the graph size both considering the computational and the [7] A. Rozza, M. Manzo, A. Petrosino, A novel graph-based fisher kernel method
prediction performance. Furthermore, wall-clock training time is for semi-supervised learning, in: ICPR, 2014, pp. 3786–3791.
18 F. Manessi, A. Rozza and M. Manzo / Pattern Recognition 97 (2020) 107000

[8] H. Xu, Y. Yang, L. Wang, W. Liu, Node classification in social network via a [31] Y. Seo, M. Defferrard, P. Vandergheynst, X. Bresson, Structured sequence mod-
factor graph model, in: PAKDD, 2013, pp. 213–224. eling with graph convolutional recurrent networks, in: International Confer-
[9] Y. Zhao, G. Wang, P.S. Yu, S. Liu, S. Zhang, Inferring social roles and statuses in ence on Neural Information Processing, Springer, 2018, pp. 362–373.
social networks, in: ACM SIGKDD, ACM, 2013, pp. 695–703. [32] F. Monti, M. Bronstein, X. Bresson, Geometric matrix completion with recur-
[10] Y. Dong, Y. Yang, J. Tang, Y. Yang, N.V. Chawla, Inferring user demographics and rent multi-graph neural networks, in: NIPS 30, 2017, pp. 3697–3707.
social strategies in mobile social networks, in: KDD ’14, ACM, 2014, pp. 15–24. [33] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
[11] J. Bruna, W. Zaremba, A. Szlam, Y. LeCun, Spectral networks and locally con- [34] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on
nected networks on graphs, ICLR, 2013. graphs with fast localized spectral filtering, in: NIPS, 2016, pp. 3844–3852.
[12] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on [35] M. Génois, C.L. Vestergaard, J. Fournet, A. Panisson, I. Bonmarin, A. Barrat, Data
graphs with fast localized spectral filtering, NIPS, 2016. on face-to-face contacts in an office building suggest a low-cost vaccination
[13] D.K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspu- strategy based on community linkers, Netw. Sci. 3 (2015) 326–347.
ru-Guzik, R.P. Adams, Convolutional networks on graphs for learning molecular [36] M. Hashemian, W. Qian, K.G. Stanley, N.D. Osgood, Temporal aggregation im-
fingerprints, NIPS, 2015. pacts on epidemiological simulations employing microcontact data, BMC Med.
[14] T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional Inf. Decis. Making 12 (1) (2012) 132.
networks, ICLR, 2017. [37] H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affor-
[15] Y. Li, D. Tarlow, M. Brockschmidt, R.S. Zemel, Gated graph sequence neural net- dances from RGB-D videos, Int. J. Rob. Res. 32 (8) (2013) 951–970.
works, ICLR, 2016. [38] D.G. Lowe, Object recognition from local scale-invariant features, in: ICCV,
[16] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) IEEE, 1999, pp. 1150–1157.
(1997) 1735–1780. [39] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, A. Weber, Documen-
[17] L.C. Jain, L.R. Medsker, Recurrent Neural Networks: Design and Applications, tation Mocap Database HDM05, Technical Report, Universität Bonn, 2007.
1st ed., CRC Press, Inc., 1999. [40] D. Kingma, J. Ba, Adam: a method for stochastic optimization, ICLR, 2015.
[18] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to [41] F. Manessi, A. Rozza, S. Bianco, P. Napoletano, R. Schettini, Automated pruning
document recognition, in: Proceedings of the IEEE, 1998, pp. 2278–2324. for deep neural network compression, in: Proceedings of the 24th International
[19] X. Zhu, Z. Ghahramani, J. Lafferty, et al., Semi-supervised learning using gaus- Conference on Pattern Recognition, 2018, pp. 657–664.
sian fields and harmonic functions, in: ICML, vol. 3, 2003, pp. 912–919.
[20] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via Franco Manessi He obtained his master degree cum Laude, from the Department
graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 23 (11) (2001) 1222–1239. of Theoretical Physics of Universitá degli Studi di Pavia in the 2011. He received the
[21] B. Wang, J. Tsotsos, Dynamic label propagation for semi-supervised multi-class Ph.D. degree in Theoretical Physics in January 2015 from the Department of Theo-
multi-label classification, Pattern Recognit. 52 (2016) 75–84. retical Physics, Universitá degli Studi di Pavia. From 2015 to 2016 he was Data Sci-
[22] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: entist at a consulting firm. In 2017 he was a member of Waynaut’s research team, a
ACM SIGKDD, ACM, 2016, pp. 855–864. startup working in the on-line travel industry later acquired by the lastminute.com
[23] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: online learning of social represen- group. He is now a data scientist in the lastminute.com group. His research interests
tations, in: ACM SIGKDD, ACM, 2014, pp. 701–710. include machine learning and its applications.
[24] F.B. Silva, R.d.O. Werneck, S. Goldenstein, S. Tabbone, R.d.S. Torres, Graph-based
bag-of-words for classification, Pattern Recognit. 74 (2018) 266–285.
[25] K. Li, S. Guo, N. Du, J. Gao, A. Zhang, Learning, analyzing and predicting object Alessandro Rozza He obtained his master degree cum Laude, from the Department
roles on dynamic networks, in: IEEE ICDM, 2013, pp. 428–437. of Computer Science of Universitá degli Studi di Milano/Bicocca in the 2006. He
[26] Y. Yao, L. Holder, Scalable SVM-based classification in dynamic graphs, in: IEEE received the Ph.D. degree in Computer Science in March 2011 from the Department
ICDM, 2014, pp. 650–659. of Scienze dell’Informazione, Universitá degli Studi di Milano. From 2012 to 2014 he
[27] Y. Pei, J. Zhang, G.H. Fletcher, M. Pechenizkiy, Node classification in dynamic was Assistant Professor at Universitá degli Studi di Napoli-Parthenope. From 2015
social networks, in: AALTD 2016: 2nd ECML-PKDD International Workshop on to 2017, he was head of research at Waynaut. Nowadays, he is the Chief Scientist
Advanced Analytics and Learning on Temporal Data, 2016, pp. 54–93. of lastminute.com group. His research interests include machine learning and its
[28] M. Gori, G. Monfardini, F. Scarselli, A new model for learning in graph domains, applications.
in: Proceedings of the 2005 IEEE International Joint Conference on Neural Net-
works, vol. 2, 2005, pp. 729–734. Mario Manzo He obtained his master degree cum Laude, from the Department of
[29] K. Cho, B. van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, Computer Science of Universitá degli Studi di Napoli “Parthenope” in the 2008. He
Y. Bengio, Learning phrase representations using RNN encoder–decoder for sta- received the Ph.D. degree in Computer Science in 2013 from the Department of
tistical machine translation, in: EMNLP, 2014, pp. 1724–1734. Scienze dell’Informazione, Universitá degli Studi di Milano. His research interests
[30] D.K. Hammond, P. Vandergheynst, R. Gribonval, Wavelets on graphs via spec- include machine learning, structured data, and its applications.
tral graph theory, Appl. Comput. Harmon. Anal. 30 (2) (2011) 129–150.

You might also like