0% found this document useful (0 votes)
82 views16 pages

Detection

This document proposes a framework called EULER for detecting network lateral movement via scalable temporal link prediction. EULER stacks a graph neural network (GNN) onto a sequence encoder to encode the topology of a network at discrete time snapshots using the GNN, and encode dynamic connection changes over time using the sequence encoder. This allows the GNN layers to be distributed across multiple machines for processing large datasets in parallel, improving performance. The document evaluates EULER on a large real-world network intrusion dataset, finding it outperforms previous methods in precision and speed for detecting anomalous lateral movement.

Uploaded by

Đorđe Klisura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views16 pages

Detection

This document proposes a framework called EULER for detecting network lateral movement via scalable temporal link prediction. EULER stacks a graph neural network (GNN) onto a sequence encoder to encode the topology of a network at discrete time snapshots using the GNN, and encode dynamic connection changes over time using the sequence encoder. This allows the GNN layers to be distributed across multiple machines for processing large datasets in parallel, improving performance. The document evaluates EULER on a large real-world network intrusion dataset, finding it outperforms previous methods in precision and speed for detecting anomalous lateral movement.

Uploaded by

Đorđe Klisura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

E ULER: Detecting Network Lateral Movement via

Scalable Temporal Link Prediction

Isaiah J. King and H. Howie Huang


George Washington University
{iking5, howie}@gwu.edu

Abstract—Lateral movement is a key stage of system com- The most robust way to detect malware propagation is not
promise used by advanced persistent threats. Detecting it is no to exhaustively list every known malicious signature correlat-
simple task. When network host logs are abstracted into discrete ing with it; rather it is to train a model to learn what normal
temporal graphs, the problem can be reframed as anomalous edge activity looks like, and to alert when it detects behavior that
detection in an evolving network. Research in modern deep graph deviates from it. However, detecting anomalous activity in
learning techniques has produced many creative and complicated
models for this task. However, as is the case in many machine
an enterprise network presents unique challenges. The data
learning fields, the generality of models is of paramount impor- involved both during training and the implementation of an
tance for accuracy and scalability during training and inference. anomaly-based intrusion detection system is enormous. Often
In this paper, we propose a formalized approach to this problem the log files that such a system would require as input are
with a framework we call E ULER. It consists of a model-agnostic terabytes large. To be useful, a lateral movement detection
graph neural network stacked upon a model-agnostic sequence model must be highly scalable to accommodate such large
encoding layer such as a recurrent neural network. Models built data. Additionally, when viewed as a classification problem,
according to the E ULER framework can easily distribute their any such system would have to be highly precise. Millions
graph convolutional layers across multiple machines for large of events occur in an enterprise network on any given day,
performance improvements. Additionally, we demonstrate that and only a fraction of a percent of all interactions are ever
E ULER-based models are competitive, or better than many state-
anomalous [18]. Therefore, a model must have an extremely
of-the-art approaches to anomalous link detection and prediction.
As anomaly-based intrusion detection systems, E ULER models can low rate of false alerts so as not to overwhelm its users.
efficiently identify anomalous connections between entities with In this work, we formulate anomalous lateral movement
high precision and outperform other unsupervised techniques for
detection as a temporal graph link prediction problem. In-
anomalous lateral movement detection.
teractions occurring in discrete units of time on a network
can be abstracted into a series of graphs Gt = {V, Et } called
I. I NTRODUCTION & M OTIVATION snapshots, where V is the set of entities in the network that had
interactions Et = {(u, v) ∈ V} during a set period of time, t.
Lateral movement is a key stage of the MITRE ATT&CK A temporal link prediction model will learn normal patterns of
framework [1] describing the behavior of advanced persistent behavior from previous snapshots and assign likelihood scores
threats (APTs). At its core, lateral movement is malware to edges that occur in the future. Edges with low likelihood
propagating through a network to spread onto new computers scores correlate to anomalous connections within the network.
in an attempt to find their target. This may involve pivoting As [9] points out, these anomalous connections are often
through multiple systems and accounts in a network using indicative of lateral movement. As we will later show, this
either legitimate credentials or malware to accomplish the reframing of the problem improves precision over standard
task [2]. As it is one of the final steps in the killchain anomaly-based intrusion detection techniques.
before complete compromise, detecting it early is of critical
importance. Recent approaches to temporal link prediction combine a
graph neural network (GNN) with a sequence encoder such
A plethora of machine learning approaches to intrusion as a recurrent neural network (RNN) to capture topological
detection exist, both signature-based models [3], [4], [5], [6], and temporal features of an evolving network. However, these
[7], [8] and anomaly-based [9], [10], [11], [12], [13], [14], approaches are either reliant on RNN output during the GNN
[15]. These latter techniques are especially well-suited for stage of embedding [19] or merely incorporate GNNs into the
lateral movement detection, as APT techniques like “Pass the RNN architecture [20], [21], [22]. As Figure 1a illustrates,
Ticket” [16], or even just using stolen credentials [17] are these models are necessarily sequential, and unfortunately
very difficult to formalize into signatures for signature-based cannot scale to the large datasets that they would need to
intrusion detection systems [2]. process to be useful lateral movement detectors.
Proposed solution. To address this problem, we have
observed that the most memory-intensive part of existing
architectures occurs during the message passing stage within
Network and Distributed Systems Security (NDSS) Symposium 2022 the GNN. Furthermore, there exists an imbalance between the
24-28 April 2022, San Diego, CA, USA
ISBN 1-891562-74-6 massive size of node input features and the comparatively
https://dx.doi.org/10.14722/ndss.2022.24107 minuscule topological node embeddings. This means the most
www.ndss-symposium.org work and the most memory usage occurs in the GNN before
(a) Sequential temporal link predictors (b) E ULER parallel framework

Fig. 1: (a) Prior approaches rely on RNN output during the GNN stage of embedding or merely incorporate GNNs into the
RNN architecture, which forces models to work in serial, one snapshot at a time. In contrast, (b) the E ULER framework can
utilize several worker machines to hold consecutive snapshots of a discrete temporal graph. These workers process snapshots
in parallel through a replicated GNN shared across each machine. The output of these GNNs is returned to the leader machine
which runs them through a recurrent neural network to create temporal node embeddings which may be used for link prediction.
The framework is explained in detail in Section IV

the simpler forward pass of the RNN is calculated, necessarily the LANL comprehensive multi-source cyber-security events
in serial. If several replicated GNNs operate on snapshots dataset [24]. This dataset includes labeled event data from
independently, they can execute concurrently as shown in Fig- 58 consecutive days of real-world computer network use
ure 1b. Amdahl’s Law [23] would suggest that by distributing interspersed with red team activity. There are approximately
such a large portion of work, performance improvements will 1.6 billion events in total. This is a test not only of the E ULER
ensue. framework’s precision, but also its ability to scale. Our tests
show that E ULER-based models outperform prior works in both
In this work, we have developed E ULER1 , a formalized precision and compute time.
approach for scalable dynamic link prediction and anomalous
edge detection. The framework involves stacking a model In summary, our research contributions are:
agnostic GNN upon a sequence encoder such as an RNN.
In this way, a network’s topology at discrete moments in • We present, to the best of our knowledge, the first use
time is encoded by the GNN, and the dynamic changes of temporal graph link prediction for anomaly-based
in connections are encoded by the RNN. The embeddings intrusion detection. Other research in applying graph
produced by this model provide prior probabilities for future analytics to anomaly detection either does not consider
states of the network given what is embedded about the past the temporal nature of the data or does not use a pow-
structure. Most importantly, the framework is designed such erful GNN model. By incorporating both elements into
that GNNs may be replicated across several worker machines, E ULER, models built on this framework outperform
and execute independently, allowing disjoint sets of snapshots other unsupervised anomaly-based intrusion detection
to be processed concurrently. When the GNNs’ work occurs systems and yield more informative alerts.
in parallel for each snapshot, the topological data for an • We demonstrate that for temporal link prediction
entire series of graphs can theoretically be encoded in the and detection, the simple framework we propose is
time it takes to encode the snapshot with the most edges. equally or more accurate and precise than state-of-
With these immense performance enhancements, detecting the-art temporal graph autoencoder models. E ULER
anomalous user activity in industry-scale, real-world networks models attain higher area under the curve and average
using powerful GNN models becomes tractable. precision scores by 4% in inductive link detection
Experimental evaluation. To prove that the very simple tests, and about equal metrics to the state-of-the-art
model we propose is adequate for the complex task of anoma- in transductive link prediction tests.
lous edge detection, and by extension lateral movement de- • We propose a scalable framework for distributed tem-
tection, we evaluate a model following the E ULER framework poral link prediction for use on big data. The E ULER
against several state-of-the-art temporal link prediction models framework is simple and makes message passing
on three datasets. The results of these experiments show that lightweight even over large graphs. By breaking edge
despite its simplicity, the E ULER framework performs as well lists into temporal chunks and processing them in
or better than the state-of-the-art, and unlike the other models, parallel, the computational complexity of the message
can scale to accommodate larger data. passing stage of the model is theoretically bound
Implementation as an intrusion detection system. We by only the snapshot with the most edges. Other
test several models following the E ULER framework on optimizations allow the RNN to operate in parallel
with some of the GNN workers, further improving
1 Source code available at https://github.com/iHeartGraph/Euler performance.

2
The rest of this work is organized as follows: Section II graph G of network activity that predicts the likelihood of
provides background information about the topic and defines future interactions occurring in unseen snapshots. An observed
our key terms. Section III gives a motivating example to interaction between entities with a likelihood score below a
demonstrate the necessity of temporal graphs in anomaly-based certain threshold is said to be anomalous. These anomalous
intrusion detection. Section IV explains the E ULER framework edges, in the context of network monitoring, are often indica-
in detail. Section V details the experiments we conducted tive of lateral movement [9].
with this model compared to other temporal autoencoders.
Section VI showcases how we used E ULER-based models to III. M OTIVATION
build efficient and precise anomaly-based intrusion detection
systems. Section VII discusses related work in anomaly-based Historically, there have been two main ways of abstract-
intrusion detection systems and temporal link prediction. Fi- ing data for anomaly detection: frequency-based, and events-
nally, we provide a brief discussion and suggestions for future based [18]. Frequency-based methods, such as [12], [13], [25]
work in Section VIII. define anomalous behavior as activity which significantly devi-
ates from observed temporal patterns. Events-based methods,
such as [10], [14], [15], [26], [27], [28] identify commonalities
II. BACKGROUND
in features, such as packet count, protocols, etc., associated
Anomaly-based intrusion detection systems must first de- with individual events within a system. Anomalies are then
fine a baseline for normal behavior, then generate alerts when defined as events with unexpected features.
events occur that significantly deviate from this baseline. The The problem with these two approaches is that they ignore
definition for normalcy is highly contingent on the abstraction the inherently structural nature of network data. This is readily
used to represent the system. Our proposed solution represents apparent by observing that one of the most influential datasets
network activity as discrete temporal graphs. From there, for intrusion detection, [29], [30], has no features for source
detecting evidence of lateral movement betrayed by anomalous or destination entities. All that matters with these models
network activity is equivalent to anomalous edge detection, are the details of the events themselves in isolation, not the
which we accomplish via temporal link prediction. machines between which they occurred. Networks are too
complex to be represented as just a series of unrelated vectors,
Discrete Temporal Graphs or a multivariate probability distribution of isolated samples.
A discrete temporal graph2 G = {G1 , G2 , ...GT } is defined as Networks are webs of relational data, fluctuating over time.
a series of graphs Gt = {V, Et , Xt }, which we refer to as The most natural way to represent and analyze relational data,
snapshots, representing the state of a network at time t. Here, is as a graph.
V denotes a set of all nodes which appear in the network,
Et denotes relationships between nodes at time t, and Xt There has been growing interest in static graph-based
represents any features associated with the nodes at time t. methods for intrusion detection, such as [9]. Here, normalcy
In this work, all graphs are directed, and some have weighed is defined in terms of interactions between entities within
edges, W : E → R representing edge frequencies during the systems, but time is not considered. Anomalies are defined
time period each snapshot encompasses. as edges with low probability given what is known about the
graph’s structure.
Let the interactions I between users and machines in a
network at specific times be represented as a multiset of tuples To demonstrate the advantages and disadvantages of the
<src,dst,ts>. Here, src is an entity interacting with entity dst above intrusion detection systems, consider the example shown
at time ts. From this multiset, we can build the temporal graph in Figure 2. The first two time slices show normal activity in
G = {G0 , ..., GT } with time window δ. The set of all nodes V, the network: first at t0, Alice and Bob authenticate with their
is the set of every src and dst entity that appears in I. The set computers A and B, then at t1 computers A and B make a
of edges at time t, Et is constrained such that for all edges request to the shared drive. At times t2 and t3, we see that
(u, v) ∈ Et there exists an interaction < u, v, w >∈ I where when Bob does not first authenticate with computer B, it does
t ≤ w < t + δ. not communicate with the shared drive. A simple probability
distribution is apparent:
Temporal Link Prediction
Temporal link prediction is defined as finding a function that P((C1, SD) ∈ Et+1 | (B, C1) ∈ Et ) = 1
(1)
describes the likelihood that an edge exists in a temporal P((C1, SD) ∈ Et+1 | (B, C1) 6∈ Et ) = 0
graph at some point in time, given the previously observed
snapshots of a network. By representing an enterprise network However, in t4 and t5, something unusual occurs: computer B
as a temporal graph, we can further extend this definition to requests data from the shared drive without Bob authenticating
encompass anomaly detection. This follows from the assump- with it first! This could be a sign of an APT using attack tech-
tion that anomalous edges in a temporal graph will have a niques (as catalogued by the MITRE ATT&CK framework)
low probability of occurring given what is known about the T1563, remote service hijacking, or attempting T1080, tainting
network’s behavior in the past. shared content [2].

Lateral movement detection with temporal link prediction In order to detect attacks such as the one in the example,
is then defined as finding a function learned from the temporal a model would need to consider events with reference to
those which occurred previously, and with reference to the
2 For simplicity, for the remainder of this work, we refer to these simply as other interactions within the network. An event between two
“temporal graphs” entities that happens at one point in time cannot be considered

3
Fig. 2: A simple example of hard to detect temporally anomalous activity in a network with 3 machines and two users. The
normal sequence of events is that a user authenticates with a computer, then that computer makes a request to the shared drive.
However, at time t5, computer 1 makes a request to the shared drive without Bob first authenticating with it, denoting a perhaps
malicious process running on computer 1 acting not on behalf of the machine’s primary user.

identical to the same event occurring in the future under a A. The Encoder-Decoder
different global context. Unfortunately, existing graph-based
approaches, which do not consider time, and many event-based The E ULER framework is a generic extension of the
approaches that look at each event in isolation such as [9], [10], traditional graph autoencoder model [32] to temporal graphs.
[14] would see no difference between (C1, SD) at time t1 and It consists of a model-agnostic graph neural network (GNN)
(C1, SD) at time t4. What makes it stand out as anomalous is stacked upon a model-agnostic recurrent neural network
something lacking in the previous state of the graph–something (RNN). Together, these models aim to find an encoding func-
that would only be detected by considering prior probabilities tion f(·) and a decoding function g(·). The encoding function
for the next state of the system. maps nodes in a temporal graph with T snapshots to T low-
dimensional embedding vectors. The decoding function en-
Frequency-based and event-based approaches such as [12], sures minimal information is lost during the encoding process
[13], [26], [27], [28] would have similar trouble, as they lack and aims to reconstruct the input snapshots from the latent Z
the relational data present in the network; Bob’s activities a vectors. More formally, we can describe the behavior of the
timestep earlier would have little import on the interpretation encoder as
of the shared drive’s activity later on. If the interaction between Z = f({G0 , . . . , GT })
(C1, SD) at time t4 was sufficiently similar to the interaction (2)
= RNN( [GNN(X0 , A0 ), . . . , GNN(XT , AT )] )
at time t1, these approaches would see no difference between
the two. They lack the ability to capture the importance of where At is the |V| × |V| adjacency matrix representation of
interactions occurring between other entities in the network, the snapshot at time t. This T × |V| × d dimensional tensor Z
and how they may relate to a separate event. is optimized to contain information about both the structure of
the graph, and the dynamics of how it changes over time.
Our proposed solution, to represent the network as a
temporal graph, ensures the global structure of a network This is enforced by a decoder function, g(·) which attempts
at individual points in time is captured without losing the to reconstruct the original graph structure given the embed-
temporal dependencies of the changing connections. We make dings. More formally,
the more difficult assumption that malicious events can have g(Zt ) = Pr(At+n = 1 | Zt ) (3)
the same event features as normal ones; if this is the case,
traditional event-based approaches will not work. We also where Zt = Z[t] is the embedding of graph Gt and n ≥ 0.
assume attackers can make similar connections, or even the As was done by [9], [19], [32], [33] we use the inner product
same connections as observed in normal activity, meaning decoding as this g(·) function:
available graph-based approaches, and statistical approaches
are insufficient. With temporal graphs, assuming they have g(Zt ) = σ(Zt Z|t ) = Ãt+n (4)
enough granularity, these problems, as well as those tackled where σ(·) denotes the logistic sigmoid function, and Ãt+n
by prior works are solvable. represents the reconstructed adjacency matrix at time t + n.
As a consequence of using inner product decoding, the dot
product of vectors Zt [u] and Zt [v] represents the log-odds
IV. E ULER that an edge, (u, v) exists at time t + n. In this way, the g(·)
function is used to detect anomalous edges.
In this section we describe our proposed framework, which
we call E ULER. This framework aims to learn a probability
function conditioned on previous states of a temporal graph, B. Workflow
to determine the likelihood of an edge occurring at a later The core of the E ULER framework is a simple design. It
state. Furthermore, it is the goal of this work to offer an simply stacks the replicas of a model-agnostic GNN which
approach that is not just precise, but also highly scalable. we refer to as the topological encoder upon a model-agnostic
We first describe the basic components of the system, then recurrent layer with a few simple constraints. When fit into
how they are distributed across multiple machines, and how the leader/worker paradigm3 with one recurrent layer as the
these components interact. Finally, we describe how different
training objectives are implemented on the E ULER interface. 3 Historically called the master/slave paradigm

4
Fig. 3: The complete series of interactions between the leader, and worker machines during one training step. In step 1.), the
leader machine initiates each worker, and issues a command for each of them to load a disjoint subset of the snapshots from
memory. When this has occurred, the training loop can begin. At step 2.), the leader issues a command to each worker to perform
a forward pass on their graphs through their GNNs. As they complete this, workers send the topological embeddings, Z̄t to an
ordered queue in the leader. Upon receiving the embedding for t = 0, step 3.) begins. The leader runs embeddings from the
queue through its RNN as they come in to produce the final Z embeddings. Then, in step 4.), the leader sends the embeddings
back to the workers to decode and calculate loss. Finally, 5.), the leader and workers perform a backward pass on the aggregated
loss functions according to the DDP gradient bucketing algorithm [31]. During evaluation, the steps are nearly identical, but on
step 4.) workers return their scores for each edge y
^ t instead of loss.

leader, and multiple topological encoders as workers, it has on earlier embeddings while future snapshots are still being
the potential for massive parallelism. processed by workers. The method for this distribution of work
is explained in more detail in Algorithm 1.
The overall workflow for E ULER is shown in Figure 3. It
occurs in 5 basic stages: 1.) The leader spawns the workers
and instructs them on which snapshots to load; 2.) the leader Algorithm 1: Distributing G across workers
initiates the training loop, and workers generate topological initialize k workers, and temporal graph G = {G0 , ..., GT };
embeddings; 3.) as the topological embeddings are received, /* Give each worker an equal amount of work */
the leader processes them through an RNN; 4.) the output of int minTasks = b Tk c;
the RNN is sent back to the workers to calculate loss or for int tasks = [minTasks] ∗ k;
scoring; 5.) in training mode, loss is returned to the leader /* If work is not evenly divided, assign it to
to be backpropagated. During evaluation, anomaly scores are last workers first */
returned. int remainder = T % k;
if remainder then
In this section, we describe how these 5 steps are imple- for (int i=k-1; i≥k-remainder; i - -) do
mented in greater detail. The distributed workers are imple- tasks[i] ++;
mented using the PyTorch DDP library [31]. More detailed end
end
technical information about the E ULER interface is available
in the Appendix. /* Each worker loads as many contiguous
snapshots as they were assigned */
int tmp, start=0, end=tasks[0];
Loading Data for (i=0; i<k; i ++) do
When the program first starts, several processes are spawned. asynchronously call workers[i].load new data(G[start:end]);
The PyTorch multiprocessing library automatically assigns tmp=start; start += end; end = tmp+tasks[i+1];
end
each spawned process a unique ID from 0 to k + 1; for
convenience the machine with id:0 becomes the leader, and all [w.wait() for w ∈ workers]
others are workers. The leader machine, which holds remote
references to all workers in addition to the recurrent layer After the workers have been assigned their snapshots, they
issues commands to the other processes to spin up all worker concurrently read them in. The leader waits for each worker to
instances. signal that all data is loaded, then moves on to the next phase
Upon their creation, workers connect to the leader and of the workflow.
await instructions. After the initialization of all workers, the
leader assigns them contiguous subsets of temporal graph data Topological Encoding
to load, shown in step 1 of Figure 3. For example, given a During the forward pass–shown in step 2 of Figure 3–the
set W = {w1 , ..., wk } of k workers, and a temporal graph recurrent layer issues asynchronous calls to forward on each
split into T snapshots, the leader assigns each worker s = Tk
 
worker machine. To minimize network traffic, the only thing
snapshots. Then worker wi holds {Gsi , ..., Gsi+(s−1) }. In the the leader sends to the workers is an enum representing which
likely case where k does not divide T , extra units of work partition of edges to process, as some are held out for vali-
are assigned to workers holding snapshots later in time. This dation and testing. Workers then process every snapshot they
is because the recurrent layer of the model processes the hold. Further reducing network traffic, the matrices returned
topological encoders’ output in order, as it comes in. Thus, by the workers are far smaller than those used as inputs, as
the RNN can perform the necessarily sequential forward pass each worker is essentially a graph autoencoder [32].

5
Algorithm 2: Leader machine forward method yt . The process for decoding the embeddings is the same,
def forward(self, workers, partition): however loss is not calculated.
/* Leader tells each worker to begin
executing */ Backpropagation & Evaluation
futures = [];
for w ∈ workers do
When loss has been calculated and returned to the leader
future = asynchronously execute w.forward(partition); machine, gradients are calculated via backpropagation, first
futures.append(future); through the recurrent layer, then components of the loss
function generated by each topological encoder are backpropa-
/* As workers return their embeddings, the
leader processes them in order, as they
gated in parallel and broadcast between workers in accordance
arrive */ with the bucketing algorithm described in [31]. After the
h=NULL; backpropagation step, and collective communication between
zs = []; workers, gradients across all workers’ model replicas are equal.
for f ∈ futures do Finally, the recurrent layer and the topological encoders all
z, h = self.RNN(f.wait(), h);
zs.append(z); update their parameters using an Adam optimizer [34], and
the leader repeats steps 2-5 until convergence.
return concat(zs)
If the model is in evaluation mode, the leader machine
instead uses the y
^ t likelihoods the workers generate, and the
known labels yt , also returned by the workers, to calculate
For performance optimization, and to ensure consistency, precision and accuracy metrics, which are saved. In a real-
we impose just one constraint for this stage: topological world implementation without labels, it would instead raise
encoders must not be dependent on any temporal information. alerts on observed edges with likelihoods below a certain
They provide a purely spatial encoding of the state of each threshold.
snapshot they hold, using only features observable at a single
point in time. With this constraint satisfied, all encoders can
operate in parallel, as no one worker is dependent on the output C. Training
of another. Theoretically, with as many workers as snapshots,
the time complexity of a forward pass is constrained only by There are two modes of training E ULER models: as a
the snapshot with the greatest number of edges. link detector, or a link predictor. These two modes are
distinguished by which Zt embeddings are sent to the workers
Temporal Encoding at step 4 to calculate loss. Link detectors are inductive; they
The leader maintains an ordered list of future objects that point generate Zt using partially observed snapshots {G^0 , ..., G^t } and
to the eventual output of the workers and waits for the future attempt to reconstruct the full adjacency matrix At with g(Zt ).
pointing to the first embeddings to finish executing. When In practice, they would be used for forensic tasks, where one
the leader receives this tensor, it is immediately processed is performing an audit to identify anomalous connections that
by its RNN, shown in step 3 of Figure 3. Note that so long have already occurred.
as tasks are slightly imbalanced such that workers holding Link predictors are transductive; they generate Zt using
later snapshots contain more work units, the leader’s recurrent snapshots {G0 , ..., Gt }, in order to predict the future state, At+n
layer can execute concurrently with at most k − 1 workers’ where n > 0. In practice, they could be used as a live intrusion
topological encoders. Workers with earlier snapshots hold detection tool, as predictive models can score edges as they are
fewer work units, and therefore finish executing earlier, so the observed–before they have been processed into full snapshots.
leader can process their outputs while workers holding greater For example, when n = 1, given what has been observed about
quantities of snapshots further in time are still processing. the network up until time t − 1, it is the goal of predictive
When the recurrent layer has finished processing the output implementation of E ULER to score edges observed at time t.
of one worker, the hidden state and outputs from the RNN are Such a model can use embeddings learned from previous states
saved. The leader waits for the next topological embedding to of the network to process connections as they occur.
finish processing, then uses the saved RNN hidden state and To ensure these objectives, the reconstruction loss function
the next embedding to repeat the process until all workers have aims to minimize the negative log likelihood of Equation 3,
finished executing. This procedure is described in more detail where n = 0 when training detectors, and n > 0 for predictors.
in Algorithm 2. For larger graphs, operating over the entire adjacency matrix
quickly becomes intractable. Instead, we approximate this
Decoding value by minimizing binary cross entropy on the likelihood
When the leader finishes generating the final embeddings, they scores for known edges, and a random sample of non-edges
are sent back to the workers to decode, and if the model is at time t + n: Pt+n and Nt+n . Formally, the reconstruction
training, to calculate loss. This process occurs in parallel on loss function for snapshot t + n is
the worker machines. In general, graph functions such as those
used to find edge likelihoods are more compute and memory
intensive, so we endeavor to run them in parallel whenever Lt = − log(Pr(At+n | Zt ))
possible. This stage is shown in step 4 of Figure 3. −1 X −1 X
≈ log(p) + log(1 − n)
During evaluation, instead of returning loss, the workers |Pt+n | |Nt+n |
p∈Pt+n n∈Nt+1
return edge likelihoods y
^ t , and the ground-truth edge labels (5)

6
New negative edges are randomly sampled on each epoch V. B ENCHMARK E VALUATIONS
by workers for each snapshot they hold. The leader machine
coordinates which slices of the Z tensor are sent to each worker Prior works [19] and [22] both assert that simply em-
to generate Pt+n and Nt+n , and each worker independently bedding graph snapshots with a GNN, and running these
calculates loss on the training data they hold. embeddings through an RNN, as was done by [20], [35] does
not adequately capture the shifting distribution of nodes in
On predictive models, Zt represents the latent space of a dynamic network. Prior work [22] demonstrates that this
nodes n snapshots in the future. Consequently, predictive is the case for inferring complete graph structure several
models cannot calculate loss on the first n snapshots of the time steps in the future, but [19] does not evaluate their
graph. To account for this, the leader pads Z with a n |V| × d model against the very model they so thoroughly disregard.
zero matrices at the beginning, and the final n matrices are To remedy this, we present a comparison between several
removed before returning the embeddings to the workers. This existing temporal autoencoder models, and a simple stacked
way, Z[t] predicts the snapshot indexed at t on all workers GCN [36] and GRU [37] following the E ULER framework.
except the one holding the initial snapshots, which ignores We select these two layers because of their lack of parameters–
embeddings that equal the zero matrix. This process is shown they are the most generalized of the models in their respective
more clearly in Algorithm 3. domains [38], [39].

Algorithm 3: Calculate loss A. Models Tested & Data Sets


def worker loss(zs, n jobs, workers, n=1): In this section, we implement E ULER as a graph convo-
/* Pad the Z tensor if predictive */ lutional network [36], the most general GNN available [38],
zeros = n × |V| × d 0 matrix;
zs = concat(zeros, zs[:-n]); stacked upon a GRU, an RNN with very few parameters.
futures = [];
Though the model is so simple as to be called the “naive
start=0; end=n jobs[0]; loss = 0; i=0; method” by [20], it is also the fastest temporal model tested,
/* Send the correctly offset embeddings to
and as we will show, quite effective. The topological embed-
the workers to calculate loss */ ding layer is a two-layer GCN, essentially a graph autoencoder.
for w ∈ workers do We include an edge dropout layer [40] before the initial
f = asynchronously execute w.loss(zs[start:end]); forward pass and feature dropout layer between all layers to
futures.append(f); prevent overfitting and oversmoothing on the small datasets.
tmp=start; start=end; end=start+n jobs[i ++];
Its hidden layer and output are both 32-dimensional. The
return sum([f.wait() for f in futures]) / len(workers)
sequence of GCN outputs is then passed through a tanh
activation function before they are processed by a single 32-
dimensional GRU, and finally a MLP to project the output into
D. Classification 16-dimensional embeddings.
Though for much of our evaluation, we rely on regression The other methods evaluated are as follows:
metrics relating to the fitness of scores assigned to edges, it DynGraph2Vec [41]: a model which passes adjacency
is useful to automate the process of deciding the threshold vectors into an multilayer perceptron (MLP) and an RNN to
for what counts as anomalous to obtain classification scores. capture graph dynamics using traditional deep learning tech-
To this end, when training the model, we hold out one or niques. The DynAE variant passes adjacency vectors through
more full snapshots to act as an extra validation set. Using the a deep autoencoder architecture; the DynRNN variant passes
final hidden state h of the RNN from the training snapshots as them through a recurrent neural network; and DynAERNN
input for the validation snapshot, a training partition of edges passes the output of deep autoencoders to an RNN to generate
is passed through the model. From there, finding an optimal node embeddings. This latter model fits into the E ULER
cutoff threshold for edge likelihood scores becomes a simple framework if an MLP were used instead of an GNN. We
optimization problem. include this model to show the value of message passing
Given a set of scores for edges that exist in the validation networks.
snapshots, but were held out of the training set, and a set of Evolving GCN [22]: a GCN whose parameters are passed
scores for non-edges, the optimal cutoff threshold τ is the one through an LSTM [42] or GRU [37] at each timestep. In
which satisfies the EGCN-O variant, the GCN parameters are the input
and output of an LSTM; in the EGCN-H variant, the GCN
argmin k(1 − λ)TPR(τ) − λFPR(τ)k (6) parameters are used as the hidden states of a GRU, which
τ
takes the previous embedding as input. When the models were
where TPR(τ) and FPR(τ) refer to the true and false positive evaluated for temporal link prediction in the original paper,
rate of classification given cutoff threshold τ, and λ is a they were given no subset of ground-truth edges from which to
hyperparameter in [0 − 1] biasing the model to optimize for generate embeddings. Instead, they used the predictions from
either a high true positive rate, or a low false positive rate. the previous snapshot as the adjacency matrix inputs. This is
Experiments have shown that for anomaly detection, where a far more difficult task than our method of link prediction,
low false positive rates are critical, λ = 0.6 is very effective. where a subset of edges is available as a ground truth training
For any metric involving classification for the remainder of set for every snapshot. But our interest is anomaly detection,
the paper, this is how classes were determined from the edge where such accommodations are always available. Lastly, we
likelihood scores, and unless otherwise specified λ = 0.6. note that the original model uses an MLP that takes two nodes’

7
TABLE I: Data set metadata C. Experimental Setup
Data Set Nodes Edges Avg. Density Timestamps As was done by [19], we conduct three different bench-
FB 663 23,394 0.00591 9
marking tests to compare E ULER to other temporal link pre-
COLAB 315 5,104 0.01284 10 diction methods: inductive dynamic link detection, transductive
Enron10 184 4,784 0.00514 11 dynamic link prediction, and transductive dynamic new link
prediction. Link detection and link prediction are implemented
as described in Section IV-C. On link detection tests the
embeddings as input to calculate likelihood scores, but our objective function is Equation 3 with n = 0. On (new) link
experiments found using inner product decoding was more prediction tests, n = 1.
effective. For link detection tests, we randomly remove 5 and 10%
of the edges from each snapshot for validation and testing
VGRNN [19]: a GCN stacked upon a GC-LSTM [21]. respectively. The reported results are evaluations of final 3
Their method, however, cannot be fit into the E ULER frame- snapshots, which are never used for training; however, a
work; this is because node embeddings from the GCNs are training set of edges from those latter timesteps are withheld
passed into the RNN, whose output is concatenated with node and not evaluated, to be used as inputs during the forward pass
features to be used as input for the next snapshot. This process at evaluation.
must happen in serial. The hidden state vectors from the RNN
are designed to predict the next state of the graph, in addition For link prediction tests, all edges in final three snapshots
to providing information for the GCN. Hence, for predictive are considered positive samples in the test set, with an equal
tasks, a non-linear transformation of the RNN output is used number of randomly sampled negative edges. As it is a
as the node embeddings rather than the direct output of the transductive test, during the forward pass all edges at time
GCN. t are ingested to predict the state at t + 1. However, 5% of
edges are held out for validation during decoding.
VGAE [32]: A simple two-layer variational graph autoen-
coder. This model does not consider time at all. Instead, it New link prediction is an extra evaluation of predictive
views every interaction as one large graph. We include this models. This test is identical to link prediction, but the set of
model to demonstrate the usefulness of the recurrent unit true positives only includes edges that were not in the snapshot
in other models. The VGAE acts as a baseline to compare immediately before:
inductive tests against; as it is a static model, it cannot be
used for dynamic (new) link prediction. {(u, v) | (u, v) ∈ Et+1 ∧ (u, v) 6∈ Et } (10)
the purpose of this test is to evaluate predictive models’ ability
We use three data sets for these tests: Facebook [43],
to anticipate new edges given what it has observed about the
Enron10 [44], and COLAB [45]. Table I contains more detailed
dynamics of the graph.
information about these data sets. None of these data sets
contain node features, so the identity matrix is used as the Model parameters are updated with an Adam opti-
initial feature input for all models. These graphs are all mizer [34] with a learning rate of either 0.03, 0.02, 0.01,
directed, and during evaluation self-loops are not considered. or 0.005 according to a parameter search. Models are trained
for 1,500 epochs with early stopping after 100 epochs of no
progress on the validation data. Results for the (SI-)VGRNN,
B. Evaluation Metrics the DynGraph2Vec models, and the VGAE are those reported
by [19]. We use the source code provided by [22] to evaluate
As we use a regression model to output probabilities, we the EGCN models.
use the following to measure its effectiveness: area under the
ROC curve (AUC) and average precision score (AP). The AUC D. Results
score is defined as the area under the curve created by plotting
Here, we report the results of the three tests with some
the true positive rate (TPR) and false positive rates (FPR) as
discussion. When comparing results from competing methods,
the threshold for classification changes [46].
ties occur when the difference between the means of two
The TPR and FPR are defined as models’ metrics, A and B, are not statistically significant. This
is determined via a two-tailed t-test with the null hypothesis
TP FP H0 that the two means are the same:
TPR = FPR = (7)
TP + FN FP + TN
0 − (µ(B) − µ(A)) µ(A) − µ(B)
Average precision for regression tasks is defined as t= q =p (11)
Var(B−A) σM (A)2 + σM (B)2
X N
AP = (Rτ − Rτ−1 )Pτ (8)
τ Where N denotes degrees of freedom, and σM (·) denotes
the standard error. If
where Pτ and Rτ denote the precision and recall scores at
threshold τ. Precision and recall are defined as p = 2 min{Pr[T ≤ t | H0 ], Pr[T ≥ t | H0 ]} > 0.05 (12)
TP TP the difference between the two means is deemed insignificant,
P= R= (9) and the two models are considered equivalent. For ease of
TP + FP TP + FN

8
TABLE II: Comparison of E ULER to related work on dynamic TABLE III: Comparison of E ULER to related work on dynamic
link detection link prediction
Metrics Methods Enron COLAB Facebook Metrics Methods Enron COLAB Facebook
VGAE 88.26 ± 1.33 70.49 ± 6.46 80.37 ± 0.12 DynAE 74.22 ± 0.74 63.14 ± 1.30 56.06 ± 0.29
DynAE 84.06 ± 3.30 66.83 ± 2.62 60.71 ± 1.05 DynRNN 86.41 ± 1.36 75.7 ± 1.09 73.18 ± 0.60
DynRNN 77.74 ± 5.31 68.01 ± 5.50 69.77 ± 2.01 DynAERNN 87.43 ± 1.19 76.06 ± 1.08 76.02 ± 0.88
DynAERNN 91.71 ± 0.94 77.38 ± 3.84 81.71 ± 1.51 EGCN-O 84.28 ± 0.87 78.63 ± 2.14 77.31 ± 0.58
EGCN-O 93.07 ± 0.77 90.77 ± 0.39 86.91 ± 0.51 AUC EGCN-H 88.29 ± 0.87 80.80 ± 0.95 75.88 ± 0.32
AUC EGCN-H 92.29 ± 0.66 87.47 ± 0.91 85.95 ± 0.95 VGRNN 93.10 ± 0.57 85.95 ± 0.49 89.47 ± 0.37
VGRNN 94.41 ± 0.73 88.67 ± 1.57 88.00 ± 0.57 SI-VGRNN 93.93 ± 1.03 85.45 ± 0.91 90.94 ± 0.37
SI-VGRNN 95.03 ± 1.07 89.15 ± 1.31 88.12 ± 0.83 E ULER 93.15 ± 0.42 86.54 ± 0.20 90.88 ± 0.12
E ULER 97.34 ± 0.41 91.89 ± 0.76 92.20 ± 0.56 DynAE 76.00 ± 0.77 64.02 ± 1.08 56.04 ± 0.37
VGAE 89.95 ± 1.45 73.08 ± 5.70 79.80 ± 0.22 DynRNN 85.61 ± 1.46 78.95 ± 1.55 75.88 ± 0.42
DynAE 86.30 ± 2.43 67.92 ± 2.43 60.83 ± 0.94 DynAERNN 89.37 ± 1.17 81.84 ± 0.89 78.55 ± 0.73
DynRNN 81.85 ± 4.44 73.12 ± 3.15 70.63 ± 1.75 EGCN-O 86.55 ± 1.57 81.43 ± 1.69 76.13 ± 0.52
DynAERNN 93.16 ± 0.88 83.02 ± 2.59 83.36 ± 1.83 AP EGCN-H 89.33 ± 1.25 83.87 ± 0.83 74.34 ± 0.53
EGCN-O 92.56 ± 0.99 91.41 ± 0.33 84.88 ± 0.52 VGRNN 93.29 ± 0.69 87.77 ± 0.79 89.04 ± 0.33
AP EGCN-H 92.56 ± 0.72 88.00 ± 0.85 82.56 ± 0.91 SI-VGRNN 94.44 ± 0.85 88.36 ± 0.73 90.19 ± 0.27
VGRNN 95.17 ± 0.41 89.74 ± 1.31 87.32 ± 0.60 E ULER 94.10 ± 0.32 89.03 ± 0.08 89.98 ± 0.19
SI-VGRNN 96.31 ± 0.72 89.90 ± 1.06 87.69 ± 0.92
E ULER 97.06 ± 0.48 92.85 ± 0.88 91.74 ± 0.71

TABLE IV: Comparison of E ULER to related work on dynamic


new link prediction
reading, we highlight only ties between the two models with
the highest observed means. Metrics Methods Enron COLAB Facebook
DynAE 66.10 ± 0.71 58.14 ± 1.16 54.62 ± 0.22
DynRNN 83.20 ± 1.01 71.71± 0.73 73.32 ± 0.60
Dynamic Link Detection DynAERNN 83.77 ± 1.65 71.99 ± 1.04 76.35 ± 0.50
As shown in Table II, the simplistic E ULER model out- EGCN-O 84.42 ± 0.82 79.06 ± 1.60 75.95 ± 1.15
performs the more modern ones in almost every test. In tests AUC EGCN-H 87.00 ± 0.85 78.47 ± 1.27 74.85 ± 0.98
VGRNN 88.43 ± 0.75 77.09 ± 0.23 87.20 ± 0.43
where it does not outperform the state-of-the-art methods, SI-VGRNN 88.60 ± 0.95 77.95 ± 0.41 87.74 ± 0.53
it is equivalent, despite having fewer parameters. Compared E ULER 87.92 ± 0.64 78.39 ± 0.68 89.02 ± 0.09
to the static VGAE, which does not consider time at all, DynAE 66.50 ± 1.12 58.82 ± 1.06 54.57 ± 0.20
DynRNN 80.96 ± 1.37 75.34 ± 0.67 75.52 ± 0.50
the benefit of the additional RNN layer is clear. We further DynAERNN 85.16 ± 1.04 77.68 ± 0.66 78.70 ± 0.44
observe the benefit of graph neural networks over MLPs EGCN-O 86.92 ± 0.39 81.36 ± 0.85 73.66 ± 1.25
when E ULER and the state-of-the-art methods are compared AP EGCN-H 86.46 ± 1.42 79.11 ± 2.26 73.43 ± 1.38
VGRNN 87.57 ± 0.57 79.63 ± 0.94 86.30 ± 0.29
to the DynGraph2Vec methods. However, the experiments SI-VGRNN 87.88 ± 0.84 81.26 ± 0.38 86.72 ± 0.54
do not support claims that much is gained by the complex E ULER 88.49 ± 0.55 81.34 ± 0.62 87.54 ± 0.11
models beyond what is afforded simply by using a GNN and
RNN. Both highly engineered models, the EGCN and VGRNN
variants, do not perform significantly better than the simplistic density as Enron, but 3.5x as many nodes, E ULER performs
stacked GCN on a GRU, and in some cases perform worse. significantly better than other methods in new link prediction.
As variance and complexity increases in the data, E ULER
Significantly, the data set where E ULER performed better adapts better than the other methods while retaining precision
than prior works with p < 0.05 was the Facebook data set. This on simpler data without becoming overfit.
data set contains the most nodes and edges and has the fewest
snapshots in the training set. Despite these difficulties, our In both tests, models which process graph embeddings
simple E ULER model achieves a 4% improvement over prior using an RNN were significantly better than DynAE which
work in both AUC and AP, signifying its ability to learn very does not. This component enables temporal attributes of the
complex spatio-temporal patterns even on larger data sets. We data to be carried over from previous time steps. If a new edge
observe that the model is generalized enough not to become has been seen in the distant past, or a pattern that indicates
overfit on the smallest data sets, but not so simple it cannot a new edge is likely to appear has previously been observed,
handle larger ones. This supports our claim that this model this history is carried over by the RNN into future embeddings.
design, despite its simplicity, is highly precise. From this, we can again infer that the benefit derived from the
(SI-)VGRNN models has more to do with the components of
Dynamic (New) Link Prediction those models, which are also GCNs connected with an RNN.
In these cases where temporal data is more significant, the The espoused benefit of the topological embedders ingesting
results are less clear. As shown in Tables III and IV, between temporal information does not appear to be as great as simply
our method and (SI-)VGRNN models, the results are almost using those components; by removing the GNNs’ reliance on
all within each methods’ margin of error. We note that on temporal information, our models can embed at least the same
the dynamic new link prediction test for Enron10, though our quality of information in a more efficient manner.
method’s observed mean AUC and AP were lower than both
VGRNN methods, a t-test showed that this disparity was not With these data, it is clear that the simplicity of models
statistically significant. We can conclude that neither method following the E ULER framework is not a hindrance, and in
is significantly better than the other. Furthermore, we observe many cases is actually advantageous. The purpose of E ULER
that on the Facebook data set, which has roughly the same edge is chiefly to improve efficiency and scalability, so the fact that

9
 C(u, v) − µ 
it is only a small improvement, or about equal to state-of-the- E
W((u, v) ∈ E) = σ (13)
art models is adequate for our purposes. The real benefit of ΣE
building models within the E ULER framework is their ability
to scale. On larger data sets, this advantage is more evident. where σ(·) represents the sigmoid function, C(u, v) rep-
resents the frequency of an authentication between u and v
VI. L ATERAL M OVEMENT D ETECTION in the time window, and µE and ΣE represent the mean and
standard deviation of all edge frequencies in the time window.
In the previous section, all datasets tested were rather small. In this way, edges which occur very infrequently are given
It is not until a real-world application of the E ULER framework lower weight during training so as to appear less “normal” and
is tested that the true performance improvements are evident. edges which occur with high frequency, such as edges from
computers to domain controllers, or ticket granting servers
To demonstrate the impressive speedup achieved by this have high weight, and appear routine.
framework when compared to related work we evaluate several
E ULER models on the LANL 2015 Comprehensive Multi- The LANL dataset has no node features by default, how-
Source Cyber Security Events dataset [24]. The dataset consists ever some information can be gleaned from the naming con-
of 57 days of log files from five different sources within the vention used in the log files. Entities have unique, anonymized
Los Alamos National Laboratory’s internal corporate network identifiers that start with either a U or a C denoting users
as it underwent both normal activity and a redteam campaign. and computers respectively. There are also nodes with non-
Specific details about this dataset are reported in Table V. The anonymized names that have important roles in the system such
edge count in the table represents the number of weighted as TGT, the Kerberos key distribution center, DC, the domain
edges; multiple events between the same entities in the same controller, and so on. To leverage this additional data, we
time period may be compressed into a single weighted edge. concatenate a 1-hot vector denoting user, computer, or special
Because events from the authentication logs have been labeled administrative machine to each node’s one-hot ID vector.
as normal or anomalous, this data set has been widely used for
cyber security research [9], [12], [13], [15]. The labels make it For quicker file scanning, and data loading times, the
especially apt for lateral movement detection research. When full 69 GB auth.txt file is split into chunks, which each
an APT-level threat is attempting to traverse a system, one hold 10,000 seconds (approximately 3 hours) of logs. Worker
possible warning sign will be authentications that should not machines are issued instructions to read in certain ranges of the
normally occur, a sign indicative of lateral movement on a log files and build the temporal graphs. Workers accomplish
network level. this by spawning several child processes to load multiple
snapshots in parallel. The associated edge lists, and edge
weight lists from each child process are combined to form the
TABLE V: LANL Data Set Metadata final TGraph object, which holds a list of all edge lists, edge
weights, node features, and tensor masks to take partitions of
Nodes 17,685
Events 45,871,390
each edge list for training.
Anomalous Edges 518
Duration (Days) 58 For all experiments, the training set consists of all snap-
shots that occur before the first anomalous edge appears in the
authentication logs. This allows models to learn what normal
In this section, all distributed models were implemented activity looks like. From this set, we remove the final 5% of
with 4 worker nodes unless otherwise specified. All exper- snapshots for tuning the classifier, and mask 5% of edges from
iments are run on a server with two Intel Xeon E5-2683 v3 each snapshot for validation.
(2.00 GHz) CPUs, each of which has 14 cores with 28 threads,
and 512 GB of memory [47].
B. Experimental Setup
We will first present the utility of models following the
E ULER framework as an anomaly-based intrusion detection We test three encoders in conjunction with two recurrent
system on the LANL dataset, then an analysis of the immense neural networks as well as models with no recurrent layer to
scalability afforded by splitting models in this way. measure how much value temporal data adds to the overall
embeddings. The encoder models are GCN [36], GAT [48],
and GraphSAGE [49]. The recurrent models are GRU [37]
A. Graph Construction and LSTM [42]. The models are trained in the same manner
We construct a weighted, directed graph from the authen- as the link detection and link prediction models in Section V.
tication logs by mapping which entities authenticate with one However, experiments showed that once a local optimum was
another. As nodes, we use the entities denoted source and found and validation scores ceased improving, it rarely im-
destination computers in the LANL documentation. For all proved after further iterations. As such, early stopping occurs
authentications that occur from time t to t + δ, an edge is after only ten epochs of no improvement.
created between the source computer and destination computer. Experiments showed for every model that using smaller
If an edge already exists, a tally keeping track of the number of time windows always lead to better results. As such, we
authentications between the two machines is updated. Experi- only present the output of tests on temporal graphs with time
ments have shown that the most effective method to normalize window δ = 1800 seconds (30 minutes).
these tallies into usable edge weights is to take the logistic
sigmoid of the edges’ standardized values. Mathematically, it The GAT encoder uses 3 attention heads, which was found
can be represented as to be optimal via hyperparameter tuning. The SAGE encoder

10
uses maxpooling as its aggregation function, as this was found TABLE VI: Performance of E ULER Models on the LANL
to be optimal in their paper, and this makes it capable of Dataset when δ = 0.5
discerning between certain graphs GCN cannot [50].
Dynamic Link Detection
Unfortunately, many other works which experiment with Encoder RNN AUC AP TPR FPR P
the LANL dataset either do not use the data set in full, as is
GRU 0.9912 0.0523 86.10 0.5698 0.0054
the case with [12], [13] or conduct tests on portions of the data GCN LSTM 0.9913 0.0169 89.65 0.5723 0.0056
other than purely the authentication logs, as was done by [15], None 0.9916 0.0116 88.57 0.4798 0.0066
so it would not be fair or meaningful to compare our results GRU 0.9872 0.0307 84.71 0.6874 0.0044
to theirs. Prior work [9] does use the full authentication log SAGE LSTM 0.9887 0.0389 83.55 0.6591 0.0045
as its dataset, however it trains on a larger set of data, using None 0.8652 0.0052 79.58 24.5669 0.0001
all days that contained no anomalous activity as the training GRU 0.9094 0.0076 85.21 21.533 0.0001
GAT LSTM 0.8713 0.0022 96.83 19.873 0.0002
set, rather than just the days before the attack campaign. We None 0.9867 0.0079 99.88 23.174 0.0002
include their results, nonetheless. We also include the TPR and
UA – – 72.00 4.400 0.0010
FPR of a rules-based “Unknown Authentication” (UA) model GL-LV [9] – – 67.00 1.200 0.0034
reported by [9]. This rule simply marks any edge that did not GL-GV [9] – – 85.00 0.900 0.0051
exist in the training data as anomalous. VGRNN 0.9315 0.0000 59.69 4.938 0.0000

By default, VGRNN operates on full adjacency matrices, Dynamic Link Prediction


however we modified it to use sparse edge lists for our exper- Encoder RNN AUC AP TPR FPR P
iments. This way it was able to scale to the large size of the GRU 0.9906 0.0155 85.49 0.6088 0.0050
LANL data set. Unfortunately, the E-GCN and DynGraph2Vec GCN LSTM 0.9885 0.0166 78.91 0.5987 0.0047
None 0.9902 0.0092 86.42 0.5425 0.0057
models could not scale to the LANL data set. DynGraph2Vec
relies on dense adjacency matrices, and the size of the 1-hot GRU 0.9847 0.0200 86.30 1.6542 0.0019
SAGE LSTM 0.9865 0.0228 85.29 0.8037 0.0038
vectors used as inputs was too large for E-GCN to process. None 0.9284 0.0020 86.23 16.525 0.0002
As a result, our hardware was unable to fully evaluate these
GRU 0.8826 0.0020 87.82 21.971 0.0001
methods and their results are not compared to those of E ULER. GAT LSTM 0.8383 0.0002 83.42 29.297 0.0001
None 0.9352 0.0079 88.83 20.093 0.0002
All models evaluated use 32-dimensional hidden layers,
VGRNN 0.9503 0.0004 70.00 0.280 0.0004
and 16-dimensional embeddings. All E ULER models use a
tanh activation function between the encoder and the recurrent
layers and an edge dropout layer before the GNNs. They all
determine the classification threshold according to Equation 6 model has a TPR above 72% with a FPR lower than 4.4%, it
with λ = 0.6, except the GraphSAGE models. For this encoder, must be leveraging topological or temporal context judge the
experiments showed λ = 0.5 was more appropriate. All validity of connections.
reported results are average scores from 5 independent tests The results show that the GCN is the most effective
on link detection and link prediction. encoder for link detection, and SAGE is most effective for
link prediction; this supports our claim and that of [38]
C. Anomalous Edge Detection that more generalized models are more effective. The GAT
models, which have 3x as many parameters as GCN and SAGE
It is difficult to properly evaluate methods for classifying
performed quite poorly both in quality of scores, and quality
imbalanced data, especially anomaly detection, where small
of classification.
false positive rates are so critical. For this reason, in addition
to the raw true positive and false positive rates, we report Also worth noting is the way using a RNN affects the
precision (P), area under the curve (AUC) and average pre- output. Surprisingly, the best AUC in the link detection tests
cision (AP). This latter method is recommended for anomaly were from a GCN with no temporal encoder. However, this
detection by [51] as especially adept for imbalanced datasets. metric is not a good indicator of model quality on data sets
The AUC and AP metrics evaluate the overall quality of scores with imbalance as extreme as LANL. The dramatically higher
given to edges, as opposed to the quality of classification, and AP score of all models which use RNNs suggests temporal
provide better measurements of the model if the anomalous data strongly affects FPR and cannot be ignored. Similarly,
score threshold were to be changed. The precision metric the models without an RNN have high precision on the GCN
provides further context to the quality of classification at the models. However, again, the AP scores would indicate that
specific threshold. The average results of 5 experiments are while at this specific threshold omitting the RNN is beneficial,
shown in Table VI. over all thresholds, models which take time into account
perform better.
As a baseline, consider the TPR of the UA rules-based
approach. This implies that 28% of anomalous connections In the more realistic transductive link prediction tests,
are those which have occurred before in the network. This though the difference between the two RNNs tested is small
supports our claim that temporal information about the context in every case, the benefit they add is unquestionable. The best
of connections is just as important as the entities that are performing encoder, GraphSAGE enjoyed a 10x improvement
authenticating. This system is an excellent baseline model to in AP when used in conjunction with any RNN. The next best
compare to, as any model that has a higher TPR than UA performing encoder, GCN achieved a 1.6x improvement in AP.
must be using a more advanced metric than simply memorizing This is evidence that temporal information carries important
every legitimate connection observed in normal activity. If a context for the topological state of the network, particularly for

11
filtering false positives. Where one authentication may appear
anomalous in isolation, when viewed in the context of previous
authentications, it can be correctly identified as benign.
The GL-LV and GL-GV method reported in [9] does
not consider time at all; the network is viewed as a static
graph. Here, we again see the benefit of using a sequence
encoder in conjunction with a pure topological embedder.
The best E ULER methods outperform their random walk-based
approach in terms of both TPR and FPR. Also worth noting
is that because our model uses temporal graphs, the alerts
from E ULER-based models come with a time stamp, making
them more informative and valuable in a real-world scenario.
The prior work ranks any duplicate edges, regardless of their
temporal context, as equally anomalous. Fig. 4: Change in AUC score as δ increases for link prediction
and detection with GCN+GRU models. Scores are the average
Like E ULER, VGRNN combines a sequence encoder with of five tests on the LANL dataset.
a temporal one. They claim that by using temporal information
as input during the topological encoding phase, complex
temporal dependencies are better encoded than without it.
However, this method performs no better than the purely With more edges, and more temporal granularity, models are
statistical UA method. Even still, the false positive rate is more capable of learning useful patterns across time. Figure 4
excellent in the predictive test, outperforming every E ULER illustrates this phenomenon.
model. It is worth noting however, the quality of likelihood
scores is very poor with this method, which implies that had As is evident from the figure, small window sizes are
the threshold for the E ULER models been set lower their FPR more optimal for quality of scores; however, due to the longer
would be lower with equivalent TPRs. This is readily apparent processing time required, and the relatively small performance
in the GCN-based models where the FPR is only a few tenths improvements as window size decreases after a certain point, it
of a percent larger than the dynamic VGRNN, but their TPRs may be adequate to leave the snapshot duration a little higher
are almost 10% higher. than is optimal for faster training and evaluation. Additionally,
we observe that changes in δ seem to affect the link prediction
Finally, we must concede that while the E ULER models model at a higher rate than the link detection one. We suspect
do outperform prior works, their FPRs are still too high to be this is caused by the model’s inability to predict short term
useful as an intrusion detection system on their own. Some of temporal patterns like the one described in section III as δ
this can be attributed to the data set itself; labeled anomalous grows, and the predictive model is especially apt to detect this
events are very coarse-grained. There are likely many events type of anomaly.
the compromised entities engaged in that should be considered
anomalous, and may have even been detected by our models, Nonetheless, in both cases, more granular graphs lead to
but that are treated as false positives due to the lack of more informative edge scores. We speculate that at some point,
fine-grained label information. Indeed, the redlog only tracks having time slices too granular would have diminishing returns
“compromise events” and not the further malicious activity that as graphs will have too few edges to be useful. But due to the
ensues [24]. severe training time as δ decreases, we have never managed
to reach this point. We leave finding this boundary as an area
However, even if this is not fully the case, this method has for future work.
great potential as a filtering device for further analysis tools.
The low cost of processing time, that we will demonstrate
later on, makes this an efficient way to minimize the number E. Scalability Analysis
of interactions that need to be analyzed by a signature-based The main benefit of using models which fit into the
technique for example. As part of future work, we plan to E ULER framework is their scalability. While evaluation metrics
work with analysts to answer the question of whether this on benchmarks were generally better than prior work, there
approach is best used as a step in a longer pipeline, or on were certainly some categories where the advanced models
its own. However, we have demonstrated that compared to all are comparable our simple ones. However, as we will show,
other anomaly-based works which analyze this data set, E ULER distributed topological encoding has tremendous performance
models are the most effective. benefits.

D. Parameter Analysis In Figure 5 we compare the best runtimes of our method


to the runtimes of the serial GNN models we used for
Time window size, the hyperparameter δ, has significant benchmarking in Section V. The serial methods were allowed
tradeoffs associated with it. As this value decreases, the num- 16 threads for intraprocess communication for a fair com-
ber of edges increases, requiring more processing time. As this parison to our 16 worker processes. However, even with this
value increases, more repeat edges are condensed into single, advantage, these methods are forced to process time steps one
weighted edges; additionally, there will be fewer discrete time at a time; they simply cannot compete with the efficiency of
steps for the recurrent networks to process, all contributing E ULER, especially as the size of data increases. Figure 5a
to faster processing time at the expense of less precise data. shows that as the size of the data being processed increases,

12
(a) Forward propagation time (b) Backward propagation time

Fig. 5: Performance comparison between the distributed E ULER model and competing methods on varying numbers of snapshots
from the LANL dataset. All time windows, δ are 0.5 hours.

much slower rate. This is because the topological encoding


task, the bulk of which must occur on CPU due to the high
number of random accesses, scales perfectly with additional
workers.

VII. R ELATED W ORK


We now compare E ULER to related work in both intrusion
detection, and temporal link prediction generally.

Intrusion Detection Systems


Frequency-based models define normalcy through observed
distributions of frequency counts, or other stochastic processes
Fig. 6: Performance improvements as more workers are added present in the network [12], [13]. Midas [25] does incorporate
for varying numbers of snapshots from the LANL data set. The structural data with frequency counts; however, it can only
model used was a GCN stacked on a GRU. All time windows, detect bursts of anomalous events, and would struggle to iden-
δ are 0.5 hours. tify individually anomalous connections. Supervised learning
approaches such as [26], [27], [28] analyze only features
of events with data mining approaches. Like the E ULER
approach, many unsupervised systems use deep autoencoders.
E ULER’s forward propagation speed is 2x that of the fastest However, they embed event features rather than network in-
competing algorithm. Additionally, by implementing E ULER teractions [14], [15]. Kitsune [10] uses temporal patterns in
using DDP [31], backpropagation is sped up dramatically. addition to event features; however temporal information is
Figure 5b shows E ULER has almost a 16x improvement in used as an input feature rather than for sequence encoding.
backpropagation speed, suggesting backpropagation has near The primary focus of research using graphs for intrusion
linear scaling as workers are added. detection is subgraph matching to detect known malicious
Figure 6 shows the speedup of a GCN stacked upon a patterns in provenance graphs [3], [5], [6], [7], [8]–signature-
GRU built within the framework of E ULER as more workers based approaches. Network-level anomaly-based intrusion de-
are added. For these experiments, we evaluate only workers tection systems in the field of graph analytics are lacking. Only
that are powers of two to ensure each worker holds the same prior work [9] has proposed a method for detecting anomalous
number of snapshots. As each worker requires two threads events in host logs that leverages the rich graph structure
to run, one for the model replica, and one for collective inherent to the medium. This approach only considers the
communication, our equipment can only accommodate up to network as a static graph, so edge embeddings that occurred at
16 workers. The framework allows for the number of workers different points in time always have the same anomaly score.
to be a user-defined hyperparameter, so it is effortless to
distribute work across a network with enough nodes to support Temporal Link Prediction
it. It is evident from the chart that performance improvements Temporal link prediction has been used in fraud detection [19],
are immediate. As more workers are added runtime improves contact tracing [21], and traffic prediction [35]. To our knowl-
rapidly, with negligible improvement after about 8 workers edge, however, E ULER is the first system to make use of
for smaller amounts of data. However, as the amount of data this technique for anomaly-based intrusion detection. Earlier
processed increases, performance improvements diminish at a approaches to temporal link prediction used one-hot encodings

13
of node neighbors as features and processed them using [6] S. M. Milajerdi, R. Gjomemo, B. Eshete, R. Sekar, and V. Venkatakrish-
MLPs [41], [52], [53], [54], [55]. However, as we and others nan, “Holmes: real-time apt detection through correlation of suspicious
have shown, these methods are not as powerful as those information flows,” in 2019 IEEE Symposium on Security and Privacy
(SP). IEEE, 2019, pp. 1137–1152.
which leverage GNNs. Many models that incorporate GNNs
[7] S. Wang and S. Y. Philip, “Heterogeneous graph matching networks:
for temporal graph embedding often do so by replacing the Application to unknown malware detection,” pp. 5401–5408, 2019.
linear layers in RNNs with GNNs [20], [21]; other existing [8] R. Wei, L. Cai, A. Yu, and D. Meng, “Deephunter: A graph neural
approaches such as VGRNN [19] and E-GCN [22] use highly network based approach for robust cyber threat hunting,” arXiv preprint
engineered models, in addition to graph neural network RNNs, arXiv:2104.09806, 2021.
and must be run in serial. [9] B. Bowman, C. Laprade, Y. Ji, and H. H. Huang, “Detecting lateral
movement in enterprise computer networks with unsupervised graph
{AI},” in 23rd International Symposium on Research in Attacks, Intru-
VIII. C ONCLUSION sions and Defenses ({RAID} 2020), 2020, pp. 257–268.
In this work, we presented the E ULER framework: a [10] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, “Kitsune: An
method to exploit the previously untapped potential for dis- ensemble of autoencoders for online network intrusion detection,”
machine learning, vol. 5, p. 2.
tributing the work done to train and execute temporal graph
link predictors. When each topological encoder can operate [11] I. H. Sarker, “Cyberlearning: Effectiveness analysis of machine learning
security modeling to detect cyber-anomalies and multi-attacks,” Internet
independently, the most compute-heavy task of generating of Things, p. 100393, 2021.
node embeddings can be scaled in a highly efficient man- [12] N. Heard and P. Rubin-Delanchy, “Network-wide anomaly detection
ner. Though models following this framework are necessarily via the dirichlet process,” in 2016 IEEE Conference on Intelligence
simple, we have shown that for anomalous link detection and and Security Informatics (ISI). IEEE, 2016, pp. 220–224.
prediction, models following the E ULER framework perform [13] M. Whitehouse, M. Evangelou, and N. M. Adams, “Activity-based
as well, or better than their complex contemporaries. Finally, temporal anomaly detection in enterprise-cyber security,” in 2016 IEEE
we showed how this framework can be used to train highly pre- Conference on Intelligence and Security Informatics (ISI). IEEE, 2016,
pp. 248–250.
cise anomaly-based intrusion detection systems when network
[14] F. Farahnakian and J. Heikkonen, “A deep auto-encoder based approach
activity is viewed through the abstraction of a temporal graph. for intrusion detection system,” in 2018 20th International Conference
These intrusion detection systems are scalable, and more sound on Advanced Communication Technology (ICACT). IEEE, 2018, pp.
than other unsupervised techniques despite being trained with 178–183.
less data. [15] T. Bai, H. Bian, A. Abou Daya, M. A. Salahuddin, N. Limam, and
R. Boutaba, “A machine learning approach for rdp-based lateral move-
Future work may include testing other topological or tem- ment detection,” in 2019 IEEE 44th Conference on Local Computer
poral encoders that we did not. Further improvements could Networks (LCN). IEEE, 2019, pp. 242–245.
be made to the classifier by testing edge decoding techniques [16] “Use alternate authentication material: Pass the ticket,” May 2017.
more advanced than the simple inner product logistic regres- [Online]. Available: https://attack.mitre.org/techniques/T1550/003/
sion we use. As this framework allows for scalable message [17] “Valid accounts,” May 2017. [Online]. Available: https://attack.mitre.
passing graph neural networks, future work could even include org/techniques/T1078/
testing this technique on any large temporal graph data set [18] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Commu-
previously thought intractable for GNNs. nications surveys & tutorials, vol. 18, no. 2, pp. 1153–1176, 2015.
[19] E. Hajiramezanali, A. Hasanzadeh, K. Narayanan, N. Duffield,
ACKNOWLEDGMENT M. Zhou, and X. Qian, “Variational graph recurrent neural networks,”
in Advances in Neural Information Processing Systems, H. Wallach,
The authors thank the anonymous reviewers for their help- H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and
ful suggestions. This work was supported in part by DARPA R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019, pp. 10 701–
under agreement number N66001-18-C-4033 and National 10 711. [Online]. Available: https://proceedings.neurips.cc/paper/2019/
Science Foundation grants 1618706, 1717774, and 2127207. file/a6b8deb7798e7532ade2a8934477d3ce-Paper.pdf
The views, opinions, and/or findings expressed in this material [20] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured
sequence modeling with graph convolutional recurrent networks,” in
are those of the authors and should not be interpreted as International Conference on Neural Information Processing. Springer,
representing the official views or policies of the Department 2018, pp. 362–373.
of Defense, National Science Foundation, or the U.S. Govern- [21] J. Chen, X. Xu, Y. Wu, and H. Zheng, “Gc-lstm: Graph convo-
ment. lution embedded lstm for dynamic link prediction,” arXiv preprint
arXiv:1812.04206, 2018.
R EFERENCES [22] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi,
T. Kaler, T. Schardl, and C. Leiserson, “Evolvegcn: Evolving graph
[1] B. E. Strom, A. Applebaum, D. P. Miller, K. C. Nickels, A. G. convolutional networks for dynamic graphs,” in Proceedings of the AAAI
Pennington, and C. B. Thomas, “Mitre att&ck: Design and philosophy,” Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5363–
Technical report, 2018. 5370.
[2] “Lateral movement, tactic TA0008,” Oct 2018. [Online]. Available: [23] G. M. Amdahl, “Validity of the single processor approach to achieving
https://attack.mitre.org/tactics/TA0008/ large scale computing capabilities,” in Proceedings of the April 18-20,
[3] S. T. King and P. M. Chen, “Backtracking intrusions,” in Proceedings of 1967, spring joint computer conference, 1967, pp. 483–485.
the nineteenth ACM symposium on Operating systems principles, 2003, [24] A. D. Kent, “Comprehensive, multi-source cyber-security events data
pp. 223–236. set,” Los Alamos National Lab.(LANL), Los Alamos, NM (United
[4] B. Caswell and J. Beale, Snort 2.1 intrusion detection. Elsevier, 2004. States), Tech. Rep., 2015.
[5] S. M. Milajerdi, B. Eshete, R. Gjomemo, and V. Venkatakrishnan, [25] S. Bhatia, B. Hooi, M. Yoon, K. Shin, and C. Faloutsos, “Midas:
“Poirot: Aligning attack behavior with kernel audit records for cyber Microcluster-based detector of anomalies in edge streams,” in Proceed-
threat hunting,” in Proceedings of the 2019 ACM SIGSAC Conference ings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04,
on Computer and Communications Security, 2019, pp. 1795–1812. 2020, pp. 3242–3249.

14
[26] L. Dhanabal and S. Shantharajah, “A study on nsl-kdd dataset for [49] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
intrusion detection system based on classification algorithms,” Inter- learning on large graphs,” arXiv preprint arXiv:1706.02216, 2017.
national journal of advanced research in computer and communication [50] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph
engineering, vol. 4, no. 6, pp. 446–452, 2015. neural networks?” arXiv preprint arXiv:1810.00826, 2018.
[27] A. Alazab, M. Hobbs, J. Abawajy, and M. Alazab, “Using feature selec- [51] A. Divakaran and A. Mohan, “Temporal link prediction: a survey,” New
tion for intrusion detection system,” in 2012 international symposium Generation Computing, pp. 1–46, 2019.
on communications and information technologies (ISCIT). IEEE, 2012,
[52] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang, “Dynamic network
pp. 296–301.
embedding by modeling triadic closure process,” in Proceedings of the
[28] R. Lohiya and A. Thakkar, “Intrusion detection using deep neural AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
network with antirectifier layer,” in Applied Soft Computing and Com-
[53] M. Rahman, T. K. Saha, M. A. Hasan, K. S. Xu, and C. K. Reddy,
munication Networks. Springer, 2021, pp. 89–105.
“Dylink2vec: Effective feature representation for link prediction in
[29] “Kdd cup 1999,” Oct 1999. [Online]. Available: http://kdd.ics.uci.edu/ dynamic networks,” arXiv preprint arXiv:1804.05755, 2018.
databases/kddcup99/kddcup99.html
[54] P. Goyal, N. Kamra, X. He, and Y. Liu, “Dyngem: Deep embedding
[30] G. Mohi-ud din, “Nsl-kdd,” 2018. [Online]. Available: https: method for dynamic graphs,” arXiv preprint arXiv:1805.11273, 2018.
//dx.doi.org/10.21227/425a-3e55
[55] T. Li, J. Zhang, S. Y. Philip, Y. Zhang, and Y. Yan, “Deep dynamic
[31] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, network embedding for link prediction,” IEEE Access, vol. 6, pp.
J. Smith, B. Vaughan, P. Damania et al., “Pytorch distributed: 29 219–29 230, 2018.
Experiences on accelerating data parallel training,” arXiv preprint
arXiv:2006.15704, 2020.
[32] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv
preprint arXiv:1611.07308, 2016.
[33] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for
networks,” in Proceedings of the 22nd ACM SIGKDD international
conference on Knowledge discovery and data mining, 2016, pp. 855–
864.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[35] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng,
and H. Li, “T-gcn: A temporal graph convolutional network for traffic
prediction,” IEEE Transactions on Intelligent Transportation Systems,
vol. 21, no. 9, pp. 3848–3858, 2019.
[36] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[37] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,”
arXiv preprint arXiv:1409.1259, 2014.
[38] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls
of graph neural network evaluation,” arXiv preprint arXiv:1811.05868,
2018.
[39] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[40] Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: Towards deep
graph convolutional networks on node classification,” arXiv preprint
arXiv:1907.10903, 2019.
[41] P. Goyal, S. R. Chhetri, and A. Canedo, “dyngraph2vec: Captur-
ing network dynamics using dynamic graph representation learning,”
Knowledge-Based Systems, vol. 187, p. 104816, 2020.
[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[43] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, “On the
evolution of user interaction in facebook,” in Proceedings of the 2nd
ACM workshop on Online social networks, 2009, pp. 37–42.
[44] C. E. Priebe, J. M. Conroy, D. J. Marchette, and Y. Park, “Scan statistics
on enron graphs,” Computational & Mathematical Organization Theory,
vol. 11, no. 3, pp. 229–247, 2005.
[45] M. Rahman and M. Al Hasan, “Link prediction in dynamic networks
using graphlet,” in Joint European Conference on Machine Learning
and Knowledge Discovery in Databases. Springer, 2016, pp. 394–
409.
[46] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining.
Pearson Education India, 2016, ch. 5.
[47] “Intel xeon processor e5-2683 v3 (35m cache,
2.00 ghz) product specifications,” 2014. [Online].
Available: https://ark.intel.com/content/www/us/en/ark/products/81055/
intel-xeon-processor-e5-2683-v3-35m-cache-2-00-ghz.html
[48] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Ben-
gio, “Graph attention networks,” arXiv preprint arXiv:1710.10903,
2017.

15
TABLE VII: The E ULER Interface

Encoding Layer

forward(partition: int) -> Tensor Takes a subset of edges hosted on the worker machine denoted by the partition enum which maps to a mask
tensor held on the worker, and passes them through a model; for the purposes of this paper it is assumed to be
a GNN. It returns a Tw × |V| × h or |V| × Tw × h tensor, depending on how one wishes to implement the
RNN, where Tw is the number of snapshots held on worker w, and h is a user specified hidden dimension.
calc_loss(zs: Tensor, partition: Samples some subset of edges held by the worker, and uses the embeddings, zs to calculate loss. In general, we
int, nratio: float) -> Tensor use binary cross entropy loss (BCE) on the training set of known edges, and a random sample of |E| × nratio
non-edges at each training step. Returns a 1 × 1 tensor of the total loss generated from the edges held in the
worker
score_edges(zs: Tensor) -> Returns the likelihood score of each edge held by the worker given the input tensor of node embeddings. This
Sequence[Tensor] function is used for evaluation to detect anomalous connections, and to calculate AUC scores. It returns a Tw
length list of |Et | × 1 tensors containing likelihood scores.
load_new_data(fn: Callable[..., A function to load new data into a worker, called by the constructor by default. The loader should have a return
TGraph], *args, **kwargs) -> None type TGraph, a special class to hold temporal graphs.

Recurrent Layer

forward(partition: int) -> Tensor Issues a command to each worker to run their forward method, and waits for their response. As responses come
in, in order, it passes output from workers into an RNN. Returns the T × |V| × d or |V| × T × d tensor,
depending on how one wishes to implement the RNN, where T is the total number of snapshots held across all
workers and d is the user specified embedding dimension.
calc_loss(zs: Tensor, partition: Splits the tensor of embeddings zs into contiguous slices such that z[t] is the parameter for the likelihood of
int, nratio: float) -> Tensor edges in snapshot t in the worker it is sent to. It then passes those embeddings to the workers and issues a call
for them to calculate loss on them. When the workers have finished calculating loss, the leader aggregates the
totals and returns a 1 × 1 tensor of total loss across all workers.
score_edges(zs: Tensor) -> Splits the tensor of embeddings zs into contiguous slices in the same manner as calc_loss, then passes those
Sequence[Tensor] embeddings to the workers, and issues a call for them to return the likelihood score of each edge they hold. It
aggregates each list of edge likelihoods returned by the workers, and returns a T length list of |Et | × 1 tensors
of likelihood scores.

A PPENDIX A
I MPLEMENTATION D ETAILS
Here we provide the technical details of the E ULER frame-
work. Every required method for the leader and follower
interfaces is detailed in Table VII.

The Topological Encoder is a replicated GNN across several


worker computers. This can be accomplished in a number
of ways, but in our implementation we use the PyTorch
DDP wrapper class around classes which implement the topo-
logical encoder interface. This interface requires the follow-
ing methods: forward, calc_loss, score_edges, and
load_new_data. Each encoder also has a field containing
a TGraph object. This object holds a list of edges at each
snapshot, the weights of those edges, and node features at
each time. When workers are constructed, they load this data
from memory using the load_new_data method.

The Recurrent Layer is a model-agnostic RNN held on


a single leader machine which coordinates the actions of
the workers. Users must implement the following methods:
forward, calc_loss, and score_edges, which are ex-
plained in detail in Table VII.

16

You might also like