Detection
Detection
Abstract—Lateral movement is a key stage of system com- The most robust way to detect malware propagation is not
promise used by advanced persistent threats. Detecting it is no to exhaustively list every known malicious signature correlat-
simple task. When network host logs are abstracted into discrete ing with it; rather it is to train a model to learn what normal
temporal graphs, the problem can be reframed as anomalous edge activity looks like, and to alert when it detects behavior that
detection in an evolving network. Research in modern deep graph deviates from it. However, detecting anomalous activity in
learning techniques has produced many creative and complicated
models for this task. However, as is the case in many machine
an enterprise network presents unique challenges. The data
learning fields, the generality of models is of paramount impor- involved both during training and the implementation of an
tance for accuracy and scalability during training and inference. anomaly-based intrusion detection system is enormous. Often
In this paper, we propose a formalized approach to this problem the log files that such a system would require as input are
with a framework we call E ULER. It consists of a model-agnostic terabytes large. To be useful, a lateral movement detection
graph neural network stacked upon a model-agnostic sequence model must be highly scalable to accommodate such large
encoding layer such as a recurrent neural network. Models built data. Additionally, when viewed as a classification problem,
according to the E ULER framework can easily distribute their any such system would have to be highly precise. Millions
graph convolutional layers across multiple machines for large of events occur in an enterprise network on any given day,
performance improvements. Additionally, we demonstrate that and only a fraction of a percent of all interactions are ever
E ULER-based models are competitive, or better than many state-
anomalous [18]. Therefore, a model must have an extremely
of-the-art approaches to anomalous link detection and prediction.
As anomaly-based intrusion detection systems, E ULER models can low rate of false alerts so as not to overwhelm its users.
efficiently identify anomalous connections between entities with In this work, we formulate anomalous lateral movement
high precision and outperform other unsupervised techniques for
detection as a temporal graph link prediction problem. In-
anomalous lateral movement detection.
teractions occurring in discrete units of time on a network
can be abstracted into a series of graphs Gt = {V, Et } called
I. I NTRODUCTION & M OTIVATION snapshots, where V is the set of entities in the network that had
interactions Et = {(u, v) ∈ V} during a set period of time, t.
Lateral movement is a key stage of the MITRE ATT&CK A temporal link prediction model will learn normal patterns of
framework [1] describing the behavior of advanced persistent behavior from previous snapshots and assign likelihood scores
threats (APTs). At its core, lateral movement is malware to edges that occur in the future. Edges with low likelihood
propagating through a network to spread onto new computers scores correlate to anomalous connections within the network.
in an attempt to find their target. This may involve pivoting As [9] points out, these anomalous connections are often
through multiple systems and accounts in a network using indicative of lateral movement. As we will later show, this
either legitimate credentials or malware to accomplish the reframing of the problem improves precision over standard
task [2]. As it is one of the final steps in the killchain anomaly-based intrusion detection techniques.
before complete compromise, detecting it early is of critical
importance. Recent approaches to temporal link prediction combine a
graph neural network (GNN) with a sequence encoder such
A plethora of machine learning approaches to intrusion as a recurrent neural network (RNN) to capture topological
detection exist, both signature-based models [3], [4], [5], [6], and temporal features of an evolving network. However, these
[7], [8] and anomaly-based [9], [10], [11], [12], [13], [14], approaches are either reliant on RNN output during the GNN
[15]. These latter techniques are especially well-suited for stage of embedding [19] or merely incorporate GNNs into the
lateral movement detection, as APT techniques like “Pass the RNN architecture [20], [21], [22]. As Figure 1a illustrates,
Ticket” [16], or even just using stolen credentials [17] are these models are necessarily sequential, and unfortunately
very difficult to formalize into signatures for signature-based cannot scale to the large datasets that they would need to
intrusion detection systems [2]. process to be useful lateral movement detectors.
Proposed solution. To address this problem, we have
observed that the most memory-intensive part of existing
architectures occurs during the message passing stage within
Network and Distributed Systems Security (NDSS) Symposium 2022 the GNN. Furthermore, there exists an imbalance between the
24-28 April 2022, San Diego, CA, USA
ISBN 1-891562-74-6 massive size of node input features and the comparatively
https://dx.doi.org/10.14722/ndss.2022.24107 minuscule topological node embeddings. This means the most
www.ndss-symposium.org work and the most memory usage occurs in the GNN before
(a) Sequential temporal link predictors (b) E ULER parallel framework
Fig. 1: (a) Prior approaches rely on RNN output during the GNN stage of embedding or merely incorporate GNNs into the
RNN architecture, which forces models to work in serial, one snapshot at a time. In contrast, (b) the E ULER framework can
utilize several worker machines to hold consecutive snapshots of a discrete temporal graph. These workers process snapshots
in parallel through a replicated GNN shared across each machine. The output of these GNNs is returned to the leader machine
which runs them through a recurrent neural network to create temporal node embeddings which may be used for link prediction.
The framework is explained in detail in Section IV
the simpler forward pass of the RNN is calculated, necessarily the LANL comprehensive multi-source cyber-security events
in serial. If several replicated GNNs operate on snapshots dataset [24]. This dataset includes labeled event data from
independently, they can execute concurrently as shown in Fig- 58 consecutive days of real-world computer network use
ure 1b. Amdahl’s Law [23] would suggest that by distributing interspersed with red team activity. There are approximately
such a large portion of work, performance improvements will 1.6 billion events in total. This is a test not only of the E ULER
ensue. framework’s precision, but also its ability to scale. Our tests
show that E ULER-based models outperform prior works in both
In this work, we have developed E ULER1 , a formalized precision and compute time.
approach for scalable dynamic link prediction and anomalous
edge detection. The framework involves stacking a model In summary, our research contributions are:
agnostic GNN upon a sequence encoder such as an RNN.
In this way, a network’s topology at discrete moments in • We present, to the best of our knowledge, the first use
time is encoded by the GNN, and the dynamic changes of temporal graph link prediction for anomaly-based
in connections are encoded by the RNN. The embeddings intrusion detection. Other research in applying graph
produced by this model provide prior probabilities for future analytics to anomaly detection either does not consider
states of the network given what is embedded about the past the temporal nature of the data or does not use a pow-
structure. Most importantly, the framework is designed such erful GNN model. By incorporating both elements into
that GNNs may be replicated across several worker machines, E ULER, models built on this framework outperform
and execute independently, allowing disjoint sets of snapshots other unsupervised anomaly-based intrusion detection
to be processed concurrently. When the GNNs’ work occurs systems and yield more informative alerts.
in parallel for each snapshot, the topological data for an • We demonstrate that for temporal link prediction
entire series of graphs can theoretically be encoded in the and detection, the simple framework we propose is
time it takes to encode the snapshot with the most edges. equally or more accurate and precise than state-of-
With these immense performance enhancements, detecting the-art temporal graph autoencoder models. E ULER
anomalous user activity in industry-scale, real-world networks models attain higher area under the curve and average
using powerful GNN models becomes tractable. precision scores by 4% in inductive link detection
Experimental evaluation. To prove that the very simple tests, and about equal metrics to the state-of-the-art
model we propose is adequate for the complex task of anoma- in transductive link prediction tests.
lous edge detection, and by extension lateral movement de- • We propose a scalable framework for distributed tem-
tection, we evaluate a model following the E ULER framework poral link prediction for use on big data. The E ULER
against several state-of-the-art temporal link prediction models framework is simple and makes message passing
on three datasets. The results of these experiments show that lightweight even over large graphs. By breaking edge
despite its simplicity, the E ULER framework performs as well lists into temporal chunks and processing them in
or better than the state-of-the-art, and unlike the other models, parallel, the computational complexity of the message
can scale to accommodate larger data. passing stage of the model is theoretically bound
Implementation as an intrusion detection system. We by only the snapshot with the most edges. Other
test several models following the E ULER framework on optimizations allow the RNN to operate in parallel
with some of the GNN workers, further improving
1 Source code available at https://github.com/iHeartGraph/Euler performance.
2
The rest of this work is organized as follows: Section II graph G of network activity that predicts the likelihood of
provides background information about the topic and defines future interactions occurring in unseen snapshots. An observed
our key terms. Section III gives a motivating example to interaction between entities with a likelihood score below a
demonstrate the necessity of temporal graphs in anomaly-based certain threshold is said to be anomalous. These anomalous
intrusion detection. Section IV explains the E ULER framework edges, in the context of network monitoring, are often indica-
in detail. Section V details the experiments we conducted tive of lateral movement [9].
with this model compared to other temporal autoencoders.
Section VI showcases how we used E ULER-based models to III. M OTIVATION
build efficient and precise anomaly-based intrusion detection
systems. Section VII discusses related work in anomaly-based Historically, there have been two main ways of abstract-
intrusion detection systems and temporal link prediction. Fi- ing data for anomaly detection: frequency-based, and events-
nally, we provide a brief discussion and suggestions for future based [18]. Frequency-based methods, such as [12], [13], [25]
work in Section VIII. define anomalous behavior as activity which significantly devi-
ates from observed temporal patterns. Events-based methods,
such as [10], [14], [15], [26], [27], [28] identify commonalities
II. BACKGROUND
in features, such as packet count, protocols, etc., associated
Anomaly-based intrusion detection systems must first de- with individual events within a system. Anomalies are then
fine a baseline for normal behavior, then generate alerts when defined as events with unexpected features.
events occur that significantly deviate from this baseline. The The problem with these two approaches is that they ignore
definition for normalcy is highly contingent on the abstraction the inherently structural nature of network data. This is readily
used to represent the system. Our proposed solution represents apparent by observing that one of the most influential datasets
network activity as discrete temporal graphs. From there, for intrusion detection, [29], [30], has no features for source
detecting evidence of lateral movement betrayed by anomalous or destination entities. All that matters with these models
network activity is equivalent to anomalous edge detection, are the details of the events themselves in isolation, not the
which we accomplish via temporal link prediction. machines between which they occurred. Networks are too
complex to be represented as just a series of unrelated vectors,
Discrete Temporal Graphs or a multivariate probability distribution of isolated samples.
A discrete temporal graph2 G = {G1 , G2 , ...GT } is defined as Networks are webs of relational data, fluctuating over time.
a series of graphs Gt = {V, Et , Xt }, which we refer to as The most natural way to represent and analyze relational data,
snapshots, representing the state of a network at time t. Here, is as a graph.
V denotes a set of all nodes which appear in the network,
Et denotes relationships between nodes at time t, and Xt There has been growing interest in static graph-based
represents any features associated with the nodes at time t. methods for intrusion detection, such as [9]. Here, normalcy
In this work, all graphs are directed, and some have weighed is defined in terms of interactions between entities within
edges, W : E → R representing edge frequencies during the systems, but time is not considered. Anomalies are defined
time period each snapshot encompasses. as edges with low probability given what is known about the
graph’s structure.
Let the interactions I between users and machines in a
network at specific times be represented as a multiset of tuples To demonstrate the advantages and disadvantages of the
<src,dst,ts>. Here, src is an entity interacting with entity dst above intrusion detection systems, consider the example shown
at time ts. From this multiset, we can build the temporal graph in Figure 2. The first two time slices show normal activity in
G = {G0 , ..., GT } with time window δ. The set of all nodes V, the network: first at t0, Alice and Bob authenticate with their
is the set of every src and dst entity that appears in I. The set computers A and B, then at t1 computers A and B make a
of edges at time t, Et is constrained such that for all edges request to the shared drive. At times t2 and t3, we see that
(u, v) ∈ Et there exists an interaction < u, v, w >∈ I where when Bob does not first authenticate with computer B, it does
t ≤ w < t + δ. not communicate with the shared drive. A simple probability
distribution is apparent:
Temporal Link Prediction
Temporal link prediction is defined as finding a function that P((C1, SD) ∈ Et+1 | (B, C1) ∈ Et ) = 1
(1)
describes the likelihood that an edge exists in a temporal P((C1, SD) ∈ Et+1 | (B, C1) 6∈ Et ) = 0
graph at some point in time, given the previously observed
snapshots of a network. By representing an enterprise network However, in t4 and t5, something unusual occurs: computer B
as a temporal graph, we can further extend this definition to requests data from the shared drive without Bob authenticating
encompass anomaly detection. This follows from the assump- with it first! This could be a sign of an APT using attack tech-
tion that anomalous edges in a temporal graph will have a niques (as catalogued by the MITRE ATT&CK framework)
low probability of occurring given what is known about the T1563, remote service hijacking, or attempting T1080, tainting
network’s behavior in the past. shared content [2].
Lateral movement detection with temporal link prediction In order to detect attacks such as the one in the example,
is then defined as finding a function learned from the temporal a model would need to consider events with reference to
those which occurred previously, and with reference to the
2 For simplicity, for the remainder of this work, we refer to these simply as other interactions within the network. An event between two
“temporal graphs” entities that happens at one point in time cannot be considered
3
Fig. 2: A simple example of hard to detect temporally anomalous activity in a network with 3 machines and two users. The
normal sequence of events is that a user authenticates with a computer, then that computer makes a request to the shared drive.
However, at time t5, computer 1 makes a request to the shared drive without Bob first authenticating with it, denoting a perhaps
malicious process running on computer 1 acting not on behalf of the machine’s primary user.
identical to the same event occurring in the future under a A. The Encoder-Decoder
different global context. Unfortunately, existing graph-based
approaches, which do not consider time, and many event-based The E ULER framework is a generic extension of the
approaches that look at each event in isolation such as [9], [10], traditional graph autoencoder model [32] to temporal graphs.
[14] would see no difference between (C1, SD) at time t1 and It consists of a model-agnostic graph neural network (GNN)
(C1, SD) at time t4. What makes it stand out as anomalous is stacked upon a model-agnostic recurrent neural network
something lacking in the previous state of the graph–something (RNN). Together, these models aim to find an encoding func-
that would only be detected by considering prior probabilities tion f(·) and a decoding function g(·). The encoding function
for the next state of the system. maps nodes in a temporal graph with T snapshots to T low-
dimensional embedding vectors. The decoding function en-
Frequency-based and event-based approaches such as [12], sures minimal information is lost during the encoding process
[13], [26], [27], [28] would have similar trouble, as they lack and aims to reconstruct the input snapshots from the latent Z
the relational data present in the network; Bob’s activities a vectors. More formally, we can describe the behavior of the
timestep earlier would have little import on the interpretation encoder as
of the shared drive’s activity later on. If the interaction between Z = f({G0 , . . . , GT })
(C1, SD) at time t4 was sufficiently similar to the interaction (2)
= RNN( [GNN(X0 , A0 ), . . . , GNN(XT , AT )] )
at time t1, these approaches would see no difference between
the two. They lack the ability to capture the importance of where At is the |V| × |V| adjacency matrix representation of
interactions occurring between other entities in the network, the snapshot at time t. This T × |V| × d dimensional tensor Z
and how they may relate to a separate event. is optimized to contain information about both the structure of
the graph, and the dynamics of how it changes over time.
Our proposed solution, to represent the network as a
temporal graph, ensures the global structure of a network This is enforced by a decoder function, g(·) which attempts
at individual points in time is captured without losing the to reconstruct the original graph structure given the embed-
temporal dependencies of the changing connections. We make dings. More formally,
the more difficult assumption that malicious events can have g(Zt ) = Pr(At+n = 1 | Zt ) (3)
the same event features as normal ones; if this is the case,
traditional event-based approaches will not work. We also where Zt = Z[t] is the embedding of graph Gt and n ≥ 0.
assume attackers can make similar connections, or even the As was done by [9], [19], [32], [33] we use the inner product
same connections as observed in normal activity, meaning decoding as this g(·) function:
available graph-based approaches, and statistical approaches
are insufficient. With temporal graphs, assuming they have g(Zt ) = σ(Zt Z|t ) = Ãt+n (4)
enough granularity, these problems, as well as those tackled where σ(·) denotes the logistic sigmoid function, and Ãt+n
by prior works are solvable. represents the reconstructed adjacency matrix at time t + n.
As a consequence of using inner product decoding, the dot
product of vectors Zt [u] and Zt [v] represents the log-odds
IV. E ULER that an edge, (u, v) exists at time t + n. In this way, the g(·)
function is used to detect anomalous edges.
In this section we describe our proposed framework, which
we call E ULER. This framework aims to learn a probability
function conditioned on previous states of a temporal graph, B. Workflow
to determine the likelihood of an edge occurring at a later The core of the E ULER framework is a simple design. It
state. Furthermore, it is the goal of this work to offer an simply stacks the replicas of a model-agnostic GNN which
approach that is not just precise, but also highly scalable. we refer to as the topological encoder upon a model-agnostic
We first describe the basic components of the system, then recurrent layer with a few simple constraints. When fit into
how they are distributed across multiple machines, and how the leader/worker paradigm3 with one recurrent layer as the
these components interact. Finally, we describe how different
training objectives are implemented on the E ULER interface. 3 Historically called the master/slave paradigm
4
Fig. 3: The complete series of interactions between the leader, and worker machines during one training step. In step 1.), the
leader machine initiates each worker, and issues a command for each of them to load a disjoint subset of the snapshots from
memory. When this has occurred, the training loop can begin. At step 2.), the leader issues a command to each worker to perform
a forward pass on their graphs through their GNNs. As they complete this, workers send the topological embeddings, Z̄t to an
ordered queue in the leader. Upon receiving the embedding for t = 0, step 3.) begins. The leader runs embeddings from the
queue through its RNN as they come in to produce the final Z embeddings. Then, in step 4.), the leader sends the embeddings
back to the workers to decode and calculate loss. Finally, 5.), the leader and workers perform a backward pass on the aggregated
loss functions according to the DDP gradient bucketing algorithm [31]. During evaluation, the steps are nearly identical, but on
step 4.) workers return their scores for each edge y
^ t instead of loss.
leader, and multiple topological encoders as workers, it has on earlier embeddings while future snapshots are still being
the potential for massive parallelism. processed by workers. The method for this distribution of work
is explained in more detail in Algorithm 1.
The overall workflow for E ULER is shown in Figure 3. It
occurs in 5 basic stages: 1.) The leader spawns the workers
and instructs them on which snapshots to load; 2.) the leader Algorithm 1: Distributing G across workers
initiates the training loop, and workers generate topological initialize k workers, and temporal graph G = {G0 , ..., GT };
embeddings; 3.) as the topological embeddings are received, /* Give each worker an equal amount of work */
the leader processes them through an RNN; 4.) the output of int minTasks = b Tk c;
the RNN is sent back to the workers to calculate loss or for int tasks = [minTasks] ∗ k;
scoring; 5.) in training mode, loss is returned to the leader /* If work is not evenly divided, assign it to
to be backpropagated. During evaluation, anomaly scores are last workers first */
returned. int remainder = T % k;
if remainder then
In this section, we describe how these 5 steps are imple- for (int i=k-1; i≥k-remainder; i - -) do
mented in greater detail. The distributed workers are imple- tasks[i] ++;
mented using the PyTorch DDP library [31]. More detailed end
end
technical information about the E ULER interface is available
in the Appendix. /* Each worker loads as many contiguous
snapshots as they were assigned */
int tmp, start=0, end=tasks[0];
Loading Data for (i=0; i<k; i ++) do
When the program first starts, several processes are spawned. asynchronously call workers[i].load new data(G[start:end]);
The PyTorch multiprocessing library automatically assigns tmp=start; start += end; end = tmp+tasks[i+1];
end
each spawned process a unique ID from 0 to k + 1; for
convenience the machine with id:0 becomes the leader, and all [w.wait() for w ∈ workers]
others are workers. The leader machine, which holds remote
references to all workers in addition to the recurrent layer After the workers have been assigned their snapshots, they
issues commands to the other processes to spin up all worker concurrently read them in. The leader waits for each worker to
instances. signal that all data is loaded, then moves on to the next phase
Upon their creation, workers connect to the leader and of the workflow.
await instructions. After the initialization of all workers, the
leader assigns them contiguous subsets of temporal graph data Topological Encoding
to load, shown in step 1 of Figure 3. For example, given a During the forward pass–shown in step 2 of Figure 3–the
set W = {w1 , ..., wk } of k workers, and a temporal graph recurrent layer issues asynchronous calls to forward on each
split into T snapshots, the leader assigns each worker s = Tk
worker machine. To minimize network traffic, the only thing
snapshots. Then worker wi holds {Gsi , ..., Gsi+(s−1) }. In the the leader sends to the workers is an enum representing which
likely case where k does not divide T , extra units of work partition of edges to process, as some are held out for vali-
are assigned to workers holding snapshots later in time. This dation and testing. Workers then process every snapshot they
is because the recurrent layer of the model processes the hold. Further reducing network traffic, the matrices returned
topological encoders’ output in order, as it comes in. Thus, by the workers are far smaller than those used as inputs, as
the RNN can perform the necessarily sequential forward pass each worker is essentially a graph autoencoder [32].
5
Algorithm 2: Leader machine forward method yt . The process for decoding the embeddings is the same,
def forward(self, workers, partition): however loss is not calculated.
/* Leader tells each worker to begin
executing */ Backpropagation & Evaluation
futures = [];
for w ∈ workers do
When loss has been calculated and returned to the leader
future = asynchronously execute w.forward(partition); machine, gradients are calculated via backpropagation, first
futures.append(future); through the recurrent layer, then components of the loss
function generated by each topological encoder are backpropa-
/* As workers return their embeddings, the
leader processes them in order, as they
gated in parallel and broadcast between workers in accordance
arrive */ with the bucketing algorithm described in [31]. After the
h=NULL; backpropagation step, and collective communication between
zs = []; workers, gradients across all workers’ model replicas are equal.
for f ∈ futures do Finally, the recurrent layer and the topological encoders all
z, h = self.RNN(f.wait(), h);
zs.append(z); update their parameters using an Adam optimizer [34], and
the leader repeats steps 2-5 until convergence.
return concat(zs)
If the model is in evaluation mode, the leader machine
instead uses the y
^ t likelihoods the workers generate, and the
known labels yt , also returned by the workers, to calculate
For performance optimization, and to ensure consistency, precision and accuracy metrics, which are saved. In a real-
we impose just one constraint for this stage: topological world implementation without labels, it would instead raise
encoders must not be dependent on any temporal information. alerts on observed edges with likelihoods below a certain
They provide a purely spatial encoding of the state of each threshold.
snapshot they hold, using only features observable at a single
point in time. With this constraint satisfied, all encoders can
operate in parallel, as no one worker is dependent on the output C. Training
of another. Theoretically, with as many workers as snapshots,
the time complexity of a forward pass is constrained only by There are two modes of training E ULER models: as a
the snapshot with the greatest number of edges. link detector, or a link predictor. These two modes are
distinguished by which Zt embeddings are sent to the workers
Temporal Encoding at step 4 to calculate loss. Link detectors are inductive; they
The leader maintains an ordered list of future objects that point generate Zt using partially observed snapshots {G^0 , ..., G^t } and
to the eventual output of the workers and waits for the future attempt to reconstruct the full adjacency matrix At with g(Zt ).
pointing to the first embeddings to finish executing. When In practice, they would be used for forensic tasks, where one
the leader receives this tensor, it is immediately processed is performing an audit to identify anomalous connections that
by its RNN, shown in step 3 of Figure 3. Note that so long have already occurred.
as tasks are slightly imbalanced such that workers holding Link predictors are transductive; they generate Zt using
later snapshots contain more work units, the leader’s recurrent snapshots {G0 , ..., Gt }, in order to predict the future state, At+n
layer can execute concurrently with at most k − 1 workers’ where n > 0. In practice, they could be used as a live intrusion
topological encoders. Workers with earlier snapshots hold detection tool, as predictive models can score edges as they are
fewer work units, and therefore finish executing earlier, so the observed–before they have been processed into full snapshots.
leader can process their outputs while workers holding greater For example, when n = 1, given what has been observed about
quantities of snapshots further in time are still processing. the network up until time t − 1, it is the goal of predictive
When the recurrent layer has finished processing the output implementation of E ULER to score edges observed at time t.
of one worker, the hidden state and outputs from the RNN are Such a model can use embeddings learned from previous states
saved. The leader waits for the next topological embedding to of the network to process connections as they occur.
finish processing, then uses the saved RNN hidden state and To ensure these objectives, the reconstruction loss function
the next embedding to repeat the process until all workers have aims to minimize the negative log likelihood of Equation 3,
finished executing. This procedure is described in more detail where n = 0 when training detectors, and n > 0 for predictors.
in Algorithm 2. For larger graphs, operating over the entire adjacency matrix
quickly becomes intractable. Instead, we approximate this
Decoding value by minimizing binary cross entropy on the likelihood
When the leader finishes generating the final embeddings, they scores for known edges, and a random sample of non-edges
are sent back to the workers to decode, and if the model is at time t + n: Pt+n and Nt+n . Formally, the reconstruction
training, to calculate loss. This process occurs in parallel on loss function for snapshot t + n is
the worker machines. In general, graph functions such as those
used to find edge likelihoods are more compute and memory
intensive, so we endeavor to run them in parallel whenever Lt = − log(Pr(At+n | Zt ))
possible. This stage is shown in step 4 of Figure 3. −1 X −1 X
≈ log(p) + log(1 − n)
During evaluation, instead of returning loss, the workers |Pt+n | |Nt+n |
p∈Pt+n n∈Nt+1
return edge likelihoods y
^ t , and the ground-truth edge labels (5)
6
New negative edges are randomly sampled on each epoch V. B ENCHMARK E VALUATIONS
by workers for each snapshot they hold. The leader machine
coordinates which slices of the Z tensor are sent to each worker Prior works [19] and [22] both assert that simply em-
to generate Pt+n and Nt+n , and each worker independently bedding graph snapshots with a GNN, and running these
calculates loss on the training data they hold. embeddings through an RNN, as was done by [20], [35] does
not adequately capture the shifting distribution of nodes in
On predictive models, Zt represents the latent space of a dynamic network. Prior work [22] demonstrates that this
nodes n snapshots in the future. Consequently, predictive is the case for inferring complete graph structure several
models cannot calculate loss on the first n snapshots of the time steps in the future, but [19] does not evaluate their
graph. To account for this, the leader pads Z with a n |V| × d model against the very model they so thoroughly disregard.
zero matrices at the beginning, and the final n matrices are To remedy this, we present a comparison between several
removed before returning the embeddings to the workers. This existing temporal autoencoder models, and a simple stacked
way, Z[t] predicts the snapshot indexed at t on all workers GCN [36] and GRU [37] following the E ULER framework.
except the one holding the initial snapshots, which ignores We select these two layers because of their lack of parameters–
embeddings that equal the zero matrix. This process is shown they are the most generalized of the models in their respective
more clearly in Algorithm 3. domains [38], [39].
7
TABLE I: Data set metadata C. Experimental Setup
Data Set Nodes Edges Avg. Density Timestamps As was done by [19], we conduct three different bench-
FB 663 23,394 0.00591 9
marking tests to compare E ULER to other temporal link pre-
COLAB 315 5,104 0.01284 10 diction methods: inductive dynamic link detection, transductive
Enron10 184 4,784 0.00514 11 dynamic link prediction, and transductive dynamic new link
prediction. Link detection and link prediction are implemented
as described in Section IV-C. On link detection tests the
embeddings as input to calculate likelihood scores, but our objective function is Equation 3 with n = 0. On (new) link
experiments found using inner product decoding was more prediction tests, n = 1.
effective. For link detection tests, we randomly remove 5 and 10%
of the edges from each snapshot for validation and testing
VGRNN [19]: a GCN stacked upon a GC-LSTM [21]. respectively. The reported results are evaluations of final 3
Their method, however, cannot be fit into the E ULER frame- snapshots, which are never used for training; however, a
work; this is because node embeddings from the GCNs are training set of edges from those latter timesteps are withheld
passed into the RNN, whose output is concatenated with node and not evaluated, to be used as inputs during the forward pass
features to be used as input for the next snapshot. This process at evaluation.
must happen in serial. The hidden state vectors from the RNN
are designed to predict the next state of the graph, in addition For link prediction tests, all edges in final three snapshots
to providing information for the GCN. Hence, for predictive are considered positive samples in the test set, with an equal
tasks, a non-linear transformation of the RNN output is used number of randomly sampled negative edges. As it is a
as the node embeddings rather than the direct output of the transductive test, during the forward pass all edges at time
GCN. t are ingested to predict the state at t + 1. However, 5% of
edges are held out for validation during decoding.
VGAE [32]: A simple two-layer variational graph autoen-
coder. This model does not consider time at all. Instead, it New link prediction is an extra evaluation of predictive
views every interaction as one large graph. We include this models. This test is identical to link prediction, but the set of
model to demonstrate the usefulness of the recurrent unit true positives only includes edges that were not in the snapshot
in other models. The VGAE acts as a baseline to compare immediately before:
inductive tests against; as it is a static model, it cannot be
used for dynamic (new) link prediction. {(u, v) | (u, v) ∈ Et+1 ∧ (u, v) 6∈ Et } (10)
the purpose of this test is to evaluate predictive models’ ability
We use three data sets for these tests: Facebook [43],
to anticipate new edges given what it has observed about the
Enron10 [44], and COLAB [45]. Table I contains more detailed
dynamics of the graph.
information about these data sets. None of these data sets
contain node features, so the identity matrix is used as the Model parameters are updated with an Adam opti-
initial feature input for all models. These graphs are all mizer [34] with a learning rate of either 0.03, 0.02, 0.01,
directed, and during evaluation self-loops are not considered. or 0.005 according to a parameter search. Models are trained
for 1,500 epochs with early stopping after 100 epochs of no
progress on the validation data. Results for the (SI-)VGRNN,
B. Evaluation Metrics the DynGraph2Vec models, and the VGAE are those reported
by [19]. We use the source code provided by [22] to evaluate
As we use a regression model to output probabilities, we the EGCN models.
use the following to measure its effectiveness: area under the
ROC curve (AUC) and average precision score (AP). The AUC D. Results
score is defined as the area under the curve created by plotting
Here, we report the results of the three tests with some
the true positive rate (TPR) and false positive rates (FPR) as
discussion. When comparing results from competing methods,
the threshold for classification changes [46].
ties occur when the difference between the means of two
The TPR and FPR are defined as models’ metrics, A and B, are not statistically significant. This
is determined via a two-tailed t-test with the null hypothesis
TP FP H0 that the two means are the same:
TPR = FPR = (7)
TP + FN FP + TN
0 − (µ(B) − µ(A)) µ(A) − µ(B)
Average precision for regression tasks is defined as t= q =p (11)
Var(B−A) σM (A)2 + σM (B)2
X N
AP = (Rτ − Rτ−1 )Pτ (8)
τ Where N denotes degrees of freedom, and σM (·) denotes
the standard error. If
where Pτ and Rτ denote the precision and recall scores at
threshold τ. Precision and recall are defined as p = 2 min{Pr[T ≤ t | H0 ], Pr[T ≥ t | H0 ]} > 0.05 (12)
TP TP the difference between the two means is deemed insignificant,
P= R= (9) and the two models are considered equivalent. For ease of
TP + FP TP + FN
8
TABLE II: Comparison of E ULER to related work on dynamic TABLE III: Comparison of E ULER to related work on dynamic
link detection link prediction
Metrics Methods Enron COLAB Facebook Metrics Methods Enron COLAB Facebook
VGAE 88.26 ± 1.33 70.49 ± 6.46 80.37 ± 0.12 DynAE 74.22 ± 0.74 63.14 ± 1.30 56.06 ± 0.29
DynAE 84.06 ± 3.30 66.83 ± 2.62 60.71 ± 1.05 DynRNN 86.41 ± 1.36 75.7 ± 1.09 73.18 ± 0.60
DynRNN 77.74 ± 5.31 68.01 ± 5.50 69.77 ± 2.01 DynAERNN 87.43 ± 1.19 76.06 ± 1.08 76.02 ± 0.88
DynAERNN 91.71 ± 0.94 77.38 ± 3.84 81.71 ± 1.51 EGCN-O 84.28 ± 0.87 78.63 ± 2.14 77.31 ± 0.58
EGCN-O 93.07 ± 0.77 90.77 ± 0.39 86.91 ± 0.51 AUC EGCN-H 88.29 ± 0.87 80.80 ± 0.95 75.88 ± 0.32
AUC EGCN-H 92.29 ± 0.66 87.47 ± 0.91 85.95 ± 0.95 VGRNN 93.10 ± 0.57 85.95 ± 0.49 89.47 ± 0.37
VGRNN 94.41 ± 0.73 88.67 ± 1.57 88.00 ± 0.57 SI-VGRNN 93.93 ± 1.03 85.45 ± 0.91 90.94 ± 0.37
SI-VGRNN 95.03 ± 1.07 89.15 ± 1.31 88.12 ± 0.83 E ULER 93.15 ± 0.42 86.54 ± 0.20 90.88 ± 0.12
E ULER 97.34 ± 0.41 91.89 ± 0.76 92.20 ± 0.56 DynAE 76.00 ± 0.77 64.02 ± 1.08 56.04 ± 0.37
VGAE 89.95 ± 1.45 73.08 ± 5.70 79.80 ± 0.22 DynRNN 85.61 ± 1.46 78.95 ± 1.55 75.88 ± 0.42
DynAE 86.30 ± 2.43 67.92 ± 2.43 60.83 ± 0.94 DynAERNN 89.37 ± 1.17 81.84 ± 0.89 78.55 ± 0.73
DynRNN 81.85 ± 4.44 73.12 ± 3.15 70.63 ± 1.75 EGCN-O 86.55 ± 1.57 81.43 ± 1.69 76.13 ± 0.52
DynAERNN 93.16 ± 0.88 83.02 ± 2.59 83.36 ± 1.83 AP EGCN-H 89.33 ± 1.25 83.87 ± 0.83 74.34 ± 0.53
EGCN-O 92.56 ± 0.99 91.41 ± 0.33 84.88 ± 0.52 VGRNN 93.29 ± 0.69 87.77 ± 0.79 89.04 ± 0.33
AP EGCN-H 92.56 ± 0.72 88.00 ± 0.85 82.56 ± 0.91 SI-VGRNN 94.44 ± 0.85 88.36 ± 0.73 90.19 ± 0.27
VGRNN 95.17 ± 0.41 89.74 ± 1.31 87.32 ± 0.60 E ULER 94.10 ± 0.32 89.03 ± 0.08 89.98 ± 0.19
SI-VGRNN 96.31 ± 0.72 89.90 ± 1.06 87.69 ± 0.92
E ULER 97.06 ± 0.48 92.85 ± 0.88 91.74 ± 0.71
9
C(u, v) − µ
it is only a small improvement, or about equal to state-of-the- E
W((u, v) ∈ E) = σ (13)
art models is adequate for our purposes. The real benefit of ΣE
building models within the E ULER framework is their ability
to scale. On larger data sets, this advantage is more evident. where σ(·) represents the sigmoid function, C(u, v) rep-
resents the frequency of an authentication between u and v
VI. L ATERAL M OVEMENT D ETECTION in the time window, and µE and ΣE represent the mean and
standard deviation of all edge frequencies in the time window.
In the previous section, all datasets tested were rather small. In this way, edges which occur very infrequently are given
It is not until a real-world application of the E ULER framework lower weight during training so as to appear less “normal” and
is tested that the true performance improvements are evident. edges which occur with high frequency, such as edges from
computers to domain controllers, or ticket granting servers
To demonstrate the impressive speedup achieved by this have high weight, and appear routine.
framework when compared to related work we evaluate several
E ULER models on the LANL 2015 Comprehensive Multi- The LANL dataset has no node features by default, how-
Source Cyber Security Events dataset [24]. The dataset consists ever some information can be gleaned from the naming con-
of 57 days of log files from five different sources within the vention used in the log files. Entities have unique, anonymized
Los Alamos National Laboratory’s internal corporate network identifiers that start with either a U or a C denoting users
as it underwent both normal activity and a redteam campaign. and computers respectively. There are also nodes with non-
Specific details about this dataset are reported in Table V. The anonymized names that have important roles in the system such
edge count in the table represents the number of weighted as TGT, the Kerberos key distribution center, DC, the domain
edges; multiple events between the same entities in the same controller, and so on. To leverage this additional data, we
time period may be compressed into a single weighted edge. concatenate a 1-hot vector denoting user, computer, or special
Because events from the authentication logs have been labeled administrative machine to each node’s one-hot ID vector.
as normal or anomalous, this data set has been widely used for
cyber security research [9], [12], [13], [15]. The labels make it For quicker file scanning, and data loading times, the
especially apt for lateral movement detection research. When full 69 GB auth.txt file is split into chunks, which each
an APT-level threat is attempting to traverse a system, one hold 10,000 seconds (approximately 3 hours) of logs. Worker
possible warning sign will be authentications that should not machines are issued instructions to read in certain ranges of the
normally occur, a sign indicative of lateral movement on a log files and build the temporal graphs. Workers accomplish
network level. this by spawning several child processes to load multiple
snapshots in parallel. The associated edge lists, and edge
weight lists from each child process are combined to form the
TABLE V: LANL Data Set Metadata final TGraph object, which holds a list of all edge lists, edge
weights, node features, and tensor masks to take partitions of
Nodes 17,685
Events 45,871,390
each edge list for training.
Anomalous Edges 518
Duration (Days) 58 For all experiments, the training set consists of all snap-
shots that occur before the first anomalous edge appears in the
authentication logs. This allows models to learn what normal
In this section, all distributed models were implemented activity looks like. From this set, we remove the final 5% of
with 4 worker nodes unless otherwise specified. All exper- snapshots for tuning the classifier, and mask 5% of edges from
iments are run on a server with two Intel Xeon E5-2683 v3 each snapshot for validation.
(2.00 GHz) CPUs, each of which has 14 cores with 28 threads,
and 512 GB of memory [47].
B. Experimental Setup
We will first present the utility of models following the
E ULER framework as an anomaly-based intrusion detection We test three encoders in conjunction with two recurrent
system on the LANL dataset, then an analysis of the immense neural networks as well as models with no recurrent layer to
scalability afforded by splitting models in this way. measure how much value temporal data adds to the overall
embeddings. The encoder models are GCN [36], GAT [48],
and GraphSAGE [49]. The recurrent models are GRU [37]
A. Graph Construction and LSTM [42]. The models are trained in the same manner
We construct a weighted, directed graph from the authen- as the link detection and link prediction models in Section V.
tication logs by mapping which entities authenticate with one However, experiments showed that once a local optimum was
another. As nodes, we use the entities denoted source and found and validation scores ceased improving, it rarely im-
destination computers in the LANL documentation. For all proved after further iterations. As such, early stopping occurs
authentications that occur from time t to t + δ, an edge is after only ten epochs of no improvement.
created between the source computer and destination computer. Experiments showed for every model that using smaller
If an edge already exists, a tally keeping track of the number of time windows always lead to better results. As such, we
authentications between the two machines is updated. Experi- only present the output of tests on temporal graphs with time
ments have shown that the most effective method to normalize window δ = 1800 seconds (30 minutes).
these tallies into usable edge weights is to take the logistic
sigmoid of the edges’ standardized values. Mathematically, it The GAT encoder uses 3 attention heads, which was found
can be represented as to be optimal via hyperparameter tuning. The SAGE encoder
10
uses maxpooling as its aggregation function, as this was found TABLE VI: Performance of E ULER Models on the LANL
to be optimal in their paper, and this makes it capable of Dataset when δ = 0.5
discerning between certain graphs GCN cannot [50].
Dynamic Link Detection
Unfortunately, many other works which experiment with Encoder RNN AUC AP TPR FPR P
the LANL dataset either do not use the data set in full, as is
GRU 0.9912 0.0523 86.10 0.5698 0.0054
the case with [12], [13] or conduct tests on portions of the data GCN LSTM 0.9913 0.0169 89.65 0.5723 0.0056
other than purely the authentication logs, as was done by [15], None 0.9916 0.0116 88.57 0.4798 0.0066
so it would not be fair or meaningful to compare our results GRU 0.9872 0.0307 84.71 0.6874 0.0044
to theirs. Prior work [9] does use the full authentication log SAGE LSTM 0.9887 0.0389 83.55 0.6591 0.0045
as its dataset, however it trains on a larger set of data, using None 0.8652 0.0052 79.58 24.5669 0.0001
all days that contained no anomalous activity as the training GRU 0.9094 0.0076 85.21 21.533 0.0001
GAT LSTM 0.8713 0.0022 96.83 19.873 0.0002
set, rather than just the days before the attack campaign. We None 0.9867 0.0079 99.88 23.174 0.0002
include their results, nonetheless. We also include the TPR and
UA – – 72.00 4.400 0.0010
FPR of a rules-based “Unknown Authentication” (UA) model GL-LV [9] – – 67.00 1.200 0.0034
reported by [9]. This rule simply marks any edge that did not GL-GV [9] – – 85.00 0.900 0.0051
exist in the training data as anomalous. VGRNN 0.9315 0.0000 59.69 4.938 0.0000
11
filtering false positives. Where one authentication may appear
anomalous in isolation, when viewed in the context of previous
authentications, it can be correctly identified as benign.
The GL-LV and GL-GV method reported in [9] does
not consider time at all; the network is viewed as a static
graph. Here, we again see the benefit of using a sequence
encoder in conjunction with a pure topological embedder.
The best E ULER methods outperform their random walk-based
approach in terms of both TPR and FPR. Also worth noting
is that because our model uses temporal graphs, the alerts
from E ULER-based models come with a time stamp, making
them more informative and valuable in a real-world scenario.
The prior work ranks any duplicate edges, regardless of their
temporal context, as equally anomalous. Fig. 4: Change in AUC score as δ increases for link prediction
and detection with GCN+GRU models. Scores are the average
Like E ULER, VGRNN combines a sequence encoder with of five tests on the LANL dataset.
a temporal one. They claim that by using temporal information
as input during the topological encoding phase, complex
temporal dependencies are better encoded than without it.
However, this method performs no better than the purely With more edges, and more temporal granularity, models are
statistical UA method. Even still, the false positive rate is more capable of learning useful patterns across time. Figure 4
excellent in the predictive test, outperforming every E ULER illustrates this phenomenon.
model. It is worth noting however, the quality of likelihood
scores is very poor with this method, which implies that had As is evident from the figure, small window sizes are
the threshold for the E ULER models been set lower their FPR more optimal for quality of scores; however, due to the longer
would be lower with equivalent TPRs. This is readily apparent processing time required, and the relatively small performance
in the GCN-based models where the FPR is only a few tenths improvements as window size decreases after a certain point, it
of a percent larger than the dynamic VGRNN, but their TPRs may be adequate to leave the snapshot duration a little higher
are almost 10% higher. than is optimal for faster training and evaluation. Additionally,
we observe that changes in δ seem to affect the link prediction
Finally, we must concede that while the E ULER models model at a higher rate than the link detection one. We suspect
do outperform prior works, their FPRs are still too high to be this is caused by the model’s inability to predict short term
useful as an intrusion detection system on their own. Some of temporal patterns like the one described in section III as δ
this can be attributed to the data set itself; labeled anomalous grows, and the predictive model is especially apt to detect this
events are very coarse-grained. There are likely many events type of anomaly.
the compromised entities engaged in that should be considered
anomalous, and may have even been detected by our models, Nonetheless, in both cases, more granular graphs lead to
but that are treated as false positives due to the lack of more informative edge scores. We speculate that at some point,
fine-grained label information. Indeed, the redlog only tracks having time slices too granular would have diminishing returns
“compromise events” and not the further malicious activity that as graphs will have too few edges to be useful. But due to the
ensues [24]. severe training time as δ decreases, we have never managed
to reach this point. We leave finding this boundary as an area
However, even if this is not fully the case, this method has for future work.
great potential as a filtering device for further analysis tools.
The low cost of processing time, that we will demonstrate
later on, makes this an efficient way to minimize the number E. Scalability Analysis
of interactions that need to be analyzed by a signature-based The main benefit of using models which fit into the
technique for example. As part of future work, we plan to E ULER framework is their scalability. While evaluation metrics
work with analysts to answer the question of whether this on benchmarks were generally better than prior work, there
approach is best used as a step in a longer pipeline, or on were certainly some categories where the advanced models
its own. However, we have demonstrated that compared to all are comparable our simple ones. However, as we will show,
other anomaly-based works which analyze this data set, E ULER distributed topological encoding has tremendous performance
models are the most effective. benefits.
12
(a) Forward propagation time (b) Backward propagation time
Fig. 5: Performance comparison between the distributed E ULER model and competing methods on varying numbers of snapshots
from the LANL dataset. All time windows, δ are 0.5 hours.
13
of node neighbors as features and processed them using [6] S. M. Milajerdi, R. Gjomemo, B. Eshete, R. Sekar, and V. Venkatakrish-
MLPs [41], [52], [53], [54], [55]. However, as we and others nan, “Holmes: real-time apt detection through correlation of suspicious
have shown, these methods are not as powerful as those information flows,” in 2019 IEEE Symposium on Security and Privacy
(SP). IEEE, 2019, pp. 1137–1152.
which leverage GNNs. Many models that incorporate GNNs
[7] S. Wang and S. Y. Philip, “Heterogeneous graph matching networks:
for temporal graph embedding often do so by replacing the Application to unknown malware detection,” pp. 5401–5408, 2019.
linear layers in RNNs with GNNs [20], [21]; other existing [8] R. Wei, L. Cai, A. Yu, and D. Meng, “Deephunter: A graph neural
approaches such as VGRNN [19] and E-GCN [22] use highly network based approach for robust cyber threat hunting,” arXiv preprint
engineered models, in addition to graph neural network RNNs, arXiv:2104.09806, 2021.
and must be run in serial. [9] B. Bowman, C. Laprade, Y. Ji, and H. H. Huang, “Detecting lateral
movement in enterprise computer networks with unsupervised graph
{AI},” in 23rd International Symposium on Research in Attacks, Intru-
VIII. C ONCLUSION sions and Defenses ({RAID} 2020), 2020, pp. 257–268.
In this work, we presented the E ULER framework: a [10] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, “Kitsune: An
method to exploit the previously untapped potential for dis- ensemble of autoencoders for online network intrusion detection,”
machine learning, vol. 5, p. 2.
tributing the work done to train and execute temporal graph
link predictors. When each topological encoder can operate [11] I. H. Sarker, “Cyberlearning: Effectiveness analysis of machine learning
security modeling to detect cyber-anomalies and multi-attacks,” Internet
independently, the most compute-heavy task of generating of Things, p. 100393, 2021.
node embeddings can be scaled in a highly efficient man- [12] N. Heard and P. Rubin-Delanchy, “Network-wide anomaly detection
ner. Though models following this framework are necessarily via the dirichlet process,” in 2016 IEEE Conference on Intelligence
simple, we have shown that for anomalous link detection and and Security Informatics (ISI). IEEE, 2016, pp. 220–224.
prediction, models following the E ULER framework perform [13] M. Whitehouse, M. Evangelou, and N. M. Adams, “Activity-based
as well, or better than their complex contemporaries. Finally, temporal anomaly detection in enterprise-cyber security,” in 2016 IEEE
we showed how this framework can be used to train highly pre- Conference on Intelligence and Security Informatics (ISI). IEEE, 2016,
pp. 248–250.
cise anomaly-based intrusion detection systems when network
[14] F. Farahnakian and J. Heikkonen, “A deep auto-encoder based approach
activity is viewed through the abstraction of a temporal graph. for intrusion detection system,” in 2018 20th International Conference
These intrusion detection systems are scalable, and more sound on Advanced Communication Technology (ICACT). IEEE, 2018, pp.
than other unsupervised techniques despite being trained with 178–183.
less data. [15] T. Bai, H. Bian, A. Abou Daya, M. A. Salahuddin, N. Limam, and
R. Boutaba, “A machine learning approach for rdp-based lateral move-
Future work may include testing other topological or tem- ment detection,” in 2019 IEEE 44th Conference on Local Computer
poral encoders that we did not. Further improvements could Networks (LCN). IEEE, 2019, pp. 242–245.
be made to the classifier by testing edge decoding techniques [16] “Use alternate authentication material: Pass the ticket,” May 2017.
more advanced than the simple inner product logistic regres- [Online]. Available: https://attack.mitre.org/techniques/T1550/003/
sion we use. As this framework allows for scalable message [17] “Valid accounts,” May 2017. [Online]. Available: https://attack.mitre.
passing graph neural networks, future work could even include org/techniques/T1078/
testing this technique on any large temporal graph data set [18] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Commu-
previously thought intractable for GNNs. nications surveys & tutorials, vol. 18, no. 2, pp. 1153–1176, 2015.
[19] E. Hajiramezanali, A. Hasanzadeh, K. Narayanan, N. Duffield,
ACKNOWLEDGMENT M. Zhou, and X. Qian, “Variational graph recurrent neural networks,”
in Advances in Neural Information Processing Systems, H. Wallach,
The authors thank the anonymous reviewers for their help- H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and
ful suggestions. This work was supported in part by DARPA R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019, pp. 10 701–
under agreement number N66001-18-C-4033 and National 10 711. [Online]. Available: https://proceedings.neurips.cc/paper/2019/
Science Foundation grants 1618706, 1717774, and 2127207. file/a6b8deb7798e7532ade2a8934477d3ce-Paper.pdf
The views, opinions, and/or findings expressed in this material [20] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured
sequence modeling with graph convolutional recurrent networks,” in
are those of the authors and should not be interpreted as International Conference on Neural Information Processing. Springer,
representing the official views or policies of the Department 2018, pp. 362–373.
of Defense, National Science Foundation, or the U.S. Govern- [21] J. Chen, X. Xu, Y. Wu, and H. Zheng, “Gc-lstm: Graph convo-
ment. lution embedded lstm for dynamic link prediction,” arXiv preprint
arXiv:1812.04206, 2018.
R EFERENCES [22] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi,
T. Kaler, T. Schardl, and C. Leiserson, “Evolvegcn: Evolving graph
[1] B. E. Strom, A. Applebaum, D. P. Miller, K. C. Nickels, A. G. convolutional networks for dynamic graphs,” in Proceedings of the AAAI
Pennington, and C. B. Thomas, “Mitre att&ck: Design and philosophy,” Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5363–
Technical report, 2018. 5370.
[2] “Lateral movement, tactic TA0008,” Oct 2018. [Online]. Available: [23] G. M. Amdahl, “Validity of the single processor approach to achieving
https://attack.mitre.org/tactics/TA0008/ large scale computing capabilities,” in Proceedings of the April 18-20,
[3] S. T. King and P. M. Chen, “Backtracking intrusions,” in Proceedings of 1967, spring joint computer conference, 1967, pp. 483–485.
the nineteenth ACM symposium on Operating systems principles, 2003, [24] A. D. Kent, “Comprehensive, multi-source cyber-security events data
pp. 223–236. set,” Los Alamos National Lab.(LANL), Los Alamos, NM (United
[4] B. Caswell and J. Beale, Snort 2.1 intrusion detection. Elsevier, 2004. States), Tech. Rep., 2015.
[5] S. M. Milajerdi, B. Eshete, R. Gjomemo, and V. Venkatakrishnan, [25] S. Bhatia, B. Hooi, M. Yoon, K. Shin, and C. Faloutsos, “Midas:
“Poirot: Aligning attack behavior with kernel audit records for cyber Microcluster-based detector of anomalies in edge streams,” in Proceed-
threat hunting,” in Proceedings of the 2019 ACM SIGSAC Conference ings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04,
on Computer and Communications Security, 2019, pp. 1795–1812. 2020, pp. 3242–3249.
14
[26] L. Dhanabal and S. Shantharajah, “A study on nsl-kdd dataset for [49] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation
intrusion detection system based on classification algorithms,” Inter- learning on large graphs,” arXiv preprint arXiv:1706.02216, 2017.
national journal of advanced research in computer and communication [50] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph
engineering, vol. 4, no. 6, pp. 446–452, 2015. neural networks?” arXiv preprint arXiv:1810.00826, 2018.
[27] A. Alazab, M. Hobbs, J. Abawajy, and M. Alazab, “Using feature selec- [51] A. Divakaran and A. Mohan, “Temporal link prediction: a survey,” New
tion for intrusion detection system,” in 2012 international symposium Generation Computing, pp. 1–46, 2019.
on communications and information technologies (ISCIT). IEEE, 2012,
[52] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang, “Dynamic network
pp. 296–301.
embedding by modeling triadic closure process,” in Proceedings of the
[28] R. Lohiya and A. Thakkar, “Intrusion detection using deep neural AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
network with antirectifier layer,” in Applied Soft Computing and Com-
[53] M. Rahman, T. K. Saha, M. A. Hasan, K. S. Xu, and C. K. Reddy,
munication Networks. Springer, 2021, pp. 89–105.
“Dylink2vec: Effective feature representation for link prediction in
[29] “Kdd cup 1999,” Oct 1999. [Online]. Available: http://kdd.ics.uci.edu/ dynamic networks,” arXiv preprint arXiv:1804.05755, 2018.
databases/kddcup99/kddcup99.html
[54] P. Goyal, N. Kamra, X. He, and Y. Liu, “Dyngem: Deep embedding
[30] G. Mohi-ud din, “Nsl-kdd,” 2018. [Online]. Available: https: method for dynamic graphs,” arXiv preprint arXiv:1805.11273, 2018.
//dx.doi.org/10.21227/425a-3e55
[55] T. Li, J. Zhang, S. Y. Philip, Y. Zhang, and Y. Yan, “Deep dynamic
[31] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, network embedding for link prediction,” IEEE Access, vol. 6, pp.
J. Smith, B. Vaughan, P. Damania et al., “Pytorch distributed: 29 219–29 230, 2018.
Experiences on accelerating data parallel training,” arXiv preprint
arXiv:2006.15704, 2020.
[32] T. N. Kipf and M. Welling, “Variational graph auto-encoders,” arXiv
preprint arXiv:1611.07308, 2016.
[33] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for
networks,” in Proceedings of the 22nd ACM SIGKDD international
conference on Knowledge discovery and data mining, 2016, pp. 855–
864.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[35] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng,
and H. Li, “T-gcn: A temporal graph convolutional network for traffic
prediction,” IEEE Transactions on Intelligent Transportation Systems,
vol. 21, no. 9, pp. 3848–3858, 2019.
[36] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[37] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,”
arXiv preprint arXiv:1409.1259, 2014.
[38] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, “Pitfalls
of graph neural network evaluation,” arXiv preprint arXiv:1811.05868,
2018.
[39] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[40] Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: Towards deep
graph convolutional networks on node classification,” arXiv preprint
arXiv:1907.10903, 2019.
[41] P. Goyal, S. R. Chhetri, and A. Canedo, “dyngraph2vec: Captur-
ing network dynamics using dynamic graph representation learning,”
Knowledge-Based Systems, vol. 187, p. 104816, 2020.
[42] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[43] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, “On the
evolution of user interaction in facebook,” in Proceedings of the 2nd
ACM workshop on Online social networks, 2009, pp. 37–42.
[44] C. E. Priebe, J. M. Conroy, D. J. Marchette, and Y. Park, “Scan statistics
on enron graphs,” Computational & Mathematical Organization Theory,
vol. 11, no. 3, pp. 229–247, 2005.
[45] M. Rahman and M. Al Hasan, “Link prediction in dynamic networks
using graphlet,” in Joint European Conference on Machine Learning
and Knowledge Discovery in Databases. Springer, 2016, pp. 394–
409.
[46] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to data mining.
Pearson Education India, 2016, ch. 5.
[47] “Intel xeon processor e5-2683 v3 (35m cache,
2.00 ghz) product specifications,” 2014. [Online].
Available: https://ark.intel.com/content/www/us/en/ark/products/81055/
intel-xeon-processor-e5-2683-v3-35m-cache-2-00-ghz.html
[48] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Ben-
gio, “Graph attention networks,” arXiv preprint arXiv:1710.10903,
2017.
15
TABLE VII: The E ULER Interface
Encoding Layer
forward(partition: int) -> Tensor Takes a subset of edges hosted on the worker machine denoted by the partition enum which maps to a mask
tensor held on the worker, and passes them through a model; for the purposes of this paper it is assumed to be
a GNN. It returns a Tw × |V| × h or |V| × Tw × h tensor, depending on how one wishes to implement the
RNN, where Tw is the number of snapshots held on worker w, and h is a user specified hidden dimension.
calc_loss(zs: Tensor, partition: Samples some subset of edges held by the worker, and uses the embeddings, zs to calculate loss. In general, we
int, nratio: float) -> Tensor use binary cross entropy loss (BCE) on the training set of known edges, and a random sample of |E| × nratio
non-edges at each training step. Returns a 1 × 1 tensor of the total loss generated from the edges held in the
worker
score_edges(zs: Tensor) -> Returns the likelihood score of each edge held by the worker given the input tensor of node embeddings. This
Sequence[Tensor] function is used for evaluation to detect anomalous connections, and to calculate AUC scores. It returns a Tw
length list of |Et | × 1 tensors containing likelihood scores.
load_new_data(fn: Callable[..., A function to load new data into a worker, called by the constructor by default. The loader should have a return
TGraph], *args, **kwargs) -> None type TGraph, a special class to hold temporal graphs.
Recurrent Layer
forward(partition: int) -> Tensor Issues a command to each worker to run their forward method, and waits for their response. As responses come
in, in order, it passes output from workers into an RNN. Returns the T × |V| × d or |V| × T × d tensor,
depending on how one wishes to implement the RNN, where T is the total number of snapshots held across all
workers and d is the user specified embedding dimension.
calc_loss(zs: Tensor, partition: Splits the tensor of embeddings zs into contiguous slices such that z[t] is the parameter for the likelihood of
int, nratio: float) -> Tensor edges in snapshot t in the worker it is sent to. It then passes those embeddings to the workers and issues a call
for them to calculate loss on them. When the workers have finished calculating loss, the leader aggregates the
totals and returns a 1 × 1 tensor of total loss across all workers.
score_edges(zs: Tensor) -> Splits the tensor of embeddings zs into contiguous slices in the same manner as calc_loss, then passes those
Sequence[Tensor] embeddings to the workers, and issues a call for them to return the likelihood score of each edge they hold. It
aggregates each list of edge likelihoods returned by the workers, and returns a T length list of |Et | × 1 tensors
of likelihood scores.
A PPENDIX A
I MPLEMENTATION D ETAILS
Here we provide the technical details of the E ULER frame-
work. Every required method for the leader and follower
interfaces is detailed in Table VII.
16