Telecom Fraud Detection with Machine Learning
Telecom Fraud Detection with Machine Learning
Telecom Fraud
Detection Using
Machine Learning
KTH Thesis Report
Chao Xiong
Chao Xiong
Computer Science
KTH Royal Institute of Technology
Examiner
Prof. Panos Papadimitratos
NSS Group
KTH Royal Institute of Technology
Supervisor at KTH
Zahra Alimadadi
NSS Group
KTH Royal Institute of Technology
Supervisors at Sinch
ii
Abstract
International Revenue Sharing Fraud (IRSF) is one of the most persistent types of
fraud within the telecommunications industry. According to the 2017 Communications
Fraud Control Association (CFCA) fraud loss survey, IRSF costs 6 billion dollars a
year. Therefore, the detection of such frauds is of vital importance to avoid further
loss. Though many efforts have been made, very few utilize the temporal patterns of
phone call traffic. This project, supported with Sinch’s real production data, aims to
exploit both spatial and temporal patterns learned by Graph Attention Neural network
(GAT) with Gated Recurrent Unit (GRU) to find suspicious timestamps in the historical
traffic.
Moreover, combining with the time-independent Isolation forest model, our model
should give better results for the phone call records. This report first explains the
mechanism of IRSF in detail and introduces the models that are applied in this project,
including GAT, GRU, and Isolation forest. Finally, it presents how our experiments
have been conducted and the results with extensive analysis. Moreover, we have
achieved 42.4% precision and 96.1% recall on the test data provided by Sinch, showing
significant advantages over both previous work and baselines.
Keywords
iii
Abstrakt
iv
Contents
1 Introduction 1
1.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Signature-based Detection Methods . . . . . . . . . . . . . . . . 4
1.2.2 Anomaly-based Detection Methods . . . . . . . . . . . . . . . . 5
1.2.3 Temporal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Questions and Contributions . . . . . . . . . . . . . . . . . . 7
2 Theoretical Background 9
2.1 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 RNN and GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Encoder-decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Variational Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 GAT (Graph Attention Neural Network) . . . . . . . . . . . . . . . . . . 17
2.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Proposed Methods 23
3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Feature Engineering for the Isolation Forest . . . . . . . . . . . . . . . 24
3.3 Temporal Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Auto-encoder and joint optimization . . . . . . . . . . . . . . . . 28
3.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Evaluation 32
v
CONTENTS
4.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Preprocess Details . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Results And Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Results of T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Results of T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Discussion 41
5.1 Other Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References 44
vi
Chapter 1
Introduction
This chapter first briefly introduces the IRSF with its mechanisms, and why this topic
is worth studying. Following that, then the related work and approaches that have
inspired this project will be presented.
Telephony is one of the world’s oldest and biggest networks. Telephony networks, as
a result of their long history and widespread use, incorporate a variety of technologies
and involve a diverse range of service providers and consumers, making them
complicated and opaque.
Every day, telephony networks handle a massive amount of phone call traffic. Its
intrinsic complexity, which involves many providers and suppliers located in different
countries, makes estimating the cost of each call more challenging. This intricacy has
been exacerbated by the emergence of VoIP technology that connects the Internet and
traditional telephony, allowing telephony networks to be used as a lucrative setting
for various fraud schemes. The fraud we are dealing with in this project is called
International Revenue Sharing Fraud (IRSF). In this fraud scheme, fraudsters often
utilize illegal (e.g., stolen SIM cards) or legal (e.g., verification service) resources to gain
access to an operator’s network in order to bring traffic to phone numbers obtained
from an International Premium Rate (IPRN) provider. These numbers will charge
telecom companies with higher fees because they, in the beginning, were designed for
the payment of phone-based services. For instance, in the 90s, people would call for
1
CHAPTER 1. INTRODUCTION
weather forecasting services at higher prices; in this case, customers do not need to pay
the extra money since the fees are included in the phone bills. However, this feature
has become a critical part of being abused by fraudsters in IRSF.
There are mainly three parties involved in the IRSF scheme for Sinch:
• IPRN providers: The owners of IPRNs, who obtain profits from the higher fees
they charge.
An example of an IRSF case is depicted in the Figure 1.1.1. For a normal case (the path
from users to normal calls marked in blue), the telecom company’s clients (enterprises
in the Figure on the left) receive phone numbers from their own users and make calls
to these numbers via telecom company’s services. During this process, the telecom
2
CHAPTER 1. INTRODUCTION
company and their customers will have to pay for transferring the data, but the price is
often acceptable because normal numbers would not charge such high expenses.
However, for a fraud case (the red path starting from fraudsters to IRSF in the Figure
1.1.1), the telecom company’s customers will call IPRNs given by fraudsters via the
telecom company. If successful calls to those IPRNs are made, both customers and the
telecom company will suffer high expenses for these calls, while fraudsters and IPRN
providers will share the extra profits. Fraudsters tend to make as many phone calls as
possible in a short period of time to maximize profits because the numbers they use can
easily be banned from the telecom company’s side. In order to mitigate this loss, the
telecom company should have mechanisms in place that can detect these suspicious
calls in its network, it can send warnings to their customers and limit calls with similar
patterns to avoid further loss.
Above is the most common fraud case for telecom companies, where termination is the
center of the scheme. There exists another type of fraud introduced by Merve et al.[27],
where instead of directly calling IPRN, fraudsters could also short-stop the calls by co-
operating with the transit operators and misroute the traffic to the destination they
desire. For example, routing international calls from the originating operator to the
destination operator is usually the responsibility of multiple transit operators. Each
operator on the call route determines the route for the next call based on its protocol
and routing algorithm. However, operators in the routing have only partial visibility
from the ends, and the protocols and algorithms they use are usually invisible to others.
Therefore, the operators co-operating with fraudsters can reroute traffic to the target
IPRNs and achieve such frauds, and the original destinations of those rerouted calls
can never be reached, which violates availability. This is another scheme where the
transit operators get involved.
For general fraud detection, it is a large and intensively studied field where there exists
3
CHAPTER 1. INTRODUCTION
This is the simplest approach, which blocks calls to frequent IRSF destinations
or number ranges. However, the interference is extensive, which will cause the
unreachability of normal calls to the suspicious country. Fraudsters could easily bypass
this approach by hijacking the call in routing.
Crowdsourced Blocklists
This approach is similar to the previous naive one, but it uses pre-defined lists to
block calls. Instead of number ranges, the lists contain specific numbers. So, the
interference is slightly smaller than the previous one, producing fewer false positives.
Such lists are usually maintained by organizations like GSMA (Global System for
Mobile Association) and Communications Fraud Control Association (CFCA), shared
with their members. However, according to the study of [27], the number space being
abused for IRSF is quite large, so it seems complicated to cover all the numbers using
such lists. In addition, it is always easy for fraudsters to circulate and create new
IRPNs, which make such lists outdated.
This approach tries to detect IRSF in advance. The IPRN providers often provide some
test numbers for fraudsters to check whether their methods could generate traffic to
the providers’ networks or not before actual fraud. Once test calls are detected, the
company (either the telecom company or callees) can block calls in the following short
4
CHAPTER 1. INTRODUCTION
period or calls whose numbers are similar to the test calls. This approach turns the
problem to the detection of test calls, which share the same weaknesses as the previous
two.
From the description above, we can see the signature-based methods heavily rely on
updating the IRSF database, which makes signature-based systems vulnerable when
facing unknown attacks. To conquer such vulnerability, anomaly-based methods have
come to use. These type of methods aim to detect fraud by learning normal actions or
patterns from historical records. So, such systems will classify those unknown actions
as anomalies.
The biggest challenge for anomaly detection is the imbalanced data and diversity
of anomalies, which means the training is often unsupervised. Specifically, in the
IRSF’s scenario, the lack of labels adds difficulties to the evaluation. So, the main
goal for anomaly-based IRSF detection is to solve the lack of ground truth. The
following anomaly-based approaches use different methods to create the ground truth
for evaluation and provide different solutions for the detection task.
More recently, Meijaard et al. [26] used Isolation forest to detect fraud and evaluate
the model based on the results of other existing anti-fraud solutions, but unfortunately,
the authors did not mention what those solutions are. Meijaard’s work also explores
the effectiveness of only using pre-call features, which are features obtained before
answering a call, for the fraud detection task.
The above two are unsupervised methods, which barely involve prior human
knowledge. Thus, these two models tend to produce too many false positives, pointed
out by Merve Sahin et al.[27]. To improve this, Merve Sahin et al.[27] contributes
5
CHAPTER 1. INTRODUCTION
to manually label a batch of data and uses Random Forest Algorithm, treating the
detection as an imbalanced supervised task. The evaluation is based on the labels
created by themselves. In particular, they utilize the gathered test IPRNs to identify
potential fraud instances.
In conclusion, the recent unsupervised methods suffer from producing too many false
positives. To reduce false positives in an unsupervised manner, this project introduces
the use of temporal methods for more accurate detection.
Based on the fact that fraudsters tend to make as many calls as possible in a small
amount of time, it is expected to observe those unusual peaks marked with red
rectangles in 1.2.1. Inspired by this observation, this project tries to introduce temporal
methods into the model. The work listed below has motivated this project to design the
model from temporal aspects.
Zhao et al.[38] use data from sensors in LA to predict traffic in the city. This paper
treats every sensor as a node in the graph and uses a Graph convolution network to
extract spatial information of sensors and then pass the extracted features to a GRU
unit to combine the temporal information. Their experiments demonstrate that their
work can obtain the spatio-temporal correlation from traffic data and the predictions
outperform state-of-art baselines on real-world traffic datasets.
Zhao et al.[37] aim to detect time-series anomalies on real world datasets. The main
contribution of this work is that the author proposes two GAT layers in parallel to
6
CHAPTER 1. INTRODUCTION
T2 : To classify whether records within those time spans produced by T1 are fraud
The final goal still remains as T2, and our work adds an intermediate step. Moreover,
this intermediary task T1 helps experts to reduce their search space for frauds, lowering
the workloads for manual detection.
Q2 Is the temporal module able to sort out suspicious time slots based on historical
data?
7
CHAPTER 1. INTRODUCTION
classical anomaly detection model named Isolation Forest and investigate the impact
of different experiment settings. Specifically, the main contributions of the projects
are:
The next chapter introduces the theoretical basics of this project, and the details of our
approach are explained in Chapter 3.
8
Chapter 2
Theoretical Background
This chapter introduces the involved general techniques and theories in this project.
IF deals with continuous structured data, aiming to detect outliers using isolation (how
far a sample is to the rest of the data). From the statistics perspective, data samples
have a low probability of being in areas where points are sparsely distributed. In other
words, IF detects anomalies satisfying the following requirements:
Algorithm 1 iForest(X, t, Ψ)
Input: X:input data, n: number of trees ,Ψ: sub-sampling size
Output: n iTrees
1: Forest.init()
2: set the height limit l = ⌈log2 Ψ⌉
3: for i = 1 to n do
4: X ′ = Ψ data points sampled from X
5: F orest = F orest ∪ iT ree(X ′ , 0, l)
6: return Forest
Algorithm 2 iTree(X, e, l)
Input: X:input data, e: current height of the whole tree ,l: height limit
Output:iTree itself
if e ≥ l or |X| ≤ 1 then
2: return LeafNode{Size=|X|}
Randomly select an attribute q from all attributes of X
4: Randomly select a split point p from (qmin , qmax )
Xl = Xq<p
6: Xr = X − Xl
return Node{ Left = iTree(Xl , e + 1, l), Right = iTree(Xr , e + 1, l), attr = q, value = p
}
Algorithm 3 PathLength(x, T, e)
Input: x:a data point, T : an iTree ,e: current path length initialized to 0
Output: h(x): path length of x
if T is a leaf then
return e + c(T.size)
3: a = T.attr
if xa < T.value then
return P athLength(x, T.Lef t, e + 1)
6: else
return P athLength(x, T.right, e + 1)
Algorithm 1 and Algorithm 2 are used for the training stage. In particular, Algorithm 2
constructs a tree based on a given subset of the training data and Algorithm 1 integrates
the constructed trees into a forest as the final ensembled model.
For the evaluation stage, below is the form of formula used for computing path of a
sample x in the pseudocode of Algorithm 3:
10
CHAPTER 2. THEORETICAL BACKGROUND
H(n) = ln(n) + γ
where γ is the Euler-Mascheroni constant. With the path length, the anomaly score s
of an instance x in a n-sample dataset is defined as:
E(h(x))
s(x, n) = 2− c(n)
where E(h(x)) denotes the expectation of h(x) in the forest. A high score indicates
that few splits are required to separate a single point from the rest and, therefore,
is considered to be outliers. Similarly, with low score, many splits are required to
separate the data point and, therefore, it is considered similar to the other data points.
Due to the fact that the sampling process, the selection of attributes, and the splitting
points are conducted randomly, the anomaly scores from a single tree are unreliable.
Thus, taking the average of the forest is needed to have stable and low-variance
results.
Gated Recurrent Unit (GRU) [7] is a variation of Recurrent Neural Networks (RNN).
Unlike feed-forward network and Multi-Layer Perceptron (MLP) whose outputs only
depend on the present inputs, the outputs of RNN models are also related to the state of
the previous iteration. This characteristic enables RNN to model relationships with a
sequence. However, naive RNNs have difficulty dealing with long dependencies in the
sequence because of the nature of Back Propagation Through Time (BPTT) algorithm.
The gradient at the short range will dominate the total gradient, which means the
information at the early stage is being forgotten during training. So, Long Short-Term
Memory (LSTM) and GRU have come to use to alleviate this problem, introducing gate
mechanisms to memorize long-term information. This chapter only includes GRU, for
the project only exploits GRU.
2.2.1 RNN
Figure 2.2.1 is an illustration of RNN expanded at a time, where xt 1 is the input at time
t, ht is the hidden state at t, and W, V and Uo are weights for the input x, the previous
1
Bold symbols represent vectors or tensors in this thesis.
11
CHAPTER 2. THEORETICAL BACKGROUND
hidden state h and output o, respectively. Notice that U and V are shared with all xt
and ht .
Lt represents the loss function at t. So, the RNN model consists of the following
computation at every iteration, where σ is the arbitrary activation function like Relu.
The computation of naive RNN is as below:
ot = σ(V ∗ ht + bo )
, where bh is the bias term. This formula shows that the present hidden state ht is
determined by the present input xt and previous hidden state ht−1 , and the present
output ot only depends on ht . When using BPTT for optimization, the gradients for
the weights W can be described as the following:
∂Lt ∑t
∂Lt ∂ot ∏ ∂hj ∂hi
t
= ( )
∂W i=1
∂ot ∂ht j=i+1
∂hj−1 ∂W
∏t ∂hj
The product term j=i+1 ∂hj−1 in the gradient of W causes the gradient vanishing or
exploding problem, which is related to the well-known long dependency issue in RNN.
∂hj
∂hj−1
itself is a Jacobian matrix, which can be written analytically as:
∂hj
= (1 − h2t )W (2.2)
∂hj−1
12
CHAPTER 2. THEORETICAL BACKGROUND
When dealing with long sequences, the product term can either be very small or very
large because of the multiplication of these matrices, depending on the norm of W. In
∂hj
both cases, there will be losses of information. When the norm of ∂hj−1
is greater than
1, the gradients can become large, leading to the optimization failure. When the norm
∂hj ∂h100
of ∂hj−1
is less than 1, the gradients from early states, like ∂h1
, is numerically close to
0, which makes the model only learn short-term dependencies. The exploding problem
can usually be handled by gradients clipping, so the key part is the vanishing issue. To
solve this, gate-based RNNs like GRU were proposed.
2.2.2 GRU
The biggest change here compared with formula 2.2 is adding zj , which enables the
product term to be polynomial including both high-order and low-order terms (which
provides relatively larger gradients because of fewer multiplications). However, the
naive formula 2.2 only contains high-order terms, making the gradients vanish very
fast. This idea is very similar to the residual connection in large deep neural networks
like ResNet [16] and Bert [9] but in GRU and LSTM, the residual connections are built
at time level instead of depth.
13
CHAPTER 2. THEORETICAL BACKGROUND
The input and output of the framework do not have to be different. When the
decoder tries to regenerate the encoded input from a relatively low-dimension space,
architectures like this are known as auto-encoder. The advantage of such architectures
is that models can learn efficient coding or the latent representation of the input data
14
CHAPTER 2. THEORETICAL BACKGROUND
in an unsupervised manner.
L(X, X′ ) = ∥X − X′ ∥n
15
CHAPTER 2. THEORETICAL BACKGROUND
Let P (x) be the likelihood of the observed data and qϕ (z|x) be the proposed posterior
of z or the encoder in Figure 2.4.2 whose parameters are ϕ,then
P (x, z)
lnP (x) = ln P (x,z)
P (x)
Because lnP (x) is irrelevant with z, it remains the same when taking the expectation
of qϕ (z|x) at both sides. Therefore, we can get
P (x, z) P (z|x)
lnP (x) = Eqϕ (z|x) [ln ] − Eqϕ (z|x) [ln ]
qϕ (z|x) qϕ (z|x)
P (x, z) qϕ (z|x)
= Eqϕ (z|x) [ln ] + Eqϕ (z|x) [ln ]
qϕ (z|x) P (z|x)
q (z|x)
Notice that the second term Eqϕ (z|x) [ln Pϕ(z|x) ] is exactly the KL divergence between
qϕ (z|x) and P (z|x) and according to Jensen’s inequality [17], KL divergence is not less
than 0, so we have
P (x, z)
lnP (x) ≥ Eqϕ (z|x) [ln ] = ELBO
qϕ (z|x)
The first term is known as the Evidence Lower Bound (ELBO), because it provides
16
CHAPTER 2. THEORETICAL BACKGROUND
a lower bound for the likelihood. According to formula 2.3, the ELBO breaks down
to
Pθ (x|z)P (z)
ELBO = Eqϕ (z|x) [ln ]
qϕ (z|x)
P (z)
= Eqϕ (z|x) [lnPθ (x|z)] + Eqϕ (z|x)) [ln ]
qϕ (z|x)
= Eqϕ (z|x) [lnPθ (x|z)] − KL(qϕ (z|x)||P (z)) (2.5)
The advantage of doing it is that the optimization target now is switched to the
maximization of ELBO without direct parameterization of the original data like max
likelihood estimation. Notice that in Figure 2.4.2, Pθ (x|z) is the output of the decoder,
which means this can be directly obtained during inference. Moreover, in the course of
the maximization of ELBO, the model can learn the relationship between two random
variables, data x and the latent variables z. So, it is necessary to know the forms of
Pθ (x|z),qϕ (z|x) and P (z).
For qϕ (z|x), it is common to use normal distribution (or other distributions from
the exponential family) whose parameters θ = {µ, σ} are provided by two neural
networks. In order to have µ and σ being optimized during back-propagation, the
reparameterization trick is applied which uses the linearity of normal distribution as
illustrated in Figure 2.4.2. As the posterior is normal, the prior usually is also chosen as
the standard normal N (0, I). So, the KL-divergence term in formula 2.5 can be written
as:
1∑ 2
d
KL(qϕ (z|x)||P (z)) = (µ + σi2 − lnσi2 − 1) (2.6)
2 i=1 i
17
CHAPTER 2. THEORETICAL BACKGROUND
maps by calculating the weighted sum of the central pixel and adjacent pixels, allowing
it to extract spatial features effectively. However, it should be noted that the pixels
in the image or video data processed by CNN are arranged in a very neat matrix,
which makes naive CNN unable to deal with data of Non-Euclidean structures, like
social networks and molecules. However, researchers still want to utilize convolution
to extract features from such data. So, Graph Convolution Network(GCN) [20] has
been introduced to satisfy such needs, using the concepts from graph theory. There are
mainly two approaches to implement graph convolution: the spatial domain approach
and the frequency domain approach. For frequency approaches are more complicated
and GAT focuses more on spatial information, only the spatial domain approach will be
introduced here. Spatial GCN aggregates nodes’ information with their own neighbors.
The traditional CNN kernels can be seen as the specialization of spatial GCN, whose
nodes are arranged in a grid graph like Figure 2.5.1. Given a graph G = {V, E}, any
18
CHAPTER 2. THEORETICAL BACKGROUND
node i ∈ V can be characterized with a node vector Xi . The extracted feature or node
representations of node i are obtained by using the associated information (edges or
the adjacency matrix A in the Graph like in Figure 2.5.3). Assuming the node vector
Xi ∈ Rn and W ∈ Rn×d , the calculation of feature extraction can be described as
below:
hi = σ(Aggregate(Xi )W)
∑
Aggregate(Xi ) = Xj Aij
j∈neighbor(i)
where hi ∈ Rd , a ∈ R2d and || is the concatenation operation. The reason for denoting
the input as hi here is that GAT can also be applied as the middle layer which takes the
hidden states as inputs. Therefore, the output of the present GAT layer would be:
( )
∑ ∑
σ 1 K k k
k=1 j∈Ni αij hj W avg
h′i =
K
(∑ ) (2.7)
∥K k k
k=1 σ α
j∈Ni ij hj W concat
where K represents the number of heads of a GAT layer. The attention mechanism
enables GAT to have adaptive weights given different inputs, as α depends on hi ,
improving models with more generalizability. Similar to transformers, GAT can also be
equipped with multi-head attention, allowing models to jointly combine information
19
CHAPTER 2. THEORETICAL BACKGROUND
from different representation subspaces at different initialized start points. There are
usually two options to combine the K heads, like formula 2.7, either concatenating or
averaging them.
Classification tasks have many metrics, by which most of them can only partially reflect
the performance of the model. If the metric cannot be used reasonably, not only can
problems be found in the model itself, but also wrong conclusions can be drawn. So, it
is important to have proper metrics relevant to the target task.
The main metrics used to evaluate the performance of the model in this thesis are
precision, recall, F1 score, AUC and mean absolute error (MAE). For comparison with
previous work, false positive (FPR) rate and true positive rate (TPR) are also chosen.
Except MAE, other metrics are used for the evaluation of the classification.
Precision means the ratio between correct positives and the total number of predicted
positives, showing how confident the model is about the predicted positives. The
formula is described below:
TP
P recision = (2.8)
FP + TP
20
CHAPTER 2. THEORETICAL BACKGROUND
Recall refers to the number of predicted true positives divided by the number of all
positives. This metric describes how unlikely the model may miss true positives. The
formula follows:
TP
Recall = (2.9)
TP + FN
Usually there exists a trade-off between the two metrics, which means it’s usually hard
to have both metrics high at the same time. So, it is often to use the harmonic average of
precision and recall, F1 score, to evaluate the performance more thoroughly. F1 score
is defined as the following:
1
F1 = 2 × 1 1 (2.10)
P recision
+ Recall
False Positive Rate (FPR) and True Positive Rate (TPR) are calculated as below:
FP
FPR = (2.11)
FP + TN
TP
TPR = (2.12)
TP + FN
, where TPR is equivalent of recall and FPR is the proportion of misidentified positives.
These two metrics reflect the chances that for a given dataset positives are correctly or
incorrectly classified.
21
CHAPTER 2. THEORETICAL BACKGROUND
22
Chapter 3
Proposed Methods
This chapter illustrates how to process the data and the design of the proposed methods
using previously introduced algorithms.
records are anonymized (replaced with randomly generated numbers), which increases
the difficulty of detection because numbers are informative features as introduced in
the previous chapter: IPRN tend to have similar patterns like shared prefixes. Records
are preprocessed differently to be fitted into the input requirements of the Isolation
Forest and the temporal model, because for temporal methods we only care about
traffic or volume at given frequency so the temporal one receives time-series input,
23
CHAPTER 3. PROPOSED METHODS
and as for the Isolation Forest, it takes CDR as the input rather than a generated time
series.
According to Table 3.1.1, the only numerical features are debit and duration. From the
previous introduction of Isolation Forests, the model only accepts numerical features,
making the other features need further preprocess. Fortunately, H2O [14] library
offers out-of-the-box support for categorical data, using one-hot encoding for those
discrete values by default.
Figure 3.2.1: Geographical representation of the destination of the IRSF fraud. [27]
According to [26], pure Isolation Forest based on CDR without any prior information
will produce too many false positives. Thus, more IRSF-related knowledge needs to
be added to the model. Figure 3.2.1 shows that most fraudulent calls are made to
developing countries. With the advice from Sinch’s experts, the GDP (crawled from
World Bank [36]) of both source and destination country is added to the input features.
This is because GDP can reflect the development status of countries, and countries with
24
CHAPTER 3. PROPOSED METHODS
similar GDP may have similar behavior. Taking it as a feature can introduce a piece
of prior knowledge into the model. Moreover, thanks to the nature of Isolation Forest,
properly scaling (which is essential for some sensitive models like OneClass SVM [32])
of the numerical features is unnecessary. Thus, the final input features for the forest
are similar to the Table 3.2.1:
To detect sudden traffic like Figure 1.2.1, it is necessary to extract the time series of
traffic from the original records. In this project, traffic is counted at a given frequency
from the earliest timestamp to the latest for every country, producing a data matrix
X ∈ Rk×T where k is the number of countries and T is the number of bins (each bin
represents f minutes) for the whole time span like one month.
To formulate the problem from the temporal perspective, the target here is to, given
the data matrix X ∈ Rk×T , produce anomaly scores Y ∈ Rk×T while every component
of Y is in (0, 1).
As inspired by Jiang et al. [18], graphs are considered suitable structures to memorize
patterns among phone calls. Our project uses an architecture similar to Zhao et al. [37]
for the temporal model. Since the project utilizes GAT-based model, the form of the
node feature must be decided, and a sliding window with a length of n is used here.
The reason is that while maintaining the diversity of data, low-frequency information
can be preserved by the model. In addition, as VAE is used here, the observations
within a sequence share a unique representation in the latent space z, helping the
25
CHAPTER 3. PROPOSED METHODS
model to capture the characteristics of the global distribution of the data. Then pipeline
like the Figure 3.3.1 is applied. Compared to the original paper [37], the project’s
contributions include the adaption of the loss function to better deal with given dataset,
and specification of each module in the pipeline as the paper only provides high-level
descriptions especially for the Auto-encoder part. The first two modules in the pipeline
serve as Feature Extractor by utilizing the power of GAT to extract features from Non-
Euclidean space. Then, the rest steps are adapted from autoencoder architecture to
learn the inherent relationship between the extracted features and latent space. Below
are the details of each stage introduced by Zhao et al [37]:
2. Two parallel graph attention (GAT) layers handle the output of the convolution
layer, which aims to learn the relationship between multiple features and
timestamps, respectively.
3. The GRU module takes the concatenation of two GAT outputs as the input and
maps it into the latent space z. This layer is used for obtaining sequential patterns
in timeseries and compressing the representation of features.
4. The outputs of the encoder are passed into a forecasting-based model and a
reconstruction-based model in parallel for the final result. The forecasting-based
model is implemented as a multilayer perceptron, and VAE is adopted for the
reconstructor.
26
CHAPTER 3. PROPOSED METHODS
Figure 3.3.2: The illustration of feature-oriented (left) and time-oriented GAT layers
(right) [15].
Figure 3.3.3: The illustration of input node features in the feature-oriented layers.
27
CHAPTER 3. PROPOSED METHODS
Moreover, to extract the local dependencies within a sliding window, GAT is used to
leverage its power. To do so, every node represents a single timestamp and the input
can be seen as a snapshot of the traffic of all countries. This process is very similar to
Transformer [34], where a fully-connected self-attention operation takes all words in
a sequence. This part will take node features of shape k and a n × n complete graph as
the input and output a feature matrix of shape n × k, as illustrated in the right Figure
3.3.2.
To fuse the information from different sources, three types of n × k extracted features
are concatenated as a big n × 3k matrix, which serves as the next module’s input.
As previously mentioned, the GRU encoder is used to capture sequential patterns and
learn the sliding window’s invariants. The model includes a forecasting-based model
to predict the value at the next timestamp and a reconstruction-based model to fit the
data distribution of sliding windows. Then, the GRU units iterate over n timesteps,
producing an encoded d dimensional vector for later use.
For the prediction model, the task is given the encoded d dimensional vector to predict
the value of the following timestamp ypred ∈ Rk . This is achieved by using three fully-
connected layers, whose last layer has the same number of units as the number of
countries, suggested by Zhao et al. [37]. The loss function for the forecasting task
is as follows
Losspred = M AE(ypred − yn+1 )
where MAE represents mean absolute error instead of mean square error in Zhao et al.
[37].
For the reconstruction model, the task is to reconstruct the input feature from the latent
space. Having observed some sparseness in the data, we chose Laplace distribution for
this project rather than Gaussian as the specification for the decoder Pθ (x|z), because
the training objective aims to minimize an l1 reconstruction loss which promotes sparse
reconstruction errors. This is based on the experience that anomalous traffic should be
rare and sparse. To generate a sequence for the reconstruction model, another GRU
is used as the decoder which takes latent vectors as the input from the encoder. As
described in the previous chapter, the second term of ELBO 2.6 can be analytically
28
CHAPTER 3. PROPOSED METHODS
Instead of maximizing the PDF of the Laplace distribution directly, we discovered that
optimizing the exponent of the pdf offers numerically more stable gradients without
crashed optimization, similar to what the original VAE[19] suggested. Therefore, the
total loss for the model is
3.3.3 Inference
The inference also consists of the combination of the two jointly-optimized models.
The reconstruction model will output probability p of shape n × k whose component
pij shows how likely such traffic happens at the timestamp i for country j. The output
of the forecasting model ypre or yt+n is a k dimensional vector, indicating what the value
yt+n of the next timestamp should be given a sliding window {yt , ....., yt+n−1 }.
where p = 12 e−|yij −yrec_ij | . The first term (1 − pij ) suggests how unlikely the traffic at i
step happens in country j and the second term |ypre_ij − yij | shows the inconsistency
between the value of the next timestamp and the truth. The weigted average of these
two is also an alternative:
To have a prediction for a single record, we need to combine the scores from Isolation
Forest and GAT-based model. Every record will be assigned an anomaly score
depending on which timestamp the record belongs to because the proposed temporal
29
CHAPTER 3. PROPOSED METHODS
model can evaluate every time slot. The score from the temporal model denote as stime ,
while the tree-based model produces stree . The output set of the proposed detection
model is
∪
Y = Xstime >p1 &&stree >p2 Xstree >p3
where p3 > p1 and X is the set of records. The first set Xstime >p1 &&stime >p2 represents
calls that are thought to be suspicious by both models. The requirement of identifying
fraud is more strict than a single Isolation Forest, so this method should propose fewer
false positives while maintaining an acceptable recall rate. The second set Xstree >p3
serves as compensation to the output because fraud could happen at timestamps that
are not suspicious from a temporal perspective. The proposed model in this project
follows the framework illustrated in Figure 3.3.4. Our proposed method first uses the
temporal model to find suspicious time-spans and then conducts the detection using
IF. In our experiments, p3 is set to 0, because we found that the IF would only produce
more false positives with our dataset, but it is worthy to investigate different settings in
future work. Moreover, others can try different temporal or anomaly detection model
and try a more flexible way to combine the two results for their case.
30
CHAPTER 3. PROPOSED METHODS
prediction
Anomaly
Temporal model detection model
A single call
record
CDR
31
Chapter 4
Evaluation
4.1 Experiment
The data provided by Sinch includes CDR of December in 2020 and March, July in
2021. For the time series extraction, the bin’s length f is set to 2 minutes for every
country, producing timeseries for each month. We use the same sliding window size
n = 100 for all models. To feed such a long sequence to the temporal model, training
is performed with overlapping sequences obtained by a sliding window with stride = 1
for more data. As the timeseries are extremely unbalanced (only a few countries have
high traffic volumes), we first take the logarithms of the data.
X = log(Xraw + 1)
The effects of logarithmic transformation are shown in Figure 4.1.1. While the partial
order remains the same, the transformed data are more diversely distributed, which is
beneficial for optimization. Moreover, high volume traffic or large values will be closer
to each other, keeping large and small values more separated. This property supports
better performance because a couple of hundred calls may be neglected for high traffic
but will significantly affect analysis for low traffic. For example, we always want to be
more lenient when facing an increase from 2k to 2.5k, but more alert when dealing with
a strike from nothing to 0.5k. The timeseries are then normalized with the maximum
32
CHAPTER 4. EVALUATION
Figure 4.1.1: The data distributions before (left) and after(right) the logarithmic
transformation.
X − min(X)
Xin =
max(X) − min(X)
where Xraw is of shape k × T , so the range of observed traffic lies in the interval
[0, 1].
4.1.2 Setup
We have three datasets from three nonconsecutive months, among which only July is
partially (the first half) labeled as listed in Table 4.1.1. The dataset contains frauds
that are not labeled due to the lack of accurate information on their exact timing,
although according to the complaints submitted to Sinch fraud calls have happened
in an episode. Moreover, the dataset might contain other frauds which are not labeled,
as they are labeled manually. Confirmed fraud episodes represent the dates when
confirmed fraud happened. The concatenation of December and March data is used
for training (first 80%) and validation (the last 20%), while the first half of July is used
for evaluation. For T1, we chose the dimension of latent space d = 300 and the AdamW
optimizer[24] to train our model for 100 epochs with a learning rate 1e-5, batch size
128, and early stopping strategy to avoid over-fitting, as suggested by [37]. Specifically,
the temporal model is implemented using Keras[6] and Spekral [13]. The details of the
model can be found in Appendix Figure A.0.1. Regarding T2, we choose 100 trees for
33
CHAPTER 4. EVALUATION
the Isolation Forest because Liu et al. [23] state that models with around 100 trees are
sufficient to produce convergent results.
For the hardware setup, the experiment is conducted on Google’s Colab virtual
environment, using Nvidia P100 and Intel Xeon dual-core Processor 2.3Ghz with 25G
RAM.
We compare the proposed model with pure Isolation Forest for the T2 detection task.
As the temporal model can be considered as a regression model, then the pure GRU is
compared with the proposed GAT model for regression error and T1.
4.2.1 Results of T1
Table 4.2.1: The results of T1 on July’s data using the full training data
As shown in table 4.2.1 as for T1, the proposed model does not show dominant
advantages over the naive GRU. There are two potential reasons that: first, most of
the suspicious time slots appear in countries where there is usually no traffic at all
but with extreme peaks like 4.2.1, so even with the naive temporal model, this type
of timestamp can still be found out; second, the concatenation of two months’ data
provides sufficient similar patterns and even for the simple model, it is easy to capture
such seasonality among data. However, if trained with limited data (fewer similar
patterns), will the results remain the same?
To answer the question above, we then only use March for training. The results are
shown in Table 4.2.2. The proposed model produced fewer false positives compared to
the naive model when dealing with fewer data. In addition, according to Figure 4.2.2,
with the increasing data, both methods have improvement on their regression errors
34
CHAPTER 4. EVALUATION
but the GAT-based model outperforms the naive temporal at all settings of data. The
GAT-based model might be more powerful to capture the weak seasonality among the
traffic with limited data compared to GRU as shown in Figure 4.2.3.
Table 4.2.2: The results of T1 on July’s data only using March’s Data
Moreover, to interpret what the model has learned through training, we visualize
the encoded latent representations of the suspicious timestamps Figure 4.2.4. As for
Figure 4.2.4, we sample 100 data around the fraud episodes in July and visualize the
output of the encoder using sklearn’s TSNE [29], where the color shows the average
anomaly scores described in the previous chapter. The visualization demonstrates that
most anomalies are at the periphery (yellow points) of the latent distribution by the
model. In other words, the model matches anomalies with low probability samples in
the latent distribution, as we expected.
35
CHAPTER 4. EVALUATION
Figure 4.2.2: The regression error on test set with different percentages of training
data
Figure 4.2.3: The captured seasonality in July of GRU(up) and GAT(below) using
March data
Recap that when using the feature-oriented GAT module, the relationship between
every node (country) is represented by a complete graph. In other words, the inherent
connections of nodes are captured by the attention mechanism. Figure 4.2.5 shows
the visualization of attention values, the attention values are calculated based on
1k samples around Christmas. Interestingly, some countries that do not celebrate
Christmas, like CN and OM, are far from the US and have relatively small attention
values. In contrast, countries like JP and KE with Christmas holidays are closer to the
US. This result is expected because when the model is trying to fit the training data,
it will make things easier if it gives more attention to those countries that have low
volume of traffic during Christmas. However, we do observe some exceptions like IN,
BB, and CA. The figure also shows that the US has strong connections with XV and MD
36
CHAPTER 4. EVALUATION
during Christmas.
Figure 4.2.4: The visualization of latent space around suspicious time slots
Next, we analyze the false positives produced by the model, as plotted in Figure 4.2.6.
Most countries have low or even zero volume, so when the model sees a spike in
the traffic, it would identify the corresponding time slots as anomalies. However,
37
CHAPTER 4. EVALUATION
From the results of T1, we can conclude that for Q2, the proposed temporal model
is able to detect suspicious time slots using historical traffic and has advantages over
naive temporal models like naive GRU.
4.2.2 Results of T2
The aim of T2 is to classify every record in the test data. Before diving into the analysis
for T2, it should be declared that only frauds within one episode are labeled, which
means the metrics can only measure the partial capability of the model. As shown in
Table 4.2.3 and Figure 4.2.7, our proposed model has dominant advantages over the
pure Isolation Forest, providing more accurate detection.
We chose 0.5 as the classification threshold for the result in Table 4.2.3. We notice
that there is a minor loss regarding the recall compared to the baseline, because
we ignore samples outside the detected timestamps, which is expected since some
potential frauds outsides are ignored here. Admittedly, the precision is still relatively
low, and the proposed model shows a 505% improvement compared to the naive IF. The
38
CHAPTER 4. EVALUATION
Figure 4.2.7: The produced ROCs of the Isolation Forest (left), the combined model
(mid) and the IF without GDP (right)
potential reasons lie in two aspects: first, the utilization of the anomaly detection for
this problem is based on the assumption that frauds should be a subset of anomalies
which indicates the detected samples are not guaranteed to be frauds; second, due
to privacy issues, the model does not leverage all informative features such as phone
numbers, leading to coarse detection.
We also did an ablation study to validate the effectiveness of the additional GDP
feature and the temporal model in Table 4.2.3. The GDP features bring about
44% performance gain in terms of recall with just a minor loss at precision. Our
interpretation is that for some countries that rarely have calls, the GDP feature could
serve as evidence for further branching. For example, for some trees, samples are split
based on GDP field before country field so that samples might be assigned to a subtree
with samples from countries with similar GDP. To validate our guess, we checked
the average anomaly score for countries with similar GDP to SA (around 23140$), as
demonstrated in Table 4.2.4. Most countries satisfy our guess that countries with
similar GDP should have similar behaviors, close anomaly scores. There is still some
exception like AW, due to the random nature of IF, so the observation is not guaranteed
with different random seeds. Besides, this could also introduce bias into the model
because the anomaly scores of some regular calls from countries with similar GDP
might increase. Overall, this prior knowledge brings significant improvement to the
model. In terms of threshold-irrelevant metric, from Figure 4.2.7, it is observed that
Table 4.2.4: The average anomaly score for samples from some countries
SA CZ CY EE AW US
anomaly score 0.294 0.3025 0.306 0.322 0.490 0.359
the proposed model also outperforms the two baselines, showing 7% improvement for
AUC of IF with GDP and 17% for AUC of naive IF.
39
CHAPTER 4. EVALUATION
The comparison of the results with previous related work is shown in Table 4.2.5.
Notice that these work use different metrics than above and these reported results
are from the original papers. According to the comparison, the proposed model still
shows advantages over the previous work, but with a different dataset, since there is no
publicly available dataset now. Compared with Merve et al. [27], the proposed model
fails to outperform regarding FPR because Merve et al. [27] use a supervised model
which could learn the intrinsic nature with more prior knowledge of fraud while our
model is unsupervised and based on anomaly detection. However, the results might
not have great reference value because these metrics in Table 4.2.5 are not important
according to Sinch’s expert, and the private closed-source dataset previous work use
also make these results unfair for such comparison.
TPR FPR
Merve et al [27] 0.964 0.0026
Meijaard et al [26] 0.87 0.05
Proposed Model 0.961 0.021
The above results and analysis show that the proposed model at least with Sinch’s
dataset can achieve SOTA performance using TPR and FPR. The most considerable
improvement is to reduce false positives compared to simple IF, satisfying the
requirement of Q3. Moreover, our model is trained in an unsupervised manner and
allows positives in the training set, which answers Q1.
Efficiency
To explore the potential of our model for real-time detection, we also measured the
training and inference time during experiments. The result is shown in 4.2.6. In the
experiment, the inference stage involves all of test data, while the model only has to deal
with one sample in production. Thus, when dealing with stream data, the inference
time would be reduced to few seconds and we believe the model shows great potential
for real-time detection.
40
Chapter 5
Discussion
The second attempt was to use the recommender system reversely. The motivation was
that generally, a recommender system is to recommend users with actions that users
most likely would do. On the other hand, those behaviors that users are unlikely to do
should be anomalies. After some experiments, we found this method unfeasible in our
scenario because some users have very little traffic, where the patterns are not formed
at all.
Except for the attempts for the design of model, some alternatives were also tried to
solve other challenges in this project.
Because the country and reason fields are categorical data, this could produce extra
220+ sparse dimensions for the input data, while there are only 3 to 4 numerical
features. This sparsity could damage the performance of machine learning models.
To avoid such sparsity, embedding methods like Item2Vec [2] were tried but ended
being not improving performance.
41
CHAPTER 5. DISCUSSION
In the first four months of this project, there was no labeled data even for evaluation.
We tried some label-less evaluation methods like Gosix et al. [12], which uses criteria
based on existing Excess-Mass (EM) and Mass-Volume (MV) curves for non-labeled
data. However, after some experiments, we found that these methods are not suitable
for our data because they require continuous and euclidean vectors, while our data do
not satisfy.
5.2 Limitations
The most significant limitation of the project is that the phone numbers, these very
informative features, have not been utilized due to privacy regulations. One who
can access actual and non-anonymized data may use these features to perform more
accurate detection, because IPRN tend to have similar patterns.
Another issue is that the data of July is only partially labeled, which means there are
some unlabelled frauds. Therefore, the present precision is just a lower bound for the
true precision, while the actual recall might be lower than what we analyzed. During
the development of the project, the results were sent to Sinch and helped them find
new fraud cases, which proves our assumption to be correct.
There are other limitations for the labeled data as well. One may be careful with
the neighboring dates of the suspicious timesteps because fraudsters tend to generate
attacks in consecutive days. Due to time constraints, only the first half of July are
manually validated and used for evaluation, so there could be more frauds in July.
One more interesting aspect worthy of investigating is that, in this project, the snapshot
of the whole network is treated as the input, but one may also conduct experiments
for every user or country because anomalies in the scenario are context-based. For
42
CHAPTER 5. DISCUSSION
example, the definition of anomalies might vary from country to country and from
month to month.
5.4 Conclusion
This project tries to provide a solution for the long-standing yet unsolved problem of
International Revenue Shared Fraud (IRSF). The existing methods suffer high false-
positive rates. To solve this, we split the detection task into two subtasks which are
handled by two methods respectively. For the first subtask, we introduce temporal
methods to detect samples only in suspicious timestamps by utilizing the power of
the graph-based model. For the second task, Isolation Forest with additional GDP
features is applied. As a result, with the combination of these two methods, our model
outperforms the existing model based on Sinch’s data, achieving 96.1% recall and 42.4%
precision, which demonstrates great potential for effective detection in production.
Extensive analysis is also provided that the model has good interpretability in terms
of traffic patterns.
43
Bibliography
[2] Barkan, Oren and Koenigstein, Noam. Item2Vec: Neural Item Embedding for
Collaborative Filtering. 2017. arXiv: 1603.04259 [cs.LG].
[3] Breunig, Markus M., Kriegel, Hans-Peter, Ng, Raymond T., and Sander, Jörg.
“LOF: Identifying Density-Based Local Outliers”. In: SIGMOD ’00. Accessed:
2021-07-28. Dallas, Texas, USA: Association for Computing Machinery, 2000,
pp. 93–104. ISBN: 1581132174. DOI: 10 . 1145 / 342009 . 335388. URL: https :
//doi.org/10.1145/342009.335388.
[4] Cheng, Yan, Sun, Huan, Chen, Haomai, Li, Meng, Cai, Yingying, Cai, Zhuang,
and Huang, Jing. “Sentiment Analysis Using Multi-Head Attention Capsules
With Multi-Channel CNN and Bidirectional GRU”. In: IEEE Access 9 (2021),
pp. 60383–60395. DOI: 10.1109/ACCESS.2021.3073988.
[5] Cho, Kyunghyun, Merrienboer, Bart van, Gulcehre, Caglar, Bahdanau, Dzmitry,
Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation. 2014. arXiv: 1406.1078 [cs.CL].
[6] Chollet, Francois et al. Keras. Accessed: 2021-07-28. 2015. URL: https : / /
github.com/fchollet/keras.
[7] Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence
Modeling. 2014. arXiv: 1412.3555 [cs.NE].
44
BIBLIOGRAPHY
[8] Danel, Tomasz, Spurek, Przemysław, Tabor, Jacek, Śmieja, Marek, Struski,
Łukasz, Słowik, Agnieszka, and Maziarka, Łukasz. Spatial Graph Convolutional
Networks. 2020. arXiv: 1909.05310 [cs.LG].
[9] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina.
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. 2018. DOI: 10.48550/ARXIV.1810.04805.
[11] Feng, Weijiang, Guan, Naiyang, Li, Yuan, Zhang, Xiang, and Luo, Zhigang.
“Audio visual speech recognition with multimodal recurrent neural networks”.
In: May 2017, pp. 681–688. DOI: 10.1109/IJCNN.2017.7965918.
[12] Goix, Nicolas. How to Evaluate the Quality of Unsupervised Anomaly Detection
Algorithms? 2016. arXiv: 1607.01152 [stat.ML].
[13] Grattarola, Daniele and Alippi, Cesare. Graph Neural Networks in TensorFlow
and Keras with Spektral. 2020. arXiv: 2006.12138 [cs.LG].
[14] H2O.ai. h2o: Python Interface for H2O. Accessed: 2021-07-28. 2020. URL:
https://github.com/h2oai/h2o-3.
[16] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep Residual
Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].
[17] Jensen, J. L. W. V. “Sur les fonctions convexes et les inégalités entre les valeurs
moyennes”. In: Acta Mathematica 30.none (1906), pp. 175–193. DOI: 10.1007/
BF02418571.
[18] Jiang, Nan, Jin, Yu, Skudlark, Ann, Hsu, Wen-Ling, Jacobson, Guy, Prakasam,
Siva, and Zhang, Zhi-Li. “Isolating and analyzing fraud activities in a large
cellular network via voice call graph analysis”. In: MobiSys’12 - Proceedings
of the 10th International Conference on Mobile Systems, Applications, and
Services (June 2012). DOI: 10.1145/2307636.2307660.
[19] Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. 2014.
arXiv: 1312.6114 [stat.ML].
45
BIBLIOGRAPHY
[20] Kipf, Thomas N and Welling, Max. “Semi-supervised classification with graph
convolutional networks”. In: arXiv preprint arXiv:1609.02907 (2016).
[21] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. “Deep learning”. In:
nature 521.7553 (2015), pp. 436–444.
[22] Li, Fei-Fei, Karpathy, Andrej, and Johnson, Justin. “CS231n: Convolutional
Neural Networks for Visual Recognition [PDF document]”. In: (2019). Accessed:
2021-07-28. URL: http://cs231n.stanford.edu/slides/2019/cs231n_2019_
lecture11.pdf.
[23] Liu, Fei Tony, Ting, Kai Ming, and Zhou, Zhi-Hua. “Isolation forest”. In: 2008
Eighth IEEE International Conference on Data Mining. IEEE. 2008, pp. 413–
422.
[24] Loshchilov, Ilya and Hutter, Frank. Decoupled Weight Decay Regularization.
2019. arXiv: 1711.05101 [cs.LG].
[26] Meijaard, Yoram J., Cappers, Bram C. M., Mengerink, Josh G. M., and Zannone,
Nicola. “Predictive Analytics to Prevent Voice over IP International Revenue
Sharing Fraud”. In: Data and Applications Security and Privacy XXXIV. Ed. by
Anoop Singhal and Jaideep Vaidya. Cham: Springer International Publishing,
2020, pp. 241–260. ISBN: 978-3-030-49669-2.
[28] Narkhede, Sarang. Understanding AUC - ROC Curve. June 2021. URL: https:
//towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5.
[29] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. “Scikit-learn:
Machine Learning in Python”. In: Journal of Machine Learning Research 12
(2011), pp. 2825–2830.
46
BIBLIOGRAPHY
[30] Pham, Chau. Graph convolutional networks (gcn). Accessed: 2021-07-28. Feb.
2021. URL: https://www.topbots.com/graph-convolutional-networks/.
[31] Santos, Cícero dos and Gatti., Maíra. “Deep Convolutional Neural Networks for
Sentiment Analysis of Short Texts”. In: Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Technical Papers.
Dublin, Ireland: Dublin City University and Association for Computational
Linguistics, Aug. 2014, pp. 69–78.
[32] Schölkopf, Bernhard, Platt, John C., Shawe-Taylor, John C., Smola, Alex J.,
and Williamson, Robert C. “Estimating the Support of a High-Dimensional
Distribution”. In: Neural Comput. 13.7 (July 2001), pp. 1443–1471. ISSN: 0899-
7667. DOI: 10.1162/089976601750264965.
[34] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion,
Gomez, Aidan N., Kaiser, Lukasz, and Polosukhin, Illia. Attention Is All You
Need. 2017. arXiv: 1706.03762 [cs.CL].
[35] Veličković, Petar, Cucurull, Guillem, Casanova, Arantxa, Romero, Adriana, Liò,
Pietro, and Bengio, Yoshua. Graph Attention Networks. 2018. arXiv: 1710 .
10903 [stat.ML].
[36] World, Bank. GDP. Accessed: 2021-07-28. 2021. URL: https : / / data .
worldbank.org/indicator/NY.GDP.MKTP.CD.
[37] Zhao, Hang, Wang, Yujing, Duan, Juanyong, Huang, Congrui, Cao, Defu, Tong,
Yunhai, Xu, Bixiong, Bai, Jing, Tong, Jie, and Zhang, Qi. Multivariate Time-
series Anomaly Detection via Graph Attention Network. 2020. arXiv: 2009 .
02040 [cs.LG].
[38] Zhao, Ling, Song, Yujiao, Zhang, Chao, Liu, Yu, Wang, Pu, Lin, Tao, Deng, Min,
and Li, Haifeng. “T-GCN: A Temporal Graph Convolutional Network for Traffic
Prediction”. In: IEEE Transactions on Intelligent Transportation Systems 21.9
(Sept. 2020), pp. 3848–3858. DOI: 10.1109/tits.2019.2935152. URL: https:
//doi.org/10.1109%2Ftits.2019.2935152.
47
BIBLIOGRAPHY
[39] Zhu, Xinxin, Wang, Weining, Guo, Longteng, and Liu, Jing. AutoCaption:
Image Captioning with Neural Architecture Search. 2020. arXiv: 2012.09742
[cs.CV].
48
Appendix - Contents
A Appendix 50
49
Appendix A
Appendix
The Figure A.0.1 shows the details of every component in the model, where black
arrows show that two component are connected,and yellow arrows represent that those
modules’ roles mentioned in the report.
50
APPENDIX A. APPENDIX
www.kth.se