0% found this document useful (0 votes)
31 views58 pages

Telecom Fraud Detection with Machine Learning

This thesis explores the detection of International Revenue Sharing Fraud (IRSF) in telecommunications using machine learning techniques, specifically Graph Attention Neural Networks (GAT) and Gated Recurrent Units (GRU). The study demonstrates the effectiveness of combining these models with an Isolation Forest approach, achieving 42.4% precision and 96.1% recall on test data. The report details the mechanisms of IRSF, the models used, and the results of extensive experiments conducted with real production data from Sinch.

Uploaded by

dilubucharo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views58 pages

Telecom Fraud Detection with Machine Learning

This thesis explores the detection of International Revenue Sharing Fraud (IRSF) in telecommunications using machine learning techniques, specifically Graph Attention Neural Networks (GAT) and Gated Recurrent Units (GRU). The study demonstrates the effectiveness of combining these models with an Isolation Forest approach, achieving 42.4% precision and 96.1% recall on test data. The report details the mechanisms of IRSF, the models used, and the results of extensive experiments conducted with real production data from Sinch.

Uploaded by

dilubucharo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2022

Telecom Fraud
Detection Using
Machine Learning
KTH Thesis Report

Chao Xiong

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Telecom Fraud Detection Based On
Machine Learning

Chao Xiong
Computer Science
KTH Royal Institute of Technology

Examiner
Prof. Panos Papadimitratos
NSS Group
KTH Royal Institute of Technology

Supervisor at KTH
Zahra Alimadadi
NSS Group
KTH Royal Institute of Technology

Supervisors at Sinch

Paulo Fonseca, Pieter Buteneers and Michael Truong


Sinch

ii
Abstract

International Revenue Sharing Fraud (IRSF) is one of the most persistent types of
fraud within the telecommunications industry. According to the 2017 Communications
Fraud Control Association (CFCA) fraud loss survey, IRSF costs 6 billion dollars a
year. Therefore, the detection of such frauds is of vital importance to avoid further
loss. Though many efforts have been made, very few utilize the temporal patterns of
phone call traffic. This project, supported with Sinch’s real production data, aims to
exploit both spatial and temporal patterns learned by Graph Attention Neural network
(GAT) with Gated Recurrent Unit (GRU) to find suspicious timestamps in the historical
traffic.

Moreover, combining with the time-independent Isolation forest model, our model
should give better results for the phone call records. This report first explains the
mechanism of IRSF in detail and introduces the models that are applied in this project,
including GAT, GRU, and Isolation forest. Finally, it presents how our experiments
have been conducted and the results with extensive analysis. Moreover, we have
achieved 42.4% precision and 96.1% recall on the test data provided by Sinch, showing
significant advantages over both previous work and baselines.

Keywords

Fraud Detection, Anomaly Detection, Machine Learning, Deep Learning, International


Revenue Sharing Fraud

iii
Abstrakt

International Revenue Sharing Fraud (IRSF) är en av de mest ihållande typerna av


bedrägerier inom telekommunikationsindustrin. Enligt 2017 Communications Fraud
Control Association (CFCA) bedrägeriförlustundersökning kostar IRSF 6 miljarder
dollar per år. Därför är upptäckten av sådana bedrägerier av avgörande betydelse
för att undvika ytterligare förluster. Även om många ansträngningar har gjorts är det
väldigt få som använder telefonsamtalstrafikens tidsmässiga mönster. Detta projekt,
med stöd av Sinchs verkliga produktionsdata, syftar till att utnyttja både rumsliga
och tidsmässiga mönster som lärts in av Graph Attention Neural Network (GAT) med
Gated Recurrent Unit (GRU) för att hitta misstänkt tid i den historiska trafiken.

Dessutom, i kombination med den tidsoberoende skogsmodellen Isolation, borde


vår modell ge bättre resultat för telefonsamtalsposterna. Denna rapport förklarar
först mekanismen för IRSF i detalj och introducerar modellerna som används i detta
projekt, inklusive GAT, GRU och Isolation forest. Slutligen presenteras hur våra
experiment har genomförts och resultaten med omfattande analys. Dessutom har
vi uppnått 42.4% precision och 96.1% återkallelse på testdata från Sinch, vilket visar
betydande fördelar jämfört med både tidigare arbete och baslinjer.

iv
Contents

1 Introduction 1
1.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Signature-based Detection Methods . . . . . . . . . . . . . . . . 4
1.2.2 Anomaly-based Detection Methods . . . . . . . . . . . . . . . . 5
1.2.3 Temporal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Research Questions and Contributions . . . . . . . . . . . . . . . . . . 7

2 Theoretical Background 9
2.1 Isolation Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 RNN and GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Encoder-decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Variational Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 GAT (Graph Attention Neural Network) . . . . . . . . . . . . . . . . . . 17
2.6 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Proposed Methods 23
3.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Feature Engineering for the Isolation Forest . . . . . . . . . . . . . . . 24
3.3 Temporal Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Auto-encoder and joint optimization . . . . . . . . . . . . . . . . 28
3.3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Evaluation 32

v
CONTENTS

4.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Preprocess Details . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Results And Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.1 Results of T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.2 Results of T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Discussion 41
5.1 Other Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

References 44

vi
Chapter 1

Introduction

This chapter first briefly introduces the IRSF with its mechanisms, and why this topic
is worth studying. Following that, then the related work and approaches that have
inspired this project will be presented.

1.1 Problem Background

Telephony is one of the world’s oldest and biggest networks. Telephony networks, as
a result of their long history and widespread use, incorporate a variety of technologies
and involve a diverse range of service providers and consumers, making them
complicated and opaque.

Every day, telephony networks handle a massive amount of phone call traffic. Its
intrinsic complexity, which involves many providers and suppliers located in different
countries, makes estimating the cost of each call more challenging. This intricacy has
been exacerbated by the emergence of VoIP technology that connects the Internet and
traditional telephony, allowing telephony networks to be used as a lucrative setting
for various fraud schemes. The fraud we are dealing with in this project is called
International Revenue Sharing Fraud (IRSF). In this fraud scheme, fraudsters often
utilize illegal (e.g., stolen SIM cards) or legal (e.g., verification service) resources to gain
access to an operator’s network in order to bring traffic to phone numbers obtained
from an International Premium Rate (IPRN) provider. These numbers will charge
telecom companies with higher fees because they, in the beginning, were designed for
the payment of phone-based services. For instance, in the 90s, people would call for

1
CHAPTER 1. INTRODUCTION

Figure 1.1.1: Overview of an IRSF case

weather forecasting services at higher prices; in this case, customers do not need to pay
the extra money since the fees are included in the phone bills. However, this feature
has become a critical part of being abused by fraudsters in IRSF.

There are mainly three parties involved in the IRSF scheme for Sinch:

• IPRN providers: The owners of IPRNs, who obtain profits from the higher fees
they charge.

• Fraudsters/ Call generators: People who use various approaches to generate


traffic to IPRN so as to share profits with IPRN providers.

• Telecommunications companies: Companies that provide telecom services such


as make calls for their customers. Telecom companies have to pay for the calls to
IPRN and pass down the cost to their customers.

An example of an IRSF case is depicted in the Figure 1.1.1. For a normal case (the path
from users to normal calls marked in blue), the telecom company’s clients (enterprises
in the Figure on the left) receive phone numbers from their own users and make calls
to these numbers via telecom company’s services. During this process, the telecom

2
CHAPTER 1. INTRODUCTION

company and their customers will have to pay for transferring the data, but the price is
often acceptable because normal numbers would not charge such high expenses.

However, for a fraud case (the red path starting from fraudsters to IRSF in the Figure
1.1.1), the telecom company’s customers will call IPRNs given by fraudsters via the
telecom company. If successful calls to those IPRNs are made, both customers and the
telecom company will suffer high expenses for these calls, while fraudsters and IPRN
providers will share the extra profits. Fraudsters tend to make as many phone calls as
possible in a short period of time to maximize profits because the numbers they use can
easily be banned from the telecom company’s side. In order to mitigate this loss, the
telecom company should have mechanisms in place that can detect these suspicious
calls in its network, it can send warnings to their customers and limit calls with similar
patterns to avoid further loss.

Above is the most common fraud case for telecom companies, where termination is the
center of the scheme. There exists another type of fraud introduced by Merve et al.[27],
where instead of directly calling IPRN, fraudsters could also short-stop the calls by co-
operating with the transit operators and misroute the traffic to the destination they
desire. For example, routing international calls from the originating operator to the
destination operator is usually the responsibility of multiple transit operators. Each
operator on the call route determines the route for the next call based on its protocol
and routing algorithm. However, operators in the routing have only partial visibility
from the ends, and the protocols and algorithms they use are usually invisible to others.
Therefore, the operators co-operating with fraudsters can reroute traffic to the target
IPRNs and achieve such frauds, and the original destinations of those rerouted calls
can never be reached, which violates availability. This is another scheme where the
transit operators get involved.

The project is supported by Sinch, a telecommunications and cloud communications


platform as a service (CPaaS) company, with experienced experts and dataset.

1.2 Related Work


For IRSF specifically, Merve et al.[27] introduces IRSF from its ecosystem to
mechanism in detail, which is helpful for whom wants to learn about this field.

For general fraud detection, it is a large and intensively studied field where there exists

3
CHAPTER 1. INTRODUCTION

a variety of techniques. Many of them follow a framework similar to the intrusion


detection system, either signature-based or anomaly-based detection: signature-
based detection methods aim to detect known attack-related patterns in the system,
and the anomaly-based detection methods are to memorize or learn what is known
as normality and try to detect unknown anomalies. This is usually achieved by
machine learning models. The next two subsections briefly introduce the existing
IRSF detection techniques and discuss their pros and cons respectively. Moreover,
because the project utilizes temporal patterns of phone call traffic, the related temporal
methods are introduced.

1.2.1 Signature-based Detection Methods

Blocking Number Ranges

This is the simplest approach, which blocks calls to frequent IRSF destinations
or number ranges. However, the interference is extensive, which will cause the
unreachability of normal calls to the suspicious country. Fraudsters could easily bypass
this approach by hijacking the call in routing.

Crowdsourced Blocklists

This approach is similar to the previous naive one, but it uses pre-defined lists to
block calls. Instead of number ranges, the lists contain specific numbers. So, the
interference is slightly smaller than the previous one, producing fewer false positives.
Such lists are usually maintained by organizations like GSMA (Global System for
Mobile Association) and Communications Fraud Control Association (CFCA), shared
with their members. However, according to the study of [27], the number space being
abused for IRSF is quite large, so it seems complicated to cover all the numbers using
such lists. In addition, it is always easy for fraudsters to circulate and create new
IRPNs, which make such lists outdated.

Monitoring for Test IPRNs

This approach tries to detect IRSF in advance. The IPRN providers often provide some
test numbers for fraudsters to check whether their methods could generate traffic to
the providers’ networks or not before actual fraud. Once test calls are detected, the
company (either the telecom company or callees) can block calls in the following short

4
CHAPTER 1. INTRODUCTION

period or calls whose numbers are similar to the test calls. This approach turns the
problem to the detection of test calls, which share the same weaknesses as the previous
two.

1.2.2 Anomaly-based Detection Methods

From the description above, we can see the signature-based methods heavily rely on
updating the IRSF database, which makes signature-based systems vulnerable when
facing unknown attacks. To conquer such vulnerability, anomaly-based methods have
come to use. These type of methods aim to detect fraud by learning normal actions or
patterns from historical records. So, such systems will classify those unknown actions
as anomalies.

The biggest challenge for anomaly detection is the imbalanced data and diversity
of anomalies, which means the training is often unsupervised. Specifically, in the
IRSF’s scenario, the lack of labels adds difficulties to the evaluation. So, the main
goal for anomaly-based IRSF detection is to solve the lack of ground truth. The
following anomaly-based approaches use different methods to create the ground truth
for evaluation and provide different solutions for the detection task.

In 2012, N.Jiang et al.[18] focused on mobile networks and proposed to construct


an international outgoing call graph representing voice calls from domestic callers to
foreign recipients. The author used a Markov clustering method intending to isolate
fraudulent activity. The authors acquired six months’ worth of calls from a large mobile
provider, and then utilized customer complaints and other internet forums as labels for
fraud. The idea of using graphs to memorize patterns among historical traffic inspires
the design of this project.

More recently, Meijaard et al. [26] used Isolation forest to detect fraud and evaluate
the model based on the results of other existing anti-fraud solutions, but unfortunately,
the authors did not mention what those solutions are. Meijaard’s work also explores
the effectiveness of only using pre-call features, which are features obtained before
answering a call, for the fraud detection task.

The above two are unsupervised methods, which barely involve prior human
knowledge. Thus, these two models tend to produce too many false positives, pointed
out by Merve Sahin et al.[27]. To improve this, Merve Sahin et al.[27] contributes

5
CHAPTER 1. INTRODUCTION

Figure 1.2.1: The comparison of suspicious(left) and normal traffic (right)

to manually label a batch of data and uses Random Forest Algorithm, treating the
detection as an imbalanced supervised task. The evaluation is based on the labels
created by themselves. In particular, they utilize the gathered test IPRNs to identify
potential fraud instances.

In conclusion, the recent unsupervised methods suffer from producing too many false
positives. To reduce false positives in an unsupervised manner, this project introduces
the use of temporal methods for more accurate detection.

1.2.3 Temporal Methods

Based on the fact that fraudsters tend to make as many calls as possible in a small
amount of time, it is expected to observe those unusual peaks marked with red
rectangles in 1.2.1. Inspired by this observation, this project tries to introduce temporal
methods into the model. The work listed below has motivated this project to design the
model from temporal aspects.

Zhao et al.[38] use data from sensors in LA to predict traffic in the city. This paper
treats every sensor as a node in the graph and uses a Graph convolution network to
extract spatial information of sensors and then pass the extracted features to a GRU
unit to combine the temporal information. Their experiments demonstrate that their
work can obtain the spatio-temporal correlation from traffic data and the predictions
outperform state-of-art baselines on real-world traffic datasets.

Zhao et al.[37] aim to detect time-series anomalies on real world datasets. The main
contribution of this work is that the author proposes two GAT layers in parallel to

6
CHAPTER 1. INTRODUCTION

learn the complex dependencies of multivariate time-series in both temporal and


feature dimensions. In addition, Zhao et al.[37] jointly optimizes a forecasting-based
model and a reconstruction-based model, obtaining better time-series representations
through a combination of single-timestamp prediction and reconstruction of the entire
time-series. Through extensive experiments, their proposed method outperformed
other state-of-the-art models on three real-world datasets.

1.3 Research Questions and Contributions


Due to the lack of ground truth in this problem, fine-grained detection for every
record would be sometimes impossible. One workaround is to just detect when frauds
might have happened. Therefore, the detection task for IRSF can be split into two
subtasks:

T1 : To detect time spans where frauds could have happened.

T2 : To classify whether records within those time spans produced by T1 are fraud

The final goal still remains as T2, and our work adds an intermediate step. Moreover,
this intermediary task T1 helps experts to reduce their search space for frauds, lowering
the workloads for manual detection.

Inspired by Zhao et al.[38], we considered countries as nodes in the telecom network,


by analogy of sensors in city traffic. In addition, N.Jiang et al.[18] proves that graphs
are effective structure for the detection task. Our proposed method first uses GAT-
based temporal model similiar to Zhao et al.[37] to select suspicious traffic and then
continue using Isolation Forest for the detection suggested by Meijaard et al.[26].

To evaluate the effectiveness of the proposed approach, the following research


questions should be answered:

Q1 Is the model robust and effective using labelless training data?

Q2 Is the temporal module able to sort out suspicious time slots based on historical
data?

Q3 Is the combined model able to reduce false positives compared to a single


Isolation Forest?

In this project, we perform a comparison between the proposed methods and a

7
CHAPTER 1. INTRODUCTION

classical anomaly detection model named Isolation Forest and investigate the impact
of different experiment settings. Specifically, the main contributions of the projects
are:

• The application of a novel temporal model based on Graph Attention Network


for the detection of IRSF on real data.

• A framework to improve IRSF detection by utilizing two approaches; namely,


temporal model and anomaly detection.

• A comparison of the proposed method performance with respect to previous


related literature.

The next chapter introduces the theoretical basics of this project, and the details of our
approach are explained in Chapter 3.

8
Chapter 2

Theoretical Background

This chapter introduces the involved general techniques and theories in this project.

2.1 Isolation Forest


Isolation Forest (IF) [23] is an algorithm based on ensemble methods with linear time
complexity and acceptable performance with high volume data. Thus, it is widely used
in the industry field, including intrusion detection and fraud detection.

IF deals with continuous structured data, aiming to detect outliers using isolation (how
far a sample is to the rest of the data). From the statistics perspective, data samples
have a low probability of being in areas where points are sparsely distributed. In other
words, IF detects anomalies satisfying the following requirements:

• Only account for a small portion of the whole dataset

• Follow distributions different from normal samples.

Figure 2.1.1: Anomaly Detection With Isolation Forest [1]


9
CHAPTER 2. THEORETICAL BACKGROUND

The pseudocodes of the algorithm are illustrated below:

Algorithm 1 iForest(X, t, Ψ)
Input: X:input data, n: number of trees ,Ψ: sub-sampling size
Output: n iTrees
1: Forest.init()
2: set the height limit l = ⌈log2 Ψ⌉
3: for i = 1 to n do
4: X ′ = Ψ data points sampled from X
5: F orest = F orest ∪ iT ree(X ′ , 0, l)
6: return Forest

Algorithm 2 iTree(X, e, l)
Input: X:input data, e: current height of the whole tree ,l: height limit
Output:iTree itself
if e ≥ l or |X| ≤ 1 then
2: return LeafNode{Size=|X|}
Randomly select an attribute q from all attributes of X
4: Randomly select a split point p from (qmin , qmax )
Xl = Xq<p
6: Xr = X − Xl
return Node{ Left = iTree(Xl , e + 1, l), Right = iTree(Xr , e + 1, l), attr = q, value = p
}

Algorithm 3 PathLength(x, T, e)
Input: x:a data point, T : an iTree ,e: current path length initialized to 0
Output: h(x): path length of x
if T is a leaf then
return e + c(T.size)
3: a = T.attr
if xa < T.value then
return P athLength(x, T.Lef t, e + 1)
6: else
return P athLength(x, T.right, e + 1)

Algorithm 1 and Algorithm 2 are used for the training stage. In particular, Algorithm 2
constructs a tree based on a given subset of the training data and Algorithm 1 integrates
the constructed trees into a forest as the final ensembled model.

For the evaluation stage, below is the form of formula used for computing path of a
sample x in the pseudocode of Algorithm 3:

c(n) = 2H(n − 1) − (2n − 1)/n

10
CHAPTER 2. THEORETICAL BACKGROUND

H(n) = ln(n) + γ

where γ is the Euler-Mascheroni constant. With the path length, the anomaly score s
of an instance x in a n-sample dataset is defined as:

E(h(x))
s(x, n) = 2− c(n)

where E(h(x)) denotes the expectation of h(x) in the forest. A high score indicates
that few splits are required to separate a single point from the rest and, therefore,
is considered to be outliers. Similarly, with low score, many splits are required to
separate the data point and, therefore, it is considered similar to the other data points.
Due to the fact that the sampling process, the selection of attributes, and the splitting
points are conducted randomly, the anomaly scores from a single tree are unreliable.
Thus, taking the average of the forest is needed to have stable and low-variance
results.

2.2 RNN and GRU

Gated Recurrent Unit (GRU) [7] is a variation of Recurrent Neural Networks (RNN).
Unlike feed-forward network and Multi-Layer Perceptron (MLP) whose outputs only
depend on the present inputs, the outputs of RNN models are also related to the state of
the previous iteration. This characteristic enables RNN to model relationships with a
sequence. However, naive RNNs have difficulty dealing with long dependencies in the
sequence because of the nature of Back Propagation Through Time (BPTT) algorithm.
The gradient at the short range will dominate the total gradient, which means the
information at the early stage is being forgotten during training. So, Long Short-Term
Memory (LSTM) and GRU have come to use to alleviate this problem, introducing gate
mechanisms to memorize long-term information. This chapter only includes GRU, for
the project only exploits GRU.

2.2.1 RNN

Figure 2.2.1 is an illustration of RNN expanded at a time, where xt 1 is the input at time
t, ht is the hidden state at t, and W, V and Uo are weights for the input x, the previous
1
Bold symbols represent vectors or tensors in this thesis.

11
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.1: Naive RNN unfolded by time adapted from [11]

hidden state h and output o, respectively. Notice that U and V are shared with all xt
and ht .

Lt represents the loss function at t. So, the RNN model consists of the following
computation at every iteration, where σ is the arbitrary activation function like Relu.
The computation of naive RNN is as below:

ht = tanh(Uxt + Wht−1 + bh ) (2.1)

ot = σ(V ∗ ht + bo )

, where bh is the bias term. This formula shows that the present hidden state ht is
determined by the present input xt and previous hidden state ht−1 , and the present
output ot only depends on ht . When using BPTT for optimization, the gradients for
the weights W can be described as the following:

∂Lt ∑t
∂Lt ∂ot ∏ ∂hj ∂hi
t
= ( )
∂W i=1
∂ot ∂ht j=i+1
∂hj−1 ∂W

∏t ∂hj
The product term j=i+1 ∂hj−1 in the gradient of W causes the gradient vanishing or
exploding problem, which is related to the well-known long dependency issue in RNN.
∂hj
∂hj−1
itself is a Jacobian matrix, which can be written analytically as:

∂hj
= (1 − h2t )W (2.2)
∂hj−1

12
CHAPTER 2. THEORETICAL BACKGROUND

When dealing with long sequences, the product term can either be very small or very
large because of the multiplication of these matrices, depending on the norm of W. In
∂hj
both cases, there will be losses of information. When the norm of ∂hj−1
is greater than
1, the gradients can become large, leading to the optimization failure. When the norm
∂hj ∂h100
of ∂hj−1
is less than 1, the gradients from early states, like ∂h1
, is numerically close to
0, which makes the model only learn short-term dependencies. The exploding problem
can usually be handled by gradients clipping, so the key part is the vanishing issue. To
solve this, gate-based RNNs like GRU were proposed.

2.2.2 GRU

In GRU, instead of formula 2.1, the computations are as follows:

• Update gate: zt = σ(Wz xt + Uz ht−1 )

• Reset gate: rt = σ(Wr xt + Ur ht−1 )

• New memory: h̃t = tanh(Wxt + rt ⊙ Uht−1 )

• Final memory: ht = zt ⊙ ht−1 + (1 − zt ) ⊙ h̃t

, where ⊙ represents the element-wise multiplication. If updated gate zt is close to 0,


the previous state ht−1 is removed from the new memory and only new information
is memorized. The reset gate controls how much the present input xt and previous
state ht−1 changes the current state. The idea here is very similar to gates in circuit,
as the Figure 2.2.2. For example, when zt is close to 0, it looks like the gate is turned
off, blocking information being added to the new memory, and vices versus when zt
is close to 0. Back to the mathematical perspective, the product term in GRU can be
written as

t
∂hj ∏t
2
= (zj + (1 − zj )(1 − h̃j )(rj ⊙ U))
j=i+1
∂hj−1 j=i+1

The biggest change here compared with formula 2.2 is adding zj , which enables the
product term to be polynomial including both high-order and low-order terms (which
provides relatively larger gradients because of fewer multiplications). However, the
naive formula 2.2 only contains high-order terms, making the gradients vanish very
fast. This idea is very similar to the residual connection in large deep neural networks
like ResNet [16] and Bert [9] but in GRU and LSTM, the residual connections are built
at time level instead of depth.

13
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.2.2: The structure of a single GRU cell [4]

2.3 Encoder-decoder Architecture


Encoder-decoder [5] is one of the most common model in deep learning field. Many
well-known models are designed based on encoder-decoder framework. For example,
Kyunghyun Cho et al [5] uses two RNNs as the encoder and the decoder, for the
machine translation task. The motivation is simple that two sentences have the same
meaning should be close to each other in the encoded vector space. So, the model
first uses the encoder to map the input sentence to a vector in a space with the fixed
dimension. Then, the same vector is processed by the decoder, which translates the
vectorized sentence to the target language. This architecture (Figure 2.3.1) is universal
framework and not restricted to certain models. Developers can adapt the framework
using different specific models depending on the downstream tasks like MLP in Bert
[34] and CNN in Image Caption [39].

Figure 2.3.1: Overview of the encoder-decoder architecture [33]

The input and output of the framework do not have to be different. When the
decoder tries to regenerate the encoded input from a relatively low-dimension space,
architectures like this are known as auto-encoder. The advantage of such architectures
is that models can learn efficient coding or the latent representation of the input data

14
CHAPTER 2. THEORETICAL BACKGROUND

in an unsupervised manner.

Taking Figure 2.3.2 as an example, the


training target of the auto-encoder is
usually to minimize the error between the
reconstructed data and the true data as
below

L(X, X′ ) = ∥X − X′ ∥n

where X is the original input data and


X′ is the regenerated output from the
decoder. ∥·∥n represents the n norm, Figure 2.3.2: Autoencoder schema [25]
while n = 1 and n = 2 are the common
choices. In the field of anomaly detection, it is commonly assumed that anomalies
should be hard to reconstruct from the low dimensional latent space because they are
not from the same distribution with the normal data, which makes anomaly detection
feasible.

2.4 Variational Auto-encoder

Variational Auto-Encoder (VAE) [19] is


the combination of auto-encoder and variational
Bayesian methods, introducing prior distribution
for the latent variables. This makes VAE a
probabilistic model rather than a deterministic
one, which means there exists uncertainty during Figure 2.4.1: The standard VAE
the inference stage with a trained model. VAE model represented as a graphical
model[10]
is based on the latent variable model from the
concept of Probabilistic Graphical model illustrated in Figure 2.4.1. Here, the rectangle
is “plate notation” meaning that we can sample from z and x N times, while the model
parameter θ remains fixed. Mathematically speaking, it means the joint distribution
can be written as
P (x, z) = Pθ (x|z)P (z) (2.3)

15
CHAPTER 2. THEORETICAL BACKGROUND

In other words, the latent variable z determines the generation of x.

Figure 2.4.2: The architecture of VAE from CS231n[22]

VAE belongs to likelihood-based generative models, meaning that its goal is to


maximize the likelihood.

Let P (x) be the likelihood of the observed data and qϕ (z|x) be the proposed posterior
of z or the encoder in Figure 2.4.2 whose parameters are ϕ,then

P (x, z)
lnP (x) = ln P (x,z)
P (x)

lnP (x, z) − lnP (z|x) (2.4)


P (x, z) P (z|x)
= ln − ln
qϕ (z|x) qϕ (z|x)

Because lnP (x) is irrelevant with z, it remains the same when taking the expectation
of qϕ (z|x) at both sides. Therefore, we can get

P (x, z) P (z|x)
lnP (x) = Eqϕ (z|x) [ln ] − Eqϕ (z|x) [ln ]
qϕ (z|x) qϕ (z|x)
P (x, z) qϕ (z|x)
= Eqϕ (z|x) [ln ] + Eqϕ (z|x) [ln ]
qϕ (z|x) P (z|x)

q (z|x)
Notice that the second term Eqϕ (z|x) [ln Pϕ(z|x) ] is exactly the KL divergence between
qϕ (z|x) and P (z|x) and according to Jensen’s inequality [17], KL divergence is not less
than 0, so we have
P (x, z)
lnP (x) ≥ Eqϕ (z|x) [ln ] = ELBO
qϕ (z|x)
The first term is known as the Evidence Lower Bound (ELBO), because it provides

16
CHAPTER 2. THEORETICAL BACKGROUND

a lower bound for the likelihood. According to formula 2.3, the ELBO breaks down
to

Pθ (x|z)P (z)
ELBO = Eqϕ (z|x) [ln ]
qϕ (z|x)
P (z)
= Eqϕ (z|x) [lnPθ (x|z)] + Eqϕ (z|x)) [ln ]
qϕ (z|x)
= Eqϕ (z|x) [lnPθ (x|z)] − KL(qϕ (z|x)||P (z)) (2.5)

The advantage of doing it is that the optimization target now is switched to the
maximization of ELBO without direct parameterization of the original data like max
likelihood estimation. Notice that in Figure 2.4.2, Pθ (x|z) is the output of the decoder,
which means this can be directly obtained during inference. Moreover, in the course of
the maximization of ELBO, the model can learn the relationship between two random
variables, data x and the latent variables z. So, it is necessary to know the forms of
Pθ (x|z),qϕ (z|x) and P (z).

For qϕ (z|x), it is common to use normal distribution (or other distributions from
the exponential family) whose parameters θ = {µ, σ} are provided by two neural
networks. In order to have µ and σ being optimized during back-propagation, the
reparameterization trick is applied which uses the linearity of normal distribution as
illustrated in Figure 2.4.2. As the posterior is normal, the prior usually is also chosen as
the standard normal N (0, I). So, the KL-divergence term in formula 2.5 can be written
as:

1∑ 2
d
KL(qϕ (z|x)||P (z)) = (µ + σi2 − lnσi2 − 1) (2.6)
2 i=1 i

where d is the dimension of the latent variable z. As Pθ (x|z) is more task-relevant, it is


explained in the next chapter.

In a nutshell, VAE can be considered as autoencoders whose latent variables are


considered to follow a certain prior distribution.

2.5 GAT (Graph Attention Neural Network)


Convolution Neural Network(CNN) [21] has been proven a strong weapon in computer
vision. The reason is that it can use shared parameter filters (kernel) to form feature

17
CHAPTER 2. THEORETICAL BACKGROUND

maps by calculating the weighted sum of the central pixel and adjacent pixels, allowing
it to extract spatial features effectively. However, it should be noted that the pixels
in the image or video data processed by CNN are arranged in a very neat matrix,
which makes naive CNN unable to deal with data of Non-Euclidean structures, like
social networks and molecules. However, researchers still want to utilize convolution
to extract features from such data. So, Graph Convolution Network(GCN) [20] has
been introduced to satisfy such needs, using the concepts from graph theory. There are
mainly two approaches to implement graph convolution: the spatial domain approach
and the frequency domain approach. For frequency approaches are more complicated
and GAT focuses more on spatial information, only the spatial domain approach will be
introduced here. Spatial GCN aggregates nodes’ information with their own neighbors.

Figure 2.5.1: A comparison between CNN(left) and GCN(right) [8]

The traditional CNN kernels can be seen as the specialization of spatial GCN, whose
nodes are arranged in a grid graph like Figure 2.5.1. Given a graph G = {V, E}, any

Figure 2.5.2: An example of how to forward in GCN.[30]

18
CHAPTER 2. THEORETICAL BACKGROUND

node i ∈ V can be characterized with a node vector Xi . The extracted feature or node
representations of node i are obtained by using the associated information (edges or
the adjacency matrix A in the Graph like in Figure 2.5.3). Assuming the node vector
Xi ∈ Rn and W ∈ Rn×d , the calculation of feature extraction can be described as
below:
hi = σ(Aggregate(Xi )W)

Aggregate(Xi ) = Xj Aij
j∈neighbor(i)

For a numerical example, one may take


a look at Figure 2.5.2, where each node
vector Xi is 3-dimensional and the final
output is 2d. After the
aggregation operation, the intermediate
results are sent to a dense layer with a
self-defined activation function σ(·) for
nonlinearity. Figure 2.5.3: The average aggregation GCN

However, the way GCN aggregates messages is structure-dependent, which may


negatively affect its generalizability. In other words, the fixed structure limits the
expressiveness of GCN. Therefore, inspired by the attention mechanism in transformer
[34], the attention value of every node pair i and j is defined as:

exp (LeakyReLU (a [(hi W∥(hj W)]))


αij = ∑
k∈N (i)∪{i} exp (LeakyReLU (a [(hi W)∥(hk W)]))

where hi ∈ Rd , a ∈ R2d and || is the concatenation operation. The reason for denoting
the input as hi here is that GAT can also be applied as the middle layer which takes the
hidden states as inputs. Therefore, the output of the present GAT layer would be:
 ( )
 ∑ ∑
σ 1 K k k
k=1 j∈Ni αij hj W avg
h′i =
K
(∑ ) (2.7)

∥K k k
k=1 σ α
j∈Ni ij hj W concat

where K represents the number of heads of a GAT layer. The attention mechanism
enables GAT to have adaptive weights given different inputs, as α depends on hi ,
improving models with more generalizability. Similar to transformers, GAT can also be
equipped with multi-head attention, allowing models to jointly combine information

19
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.5.4: The computation of GAT [35]

from different representation subspaces at different initialized start points. There are
usually two options to combine the K heads, like formula 2.7, either concatenating or
averaging them.

2.6 Evaluation Metrics

Classification tasks have many metrics, by which most of them can only partially reflect
the performance of the model. If the metric cannot be used reasonably, not only can
problems be found in the model itself, but also wrong conclusions can be drawn. So, it
is important to have proper metrics relevant to the target task.

The main metrics used to evaluate the performance of the model in this thesis are
precision, recall, F1 score, AUC and mean absolute error (MAE). For comparison with
previous work, false positive (FPR) rate and true positive rate (TPR) are also chosen.
Except MAE, other metrics are used for the evaluation of the classification.

Precision means the ratio between correct positives and the total number of predicted
positives, showing how confident the model is about the predicted positives. The
formula is described below:

TP
P recision = (2.8)
FP + TP

20
CHAPTER 2. THEORETICAL BACKGROUND

Recall refers to the number of predicted true positives divided by the number of all
positives. This metric describes how unlikely the model may miss true positives. The
formula follows:

TP
Recall = (2.9)
TP + FN

Usually there exists a trade-off between the two metrics, which means it’s usually hard
to have both metrics high at the same time. So, it is often to use the harmonic average of
precision and recall, F1 score, to evaluate the performance more thoroughly. F1 score
is defined as the following:

1
F1 = 2 × 1 1 (2.10)
P recision
+ Recall

False Positive Rate (FPR) and True Positive Rate (TPR) are calculated as below:

FP
FPR = (2.11)
FP + TN
TP
TPR = (2.12)
TP + FN

, where TPR is equivalent of recall and FPR is the proportion of misidentified positives.
These two metrics reflect the chances that for a given dataset positives are correctly or
incorrectly classified.

However, the above metrics are threshold-dependent, which means different


thresholds corresponds to different classification results, making the comparison of
different models hard using these metrics. Therefore, another popular option is to
use AUC (Area Under ROC Curve), which measures the performance of models more
thoroughly. ROC (receiver-operator characteristics) reflects the relationship between
FPR and TPR, where the x-axis is the false positive rate and the y-axis is the true
positive rate, as shown in Figure 4.2.7. Different thresholds correspond to different
points on the curve, and the closer the curve is to the left-up corner, the better
performance the model usually has. Thus, AUC can be used for describing such an
expectation that a well-performed classifier should have a large AUC.

21
CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.6.1: Example of ROC curve [28]

22
Chapter 3

Proposed Methods

This chapter illustrates how to process the data and the design of the proposed methods
using previously introduced algorithms.

3.1 Data Description


The raw data provided by Sinch are Call Detail Records (CDR), which is the format of
data used for describing phone call actions. The most important fields are shown in
Table 3.1.1. According to Meijaard et al. [26] and Merve el al. [27] the other fields are
considered irrelevent for the IRSF detection. For privacy issues, phone numbers in the

Table 3.1.1: The fields in the CDR


Field Description
Id The unique key for a record in the record table, used to identify records when evaluation (string).
Caller Number The phone number of the caller, in an anonymized format (string).
Callee Number The phone number of the callee, in an anonymized format. It could be an IPRN in a fraud case (string).
Source The country where the caller’s number belongs to (string, 200+ unique values).
Dest The country where the callee’s number belongs to (string, 200+ unique values).
Duration The duration of time one call may last (floats).
Debit The cost for the call for the company (floats).
Reason The reason for the termination of the corresponding call (string, 5 unique values or words).
Timestamp The timestamp indicating when the call started. (yymmddhhmmss).

records are anonymized (replaced with randomly generated numbers), which increases
the difficulty of detection because numbers are informative features as introduced in
the previous chapter: IPRN tend to have similar patterns like shared prefixes. Records
are preprocessed differently to be fitted into the input requirements of the Isolation
Forest and the temporal model, because for temporal methods we only care about
traffic or volume at given frequency so the temporal one receives time-series input,

23
CHAPTER 3. PROPOSED METHODS

and as for the Isolation Forest, it takes CDR as the input rather than a generated time
series.

3.2 Feature Engineering for the Isolation Forest


The goal of Isolation Forest is that given data X of shape n × d, where d is the number
of attributes and n is the number of samples, forest should produce anomaly scores
Y ∈ Rn with its component Yi ∈ (0, 1), indicating that the closer to 1, the more
suspicious the record may be.

According to Table 3.1.1, the only numerical features are debit and duration. From the
previous introduction of Isolation Forests, the model only accepts numerical features,
making the other features need further preprocess. Fortunately, H2O [14] library
offers out-of-the-box support for categorical data, using one-hot encoding for those
discrete values by default.

Figure 3.2.1: Geographical representation of the destination of the IRSF fraud. [27]

According to [26], pure Isolation Forest based on CDR without any prior information
will produce too many false positives. Thus, more IRSF-related knowledge needs to
be added to the model. Figure 3.2.1 shows that most fraudulent calls are made to
developing countries. With the advice from Sinch’s experts, the GDP (crawled from
World Bank [36]) of both source and destination country is added to the input features.
This is because GDP can reflect the development status of countries, and countries with

24
CHAPTER 3. PROPOSED METHODS

similar GDP may have similar behavior. Taking it as a feature can introduce a piece
of prior knowledge into the model. Moreover, thanks to the nature of Isolation Forest,
properly scaling (which is essential for some sensitive models like OneClass SVM [32])
of the numerical features is unnecessary. Thus, the final input features for the forest
are similar to the Table 3.2.1:

Table 3.2.1: The input for Isolation forest

source dest duration reason dest_gdp source_gdp

3.3 Temporal Model Design

Table 3.3.1: Notation for the temporal model

k The number of countries


n The length of sliding window
d The dimension of the latent space
T The length of the largest time span in the dataset
ypre The output of the forecasting-based model
yrec The output of the reconstruction-based model
f the length of a bin

To detect sudden traffic like Figure 1.2.1, it is necessary to extract the time series of
traffic from the original records. In this project, traffic is counted at a given frequency
from the earliest timestamp to the latest for every country, producing a data matrix
X ∈ Rk×T where k is the number of countries and T is the number of bins (each bin
represents f minutes) for the whole time span like one month.

To formulate the problem from the temporal perspective, the target here is to, given
the data matrix X ∈ Rk×T , produce anomaly scores Y ∈ Rk×T while every component
of Y is in (0, 1).

As inspired by Jiang et al. [18], graphs are considered suitable structures to memorize
patterns among phone calls. Our project uses an architecture similar to Zhao et al. [37]
for the temporal model. Since the project utilizes GAT-based model, the form of the
node feature must be decided, and a sliding window with a length of n is used here.
The reason is that while maintaining the diversity of data, low-frequency information
can be preserved by the model. In addition, as VAE is used here, the observations
within a sequence share a unique representation in the latent space z, helping the

25
CHAPTER 3. PROPOSED METHODS

model to capture the characteristics of the global distribution of the data. Then pipeline
like the Figure 3.3.1 is applied. Compared to the original paper [37], the project’s
contributions include the adaption of the loss function to better deal with given dataset,
and specification of each module in the pipeline as the paper only provides high-level
descriptions especially for the Auto-encoder part. The first two modules in the pipeline

Figure 3.3.1: The overview of the temporal model [37].

serve as Feature Extractor by utilizing the power of GAT to extract features from Non-
Euclidean space. Then, the rest steps are adapted from autoencoder architecture to
learn the inherent relationship between the extracted features and latent space. Below
are the details of each stage introduced by Zhao et al [37]:

1. First, a 1D-convolution with kernel size 7 is applied to extract the high-level


features of each time series input. As shown in the recent work [31], 1D-
convolution improves local feature engineering in a sliding window for sequence
data.

2. Two parallel graph attention (GAT) layers handle the output of the convolution
layer, which aims to learn the relationship between multiple features and
timestamps, respectively.

3. The GRU module takes the concatenation of two GAT outputs as the input and
maps it into the latent space z. This layer is used for obtaining sequential patterns
in timeseries and compressing the representation of features.

4. The outputs of the encoder are passed into a forecasting-based model and a
reconstruction-based model in parallel for the final result. The forecasting-based
model is implemented as a multilayer perceptron, and VAE is adopted for the
reconstructor.

26
CHAPTER 3. PROPOSED METHODS

3.3.1 Feature Extractor

First, the correlation among multivariable data (relationships between countries)


needs to be learned without any prior knowledge. Therefore, we regard multivariate
time series as a complete graph, in which each node represents a country and each
edge represents the relationship between two corresponding countries. In this way,
the relationship between adjacent nodes can be captured through graph attention
operation during optimization. This module is called the feature-oriented attention
layer. Specifically, for the feature-oriented GAT, every country is considered as a node
in the graph, resulting in an adjacent matrix of shape k × k and n-dimensional node
vectors as shown in the left Figure 3.3.2 as well as Figure 3.3.3.

Figure 3.3.2: The illustration of feature-oriented (left) and time-oriented GAT layers
(right) [15].

Figure 3.3.3: The illustration of input node features in the feature-oriented layers.

27
CHAPTER 3. PROPOSED METHODS

Moreover, to extract the local dependencies within a sliding window, GAT is used to
leverage its power. To do so, every node represents a single timestamp and the input
can be seen as a snapshot of the traffic of all countries. This process is very similar to
Transformer [34], where a fully-connected self-attention operation takes all words in
a sequence. This part will take node features of shape k and a n × n complete graph as
the input and output a feature matrix of shape n × k, as illustrated in the right Figure
3.3.2.

To fuse the information from different sources, three types of n × k extracted features
are concatenated as a big n × 3k matrix, which serves as the next module’s input.

3.3.2 Auto-encoder and joint optimization

As previously mentioned, the GRU encoder is used to capture sequential patterns and
learn the sliding window’s invariants. The model includes a forecasting-based model
to predict the value at the next timestamp and a reconstruction-based model to fit the
data distribution of sliding windows. Then, the GRU units iterate over n timesteps,
producing an encoded d dimensional vector for later use.

For the prediction model, the task is given the encoded d dimensional vector to predict
the value of the following timestamp ypred ∈ Rk . This is achieved by using three fully-
connected layers, whose last layer has the same number of units as the number of
countries, suggested by Zhao et al. [37]. The loss function for the forecasting task
is as follows
Losspred = M AE(ypred − yn+1 )

where MAE represents mean absolute error instead of mean square error in Zhao et al.
[37].

For the reconstruction model, the task is to reconstruct the input feature from the latent
space. Having observed some sparseness in the data, we chose Laplace distribution for
this project rather than Gaussian as the specification for the decoder Pθ (x|z), because
the training objective aims to minimize an l1 reconstruction loss which promotes sparse
reconstruction errors. This is based on the experience that anomalous traffic should be
rare and sparse. To generate a sequence for the reconstruction model, another GRU
is used as the decoder which takes latent vectors as the input from the encoder. As
described in the previous chapter, the second term of ELBO 2.6 can be analytically

28
CHAPTER 3. PROPOSED METHODS

derived. Now, the complete ELBO can be written as

ELBO = Eqϕ (z|x) [lnPθ (x|z)] − KL(qϕ (z|x)||P (z))


1∑ 2
d
∝ − M AE(xrec − xtrue ) − (µi + σi2 − lnσi2 − 1)
2 i=1

Instead of maximizing the PDF of the Laplace distribution directly, we discovered that
optimizing the exponent of the pdf offers numerically more stable gradients without
crashed optimization, similar to what the original VAE[19] suggested. Therefore, the
total loss for the model is

Loss = − ELBO +Losspred

3.3.3 Inference

The inference also consists of the combination of the two jointly-optimized models.
The reconstruction model will output probability p of shape n × k whose component
pij shows how likely such traffic happens at the timestamp i for country j. The output
of the forecasting model ypre or yt+n is a k dimensional vector, indicating what the value
yt+n of the next timestamp should be given a sliding window {yt , ....., yt+n−1 }.

The inference score is defined as

(1 − pij ) + (1 + γ)|ypre_ij − yij |


sij =
1+γ

where p = 12 e−|yij −yrec_ij | . The first term (1 − pij ) suggests how unlikely the traffic at i
step happens in country j and the second term |ypre_ij − yij | shows the inconsistency
between the value of the next timestamp and the truth. The weigted average of these
two is also an alternative:

sij = (1 − γ)(1 − pij ) + γ|ypre_ij − yij |

, where γ is the hyper-parameter that needs to be manually tuned.

To have a prediction for a single record, we need to combine the scores from Isolation
Forest and GAT-based model. Every record will be assigned an anomaly score
depending on which timestamp the record belongs to because the proposed temporal

29
CHAPTER 3. PROPOSED METHODS

model can evaluate every time slot. The score from the temporal model denote as stime ,
while the tree-based model produces stree . The output set of the proposed detection
model is

Y = Xstime >p1 &&stree >p2 Xstree >p3

where p3 > p1 and X is the set of records. The first set Xstime >p1 &&stime >p2 represents
calls that are thought to be suspicious by both models. The requirement of identifying
fraud is more strict than a single Isolation Forest, so this method should propose fewer
false positives while maintaining an acceptable recall rate. The second set Xstree >p3
serves as compensation to the output because fraud could happen at timestamps that
are not suspicious from a temporal perspective. The proposed model in this project
follows the framework illustrated in Figure 3.3.4. Our proposed method first uses the
temporal model to find suspicious time-spans and then conducts the detection using
IF. In our experiments, p3 is set to 0, because we found that the IF would only produce
more false positives with our dataset, but it is worthy to investigate different settings in
future work. Moreover, others can try different temporal or anomaly detection model
and try a more flexible way to combine the two results for their case.

30
CHAPTER 3. PROPOSED METHODS

prediction

Anomaly
Temporal model detection model

A single call
record

Counting Traffic& Preprocessing


Preprocessing Records
timeseries

CDR

Figure 3.3.4: The overview of general framework

31
Chapter 4

Evaluation

4.1 Experiment

4.1.1 Preprocess Details

The data provided by Sinch includes CDR of December in 2020 and March, July in
2021. For the time series extraction, the bin’s length f is set to 2 minutes for every
country, producing timeseries for each month. We use the same sliding window size
n = 100 for all models. To feed such a long sequence to the temporal model, training
is performed with overlapping sequences obtained by a sliding window with stride = 1
for more data. As the timeseries are extremely unbalanced (only a few countries have
high traffic volumes), we first take the logarithms of the data.

X = log(Xraw + 1)

The effects of logarithmic transformation are shown in Figure 4.1.1. While the partial
order remains the same, the transformed data are more diversely distributed, which is
beneficial for optimization. Moreover, high volume traffic or large values will be closer
to each other, keeping large and small values more separated. This property supports
better performance because a couple of hundred calls may be neglected for high traffic
but will significantly affect analysis for low traffic. For example, we always want to be
more lenient when facing an increase from 2k to 2.5k, but more alert when dealing with
a strike from nothing to 0.5k. The timeseries are then normalized with the maximum

32
CHAPTER 4. EVALUATION

Figure 4.1.1: The data distributions before (left) and after(right) the logarithmic
transformation.

and minimum values from the training data:

X − min(X)
Xin =
max(X) − min(X)

where Xraw is of shape k × T , so the range of observed traffic lies in the interval
[0, 1].

4.1.2 Setup

Table 4.1.1: Information of the datasets

Dec Mar July


Confirmed fraud episodes 1 unknown 4
Labeled No No Partially
Sequence length(T) 22321 22322 13681
number of calls 9828782 3181258 1844208

We have three datasets from three nonconsecutive months, among which only July is
partially (the first half) labeled as listed in Table 4.1.1. The dataset contains frauds
that are not labeled due to the lack of accurate information on their exact timing,
although according to the complaints submitted to Sinch fraud calls have happened
in an episode. Moreover, the dataset might contain other frauds which are not labeled,
as they are labeled manually. Confirmed fraud episodes represent the dates when
confirmed fraud happened. The concatenation of December and March data is used
for training (first 80%) and validation (the last 20%), while the first half of July is used
for evaluation. For T1, we chose the dimension of latent space d = 300 and the AdamW
optimizer[24] to train our model for 100 epochs with a learning rate 1e-5, batch size
128, and early stopping strategy to avoid over-fitting, as suggested by [37]. Specifically,
the temporal model is implemented using Keras[6] and Spekral [13]. The details of the
model can be found in Appendix Figure A.0.1. Regarding T2, we choose 100 trees for

33
CHAPTER 4. EVALUATION

the Isolation Forest because Liu et al. [23] state that models with around 100 trees are
sufficient to produce convergent results.

For the hardware setup, the experiment is conducted on Google’s Colab virtual
environment, using Nvidia P100 and Intel Xeon dual-core Processor 2.3Ghz with 25G
RAM.

4.2 Results And Analysis


Before diving into the result, recap the definitions of T1 and T2:

T1 : To detect time spans where frauds could have happened.

T2 : To classify whether a single record is fraud or not.

We compare the proposed model with pure Isolation Forest for the T2 detection task.
As the temporal model can be considered as a regression model, then the pure GRU is
compared with the proposed GAT model for regression error and T1.

4.2.1 Results of T1
Table 4.2.1: The results of T1 on July’s data using the full training data

T1 recall T1 precision regression error(MAE)


GRU 7/7 7/19 0.0381
Proposed Model 7/7 7/16 0.0176

As shown in table 4.2.1 as for T1, the proposed model does not show dominant
advantages over the naive GRU. There are two potential reasons that: first, most of
the suspicious time slots appear in countries where there is usually no traffic at all
but with extreme peaks like 4.2.1, so even with the naive temporal model, this type
of timestamp can still be found out; second, the concatenation of two months’ data
provides sufficient similar patterns and even for the simple model, it is easy to capture
such seasonality among data. However, if trained with limited data (fewer similar
patterns), will the results remain the same?

To answer the question above, we then only use March for training. The results are
shown in Table 4.2.2. The proposed model produced fewer false positives compared to
the naive model when dealing with fewer data. In addition, according to Figure 4.2.2,
with the increasing data, both methods have improvement on their regression errors

34
CHAPTER 4. EVALUATION

Figure 4.2.1: The example of detected fraud episodes

but the GAT-based model outperforms the naive temporal at all settings of data. The
GAT-based model might be more powerful to capture the weak seasonality among the
traffic with limited data compared to GRU as shown in Figure 4.2.3.

Table 4.2.2: The results of T1 on July’s data only using March’s Data

regression error (MAE)


GRU 0.0435
Proposed Model 0.0342

Moreover, to interpret what the model has learned through training, we visualize
the encoded latent representations of the suspicious timestamps Figure 4.2.4. As for
Figure 4.2.4, we sample 100 data around the fraud episodes in July and visualize the
output of the encoder using sklearn’s TSNE [29], where the color shows the average
anomaly scores described in the previous chapter. The visualization demonstrates that
most anomalies are at the periphery (yellow points) of the latent distribution by the
model. In other words, the model matches anomalies with low probability samples in
the latent distribution, as we expected.

35
CHAPTER 4. EVALUATION

Figure 4.2.2: The regression error on test set with different percentages of training
data

Figure 4.2.3: The captured seasonality in July of GRU(up) and GAT(below) using
March data

Recap that when using the feature-oriented GAT module, the relationship between
every node (country) is represented by a complete graph. In other words, the inherent
connections of nodes are captured by the attention mechanism. Figure 4.2.5 shows
the visualization of attention values, the attention values are calculated based on
1k samples around Christmas. Interestingly, some countries that do not celebrate
Christmas, like CN and OM, are far from the US and have relatively small attention
values. In contrast, countries like JP and KE with Christmas holidays are closer to the
US. This result is expected because when the model is trying to fit the training data,
it will make things easier if it gives more attention to those countries that have low
volume of traffic during Christmas. However, we do observe some exceptions like IN,
BB, and CA. The figure also shows that the US has strong connections with XV and MD

36
CHAPTER 4. EVALUATION

during Christmas.

Figure 4.2.4: The visualization of latent space around suspicious time slots

Figure 4.2.5: The average attention values of US for December data

Next, we analyze the false positives produced by the model, as plotted in Figure 4.2.6.
Most countries have low or even zero volume, so when the model sees a spike in
the traffic, it would identify the corresponding time slots as anomalies. However,

37
CHAPTER 4. EVALUATION

it is sometimes acceptable to have these spikes for low-volume countries because


Sinch mainly provides telecom services for business, and such spikes might be the
result of some companies’ expansion. The detection system might need more domain
knowledge to avoid producing such time slots. However, since the main objective of
T1 is to narrow down search space for T2, the results are acceptable; these temporal
false positives are likely to be manually found out afterwards.

Figure 4.2.6: An example of false positives

From the results of T1, we can conclude that for Q2, the proposed temporal model
is able to detect suspicious time slots using historical traffic and has advantages over
naive temporal models like naive GRU.

4.2.2 Results of T2

The aim of T2 is to classify every record in the test data. Before diving into the analysis
for T2, it should be declared that only frauds within one episode are labeled, which
means the metrics can only measure the partial capability of the model. As shown in
Table 4.2.3 and Figure 4.2.7, our proposed model has dominant advantages over the
pure Isolation Forest, providing more accurate detection.

Table 4.2.3: The results of T2 on July’s data

T2 recall T2 precision T2 F1 score


IF with GDP 0.963 0.07 0.133
Proposed Model 0.961 0.424 0.589
IF no gdp 0.667 0.001 0.002

We chose 0.5 as the classification threshold for the result in Table 4.2.3. We notice
that there is a minor loss regarding the recall compared to the baseline, because
we ignore samples outside the detected timestamps, which is expected since some
potential frauds outsides are ignored here. Admittedly, the precision is still relatively
low, and the proposed model shows a 505% improvement compared to the naive IF. The

38
CHAPTER 4. EVALUATION

Figure 4.2.7: The produced ROCs of the Isolation Forest (left), the combined model
(mid) and the IF without GDP (right)

potential reasons lie in two aspects: first, the utilization of the anomaly detection for
this problem is based on the assumption that frauds should be a subset of anomalies
which indicates the detected samples are not guaranteed to be frauds; second, due
to privacy issues, the model does not leverage all informative features such as phone
numbers, leading to coarse detection.

We also did an ablation study to validate the effectiveness of the additional GDP
feature and the temporal model in Table 4.2.3. The GDP features bring about
44% performance gain in terms of recall with just a minor loss at precision. Our
interpretation is that for some countries that rarely have calls, the GDP feature could
serve as evidence for further branching. For example, for some trees, samples are split
based on GDP field before country field so that samples might be assigned to a subtree
with samples from countries with similar GDP. To validate our guess, we checked
the average anomaly score for countries with similar GDP to SA (around 23140$), as
demonstrated in Table 4.2.4. Most countries satisfy our guess that countries with
similar GDP should have similar behaviors, close anomaly scores. There is still some
exception like AW, due to the random nature of IF, so the observation is not guaranteed
with different random seeds. Besides, this could also introduce bias into the model
because the anomaly scores of some regular calls from countries with similar GDP
might increase. Overall, this prior knowledge brings significant improvement to the
model. In terms of threshold-irrelevant metric, from Figure 4.2.7, it is observed that

Table 4.2.4: The average anomaly score for samples from some countries

SA CZ CY EE AW US
anomaly score 0.294 0.3025 0.306 0.322 0.490 0.359

the proposed model also outperforms the two baselines, showing 7% improvement for
AUC of IF with GDP and 17% for AUC of naive IF.

39
CHAPTER 4. EVALUATION

The comparison of the results with previous related work is shown in Table 4.2.5.
Notice that these work use different metrics than above and these reported results
are from the original papers. According to the comparison, the proposed model still
shows advantages over the previous work, but with a different dataset, since there is no
publicly available dataset now. Compared with Merve et al. [27], the proposed model
fails to outperform regarding FPR because Merve et al. [27] use a supervised model
which could learn the intrinsic nature with more prior knowledge of fraud while our
model is unsupervised and based on anomaly detection. However, the results might
not have great reference value because these metrics in Table 4.2.5 are not important
according to Sinch’s expert, and the private closed-source dataset previous work use
also make these results unfair for such comparison.

Table 4.2.5: The comparison with previous related work

TPR FPR
Merve et al [27] 0.964 0.0026
Meijaard et al [26] 0.87 0.05
Proposed Model 0.961 0.021

The above results and analysis show that the proposed model at least with Sinch’s
dataset can achieve SOTA performance using TPR and FPR. The most considerable
improvement is to reduce false positives compared to simple IF, satisfying the
requirement of Q3. Moreover, our model is trained in an unsupervised manner and
allows positives in the training set, which answers Q1.

Efficiency

To explore the potential of our model for real-time detection, we also measured the
training and inference time during experiments. The result is shown in 4.2.6. In the
experiment, the inference stage involves all of test data, while the model only has to deal
with one sample in production. Thus, when dealing with stream data, the inference
time would be reduced to few seconds and we believe the model shows great potential
for real-time detection.

Table 4.2.6: The time efficiency for the proposed model

Preprocess IF-training Temporal Training Inference


Time(min) 2 7 30 1

40
Chapter 5

Discussion

5.1 Other Trials


Below are some approaches that we have tried in this project, but for some reason were
abandoned.

At an early stage of the project, we tried some distance-based anomaly detection


methods like OneClass SVM [32], and Local Outlier Factors [3]. However, in later
experiments, we realized that these models require fine-grained feature engineering,
which would be extremely difficult for us.

The second attempt was to use the recommender system reversely. The motivation was
that generally, a recommender system is to recommend users with actions that users
most likely would do. On the other hand, those behaviors that users are unlikely to do
should be anomalies. After some experiments, we found this method unfeasible in our
scenario because some users have very little traffic, where the patterns are not formed
at all.

Except for the attempts for the design of model, some alternatives were also tried to
solve other challenges in this project.

Because the country and reason fields are categorical data, this could produce extra
220+ sparse dimensions for the input data, while there are only 3 to 4 numerical
features. This sparsity could damage the performance of machine learning models.
To avoid such sparsity, embedding methods like Item2Vec [2] were tried but ended
being not improving performance.

41
CHAPTER 5. DISCUSSION

In the first four months of this project, there was no labeled data even for evaluation.
We tried some label-less evaluation methods like Gosix et al. [12], which uses criteria
based on existing Excess-Mass (EM) and Mass-Volume (MV) curves for non-labeled
data. However, after some experiments, we found that these methods are not suitable
for our data because they require continuous and euclidean vectors, while our data do
not satisfy.

5.2 Limitations
The most significant limitation of the project is that the phone numbers, these very
informative features, have not been utilized due to privacy regulations. One who
can access actual and non-anonymized data may use these features to perform more
accurate detection, because IPRN tend to have similar patterns.

Another issue is that the data of July is only partially labeled, which means there are
some unlabelled frauds. Therefore, the present precision is just a lower bound for the
true precision, while the actual recall might be lower than what we analyzed. During
the development of the project, the results were sent to Sinch and helped them find
new fraud cases, which proves our assumption to be correct.

There are other limitations for the labeled data as well. One may be careful with
the neighboring dates of the suspicious timesteps because fraudsters tend to generate
attacks in consecutive days. Due to time constraints, only the first half of July are
manually validated and used for evaluation, so there could be more frauds in July.

5.3 Future Work


In addition, the basic assumption for anomaly-detection-based models is that fraud
should be a subset of anomalies or at least share large overlaps with anomalies,
which produces false positives itself. To reduce more false positives, one could
work with domain experts, using manual detection methods to filter out the detected
samples.

One more interesting aspect worthy of investigating is that, in this project, the snapshot
of the whole network is treated as the input, but one may also conduct experiments
for every user or country because anomalies in the scenario are context-based. For

42
CHAPTER 5. DISCUSSION

example, the definition of anomalies might vary from country to country and from
month to month.

5.4 Conclusion
This project tries to provide a solution for the long-standing yet unsolved problem of
International Revenue Shared Fraud (IRSF). The existing methods suffer high false-
positive rates. To solve this, we split the detection task into two subtasks which are
handled by two methods respectively. For the first subtask, we introduce temporal
methods to detect samples only in suspicious timestamps by utilizing the power of
the graph-based model. For the second task, Isolation Forest with additional GDP
features is applied. As a result, with the combination of these two methods, our model
outperforms the existing model based on Sinch’s data, achieving 96.1% recall and 42.4%
precision, which demonstrates great potential for effective detection in production.
Extensive analysis is also provided that the model has good interpretability in terms
of traffic patterns.

43
Bibliography

[1] Anello, Eugenia. IsolationForest. Accessed: 2021-07-28. URL: https : / /


betterprogramming . pub / anomaly - detection - with - isolation - forest -
e41f1f55cc6.

[2] Barkan, Oren and Koenigstein, Noam. Item2Vec: Neural Item Embedding for
Collaborative Filtering. 2017. arXiv: 1603.04259 [cs.LG].

[3] Breunig, Markus M., Kriegel, Hans-Peter, Ng, Raymond T., and Sander, Jörg.
“LOF: Identifying Density-Based Local Outliers”. In: SIGMOD ’00. Accessed:
2021-07-28. Dallas, Texas, USA: Association for Computing Machinery, 2000,
pp. 93–104. ISBN: 1581132174. DOI: 10 . 1145 / 342009 . 335388. URL: https :
//doi.org/10.1145/342009.335388.

[4] Cheng, Yan, Sun, Huan, Chen, Haomai, Li, Meng, Cai, Yingying, Cai, Zhuang,
and Huang, Jing. “Sentiment Analysis Using Multi-Head Attention Capsules
With Multi-Channel CNN and Bidirectional GRU”. In: IEEE Access 9 (2021),
pp. 60383–60395. DOI: 10.1109/ACCESS.2021.3073988.

[5] Cho, Kyunghyun, Merrienboer, Bart van, Gulcehre, Caglar, Bahdanau, Dzmitry,
Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation. 2014. arXiv: 1406.1078 [cs.CL].

[6] Chollet, Francois et al. Keras. Accessed: 2021-07-28. 2015. URL: https : / /
github.com/fchollet/keras.

[7] Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, and Bengio, Yoshua.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence
Modeling. 2014. arXiv: 1412.3555 [cs.NE].

44
BIBLIOGRAPHY

[8] Danel, Tomasz, Spurek, Przemysław, Tabor, Jacek, Śmieja, Marek, Struski,
Łukasz, Słowik, Agnieszka, and Maziarka, Łukasz. Spatial Graph Convolutional
Networks. 2020. arXiv: 1909.05310 [cs.LG].

[9] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina.
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. 2018. DOI: 10.48550/ARXIV.1810.04805.

[10] Doersch, Carl. Tutorial on Variational Autoencoders. 2021. arXiv: 1606.05908


[stat.ML].

[11] Feng, Weijiang, Guan, Naiyang, Li, Yuan, Zhang, Xiang, and Luo, Zhigang.
“Audio visual speech recognition with multimodal recurrent neural networks”.
In: May 2017, pp. 681–688. DOI: 10.1109/IJCNN.2017.7965918.

[12] Goix, Nicolas. How to Evaluate the Quality of Unsupervised Anomaly Detection
Algorithms? 2016. arXiv: 1607.01152 [stat.ML].

[13] Grattarola, Daniele and Alippi, Cesare. Graph Neural Networks in TensorFlow
and Keras with Spektral. 2020. arXiv: 2006.12138 [cs.LG].

[14] H2O.ai. h2o: Python Interface for H2O. Accessed: 2021-07-28. 2020. URL:
https://github.com/h2oai/h2o-3.

[15] Harstad, Axel Øvrebø and Kvaale, William. mtad-gat-pytorch. https : / /


github.com/ML4ITS/mtad-gat-pytorch. Accessed: 2021-07-28. 2021.

[16] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep Residual
Learning for Image Recognition. 2015. arXiv: 1512.03385 [cs.CV].

[17] Jensen, J. L. W. V. “Sur les fonctions convexes et les inégalités entre les valeurs
moyennes”. In: Acta Mathematica 30.none (1906), pp. 175–193. DOI: 10.1007/
BF02418571.

[18] Jiang, Nan, Jin, Yu, Skudlark, Ann, Hsu, Wen-Ling, Jacobson, Guy, Prakasam,
Siva, and Zhang, Zhi-Li. “Isolating and analyzing fraud activities in a large
cellular network via voice call graph analysis”. In: MobiSys’12 - Proceedings
of the 10th International Conference on Mobile Systems, Applications, and
Services (June 2012). DOI: 10.1145/2307636.2307660.

[19] Kingma, Diederik P and Welling, Max. Auto-Encoding Variational Bayes. 2014.
arXiv: 1312.6114 [stat.ML].

45
BIBLIOGRAPHY

[20] Kipf, Thomas N and Welling, Max. “Semi-supervised classification with graph
convolutional networks”. In: arXiv preprint arXiv:1609.02907 (2016).

[21] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. “Deep learning”. In:
nature 521.7553 (2015), pp. 436–444.

[22] Li, Fei-Fei, Karpathy, Andrej, and Johnson, Justin. “CS231n: Convolutional
Neural Networks for Visual Recognition [PDF document]”. In: (2019). Accessed:
2021-07-28. URL: http://cs231n.stanford.edu/slides/2019/cs231n_2019_
lecture11.pdf.

[23] Liu, Fei Tony, Ting, Kai Ming, and Zhou, Zhi-Hua. “Isolation forest”. In: 2008
Eighth IEEE International Conference on Data Mining. IEEE. 2008, pp. 413–
422.

[24] Loshchilov, Ilya and Hutter, Frank. Decoupled Weight Decay Regularization.
2019. arXiv: 1711.05101 [cs.LG].

[25] Massi, Michela. Schema of a basic Autoencoder. Accessed: 2021-07-28. 2019.


URL: https://commons.wikimedia.org/wiki/File:Autoencoder_schema.
png.

[26] Meijaard, Yoram J., Cappers, Bram C. M., Mengerink, Josh G. M., and Zannone,
Nicola. “Predictive Analytics to Prevent Voice over IP International Revenue
Sharing Fraud”. In: Data and Applications Security and Privacy XXXIV. Ed. by
Anoop Singhal and Jaideep Vaidya. Cham: Springer International Publishing,
2020, pp. 241–260. ISBN: 978-3-030-49669-2.

[27] Merve, Şahin


and Aurélien, Francillon. “Understanding and Detecting International Revenue
Share Fraud”. In: Jan. 2021. DOI: 10.14722/ndss.2021.24051.

[28] Narkhede, Sarang. Understanding AUC - ROC Curve. June 2021. URL: https:
//towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5.

[29] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. “Scikit-learn:
Machine Learning in Python”. In: Journal of Machine Learning Research 12
(2011), pp. 2825–2830.

46
BIBLIOGRAPHY

[30] Pham, Chau. Graph convolutional networks (gcn). Accessed: 2021-07-28. Feb.
2021. URL: https://www.topbots.com/graph-convolutional-networks/.

[31] Santos, Cícero dos and Gatti., Maíra. “Deep Convolutional Neural Networks for
Sentiment Analysis of Short Texts”. In: Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Technical Papers.
Dublin, Ireland: Dublin City University and Association for Computational
Linguistics, Aug. 2014, pp. 69–78.

[32] Schölkopf, Bernhard, Platt, John C., Shawe-Taylor, John C., Smola, Alex J.,
and Williamson, Robert C. “Estimating the Support of a High-Dimensional
Distribution”. In: Neural Comput. 13.7 (July 2001), pp. 1443–1471. ISSN: 0899-
7667. DOI: 10.1162/089976601750264965.

[33] The encoder-decoder architecture. Accessed: 2021-07-28. URL: https://d2l.


ai / chapter _ recurrent - modern / encoder - decoder . html # fig - encoder -
decoder.

[34] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion,
Gomez, Aidan N., Kaiser, Lukasz, and Polosukhin, Illia. Attention Is All You
Need. 2017. arXiv: 1706.03762 [cs.CL].

[35] Veličković, Petar, Cucurull, Guillem, Casanova, Arantxa, Romero, Adriana, Liò,
Pietro, and Bengio, Yoshua. Graph Attention Networks. 2018. arXiv: 1710 .
10903 [stat.ML].

[36] World, Bank. GDP. Accessed: 2021-07-28. 2021. URL: https : / / data .
worldbank.org/indicator/NY.GDP.MKTP.CD.

[37] Zhao, Hang, Wang, Yujing, Duan, Juanyong, Huang, Congrui, Cao, Defu, Tong,
Yunhai, Xu, Bixiong, Bai, Jing, Tong, Jie, and Zhang, Qi. Multivariate Time-
series Anomaly Detection via Graph Attention Network. 2020. arXiv: 2009 .
02040 [cs.LG].

[38] Zhao, Ling, Song, Yujiao, Zhang, Chao, Liu, Yu, Wang, Pu, Lin, Tao, Deng, Min,
and Li, Haifeng. “T-GCN: A Temporal Graph Convolutional Network for Traffic
Prediction”. In: IEEE Transactions on Intelligent Transportation Systems 21.9
(Sept. 2020), pp. 3848–3858. DOI: 10.1109/tits.2019.2935152. URL: https:
//doi.org/10.1109%2Ftits.2019.2935152.

47
BIBLIOGRAPHY

[39] Zhu, Xinxin, Wang, Weining, Guo, Longteng, and Liu, Jing. AutoCaption:
Image Captioning with Neural Architecture Search. 2020. arXiv: 2012.09742
[cs.CV].

48
Appendix - Contents

A Appendix 50

49
Appendix A

Appendix

The Figure A.0.1 shows the details of every component in the model, where black
arrows show that two component are connected,and yellow arrows represent that those
modules’ roles mentioned in the report.

50
APPENDIX A. APPENDIX

Figure A.0.1: Implementation details of the temporal model 51


TRITA-EECS-EX-2022:368

www.kth.se

You might also like