Automated Insider Threat
Automated Insider Threat
net/publication/282550435
Automated Insider Threat Detection System Using User and Role-Based Profile
Assessment
CITATIONS READS
114 1,231
4 authors, including:
Sadie Creese
University of Oxford
172 PUBLICATIONS 2,837 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Insider Threat: Understanding, Preventing, Detecting and Responding to Threatening Insiders View project
Privacy for the Internet of Things (incl. privacy paradox, personal wearables, smart environments) View project
All content following this page was uploaded by Phil Legg on 19 October 2020.
Abstract—Organisations are experiencing an ever-growing con- governments alike have suffered similar experiences, whereby
cern of how to identify and defend against insider threats. Those top secret information has been exfiltrated and passed on to
who have authorised access to sensitive organisational data are oppositions. The threat posed by the insider is very real, and
placed in a position of power that could well be abused and could requires serious attention from both employees and organisa-
cause significant damage to an organisation. This could range tions alike.
from financial theft and intellectual property theft, through to
the destruction of property and business reputation. Traditional Over the years, technological advancements have meant
intrusion detection systems are not designed, nor are capable, that the way organisations conduct business is constantly
of identifying those who act maliciously within an organisation. evolving. It is now common practice for employees to have
In this paper, we describe an automated system that is capable access to large repositories of organisation documents stored
of detecting insider threats within an organisation. We define a electronically on distributed file servers. Many organisations
tree-structure profiling approach that incorporates the details of provide their employees with company laptops for working
activities conducted by each user and each job role, and then use
whilst on the move, and use e-mail to organise and sched-
this to obtain a consistent representation of features that provide
a rich description of the user’s behaviour. Deviation can be ule appointments. Services such as video conferencing are
assessed based on the amount of variance that each user exhibits frequently used for hosting meetings across the globe, and
across multiple attributes, compared against their peers. We employees are constantly connected to the Internet where they
have performed experimentation using 10 synthetic data-driven can obtain information on practically anything that they require
scenarios and found that the system can identify anomalous for conducting their workload. Given the electronic nature of
behaviour that may be indicative of a potential threat. We also organisational records, these technological advancements could
show how our detection system can be combined with visual potentially make it easier for insiders to attack. From the
analytics tools to support further investigation by an analyst. organisational view, one advantage to this is the capability of
Keywords—Insider threat, anomaly detection, cyber security. capturing activity logs that may provide insight into the actions
of employees. However, actually analysing such activity logs
I. I NTRODUCTION would be infeasible for any analyst due to the sheer volume
of activity being conducted by employees every day. What is
T he insider-threat problem is one that is constantly growing
in magnitude, resulting in significant damage to organisa-
tions and businesses alike. Those who operate within an organ-
required is a capability to analyse individual users who conduct
business on organisational systems, to assess when users are
behaving normally, and when users are posing a threat.
isation are often trusted with highly confidential information
such as intellectual property, financial records and customer In this work, we present a systematic approach for in-
accounts, in order to perform their job. If an individual should sider threat detection and analysis based on the concept of
choose to abuse this trust and act maliciously towards the anomaly detection. Given a large collection of activity log
organisation, then their position within the organisation, their data, the system constructs tree-structured profiles that describe
knowledge of the organisational systems, and their ability to individual user activity and combined role activity. Using
access such materials, means that they can pose a serious threat these profiles, comparisons can be clearly made to assess
to the operation of the business. The range of possible activities how the current daily observations vary from previously-
could be anything from taking money from a cash register, observed activities. In this fashion, we construct a feature set
to exfiltrating intellectual property from the organisation to representation that describes the observations made for each
sell on to rivals which could effectively destroy the successful day, and the variations that are exhibited between the current
operation of the organisation. Capelli et al. from the Carnegie day and the previously-observed days. This large feature set
Mellon University CERT group identify three main groups of is reduced into multiple anomaly assessment scores using
insider threat: IT sabotage, theft of IP, and data fraud [1]. Principal Component Analysis (PCA) [2] decompositions on
There are also a growing number of cases that media attention subsets of features, to identify the degree of deviation for each
has highlighted in recent years that reveal both businesses and grouping. The anomaly assessment scores can be used either
with classification schemes to produce a list of suspicious
E-mail: [email protected] users, or can be visualized using parallel co-ordinates plots
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 2
to provide a more in-depth view. To test the performance us to extract many more features beyond the activity counts
of the approach, a red team developed 10 simulated insider that they suggest. Brdiczka et al. [17] combine psychological
threat scenarios for experimentation that are designed to cover profiling with structural anomaly detection to develop an ar-
a variety of different types of insider attacks that are often chitecture for insider-threat detection. They use data collected
observed. It was found that the system performed significantly from the multi-player online game, World of Warcraft, to pre-
well for detecting the attacks using the classification alerts, and dict whether a player will quit their guild. In contrast to real-
the visualization enabled analysts to identify what particular world insider threat detection, they acknowledge that the game
attributes caused the insider to be detected. The remainder contains obvious malicious behaviours, however they aim to
of the paper is as follows: Section II discusses the related apply these techniques to real-world enterprises. Eberle et
works. Section III describes the requirements of an insider al. [18] consider Graph-Based Anomaly Detection as a tool for
threat detection system. Section IV presents the proposed sys- detecting insiders, based on modifications, insertions and dele-
tem, describing in detail the different components. Section V tions of activities from the graph. They use the Enron e-mail
presents process of constructing effective simulation data and dataset [19] and cell-phone traffic as two preliminary cases,
the experimentation of the detection system, and Section VI within the intention of extending to the CERT insider threat
concludes the paper. datasets. Senator et al. [20] propose to combine structural and
semantic information on user behaviour to develop a real-world
II. R ELATED W ORKS detection system. They use a real corporate database, gather as
The topic of insider threat has recently received much part of the Anomaly Detection at Multiple Scales (ADAMS)
attention in the literature. Researchers have proposed a variety program, however due to confidentiality they can not disclose
of different models that are designed to prevent or detect the full details and so it is difficult to compare against the work.
the presence of attacks (e.g., [3], [4]). Similarly, there is Parveen et al. [21] use stream mining and graph mining to
much work that considers the psychological and behavioural detect insider activity in large volumes of streaming data, based
characteristics of insiders who may pose a threat as means on ensemble-based methods, unsupervised learning and graph-
for detection (e.g., [5], [6], [7]). Kammüller and Probst [8] based anomaly detection. Parveen and Thuraisingham [22]
consider how organisations can identify attack vectors based on extend the work with an incremental learning algorithm for
policy violations, to minimise the potential of insider attacks. insider threat detection that is based on maintaining repetitive
Likewise, Ogiela and Ogiela [9] study how to prevent insider sequences of events. They use trace files collected from real
threats using hierarchical and threshold secret sharing. For users of the Unix C shell [23], however this public dataset is
the remainder of this section, we choose to focus particularly relatively dated now.
on works that address the practicalities of designing and One clear observation from these related works is that access
developing systems that can predict or detect the presence of to real-world data is extremely difficult, and so researchers
insider threat. synthesise data that is similar to that of a real-world enterprise,
Early work by Spitzner [10] discusses the use of honeypots or use a subset of data points, or apply insider threat detection
(decoy machines that may lure an attack) for detecting in- techniques to other problem domains (e.g., online games). In
sider attacks. However, as security awareness increases, those our work, we particularly wanted to represent the variety and
choosing to commit insider attacks are finding more subtle volume of data that would be observed in a modern real-world
methods to cause harm or defraud their organisations, and so organisation, and show how this could be combined to form
there is a need for more sophisticated prevention and detection. an overall assessment for each user and for each role. We also
Early work by Magklaras and Furnell [11] considers how to wanted to clearly demonstrate a wide variety of insider threat
estimate the level of threat that is likely to originate from a scenarios as represented by our synthetic data generation, and
particular insider based on certain profiles of user behaviour. show how our detection system would be capable of detecting
As they acknowledge, substantial work is still required to the different attacks.
validate the proposed solutions. Myers et al. [12] consider how
web server log data can be used to identify malicious insiders
III. R EQUIREMENTS A NALYSIS
who look to exploit internal systems. Maloof and Stephens [13]
propose a detection tool for when insiders violate need-to- The work described in this paper was carried out as part
know restrictions that are in place within the organisation. of a wider inter-disciplinary project that includes computer
Okolica et al. [14] use Probabilistic Latent Semantic Indexing scientists, security researchers, and cyber-psychology experts.
with Users to determine employee interests, which are used As the problem of insider threat continues to be of growing
to form social graphs that can highlight insiders. Liu et concern to businesses and governments alike, there becomes a
al. [15] propose a multilevel framework called SIDD (Sensitive critical need for practical tools to help alleviate the threat that
Information Disseination Detection) that incorporates network- is posed. Our understanding of what we believe to constitute as
level application identification, content signature generation insider threat is the result of close inter-disciplinary collabora-
and detection, and covert communication detection. tion between industry, government and academia. The system
More recently, Eldardiry et al. [16] also propose a system for that is proposed here aims to address the majority of scenarios
insider threat detection based on feature extraction from user that are understood from the knowledge that has been shared
activities. However, they do not factor in role-based assess- by organisations experiencing such attacks, and case studies
ment. In addition, the profiling stage that we perform allows that have been documented in research reports and the media.
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 3
Anomaly
parameters
For
each
user
Update user and role profile with current daily observa:on profile
Fig. 1. Architecture of the insider threat detection system. The system comprises of a number of key components that process incoming log data records and
construct a profile of user and role behaviour for the current day, and assess the level of threat posed by the invididual. Alerts can be automatically triggered
at three levels: policy violations and previously-recognised attacks, threshold-based anomalies or deviation-based anomalies. Alerts are dealt with by an analyst
who can then determine whether the individual does actually pose a threat or not. If deemed not to be a threat, the analyst can refine the detection model to
minimise the false positive rate for future observations.
Our initial work on insider threat detection was to develop a user that relates to the threat that they currently pose.
conceptual model of how a detection system could connect the • The system should be able to deal with various forms of
actions of the real world with the hypothesis that a particular insider threat, including sabotage, intellectual property
individual is an insider [3]. It is crucial that organisations theft, and data fraud.
looking to deploy insider threat detection tools have a clear • The system should also be able to deal with unknown
understanding of the valuable assets of the organisation, and cases of insider threat, whereby the threat is deemed to
the monitored activities that relate to these assets, to therefore be an anomaly for that user and for that role.
understand the type of attacks that could potentially arise. In • The system should assess the threat that an individual
developing our conceptual model, we identified the different poses based on how this behaviour deviates from both
elements that exist within organisations to understand what their own previous behaviour, and the behaviour exhib-
elements could be affected as a result of an insider attack. ited by those in a similar job role.
As a result, we can define the requirements of the detection
system as given below: Whilst we aim for a well-defined detection system that can
alleviate the presence of insider threat, to promise a system
• The system should be able to determine a score for each that can eradicate the problem is a bold claim that we do
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 4
not try to state here. By the very nature of an insider attack, activity logs consist of five different files that correspond to
a sophisticated attacker would be conscious of covering their the different activities that can be performed: login, usb device,
tracks to avoid being detected. For example, they could attempt e-mail, web, and file access. Each record is parsed to obtain
to falsify or delete the activity logs that are reported to the a timestamp, a user ID, a device ID (i.e., what device logged
detection system, or they could attempt to circumvent standard the action) and an activity name (e.g., login, e-mail). Some
monitoring practices. In theory, the very nature of modifying activities (i.e., e-mails, files, websites) may also contain further
or deleting log files should be detected and so should raise information that we assign as the attribute, such as the e-
an alert, given that this behaviour should not be deemed as mail recipients, the filename accessed, or the website accessed.
normal. Such attacks would therefore most likely be detected Where an attribute is provided, the system is also capable of
through a combination of both online and offline behaviours, retrieving and analysing content that can be assigned as the
such as acting suspiciously in the workplace. final property of the record, which is handled by the Content
Parser.
IV. S YSTEM OVERVIEW The Content Parser consists of two main techniques of
The architecture of the detection system is detailed in Fig- analysing textual data: bag of words, and Linguistic Inquiry
ure 1. Here, the detection system connects with a database that Word Count (LIWC) [25]. For analysing website and file
contains all available log records that exist for the organisation. content, Content Parser will scrape the given URL and retrieve
Such examples may be computer-based access logs, e-mail all text that is recognised to exist within the English dictionary.
and web records, and physical building access (e.g., swipe Using a bag of words approach to construct a feature set, this
card logs). All records for the current date are retrieved and feature vector is assigned to the given record. Similarly, for
parsed by the system. For each record, the user ID is used to e-mail content we construct a feature vector, however rather
append the activity to their daily observed profile. Likewise, than using the raw text content, we use features defined by
the activity is also appended to the daily observed profile of LIWC. The justification of this is three-fold. Firstly, given the
their associated role, if applicable. Once the daily observation sensitivity of e-mail content, many organisations are concerned
profiles are constructed, the system proceeds to assess each with directly monitoring the content of e-mails. Secondly, the
user based on three levels of alerts: policy violations and LIWC categories have well-defined meaning with regards to
previously-recognised attacks, threshold-based anomalies, and psychological context, and so could provide more meaningful
deviation-based anomalies. At each stage in the assessment, information regarding the e-mail content than the raw message
the system can trigger an alert to the analyst to notify of a would do in any case. Finally, there are 80 features defined
supposed threat being observed. The analyst can investigate by the LIWC tool, so it means that the size of the feature
the alert, and then decide whether this alert is correct. Should vector can easily be reduced. It would be possible to use either
the analyst decide that the alert is not correct, then they have technique for assessment of each activity, however we make
the capability to reject a detection result which then refines the this distinction due to e-mails being user-generated, rather than
parameters within the system, to minimise the false positive websites or files that are only being read by a user. Each
rate for future observations. content-based feature vector is combined with the user and
In the following sections, we will detail how each of the key role-based daily observation profiles, which we will describe
components of the system are performed to identify at-risk in- further in Section IV-B.
dividuals. We consider the key components of the system to be The Content Parser serves as an optional module within our
the retrieval of records from the organisational database, user architecture. It is understood that many organisations currently
and role-based profiling, profile feature extraction, anomaly do not maintain records of all content from e-mails being
assessment from features, and classification of threat from sent, due to privacy concerns. However, organisations may well
anomaly scores. For this work, a pilot detection system was de- change their position on this, especially if it is believed that
veloped using the Python programming language. In addition, such content would help in combatting against the threat of
visualization components have also been developed that allow insider attacks. For the development of our system, we have
the analyst to explore different components of the detection worked with a number of synthetic data sets including CMU-
process, such as user profiles and multiple anomaly scores. CERT insider threat scenarios, the published Enron e-mail
Our visualization components are developed using a Python dataset, sample data provided by CPNI, and in-house generated
back-end and the popular D3 javascript library for the front- data. One challenge with using synthetic datasets such as
end display [24]. CMU-CERT and our own, is that whilst the data may show
that e-mails were sent or files were accessed, since these are
purely synthetic there is no substantial content within the files
A. Data Input or e-mails. E-mail content may be a collection of randomly-
At the first stage of the pipeline is the Data Parser Module chosen words that define a topic, rather than a meaningful
that interfaces with the organisation. For each day, the system communication sent by a human user. Whilst we have been
requests the set of records from the log data that correspond able to trial such methods on e-mail and web analysis in
with the current date. In theory, this could consist of many isolation, without these pairing up with corresponding insider
different captures of data from different sensors within the threat scenarios it is difficult to truly validate the approach.
organisation. Our initial work was based on the datasets However, it is incorporated into the overall architecture since
provided by CMU-CERT. In these datasets, the organisation it serves as an optional complimentary anomaly metric that
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 5
between the user’s daily activity and their previous activity, and performing a new activity, performing an existing activity
comparisons between the user’s daily activity and the previous at a new time of day, or performing an existing activity
activity of their role. more or less often than previously. These define our ‘new’,
‘hourly’, and ‘number’ metrics. The combination of multiple
metrics also provide support for greater confidence in the
D. Threat Assessment result obtained regarding an individual. For example, we may
Once the feature set for the current daily observation has observe that a particular individual scores higher than other
been computed, the next stage of the system is to determine users not only on ‘hourly anomaly’ or ‘total anomaly’, but
whether these features show significant deviation in behaviour also on ‘file anomaly’ and ‘email anomaly’. By considering
compared with all previously-accepted observations. To do how the different subsets of features score, rather than a
this, an n × m matrix is constructed for each user, where n is single overall score, it allows an assessment to be made on
the total number of sessions (or days) being considered, and not only that an individual is posing as a threat, but also on
m is the number of features that have been obtained from the what attack vectors they are acting on. Here, we observe that
profile. The bottom row of the matrix represents the current the user is logging in at an unusual time to access new files
daily observation, with the remainder of the matrix being and e-mail new contacts.
all previous observation features. To derive the amount of
variation that is exhibited in the multivariate feature space, we E. Classification of threat
perform PCA to obtain a projection of the features into lower
dimensional space based on the amount of variance exhibited The final stage of the system is to provide assessment of
by each feature. What this means is that features that have the threat that is posed by an individual, given the observa-
a higher variance can be projected into a lower-dimensional tion of their activity, and the collection of anomaly scores
space whilst preserving separability between similar and dis- that have been assigned to their daily observation profile.
similar features. It is often used to enable visualization and One approach to this as a relatively effective measure is to
understanding of large datasets using only 2 or 3-dimensions, simply normalise each column of the anomaly score matrix,
to observe the clustering of similar data records. For our and then take the maximum standard deviation as a integer
application, we also allow a weight to be associated with classification of importance. We would expect most data to
each feature so that features of greater importance can be exist within 2 standard deviations of the norm, so anything
emphasised, as dictated by an analyst. In this way, the analyst above this should certainly be investigated. Likewise, we can
can generate different models for analysis based on different also compute the mahalanobis distance to assess how far
configurations of weighted combinations. If no weights are away an individual’s observation are from the rest of the
specified then the weight is taken to be 1/f where f is the distribution. As a third approach that can be deployed, we
total number of features. All feature columns are normalized compute the covariance matrix of a user’s anomaly scores,
before the PCA decomposition is performed. By default, we and on each daily observation assess the signed differences
consider a decomposition of the features to a 2-dimensional between the covariances. The system could well be extended
space. If all feature observations were identical, then all points to support other classification schemes in the future as desired
in the new space would be clustered at the origin. However, by the analyst. The classification can be used flag up users
given the deviation that is expected of human behaviour, points to the analyst, and also to determine whether a user’s daily
are likely to be clustered near to, but not directly at, the centre. observation profile should be included within their previously-
For the new matrix, we consider only the current observation, observed normal profile. If the observation is deemed to be
which is the bottom-most record in the matrix. We compute the too much of an anomaly, then the observation is recorded as
distance of this point from the origin in the new space, and take an attack rather than their normal. This is a vital stage so as
this to be the anomaly score of this metric at this observation. to not contaminate a user’s previously-observed profile with
This process is performed for each of the anomaly metrics, malicious behaviour, whilst also providing the capability for
where each metric consists of a subset of the overall feature set, each daily observation to contribute towards the previously-
and if specified, a corresponding weighting function for each observed profile.
feature. Each anomaly metric can be configured to alert if the
score obtained for that particular metric is above a particular V. E XPERIMENTATION
threshold. To be able to assess the performance of the detection system,
The anomaly metrics that are currently considered include: we conduct a series of experimentation scenarios using the
Login anomaly, Login duration anomaly , Logoff anomaly prototype system. As part of the wider project on insider threat,
, USB inserstion anomaly , USB duration anomaly , 10 scenarios have been developed that cover the broad range
Email anomaly , Web anomaly , File anomaly , This anomaly of possible attacks that an insider could perform against their
, Any anomaly , New anomaly , Hourly anomaly , organisation. For each scenario, a narrative has been devised
Number anomaly , User anomaly , Role anomaly , and, that explains what has happened, including why the individual
Total anomaly. The system could easily support the addition has chosen to act against the organisation, and what they have
of further anomaly metrics, based on the observation of done. Each scenario is modelled within a unique synthetically-
different activity types. From our researching into case studies generated dataset that represents the normal activity of the
of insider threat, most cases could be associated with either organisation. The data contains all employee activity within
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 7
VI. C ONCLUSIONS AND F UTURE W ORK easily comparable against other users, role types, and tem-
poral observations. From each daily observation, the system
In this work, we have presented an effective approach for constructs a large set of features that describe the state of
insider threat detection. From the organisational log data, the the current daily profile, and the previously observed profiles
system generates user and role-based profiles that can describe for all users. The system then creates subsets of the features
the full extent of activities that users perform within the that describe particular anomalies of interest, and computes a
organisation. The tree-structured profiles are designed to be PCA decomposition on this to identify features that exhibit
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 9
high deviation. Alerts are generated when anomaly scores are [5] M. Bishop, S. Engle, S. Peisert, S. Whalen, and C. Gates. We have
deemed to be over a particular threshold, measured as the met the enemy and he is us. In Proc. of the 2008 workshop on New
standard deviation from the normalized anomaly scores. From security paradigms (NSPW’08), Lake Tahoe, California, USA, pages
1–12. ACM, September 2008.
an alert, the analyst can visualize how the user differs from
[6] F. L. Greitzer and R. E. Hohimer. Modeling human behavior to
their normal behaviour, or from other users, using a range anticipate insider attacks. Journal of Strategic Security, 4(2):25–48,
of visualization techniques. We demonstrate this approach for 2011.
a variety of synthetically-generated insider threat scenarios, [7] J. R. C Nurse, O. Buckley, P. A. Legg, M. Goldsmith, S. Creese, G. R. T.
both from our own development and from CMU-CERT, and Wright, and M. Whitty. Understanding insider threat: A framework for
find that the system performs well for identifying these attacks characterising attacks. In IEEE Security and Privacy Workshops (SPW).
across the range of anomaly metrics that are considered. IEEE, 2014.
Clearly, by the very nature of an insider threat, the individual [8] F. Kammueller and C. W. Probst. Invalidating policies using structural
information. Journal of Wireless Mobile Networks, Ubiquitous Com-
in question is purposely attempting to stay below the radar, and puting, and Dependable Applications, 5(2):59–79.
so to guarantee 100% detection success is difficult since there [9] M. R. Ogiela and U. Ogiela. Linguistic protocols for secure information
could be a number of attacks that are not considered by the management and sharing. Computers & Mathematics with Applications,
designers of the detection system. Our future work is to explore 63(2):564–572, January 2012.
the notion of model evolution, and how multiple detection [10] L. Spitzner. Honeypots: catching the insider threat. In Proc. of the
models could operate in parallel. In our current architecture, 19th IEEE Computer Security Applications Conference (ACSAC’03),
we have shown the process of refining the current model, Las Vegas, Nevada, USA, pages 170–179. IEEE, December 2003.
but what if the analyst chose to maintain both models and [11] G. B. Magklaras and S. M. Furnell. Insider threat prediction tool:
Evaluating the probability of IT misuse. Computers and Security,
compare the two? The analyst would then need to be able to 21(1):62–73, 2002.
assess the performance of each model over time, to decide
[12] J. Myers, M. R. Grimaila, and R. F. Mills. Towards insider threat
whether it is worth utilising all models, or whether some detection using web server logs. In Proceedings of the 5th Annual
models should be discarded. There are also organisational- Workshop on Cyber Security and Information Intelligence Research:
dependant characteristics that may need to be considered, Cyber Security and Information Intelligence Challenges and Strategies,
however the approach described is designed to be flexible CSIIRW ’09, pages 54:1–54:4, New York, NY, USA, 2009. ACM.
to the forms of data that different organisations may collect. [13] M. A. Maloof and G. D. Stephens. Elicit: A system for detecting
We are currently conducting experiments with a large real- insiders who violate need-to-know. In Christopher Kruegel, Richard
Lippmann, and Andrew Clark, editors, Recent Advances in Intrusion
world organisation to see how effective the tools can be Detection, volume 4637 of Lecture Notes in Computer Science, pages
when studying real users, and in particular, the differences 146–166. Springer Berlin Heidelberg, 2007.
between real normal and real threats. We are also exploring [14] J. S. Okolica, G. L. Peterson, and R. F. Mills. Using plsi-u to detect
whether decomposition to different levels of dimensionality insider threats by datamining e-mail. International Journal of Security
can improve the precision results for the detection system, to and Networks, 3(2):114–121, 2008.
further alleviate analyst efforts. What is very clear however, is [15] Liu Y., C. Corbett, K. Chiang, R. Archibald, B. Mukherjee, and
that organisations recognise that a real threats exist, and that D. Ghosal. Sidd: A framework for detecting sensitive data exfiltration
by an insider attack. In System Sciences, 2009. HICSS ’09. 42nd Hawaii
such systems as this could well detect and alleviate the efforts International Conference on, pages 1–10, Jan 2009.
that are required of organisational security analysts. [16] H. Eldardiry, E. Bart, Juan Liu, J. Hanley, B. Price, and O. Brdiczka.
Multi-domain information fusion for insider threat detection. In Security
ACKNOWLEDGEMENTS and Privacy Workshops (SPW), 2013 IEEE, pages 45–51, May 2013.
This research was conducted in the context of a collaborative [17] O. Brdiczka, J. Liu, B. Price, J. Shen, A. Patil, R. Chow, E. Bart,
project on Corporate Insider Threat Detection, sponsored by and N. Ducheneaut. Proactive insider threat detection through graph
learning and psychological context. In Proc. of the IEEE Symposium on
the UK National Cyber Security Programme in conjunction Security and Privacy Workshops (SPW’12), San Francisco, California,
with the Centre for the Protection of National Infrastructure, USA, pages 142–149. IEEE, May 2012.
whose support is gratefully acknowledged. The project brings [18] W. Eberle, J. Graves, and L. Holder. Insider threat detection using a
together three departments of the University of Oxford, the graph-based approach. Journal of Applied Security Research, 6(1):32–
University of Leicester and Cardiff University. 81, 2010.
[19] Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for
R EFERENCES email classification research. In Jean-Franois Boulicaut, Floriana Espos-
ito, Fosca Giannotti, and Dino Pedreschi, editors, Machine Learning:
[1] D. M. Cappelli, A. P. Moore, and R. F. Trzeciak. The CERT Guide to ECML 2004, volume 3201 of Lecture Notes in Computer Science, pages
Insider Threats: How to Prevent, Detect, and Respond to Information 217–226. Springer Berlin Heidelberg, 2004.
Technology Crimes. Addison-Wesley Professional, 1st edition, 2012.
[20] T. E. Senator, H. G. Goldberg, A. Memory, W. T. Young, B. Rees,
[2] I. Jolliffe. Principal component analysis. Wiley Online Library, 2005. R. Pierce, D. Huang, M. Reardon, D. A. Bader, E. Chow, et al. Detecting
[3] P. A. Legg, N. Moffat, J. R. C. Nurse, J. Happa, I. Agrafiotis, insider threats in a real corporate database of computer usage activity.
M. Goldsmith, and S. Creese. Towards a conceptual model and In Proceedings of the 19th ACM SIGKDD international conference on
reasoning structure for insider threat detection. Journal of Wireless Knowledge discovery and data mining, pages 1393–1401. ACM, 2013.
Mobile Networks, Ubiquitous Computing and Dependable Applications, [21] P. Parveen, J. Evans, Bhavani Thuraisingham, K.W. Hamlen, and
4(4):20–37, 2013. L. Khan. Insider threat detection using stream mining and graph
[4] M. Bishop, B. Simidchieva, H. Conboy, H. Phan, L. Osterwell, mining. In Privacy, security, risk and trust (PASSAT), 2011 IEEE Third
L. Clarke, G. Avrunin, and S. Peisert. Insider threat detection by process International conference on social computing, pages 1102–1110, Oct
analysis. In IEEE Security and Privacy Workshops (SPW). IEEE, 2014. 2011.
IEEE SYSTEMS JOURNAL, VOL. -, NO. -, SEPTEMBER 2015 10
[22] P. Parveen and Bhavani Thuraisingham. Unsupervised incremental Sadie Creese is Professor of Cybersecurity in the
sequence learning for insider threat detection. In Intelligence and Department of Computer Science at the University
Security Informatics (ISI), 2012 IEEE International Conference on, of Oxford. She is Director of Oxfords Cyber Security
pages 141–143, June 2012. Centre, Director of the Global Centre for Cyber
[23] S. Greenberg. Using unix: collected traces of 168 users. Technical Security Capacity Building at the Oxford Martin
report, 1988. School, and a co-Director of the Institute for the
Future of Computing at the Oxford Martin School.
[24] M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven docu- Her research experience spans time in academia,
ments. IEEE Transactions on Visualization and Computer Graphics, industry and government. She is engaged in a broad
17(12):2301–2309, December 2011. portfolio of cyber security research spanning situa-
[25] Y. R. Tausczik and J. W. Pennebaker. The psychological meaning tional awareness, visual analytics, risk propagation
of words: Liwc and computerized text analysis methods. Journal of and communication, threat modelling and detection, network defence, de-
Language and Social Psychology, 29(1):24–54, 2010. pendability and resilience, and formal analysis. She has numerous research
[26] P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese. Caught in the collaborations with other disciplines and has been leading inter-disciplinary
act of an insider attack: Detection and assessment of insider threat. In research projects since 2003. Prior to joining Oxford in October 2011 Creese
IEEE International Symposium on Technologies for Homeland Security was Professor and Director of e-Security at the University of Warwicks
(HST 2015), (in press). International Digital Laboratory. Creese joined Warwick in 2007 from QinetiQ
where she most recently served as Director of Strategic Programmes for
QinetiQ’s Trusted Information Management Division.