Data Analytics Project Report
Data Analytics Project Report
John Wenskovitch, Sierra Merkes, Lata Kodali, Abhinuv Pitale, Mai Dahshan
Abstract— This document discusses our findings during the course of investigating two datasets. During our exploration, we made
use of a variety of programming languages and packages, including Matlab, R, Gephi, Bash, Python, Java, and occasionally Excel
(pivot tables are great!). We used our own laptop and desktop computers, as well as the high-performance computational resources
provided by the Advanced Research Computing (ARC) group. We used a variety of analytical models to explore the data files, as well
as writing our own custom searching, filtering, and analysis code.
This document is broken down into the following sections: Section 1 describes the storylines and scenarios that we found
in both datasets. Section 2 begins to describe how we made these findings, beginning with a discussion of data handling and variable
creation. Section 3 discusses the models that we created, including methods of aggregation, explanatory models that we used, and
justifications for using those models. Section 4 discusses computational issues that we encountered during our exploration. Finally,
Section 5 discusses the importance of our results and conclusions.
1 S TORYLINES
In this section, we summarize the scenarios that we discovered within
each dataset. The methods used to uncover these scenarios are described
in more detail in subsequent sections.
1.1 Dataset 1 – ACME
As we began to explore this dataset, we were biased towards looking
for an evil plot of corporate subterfuge or a scenario along those lines.
Under the rationale that most nefarious behavior would take place at
night when fewer employees were present in the office, we focused on
searching the after hours behavior of the employees. We considered
“after hours” to be after 10:00 PM and before 5:00 AM, while “late
night” was considered to be between 12:00–4:00 AM.
Among some other bizarre behavior (such as senior manager Fulton
K. Rojas, who worked a 22 hour shift but primarily surfed the web
all night and throughout the day), we identified some employees who
connect devices very late at night. Listed in Table 1 are individuals
with such activity. The first four employees listed leave in the table the Fig. 1: A PCA plot of ACME employees, highlighting the six employ-
company, while the other two do not. ees identified in Table 1.
1
Fig. 2: The spread of “Prince” throughout the company by email. From left to right, this figure shows the spread of “Prince” on Day 1, at
approximately Day 100, on the final day, and highlighting the most responsible departments on the last day. Networks generated with Gephi.
2
Fig. 6: PCA plot of psychometric scores for all IT Admins. Nothing
stands out psychometrically about the IT Admins who used keyloggers
relative to the full IT Admin population.
Fig. 5: The behavior pattern of our suspicious IT Admins before they
left the company, including the applicable files in the DTAA dataset.
and then resigned.
ID Name File Accessed PC Used 2 DATA S UMMARIZATION
CSC0217 Cathleen S. Craig 06/10/10 15:20 PC-6377 In this section, we describe methods that we followed for loading and
GTD0219 Guy T. Daniel 06/17/10 15:14 PC-6425 manipulating the data, as well as justify new variables that we created
JGT0221 Jonathan G. Terry 07/15/10 15:20 PC-2948 as we explored and aggregated the data.
JTM0223 Jerry T. McCall 07/22/10 15:11 PC-9681
2.1 Data Handling
BBS0039 Bevis B. Sheppard 08/12/10 14:54 PC-9436
In initial explorations of the data, we loaded the provided data files
BSS0369 Brenden S. Shaffer 09/30/10 16:10 PC-3672 into a variety of programs, including Notepad++, R, and Matlab. These
MPM0220 Meghan P. Macias 11/04/10 15:19 PC-2344 programs enabled us to view individual records and to begin to locate
MSO0222 Medge S. O’Brien 12/09/10 15:23 PC-2524 interesting observations and attributes in the data. When we detected
JLM0364 Jacqueline L. Miles 04/28/11 16:06 PC-3791 something that appeared to be worth exploring, we began to filter
and sort the data to investigate further. For smaller files, this was
Table 4: IT Admins who put keyloggers on their supervisor’s PC computationally feasible using these tools. For larger files, command-
line programs such as grep were useful for filtering based on keywords
that we wanted to investigate. For our DTAA storylines, we were
able to use grep to locate website visits that contained the “keylogger”
looking for new jobs. We see that multiple emails have been sent from keyword or the “wikileaks.org” URL from the http info file quickly
their personal email to various companies containing keywords such as and efficiency.
“resume,” “experience,” and “passion,” which led us to conclude that After these initial searches, we began to explore the datasets more
they are job searching. This is particularly threatening to the company deeply. Rather than trying to find storylines with the provided collection
as they appear to have installed a keylogger on their supervisor’s PC of individual files, we worked to create a single master file for each
before they left the company, enabling them to receive and leak out dataset that aggregated and stored all useful information (both variables
company secrets and data. that were provided to us and variables that we created). This process
We used PCA again to project the psychometric scores of all IT was considerably easier for the ACME data than for DTAA data, as
Admins (Figure 6), highlighting those who used a keylogger on their the resulting size of the aggregate ACME file was not much larger
supervisor as well as those who left the company. We do not see much than the initially-provided http info file. These master files gave us
of a difference seen between groups in the projection, suggesting that more power in locating relationships within the data, such as finding
psychometric scores alone are not a good measure of detecting which ownership links between employees and their PCs (and identifying
individuals would likely become disgruntled in the future. shared PCs).
In the case of DTAA, we aggregated some information within the
1.2.3 Part 3 – WikiLeaks: The Real Story about DTAA individual files before combining them into a single master file. For
Continuing the theme of information leaking from DTAA, we found example, we aggregated the keywords listed in each individual record
that a number of employees visited a “The Real Story about DTAA” in http info, computing a frequency for each keyword by date and
webpage hosted on WikiLeaks. None of the 30 employees who ac- employee. Seeing that this file was still quite large, we filtered to only
cessed this page stayed with the company for the entire 17 months. the top 10 keywords aggregated for each date and employee. This
Indeed, these employees departed the company between the months of more manageable information was then included in the master file. In
July 2010 and March 2011, roughly the same timespan as their visits both datasets, we combined the provided monthly employee files into a
to the webpage, and on average they leave the company 20 days after single aggregated employee record, tracking the number of months that
visiting WikiLeaks. each user was employed by the company and the month that they left
Some of the keywords listed as associated with the WikiLeaks URL the company (if applicable).
are concerning, including “subterfuge,” “clandestine,” “forgery,” and
“lie” among others. Though we have no evidence to support either 2.2 Variable Creation
of these hypotheses directly, we consider the possibility that either The master files that we created for each dataset incorporated a number
(1) these employees have discovered some awful truth about DTAA of variables that were not present in the provided data files. Some of
via this WikiLeaks page and resigned, or (2) perhaps they themselves these variables were quite straightforward, such as separating the times-
contributed their own knowledge of company activities to this WikiLeak tamp field into the individual hour, minute, etc. components. Other
3
variables took a bit more computational effort to create, including com- 3.2 Explanatory/Predictive Modeling
puting logon duration and finding logon events that were not followed Initially, we tried to visualize the data across many dimensions (both
by an accompanying logoff. These variables were initially created in with supplied and created variables for our datasets). We decided
the ACME master file, but some were duplicated for DTAA as well. to present PCA plots in this paper for ease of explanation, but we
Specific to the DTAA dataset, we created a number of aggregation also explored the use of k-means and other clustering algorithms on
variables in addition to these straightforward ones. For example, some the DTAA employee psychometric scores. For psychometric scores
variables included denoting whether an email or website is “prince” alone, we found that there is no inherent structure across all DTAA
related, as well as the network of users sending and receiving emails (if employees, making these clustering algorithms ineffective for that
the email was sent to someone in the company). avenue of exploration. We instead chose smaller subsets of the variables
Adding these aggregation variables to our master files provided on which to perform k-means. One such subset involved looking at
more information to our analysis than using only the individual records, infected files and emails. The result of the clustering algorithm found
because they enable us to better see patterns in the observations. Un- the groups of employees that send emails and never access files and
derstanding areas of more frequent or outlying activity provided hints further classified those who do, which failed to capture information
for where we should focus our current and future investigations. previously unknown. In addition to the PCA plots presented here, we
One possible weakness to our approach of duplicating our ACME generated dozens of others during our explorations using either smaller
variables into the DTAA data was that it biased our initial DTAA groups of observations or on a smaller collection of dimensions.
exploration. As a result, there was also some bias in the new variables After visualizing the data and mining for relevant information from
that we created for DTAA. This bias is partly why we did not find most emails and URLs, we wanted to better characterize groups of employees.
of our DTAA scenarios until shortly before the deadline. We could For example, consider the group of employees who left each company
have potentially improved our DTAA exploration and variable creation — were these individuals all fired, did they all resign, or was there
by treating it as a new dataset from the beginning, rather than trying to some combination of both cases? As we developed our storylines, we
mimic our approach to the ACME data. created a variety of indicator variables in an attempt to better inform
Another potential improvement to our variable creation would have our predictive models. In trying to predict which other employees
been to standardize our variable notation across our group, especially were at risk of leaving, we decided for the binary responses (left or
because we started our exploration by each taking ownership of a single did not leave the company) to use a logistic LASSO with 10-fold cross
data file. The variety of personal notation preferences yielded variables validation. We selected this approach based upon discussions from the
called “emp,” “emp id,” and “EmpId” for the same attributes. Minor lecture content, since we knew that not all of our selected variables
conflicts then arose when joining tables later in the exploration process were going to be relevant to employee departure.
when sharing among group members. Another important group that we focused on were the DTAA IT
Admins. Since they were overworked, some were disgruntled enough
3 DATA M ODELING to react and leave the company. When looking at a projection of
In this section, we describe our modeling strategies that we imple- the psychometric scores for groups of IT Admins, we found nothing
mented and applied on each dataset. These strategies include aggregat- peculiar (as seen in Figure 6). However, when used in conjunction with
ing data, applying explanatory and predictive modeling techniques, and other variables, some of these psychometric scores aided in prediction
justifications regarding the appropriateness of our strategies. for our models.
4
still find our logistic LASSO to be much more reliable. We show a and still learn that those employees who stay the whole 17 months are
confusion matrix for the logistic LASSO in Table 6. difficult to classify correctly when their behavior is similar to those left
the company.
Stayed Left We note that at first we improperly fit models on the raw counts,
Stayed 845 0 since we did not account for the number of days active. We also tried to
Left 25 130 classify those who would be infected with the “prince” virus, but this
also produced slight high prediction error rates. A weakness here is not
fitting models to the datasets aggregated by the day, but it was difficult
Table 6: Confusion Matrix from Logistic LASSO
with the remaining time to both wrap up the newly discovered stories
as well as find appropriate variables to model. Ideally, we would love
Note that the employees who leave around months 14 through 16 will to understand the data in its raw form, but the sheer size of the DTAA
may have similar behaviors as those who stay the 17 months. Hence, dataset made this nearly impossible.
some of those who stay the entire duration are much more difficult
to classify correctly. Important variables for this classification model 5 D ISCUSSION AND C ONCLUSIONS
include most of the psychometric score coefficients (extrovert being In this section, we evaluate the importance of our results to each of
the exception) along with many of the activity variables for log ons and the companies, as well as discuss some lessons that we learned while
and receiving emails. Many of the “prince” associated variables that exploring the datasets and completing this project.
we created were not deemed important in determining whether or not
an employee would remain with the company. 5.1 Importance of Results
Our results from exploring the ACME data provide hints towards eval-
4 C OMPUTATIONAL I SSUES uating the behavior of employees to uncover odd or unusual events, as
In this section, we discuss the computational issues related to the well as employees who are logging hours without working (as in the
size and complexity of the data files used in each dataset. These case of the 22-hour stint of web surfing). Having the ability to locate
considerations include how we solved challenges related to data scale and eventually correct odd employee behavior will result in a stronger
and the variety of files, as well as an assessment of the strengths and company overall.
weaknesses of our modeling choices. DTAA can use our results to improve their company, especially in
planning future enforcement of data management to prevent leaks, as
4.1 Computational Considerations and Demands well as better information security policies. We showed that the com-
We have already discussed aggregation strategies for dealing with pany has suffered a cyber attack wherein its computers were infected by
large files (primarily the http files in both datasets). These strategies some malware which corrupted files and email contents. This “prince”
certainly helped to address computational challenges for dealing with malware rapidly spreads throughout the company, primarily via email.
these large files. In addition, we were able to make use of the ARC The spread likely could have been avoided by using a secure mail client
resources in order to more quickly perform computations on the large such as Outlook or Proton mail which checks for malicious content.
data files and our aggregate global datasets. These resources provided In reference to the outbreak of the “prince” malware, it appeared
a substantial boost in speed over our personal computers. For example, that the IT Admins were trying to contain it (Figure 2). However, they
our aggregation code for DTAA’s http info was processing one day of were not able to completely contain the infection. We suspect this is
website visits in roughly 5 minutes on John’s laptop, but accelerated to the reason IT Admins were overworked, leading to their frustration and
one day of website visits in roughly 15 seconds on ARC. One difficulty retaliation. We recommend DTAA check their staff’s working hours
related to using ARC resources was the challenge of remote access to and take monthly feedback to estimate their employee satisfaction.
these clusters. Connecting via a VPN from the other side of the planet Also, IT Admins were able to access suspicious websites and download
presented substantial latency issues, nearly to the point of unusability. malicious software. This could have been prevented by using a trusted
Designing multi-threaded code further provided a speed increase, Anti-Virus software and logging such incidents for review by upper
especially when running code in the ARC environment. Despite the management. They were further able to upload these keyloggers to
additional complications and debugging involved in ensuring that the their supervisor’s computer. This is a serious threat to the company, as
code was correct, the performance increase was worthwhile when supervisor’s data is being leaked and/or infected. All this could have
processing large files. For example, we first worked to aggregate the been avoided with stronger security and data encryption tools.
content of the DTAA http info file by date and user in a single thread.
After getting a sense for how long that aggregation would take to 5.2 Project Lessons Learned
execute, we preprocessed that http info file, separating it into one Despite working on these datasets for more than two months, nearly all
file for each of the 1,000 employees. Then, we updated the code to of our best ideas and findings came in the last two days while writing
aggregate keywords for each employee in a separate thread, allowing this report, some even in the final 10 hours. This resulted in a massive
us to make use of multi-core desktops and ARC clusters. rewrite of this document in the final hours before the deadline. In
Using Gephi for network graphs was also quite useful, as this soft- addition to demonstrating the benefits of last-minute panic, this shows
ware package contains utilities such as filtering, timeline preview, that moments of inspiration can occur at any time when exploring the
and categorization by label. It also performed efficiently, despite the data, even at the last moment.
1,000 node and 1,300,000 edged graph of email communication that As noted previously, we felt that our exploration of the DTAA dataset
we supplied. was initially biased towards the approach we followed on the ACME
dataset. Because the storylines within the companies and datasets were
4.2 Computational Modeling Choices quite different, this caused us issues with detecting the scenarios that
Our choices of models were impacted by how we aggregated and re- we report in this paper.
duced all of the information provided in both the ACME and especially Lastly, it is important to have items setup even when not all the
the DTAA datasets. Since our ideas about the stories themselves were pieces are finished. When group members have varying schedules and
finalized shortly before the deadline, the amount of time available to us other deadlines to meet, it can be hard to have all the pieces needed in
to run and refine models was greatly reduced. Rather than fitting a more order to analyze something. Having the code ready to go when those
complicated model before downsizing the data or fitting an over-fitted pieces are in place would have saved some time.
model that may not predict well, we used approaches that are reliable
ACKNOWLEDGMENTS
such as a logistic LASSO and CART. The datasets that we ran models
on were aggregated by employees in the company, and thus contained The authors wish to thank our guiding lighthouses who assisted us in
1,000 observations. We could easily run models on datasets of this size these findings.
5
A PPENDIX
The overseeing of this project was managed very carefully by the lovely
daughter of Mai, Ms. Jana ElFishawy. Without this little girl’s patience,
we would have never been able to finalize our stories or analyze the
emails dataset in those very tiring and long weekend and weeknight
meetings.