0% found this document useful (0 votes)
10 views6 pages

Data Analytics Project Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Data Analytics Project Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Team J-SLAM Findings

John Wenskovitch, Sierra Merkes, Lata Kodali, Abhinuv Pitale, Mai Dahshan

Abstract— This document discusses our findings during the course of investigating two datasets. During our exploration, we made
use of a variety of programming languages and packages, including Matlab, R, Gephi, Bash, Python, Java, and occasionally Excel
(pivot tables are great!). We used our own laptop and desktop computers, as well as the high-performance computational resources
provided by the Advanced Research Computing (ARC) group. We used a variety of analytical models to explore the data files, as well
as writing our own custom searching, filtering, and analysis code.

This document is broken down into the following sections: Section 1 describes the storylines and scenarios that we found
in both datasets. Section 2 begins to describe how we made these findings, beginning with a discussion of data handling and variable
creation. Section 3 discusses the models that we created, including methods of aggregation, explanatory models that we used, and
justifications for using those models. Section 4 discusses computational issues that we encountered during our exploration. Finally,
Section 5 discusses the importance of our results and conclusions.

1 S TORYLINES
In this section, we summarize the scenarios that we discovered within
each dataset. The methods used to uncover these scenarios are described
in more detail in subsequent sections.
1.1 Dataset 1 – ACME
As we began to explore this dataset, we were biased towards looking
for an evil plot of corporate subterfuge or a scenario along those lines.
Under the rationale that most nefarious behavior would take place at
night when fewer employees were present in the office, we focused on
searching the after hours behavior of the employees. We considered
“after hours” to be after 10:00 PM and before 5:00 AM, while “late
night” was considered to be between 12:00–4:00 AM.
Among some other bizarre behavior (such as senior manager Fulton
K. Rojas, who worked a 22 hour shift but primarily surfed the web
all night and throughout the day), we identified some employees who
connect devices very late at night. Listed in Table 1 are individuals
with such activity. The first four employees listed leave in the table the Fig. 1: A PCA plot of ACME employees, highlighting the six employ-
company, while the other two do not. ees identified in Table 1.

ID Name Role Late Connects


GML0105 Germaine M. Lyons Technician 5/9 between the two companies. At ACME, 6.5% of employees left the
CTR0537 Candice T. Ramos Admin. Staff 3/4 company and the departures appeared to be randomly distributed. In
LKY0181 Leroy K. York Engineer 4/5 contrast, 15.5% of employees left DTAA, with an evident increase in
AFF0760 Aubrey F. Foster Foreman 3/3 the departure rate between March and November 2010 (Figure 3).
JSE0020 Jane S. Eaton Tradesman 3/3 Why could this be important? We noted that employees at DTAA
MCH0530 Matthew C. Hayes Janitor 6/6 worked much more often on weekends and holidays than the employees
at ACME. In fact, no one at ACME worked on the weekends outside
Table 1: Employees with late night device connection activity. of Friday late night shifts. The roles are more varied at DTAA than
at ACME, but for similar roles, there are some differences. When
We found such activity suspicious since these employees rarely ever
comparing the activity of IT Admins, we found some stark differences.
connect devices, but some leave the company while others do not. A
All of the IT Admins at ACME stay with the company the entire time,
potential explanation is that some employees may not have been caught
and all exhibit similar behaviors: they all log on to many different PCs,
doing suspicious late night activity, while other had been caught and
have late night activity, and only one connects devices. However, at
fired. Behavior related to after hours connects and disconnects vary
DTAA, 42% of IT Admins leave the company, and their behaviors are
from user to user — some log in for a short time and connect the
not as consistent as in the first dataset.
device for a short time, while others may leave the device connected
for hours. Neither their web or login activity appeared suspicious, and
they only use their personal PC to connect the device. Figure 1 shows 1.2.1 Part 1 – The Spread of “Prince”
a Principal Component Analysis (PCA) plot of all 1000 employees, Throughout the 17 months, we noted that emails containing repeated
using aggregated of counts for logons, connects, late night connects, instances of the word “prince” spread throughout the company. This
website visits, and months worked as inputs. We see the individuals appears to be a virus or some sort of malware spreading throughout
who left the company (in blue) have somewhat different behavior than the company. Figure 2 shows several views of the spread of this word
the rest of the employees and those who stayed (in red). over time. Initially, the infection was contained within the Sales and
Marketing department, which is understandable since the majority of
1.2 Dataset 2 – DTAA emails are internal to departments. However, the “prince” virus eventu-
We noticed early on in both datasets that a number of employees left ally began to spread to other departments, with all but 18 employees
each company during the course of the provided 17 months of infor- sending an infected email at some point. Indeed, these 18 individuals
mation. Interestingly, the pattern of employees leaving was different never receive emails infected with “prince” suggesting the infection

1
Fig. 2: The spread of “Prince” throughout the company by email. From left to right, this figure shows the spread of “Prince” on Day 1, at
approximately Day 100, on the final day, and highlighting the most responsible departments on the last day. Networks generated with Gephi.

Functional Unit FD Infected Total Prince %


Purchasing 61 2897 11627 0.25
Manufacturing 32 2080 8949 0.23
Manufacturing 31 480 2553 0.19
Finance 43 96 1212 0.08
Sales 52 3526 66543 0.05

Table 3: Counts and proportions of files infected with “prince.”

Fig. 3: Employee departures by month at each company.

Functional Unit FD Send Rec. Send % Rec. %


Fig. 4: (Left) Cumulative infected users per day. The horizontal line
Manufacturing 31 11725/ 11087/ 0.43 0.27
shows the cap at 982 infected employees. (Right) New users infected
26979 40754
per day. The plot shows the number of unique new users infected per
Purchasing 61 9543/ 8967/ 0.39 0.27
and Contracts 24495 33672 day with a curve fitted.
Manufacturing 32 12353/ 11476/ 0.37 0.23
33682 50511
Sales 52 69124/ 65615/ 0.09 0.06 decrease for the remaining months. An exponential curve fitted on this
744229 1038843 data has a negative exponent (−0.78), which suggests that an attempt
was made to contain the infection. Who better to tackle such a problem
Table 2: Counts and proportions of emails infected with “prince.” than the IT Admins? These employees also display some interesting
behavior, and we next discuss the IT Admins working for DTAA.

lies within the organization. 1.2.2 Part 2 – Nine Angry IT Administrators


Table 2 shows the number of emails sent and received with “prince” Within the file info data, we discovered nine users (listed in Table 4)
by functional unit and department (FD). We look at the top 4 such who accessed an executable file that contained suspicious content. The
units by raw count of infections and then by proportions. These counts possibility of a malicious program naturally alarms us. Our investiga-
and proportions confirm that the functional units found in the final day tion was furthered by the fact that these programs appear to contain
network graphic in Figure 2 are contributing the most to the spread keyloggers (each of these employees visited websites with “keylogger”
of this virus. We note that the Sales group have a high raw number as a keyword in http info). After combining information from across
of email communications and count of “prince” infected emails, but the collection of data files, we discovered that these nine users were all
proportionally is not as high as the Manufacturing groups. In contrast, IT administrators, with each having significantly less device connection
we look at the top-5 units who access infected files with prince and activity compared to their colleagues. When examining their email
notice that proportionally the Sales group is not the highest, but is in communication as well as file contents, we found a strange pattern of
raw counts of accessing infected files. behavior (shown in Figure 5) that repeatedly occurred only a few days
We also examined the spread of the virus using the number of em- before each left the company.
ployees infected per day (Figure 4). We see this count grow quickly Our analysis of this behavior is that the IT Admins are probably
in the first month or so and before midway, almost all the company is overworked due to the spread of “prince,” and their workload continues
infected by the virus except for those 18-employees. to increase as a result. One of the major complaints seen in their emails
When we break this down further and examine only newly infected is that they are made to work on Weekends and holidays too. This
users, we see something somewhat surprising. The infection spreads has caused them to retaliate against the company. Additionally further
quickly in the beginning weeks, but the number of newly infected users investigation of the email info file shows that these IT Admins were

2
Fig. 6: PCA plot of psychometric scores for all IT Admins. Nothing
stands out psychometrically about the IT Admins who used keyloggers
relative to the full IT Admin population.
Fig. 5: The behavior pattern of our suspicious IT Admins before they
left the company, including the applicable files in the DTAA dataset.
and then resigned.
ID Name File Accessed PC Used 2 DATA S UMMARIZATION
CSC0217 Cathleen S. Craig 06/10/10 15:20 PC-6377 In this section, we describe methods that we followed for loading and
GTD0219 Guy T. Daniel 06/17/10 15:14 PC-6425 manipulating the data, as well as justify new variables that we created
JGT0221 Jonathan G. Terry 07/15/10 15:20 PC-2948 as we explored and aggregated the data.
JTM0223 Jerry T. McCall 07/22/10 15:11 PC-9681
2.1 Data Handling
BBS0039 Bevis B. Sheppard 08/12/10 14:54 PC-9436
In initial explorations of the data, we loaded the provided data files
BSS0369 Brenden S. Shaffer 09/30/10 16:10 PC-3672 into a variety of programs, including Notepad++, R, and Matlab. These
MPM0220 Meghan P. Macias 11/04/10 15:19 PC-2344 programs enabled us to view individual records and to begin to locate
MSO0222 Medge S. O’Brien 12/09/10 15:23 PC-2524 interesting observations and attributes in the data. When we detected
JLM0364 Jacqueline L. Miles 04/28/11 16:06 PC-3791 something that appeared to be worth exploring, we began to filter
and sort the data to investigate further. For smaller files, this was
Table 4: IT Admins who put keyloggers on their supervisor’s PC computationally feasible using these tools. For larger files, command-
line programs such as grep were useful for filtering based on keywords
that we wanted to investigate. For our DTAA storylines, we were
able to use grep to locate website visits that contained the “keylogger”
looking for new jobs. We see that multiple emails have been sent from keyword or the “wikileaks.org” URL from the http info file quickly
their personal email to various companies containing keywords such as and efficiency.
“resume,” “experience,” and “passion,” which led us to conclude that After these initial searches, we began to explore the datasets more
they are job searching. This is particularly threatening to the company deeply. Rather than trying to find storylines with the provided collection
as they appear to have installed a keylogger on their supervisor’s PC of individual files, we worked to create a single master file for each
before they left the company, enabling them to receive and leak out dataset that aggregated and stored all useful information (both variables
company secrets and data. that were provided to us and variables that we created). This process
We used PCA again to project the psychometric scores of all IT was considerably easier for the ACME data than for DTAA data, as
Admins (Figure 6), highlighting those who used a keylogger on their the resulting size of the aggregate ACME file was not much larger
supervisor as well as those who left the company. We do not see much than the initially-provided http info file. These master files gave us
of a difference seen between groups in the projection, suggesting that more power in locating relationships within the data, such as finding
psychometric scores alone are not a good measure of detecting which ownership links between employees and their PCs (and identifying
individuals would likely become disgruntled in the future. shared PCs).
In the case of DTAA, we aggregated some information within the
1.2.3 Part 3 – WikiLeaks: The Real Story about DTAA individual files before combining them into a single master file. For
Continuing the theme of information leaking from DTAA, we found example, we aggregated the keywords listed in each individual record
that a number of employees visited a “The Real Story about DTAA” in http info, computing a frequency for each keyword by date and
webpage hosted on WikiLeaks. None of the 30 employees who ac- employee. Seeing that this file was still quite large, we filtered to only
cessed this page stayed with the company for the entire 17 months. the top 10 keywords aggregated for each date and employee. This
Indeed, these employees departed the company between the months of more manageable information was then included in the master file. In
July 2010 and March 2011, roughly the same timespan as their visits both datasets, we combined the provided monthly employee files into a
to the webpage, and on average they leave the company 20 days after single aggregated employee record, tracking the number of months that
visiting WikiLeaks. each user was employed by the company and the month that they left
Some of the keywords listed as associated with the WikiLeaks URL the company (if applicable).
are concerning, including “subterfuge,” “clandestine,” “forgery,” and
“lie” among others. Though we have no evidence to support either 2.2 Variable Creation
of these hypotheses directly, we consider the possibility that either The master files that we created for each dataset incorporated a number
(1) these employees have discovered some awful truth about DTAA of variables that were not present in the provided data files. Some of
via this WikiLeaks page and resigned, or (2) perhaps they themselves these variables were quite straightforward, such as separating the times-
contributed their own knowledge of company activities to this WikiLeak tamp field into the individual hour, minute, etc. components. Other

3
variables took a bit more computational effort to create, including com- 3.2 Explanatory/Predictive Modeling
puting logon duration and finding logon events that were not followed Initially, we tried to visualize the data across many dimensions (both
by an accompanying logoff. These variables were initially created in with supplied and created variables for our datasets). We decided
the ACME master file, but some were duplicated for DTAA as well. to present PCA plots in this paper for ease of explanation, but we
Specific to the DTAA dataset, we created a number of aggregation also explored the use of k-means and other clustering algorithms on
variables in addition to these straightforward ones. For example, some the DTAA employee psychometric scores. For psychometric scores
variables included denoting whether an email or website is “prince” alone, we found that there is no inherent structure across all DTAA
related, as well as the network of users sending and receiving emails (if employees, making these clustering algorithms ineffective for that
the email was sent to someone in the company). avenue of exploration. We instead chose smaller subsets of the variables
Adding these aggregation variables to our master files provided on which to perform k-means. One such subset involved looking at
more information to our analysis than using only the individual records, infected files and emails. The result of the clustering algorithm found
because they enable us to better see patterns in the observations. Un- the groups of employees that send emails and never access files and
derstanding areas of more frequent or outlying activity provided hints further classified those who do, which failed to capture information
for where we should focus our current and future investigations. previously unknown. In addition to the PCA plots presented here, we
One possible weakness to our approach of duplicating our ACME generated dozens of others during our explorations using either smaller
variables into the DTAA data was that it biased our initial DTAA groups of observations or on a smaller collection of dimensions.
exploration. As a result, there was also some bias in the new variables After visualizing the data and mining for relevant information from
that we created for DTAA. This bias is partly why we did not find most emails and URLs, we wanted to better characterize groups of employees.
of our DTAA scenarios until shortly before the deadline. We could For example, consider the group of employees who left each company
have potentially improved our DTAA exploration and variable creation — were these individuals all fired, did they all resign, or was there
by treating it as a new dataset from the beginning, rather than trying to some combination of both cases? As we developed our storylines, we
mimic our approach to the ACME data. created a variety of indicator variables in an attempt to better inform
Another potential improvement to our variable creation would have our predictive models. In trying to predict which other employees
been to standardize our variable notation across our group, especially were at risk of leaving, we decided for the binary responses (left or
because we started our exploration by each taking ownership of a single did not leave the company) to use a logistic LASSO with 10-fold cross
data file. The variety of personal notation preferences yielded variables validation. We selected this approach based upon discussions from the
called “emp,” “emp id,” and “EmpId” for the same attributes. Minor lecture content, since we knew that not all of our selected variables
conflicts then arose when joining tables later in the exploration process were going to be relevant to employee departure.
when sharing among group members. Another important group that we focused on were the DTAA IT
Admins. Since they were overworked, some were disgruntled enough
3 DATA M ODELING to react and leave the company. When looking at a projection of
In this section, we describe our modeling strategies that we imple- the psychometric scores for groups of IT Admins, we found nothing
mented and applied on each dataset. These strategies include aggregat- peculiar (as seen in Figure 6). However, when used in conjunction with
ing data, applying explanatory and predictive modeling techniques, and other variables, some of these psychometric scores aided in prediction
justifications regarding the appropriateness of our strategies. for our models.

3.1 Data Aggregation 3.3 Model Justification


The large data files in both datasets necessitated some aggregation With our binary responses created from our indicator variables, a lo-
before we were able to merge that content into our master files. Addi- gistic regression using either a logit or probit link function is suitable.
tionally, some of the raw data such as website keywords and monthly The variables we have considered in our models are related to login
employment records were not as useful in raw form as they were in activity, device activity, file activity, email activity, some web activity
aggregated form. (mainly visiting the WikiLeaks page), psychometric scores, and those
Starting with the employees, we worked in both datasets to combine activities related to the “prince” virus. We use an aggregated dataset
the individual monthly records into a global employment picture across by employee and noted the total number of days active as well as an
the 17 months of data. We implemented some custom aggregation indicator variable of whether or not this employee stayed. Our choice
code to create records for each employee, capturing the months that of comparison model was CART, since this model is appropriate for
they worked, the month that they departed (if applicable), and the total both continuous and categorical variables. To validate each model, we
number of months that they appeared in the datasets. This enabled us used a 10-fold cross validation process. In particular for CART, we
to focus our investigation in both datasets on employees who left the used bagging within our cross validation to find our best tree.
company during the time period under investigation. The initial goal with a LASSO and CART was to predict the number
In the DTAA data, we aggregated the keywords in the http info of days employed, but we quickly realized that the activities were in
file, summarizing content about the websites that each employee visited the incorrect scale. We then used both models to predict whether an
by day via the keywords that accompanied each website visit. After employee would leave the company or not. To do this properly, we
aggregating, we were able to sort the keywords by frequency and detect divided the variables associated with activity during that period by the
that the “prince” issue was common in the websites as well as the number of days active. We again validated our models using a 10-fold
emails, as well as detecting some accompanying “anhk” and “ahmose” cross validation scheme. A measure of how incorrectly we predict
keywords. We were then able to aggregate by items with these words outcomes is reported in Table 5.
and without. We aggregated the ACME company data on a minute
level, but we used a daily level for DTAA to make the file sizes more Model Count of Days Active Left the Company
manageable. LASSO 0.200 0.033
Using Gephi, we created the network graphs as seen in Figure 2 for CART 0.290 0.007
the emails sent and received by a user. This was useful in displaying
the connections between various functional units and departments, as Table 5: 10-Fold Cross Validation Error Rates
it showed us the spread of “prince” across the DTAA company. The
dynamic timeline capability in Gephi enabled us to observe the spread Note the high prediction error rates for the incorrectly scaled data,
of “prince” over time. We also used Word Clouds to aggregate word while the prediction error rates for the latter are much more reasonable.
frequencies in a subsets of emails, discovering patterns in email groups Comparing the prediction errors, CART performs better. However, we
that lead to some of our Angry IT Admin findings. are not sure if certain nodes only contain one observation, and so we

4
still find our logistic LASSO to be much more reliable. We show a and still learn that those employees who stay the whole 17 months are
confusion matrix for the logistic LASSO in Table 6. difficult to classify correctly when their behavior is similar to those left
the company.
Stayed Left We note that at first we improperly fit models on the raw counts,
Stayed 845 0 since we did not account for the number of days active. We also tried to
Left 25 130 classify those who would be infected with the “prince” virus, but this
also produced slight high prediction error rates. A weakness here is not
fitting models to the datasets aggregated by the day, but it was difficult
Table 6: Confusion Matrix from Logistic LASSO
with the remaining time to both wrap up the newly discovered stories
as well as find appropriate variables to model. Ideally, we would love
Note that the employees who leave around months 14 through 16 will to understand the data in its raw form, but the sheer size of the DTAA
may have similar behaviors as those who stay the 17 months. Hence, dataset made this nearly impossible.
some of those who stay the entire duration are much more difficult
to classify correctly. Important variables for this classification model 5 D ISCUSSION AND C ONCLUSIONS
include most of the psychometric score coefficients (extrovert being In this section, we evaluate the importance of our results to each of
the exception) along with many of the activity variables for log ons and the companies, as well as discuss some lessons that we learned while
and receiving emails. Many of the “prince” associated variables that exploring the datasets and completing this project.
we created were not deemed important in determining whether or not
an employee would remain with the company. 5.1 Importance of Results
Our results from exploring the ACME data provide hints towards eval-
4 C OMPUTATIONAL I SSUES uating the behavior of employees to uncover odd or unusual events, as
In this section, we discuss the computational issues related to the well as employees who are logging hours without working (as in the
size and complexity of the data files used in each dataset. These case of the 22-hour stint of web surfing). Having the ability to locate
considerations include how we solved challenges related to data scale and eventually correct odd employee behavior will result in a stronger
and the variety of files, as well as an assessment of the strengths and company overall.
weaknesses of our modeling choices. DTAA can use our results to improve their company, especially in
planning future enforcement of data management to prevent leaks, as
4.1 Computational Considerations and Demands well as better information security policies. We showed that the com-
We have already discussed aggregation strategies for dealing with pany has suffered a cyber attack wherein its computers were infected by
large files (primarily the http files in both datasets). These strategies some malware which corrupted files and email contents. This “prince”
certainly helped to address computational challenges for dealing with malware rapidly spreads throughout the company, primarily via email.
these large files. In addition, we were able to make use of the ARC The spread likely could have been avoided by using a secure mail client
resources in order to more quickly perform computations on the large such as Outlook or Proton mail which checks for malicious content.
data files and our aggregate global datasets. These resources provided In reference to the outbreak of the “prince” malware, it appeared
a substantial boost in speed over our personal computers. For example, that the IT Admins were trying to contain it (Figure 2). However, they
our aggregation code for DTAA’s http info was processing one day of were not able to completely contain the infection. We suspect this is
website visits in roughly 5 minutes on John’s laptop, but accelerated to the reason IT Admins were overworked, leading to their frustration and
one day of website visits in roughly 15 seconds on ARC. One difficulty retaliation. We recommend DTAA check their staff’s working hours
related to using ARC resources was the challenge of remote access to and take monthly feedback to estimate their employee satisfaction.
these clusters. Connecting via a VPN from the other side of the planet Also, IT Admins were able to access suspicious websites and download
presented substantial latency issues, nearly to the point of unusability. malicious software. This could have been prevented by using a trusted
Designing multi-threaded code further provided a speed increase, Anti-Virus software and logging such incidents for review by upper
especially when running code in the ARC environment. Despite the management. They were further able to upload these keyloggers to
additional complications and debugging involved in ensuring that the their supervisor’s computer. This is a serious threat to the company, as
code was correct, the performance increase was worthwhile when supervisor’s data is being leaked and/or infected. All this could have
processing large files. For example, we first worked to aggregate the been avoided with stronger security and data encryption tools.
content of the DTAA http info file by date and user in a single thread.
After getting a sense for how long that aggregation would take to 5.2 Project Lessons Learned
execute, we preprocessed that http info file, separating it into one Despite working on these datasets for more than two months, nearly all
file for each of the 1,000 employees. Then, we updated the code to of our best ideas and findings came in the last two days while writing
aggregate keywords for each employee in a separate thread, allowing this report, some even in the final 10 hours. This resulted in a massive
us to make use of multi-core desktops and ARC clusters. rewrite of this document in the final hours before the deadline. In
Using Gephi for network graphs was also quite useful, as this soft- addition to demonstrating the benefits of last-minute panic, this shows
ware package contains utilities such as filtering, timeline preview, that moments of inspiration can occur at any time when exploring the
and categorization by label. It also performed efficiently, despite the data, even at the last moment.
1,000 node and 1,300,000 edged graph of email communication that As noted previously, we felt that our exploration of the DTAA dataset
we supplied. was initially biased towards the approach we followed on the ACME
dataset. Because the storylines within the companies and datasets were
4.2 Computational Modeling Choices quite different, this caused us issues with detecting the scenarios that
Our choices of models were impacted by how we aggregated and re- we report in this paper.
duced all of the information provided in both the ACME and especially Lastly, it is important to have items setup even when not all the
the DTAA datasets. Since our ideas about the stories themselves were pieces are finished. When group members have varying schedules and
finalized shortly before the deadline, the amount of time available to us other deadlines to meet, it can be hard to have all the pieces needed in
to run and refine models was greatly reduced. Rather than fitting a more order to analyze something. Having the code ready to go when those
complicated model before downsizing the data or fitting an over-fitted pieces are in place would have saved some time.
model that may not predict well, we used approaches that are reliable
ACKNOWLEDGMENTS
such as a logistic LASSO and CART. The datasets that we ran models
on were aggregated by employees in the company, and thus contained The authors wish to thank our guiding lighthouses who assisted us in
1,000 observations. We could easily run models on datasets of this size these findings.

5
A PPENDIX
The overseeing of this project was managed very carefully by the lovely
daughter of Mai, Ms. Jana ElFishawy. Without this little girl’s patience,
we would have never been able to finalize our stories or analyze the
emails dataset in those very tiring and long weekend and weeknight
meetings.

Fig. 7: Jana and Spongebob.


Thanks Jana!

You might also like