Exploratory Data Analysis
and
Data Science
Module 2
Content
1. EDA
a. Basic tools of EDA
b. Philosophy of EDA
2. The Data Science process
a. Case study: Real Direct (online realestate firm)
3. Three basic Machine Learning Algorithms
a. Linear Regression
b. k-Nearest Neighbours (k-NN)
c. k-means
Exploratory Data Analysis
(EDA)
Introduction
1. “Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look for
those things that we believe are not there, as well as those we believe to be there. — John
Tukey
2. Exploratory Data Analysis (EDA) as the first step toward building a model.
3. The “exploratory” aspect means that your understanding of the problem you are solving,
or might solve, is changing as you go.
4. So EDA, there is no hypothesis and there is no model.
5. It’s traditionally presented as a bunch of histograms and stem-and-leaf plots.
6. EDA is a critical part of the data science process,
a. Basic tools of EDA
1. The basic tools of EDA are plots, graphs and summary statistics.
2. It’s a method of systematically going through the data, plotting distributions of all
variables (using box plots), plotting time series of data, transforming variables, looking at
all pairwise relationships between variables using scatterplot matrices, and generating
summary statistics for all of them.
3. At the very least that would mean computing their mean, minimum, maximum, the upper
and lower quartiles, and identifying outliers.
4. EDA is about your relationship with the data.
5. You want to understand the data—gain intuition, understand the shape of it, and try to
connect your understanding of the process that generated the data to the data itself.
6. EDA happens between you and the data and isn’t about proving anything to anyone else
yet.
b. Philosophy of EDA
1. In the context of data in an Internet/engineering company, EDA is done for some of the
same reasons it’s done with smaller datasets, but there are additional reasons to do it
with data that has been generated from logs.
2. There are important reasons anyone working with data should do EDA. Namely,
a. To gain intuition about the data;
b. To make comparisons between distributions;
c. For sanity checking (making sure the data is on the scale you expect, in the format
you thought it should be);
d. To find out where data is missing or if there are outliers;
e. To summarize the data.
b. Philosophy of EDA
1. In the context of data generated from logs, EDA also helps with debugging the logging
process.
a. For example, “patterns” you find in the data could actually be something wrong in the
logging process that needs to be fixed. If you never go to the trouble of debugging,
you’ll continue to think your patterns are real.
2. The engineers we’ve worked with are always grateful for help in this area.
3. Visualization involved in EDA, we distinguish between EDA and data visualization in that
EDA is done toward the beginning of analysis, and data visualization is done toward the
end to communicate one’s findings.
4. In the end, EDA helps you make sure the product is performing as intended