0% found this document useful (0 votes)
14 views39 pages

Exploratory Data Analysis

The document discusses the data science workflow which includes collecting and cleaning data, exploratory data analysis, developing models and algorithms, and communicating results. It focuses on the importance of data cleaning and provides examples of how data can be messy and techniques for cleaning it such as removing duplicates, standardizing text, handling missing values, and identifying outliers.

Uploaded by

ravinyse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views39 pages

Exploratory Data Analysis

The document discusses the data science workflow which includes collecting and cleaning data, exploratory data analysis, developing models and algorithms, and communicating results. It focuses on the importance of data cleaning and provides examples of how data can be messy and techniques for cleaning it such as removing duplicates, standardizing text, handling missing values, and identifying outliers.

Uploaded by

ravinyse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Data Science Workflow:

Data Cleaning +
Exploratory Data Analysis
DATA SCIENCE WORKFLOW

1. Start with 4. Models &


Question Algorithms
3. Exploratory
Data Analysis

2. Collect Data & 5. Communicate


Clean Data Results
DATA SCIENCE WORKFLOW

1. Start with 4. Models &


Question Algorithms
3. Exploratory
Data Analysis

2. Collect Data & 5. Communicate


Clean Data Results
Data Cleaning
WHY IS DATA CLEANING SO IMPORTANT?

“Data is king” “Data >> Features >> “Garbage in,


Algorithms” garbage out”
1. Duplicate or
HOW CAN unnecessary data
2. Inconsistent text and
DATA BE typos

MESSY? 3. Missing data


4. Outliers
… and more!
1. DUPLICATE OR UNNECESSARY DATA

" Let’s say I’d like to do some analysis on Metis students


Name Applied Date Status Campus
Alice Jan 2018 Enrolled Chicago
Bob Feb 2018 Enrolled Chicago
Charlie June 2017 Rejected NYC
Charlie Jan 2018 Enrolled NYC
Eve Jan 2018 Enrolled NYC
Frank Feb 2018 Deferred Seattle
Grace Dec 2017 Enrolled Seattle
Henry Jan 2018 Enrolled SF
1. DUPLICATE OR UNNECESSARY DATA

" Let’s say I’d like to do some analysis on Metis students


Name Applied Date Status Campus
Alice Jan 2018 Enrolled Chicago
Bob Feb 2018 Enrolled Chicago
Charlie June 2017 Rejected NYC
Charlie Jan 2018 Enrolled NYC
Eve Jan 2018 Enrolled NYC
Frank Feb 2018 Deferred Seattle
Grace Dec 2017 Enrolled Seattle
Henry Jan 2018 Enrolled SF
1. DUPLICATE OR UNNECESSARY DATA

" Keep an eye out for duplicate values and dig into why
there are multiple values

" It’s a good idea to look at the features you’re bringing in


and filter down the data as necessary (although be
careful not to filter too much if you may use the features at
a later point)
2. INCONSISTENT TEXT AND TYPOS

" Let’s say I’d like to do some analysis on Metis students


Name Applied Date Status Campus
Alice Jan 2018 Enrolled Chicago
Bob Feb 2008 Enrolled Chicago
Charlie June 2017 Rejected NYC
Charlie Jan 2018 Enrolled NYC
Eve Jan 2018 Enrolled new york city
Frank Feb 2018 Deferred Seattle
Grace Dec 2017 Enrolled seattle
Henry Jan 2018 Enrolled SF
2. INCONSISTENT TEXT AND TYPOS

" Let’s say I’d like to do some analysis on Metis students


Name Applied Date Status Campus
Alice Jan 2018 Enrolled Chicago
Bob Feb 2008 Enrolled Chicago
Charlie June 2017 Rejected NYC
Charlie Jan 2018 Enrolled NYC
Eve Jan 2018 Enrolled new york city
Frank Feb 2018 Deferred Seattle
Grace Dec 2017 Enrolled seattle
Henry Jan 2018 Enrolled SF
2. INCONSISTENT TEXT AND TYPOS

" Look at some summary statistics for each column

" For numerical fields, what are the minimum and


maximum values - do they make sense?

" For categorical fields, what are the unique values -


can some values be grouped together?
3. MISSING DATA

" Let’s say I’d like to do some analysis on Metis students


Name Applied Date Status Campus
Alice Jan 2018 Enrolled Chicago
Bob —- Enrolled Chicago
Charlie June 2017 Rejected NYC
Charlie Jan 2018 Enrolled NYC
Eve Jan 2018 Enrolled new york city
Frank Feb 2018 Deferred Seattle
Grace Dec 2017 Enrolled seattle
Henry —- —- —-
3. MISSING DATA

" Let’s say I’d like to do some analysis on Metis students


Name Applied Date Status Campus
Alice Jan 2018 Enrolled Chicago
Bob —- Enrolled Chicago
Charlie June 2017 Rejected NYC
Charlie Jan 2018 Enrolled NYC
Eve Jan 2018 Enrolled new york city
Frank Feb 2018 Deferred Seattle
Grace Dec 2017 Enrolled seattle
Henry —- —- —-
3. MISSING DATA

" Things to do about missing data


" Remove the row(s) entirely
" Impute the data = replace with substituted values
" Fill in the missing data with the most common
value, the average value, etc.

" What are the pros and cons of each of these approaches?
4. OUTLIERS

" An outlier is an observation in data that is distant from most other


observations

" Typically, these observations are aberrations and do not accurately


represent the phenomenon we are trying to explain through the model

" If we do not identify and deal with outliers, they can have a significant
impact on the model
HOW TO FIND OUTLIERS

1. Plots 2. Statistics 3. Residuals

Histogram Interquartile Range Studentized Residual

Deleted Residual
Density Plot Standard Deviation
(normally
(for regression
Box Plot distributed data) problems)
1. PLOTS

HISTOGRAM BOX PLOT


2. STATISTICS

INTERQUARTILE
RANGE

STANDARD
DEVIATION
HOW TO DEAL WITH OUTLIERS

" Remove them


" Assign the mean or median value
" K-nearest neighbors
" Use regression to try and predict what the value should be
" Transform the variable
THE POWER OF TRANSFORMATIONS

RIGHT SKEWED NORMAL!


1. Duplicate or
HOW CAN unnecessary data
2. Inconsistent text and
DATA BE typos

MESSY? 3. Missing data


4. Outliers
… and more!
DATA SCIENCE WORKFLOW

1. Start with 4. Models &


Question Algorithms
3. Exploratory
Data Analysis

2. Collect Data & 5. Communicate


Clean Data Results
DATA SCIENCE WORKFLOW

1. Start with 4. Models &


Question Algorithms
3. Exploratory
Data Analysis

2. Collect Data & 5. Communicate


Clean Data Results
Exploratory Data Analysis
WHAT IS EXPLORATORY DATA ANALYSIS?
“”
Exploratory data analysis
(EDA) is an approach to
analyzing data sets to
summarize their main
characteristics, often with
visual methods.

– Wikipedia
WHY IS EDA USEFUL?

" Get an initial feel for the data

" See if the data makes sense and if further cleaning or


more data is needed

" Identify patterns and trends in the data - often these can
be just as important as your findings from modeling
WHAT ARE SOME TECHNIQUES?

" Summary Statistics


Average, Median, Min, Max, Correlations, etc.

" Visualizations
Histograms, Scatter Plots, Box Plots, etc.
WHAT ARE SOME TOOLS?

" Data Wrangling


" Pandas

" Data Visualization


" Matplotlib
" Seaborn
OUR QUESTION

" Let’s say I want to do some analysis to see which


applicants get accepted into Metis

" As a class, can you brainstorm some ways you can


explore this data using (1) statistics and (2) visualizations?
EDA: SUMMARY STATISTICS

" Average: I could look at the average of all student interview


scores, or perhaps the average of student interview scores by city

" Max: I could look at the most common words that accepted vs
rejected students use in their application

" Correlation: Take a look at the correlation between technical


assessment grade and years of Python experience
EDA: VISUALIZATIONS

" Histogram (numeric): Take a look at the distribution of number


of years of work experience of our applicants

" Bar Chart (categorical): Create a chart showing the number of


applicants with each type of major

" Scatter Plot: Create a scatter plot comparing the technical


assessment grade and years of Python experience
WHY IS EDA USEFUL?

" Get an initial feel for the data

" See if the data makes sense and if further cleaning or


more data is needed

" Identify patterns and trends in the data - often these can
be just as important as your findings from modeling
Summary
SUMMARY

Data Cleaning EDA

Data is king Summary Statistics

Lots of time here Visualizations

May repeat this Gut check before


modeling
ADDITIONAL RESOURCES

" Book: Python for Data Analysis: Data Wrangling with


Pandas, NumPy, and IPython by Wes McKinney

" Website: Exploratory Data Analysis by DataCamp


Don’t need to pay for the course, but the course outline
shows even more techniques you can use
Up Next…

You might also like