0 ratings 0% found this document useful (0 votes) 24 views 19 pages Exploratory Data Analysis
The document provides a comprehensive five-step guide for conducting exploratory data analysis (EDA), emphasizing its importance in data science for uncovering patterns and informing business decisions. It outlines key steps including understanding the dataset, checking for missing data, providing basic descriptions of features, identifying data shape, and recognizing significant correlations. The guide is aimed at both new and seasoned data scientists, offering practical tips and methodologies for effective EDA.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save exploratory data analysis For Later Shopify Engineering
Show ~
Latest articlesDevelopmentinfrastructureMobileDeveloper ToolingSecurityData Science & EngineeringCulture
A Five-Step Guide for Conducting Exploratory Data
Analysis
by Cody Mazza-Anthony Data Science & Engineering
Apr 28,2021 16 minute read
Have you ever been handed a dataset and then been asked to describe it? When |
was first starting out in data science, this question confused me. My first thought
was “What do you mean?” followed by “Can you be more specific?” The reality is
that exploratory data analysis (EDA) is a critical tool in every data scientist's kit,
and the results are invaluable for answering important business questions.
Simply put, an EDA refers to performing visualizations and identifying significant
patterns, such as correlated features, missing data, and outliers. EDA's are also
essential for providing hypotheses for why these patterns occur. It most likelywon't appear in your data product, data highlight, or dashboard, but it will help to
inform all of these things.
Below, I'll walk you through key tips for performing an effective EDA. For the more
seasoned data scientists, the contents of this post may not come as a surprise
(rather a good reminder!), but for the new data scientists, I'll provide a potential
framework that can help you get started on your journey. We'll use a synthetic
dataset for illustrative purposes. The data below does not reflect actual data from
Shopify merchants. As we go step-by-step, | encourage you to take notes as you
progress through each section!
Before You Start
Before you start exploring the data, you should try to understand the data at a
high level. Speak to leadership and product to try to gain as much context as
possible to help inform where to focus your efforts. Are you interested in
performing a prediction task? Is the task purely for exploratory purposes?
Depending on the intended outcome, you might point out very different things in
your EDA.
With that context, it's now time to look at your dataset. It’s important to identify
how many samples (rows) and how many features (columns) are in your dataset.
The size of your data helps inform any computational bottlenecks that may occur
down the road. For instance, computing a correlation matrix on large datasets can
take quite a bit of time. If your dataset is too big to work within a Jupyter
notebook, | suggest subsampling so you have something that represents your
data, but isn’t too big to work with.
The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify
merchants,Once you have your data in a suitable working environment, it’s usually a good idea
to look at the first couple rows. The above image shows an example dataset we
can use for our EDA. This dataset is used to analyze merchant behaviour. Here are
a few details about the features:
* Shop Cohort: the month and year a merchant joined Shopify
+ GMV (Gross Merchandise Volume): total value of merchandise sold.
* AOV (Average Order Value): the average value of customers’ orders since their
first order.
* Conversion Rate: the percentage of sessions that resulted in a purchase
* Sessions: the total number of sessions on your online store.
* Fulfilled Orders: the number of orders that have been packaged and shipped.
* Delivered Orders: the number of orders that have been received by the
customer.
One question to address is “What is the unique identifier of each row in the data?”
A unique identifier can be a column or set of columns that is guaranteed to be
unique across rows in your dataset. This is key for distinguishing rows and
referencing them in our EDA.
Now, if you've been taking notes, here's what they may look like so far:
* The dataset is about merchant behaviour. It consists of historic information
about a set of merchants collected on a daily basis
* The dataset contains 1500 samples and 13 features. This is a reasonable size
and will allow me to work in Jupyter notebooks
* Each row of my data is uniquely identified by the “Snapshot Date” and “Shop
ID” columns. In other words, my data contains one row per shop per day
* There are 100 unique Shop IDs and 15 unique Snapshot Dates
* Snapshot Dates range from ‘2020-01-01’ to ‘2020-01-15’ for a total of 15 days
1. Check For Missing DataNow that we've decided how we're going to work with the data, we begin to look at
the data itself. Checking your data for missing values is usually a good place to
start. For this analysis, and future analysis, | suggest analyzing features one at a
time and ranking them with respect to your specific analysis. For example, if we
look at the below missing values analysis, we'd simply count the number of
missing values for each feature, and then rank the features by largest amount of
missing values to smallest. This is especially useful if there are a large amount of
features.
2
2
a.
a.
2
° pts cree red Ore Cameaentine Sap itat Pett Sa cary Sexcuere
Feature Names
Feature ranking by missing value counts
Let’s look at an example of something that might occur in your data. Suppose a
feature has 70 percent of its values missing. As a result of such a high amount of
missing data, some may suggest to just remove this feature entirely from the data.
However, before we do anything, we try to understand what this feature
represents and why we're seeing this behaviour. After further analysis, we may
discover that this feature represents a response to a survey question, and in most
cases it was left blank. A possible hypothesis is that a large proportion of the
population didn't feel comfortable providing an answer. If we simply remove this
feature, we introduce bias into our data. Therefore, this missing data is a feature in
its own right and should be treated as such.
Now for each feature, | suggest trying to understand why the data is missing and
what it can mean. Unfortunately, this isn’t always so simple and an answer might
not exist. That's why an entire area of statistics, Imputation, is devoted to this
problem and offers several solutions. What approach you choose depends entirely
on the type of data. For time series data without seasonality or trend, you can
replace missing values with the mean or median. If the time series does contain a
trend but not seasonality, then you can apply a linear interpolation. If it contains
both, then you should adjust for the seasonality and then apply a linearinterpolation, In the survey example I discussed above, | handle missing values by
creating a new category “Not Answered” for the survey question feature. | won't go
into detail about all the various methods here, however, | suggest reading How to
Handle Missing Data for more details on Imputation.
Great! We've now identified the missing values in our data—let’s update our
summary notes:
* 10 features contain missing values
* “Fulfilled Orders” contains the most missing values at 8% and “Shop Currency”
contains the least at 6%
2. Provide Basic Descriptions of Your Sample
and Features
At this point in our EDA, we've identified features with missing values, but we still
know very little about our data. So let’s try to fill in some of the blanks. Let's
categorize our features as either:
Continuous: A feature that is continuous can assume an infinite number of values
in a given range. An example of a continuous feature is a merchant's Gross.
Merchandise Value (GMV).
Discrete: A feature that is discrete can assume a countable number of values and
is always numeric. An example of a discrete feature is a merchant's Sessions.
Categorical: A feature that is discrete can only assume a finite number of values.
An example of a discrete feature is a merchant's Shopify plan type.
The goal is to classify all your features into one of these three categories.
GMV AOV Conversion Rate62894.85 628.95 17
NaN 1390.86 0.07
25890.06 258.90 0.04
6446.36 64.46 0.20
47432.44 258.90 0.10
Example of continuous features
Sessions Products Sold Fulfilled Orders
un 52 108
9 46 96
182 47 98
147 44 99
45 65 125
Example of discrete features
Plan Country Shop Cohort
Plus UK Nan
Advanced Shopify UK 2019-04
Plus Canada 2019-04Advanced Shopify Canada 2019-06
Advanced Shopify UK 2019-06
Example of categorical features
You might be asking yourself, how does classifying features help us? This
categorization helps us decide what visualizations to choose in our EDA, and what
statistical methods we can apply to this data. Some visualizations won't work on all
continuous, discrete, and categorical features. This means we have to treat groups
of each type of feature differently. We will see how this works in later sections.
Let's focus on continuous features first. Record any characteristics that you think
are important, such as the maximum and minimum values for that feature. Do the
same thing for discrete features. For categorical features, some things | like to
check for are the number of unique values and the number of occurrences of each
unique value. Let's add our findings to our summary notes:
There are 3 continuous features, 4 discrete features, and 4 categorical features
* GMV:
* Continuous Feature
* Values are between $12.07 and $814468.03
* Data is missing for one day... should check to make sure this isn't a data
collection error”
Plan:
* Categorical Feature
* Assumes four values: “Basic Shopit}
“Plus”.
‘Shopify”, “Advanced Shopify”, and
* The value counts of “Basic Shopify”, “Shopify”, “Advanced Shopify”, and
“Plus” are 255, 420, 450, and 375, respectively
* There seems to be more merchants on the “Advanced Shopify” plan than on
the “Basic” plan. Does this make sense?3. Identify The Shape of Your Data
The shape of a feature can tell you so much about it. What do | mean by shape? I'm
referring to the distribution of your data, and how it can change over time. Let's
plot a few features from our dataset:
GMV and Sessions behaviour across samples
20018)
sone
If the dataset is a time series, then we investigate how the feature changes over
time. Perhaps there's a seasonality to the feature or a positive/negative linear trend
over time. These are all important things to consider in your EDA. In the graphs
above, we can see that AOV and Sessions have positive linear trends and Sessions
emits a seasonality (a distinct behaviour that occurs in intervals). Recall that
Snapshot Data and Shop ID uniquely define our data, so the seasonality we
observe can be due to particular shops having more sessions than other shops in
the data. In the line graph below, we see that the Sessions seasonality was a result
of two specific shops: Shop 1 and Shop 51. Perhaps these shops have a higher
GMV or AOV?Next, you'll calculate the mean and variance of each feature. Does the feature
hardly change at all? Is it constantly changing? Try to hypothesize about the
behaviour you see. A feature that has a very low or very high variance may require
additional investigation.
Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are
your friends. To understand the shape of your features, PMFs are used for discrete
features and PDFs for continuous features.
: “|
‘ov Sessions
Example of feature density functions
Here a few things that PMFs and PDFs can tell you about your data:
* Skewness
* Is the feature heterogeneous (multimodal)?If the PDF has a gap in it, the feature may be disconnected.
\s it bounded?
We can see from the example feature density functions above that all three
features are skewed. Skewness measures the asymmetry of your data. This might
deter us from using the mean as a measure of central tendency. The median is
more robust, but it comes with additional computational cost.
Overall, there are a lot of things you can consider when visualizing the distribution
of your data. For more great ideas, | recommend reading Exploratory Data
Analysis. Don’t forget to update your summary notes:
AOV:
* Continuous Feature
* Values are between $38.69 and $8994.58
* 8% of values missing
* Observe a large skewness in the data.
* Observe a positive linear trend across samples
Sessions:
* Discrete Feature
* Contains count data (can assume non-negative integer data)
* Values are between 2 and 2257
* 7.7% of values missing
* Observe a large skewness in the data
* Observe a positive linear trend across samples
* Shops 1 and 51 have larger daily sessions and show a more rapid growth of
sessions compared to other shops in the data
Conversion Rate:
* Continuous Feature
* Values are between 0.0 and 0.86
* 77% of values missing* Observe a large skewness in the data
4. Identify Significant Correlations
Correlation measures the relationship between two variable quantities. Suppose
we focus on the correlation between two continuous features: Delivered Orders
and Fulfilled Orders. The easiest way to visualize correlation is by plotting a scatter
plot with Delivered Orders on the y axis and Fulfilled Orders on the x axis. As
expected, there's a positive relationship between these two features.
140 =
Delivered Orders
Ss 2 8 @
8 8 8 8
8
70 8) =6©90) 6100 10 120 130 140
Fulfilled Orders
Scatter plot showing positive correlation between features "Delivered Orders” and “Fulfilled Orders”
If you have a high number of features in your dataset, then you can't create this
plot for all of them—it takes too long. So, | recommend computing the Pearson
correlation matrix for your dataset. It measures the linear correlation between
features in your dataset and assigns a value between -1 and 1 to each feature pair.
A positive value indicates a positive relationship and a negative value indicates a
negative relationship.cmv 002 030 002 000 001
0.02 010 001 000 001 ADS
075
Conversion Rate 0.02 0.02 002 002 «002 0.01 0.50
~025
Sessions 030 010 0.02 = | 0.00 002 0.03 -000
--0.25
Products Sold 0.02 0.01 002 0.00 B 003 005 -050
075
Fulfled Orders 0.00 0,00 002-002 0.03 Aa
Delivered Orders 0.01 0.01 001 003 005 Bm
Feature
My
Aov
Sessions
Products Sold
Fulfled Orders
Conversion Rate
Delivered Orders
Feature
Correlation matrix for continuous and discrete features
It’s important to take note of all significant correlation between features. It’s
possible that you might observe many relationships between features in your
dataset, but you might also observe very little. Every dataset is different! Try to
form hypotheses around why features might be correlated with each other.
In the correlation matrix above, we see a few interesting things. First of all, we see
a large positive correlation between Fulfilled Orders and Delivered Orders, and
between GMV and AOV. We also see a slight positive correlation between Sessions,
GMV, and AOV. These are significant patterns that you should take note of.
Since our data is a time series, we also look at the autocorrelation of shops.
Computing the autocorrelation reveals relationships between a signal's current
value and its previous values. For instance, there could be a positive correlation
between a shop's GMV today and its GMV from 7 days ago. In other words,customers like to shop more on Saturdays compared to Mondays. I won't go into
any more detail here since Time Series Analysis is a very large area of statistics,
but | suggest reading A Very Short Course on Time Series Analysis.
Shop | 2019- | 2019- | 2019- | 2019- | 2019- | 2019-
Cohort | o1 02 03 04 05 06
Plan
Adv.
| 7 102 | 27 nm 87 73
Shopify
Basic
45 44 42 59 0 87
Shopity
Plus 29 55 69 44 nm 86
Shopify | 53 72 108 | 72 60 28
Contingency table for discrete features “Shop Cohort” an “Plan”
The methodology outlined above only applies to continuous and discrete features.
There are a few ways to compute correlation between categorical features,
however, I'll only discuss one, the Pearson chi-square test. This involves taking
pairs of discrete features and computing their contingency table. Each cell in the
contingency table shows the frequency of observations. In the Pearson chi-square
test, the null hypothesis is that the categorical variables in question are
independent and, therefore, not related. In the table above, we compute the
contingency table for two categorical features from our dataset: Shop Cohort and
Plan. After that, we perform a hypothesis test using the chi-square distribution
with a specified alpha level of significance. We then determine whether the
categorical features are independent or dependent.
5. Spot Outliers in the DatasetLast, but certainly not least, spotting outliers in your dataset is a crucial step in
EDA. Outliers are significantly different from other samples in your dataset and can
lead to major problems when performing statistical tasks following your EDA.
There are many reasons why an outlier might occur. Perhaps there was a
measurement error for that sample and feature, but in many cases outliers occur
naturally.
Continuous feature box plots
The box plot visualization is extremely useful for identifying outliers. In the above
figure, we observe that all features contain quite a few outliers because we see
data points that are distant from the majority of the data.
There are many ways to identify outliers. It doesn’t make sense to plot all of our
features one at a time to spot outliers, so how do we systematically accomplish
this? One way is to compute the ist and 99th percentile for each feature, then
classify data points that are greater than the 99th percentile or less than the Ist
percentile. For each feature, count the number of outliers, then rank them from
most outliers to least outliers. Focus on the features that have the most outliers
and try to understand why that might be. Take note of your findings.
Unfortunately, the aforementioned approach doesn’t work for discrete features
since there needs to be an ordering to compute percentiles. An outlier can mean
many things. Suppose our discrete feature can assume one of three values: apple,
orange, or pear. For 99 percent of samples, the value is either apple or orange, and
only 1 percent for pear. This is one way we might classify an outlier for this feature.
For more advanced methods on detecting anomalies in categorical data, check
out Outlier Analysis.What's Next?
We're at the finish line and completed our EDA. Let's review the main takeaways:
* Missing values can plague your data. Make sure to understand why they are
there and how you plan to deal with them.
* Provide a basic description of your features and categorize them. This will
drastically change the visualizations you use and the statistical methods you
apply.
* Understand your data by visualizing its distribution. You never know what you
will find! Get comfortable with how your data changes across samples and over
time.
* Your features have relationships! Make note of them. These relationships can
come in handy down the road.
* Outliers can dampen your fun only if you don’t know about them. Make known,
the unknowns!
But what do we do next? Well that all depends on the problem. It's useful to
present a summary of your findings to leadership and product. By performing an
EDA, you might answer some of those crucial business questions we alluded to
earlier. Going forward, does your team want to perform a regression or
classification on the dataset? Do they want it to power a KPI dashboard? So many
wonderful options and opportunities to explore!
It’s important to note that an EDA is a very large area of focus. The steps | suggest
are by no means exhaustive and if | had the time I could write a book on the
subject! In this post, | share some of the most common approaches, but there's so
much more that can be added to your own EDA.
If you're passionate about data at scale, and you're eager to learn more, we're
hiring! Reach out to us or apply on our careers page.
Cody Mazza-Anthony is a Data Scientist on the Insights team. He is
currently working on building a Product Analytics framework for Insights.Cody enjoys building intelligent systems to enable merchants to grow their
business. If you'd like to connect with Cody, reach out on Linkedin.
Wherever you are, your next journey starts here! If building systems from the
ground up to solve real-world problems interests you, our Engineering blog has
stories about other challenges we have encountered. Intrigued? Visit our Data
Science & Engineering career page to find out about our open positions. Learn
about how we'te hiring to design the future together—a future that is Digital by
Design.
Get stories like this in your inbox!
Stories from the teams who build and scale Shopify. The commerce platform
powering millions of businesses w:
Email address Yes, sign me up!
Share your email with us and receive monthly ups
wide.
C Search articles
LY
Get stories like this in your inbox!Stories from the teams who build and scale Shopify. The commerce platform
powering millions of businesses worldwide.
Email address
Yes, sign me up!
Share your email with us and receive monthly up
(D RESOURCES
Our Tech Stack
Curious about what's in our tech stack.
Sponsorship
We're looking to partner with you
Working Anywhere at Shopify
Learn about Digital by Design
Shopify Partner Developers
Become a Shopify developer and earn money by building apps or working with businesses
Shopify Engineering on Twitter
Connect with us on Twitter
Shopify Engineering YouTube
Connect with us on YouTube
{) POPULAR
Migrating our Largest Mobile App to React Native
Ruby 3.2's YJIT is Production-Ready
From Ruby to Node: Overhauling Shopify’s CLI for a Better Developer Experience
The 25 Percent Rule for Tackling Technical Debt
The Case Against Monkey Patching, From a Rails Core Team Member
10 Tips for Building Resilient Payment SystemsHow Good Documentation Can Improve Productivity
Five Common Data Stores and When to Use Them
Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity
@ LATEST
Improving Shopify App’s Performance
A Packwerk Retrospective
Horizontally scaling the Rails backend of Shop app with Vitess
Getting Started with React Native Skia
Introducing Ruvy
Building a ShopifyQL Code Editor
Sidekick’s Improved Streaming Experience
Shopify’s platform is the Web platform
Contributing support for a Wasm instruction to Winch
Creating a Flexible Order Routing System with Shopify Functions
Ready to tackle frontend,
backend, infrastructure,
data, or security
challenges?Explore all of our available roles >