Exploratory Data Analysis (EDA):
Exploratory Data Analysis (EDA) is the process of
examining and summarizing a dataset to understand
its main characteristics, often using visual methods.
It is typically the first step in a data analysis or
machine learning workflow.
The main goal of EDA is to gain insights from the
data, detect patterns, identify anomalies, test
hypotheses, and check assumptions through both
statistical and visual techniques.
Objectives of EDA
1. Understand the data structure – What are the
features, types, and formats?
2. Identify missing or incorrect data – Are there
NaNs or invalid entries?
3. Uncover underlying patterns or trends – Are
there correlations or groupings?
4. Spot outliers or anomalies – Are there data points
that deviate significantly?
5. Prepare data for modeling – Help decide feature
selection and engineering steps.
Data visualization techniques:
Data visualization is the graphical
representation of information and data.
It helps in communicating insights from data
clearly and effectively.
Below are common data visualization
techniques, categorized by purpose and data type:
1. Basic Charts and Graphs
These are foundational and widely used:
Bar Chart – Compare quantities across
categories.
Column Chart – Vertical bars to show variation
over time or categories.
Pie Chart – Show parts of a whole (not
recommended for precise comparisons).
Line Chart – Show trends over time (time
series).
Area Chart – Like line charts, but areas are
filled to show volume.
Scatter Plot – Show relationship or correlation
between two variables.
Histogram – Show distribution of a single
variable (frequency of bins).
2. Advanced & Multivariate Techniques
Used when dealing with more than two
variables:
Bubble Chart – Like scatter plots but with a
third variable represented by size.
Heatmap – Show data intensity using color
gradients in a matrix.
Box Plot (Box-and-Whisker) – Show
distribution with median, quartiles, and outliers.
Violin Plot – Combines box plot and KDE to
show distribution.
Radar (Spider) Chart – Compare multiple
variables for multiple entities.
3. Time-Series Visualization
Used to analyze trends and patterns over time:
Time Series Line Chart – Most common for
time-related data.
Stacked Area Chart – Show how components of
a whole change over time.
Candlestick Chart – Used in financial market
data (open, high, low, close).
4. Geospatial Visualization
For mapping data with geographic components:
Choropleth Map – Color-coded maps to show
values per region.
Dot Density Map – Use dots to show frequency
or density.
Heat Maps on Maps – Show intensity (e.g.,
population density).
Symbol Maps – Use size or shape of symbols on
a map to convey data.
5. Network and Hierarchical Visualizations
Best for relationships and structure:
Tree Map – Show hierarchical data with nested
rectangles.
Sunburst Chart – Circular version of tree map.
Network Graph – Show relationships (nodes and
links).
Sankey Diagram – Show flow and proportion
between nodes.
Hierarchy Dendrogram – Tree-like structure for
clustering or taxonomies.
6. Interactive Visualization Tools
Allow user interaction for deeper analysis:
Dashboards – Combine multiple charts with
interactivity (e.g., filters).
Sliders/Drilldowns – Allow zooming into specific
data.
Tools: Tableau, Power BI, Plotly, D3.js, Google
Data Studio.
7. Specialized Techniques
Used for unique types of data or storytelling:
Word Cloud – Frequency of text-based data.
Gantt Chart – Project timelines and schedules.
Parallel Coordinates Plot – Visualize multi-
dimensional data.
Correlation Matrix – Visualize correlations
between variables.
Statistical analysis and hypothesis testing:
Statistical Analysis
Statistical analysis is the process of
collecting, organizing, analyzing, interpreting,
and presenting data to discover underlying
patterns and trends.
Types of Statistical Analysis:
1.Descriptive Statistics – Summarizes data.
o Measures: Mean, Median, Mode,
Standard Deviation, Range
o Visuals: Histograms, Box Plots, Pie
Charts
2.Inferential Statistics – Makes predictions or
inferences about a population based on a
sample.
o Techniques: Hypothesis testing,
confidence intervals, regression analysis
Hypothesis Testing:
Hypothesis testing is a method used in
inferential statistics to determine if there is
enough evidence in a sample of data to infer
that a certain condition is true for the entire
population.
Key Components:
Null Hypothesis (H₀): Assumes no effect or
difference (status quo)
Alternative Hypothesis (H₁ or Hₐ): Assumes
some effect or difference
Significance Level (α): Probability of
rejecting H₀ when it is true (commonly 0.05)
Test Statistic: A standardized value to decide
whether to reject H₀
P-value: Probability of obtaining test results at
least as extreme as the observed data, under H₀
Common Hypothesis Tests:
Test Use Case Assumptions
Known
Z-test variance, large Normal distribution
sample
Comparing Normality (for
T-test
means small samples)
Chi-square Expected
Categorical data
test frequencies > 5
Comparing >2 Normality, equal
ANOVA
means variances
Regression Predicting Linearity,
Test Use Case Assumptions
outcome independence,
variable normality
Steps in Hypothesis Testing:
1.State the hypotheses (H₀ and H₁)
2.Choose significance level (α)
3.Select the appropriate test
4.Calculate test statistic
5.Compare p-value to α or use critical value
6.Draw conclusion (Reject or fail to reject H₀)
Example: T-test
Suppose we want to test if a new teaching method
is more effective than the traditional method.
H₀: The mean scores of both methods are
equal.
H₁: The mean score of the new method is
higher.
Use a t-test if the sample size is small and
population variance is unknown.
If p-value < α, reject H₀ → Conclude the new
method is significantly better.
Z-Test: Explanation and Example:
A Z-test is a statistical method used to
determine whether there is a significant
difference between sample and population
means or between two sample means. It's used
when:
The population standard deviation is known
The sample size is large (typically n>30n >
30n>30)
Types of Z-Tests
1.One-Sample Z-Test: Compares the sample
mean to a known population mean.
2.Two-Sample Z-Test: Compares the means
of two independent samples.
3.Z-Test for Proportions: Used when dealing
with proportions instead of means.
One-Sample Z-Test Example
Problem:
A company claims that their light bulbs last on
average 1000 hours. A quality control inspector
selects a sample of 50 bulbs and finds a mean
life of 980 hours, with a known population
standard deviation of 80 hours.
Can we say that the bulbs are not lasting as
long as claimed at the 5% significance level?
Step-by-Step Solution
Step 1: Define the Hypotheses
Null Hypothesis (H0): μ=1000
Alternative Hypothesis (H1): μ<1000 (Left-
tailed test)
Step 2: Gather the Data
Population mean μ=1000
Sample mean xˉ=980
Standard deviation σ=80
Sample size n=50
Significance level α=0.05
Step 3: Compute the Z-Statistic
Z= xˉ−μ/ σ/sqroot(n) =980−1000/80*sqrroot(50)= −20
/11.31≈−1.77
Step 4: Find the Critical Value
For a left-tailed test at α=0.05
the critical z-value ≈ -1.645
Step 5: Make the Decision
Since Z=−1.77<−1.645Z
we reject the null hypothesis
Conclusion:
There is sufficient evidence at the 5%
significance level to conclude that the light
bulbs do not last as long
Identifying patterns and insights from data:
Identifying patterns and insights from data is a key
part of data analysis and can help drive informed
decisions. Here’s a structured approach to do that:
1. Understand the Data
Before diving in, ask:
What kind of data is it? (Numerical,
categorical, text, time-series, etc.)
What do the variables mean?
Are there any missing or inconsistent values?
Example tools: Data dictionary, exploratory data
analysis (EDA)
2. Perform Exploratory Data Analysis (EDA)
Explore and summarize the main characteristics:
Summary statistics: Mean, median, standard
deviation
Distributions: Histograms, boxplots
Relationships: Scatter plots, correlation
matrices
Trends over time: Line charts for time series
Look for:
Outliers
Clusters
Trends
Correlations
3. Identify Patterns
Depending on the data type:
For numerical data:
Correlations (e.g., sales vs. marketing spend)
Seasonality/trends in time-series data
Distribution shapes (normal, skewed,
bimodal)
For categorical data:
Frequency counts
Cross-tabulations (e.g., sales by region and
product)
Chi-square tests for association
For text data:
Keyword frequency
Sentiment analysis
Topic modeling
4. Use Analytical Techniques
Clustering (e.g., K-means) to find natural
groupings
Dimensionality reduction (e.g., PCA) to
simplify and visualize complex data
Regression analysis to quantify relationships
Classification models to predict categories
5. Generate Insights
Translate patterns into business or research
insights:
“Sales peak in Q4 due to holiday demand.”
“Users in Segment A are more likely to
churn.”
“There is a strong positive correlation between
price and customer satisfaction—perhaps
higher price signals quality.”
Tools That Help
Python (Pandas, Matplotlib, Seaborn,
Scikit-learn)
R
Excel
Power BI / Tableau
SQL for querying structured data
Here's a clear example of identifying patterns
and insights from data, using a small dataset and
walking through the analysis:
Scenario: Sales Data for an Online Store
Dataset (sample):
Product A Product B Marketing
Month
Sales Sales Spend ($)
Jan 120 80 1000
Feb 150 100 1200
Mar 170 130 1500
Apr 160 110 1300
May 180 140 1600
Jun 200 150 1700
🔍 Step 1: Identify Patterns
1.1 Trend in Sales Over Time:
Product A sales increase from 120 → 200
(steady growth).
Product B sales also increase from 80 → 150,
but slightly less steep.
Insight: Both products show consistent
monthly growth.
1.2 Correlation with Marketing Spend:
Marketing Spend rises alongside sales.
Check correlation:
o More marketing → More sales?
o Simple observation suggests a positive
correlation.
1.3 Monthly Sales Performance:
May and June are the best-performing months
for both products.
These also have the highest marketing spend.
Step 2: Extract Insights
1.Marketing Drives Sales:
o As marketing spend increases, so do sales.
o Suggests a potential ROI analysis on
marketing.
2.Product A is Growing Faster:
o A 67% increase from Jan to Jun.
o Product B only increased by 87.5% but
remains the lower seller.
3.Seasonality Not Evident (yet):
o No sharp dips or spikes suggest steady
growth rather than seasonal behavior.
o More data (e.g. full year) could confirm
seasonality patterns.
Step 3: Recommendations
Continue investing in marketing, especially
around periods of historically higher returns
(e.g. May, June).
Prioritize Product A in promotions – higher
ROI potential.
Collect more data (like customer segments or
regional sales) to improve targeting.
Time Series Analysis:
Time series analysis is a powerful statistical
technique used to analyze and forecast data points
collected or recorded at specific time intervals
(e.g., daily, monthly, yearly).
What Is Time Series Analysis?
Time series analysis involves methods for:
Understanding underlying patterns in time-
ordered data.
Identifying trends, seasonality, and cycles.
Making forecasts (predictions) for future time
points.
Core Components of a Time Series
1.Trend (T)
Long-term upward or downward movement in
the data.
2.Seasonality (S)
Regular pattern that repeats over a fixed period
(e.g., higher sales every December).
3.Cyclic Patterns (C)
Longer-term fluctuations not of fixed period
(e.g., economic cycles).
4.Irregular/Noise (I)
Random variation that can't be explained by
the above.
TSeries Models
1. Classical Decomposition
Additive: Yt=Tt+St+It
Multiplicative: Yt=Tt×St×It
Moving Averages
Smooths data to highlight trends.
3. Exponential Smoothing
Weighted averages of past data.
Variants: Simple, Holt’s Linear, Holt-Winters.
4. ARIMA (AutoRegressive Integrated Moving
Average)
Most popular model for forecasting.
AR: Past values
I: Differencing to make series stationary
MA: Past forecast errors
5. SARIMA (Seasonal ARIMA)
Extends ARIMA for seasonal data.
6. Machine Learning Models
Random Forest, XGBoost, LSTM (Deep
Learning for sequential data).
Steps in Time Series Analysis
1.Visualize the Data
2.Check for Stationarity
o Use ADF or KPSS test
3.Decompose the Series
4.Choose and Fit a Model
5.Validate the Model
o Use train/test split or cross-validation
6.Forecast and Evaluate
o Metrics: MAE, RMSE, MAPE
Common Tools and Libraries
Python: pandas, statsmodels, scikit-learn,
prophet, pmdarima
R: forecast, tseries, zoo, xts
Visualization: matplotlib, seaborn, plotly
Time series forecasting using ARIMA and
Prophet:
Time series forecasting is a crucial tool for
predicting future data points based on past
observations. Two popular models used for this
purpose are ARIMA (AutoRegressive Integrated
Moving Average) and Facebook Prophet. Here's
a concise comparison and implementation of both.
1. ARIMA (AutoRegressive Integrated Moving
Average)
Overview:
ARIMA is a traditional statistical model suited for
univariate time series data that is stationary or can
be made stationary.
Components:
AR (AutoRegressive): dependence on past
values.
I (Integrated): differencing to make the series
stationary.
MA (Moving Average): dependence on past
errors.
Strengths:
Good for short-term forecasts.
Works well with small datasets.
Transparent model parameters.
Limitations:
Doesn’t handle seasonality natively (need
SARIMA).
Assumes stationarity.
Can be complex to tune (p, d, q parameters).
2. Prophet (by Facebook)
Overview:
Prophet is a decomposable time series model
designed to handle trends, seasonality, and
holidays. It’s robust to missing data and outliers.
Components:
Trend
Seasonality (daily, weekly, yearly)
Holiday effects
Strengths:
Easy to use and interpret.
Handles missing data.
Captures seasonality and trend changes well.
Limitations:
May underperform for highly volatile or non-
seasonal data.
Slower on large datasets.
Implementation in Python
Let's demonstrate both models using a sample
dataset (e.g., monthly airline passengers):
Install Libraries
bash
pip install pandas numpy matplotlib statsmodels
prophet
Sample Data
import pandas as pd
# Load sample dataset
url =
"https://raw.githubusercontent.com/jbrownlee/Data
sets/master/airline-passengers.csv"
df = pd.read_csv(url, parse_dates=['Month'])
df.columns = ['ds', 'y']
# Prophet requires these column names
Forecasting with ARIMA
python
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt
# Set index
df_arima = df.copy()
df_arima.set_index('ds', inplace=True)
# Fit ARIMA(p,d,q) model (example:
ARIMA(2,1,2))
model = ARIMA(df_arima['y'], order=(2,1,2))
model_fit = model.fit()
# Forecast
forecast = model_fit.forecast(steps=12)
forecast.plot(label='Forecast')
df_arima['y'].plot(label='Original')
plt.legend()
plt.title("ARIMA Forecast")
plt.show()
Forecasting with Prophet
python
from prophet import Prophet
# Initialize and fit the model
model = Prophet()
model.fit(df)
# Create future dataframe
future =
model.make_future_dataframe(periods=12,
freq='M')
forecast = model.predict(future)
# Plot forecast
model.plot(forecast)
plt.title("Prophet Forecast")
plt.show()