MODULE #5
Exploratory Data Analysis (EDA) is the process of examining datasets to summarize their
main characteristics, often using visual methods and statistical tools. It is a crucial step before
applying machine learning or statistical models.
Exploratory Data Analysis (EDA)
Objectives of EDA:
Understand the structure, distribution, and quality of the data.
Detect outliers or anomalies.
Check assumptions for statistical methods.
Identify patterns, trends, and relationships between variables.
Guide feature engineering and model selection.
PART 1: Data Visualization with Matplotlib & Seaborn
Why Visualize?
Visualizations simplify:
Distribution understanding
Relationship identification
Outlier detection
Pattern recognition
Libraries Used:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
Common Plots
Plot Type Use Case
Histogram Distribution of a single variable
Boxplot Detecting outliers, comparing groups
Scatter Plot Relationship between 2 variables
Heatmap Correlation matrix
Pairplot Visualizing relationships among multiple variables
Example with Code
# Load data
df = sns.load_dataset('tips')
# Histogram of total_bill
sns.histplot(df['total_bill'], kde=True)
plt.title("Distribution of Total Bill")
plt.show()
# Boxplot of total_bill by sex
sns.boxplot(x='sex', y='total_bill', data=df)
plt.title("Total Bill by Gender")
plt.show()
# Scatter plot: total_bill vs tip
sns.scatterplot(x='total_bill', y='tip', hue='sex', data=df)
plt.title("Tip vs Total Bill")
plt.show()
# Heatmap of correlation
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()
# Pairplot
sns.pairplot(df, hue='sex')
plt.show()
PART 2: Statistical Analysis & Hypothesis Testing
What is Hypothesis Testing?
It helps determine if there is enough evidence to support a specific claim about a dataset.
General Hypothesis Testing Workflow:
1. Define Hypotheses
o Null Hypothesis (H₀): No difference/effect
o Alternate Hypothesis (H₁): There is a difference/effect
2. Select Test (z, t, chi-square, ANOVA)
3. Compute test statistic
4. Determine p-value
5. Compare p-value to α (e.g., 0.05)
2.1 Z-Test
Use when population variance is known or large sample size (n > 30).
Assumes normality.
Formula:
Problem 1: Is the mean total bill ≠ $19?
from statsmodels.stats.weightstats import ztest
z_stat, p_val = ztest(df['total_bill'], value=19)
print(f"Z-statistic: {z_stat}, P-value: {p_val}")
H₀: μ = 19
H₁: μ ≠ 19
If p-value < 0.05 → Reject H₀
Problem 2: Are tips by females ≠ $2.5?
female_tips = df[df['sex'] == 'Female']['tip']
z_stat, p_val = ztest(female_tips, value=2.5)
print(f"Z-statistic: {z_stat}, P-value: {p_val}")
2.2 T-Test
Used when:
Population variance unknown
Small sample size (n < 30)
Formula for two-sample t-test:
Problem 1: Do male and female customers tip differently?
from scipy.stats import ttest_ind
male_tips = df[df['sex'] == 'Male']['tip']
female_tips = df[df['sex'] == 'Female']['tip']
t_stat, p_val = ttest_ind(male_tips, female_tips)
print(f"T-statistic: {t_stat}, P-value: {p_val}")
Problem 2: Do smokers tip differently than non-smokers?
smoker = df[df['smoker'] == 'Yes']['tip']
non_smoker = df[df['smoker'] == 'No']['tip']
t_stat, p_val = ttest_ind(smoker, non_smoker)
print(f"T-statistic: {t_stat}, P-value: {p_val}")
2.3 Chi-Square Test
Used to test independence between two categorical variables.
Formula:
χ2=∑(O−E)2 / E
Where:
O = Observed frequency
E = Expected frequency
Problem 1: Are gender and smoking status independent?
from scipy.stats import chi2_contingency
table = pd.crosstab(df['sex'], df['smoker'])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi2: {chi2}, P-value: {p}")
Problem 2: Is day of week related to smoking status?
table = pd.crosstab(df['day'], df['smoker'])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi2: {chi2}, P-value: {p}")
2.4 ANOVA (Analysis of Variance)
Used to compare means across 3 or more groups.
Formula:
F=Between-group variance / Within-group variance
Problem 1: Is average total bill different across days?
from scipy.stats import f_oneway
groups = [group['total_bill'] for name, group in df.groupby('day')]
f_stat, p_val = f_oneway(*groups)
print(f"F-statistic: {f_stat}, P-value: {p_val}")
Problem 2: Do tips vary by time of day (Lunch vs Dinner)?
groups = [group['tip'] for name, group in df.groupby('time')]
f_stat, p_val = f_oneway(*groups)
print(f"F-statistic: {f_stat}, P-value: {p_val}")
PART 3: Identifying Patterns and Insights
Techniques:
GroupBy Analysis: Summarize categories
Correlation: Relationships between numeric variables
Value Counts: Categorical distributions
Outliers: Boxplots or z-scores
Examples:
# Correlation between total_bill and tip
print(df['total_bill'].corr(df['tip']))
# Average tip per day
print(df.groupby('day')['tip'].mean())
# Total bill by time
sns.boxplot(x='time', y='total_bill', data=df)
plt.title("Total Bill by Time")
plt.show()
Conclusion:
EDA is not optional—it’s essential. It helps:
Detect errors
Understand relationships
Validate assumptions
Make better decisions about modeling
Introduction to Time Series Analysis
What is a Time Series?
A Time Series is a sequence of data points recorded or observed at successive points in time,
usually at uniform intervals. Examples include:
Daily stock prices
Hourly temperature readings
Monthly sales figures
Annual GDP growth rates
Formally, a time series is a set of observations {xt} indexed in time order t=1,2,3,…,N.
Why Analyze Time Series?
Time series analysis focuses on understanding the underlying structure and patterns in the
data collected over time to:
Describe the data
Model the data generating process
Forecast future values
Detect anomalies or changes
Test hypotheses about temporal dynamics
Key Characteristics of Time Series Data
1. Trend: A long-term increase or decrease in the data. Example: Increasing global
temperature over decades.
2. Seasonality: Regular repeating patterns or cycles within specific time periods (daily,
weekly, yearly). Example: Retail sales peaking every December.
3. Cyclic Patterns: Fluctuations occurring at irregular intervals due to economic/business
cycles.
4. Noise (Irregularity): Random or unexplained variability not accounted for by trend or
seasonality.
5. Stationarity: A stationary time series has statistical properties (mean, variance) that do
not change over time. Many time series models require or assume stationarity.
Types of Time Series
Univariate Time Series: Observations of a single variable over time (e.g., daily
temperature).
Multivariate Time Series: Multiple variables observed over time (e.g., temperature,
humidity, and wind speed recorded hourly).
Components of Time Series
Mathematically, many time series can be decomposed as:
xt =Tt+St+Ct+ϵt
Where:
Tt: Trend component
St: Seasonal component
Ct: Cyclical component
ϵt: Noise or residual
Goals of Time Series Analysis
Exploratory Analysis: Identify patterns, visualize trends and seasonality.
Modeling: Fit models that describe data (e.g., ARIMA, Exponential Smoothing).
Forecasting: Predict future values based on past data.
Anomaly Detection: Identify unusual or unexpected events.
Understanding Relationships: Between variables over time (cross-correlation,
causality).
Common Time Series Models
Moving Average (MA) Models: Smooth data by averaging neighboring observations.
Autoregressive (AR) Models: Model current value as a function of past values.
ARMA / ARIMA Models: Combine AR and MA models; ARIMA incorporates
differencing to make non-stationary data stationary.
Seasonal ARIMA (SARIMA): Handles seasonality.
Exponential Smoothing: Weighted averages that give more importance to recent
observations.
State Space Models: Like Kalman Filters for dynamic systems.
Machine Learning Approaches: LSTM, RNN for complex patterns.
Visualization in Time Series Analysis
Key plots include:
Line plots (time vs. value)
Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots
to check dependencies.
Seasonal plots to reveal repeating cycles.
Decomposition plots to isolate components.
Challenges in Time Series Analysis
Non-stationarity: Trends and seasonality complicate modeling.
Missing Data: Gaps can bias results.
Noise: Distinguishing signal from noise is hard.
High-frequency data: Can be voluminous and noisy.
Changing dynamics: Structural breaks or regime changes.
Practical Applications
Finance: Stock price forecasting, risk management.
Weather: Forecasting temperature, rainfall.
Operations: Demand forecasting, inventory management.
Healthcare: Monitoring patient vitals over time.
IoT & Sensor Data: Equipment failure prediction.
ARIMA Model (Autoregressive Integrated Moving Average)
Overview:
ARIMA is a classical and powerful model for analyzing and forecasting univariate time series
data. It combines three components:
AR: Autoregression — the model uses the relationship between an observation and a
number of lagged observations.
I: Integration — differencing of raw observations to make the time series stationary.
MA: Moving Average — the model uses the dependency between an observation and a
residual error from a moving average model applied to lagged observations.
Mathematical Foundation
Given a time series {yt}, an ARIMA(p, d, q) model is defined as:
p: Number of autoregressive terms
d: Number of differences needed to make the series stationary
q: Number of lagged forecast errors in the prediction equation
Step 1: Differencing
Step 2: Autoregressive (AR) Model
Step 3: Moving Average (MA) Model
Step 4: Combining into ARIMA(p, d, q)
Model Identification
Plot data and ACF/PACF to select p,d, q.
Use methods like AIC/BIC to select best parameters.
Strengths and Weaknesses
Strengths Weaknesses
Can model a wide range of series Requires stationarity
Well-studied with strong theory Needs manual parameter tuning
Works well with linear data Limited for complex seasonal patterns
Python Example
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# Load time series data (example)
df = pd.read_csv('your_timeseries.csv', index_col=0, parse_dates=True)
# Fit ARIMA(1,1,1)
model = ARIMA(df['value'], order=(1,1,1))
fit = model.fit()
print(fit.summary())
# Forecast next 5 points
forecast = fit.forecast(steps=5)
print(forecast)
Prophet Model
Overview:
Prophet is a modern time series forecasting model developed by Facebook, designed for
business time series that have multiple seasonality and holiday effects. It’s intuitive, requires
minimal tuning, and is robust to missing data and outliers.
Key Idea:
Prophet decomposes time series into:
Mathematical Details
1. Trend g(t)
2. Seasonality s (t)
3. Holiday Effects h(t)
Model Fitting
Prophet uses Bayesian modeling and maximum a posteriori estimation.
It automatically detects change points and fits parameters using Stan (a probabilistic
programming language).
Strengths and Weaknesses
Strengths Weaknesses
Handles multiple seasonalities Less flexible for non-additive effects
Incorporates holidays and events May overfit if many changepoints
Robust to missing data & outliers Model interpretability can be complex
Easy to use with minimal tuning Focused on business time series
Python Example
from prophet import Prophet
import pandas as pd
# Data must have columns 'ds' (datetime) and 'y' (value)
df = pd.read_csv('your_timeseries.csv')
df['ds'] = pd.to_datetime(df['date_column'])
df.rename(columns={'value_column':'y'}, inplace=True)
model = Prophet(weekly_seasonality=True, yearly_seasonality=True)
model.fit(df)
# Make future dataframe (e.g., next 30 days)
future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)
model.plot(forecast)
model.plot_components(forecast)
Summary Table
Aspect ARIMA Prophet
Additive model with
Type Classical linear model
trend+seasonality+holidays
Requires stationary or
Stationarity No stationarity requirement
differencing
Manual or seasonal ARIMA
Seasonality Built-in multiple seasonalities
(SARIMA)
Holidays No built-in holiday handling Explicit holiday/event modeling
Handling missing
Poor Robust
data
Ease of use More technical & requires tuning User-friendly, automatic changepoints
Forecast Output Point forecast Forecast with uncertainty intervals
How to Evaluate Time Series Models
1. Performance Metrics
These metrics quantify forecast accuracy by comparing predicted values y^t\hat{y}_ty^t with
actual values yty_tyt.
1.1. Mean Absolute Error (MAE)
1.2. Mean Squared Error (MSE)
1.3. Root Mean Squared Error (RMSE)
1.4. Mean Absolute Percentage Error (MAPE)
1.5. Symmetric Mean Absolute Percentage Error (sMAPE)
1.6. Mean Directional Accuracy (MDA)
Exam
ple in Python
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
y_true = np.array([100, 150, 200, 250])
y_pred = np.array([110, 145, 195, 240])
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}, MAPE: {mape:.2f}%")
2. Train-Test Splitting in Time Series
Time Series ≠ Random Split.
Since time series data is sequential, standard random train-test splits violate temporal order.
Use chronological split.
2.1. Hold-out Split (Single Forecast Horizon)
Train: 2010–2018
Test: 2019
Simple but does not test performance across time.
2.2. Rolling Forecast Origin (Walk-Forward Validation)
Fold 1: Train → 2010–2017 | Test → 2018
Fold 2: Train → 2010–2018 | Test → 2019
Fold 3: Train → 2010–2019 | Test → 2020
Models are updated each time
Mimics real-world forecasting
2.3. Sliding Window Validation
Fold 1: Train → 2010–2015 | Test → 2016
Fold 2: Train → 2011–2016 | Test → 2017
Fold 3: Train → 2012–2017 | Test → 2018
Fixed-length training window, shifts forward
Useful for non-stationary data
Time Series Split in Python
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(data):
print("Train:", train_index, "Test:", test_index)
3. Visual Evaluation Techniques
Forecast vs. Actual Plot
import matplotlib.pyplot as plt
plt.plot(y_true, label='Actual')
plt.plot(y_pred, label='Forecast')
plt.legend()
plt.title("Actual vs Forecast")
plt.show()
Good for spotting bias, lag, or poor seasonality modeling.
Residual Plots
Residuals should be white noise (mean 0, constant variance, no autocorrelation).
residuals = y_true - y_pred
plt.plot(residuals)
plt.title("Residuals")
plt.show()
You can also check ACF/PACF of residuals.
4. Backtesting
Backtesting refers to simulating model performance in historical settings using only past data
available at the time.
Helps understand how model would perform in production.
Can be performed using rolling-origin evaluation with time series cross-validation.
Summary
Metric Description Best for
MAE Average magnitude of error General-purpose, interpretable
RMSE Penalizes large errors Sensitive applications
MAPE Percent error When scale needs to be removed
sMAPE Symmetric percentage Safer than MAPE
MDA Directional correctness Trend prediction