0% found this document useful (0 votes)
3 views16 pages

Module 5

The document covers Exploratory Data Analysis (EDA) and Time Series Analysis, detailing their importance, objectives, and methodologies. EDA involves visualizing data and conducting statistical tests to understand data characteristics, while Time Series Analysis focuses on analyzing data points collected over time to identify trends, seasonality, and patterns. It also introduces models like ARIMA and Prophet for forecasting, along with evaluation metrics and best practices for handling time series data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views16 pages

Module 5

The document covers Exploratory Data Analysis (EDA) and Time Series Analysis, detailing their importance, objectives, and methodologies. EDA involves visualizing data and conducting statistical tests to understand data characteristics, while Time Series Analysis focuses on analyzing data points collected over time to identify trends, seasonality, and patterns. It also introduces models like ARIMA and Prophet for forecasting, along with evaluation metrics and best practices for handling time series data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MODULE #5

Exploratory Data Analysis (EDA) is the process of examining datasets to summarize their
main characteristics, often using visual methods and statistical tools. It is a crucial step before
applying machine learning or statistical models.

Exploratory Data Analysis (EDA)


Objectives of EDA:
 Understand the structure, distribution, and quality of the data.
 Detect outliers or anomalies.
 Check assumptions for statistical methods.
 Identify patterns, trends, and relationships between variables.
 Guide feature engineering and model selection.

PART 1: Data Visualization with Matplotlib & Seaborn


Why Visualize?

Visualizations simplify:

 Distribution understanding
 Relationship identification
 Outlier detection
 Pattern recognition

Libraries Used:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

Common Plots

Plot Type Use Case


Histogram Distribution of a single variable
Boxplot Detecting outliers, comparing groups
Scatter Plot Relationship between 2 variables
Heatmap Correlation matrix
Pairplot Visualizing relationships among multiple variables
Example with Code
# Load data
df = sns.load_dataset('tips')

# Histogram of total_bill
sns.histplot(df['total_bill'], kde=True)
plt.title("Distribution of Total Bill")
plt.show()

# Boxplot of total_bill by sex


sns.boxplot(x='sex', y='total_bill', data=df)
plt.title("Total Bill by Gender")
plt.show()

# Scatter plot: total_bill vs tip


sns.scatterplot(x='total_bill', y='tip', hue='sex', data=df)
plt.title("Tip vs Total Bill")
plt.show()

# Heatmap of correlation
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# Pairplot
sns.pairplot(df, hue='sex')
plt.show()

PART 2: Statistical Analysis & Hypothesis Testing


What is Hypothesis Testing?

It helps determine if there is enough evidence to support a specific claim about a dataset.

General Hypothesis Testing Workflow:

1. Define Hypotheses
o Null Hypothesis (H₀): No difference/effect
o Alternate Hypothesis (H₁): There is a difference/effect
2. Select Test (z, t, chi-square, ANOVA)
3. Compute test statistic
4. Determine p-value
5. Compare p-value to α (e.g., 0.05)

2.1 Z-Test
 Use when population variance is known or large sample size (n > 30).
 Assumes normality.
Formula:

Problem 1: Is the mean total bill ≠ $19?


from statsmodels.stats.weightstats import ztest

z_stat, p_val = ztest(df['total_bill'], value=19)


print(f"Z-statistic: {z_stat}, P-value: {p_val}")

 H₀: μ = 19
 H₁: μ ≠ 19

If p-value < 0.05 → Reject H₀

Problem 2: Are tips by females ≠ $2.5?


female_tips = df[df['sex'] == 'Female']['tip']
z_stat, p_val = ztest(female_tips, value=2.5)
print(f"Z-statistic: {z_stat}, P-value: {p_val}")

2.2 T-Test
Used when:

 Population variance unknown


 Small sample size (n < 30)

Formula for two-sample t-test:

Problem 1: Do male and female customers tip differently?


from scipy.stats import ttest_ind

male_tips = df[df['sex'] == 'Male']['tip']


female_tips = df[df['sex'] == 'Female']['tip']
t_stat, p_val = ttest_ind(male_tips, female_tips)
print(f"T-statistic: {t_stat}, P-value: {p_val}")
Problem 2: Do smokers tip differently than non-smokers?
smoker = df[df['smoker'] == 'Yes']['tip']
non_smoker = df[df['smoker'] == 'No']['tip']
t_stat, p_val = ttest_ind(smoker, non_smoker)
print(f"T-statistic: {t_stat}, P-value: {p_val}")

2.3 Chi-Square Test


Used to test independence between two categorical variables.

Formula:

χ2=∑(O−E)2 / E

Where:

 O = Observed frequency
 E = Expected frequency

Problem 1: Are gender and smoking status independent?


from scipy.stats import chi2_contingency

table = pd.crosstab(df['sex'], df['smoker'])


chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi2: {chi2}, P-value: {p}")

Problem 2: Is day of week related to smoking status?


table = pd.crosstab(df['day'], df['smoker'])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi2: {chi2}, P-value: {p}")

2.4 ANOVA (Analysis of Variance)


Used to compare means across 3 or more groups.

Formula:

F=Between-group variance / Within-group variance

Problem 1: Is average total bill different across days?


from scipy.stats import f_oneway

groups = [group['total_bill'] for name, group in df.groupby('day')]


f_stat, p_val = f_oneway(*groups)
print(f"F-statistic: {f_stat}, P-value: {p_val}")

Problem 2: Do tips vary by time of day (Lunch vs Dinner)?


groups = [group['tip'] for name, group in df.groupby('time')]
f_stat, p_val = f_oneway(*groups)
print(f"F-statistic: {f_stat}, P-value: {p_val}")

PART 3: Identifying Patterns and Insights


Techniques:

 GroupBy Analysis: Summarize categories


 Correlation: Relationships between numeric variables
 Value Counts: Categorical distributions
 Outliers: Boxplots or z-scores

Examples:
# Correlation between total_bill and tip
print(df['total_bill'].corr(df['tip']))

# Average tip per day


print(df.groupby('day')['tip'].mean())

# Total bill by time


sns.boxplot(x='time', y='total_bill', data=df)
plt.title("Total Bill by Time")
plt.show()

Conclusion:
EDA is not optional—it’s essential. It helps:

 Detect errors
 Understand relationships
 Validate assumptions
 Make better decisions about modeling
Introduction to Time Series Analysis
What is a Time Series?
A Time Series is a sequence of data points recorded or observed at successive points in time,
usually at uniform intervals. Examples include:

 Daily stock prices


 Hourly temperature readings
 Monthly sales figures
 Annual GDP growth rates

Formally, a time series is a set of observations {xt} indexed in time order t=1,2,3,…,N.

Why Analyze Time Series?


Time series analysis focuses on understanding the underlying structure and patterns in the
data collected over time to:

 Describe the data


 Model the data generating process
 Forecast future values
 Detect anomalies or changes
 Test hypotheses about temporal dynamics

Key Characteristics of Time Series Data


1. Trend: A long-term increase or decrease in the data. Example: Increasing global
temperature over decades.
2. Seasonality: Regular repeating patterns or cycles within specific time periods (daily,
weekly, yearly). Example: Retail sales peaking every December.
3. Cyclic Patterns: Fluctuations occurring at irregular intervals due to economic/business
cycles.
4. Noise (Irregularity): Random or unexplained variability not accounted for by trend or
seasonality.
5. Stationarity: A stationary time series has statistical properties (mean, variance) that do
not change over time. Many time series models require or assume stationarity.

Types of Time Series


 Univariate Time Series: Observations of a single variable over time (e.g., daily
temperature).
 Multivariate Time Series: Multiple variables observed over time (e.g., temperature,
humidity, and wind speed recorded hourly).

Components of Time Series


Mathematically, many time series can be decomposed as:

xt =Tt+St+Ct+ϵt
Where:

 Tt: Trend component


 St: Seasonal component
 Ct: Cyclical component
 ϵt: Noise or residual

Goals of Time Series Analysis


 Exploratory Analysis: Identify patterns, visualize trends and seasonality.
 Modeling: Fit models that describe data (e.g., ARIMA, Exponential Smoothing).
 Forecasting: Predict future values based on past data.
 Anomaly Detection: Identify unusual or unexpected events.
 Understanding Relationships: Between variables over time (cross-correlation,
causality).

Common Time Series Models


 Moving Average (MA) Models: Smooth data by averaging neighboring observations.
 Autoregressive (AR) Models: Model current value as a function of past values.
 ARMA / ARIMA Models: Combine AR and MA models; ARIMA incorporates
differencing to make non-stationary data stationary.
 Seasonal ARIMA (SARIMA): Handles seasonality.
 Exponential Smoothing: Weighted averages that give more importance to recent
observations.
 State Space Models: Like Kalman Filters for dynamic systems.
 Machine Learning Approaches: LSTM, RNN for complex patterns.

Visualization in Time Series Analysis


Key plots include:

 Line plots (time vs. value)


 Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots
to check dependencies.
 Seasonal plots to reveal repeating cycles.
 Decomposition plots to isolate components.

Challenges in Time Series Analysis


 Non-stationarity: Trends and seasonality complicate modeling.
 Missing Data: Gaps can bias results.
 Noise: Distinguishing signal from noise is hard.
 High-frequency data: Can be voluminous and noisy.
 Changing dynamics: Structural breaks or regime changes.

Practical Applications
 Finance: Stock price forecasting, risk management.
 Weather: Forecasting temperature, rainfall.
 Operations: Demand forecasting, inventory management.
 Healthcare: Monitoring patient vitals over time.
 IoT & Sensor Data: Equipment failure prediction.

ARIMA Model (Autoregressive Integrated Moving Average)


Overview:

ARIMA is a classical and powerful model for analyzing and forecasting univariate time series
data. It combines three components:

 AR: Autoregression — the model uses the relationship between an observation and a
number of lagged observations.
 I: Integration — differencing of raw observations to make the time series stationary.
 MA: Moving Average — the model uses the dependency between an observation and a
residual error from a moving average model applied to lagged observations.

Mathematical Foundation
Given a time series {yt}, an ARIMA(p, d, q) model is defined as:

 p: Number of autoregressive terms


 d: Number of differences needed to make the series stationary
 q: Number of lagged forecast errors in the prediction equation

Step 1: Differencing
Step 2: Autoregressive (AR) Model

Step 3: Moving Average (MA) Model

Step 4: Combining into ARIMA(p, d, q)


Model Identification
 Plot data and ACF/PACF to select p,d, q.
 Use methods like AIC/BIC to select best parameters.

Strengths and Weaknesses


Strengths Weaknesses
Can model a wide range of series Requires stationarity
Well-studied with strong theory Needs manual parameter tuning
Works well with linear data Limited for complex seasonal patterns

Python Example
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load time series data (example)


df = pd.read_csv('your_timeseries.csv', index_col=0, parse_dates=True)

# Fit ARIMA(1,1,1)
model = ARIMA(df['value'], order=(1,1,1))
fit = model.fit()
print(fit.summary())

# Forecast next 5 points


forecast = fit.forecast(steps=5)
print(forecast)

Prophet Model
Overview:

Prophet is a modern time series forecasting model developed by Facebook, designed for
business time series that have multiple seasonality and holiday effects. It’s intuitive, requires
minimal tuning, and is robust to missing data and outliers.

Key Idea:

Prophet decomposes time series into:


Mathematical Details
1. Trend g(t)

2. Seasonality s (t)
3. Holiday Effects h(t)

Model Fitting

 Prophet uses Bayesian modeling and maximum a posteriori estimation.


 It automatically detects change points and fits parameters using Stan (a probabilistic
programming language).

Strengths and Weaknesses


Strengths Weaknesses
Handles multiple seasonalities Less flexible for non-additive effects
Incorporates holidays and events May overfit if many changepoints
Robust to missing data & outliers Model interpretability can be complex
Easy to use with minimal tuning Focused on business time series

Python Example
from prophet import Prophet
import pandas as pd

# Data must have columns 'ds' (datetime) and 'y' (value)


df = pd.read_csv('your_timeseries.csv')
df['ds'] = pd.to_datetime(df['date_column'])
df.rename(columns={'value_column':'y'}, inplace=True)

model = Prophet(weekly_seasonality=True, yearly_seasonality=True)


model.fit(df)

# Make future dataframe (e.g., next 30 days)


future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)

model.plot(forecast)
model.plot_components(forecast)
Summary Table
Aspect ARIMA Prophet
Additive model with
Type Classical linear model
trend+seasonality+holidays
Requires stationary or
Stationarity No stationarity requirement
differencing
Manual or seasonal ARIMA
Seasonality Built-in multiple seasonalities
(SARIMA)
Holidays No built-in holiday handling Explicit holiday/event modeling
Handling missing
Poor Robust
data
Ease of use More technical & requires tuning User-friendly, automatic changepoints
Forecast Output Point forecast Forecast with uncertainty intervals

How to Evaluate Time Series Models


1. Performance Metrics
These metrics quantify forecast accuracy by comparing predicted values y^t\hat{y}_ty^t with
actual values yty_tyt.

1.1. Mean Absolute Error (MAE)

1.2. Mean Squared Error (MSE)

1.3. Root Mean Squared Error (RMSE)


1.4. Mean Absolute Percentage Error (MAPE)

1.5. Symmetric Mean Absolute Percentage Error (sMAPE)

1.6. Mean Directional Accuracy (MDA)

Exam
ple in Python
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

y_true = np.array([100, 150, 200, 250])


y_pred = np.array([110, 145, 195, 240])

mae = mean_absolute_error(y_true, y_pred)


rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}, MAPE: {mape:.2f}%")

2. Train-Test Splitting in Time Series


Time Series ≠ Random Split.

Since time series data is sequential, standard random train-test splits violate temporal order.
Use chronological split.

2.1. Hold-out Split (Single Forecast Horizon)


Train: 2010–2018
Test: 2019

 Simple but does not test performance across time.

2.2. Rolling Forecast Origin (Walk-Forward Validation)


Fold 1: Train → 2010–2017 | Test → 2018
Fold 2: Train → 2010–2018 | Test → 2019
Fold 3: Train → 2010–2019 | Test → 2020

 Models are updated each time


 Mimics real-world forecasting

2.3. Sliding Window Validation


Fold 1: Train → 2010–2015 | Test → 2016
Fold 2: Train → 2011–2016 | Test → 2017
Fold 3: Train → 2012–2017 | Test → 2018

 Fixed-length training window, shifts forward


 Useful for non-stationary data

Time Series Split in Python


from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(data):


print("Train:", train_index, "Test:", test_index)

3. Visual Evaluation Techniques


Forecast vs. Actual Plot
import matplotlib.pyplot as plt

plt.plot(y_true, label='Actual')
plt.plot(y_pred, label='Forecast')
plt.legend()
plt.title("Actual vs Forecast")
plt.show()

 Good for spotting bias, lag, or poor seasonality modeling.

Residual Plots

 Residuals should be white noise (mean 0, constant variance, no autocorrelation).

residuals = y_true - y_pred


plt.plot(residuals)
plt.title("Residuals")
plt.show()

 You can also check ACF/PACF of residuals.

4. Backtesting
Backtesting refers to simulating model performance in historical settings using only past data
available at the time.

 Helps understand how model would perform in production.


 Can be performed using rolling-origin evaluation with time series cross-validation.

Summary
Metric Description Best for
MAE Average magnitude of error General-purpose, interpretable
RMSE Penalizes large errors Sensitive applications
MAPE Percent error When scale needs to be removed
sMAPE Symmetric percentage Safer than MAPE
MDA Directional correctness Trend prediction

You might also like