0% found this document useful (0 votes)

3 views16 pages

Module 5

The document covers Exploratory Data Analysis (EDA) and Time Series Analysis, detailing their importance, objectives, and methodologies. EDA involves visualizing data and conducting statistical tests to understand data characteristics, while Time Series Analysis focuses on analyzing data points collected over time to identify trends, seasonality, and patterns. It also introduces models like ARIMA and Prophet for forecasting, along with evaluation metrics and best practices for handling time series data.

Uploaded by

SATHYABAMA MADHANKUMAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views16 pages

Module 5

Uploaded by

SATHYABAMA MADHANKUMAR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE #5

Exploratory Data Analysis (EDA) is the process of examining datasets to summarize their
main characteristics, often using visual methods and statistical tools. It is a crucial step before
applying machine learning or statistical models.

Exploratory Data Analysis (EDA)

Objectives of EDA:
 Understand the structure, distribution, and quality of the data.
 Detect outliers or anomalies.
 Check assumptions for statistical methods.
 Identify patterns, trends, and relationships between variables.
 Guide feature engineering and model selection.

PART 1: Data Visualization with Matplotlib & Seaborn

Why Visualize?

Visualizations simplify:

 Distribution understanding
 Relationship identification
 Outlier detection
 Pattern recognition

Libraries Used:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

Common Plots

Plot Type Use Case

Histogram Distribution of a single variable
Boxplot Detecting outliers, comparing groups
Scatter Plot Relationship between 2 variables
Heatmap Correlation matrix
Pairplot Visualizing relationships among multiple variables
Example with Code
# Load data
df = sns.load_dataset('tips')

# Histogram of total_bill
sns.histplot(df['total_bill'], kde=True)
plt.title("Distribution of Total Bill")
plt.show()

# Boxplot of total_bill by sex

sns.boxplot(x='sex', y='total_bill', data=df)
plt.title("Total Bill by Gender")
plt.show()

# Scatter plot: total_bill vs tip

sns.scatterplot(x='total_bill', y='tip', hue='sex', data=df)
plt.title("Tip vs Total Bill")
plt.show()

# Heatmap of correlation
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# Pairplot
sns.pairplot(df, hue='sex')
plt.show()

PART 2: Statistical Analysis & Hypothesis Testing

What is Hypothesis Testing?

It helps determine if there is enough evidence to support a specific claim about a dataset.

General Hypothesis Testing Workflow:

1. Define Hypotheses
o Null Hypothesis (H₀): No difference/effect
o Alternate Hypothesis (H₁): There is a difference/effect
2. Select Test (z, t, chi-square, ANOVA)
3. Compute test statistic
4. Determine p-value
5. Compare p-value to α (e.g., 0.05)

2.1 Z-Test
 Use when population variance is known or large sample size (n > 30).
 Assumes normality.
Formula:

Problem 1: Is the mean total bill ≠ $19?

from statsmodels.stats.weightstats import ztest

z_stat, p_val = ztest(df['total_bill'], value=19)

print(f"Z-statistic: {z_stat}, P-value: {p_val}")

 H₀: μ = 19
 H₁: μ ≠ 19

If p-value < 0.05 → Reject H₀

Problem 2: Are tips by females ≠ $2.5?

female_tips = df[df['sex'] == 'Female']['tip']
z_stat, p_val = ztest(female_tips, value=2.5)
print(f"Z-statistic: {z_stat}, P-value: {p_val}")

2.2 T-Test
Used when:

 Population variance unknown

 Small sample size (n < 30)

Formula for two-sample t-test:

Problem 1: Do male and female customers tip differently?

from scipy.stats import ttest_ind

male_tips = df[df['sex'] == 'Male']['tip']

female_tips = df[df['sex'] == 'Female']['tip']
t_stat, p_val = ttest_ind(male_tips, female_tips)
print(f"T-statistic: {t_stat}, P-value: {p_val}")
Problem 2: Do smokers tip differently than non-smokers?
smoker = df[df['smoker'] == 'Yes']['tip']
non_smoker = df[df['smoker'] == 'No']['tip']
t_stat, p_val = ttest_ind(smoker, non_smoker)
print(f"T-statistic: {t_stat}, P-value: {p_val}")

2.3 Chi-Square Test

Used to test independence between two categorical variables.

Formula:

χ2=∑(O−E)2 / E

Where:

 O = Observed frequency
 E = Expected frequency

Problem 1: Are gender and smoking status independent?

from scipy.stats import chi2_contingency

table = pd.crosstab(df['sex'], df['smoker'])

chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi2: {chi2}, P-value: {p}")

Problem 2: Is day of week related to smoking status?

table = pd.crosstab(df['day'], df['smoker'])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi2: {chi2}, P-value: {p}")

2.4 ANOVA (Analysis of Variance)

Used to compare means across 3 or more groups.

Formula:

F=Between-group variance / Within-group variance

Problem 1: Is average total bill different across days?

from scipy.stats import f_oneway

groups = [group['total_bill'] for name, group in df.groupby('day')]

f_stat, p_val = f_oneway(*groups)
print(f"F-statistic: {f_stat}, P-value: {p_val}")

Problem 2: Do tips vary by time of day (Lunch vs Dinner)?

groups = [group['tip'] for name, group in df.groupby('time')]
f_stat, p_val = f_oneway(*groups)
print(f"F-statistic: {f_stat}, P-value: {p_val}")

PART 3: Identifying Patterns and Insights

Techniques:

 GroupBy Analysis: Summarize categories

 Correlation: Relationships between numeric variables
 Value Counts: Categorical distributions
 Outliers: Boxplots or z-scores

Examples:
# Correlation between total_bill and tip
print(df['total_bill'].corr(df['tip']))

# Average tip per day

print(df.groupby('day')['tip'].mean())

# Total bill by time

sns.boxplot(x='time', y='total_bill', data=df)
plt.title("Total Bill by Time")
plt.show()

Conclusion:
EDA is not optional—it’s essential. It helps:

 Detect errors
 Understand relationships
 Validate assumptions
 Make better decisions about modeling
Introduction to Time Series Analysis
What is a Time Series?
A Time Series is a sequence of data points recorded or observed at successive points in time,
usually at uniform intervals. Examples include:

 Daily stock prices

 Hourly temperature readings
 Monthly sales figures
 Annual GDP growth rates

Formally, a time series is a set of observations {xt} indexed in time order t=1,2,3,…,N.

Why Analyze Time Series?

Time series analysis focuses on understanding the underlying structure and patterns in the
data collected over time to:

 Describe the data

 Model the data generating process
 Forecast future values
 Detect anomalies or changes
 Test hypotheses about temporal dynamics

Key Characteristics of Time Series Data

1. Trend: A long-term increase or decrease in the data. Example: Increasing global
temperature over decades.
2. Seasonality: Regular repeating patterns or cycles within specific time periods (daily,
weekly, yearly). Example: Retail sales peaking every December.
3. Cyclic Patterns: Fluctuations occurring at irregular intervals due to economic/business
cycles.
4. Noise (Irregularity): Random or unexplained variability not accounted for by trend or
seasonality.
5. Stationarity: A stationary time series has statistical properties (mean, variance) that do
not change over time. Many time series models require or assume stationarity.

Types of Time Series

 Univariate Time Series: Observations of a single variable over time (e.g., daily
temperature).
 Multivariate Time Series: Multiple variables observed over time (e.g., temperature,
humidity, and wind speed recorded hourly).

Components of Time Series

Mathematically, many time series can be decomposed as:

xt =Tt+St+Ct+ϵt
Where:

 Tt: Trend component

 St: Seasonal component
 Ct: Cyclical component
 ϵt: Noise or residual

Goals of Time Series Analysis

 Exploratory Analysis: Identify patterns, visualize trends and seasonality.
 Modeling: Fit models that describe data (e.g., ARIMA, Exponential Smoothing).
 Forecasting: Predict future values based on past data.
 Anomaly Detection: Identify unusual or unexpected events.
 Understanding Relationships: Between variables over time (cross-correlation,
causality).

Common Time Series Models

 Moving Average (MA) Models: Smooth data by averaging neighboring observations.
 Autoregressive (AR) Models: Model current value as a function of past values.
 ARMA / ARIMA Models: Combine AR and MA models; ARIMA incorporates
differencing to make non-stationary data stationary.
 Seasonal ARIMA (SARIMA): Handles seasonality.
 Exponential Smoothing: Weighted averages that give more importance to recent
observations.
 State Space Models: Like Kalman Filters for dynamic systems.
 Machine Learning Approaches: LSTM, RNN for complex patterns.

Visualization in Time Series Analysis

Key plots include:

 Line plots (time vs. value)

 Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots
to check dependencies.
 Seasonal plots to reveal repeating cycles.
 Decomposition plots to isolate components.

Challenges in Time Series Analysis

 Non-stationarity: Trends and seasonality complicate modeling.
 Missing Data: Gaps can bias results.
 Noise: Distinguishing signal from noise is hard.
 High-frequency data: Can be voluminous and noisy.
 Changing dynamics: Structural breaks or regime changes.

Practical Applications
 Finance: Stock price forecasting, risk management.
 Weather: Forecasting temperature, rainfall.
 Operations: Demand forecasting, inventory management.
 Healthcare: Monitoring patient vitals over time.
 IoT & Sensor Data: Equipment failure prediction.

ARIMA Model (Autoregressive Integrated Moving Average)

Overview:

ARIMA is a classical and powerful model for analyzing and forecasting univariate time series
data. It combines three components:

 AR: Autoregression — the model uses the relationship between an observation and a
number of lagged observations.
 I: Integration — differencing of raw observations to make the time series stationary.
 MA: Moving Average — the model uses the dependency between an observation and a
residual error from a moving average model applied to lagged observations.

Mathematical Foundation
Given a time series {yt}, an ARIMA(p, d, q) model is defined as:

 p: Number of autoregressive terms

 d: Number of differences needed to make the series stationary
 q: Number of lagged forecast errors in the prediction equation

Step 1: Differencing
Step 2: Autoregressive (AR) Model

Step 3: Moving Average (MA) Model

Step 4: Combining into ARIMA(p, d, q)

Model Identification
 Plot data and ACF/PACF to select p,d, q.
 Use methods like AIC/BIC to select best parameters.

Strengths and Weaknesses

Strengths Weaknesses
Can model a wide range of series Requires stationarity
Well-studied with strong theory Needs manual parameter tuning
Works well with linear data Limited for complex seasonal patterns

Python Example
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load time series data (example)

df = pd.read_csv('your_timeseries.csv', index_col=0, parse_dates=True)

# Fit ARIMA(1,1,1)
model = ARIMA(df['value'], order=(1,1,1))
fit = model.fit()
print(fit.summary())

# Forecast next 5 points

forecast = fit.forecast(steps=5)
print(forecast)

Prophet Model
Overview:

Prophet is a modern time series forecasting model developed by Facebook, designed for
business time series that have multiple seasonality and holiday effects. It’s intuitive, requires
minimal tuning, and is robust to missing data and outliers.

Key Idea:

Prophet decomposes time series into:

Mathematical Details
1. Trend g(t)

2. Seasonality s (t)
3. Holiday Effects h(t)

Model Fitting

 Prophet uses Bayesian modeling and maximum a posteriori estimation.

 It automatically detects change points and fits parameters using Stan (a probabilistic
programming language).

Strengths and Weaknesses

Strengths Weaknesses
Handles multiple seasonalities Less flexible for non-additive effects
Incorporates holidays and events May overfit if many changepoints
Robust to missing data & outliers Model interpretability can be complex
Easy to use with minimal tuning Focused on business time series

Python Example
from prophet import Prophet
import pandas as pd

# Data must have columns 'ds' (datetime) and 'y' (value)

df = pd.read_csv('your_timeseries.csv')
df['ds'] = pd.to_datetime(df['date_column'])
df.rename(columns={'value_column':'y'}, inplace=True)

model = Prophet(weekly_seasonality=True, yearly_seasonality=True)

model.fit(df)

# Make future dataframe (e.g., next 30 days)

future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)

model.plot(forecast)
model.plot_components(forecast)
Summary Table
Aspect ARIMA Prophet
Additive model with
Type Classical linear model
trend+seasonality+holidays
Requires stationary or
Stationarity No stationarity requirement
differencing
Manual or seasonal ARIMA
Seasonality Built-in multiple seasonalities
(SARIMA)
Holidays No built-in holiday handling Explicit holiday/event modeling
Handling missing
Poor Robust
data
Ease of use More technical & requires tuning User-friendly, automatic changepoints
Forecast Output Point forecast Forecast with uncertainty intervals

How to Evaluate Time Series Models

1. Performance Metrics
These metrics quantify forecast accuracy by comparing predicted values y^t\hat{y}_ty^t with
actual values yty_tyt.

1.1. Mean Absolute Error (MAE)

1.2. Mean Squared Error (MSE)

1.3. Root Mean Squared Error (RMSE)

1.4. Mean Absolute Percentage Error (MAPE)

1.5. Symmetric Mean Absolute Percentage Error (sMAPE)

1.6. Mean Directional Accuracy (MDA)

Exam
ple in Python
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

y_true = np.array([100, 150, 200, 250])

y_pred = np.array([110, 145, 195, 240])

mae = mean_absolute_error(y_true, y_pred)

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}, MAPE: {mape:.2f}%")

2. Train-Test Splitting in Time Series

Time Series ≠ Random Split.

Since time series data is sequential, standard random train-test splits violate temporal order.
Use chronological split.

2.1. Hold-out Split (Single Forecast Horizon)

Train: 2010–2018
Test: 2019

 Simple but does not test performance across time.

2.2. Rolling Forecast Origin (Walk-Forward Validation)

Fold 1: Train → 2010–2017 | Test → 2018
Fold 2: Train → 2010–2018 | Test → 2019
Fold 3: Train → 2010–2019 | Test → 2020

 Models are updated each time

 Mimics real-world forecasting

2.3. Sliding Window Validation

Fold 1: Train → 2010–2015 | Test → 2016
Fold 2: Train → 2011–2016 | Test → 2017
Fold 3: Train → 2012–2017 | Test → 2018

 Fixed-length training window, shifts forward

 Useful for non-stationary data

Time Series Split in Python

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=3)

for train_index, test_index in tscv.split(data):

print("Train:", train_index, "Test:", test_index)

3. Visual Evaluation Techniques

Forecast vs. Actual Plot
import matplotlib.pyplot as plt

plt.plot(y_true, label='Actual')
plt.plot(y_pred, label='Forecast')
plt.legend()
plt.title("Actual vs Forecast")
plt.show()

 Good for spotting bias, lag, or poor seasonality modeling.

Residual Plots

 Residuals should be white noise (mean 0, constant variance, no autocorrelation).

residuals = y_true - y_pred

plt.plot(residuals)
plt.title("Residuals")
plt.show()

 You can also check ACF/PACF of residuals.

4. Backtesting
Backtesting refers to simulating model performance in historical settings using only past data
available at the time.

 Helps understand how model would perform in production.

 Can be performed using rolling-origin evaluation with time series cross-validation.

Summary
Metric Description Best for
MAE Average magnitude of error General-purpose, interpretable
RMSE Penalizes large errors Sensitive applications
MAPE Percent error When scale needs to be removed
sMAPE Symmetric percentage Safer than MAPE
MDA Directional correctness Trend prediction

DBMS Lab LP
No ratings yet
DBMS Lab LP
8 pages
Bs DMV Module2
No ratings yet
Bs DMV Module2
85 pages
Module3 NoSQL
No ratings yet
Module3 NoSQL
48 pages
MLDAP Mod3
No ratings yet
MLDAP Mod3
34 pages
HTML5 Presentation
No ratings yet
HTML5 Presentation
12 pages
DevOps Overview Presentation
No ratings yet
DevOps Overview Presentation
9 pages
Software Project Management Presentation
No ratings yet
Software Project Management Presentation
11 pages
Mod 4
No ratings yet
Mod 4
27 pages
AK - Software Engineering
No ratings yet
AK - Software Engineering
128 pages
Module2 Project Evaluation Finance 5mark Answers
No ratings yet
Module2 Project Evaluation Finance 5mark Answers
3 pages
Mca Add QB
No ratings yet
Mca Add QB
4 pages
Unit - 1
No ratings yet
Unit - 1
39 pages
Unit - 4
No ratings yet
Unit - 4
33 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
14 pages
Create An Application To Develop Login Window Using UI Controls.
No ratings yet
Create An Application To Develop Login Window Using UI Controls.
4 pages
Golden Rules of Accounting
No ratings yet
Golden Rules of Accounting
3 pages
Unit 1
100% (1)
Unit 1
18 pages
Creating An Application That Displays Message Based On The Screen Orientation
No ratings yet
Creating An Application That Displays Message Based On The Screen Orientation
3 pages
Financial Accounting Expanded
No ratings yet
Financial Accounting Expanded
3 pages
Detailed Software Project Management Presentation
No ratings yet
Detailed Software Project Management Presentation
20 pages
Unit 3
No ratings yet
Unit 3
23 pages
11 Program
No ratings yet
11 Program
6 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
34 pages
9 Program
No ratings yet
9 Program
6 pages
MCQ Bridge Course - Print
0% (1)
MCQ Bridge Course - Print
5 pages
Characterize Delivery and Cyclicality in Learning Environment
No ratings yet
Characterize Delivery and Cyclicality in Learning Environment
3 pages
Unit - II - Eco Systems
No ratings yet
Unit - II - Eco Systems
59 pages
Software Requirements Specification For Traffic Signal Control System
No ratings yet
Software Requirements Specification For Traffic Signal Control System
13 pages
Epiq Solutions Skylight
No ratings yet
Epiq Solutions Skylight
2 pages
SV-11-0032 - Rev.1 - 7200z Service Manual - 112118
No ratings yet
SV-11-0032 - Rev.1 - 7200z Service Manual - 112118
43 pages
Reasoning
No ratings yet
Reasoning
8 pages
Unit 5
No ratings yet
Unit 5
66 pages
Buy Verified Go2Bank Account - Secure Your Finances Today
No ratings yet
Buy Verified Go2Bank Account - Secure Your Finances Today
17 pages
Osint MCQS
No ratings yet
Osint MCQS
2 pages
Using Humanoid Characters
No ratings yet
Using Humanoid Characters
22 pages
Reconfigurable Process Plans in Manufacturing
No ratings yet
Reconfigurable Process Plans in Manufacturing
12 pages
(Ebook PDF) AutoCAD and Its Applications Comprehensive 2018 Twenty Fifth Edition PDF Download
100% (1)
(Ebook PDF) AutoCAD and Its Applications Comprehensive 2018 Twenty Fifth Edition PDF Download
51 pages
Operators Manual SIDEXIS XG
No ratings yet
Operators Manual SIDEXIS XG
204 pages
IENA 2020 Annales Concours Anglais
No ratings yet
IENA 2020 Annales Concours Anglais
82 pages
Harvest-Fantasia Translations
No ratings yet
Harvest-Fantasia Translations
14 pages
VM Series HL7 Guide
100% (1)
VM Series HL7 Guide
26 pages
Testingexperience01 10
No ratings yet
Testingexperience01 10
116 pages
Cinderella: and Other Stories
No ratings yet
Cinderella: and Other Stories
16 pages
MC Lab Manual Cbcs 2021
No ratings yet
MC Lab Manual Cbcs 2021
45 pages
Announcing a New Smart Medical Device
No ratings yet
Announcing a New Smart Medical Device
294 pages
Control Engineering Simplified
No ratings yet
Control Engineering Simplified
23 pages
Popular Computer Uses & Benefits
No ratings yet
Popular Computer Uses & Benefits
3 pages
L01 - Intro To Java
No ratings yet
L01 - Intro To Java
18 pages
DevOps Culture and Practices Guide
No ratings yet
DevOps Culture and Practices Guide
57 pages
Google Ads
No ratings yet
Google Ads
15 pages
Dr.G.Umarani Srikanth, M.E, PHD: Prepared by Professor / Dept of CSE /PEC
No ratings yet
Dr.G.Umarani Srikanth, M.E, PHD: Prepared by Professor / Dept of CSE /PEC
38 pages
Measurement Error and Misclassification in Statistics and Epidemiology Impacts and Bayesian Adjustments 1st Edition Paul Gustafson
100% (21)
Measurement Error and Misclassification in Statistics and Epidemiology Impacts and Bayesian Adjustments 1st Edition Paul Gustafson
85 pages
National Geographic UK - April 2021 (National Geographic UK)
100% (1)
National Geographic UK - April 2021 (National Geographic UK)
138 pages
C++ Notes
No ratings yet
C++ Notes
61 pages
Xii-Ch-7 Test
No ratings yet
Xii-Ch-7 Test
3 pages
Resume Template Output
No ratings yet
Resume Template Output
3 pages
Information Assurance Structure Guide
No ratings yet
Information Assurance Structure Guide
25 pages