0% found this document useful (0 votes)
26 views65 pages

STATISTICS Lab Manual BSBL504

The document is a lab manual for the Computational Statistics Laboratory course (BCBL504) at K.S. School of Engineering and Management for the academic year 2024-2025. It outlines the course objectives, outcomes, assessment details, and a list of experiments to be conducted using Python/R programming. The manual emphasizes the importance of statistical methods in data analysis and includes resources for further learning.

Uploaded by

Sai Kishan .s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views65 pages

STATISTICS Lab Manual BSBL504

The document is a lab manual for the Computational Statistics Laboratory course (BCBL504) at K.S. School of Engineering and Management for the academic year 2024-2025. It outlines the course objectives, outcomes, assessment details, and a list of experiments to be conducted using Python/R programming. The manual emphasizes the importance of statistical methods in data analysis and includes resources for further learning.

Uploaded by

Sai Kishan .s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

K.S.

SCHOOL OF ENGINEERING AND MANAGEMENT,


BENGALURU

DEPARTMENT OF COMPUTER SCIENCE AND BUSINESS SYSTEMS

COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Prepared by: -

Mrs. Bhavya Javagal Mr. Prashant Koparde


Assistant Professor Assistant Professor
Dept. of CS & BS, Dept. of CS & BS,
KSSEM KSSEM

LAB MANUAL

ACADEMIC YEAR 2024 – 2025


Institution Vision & Mission

VISION: “To impart quality education in engineering and management to meet technological,
business and societal needs through holistic education and research”

MISSION:
K.S. School of Engineering and Management shall,
 Establish state-of-art infrastructure to facilitate effective dissemination of technical
andManagerial knowledge.
 Provide comprehensive educational experience through a combination of curricular
and experiential learning, strengthened by industry-institute- interaction.
 Pursue socially relevant research and disseminate knowledge.
 Inculcate leadership skills and foster entrepreneurial spirit among students.

Department Vision & Mission

VISION: “To provide competent learning ecosystem to develop the understanding of


technology and business to produce innovative, principled and insightful leaders to meet the
societal demands.”

MISSION:
 To deliver high-quality education in the fields of technology and business
through effective teaching-learning practices and a conducive learning
environment.
 To create the centre of excellence through collaborations with industries and
various entities, addressing the evolving demands of society.
 To foster an environment that promotes innovation, multidisciplinary research,
skill enhancement and entrepreneurship.
 To uphold and advocate for elevated standards of professional ethics
andtransparency.
K.S. SCHOOL OF ENGINEERING AND MANAGEMENT BENGALURU - 560109
DEPARTMENT OF COMPUTER SCIENCE AND BUSINESS SYSTEMS

Course: Computational Statistics Laboratory

Type: Core Course Code: BCBL504 Academic Year: 2024-2025

No. of Hours per week


Theory (Lecture Class) Practical/Field Work/Allied Total/Wee k Total Number of
Activities LabContact Hours
3 2 2 28
Hours
Marks
Internal Assessment Examination Total Credits

50 50 100 1

Aim/Objective of the Course:


1. To understand the mean, variance, regression models and error term for use in Multivariate data
analysis.
2. To understand the correlation between the data for decision making.
3. To understand the various tests used for the data analysis.
4. To explore various techniques for data analysis and visualize the results.
Course Outcomes & CO-PO Mapping

Subject Computational Statistics Laboratory BCBL504

COURSE OUTCOMES
RBT Level /
CO No. On completion of this course, students will be able to:
Cognitive
Level
BCBL504.1 To Design the experiment for the given problem using
statistical methods. Applying (K3)

BCBL504.2 To Develop the solution for the given real world problem Applying (K3)
using statistical techniques.
BCBL504.3 To Analyze the results and produce substantial written Applying (K3)
documentation.

CO-PO-PSO MAPPING

CO No. PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO 9 PO PO11 PO12 PS PS PS
10 O1 O2 O3

BCBL504.1 3 2 1 - - - 2 - - - - 2 2 3 3
BCBL504.2 3 2 2 - - - 3 - - - - 2 2 3 3
BCBL504.3 2 3 3 2 3 - - - 3 - - 3 3 2 2
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

COMPUTATIONAL STATISTICS LAB Semester V


Course Code BCBL504 CIE Marks 50
Teaching Hours/Week (L:T:P: S) 0:0:2:0 SEE Marks 50
Credits 01 Exam Hours 100
Examination type (SEE) Practical
Course objectives:
● To understand the mean, variance, regression models and error term for use in Multivariate
data analysis.
● To understand the correlation between the data for decision making.
● To understand the various tests used for the data analysis.
● To explore various techniques for data analysis and visualize the results.
Sl Experiments (Implementation using Python/R Programming)
No
1 Program on data wrangling: Combining and merging datasets, Reshaping and Pivoting
2 Program on Data Transformation: String Manipulation, Regular Expressions
3 Program on Time series: GroupBy Mechanics to display in data vector, multivariate
time series and forecasting formats
4 Program to measure central tendency and measures of dispersion: Mean, Median,
Mode, Standard Deviation, Variance, Mean deviation and Quartile deviation for a frequency
distribution/data.
5 Program to perform cross validation for a given dataset to measure Root Mean Squared
Error (RMSE), Mean Absolute Error (MAE) and R2 Error using Validation Set, Leave
One Out Cross-Validation(LOOCV) and K-fold Cross-Validation approaches
6 Program to display Normal, Binomial Poisson, Bernoulli distributions for a given
frequency distribution and analyze the results.
7 Program to implement one sample, two sample and paired sample t-tests for a sample data
and analyse the results.
8 Program to implement One-way and Two-way ANOVA tests and analyze the results
9 Program to implement correlation, rank correlation and regression and plot x-y plot and heat
maps of correlation matrices.
10 Program to implement PCA for Wisconsin dataset, visualize and analyze the results.
11 Program to implement the working of linear discriminant analysis using iris dataset and
visualize the results.
12 Program to Implement multiple linear regression using iris dataset, visualize and analyze the
results.
Course outcomes (Course Skill Set):
At the end of the course the student will be able to:
1. Design the experiment for the given problem using statistical methods.
2. Develop the solution for the given real world problem using statistical techniques.
3. Analyze the results and produce substantial written documentation.

Dept of CS&BS, KSSEM


COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Assessment Details (both CIE and SEE)


The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam
(SEE) is 50%. The minimum passing mark for the CIE is 40% of the maximum marks (20 marks
out of 50) and for the SEE minimum passing mark is 35% of the maximum marks (18 out of 50
marks). A student shall be deemed to have satisfied the academic requirements and earned the
credits allotted to each subject/ course if the student secures a minimum of 40% (40 marks out of
100) in the sum total of the CIE (Continuous Internal Evaluation) and SEE (Semester End
Examination) taken together

Continuous Internal Evaluation (CIE):


CIE marks for the practical course are 50 Marks.
The split-up of CIE marks for record/ journal and test are in the ratio 60:40.
 Each experiment is to be evaluated for conduction with an observation sheet and record
write-up. Rubrics for the evaluation of the journal/write-up for hardware/software
experiments are designed by the faculty who is handling the laboratory session and are
made known to students at the beginning of the practical session.
 Record should contain all the specified experiments in the syllabus and each experiment
write-up will be evaluated for 10 marks.
 Total marks scored by the students are scaled down to 30 marks (60% of maximum
marks).
 Weightage to be given for neatness and submission of record/write-up on time.
 Department shall conduct a test of 100 marks after the completion of all the experiments
listed in the syllabus.
 In a test, test write-up, conduction of experiment, acceptable result, and procedural
knowledge will carry a weightage of 60% and the rest 40% for viva-voce.
 The suitable rubrics can be designed to evaluate each student’s performance and learning
ability.
 The marks scored shall be scaled down to 20 marks (40% of the maximum marks).
The Sum of scaled-down marks scored in the report write-up/journal and marks of a test is
the total CIE marks scored by the student.
Semester End Evaluation (SEE):
 SEE marks for the practical course are 50 Marks.
 SEE shall be conducted jointly by the two examiners of the same institute, examiners are
appointed by the Head of the Institute.
 The examination schedule and names of examiners are informed to the university before
the conduction of the examination. These practical examinations are to be conducted
between the schedule mentioned in the academic calendar of the University.
 All laboratory experiments are to be included for practical examination.
 (Rubrics) Breakup of marks and the instructions printed on the cover page of the answer
script to be strictly adhered to by the examiners. OR based on the course requirement
evaluation rubrics shall be decided jointly by examiners.
 Students can pick one question (experiment) from the questions lot prepared by the
examiners jointly.
 Evaluation of test write-up/ conduction procedure and result/viva will be conducted
jointly by examiners.
 General rubrics suggested for SEE are mentioned here, writeup-20%, Conduction
procedure and result in -60%, Viva-voce 20% of maximum marks. SEE for practical shall
be evaluated for 100 marks and scored marks shall be scaled down to 50 marks (however,
based on course type, rubrics shall be decided by the examiners)

Dept of CS&BS, KSSEM


COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

 Change of experiment is allowed only once and 15% of Marks allotted to the procedure
part are to be made zero.
The minimum duration of SEE is 02 hours

Suggested Learning Resources:


● Chris Chatfield, The Analysis of Time Series: An Introduction, 6th Edition Chapman
and Hall/CRC, 2003
● Garett Grolemund, Hands-on Programming with R, 1st Edition, O’Reilly, 2014
● Jobson J Dave, Applied Multivariate data analysis Vol I and II, 2012 Springer-Verlag
New York Inc...
● Anderson T W, An Introduction to Multivariate Statistical Analysis, 3rd Edition, Wiley
publications, 2009
● Mark Lutz, Programming Python, 4th Edition, O’Rielly Medeia, 2012
● https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data
● https://www.kaggle.com/datasets/arshid/iris-flower-dataset
● https://www.youtube.com/watch?v=VSRUm3HRoiU
● https://www.youtube.com/watch?v=DkwvAn9AAU0
● https://www.youtube.com/playlist?list=PLoROMvodv4rPP6braWoRt5UCXYZ71GZIQ

Dept of CS&BS, KSSEM


COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Computational Statistics Laboratory (BCBL504)


INDEX
Sl No Programs List PageNo.
1. Program on data wrangling: Combining and merging datasets, Reshaping 1
and Pivoting.
2. Program on Data Transformation: String Manipulation, Regular 6
Expressions.
3. Program on Time series: GroupBy Mechanics to display in data vector, 10
multivariate time series and forecasting formats.
4. Program to measure central tendency and measures of dispersion: Mean , 16
Median, Mode, Standard Deviation, Variance, Mean deviation and Quartile
deviation for a frequency distribution/data.
5. Program to perform cross validation for a given dataset to measure Root 19
Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R2 Error
using Validation Set, Leave One Out Cross-Validation (LOOCV) and K-
fold Cross-Validation approaches.
6. Program to display Normal, Binomial Poisson, Bernoulli distributions for a 22
given frequency distribution and analyze the results.
7. Program to implement one sample, two sample and paired sample t-tests 25
for a sample data and analyse the results.
8. Program to implement One-way and Two-way ANOVA tests and analyze 28
the results.
9. Program to implement correlation, rank correlation and regression and plot 32
x-y plot and heat maps of correlation matrices.
10. Program to implement PCA for Wisconsin dataset, visualize and analyze 36
the results.
11. Program to implement the working of linear discriminant analysis using iris 38
dataset and visualize the results.
12. Program to Implement multiple linear regression using iris dataset, 40
visualize and analyze the results.
13. VIVA- QUESTIONS 43

Dept of CS&BS, KSSEM


COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

1. Program on data wrangling: Combining and merging data sets, reshaping and
pivoting

pandas provides various methods for combining and comparing Series or DataFrame

1. Merging DataFrames
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 22]
})
# Merging DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Output - Merging DataFrames

ID Name Age
0 1 Alice 25
1 2 Bob 30

2. Joining DataFrames
# Setting 'ID' as the index for df1
df1.set_index('ID', inplace=True)

1
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

# Joining df1 and df2


joined_df = df1.join(df2.set_index('ID'), how='inner')
print(joined_df)

Output - Joining DataFrames

Name Age
ID
1 Alice 25
2 Bob 30

3. Concatenating DataFrames
# Creating two DataFrames
df3 = pd.DataFrame({
'ID': [5, 6],
'Name': ['David', 'Eva']
})

# Concatenating DataFrames vertically


concatenated_df = pd.concat([df1.reset_index(), df3], ignore_index=True)
print(concatenated_df)

Output - Concatenating DataFrames

ID Name
0 1 Alice
1 2 Bob
2 3 Charlie
3 5 David
4 6 Eva

2
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

4. Comparing DataFrames
# Creating another DataFrame for comparison
df4 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

# Comparing DataFrames
comparison = df1.equals(df4.set_index('ID'))
print(f"Are the DataFrames equal? {comparison}")

Output - Comparing DataFrames

Are the DataFrames equal? True

5. Reshaping Data
Reshaping data typically involves changing the layout of a DataFrame. This can be done
using the melt() function, which transforms a wide format DataFrame into a long format.
Example of Reshaping with melt()
import pandas as pd
# Sample DataFrame

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Math': [85, 90, 95],
'Science': [80, 85, 90]
}
df = pd.DataFrame(data)
# Reshaping the DataFrame
melted_df = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'],
var_name='Subject', value_name='Score')

3
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

print(melted_df)

Output - Reshaping Data

Name Subject Score


0 Alice Math 85
1 Bob Math 90
2 Charlie Math 95
3 Alice Science 80
4 Bob Science 85
5 Charlie Science 90

6. Pivoting Data
Using pivot()
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score': [85, 90, 80, 85]
}
df = pd.DataFrame(data)
# Pivoting the DataFrame
pivoted_df = df.pivot(index='Name', columns='Subject', values='Score')
print(pivoted_df)

Output - Pivoting Data

Subject Math Science


Name
Alice 85 80
Bob 90 85

4
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Using pivot_table()
The pivot_table() function is more versatile than pivot(), as it allows for aggregation of data.
This is particularly useful when you have duplicate entries for the index/column pairs. You
can specify an aggregation function to summarize the data.

Example:
Let’s modify our previous example to include multiple sales entries for the same product on
the same date:
# Sample data with duplicates
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [100, 150, 200, 250, 300]
}

df = pd.DataFrame(data)

# Using pivot_table
pivot_table_df = df.pivot_table(index='Date', columns='Product', values='Sales',
aggfunc='sum')
print(pivot_table_df)

5
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

2. Program on Data transformations: String manipulation and Regular expression


String Manipulation
Python provides a rich set of built-in methods for string manipulation. Here are some
common operations:
1. Changing Case: You can easily change the case of strings using methods like upper(),
lower(), and title().

text = "hello world"


print(text.upper()) # Output: HELLO WORLD
print(text.title()) # Output: Hello World

Output - String Manipulation

HELLO WORLD
Hello World

2. Trimming Whitespace: The strip(), lstrip(), and rstrip() methods are useful for
removing unwanted whitespace.

text = " Hello World "


print(text.strip()) # Output: Hello World

Output - Trimming Whitespace

Hello World

3. Replacing Substrings: The replace() method allows you to replace occurrences of a


substring with another substring.

text = "I love Python"


new_text = text.replace("Python", "programming")
print(new_text) # Output: I love programming

6
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Output - Replacing Substrings

I love programming

4. Splitting and Joining Strings: You can split a string into a list of substrings using
split(), and join a list of strings into a single string using join().

text = "apple,banana,cherry"
fruits = text.split(",")
print(fruits) # Output: ['apple', 'banana', 'cherry']

new_text = " and ".join(fruits)


print(new_text) # Output: apple and banana and cherry
Output - Splitting and Joining Strings

['apple', 'banana', 'cherry']

apple and banana and cherry

Regular Expressions
Regular expressions are a powerful tool for searching and manipulating strings based on
patterns. The re module in Python provides functions to work with regex.

1. Searching for Patterns: The search() function checks if a pattern exists in a string.

import re

text = "My email is [email protected]"


match = re.search(r'\S+@\S+', text)
if match:
print("Found email:", match.group())

7
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Output - Searching for Patterns

Found email: [email protected]

2. Finding All Matches: The findall() function returns all occurrences of a pattern in a
string.
text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r'\S+@\S+', text)
print("Found emails:", emails)

Output - Finding All Matches

Found emails: ['[email protected]', '[email protected]']

3. Replacing Patterns: The sub() function allows you to replace occurrences of a pattern
with a specified string.
text = "My phone number is 123-456-7890"
new_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)
print(new_text)
Output - Replacing Patterns

My phone number is XXX-XXX-XXXX


4. Validating Input: Regular expressions can also be used to validate input formats, such
as checking if a string is a valid email address

def is_valid_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
return re.match(pattern, email) is not None

print(is_valid_email("[email protected]")) # Output: True


print(is_valid_email("invalid-email")) # Output: False

8
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Output - Validating Input

True
False

9
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

3. Program on Time series: GroupBy Mechanics to display in data vector,


multivariate time series and forecasting formats

Time series analysis is a powerful technique used in various fields such as finance,
economics, and environmental science. In Python, the pandas library provides robust tools for
handling time series data, including the groupby functionality, which allows for efficient data
aggregation and transformation.

GroupBy Mechanics in Time Series


The groupby method in pandas is essential for aggregating data based on specific time
intervals. For instance, you can group data by day, month, or year, which is particularly useful
for summarizing trends over time.

Here’s a simple example to illustrate how to use groupby with a time series dataset:

import pandas as pd
import numpy as np

# Create a sample time series DataFrame


date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
data = np.random.randint(0, 100, size=(len(date_rng), 2))
df = pd.DataFrame(data, columns=['A', 'B'], index=date_rng)

# Display the DataFrame


print("Original DataFrame:")
print(df)

Output - GroupBy Mechanics in Time Series

Original DataFrame:
A B
2023-01-01 83 7

10
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

2023-01-02 72 61
2023-01-03 13 5
2023-01-04 0 8
2023-01-05 79 79
2023-01-06 53 11
2023-01-07 4 39
2023-01-08 92 45
2023-01-09 26 74
2023-01-10 52 49

# Group by day and calculate the mean


daily_mean = df.resample('D').mean()
print("\nDaily Mean:")
print(daily_mean)

In this example, we create a DataFrame with random data indexed by dates. We then use the
resample method to group the data by day and calculate the mean for each day.
Output -

Daily Mean:
A B
2023-01-01 83.0 7.0
2023-01-02 72.0 61.0
2023-01-03 13.0 5.0
2023-01-04 0.0 8.0
2023-01-05 79.0 79.0
2023-01-06 53.0 11.0
2023-01-07 4.0 39.0
2023-01-08 92.0 45.0
2023-01-09 26.0 74.0
2023-01-10 52.0 49.0

11
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Multivariate Time Series

When dealing with multivariate time series, you may have multiple variables that you want to
analyze simultaneously. The same groupby mechanics can be applied, but you can also
visualize the relationships between these variables.

Here’s how you can handle multivariate time series data:

# Create a multivariate time series DataFrame


date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
data = {
'Temperature': np.random.randint(20, 30, size=(len(date_rng))),
'Humidity': np.random.randint(30, 70, size=(len(date_rng)))
}
df_multivariate = pd.DataFrame(data, index=date_rng)

# Display the multivariate DataFrame


print("\nMultivariate DataFrame:")
print(df_multivariate)

Output - Multivariate Time Series

Multivariate DataFrame:
Temperature Humidity
2023-01-01 23 53
2023-01-02 23 31
2023-01-03 29 36
2023-01-04 22 60
2023-01-05 25 46
2023-01-06 22 56

12
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

2023-01-07 23 65
2023-01-08 25 39
2023-01-09 27 43
2023-01-10 22 36

# Group by day and calculate the mean for each variable


daily_mean_multivariate = df_multivariate.resample('D').mean()
print("\nDaily Mean for Multivariate DataFrame:")
print(daily_mean_multivariate)

In this code snippet, we create a multivariate DataFrame with temperature and humidity data.
We then apply the same resample method to calculate daily means for both variables.

Output -

Daily Mean for Multivariate DataFrame:


Temperature Humidity
2023-01-01 23.0 53.0
2023-01-02 23.0 31.0
2023-01-03 29.0 36.0
2023-01-04 22.0 60.0
2023-01-05 25.0 46.0
2023-01-06 22.0 56.0
2023-01-07 23.0 65.0
2023-01-08 25.0 39.0
2023-01-09 27.0 43.0
2023-01-10 22.0 36.0

Forecasting Formats
Forecasting in time series can be approached using various models, such as ARIMA,
Exponential Smoothing, or machine learning techniques. The statsmodels library provides
tools for implementing these models.

13
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Here’s a basic example of how to implement an ARIMA model for forecasting:

from statsmodels.tsa.arima.model import ARIMA


import matplotlib.pyplot as plt

# Fit an ARIMA model


model = ARIMA(df['A'], order=(1, 1, 1))
model_fit = model.fit()

# Forecast the next 5 days


forecast = model_fit.forecast(steps=5)
print("\nForecast for the next 5 days:")
print(forecast)

Output - Forecasting Formats

Forecast for the next 5 days:


2023-01-11 45.658360
2023-01-12 46.916272
2023-01-13 46.666756
2023-01-14 46.716249
2023-01-15 46.706432

# Plot the results


plt.figure(figsize=(10, 5))
plt.plot(df['A'], label='Historical Data')
plt.plot(pd.date_range(start=df.index[-1] + pd.Timedelta(days=1), periods=5), forecast,
label='Forecast', color='red')
plt.title('Time Series Forecasting')
plt.xlabel('Date')

14
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

plt.ylabel('Values')
plt.legend()
plt.show()

In this example, we fit an ARIMA model to the time series data and forecast the next five
days. The results are then visualized using matplotlib.

Output -

Freq: D, Name: predicted_mean, dtype: float64

15
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

4. Program to measure central tendency and measures of dispersion: Mean,


median, mode, standard deviation, variance, mean deviation and quartile
deviation for a frequency distribution/data.

In statistics, understanding the central tendency and measures of dispersion is crucial for
analyzing data. Central tendency provides a summary measure that represents the entire
dataset, while measures of dispersion indicate the spread or variability of the data. Below, we
will explore how to compute these statistics using Python.

Required Libraries
To perform these calculations, we will utilize the numpy and scipy libraries. If you haven't
installed these libraries yet, you can do so using pip:

pip install numpy scipy

Sample Data
Let's assume we have a frequency distribution represented as a list of tuples, where each tuple
contains a value and its corresponding frequency. For example:

data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]

Mean: The mean is calculated as the sum of all values multiplied by their frequencies divided
by the total frequency.

Median: The median is the middle value when the data is sorted. If the number of
observations is even, it is the average of the two middle values.

Mode: The mode is the value that appears most frequently in the dataset.

Calculating Measures of Dispersion


Variance: Variance measures how far a set of numbers is spread out from their average value.

Standard Deviation: The standard deviation is the square root of the variance, providing a
measure of the average distance from the mean.

16
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Mean Deviation: This is the average of the absolute deviations from the mean.

Quartile Deviation: This is half the difference between the first quartile (Q1) and the third
quartile (Q3).

Implementation

import numpy as np
from scipy import stats

# Sample frequency distribution


data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]

# Expanding the data based on frequency


expanded_data = []
for value, frequency in data:
expanded_data.extend([value] * frequency)

# Convert to numpy array for calculations


expanded_data = np.array(expanded_data)

# Central Tendency
mean = np.mean(expanded_data)
median = np.median(expanded_data)
mode = stats.mode(expanded_data)[0][0]

# Measures of Dispersion
variance = np.var(expanded_data)
std_deviation = np.std(expanded_data)
mean_deviation = np.mean(np.abs(expanded_data - mean))

17
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

# Quartiles
Q1 = np.percentile(expanded_data, 25)
Q3 = np.percentile(expanded_data, 75)
quartile_deviation = (Q3 - Q1) / 2

# Displaying the results


print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")
print(f"Mean Deviation: {mean_deviation}")
print(f"Quartile Deviation: {quartile_deviation}")

Conclusion
This program effectively calculates the central tendency and measures of dispersion for a
frequency distribution. By utilizing Python's powerful libraries, we can easily perform
statistical analysis, making it a valuable tool for data scientists and analysts. Understanding
these measures allows for better insights into the data, guiding informed decision-making.

Output – Mean , median, mode, standard deviation, variance, mean deviation and
quartile deviation for a frequency distribution/data.

Mean: 3.3333333333333335
Median: 3.5
Mode: 4
Variance: 1.3888888888888886
Standard Deviation: 1.178511301977579
Mean Deviation: 0.9999999999999998
Quartile Deviation: 0.625

18
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

5. Program to perform cross validation for a given dataset to measure Root Mean
Squared Error (RMSE), Mean Absolute Error (MAE) and R2 Error using
validation set, Leave one out cross-validation (LOOCV) and k-fold cross-
validation approaches.

import numpy as np
from sklearn.model_selection import KFold, LeaveOneOut
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate synthetic data


X, y = make_regression(n_samples=100, n_features=1, noise=10)

# Initialize model
model = LinearRegression()

# K-Fold Cross Validation


kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("K-Fold Metrics:")
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("MAE:", mean_absolute_error(y_test, predictions))
print("R-squared:", r2_score(y_test, predictions))

19
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

# Leave-One-Out Cross Validation


loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("LOOCV Metrics:")
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("MAE:", mean_absolute_error(y_test, predictions))
print("R-squared:", r2_score(y_test, predictions))

Output - Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R2
Error using validation set.

K-Fold Metrics:
RMSE: 11.631373440835564
MAE: 9.567705989063176
R-squared: 0.14034702030864432
K-Fold Metrics:
RMSE: 10.401084827603405
MAE: 8.142428049490906
R-squared: -0.00725242933644954
K-Fold Metrics:
RMSE: 11.333138518405033
MAE: 7.793746967878539
R-squared: 0.0005038144360902663
K-Fold Metrics:
RMSE: 9.662205911675452

20
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

MAE: 7.543601870520996
R-squared: 0.2531329275137174
K-Fold Metrics:
RMSE: 9.895247485776258
MAE: 7.894191252365322
R-squared: 0.32917106292669396

Output - Leave one out cross-validation(LOOCV) and k-fold cross-validation


approaches

LOOCV Metrics:
RMSE: 18.428016888511046
MAE: 18.428016888511046
R-squared: nan
LOOCV Metrics:
RMSE: 17.169373950290414
MAE: 17.169373950290414
R-squared: nan
LOOCV Metrics:
RMSE: 17.276426835940963
MAE: 17.276426835940963
R-squared: nan

21
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

6. Program to display Normal, Binomial Poisson, Bernoulli distributions for a given


frequency distribution and analyze the results.
import numpy as np
import matplotlib.pyplot as plt

# Function to calculate normal distribution


def normal_distribution(x, mu, sigma):
return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Function to calculate binomial distribution


def binomial_distribution(n, p, k):
from math import comb
return comb(n, k) * (p ** k) * ((1 - p) ** (n - k))

# Function to calculate Poisson distribution


def poisson_distribution(lmbda, k):
from math import exp, factorial
return (lmbda ** k * exp(-lmbda)) / factorial(k)

# Function to calculate Bernoulli distribution


def bernoulli_distribution(p, k):
return p ** k * (1 - p) ** (1 - k)

# Parameters
mu = 0
sigma = 1
n = 10
p = 0.5
lmbda = 3

# X values for normal distribution


x = np.linspace(-5, 5, 100)
normal_y = normal_distribution(x, mu, sigma)

# X values for binomial distribution


k_values = np.arange(0, n + 1)
binomial_y = [binomial_distribution(n, p, k) for k in k_values]

# X values for Poisson distribution


poisson_k_values = np.arange(0, 15)
poisson_y = [poisson_distribution(lmbda, k) for k in poisson_k_values]

# X values for Bernoulli distribution


bernoulli_k_values = [0, 1]
bernoulli_y = [bernoulli_distribution(p, k) for k in bernoulli_k_values]

22
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

# Plotting
plt.figure(figsize=(12, 8))

# Normal Distribution
plt.subplot(2, 2, 1)
plt.plot(x, normal_y, label='Normal Distribution', color='blue')
plt.title('Normal Distribution')
plt.xlabel('X')
plt.ylabel('Probability Density')
plt.grid()

# Binomial Distribution
plt.subplot(2, 2, 2)
plt.bar(k_values, binomial_y, label='Binomial Distribution', color='orange')
plt.title('Binomial Distribution')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.grid()

# Poisson Distribution
plt.subplot(2, 2, 3)
plt.bar(poisson_k_values, poisson_y, label='Poisson Distribution', color='green')
plt.title('Poisson Distribution')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.grid()

# Bernoulli Distribution
plt.subplot(2, 2, 4)
plt.bar(bernoulli_k_values, bernoulli_y, label='Bernoulli Distribution', color='red')
plt.title('Bernoulli Distribution')
plt.xlabel('Outcome')
plt.ylabel('Probability')
plt.xticks(bernoulli_k_values)
plt.grid()

plt.tight_layout()
plt.show()

Conclusion
Normal Distribution: The function normal_distribution computes the PDF for a range
of x values.
Binomial Distribution: The function binomial_distribution calculates the probability
for each number of successes.

23
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Poisson Distribution: The function poisson_distribution computes the probabilities for


a range of events.
Bernoulli Distribution: The function bernoulli_distribution calculates the probabilities
for two outcomes (success and failure).
Output - Normal, Binomial, Poisson, Bernoulli distributions

24
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

7. Program to implement one Sample, Two Sample and Paired-Sample t-test for a
simple data and analyze the results.
import math

def one_sample_t_test(sample, population_mean):


sample_mean = sum(sample) / len(sample)
sample_std = math.sqrt(sum((x - sample_mean) ** 2 for x in sample) / (len(sample) - 1))
t_statistic = (sample_mean - population_mean) / (sample_std / math.sqrt(len(sample)))
return t_statistic, sample_mean

# Sample data
sample_data = [2.3, 2.5, 2.8, 3.0, 2.7]
population_mean = 2.5

t_statistic, sample_mean = one_sample_t_test(sample_data, population_mean)


print(f"One-Sample T-Test: t-statistic = {t_statistic}, Sample Mean = {sample_mean}")
def two_sample_t_test(sample1, sample2):
mean1 = sum(sample1) / len(sample1)
mean2 = sum(sample2) / len(sample2)
std1 = math.sqrt(sum((x - mean1) ** 2 for x in sample1) / (len(sample1) - 1))
std2 = math.sqrt(sum((x - mean2) ** 2 for x in sample2) / (len(sample2) - 1))

pooled_std = math.sqrt(((len(sample1) - 1) * std1**2 + (len(sample2) - 1) * std2**2) /


(len(sample1) + len(sample2) - 2))
t_statistic = (mean1 - mean2) / (pooled_std * math.sqrt(1/len(sample1) + 1/len(sample2)))

return t_statistic, mean1, mean2

# Sample data
sample_data1 = [2.3, 2.5, 2.8, 3.0, 2.7]
sample_data2 = [3.1, 3.3, 3.5, 3.7, 3.6]

25
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

t_statistic, mean1, mean2 = two_sample_t_test(sample_data1, sample_data2)


print(f"Two-Sample T-Test: t-statistic = {t_statistic}, Sample Mean 1 = {mean1}, Sample
Mean 2 = {mean2}")
def paired_sample_t_test(sample1, sample2):
differences = [x - y for x, y in zip(sample1, sample2)]
mean_diff = sum(differences) / len(differences)
std_diff = math.sqrt(sum((d - mean_diff) ** 2 for d in differences) / (len(differences) - 1))
t_statistic = mean_diff / (std_diff / math.sqrt(len(differences)))

return t_statistic, mean_diff

# Sample data
sample_data1 = [2.3, 2.5, 2.8, 3.0, 2.7]
sample_data2 = [2.1, 2.4, 2.6, 2.9, 2.5]

t_statistic, mean_diff = paired_sample_t_test(sample_data1, sample_data2)


print(f"Paired Sample T-Test: t-statistic = {t_statistic}, Mean Difference = {mean_diff}")

Conclusion:
Analysis of Results One-Sample T-Test: The t-statistic indicates how far the sample mean is
from the population mean in terms of standard errors. A higher absolute value suggests a
significant difference.
Two-Sample T-Test: The t-statistic here compares the means of two independent samples. If
the t-statistic is significantly high or low, it suggests that the two groups differ in their means.
Paired Sample T-Test: This test focuses on the differences between paired observations. A
significant t-statistic indicates that the treatment or condition has had an effect.
In conclusion, implementing t-tests without statistical packages allows for a deeper
understanding of the underlying calculations and assumptions. By analyzing the t-statistics
and means, we can draw conclusions about the significance of our sample data.

26
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

Output - one Sample, Two Sample and Paired-Sample t-test

One-Sample T-Test: t-statistic = 1.3241694217637898, Sample Mean = 2.66


Two-Sample T-Test: t-statistic = -4.818856093078681, Sample Mean 1 = 2.66, Sample Mean
2 = 3.4400000000000004
Paired Sample T-Test: t-statistic = 6.531972647421822, Mean Difference =
0.15999999999999998

27
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

8. Program to Implement One-Way and Two-way ANOVA test and analyze the
results.
One-Way ANOVA One-way ANOVA is used when comparing the means of three or
more independent groups. The null hypothesis states that all group means are equal.

Steps to Implement One-Way ANOVA:

Calculate the Group Means: Find the mean of each group.

Calculate the Overall Mean: Find the mean of all data points.

Calculate the Between-Group Variance: This measures how much the group means
deviate from the overall mean.

Calculate the Within-Group Variance: This measures how much the individual data
points deviate from their respective group means.

Calculate the F-statistic: This is the ratio of the between-group variance to the within-
group variance.

import numpy as np

# Sample data for three groups

group1 = [23, 20, 22, 25, 30]

group2 = [30, 32, 29, 35, 31]

group3 = [25, 27, 24, 22, 26]

# Combine groups into a list

data = [group1, group2, group3]

# Calculate means

group_means = [np.mean(group) for group in data]

overall_mean = np.mean([item for group in data for item in group])

# Calculate Between-Group Variance

28
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

SSB = sum(len(group) * (mean - overall_mean) ** 2 for group, mean in zip(data,


group_means))

# Calculate Within-Group Variance

SSW = sum(sum((x - mean) ** 2 for x in group) for group, mean in zip(data,


group_means))

# Degrees of freedom

df_between = len(data) - 1

df_within = sum(len(group) for group in data) - len(data)

# Mean Squares

MSB = SSB / df_between

MSW = SSW / df_within

# F-statistic

F_statistic = MSB / MSW

print(f"F-statistic for One-Way ANOVA: {F_statistic}")

# Sample data for two factors (A and B)

factor_A = [[23, 20, 22], [30, 32, 29], [25, 27, 24]]

factor_B = [[25, 30, 28], [22, 20, 21], [27, 29, 26]]

# Calculate means

means_A = [np.mean([factor_A[i][j] for i in range(len(factor_A))]) for j in


range(len(factor_A[0]))]

29
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

means_B = [np.mean([factor_B[i][j] for i in range(len(factor_B))]) for j in


range(len(factor_B[0]))]

overall_mean = np.mean([item for sublist in factor_A for item in sublist] + [item for
sublist in factor_B for item in sublist])

# Calculate Sum of Squares

SS_A = sum(len(factor_B[0]) * (mean - overall_mean) ** 2 for mean in means_A)

SS_B = sum(len(factor_A[0]) * (mean - overall_mean) ** 2 for mean in means_B)

# Interaction Sum of Squares

SS_AB = sum((np.mean(factor_A[i]) - overall_mean) ** 2 for i in


range(len(factor_A)))

# Total Sum of Squares

SST = SS_A + SS_B + SS_AB

# Degrees of freedom

df_A = len(factor_A) - 1

df_B = len(factor_B) - 1

df_AB = df_A * df_B

# Mean Squares

MS_A = SS_A / df_A

MS_B = SS_B / df_B

MS_AB = SS_AB / df_AB

30
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

# F-statistics

F_A = MS_A / (SST / (len(factor_A) * len(factor_B) - 1))

F_B = MS_B / (SST / (len(factor_A) * len(factor_B) - 1))

print(f"F-statistic for Factor A: {F_A}")

print(f"F-statistic for Factor B: {F_B}")

Output - One-Way and Two-way ANOVA test

F-statistic for One-Way ANOVA: 10.52765957446808


F-statistic for Factor A: 0.2870813397129179
F-statistic for Factor B: 0.44019138755980847

31
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

9. Program to implement correlation, rank correlation and regression x-y plot and
heat maps of correlation matrices.

import numpy as np

import matplotlib.pyplot as plt

# Generating sample data

np.random.seed(0)

x = np.random.rand(100)

y = 2 * x + np.random.normal(0, 0.1, 100) # Linear relationship with noise

def pearson_correlation(x, y):

n = len(x)

sum_x = np.sum(x)

sum_y = np.sum(y)

sum_x2 = np.sum(x**2)

sum_y2 = np.sum(y**2)

sum_xy = np.sum(x * y)

numerator = n * sum_xy - sum_x * sum_y

denominator = np.sqrt((n * sum_x2 - sum_x**2) * (n * sum_y2 - sum_y**2))

return numerator / denominator

correlation = pearson_correlation(x, y)

print(f"Pearson Correlation Coefficient: {correlation}")

def spearman_rank_correlation(x, y):

rank_x = np.argsort(np.argsort(x))

rank_y = np.argsort(np.argsort(y))

32
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

return pearson_correlation(rank_x, rank_y)

rank_correlation = spearman_rank_correlation(x, y)

print(f"Spearman Rank Correlation Coefficient: {rank_correlation}")

def linear_regression(x, y):

n = len(x)

m = (n * np.sum(x * y) - np.sum(x) * np.sum(y)) / (n * np.sum(x**2) -


(np.sum(x)**2))

b = (np.sum(y) - m * np.sum(x)) / n

return m, b

slope, intercept = linear_regression(x, y)

print(f"Linear Regression: Slope = {slope}, Intercept = {intercept}")

plt.scatter(x, y, label='Data Points')

plt.plot(x, slope * x + intercept, color='red', label='Regression Line')

plt.xlabel('X')

plt.ylabel('Y')

plt.title('Scatter Plot with Regression Line')

plt.legend()

plt.show()

def plot_correlation_matrix(x, y):

correlation_matrix = np.corrcoef(x, y)

plt.imshow(correlation_matrix, cmap='hot', interpolation='nearest')

plt.colorbar()

plt.title('Correlation Matrix Heat Map')

plt.xticks([0, 1], ['X', 'Y'])

33
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

plt.yticks([0, 1], ['X', 'Y'])

plt.show()

plot_correlation_matrix(x, y)

Output - correlation, rank correlation and regression x-y plot and heat maps of
correlation matrices
Pearson Correlation Coefficient: 0.9853103832101713
Spearman Rank Correlation Coefficient: 0.9836063606360637
Linear Regression: Slope = 1.9936935021402027, Intercept = 0.022215107744723496

34
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

35
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

10. Program to implement PCA for Wisconsin dataset, visualize and analyze the
results.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer

print(df)

# Standardize the data

X_mean = np.mean(X, axis=0)

X_std = np.std(X, axis=0)

X_standardized = (X - X_mean) / X_std

# Compute the covariance matrix

cov_matrix = np.cov(X_standardized, rowvar=False)

print(cov_matrix)

# Compute eigenvalues and eigenvectors

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort the eigenvalues and eigenvectors

sorted_indices = np.argsort(eigenvalues)[::-1]

eigenvalues_sorted = eigenvalues[sorted_indices]

eigenvectors_sorted = eigenvectors[:, sorted_indices]

# Select the top 2 principal components

k=2

eigenvectors_subset = eigenvectors_sorted[:, :k]

# Transform the data

X_pca = X_standardized.dot(eigenvectors_subset)

36
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

# Visualize the PCA results

plt.figure(figsize=(10, 6))

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)

plt.title('PCA of Wisconsin Breast Cancer Dataset')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.colorbar(label='Class Label')

plt.grid()

plt.show()

Output - PCA for Wisconsin dataset

37
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

11. Program to implement the working of linear discriminant analysis using IRIS
dataset and visualize the result.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import datasets

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Load the iris dataset

iris = datasets.load_iris()

X = iris.data # Features

y = iris.target # Target classes

print(y)

# Create an instance of LDA

lda = LDA(n_components=2)

# Fit and transform the data

X_lda = lda.fit_transform(X, y)

# Create a DataFrame for visualization

lda_df = pd.DataFrame(data=X_lda, columns=['LD1', 'LD2'])

lda_df['target'] = y

# Map target values to class names

lda_df['target'] = lda_df['target'].map({0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'})

# Plotting

plt.figure(figsize=(10, 6))

sns.scatterplot(data=lda_df, x='LD1', y='LD2', hue='target', palette='viridis', s=100)

38
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

plt.title('LDA of Iris Dataset')

plt.xlabel('Linear Discriminant 1')

plt.ylabel('Linear Discriminant 2')

plt.legend(title='Species')

plt.grid()

plt.show()

Output - working of linear discriminant analysis using IRIS dataset

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]

39
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

12. Program to implement multiple linear regression using IRIS dataset, visualize
and analyze the results.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the iris dataset
iris = sns.load_dataset('iris')
print(iris.head())
# Define independent variables (features) and dependent variable (target)
X = iris[['sepal_length', 'sepal_width', 'petal_width']]
y = iris['petal_length']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')


print(f'R-squared: {r2}')
# Visualize the results
plt.figure(figsize=(10, 6))

40
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

plt.scatter(y_test, y_pred, color='blue')


plt.plot([y.min(), y.max()], [y.min(), y.max()], color='red', linewidth=2)
plt.title('Actual vs Predicted Petal Length')
plt.xlabel('Actual Petal Length')
plt.ylabel('Predicted Petal Length')
plt.grid()
plt.show()

Output - multiple linear regression using IRIS dataset

sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Mean Squared Error: 0.13001626031382693
R-squared: 0.9603293155857664

41
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

42
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

VIVA- QUESTIONS
1. What is computational statistics, and how does it differ from traditional
statistics?
Computational statistics is a field of statistics that focuses on the use of computer-
based techniques and algorithms to solve complex statistical problems and analyze
large datasets. It involves the development and application of computational methods
to implement statistical models, perform simulations, estimate parameters, and
visualize data, often in situations where traditional analytical methods are impractical
due to the complexity or size of the data.
Computational statistics includes the use of numerical methods such as Monte Carlo
simulations, optimization algorithms, bootstrapping, and Markov Chain Monte Carlo
(MCMC) methods. It also encompasses the use of software tools and programming
languages (e.g., R, Python, MATLAB) to carry out statistical analyses and generate
insights from data.
Key Features of Computational Statistics:
 Simulation-Based Methods: Use of simulation techniques like Monte Carlo or
bootstrapping to approximate statistical quantities when analytical methods are
difficult or impossible.
 High-Dimensional Data: Application of methods to handle large and complex
datasets that are too large for traditional methods to process effectively.
 Numerical Optimization: Computational statisticians often work on optimization
problems (e.g., maximizing likelihood functions) to estimate model parameters.
 Algorithm Development: Computational statisticians develop new algorithms for
data analysis, statistical inference, and model fitting.
 Big Data Analytics: Ability to work with large datasets, using parallel computing or
distributed computing frameworks like Hadoop or Spark.
How Computational Statistics Differs from Traditional Statistics
While traditional statistics focuses on the mathematical theory and analytical
approaches for understanding and interpreting data, computational statistics
emphasizes the use of computers and numerical methods to apply these statistical
concepts to complex problems. The main differences between computational statistics
and traditional statistics are as follows:
1. Nature of Methods:
 Traditional Statistics: Relies primarily on closed-form solutions and analytical
methods. For example, statistical tests like the t-test or regression analysis rely on
mathematical formulas to make inferences from data.
 Computational Statistics: Uses numerical methods and computer simulations to
solve problems. Techniques like bootstrapping, Markov Chain Monte Carlo
(MCMC), and Monte Carlo simulations allow statisticians to approximate solutions
when analytical methods are impractical.
2. Data Size and Complexity:
 Traditional Statistics: Typically works with small to medium-sized datasets, where
statistical assumptions (e.g., normality) can be easily checked and met. Traditional
methods also assume that the data can be summarized with simple statistical measures
like the mean or variance.

43
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

 Computational Statistics: Specializes in large and complex datasets that may


involve high-dimensional, unstructured, or streaming data. Computational methods
allow statisticians to analyze these large datasets efficiently.
3. Assumptions:
 Traditional Statistics: Often assumes that the data fits specific parametric
distributions (e.g., normal distribution). These methods typically work under strict
assumptions (e.g., homogeneity of variance, normality, independence).
 Computational Statistics: More flexible and less assumption-dependent. While
computational methods may still rely on some basic assumptions (e.g., random
sampling), they often don't require strong assumptions about the underlying
distribution of the data. For example, non-parametric methods (like bootstrapping)
can be used without assuming a particular data distribution.
4. Computational Resources:
 Traditional Statistics: Methods are often based on formulas that can be manually
computed or calculated using basic tools like calculators or spreadsheets.
 Computational Statistics: Involves the use of advanced computing resources, such
as high-performance computing, cloud computing, and specialized software to
perform complex calculations and simulations efficiently.
5. Modeling and Simulation:
 Traditional Statistics: Focuses more on parameter estimation and hypothesis
testing using models with closed-form solutions (e.g., least squares regression, chi-
square tests).
 Computational Statistics: Involves using simulations (e.g., Monte Carlo,
bootstrapping) to estimate parameters, perform uncertainty quantification, or simulate
scenarios when analytical solutions are not feasible.
6. Real-Time and Big Data:
 Traditional Statistics: Typically applied to data that can be handled by standard
desktop software or statistical packages, making it more suitable for smaller-scale
problems.
 Computational Statistics: Often works with big data and uses distributed
computing systems (e.g., Hadoop, Spark) to analyze data from sources like sensors,
web logs, social media, or real-time streaming.

2. Can you explain the difference between parametric and non-parametric


methods?
In statistics, parametric and non-parametric methods refer to two broad categories
of techniques used to analyze data. The main difference between them lies in the
assumptions they make about the underlying data distribution.
1. Parametric Methods
Definition: Parametric methods are statistical techniques that assume the data follows
a specific probability distribution, such as the normal distribution, and estimate the
parameters of that distribution.
Key Characteristics:
 Assumptions: These methods make strong assumptions about the form of the
underlying population distribution. For example, a common parametric method like t-
tests assumes that the data comes from a normal distribution.

44
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

 Parameters: Parametric methods estimate parameters (e.g., mean, variance,


regression coefficients) of a known distribution.
 Efficiency: When the assumptions about the data distribution hold true, parametric
methods are very efficient and provide precise estimates.
 Examples:
o t-tests: Compare means of two groups assuming normality.
o Linear regression: Assumes a linear relationship between independent and
dependent variables with normally distributed errors.
o ANOVA (Analysis of Variance): Assumes that the populations from which
samples are taken are normally distributed.
Advantages:
 Efficiency: They generally require fewer data points to estimate parameters
effectively when the assumptions are correct.
 Powerful: When the assumptions hold, parametric methods tend to be more powerful
(i.e., they are more likely to detect a true effect).
Disadvantages:
 Assumption-Dependent: If the data does not meet the assumptions (e.g., non-
normality), parametric methods can lead to biased or incorrect conclusions.
2. Non-Parametric Methods
Definition: Non-parametric methods are statistical techniques that do not make strong
assumptions about the specific form of the population distribution. These methods are
more flexible and rely on fewer assumptions.
Key Characteristics:
 No Assumptions About Distribution: Non-parametric methods do not assume that
the data follows a specific distribution (such as normal). They can be used for data
that does not meet the assumptions of parametric methods.
 Rank-Based: Many non-parametric tests use the ranks of the data rather than the
actual values (e.g., comparing the relative sizes of observations rather than their
specific values).
 Robustness: Non-parametric methods are often used when the data is skewed, has
outliers, or is ordinal (i.e., categorical with a meaningful order) rather than
continuous.

3. What is the role of a computational statistician?


A computational statistician combines expertise in statistics and computer science to
develop and apply computational techniques for analyzing large, complex datasets.
The role of a computational statistician is to design, implement, and interpret
statistical models and algorithms, often using advanced computational tools, to extract
meaningful insights from data. Computational statisticians typically work in research,
data science, and analytics, providing critical support in areas like machine learning,
data analysis, and decision-making.
Here are the key roles and responsibilities of a computational statistician:
1. Data Analysis and Modeling:
 Statistical Modeling: Computational statisticians create and apply statistical models
to understand and make predictions based on data. This includes developing models

45
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

such as linear regression, time series analysis, Bayesian models, and more advanced
machine learning algorithms.
 Model Validation: They assess and validate models using techniques such as cross-
validation, bootstrapping, and hypothesis testing to ensure accuracy and robustness.
 Complex Data: They are skilled at handling complex, high-dimensional data,
including time series, spatial data, and unstructured data like text or images.
2. Algorithm Development:
 Developing Computational Algorithms: A computational statistician designs and
implements algorithms to solve statistical problems, such as optimization algorithms
for parameter estimation, sampling methods like Markov Chain Monte Carlo
(MCMC), or simulation techniques like Monte Carlo methods.
 Scalability: They develop algorithms that can handle large datasets efficiently,
ensuring computational methods scale to the size of the data without sacrificing
accuracy.
3. Data Cleaning and Preprocessing:
 Data Wrangling: They play an important role in cleaning and preprocessing raw
data, handling missing data, outliers, and data transformations to make it suitable for
analysis.
 Feature Engineering: They may design new features or variables to improve model
performance and help the machine learning algorithms learn better patterns from the
data.
4. Statistical Inference and Decision Making:
 Hypothesis Testing: They conduct statistical hypothesis testing to draw inferences
about populations based on sample data.
 Bayesian Inference: Many computational statisticians use Bayesian methods to
perform inference and update beliefs based on data, especially in uncertain or
complex systems.
 Uncertainty Quantification: They work with uncertainty in models and data, using
techniques like confidence intervals, credible intervals, and bootstrapping to quantify
uncertainty in predictions.
5. Programming and Software Development:
 Coding Skills: Computational statisticians are proficient in programming languages
such as R, Python, MATLAB, or Julia to implement statistical methods and
algorithms.
 Automation and Efficiency: They write efficient, reusable code to automate
repetitive tasks and ensure their statistical analyses can be applied to new datasets
with minimal effort.
 Software Tools: They may also be involved in developing software tools or libraries
that enable others to apply complex statistical methods easily.
6. Data Visualization and Communication:
 Data Visualization: They create clear and effective visualizations (e.g., histograms,
scatter plots, heatmaps) to help stakeholders understand patterns, distributions, and
relationships in the data.
 Communicating Results: They must be able to communicate complex statistical
results and insights to non-experts, including business leaders, policymakers, or
researchers, in a clear and actionable manner.

46
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

7. Research and Development:


 Advancing Statistical Methods: Computational statisticians contribute to the
development of new statistical methodologies, improving existing methods, or finding
novel ways to apply statistics in fields like genomics, finance, or engineering.
 Collaborating with Researchers: They often work in interdisciplinary teams,
collaborating with domain experts to apply statistical methods to specific research
problems, such as analyzing healthcare data or designing experiments.
8. Applications in Machine Learning and AI:
 Machine Learning: Computational statisticians work closely with machine learning
techniques, such as supervised learning, unsupervised learning, and deep learning, to
develop predictive models and uncover hidden patterns in data.
 AI Integration: They play a key role in integrating statistical models into AI systems,
where probabilistic models, optimization, and statistical inference are central to the
learning and decision-making processes of artificial intelligence systems.
9. Ensuring Ethical and Responsible Use of Data:
 Ethics and Privacy: Computational statisticians are responsible for ensuring that
statistical methods and data analyses are ethically sound, particularly when dealing
with sensitive or personal data. They are involved in ensuring privacy, fairness, and
transparency in data-driven decision-making.
 Bias Mitigation: They may work on methods to detect and mitigate bias in data and
algorithms, ensuring that statistical models and machine learning systems make fair
predictions.
Key Skills and Tools:
 Statistical Expertise: Proficiency in statistical theory, probability theory, and applied
statistics.
 Computational Skills: Strong coding and algorithm development skills, with
knowledge of programming languages like Python, R, MATLAB, or Julia.
 Data Management: Experience with data wrangling, cleaning, and manipulation
using libraries like Pandas, NumPy, and dplyr.
 Machine Learning: Familiarity with machine learning libraries and frameworks,
such as Scikit-learn, TensorFlow, or PyTorch.
 Software Engineering: Experience with software development best practices,
including version control, testing, and performance optimization.
Industries and Applications:
Computational statisticians are in demand across a variety of industries, including:
 Healthcare: Analyzing clinical trial data, bioinformatics, and epidemiological studies.
 Finance: Quantitative analysis, risk assessment, and algorithmic trading.
 Marketing and Retail: Customer segmentation, recommendation systems, and
demand forecasting.
 Government: Public health analysis, social science research, and policy evaluation.
 Tech and AI: Machine learning, natural language processing, and data science.

4. How does Monte Carlo simulation work? Can you provide a basic example?
Monte Carlo simulation is a computational technique used to estimate numerical
solutions to problems by relying on random sampling. The method is particularly
useful for solving problems that may be difficult or impossible to solve analytically,

47
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

especially in cases involving complex or uncertain systems. Monte Carlo methods are
often used to approximate probabilities, integrals, optimization solutions, and other
numerical quantities.
The general process of Monte Carlo simulation involves:
1. Define the Problem: Identify the problem you want to solve and express it in
terms of random variables. For example, you might want to estimate the expected
value of a function over a certain distribution.
2. Generate Random Samples: Randomly sample values from the input
distributions that define the problem. This is typically done using pseudorandom
number generators.
3. Perform Calculations: Apply the random samples to the problem's equations or
model and compute the corresponding values (e.g., outcomes, costs, risks, etc.).
4. Aggregate Results: After performing the simulation many times (usually
thousands or millions of iterations), aggregate the results (e.g., take the average) to
obtain an estimate of the desired quantity.
5. Analyze the Results: Use the aggregated results to make inferences, such as
estimating probabilities, computing confidence intervals, or performing sensitivity
analysis.
Key Features of Monte Carlo Simulation:
 Random Sampling: The method uses random sampling to explore the solution space,
which is especially useful when dealing with high-dimensional or complex problems.
 Repetition: The simulation is run many times (often millions of times) to obtain a
large number of samples, which helps to ensure the results are reliable and converge
to the true solution.
 Stochastic Nature: Monte Carlo simulations deal with uncertainty by using
probabilistic models and randomness in the inputs.
Basic Example: Estimating the Value of π
One of the classic examples of using Monte Carlo simulation is estimating the value
of π\piπ using the Monte Carlo method for integration. Here's how you can estimate
π\piπ using random sampling.
Problem:
You want to estimate the value of π\piπ using a circle inscribed in a square. The area
of the circle is πr2\pi r^2πr2, and the area of the square is 4r24r^24r2 (if we set the
radius r=1r = 1r=1).
By randomly generating points in the square and checking how many fall inside the
circle, you can estimate the ratio of the areas and thus estimate π\piπ.

Steps:
1. Define the Geometry:
o Imagine a square with side length 2, and a circle inscribed in it with a
radius of 1. The square's area is 444 and the circle's area is π\piπ.
2. Generate Random Points:
o Generate random points with coordinates (x,y)(x, y)(x,y) where xxx and
yyy are uniformly distributed between -1 and 1. These points will fall
inside the square.

48
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

o To check if a point is inside the circle, use the equation x2+y2≤1x^2 + y^2
\leq 1x2+y2≤1. If this condition is true, the point lies inside the circle.

5. Explain the law of large numbers and its significance in computational statistics?
The Law of Large Numbers (LLN) is a fundamental theorem in probability theory
and statistics that describes the behavior of the sample mean as the sample size
increases. It essentially states that as the sample size nnn grows, the sample mean
Xˉn\bar{X}_nXˉn (the average of a set of observations) will converge to the true
population mean μ\muμ (the expected value of the random variable).
There are two main forms of the Law of Large Numbers:
1. Weak Law of Large Numbers (WLLN):
o States that for a sequence of independent and identically distributed (i.i.d.)
random variables with finite expected value μ\muμ and variance
σ2\sigma^2σ2, the sample mean Xˉn\bar{X}_nXˉn converges in
probability to the true population mean μ\muμ as the sample size nnn tends
to infinity.
o In other words, for any ϵ>0\epsilon > 0ϵ>0, the probability that the sample
mean deviates from the population mean by more than ϵ\epsilonϵ
approaches zero as nnn increases: P(∣Xˉn−μ∣≥ϵ)→0 as n→∞P(|\bar{X}_n
- \mu| \geq \epsilon) \to 0 \text{ as } n \to \inftyP(∣Xˉn−μ∣≥ϵ)→0 as n→∞
2. Strong Law of Large Numbers (SLLN):
o A stronger form, which states that the sample mean Xˉn\bar{X}_nXˉn
converges almost surely (with probability 1) to the true population mean
μ\muμ as nnn becomes large. This means that the probability of the sample
mean not converging to μ\muμ becomes zero as the number of samples
grows.
Significance of the Law of Large Numbers in Computational Statistics
The Law of Large Numbers has several important implications in the field of
computational statistics, particularly in methods like Monte Carlo simulations,
bootstrap resampling, and other statistical estimation techniques:
1. Convergence of Sample Estimates:
o As the sample size increases, the sample mean (or any sample-
based statistic) becomes a more reliable estimate of the true
population parameter. This is foundational in computational
statistics, where we often rely on sample-based estimations to make
inferences about populations when the full data is not available.
o For example, in Monte Carlo simulations, repeated sampling
from a distribution allows the estimation of expected values, and as
more samples are drawn, the estimate becomes more accurate.
2. Improved Accuracy:
o The LLN guarantees that, as the sample size grows, the variance
of the sample mean decreases, meaning that large samples provide
more stable and accurate estimates of population parameters.
o This property is particularly important in statistical inference, as it
ensures that larger datasets lead to more precise and reliable results.
3. Central Limit Theorem (CLT):

49
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

o The LLN lays the groundwork for the Central Limit Theorem,
which asserts that the distribution of the sample mean approaches a
normal distribution as the sample size increases, regardless of the
original distribution of the data. This is crucial for many statistical
techniques, such as hypothesis testing and confidence interval
estimation, which rely on the assumption of normality in large
samples.
4. Simulation and Resampling Techniques:
o In methods like bootstrapping and permutation tests, the Law of
Large Numbers ensures that repeated resampling leads to reliable
estimates of confidence intervals, standard errors, and other
statistics.
o Bootstrapping, for instance, relies on repeatedly resampling from
the observed data to simulate the sampling distribution of a
statistic. The more resamples you generate, the closer the
resampled statistic will approximate the true sampling distribution.
5. Computational Efficiency:
o LLN also plays a role in computational efficiency. For example, in
many Monte Carlo simulations or Markov Chain Monte Carlo
(MCMC) methods, we use random sampling to approximate
complex integrals or distributions. As the number of samples
increases, the estimates converge, meaning that computational
resources spent on generating large datasets are more likely to
produce accurate results.
6. Estimation of High-Dimensional Models:
o In high-dimensional statistical models (such as in machine
learning or Bayesian inference), the Law of Large Numbers helps
ensure that the estimates of model parameters (e.g., regression
coefficients, variances) become more accurate as the sample size
increases. This is particularly useful in regularization techniques
(e.g., LASSO), where the sample size is often critical for achieving
reliable parameter estimates.
6. What are the main differences between Monte Carlo methods and
Bootstrapping?
Monte Carlo methods and bootstrapping are both statistical techniques that rely on
random sampling to estimate properties of a distribution or model, but they differ in
their underlying principles, goals, and applications. Here's a breakdown of the main
differences:
1. Purpose and Application
 Monte Carlo Methods:
o Monte Carlo methods are used primarily for estimating numerical
quantities (such as expectations, variances, or probabilities) that are
difficult to compute analytically.
o These methods typically involve sampling from a known probability
distribution (often the theoretical distribution of a variable or model) to
estimate values like integrals, means, or variances.

50
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

o Example applications: Calculating the value of complex integrals,


estimating probabilities of rare events, or simulating physical systems in
physics and engineering.
 Bootstrapping:
o Bootstrapping is a resampling technique used to estimate the sampling
distribution of a statistic (e.g., mean, variance, confidence intervals) from
observed data.
o It involves resampling with replacement from the observed sample to
generate new datasets, which are then used to compute statistics and
estimate their variability.
o Example applications: Estimating confidence intervals, performing
hypothesis tests, and assessing the uncertainty of model parameters when
the true population distribution is unknown.
2. Sampling Distribution
 Monte Carlo Methods:
o Sampling is done from a known theoretical probability distribution,
such as the normal distribution, exponential distribution, etc.
o The goal is often to estimate properties of this known distribution by
generating random samples based on its parameters.
 Bootstrapping:
o Sampling is done from the empirical distribution of the observed data,
i.e., the data you have already collected.
o Bootstrapping generates new samples (called bootstrap samples) by
randomly selecting observations from the original dataset with
replacement, and the goal is to use these to approximate the sampling
distribution of a statistic.
3. Data Used for Sampling
 Monte Carlo Methods:
o Monte Carlo methods generally require knowledge of the true probability
distribution and often involve generating random samples from that
distribution.
o These methods do not rely on an initial dataset; they generate simulated
data based on a predefined model or distribution.
 Bootstrapping:
o Bootstrapping works exclusively with the observed data. No assumption
about the underlying distribution of the data is required.
o The observed sample is treated as if it is the population, and new datasets
are created by sampling with replacement from this dataset.
4. Resampling Technique
 Monte Carlo Methods:
o In Monte Carlo simulations, new samples are generated according to a
known distribution and used to perform simulations or estimate quantities.
o The process does not involve resampling from the observed data, but
rather simulating new data points.
 Bootstrapping:

51
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

o Bootstrapping involves resampling with replacement from the original


dataset, meaning each observation in the dataset can be selected multiple
times or not at all in each new sample.
o This process generates multiple "bootstrap samples" that are used to
compute statistics and form confidence intervals or assess variability.
5. Goal of the Estimation
 Monte Carlo Methods:
o The goal of Monte Carlo methods is to approximate an unknown quantity,
often an integral or expectation, using random sampling.
o It's a general-purpose tool for numerical simulation and integration.
 Bootstrapping:
o The goal of bootstrapping is to estimate the variability of a sample
statistic by using resampling methods to simulate multiple possible
samples from the original data.
o It is particularly useful when you cannot easily derive the distribution of a
statistic analytically.
6. Mathematical Assumptions
 Monte Carlo Methods:
o Monte Carlo methods assume that you know the distribution you are
sampling from (or have some model that describes it).
o It is based on the principle that if you sample sufficiently from the known
distribution, the law of large numbers will ensure the estimates converge to
the true value.
 Bootstrapping:
o Bootstrapping makes fewer assumptions. It assumes that the observed
data is representative of the population and that the sampling distribution
of the statistic can be approximated by the resampling process.
o It does not require knowledge of the underlying distribution of the data.
7. Examples and Use Cases
 Monte Carlo Methods:
o Estimating the integral of a function that is difficult to evaluate
analytically.
o Simulating systems in physics, finance, and engineering (e.g., option
pricing, particle movement).
o Estimating probabilities for rare events, such as tail risk in finance.
 Bootstrapping:
o Estimating the confidence intervals of a sample mean, median, or other
statistics without making strong assumptions about the data's distribution.
o Assessing the variance of a sample statistic, such as the error rate of a
machine learning model.
o Hypothesis testing when the population distribution is unknown.

7. How does importance sampling work, and why would you use it?
Importance Sampling is a statistical technique used to estimate properties of a
distribution by sampling from a different distribution, called the proposal
distribution, and then re-weighting the samples to reflect the target distribution. It is

52
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

particularly useful when direct sampling from the target distribution is difficult or
computationally expensive.
Difficult to Sample from Target Distribution: In many cases, directly sampling
from the target distribution p(x)p(x)p(x) may be hard or computationally expensive.
Importance sampling allows you to sample from an easier distribution q(x)q(x)q(x)
and then adjust for the discrepancy between q(x)q(x)q(x) and p(x)p(x)p(x).
Rare Events (Heavy-Tailed Distributions): Importance sampling is often used in
problems involving rare events or tail estimation, where the probability of the event of
interest is very small. In such cases, directly sampling from the target distribution
would require an impractically large number of samples. By using a proposal
distribution that places more mass on the rare event, importance sampling can yield
more efficient estimates.
Efficiency: If the proposal distribution is well-chosen and resembles the target
distribution, importance sampling can lead to more efficient estimates than other
methods like Monte Carlo integration or direct simulation. By focusing sampling
on the regions of interest in the target distribution, it reduces the number of samples
needed to achieve a certain level of accuracy.
Variance Reduction: Importance sampling can reduce the variance of estimators,
especially when the proposal distribution is designed to focus on the important
regions of the target distribution. In cases where certain values of xxx are more
significant than others, this technique can significantly improve the estimation
process.

8. How do you estimate the parameters of a probability distribution?


Estimating the parameters of a probability distribution involves determining the
values of the distribution's parameters that best explain the observed data. There are
different methods for parameter estimation depending on the type of distribution and
available data. The two most common methods are Maximum Likelihood
Estimation (MLE) and Method of Moments (MoM).
1. Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE) is the most widely used method for
estimating parameters. It involves finding the parameter values that maximize the
likelihood of the observed data under a given probability distribution.

9. What is the difference between Maximum Likelihood Estimation (MLE) and


Bayesian estimation?
Philosophical Approach:
 MLE (Frequentist): MLE is a frequentist method that focuses on finding the
parameter values that maximize the likelihood of the observed data. The parameters
are considered fixed, and the goal is to find the most likely values of the parameters
given the data. There is no incorporation of prior knowledge or beliefs.
 Bayesian Estimation: In Bayesian estimation, parameters are treated as random
variables with probability distributions. The estimation process incorporates both
prior beliefs (prior distributions) about the parameters and the observed data. Bayes'
theorem is used to update these beliefs, resulting in a posterior distribution for the
parameters.
2. Treatment of Parameters:

53
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

 MLE: The parameters are treated as fixed but unknown quantities. MLE estimates
the parameters by maximizing the likelihood function, assuming that the true
parameter values exist and can be estimated from the data.
 Bayesian Estimation: The parameters are treated as random variables that have a
probability distribution. The Bayesian approach combines prior beliefs with the
likelihood of the data to update the parameters' distribution, resulting in a posterior
distribution that reflects both the prior and the observed data.
3. Estimation Process:
 MLE: MLE seeks a point estimate of the parameters that maximizes the likelihood
function. It provides a single estimate for each parameter, but it does not quantify the
uncertainty around these estimates. The objective is to find the value of the parameter
that makes the observed data most probable.
 Bayesian Estimation: Bayesian estimation provides a distribution for each
parameter, called the posterior distribution. It quantifies the uncertainty about the
parameters by incorporating both the prior belief (prior distribution) and the
likelihood of the data. Instead of a single point estimate, Bayesian estimation yields a
range of possible parameter values with associated probabilities.
4. Use of Prior Information:
 MLE: MLE does not use any prior information about the parameters. It relies solely
on the observed data. The method only focuses on the likelihood of the data, without
considering any external or prior beliefs about the parameters.
 Bayesian Estimation: Bayesian estimation explicitly incorporates prior knowledge
about the parameters through a prior distribution. This prior represents the belief
about the parameter values before observing any data. After incorporating the data
(via the likelihood), Bayes' theorem updates the prior to form the posterior
distribution.
5. Posterior vs Point Estimate:
 MLE: Provides a point estimate of the parameters. It finds the value of the parameter
that maximizes the likelihood function:
θ^MLE=argmaxθp(D∣θ)\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} p(D |
\theta)θ^MLE=argθmaxp(D∣θ)
 Bayesian Estimation: Provides a probabilistic distribution for the parameters. The
posterior distribution is given by: p(θ∣D)=p(D∣θ)⋅p(θ)p(D)p(\theta | D) = \frac{p(D |
\theta) \cdot p(\theta)}{p(D)}p(θ∣D)=p(D)p(D∣θ)⋅p(θ) From this posterior distribution,
you can derive various summaries, such as the mean, mode, or median as a point
estimate, but you also have the full distribution to quantify uncertainty.
6. Uncertainty and Confidence:
 MLE: In MLE, uncertainty is typically expressed through the standard errors of the
estimates or confidence intervals. The confidence interval gives a range of values
within which the true parameter value is expected to lie, but it is based on the
assumption of the sampling distribution of the estimator.
 Bayesian Estimation: In Bayesian estimation, uncertainty is directly represented by
the posterior distribution. The range of possible values for each parameter, as well
as the associated probabilities, gives a more intuitive understanding of uncertainty.
For example, a credible interval in Bayesian inference is a range of parameter values
within which the true value lies with a certain probability.

54
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

7. Computation:
 MLE: MLE often requires numerical optimization techniques to find the maximum
likelihood estimate. It can be computationally intensive for complex models or high-
dimensional data.
 Bayesian Estimation: Bayesian estimation often requires more complex
computations, such as Markov Chain Monte Carlo (MCMC) sampling, especially
when the posterior distribution does not have a closed form. This can be
computationally demanding, but it provides a full distribution, not just a point
estimate.
8. Flexibility and Robustness:
 MLE: MLE can be limited when the sample size is small or when the likelihood
function is not well-behaved. MLE estimates are sensitive to the data and may lead to
biased estimates if the model is misspecified.
 Bayesian Estimation: Bayesian methods are more flexible and can incorporate
complex prior information. The posterior distribution allows for a more robust
understanding of parameter uncertainty, especially in cases with limited data or when
data are noisy or sparse.

10. Describe the role of prior, likelihood, and posterior in Bayesian statistics.
 The prior represents our beliefs about the model parameters before observing the
data. It encodes the knowledge or assumptions we have about the parameters based on
previous experience, domain expertise, or theoretical considerations. The prior can be
based on historical data, expert opinions, or even a default assumption like uniformity
(if there’s no prior knowledge).
 Purpose: The prior is used to express initial uncertainty about the parameters. It
represents how likely different values of the parameters are before any data is
observed.
 Types of Priors:
o Non-informative (or vague) prior: A prior that assumes minimal knowledge
about the parameter, often used when we want to let the data speak for itself
(e.g., a uniform distribution).
o Informative prior: A prior based on strong prior knowledge about the
parameter, often used when previous data or expert opinion is available (e.g., a
normal distribution with a known mean and variance).
 Mathematical Representation: The prior distribution is denoted as p(θ)p(\theta)p(θ),
where θ\thetaθ represents the model parameters.
2. Likelihood:
 The likelihood represents the probability of the observed data given the parameters of
the model. It is the likelihood function p(D∣θ)p(D | \theta)p(D∣θ), where DDD is the
observed data, and θ\thetaθ represents the parameters of the model.
 Purpose: The likelihood quantifies how likely the observed data is for different
values of the parameters. It is a key component in updating the prior distribution to
the posterior distribution, as it incorporates the new data into the analysis.
 Example: For example, in a coin toss scenario, the likelihood would describe the
probability of obtaining the observed number of heads (or tails) for a given
probability of heads θ\thetaθ.

55
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

 Mathematical Representation: The likelihood function is p(D∣θ)p(D | \theta)p(D∣θ),


which is typically derived from a known probability distribution (e.g., normal
distribution, binomial distribution) based on the data generation process.

11. What is conjugate prior, and why is it useful in Bayesian inference?


In Bayesian statistics, a conjugate prior is a prior distribution that, when combined
with a particular likelihood function, results in a posterior distribution that belongs to
the same family as the prior distribution. In other words, a prior and its corresponding
likelihood are said to be conjugate if the posterior distribution can be expressed in the
same form as the prior distribution.The concept of conjugate priors is very useful
because it simplifies the process of updating beliefs and calculating the posterior
distribution. It allows for easier analytical solutions in Bayesian inference.

12. Explain the role of posterior predictive distributions in Bayesian analysis.


n Bayesian analysis, the posterior predictive distribution plays a crucial role in
making predictions about new, unseen data based on the information learned from the
observed data. It is used to estimate the likelihood of future outcomes, given the data
and the model. This approach combines the Bayesian framework's ability to
incorporate uncertainty into model parameters with the goal of forecasting future
events or observations.
Key Concepts:
1. Bayesian Framework Recap:
o In Bayesian analysis, we start with a prior distribution that represents our
beliefs about the parameters of the model before observing any data.
o After observing the data, we update our beliefs through Bayes’ theorem,
resulting in a posterior distribution that encapsulates our knowledge
about the parameters after observing the data.
2. Posterior Predictive Distribution:
o The posterior predictive distribution represents the distribution of a
future or unseen data point ynewy_{\text{new}}ynew given the observed
data DDD. In essence, it accounts for the uncertainty about the parameters
of the model (as encapsulated in the posterior distribution) and predicts the
future outcomes based on this uncertainty.
Mathematically, it can be expressed as:
p(ynew∣D)=∫p(ynew∣θ)p(θ∣D) dθp(y_{\text{new}} | D) = \int p(y_{\text{new}} |
\theta) p(\theta | D) \, d\thetap(ynew∣D)=∫p(ynew∣θ)p(θ∣D)dθ
Where:
o ynewy_{\text{new}}ynew is the new or future data point we wish to
predict,
o θ\thetaθ represents the model parameters,
o p(θ∣D)p(\theta | D)p(θ∣D) is the posterior distribution of the model
parameters, given the observed data DDD,
o p(ynew∣θ)p(y_{\text{new}} | \theta)p(ynew∣θ) is the likelihood of the new
data point given the model parameters θ\thetaθ,
o The integral sums over all possible values of θ\thetaθ, weighted by the
posterior distribution.

56
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)

3. Interpretation of Posterior Predictive Distribution:


o The posterior predictive distribution integrates the uncertainty in the model
parameters θ\thetaθ and the inherent variability in the data. This means that
it reflects not only the uncertainty in parameter estimation but also the
random variation in future observations.
o It provides a distribution for the predicted values of
ynewy_{\text{new}}ynew, rather than just a point estimate, allowing for
more informed decision-making under uncertainty.
4. Key Uses:
o Predicting Future Data: The posterior predictive distribution allows us to
generate predictions for future observations, given the data we have. This
is particularly valuable when making forecasts in time series data or in
applications where new data is expected to be observed.
o Model Checking and Validation: By generating predictions from the
posterior predictive distribution, we can compare them to actual observed
values. If the predictions do not align with the data, this suggests that the
model might need adjustments or that the assumptions behind the model
are incorrect.
o Quantifying Uncertainty: Instead of providing a single-point prediction,
Bayesian analysis gives a distribution, offering a range of likely outcomes.
This helps quantify the uncertainty in the predictions.

5. Computational Considerations:
o In practice, the posterior predictive distribution is often computed via
Monte Carlo methods (such as Markov Chain Monte Carlo, MCMC).
These methods allow sampling from the posterior distribution and
generating predictions based on these samples. This approach is useful in
complex models where analytical solutions are difficult to obtain.

57
Dept of CS&BS, KSSEM

You might also like