STATISTICS Lab Manual BSBL504
STATISTICS Lab Manual BSBL504
Prepared by: -
LAB MANUAL
VISION: “To impart quality education in engineering and management to meet technological,
business and societal needs through holistic education and research”
MISSION:
K.S. School of Engineering and Management shall,
Establish state-of-art infrastructure to facilitate effective dissemination of technical
andManagerial knowledge.
Provide comprehensive educational experience through a combination of curricular
and experiential learning, strengthened by industry-institute- interaction.
Pursue socially relevant research and disseminate knowledge.
Inculcate leadership skills and foster entrepreneurial spirit among students.
MISSION:
To deliver high-quality education in the fields of technology and business
through effective teaching-learning practices and a conducive learning
environment.
To create the centre of excellence through collaborations with industries and
various entities, addressing the evolving demands of society.
To foster an environment that promotes innovation, multidisciplinary research,
skill enhancement and entrepreneurship.
To uphold and advocate for elevated standards of professional ethics
andtransparency.
K.S. SCHOOL OF ENGINEERING AND MANAGEMENT BENGALURU - 560109
DEPARTMENT OF COMPUTER SCIENCE AND BUSINESS SYSTEMS
50 50 100 1
COURSE OUTCOMES
RBT Level /
CO No. On completion of this course, students will be able to:
Cognitive
Level
BCBL504.1 To Design the experiment for the given problem using
statistical methods. Applying (K3)
BCBL504.2 To Develop the solution for the given real world problem Applying (K3)
using statistical techniques.
BCBL504.3 To Analyze the results and produce substantial written Applying (K3)
documentation.
CO-PO-PSO MAPPING
CO No. PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO 9 PO PO11 PO12 PS PS PS
10 O1 O2 O3
BCBL504.1 3 2 1 - - - 2 - - - - 2 2 3 3
BCBL504.2 3 2 2 - - - 3 - - - - 2 2 3 3
BCBL504.3 2 3 3 2 3 - - - 3 - - 3 3 2 2
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
Change of experiment is allowed only once and 15% of Marks allotted to the procedure
part are to be made zero.
The minimum duration of SEE is 02 hours
1. Program on data wrangling: Combining and merging data sets, reshaping and
pivoting
pandas provides various methods for combining and comparing Series or DataFrame
1. Merging DataFrames
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 22]
})
# Merging DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
ID Name Age
0 1 Alice 25
1 2 Bob 30
2. Joining DataFrames
# Setting 'ID' as the index for df1
df1.set_index('ID', inplace=True)
1
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
Name Age
ID
1 Alice 25
2 Bob 30
3. Concatenating DataFrames
# Creating two DataFrames
df3 = pd.DataFrame({
'ID': [5, 6],
'Name': ['David', 'Eva']
})
ID Name
0 1 Alice
1 2 Bob
2 3 Charlie
3 5 David
4 6 Eva
2
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
4. Comparing DataFrames
# Creating another DataFrame for comparison
df4 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
# Comparing DataFrames
comparison = df1.equals(df4.set_index('ID'))
print(f"Are the DataFrames equal? {comparison}")
5. Reshaping Data
Reshaping data typically involves changing the layout of a DataFrame. This can be done
using the melt() function, which transforms a wide format DataFrame into a long format.
Example of Reshaping with melt()
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Math': [85, 90, 95],
'Science': [80, 85, 90]
}
df = pd.DataFrame(data)
# Reshaping the DataFrame
melted_df = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'],
var_name='Subject', value_name='Score')
3
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
print(melted_df)
6. Pivoting Data
Using pivot()
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score': [85, 90, 80, 85]
}
df = pd.DataFrame(data)
# Pivoting the DataFrame
pivoted_df = df.pivot(index='Name', columns='Subject', values='Score')
print(pivoted_df)
4
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
Using pivot_table()
The pivot_table() function is more versatile than pivot(), as it allows for aggregation of data.
This is particularly useful when you have duplicate entries for the index/column pairs. You
can specify an aggregation function to summarize the data.
Example:
Let’s modify our previous example to include multiple sales entries for the same product on
the same date:
# Sample data with duplicates
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [100, 150, 200, 250, 300]
}
df = pd.DataFrame(data)
# Using pivot_table
pivot_table_df = df.pivot_table(index='Date', columns='Product', values='Sales',
aggfunc='sum')
print(pivot_table_df)
5
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
HELLO WORLD
Hello World
2. Trimming Whitespace: The strip(), lstrip(), and rstrip() methods are useful for
removing unwanted whitespace.
Hello World
6
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
I love programming
4. Splitting and Joining Strings: You can split a string into a list of substrings using
split(), and join a list of strings into a single string using join().
text = "apple,banana,cherry"
fruits = text.split(",")
print(fruits) # Output: ['apple', 'banana', 'cherry']
Regular Expressions
Regular expressions are a powerful tool for searching and manipulating strings based on
patterns. The re module in Python provides functions to work with regex.
1. Searching for Patterns: The search() function checks if a pattern exists in a string.
import re
7
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
2. Finding All Matches: The findall() function returns all occurrences of a pattern in a
string.
text = "Contact us at [email protected] or [email protected]"
emails = re.findall(r'\S+@\S+', text)
print("Found emails:", emails)
3. Replacing Patterns: The sub() function allows you to replace occurrences of a pattern
with a specified string.
text = "My phone number is 123-456-7890"
new_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)
print(new_text)
Output - Replacing Patterns
def is_valid_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
return re.match(pattern, email) is not None
8
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
True
False
9
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
Time series analysis is a powerful technique used in various fields such as finance,
economics, and environmental science. In Python, the pandas library provides robust tools for
handling time series data, including the groupby functionality, which allows for efficient data
aggregation and transformation.
Here’s a simple example to illustrate how to use groupby with a time series dataset:
import pandas as pd
import numpy as np
Original DataFrame:
A B
2023-01-01 83 7
10
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
2023-01-02 72 61
2023-01-03 13 5
2023-01-04 0 8
2023-01-05 79 79
2023-01-06 53 11
2023-01-07 4 39
2023-01-08 92 45
2023-01-09 26 74
2023-01-10 52 49
In this example, we create a DataFrame with random data indexed by dates. We then use the
resample method to group the data by day and calculate the mean for each day.
Output -
Daily Mean:
A B
2023-01-01 83.0 7.0
2023-01-02 72.0 61.0
2023-01-03 13.0 5.0
2023-01-04 0.0 8.0
2023-01-05 79.0 79.0
2023-01-06 53.0 11.0
2023-01-07 4.0 39.0
2023-01-08 92.0 45.0
2023-01-09 26.0 74.0
2023-01-10 52.0 49.0
11
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
When dealing with multivariate time series, you may have multiple variables that you want to
analyze simultaneously. The same groupby mechanics can be applied, but you can also
visualize the relationships between these variables.
Multivariate DataFrame:
Temperature Humidity
2023-01-01 23 53
2023-01-02 23 31
2023-01-03 29 36
2023-01-04 22 60
2023-01-05 25 46
2023-01-06 22 56
12
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
2023-01-07 23 65
2023-01-08 25 39
2023-01-09 27 43
2023-01-10 22 36
In this code snippet, we create a multivariate DataFrame with temperature and humidity data.
We then apply the same resample method to calculate daily means for both variables.
Output -
Forecasting Formats
Forecasting in time series can be approached using various models, such as ARIMA,
Exponential Smoothing, or machine learning techniques. The statsmodels library provides
tools for implementing these models.
13
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
14
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
plt.ylabel('Values')
plt.legend()
plt.show()
In this example, we fit an ARIMA model to the time series data and forecast the next five
days. The results are then visualized using matplotlib.
Output -
15
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
In statistics, understanding the central tendency and measures of dispersion is crucial for
analyzing data. Central tendency provides a summary measure that represents the entire
dataset, while measures of dispersion indicate the spread or variability of the data. Below, we
will explore how to compute these statistics using Python.
Required Libraries
To perform these calculations, we will utilize the numpy and scipy libraries. If you haven't
installed these libraries yet, you can do so using pip:
Sample Data
Let's assume we have a frequency distribution represented as a list of tuples, where each tuple
contains a value and its corresponding frequency. For example:
data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]
Mean: The mean is calculated as the sum of all values multiplied by their frequencies divided
by the total frequency.
Median: The median is the middle value when the data is sorted. If the number of
observations is even, it is the average of the two middle values.
Mode: The mode is the value that appears most frequently in the dataset.
Standard Deviation: The standard deviation is the square root of the variance, providing a
measure of the average distance from the mean.
16
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
Mean Deviation: This is the average of the absolute deviations from the mean.
Quartile Deviation: This is half the difference between the first quartile (Q1) and the third
quartile (Q3).
Implementation
import numpy as np
from scipy import stats
# Central Tendency
mean = np.mean(expanded_data)
median = np.median(expanded_data)
mode = stats.mode(expanded_data)[0][0]
# Measures of Dispersion
variance = np.var(expanded_data)
std_deviation = np.std(expanded_data)
mean_deviation = np.mean(np.abs(expanded_data - mean))
17
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
# Quartiles
Q1 = np.percentile(expanded_data, 25)
Q3 = np.percentile(expanded_data, 75)
quartile_deviation = (Q3 - Q1) / 2
Conclusion
This program effectively calculates the central tendency and measures of dispersion for a
frequency distribution. By utilizing Python's powerful libraries, we can easily perform
statistical analysis, making it a valuable tool for data scientists and analysts. Understanding
these measures allows for better insights into the data, guiding informed decision-making.
Output – Mean , median, mode, standard deviation, variance, mean deviation and
quartile deviation for a frequency distribution/data.
Mean: 3.3333333333333335
Median: 3.5
Mode: 4
Variance: 1.3888888888888886
Standard Deviation: 1.178511301977579
Mean Deviation: 0.9999999999999998
Quartile Deviation: 0.625
18
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
5. Program to perform cross validation for a given dataset to measure Root Mean
Squared Error (RMSE), Mean Absolute Error (MAE) and R2 Error using
validation set, Leave one out cross-validation (LOOCV) and k-fold cross-
validation approaches.
import numpy as np
from sklearn.model_selection import KFold, LeaveOneOut
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Initialize model
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("K-Fold Metrics:")
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("MAE:", mean_absolute_error(y_test, predictions))
print("R-squared:", r2_score(y_test, predictions))
19
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("LOOCV Metrics:")
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("MAE:", mean_absolute_error(y_test, predictions))
print("R-squared:", r2_score(y_test, predictions))
Output - Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and R2
Error using validation set.
K-Fold Metrics:
RMSE: 11.631373440835564
MAE: 9.567705989063176
R-squared: 0.14034702030864432
K-Fold Metrics:
RMSE: 10.401084827603405
MAE: 8.142428049490906
R-squared: -0.00725242933644954
K-Fold Metrics:
RMSE: 11.333138518405033
MAE: 7.793746967878539
R-squared: 0.0005038144360902663
K-Fold Metrics:
RMSE: 9.662205911675452
20
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
MAE: 7.543601870520996
R-squared: 0.2531329275137174
K-Fold Metrics:
RMSE: 9.895247485776258
MAE: 7.894191252365322
R-squared: 0.32917106292669396
LOOCV Metrics:
RMSE: 18.428016888511046
MAE: 18.428016888511046
R-squared: nan
LOOCV Metrics:
RMSE: 17.169373950290414
MAE: 17.169373950290414
R-squared: nan
LOOCV Metrics:
RMSE: 17.276426835940963
MAE: 17.276426835940963
R-squared: nan
21
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
# Parameters
mu = 0
sigma = 1
n = 10
p = 0.5
lmbda = 3
22
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
# Plotting
plt.figure(figsize=(12, 8))
# Normal Distribution
plt.subplot(2, 2, 1)
plt.plot(x, normal_y, label='Normal Distribution', color='blue')
plt.title('Normal Distribution')
plt.xlabel('X')
plt.ylabel('Probability Density')
plt.grid()
# Binomial Distribution
plt.subplot(2, 2, 2)
plt.bar(k_values, binomial_y, label='Binomial Distribution', color='orange')
plt.title('Binomial Distribution')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.grid()
# Poisson Distribution
plt.subplot(2, 2, 3)
plt.bar(poisson_k_values, poisson_y, label='Poisson Distribution', color='green')
plt.title('Poisson Distribution')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.grid()
# Bernoulli Distribution
plt.subplot(2, 2, 4)
plt.bar(bernoulli_k_values, bernoulli_y, label='Bernoulli Distribution', color='red')
plt.title('Bernoulli Distribution')
plt.xlabel('Outcome')
plt.ylabel('Probability')
plt.xticks(bernoulli_k_values)
plt.grid()
plt.tight_layout()
plt.show()
Conclusion
Normal Distribution: The function normal_distribution computes the PDF for a range
of x values.
Binomial Distribution: The function binomial_distribution calculates the probability
for each number of successes.
23
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
24
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
7. Program to implement one Sample, Two Sample and Paired-Sample t-test for a
simple data and analyze the results.
import math
# Sample data
sample_data = [2.3, 2.5, 2.8, 3.0, 2.7]
population_mean = 2.5
# Sample data
sample_data1 = [2.3, 2.5, 2.8, 3.0, 2.7]
sample_data2 = [3.1, 3.3, 3.5, 3.7, 3.6]
25
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
# Sample data
sample_data1 = [2.3, 2.5, 2.8, 3.0, 2.7]
sample_data2 = [2.1, 2.4, 2.6, 2.9, 2.5]
Conclusion:
Analysis of Results One-Sample T-Test: The t-statistic indicates how far the sample mean is
from the population mean in terms of standard errors. A higher absolute value suggests a
significant difference.
Two-Sample T-Test: The t-statistic here compares the means of two independent samples. If
the t-statistic is significantly high or low, it suggests that the two groups differ in their means.
Paired Sample T-Test: This test focuses on the differences between paired observations. A
significant t-statistic indicates that the treatment or condition has had an effect.
In conclusion, implementing t-tests without statistical packages allows for a deeper
understanding of the underlying calculations and assumptions. By analyzing the t-statistics
and means, we can draw conclusions about the significance of our sample data.
26
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
27
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
8. Program to Implement One-Way and Two-way ANOVA test and analyze the
results.
One-Way ANOVA One-way ANOVA is used when comparing the means of three or
more independent groups. The null hypothesis states that all group means are equal.
Calculate the Overall Mean: Find the mean of all data points.
Calculate the Between-Group Variance: This measures how much the group means
deviate from the overall mean.
Calculate the Within-Group Variance: This measures how much the individual data
points deviate from their respective group means.
Calculate the F-statistic: This is the ratio of the between-group variance to the within-
group variance.
import numpy as np
# Calculate means
28
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
# Degrees of freedom
df_between = len(data) - 1
# Mean Squares
# F-statistic
factor_A = [[23, 20, 22], [30, 32, 29], [25, 27, 24]]
factor_B = [[25, 30, 28], [22, 20, 21], [27, 29, 26]]
# Calculate means
29
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
overall_mean = np.mean([item for sublist in factor_A for item in sublist] + [item for
sublist in factor_B for item in sublist])
# Degrees of freedom
df_A = len(factor_A) - 1
df_B = len(factor_B) - 1
# Mean Squares
30
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
# F-statistics
31
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
9. Program to implement correlation, rank correlation and regression x-y plot and
heat maps of correlation matrices.
import numpy as np
np.random.seed(0)
x = np.random.rand(100)
n = len(x)
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_x2 = np.sum(x**2)
sum_y2 = np.sum(y**2)
sum_xy = np.sum(x * y)
correlation = pearson_correlation(x, y)
rank_x = np.argsort(np.argsort(x))
rank_y = np.argsort(np.argsort(y))
32
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
rank_correlation = spearman_rank_correlation(x, y)
n = len(x)
b = (np.sum(y) - m * np.sum(x)) / n
return m, b
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()
correlation_matrix = np.corrcoef(x, y)
plt.colorbar()
33
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
plt.show()
plot_correlation_matrix(x, y)
Output - correlation, rank correlation and regression x-y plot and heat maps of
correlation matrices
Pearson Correlation Coefficient: 0.9853103832101713
Spearman Rank Correlation Coefficient: 0.9836063606360637
Linear Regression: Slope = 1.9936935021402027, Intercept = 0.022215107744723496
34
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
35
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
10. Program to implement PCA for Wisconsin dataset, visualize and analyze the
results.
import numpy as np
import pandas as pd
print(df)
print(cov_matrix)
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues_sorted = eigenvalues[sorted_indices]
k=2
X_pca = X_standardized.dot(eigenvectors_subset)
36
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
plt.figure(figsize=(10, 6))
plt.colorbar(label='Class Label')
plt.grid()
plt.show()
37
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
11. Program to implement the working of linear discriminant analysis using IRIS
dataset and visualize the result.
import pandas as pd
import numpy as np
iris = datasets.load_iris()
X = iris.data # Features
print(y)
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X, y)
lda_df['target'] = y
# Plotting
plt.figure(figsize=(10, 6))
38
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
plt.legend(title='Species')
plt.grid()
plt.show()
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]
39
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
12. Program to implement multiple linear regression using IRIS dataset, visualize
and analyze the results.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the iris dataset
iris = sns.load_dataset('iris')
print(iris.head())
# Define independent variables (features) and dependent variable (target)
X = iris[['sepal_length', 'sepal_width', 'petal_width']]
y = iris['petal_length']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
40
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
41
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
42
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
VIVA- QUESTIONS
1. What is computational statistics, and how does it differ from traditional
statistics?
Computational statistics is a field of statistics that focuses on the use of computer-
based techniques and algorithms to solve complex statistical problems and analyze
large datasets. It involves the development and application of computational methods
to implement statistical models, perform simulations, estimate parameters, and
visualize data, often in situations where traditional analytical methods are impractical
due to the complexity or size of the data.
Computational statistics includes the use of numerical methods such as Monte Carlo
simulations, optimization algorithms, bootstrapping, and Markov Chain Monte Carlo
(MCMC) methods. It also encompasses the use of software tools and programming
languages (e.g., R, Python, MATLAB) to carry out statistical analyses and generate
insights from data.
Key Features of Computational Statistics:
Simulation-Based Methods: Use of simulation techniques like Monte Carlo or
bootstrapping to approximate statistical quantities when analytical methods are
difficult or impossible.
High-Dimensional Data: Application of methods to handle large and complex
datasets that are too large for traditional methods to process effectively.
Numerical Optimization: Computational statisticians often work on optimization
problems (e.g., maximizing likelihood functions) to estimate model parameters.
Algorithm Development: Computational statisticians develop new algorithms for
data analysis, statistical inference, and model fitting.
Big Data Analytics: Ability to work with large datasets, using parallel computing or
distributed computing frameworks like Hadoop or Spark.
How Computational Statistics Differs from Traditional Statistics
While traditional statistics focuses on the mathematical theory and analytical
approaches for understanding and interpreting data, computational statistics
emphasizes the use of computers and numerical methods to apply these statistical
concepts to complex problems. The main differences between computational statistics
and traditional statistics are as follows:
1. Nature of Methods:
Traditional Statistics: Relies primarily on closed-form solutions and analytical
methods. For example, statistical tests like the t-test or regression analysis rely on
mathematical formulas to make inferences from data.
Computational Statistics: Uses numerical methods and computer simulations to
solve problems. Techniques like bootstrapping, Markov Chain Monte Carlo
(MCMC), and Monte Carlo simulations allow statisticians to approximate solutions
when analytical methods are impractical.
2. Data Size and Complexity:
Traditional Statistics: Typically works with small to medium-sized datasets, where
statistical assumptions (e.g., normality) can be easily checked and met. Traditional
methods also assume that the data can be summarized with simple statistical measures
like the mean or variance.
43
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
44
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
45
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
such as linear regression, time series analysis, Bayesian models, and more advanced
machine learning algorithms.
Model Validation: They assess and validate models using techniques such as cross-
validation, bootstrapping, and hypothesis testing to ensure accuracy and robustness.
Complex Data: They are skilled at handling complex, high-dimensional data,
including time series, spatial data, and unstructured data like text or images.
2. Algorithm Development:
Developing Computational Algorithms: A computational statistician designs and
implements algorithms to solve statistical problems, such as optimization algorithms
for parameter estimation, sampling methods like Markov Chain Monte Carlo
(MCMC), or simulation techniques like Monte Carlo methods.
Scalability: They develop algorithms that can handle large datasets efficiently,
ensuring computational methods scale to the size of the data without sacrificing
accuracy.
3. Data Cleaning and Preprocessing:
Data Wrangling: They play an important role in cleaning and preprocessing raw
data, handling missing data, outliers, and data transformations to make it suitable for
analysis.
Feature Engineering: They may design new features or variables to improve model
performance and help the machine learning algorithms learn better patterns from the
data.
4. Statistical Inference and Decision Making:
Hypothesis Testing: They conduct statistical hypothesis testing to draw inferences
about populations based on sample data.
Bayesian Inference: Many computational statisticians use Bayesian methods to
perform inference and update beliefs based on data, especially in uncertain or
complex systems.
Uncertainty Quantification: They work with uncertainty in models and data, using
techniques like confidence intervals, credible intervals, and bootstrapping to quantify
uncertainty in predictions.
5. Programming and Software Development:
Coding Skills: Computational statisticians are proficient in programming languages
such as R, Python, MATLAB, or Julia to implement statistical methods and
algorithms.
Automation and Efficiency: They write efficient, reusable code to automate
repetitive tasks and ensure their statistical analyses can be applied to new datasets
with minimal effort.
Software Tools: They may also be involved in developing software tools or libraries
that enable others to apply complex statistical methods easily.
6. Data Visualization and Communication:
Data Visualization: They create clear and effective visualizations (e.g., histograms,
scatter plots, heatmaps) to help stakeholders understand patterns, distributions, and
relationships in the data.
Communicating Results: They must be able to communicate complex statistical
results and insights to non-experts, including business leaders, policymakers, or
researchers, in a clear and actionable manner.
46
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
4. How does Monte Carlo simulation work? Can you provide a basic example?
Monte Carlo simulation is a computational technique used to estimate numerical
solutions to problems by relying on random sampling. The method is particularly
useful for solving problems that may be difficult or impossible to solve analytically,
47
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
especially in cases involving complex or uncertain systems. Monte Carlo methods are
often used to approximate probabilities, integrals, optimization solutions, and other
numerical quantities.
The general process of Monte Carlo simulation involves:
1. Define the Problem: Identify the problem you want to solve and express it in
terms of random variables. For example, you might want to estimate the expected
value of a function over a certain distribution.
2. Generate Random Samples: Randomly sample values from the input
distributions that define the problem. This is typically done using pseudorandom
number generators.
3. Perform Calculations: Apply the random samples to the problem's equations or
model and compute the corresponding values (e.g., outcomes, costs, risks, etc.).
4. Aggregate Results: After performing the simulation many times (usually
thousands or millions of iterations), aggregate the results (e.g., take the average) to
obtain an estimate of the desired quantity.
5. Analyze the Results: Use the aggregated results to make inferences, such as
estimating probabilities, computing confidence intervals, or performing sensitivity
analysis.
Key Features of Monte Carlo Simulation:
Random Sampling: The method uses random sampling to explore the solution space,
which is especially useful when dealing with high-dimensional or complex problems.
Repetition: The simulation is run many times (often millions of times) to obtain a
large number of samples, which helps to ensure the results are reliable and converge
to the true solution.
Stochastic Nature: Monte Carlo simulations deal with uncertainty by using
probabilistic models and randomness in the inputs.
Basic Example: Estimating the Value of π
One of the classic examples of using Monte Carlo simulation is estimating the value
of π\piπ using the Monte Carlo method for integration. Here's how you can estimate
π\piπ using random sampling.
Problem:
You want to estimate the value of π\piπ using a circle inscribed in a square. The area
of the circle is πr2\pi r^2πr2, and the area of the square is 4r24r^24r2 (if we set the
radius r=1r = 1r=1).
By randomly generating points in the square and checking how many fall inside the
circle, you can estimate the ratio of the areas and thus estimate π\piπ.
Steps:
1. Define the Geometry:
o Imagine a square with side length 2, and a circle inscribed in it with a
radius of 1. The square's area is 444 and the circle's area is π\piπ.
2. Generate Random Points:
o Generate random points with coordinates (x,y)(x, y)(x,y) where xxx and
yyy are uniformly distributed between -1 and 1. These points will fall
inside the square.
48
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
o To check if a point is inside the circle, use the equation x2+y2≤1x^2 + y^2
\leq 1x2+y2≤1. If this condition is true, the point lies inside the circle.
5. Explain the law of large numbers and its significance in computational statistics?
The Law of Large Numbers (LLN) is a fundamental theorem in probability theory
and statistics that describes the behavior of the sample mean as the sample size
increases. It essentially states that as the sample size nnn grows, the sample mean
Xˉn\bar{X}_nXˉn (the average of a set of observations) will converge to the true
population mean μ\muμ (the expected value of the random variable).
There are two main forms of the Law of Large Numbers:
1. Weak Law of Large Numbers (WLLN):
o States that for a sequence of independent and identically distributed (i.i.d.)
random variables with finite expected value μ\muμ and variance
σ2\sigma^2σ2, the sample mean Xˉn\bar{X}_nXˉn converges in
probability to the true population mean μ\muμ as the sample size nnn tends
to infinity.
o In other words, for any ϵ>0\epsilon > 0ϵ>0, the probability that the sample
mean deviates from the population mean by more than ϵ\epsilonϵ
approaches zero as nnn increases: P(∣Xˉn−μ∣≥ϵ)→0 as n→∞P(|\bar{X}_n
- \mu| \geq \epsilon) \to 0 \text{ as } n \to \inftyP(∣Xˉn−μ∣≥ϵ)→0 as n→∞
2. Strong Law of Large Numbers (SLLN):
o A stronger form, which states that the sample mean Xˉn\bar{X}_nXˉn
converges almost surely (with probability 1) to the true population mean
μ\muμ as nnn becomes large. This means that the probability of the sample
mean not converging to μ\muμ becomes zero as the number of samples
grows.
Significance of the Law of Large Numbers in Computational Statistics
The Law of Large Numbers has several important implications in the field of
computational statistics, particularly in methods like Monte Carlo simulations,
bootstrap resampling, and other statistical estimation techniques:
1. Convergence of Sample Estimates:
o As the sample size increases, the sample mean (or any sample-
based statistic) becomes a more reliable estimate of the true
population parameter. This is foundational in computational
statistics, where we often rely on sample-based estimations to make
inferences about populations when the full data is not available.
o For example, in Monte Carlo simulations, repeated sampling
from a distribution allows the estimation of expected values, and as
more samples are drawn, the estimate becomes more accurate.
2. Improved Accuracy:
o The LLN guarantees that, as the sample size grows, the variance
of the sample mean decreases, meaning that large samples provide
more stable and accurate estimates of population parameters.
o This property is particularly important in statistical inference, as it
ensures that larger datasets lead to more precise and reliable results.
3. Central Limit Theorem (CLT):
49
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
o The LLN lays the groundwork for the Central Limit Theorem,
which asserts that the distribution of the sample mean approaches a
normal distribution as the sample size increases, regardless of the
original distribution of the data. This is crucial for many statistical
techniques, such as hypothesis testing and confidence interval
estimation, which rely on the assumption of normality in large
samples.
4. Simulation and Resampling Techniques:
o In methods like bootstrapping and permutation tests, the Law of
Large Numbers ensures that repeated resampling leads to reliable
estimates of confidence intervals, standard errors, and other
statistics.
o Bootstrapping, for instance, relies on repeatedly resampling from
the observed data to simulate the sampling distribution of a
statistic. The more resamples you generate, the closer the
resampled statistic will approximate the true sampling distribution.
5. Computational Efficiency:
o LLN also plays a role in computational efficiency. For example, in
many Monte Carlo simulations or Markov Chain Monte Carlo
(MCMC) methods, we use random sampling to approximate
complex integrals or distributions. As the number of samples
increases, the estimates converge, meaning that computational
resources spent on generating large datasets are more likely to
produce accurate results.
6. Estimation of High-Dimensional Models:
o In high-dimensional statistical models (such as in machine
learning or Bayesian inference), the Law of Large Numbers helps
ensure that the estimates of model parameters (e.g., regression
coefficients, variances) become more accurate as the sample size
increases. This is particularly useful in regularization techniques
(e.g., LASSO), where the sample size is often critical for achieving
reliable parameter estimates.
6. What are the main differences between Monte Carlo methods and
Bootstrapping?
Monte Carlo methods and bootstrapping are both statistical techniques that rely on
random sampling to estimate properties of a distribution or model, but they differ in
their underlying principles, goals, and applications. Here's a breakdown of the main
differences:
1. Purpose and Application
Monte Carlo Methods:
o Monte Carlo methods are used primarily for estimating numerical
quantities (such as expectations, variances, or probabilities) that are
difficult to compute analytically.
o These methods typically involve sampling from a known probability
distribution (often the theoretical distribution of a variable or model) to
estimate values like integrals, means, or variances.
50
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
51
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
7. How does importance sampling work, and why would you use it?
Importance Sampling is a statistical technique used to estimate properties of a
distribution by sampling from a different distribution, called the proposal
distribution, and then re-weighting the samples to reflect the target distribution. It is
52
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
particularly useful when direct sampling from the target distribution is difficult or
computationally expensive.
Difficult to Sample from Target Distribution: In many cases, directly sampling
from the target distribution p(x)p(x)p(x) may be hard or computationally expensive.
Importance sampling allows you to sample from an easier distribution q(x)q(x)q(x)
and then adjust for the discrepancy between q(x)q(x)q(x) and p(x)p(x)p(x).
Rare Events (Heavy-Tailed Distributions): Importance sampling is often used in
problems involving rare events or tail estimation, where the probability of the event of
interest is very small. In such cases, directly sampling from the target distribution
would require an impractically large number of samples. By using a proposal
distribution that places more mass on the rare event, importance sampling can yield
more efficient estimates.
Efficiency: If the proposal distribution is well-chosen and resembles the target
distribution, importance sampling can lead to more efficient estimates than other
methods like Monte Carlo integration or direct simulation. By focusing sampling
on the regions of interest in the target distribution, it reduces the number of samples
needed to achieve a certain level of accuracy.
Variance Reduction: Importance sampling can reduce the variance of estimators,
especially when the proposal distribution is designed to focus on the important
regions of the target distribution. In cases where certain values of xxx are more
significant than others, this technique can significantly improve the estimation
process.
53
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
MLE: The parameters are treated as fixed but unknown quantities. MLE estimates
the parameters by maximizing the likelihood function, assuming that the true
parameter values exist and can be estimated from the data.
Bayesian Estimation: The parameters are treated as random variables that have a
probability distribution. The Bayesian approach combines prior beliefs with the
likelihood of the data to update the parameters' distribution, resulting in a posterior
distribution that reflects both the prior and the observed data.
3. Estimation Process:
MLE: MLE seeks a point estimate of the parameters that maximizes the likelihood
function. It provides a single estimate for each parameter, but it does not quantify the
uncertainty around these estimates. The objective is to find the value of the parameter
that makes the observed data most probable.
Bayesian Estimation: Bayesian estimation provides a distribution for each
parameter, called the posterior distribution. It quantifies the uncertainty about the
parameters by incorporating both the prior belief (prior distribution) and the
likelihood of the data. Instead of a single point estimate, Bayesian estimation yields a
range of possible parameter values with associated probabilities.
4. Use of Prior Information:
MLE: MLE does not use any prior information about the parameters. It relies solely
on the observed data. The method only focuses on the likelihood of the data, without
considering any external or prior beliefs about the parameters.
Bayesian Estimation: Bayesian estimation explicitly incorporates prior knowledge
about the parameters through a prior distribution. This prior represents the belief
about the parameter values before observing any data. After incorporating the data
(via the likelihood), Bayes' theorem updates the prior to form the posterior
distribution.
5. Posterior vs Point Estimate:
MLE: Provides a point estimate of the parameters. It finds the value of the parameter
that maximizes the likelihood function:
θ^MLE=argmaxθp(D∣θ)\hat{\theta}_{\text{MLE}} = \arg\max_{\theta} p(D |
\theta)θ^MLE=argθmaxp(D∣θ)
Bayesian Estimation: Provides a probabilistic distribution for the parameters. The
posterior distribution is given by: p(θ∣D)=p(D∣θ)⋅p(θ)p(D)p(\theta | D) = \frac{p(D |
\theta) \cdot p(\theta)}{p(D)}p(θ∣D)=p(D)p(D∣θ)⋅p(θ) From this posterior distribution,
you can derive various summaries, such as the mean, mode, or median as a point
estimate, but you also have the full distribution to quantify uncertainty.
6. Uncertainty and Confidence:
MLE: In MLE, uncertainty is typically expressed through the standard errors of the
estimates or confidence intervals. The confidence interval gives a range of values
within which the true parameter value is expected to lie, but it is based on the
assumption of the sampling distribution of the estimator.
Bayesian Estimation: In Bayesian estimation, uncertainty is directly represented by
the posterior distribution. The range of possible values for each parameter, as well
as the associated probabilities, gives a more intuitive understanding of uncertainty.
For example, a credible interval in Bayesian inference is a range of parameter values
within which the true value lies with a certain probability.
54
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
7. Computation:
MLE: MLE often requires numerical optimization techniques to find the maximum
likelihood estimate. It can be computationally intensive for complex models or high-
dimensional data.
Bayesian Estimation: Bayesian estimation often requires more complex
computations, such as Markov Chain Monte Carlo (MCMC) sampling, especially
when the posterior distribution does not have a closed form. This can be
computationally demanding, but it provides a full distribution, not just a point
estimate.
8. Flexibility and Robustness:
MLE: MLE can be limited when the sample size is small or when the likelihood
function is not well-behaved. MLE estimates are sensitive to the data and may lead to
biased estimates if the model is misspecified.
Bayesian Estimation: Bayesian methods are more flexible and can incorporate
complex prior information. The posterior distribution allows for a more robust
understanding of parameter uncertainty, especially in cases with limited data or when
data are noisy or sparse.
10. Describe the role of prior, likelihood, and posterior in Bayesian statistics.
The prior represents our beliefs about the model parameters before observing the
data. It encodes the knowledge or assumptions we have about the parameters based on
previous experience, domain expertise, or theoretical considerations. The prior can be
based on historical data, expert opinions, or even a default assumption like uniformity
(if there’s no prior knowledge).
Purpose: The prior is used to express initial uncertainty about the parameters. It
represents how likely different values of the parameters are before any data is
observed.
Types of Priors:
o Non-informative (or vague) prior: A prior that assumes minimal knowledge
about the parameter, often used when we want to let the data speak for itself
(e.g., a uniform distribution).
o Informative prior: A prior based on strong prior knowledge about the
parameter, often used when previous data or expert opinion is available (e.g., a
normal distribution with a known mean and variance).
Mathematical Representation: The prior distribution is denoted as p(θ)p(\theta)p(θ),
where θ\thetaθ represents the model parameters.
2. Likelihood:
The likelihood represents the probability of the observed data given the parameters of
the model. It is the likelihood function p(D∣θ)p(D | \theta)p(D∣θ), where DDD is the
observed data, and θ\thetaθ represents the parameters of the model.
Purpose: The likelihood quantifies how likely the observed data is for different
values of the parameters. It is a key component in updating the prior distribution to
the posterior distribution, as it incorporates the new data into the analysis.
Example: For example, in a coin toss scenario, the likelihood would describe the
probability of obtaining the observed number of heads (or tails) for a given
probability of heads θ\thetaθ.
55
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
56
Dept of CS&BS, KSSEM
COMPUTATIONAL STATISTICS LABORATORY (BCBL504)
5. Computational Considerations:
o In practice, the posterior predictive distribution is often computed via
Monte Carlo methods (such as Markov Chain Monte Carlo, MCMC).
These methods allow sampling from the posterior distribution and
generating predictions based on these samples. This approach is useful in
complex models where analytical solutions are difficult to obtain.
57
Dept of CS&BS, KSSEM