Statistical Analysis for Data Science
Statistical Analysis for Data Science
Statistical analysis is a fundamental component of data science that involves collecting,
exploring, and presenting large amounts of data to discover underlying patterns and trends. It
provides the mathematical foundation for making informed decisions and predictions based on
data.
Descriptive Statistics
Descriptive statistics summarize and quantify the main features of a dataset.
Measures of Central Tendency
Mean: The average of all values in a dataset.
Median: The middle value when data is arranged in order.
Mode: The most frequently occurring value in a dataset.
Measures of Dispersion
Range: The difference between the maximum and minimum values.
Variance: The average squared deviation from the mean.
Standard Deviation: The square root of the variance.
Interquartile Range (IQR): The range between the first and third quartiles.
import numpy as np
from scipy import stats
# Sample dataset
data = [12, 15, 17, 19, 20, 22, 25, 25, 27, 28, 30, 32, 35, 37, 40, 42, 45]
# Measures of central tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode[0]
# Measures of dispersion
data_range = np.max(data) - np.min(data)
variance = np.var(data)
Statistical Analysis for Data Science 1
std_dev = np.std(data)
q75, q25 = np.percentile(data, [75, 25])
iqr = q75 - q25
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Range: {data_range}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_dev}")
print(f"Interquartile Range: {iqr}")
Inferential Statistics
Inferential statistics use sample data to make inferences about a larger population.
Hypothesis Testing
The process of making statistical decisions based on experimental data.
Null Hypothesis (H₀): A statement that there is no effect or difference.
Alternative Hypothesis (H₁): A statement that contradicts the null hypothesis.
p-value: The probability of obtaining results as extreme as the observed results, assuming
the null hypothesis is true.
Significance Level (α): The threshold below which the null hypothesis is rejected.
Common Statistical Tests
Test Use Case Example
Comparing treatment vs. control
t-test Compare means of two groups
group
Comparing multiple treatment
ANOVA Compare means of three or more groups
options
Chi-square Test relationships between categorical variables Analyzing survey responses
Correlation Measure relationship between two variables Height vs. weight analysis
Predict a dependent variable based on independent
Regression Predicting house prices
variables
Probability Distributions
Probability distributions describe the likelihood of different outcomes in an experiment.
Statistical Analysis for Data Science 2