0% found this document useful (0 votes)
13 views6 pages

Statistics - Data - Science - Study - Guide Science - Study - Guide

Complete study guide covering descriptive statistics, probability, hypothesis testing, and regression

Uploaded by

yabravod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Statistics - Data - Science - Study - Guide Science - Study - Guide

Complete study guide covering descriptive statistics, probability, hypothesis testing, and regression

Uploaded by

yabravod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics for Data Science: Complete

Study Guide
Essential Statistical Concepts for Data Analysis

Chapter 1: Descriptive Statistics


1.1 Measures of Central Tendency
Mean (µ): Sum of all values divided by count. Sensitive to outliers. Formula: µ = Σx/n Median: Middle
value when data is sorted. Robust to outliers. For even n: average of two middle values Mode: Most
frequently occurring value(s). Can be multimodal. Useful for categorical data.

1.2 Measures of Dispersion


Variance (σ²): Average squared deviation from mean Population: σ² = Σ(x-µ)²/N Sample: s² =
Σ(x-x■)²/(n-1) [Bessel's correction] Standard Deviation (σ): Square root of variance. Same units as
data. 68-95-99.7 rule for normal distributions Interquartile Range (IQR): Q3 - Q1 Robust measure of
spread. Used for outlier detection: outliers beyond Q1-1.5×IQR or Q3+1.5×IQR

1.3 Shape Measures


Skewness: Measure of asymmetry • Positive skew: tail extends right, mean > median • Negative skew:
tail extends left, mean < median • Zero skew: symmetric distribution Kurtosis: Measure of tail
heaviness • Leptokurtic: heavy tails, high peak (kurtosis > 3) • Platykurtic: light tails, flat peak (kurtosis <
3) • Mesokurtic: normal distribution (kurtosis = 3)
Chapter 2: Probability Theory
2.1 Basic Probability
Probability Axioms: 1. 0 ≤ P(A) ≤ 1 for any event A 2. P(S) = 1 where S is sample space 3. For
mutually exclusive events: P(A∪B) = P(A) + P(B) Conditional Probability: P(A|B) = P(A∩B)/P(B)
Bayes' Theorem: P(A|B) = P(B|A)×P(A)/P(B) Applications: spam filtering, medical diagnosis, A/B
testing

2.2 Probability Distributions


Distribution Type Parameters Use Cases

Bernoulli Discrete p (success prob) Binary outcomes

Binomial Discrete n, p Number of successes in n trials

Poisson Discrete λ (rate) Count of events in interval

Normal Continuous µ, σ Natural phenomena, CLT

Exponential Continuous λ (rate) Time between events

Chi-Square Continuous df Goodness of fit tests


Chapter 3: Statistical Inference
3.1 Hypothesis Testing
Steps: 1. State null (H■) and alternative (H■) hypotheses 2. Choose significance level (α, typically
0.05) 3. Select appropriate test statistic 4. Calculate p-value or critical value 5. Make decision: reject
H■ if p-value < α Types of Errors: • Type I: Reject true H■ (probability = α) • Type II: Fail to reject
false H■ (probability = β) • Power = 1 - β (probability of detecting true effect)

3.2 Common Statistical Tests


Parametric Tests (assume normal distribution): • t-test: Compare means (one-sample, two-sample,
paired) • ANOVA: Compare means across multiple groups • Pearson correlation: Linear relationship
strength Non-parametric Tests (distribution-free): • Mann-Whitney U: Alternative to two-sample
t-test • Kruskal-Wallis: Alternative to ANOVA • Spearman correlation: Monotonic relationship •
Chi-square test: Independence or goodness of fit

3.3 Confidence Intervals


Interpretation: If we repeated sampling many times, (1-α)% of constructed intervals would contain true
parameter. For mean (known σ): x■ ± z_(α/2) × σ/√n For mean (unknown σ): x■ ± t_(α/2,df) × s/√n
For proportion: p■ ± z_(α/2) × √(p■(1-p■)/n) Width depends on: sample size (n), variability (σ),
confidence level (1-α)
Chapter 4: Regression Analysis
4.1 Linear Regression
Simple Linear Regression: y = β■ + β■x + ε • Least squares estimation minimizes Σ(y - ■)² •
Assumptions: linearity, independence, normality, homoscedasticity Multiple Linear Regression: y =
β■ + β■x■ + ... + β■x■ + ε • Adjusted R² penalizes additional predictors • Multicollinearity: high
correlation between predictors • Variable selection: forward, backward, stepwise

4.2 Model Evaluation


Metrics: • R²: Proportion of variance explained (0 to 1) • RMSE: √(Σ(y-■)²/n) - same units as y • MAE:
Σ|y-■|/n - robust to outliers • AIC/BIC: Balance fit and complexity Diagnostics: • Residual plots: check
assumptions • Q-Q plots: assess normality • Cook's distance: identify influential points • VIF: detect
multicollinearity (VIF > 10 problematic)
Chapter 5: Data Science Applications
5.1 A/B Testing
Design Considerations: • Sample size calculation: depends on effect size, power, significance •
Randomization: ensure comparable groups • Multiple testing correction: Bonferroni, FDR • Early
stopping: sequential testing methods Common Pitfalls: • Peeking at results (inflates Type I error) •
Simpson's paradox in segmented analysis • Novelty effects and seasonality

5.2 Time Series Analysis


Components: • Trend: long-term direction • Seasonality: regular periodic patterns • Cyclic: irregular
long-term fluctuations • Noise: random variation Methods: • Moving averages: smoothing • ARIMA:
autoregressive integrated moving average • Exponential smoothing: weighted averages •
Decomposition: separate components
Quick Reference: Statistical Formulas
Concept Formula When to Use

Standard Error (mean) SE = σ/√n Sampling distribution of mean


Z-score z = (x-µ)/σ Standardization, outlier detection
Correlation r = Cov(X,Y)/(σ■σ■) Linear relationship strength
Effect Size (Cohen's d) d = (µ■-µ■)/σ_pooled Practical significance
Chi-square statistic χ² = Σ(O-E)²/E Categorical data analysis
F-statistic F = MS_between/MS_within ANOVA, variance comparison

Remember: Statistics is about understanding uncertainty and making informed decisions from data!

You might also like