0% found this document useful (0 votes)
8 views3 pages

Final Notes

The document outlines key concepts in statistics, including various visualization techniques such as scatter plots and histograms, as well as probability rules like multiplication and addition. It covers hypothesis testing, A/B testing, bootstrapping, and regression analysis, detailing methods for calculating means, confidence intervals, and regression coefficients. Additionally, it introduces k-NN classification and Bayes Theorem, emphasizing their applications in statistical analysis.

Uploaded by

milindisbeast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Final Notes

The document outlines key concepts in statistics, including various visualization techniques such as scatter plots and histograms, as well as probability rules like multiplication and addition. It covers hypothesis testing, A/B testing, bootstrapping, and regression analysis, detailing methods for calculating means, confidence intervals, and regression coefficients. Additionally, it introduces k-NN classification and Bayes Theorem, emphasizing their applications in statistical analysis.

Uploaded by

milindisbeast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Pre-Midterm

Visualizations
-​ Scatter plot – Relation between two values, useful for visualizing associations
-​ Line plot – Also two values, used for chronological or sequential data
-​ Bar chart – Categorical data, bar width is same, categories on y axis
-​ Histogram – visualize distribution of numerical values [lower bound, upper bound)

Chance
-​ Multiplication rule: two events happen together (think and)
-​ Addition rule: event happens in multiple ways (think or)
-​ Complement rule: (atleast one) of something
-​ Probability of all outcomes = 1, so P(every other event) = 1 - P(one event)

Hypothesis testing
-​ Simulate under null hypothesis
-​ Test stats such as TVD, abs diff

A/B Testing
-​ 2 groups, usually comparing means
-​ Test stat is
-​ Shuffling labels is simulation

Bootstrapping
-​ 95% CI means in a 10000 bootstrap stample, 95% of the time your sampled interval will
contain the true mean
-​ Cutting off the two ends (2.75 97.5) ignores most extreme values
-​ (100-p)% is when you have a p-value cutoff, otherwise given
-​ To get an interval do percentile(lower_bound, variable),(higher_bound, variable)

Post-Midterm
Center and Spread
-​ Mean: np.mean(array), Media: percentile(50,array)
-​ Also mean: np.sqrt(np.mean((array - np.mean(array)) ** 2))
1
-​ Chebyshev’s bound: 1 − 2
𝑘

Sample means and CLT (central limit theorem)


-​ CLT says most sample means is centered at population mean
-​ Does not work with max
-​ 95% confidence interval is (sample mean - nSD, sample mean + nSD)
-​ Width > SD times (population SD/sqrt(sample size))
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝐷
-​ SD of sample means =
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
Regression
-​ -1 < r < 1 shows slope relation
-​ Residuals shows error (actual - predicted)
-​ RMSE – np.sqrt(np.mean((actual - predicted) ** 2))
-​ Correlation coefficient – r = np.mean(Xsu * Ysu)
-​ Regression line formula
-​ Predicted y = slope * x + intercept
𝑛𝑝.𝑠𝑡𝑑(𝑦)
-​ 𝑆𝑙𝑜𝑝𝑒 = 𝑟 × 𝑛𝑝.𝑠𝑡𝑑(𝑥)
-​ Intercept = np.mean(y) - slope * np.mean(x)
k-NN Classification
-​ k-Neareast Neighbors
-​ np.sqrt(np.sum((feats_1 - feats_2)**2)), distance between new points and training
set points
2 2
(𝑥1 − 𝑥2) + (𝑦1 − 𝑦2)
-​ k-NN regression
-​ Use euclidean formula first, then average the differences
𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟 1 + 𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟 2
𝑘
-​ Multiple linear regression
𝑦 = 𝑚1​×𝑥1​ + 𝑚2​×𝑥2​ + 𝑏

Bayes Theorem
-​ P(A & B happen) = P(A happens) * P(B happens given A happens), same for B & A
-​ Theorem says
𝑃(𝐴 & 𝐵) 𝑃(𝐴 𝐻𝑎𝑝𝑝𝑒𝑛𝑠) × 𝑃(𝐵 ℎ𝑎𝑝𝑝𝑒𝑛𝑠 𝑔𝑖𝑣𝑒𝑛 𝐴 ℎ𝑎𝑝𝑝𝑒𝑛𝑠)
-​ 𝑃(𝐴|𝐵) = 𝑃(𝐵)
= 𝑃(𝐵)

Repetition P-value - np.count_nonzero(test_stats >= ovserved_ts) -


cool_stats = make_array() len(test_stats)
for i in np.arange(n):
stat = make_statistic() Bootstrapping
cool_stats = np.append(cool_stats, stats) def bootstrap(tbl):
statistic = make_array()
TVD: sum(abs(array1 - array2)) / 2 for i in np.arange(n)
boostrap_tbl = tbl.sample()
Hypothesis testing statistic = np.mean(boostrap_tbl.column(0)
def simulate(num_simulations): statistics = np.append(statistics, statistic)
test_stats = make_array() return statistics
for i in np.arange(num_simulations):
one_test_stat = calculate_statistics() Regression prediction
test_stats = np.append(test_stats, one_test_stat) x_su = (x_value - np.mean(x_array)) / np.std(array)
return test_stats y_su_pred = r * x_su
y_pred = y_su_pred * np.std(y_array) + np.mean(y_array)

You might also like