Pre-Midterm
Visualizations
- Scatter plot – Relation between two values, useful for visualizing associations
- Line plot – Also two values, used for chronological or sequential data
- Bar chart – Categorical data, bar width is same, categories on y axis
- Histogram – visualize distribution of numerical values [lower bound, upper bound)
Chance
- Multiplication rule: two events happen together (think and)
- Addition rule: event happens in multiple ways (think or)
- Complement rule: (atleast one) of something
- Probability of all outcomes = 1, so P(every other event) = 1 - P(one event)
Hypothesis testing
- Simulate under null hypothesis
- Test stats such as TVD, abs diff
A/B Testing
- 2 groups, usually comparing means
- Test stat is
- Shuffling labels is simulation
Bootstrapping
- 95% CI means in a 10000 bootstrap stample, 95% of the time your sampled interval will
contain the true mean
- Cutting off the two ends (2.75 97.5) ignores most extreme values
- (100-p)% is when you have a p-value cutoff, otherwise given
- To get an interval do percentile(lower_bound, variable),(higher_bound, variable)
Post-Midterm
Center and Spread
- Mean: np.mean(array), Media: percentile(50,array)
- Also mean: np.sqrt(np.mean((array - np.mean(array)) ** 2))
1
- Chebyshev’s bound: 1 − 2
𝑘
Sample means and CLT (central limit theorem)
- CLT says most sample means is centered at population mean
- Does not work with max
- 95% confidence interval is (sample mean - nSD, sample mean + nSD)
- Width > SD times (population SD/sqrt(sample size))
𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝐷
- SD of sample means =
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
Regression
- -1 < r < 1 shows slope relation
- Residuals shows error (actual - predicted)
- RMSE – np.sqrt(np.mean((actual - predicted) ** 2))
- Correlation coefficient – r = np.mean(Xsu * Ysu)
- Regression line formula
- Predicted y = slope * x + intercept
𝑛𝑝.𝑠𝑡𝑑(𝑦)
- 𝑆𝑙𝑜𝑝𝑒 = 𝑟 × 𝑛𝑝.𝑠𝑡𝑑(𝑥)
- Intercept = np.mean(y) - slope * np.mean(x)
k-NN Classification
- k-Neareast Neighbors
- np.sqrt(np.sum((feats_1 - feats_2)**2)), distance between new points and training
set points
2 2
(𝑥1 − 𝑥2) + (𝑦1 − 𝑦2)
- k-NN regression
- Use euclidean formula first, then average the differences
𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟 1 + 𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟 2
𝑘
- Multiple linear regression
𝑦 = 𝑚1×𝑥1 + 𝑚2×𝑥2 + 𝑏
Bayes Theorem
- P(A & B happen) = P(A happens) * P(B happens given A happens), same for B & A
- Theorem says
𝑃(𝐴 & 𝐵) 𝑃(𝐴 𝐻𝑎𝑝𝑝𝑒𝑛𝑠) × 𝑃(𝐵 ℎ𝑎𝑝𝑝𝑒𝑛𝑠 𝑔𝑖𝑣𝑒𝑛 𝐴 ℎ𝑎𝑝𝑝𝑒𝑛𝑠)
- 𝑃(𝐴|𝐵) = 𝑃(𝐵)
= 𝑃(𝐵)
Repetition P-value - np.count_nonzero(test_stats >= ovserved_ts) -
cool_stats = make_array() len(test_stats)
for i in np.arange(n):
stat = make_statistic() Bootstrapping
cool_stats = np.append(cool_stats, stats) def bootstrap(tbl):
statistic = make_array()
TVD: sum(abs(array1 - array2)) / 2 for i in np.arange(n)
boostrap_tbl = tbl.sample()
Hypothesis testing statistic = np.mean(boostrap_tbl.column(0)
def simulate(num_simulations): statistics = np.append(statistics, statistic)
test_stats = make_array() return statistics
for i in np.arange(num_simulations):
one_test_stat = calculate_statistics() Regression prediction
test_stats = np.append(test_stats, one_test_stat) x_su = (x_value - np.mean(x_array)) / np.std(array)
return test_stats y_su_pred = r * x_su
y_pred = y_su_pred * np.std(y_array) + np.mean(y_array)