0% found this document useful (0 votes)
16 views8 pages

Data Analysis Cheatsheet

Uploaded by

rafael123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Data Analysis Cheatsheet

Uploaded by

rafael123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)

. .
Data Analysis .
. 1.2 Notation .
. • Interquartile Range (IQR): The difference between Q3
. A random variable is a placeholder for the possible values of . and Q1.
. .
Formula Sheet .
. some process. It represents a column in the dataset. .
.
. . • Standard Deviation: Is the average distance of each
Fady Morris Ebeid .
.
Random variables are represented by capital letters (for example .
. observation from the mean. Represents how far each point
. X). Once we observe an outcome of these random variables, we .
(2021) .
. notate it as a lower case of the same letter (for example x1 ). .
.
in our dataset is from the mean.
. .
. 1.3 Summary Statistics . v
n
. . u
u1 X
Chapter 1 .
.
.
.
There are four main aspects to analyzing quantitative data.
.
.
.
.
s=t
n i=1
(xi − x̄)2
. 1. Measures of Center .
Practical Statistics .
.
. 2. Measures of Spread
.
.
. It has the same units as our original data.
. .
1 Descriptive Statistics . 3. The Shape of the data .
.
.
.
. • Variance: The average squared difference of each
1.1 Data Types . 4. Outliers . observation from the mean.
. .
• Quantitative: data takes on numeric values that allow us . .
. Measures of Center .
to perform mathematical operations (like the number of . . 1X
n
. . Var(X) = s2 = (xi − x̄)2
dogs). . • Mean: (often called average or expected value) .
. . n i=1
. n
.
Quantitative data can be divided into: . 1X .
. x̄ = xi . It has units that are the square of the units of the original data.
. .
– Continuous data: can be split into smaller and . n i=1 .
. .
smaller units, and still a smaller unit exists. An . .
example of this is the age of the dog - we can measure
.
. Where: .
.
Shape of the Data
. xi → The ith data point. .
the units of the age in years, months, days, hours, . .
. . The shape of the data can be investigated using histograms or box
seconds, but there are still smaller units that could be . n → Number of samples. .
. . plots.
associated with the age. . .
. • Median: The median splits our data so that 50% of our .
– Discrete data: only takes on countable values. The . . The distribution of data can take one of three shapes:
. values are lower and 50% are higher. .
number of dogs we interact with is an example of a . .
. .
discrete data type. .  . • Symmetric: Normally distributed. The mean equals the
. x(n+1)/2 , if n is odd .
. median(X) = xn/2 + xn/2+1 . median of the data.
They can also be classified into: . .
.  , if n is even .
. 2 .
– Interval data: numeric values where absolute . . x̄ = median(X)
. In order to compute the median we must sort our values .
differences are meaningful (addition and subtraction . .
. first. .
operations can be made). Examples: year and . . Examples: Height, Weight, Errors, Precipitation.
. .
temperature in celsius. .
. • Mode: The mode is the most frequently observed value in .
.
. our dataset. . To know if the data is normally distributed, there are plots
– Ratio data: numeric values where relative differences . . called normal quantile-quantile plots and statistical
. There might be multiple modes for a particular .
are meaningful (multiplication and division operations . . methods like the Kolmogorov-Smirnov test.
. dataset(multimodal), or no mode at all. .
can be made). There must be a meaningful zero . .
. .
point. Examples: document word count and mass in . Measures of Spread . • Right skew (positive skew - right tailed): The mean being
. .
kilograms. . . skewed to the right of a median of the data.
. The 5 number summary: .
. .
• Categorical (Qualitative): are used to label a group or . .
. • Minimum: The smallest number in the dataset. . x̄ > median(X)
set of items (like dog breeds - Collies, Labs, Poodles, etc.). . .
. • Q1 (First Quartile): The value such that 25% of the data .
We can divide categorical data further into two types: . .
. fall below. . Examples: Amount of drug remaining in a blood stream,
. .
– Ordinal: data take on a ranked ordering (like a . . Time between phone calls at a call center, Time until light
.
. • Q2 (Second Quartile): The value such that 50% of the .
. bulb dies.
ranked interaction on a scale from Very Poor to Very . .
. data fall below (the median). .
Good with the dogs). The distances between the . .
categories are unknown. . • Q3 (Third Quartile): The value such that 75% of the . • Left skew (negative skew - left tailed): The mean being
. . skewed to the left of the median of the data.
. data fall below. .
– Nominal: data do not have an order or ranking (like . .
. • Maximum: The largest value in the dataset. .
the breeds of the dog). . .
. . x̄ < median(X)
. Measures of Spread: .
When analyzing categorical variables, we commonly just look at . .
. .
the count or percent of a group that falls into each level of a . • Range: The difference between the maximum and the . Examples: Grades as a percentage in many universities, Age
. .
category. minimum. of death, Asset price changes.
1
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Outliers .
. 3 Binomial Distribution .
. Note that:
. . P (A) = P (A ∩ B) + P (A ∩ B̄)
Outliers are points that fall very far from the rest of our data . The Binomial Distribution helps us determine the probability of x .
. . P (Ā) = P (Ā ∩ B) + P (Ā ∩ B̄)
points. . sucesses in n Bernoulli trials. .
. .
They influences measures like the mean and standard deviation . The probability mass function of the binomial distribution is given . P (A|B) + P (Ā|B) = 1
. .
much more than measures associated with the five number . by: . P (A|B̄) + P (Ā|B̄) = 1
. (n .
summary. . px (1 − p)n−x , x = 0, 1, 2, . . . , n .
. . P (A) + P (Ā) = 1
Outliers can be identified visually using histogram. And there are . b(x; n, p) = x .
a number of different techniques for identifying outliers. Refer to . 0, otherwise .
. .
[Seo06] paper. .
. Where:
.
. 5 Bayes Rule
When outliers are present we should consider the following points: . . P (B|A)P (A)
. p → probability of success. .
. . P (A|B) =
1. Noting they exist and the impact on summary statistics. . . P (B)
. x → number of sucesses. .
2. If typo - remove or fix . . P (A|B) is the posterior probability.
. n → total number of trials. .
. . P (A) is the prior probability.
3. Understanding why they exist, and the impact on questions . n .
. → binomial coefficient. .
we are trying to answer about our data. . . 5.1 Example: Cancer Test Case
. x .
4. Reporting the 5 number summary values is often a better . . Prior probabilities:
. n n! .
indication than measures like the mean and standard . = .
. x x!(n − x)! .    
deviation when we have outliers. . . P (C) 0.01
. . =
5. Be careful in reporting. Know how to ask the right . Mean: E(X) = np . P (¬C) 0.99
. .
questions. . Variance: Var(X) = npq . Confusion Matrix:
. .
. .
. . P os N eg 
1.4 General Steps for Working with a Random . .
4 Conditional Probability
   
. . C TP FN P (P os|C) P (N eg|C) 0.9 0.1
Variable . . = =
. Conditional probability formula: . ¬C FP TN P (P os|¬C) P (N eg|¬C) 0.15 0.85
1. Plot your data to identify if you have outliers. . .
. .
. .
2. Handle outliers accordingly via the methods above. . P (A ∩ B) .
. P (A|B) = . Sensitivity (True Positive rate): measures the proportion of
3. If no outliers and your data follow a normal distribution - . P (B) .
. . positives that are correctly identified (i.e. the proportion of those
use the mean and standard deviation to describe your . .
. . who have some condition (affected) who are correctly identified as
dataset, and report that the data are normally distributed. . . having the condition). Also called the recall.
. .
4. If you have skewed data or outliers, use the five number . . Specificity (True Negative rate): measures the proportion of
. .
. A B . negatives that are correctly identified (i.e. the proportion of those
summary to summarize your data and report the outliers. . .
. . who do not have the condition (unaffected) who are correctly
. .
1.5 Descriptive vs. Inferential Statistics . . identified as not having the condition).
. A∩B .
Comparison: . . Type I error (false positive): “The true fact is that the patients
. .
1. Descriptive statistics is about describing collected data. . . do not have a specific disease but the physicians judges the
. .
. . patients was ill according to the test reports.d”
2. Inferential statistics is about using collected data to . .
. . Type II error (false negative): “The true fact is that the disease
draw conclusions to a larger population. . .
. . is actually present but the test reports provide a falsely reassuring
We look at specific examples that allow us to identify the: . .
. Figure 1.1: Conditional Probability . message to patients and physicians that the disease is absent.”
. .
(a) population: The entire group of interest. . . Joint Probabilities (Interesection):
. .
(b) Parameter: Numeric summary about the population. . .
. .
(c) Sample: Subset of the population. . .
. P (A ∩ B) = P (A, B) = P (B)P (A|B) .
. .
(d) Statistic: Numeric summary about a sample. . .
. P (A, B) is the joint probability between A and B . P os
. . C
. .
. .
2 Probability .
.
.
.
• Probability of an event: P (A) .
.
P (B) P (B̄) .
. C ∩ N eg C ∩ P os ¬C ∩ P os
• Probability of opposite event: . .
. .
. B B̄ .
P (A) = 1 − P (¬A) . .
. P (A|B) P (Ā|B) P (A|B̄) P (Ā|B̄) .
. .
• Probability of the occurance of a composite event n times . . ¬C ∩ N eg
. A Ā A Ā .
(independent): . .
. .
P (A, A, . . . , A) = P (A) · P (A) · . . . · P (A) = P (A)n . P (Ā∩B) P (A∩B̄) P (Ā∩B̄)
.
. P (A∩B) .
. . Figure 1.2: Cancer Test Case - Joint Probabilities

2
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
. 1. The sampling distribution is centered on the original . 3. Difference in sample means (x̄1 − x̄2 ).
. .
. parameter value. .
. .
 
P (P os, C) P (N eg, C) . . 4. Difference in sample proportions (p1 − p2 ).
. µM = µ .
P (P os, ¬C) P (N eg, ¬C) . .
. . It does not apply to the following statistics:

P (C)
 
P (P os|C) P (N eg|C)

. 2. The sampling distribution’s variance depends on the sample .
= . size n. It decreases when n increases. . 1. Sample standard deviation s.
P (¬C) P (P os|¬C) P (N eg|¬C) . .
. If we have a random variable X, with a variance of σ 2 , then .

P (C)P (P os|C) P (C)P (N eg|C)
 . . 2. Correlation coefficient r.
= . the sampling distribution of the sample mean X̄ has a .
P (¬C)P (P os|¬C) P (¬C)P (N eg|¬C) . .
. variance of . 3. Maximum value in the dataset.
. .
. σ2 .
 
0.009 0.001 . 2
σM = .
= . . 7.5 Bootstrapping
0.1485 0.8415 . n .
. .
. 3. The standard error is the standard deviation of the . Definition 7.2. Bootstrapping: is a technique where we sample
. .
  . sampling distribution. . from a group with replacement.
P (P os) P (N eg) . .
. s .

= P (P os, C) + P (P os, ¬C) P (N eg, C) + P (N eg, ¬C
 . σ2 . • We can use bootstrapping to simulate the creation of
. .
. standard error = . sampling distribution.

= 0.1575 0.8425
 . n .
. .
. . • An element can be picked more than once from the dataset.
. .
Posterior probabilities: . 7.2 Notation .
. . • The probability of any number in our set stays the same
. A parameter θ pertains to a population, while a statistic or .
. . regardless of how many times it has been chosen. Flipping a

P (C|P os) P (C|N eg)
 . estimator θ̂ pertains to a sample. .
. . coin and rolling a die are examples of bootstrapping.
P (¬C|P os) P (¬C|N eg) . .
. .
. θ Statistic Description .
. .
 
P (P os, C) P (N eg, C)  
. µ x̄ µ̂ The mean of a dataset .
= P (P os) P (N eg)
P (P os, ¬C) P (N eg, ¬C) .
. σ s σ̂ The standard deviation of a dataset
.
. 8 Confidence Intervals
. .
σ2 s2 σ̂ 2 The variance of a dataset
 
P (P os, C) P (N eg, C) . . Definition 8.1. Confidence intervals provide a range of values
. .
 P (P os) P (N eg)  . π p π̂ The proportion (mean) of a binomial dataset . that are possible for a population parameter. A confidence interval
=  . .
 P (P os, ¬C) P (N eg, ¬C)  . ρ r ρ̂ The correlation coefficient . is the probability that a population parameter will fall between a
. .
P (P os) P (N eg) . β b β̂ The regression coefficient . set of values for a certain proportion of times.
. .
  . .
0.0571 0.0012 . .
= . A binomial dataset is a dataset with only 0 and 1 values. . • Confidence intervals can be interpreted as “we are x%
0.9429 0.9988 . .
. The parameter, which is a numeric summary of the population . confident that the population parameter falls between the
. doesn’t change. While a statistic changes based on the sample . bounds of the interval”
P (C|P os) is called the precision. . .
. selected from the population. .
. . • Confidence intervals can be built for different parameters
. .
6 Normal Distribution .
. 7.3 The Law of Large Numbers .
. such as population mean, or difference in means.
. .

1 (x−µ)
2

. Theorem 7.1. The Law of Large Numbers: As our sample . • Confidence levels can be 90%, 95%, 98%, 99%
2 1 − 2 σ2
. .
. .

N x; µ, σ = √ e size increases, the sample statistic gets closer to the population
2πσ 2 . . • An important application that uses comparison of means is
. parameter. .
2 . . A/B testing.
. .

1 1 x−µ
−2
= √ e σ . .
σ 2π . .
. Most common ways of parameter estimation: . 8.1 Statistical vs. Practical Significance
. .
. .
. 1. Maximum Likelihood Estimation . Statistical significance: Evidence from hypothesis tests and
7 Central Limit Theorem . .
. . confidence intervals that H1 is true.
. 2. Method of Moments Estimation .
Modeling coin flip according to number of flips: . . Practical significance: Considers real world aspects, not just
. .
Single Coin Flip A few Coin Flips Infinite coin flips (∞) . 3. Bayesian Estimation . numbers in making final conclusions. It takes into account other
. .
Binomial distribution Normal distribution . . real world constraints such as space, time, or money.
2 . 7.4 The Central Limit Theorem .
. .

n 1 x−µ
k n−k −1
p p (1 − p) √ e 2 σ . .
k σ 2π . Theorem 7.2. The Central Limit Theorem states that with a . 8.2 Building Confidence Intervals
. .
. large enough sample size the sampling distribution of the mean .
. . There are two methods:
. will be normally distributed. .
7.1 Sampling Distribution . .
• Bootstrapping.
. .
Definition 7.1. A sampling distribution is the distribution of . It applies for the following statistics: .
. .
a statistic. It is a distribution formed by samples. . . • Traditional Methods: These methods are no longer
. 1. Sample means (x̄). .
. . necessary with what is possible with statistics in modern
. .
Characteristics of a sampling distribution: 2. Sample proportions(p). computing. For reference, see Stat Trek site
3
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
8.3 Other Terms Associated with Confidence . • Error rate is denoted by α. Commonly it is 1-5%. . 9.4 P -value
. .
. .
Intervals . • Deciding the alternative (H1 ) is true, when actually the null . P -value: The conditional probability of observing a test statistic
. .
• The confidence interval width: is the difference between .
. (H0 ) is true .
.
(or more extreme test results) in favor of the alternative
the upper and lower bounds of the confidence interval. . . hypothesis (H1 ) if the null hypothesis (H0 ) is true.
. • The worse of errors. .
. .
• The margin of error: Is half the confidence interval width. . . • A very small p-value means that such an extreme observed
. • You should set up your null and alternative hypotheses, so .
Example: “Candidate A has 34% ± 3% of the votes” . that the worst of your errors is the type I error. . outcome would be very unlikely under the null hypothesis.
. .
=⇒ (31%, 37%) . . Small p-value −→ Choose H1
. Type II Errors: .
. .
The relationship between sample size and confidence level to . . • It is actually always the case that when your p-value is large
. • False negatives. .
confidence interval: . . you will end up staying with the null hypothesis as your
. .
. • Error rate is denoted by β. . choice.
1. Increasing the sample size n will decrease the width of . .
confidence interval. . • Deciding the null (H0 ) is true, when actually the alternative . Large p-value −→ Choose H0
. .
. (H1 ) is true. .
2. Increasing the confidence level (say from 95% to 99%) will . .
. . One-Sided Right-Tail Test
increase the width of the confidence interval. . Power of a statistical test: True positive rate (1 − β). This is .
. .
. the ability of an individual to correctly choose the alternative . Consider an observed test-statistic t from unknown distribution T .
8.4 Confidence Intervals vs. Machine Learning . .
. hypothesis (the probability of rejecting a null hypothesis that is . The p value is the prior probability of observing a test statistic ≥ t
. .
Confidence intervals are about parameters in a population, while . false). . if H0 were true.
. .
machine learning make predictions about individual data points. . .
. 9.2 Common Types of Hypothesis Tests .
. .
. .
9 Hypothesis Testing .
.
Hypothesis tests are preformed on population parameters, .
.
The process is as follows:
Rules of setting up the null and alternative hypotheses: . never on statistics, as statistics are values you already know from .
. . 1. Simulate the values of your statistic that are possible from
. the data. .
1. The H0 is the null hypothesis. It is the condition we believe . . the null.
. Common hypothesis tests: .
to be true before collecting any data. . . 2. Calculate the value of the statistic (t) you actually obtained
. 1. Testing a population mean (One sample t-test). .
2. The H0 usually states that there is no effect or that two . . in your data.
. .
groups are equal. . 2. Testing the difference in means (Two sample t-test). .
. . 3. Compare your statistic to the values from the null.
. .
3. H1 is the alternative hypothesis. It is what we would like to . 3. Testing the difference before and after some treatment on .
. . 4. Calculate the p-value as the proportion of null values that
prove to be true. . the same individual (Paired t-test). .
. . are considered extreme based on your alternative.
. 4. Testing a population proportion (One sample z-test). .
4. The H0 and H1 are competing, non-overlapping hypotheses. . .
. . p = P (T ≥ t|H0 is true)
. 5. Testing the difference between population proportions (Two .
5. H0 contains an equal sign of some kind when it pertains to . .
mathematical ideas. Either =, ≤, or ≥. . sample z-test). .
. . p-value and Errors
. .
6. H1 contains the opposition of the null hypothesis. Either . T table: t distribution critical values. .
. . Compare p-value and type I error threshold (α), the decision about
=, >, or <. . Two-sided test: Testing simply if the parameters of two groups .
. . which hypothesis to choose becomes:
. are the same or if they are different. The equal case should still be .
9.1 Types of Errors . .
.
.
in the null hypothesis. We aren’t interested in if one parameter is .
. • p-value ≤ α =⇒ Reject H0
. greater than another. .
.
.
.
. • p-value ≥ α =⇒ Fail to Reject H0
Null Hypothesis (H0 ) is . 9.3 Difference in Means .
. . 9.5 Other Things to Consider
Table of error types . Notice the standard deviation with the difference in means is .
. .
False True . actually the square root of the sum of the variance of each of the . Impact of Large Sample Size
. .
(+) (−) . individual sampling distributions. .
. . • With large sample sizes, hypothesis testing leads to even the
. .
. q . smallest findings become statistically significant(ending up
False Positive . σdiff = σM 2 2 2 2 2 .
Decision Reject True Positive . 1 + σM 2 =⇒ σdiff = σM 1 + σM 2 . rejecting essentially every null). However, these findings
(+) 1−β (Type I error) . .
about null . . many not be practically significant at all.
α . And the standard deviation of the mean is the standard deviation .
hypothesis . .
(H0 ) .
.
of the original draws divided by the square root of the sample size .
. • Hypothesis testing takes an aggregate approach towards the
Don’t False Negative . taken. . conclusions made based on data, as these tests are aimed at
True Negative . σ .
reject (Type II error) . σM = √ . understanding population parameters (which are aggregate
1−α . n .
(−) β . . population values).
. Thus: .
. .
. . • Alternatively, machine learning techniques take an
Type I Errors: . .
. σ2 σ2 . individual approach towards making conclusions, as they
. 2
σdiff = 1 + 2 .
• False positives. n1 n2 attempt to predict an outcome for each specific data point.
4
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Performing More Than One Hypothesis Test . 10.1 Testing Changes on a Web Page . Difficulties in A/B Testing
. .
. A/B tests are used to test changes on a web page by running an .
When performing more than one hypothesis test, your type I error . . • Novelty effect and change aversion when existing users first
. experiment where a control group sees the old version, while the .
compounds. In order to correct for this, you can use one of the . . experience a change
. experiment group sees the new version. A metric is then .
following techniques: . .
. chosen to measure the level of engagement from users in each . • Sufficient traffic and conversions to have significant and
. .
1. Bonferroni correction. . group. These results are then used to judge whether one version is . repeatable results
. .
. more effective than the other. .
A simple, but very conservative approach. . . • Best metric choice for making the ultimate decision (eg.
. .
The new type I error rate should be the error rate you . . measuring revenue vs. clicks)
. The metric used is the click through rate (CTR). .
actually want divided by the number of tests you are . .
. . • Long enough run time for the experiment to account for
performing. . .
. # clicks by unique users . changes in behavior based on time of day/week or seasonal
. CTR = .
Let m be the number of tests, then your new error rate . # views by unique users . events.
. .
becomes: . Hypotheses: .
α . . • Practical significance of a conversion rate (the cost of
αnew = . H0 : CTRnew ≤ CTRold .
m . . launching a new feature vs. the gain from the increase in
. .
2. Turkey correction . H1 : CTRnew > CTRold . conversion)
. .
. .
3. Q-values (popular in medical tests). . Steps Taken to Analyze the Results of A/B test . • Consistency among test subjects in the control and
. .
. 1. Compute the observed difference, between the CTR . experiment group (imbalance in the population represented
. .
How Do Confidence Intervals and Hypothesis Testing . metrics for the control and experiment group. . in each group can lead to situations like Simpson’s Paradox)
. .
Compare . .
. 2. Simulate the sampling distribution for the difference in .
. .
. proportions(difference in CTR). .
Confidence interval and two-sided hypothesis test (a test that .
.
.
. 11 Regression
involves a 6= in the alternative) are the same in terms of . 3. Use the sampling distribution to simulate the distribution .
conclusions. . . Machine learning is split into:
. under the null hypothesis, by creating a random normal .
. .
. distribution centered at 0 with the same standard deviation . • Supervised learning: Predicting the labels of data.
. as the sampling distribution. .
1 − CI = α . . Examples are fraud detection, whether customers will buy a
. .
. 4. Compute the p-value by finding the null values in the null . product or not, house values.
Example: 95% confidence interval is similar to a hypothesis test . .
. distribution that are greater than the observed difference. .
with a type I error reate α = 0.05 . . • Unsupervised learning: Clustering unlabeled data
. .
. 5. Compare the p-value to type I error threshold α to . together.
. .
10 A/B Testing .
.
determine the statistical significance of observed .
.
. difference. . 11.1 Simple Linear Regression
Definition 10.1 (A/B testing). (also known as bucket testing or . .
split-run testing) is a user experience research methodology. A/B . . We compare two quantitative variables to one another. To predict
.
.
Analyzing Multiple Metrics .
.
tests consist of a randomized experiment with two variants, A and . . a linear function.
. The more metrics to evaluate, the more likely to observe .
B. It is a way to compare two versions of a single variable, typically . .
. significant just by chance. The probability of any false positive .
by testing a subject’s response to variant A against variant B, and . . y = f (x)
. increases as you increase the number of metrics(tests). This .
determining which of the two variants is more effective. . .
. multiple comparison problem can be solved by several techniques, . y is called the response variable or dependent variable. It is the
. such as bonferroni correction (see section 9.5). . variable you are interested in predicting.
A/B testing is a form of hypothesis testing where: . .
. .
. . x is called the explanatory variable or independent variable. It
• Null Hypothesis: Experiment does equally or worse than . Since the Bonferroni method is too conservative when we expect . is the variable used to predict the response.
. .
the control. . correlation among metrics, we can better approach this problem .
. .
. with more sophisticated methods, such as . 11.2 Covariance
• Alternative Hypothesis: Experiment does better than . .
. . Pn
the control. . • The closed testing procedure . i=1 (xi− x̄)(yi − ȳ)
. . sxy =
. . n−1
A/B testing has drawbacks. It can’t tell you about options you .
. • Boole-Bonferroni bound .
.
haven’t considered. It is also subject to bias when tested on . . 11.3 Correlation Coefficient
. • The Holm-Bonferroni method. .
existing users. There are two types of bias: . .
. These are less conservative and take this correlation into account. . There are different ways to measure the correlation between two
. .
• Change Aversion: Existing users may give an unfair . . variables [see this link].
. .
advantage to the old version, simply because they resist . . For a linear relationship, the most common way to measure
. If you do choose to use a less conservative method, just make sure .
change, even if it is better on the long run. . . correlation is Pearson’s correlation coefficient.
. the assumptions of that method are truly met in your situation, .
. . Denoted by r.
• Novelty Effect: Existing users may give an unfair . and that you’re not just trying to cheat on a p-value. Choosing a .
. .
advantage to the new version, because they are excited or . poorly suited test just to get significant results will only lead to .
. .
drawn to change, even if it isn’t any better in the long run. misguided decisions that harm the performance in the long run. r ∈ [−1, 1]
5
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. sy .
Correlation coefficient for a sample: . b1 = r . 1.2 Chart Junk
. sx .
sxy . . Chart junk: all visual elements in charts and graphs that are not
. .
rxy = . .
sx sy . . necessary to comprehend the information represented on the graph
. b0 = ȳ − b1 x̄ . or that distract the viewer from this information.
Pn
(xi − x̄) (yi − ȳ) . .
= qP i=1 . . Examples of chart junk include:
. .
(xi − x̄)2
qP
(yi − ȳ)2 .
.
.
Chapter 2 .
.
. 1. Heavy grid lines
. . 2. Unnecessary text
It measures:
• Strength: A relationship can be classified according to
.
.
.
Data Visualization .
.
. 3. Pictures surrounding the visual
. .
strength into: . There are two main reasons for creating visuals using data: . 4. Shading or 3d components
. .
. .
– Weak: 0.0 ≤ |r| < 0.3 . 1. Exploratory analysis is done when you are searching for . 5. Ornamented chart axes
. .
. insights. These visualizations don’t need to be perfect. You .
– Moderate: 0.3 ≤ |r| < 0.7 .
. are using plots to find insights, but they don’t need to be
.
. 1.3 Data-Ink Ratio
– Strong: 0.7 ≤ |r| ≤ 1.0 . . The more of the ink in your visual that is related to conveying the
. aesthetically appealing. You are the consumer of these .
. . message in the data, the better.
. plots, and you need to be able to find the answer to your .
• Direction: Either negative or positive depending on the sign. . . Limiting chart junk increases the data-ink ratio.
. questions from these plots. .
. .
11.4 Equation of The Line . 2. Explanatory analysis is done when you are providing your . 1.4 Design Intergrity
. .
The line in linear regression has the equation: . results for others. These visualizations need to provide you . Lie factor: the degree to which visualization distorts or
. .
. the emphasis necessary to convey your message. They . misinterprets the data values being plotted. It is calculated as:
ŷ = b0 + b1 x1 . .
. should be accurate, insightful, and visually appealing. .
. . ∆visual/visualstart
Where: . . lie factor =
. The five steps of the data analysis process: . ∆data/datastart
ŷ −→ is the predicted value of the response from the line. . .
. .
b0 −→ is the intercept. It is the predicted value of the response . . It is the relative change shown in the graphic divided by the actual
. 1. Extract - Obtain the data from a spreadsheet, SQL, the .
when the x-variable is zero. (denoted β0 for the population). . . relative change in the data.
. web, etc. .
b1 −→ is the slope. It is the predicted change in the response for . . Ideally, the lie factor should be 1. Any other value means that
. 2. Clean - Here we could use exploratory visuals. . there is some mismatch in the ratio of depicted change to actual
every one unit increase in the x-value. (denoted β1 for the . .
. . change.
population). . 3. Explore - Here we use exploratory visuals. .
x1 −→ is the explanatory variable. . .
. . 1.5 Using Color
y −→ is the actual response value for a data point in our . 4. Analyze - Here we might use either exploratory or .
. .
dataset(also called label). . explanatory visuals. . 1. Before adding color to a visualization, start with black and
. .
. . white.
11.5 Fitting a Regression Line . 5. Share - Here is where explanatory visuals live. .
. . 2. When using color, use less intense colors - not all the colors
. .
The algorithm used to fit a regression line to a dataset is called . . of the rainbow, which is the default in many software
. .
least-squares algorithm, it finds the line that minimizes .
. 1 Design of Visualizations .
. applications.
n . Visuals can be bad if they: . 3. Color for communication. Use color to highlight your
X . .
(yi − ŷi )2 .
.
.
. message and separate groups of interest. Don’t add color
. 1. Don’t convey the desired message. .
i=1
. . just to have color in your visualization.
. 2. Are misleading. .
Closed form solution: . .
1X . Reference [Huf93] . 1.6 Designing for Color Blindness
x̄ = xi . .
n . . Stay away from red to green palette and use blue to orange
. .
.
.
1.1 Visual Encodings .
.
palette instead.
1X . humans are able to best understand data encoded with: . Further reading: 5 tips on designing colorblind-friendly
ȳ = yi . .
n . . visualizations
. • positional changes (differences in x- and y- position as in .
. .
. scatterplots) . 1.7 Additional Encodings
. .
r
1 X
sx = (xi − x̄)2 (Using the Bessel’s Correction .
. • length changes (differences in box heights as bar charts
.
. We typically use the x− and y− axes to depict the value of
n−1 . . variables. If we have more than two variables we can use other
formula). . and histograms). .
. . visual encodings:
. Alternatively, humans struggle with understanding data encoded .
. . 1. color and shape for catigorical data.
r
1 X . with: .
sy = (yi − ȳ)2 (Using the Bessel’s Correction formula). . .
. . 2. size of marker for quantitative data.
n−1 . • color hue changes (as are unfortunately commonly used .
. . Use additional encoding only when necessary. If the visual gets
. as an additional variable encoding in scatter plots) .
Pn . . complicated consider breaking it into multiple visuals that convey
(xi − x̄) (yi − ȳ) . .
rxy = qP i=1 . • area changes (as in pie charts, which often makes them . multiple messages.
. .
qP
(xi − x̄)2 (yi − ȳ)2 not the best plot choice).
6
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
2 Univariate Exploration of Data .
. 3.3 Violin Plots .
. 4 Multivariate Exploration of Data
. .
Univariate visualizations: Visualize single-variables, such as . Quantitative variable vs. qualitative(categorical) variable. . 4.1 Non-Positional Encodings for Third
. .
bar charts, histograms, and line charts. . .
. For each level of categorical variable, a distribution of the values . Variables
. on the numeric variabble is plotted. .
2.1 Bar Charts .
.
.
.
There are four major cases to consider when we want to plot three
. The distribution is plotted as a kernel density estimate, like a . variables together:
Used to depict the distribution of categorical/qualitative variables. . .
. smoothed histogram. .
They can also be used for discrete quantitative data. . . • Three numeric variables
. .
. 3.4 Box Plots . • Two numeric variables and one categorical variable
2.2 Pie Charts .
.
.
.
Guidelines to use a pie chart: . Quantitative(numeric) vs. qualitative(categorical). . • One numeric variable and two categorical variables
. .
. Compared to the violin plot, the box plot is better for displaying .
• Make sure that your interest is relative frequencies. .
.
.
. • Three categorical variables
. the 5 point summary of data, reporting a set of descriptive .
• Limit the number of slices plotted to two or three slices, .
. statistics: .
. If we have at least two numeric variables, we use scatterplot to
though it’s possible to plot with four or five slices as long as . . encode two of the numeric variables, then use a non-positional
. .
the wedge sizes can be distinguished. . • The central line in the box indicates the median of the . encoding (like shape, size and color) for the third variable.
. .
. distribution .
• Plot the data systemically. Start from 12 o’clock position of . . 4.2 Color Palettes
. .
the circle, then plot each categorical level from most . • The top and bottom of the box represent the third and first . There are three major classes of color palette to consider:
. .
frequent to least frequent. . quartiles of the data, respectively. .
. . • Qualitative: Distinct colors, for nominal-type data.
Otherwise, use a bar chart instead. . .
. • The height of the box is the interquartile range (IQR). .
. . • Sequential: light-to-dark trend across a single or small
. .
2.3 Histograms .
. • From the top and bottom of the box, the whiskers indicate .
. range of hues of the same color. Used for categorical ordinal
A histogram is used to plot the distribution of a . the range from the first or third quartiles to the minimum . or numeric(quantitative) data types.
. .
quantitative(numeric) variable. It is the quantitative version of the . or maximum value in the distribution. .
. . • Diverging: Used if there is a meaningful zero or center
bar chart. Values are grouped into continuous bins. . .
. • Typically, a maximum range is set on whisker length; by . value for the variable.Two sequential palettes with different
When a data value is on a bin edge, it is counted in the bin to its . .
. default, this is 1.5 times the IQR. . hues are put back to back, with common color (usually
right. The exception is the rightmost bin edge. . .
. . white or gray) connecting them. One hue indicates values
. • Individual points below the lower whisker or above the .
. . greater than the center point, while the other indicates
3 Bivariate Exploration of Data . upper whisker indicate individual outlier points that are .
. . values smaller than the center.
. more than 1.5 times the IQR below the first quartile or .
Bivariate visualizations: Those visualizations involving two . .
. above the third quartile. . 4.3 Faceting
variables. The variation in one variable will affect the value of the . .
. . Faceting allows you to plot multiple simpler plots across levels of
other variable. . .
. Box plots are better than violin plots for explanatory . one or two other variables.
. .
3.1 Scatterplots . visualizations. .
. .
Quantitative variable vs. quantitative variable.
. . 4.4 Heat Maps
. .
If we have a very large number of points or our numeric variables
.
.
3.5 Clustered Bar Charts .
. They can be used for multivariate visualizations, substituting
. . count with a third variable.
are discrete-valued, then using a scatterplot won’t be informative. . Depict the relationship between two categorical variables. .
The visualization will suffer from overplotting. . .
. Bars are organized into clusters based on levels of the first . 4.5 Plot Matrices
. .
overplotting is where the high amount of overlap in points makes it . variable, and then bars are ordered consistently across the second . Give a high level look at pairwise relationship between all
. .
difficult to see the actual relationship between variables. . variable within each cluster. . variables. In a plot matrix, each row and column represents a
. .
To make the trends in the data clearer, it is overcome by: . . different variable.
. .
1. Using jitter. .
.
3.6 Line Plots .
.
. . 4.6 Correlation Heat Maps
2. Using transparency. . Line plot: plots the trend of one numeric variable against values .
. of a second variable. In a line plot, only one point is plotted for . For numeric variables this is similar to a correlation matrix. It
. .
. . shows the strength of relationships between variables.
3.2 Heat Maps . every unique x-value or bin of x-value (like a histogram). .
. .
Quantitative variable vs. quantitative variable. .
.
Line plots are used instead of bar plots to: .
.
4.7 Feature Engineering
A heat map is a 2-D version of the histogram that can be used as . . There may exist two variables that are related in some way.
. • Emphasize relative change. The need for a zero on the .
an alternative to a scatterplot. . . Feature engineering is about creating a new variable with a sum,
. y-axis is not necessary. .
They are good in the following cases: . . product or ratio between original variables that give insight into
. .
.
. • Emphasize trends across x-values. .
. research questions.
1. Good for discrete variable vs. discrete variable. . .
. . Example:
2. Good alternative to transparency for a lot of data. . Time series plot: A line plot where the x-variable represents .
. time. For example, stock or currency charts. . crime incidents
. . crime incident rate =
Correct choice of bin sizes is important. population totals
7
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Another way to perform feature engineering is to divide a numeric . .
. .
variable into ordered bins. . .
. .
.
.
.
.
.
References .
.
.
.
.
. .
. [Huf93] Darrell Huff. How to Lie with Statistics. English. .
. .
. Reissue edition. New York: W. W. Norton & Company, .
. .
. Oct. 1993. isbn: 978-0-393-31072-6. .
. .
. .
. [Seo06] Songwon Seo. A Review and Comparison of Methods .
. .
. for Detecting Outliers in Univariate Data Sets. en. .
. .
. University of Pittsburgh ETD. Publisher: University of .
. .
. Pittsburgh. Aug. 2006. url: .
. .
. http://d-scholarship.pitt.edu/7948/ (visited on .
. .
. 01/28/2021). .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
© 2021 Fady Morris Ebeid . .
. .
https://github.com/FadyMorris/formula-sheets
8

You might also like