Data Analysis Cheatsheet
Data Analysis Cheatsheet
. .
Data Analysis .
. 1.2 Notation .
. • Interquartile Range (IQR): The difference between Q3
. A random variable is a placeholder for the possible values of . and Q1.
. .
Formula Sheet .
. some process. It represents a column in the dataset. .
.
. . • Standard Deviation: Is the average distance of each
Fady Morris Ebeid .
.
Random variables are represented by capital letters (for example .
. observation from the mean. Represents how far each point
. X). Once we observe an outcome of these random variables, we .
(2021) .
. notate it as a lower case of the same letter (for example x1 ). .
.
in our dataset is from the mean.
. .
. 1.3 Summary Statistics . v
n
. . u
u1 X
Chapter 1 .
.
.
.
There are four main aspects to analyzing quantitative data.
.
.
.
.
s=t
n i=1
(xi − x̄)2
. 1. Measures of Center .
Practical Statistics .
.
. 2. Measures of Spread
.
.
. It has the same units as our original data.
. .
1 Descriptive Statistics . 3. The Shape of the data .
.
.
.
. • Variance: The average squared difference of each
1.1 Data Types . 4. Outliers . observation from the mean.
. .
• Quantitative: data takes on numeric values that allow us . .
. Measures of Center .
to perform mathematical operations (like the number of . . 1X
n
. . Var(X) = s2 = (xi − x̄)2
dogs). . • Mean: (often called average or expected value) .
. . n i=1
. n
.
Quantitative data can be divided into: . 1X .
. x̄ = xi . It has units that are the square of the units of the original data.
. .
– Continuous data: can be split into smaller and . n i=1 .
. .
smaller units, and still a smaller unit exists. An . .
example of this is the age of the dog - we can measure
.
. Where: .
.
Shape of the Data
. xi → The ith data point. .
the units of the age in years, months, days, hours, . .
. . The shape of the data can be investigated using histograms or box
seconds, but there are still smaller units that could be . n → Number of samples. .
. . plots.
associated with the age. . .
. • Median: The median splits our data so that 50% of our .
– Discrete data: only takes on countable values. The . . The distribution of data can take one of three shapes:
. values are lower and 50% are higher. .
number of dogs we interact with is an example of a . .
. .
discrete data type. . . • Symmetric: Normally distributed. The mean equals the
. x(n+1)/2 , if n is odd .
. median(X) = xn/2 + xn/2+1 . median of the data.
They can also be classified into: . .
. , if n is even .
. 2 .
– Interval data: numeric values where absolute . . x̄ = median(X)
. In order to compute the median we must sort our values .
differences are meaningful (addition and subtraction . .
. first. .
operations can be made). Examples: year and . . Examples: Height, Weight, Errors, Precipitation.
. .
temperature in celsius. .
. • Mode: The mode is the most frequently observed value in .
.
. our dataset. . To know if the data is normally distributed, there are plots
– Ratio data: numeric values where relative differences . . called normal quantile-quantile plots and statistical
. There might be multiple modes for a particular .
are meaningful (multiplication and division operations . . methods like the Kolmogorov-Smirnov test.
. dataset(multimodal), or no mode at all. .
can be made). There must be a meaningful zero . .
. .
point. Examples: document word count and mass in . Measures of Spread . • Right skew (positive skew - right tailed): The mean being
. .
kilograms. . . skewed to the right of a median of the data.
. The 5 number summary: .
. .
• Categorical (Qualitative): are used to label a group or . .
. • Minimum: The smallest number in the dataset. . x̄ > median(X)
set of items (like dog breeds - Collies, Labs, Poodles, etc.). . .
. • Q1 (First Quartile): The value such that 25% of the data .
We can divide categorical data further into two types: . .
. fall below. . Examples: Amount of drug remaining in a blood stream,
. .
– Ordinal: data take on a ranked ordering (like a . . Time between phone calls at a call center, Time until light
.
. • Q2 (Second Quartile): The value such that 50% of the .
. bulb dies.
ranked interaction on a scale from Very Poor to Very . .
. data fall below (the median). .
Good with the dogs). The distances between the . .
categories are unknown. . • Q3 (Third Quartile): The value such that 75% of the . • Left skew (negative skew - left tailed): The mean being
. . skewed to the left of the median of the data.
. data fall below. .
– Nominal: data do not have an order or ranking (like . .
. • Maximum: The largest value in the dataset. .
the breeds of the dog). . .
. . x̄ < median(X)
. Measures of Spread: .
When analyzing categorical variables, we commonly just look at . .
. .
the count or percent of a group that falls into each level of a . • Range: The difference between the maximum and the . Examples: Grades as a percentage in many universities, Age
. .
category. minimum. of death, Asset price changes.
1
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Outliers .
. 3 Binomial Distribution .
. Note that:
. . P (A) = P (A ∩ B) + P (A ∩ B̄)
Outliers are points that fall very far from the rest of our data . The Binomial Distribution helps us determine the probability of x .
. . P (Ā) = P (Ā ∩ B) + P (Ā ∩ B̄)
points. . sucesses in n Bernoulli trials. .
. .
They influences measures like the mean and standard deviation . The probability mass function of the binomial distribution is given . P (A|B) + P (Ā|B) = 1
. .
much more than measures associated with the five number . by: . P (A|B̄) + P (Ā|B̄) = 1
. (n .
summary. . px (1 − p)n−x , x = 0, 1, 2, . . . , n .
. . P (A) + P (Ā) = 1
Outliers can be identified visually using histogram. And there are . b(x; n, p) = x .
a number of different techniques for identifying outliers. Refer to . 0, otherwise .
. .
[Seo06] paper. .
. Where:
.
. 5 Bayes Rule
When outliers are present we should consider the following points: . . P (B|A)P (A)
. p → probability of success. .
. . P (A|B) =
1. Noting they exist and the impact on summary statistics. . . P (B)
. x → number of sucesses. .
2. If typo - remove or fix . . P (A|B) is the posterior probability.
. n → total number of trials. .
. . P (A) is the prior probability.
3. Understanding why they exist, and the impact on questions . n .
. → binomial coefficient. .
we are trying to answer about our data. . . 5.1 Example: Cancer Test Case
. x .
4. Reporting the 5 number summary values is often a better . . Prior probabilities:
. n n! .
indication than measures like the mean and standard . = .
. x x!(n − x)! .
deviation when we have outliers. . . P (C) 0.01
. . =
5. Be careful in reporting. Know how to ask the right . Mean: E(X) = np . P (¬C) 0.99
. .
questions. . Variance: Var(X) = npq . Confusion Matrix:
. .
. .
. . P os N eg
1.4 General Steps for Working with a Random . .
4 Conditional Probability
. . C TP FN P (P os|C) P (N eg|C) 0.9 0.1
Variable . . = =
. Conditional probability formula: . ¬C FP TN P (P os|¬C) P (N eg|¬C) 0.15 0.85
1. Plot your data to identify if you have outliers. . .
. .
. .
2. Handle outliers accordingly via the methods above. . P (A ∩ B) .
. P (A|B) = . Sensitivity (True Positive rate): measures the proportion of
3. If no outliers and your data follow a normal distribution - . P (B) .
. . positives that are correctly identified (i.e. the proportion of those
use the mean and standard deviation to describe your . .
. . who have some condition (affected) who are correctly identified as
dataset, and report that the data are normally distributed. . . having the condition). Also called the recall.
. .
4. If you have skewed data or outliers, use the five number . . Specificity (True Negative rate): measures the proportion of
. .
. A B . negatives that are correctly identified (i.e. the proportion of those
summary to summarize your data and report the outliers. . .
. . who do not have the condition (unaffected) who are correctly
. .
1.5 Descriptive vs. Inferential Statistics . . identified as not having the condition).
. A∩B .
Comparison: . . Type I error (false positive): “The true fact is that the patients
. .
1. Descriptive statistics is about describing collected data. . . do not have a specific disease but the physicians judges the
. .
. . patients was ill according to the test reports.d”
2. Inferential statistics is about using collected data to . .
. . Type II error (false negative): “The true fact is that the disease
draw conclusions to a larger population. . .
. . is actually present but the test reports provide a falsely reassuring
We look at specific examples that allow us to identify the: . .
. Figure 1.1: Conditional Probability . message to patients and physicians that the disease is absent.”
. .
(a) population: The entire group of interest. . . Joint Probabilities (Interesection):
. .
(b) Parameter: Numeric summary about the population. . .
. .
(c) Sample: Subset of the population. . .
. P (A ∩ B) = P (A, B) = P (B)P (A|B) .
. .
(d) Statistic: Numeric summary about a sample. . .
. P (A, B) is the joint probability between A and B . P os
. . C
. .
. .
2 Probability .
.
.
.
• Probability of an event: P (A) .
.
P (B) P (B̄) .
. C ∩ N eg C ∩ P os ¬C ∩ P os
• Probability of opposite event: . .
. .
. B B̄ .
P (A) = 1 − P (¬A) . .
. P (A|B) P (Ā|B) P (A|B̄) P (Ā|B̄) .
. .
• Probability of the occurance of a composite event n times . . ¬C ∩ N eg
. A Ā A Ā .
(independent): . .
. .
P (A, A, . . . , A) = P (A) · P (A) · . . . · P (A) = P (A)n . P (Ā∩B) P (A∩B̄) P (Ā∩B̄)
.
. P (A∩B) .
. . Figure 1.2: Cancer Test Case - Joint Probabilities
2
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
. 1. The sampling distribution is centered on the original . 3. Difference in sample means (x̄1 − x̄2 ).
. .
. parameter value. .
. .
P (P os, C) P (N eg, C) . . 4. Difference in sample proportions (p1 − p2 ).
. µM = µ .
P (P os, ¬C) P (N eg, ¬C) . .
. . It does not apply to the following statistics:
P (C)
P (P os|C) P (N eg|C)
. 2. The sampling distribution’s variance depends on the sample .
= . size n. It decreases when n increases. . 1. Sample standard deviation s.
P (¬C) P (P os|¬C) P (N eg|¬C) . .
. If we have a random variable X, with a variance of σ 2 , then .
P (C)P (P os|C) P (C)P (N eg|C)
. . 2. Correlation coefficient r.
= . the sampling distribution of the sample mean X̄ has a .
P (¬C)P (P os|¬C) P (¬C)P (N eg|¬C) . .
. variance of . 3. Maximum value in the dataset.
. .
. σ2 .
0.009 0.001 . 2
σM = .
= . . 7.5 Bootstrapping
0.1485 0.8415 . n .
. .
. 3. The standard error is the standard deviation of the . Definition 7.2. Bootstrapping: is a technique where we sample
. .
. sampling distribution. . from a group with replacement.
P (P os) P (N eg) . .
. s .
= P (P os, C) + P (P os, ¬C) P (N eg, C) + P (N eg, ¬C
. σ2 . • We can use bootstrapping to simulate the creation of
. .
. standard error = . sampling distribution.
= 0.1575 0.8425
. n .
. .
. . • An element can be picked more than once from the dataset.
. .
Posterior probabilities: . 7.2 Notation .
. . • The probability of any number in our set stays the same
. A parameter θ pertains to a population, while a statistic or .
. . regardless of how many times it has been chosen. Flipping a
P (C|P os) P (C|N eg)
. estimator θ̂ pertains to a sample. .
. . coin and rolling a die are examples of bootstrapping.
P (¬C|P os) P (¬C|N eg) . .
. .
. θ Statistic Description .
. .
P (P os, C) P (N eg, C)
. µ x̄ µ̂ The mean of a dataset .
= P (P os) P (N eg)
P (P os, ¬C) P (N eg, ¬C) .
. σ s σ̂ The standard deviation of a dataset
.
. 8 Confidence Intervals
. .
σ2 s2 σ̂ 2 The variance of a dataset
P (P os, C) P (N eg, C) . . Definition 8.1. Confidence intervals provide a range of values
. .
P (P os) P (N eg) . π p π̂ The proportion (mean) of a binomial dataset . that are possible for a population parameter. A confidence interval
= . .
P (P os, ¬C) P (N eg, ¬C) . ρ r ρ̂ The correlation coefficient . is the probability that a population parameter will fall between a
. .
P (P os) P (N eg) . β b β̂ The regression coefficient . set of values for a certain proportion of times.
. .
. .
0.0571 0.0012 . .
= . A binomial dataset is a dataset with only 0 and 1 values. . • Confidence intervals can be interpreted as “we are x%
0.9429 0.9988 . .
. The parameter, which is a numeric summary of the population . confident that the population parameter falls between the
. doesn’t change. While a statistic changes based on the sample . bounds of the interval”
P (C|P os) is called the precision. . .
. selected from the population. .
. . • Confidence intervals can be built for different parameters
. .
6 Normal Distribution .
. 7.3 The Law of Large Numbers .
. such as population mean, or difference in means.
. .
1 (x−µ)
2
. Theorem 7.1. The Law of Large Numbers: As our sample . • Confidence levels can be 90%, 95%, 98%, 99%
2 1 − 2 σ2
. .
. .
N x; µ, σ = √ e size increases, the sample statistic gets closer to the population
2πσ 2 . . • An important application that uses comparison of means is
. parameter. .
2 . . A/B testing.
. .
1 1 x−µ
−2
= √ e σ . .
σ 2π . .
. Most common ways of parameter estimation: . 8.1 Statistical vs. Practical Significance
. .
. .
. 1. Maximum Likelihood Estimation . Statistical significance: Evidence from hypothesis tests and
7 Central Limit Theorem . .
. . confidence intervals that H1 is true.
. 2. Method of Moments Estimation .
Modeling coin flip according to number of flips: . . Practical significance: Considers real world aspects, not just
. .
Single Coin Flip A few Coin Flips Infinite coin flips (∞) . 3. Bayesian Estimation . numbers in making final conclusions. It takes into account other
. .
Binomial distribution Normal distribution . . real world constraints such as space, time, or money.
2 . 7.4 The Central Limit Theorem .
. .
n 1 x−µ
k n−k −1
p p (1 − p) √ e 2 σ . .
k σ 2π . Theorem 7.2. The Central Limit Theorem states that with a . 8.2 Building Confidence Intervals
. .
. large enough sample size the sampling distribution of the mean .
. . There are two methods:
. will be normally distributed. .
7.1 Sampling Distribution . .
• Bootstrapping.
. .
Definition 7.1. A sampling distribution is the distribution of . It applies for the following statistics: .
. .
a statistic. It is a distribution formed by samples. . . • Traditional Methods: These methods are no longer
. 1. Sample means (x̄). .
. . necessary with what is possible with statistics in modern
. .
Characteristics of a sampling distribution: 2. Sample proportions(p). computing. For reference, see Stat Trek site
3
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
8.3 Other Terms Associated with Confidence . • Error rate is denoted by α. Commonly it is 1-5%. . 9.4 P -value
. .
. .
Intervals . • Deciding the alternative (H1 ) is true, when actually the null . P -value: The conditional probability of observing a test statistic
. .
• The confidence interval width: is the difference between .
. (H0 ) is true .
.
(or more extreme test results) in favor of the alternative
the upper and lower bounds of the confidence interval. . . hypothesis (H1 ) if the null hypothesis (H0 ) is true.
. • The worse of errors. .
. .
• The margin of error: Is half the confidence interval width. . . • A very small p-value means that such an extreme observed
. • You should set up your null and alternative hypotheses, so .
Example: “Candidate A has 34% ± 3% of the votes” . that the worst of your errors is the type I error. . outcome would be very unlikely under the null hypothesis.
. .
=⇒ (31%, 37%) . . Small p-value −→ Choose H1
. Type II Errors: .
. .
The relationship between sample size and confidence level to . . • It is actually always the case that when your p-value is large
. • False negatives. .
confidence interval: . . you will end up staying with the null hypothesis as your
. .
. • Error rate is denoted by β. . choice.
1. Increasing the sample size n will decrease the width of . .
confidence interval. . • Deciding the null (H0 ) is true, when actually the alternative . Large p-value −→ Choose H0
. .
. (H1 ) is true. .
2. Increasing the confidence level (say from 95% to 99%) will . .
. . One-Sided Right-Tail Test
increase the width of the confidence interval. . Power of a statistical test: True positive rate (1 − β). This is .
. .
. the ability of an individual to correctly choose the alternative . Consider an observed test-statistic t from unknown distribution T .
8.4 Confidence Intervals vs. Machine Learning . .
. hypothesis (the probability of rejecting a null hypothesis that is . The p value is the prior probability of observing a test statistic ≥ t
. .
Confidence intervals are about parameters in a population, while . false). . if H0 were true.
. .
machine learning make predictions about individual data points. . .
. 9.2 Common Types of Hypothesis Tests .
. .
. .
9 Hypothesis Testing .
.
Hypothesis tests are preformed on population parameters, .
.
The process is as follows:
Rules of setting up the null and alternative hypotheses: . never on statistics, as statistics are values you already know from .
. . 1. Simulate the values of your statistic that are possible from
. the data. .
1. The H0 is the null hypothesis. It is the condition we believe . . the null.
. Common hypothesis tests: .
to be true before collecting any data. . . 2. Calculate the value of the statistic (t) you actually obtained
. 1. Testing a population mean (One sample t-test). .
2. The H0 usually states that there is no effect or that two . . in your data.
. .
groups are equal. . 2. Testing the difference in means (Two sample t-test). .
. . 3. Compare your statistic to the values from the null.
. .
3. H1 is the alternative hypothesis. It is what we would like to . 3. Testing the difference before and after some treatment on .
. . 4. Calculate the p-value as the proportion of null values that
prove to be true. . the same individual (Paired t-test). .
. . are considered extreme based on your alternative.
. 4. Testing a population proportion (One sample z-test). .
4. The H0 and H1 are competing, non-overlapping hypotheses. . .
. . p = P (T ≥ t|H0 is true)
. 5. Testing the difference between population proportions (Two .
5. H0 contains an equal sign of some kind when it pertains to . .
mathematical ideas. Either =, ≤, or ≥. . sample z-test). .
. . p-value and Errors
. .
6. H1 contains the opposition of the null hypothesis. Either . T table: t distribution critical values. .
. . Compare p-value and type I error threshold (α), the decision about
=, >, or <. . Two-sided test: Testing simply if the parameters of two groups .
. . which hypothesis to choose becomes:
. are the same or if they are different. The equal case should still be .
9.1 Types of Errors . .
.
.
in the null hypothesis. We aren’t interested in if one parameter is .
. • p-value ≤ α =⇒ Reject H0
. greater than another. .
.
.
.
. • p-value ≥ α =⇒ Fail to Reject H0
Null Hypothesis (H0 ) is . 9.3 Difference in Means .
. . 9.5 Other Things to Consider
Table of error types . Notice the standard deviation with the difference in means is .
. .
False True . actually the square root of the sum of the variance of each of the . Impact of Large Sample Size
. .
(+) (−) . individual sampling distributions. .
. . • With large sample sizes, hypothesis testing leads to even the
. .
. q . smallest findings become statistically significant(ending up
False Positive . σdiff = σM 2 2 2 2 2 .
Decision Reject True Positive . 1 + σM 2 =⇒ σdiff = σM 1 + σM 2 . rejecting essentially every null). However, these findings
(+) 1−β (Type I error) . .
about null . . many not be practically significant at all.
α . And the standard deviation of the mean is the standard deviation .
hypothesis . .
(H0 ) .
.
of the original draws divided by the square root of the sample size .
. • Hypothesis testing takes an aggregate approach towards the
Don’t False Negative . taken. . conclusions made based on data, as these tests are aimed at
True Negative . σ .
reject (Type II error) . σM = √ . understanding population parameters (which are aggregate
1−α . n .
(−) β . . population values).
. Thus: .
. .
. . • Alternatively, machine learning techniques take an
Type I Errors: . .
. σ2 σ2 . individual approach towards making conclusions, as they
. 2
σdiff = 1 + 2 .
• False positives. n1 n2 attempt to predict an outcome for each specific data point.
4
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Performing More Than One Hypothesis Test . 10.1 Testing Changes on a Web Page . Difficulties in A/B Testing
. .
. A/B tests are used to test changes on a web page by running an .
When performing more than one hypothesis test, your type I error . . • Novelty effect and change aversion when existing users first
. experiment where a control group sees the old version, while the .
compounds. In order to correct for this, you can use one of the . . experience a change
. experiment group sees the new version. A metric is then .
following techniques: . .
. chosen to measure the level of engagement from users in each . • Sufficient traffic and conversions to have significant and
. .
1. Bonferroni correction. . group. These results are then used to judge whether one version is . repeatable results
. .
. more effective than the other. .
A simple, but very conservative approach. . . • Best metric choice for making the ultimate decision (eg.
. .
The new type I error rate should be the error rate you . . measuring revenue vs. clicks)
. The metric used is the click through rate (CTR). .
actually want divided by the number of tests you are . .
. . • Long enough run time for the experiment to account for
performing. . .
. # clicks by unique users . changes in behavior based on time of day/week or seasonal
. CTR = .
Let m be the number of tests, then your new error rate . # views by unique users . events.
. .
becomes: . Hypotheses: .
α . . • Practical significance of a conversion rate (the cost of
αnew = . H0 : CTRnew ≤ CTRold .
m . . launching a new feature vs. the gain from the increase in
. .
2. Turkey correction . H1 : CTRnew > CTRold . conversion)
. .
. .
3. Q-values (popular in medical tests). . Steps Taken to Analyze the Results of A/B test . • Consistency among test subjects in the control and
. .
. 1. Compute the observed difference, between the CTR . experiment group (imbalance in the population represented
. .
How Do Confidence Intervals and Hypothesis Testing . metrics for the control and experiment group. . in each group can lead to situations like Simpson’s Paradox)
. .
Compare . .
. 2. Simulate the sampling distribution for the difference in .
. .
. proportions(difference in CTR). .
Confidence interval and two-sided hypothesis test (a test that .
.
.
. 11 Regression
involves a 6= in the alternative) are the same in terms of . 3. Use the sampling distribution to simulate the distribution .
conclusions. . . Machine learning is split into:
. under the null hypothesis, by creating a random normal .
. .
. distribution centered at 0 with the same standard deviation . • Supervised learning: Predicting the labels of data.
. as the sampling distribution. .
1 − CI = α . . Examples are fraud detection, whether customers will buy a
. .
. 4. Compute the p-value by finding the null values in the null . product or not, house values.
Example: 95% confidence interval is similar to a hypothesis test . .
. distribution that are greater than the observed difference. .
with a type I error reate α = 0.05 . . • Unsupervised learning: Clustering unlabeled data
. .
. 5. Compare the p-value to type I error threshold α to . together.
. .
10 A/B Testing .
.
determine the statistical significance of observed .
.
. difference. . 11.1 Simple Linear Regression
Definition 10.1 (A/B testing). (also known as bucket testing or . .
split-run testing) is a user experience research methodology. A/B . . We compare two quantitative variables to one another. To predict
.
.
Analyzing Multiple Metrics .
.
tests consist of a randomized experiment with two variants, A and . . a linear function.
. The more metrics to evaluate, the more likely to observe .
B. It is a way to compare two versions of a single variable, typically . .
. significant just by chance. The probability of any false positive .
by testing a subject’s response to variant A against variant B, and . . y = f (x)
. increases as you increase the number of metrics(tests). This .
determining which of the two variants is more effective. . .
. multiple comparison problem can be solved by several techniques, . y is called the response variable or dependent variable. It is the
. such as bonferroni correction (see section 9.5). . variable you are interested in predicting.
A/B testing is a form of hypothesis testing where: . .
. .
. . x is called the explanatory variable or independent variable. It
• Null Hypothesis: Experiment does equally or worse than . Since the Bonferroni method is too conservative when we expect . is the variable used to predict the response.
. .
the control. . correlation among metrics, we can better approach this problem .
. .
. with more sophisticated methods, such as . 11.2 Covariance
• Alternative Hypothesis: Experiment does better than . .
. . Pn
the control. . • The closed testing procedure . i=1 (xi− x̄)(yi − ȳ)
. . sxy =
. . n−1
A/B testing has drawbacks. It can’t tell you about options you .
. • Boole-Bonferroni bound .
.
haven’t considered. It is also subject to bias when tested on . . 11.3 Correlation Coefficient
. • The Holm-Bonferroni method. .
existing users. There are two types of bias: . .
. These are less conservative and take this correlation into account. . There are different ways to measure the correlation between two
. .
• Change Aversion: Existing users may give an unfair . . variables [see this link].
. .
advantage to the old version, simply because they resist . . For a linear relationship, the most common way to measure
. If you do choose to use a less conservative method, just make sure .
change, even if it is better on the long run. . . correlation is Pearson’s correlation coefficient.
. the assumptions of that method are truly met in your situation, .
. . Denoted by r.
• Novelty Effect: Existing users may give an unfair . and that you’re not just trying to cheat on a p-value. Choosing a .
. .
advantage to the new version, because they are excited or . poorly suited test just to get significant results will only lead to .
. .
drawn to change, even if it isn’t any better in the long run. misguided decisions that harm the performance in the long run. r ∈ [−1, 1]
5
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. sy .
Correlation coefficient for a sample: . b1 = r . 1.2 Chart Junk
. sx .
sxy . . Chart junk: all visual elements in charts and graphs that are not
. .
rxy = . .
sx sy . . necessary to comprehend the information represented on the graph
. b0 = ȳ − b1 x̄ . or that distract the viewer from this information.
Pn
(xi − x̄) (yi − ȳ) . .
= qP i=1 . . Examples of chart junk include:
. .
(xi − x̄)2
qP
(yi − ȳ)2 .
.
.
Chapter 2 .
.
. 1. Heavy grid lines
. . 2. Unnecessary text
It measures:
• Strength: A relationship can be classified according to
.
.
.
Data Visualization .
.
. 3. Pictures surrounding the visual
. .
strength into: . There are two main reasons for creating visuals using data: . 4. Shading or 3d components
. .
. .
– Weak: 0.0 ≤ |r| < 0.3 . 1. Exploratory analysis is done when you are searching for . 5. Ornamented chart axes
. .
. insights. These visualizations don’t need to be perfect. You .
– Moderate: 0.3 ≤ |r| < 0.7 .
. are using plots to find insights, but they don’t need to be
.
. 1.3 Data-Ink Ratio
– Strong: 0.7 ≤ |r| ≤ 1.0 . . The more of the ink in your visual that is related to conveying the
. aesthetically appealing. You are the consumer of these .
. . message in the data, the better.
. plots, and you need to be able to find the answer to your .
• Direction: Either negative or positive depending on the sign. . . Limiting chart junk increases the data-ink ratio.
. questions from these plots. .
. .
11.4 Equation of The Line . 2. Explanatory analysis is done when you are providing your . 1.4 Design Intergrity
. .
The line in linear regression has the equation: . results for others. These visualizations need to provide you . Lie factor: the degree to which visualization distorts or
. .
. the emphasis necessary to convey your message. They . misinterprets the data values being plotted. It is calculated as:
ŷ = b0 + b1 x1 . .
. should be accurate, insightful, and visually appealing. .
. . ∆visual/visualstart
Where: . . lie factor =
. The five steps of the data analysis process: . ∆data/datastart
ŷ −→ is the predicted value of the response from the line. . .
. .
b0 −→ is the intercept. It is the predicted value of the response . . It is the relative change shown in the graphic divided by the actual
. 1. Extract - Obtain the data from a spreadsheet, SQL, the .
when the x-variable is zero. (denoted β0 for the population). . . relative change in the data.
. web, etc. .
b1 −→ is the slope. It is the predicted change in the response for . . Ideally, the lie factor should be 1. Any other value means that
. 2. Clean - Here we could use exploratory visuals. . there is some mismatch in the ratio of depicted change to actual
every one unit increase in the x-value. (denoted β1 for the . .
. . change.
population). . 3. Explore - Here we use exploratory visuals. .
x1 −→ is the explanatory variable. . .
. . 1.5 Using Color
y −→ is the actual response value for a data point in our . 4. Analyze - Here we might use either exploratory or .
. .
dataset(also called label). . explanatory visuals. . 1. Before adding color to a visualization, start with black and
. .
. . white.
11.5 Fitting a Regression Line . 5. Share - Here is where explanatory visuals live. .
. . 2. When using color, use less intense colors - not all the colors
. .
The algorithm used to fit a regression line to a dataset is called . . of the rainbow, which is the default in many software
. .
least-squares algorithm, it finds the line that minimizes .
. 1 Design of Visualizations .
. applications.
n . Visuals can be bad if they: . 3. Color for communication. Use color to highlight your
X . .
(yi − ŷi )2 .
.
.
. message and separate groups of interest. Don’t add color
. 1. Don’t convey the desired message. .
i=1
. . just to have color in your visualization.
. 2. Are misleading. .
Closed form solution: . .
1X . Reference [Huf93] . 1.6 Designing for Color Blindness
x̄ = xi . .
n . . Stay away from red to green palette and use blue to orange
. .
.
.
1.1 Visual Encodings .
.
palette instead.
1X . humans are able to best understand data encoded with: . Further reading: 5 tips on designing colorblind-friendly
ȳ = yi . .
n . . visualizations
. • positional changes (differences in x- and y- position as in .
. .
. scatterplots) . 1.7 Additional Encodings
. .
r
1 X
sx = (xi − x̄)2 (Using the Bessel’s Correction .
. • length changes (differences in box heights as bar charts
.
. We typically use the x− and y− axes to depict the value of
n−1 . . variables. If we have more than two variables we can use other
formula). . and histograms). .
. . visual encodings:
. Alternatively, humans struggle with understanding data encoded .
. . 1. color and shape for catigorical data.
r
1 X . with: .
sy = (yi − ȳ)2 (Using the Bessel’s Correction formula). . .
. . 2. size of marker for quantitative data.
n−1 . • color hue changes (as are unfortunately commonly used .
. . Use additional encoding only when necessary. If the visual gets
. as an additional variable encoding in scatter plots) .
Pn . . complicated consider breaking it into multiple visuals that convey
(xi − x̄) (yi − ȳ) . .
rxy = qP i=1 . • area changes (as in pie charts, which often makes them . multiple messages.
. .
qP
(xi − x̄)2 (yi − ȳ)2 not the best plot choice).
6
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
2 Univariate Exploration of Data .
. 3.3 Violin Plots .
. 4 Multivariate Exploration of Data
. .
Univariate visualizations: Visualize single-variables, such as . Quantitative variable vs. qualitative(categorical) variable. . 4.1 Non-Positional Encodings for Third
. .
bar charts, histograms, and line charts. . .
. For each level of categorical variable, a distribution of the values . Variables
. on the numeric variabble is plotted. .
2.1 Bar Charts .
.
.
.
There are four major cases to consider when we want to plot three
. The distribution is plotted as a kernel density estimate, like a . variables together:
Used to depict the distribution of categorical/qualitative variables. . .
. smoothed histogram. .
They can also be used for discrete quantitative data. . . • Three numeric variables
. .
. 3.4 Box Plots . • Two numeric variables and one categorical variable
2.2 Pie Charts .
.
.
.
Guidelines to use a pie chart: . Quantitative(numeric) vs. qualitative(categorical). . • One numeric variable and two categorical variables
. .
. Compared to the violin plot, the box plot is better for displaying .
• Make sure that your interest is relative frequencies. .
.
.
. • Three categorical variables
. the 5 point summary of data, reporting a set of descriptive .
• Limit the number of slices plotted to two or three slices, .
. statistics: .
. If we have at least two numeric variables, we use scatterplot to
though it’s possible to plot with four or five slices as long as . . encode two of the numeric variables, then use a non-positional
. .
the wedge sizes can be distinguished. . • The central line in the box indicates the median of the . encoding (like shape, size and color) for the third variable.
. .
. distribution .
• Plot the data systemically. Start from 12 o’clock position of . . 4.2 Color Palettes
. .
the circle, then plot each categorical level from most . • The top and bottom of the box represent the third and first . There are three major classes of color palette to consider:
. .
frequent to least frequent. . quartiles of the data, respectively. .
. . • Qualitative: Distinct colors, for nominal-type data.
Otherwise, use a bar chart instead. . .
. • The height of the box is the interquartile range (IQR). .
. . • Sequential: light-to-dark trend across a single or small
. .
2.3 Histograms .
. • From the top and bottom of the box, the whiskers indicate .
. range of hues of the same color. Used for categorical ordinal
A histogram is used to plot the distribution of a . the range from the first or third quartiles to the minimum . or numeric(quantitative) data types.
. .
quantitative(numeric) variable. It is the quantitative version of the . or maximum value in the distribution. .
. . • Diverging: Used if there is a meaningful zero or center
bar chart. Values are grouped into continuous bins. . .
. • Typically, a maximum range is set on whisker length; by . value for the variable.Two sequential palettes with different
When a data value is on a bin edge, it is counted in the bin to its . .
. default, this is 1.5 times the IQR. . hues are put back to back, with common color (usually
right. The exception is the rightmost bin edge. . .
. . white or gray) connecting them. One hue indicates values
. • Individual points below the lower whisker or above the .
. . greater than the center point, while the other indicates
3 Bivariate Exploration of Data . upper whisker indicate individual outlier points that are .
. . values smaller than the center.
. more than 1.5 times the IQR below the first quartile or .
Bivariate visualizations: Those visualizations involving two . .
. above the third quartile. . 4.3 Faceting
variables. The variation in one variable will affect the value of the . .
. . Faceting allows you to plot multiple simpler plots across levels of
other variable. . .
. Box plots are better than violin plots for explanatory . one or two other variables.
. .
3.1 Scatterplots . visualizations. .
. .
Quantitative variable vs. quantitative variable.
. . 4.4 Heat Maps
. .
If we have a very large number of points or our numeric variables
.
.
3.5 Clustered Bar Charts .
. They can be used for multivariate visualizations, substituting
. . count with a third variable.
are discrete-valued, then using a scatterplot won’t be informative. . Depict the relationship between two categorical variables. .
The visualization will suffer from overplotting. . .
. Bars are organized into clusters based on levels of the first . 4.5 Plot Matrices
. .
overplotting is where the high amount of overlap in points makes it . variable, and then bars are ordered consistently across the second . Give a high level look at pairwise relationship between all
. .
difficult to see the actual relationship between variables. . variable within each cluster. . variables. In a plot matrix, each row and column represents a
. .
To make the trends in the data clearer, it is overcome by: . . different variable.
. .
1. Using jitter. .
.
3.6 Line Plots .
.
. . 4.6 Correlation Heat Maps
2. Using transparency. . Line plot: plots the trend of one numeric variable against values .
. of a second variable. In a line plot, only one point is plotted for . For numeric variables this is similar to a correlation matrix. It
. .
. . shows the strength of relationships between variables.
3.2 Heat Maps . every unique x-value or bin of x-value (like a histogram). .
. .
Quantitative variable vs. quantitative variable. .
.
Line plots are used instead of bar plots to: .
.
4.7 Feature Engineering
A heat map is a 2-D version of the histogram that can be used as . . There may exist two variables that are related in some way.
. • Emphasize relative change. The need for a zero on the .
an alternative to a scatterplot. . . Feature engineering is about creating a new variable with a sum,
. y-axis is not necessary. .
They are good in the following cases: . . product or ratio between original variables that give insight into
. .
.
. • Emphasize trends across x-values. .
. research questions.
1. Good for discrete variable vs. discrete variable. . .
. . Example:
2. Good alternative to transparency for a lot of data. . Time series plot: A line plot where the x-variable represents .
. time. For example, stock or currency charts. . crime incidents
. . crime incident rate =
Correct choice of bin sizes is important. population totals
7
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Another way to perform feature engineering is to divide a numeric . .
. .
variable into ordered bins. . .
. .
.
.
.
.
.
References .
.
.
.
.
. .
. [Huf93] Darrell Huff. How to Lie with Statistics. English. .
. .
. Reissue edition. New York: W. W. Norton & Company, .
. .
. Oct. 1993. isbn: 978-0-393-31072-6. .
. .
. .
. [Seo06] Songwon Seo. A Review and Comparison of Methods .
. .
. for Detecting Outliers in Univariate Data Sets. en. .
. .
. University of Pittsburgh ETD. Publisher: University of .
. .
. Pittsburgh. Aug. 2006. url: .
. .
. http://d-scholarship.pitt.edu/7948/ (visited on .
. .
. 01/28/2021). .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
© 2021 Fady Morris Ebeid . .
. .
https://github.com/FadyMorris/formula-sheets
8