Data Analysis Cheatsheet

Uploaded by

rafael123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

Data Analysis Cheatsheet

Uploaded by

rafael123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)

. .
Data Analysis .
. 1.2 Notation .
. • Interquartile Range (IQR): The difference between Q3
. A random variable is a placeholder for the possible values of . and Q1.
. .
Formula Sheet .
. some process. It represents a column in the dataset. .
.
. . • Standard Deviation: Is the average distance of each
Fady Morris Ebeid .
.
Random variables are represented by capital letters (for example .
. observation from the mean. Represents how far each point
. X). Once we observe an outcome of these random variables, we .
(2021) .
. notate it as a lower case of the same letter (for example x1 ). .
.
in our dataset is from the mean.
. .
. 1.3 Summary Statistics . v
n
. . u
u1 X
Chapter 1 .
.
.
.
There are four main aspects to analyzing quantitative data.
.
.
.
.
s=t
n i=1
(xi − x̄)2
. 1. Measures of Center .
Practical Statistics .
.
. 2. Measures of Spread
.
.
. It has the same units as our original data.
. .
1 Descriptive Statistics . 3. The Shape of the data .
.
.
.
. • Variance: The average squared difference of each
1.1 Data Types . 4. Outliers . observation from the mean.
. .
• Quantitative: data takes on numeric values that allow us . .
. Measures of Center .
to perform mathematical operations (like the number of . . 1X
n
. . Var(X) = s2 = (xi − x̄)2
dogs). . • Mean: (often called average or expected value) .
. . n i=1
. n
.
Quantitative data can be divided into: . 1X .
. x̄ = xi . It has units that are the square of the units of the original data.
. .
– Continuous data: can be split into smaller and . n i=1 .
. .
smaller units, and still a smaller unit exists. An . .
example of this is the age of the dog - we can measure
.
. Where: .
.
Shape of the Data
. xi → The ith data point. .
the units of the age in years, months, days, hours, . .
. . The shape of the data can be investigated using histograms or box
seconds, but there are still smaller units that could be . n → Number of samples. .
. . plots.
associated with the age. . .
. • Median: The median splits our data so that 50% of our .
– Discrete data: only takes on countable values. The . . The distribution of data can take one of three shapes:
. values are lower and 50% are higher. .
number of dogs we interact with is an example of a . .
. .
discrete data type. .  . • Symmetric: Normally distributed. The mean equals the
. x(n+1)/2 , if n is odd .
. median(X) = xn/2 + xn/2+1 . median of the data.
They can also be classified into: . .
.  , if n is even .
. 2 .
– Interval data: numeric values where absolute . . x̄ = median(X)
. In order to compute the median we must sort our values .
differences are meaningful (addition and subtraction . .
. first. .
operations can be made). Examples: year and . . Examples: Height, Weight, Errors, Precipitation.
. .
temperature in celsius. .
. • Mode: The mode is the most frequently observed value in .
.
. our dataset. . To know if the data is normally distributed, there are plots
– Ratio data: numeric values where relative differences . . called normal quantile-quantile plots and statistical
. There might be multiple modes for a particular .
are meaningful (multiplication and division operations . . methods like the Kolmogorov-Smirnov test.
. dataset(multimodal), or no mode at all. .
can be made). There must be a meaningful zero . .
. .
point. Examples: document word count and mass in . Measures of Spread . • Right skew (positive skew - right tailed): The mean being
. .
kilograms. . . skewed to the right of a median of the data.
. The 5 number summary: .
. .
• Categorical (Qualitative): are used to label a group or . .
. • Minimum: The smallest number in the dataset. . x̄ > median(X)
set of items (like dog breeds - Collies, Labs, Poodles, etc.). . .
. • Q1 (First Quartile): The value such that 25% of the data .
We can divide categorical data further into two types: . .
. fall below. . Examples: Amount of drug remaining in a blood stream,
. .
– Ordinal: data take on a ranked ordering (like a . . Time between phone calls at a call center, Time until light
.
. • Q2 (Second Quartile): The value such that 50% of the .
. bulb dies.
ranked interaction on a scale from Very Poor to Very . .
. data fall below (the median). .
Good with the dogs). The distances between the . .
categories are unknown. . • Q3 (Third Quartile): The value such that 75% of the . • Left skew (negative skew - left tailed): The mean being
. . skewed to the left of the median of the data.
. data fall below. .
– Nominal: data do not have an order or ranking (like . .
. • Maximum: The largest value in the dataset. .
the breeds of the dog). . .
. . x̄ < median(X)
. Measures of Spread: .
When analyzing categorical variables, we commonly just look at . .
. .
the count or percent of a group that falls into each level of a . • Range: The difference between the maximum and the . Examples: Grades as a percentage in many universities, Age
. .
category. minimum. of death, Asset price changes.
1
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Outliers .
. 3 Binomial Distribution .
. Note that:
. . P (A) = P (A ∩ B) + P (A ∩ B̄)
Outliers are points that fall very far from the rest of our data . The Binomial Distribution helps us determine the probability of x .
. . P (Ā) = P (Ā ∩ B) + P (Ā ∩ B̄)
points. . sucesses in n Bernoulli trials. .
. .
They influences measures like the mean and standard deviation . The probability mass function of the binomial distribution is given . P (A|B) + P (Ā|B) = 1
. .
much more than measures associated with the five number . by: . P (A|B̄) + P (Ā|B̄) = 1
. (n .
summary. . px (1 − p)n−x , x = 0, 1, 2, . . . , n .
. . P (A) + P (Ā) = 1
Outliers can be identified visually using histogram. And there are . b(x; n, p) = x .
a number of different techniques for identifying outliers. Refer to . 0, otherwise .
. .
[Seo06] paper. .
. Where:
.
. 5 Bayes Rule
When outliers are present we should consider the following points: . . P (B|A)P (A)
. p → probability of success. .
. . P (A|B) =
1. Noting they exist and the impact on summary statistics. . . P (B)
. x → number of sucesses. .
2. If typo - remove or fix . . P (A|B) is the posterior probability.
. n → total number of trials. .
. . P (A) is the prior probability.
3. Understanding why they exist, and the impact on questions . n .
. → binomial coefficient. .
we are trying to answer about our data. . . 5.1 Example: Cancer Test Case
. x .
4. Reporting the 5 number summary values is often a better . . Prior probabilities:
. n n! .
indication than measures like the mean and standard . = .
. x x!(n − x)! .
deviation when we have outliers. . . P (C) 0.01
. . =
5. Be careful in reporting. Know how to ask the right . Mean: E(X) = np . P (¬C) 0.99
. .
questions. . Variance: Var(X) = npq . Confusion Matrix:
. .
. .
. . P os N eg
1.4 General Steps for Working with a Random . .
4 Conditional Probability

. . C TP FN P (P os|C) P (N eg|C) 0.9 0.1
Variable . . = =
. Conditional probability formula: . ¬C FP TN P (P os|¬C) P (N eg|¬C) 0.15 0.85
1. Plot your data to identify if you have outliers. . .
. .
. .
2. Handle outliers accordingly via the methods above. . P (A ∩ B) .
. P (A|B) = . Sensitivity (True Positive rate): measures the proportion of
3. If no outliers and your data follow a normal distribution - . P (B) .
. . positives that are correctly identified (i.e. the proportion of those
use the mean and standard deviation to describe your . .
. . who have some condition (affected) who are correctly identified as
dataset, and report that the data are normally distributed. . . having the condition). Also called the recall.
. .
4. If you have skewed data or outliers, use the five number . . Specificity (True Negative rate): measures the proportion of
. .
. A B . negatives that are correctly identified (i.e. the proportion of those
summary to summarize your data and report the outliers. . .
. . who do not have the condition (unaffected) who are correctly
. .
1.5 Descriptive vs. Inferential Statistics . . identified as not having the condition).
. A∩B .
Comparison: . . Type I error (false positive): “The true fact is that the patients
. .
1. Descriptive statistics is about describing collected data. . . do not have a specific disease but the physicians judges the
. .
. . patients was ill according to the test reports.d”
2. Inferential statistics is about using collected data to . .
. . Type II error (false negative): “The true fact is that the disease
draw conclusions to a larger population. . .
. . is actually present but the test reports provide a falsely reassuring
We look at specific examples that allow us to identify the: . .
. Figure 1.1: Conditional Probability . message to patients and physicians that the disease is absent.”
. .
(a) population: The entire group of interest. . . Joint Probabilities (Interesection):
. .
(b) Parameter: Numeric summary about the population. . .
. .
(c) Sample: Subset of the population. . .
. P (A ∩ B) = P (A, B) = P (B)P (A|B) .
. .
(d) Statistic: Numeric summary about a sample. . .
. P (A, B) is the joint probability between A and B . P os
. . C
. .
. .
2 Probability .
.
.
.
• Probability of an event: P (A) .
.
P (B) P (B̄) .
. C ∩ N eg C ∩ P os ¬C ∩ P os
• Probability of opposite event: . .
. .
. B B̄ .
P (A) = 1 − P (¬A) . .
. P (A|B) P (Ā|B) P (A|B̄) P (Ā|B̄) .
. .
• Probability of the occurance of a composite event n times . . ¬C ∩ N eg
. A Ā A Ā .
(independent): . .
. .
P (A, A, . . . , A) = P (A) · P (A) · . . . · P (A) = P (A)n . P (Ā∩B) P (A∩B̄) P (Ā∩B̄)
.
. P (A∩B) .
. . Figure 1.2: Cancer Test Case - Joint Probabilities

2
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
. 1. The sampling distribution is centered on the original . 3. Difference in sample means (x̄1 − x̄2 ).
. .
. parameter value. .
. .

P (P os, C) P (N eg, C) . . 4. Difference in sample proportions (p1 − p2 ).
. µM = µ .
P (P os, ¬C) P (N eg, ¬C) . .
. . It does not apply to the following statistics:

P (C)

P (P os|C) P (N eg|C)

. 2. The sampling distribution’s variance depends on the sample .
= . size n. It decreases when n increases. . 1. Sample standard deviation s.
P (¬C) P (P os|¬C) P (N eg|¬C) . .
. If we have a random variable X, with a variance of σ 2 , then .

P (C)P (P os|C) P (C)P (N eg|C)
. . 2. Correlation coefficient r.
= . the sampling distribution of the sample mean X̄ has a .
P (¬C)P (P os|¬C) P (¬C)P (N eg|¬C) . .
. variance of . 3. Maximum value in the dataset.
. .
. σ2 .

0.009 0.001 . 2
σM = .
= . . 7.5 Bootstrapping
0.1485 0.8415 . n .
. .
. 3. The standard error is the standard deviation of the . Definition 7.2. Bootstrapping: is a technique where we sample
. .
. sampling distribution. . from a group with replacement.
P (P os) P (N eg) . .
. s .

= P (P os, C) + P (P os, ¬C) P (N eg, C) + P (N eg, ¬C
. σ2 . • We can use bootstrapping to simulate the creation of
. .
. standard error = . sampling distribution.

= 0.1575 0.8425
. n .
. .
. . • An element can be picked more than once from the dataset.
. .
Posterior probabilities: . 7.2 Notation .
. . • The probability of any number in our set stays the same
. A parameter θ pertains to a population, while a statistic or .
. . regardless of how many times it has been chosen. Flipping a

P (C|P os) P (C|N eg)
. estimator θ̂ pertains to a sample. .
. . coin and rolling a die are examples of bootstrapping.
P (¬C|P os) P (¬C|N eg) . .
. .
. θ Statistic Description .
. .

P (P os, C) P (N eg, C)
. µ x̄ µ̂ The mean of a dataset .
= P (P os) P (N eg)
P (P os, ¬C) P (N eg, ¬C) .
. σ s σ̂ The standard deviation of a dataset
.
. 8 Confidence Intervals
. .
σ2 s2 σ̂ 2 The variance of a dataset
 
P (P os, C) P (N eg, C) . . Definition 8.1. Confidence intervals provide a range of values
. .
 P (P os) P (N eg)  . π p π̂ The proportion (mean) of a binomial dataset . that are possible for a population parameter. A confidence interval
=  . .
 P (P os, ¬C) P (N eg, ¬C)  . ρ r ρ̂ The correlation coefficient . is the probability that a population parameter will fall between a
. .
P (P os) P (N eg) . β b β̂ The regression coefficient . set of values for a certain proportion of times.
. .
. .
0.0571 0.0012 . .
= . A binomial dataset is a dataset with only 0 and 1 values. . • Confidence intervals can be interpreted as “we are x%
0.9429 0.9988 . .
. The parameter, which is a numeric summary of the population . confident that the population parameter falls between the
. doesn’t change. While a statistic changes based on the sample . bounds of the interval”
P (C|P os) is called the precision. . .
. selected from the population. .
. . • Confidence intervals can be built for different parameters
. .
6 Normal Distribution .
. 7.3 The Law of Large Numbers .
. such as population mean, or difference in means.
. .

1 (x−µ)
2

. Theorem 7.1. The Law of Large Numbers: As our sample . • Confidence levels can be 90%, 95%, 98%, 99%
2 1 − 2 σ2
. .
. .

N x; µ, σ = √ e size increases, the sample statistic gets closer to the population
2πσ 2 . . • An important application that uses comparison of means is
. parameter. .
2 . . A/B testing.
. .

1 1 x−µ
−2
= √ e σ . .
σ 2π . .
. Most common ways of parameter estimation: . 8.1 Statistical vs. Practical Significance
. .
. .
. 1. Maximum Likelihood Estimation . Statistical significance: Evidence from hypothesis tests and
7 Central Limit Theorem . .
. . confidence intervals that H1 is true.
. 2. Method of Moments Estimation .
Modeling coin flip according to number of flips: . . Practical significance: Considers real world aspects, not just
. .
Single Coin Flip A few Coin Flips Infinite coin flips (∞) . 3. Bayesian Estimation . numbers in making final conclusions. It takes into account other
. .
Binomial distribution Normal distribution . . real world constraints such as space, time, or money.
2 . 7.4 The Central Limit Theorem .
. .

n 1 x−µ
k n−k −1
p p (1 − p) √ e 2 σ . .
k σ 2π . Theorem 7.2. The Central Limit Theorem states that with a . 8.2 Building Confidence Intervals
. .
. large enough sample size the sampling distribution of the mean .
. . There are two methods:
. will be normally distributed. .
7.1 Sampling Distribution . .
• Bootstrapping.
. .
Definition 7.1. A sampling distribution is the distribution of . It applies for the following statistics: .
. .
a statistic. It is a distribution formed by samples. . . • Traditional Methods: These methods are no longer
. 1. Sample means (x̄). .
. . necessary with what is possible with statistics in modern
. .
Characteristics of a sampling distribution: 2. Sample proportions(p). computing. For reference, see Stat Trek site
3
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
8.3 Other Terms Associated with Confidence . • Error rate is denoted by α. Commonly it is 1-5%. . 9.4 P -value
. .
. .
Intervals . • Deciding the alternative (H1 ) is true, when actually the null . P -value: The conditional probability of observing a test statistic
. .
• The confidence interval width: is the difference between .
. (H0 ) is true .
.
(or more extreme test results) in favor of the alternative
the upper and lower bounds of the confidence interval. . . hypothesis (H1 ) if the null hypothesis (H0 ) is true.
. • The worse of errors. .
. .
• The margin of error: Is half the confidence interval width. . . • A very small p-value means that such an extreme observed
. • You should set up your null and alternative hypotheses, so .
Example: “Candidate A has 34% ± 3% of the votes” . that the worst of your errors is the type I error. . outcome would be very unlikely under the null hypothesis.
. .
=⇒ (31%, 37%) . . Small p-value −→ Choose H1
. Type II Errors: .
. .
The relationship between sample size and confidence level to . . • It is actually always the case that when your p-value is large
. • False negatives. .
confidence interval: . . you will end up staying with the null hypothesis as your
. .
. • Error rate is denoted by β. . choice.
1. Increasing the sample size n will decrease the width of . .
confidence interval. . • Deciding the null (H0 ) is true, when actually the alternative . Large p-value −→ Choose H0
. .
. (H1 ) is true. .
2. Increasing the confidence level (say from 95% to 99%) will . .
. . One-Sided Right-Tail Test
increase the width of the confidence interval. . Power of a statistical test: True positive rate (1 − β). This is .
. .
. the ability of an individual to correctly choose the alternative . Consider an observed test-statistic t from unknown distribution T .
8.4 Confidence Intervals vs. Machine Learning . .
. hypothesis (the probability of rejecting a null hypothesis that is . The p value is the prior probability of observing a test statistic ≥ t
. .
Confidence intervals are about parameters in a population, while . false). . if H0 were true.
. .
machine learning make predictions about individual data points. . .
. 9.2 Common Types of Hypothesis Tests .
. .
. .
9 Hypothesis Testing .
.
Hypothesis tests are preformed on population parameters, .
.
The process is as follows:
Rules of setting up the null and alternative hypotheses: . never on statistics, as statistics are values you already know from .
. . 1. Simulate the values of your statistic that are possible from
. the data. .
1. The H0 is the null hypothesis. It is the condition we believe . . the null.
. Common hypothesis tests: .
to be true before collecting any data. . . 2. Calculate the value of the statistic (t) you actually obtained
. 1. Testing a population mean (One sample t-test). .
2. The H0 usually states that there is no effect or that two . . in your data.
. .
groups are equal. . 2. Testing the difference in means (Two sample t-test). .
. . 3. Compare your statistic to the values from the null.
. .
3. H1 is the alternative hypothesis. It is what we would like to . 3. Testing the difference before and after some treatment on .
. . 4. Calculate the p-value as the proportion of null values that
prove to be true. . the same individual (Paired t-test). .
. . are considered extreme based on your alternative.
. 4. Testing a population proportion (One sample z-test). .
4. The H0 and H1 are competing, non-overlapping hypotheses. . .
. . p = P (T ≥ t|H0 is true)
. 5. Testing the difference between population proportions (Two .
5. H0 contains an equal sign of some kind when it pertains to . .
mathematical ideas. Either =, ≤, or ≥. . sample z-test). .
. . p-value and Errors
. .
6. H1 contains the opposition of the null hypothesis. Either . T table: t distribution critical values. .
. . Compare p-value and type I error threshold (α), the decision about
=, >, or <. . Two-sided test: Testing simply if the parameters of two groups .
. . which hypothesis to choose becomes:
. are the same or if they are different. The equal case should still be .
9.1 Types of Errors . .
.
.
in the null hypothesis. We aren’t interested in if one parameter is .
. • p-value ≤ α =⇒ Reject H0
. greater than another. .
.
.
.
. • p-value ≥ α =⇒ Fail to Reject H0
Null Hypothesis (H0 ) is . 9.3 Difference in Means .
. . 9.5 Other Things to Consider
Table of error types . Notice the standard deviation with the difference in means is .
. .
False True . actually the square root of the sum of the variance of each of the . Impact of Large Sample Size
. .
(+) (−) . individual sampling distributions. .
. . • With large sample sizes, hypothesis testing leads to even the
. .
. q . smallest findings become statistically significant(ending up
False Positive . σdiff = σM 2 2 2 2 2 .
Decision Reject True Positive . 1 + σM 2 =⇒ σdiff = σM 1 + σM 2 . rejecting essentially every null). However, these findings
(+) 1−β (Type I error) . .
about null . . many not be practically significant at all.
α . And the standard deviation of the mean is the standard deviation .
hypothesis . .
(H0 ) .
.
of the original draws divided by the square root of the sample size .
. • Hypothesis testing takes an aggregate approach towards the
Don’t False Negative . taken. . conclusions made based on data, as these tests are aimed at
True Negative . σ .
reject (Type II error) . σM = √ . understanding population parameters (which are aggregate
1−α . n .
(−) β . . population values).
. Thus: .
. .
. . • Alternatively, machine learning techniques take an
Type I Errors: . .
. σ2 σ2 . individual approach towards making conclusions, as they
. 2
σdiff = 1 + 2 .
• False positives. n1 n2 attempt to predict an outcome for each specific data point.
4
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Performing More Than One Hypothesis Test . 10.1 Testing Changes on a Web Page . Difficulties in A/B Testing
. .
. A/B tests are used to test changes on a web page by running an .
When performing more than one hypothesis test, your type I error . . • Novelty effect and change aversion when existing users first
. experiment where a control group sees the old version, while the .
compounds. In order to correct for this, you can use one of the . . experience a change
. experiment group sees the new version. A metric is then .
following techniques: . .
. chosen to measure the level of engagement from users in each . • Sufficient traffic and conversions to have significant and
. .
1. Bonferroni correction. . group. These results are then used to judge whether one version is . repeatable results
. .
. more effective than the other. .
A simple, but very conservative approach. . . • Best metric choice for making the ultimate decision (eg.
. .
The new type I error rate should be the error rate you . . measuring revenue vs. clicks)
. The metric used is the click through rate (CTR). .
actually want divided by the number of tests you are . .
. . • Long enough run time for the experiment to account for
performing. . .
. # clicks by unique users . changes in behavior based on time of day/week or seasonal
. CTR = .
Let m be the number of tests, then your new error rate . # views by unique users . events.
. .
becomes: . Hypotheses: .
α . . • Practical significance of a conversion rate (the cost of
αnew = . H0 : CTRnew ≤ CTRold .
m . . launching a new feature vs. the gain from the increase in
. .
2. Turkey correction . H1 : CTRnew > CTRold . conversion)
. .
. .
3. Q-values (popular in medical tests). . Steps Taken to Analyze the Results of A/B test . • Consistency among test subjects in the control and
. .
. 1. Compute the observed difference, between the CTR . experiment group (imbalance in the population represented
. .
How Do Confidence Intervals and Hypothesis Testing . metrics for the control and experiment group. . in each group can lead to situations like Simpson’s Paradox)
. .
Compare . .
. 2. Simulate the sampling distribution for the difference in .
. .
. proportions(difference in CTR). .
Confidence interval and two-sided hypothesis test (a test that .
.
.
. 11 Regression
involves a 6= in the alternative) are the same in terms of . 3. Use the sampling distribution to simulate the distribution .
conclusions. . . Machine learning is split into:
. under the null hypothesis, by creating a random normal .
. .
. distribution centered at 0 with the same standard deviation . • Supervised learning: Predicting the labels of data.
. as the sampling distribution. .
1 − CI = α . . Examples are fraud detection, whether customers will buy a
. .
. 4. Compute the p-value by finding the null values in the null . product or not, house values.
Example: 95% confidence interval is similar to a hypothesis test . .
. distribution that are greater than the observed difference. .
with a type I error reate α = 0.05 . . • Unsupervised learning: Clustering unlabeled data
. .
. 5. Compare the p-value to type I error threshold α to . together.
. .
10 A/B Testing .
.
determine the statistical significance of observed .
.
. difference. . 11.1 Simple Linear Regression
Definition 10.1 (A/B testing). (also known as bucket testing or . .
split-run testing) is a user experience research methodology. A/B . . We compare two quantitative variables to one another. To predict
.
.
Analyzing Multiple Metrics .
.
tests consist of a randomized experiment with two variants, A and . . a linear function.
. The more metrics to evaluate, the more likely to observe .
B. It is a way to compare two versions of a single variable, typically . .
. significant just by chance. The probability of any false positive .
by testing a subject’s response to variant A against variant B, and . . y = f (x)
. increases as you increase the number of metrics(tests). This .
determining which of the two variants is more effective. . .
. multiple comparison problem can be solved by several techniques, . y is called the response variable or dependent variable. It is the
. such as bonferroni correction (see section 9.5). . variable you are interested in predicting.
A/B testing is a form of hypothesis testing where: . .
. .
. . x is called the explanatory variable or independent variable. It
• Null Hypothesis: Experiment does equally or worse than . Since the Bonferroni method is too conservative when we expect . is the variable used to predict the response.
. .
the control. . correlation among metrics, we can better approach this problem .
. .
. with more sophisticated methods, such as . 11.2 Covariance
• Alternative Hypothesis: Experiment does better than . .
. . Pn
the control. . • The closed testing procedure . i=1 (xi− x̄)(yi − ȳ)
. . sxy =
. . n−1
A/B testing has drawbacks. It can’t tell you about options you .
. • Boole-Bonferroni bound .
.
haven’t considered. It is also subject to bias when tested on . . 11.3 Correlation Coefficient
. • The Holm-Bonferroni method. .
existing users. There are two types of bias: . .
. These are less conservative and take this correlation into account. . There are different ways to measure the correlation between two
. .
• Change Aversion: Existing users may give an unfair . . variables [see this link].
. .
advantage to the old version, simply because they resist . . For a linear relationship, the most common way to measure
. If you do choose to use a less conservative method, just make sure .
change, even if it is better on the long run. . . correlation is Pearson’s correlation coefficient.
. the assumptions of that method are truly met in your situation, .
. . Denoted by r.
• Novelty Effect: Existing users may give an unfair . and that you’re not just trying to cheat on a p-value. Choosing a .
. .
advantage to the new version, because they are excited or . poorly suited test just to get significant results will only lead to .
. .
drawn to change, even if it isn’t any better in the long run. misguided decisions that harm the performance in the long run. r ∈ [−1, 1]
5
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. sy .
Correlation coefficient for a sample: . b1 = r . 1.2 Chart Junk
. sx .
sxy . . Chart junk: all visual elements in charts and graphs that are not
. .
rxy = . .
sx sy . . necessary to comprehend the information represented on the graph
. b0 = ȳ − b1 x̄ . or that distract the viewer from this information.
Pn
(xi − x̄) (yi − ȳ) . .
= qP i=1 . . Examples of chart junk include:
. .
(xi − x̄)2
qP
(yi − ȳ)2 .
.
.
Chapter 2 .
.
. 1. Heavy grid lines
. . 2. Unnecessary text
It measures:
• Strength: A relationship can be classified according to
.
.
.
Data Visualization .
.
. 3. Pictures surrounding the visual
. .
strength into: . There are two main reasons for creating visuals using data: . 4. Shading or 3d components
. .
. .
– Weak: 0.0 ≤ |r| < 0.3 . 1. Exploratory analysis is done when you are searching for . 5. Ornamented chart axes
. .
. insights. These visualizations don’t need to be perfect. You .
– Moderate: 0.3 ≤ |r| < 0.7 .
. are using plots to find insights, but they don’t need to be
.
. 1.3 Data-Ink Ratio
– Strong: 0.7 ≤ |r| ≤ 1.0 . . The more of the ink in your visual that is related to conveying the
. aesthetically appealing. You are the consumer of these .
. . message in the data, the better.
. plots, and you need to be able to find the answer to your .
• Direction: Either negative or positive depending on the sign. . . Limiting chart junk increases the data-ink ratio.
. questions from these plots. .
. .
11.4 Equation of The Line . 2. Explanatory analysis is done when you are providing your . 1.4 Design Intergrity
. .
The line in linear regression has the equation: . results for others. These visualizations need to provide you . Lie factor: the degree to which visualization distorts or
. .
. the emphasis necessary to convey your message. They . misinterprets the data values being plotted. It is calculated as:
ŷ = b0 + b1 x1 . .
. should be accurate, insightful, and visually appealing. .
. . ∆visual/visualstart
Where: . . lie factor =
. The five steps of the data analysis process: . ∆data/datastart
ŷ −→ is the predicted value of the response from the line. . .
. .
b0 −→ is the intercept. It is the predicted value of the response . . It is the relative change shown in the graphic divided by the actual
. 1. Extract - Obtain the data from a spreadsheet, SQL, the .
when the x-variable is zero. (denoted β0 for the population). . . relative change in the data.
. web, etc. .
b1 −→ is the slope. It is the predicted change in the response for . . Ideally, the lie factor should be 1. Any other value means that
. 2. Clean - Here we could use exploratory visuals. . there is some mismatch in the ratio of depicted change to actual
every one unit increase in the x-value. (denoted β1 for the . .
. . change.
population). . 3. Explore - Here we use exploratory visuals. .
x1 −→ is the explanatory variable. . .
. . 1.5 Using Color
y −→ is the actual response value for a data point in our . 4. Analyze - Here we might use either exploratory or .
. .
dataset(also called label). . explanatory visuals. . 1. Before adding color to a visualization, start with black and
. .
. . white.
11.5 Fitting a Regression Line . 5. Share - Here is where explanatory visuals live. .
. . 2. When using color, use less intense colors - not all the colors
. .
The algorithm used to fit a regression line to a dataset is called . . of the rainbow, which is the default in many software
. .
least-squares algorithm, it finds the line that minimizes .
. 1 Design of Visualizations .
. applications.
n . Visuals can be bad if they: . 3. Color for communication. Use color to highlight your
X . .
(yi − ŷi )2 .
.
.
. message and separate groups of interest. Don’t add color
. 1. Don’t convey the desired message. .
i=1
. . just to have color in your visualization.
. 2. Are misleading. .
Closed form solution: . .
1X . Reference [Huf93] . 1.6 Designing for Color Blindness
x̄ = xi . .
n . . Stay away from red to green palette and use blue to orange
. .
.
.
1.1 Visual Encodings .
.
palette instead.
1X . humans are able to best understand data encoded with: . Further reading: 5 tips on designing colorblind-friendly
ȳ = yi . .
n . . visualizations
. • positional changes (differences in x- and y- position as in .
. .
. scatterplots) . 1.7 Additional Encodings
. .
r
1 X
sx = (xi − x̄)2 (Using the Bessel’s Correction .
. • length changes (differences in box heights as bar charts
.
. We typically use the x− and y− axes to depict the value of
n−1 . . variables. If we have more than two variables we can use other
formula). . and histograms). .
. . visual encodings:
. Alternatively, humans struggle with understanding data encoded .
. . 1. color and shape for catigorical data.
r
1 X . with: .
sy = (yi − ȳ)2 (Using the Bessel’s Correction formula). . .
. . 2. size of marker for quantitative data.
n−1 . • color hue changes (as are unfortunately commonly used .
. . Use additional encoding only when necessary. If the visual gets
. as an additional variable encoding in scatter plots) .
Pn . . complicated consider breaking it into multiple visuals that convey
(xi − x̄) (yi − ȳ) . .
rxy = qP i=1 . • area changes (as in pie charts, which often makes them . multiple messages.
. .
qP
(xi − x̄)2 (yi − ȳ)2 not the best plot choice).
6
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
2 Univariate Exploration of Data .
. 3.3 Violin Plots .
. 4 Multivariate Exploration of Data
. .
Univariate visualizations: Visualize single-variables, such as . Quantitative variable vs. qualitative(categorical) variable. . 4.1 Non-Positional Encodings for Third
. .
bar charts, histograms, and line charts. . .
. For each level of categorical variable, a distribution of the values . Variables
. on the numeric variabble is plotted. .
2.1 Bar Charts .
.
.
.
There are four major cases to consider when we want to plot three
. The distribution is plotted as a kernel density estimate, like a . variables together:
Used to depict the distribution of categorical/qualitative variables. . .
. smoothed histogram. .
They can also be used for discrete quantitative data. . . • Three numeric variables
. .
. 3.4 Box Plots . • Two numeric variables and one categorical variable
2.2 Pie Charts .
.
.
.
Guidelines to use a pie chart: . Quantitative(numeric) vs. qualitative(categorical). . • One numeric variable and two categorical variables
. .
. Compared to the violin plot, the box plot is better for displaying .
• Make sure that your interest is relative frequencies. .
.
.
. • Three categorical variables
. the 5 point summary of data, reporting a set of descriptive .
• Limit the number of slices plotted to two or three slices, .
. statistics: .
. If we have at least two numeric variables, we use scatterplot to
though it’s possible to plot with four or five slices as long as . . encode two of the numeric variables, then use a non-positional
. .
the wedge sizes can be distinguished. . • The central line in the box indicates the median of the . encoding (like shape, size and color) for the third variable.
. .
. distribution .
• Plot the data systemically. Start from 12 o’clock position of . . 4.2 Color Palettes
. .
the circle, then plot each categorical level from most . • The top and bottom of the box represent the third and first . There are three major classes of color palette to consider:
. .
frequent to least frequent. . quartiles of the data, respectively. .
. . • Qualitative: Distinct colors, for nominal-type data.
Otherwise, use a bar chart instead. . .
. • The height of the box is the interquartile range (IQR). .
. . • Sequential: light-to-dark trend across a single or small
. .
2.3 Histograms .
. • From the top and bottom of the box, the whiskers indicate .
. range of hues of the same color. Used for categorical ordinal
A histogram is used to plot the distribution of a . the range from the first or third quartiles to the minimum . or numeric(quantitative) data types.
. .
quantitative(numeric) variable. It is the quantitative version of the . or maximum value in the distribution. .
. . • Diverging: Used if there is a meaningful zero or center
bar chart. Values are grouped into continuous bins. . .
. • Typically, a maximum range is set on whisker length; by . value for the variable.Two sequential palettes with different
When a data value is on a bin edge, it is counted in the bin to its . .
. default, this is 1.5 times the IQR. . hues are put back to back, with common color (usually
right. The exception is the rightmost bin edge. . .
. . white or gray) connecting them. One hue indicates values
. • Individual points below the lower whisker or above the .
. . greater than the center point, while the other indicates
3 Bivariate Exploration of Data . upper whisker indicate individual outlier points that are .
. . values smaller than the center.
. more than 1.5 times the IQR below the first quartile or .
Bivariate visualizations: Those visualizations involving two . .
. above the third quartile. . 4.3 Faceting
variables. The variation in one variable will affect the value of the . .
. . Faceting allows you to plot multiple simpler plots across levels of
other variable. . .
. Box plots are better than violin plots for explanatory . one or two other variables.
. .
3.1 Scatterplots . visualizations. .
. .
Quantitative variable vs. quantitative variable.
. . 4.4 Heat Maps
. .
If we have a very large number of points or our numeric variables
.
.
3.5 Clustered Bar Charts .
. They can be used for multivariate visualizations, substituting
. . count with a third variable.
are discrete-valued, then using a scatterplot won’t be informative. . Depict the relationship between two categorical variables. .
The visualization will suffer from overplotting. . .
. Bars are organized into clusters based on levels of the first . 4.5 Plot Matrices
. .
overplotting is where the high amount of overlap in points makes it . variable, and then bars are ordered consistently across the second . Give a high level look at pairwise relationship between all
. .
difficult to see the actual relationship between variables. . variable within each cluster. . variables. In a plot matrix, each row and column represents a
. .
To make the trends in the data clearer, it is overcome by: . . different variable.
. .
1. Using jitter. .
.
3.6 Line Plots .
.
. . 4.6 Correlation Heat Maps
2. Using transparency. . Line plot: plots the trend of one numeric variable against values .
. of a second variable. In a line plot, only one point is plotted for . For numeric variables this is similar to a correlation matrix. It
. .
. . shows the strength of relationships between variables.
3.2 Heat Maps . every unique x-value or bin of x-value (like a histogram). .
. .
Quantitative variable vs. quantitative variable. .
.
Line plots are used instead of bar plots to: .
.
4.7 Feature Engineering
A heat map is a 2-D version of the histogram that can be used as . . There may exist two variables that are related in some way.
. • Emphasize relative change. The need for a zero on the .
an alternative to a scatterplot. . . Feature engineering is about creating a new variable with a sum,
. y-axis is not necessary. .
They are good in the following cases: . . product or ratio between original variables that give insight into
. .
.
. • Emphasize trends across x-values. .
. research questions.
1. Good for discrete variable vs. discrete variable. . .
. . Example:
2. Good alternative to transparency for a lot of data. . Time series plot: A line plot where the x-variable represents .
. time. For example, stock or currency charts. . crime incidents
. . crime incident rate =
Correct choice of bin sizes is important. population totals
7
Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)
. .
Another way to perform feature engineering is to divide a numeric . .
. .
variable into ordered bins. . .
. .
.
.
.
.
.
References .
.
.
.
.
. .
. [Huf93] Darrell Huff. How to Lie with Statistics. English. .
. .
. Reissue edition. New York: W. W. Norton & Company, .
. .
. Oct. 1993. isbn: 978-0-393-31072-6. .
. .
. .
. [Seo06] Songwon Seo. A Review and Comparison of Methods .
. .
. for Detecting Outliers in Univariate Data Sets. en. .
. .
. University of Pittsburgh ETD. Publisher: University of .
. .
. Pittsburgh. Aug. 2006. url: .
. .
. http://d-scholarship.pitt.edu/7948/ (visited on .
. .
. 01/28/2021). .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
© 2021 Fady Morris Ebeid . .
. .
https://github.com/FadyMorris/formula-sheets
8

Notes
No ratings yet
Notes
29 pages
Statistics
No ratings yet
Statistics
36 pages
Biostatistis
No ratings yet
Biostatistis
35 pages
Basics of Statistics For Analytics Using SAS/ Excel
No ratings yet
Basics of Statistics For Analytics Using SAS/ Excel
28 pages
Day 1 Excel 1 - Stats Session 14 Dec 2024
No ratings yet
Day 1 Excel 1 - Stats Session 14 Dec 2024
30 pages
Descriptive Statistics and Probability Distributions: Session 1
No ratings yet
Descriptive Statistics and Probability Distributions: Session 1
34 pages
Data Sampling and Statistics Overview
No ratings yet
Data Sampling and Statistics Overview
27 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Bio Statics
No ratings yet
Bio Statics
143 pages
Lecture01
No ratings yet
Lecture01
76 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel
48 pages
Stats Reviewer
No ratings yet
Stats Reviewer
3 pages
Descriptive - Statistics Data Discret chp2
No ratings yet
Descriptive - Statistics Data Discret chp2
7 pages
QM Formula and Statistical Concepts
No ratings yet
QM Formula and Statistical Concepts
31 pages
3 - Descriptive Stat
No ratings yet
3 - Descriptive Stat
70 pages
Theoretical Questions in Basic Business Statistics
No ratings yet
Theoretical Questions in Basic Business Statistics
12 pages
Key of Week1 - Lecture Notes
No ratings yet
Key of Week1 - Lecture Notes
10 pages
Stats and Maths For Data Analyst
No ratings yet
Stats and Maths For Data Analyst
23 pages
Probability and Statistics Basics
No ratings yet
Probability and Statistics Basics
5 pages
STAT100 - Full Course Notes
No ratings yet
STAT100 - Full Course Notes
27 pages
Statisitcs
No ratings yet
Statisitcs
22 pages
Statistics Notes 1702100127
No ratings yet
Statistics Notes 1702100127
22 pages
Stats Reviewer
No ratings yet
Stats Reviewer
16 pages
ANALYST Sources
No ratings yet
ANALYST Sources
23 pages
Data Analyst
No ratings yet
Data Analyst
21 pages
Statistics For Decisions Making: Dr. Rohit Joshi, IIM Shillong
No ratings yet
Statistics For Decisions Making: Dr. Rohit Joshi, IIM Shillong
64 pages
Data Science 01 - Basics
No ratings yet
Data Science 01 - Basics
52 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
Lecture 1-Statistics-New
No ratings yet
Lecture 1-Statistics-New
8 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Basic Statistics: Introductory Workshop MS-Bapm
No ratings yet
Basic Statistics: Introductory Workshop MS-Bapm
78 pages
Assessment in Learning 1 Unit 4 Presentation Quantitative Analysis and Interpretation
No ratings yet
Assessment in Learning 1 Unit 4 Presentation Quantitative Analysis and Interpretation
86 pages
Statistical Methods in Social Sciences
No ratings yet
Statistical Methods in Social Sciences
69 pages
Reliability Distribution 1
No ratings yet
Reliability Distribution 1
41 pages
IE101 Reviewer
No ratings yet
IE101 Reviewer
22 pages
L2-Types of Data, Central Tendency and Dispersion-2
No ratings yet
L2-Types of Data, Central Tendency and Dispersion-2
81 pages
Understanding Statistics: Types & Methods
No ratings yet
Understanding Statistics: Types & Methods
7 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Bản Sao Của Chapter1 - Introduction - S
No ratings yet
Bản Sao Của Chapter1 - Introduction - S
92 pages
Cmda2005 Review
No ratings yet
Cmda2005 Review
65 pages
Statistics 000003
No ratings yet
Statistics 000003
237 pages
AEB02 - Basic Biostatistics (FE)
No ratings yet
AEB02 - Basic Biostatistics (FE)
36 pages
Mba Statistics Midterm Review Sheet
No ratings yet
Mba Statistics Midterm Review Sheet
1 page
Statistics Refresher
No ratings yet
Statistics Refresher
11 pages
Midterms Gec Math Adooooor
100% (1)
Midterms Gec Math Adooooor
6 pages
Intro To Biostatistics Lecture BSMLS 3-A&B
No ratings yet
Intro To Biostatistics Lecture BSMLS 3-A&B
74 pages
Assign
No ratings yet
Assign
4 pages
Grade 11 Statistics - 095129 - 072455
No ratings yet
Grade 11 Statistics - 095129 - 072455
9 pages
Introduction to Statistics Basics
No ratings yet
Introduction to Statistics Basics
12 pages
Review Exam 1
No ratings yet
Review Exam 1
3 pages
Statistic Module 2
No ratings yet
Statistic Module 2
15 pages
Business Statistics Course Guide
No ratings yet
Business Statistics Course Guide
69 pages
Biostatistics Notes Part 1
No ratings yet
Biostatistics Notes Part 1
9 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
Descriptive Stats
No ratings yet
Descriptive Stats
39 pages
Statistics MCT
No ratings yet
Statistics MCT
7 pages
PrepCourseStat Thanarak
No ratings yet
PrepCourseStat Thanarak
27 pages
(Shavelson & Webb, 2005) - Generalizability Theory
No ratings yet
(Shavelson & Webb, 2005) - Generalizability Theory
14 pages
Qa Pastpapers
No ratings yet
Qa Pastpapers
146 pages
Mco 22 0R MMPC 5
No ratings yet
Mco 22 0R MMPC 5
4 pages
ASM Quiz With Solution
No ratings yet
ASM Quiz With Solution
12 pages
Module 3is Weeks 9 12
No ratings yet
Module 3is Weeks 9 12
37 pages
Correlation Analysis Report
No ratings yet
Correlation Analysis Report
13 pages
CH 9 Measures of Central Tendency Practics Quesitons For Pipfa
100% (3)
CH 9 Measures of Central Tendency Practics Quesitons For Pipfa
4 pages
Chapter 6 Data Analysis
No ratings yet
Chapter 6 Data Analysis
28 pages
Machine Learning Basics Lecture 7: Multiclass Classification
No ratings yet
Machine Learning Basics Lecture 7: Multiclass Classification
28 pages
Regression Trees
No ratings yet
Regression Trees
17 pages
Lab Exercises On ANOVA, R-Square Test, and T-SNE (RRStudio)
No ratings yet
Lab Exercises On ANOVA, R-Square Test, and T-SNE (RRStudio)
3 pages
Time Series Model - Box Jenkins
No ratings yet
Time Series Model - Box Jenkins
26 pages
Estimating Population Proportion Guide
No ratings yet
Estimating Population Proportion Guide
4 pages
Chapter 8 - Forecasting
No ratings yet
Chapter 8 - Forecasting
42 pages
Philippine Corn Production Outlook
No ratings yet
Philippine Corn Production Outlook
25 pages
Statistika Normal, Homogen, T Test
No ratings yet
Statistika Normal, Homogen, T Test
2 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
62 pages
Confidence Interval
No ratings yet
Confidence Interval
19 pages
QQQ S2 Chp7HypothesisTesting
No ratings yet
QQQ S2 Chp7HypothesisTesting
4 pages
G8 SA Practice
No ratings yet
G8 SA Practice
18 pages
Coursera Machine Learning Specialization
No ratings yet
Coursera Machine Learning Specialization
46 pages
Parametric & Nonparametric Tests in SPSS
No ratings yet
Parametric & Nonparametric Tests in SPSS
1 page
Last Minute Statistics Revision Sscjsosi Abhishek
No ratings yet
Last Minute Statistics Revision Sscjsosi Abhishek
31 pages
Dav Exp2
No ratings yet
Dav Exp2
3 pages
2023-3-Sar-St Thomas Kuching-Qa
No ratings yet
2023-3-Sar-St Thomas Kuching-Qa
10 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
5 pages
Statistical Analysis of Frequency Data
100% (1)
Statistical Analysis of Frequency Data
15 pages
PSY2041 Practice Exam
No ratings yet
PSY2041 Practice Exam
16 pages
Mathematics in The Modern World: Group 1 P H 1 Y 1 - 1
No ratings yet
Mathematics in The Modern World: Group 1 P H 1 Y 1 - 1
21 pages
Steenbergen ModelingMultilevelData 2002
No ratings yet
Steenbergen ModelingMultilevelData 2002
21 pages

Data Analysis Cheatsheet

Uploaded by

Data Analysis Cheatsheet

Uploaded by

Data Analysis - Formula Sheet - by Fady Morris Ebeid (2021)

You might also like