Exploring Categorical Data
Marginal Distribution:
- A marginal distribution gets its name because it appears in the margins of a
probability distribution table.
- The marginal distribution of a subset of a collection of random variables is the
probability distribution of the variables contained in the subset. It gives the
probabilities of various values of the variables in the subset without reference
to the values of the other variables.
Conditional Distribution:
- A conditional distribution is a probability distribution for a subpopulation. In
other words, it shows the probability that a randomly selected item in a
subpopulation has a characteristic you're interested in.
Categorical Vs. Quantitative Data:
- Categorical variables take category or label values and place an individual into
one of several groups.
- Quantitative variables take numerical values and represent some kind of
measurement.
Five Number Summary
- A five-number summary simply consists of the smallest data value, the first
quartile, the median, the third quartile, and the largest data value. A box plot
is a graphical device based on a five-number summary.
Exploring one-variable quantitative Data
Dot Plot
- Dot plots take data from a frequency table and show it visually.
Distributions
- When describing a distribution it is important to talk about:Shape(Left skew,
Right skew, symmetric), Centre(mean, median), Spread(range, IQR, MAD), &
Outliers.
- When comparing distributions, talk about the spread/variability(range) and
the centre.
xploring one-variable quantitative data:
E
Summary statistics
Descriptive and inferential statistics
- Describing all data without giving it.
Inferential statistics
- Inferring based on info
_____________________________
- I f you have a left skewed distribution, the mean will most likely be to the left
and vice versa.
Interquartile range(IQR) - calculate spread
- The Interquartile range is a measure of spread between the difference of the
median of the group to the left of the median to the median of the group to the
right of the median.
- 123456 789
1234 = 2.5
6789 = 7.5
7.5 - 2.5 = 5
Sample Variance - used to check the deviation of data points with respect
to the data's average(Standard deviation - calculate spread)
-
1. F ind the mean of the data set
2. Take each data point, subtract the mean, square the result, add to all other
data points, then divide by the number of data points minus 1.
- Sn-1^2 is more accurate than Sn^
2
- A standard deviation (or σ) is a measure of how dispersed the data is in
relation to the mean. Low standard deviation means data are clustered around
the mean, and high standard deviation indicates data are more spread out.
- A biassed standard deviation is calculated from the square root of the
unbiased standard variance.
dding standard deviation:
A
-
Reading Box Plots:
-
-
Outliers:
-
2 standard deviations rule:
- Values that are greater than +2.5 standard deviations from the mean, or less
than -2.5 standard deviations, are included as outliers in the output results.
Box Plots:
-
xploring one-variable quantitative data:
E
Percentiles, z-scores, and the normal
distribution
Percentiles
- Percentile refers to the % of the data that is below the amount in question
- Or % of the data that is at or below the amount in question.
Z- score/standardised score
- # of standard deviations from our population mean for a particular data point.
- Z-score is calculated by subtracting a value by the mean(μ) and then dividing
by the standard deviation(σ(Sx , Sy) .
- Z-scores are a really good way to think about how usual or unusual a certain
data point is.
Density Curves
- In a skewed distribution, the mean and the median is different
-
Empirical rule:
- 68-95-99.7
- The empirical rule says that for each standard deviation away from the
midpoint, the probability of finding a value in the area formed is 68-95-99.7
percent.
-
Bivariate:
-
Correlation coefficient(r):
- The correlation coefficient is a measure of how well a line can describe the
relationship between x and y.
- -1≤r≤1
- The closer r is to 1, the closer an upward sloping line can accurately describe
the relationship.
- The closer r is to -1, the closer an negatively sloping line can accurately
describe the relationship.
-
-
Introduction to residuals and least-squares regression:
-
- T
he residual for each observation is the difference between predicted values of
y (dependent variable) and observed values of y . Residual=actual y
value−predicted y value.
Calculating the equation of a regression line:
-
- T he equation for the least-squares regression line for predicting y from x is of
the form:
- Estimated y (hat y) =a+bx
- where a is the y-intercept and b is the slope.
sing least squares regression output:
U
-
-
Exploring two-variable quantitative data
R^2:
- R^2 is called thecoefficient of determination
- The coefficient of determination is a measure used to assess how well a model
explains and predicts future outcomes.
- The standard deviation of the residuals, or S, measures the size of a typical
prediction error in the y variable. So the units of S match the units on the
y-variable, which is hours in this context.
- One way to measure the fit of the line is to calculate the sum of the squared
residuals.
- R-squared tells us what percent of the prediction error in the y variable is
eliminated when we use least-squares regression on the x variable.
-
- Y
ou calculate the % of total variation that is not described by the regression
line with (Squared error of the line / total variation in y)
Standard deviation of residuals or root mean square deviation (RMSD)
-
- R
oot Mean Square Error (RMSD) is the standard deviation of the residuals
(prediction errors). Residuals are a measure of how far from the regression
line data points are.
ransforming nonlinear data:
T
- W
hen the data is not linear, instead of trying to fit a linear regression line to it,
we can take the log(y), this will make the data linear.
-
- Example:
-
Collecting data:
Types of studies:
- Experimental studies tell if a group is able to show different results with some
kind of stimulus. These have groups of treatment, control, and placebo.
- Observational study consists of retrospective data(past looking) and
prospective data(forward looking).
- I t’s important to know the difference between an association between
variables and a causality. Ie. making a conclusion that teens who use
smartphones are less happy is an invalid causality as the data could also be
read as: teens who are less happy use their smartphone more.
ignificant experiments:
S
-
- I n this experiment, if the randomised results are the same as the original for
less than five percent of the time, the experiment is significant.
Probability
Bayes’ theorem
-
-
Mutually exclusive:
- In logic and probability theory, two events are mutually exclusive or disjoint if
they cannot both occur at the same time. A clear example is the set of
outcomes of a single coin toss, which can result in either heads or tails, but not
both.
Discrete and continuous distributions
- A discrete probability distribution counts occurrences that have countable or
finite outcomes. This is in contrast to a continuous distribution, where
outcomes can fall anywhere on a continuum.
Terms
- Complementof an event is the set of all possibleoutcomes in a sample space
that do not lead to the event
p(A’) = 1 - p(A)
- Disjointor mutually exclusive events are eventsthat have no outcome in
common.
- Unionof events A and B is the set of all possibleoutcomes that lead to at least
one of the two events A and B (A ∪ B) or (A or B)
p(A ∪ B) = P(A) + p(B) - p(A ∩ B)
If the events A and B are disjoint then p(A ∩ B) = 0
p(A ∪ B) = P(A) + p(B)
- Intersectionof events A and B is the set of all possibleoutcome that lead to
both events occurring(A ∩ B) or (A and B)
p(A ∩ B) = p(A) * p(B|A) = p(B) * P(A|B)
- Aconditional(Bayes’ Theoremevent: A given B is aset of outcomes for
event A that occurs if B has occurred. (A|B)
p(A|B) = p(A ∩ B)/p(B)
andom variables and probability
R
distributions
Variance
-
- V ariance is Standard Deviation(σ) -> σ^2
- The mean of a random variable X is also known as the expected value denoted
by E(X).
heoretical probability
T
-
-
Combining random variables
-
- T
o combine the variances of two random variables, we need to know, or be
willing to assume, that the two variables are independent.
inomial Variable/Geometric variable
B
-
- I f your sample size is less than 10%, it's not unreasonable to treat a random
variable as binomial.
- The geometric distribution is a special case of the negative binomial
distribution. It deals with the number of trials required for a single success.
hus, the geometric distribution is a negative binomial distribution where the
T
number of successes (r) is equal to 1.
-
ombination
C
-
- C = (x for y) = x!/[y!(x-y)!]
x y
-
- Standard deviation^
-
-
-
Sampling distributions
Central limit theorem
- The central limit theorem (CLT) states that the distribution of sample means
approximates a normal distribution(Gets more accurate) as the sample size
gets larger, regardless of the population's distribution. Sample sizes equal to
or greater than 10 are often considered sufficient for the CLT to hold.
Kurtosis
-
- Kurtosis is a measure of how normal a distribution is.
Sampling distribution
-
- If np ≥ 10 and n(1-p) ≥ 10, the distribution will be close to normal
Normalcdf/normalpdf function
- Normalcdf/normalpdf:
1. Lower bound:
2. Upper Bound:
3. μ:
4. σ:
- Normalpdf finds the probability of getting a value at a single point on a
normal curve given any mean and standard deviation.
- Normalcdf just finds the probability of getting a value in a range of values on a
normal curve given any mean and standard deviation. (So normalcdf is just
the area between two points).
Standard error of the mean
- As you increase the sample size, the distribution becomes more normal, while
the standard deviation decreases.
- The sampling distribution of the sample mean is equal to the standard
deviation of the original population divided by the square root of n.
- σx̄ = σ/n^0.5
- μx̄ = μ
-
Type 1 and 2 errors
-
I nference for categorical data:
proportions
Confidence/Z interval
-
- Standard Error of p hat ^^^
-
- T
he confidence interval(CI) tells us how likely the sampling mean would be in
the interval.The margin of error and the confidencelevel are directly
proportional, meaning, if the confidence interval increases, the margin of
error also increases.
- T he conditions we need for inference on one proportion are:
- Random: The data needs to come from a random sampleor randomised
experiment.
Normal: The sampling distribution of (hat)p needsto be approximately
normal — needs at least 10 expected successes and 10 expected failures.
- Independent: Individual observations need to be independent.If sampling
without replacement, our sample size shouldn't be more than %10, of the
population.
-
Hypothesis
- Ho: Null hypothesis, no difference
- Ha: Alternative hypothesis, new claim
-
-
-
-
-
-
Inference for quantitative data: Means
T value
-
- To find the critical t value, we need the df and the percent of the tail(s).
Z vs T statistic
- If we are dealing with the mean, we use the t table and if it is a proportion, we
use the z value.
-
- t cdf( computes the t distribution probability between lowerbound and
upperbound for the specified df (degrees of freedom), which must be > 0
-
-
-
-
hi Squared/Inference for quantitative
C
data: slopes
Chi
-
-
- T
he chi-square test of homogeneity tests to see whether different columns (or
rows) of data in a table come from the same population or not
-
-
Confidence intervals for the slope of a regression model
-
Review:
- Residuals
- Binomial cdf/pdf
- Probability
-