0% found this document useful (0 votes)
56 views32 pages

Statistics Notes

Uploaded by

xujamin90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views32 pages

Statistics Notes

Uploaded by

xujamin90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

‭Exploring Categorical Data‬

‭Marginal Distribution:‬
‭-‬ ‭A marginal distribution gets its name because it appears in the margins of a‬
‭probability distribution table.‬
‭-‬ ‭The marginal distribution of a subset of a collection of random variables is the‬
‭probability distribution of the variables contained in the subset. It gives the‬
‭probabilities of various values of the variables in the subset without reference‬
‭to the values of the other variables.‬
‭Conditional Distribution:‬
‭-‬ ‭A conditional distribution is a probability distribution for a subpopulation. In‬
‭other words, it shows the probability that a randomly selected item in a‬
‭subpopulation has a characteristic you're interested in.‬

‭Categorical Vs. Quantitative Data:‬


‭-‬ ‭Categorical variables take category or label values and place an individual into‬
‭one of several groups.‬
‭-‬ ‭Quantitative variables take numerical values and represent some kind of‬
‭measurement.‬
‭Five Number Summary‬
‭-‬ ‭A five-number summary simply consists of the smallest data value, the first‬
‭quartile, the median, the third quartile, and the largest data value. A box plot‬
‭is a graphical device based on a five-number summary.‬

‭Exploring one-variable quantitative Data‬


‭Dot Plot‬
‭-‬ ‭Dot plots take data from a frequency table and show it visually.‬
‭Distributions‬
‭-‬ ‭When describing a distribution it is important to talk about:‬‭Shape(Left skew,‬
‭Right skew, symmetric), Centre(mean, median), Spread(range, IQR, MAD), &‬
‭Outliers.‬
‭-‬ ‭When comparing distributions, talk about the spread/variability(range) and‬
‭the centre.‬

‭ xploring one-variable quantitative data:‬


E
‭Summary statistics‬
‭Descriptive and inferential statistics‬
‭-‬ ‭Describing all data without giving it.‬
‭Inferential statistics‬
‭-‬ ‭Inferring based on info‬

‭_____________________________‬

‭-‬ I‭ f you have a left skewed distribution, the mean will most likely be to the left‬
‭and vice versa.‬
‭Interquartile range(IQR) - calculate spread‬
‭-‬ ‭The Interquartile range is a measure of spread between the difference of the‬
‭median of the group to the left of the median to the median of the group to the‬
‭right of the median.‬
‭-‬ ‭1234‬‭5‭6 ‬ 789‬
‭1‬‭23‬‭4 = 2.5‬
‭6‬‭78‬‭9 = 7.5‬
‭7.5 - 2.5 = 5‬
‭Sample Variance - used to check the deviation of data points with respect‬
‭to the data's average(Standard deviation - calculate spread)‬
-‭ ‬
‭1.‬ F ‭ ind the mean of the data set‬
‭2.‬ ‭Take each data point, subtract the mean, square the result, add to all other‬
‭data points, then divide by the number of data points minus 1.‬
‭-‬ ‭S‬‭n-1‬‭^2 is more accurate than S‬‭n‭^
‬ 2‬
‭-‬ ‭A standard deviation (or σ) is a measure of how dispersed the data is in‬
‭relation to the mean. Low standard deviation means data are clustered around‬
‭the mean, and high standard deviation indicates data are more spread out.‬
‭-‬ ‭A biassed standard deviation is calculated from the square root of the‬
‭unbiased standard variance.‬
‭ dding standard deviation:‬
A

‭-‬
‭Reading Box Plots:‬

‭-‬

‭-‬

‭Outliers:‬
-‭ ‬
‭2 standard deviations rule:‬
‭-‬ ‭Values that are greater than +2.5 standard deviations from the mean, or less‬
‭than -2.5 standard deviations, are included as outliers in the output results.‬
‭Box Plots:‬

‭-‬

‭ xploring one-variable quantitative data:‬


E
‭Percentiles, z-scores, and the normal‬
‭distribution‬
‭Percentiles‬
‭-‬ ‭Percentile refers to the % of the data that is below the amount in question‬
‭-‬ ‭Or % of the data that is at or below the amount in question.‬
‭Z- score/standardised score‬
‭-‬ ‭# of standard deviations from our population mean for a particular data point.‬
‭-‬ ‭Z-score is calculated by subtracting a value by the mean(μ) and then dividing‬
‭by the standard deviation(σ(S‬‭x‬ ‭, S‬‭y‭)‬ .‬
‭-‬ ‭Z-scores are a really good way to think about how usual or unusual a certain‬
‭data point is.‬
‭Density Curves‬
‭-‬ ‭In a skewed distribution, the mean and the median is different‬

-‭ ‬
‭Empirical rule:‬
‭-‬ ‭68-95-99.7‬
‭-‬ ‭The empirical rule says that for each standard deviation away from the‬
‭midpoint, the probability of finding a value in the area formed is 68-95-99.7‬
‭percent.‬

‭-‬
‭Bivariate:‬

-‭ ‬
‭Correlation coefficient(r):‬
‭-‬ ‭The correlation coefficient is a measure of how well a line can describe the‬
‭relationship between x and y.‬
‭-‬ ‭-1≤r≤1‬
‭-‬ ‭The closer r is to 1, the closer an upward sloping line can accurately describe‬
‭the relationship.‬
‭-‬ ‭The closer r is to -1, the closer an negatively sloping line can accurately‬
‭describe the relationship.‬

‭-‬
‭-‬
‭Introduction to residuals and least-squares regression:‬

-‭ ‬
‭-‬ T
‭ he residual for each observation is the difference between predicted values of‬
‭y (dependent variable) and observed values of y . Residual=actual y‬
‭value−predicted y value.‬
‭Calculating the equation of a regression line:‬

-‭ ‬
‭-‬ T ‭ he equation for the least-squares regression line for predicting y from x is of‬
‭the form:‬
‭-‬ ‭Estimated y (hat y) =a+bx‬
‭-‬ ‭where a is the y-intercept and b is the slope.‬
‭ sing least squares regression output:‬
U

‭-‬
‭-‬

‭Exploring two-variable quantitative data‬


‭R^2:‬
‭-‬ ‭R^2 is called the‬‭coefficient of determination‬
‭-‬ ‭The coefficient of determination is a measure used to assess how well a model‬
‭explains and predicts future outcomes.‬
‭-‬ ‭The standard deviation of the residuals, or S, measures the size of a typical‬
‭prediction error in the y variable. So the units of S match the units on the‬
‭y-variable, which is hours in this context.‬
‭-‬ ‭One way to measure the fit of the line is to calculate the sum of the squared‬
‭residuals.‬
‭-‬ ‭R-squared tells us what percent of the prediction error in the y variable is‬
‭eliminated when we use least-squares regression on the x variable.‬
-‭ ‬
‭-‬ Y
‭ ou calculate the % of total variation that is not described by the regression‬
‭line with (Squared error of the line / total variation in y)‬

‭Standard deviation of residuals or root mean square deviation (RMSD)‬

-‭ ‬
‭-‬ R
‭ oot Mean Square Error (RMSD) is the standard deviation of the residuals‬
‭(prediction errors). Residuals are a measure of how far from the regression‬
‭line data points are.‬
‭ ransforming nonlinear data:‬
T
‭-‬ W
‭ hen the data is not linear, instead of trying to fit a linear regression line to it,‬
‭we can take the log(y), this will make the data linear.‬

-‭ ‬
‭-‬ ‭Example:‬

‭-‬

‭Collecting data:‬
‭Types of studies:‬
‭-‬ ‭Experimental studies tell if a group is able to show different results with some‬
‭kind of stimulus. These have groups of treatment, control, and placebo.‬
‭-‬ ‭Observational study consists of retrospective data(past looking) and‬
‭prospective data(forward looking).‬
‭-‬ I‭ t’s important to know the difference between an association between‬
‭variables and a causality. Ie. making a conclusion that teens who use‬
‭smartphones are less happy is an invalid causality as the data could also be‬
‭read as: teens who are less happy use their smartphone more.‬
‭ ignificant experiments:‬
S

-‭ ‬
‭-‬ I‭ n this experiment, if the randomised results are the same as the original for‬
‭less than five percent of the time, the experiment is significant.‬

‭Probability‬
‭Bayes’ theorem‬

‭-‬
-‭ ‬
‭Mutually exclusive:‬
‭-‬ ‭In logic and probability theory, two events are mutually exclusive or disjoint if‬
‭they cannot both occur at the same time. A clear example is the set of‬
‭outcomes of a single coin toss, which can result in either heads or tails, but not‬
‭both.‬
‭Discrete and continuous distributions‬
‭-‬ ‭A discrete probability distribution counts occurrences that have countable or‬
‭finite outcomes. This is in contrast to a continuous distribution, where‬
‭outcomes can fall anywhere on a continuum.‬
‭Terms‬
‭-‬ ‭Complement‬‭of an event is the set of all possible‬‭outcomes in a sample space‬
‭that do not lead to the event‬
‭p(A’) = 1 - p(A)‬
‭-‬ ‭Disjoint‬‭or mutually exclusive events are events‬‭that have no outcome in‬
‭common.‬
‭-‬ ‭Union‬‭of events A and B is the set of all possible‬‭outcomes that lead to at least‬
‭one of the two events A and B (A ∪ B) or (A or B)‬
‭p(A ∪ B) = P(A) + p(B) - p(A ∩ B)‬
‭If the events A and B are disjoint then p(A ∩ B) = 0‬
‭p(A ∪ B) = P(A) + p(B)‬
‭-‬ ‭Intersection‬‭of events A and B is the set of all possible‬‭outcome that lead to‬
‭both events occurring(A ∩ B) or (A and B)‬
‭p(A ∩ B) = p(A) * p(B|A) = p(B) * P(A|B)‬
‭-‬ ‭A‬‭conditional(Bayes’ Theorem‬‭event: A given B is a‬‭set of outcomes for‬
‭event A that occurs if B has occurred. (A|B)‬
‭p(A|B) = p(A ∩ B)/p(B)‬
‭ andom variables and probability‬
R
‭distributions‬
‭Variance‬

-‭ ‬
‭-‬ V ‭ ariance is Standard Deviation(σ) -> σ^2‬
‭-‬ ‭The mean of a random variable X is also known as the expected value denoted‬
‭by E(X).‬
‭ heoretical probability‬
T

‭-‬
-‭ ‬
‭Combining random variables‬

-‭ ‬
‭-‬ T
‭ o combine the variances of two random variables, we need to know, or be‬
‭willing to assume, that the two variables are independent.‬
‭ inomial Variable/Geometric variable‬
B

-‭ ‬
‭-‬ I‭ f your sample size is less than 10%, it's not unreasonable to treat a random‬
‭variable as binomial.‬
‭-‬ ‭The geometric distribution is a special case of the negative binomial‬
‭distribution. It deals with the number of trials required for a single success.‬
‭ hus, the geometric distribution is a negative binomial distribution where the‬
T
‭number of successes (r) is equal to 1.‬

-‭ ‬
‭ ombination‬
C

-‭ ‬
‭-‬ ‭C‬ ‭= (x for y) = x!/[y!(x-y)!]‬
‭x‬ ‭y‬
-‭ ‬
‭-‬ ‭Standard deviation^‬

‭-‬
‭-‬

‭-‬

‭Sampling distributions‬
‭Central limit theorem‬
‭-‬ ‭The central limit theorem (CLT) states that the distribution of sample means‬
‭approximates a normal distribution(Gets more accurate) as the sample size‬
‭gets larger, regardless of the population's distribution. Sample sizes equal to‬
‭or greater than 10 are often considered sufficient for the CLT to hold.‬
‭Kurtosis‬
-‭ ‬
‭-‬ ‭Kurtosis is a measure of how normal a distribution is.‬
‭Sampling distribution‬

-‭ ‬
‭-‬ ‭If np ≥ 10 and n(1-p) ≥ 10, the distribution will be close to normal‬
‭Normalcdf/normalpdf function‬
‭-‬ ‭Normalcdf/normalpdf:‬
‭1.‬ ‭Lower bound:‬
‭2.‬ ‭Upper Bound:‬
‭3.‬ ‭μ:‬
‭4.‬ ‭​σ:‬
‭-‬ ‭Normalpdf finds the probability of getting a value at a single point on a‬
‭normal curve given any mean and standard deviation.‬
‭-‬ ‭Normalcdf just finds the probability of getting a value in a range of values on a‬
‭normal curve given any mean and standard deviation. (So normalcdf is just‬
‭the area between two points).‬
‭Standard error of the mean‬
‭-‬ ‭As you increase the sample size, the distribution becomes more normal, while‬
‭the standard deviation decreases.‬
‭-‬ ‭The sampling distribution of the sample mean is equal to the standard‬
‭deviation of the original population divided by the square root of n.‬
‭-‬ ‭σ‬‭x̄‬ ‭= σ/n^0.5‬
‭-‬ ‭μ‬‭x̄‬ ‭= μ‬

-‭ ‬
‭Type 1 and 2 errors‬

‭-‬

I‭ nference for categorical data:‬


‭proportions‬
‭Confidence/Z interval‬
-‭ ‬
‭-‬ ‭Standard Error of p hat ^^^‬

-‭ ‬
‭-‬ T
‭ he confidence interval(CI) tells us how likely the sampling mean would be in‬
‭the interval.‬‭The margin of error and the confidence‬‭level are directly‬
‭proportional, meaning, if the confidence interval increases, the margin of‬
‭error also increases.‬

-‭ ‬ T ‭ he conditions we need for inference on one proportion are:‬


‭-‬ ‭Random‬‭: The data needs to come from a random sample‬‭or randomised‬
‭experiment.‬
‭Normal‬‭: The sampling distribution of (hat)p needs‬‭to be approximately‬
‭normal — needs at least 10 expected successes and 10 expected failures.‬
‭-‬ ‭Independent‬‭: Individual observations need to be independent.‬‭If sampling‬
‭without replacement, our sample size shouldn't be more than %10, of the‬
‭population.‬
-‭ ‬
‭Hypothesis‬
‭-‬ ‭H‬‭o‭:‬ Null hypothesis, no difference‬
‭-‬ ‭H‬‭a‭:‬ Alternative hypothesis, new claim‬

‭-‬
‭-‬

‭-‬
‭-‬

‭-‬
‭-‬

‭Inference for quantitative data: Means‬


‭T value‬

-‭ ‬
‭-‬ ‭To find the critical t value, we need the df and the percent of the tail(s).‬
‭Z vs T statistic‬
‭-‬ ‭If we are dealing with the mean, we use the t table and if it is a proportion, we‬
‭use the z value.‬
-‭ ‬
‭-‬ t‭ cdf( computes the t distribution probability between lowerbound and‬
‭upperbound for the specified df (degrees of freedom), which must be > 0‬

‭-‬
‭-‬

‭-‬
‭-‬

‭ hi Squared/Inference for quantitative‬


C
‭data: slopes‬
‭Chi‬

‭-‬
‭-‬

‭-‬ T
‭ he chi-square test of homogeneity tests to see whether different columns (or‬
‭rows) of data in a table come from the same population or not‬
‭-‬

-‭ ‬
‭Confidence intervals for the slope of a regression model‬
‭-‬

‭Review:‬
‭-‬ ‭Residuals‬
‭-‬ ‭Binomial cdf/pdf‬
‭-‬ ‭Probability‬
‭-‬

You might also like