0% found this document useful (0 votes)

158 views15 pages

Statistics For Data Analytics

Uploaded by

Mitesh Prajapati 7765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

158 views15 pages

Statistics For Data Analytics

Uploaded by

Mitesh Prajapati 7765

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The Basic Statistics Concepts for Data Analytics

Understanding the fundamentals of statistics is a core capability for becoming a Data

Scientist. Review these essential ideas that will be pervasive in your work and raise your
expertise in the field.

Statistics is a form of mathematical analysis that uses quantified models and

representations for a given set of experimental data or real-life studies. The main

advantage of statistics is that information is presented in an easy way. Recently, I reviewed

all the statistics materials and organized the 8 basic statistics concepts for becoming a data

scientist!

 Understand the Type of Analytics

 Probability

 Central Tendency

 Variability

 Relationship Between Variables

 Probability Distribution

 Hypothesis Testing and Statistical Significance

 Regression

Understand the Type of Analytics

Descriptive Analytics tells us what happened in the past and helps a business understand

how it is performing by providing context to help stakeholders interpret information.

Diagnostic Analytics takes descriptive data a step further and helps you understand why

something happened in the past.

Predictive Analytics predicts what is most likely to happen in the future and provides

companies with actionable insights based on the information.

Prescriptive Analytics provides recommendations regarding actions that will take

advantage of the predictions and guide the possible actions toward a solution.

Probability

Probability is the measure of the likelihood that an event will occur in a Random

Experiment.

Complement: P(A) + P(A’) = 1

Intersection: P(A∩B) = P(A)P(B)

Union: P(A∪B) = P(A) + P(B) − P(A∩B)

Intersection and Union.

Conditional Probability: P(A|B) is a measure of the probability of one event occurring

with some relationship to one or more other events. P(A|B)=P(A∩B)/P(B), when P(B)>0.

Independent Events: Two events are independent if the occurrence of one does not affect

the probability of occurrence of the other. P(A∩B)=P(A)P(B) where P(A) != 0 and P(B) != 0 ,

P(A|B)=P(A), P(B|A)=P(B)

Mutually Exclusive Events: Two events are mutually exclusive if they cannot both occur at

the same time. P(A∩B)=0 and P(A∪B)=P(A)+P(B).

Bayes’ Theorem describes the probability of an event based on prior knowledge of

conditions that might be related to the event.

Bayes’ Theorem.

Central Tendency

Mean: The average of the dataset.

Median: The middle value of an ordered dataset.

Mode: The most frequent value in the dataset. If the data have multiple values that

occurred the most frequently, we have a multimodal distribution.

Skewness: A measure of symmetry.

Kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative to a

normal distribution

Skewness.
Kurtosis.

Variability

Range: The difference between the highest and lowest value in the dataset.

Percentiles, Quartiles and Interquartile Range (IQR)

 Percentiles — A measure that indicates the value below which a given

percentage of observations in a group of observations falls.

 Quantiles— Values that divide the number of data points into four more or less

equal parts, or quarters.

 Interquartile Range (IQR)— A measure of statistical dispersion and variability

based on dividing a data set into quartiles. IQR = Q3 − Q1

Percentiles, Quartiles and Interquartile Range (IQR).

Variance: The average squared difference of the values from the mean to measure how

spread out a set of data is relative to mean.

Standard Deviation: The standard difference between each data point and the mean and

the square root of variance.

Population and Sample Variance and Standard Deviation.

Standard Error (SE): An estimate of the standard deviation of the sampling distribution.

Population and Sample Standard Error.

Relationship Between Variables

Causality: Relationship between two events where one event is affected by the other.

Covariance: A quantitative measure of the joint variability between two or more variables.

Correlation: Measure the relationship between two variables and ranges from -1 to 1, the

normalized version of covariance.

Covariance and Correlation.

Probability Distributions

Probability Distribution Functions

Probability Mass Function (PMF): A function that gives the probability that a discrete

random variable is exactly equal to some value.

Probability Density Function (PDF): A function for continuous data where the value at

any given sample can be interpreted as providing a relative likelihood that the value of the

random variable would equal that sample.

Cumulative Density Function (CDF): A function that gives the probability that a random

variable is less than or equal to a certain value.

Comparison between PMF, PDF, and CDF.

Continuous Probability Distribution

Uniform Distribution: Also called a rectangular distribution, is a probability distribution

where all outcomes are equally likely.

Normal/Gaussian Distribution: The curve of the distribution is bell-shaped and

symmetrical and is related to the Central Limit Theorem that the sampling distribution of

the sample means approaches a normal distribution as the sample size gets larger.

Exponential Distribution: A probability distribution of the time between the events in

a Poisson point process.

Chi-Square Distribution: The distribution of the sum of squared standard normal

deviates.

Discrete Probability Distribution

Bernoulli Distribution: The distribution of a random variable which takes a single trial

and only 2 possible outcomes, namely 1(success) with probability p, and 0(failure) with

probability (1-p).

Binomial Distribution: The distribution of the number of successes in a sequence

of n independent experiments, and each with only 2 possible outcomes, namely 1(success)

with probability p, and 0(failure) with probability (1-p).

Poisson Distribution: The distribution that expresses the probability of a given number of

events k occurring in a fixed interval of time if these events occur with a known constant

average rate λ and independently of the time.

Hypothesis Testing and Statistical Significance

Null and Alternative Hypothesis

Null Hypothesis: A general statement that there is no relationship between two measured

phenomena or no association among groups. Alternative Hypothesis: Be contrary to the

null hypothesis.

In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis,

while a type II error is the non-rejection of a false null hypothesis.

Interpretation

P-value: The probability of the test statistic being at least as extreme as the one observed

given that the null hypothesis is true. When p-value > α, we fail to reject the null

hypothesis, while p-value ≤ α, we reject the null hypothesis, and we can conclude that we

have a significant result.

Critical Value: A point on the scale of the test statistic beyond which we reject the null

hypothesis and is derived from the level of significance α of the test. It depends upon a test

statistic, which is specific to the type of test, and the significance level, α, which defines the

sensitivity of the test.

Significance Level and Rejection Region: The rejection region is actually dependent on

the significance level. The significance level is denoted by α and is the probability of

rejecting the null hypothesis if it is true.

Z-Test

A Z-test is any statistical test for which the distribution of the test statistic under the null

hypothesis can be approximated by a normal distribution and tests the mean of a

distribution in which we already know the population variance. Therefore, many statistical

tests can be conveniently performed as approximate Z-tests if the sample size is large or

the population variance is known.

T-Test

A T-test is the statistical test if the population variance is unknown, and the sample size is

not large (n < 30).

Paired sample means that we collect data twice from the same group, person, item, or

thing. Independent sample implies that the two samples must have come from two

completely different populations.

ANOVA (Analysis of Variance)

ANOVA is the way to find out if experimental results are significant. One-way

ANOVA compares two means from two independent groups using only one independent

variable. Two-way ANOVA is the extension of one-way ANOVA using two independent

variables to calculate the main effect and interaction effect.

ANOVA Table.

Chi-Square Test
Chi-Square Test Formula.

Chi-Square Test checks whether or not a model follows approximately normality when we

have s discrete set of data points. Goodness of Fit Test determines if a sample matches the

population fit one categorical variable to a distribution. Chi-Square Test for

Independence compares two sets of data to see if there is a relationship.

Regression
Linear Regression

Assumptions of Linear Regression

 Linear Relationship

 Multivariate Normality

 No or Little Multicollinearity

 No or Little Autocorrelation

 Homoscedasticity

Linear Regression is a linear approach to modeling the relationship between a dependent

variable and one independent variable. An independent variable is a variable that is

controlled in a scientific experiment to test the effects on the dependent variable.

A dependent variable is a variable being measured in a scientific experiment.

Linear Regression Formula.

Multiple Linear Regression is a linear approach to modeling the relationship between a

dependent variable and two or more independent variables.

Multiple Linear Regression Formula.

Steps for Running the Linear Regression

Step 1: Understand the model description, causality, and directionality

Step 2: Check the data, categorical data, missing data, and outliers
 Outlier is a data point that differs significantly from other observations. We can

use the standard deviation method and interquartile range (IQR) method.

 Dummy variable takes only the value 0 or 1 to indicate the effect for categorical

variables.

Step 3: Simple Analysis — Check the effect comparing between dependent variable to

independent variable and independent variable to independent variable

 Use scatter plots to check the correlation

 Multicollinearity occurs when more than two independent variables are highly

correlated. We can use Variance Inflation Factor (VIF) to measure if VIF > 5 there

is highly correlated and if VIF > 10, then there is certainly multicollinearity

among the variables.

 Interaction Term implies a change in the slope from one value to another value.

Step 4: Multiple Linear Regression — Check the model and the correct variables

Step 5: Residual Analysis

 Check normal distribution and normality for the residuals.

 Homoscedasticity describes a situation in which the error term is the same

across all values of the independent variables and means that the residuals are

equal across the regression line.

Step 6: Interpretation of Regression Output

 R-Squared is a statistical measure of fit that indicates how much variation of a

dependent variable is explained by the independent variables. Higher R-Squared

value represents smaller differences between the observed data and fitted values.

 P-value

 Regression Equation

The 8 Basic Statistics Concepts For Data Science - +
No ratings yet
The 8 Basic Statistics Concepts For Data Science - +
19 pages
Unit-IV of Data Science
No ratings yet
Unit-IV of Data Science
38 pages
Lecture 4 - Data Science Statistics
No ratings yet
Lecture 4 - Data Science Statistics
21 pages
Basic Concepts of Statistics
No ratings yet
Basic Concepts of Statistics
43 pages
Quantitative Data Analysis Guide
No ratings yet
Quantitative Data Analysis Guide
26 pages
Notes Data Analytics
No ratings yet
Notes Data Analytics
19 pages
CH11 PPT
No ratings yet
CH11 PPT
33 pages
Statistics
No ratings yet
Statistics
33 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
Biostatistics Notes Part 1
No ratings yet
Biostatistics Notes Part 1
9 pages
W7 Dmitriy-Zinovev Descriptive Stats
0% (1)
W7 Dmitriy-Zinovev Descriptive Stats
19 pages
A. Variables:: Types of Distributions
No ratings yet
A. Variables:: Types of Distributions
10 pages
Chapter 5 - RM
No ratings yet
Chapter 5 - RM
22 pages
Data Science by CFA
No ratings yet
Data Science by CFA
27 pages
Probability and Statistics Overview
No ratings yet
Probability and Statistics Overview
19 pages
Stats
No ratings yet
Stats
52 pages
Statistics Equationls
No ratings yet
Statistics Equationls
5 pages
Lecture 7.descriptive and Inferential Statistics
100% (1)
Lecture 7.descriptive and Inferential Statistics
44 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
GEA1000 Finals
No ratings yet
GEA1000 Finals
2 pages
Unit-2 Data Analytics Approaches
No ratings yet
Unit-2 Data Analytics Approaches
24 pages
MMW Data Management and Analysis
No ratings yet
MMW Data Management and Analysis
96 pages
ML Unit-3
No ratings yet
ML Unit-3
18 pages
Statistics for Psychology Students
No ratings yet
Statistics for Psychology Students
5 pages
Statistics: An Introduction and Overview
No ratings yet
Statistics: An Introduction and Overview
51 pages
Predictive Analytics Notes1
No ratings yet
Predictive Analytics Notes1
37 pages
Data Analysis Basics and Techniques
No ratings yet
Data Analysis Basics and Techniques
31 pages
2.2 Probability
No ratings yet
2.2 Probability
19 pages
Lecture 7
No ratings yet
Lecture 7
20 pages
Data Analysis and Statistical Methods
No ratings yet
Data Analysis and Statistical Methods
44 pages
STA301 IMP Notes Headings and Some Questions Answers Prepared by
No ratings yet
STA301 IMP Notes Headings and Some Questions Answers Prepared by
32 pages
ML Unit 3
No ratings yet
ML Unit 3
46 pages
1 Descriptive Statistics
No ratings yet
1 Descriptive Statistics
20 pages
Notes
No ratings yet
Notes
12 pages
DAVA Notes 1-1
No ratings yet
DAVA Notes 1-1
19 pages
Inferential Statistics For Data Science
100% (1)
Inferential Statistics For Data Science
10 pages
Data Analysis for Business Students
No ratings yet
Data Analysis for Business Students
27 pages
Business Statistics
No ratings yet
Business Statistics
25 pages
Statistics
No ratings yet
Statistics
36 pages
Overview of Descriptive Statistics
No ratings yet
Overview of Descriptive Statistics
2 pages
Statistics - Exam Reviewer (Final)
No ratings yet
Statistics - Exam Reviewer (Final)
10 pages
Statistics Fundamentals - Glossary
No ratings yet
Statistics Fundamentals - Glossary
4 pages
Maths-2 Endsem Syllabus
No ratings yet
Maths-2 Endsem Syllabus
5 pages
ST Formula Sheet Midterm
No ratings yet
ST Formula Sheet Midterm
4 pages
ISA Summary Toya
No ratings yet
ISA Summary Toya
38 pages
CRP Phase 4-Analyzing and Interpreting Quantitative Data
No ratings yet
CRP Phase 4-Analyzing and Interpreting Quantitative Data
24 pages
CG8 Data-Analysis
No ratings yet
CG8 Data-Analysis
63 pages
ML2 Math Algo
No ratings yet
ML2 Math Algo
72 pages
GEA1000 Finals Cheatsheet
No ratings yet
GEA1000 Finals Cheatsheet
2 pages
Main Title: Planning Data Analysis Using Statistical Data
100% (2)
Main Title: Planning Data Analysis Using Statistical Data
40 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
Statisitcs
No ratings yet
Statisitcs
22 pages
COM 201 - Inferential Statistics - 18032022-1
No ratings yet
COM 201 - Inferential Statistics - 18032022-1
58 pages
Bata India Working Capital Report
No ratings yet
Bata India Working Capital Report
67 pages
Vaishnavi Dandekar - Financial Performance Commercial Bank
No ratings yet
Vaishnavi Dandekar - Financial Performance Commercial Bank
67 pages
Mitesh Project
No ratings yet
Mitesh Project
67 pages
Social Security & Welfare Impact
No ratings yet
Social Security & Welfare Impact
54 pages