Statistics
Statistics
Examples:
1.Business - Data Analysis(Identifying customer behaviour) and Demand Forecasting
2.Medical - Identify efficacy of new medicines(Clinical trials), Identifying risk factor for
diseases(Epidemiology)
3. Government & Politics - Conducting surveys, Polling
4. Environmental Science - Climate research
Types of Statistics
1. Descriptive Statistics
Descriptive statistics deals with the collection, organization, analysis, interpretation, and
presentation of data in such a way that it is easy to understand and interpret.
It focuses on summarizing and describing the main features of a set of data, without making
inferences or predictions about the larger population.
It helps us better understand the data we are looking at by summarizing the important information,
such as the highest and lowest value, the middle value (median), and how spread out the data is
(range and variance). These pieces of information help to conclude the group we are studying.
2. Inferential Statistics
Inferential statistics is the branch of statistics that makes inferences (or predictions) about the
population dataset based on the sample dataset. It involves hypothesis testing, a process of
using statistical methods to determine whether the hypothesis about the population is likely
true.
Inferential statistics are widely used in Scientific & Market Research and social sciences to
make predictions, test hypotheses, and make decisions based on a solid understanding of the
data. It also helps to minimize errors and biases in the result.
OR
Inferential statistics is a branch of statistics that deals with making inferences or predictions
about a larger population based on a sample of data. It involves using statistical techniques to
test hypotheses and draw conclusions from data. Some of the topics that come under
inferential statistics are:
1. Hypothesis testing: This involves testing a hypothesis about a population parameter based
on a sample of data. For example, testing whether the mean height of a population is
different from a given value.
2. Confidence intervals: This involves estimating the range of values that a population
parameter could take based on a sample of data. For example, estimating the population
mean height within a given confidence level.
3. Analysis of variance (ANOVA): This involves comparing means across multiple groups to
determine if there are any significant differences. For example, comparing the mean
height of individuals from different regions.
5. Chi-square tests: This involves testing the independence or association between two
categorical variables. For example, testing whether gender and occupation are
independent variables.
6. Sampling techniques: This involves ensuring that the sample of data is representative of
the population. For example, using random sampling to select individuals from a
population.
A sample, on the other hand, is a subset of the population. It is a smaller group of individuals or
objects that we select from the population to study. Samples are used to estimate characteristics
of the population, such as the mean or the proportion with a certain attribute. For example, we
might randomly select 100 students.
Examples
1. All cricket fans vs fans who were present in the stadium
2. All students vs who visit college for lectures
Parameter Vs Statistics
A parameter is a characteristic of a population, while a statistic is a characteristic of a sample.
Parameters are generally unknown and are estimated using statistics. The goal of statistical
inference is to use the information obtained from the sample to make inferences about the
population parameters.
What is Data?
Data is defined as the collection of numbers, characters, images, and others that can arranged
in some manner to form meaningful information. In statistics, the data is mainly the collection
of numbers that is first studied then analysed and presented in some way that we can get
some meaningful insight from that data.
• Types of Data
a) Discrete variable
Observation can take a value based on a count from a set of distinct whole values. A discrete
variable cannot take the value of a fraction.
Ex-Height, No. of registered cars, no. of business locations, a family, all of which measured as
No. of children as whole units (i.e., 1, 2, 3 care).
b) Continuous variable
Categorical variables have values that describe a 'quality' or ‘characteristics’ of a data unit, like
‘what type’ or 'which category’. Categorical variables fall into mutually exclusive (in one
category or in another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a non-numeric
value.
a) Nominal variable
b) Ordinal variable
It can take a value that can be logically ordered or ranked. The categories associated with
ordinal variables can be ranked higher or lower than another, but do not necessarily establish
a numeric difference between each category.
Ex-It includes academic grades (le. A,B,C), clothing size (i.e., small, medium, large, extra
large) and attitudes (i.e., agree, strongly agree, disagree, strongly disagree).
# Descriptive Statistics
1. Mean
The mean (or average) is the most popular and well known measure of central tendency. It can be
used with both discrete and continuous data, although its use is most often with continuous data.
The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set.
2. Median
The median is the middle value in the dataset when the data is arranged in order.
Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it.
3. Mode
The mode is the most frequent element in our data set.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The most commonly occurring value is 54, therefore the mode of this distribution is 54.
4. Weighted Mean
The weighted mean is the sum of the products of each value and its weight, divided by the
sum of the weights. It is used to calculate a mean when the values in the dataset have
different importance or frequency.
5. Trimmed Mean
A trimmed mean is calculated by removing a certain percentage of the smallest and largest
values from the dataset and then taking the mean of the remaining values. The percentage of
values removed is called the trimming percentage.
➢ Measure of Dispersion
A measure of dispersion is a statistical measure that describes the spread or variability of a
dataset. It provides information about how the data is distributed around the central tendency
(mean, median or mode) of the dataset.
The coefficient of variation (CV) is a statistical measure that expresses the amount of variability
in a dataset relative to the mean. It is a dimensionless quantity that is expressed as a
percentage.
In statistics, frequency refers to the number of times a value occurs in a set of data.
1. Absolute Frequency
2. Relative Frequency
3. Cumulative Frequency
1. Absolute Frequency
2. Relative Frequency
3. Cumulative Frequency
➢ Graphical Representation
Graphical representation is a way to visualize data using charts and graphs. It's a tool used in
statistics to analyze numerical data, identify patterns, and make predictions.
- Histogram
- Boxplot
- Scatterplot
1. Histogram
- Negative skewed
- Mean < Median
➢ Quantiles and Percentiles
Quantiles are statistical measures used to divide a set of numerical data into equal-sized groups,
with each group containing an equal number of observations.
Quantiles are important measures of variability and can be used to: understand distribution of
data, summarize and compare different datasets. They can also be used to identify outliers.
1. Quartiles: Divide the data into four equal parts, Q1 (25th percentile), Q2
(50th percentile or median), and Q3 (75th percentile).
2. Deciles: Divide the data into ten equal parts, D1 (10th percentile), D2
(20th percentile), ..., D9 (90th percentile).
3. Percentiles: Divide the data into 100 equal parts, P1 (1st percentile), P2
(2nd percentile), ..., P99 (99th percentile).
2. First quartile (Q1): The value that separates the lowest 25% of the data from
the rest of the dataset.
3. Median (Q2): The value that separates the lowest 50% from the highest 50%
of the data.
4. Third quartile (Q3): The value that separates the lowest 75% of the data from
the highest 25% of the data.
The five-number summary is often represented visually using a box plot, which displays the range
of the dataset, the median, and the quartiles.
The five-number summary is a useful way to quickly summarize the central
tendency, variability, and distribution of a dataset.
Interquartile Range
The interquartile range (IQR) is a measure of variability that is based on the five-number summary
of a dataset. Specifically, the IQR is defined as the difference between the third quartile (Q3) and
the first quartile (Q1) of a dataset.
2. Boxplot
A box plot, also known as a box-and-whisker plot, is a graphical representation of a
dataset that shows the distribution of the data. The box plot displays a summary of the
data, including the minimum and maximum values, the first quartile (Q1), the median
(Q2), and the third quartile (Q3).
OR
The idea of box plot was presented by John Tukey in 1970. He wrote about it in his book
“Exploratory Data Analysis” in 1977. Box plot is also known as a whisker plot, box-and-whisker plot,
or simply a box-and whisker diagram. Box plot is a graphical representation of the distribution of a
dataset. It displays key summary statistics such as the median, quartiles, and potential outliers in a
concise and visual manner. By using Box plot you can provide a summary of the distribution,
identify potential and compare different datasets in a compact and visual manner.
Scatter plots’ primary uses are to observe and show relationships between two numeric
variables. The dots in a scatter plot not only report the values of individual data points, but
also patterns when the data are taken as a whole.
Identification of correlational relationships are common with scatter plots. In these cases,
we want to know, if we were given a particular horizontal value, what a good prediction
would be for the vertical value. You will often see the variable on the horizontal axis
denoted an independent variable, and the variable on the vertical axis the dependent
variable. Relationships between variables can be described in many ways: positive or
negative, strong or weak, linear or nonlinear.
A scatter plot can also be useful for identifying other patterns in data. We can divide data
points into groups based on how closely sets of points cluster together. Scatter plots can
also show if there are any unexpected gaps in the data and if there are any outlier points.
This can be useful if we want to segment the data into different parts, like in the
development of user personas.
➢ Outlier
An outlier in statistics is a data point that significantly deviates from the other observations in a
dataset. It stands out because it is unusually high or low compared to the rest of the data. Outliers
can occur due to variability in the data, errors in measurement, or other factors. Identifying and
dealing with outliers is important because they can influence the results of statistical analyses and
lead to misleading conclusions.
- Types of Histogram
1. Univariate Outliers
These are outliers in a single variable. For instance, in a dataset of students' test scores, a score of 100 in
a class where most scores are between 50 and 70 would be considered a univariate outlier.
2. Multivariate Outliers
These occur when considering multiple variables together. A data point may not be an outlier in any
single variable but can be an outlier when considering its relationship with other variables. For example,
a person with an unusually high salary for their age and experience level could be a multivariate outlier.
3. Point Outliers
These are individual data points that are far removed from the rest of the data. They are the simplest
form of outliers and are often easy to spot on a scatter plot.
4. Contextual Outliers
These are outliers when considered in a specific context. For example, a temperature reading of 30°C
might be normal in the summer but would be an outlier in the winter.
5. Collective Outliers
These are a group of data points that are outliers with respect to the entire dataset but not when taken
individually. For instance, a sequence of unusual but individually typical events in a time-series dataset.
1. Data Entry Errors: Simple human errors in entering data can create extreme values.
2. Measurement Error: Faulty device or experimental setup problems can cause abnormally
high or low readings.
3. Experimental Errors: Flaws in experimental design might produce facts factors that do not
represent what they're presupposed to degree.
4. Intentional Outliers: In some cases, data might be manipulated deliberately to produce
outlier effects, often seen in fraud cases.
5. Data Processing Errors: During the collection and processing stages, technical glitches can
introduce erroneous data.
6. Natural Variation: Inherent variability in the underlying data can also lead to outliers.
- How Outliers can be Identified?
Identifying outliers is crucial in data analysis, as they can significantly influence the results of statistical tests.
Here are several common methods to identify outliers:
1. Visual Inspection
• Box Plot: A graphical representation that displays the distribution of data and highlights outliers.
Outliers are typically shown as individual points beyond the "whiskers" of the box plot.
• Scatter Plot: Useful for identifying outliers in a bivariate dataset. Points that fall far from the main
cluster are potential outliers.
2. Statistical Methods
• Z-Score: Measures how many standard deviations a data point is from the mean. Typically, a z-score
greater than 3 or less than -3 indicates an outlier.
• Interquartile Range (IQR): Outliers are identified as values that fall below Q1−1.5×IQRQ1 - 1.5
\times IQR or above Q3+1.5×IQRQ3 + 1.5 \times IQR, where Q1Q1 and Q3Q3 are the first and third
quartiles, respectively.
• Grubbs' Test: A statistical test specifically designed to detect outliers in a univariate dataset. It tests the
hypothesis that the maximum or minimum value in a dataset is an outlier.
• Isolation Forest: An algorithm that isolates observations by randomly selecting a feature and splitting
the data. Outliers are points that require fewer splits to be isolated.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data points based
on their density. Points that do not fit into any cluster are considered outliers.
4. Context-Specific Methods
• Domain Knowledge: Sometimes, domain knowledge is essential to identify outliers. For example, a
medical professional may identify outliers in health data based on their expertise.
➢ Covariance
Covariance in statistics measures the degree to which two random variables change together. If
they tend to increase and decrease together, the covariance is positive. If one increases while the
other decreases, the covariance is negative.
Types of Covariance:
1. Positive Covariance: When one variable increases, the other variable tends to increase as
well, and vice versa.
2. Negative Covariance: When one variable increases, the other variable tends to decrease.
3. Zero Covariance: There is no linear relationship between the two variables; they move
independently of each other.
Covariance:
1. It is the relationship between a pair of random variables where a change in one variable
causes a change in another variable.
2. It can take any value between – infinity to +infinity, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
➢ Correlation
Correlation in statistics measures the strength and direction of the relationship between two
variables. Unlike covariance, correlation is standardized, so it ranges from -1 to 1. Here’s what
these values mean:
Probability can be defined as the ratio of the number of favourable outcomes to the total
number of outcomes of an event. For an experiment having 'n' number of outcomes, the
number of favourable outcomes can be denoted by x.
• Types of events
In probability, an "event" is a set of outcomes from a random experiment. Each outcome is
a possible result of the experiment.
1. Simple Event: An event that consists of a single outcome. For example, getting a head when flipping a
coin is a simple event.
2. Compound Event: An event that consists of two or more simple events. For example, rolling a 2 and
then a 4 in two rolls of a die is a compound event.
3. Mutually Exclusive (Disjoint) Events: Two events that cannot occur at the same time. For example,
getting a 3 and getting a 5 in a single roll of a die are mutually exclusive events.
4. Non-Mutually Exclusive Events: Events that can occur simultaneously. For example, being a student
and having a part-time job are non-mutually exclusive events since both can happen at the same time.
5. Independent Events: Two events are independent if the occurrence of one event does not affect the
occurrence of the other. For example, flipping a coin and rolling a die are independent events.
6. Dependent Events: Two events are dependent if the occurrence of one event affects the occurrence of
the other. For example, drawing two cards from a deck without replacement is a set of dependent events.
7. Complementary Events: The complement of an event AA is the event that AA does not occur. The
probability of the complement of AA is 1−P(A)1 - P(A).
8. Equally Likely Events: Events that have the same probability of occurring. For example, getting any
one of the six faces when rolling a fair die are equally likely events.
9. Exhaustive Events: A set of events is exhaustive if at least one of the events must occur. For example,
when flipping a coin, the events "getting a head" and "getting a tail" are exhaustive.
➢ Conditional Probability
Conditional probability is a measure of the probability of an event occurring, given that another
event has already occurred. It's a way to refine our predictions based on additional information.
The conditional probability of an event 𝐴 given that event 𝐵 has occurred is denoted as 𝑃(𝐴∣𝐵).
where 𝑃(𝐴∩𝐵) is the probability that both events 𝐴 and 𝐵 occur, and 𝑃(𝐵) is the probability of
event 𝐵 occurring.
➢ Bayes Theorem
Bayes' Theorem is a fundamental principle in probability theory and statistics that describes how
to update the probabilities of hypotheses when given evidence. Named after the Reverend
Thomas Bayes, the theorem provides a way to revise existing predictions or theories
(hypotheses) based on new evidence.
Where:
• 𝑃(𝐴∣𝐵) is the probability of event A occurring given that B is true (posterior probability).
• 𝑃(𝐵∣𝐴) is the probability of event B occurring given that A is true (likelihood).
• 𝑃(𝐴) is the probability of event A occurring (prior probability).
• 𝑃(𝐵) is the probability of event B occurring (marginal likelihood).
➢ Probability Distribution
• Random Variables
A random variable is a variable whose possible values are numerical outcomes of a random
phenomenon. Random variables are used in probability and statistics to quantify random
outcomes and to facilitate the analysis of random processes.
Properties of a PMF:
Example:
Consider a discrete random variable 𝑋 representing the outcome of rolling a fair six-sided die. The
possible values of 𝑋 are 1, 2, 3, 4, 5, and 6. The PMF for this random variable is:
This means that the probability of rolling any specific number (1 through 6) is 1/6.
• PDF
The Probability Density Function (PDF) is a fundamental concept in statistics used to describe the
likelihood of a continuous random variable taking on a specific value within an interval. For a
continuous random variable 𝑋, the PDF is denoted by 𝑓(𝑥). Unlike the PMF for discrete random
variables, the PDF does not give the probability of 𝑋 taking on an exact value, but rather the
density of the probability in a small interval around 𝑥.
Properties of PDF:
1. Non-negativity: For all possible values 𝑥, 𝑓(𝑥) ≥ 0.
2. Normalization: The total area under the PDF curve over the entire range of possible
values is equal to 1:
Not complete
➢ Discrete Distributions
• Bernoulli Distribution
Bernoulli Distribution is defined as a fundamental tool for calculating probabilities in scenarios
where only two choices are present (i.e. binary situations), such as passing or failing, winning or
losing, or a straightforward yes or no. Bernoulli Distribution can be resembled through the flipping
of a coin. Binary situations involve only two possibilitie-s: success or failure.
For example, when flipping a coin, it can land on either he-ads, representing succe-ss; or tails,
indicating failure. The likelihood of achieving heads is p, and the likelihood of getting tails is 1-p or
q.
• Binomial Distribution
➢ Continuous Distributions
• Uniform Distribution
A uniform distribution is a type of probability distribution where every possible outcome has an
equal probability of occurring. This means that all values within a given range are equally likely to
be observed.
When rolling a fair six-sided die, each face (1, 2, 3, 4, 5, 6) has an equal probability of 1/6 of
landing face up. This is a classic example of a discrete uniform distribution.
where,
- x is Random Variable
- μ is Mean
- σ is Standard Deviation
- Standard Normal Distribution
Standard normal distribution, also known as the z-distribution, is a special type of normal
distribution. In this distribution, the mean (average) is 0 and the standard deviation (a measure of
spread) is 1. This creates a bell-shaped curve that is symmetrical around the mean.
- Standardization
Standardization in statistics is a process that involves transforming data into a common format or
scale to make comparisons or analyses more meaningful. This process is essential, especially when
dealing with variables measured on different scales or units. Here's a brief overview:
1. Purpose:
• Standardization helps in comparing variables that have different units of measurement.
• It is useful in statistical analyses like regression, where comparing coefficients becomes
easier.
• It allows for more accurate comparisons across different datasets.
2. Method:
• One common method is to subtract the mean of the dataset from each data point and then
divide the result by the standard deviation. This process transforms the data to have a
mean of 0 and a standard deviation of 1.
• Formula:
Where 𝑍 is the standardized value, 𝑋 is the original value, 𝜇 is the mean of the dataset, and 𝜎 is the
standard deviation.
3. Applications:
• Data Analysis: Ensures variables contribute equally to analyses, like in Principal Component
Analysis (PCA).
• Machine Learning: Essential for algorithms that are sensitive to the scale of data, such as k-
nearest neighbours (KNN) or support vector machines (SVM).
4. Benefits:
• Eliminates units, making it easier to compare and interpret data.
• Enhances the performance and accuracy of statistical and machine learning models.
- Normalization
Normalization in statistics is a process used to adjust the values of different variables so they can
be compared on a common scale, typically ranging from 0 to 1. This is particularly important when
the data being compared have different units or scales. Here's a brief overview:
1. Purpose:
• It ensures that variables contribute equally to analyses.
• It's essential in machine learning, where different scales can affect the performance of
algorithms.
2. Method:
• One common technique is Min-Max normalization. This method rescales the data to a fixed
range, usually 0 to 1.
• Formula:
Where 𝑋norm is the normalized value, 𝑋 is the original value, 𝑋min is the minimum value in the
dataset, and 𝑋max is the maximum value in the dataset.
3. Applications:
• Data Analysis: Provides a consistent basis for comparison.
• Machine Learning: Enhances the performance of distance-based algorithms like k-
nearest neighbours (KNN).
4. Benefits:
• Removes the effects of scale, ensuring fair comparisons.
• Improves the convergence speed and accuracy of machine learning algorithms.
- Empirical Rule
Empirical Rule, also known as the 68-95-99.7 Rule, is a statistical guideline that describes the
distribution of data in a normal distribution. It states that in a bell-shaped curve, approximately
68% of the data falls within one standard deviation from the mean, about 95% within two standard
deviations, and nearly 99.7% within three standard deviations. This rule provides a quick way to
understand the spread of data and is applicable in various fields for analyzing and interpreting
distributions.
# Inferential Statistics
➢ Point Estimation
A point estimate is a sample statistic calculated using the sample data to estimate the most likely
value of the corresponding unknown population parameter. In other words, we derive the point
estimate from a single value in the sample and use it to estimate the population value.
x̅ = Σx/n
For example, 62 is the average (x̅) mark achieved by a sample of 15 students randomly collected
from a class of 150 students, which is considered the mean mark of the entire class. Since it is in
the single numeric form, it is a point estimator.
➢ Interval Estimation
A confidence interval estimate is a range of values constructed from sample data so that the
population parameter will likely occur within the range at a specified probability. Accordingly, the
specified probability is the level of confidence.