0% found this document useful (0 votes)
77 views45 pages

Statistics

Statistics is a mathematical field focused on collecting, analyzing, interpreting, and presenting data, applicable in various domains such as business, medicine, and social sciences. It encompasses descriptive statistics, which summarize data, and inferential statistics, which make predictions about populations based on samples. Key concepts include measures of central tendency, measures of dispersion, and graphical representations to visualize data.

Uploaded by

c4379649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views45 pages

Statistics

Statistics is a mathematical field focused on collecting, analyzing, interpreting, and presenting data, applicable in various domains such as business, medicine, and social sciences. It encompasses descriptive statistics, which summarize data, and inferential statistics, which make predictions about populations based on samples. Key concepts include measures of central tendency, measures of dispersion, and graphical representations to visualize data.

Uploaded by

c4379649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

STATISTICS

What are Statistics?


Statistics is a branch of mathematics that involves collecting, analysing, interpreting, and
presenting data. It provides tools and methods to understand and make sense of large
amounts of data and to draw conclusions and make decisions based on the data.
In practice, statistics is used in a wide range of fields, such as business, economics, social
sciences, medicine, and engineering. It is used to conduct research studies, analyse market
trends, evaluate the effectiveness of treatments and interventions, and make forecasts and
predictions.

Examples:
1.Business - Data Analysis(Identifying customer behaviour) and Demand Forecasting
2.Medical - Identify efficacy of new medicines(Clinical trials), Identifying risk factor for
diseases(Epidemiology)
3. Government & Politics - Conducting surveys, Polling
4. Environmental Science - Climate research

Types of Statistics
1. Descriptive Statistics
Descriptive statistics deals with the collection, organization, analysis, interpretation, and
presentation of data in such a way that it is easy to understand and interpret.
It focuses on summarizing and describing the main features of a set of data, without making
inferences or predictions about the larger population.

Descriptive statistics are used to answer questions like, what is the:


- average income of people in a state.
- most common type of car on the road.
- ranges of ages of employees working in a particular company.

It helps us better understand the data we are looking at by summarizing the important information,
such as the highest and lowest value, the middle value (median), and how spread out the data is
(range and variance). These pieces of information help to conclude the group we are studying.

2. Inferential Statistics
Inferential statistics is the branch of statistics that makes inferences (or predictions) about the
population dataset based on the sample dataset. It involves hypothesis testing, a process of
using statistical methods to determine whether the hypothesis about the population is likely
true.
Inferential statistics are widely used in Scientific & Market Research and social sciences to
make predictions, test hypotheses, and make decisions based on a solid understanding of the
data. It also helps to minimize errors and biases in the result.

Inferential statistics are used to answer questions like:


• How likely is it that new medicine will be effective based on results from clinical trials?
• What is the probability of winning party A based on the result of the survey?
• How likely is that a certain event will happen in the future based on the historical data?

OR

Inferential statistics is a branch of statistics that deals with making inferences or predictions
about a larger population based on a sample of data. It involves using statistical techniques to
test hypotheses and draw conclusions from data. Some of the topics that come under
inferential statistics are:

1. Hypothesis testing: This involves testing a hypothesis about a population parameter based
on a sample of data. For example, testing whether the mean height of a population is
different from a given value.

2. Confidence intervals: This involves estimating the range of values that a population
parameter could take based on a sample of data. For example, estimating the population
mean height within a given confidence level.

3. Analysis of variance (ANOVA): This involves comparing means across multiple groups to
determine if there are any significant differences. For example, comparing the mean
height of individuals from different regions.

4. Regression analysis: This involves modelling the relationship between a dependent


variable and one or more independent variables. For example, predicting the sales of a
product based on advertising expenditure.

5. Chi-square tests: This involves testing the independence or association between two
categorical variables. For example, testing whether gender and occupation are
independent variables.

6. Sampling techniques: This involves ensuring that the sample of data is representative of
the population. For example, using random sampling to select individuals from a
population.

7. Bayesian statistics: This is an alternative approach to statistical inference that involves


updating beliefs about the probability of an event based on new evidence. For example,
updating the probability of a disease given a positive test result.

➢ Population V/s Sample


Population refers to the entire group of individuals or objects that we are interested in studying. It
is the complete set of observations that we want to make inferences about. For example, the
population might be all the students in a particular school or all the cars in a particular city.

A sample, on the other hand, is a subset of the population. It is a smaller group of individuals or
objects that we select from the population to study. Samples are used to estimate characteristics
of the population, such as the mean or the proportion with a certain attribute. For example, we
might randomly select 100 students.

Examples
1. All cricket fans vs fans who were present in the stadium
2. All students vs who visit college for lectures

Things to be careful about which creating samples


1. Sample Size
2. Random
3. Representative

Parameter Vs Statistics
A parameter is a characteristic of a population, while a statistic is a characteristic of a sample.
Parameters are generally unknown and are estimated using statistics. The goal of statistical
inference is to use the information obtained from the sample to make inferences about the
population parameters.
What is Data?
Data is defined as the collection of numbers, characters, images, and others that can arranged
in some manner to form meaningful information. In statistics, the data is mainly the collection
of numbers that is first studied then analysed and presented in some way that we can get
some meaningful insight from that data.

• Types of Data

1. Quantitative / Numerical Data


Numeric variables have values that describe a measurable quantity as a number, like 'how
many' or ‘how much’.

Types of Quantitative / Numeric Variable

a) Discrete variable

Observation can take a value based on a count from a set of distinct whole values. A discrete
variable cannot take the value of a fraction.

Ex-Height, No. of registered cars, no. of business locations, a family, all of which measured as
No. of children as whole units (i.e., 1, 2, 3 care).

b) Continuous variable

It can take any value between a certain set of real numbers.

Ex- Height, time, age and temperature.


2. Categorical Variable

Categorical variables have values that describe a 'quality' or ‘characteristics’ of a data unit, like
‘what type’ or 'which category’. Categorical variables fall into mutually exclusive (in one
category or in another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a non-numeric
value.

Types of categorical variable

a) Nominal variable

It can take a value that is not able to be organised in a logical sequence.

Ex-Gender, business type, eye colour, religion and brand.

b) Ordinal variable

It can take a value that can be logically ordered or ranked. The categories associated with
ordinal variables can be ranked higher or lower than another, but do not necessarily establish
a numeric difference between each category.

Ex-It includes academic grades (le. A,B,C), clothing size (i.e., small, medium, large, extra
large) and attitudes (i.e., agree, strongly agree, disagree, strongly disagree).
# Descriptive Statistics

➢ Measure of Central Tendency


A measure of central tendency is a statistical measure that represents a typical or central value
for a dataset. It provides a summary of the data by identifying a single value that is most
representative of the dataset as a whole.

1. Mean

The mean (or average) is the most popular and well known measure of central tendency. It can be
used with both discrete and continuous data, although its use is most often with continuous data.

The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set.

Sample Mean Population Mean

2. Median
The median is the middle value in the dataset when the data is arranged in order.

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it.

3. Mode
The mode is the most frequent element in our data set.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The most commonly occurring value is 54, therefore the mode of this distribution is 54.
4. Weighted Mean
The weighted mean is the sum of the products of each value and its weight, divided by the
sum of the weights. It is used to calculate a mean when the values in the dataset have
different importance or frequency.
5. Trimmed Mean
A trimmed mean is calculated by removing a certain percentage of the smallest and largest
values from the dataset and then taking the mean of the remaining values. The percentage of
values removed is called the trimming percentage.
➢ Measure of Dispersion
A measure of dispersion is a statistical measure that describes the spread or variability of a
dataset. It provides information about how the data is distributed around the central tendency
(mean, median or mode) of the dataset.

• Types of Measures of Dispersion


1. Range
The range is the difference between the maximum and minimum values in the dataset.
It is a simple measure of dispersion that is easy to calculate but can be affected by
outliers.
- It does not consider the distribution.
- Suitable for small dataset.
- Sensitve to extreme values.
2. Variance
Variance: The variance is the average of the squared differences between each data
point and the mean. It measures the average distance of each data point from the mean
and is useful in comparing the dispersion of datasets with different means.
3. Standard Deviation
Standard deviation is a measure used in statistics to understand how the data points in a set
are spread out from the mean value. It indicates the extent of the data’s variation and shows
how far individual data points deviate from the average.
4. Coefficient of Variation
The CV is the ratio of the standard deviation to the mean expressed as a percentage. It is used
to compare the variability of datasets with different means and is commonly used in fields such
as biology, chemistry, and engineering.

The coefficient of variation (CV) is a statistical measure that expresses the amount of variability
in a dataset relative to the mean. It is a dimensionless quantity that is expressed as a
percentage.

- C.V. is expressed in percentage.


- 100 times the coefficient of dispersion based on standard deviation is called the coefficient of
variation.
- A distribution with smaller C.V. is said to be more homogeneous or uniform or less variable than
the other and the series wth greater C.V. is said to be more heterogeneous or more variable than
the other.
➢ Frequency
Frequency refers to the number of times a particular value or event occurs in a data set.
- Categorical variables.
- Frequency distribution table.

In statistics, frequency refers to the number of times a value occurs in a set of data.

1. Absolute Frequency
2. Relative Frequency
3. Cumulative Frequency

1. Absolute Frequency
2. Relative Frequency
3. Cumulative Frequency
➢ Graphical Representation
Graphical representation is a way to visualize data using charts and graphs. It's a tool used in
statistics to analyze numerical data, identify patterns, and make predictions.

- Histogram
- Boxplot
- Scatterplot

1. Histogram

Histogram is a type of graphical representation used in statistics to show the


distribution of numerical data. It looks somewhat like a bar chart, but unlike bar graphs,
which are used for categorical data, histograms are designed for continuous data,
grouping it into logical ranges which are also known as "bins."

A histogram helps in visualizing the distribution of data across a continuous interval or


period which makes the data more understandable and also highlights the trends and
patterns.
- Types of Histogram

(Mean, Median, Mode)


- Positive skewed
- Mean > Median

- Negative skewed
- Mean < Median
➢ Quantiles and Percentiles
Quantiles are statistical measures used to divide a set of numerical data into equal-sized groups,
with each group containing an equal number of observations.

Quantiles are important measures of variability and can be used to: understand distribution of
data, summarize and compare different datasets. They can also be used to identify outliers.

There are several types of quantiles used in statistical analysis, including:

1. Quartiles: Divide the data into four equal parts, Q1 (25th percentile), Q2
(50th percentile or median), and Q3 (75th percentile).

2. Deciles: Divide the data into ten equal parts, D1 (10th percentile), D2
(20th percentile), ..., D9 (90th percentile).

3. Percentiles: Divide the data into 100 equal parts, P1 (1st percentile), P2
(2nd percentile), ..., P99 (99th percentile).

4. Quintiles: Divides the data into 5 equal parts

Things to remember while calculating these measures:


1. Data should be sorted from low to high
2. You are basically finding the location of an observation
3. They are not actual values in the data
4. All other tiles can be easily derived from Percentiles
• Percentile
Percentile Formula is used in determining the performance of a person in comparison with
others. It is used mostly in schools during the results of tests to check where a person
stands out from others. The percentile formula for a score ‘x’ can be defined as number of
scores that fall under ‘x’ divided by the total number of values in the given population.
• 5 Number Summary
The five-number summary is a descriptive statistic that provides a summary of a dataset. It consists
of five values that divide the dataset into four equal parts, also known as quartiles. The five-
number summary includes the following values:

1. Minimum value: The smallest value in the dataset.

2. First quartile (Q1): The value that separates the lowest 25% of the data from
the rest of the dataset.

3. Median (Q2): The value that separates the lowest 50% from the highest 50%
of the data.

4. Third quartile (Q3): The value that separates the lowest 75% of the data from
the highest 25% of the data.

5. Maximum value: The largest value in the dataset.

The five-number summary is often represented visually using a box plot, which displays the range
of the dataset, the median, and the quartiles.
The five-number summary is a useful way to quickly summarize the central
tendency, variability, and distribution of a dataset.

Interquartile Range
The interquartile range (IQR) is a measure of variability that is based on the five-number summary
of a dataset. Specifically, the IQR is defined as the difference between the third quartile (Q3) and
the first quartile (Q1) of a dataset.
2. Boxplot
A box plot, also known as a box-and-whisker plot, is a graphical representation of a
dataset that shows the distribution of the data. The box plot displays a summary of the
data, including the minimum and maximum values, the first quartile (Q1), the median
(Q2), and the third quartile (Q3).

OR

The idea of box plot was presented by John Tukey in 1970. He wrote about it in his book
“Exploratory Data Analysis” in 1977. Box plot is also known as a whisker plot, box-and-whisker plot,
or simply a box-and whisker diagram. Box plot is a graphical representation of the distribution of a
dataset. It displays key summary statistics such as the median, quartiles, and potential outliers in a
concise and visual manner. By using Box plot you can provide a summary of the distribution,
identify potential and compare different datasets in a compact and visual manner.

Elements of Box Plot

A box plot gives a five-number summary of a set of data which is-

- Minimum – It is the minimum value in the dataset excluding the outliers.


- First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
- Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half above.
- Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
- Maximum – It is the maximum value in the dataset excluding the outliers.
3. Scatter Plot
A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two
different numeric variables. The position of each dot on the horizontal and vertical axis
indicates values for an individual data point. Scatter plots are used to observe relationships
between variables.

- When you should use a Scatter Plot

Scatter plots’ primary uses are to observe and show relationships between two numeric
variables. The dots in a scatter plot not only report the values of individual data points, but
also patterns when the data are taken as a whole.

Identification of correlational relationships are common with scatter plots. In these cases,
we want to know, if we were given a particular horizontal value, what a good prediction
would be for the vertical value. You will often see the variable on the horizontal axis
denoted an independent variable, and the variable on the vertical axis the dependent
variable. Relationships between variables can be described in many ways: positive or
negative, strong or weak, linear or nonlinear.

A scatter plot can also be useful for identifying other patterns in data. We can divide data
points into groups based on how closely sets of points cluster together. Scatter plots can
also show if there are any unexpected gaps in the data and if there are any outlier points.
This can be useful if we want to segment the data into different parts, like in the
development of user personas.
➢ Outlier
An outlier in statistics is a data point that significantly deviates from the other observations in a
dataset. It stands out because it is unusually high or low compared to the rest of the data. Outliers
can occur due to variability in the data, errors in measurement, or other factors. Identifying and
dealing with outliers is important because they can influence the results of statistical analyses and
lead to misleading conclusions.

- Types of Histogram

1. Univariate Outliers
These are outliers in a single variable. For instance, in a dataset of students' test scores, a score of 100 in
a class where most scores are between 50 and 70 would be considered a univariate outlier.

2. Multivariate Outliers
These occur when considering multiple variables together. A data point may not be an outlier in any
single variable but can be an outlier when considering its relationship with other variables. For example,
a person with an unusually high salary for their age and experience level could be a multivariate outlier.

3. Point Outliers
These are individual data points that are far removed from the rest of the data. They are the simplest
form of outliers and are often easy to spot on a scatter plot.

4. Contextual Outliers
These are outliers when considered in a specific context. For example, a temperature reading of 30°C
might be normal in the summer but would be an outlier in the winter.

5. Collective Outliers
These are a group of data points that are outliers with respect to the entire dataset but not when taken
individually. For instance, a sequence of unusual but individually typical events in a time-series dataset.

- Main Causes of Outliers


Outliers can arise from various sources, making their detection vital:

1. Data Entry Errors: Simple human errors in entering data can create extreme values.
2. Measurement Error: Faulty device or experimental setup problems can cause abnormally
high or low readings.
3. Experimental Errors: Flaws in experimental design might produce facts factors that do not
represent what they're presupposed to degree.
4. Intentional Outliers: In some cases, data might be manipulated deliberately to produce
outlier effects, often seen in fraud cases.
5. Data Processing Errors: During the collection and processing stages, technical glitches can
introduce erroneous data.
6. Natural Variation: Inherent variability in the underlying data can also lead to outliers.
- How Outliers can be Identified?

Identifying outliers is crucial in data analysis, as they can significantly influence the results of statistical tests.
Here are several common methods to identify outliers:

1. Visual Inspection

• Box Plot: A graphical representation that displays the distribution of data and highlights outliers.
Outliers are typically shown as individual points beyond the "whiskers" of the box plot.
• Scatter Plot: Useful for identifying outliers in a bivariate dataset. Points that fall far from the main
cluster are potential outliers.

2. Statistical Methods

• Z-Score: Measures how many standard deviations a data point is from the mean. Typically, a z-score
greater than 3 or less than -3 indicates an outlier.
• Interquartile Range (IQR): Outliers are identified as values that fall below Q1−1.5×IQRQ1 - 1.5
\times IQR or above Q3+1.5×IQRQ3 + 1.5 \times IQR, where Q1Q1 and Q3Q3 are the first and third
quartiles, respectively.
• Grubbs' Test: A statistical test specifically designed to detect outliers in a univariate dataset. It tests the
hypothesis that the maximum or minimum value in a dataset is an outlier.

3. Machine Learning Algorithms

• Isolation Forest: An algorithm that isolates observations by randomly selecting a feature and splitting
the data. Outliers are points that require fewer splits to be isolated.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Clusters data points based
on their density. Points that do not fit into any cluster are considered outliers.

4. Context-Specific Methods

• Domain Knowledge: Sometimes, domain knowledge is essential to identify outliers. For example, a
medical professional may identify outliers in health data based on their expertise.
➢ Covariance
Covariance in statistics measures the degree to which two random variables change together. If
they tend to increase and decrease together, the covariance is positive. If one increases while the
other decreases, the covariance is negative.

Types of Covariance:

1. Positive Covariance: When one variable increases, the other variable tends to increase as
well, and vice versa.

2. Negative Covariance: When one variable increases, the other variable tends to decrease.

3. Zero Covariance: There is no linear relationship between the two variables; they move
independently of each other.

Covariance:

1. It is the relationship between a pair of random variables where a change in one variable
causes a change in another variable.
2. It can take any value between – infinity to +infinity, where the negative value represents the
negative relationship whereas a positive value represents the positive relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
➢ Correlation
Correlation in statistics measures the strength and direction of the relationship between two
variables. Unlike covariance, correlation is standardized, so it ranges from -1 to 1. Here’s what
these values mean:

• 1: Perfect positive correlation (variables move in the same direction)

• -1: Perfect negative correlation (variables move in opposite directions)

• 0: No correlation (no linear relationship between the variables)


# Probability
Probability defines the likelihood of occurrence of an event.

Probability can be defined as the ratio of the number of favourable outcomes to the total
number of outcomes of an event. For an experiment having 'n' number of outcomes, the
number of favourable outcomes can be denoted by x.

The formula to calculate the probability of an event is as follows.

Probability(Event) = Favourable Outcomes / Total Outcomes = x/n

• Types of events
In probability, an "event" is a set of outcomes from a random experiment. Each outcome is
a possible result of the experiment.

1. Simple Event: An event that consists of a single outcome. For example, getting a head when flipping a
coin is a simple event.
2. Compound Event: An event that consists of two or more simple events. For example, rolling a 2 and
then a 4 in two rolls of a die is a compound event.
3. Mutually Exclusive (Disjoint) Events: Two events that cannot occur at the same time. For example,
getting a 3 and getting a 5 in a single roll of a die are mutually exclusive events.
4. Non-Mutually Exclusive Events: Events that can occur simultaneously. For example, being a student
and having a part-time job are non-mutually exclusive events since both can happen at the same time.
5. Independent Events: Two events are independent if the occurrence of one event does not affect the
occurrence of the other. For example, flipping a coin and rolling a die are independent events.
6. Dependent Events: Two events are dependent if the occurrence of one event affects the occurrence of
the other. For example, drawing two cards from a deck without replacement is a set of dependent events.
7. Complementary Events: The complement of an event AA is the event that AA does not occur. The
probability of the complement of AA is 1−P(A)1 - P(A).
8. Equally Likely Events: Events that have the same probability of occurring. For example, getting any
one of the six faces when rolling a fair die are equally likely events.
9. Exhaustive Events: A set of events is exhaustive if at least one of the events must occur. For example,
when flipping a coin, the events "getting a head" and "getting a tail" are exhaustive.

➢ Conditional Probability
Conditional probability is a measure of the probability of an event occurring, given that another
event has already occurred. It's a way to refine our predictions based on additional information.
The conditional probability of an event 𝐴 given that event 𝐵 has occurred is denoted as 𝑃(𝐴∣𝐵).

Mathematically, it's defined as:


𝑃(𝐴∣𝐵) = 𝑃(𝐴∩𝐵) / 𝑃(𝐵)

where 𝑃(𝐴∩𝐵) is the probability that both events 𝐴 and 𝐵 occur, and 𝑃(𝐵) is the probability of
event 𝐵 occurring.
➢ Bayes Theorem
Bayes' Theorem is a fundamental principle in probability theory and statistics that describes how
to update the probabilities of hypotheses when given evidence. Named after the Reverend
Thomas Bayes, the theorem provides a way to revise existing predictions or theories
(hypotheses) based on new evidence.

Here's the formula for Bayes' Theorem:


𝑃(𝐴∣𝐵)=𝑃(𝐵∣𝐴)⋅𝑃(𝐴) / 𝑃(𝐵)

Where:
• 𝑃(𝐴∣𝐵) is the probability of event A occurring given that B is true (posterior probability).
• 𝑃(𝐵∣𝐴) is the probability of event B occurring given that A is true (likelihood).
• 𝑃(𝐴) is the probability of event A occurring (prior probability).
• 𝑃(𝐵) is the probability of event B occurring (marginal likelihood).

Bayes' Theorem is particularly useful in various fields such as:


• Medicine (diagnostic testing)
• Machine Learning (classification algorithms)
• Data Science (predictive modelling)
• Finance (risk assessment)

➢ Probability Distribution

• Random Variables
A random variable is a variable whose possible values are numerical outcomes of a random
phenomenon. Random variables are used in probability and statistics to quantify random
outcomes and to facilitate the analysis of random processes.

There are two main types of random variables:

1. Discrete Random Variables


A discrete random variable can take on a countable number of distinct values.
Examples of discrete random variables include:
• The number of heads in 10 coin flips.
• The number of students in a class.
• The number of goals scored in a soccer match.

2. Continuous Random Variables


A continuous random variable can take on any value within a given range. The possible values
are uncountably infinite, often represented by intervals on the real number line.
Examples of continuous random variables include:

• The height of individuals in a population.


• The amount of time it takes to complete a task.
• The temperature at a specific location.
• PMF
In statistics, the Probability Mass Function (PMF) is a function that provides the probability of a
discrete random variable taking on a specific value. The PMF is essential in understanding the
distribution of a discrete random variable and helps in calculating the probabilities of various
outcomes. For a discrete random variable 𝑋, the PMF is denoted as 𝑃(𝑋=𝑥), which represents the
probability that 𝑋 takes the value 𝑥.

Properties of a PMF:

1. Non-negativity: 𝑃(𝑋=𝑥)≥0 for all 𝑥.


2. Normalization: The sum of the probabilities for all possible values of 𝑋 is equal to 1.
That is, ∑𝑥𝑃(𝑋=x)=1.

Example:

Consider a discrete random variable 𝑋 representing the outcome of rolling a fair six-sided die. The
possible values of 𝑋 are 1, 2, 3, 4, 5, and 6. The PMF for this random variable is:

This means that the probability of rolling any specific number (1 through 6) is 1/6.

• PDF
The Probability Density Function (PDF) is a fundamental concept in statistics used to describe the
likelihood of a continuous random variable taking on a specific value within an interval. For a
continuous random variable 𝑋, the PDF is denoted by 𝑓(𝑥). Unlike the PMF for discrete random
variables, the PDF does not give the probability of 𝑋 taking on an exact value, but rather the
density of the probability in a small interval around 𝑥.

Properties of PDF:
1. Non-negativity: For all possible values 𝑥, 𝑓(𝑥) ≥ 0.
2. Normalization: The total area under the PDF curve over the entire range of possible
values is equal to 1:

Not complete
➢ Discrete Distributions

• Bernoulli Distribution
Bernoulli Distribution is defined as a fundamental tool for calculating probabilities in scenarios
where only two choices are present (i.e. binary situations), such as passing or failing, winning or
losing, or a straightforward yes or no. Bernoulli Distribution can be resembled through the flipping
of a coin. Binary situations involve only two possibilitie-s: success or failure.

For example, when flipping a coin, it can land on either he-ads, representing succe-ss; or tails,
indicating failure. The likelihood of achieving heads is p, and the likelihood of getting tails is 1-p or
q.

• Binomial Distribution
➢ Continuous Distributions

• Uniform Distribution
A uniform distribution is a type of probability distribution where every possible outcome has an
equal probability of occurring. This means that all values within a given range are equally likely to
be observed.

When rolling a fair six-sided die, each face (1, 2, 3, 4, 5, 6) has an equal probability of 1/6 of
landing face up. This is a classic example of a discrete uniform distribution.

• Types of Uniform Distribution

1. Continuous Uniform Distribution: A continuous uniform probability distribution is a


distribution that has an infinite number of values defined in a specified range. It has a
rectangular-shaped graph so-called rectangular distribution. It works on the values which are
continuous in nature. Example: Random number generator

2. Discrete Uniform Distribution: A discrete uniform probability distribution is a distribution that


has a finite number of values defined in a specified range. Its graph contains various vertical
lines for each finite value. It works on values that are discrete in nature. Example: A dice is
rolled.

Properties of Discrete Uniform Distribution


• Each outcome in the sample space has an equal probability of occurrence.
• The probability mass function (PMF) is constant over the range of possible outcomes.
• The mean of a discrete uniform distribution is the average of the minimum and maximum values.
• The variance of a discrete uniform distribution is [(n^2 – 1) / 12], where n is the number of
possible outcomes.
o Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a continuous probability
distribution that is symmetric about the mean, depicting that data near the mean are more
frequent in occurrence than data far from the mean.

where,
- x is Random Variable
- μ is Mean
- σ is Standard Deviation
- Standard Normal Distribution
Standard normal distribution, also known as the z-distribution, is a special type of normal
distribution. In this distribution, the mean (average) is 0 and the standard deviation (a measure of
spread) is 1. This creates a bell-shaped curve that is symmetrical around the mean.
- Standardization
Standardization in statistics is a process that involves transforming data into a common format or
scale to make comparisons or analyses more meaningful. This process is essential, especially when
dealing with variables measured on different scales or units. Here's a brief overview:

1. Purpose:
• Standardization helps in comparing variables that have different units of measurement.
• It is useful in statistical analyses like regression, where comparing coefficients becomes
easier.
• It allows for more accurate comparisons across different datasets.
2. Method:
• One common method is to subtract the mean of the dataset from each data point and then
divide the result by the standard deviation. This process transforms the data to have a
mean of 0 and a standard deviation of 1.

• Formula:

Where 𝑍 is the standardized value, 𝑋 is the original value, 𝜇 is the mean of the dataset, and 𝜎 is the
standard deviation.

3. Applications:

• Data Analysis: Ensures variables contribute equally to analyses, like in Principal Component
Analysis (PCA).
• Machine Learning: Essential for algorithms that are sensitive to the scale of data, such as k-
nearest neighbours (KNN) or support vector machines (SVM).

4. Benefits:
• Eliminates units, making it easier to compare and interpret data.
• Enhances the performance and accuracy of statistical and machine learning models.

- Normalization
Normalization in statistics is a process used to adjust the values of different variables so they can
be compared on a common scale, typically ranging from 0 to 1. This is particularly important when
the data being compared have different units or scales. Here's a brief overview:

1. Purpose:
• It ensures that variables contribute equally to analyses.
• It's essential in machine learning, where different scales can affect the performance of
algorithms.

2. Method:
• One common technique is Min-Max normalization. This method rescales the data to a fixed
range, usually 0 to 1.
• Formula:
Where 𝑋norm is the normalized value, 𝑋 is the original value, 𝑋min is the minimum value in the
dataset, and 𝑋max is the maximum value in the dataset.

3. Applications:
• Data Analysis: Provides a consistent basis for comparison.
• Machine Learning: Enhances the performance of distance-based algorithms like k-
nearest neighbours (KNN).

4. Benefits:
• Removes the effects of scale, ensuring fair comparisons.
• Improves the convergence speed and accuracy of machine learning algorithms.

- Empirical Rule
Empirical Rule, also known as the 68-95-99.7 Rule, is a statistical guideline that describes the
distribution of data in a normal distribution. It states that in a bell-shaped curve, approximately
68% of the data falls within one standard deviation from the mean, about 95% within two standard
deviations, and nearly 99.7% within three standard deviations. This rule provides a quick way to
understand the spread of data and is applicable in various fields for analyzing and interpreting
distributions.
# Inferential Statistics

➢ Point Estimation
A point estimate is a sample statistic calculated using the sample data to estimate the most likely
value of the corresponding unknown population parameter. In other words, we derive the point
estimate from a single value in the sample and use it to estimate the population value.

For instance, if we use a value of x̅ to estimate the mean µ of a population.

x̅ = Σx/n

For example, 62 is the average (x̅) mark achieved by a sample of 15 students randomly collected
from a class of 150 students, which is considered the mean mark of the entire class. Since it is in
the single numeric form, it is a point estimator.

Properties of Point Estimators

• Unbiasedness: An estimator is unbiased if, on average, it provides an accurate estimate of


the parameter it's trying to estimate.
• Consistency: Consistency is the property that as the sample size increases, the estimator
tends to get closer and closer to the true value of the parameter.
• Efficiency: An efficient estimator achieves the smallest possible variance among all unbiased
estimators. In other words, it's the most precise estimator possible.
• Sufficiency: A sufficient statistic contains all the information in the sample about the
parameter being estimated.

➢ Interval Estimation
A confidence interval estimate is a range of values constructed from sample data so that the
population parameter will likely occur within the range at a specified probability. Accordingly, the
specified probability is the level of confidence.

• Broader and probably more accurate than a point estimate


• Used with inferential statistics to develop a confidence interval – where we believe with a
certain degree of confidence that the population parameter lies.
• Any parameter estimate that is based on a sample statistic has some amount of sampling error.

You might also like