0% found this document useful (0 votes)
16 views38 pages

Introduction To Data Science Unit 2 Notes

Introduction to Data Science Unit 2 Notes

Uploaded by

Rohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views38 pages

Introduction To Data Science Unit 2 Notes

Introduction to Data Science Unit 2 Notes

Uploaded by

Rohit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Data Science

Unit 2

Population:
A population refers to the entire set of individuals, objects, or data points that you want to
study. It can be large or small depending on the scope of your research.
• For example, all students in a school or all people in a country.

• It provides a complete picture and is usually denoted by N.

Sample:

A sample is a subset of the population that is selected for analysis. It's used when studying the
entire population is impractical or impossible. Sampling allows for inferences about the
population using statistical techniques.

• For example, if the population is all students in a school, a sample could be 50


students randomly chosen from different classes to participate in a survey.

• It offers an estimate and is denoted by n.

Parameters (like population mean) describe the population, while statistics (like sample
mean) describe the sample. Sampling enables us to make inferences about the population
using statistical techniques.

When to use a Population:

Populations are used when your research question requires it, or when you have access to
data from every member of the population. Usually, it is only straightforward to collect data
from a whole population when it is small, accessible, and cooperative.
Example:

• A marketing manager at a small local bakery wants to understand customer


preferences.

• They collect data on every customer’s bread purchase over a month.

• Since the customer base is limited and accessible, they analyze the entire population
to identify trends.

When to use a Sample:

When your population is large in size, geographically dispersed, or difficult to contact, it’s
necessary to use a sample. With statistical analysis, you can use sample data to make
estimates or test hypotheses about population data.

Example:

• You're researching smartphone usage among teenagers in a city.

• The population includes all teenagers aged 13–18, which could be tens of thousands.

• You select a random sample of 500 teens from different schools.


• This sample participates in surveys to provide insights into broader usage patterns.

Population vs Sample

The main difference between population and sample is given below:

Population Sample

The population includes all members of


A sample is a subset of the population.
a specified group.

Collecting data from an entire Samples offer a more feasible approach to


population can be time-consuming, studying populations, allowing researchers to
expensive, and sometimes impractical draw conclusions based on smaller,
or impossible. manageable datasets

Consists of 1000 households, a subset of the


Includes all residents in the city.
entire population.
Population Parameter vs Sample Statistic

Population Parameter Sample Statistic

Statistics are calculated from sample data and


It is a numerical characteristic that
serve as estimates or approximations of the
describes the entire population
corresponding population parameters

Calculated using data from a sample drawn from


Parameters are typically unknown
the population. Statistics are directly computed
and must be estimated.
from sample data.

Calculated using data from a sample Used to estimate population parameters based on
drawn from the population. Statistics sample data. Statistics help researchers infer
are directly computed from sample population characteristics from a representative
data. subset of the population

Population And Sample Formulas

Some important formulas related to population and sample are:


Statistics Definition:

Statistics is a branch that deals with every aspect of the data. Statistical knowledge helps to
choose the proper method of collecting the data and employ those samples in the correct
analysis process in order to effectively produce the results. In short, statistics is a crucial
process which helps to make the decision based on the data.

An example of statistical analysis is when we have to determine the number of people in a


town who watch TV out of the total population in the town. The small group of people is
called the sample here, which is taken from the population.

Types of Statistics
The two main branches of statistics are:

• Descriptive Statistics

• Inferential Statistics

Descriptive Statistics – Through graphs or tables, or numerical calculations, descriptive


statistics uses the data to provide descriptions of the population.

Inferential Statistics – Based on the data sample taken from the population, inferential
statistics makes the predictions and inferences.

Characteristics of Statistics

The important characteristics of Statistics are as follows:

• Statistics are numerically expressed.

• It has an aggregate of facts

• Data are collected in systematic order


• It should be comparable to each other

• Data are collected for a planned purpose

Descriptive Statistic:

Statistics is the foundation of data science. Descriptive statistics are simple tools that help us
understand and summarize data. They show the basic features of a dataset, like the average,
highest and lowest values and how spread out the numbers are. It's the first step in making
sense of information.
Types of Descriptive Statistics

There are three categories for standard classification of descriptive statistics methods, each
serving different purposes in summarizing and describing data. They help us understand:

1. Where the data centers (Measures of Central Tendency)

2. How spread out the data is (Measure of Variability)

3. How the data is distributed (Measures of Frequency Distribution)

1. Measures of Central Tendency

Statistical values that describe the central position within a dataset. There are three main
measures of central tendency:

Measures of Central Tendency

Mean: is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑀𝑒𝑎𝑛 =
𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Let's look at an example of how can we find the mean of a data set using python code
implementation.
import numpy as np
# Sample Data
arr = [5, 6, 11]
# Mean
mean = [Link](arr)
print("Mean = ", mean)

Output
Mean = 7.333333333333333

Mode: The most frequently occurring value in the dataset. It’s useful for categorical data
and in cases where knowing the most common choice is crucial.

import [Link] as stats


# sample Data
arr = [1, 2, 2, 3]
# Mode
mode = [Link](arr)
print("Mode = ", mode)

Output:
Mode = ModeResult(mode=array([2]), count=array([2]))

Median: The median is the middle value in a sorted dataset. If the number of values is odd,
it's the center value, if even, it's the average of the two middle values. It's often better than the
mean for skewed data.

import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = [Link](arr)
print("Median = ", median)

Output
Median = 2.5

2. Measure of Variability

Knowing not just where the data centers but also how it spreads out is important. Measures of
variability, also called measures of dispersion, help us spot the spread or distribution of
observations in a dataset. They identifying outliers, assessing model assumptions and
understanding data variability in relation to its mean. The key measures of variability include:
1. Range : describes the difference between the largest and smallest data point in our data set.
The bigger the range, the more the spread of data and vice versa. While easy to
compute range is sensitive to outliers. This measure can provide a quick sense of the data
spread but should be complemented with other statistics.
Range = Largest data value - smallest data value

import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))

Output
Maximum = 5, Minimum = 1 and Range = 4

2. Variance: is defined as an average squared deviation from the mean. It is calculated by


finding the difference between every data point and the average which is also known as the
mean, squaring them, adding all of them and then dividing by the number of data points
present in our data set.

where,
• x -> Observation under consideration
• N -> number of terms
• mu -> Mean
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# variance
print("Var = ", ([Link](arr)))
Output
Var = 2.5
3. Standard deviation: Standard deviation is widely used to measure the extent of variation
or dispersion in data. It's especially important when assessing model performance (e.g.,
residuals) or comparing datasets with different means.

It is defined as the square root of the variance. It is calculated by finding the mean,
then subtracting each number from the mean which is also known as the average and
squaring the result. Adding all the values and then dividing by the no of terms followed by
the square root.

where,
• x = Observation under consideration

• N = number of terms

• mu = Mean

import statistics
arr = [1, 2, 3, 4, 5]
print("Std = ", ([Link](arr)))

Output

Std = 1.5811388300841898

Variability measures are important in residual analysis to check how well a model fits the
data.

3. Measures of Frequency Distribution


Frequency distribution table is a powerful summarize way to show how data points are
distributed across different categories or intervals. Helps identify patterns, outliers and
the overall structure of the dataset. It is often the first step in understand the dataset before
applying more advanced analytical methods or creating visualizations like histograms or pie
charts.

Frequency Distribution Table Includes measure like:

• Data intervals or categories

• Frequency counts

• Relative frequencies (percentages)


Variable

Variable is a term derived from the word 'vary', meaning 'to change' or 'to differ'. Therefore,
variables are the characteristics that change or differ. A variable refers to the quantity or
characteristic whose value varies from one investigation to another. The difference in the
value of a variable may be with respect to a place, item, individual, or time, and each value
within this range is known as Variate.

For example, Quantity is a variable as different commodities have different quantities.


Similarly, Price, Age, Weight, Height, Income, Production, etc., are also variables. Different
variables are measured in different units, such as quantity is measured in units or tonnes, etc.,
height in centimeters or inches, age in years, etc.

Variables are of two kinds Discrete Variable (Discontinuous Variable) and Continuous
Variable.

1. Discrete Variable (Discontinuous Variable)

Discrete variables are the variables that take only exact values and not fractional values. In
simple terms, the variables which are expressed in terms of complete numbers are known as
discrete variables. For example, the number of teachers in a school is a discrete variable, as
they cannot be expressed in fractions. The data of discrete variables is collected through
counting.

Example
The data collected of the number of workers in different teams of a company in discrete
variables can be shown as:

2. Continuous Variable

Continuous variables are the variables that can take every possible value whether it is integral
or fractional, in a specified given range. For example, the height of an individual can be in
decimals or in exact numbers within the specified range. The data of continuous variables is
collected through measurement.
Example
The data collected on the height of students in a class in continuous variables can be shown
as:

Discrete Variable v/s Continuous Variable

Basis Discrete Variable Continuous Variable

Continuous variables are the


Discrete variables are the
variables that can take every
variables that take only
Meaning possible value whether it is
exact values and not
integral or fractional in a
fractional values.
specified given range.

The value of continuous


The value of discrete
variables can increase in
Change in Values variables increases in
complete numbers as well
complete numbers.
as in fractions.

The data is collected by The data is collected by


Data Collection counting in the case of a measurement in the case of
discrete variable. a continuous variable.

The height of students in a


The number of students in
class is a continuous
a college is a discrete
Example variable as they can be
variable as they cannot be
measured in numbers as
expressed in fractions.
well as in fractions.
Percentile:

A percentile is a statistical measure that indicates the value below which a given percentage
of observations in a group of data falls. It helps understand how a particular value compares
to the rest of the data.

• In simple words, percentiles are a way to express the relative standing of a value
within a dataset, indicating what percentage of the data falls below that value.

• For example, if you scored in the 90th percentile on a standardized test, it means you
performed better than 90% of the people who took the test.

To find the value corresponding to a specific percentile, the following formula is used:

Percentile(x) = (Number of values fall under 'x'/total number of values) × 100


P = (n/N) × 100

Where,

• P is the percentile,

• n - Number of values below 'x',

• N - Total count of population.

The above formula is used to calculate the percentile for a particular value in the population.
If we have a percentile value and we need to find the 'n' value, i.e., for which data value in the
population, then we can rewrite the above formula as

n = (P × N)/100

Quartiles:

Quartiles divide a data set into four equal parts, each containing 25% of the data. They help
to understand the spread and center of the data. As an important concept in statistics, quartiles
are used to analyze large data sets by highlighting values near the middle. This method is
particularly useful for identifying outliers and comparing different data sets.

Lower or First Quartile (Q1)


• Quartile 1 lies between the starting term and the middle term.

• This is the median of the lower half of the data set.

• It is also known as the 25th percentile because it marks the point where 25% of the
data is below it.

Median or Second Quartile (Q2)


• Quartile 2 lies between the starting terms and the last terms, i.e., the Middle term.
• This is the median of the entire data set.

• It is also known as the 50th percentile, as it divides the data into two halves.

Upper or Third Quartile

• Quartile 3 lies between quartile 2 and the last term.


• This is the median of the upper half of the data set.

• It is also known as the 75th percentile because it marks the point where 75% of the
data is below it.
Quartile Formula

As mentioned above, Quartile divides the data into 4 equal parts. There is a separate
formula for finding each quartile value, and the steps to obtain the quartile formula are as
shown below as follows:

Step 1: Sort the given data in ascending order.


Step 2: Find respective quartile values/terms as per need from the below formulae.

• First Quartile = (n+1)4th term4(n+1)th term

• Second Quartile = (n+1)2th term2(n+1)th term

• Third Quartile = 3(n+1)4th term43(n+1)th term

5 Number Summary:

The concept of a 5 number summary is a way to describe a distribution using 5 numbers. This
includes minimum number, quartile-1, median/quartile-2, quartile-3, and maximum number.
This concept of 5 number summary comes under the concept of Statistics which deals with
the collection of data, analyzing it, interpreting, and presenting the data in an organized
manner.

As told in the above paragraph, It gives a rough idea how the given dataset looks like by
representing minimum value, maximum value, median, quartile values, etc. To understand
better the 5 number summary concept look at the below pictorial representation of 5 number
summary

Calculating 5 number summary


In order to find the 5 number summary, we need the data to be sorted. If not sort it first in
ascending order and then find it.

• Minimum Value: It is the smallest number in the given data, and the first number
when it is sorted in ascending order.

• Maximum Value: It is the largest number in the given data, and the last number when
it is sorted in ascending order.

Median: Middle value between the minimum and maximum value. Below is the formula to
find median,

Median = (n + 1)/2th term

• Quartile 1: Middle/center value between the minimum and median value. We can
simply identify the middle value between median and minimum value for a small
dataset. If it is a big dataset with so many numbers then better to use a formula,
Quartile 1 = ((n + 1)/4)th term
• Quartile 3: Middle/center value between median and maximum value.

Quartile 3 = (3(n + 1)/4)th term

Question 1: What is the 5 number summary in the given data 10, 20, 5, 15, 25, 30, 8.

Solution:

• Step-1 Sort the given data in ascending order.


5, 8, 10, 15, 20, 25, 30
• Step-2 Find minimum number

Minimum value = 5

• Step-3 Find maximum number

Maximum value = 30

• Step-4 Find median


Here we need to find median value by a formula (n + 1)/2 th term where n
is the total count of numbers.
Here n = 7
So median = (7 + 1)/2 = 8/2 = 4 th term
4th term is median which is 15.

For Quartile-1 Formula is ((n + 1)/4)th term where n is the count of numbers in the dataset.
n = 7 because there are 7 numbers in the data.

Quartile-1 = ((7 + 1)4)th term

= (8/4)th term

= 2nd term
2nd term is 8 So, Quartile-1 = 8

In the same way find the quartile-3 using the formula (3(n + 1)/4)th term.

Quartile 3 = (3(7 + 1)/4)th term

= (3(8)/4)th term

= (24/4)th term

= 6th term

6th term is 25 so Quartile-3 = 25

What is Covariance?

Covariance is a statistical which measures the relationship between a pair of random


variables where a change in one variable causes a change in another variable. It assesses how
much two variables change together from their mean values. Covariance is calculated by
taking the average of the product of the deviations of each variable from their respective
means. Covariance helps us understand the direction of the relationship but not how strong it
is because the number depends on the units used. It’s an important tool to see how two things
are connected.
1. It can take any value between - infinity to +infinity, where the negative value
represents the negative relationship whereas a positive value represents the positive
relationship.

2. It is used for the linear relationship between variables.


3. It gives the direction of relationship between variables.

Covariance Formula
Types of Covariance
• Positive Covariance: When one variable increases, the other variable tends to
increase as well and vice versa.

• Negative Covariance: When one variable increases, the other variable tends to
decrease.
• Zero Covariance: There is no linear relationship between the two variables; they
move independently of each other.

Example

What is Correlation?

Correlation is a standardized measure of the strength and direction of the linear relationship
between two variables. It is derived from covariance and ranges between -1 and 1. Unlike
covariance, which only indicates the direction of the relationship, correlation provides a
standardized measure.

• Positive Correlation (close to +1): As one variable increases, the other variable also
tends to increase.

• Negative Correlation (close to -1): As one variable increases, the other variable tends
to decrease.

• Zero Correlation: There is no linear relationship between the variables.

The correlation coefficient ρρ (rho) for variables X and Y is defined as:


1. Correlation takes values between -1 to +1, wherein values close to +1 represents
strong positive correlation and values close to -1 represents strong negative
correlation.

2. In this variable are indirectly related to each other.

3. It gives the direction and strength of relationship between variables.


Example
Difference between Covariance and Correlation

This table shows the difference between Covariance and Covariance:

Covariance Correlation

Correlation is a statistical measure that


Covariance is a measure of how much
indicates how strongly two variables are
two random variables vary together
related.

Involves the relationship between two Involves the relationship between multiple
variables or data sets variables as well

Lie between -infinity and +infinity Lie between -1 and +1

Measure of correlation Scaled version of covariance

Provides direction of relationship Provides direction and strength of relationship

Dependent on scale of variable Independent on scale of variable

Have dimensions Dimensionless

Probability:
Probability is a branch of mathematics that quantifies the likelihood of an event occurring. It
is a numerical value between 0 and 1, where 0 indicates that an event is impossible and 1
indicates that it is certain.

How to calculate probability

The most basic way to calculate the probability of a single event is to divide the number of
favorable outcomes by the total number of possible outcomes.

P(A)=Number of favorable outcomes/Total number of possible outcomes


Addition Rule:

Multiplication Rule:
Probability Distribution - Function, Formula, Table

A probability distribution is a mathematical function or rule that describes how the


probabilities of different outcomes are assigned to the possible values of a random variable. It
provides a way of modeling the likelihood of each outcome in a random experiment.

While a Frequency Distribution shows how often outcomes occur in a sample or dataset, a
probability distribution assigns probabilities to outcomes abstractly, theoretically, regardless
of any specific dataset. These probabilities represent the likelihood of each outcome
occurring.
Common types of probability distributions include:
Probability Distribution

Properties of a probability distribution include:

• The probability of each outcome is greater than or equal to zero.


• The sum of the probabilities of all possible outcomes equals 1.

In this article will be covering the key concepts of probability distribution, types of
probability distribution, along with the applications in CS.

Probability Distribution of a Random Variable

Now the question comes, how to describe the behavior of a random variable?

Suppose that our Random Variable only takes finite values, like x1, x2, x3,..., and xn. i.e., the
range of X is the set of n values is {x1, x2, x3,..., and xn}.
The behavior of X is completely described by giving probabilities for all the values of the
random variable X.

Event Probability

x1 P(X = x1)
Event Probability

x2 P(X = x2)

x3 P(X = x3)

The Probability Function of a discrete random variable X is the function p(x) satisfying.

P(x) = P(X = x)

Example: We draw two cards successively with replacement from a well-shuffled deck of 52
cards. Find the probability distribution of finding aces.

Answer:
Types of Probability Distributions

We have seen what Probability Distributions are; now we will see different types of
Probability Distributions. The Probability Distribution's type is determined by the type of
random variable. There are two types of Probability Distributions:

• Discrete Probability Distributions for Discrete Variables

• Continuous Probability Distribution for Continuous Variables

We will study in detail two types of discrete probability distributions..

Discrete Probability Distributions


Discrete Probability Functions applies to discrete random variables, which take countable
values (e.g., 0, 1, 2, …). These distributions assign probabilities to individual [Link]
includes distributions such as Bernoulli, Binomial, and Poisson, which are used to model
outcomes that can be counted, as explained below:

Bernoulli Trials
Trials of the random experiment are known as Bernoulli Trials, if they are satisfying below
given conditions :
• Finite number of trials are required.

• All trials must be independent. (when the outcome of any trial is independent of the
outcome of any other trial.)

• Every trial has two outcomes : success or failure.

• Probability of success remains constant across all trials.

Example: Can throwing a fair die 50 times be considered an example of 50 Bernoulli trials if
we define:

• Success is getting an even number (2, 4, or 6),

• Failure as getting an odd number (1, 3, or 5)


Answer:

Yes.,this can be consider as example of 50 Bernoulli trails

• There are 3 even numbers out of 6 possible outcomes, so p = 3/6 = 1 /2

• There are 3 odd numbers out of 6, so q = 3/6 = 1 /2


So, throwing a fair die 50 times with this definition is a classic example of 50
Bernoulli trials, with p=1/2 and q = ½

Binomial Distribution
Answer:

Poisson Probability Distribution


Continuous Probability Distributions

Probability distributions for continuous random variables (uncountable outcomes, e.g.,


time, height, temperature), such as Uniform and Normal distributions, are explained below.

Uniform Distribution

Normal (Gaussian) Distribution


Chi-Square Distribution

Application of Probability Distribution in Computer Science


Probability distributions are used in many areas of computer science are as follows:

• In machine learning, they help make predictions and deal with uncertainty.

• In natural language processing, they are used to model how often words appear.

• In computer vision, they help understand image data and remove noise.

• In networking, distributions like Poisson are used to study how data packets arrive.
• Cryptography uses random numbers based on probability.

• Software testing and reliability also use distributions to predict bugs and failures.

Overall, probability distributions help in building smarter, more reliable, and efficient
computer systems.
Solved Questions on Probability Distribution

Question 1: A box contains 4 blue balls and 3 green balls. Find the probability distribution of
the number of green balls in a random draw of 3 balls.

Solution:

Given that the total number of balls is 7 out of which 3 have to be drawn at random. On
drawing 3 balls the possibilities are all 3 are green, only 2 is green, only 1 is green, and no
green. Hence X = 0, 1, 2, 3.

• P(No ball is green) = P(X = 0) = 4C3/7C3 = 4/35

• P(1 ball is green) = P(X = 1) = 3C1 × 4C2 / 7C3 = 18/35

• P(2 balls are green) = P(X = 2) = 3C2 × 4C1 / 7C3 = 12/35

• P(All 3 balls are green) = P(X = 3) = 3C3 / 7C3 = 1/35

Hence, the probability distribution for this problem is given as follows

X 0 1 2 3

P(X) 4/35 18/35 12/35 1/35

Question 2: From a lot of 10 bulbs containing 3 defective ones, 4 bulbs are drawn at random.
If X is a random variable that denotes the number of defective bulbs. Find the probability
distribution of X.

Solution:
Since, X denotes the number of defective bulbs and there is a maximum of 3 defective bulbs,
hence X can take values 0, 1, 2, and 3. Since 4 bulbs are drawn at random, the possible
combination of drawing 4 bulbs is given by 10C4.

• P(Getting No defective bulb) = P(X = 0) = 7C4 / 10C4 = 1/6


• P(Getting 1 Defective Bulb) = P(X = 1) = 3C1 × 7C3/10C4 = 1/2
• P(Getting 2 defective Bulb) = P(X = 2) = 3C2 × 7C2/10C4 = 3/10
• P(Getting 3 Defective Bulb) = P(X = 3) = 3C3 × 7C1/10C4 = 1/30
Hence Probability Distribution Table is given as follows

X 0 1 2 3

P(X) 1/6 1/2 3/10 1/30


What is Probability Mass Function?

Probability function that gives discrete random variables probability equal to an exact value
is called the probability mass function. The probability mass function is abbreviated as PMF.
The different distribution has different formulas to calculate the probability mass function.

Probability Mass Function Definition

PMF is referred to as the probability of discrete random variable which is equal to a


particular value. It is represented as f(x) = P (X = x) where, X is discrete random variable
and x is the specified value.

Probability Mass Function Example

Let is dice is rolled then the probability of getting a number equal to 4 is an example of
probability mass function. The sample space for the given event is {1, 2, 3, 4, 5, 6} and X be
the random variable for getting a 4. The probability mass function evaluated for X = 4 is 1/6.
What is Probability Density Function(PDF)?
Probability Density Function is used for calculating the probabilities for continuous random
variables. When the cumulative distribution function (CDF) is differentiated we get the
probability density function (PDF). Both functions are used to represent the probability
distribution of a continuous random variable.

The probability density function is defined over a specific range. By differentiating CDF we
get PDF and by integrating the probability density function we can get the cumulative density
function.

Probability Density Function Definition

Probability density function is the function that represents the density of probability for a
continuous random variable over the specified ranges.

Probability Density Function is abbreviated as PDF and for a continuous random variable X,
Probability Density Function is denoted by f(x).

PDF of the random variable is obtained by differentiating CDF (Cumulative Distribution


Function) of X. The probability density function should be a positive for all possible values
of the variable. The total area between the density curve and the x-axis should be equal to 1.

Necessary Conditions for PDF

Let X be the continuous random variable with probability density function f(x). For a
function to be valid probability function should satisfy below conditions.
Hypothesis Testing:

Hypothesis testing is a statistical method used to evaluate an assumption or claim about a


population using data from a sample. It is a formalized procedure that uses sample data to
determine whether there is enough evidence to reject a particular idea about the population.
What is Error?

In the statistics and hypothesis testing, an error refers to the emergence of discrepancies
between the result value based on observation or calculation and the actual value or expected
value.

The failures may happen in different factors, such as unclear implementation or faulty
assumptions. Errors can be of many types, such as

• Measurement Error

• Calculation Error

• Human Error
• Systematic Error

• Random Error

In hypothesis testing, it is often clear which kind of error is the problem, either a Type I error
or a Type II one.

Type I Error - False Positive

Type I error, also known as a false positive, occurs in statistical hypothesis testing when a
null hypothesis that is actually true is rejected. It's the error of incorrectly concluding that
there is a significant effect or difference when there isn't one in reality.
In hypothesis testing, there are two competing hypotheses:
• Null Hypothesis (H0): This hypothesis represents a default assumption that there is
no effect, no difference or no relationship in the population being studied.

• Alternative Hypothesis (H1): This hypothesis represents the opposite of the null
hypothesis. It suggests that there is a significant effect, difference or relationship in
the population.

A Type I error occurs when the null hypothesis is rejected based on the sample data, even
though it is actually true in the population.
Type II Error - False Negative

Type II error, also known as a false negative, occurs in statistical hypothesis testing when a
null hypothesis that is actually false is not rejected. In other words, it's the error of failing to
detect a significant effect or difference when one exists in reality.

A Type II error occurs when the null hypothesis is not rejected based on the sample data, even
though it is actually false in the population. It's a failure to recognize a real effect or
difference.

Suppose a medical researcher is testing a new drug to see if it's effective in treating a certain
condition. The null hypothesis (H0) states that the drug has no effect, while the alternative
hypothesis (H1) suggests that the drug is effective.

If the researcher conducts a statistical test and fails to reject the null hypothesis (H0),
concluding that the drug is not effective, when in fact it does have an effect, this would be a
Type II error.

Type I and Type II Errors - Comparison

Error Also
Type Description Known as When It Occurs

Rejecting a true null False You believe there is an effect or


Type I
hypothesis Positive difference when there isn't

Failing to reject a false False You believe there is no effect or


Type II
null hypothesis Negative difference when there is

Type I and Type II Errors Examples

Examples of Type I Error


• Medical Testing: Suppose a medical test is designed to diagnose a particular disease.
The null hypothesis (H0) is that the person does not have the disease, and the
alternative hypothesis (H1) is that the person does have the disease. A Type I error
occurs if the test incorrectly indicates that a person has the disease (rejects the null
hypothesis) when they do not actually have it.

• Legal System: In a criminal trial, the null hypothesis (H0) is that the defendant is
innocent, while the alternative hypothesis (H1) is that the defendant is guilty. A Type I
error occurs if the jury convicts the defendant (rejects the null hypothesis) when they
are actually innocent.

• Quality Control: In manufacturing, quality control inspectors may test products to


ensure they meet certain specifications. The null hypothesis (H0) is that the product
meets the required standard, while the alternative hypothesis (H1) is that the product
does not meet the standard. A Type I error occurs if a product is rejected (null
hypothesis is rejected) as defective when it actually meets the required standard.

Examples of Type II Error

• Medical Testing: In a medical test designed to diagnose a disease, a Type II error


occurs if the test incorrectly indicates that a person does not have the disease (fails to
reject the null hypothesis) when they actually do have it.

• Legal System: In a criminal trial, a Type II error occurs if the jury acquits the
defendant (fails to reject the null hypothesis) when they are actually guilty.

• Quality Control: In manufacturing, a Type II error occurs if a defective product is


accepted (fails to reject the null hypothesis) as meeting the required standard.
How to Minimize Type I and Type II Errors

To minimize Type I and Type II errors in hypothesis testing, there are several strategies that
can be employed based on the information from the sources provided:

Minimizing Type I Error


• To reduce the probability of a Type I error (rejecting a true null hypothesis), one can
choose a smaller level of significance (alpha) at the beginning of the study.

• By setting a lower significance level, the chances of incorrectly rejecting the null
hypothesis decrease, thus minimizing Type I errors.

Minimizing Type II Error

• The probability of a Type II error (failing to reject a false null hypothesis) can be
minimized by increasing the sample size or choosing a "threshold" alternative value
of the parameter further from the null value.
• Increasing the sample size reduces the variability of the statistic, making it less likely
to fall in the non-rejection region when it should be rejected, thus minimizing Type II
errors.
Factors Affecting Type I and Type II Errors
• Sample Size: In statistical hypothesis testing, larger sample sizes generally reduce the
probability of both Type I and Type II errors. With larger samples, the estimates tend
to be more precise, resulting in more accurate conclusions.

• Significance Level: The significance level (α) in hypothesis testing determines the
probability of committing a Type I error. Choosing a lower significance level reduces
the risk of Type I error but increases the risk of Type II error, and vice versa.

• Effect Size: The magnitude of the effect or difference being tested influences the
probability of Type II error. Smaller effect sizes are more challenging to detect,
increasing the likelihood of failing to reject the null hypothesis when it's false.

• Statistical Power: The power of Statistics (1 – β) dictates that the opportunity for
rejecting a wrong null hypothesis is based on the inverse of the chance of committing
a Type II error. The power level of the test rises, thus a chance of the Type II error
dropping.

What is the P-value?

The p-value, or probability value, is a statistical measure used in hypothesis testing to assess
the strength of evidence against a null hypothesis. It represents the probability of obtaining
results as extreme as, or more extreme than, the observed results under the assumption that
the null hypothesis is true.

In simpler words, it is used to reject or support the null hypothesis during hypothesis testing.
In data science, it gives valuable insights on the statistical significance of an independent
variable in predicting the dependent variable.

How P-value is calculated?


Calculating the p-value typically involves the following steps:

1. Formulate the Null Hypothesis (H0): Clearly state the null hypothesis, which
typically states that there is no significant relationship or effect between the variables.

2. Choose an Alternative Hypothesis (H1): Define the alternative hypothesis, which


proposes the existence of a significant relationship or effect between the variables.
3. Determine the Test Statistic: Calculate the test statistic, which is a measure of the
discrepancy between the observed data and the expected values under the null
hypothesis. The choice of test statistic depends on the type of data and the specific
research question.

4. Identify the Distribution of the Test Statistic: Determine the appropriate sampling
distribution for the test statistic under the null hypothesis. This distribution represents
the expected values of the test statistic if the null hypothesis is true.
5. Calculate the Critical-value: Based on the observed test statistic and the sampling
distribution, find the probability of obtaining the observed test statistic or a more
extreme one, assuming the null hypothesis is true.

6. Interpret the results: Compare the critical-value with t-statistic. If the t-statistic is
larger than the critical value, it provides evidence to reject the null hypothesis, and
vice-versa.

What is EDA?

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods.

EDA helps determine how best to manipulate data sources to get the answers you need,
making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or
check assumptions.

Types of EDA
There are four primary types of EDA:

• Univariate non-graphical
• Univariate graphical
• Multivariate non-graphical
• Multivariate graphical
Univariate non-graphical

This is simplest form of data analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.
Univariate graphical

Non-graphical methods don’t provide a full picture of the data. Graphical methods are
therefore required. Common types of univariate graphics include:

• Stem-and-leaf plots, which show all data values and the shape of the distribution.

• Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.

• Box plots, which graphically depict the five-number summary of minimum, first
quartile, median, third quartile, and maximum.

Multivariate nongraphical

Multivariate data arises from more than one variable. Multivariate non-graphical EDA
techniques generally show the relationship between two or more variables of the data through
cross-tabulation or statistics.
Multivariate graphical

Multivariate data uses graphics to display relationships between two or more sets of data. The
most used graphic is a grouped bar plot or bar chart with each group representing one level of
one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

• Scatter plot, which is used to plot data points on a horizontal and a vertical axis to
show how much one variable is affected by another.

• Multivariate chart, which is a graphical representation of the relationships between


factors and a response.

• Run chart, which is a line graph of data plotted over time.

• Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a
two-dimensional plot.

• Heat map, which is a graphical representation of data where values are depicted by
color.

Why Exploratory Data Analysis Important?

1. Helps to understand the dataset by showing how many features it has, what type of
data each feature contains and how the data is distributed.

2. Helps to identify hidden patterns and relationships between different data points
which help us in and model building.

3. Allows to identify errors or unusual data points (outliers) that could affect our results.

4. The insights gained from EDA help us to identify most important features for building
models and guide us on how to prepare them for better performance.

5. By understanding the data it helps us in choosing best modeling techniques and


adjusting them for better results.

What is Data Visualization.

Data visualization is the graphical representation of information and data using visual
elements like charts, graphs, and maps. It is the process of translating complex datasets and
statistics into accessible visuals that reveal trends, patterns, and outliers.
The main goal is to help people understand and analyze data more easily, leading to faster
and more informed decision-making. This is especially critical in the age of Big Data, where
companies deal with massive volumes of information.

Different types of graphs for data visualization


The choice of graph depends on the type of data and the insights you want to convey.
Comparison

These graphs are used to compare values across different categories or to show changes over
time.

• Bar Chart: Displays data with rectangular bars whose lengths are proportional to the
values they represent. It is one of the most common and simple chart types for
comparing categories.

• Line Chart: Connects a series of data points with a continuous line to show trends
over time, such as stock prices or website traffic.

• Column Chart: Similar to a bar chart but with vertical bars. It is often used for side-
by-side comparisons of different values.

Composition

These charts show how individual parts make up a whole.

• Pie Chart: Divides a circle into slices, with each slice representing a percentage or
proportion of the total. It is most effective when comparing a few categories.

• Donut Chart: A variant of a pie chart with a hollow center, often used to display the
total value in the center.

• Treemap: Displays hierarchical data as a set of nested rectangles, where the size of
each rectangle is proportional to its value.

• Waterfall Chart: Illustrates how an initial value is affected by a series of positive and
negative changes.

Distribution
These visualizations show the frequency or spread of data.

• Histogram: A type of bar chart that shows the frequency distribution of continuous
numerical data. The bars represent ranges (bins), and their height shows the number
of values in that range.

• Box Plot: Displays the distribution of numerical data by showing the median,
quartiles, and potential outliers. It is useful for comparing distributions across
different groups.

• Violin Plot: Combines a box plot with a density plot to show the distribution and
summary statistics of data.

Relationships
These graphs highlight correlations and connections between two or more variables.

• Scatter Plot: Uses dots to display the relationship between two variables. The pattern
of the dots can reveal correlations, clusters, or outliers.
• Bubble Chart: An extension of the scatter plot where a third variable is represented
by the size of the bubbles.

You might also like