Introduction To Data Science Unit 2 Notes
Introduction To Data Science Unit 2 Notes
Unit 2
Population:
A population refers to the entire set of individuals, objects, or data points that you want to
study. It can be large or small depending on the scope of your research.
• For example, all students in a school or all people in a country.
Sample:
A sample is a subset of the population that is selected for analysis. It's used when studying the
entire population is impractical or impossible. Sampling allows for inferences about the
population using statistical techniques.
Parameters (like population mean) describe the population, while statistics (like sample
mean) describe the sample. Sampling enables us to make inferences about the population
using statistical techniques.
Populations are used when your research question requires it, or when you have access to
data from every member of the population. Usually, it is only straightforward to collect data
from a whole population when it is small, accessible, and cooperative.
Example:
• Since the customer base is limited and accessible, they analyze the entire population
to identify trends.
When your population is large in size, geographically dispersed, or difficult to contact, it’s
necessary to use a sample. With statistical analysis, you can use sample data to make
estimates or test hypotheses about population data.
Example:
• The population includes all teenagers aged 13–18, which could be tens of thousands.
Population vs Sample
Population Sample
Calculated using data from a sample Used to estimate population parameters based on
drawn from the population. Statistics sample data. Statistics help researchers infer
are directly computed from sample population characteristics from a representative
data. subset of the population
Statistics is a branch that deals with every aspect of the data. Statistical knowledge helps to
choose the proper method of collecting the data and employ those samples in the correct
analysis process in order to effectively produce the results. In short, statistics is a crucial
process which helps to make the decision based on the data.
Types of Statistics
The two main branches of statistics are:
• Descriptive Statistics
• Inferential Statistics
Inferential Statistics – Based on the data sample taken from the population, inferential
statistics makes the predictions and inferences.
Characteristics of Statistics
Descriptive Statistic:
Statistics is the foundation of data science. Descriptive statistics are simple tools that help us
understand and summarize data. They show the basic features of a dataset, like the average,
highest and lowest values and how spread out the numbers are. It's the first step in making
sense of information.
Types of Descriptive Statistics
There are three categories for standard classification of descriptive statistics methods, each
serving different purposes in summarizing and describing data. They help us understand:
Statistical values that describe the central position within a dataset. There are three main
measures of central tendency:
Mean: is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑀𝑒𝑎𝑛 =
𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
Let's look at an example of how can we find the mean of a data set using python code
implementation.
import numpy as np
# Sample Data
arr = [5, 6, 11]
# Mean
mean = [Link](arr)
print("Mean = ", mean)
Output
Mean = 7.333333333333333
Mode: The most frequently occurring value in the dataset. It’s useful for categorical data
and in cases where knowing the most common choice is crucial.
Output:
Mode = ModeResult(mode=array([2]), count=array([2]))
Median: The median is the middle value in a sorted dataset. If the number of values is odd,
it's the center value, if even, it's the average of the two middle values. It's often better than the
mean for skewed data.
import numpy as np
# sample Data
arr = [1, 2, 3, 4]
# Median
median = [Link](arr)
print("Median = ", median)
Output
Median = 2.5
2. Measure of Variability
Knowing not just where the data centers but also how it spreads out is important. Measures of
variability, also called measures of dispersion, help us spot the spread or distribution of
observations in a dataset. They identifying outliers, assessing model assumptions and
understanding data variability in relation to its mean. The key measures of variability include:
1. Range : describes the difference between the largest and smallest data point in our data set.
The bigger the range, the more the spread of data and vice versa. While easy to
compute range is sensitive to outliers. This measure can provide a quick sense of the data
spread but should be complemented with other statistics.
Range = Largest data value - smallest data value
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))
Output
Maximum = 5, Minimum = 1 and Range = 4
where,
• x -> Observation under consideration
• N -> number of terms
• mu -> Mean
import statistics
# sample data
arr = [1, 2, 3, 4, 5]
# variance
print("Var = ", ([Link](arr)))
Output
Var = 2.5
3. Standard deviation: Standard deviation is widely used to measure the extent of variation
or dispersion in data. It's especially important when assessing model performance (e.g.,
residuals) or comparing datasets with different means.
It is defined as the square root of the variance. It is calculated by finding the mean,
then subtracting each number from the mean which is also known as the average and
squaring the result. Adding all the values and then dividing by the no of terms followed by
the square root.
where,
• x = Observation under consideration
• N = number of terms
• mu = Mean
import statistics
arr = [1, 2, 3, 4, 5]
print("Std = ", ([Link](arr)))
Output
Std = 1.5811388300841898
Variability measures are important in residual analysis to check how well a model fits the
data.
• Frequency counts
Variable is a term derived from the word 'vary', meaning 'to change' or 'to differ'. Therefore,
variables are the characteristics that change or differ. A variable refers to the quantity or
characteristic whose value varies from one investigation to another. The difference in the
value of a variable may be with respect to a place, item, individual, or time, and each value
within this range is known as Variate.
Variables are of two kinds Discrete Variable (Discontinuous Variable) and Continuous
Variable.
Discrete variables are the variables that take only exact values and not fractional values. In
simple terms, the variables which are expressed in terms of complete numbers are known as
discrete variables. For example, the number of teachers in a school is a discrete variable, as
they cannot be expressed in fractions. The data of discrete variables is collected through
counting.
Example
The data collected of the number of workers in different teams of a company in discrete
variables can be shown as:
2. Continuous Variable
Continuous variables are the variables that can take every possible value whether it is integral
or fractional, in a specified given range. For example, the height of an individual can be in
decimals or in exact numbers within the specified range. The data of continuous variables is
collected through measurement.
Example
The data collected on the height of students in a class in continuous variables can be shown
as:
A percentile is a statistical measure that indicates the value below which a given percentage
of observations in a group of data falls. It helps understand how a particular value compares
to the rest of the data.
• In simple words, percentiles are a way to express the relative standing of a value
within a dataset, indicating what percentage of the data falls below that value.
• For example, if you scored in the 90th percentile on a standardized test, it means you
performed better than 90% of the people who took the test.
To find the value corresponding to a specific percentile, the following formula is used:
Where,
• P is the percentile,
The above formula is used to calculate the percentile for a particular value in the population.
If we have a percentile value and we need to find the 'n' value, i.e., for which data value in the
population, then we can rewrite the above formula as
n = (P × N)/100
Quartiles:
Quartiles divide a data set into four equal parts, each containing 25% of the data. They help
to understand the spread and center of the data. As an important concept in statistics, quartiles
are used to analyze large data sets by highlighting values near the middle. This method is
particularly useful for identifying outliers and comparing different data sets.
• It is also known as the 25th percentile because it marks the point where 25% of the
data is below it.
• It is also known as the 50th percentile, as it divides the data into two halves.
• It is also known as the 75th percentile because it marks the point where 75% of the
data is below it.
Quartile Formula
As mentioned above, Quartile divides the data into 4 equal parts. There is a separate
formula for finding each quartile value, and the steps to obtain the quartile formula are as
shown below as follows:
5 Number Summary:
The concept of a 5 number summary is a way to describe a distribution using 5 numbers. This
includes minimum number, quartile-1, median/quartile-2, quartile-3, and maximum number.
This concept of 5 number summary comes under the concept of Statistics which deals with
the collection of data, analyzing it, interpreting, and presenting the data in an organized
manner.
As told in the above paragraph, It gives a rough idea how the given dataset looks like by
representing minimum value, maximum value, median, quartile values, etc. To understand
better the 5 number summary concept look at the below pictorial representation of 5 number
summary
• Minimum Value: It is the smallest number in the given data, and the first number
when it is sorted in ascending order.
• Maximum Value: It is the largest number in the given data, and the last number when
it is sorted in ascending order.
Median: Middle value between the minimum and maximum value. Below is the formula to
find median,
• Quartile 1: Middle/center value between the minimum and median value. We can
simply identify the middle value between median and minimum value for a small
dataset. If it is a big dataset with so many numbers then better to use a formula,
Quartile 1 = ((n + 1)/4)th term
• Quartile 3: Middle/center value between median and maximum value.
Question 1: What is the 5 number summary in the given data 10, 20, 5, 15, 25, 30, 8.
Solution:
Minimum value = 5
Maximum value = 30
For Quartile-1 Formula is ((n + 1)/4)th term where n is the count of numbers in the dataset.
n = 7 because there are 7 numbers in the data.
= (8/4)th term
= 2nd term
2nd term is 8 So, Quartile-1 = 8
In the same way find the quartile-3 using the formula (3(n + 1)/4)th term.
= (3(8)/4)th term
= (24/4)th term
= 6th term
What is Covariance?
Covariance Formula
Types of Covariance
• Positive Covariance: When one variable increases, the other variable tends to
increase as well and vice versa.
• Negative Covariance: When one variable increases, the other variable tends to
decrease.
• Zero Covariance: There is no linear relationship between the two variables; they
move independently of each other.
Example
What is Correlation?
Correlation is a standardized measure of the strength and direction of the linear relationship
between two variables. It is derived from covariance and ranges between -1 and 1. Unlike
covariance, which only indicates the direction of the relationship, correlation provides a
standardized measure.
• Positive Correlation (close to +1): As one variable increases, the other variable also
tends to increase.
• Negative Correlation (close to -1): As one variable increases, the other variable tends
to decrease.
Covariance Correlation
Involves the relationship between two Involves the relationship between multiple
variables or data sets variables as well
Probability:
Probability is a branch of mathematics that quantifies the likelihood of an event occurring. It
is a numerical value between 0 and 1, where 0 indicates that an event is impossible and 1
indicates that it is certain.
The most basic way to calculate the probability of a single event is to divide the number of
favorable outcomes by the total number of possible outcomes.
Multiplication Rule:
Probability Distribution - Function, Formula, Table
While a Frequency Distribution shows how often outcomes occur in a sample or dataset, a
probability distribution assigns probabilities to outcomes abstractly, theoretically, regardless
of any specific dataset. These probabilities represent the likelihood of each outcome
occurring.
Common types of probability distributions include:
Probability Distribution
In this article will be covering the key concepts of probability distribution, types of
probability distribution, along with the applications in CS.
Now the question comes, how to describe the behavior of a random variable?
Suppose that our Random Variable only takes finite values, like x1, x2, x3,..., and xn. i.e., the
range of X is the set of n values is {x1, x2, x3,..., and xn}.
The behavior of X is completely described by giving probabilities for all the values of the
random variable X.
Event Probability
x1 P(X = x1)
Event Probability
x2 P(X = x2)
x3 P(X = x3)
The Probability Function of a discrete random variable X is the function p(x) satisfying.
P(x) = P(X = x)
Example: We draw two cards successively with replacement from a well-shuffled deck of 52
cards. Find the probability distribution of finding aces.
Answer:
Types of Probability Distributions
We have seen what Probability Distributions are; now we will see different types of
Probability Distributions. The Probability Distribution's type is determined by the type of
random variable. There are two types of Probability Distributions:
Bernoulli Trials
Trials of the random experiment are known as Bernoulli Trials, if they are satisfying below
given conditions :
• Finite number of trials are required.
• All trials must be independent. (when the outcome of any trial is independent of the
outcome of any other trial.)
Example: Can throwing a fair die 50 times be considered an example of 50 Bernoulli trials if
we define:
Binomial Distribution
Answer:
Uniform Distribution
• In machine learning, they help make predictions and deal with uncertainty.
• In natural language processing, they are used to model how often words appear.
• In computer vision, they help understand image data and remove noise.
• In networking, distributions like Poisson are used to study how data packets arrive.
• Cryptography uses random numbers based on probability.
• Software testing and reliability also use distributions to predict bugs and failures.
Overall, probability distributions help in building smarter, more reliable, and efficient
computer systems.
Solved Questions on Probability Distribution
Question 1: A box contains 4 blue balls and 3 green balls. Find the probability distribution of
the number of green balls in a random draw of 3 balls.
Solution:
Given that the total number of balls is 7 out of which 3 have to be drawn at random. On
drawing 3 balls the possibilities are all 3 are green, only 2 is green, only 1 is green, and no
green. Hence X = 0, 1, 2, 3.
X 0 1 2 3
Question 2: From a lot of 10 bulbs containing 3 defective ones, 4 bulbs are drawn at random.
If X is a random variable that denotes the number of defective bulbs. Find the probability
distribution of X.
Solution:
Since, X denotes the number of defective bulbs and there is a maximum of 3 defective bulbs,
hence X can take values 0, 1, 2, and 3. Since 4 bulbs are drawn at random, the possible
combination of drawing 4 bulbs is given by 10C4.
X 0 1 2 3
Probability function that gives discrete random variables probability equal to an exact value
is called the probability mass function. The probability mass function is abbreviated as PMF.
The different distribution has different formulas to calculate the probability mass function.
Let is dice is rolled then the probability of getting a number equal to 4 is an example of
probability mass function. The sample space for the given event is {1, 2, 3, 4, 5, 6} and X be
the random variable for getting a 4. The probability mass function evaluated for X = 4 is 1/6.
What is Probability Density Function(PDF)?
Probability Density Function is used for calculating the probabilities for continuous random
variables. When the cumulative distribution function (CDF) is differentiated we get the
probability density function (PDF). Both functions are used to represent the probability
distribution of a continuous random variable.
The probability density function is defined over a specific range. By differentiating CDF we
get PDF and by integrating the probability density function we can get the cumulative density
function.
Probability density function is the function that represents the density of probability for a
continuous random variable over the specified ranges.
Probability Density Function is abbreviated as PDF and for a continuous random variable X,
Probability Density Function is denoted by f(x).
Let X be the continuous random variable with probability density function f(x). For a
function to be valid probability function should satisfy below conditions.
Hypothesis Testing:
In the statistics and hypothesis testing, an error refers to the emergence of discrepancies
between the result value based on observation or calculation and the actual value or expected
value.
The failures may happen in different factors, such as unclear implementation or faulty
assumptions. Errors can be of many types, such as
• Measurement Error
• Calculation Error
• Human Error
• Systematic Error
• Random Error
In hypothesis testing, it is often clear which kind of error is the problem, either a Type I error
or a Type II one.
Type I error, also known as a false positive, occurs in statistical hypothesis testing when a
null hypothesis that is actually true is rejected. It's the error of incorrectly concluding that
there is a significant effect or difference when there isn't one in reality.
In hypothesis testing, there are two competing hypotheses:
• Null Hypothesis (H0): This hypothesis represents a default assumption that there is
no effect, no difference or no relationship in the population being studied.
• Alternative Hypothesis (H1): This hypothesis represents the opposite of the null
hypothesis. It suggests that there is a significant effect, difference or relationship in
the population.
A Type I error occurs when the null hypothesis is rejected based on the sample data, even
though it is actually true in the population.
Type II Error - False Negative
Type II error, also known as a false negative, occurs in statistical hypothesis testing when a
null hypothesis that is actually false is not rejected. In other words, it's the error of failing to
detect a significant effect or difference when one exists in reality.
A Type II error occurs when the null hypothesis is not rejected based on the sample data, even
though it is actually false in the population. It's a failure to recognize a real effect or
difference.
Suppose a medical researcher is testing a new drug to see if it's effective in treating a certain
condition. The null hypothesis (H0) states that the drug has no effect, while the alternative
hypothesis (H1) suggests that the drug is effective.
If the researcher conducts a statistical test and fails to reject the null hypothesis (H0),
concluding that the drug is not effective, when in fact it does have an effect, this would be a
Type II error.
Error Also
Type Description Known as When It Occurs
• Legal System: In a criminal trial, the null hypothesis (H0) is that the defendant is
innocent, while the alternative hypothesis (H1) is that the defendant is guilty. A Type I
error occurs if the jury convicts the defendant (rejects the null hypothesis) when they
are actually innocent.
• Legal System: In a criminal trial, a Type II error occurs if the jury acquits the
defendant (fails to reject the null hypothesis) when they are actually guilty.
To minimize Type I and Type II errors in hypothesis testing, there are several strategies that
can be employed based on the information from the sources provided:
• By setting a lower significance level, the chances of incorrectly rejecting the null
hypothesis decrease, thus minimizing Type I errors.
• The probability of a Type II error (failing to reject a false null hypothesis) can be
minimized by increasing the sample size or choosing a "threshold" alternative value
of the parameter further from the null value.
• Increasing the sample size reduces the variability of the statistic, making it less likely
to fall in the non-rejection region when it should be rejected, thus minimizing Type II
errors.
Factors Affecting Type I and Type II Errors
• Sample Size: In statistical hypothesis testing, larger sample sizes generally reduce the
probability of both Type I and Type II errors. With larger samples, the estimates tend
to be more precise, resulting in more accurate conclusions.
• Significance Level: The significance level (α) in hypothesis testing determines the
probability of committing a Type I error. Choosing a lower significance level reduces
the risk of Type I error but increases the risk of Type II error, and vice versa.
• Effect Size: The magnitude of the effect or difference being tested influences the
probability of Type II error. Smaller effect sizes are more challenging to detect,
increasing the likelihood of failing to reject the null hypothesis when it's false.
• Statistical Power: The power of Statistics (1 – β) dictates that the opportunity for
rejecting a wrong null hypothesis is based on the inverse of the chance of committing
a Type II error. The power level of the test rises, thus a chance of the Type II error
dropping.
The p-value, or probability value, is a statistical measure used in hypothesis testing to assess
the strength of evidence against a null hypothesis. It represents the probability of obtaining
results as extreme as, or more extreme than, the observed results under the assumption that
the null hypothesis is true.
In simpler words, it is used to reject or support the null hypothesis during hypothesis testing.
In data science, it gives valuable insights on the statistical significance of an independent
variable in predicting the dependent variable.
1. Formulate the Null Hypothesis (H0): Clearly state the null hypothesis, which
typically states that there is no significant relationship or effect between the variables.
4. Identify the Distribution of the Test Statistic: Determine the appropriate sampling
distribution for the test statistic under the null hypothesis. This distribution represents
the expected values of the test statistic if the null hypothesis is true.
5. Calculate the Critical-value: Based on the observed test statistic and the sampling
distribution, find the probability of obtaining the observed test statistic or a more
extreme one, assuming the null hypothesis is true.
6. Interpret the results: Compare the critical-value with t-statistic. If the t-statistic is
larger than the critical value, it provides evidence to reject the null hypothesis, and
vice-versa.
What is EDA?
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods.
EDA helps determine how best to manipulate data sources to get the answers you need,
making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or
check assumptions.
Types of EDA
There are four primary types of EDA:
• Univariate non-graphical
• Univariate graphical
• Multivariate non-graphical
• Multivariate graphical
Univariate non-graphical
This is simplest form of data analysis, where the data being analyzed consists of just one
variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.
Univariate graphical
Non-graphical methods don’t provide a full picture of the data. Graphical methods are
therefore required. Common types of univariate graphics include:
• Stem-and-leaf plots, which show all data values and the shape of the distribution.
• Histograms, a bar plot in which each bar represents the frequency (count) or
proportion (count/total count) of cases for a range of values.
• Box plots, which graphically depict the five-number summary of minimum, first
quartile, median, third quartile, and maximum.
Multivariate nongraphical
Multivariate data arises from more than one variable. Multivariate non-graphical EDA
techniques generally show the relationship between two or more variables of the data through
cross-tabulation or statistics.
Multivariate graphical
Multivariate data uses graphics to display relationships between two or more sets of data. The
most used graphic is a grouped bar plot or bar chart with each group representing one level of
one of the variables and each bar within a group representing the levels of the other variable.
• Scatter plot, which is used to plot data points on a horizontal and a vertical axis to
show how much one variable is affected by another.
• Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a
two-dimensional plot.
• Heat map, which is a graphical representation of data where values are depicted by
color.
1. Helps to understand the dataset by showing how many features it has, what type of
data each feature contains and how the data is distributed.
2. Helps to identify hidden patterns and relationships between different data points
which help us in and model building.
3. Allows to identify errors or unusual data points (outliers) that could affect our results.
4. The insights gained from EDA help us to identify most important features for building
models and guide us on how to prepare them for better performance.
Data visualization is the graphical representation of information and data using visual
elements like charts, graphs, and maps. It is the process of translating complex datasets and
statistics into accessible visuals that reveal trends, patterns, and outliers.
The main goal is to help people understand and analyze data more easily, leading to faster
and more informed decision-making. This is especially critical in the age of Big Data, where
companies deal with massive volumes of information.
These graphs are used to compare values across different categories or to show changes over
time.
• Bar Chart: Displays data with rectangular bars whose lengths are proportional to the
values they represent. It is one of the most common and simple chart types for
comparing categories.
• Line Chart: Connects a series of data points with a continuous line to show trends
over time, such as stock prices or website traffic.
• Column Chart: Similar to a bar chart but with vertical bars. It is often used for side-
by-side comparisons of different values.
Composition
• Pie Chart: Divides a circle into slices, with each slice representing a percentage or
proportion of the total. It is most effective when comparing a few categories.
• Donut Chart: A variant of a pie chart with a hollow center, often used to display the
total value in the center.
• Treemap: Displays hierarchical data as a set of nested rectangles, where the size of
each rectangle is proportional to its value.
• Waterfall Chart: Illustrates how an initial value is affected by a series of positive and
negative changes.
Distribution
These visualizations show the frequency or spread of data.
• Histogram: A type of bar chart that shows the frequency distribution of continuous
numerical data. The bars represent ranges (bins), and their height shows the number
of values in that range.
• Box Plot: Displays the distribution of numerical data by showing the median,
quartiles, and potential outliers. It is useful for comparing distributions across
different groups.
• Violin Plot: Combines a box plot with a density plot to show the distribution and
summary statistics of data.
Relationships
These graphs highlight correlations and connections between two or more variables.
• Scatter Plot: Uses dots to display the relationship between two variables. The pattern
of the dots can reveal correlations, clusters, or outliers.
• Bubble Chart: An extension of the scatter plot where a third variable is represented
by the size of the bubbles.