Fundamentals of Data Science
Fundamentals of Data Science
2024-2025
SYLLABUS
UNIT -I
Need for data science –benefits and uses –facets of data – data science process –setting the
research goal – retrieving data –cleansing, integrating and transforming data –exploratory data
analysis –build the models – presenting and building applications..
UNIT-II
Frequency distributions – Outliers –relative frequency distributions –cumulative frequency
distributions – frequency distributions for nominal data –interpreting distributions –graphs –
averages –mode –median –mean
UNIT-III
Normal distributions –z scores –normal curve problems – finding proportions –finding scores –
more about z scores –correlation –scatter plots –correlation coefficient for quantitative data –
computational formula for correlation coefficient-averages for qualitative and ranked data.
UNIT-IV
Basics of Numpy arrays, aggregations, computations on arrays, comparisons, structured arrays,
Data manipulation, data indexing and selection, operating on data, missing data, hierarchical
indexing, combining datasets –aggregation and grouping, pivot tables
UNIT-V
Visualization with matplotlib, line plots, scatter plots, visualizing errors, density and contour
plots, histograms, binnings, and density, three dimensional plotting, geographic data
Text Books:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, ―Introducing Data Science‖,
Manning Publications, 2016.
2. Robert S. Witte and John S. Witte, ―Statistics‖, Eleventh Edition, Wiley Publications, 2017.
3. Jake VanderPlas, ―Python Data Science Handbook‖, O‟Reilly, 2016.
References :
1. Allen B. Downey, ―Think Stats: Exploratory Data Analysis in Python‖, Green Tea
Press, 2014.
Web Resources
● [Link]
● [Link]
● [Link]
Mapping with Programme Outcomes:
CO1 3 3 3 3 3 2
CO2 3 3 3 2 2 3
CO3 2 2 2 3 3 3
CO4 3 3 3 3 3 2
CO5 3 3 3 3 3 1
Weightage of course 14 14 14 14 14 11
contributed to each PSO
UNIT -I
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. In simpler terms,
data science is about obtaining, processing, and analyzing data to gain insights for many purposes.
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
Despite every data science project being unique—depending on the problem, the industry it's
applied in, and the data involved—most projects follow a similar lifecycle.
This lifecycle provides a structured approach for handling complex data, drawing accurate
conclusions, and making data-driven decisions.
Here are the five main phases that structure the data science lifecycle:
This initial phase involves collecting data from various sources, such as databases, Excel files, text
files, APIs, web scraping, or even real-time data streams. The type and volume of data collected
largely depend on the problem you’re addressing.
Once collected, this data is stored in an appropriate format ready for further processing. Storing
the data securely and efficiently is important to allow quick retrieval and processing.
Data preparation
Often considered the most time-consuming phase, data preparation involves cleaning and
transforming raw data into a suitable format for analysis. This phase includes handling missing or
inconsistent data, removing duplicates, normalization, and data type conversions. The objective is
to create a clean, high-quality dataset that can yield accurate and reliable analytical results.
During this phase, data scientists explore the prepared data to understand its patterns,
characteristics, and potential anomalies. Techniques like statistical analysis and data visualization
summarize the data's main characteristics, often with visual methods.
Visualization tools, such as charts and graphs, make the data more understandable, enabling
stakeholders to comprehend the data trends and patterns better.
Data scientists use machine learning algorithms and statistical models to identify patterns, make
predictions, or discover insights in this phase. The goal here is to derive something significant
from the data that aligns with the project's objectives, whether predicting future outcomes,
classifying data, or uncovering hidden patterns.
The final phase involves interpreting and communicating the results derived from the data analysis.
It's not enough to have insights; you must communicate them effectively, using clear, concise
language and compelling visuals. The goal is to convey these findings to non-technical
stakeholders in a way that influences decision-making or drives strategic initiatives.
Understanding and implementing this lifecycle allows for a more systematic and successful
approach to data science projects. Let's now delve into why data science is so important.
Data science has emerged as a revolutionary field that is crucial in generating insights from data
and transforming businesses. It's not an overstatement to say that data science is the backbone of
modern industries. But why has it gained so much significance?
● Data volume. Firstly, the rise of digital technologies has led to an explosion of data. Every
online transaction, social media interaction, and digital process generates data. However,
this data is valuable only if we can extract meaningful insights from it. And that's precisely
where data science comes in.
● Value-creation. Secondly, data science is not just about analyzing data; it's about
interpreting and using this data to make informed business decisions, predict future trends,
understand customer behavior, and drive operational efficiency. This ability to drive
decision-making based on data is what makes data science so essential to organizations.
● Career options. Lastly, the field of data science offers lucrative career opportunities. With
the increasing demand for professionals who can work with data, jobs in data science are
among the highest paying in the industry. As per Glassdoor, the average salary for a data
scientist in the United States is $137,984, making it a rewarding career choice.
Data science is used for an array of applications, from predicting customer behavior to optimizing
business processes. The scope of data science is vast and encompasses various types of analytics.
● Descriptive analytics. Analyzes past data to understand current state and trend
identification. For instance, a retail store might use it to analyze last quarter's sales or
identify best-selling products.
● Diagnostic analytics. Explores data to understand why certain events occurred, identifying
patterns and anomalies. If a company's sales fall, it would identify whether poor product
quality, increased competition, or other factors caused it.
● Predictive analytics. Uses statistical models to forecast future outcomes based on past data,
used widely in finance, healthcare, and marketing. A credit card company may employ it
to predict customer default risks.
● Prescriptive analytics. Suggests actions based on results from other types of analytics to
mitigate future problems or leverage promising trends. For example, a navigation app
advising the fastest route based on current traffic conditions.
Data science can add value to any business which uses its data effectively. From statistics to
predictions, effective data-driven practices can put a company on the fast track to success. Here
are some ways in which data science is used:
Data Science can significantly improve a company's operations in various departments, from
logistics and supply chain to human resources and beyond. It can help in resource allocation,
performance evaluation, and process automation. For example, a logistics company can use data
science to optimize routes, reduce delivery times, save fuel costs, and improve customer
satisfaction.
Data Science can uncover hidden patterns and insights that might not be evident at first glance.
These insights can provide companies with a competitive edge and help them understand their
business better. For instance, a company can use customer data to identify trends and preferences,
enabling them to tailor their products or services accordingly.
Companies can use data science to innovate and create new products or services based on customer
needs and preferences. It also allows businesses to predict market trends and stay ahead of the
competition. For example, streaming services like Netflix use data science to understand viewer
preferences and create personalized recommendations, enhancing user experience.
The implications of data science span across all industries, fundamentally changing how
organizations operate and make decisions. While every industry stands to gain from implementing
data science, it's especially influential in data-rich sectors.
Let's delve deeper into how data science is revolutionizing these key industries:
The finance sector has been quick to harness the power of data science. From fraud detection and
algorithmic trading to portfolio management and risk assessment, data science has made complex
financial operations more efficient and precise. For instance, credit card companies utilize data
science techniques to detect and prevent fraudulent transactions, saving billions of dollars
annually.
Learn more about the finance fundamentals in Python and how you can make data-driven financial
decisions with our skill track.
Healthcare is another industry where data science has a profound impact. Applications range from
predicting disease outbreaks and improving patient care quality to enhancing hospital management
and drug discovery. Predictive models help doctors diagnose diseases early, and treatment plans
can be customized according to the patient's specific needs, leading to improved patient outcomes.
You can discover more about how data science is transforming healthcare in a DataFramed Podcast
episode.
Marketing is a field that has been significantly transformed by the advent of data science. The
applications in this industry are diverse, ranging from customer segmentation and targeted
advertising to sales forecasting and sentiment analysis. Data science allows marketers to
understand consumer behavior in unprecedented detail, enabling them to create more effective
campaigns. Predictive analytics can also help businesses identify potential market trends, giving
them a competitive edge. Personalization algorithms can tailor product recommendations to
individual customers, thereby increasing sales and customer satisfaction.
We have a separate blog post on five ways to use data science in marketing, exploring some of the
methods used in the industry. You can also learn more in our Marketing Analytics with Python
skill track.
Technology companies are perhaps the most significant beneficiaries of data science. From
powering recommendation engines to enhancing image and speech recognition, data science finds
applications in diverse areas. Ride-hailing platforms, for example, rely on data science for
connecting drivers with ride hailers and optimizing the supply of drivers depending on the time of
day.
While data science overlaps with many fields that also work with data, it carries a unique blend of
principles, tools, and techniques designed to extract insightful patterns from data.
Distinguishing between data science and these related fields can give a better understanding of the
landscape and help in setting the right career path. Let's demystify these differences.
Data science and data analytics both serve crucial roles in extracting value from data, but their
focuses differ. Data science is an overarching field that uses methods including machine learning
and predictive analytics, to draw insights from data. In contrast, data analytics concentrates on
processing and performing statistical analysis on existing datasets to answer specific questions.
While business analytics also deals with data analysis, it is more centered on leveraging data for
strategic business decisions. It is generally less technical and more business-focused than data
science. Data science, though it can inform business strategies, often dives deeper into the technical
aspects, like programming and machine learning.
Data engineering focuses on building and maintaining the infrastructure for data collection,
storage, and processing, ensuring data is clean and accessible. Data science, on the other hand,
analyzes this data, using statistical and machine learning models to extract valuable insights that
influence business decisions. In essence, data engineers create the data 'roads', while data scientists
'drive' on them to derive meaningful insights. Both roles are vital in a data-driven organization.
Statistics, a mathematical discipline dealing with data collection, analysis, interpretation, and
organization, is a key component of data science. However, data science integrates statistics with
other methods to extract insights from data, making it a more multidisciplinary field.
Data Science Driving value with data across the 4 Programming, ML, Statistics
levels of analytics
Data Engineering Build and maintain data infrastructure Data collection, storage,
processing
Having understood these distinctions, we can now delve into the key concepts every data scientist
needs to master.
Key Data Science Concepts
A successful data scientist doesn't just need technical skills but also an understanding of core
concepts that form the foundation of the field. Here are some key concepts to grasp:
Statistics and probability
These are the bedrock of data science. Statistics is used to derive meaningful insights from data,
while probability allows us to make predictions about future events based on available data.
Understanding distributions, statistical tests, and probability theories is essential for any data
scientist.
➢ The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project this
will result in a project charter.
➢ The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is data
in its raw form, which probably needs polishing and transformation before it becomes usable.
➢ Now that you have the raw data, it’s time to prepare it. This includes transforming the data from
a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct
different kinds of errors in the data, combine data from different data sources, and transform it. If
you have successfully completed this step, you can progress to data visualization and modeling.
➢ The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
➢ Finally, we get to model building (often referred to as “data modeling” throughout this book).
It is now that you attempt to gain the insights or make the predictions stated in your project charter.
Now is the time to bring out the heavy guns, but remember research has taught us that often (but
not always) a combination of simple models tends to outperform one complicated model. If you’ve
done this phase right, you’re almost done.
➢ The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The importance of this step is more
apparent in projects on a strategic and tactical level. Certain projects require you to perform the
business process over and over again, so automating the project will save time.
DEFINING RESEARCH GOALS
A project starts by understanding the what, the why, and the how of your project. The outcome
should be a clear research goal, a good understanding of the context, well-defined deliverables,
and a plan of action with a timetable. This information is then best placed in a project charter.
Spend time understanding the goals and context of your research:
➢ An essential outcome is the research goal that states the purpose of your assignment in a clear
and focused manner.
➢ Understanding the business goals and context is critical for project success.
➢ Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your research is
going to change the business, and understand how they’ll use your results Create a project charter
A project charter requires teamwork, and your input covers at least the following:
➢ A clear research goal
➢ The project mission and context
➢ How you’re going to perform your analysis
➢ What resources you expect to use
➢ Proof that it’s an achievable project, or proof of concepts
➢ Deliverables and a measure of success
➢ A timeline
RETRIEVING DATA
➢ The next step in data science is to retrieve the required data. Sometimes you need to go into
the field and design a data collection process yourself, but most of the time you won’t be involved
in this step.
➢ Many companies will have already collected and stored the data for you, and what they don’t
have can often be bought from third parties.
➢ More and more organizations are making even high-quality data freely available for public and
commercial use.
➢ Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. Start with data stored within the company
(Internal data)
➢ Most companies have a program for maintaining key data; so much of the cleaning work may
already be done. This data can be stored in official data repositories such as databases, data marts,
data warehouses, and data lakes maintained by a team of IT professionals.
➢ Data warehouses and data marts are home to pre-processed data, data lakes contain data in its
natural or raw format.
➢ Finding data even within your own company can sometimes be a challenge. As companies
grow, their data becomes scattered around many places. The data may be dispersed as people
change positions and leave the company.
➢ Getting access to data is another difficult task. Organizations understand the value and
sensitivity of data and often have policies in place so everyone has access to what they need and
nothing more.
➢ These policies translate into physical and digital barriers called Chinese walls. These “walls”
are mandatory and well-regulated for customer data in most countries. External Data
➢ If data isn’t available inside your organization, look outside your organizations. Companies
provide data so that you, in turn, can enrich their services and ecosystem. Such is the case with
Twitter, LinkedIn, and Facebook.
➢ More and more governments and organizations share their data for free with the world.
➢ A list of open data providers that should get you started.
Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is a crucial step in the data science
pipeline. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets
to improve their quality and reliability. This process ensures that the data used for analysis and
modeling is accurate, complete, and suitable for its intended purpose.
In this article, we’ll explore the importance of data cleaning, common issues that data scientists
encounter, and various techniques and best practices for effective data cleaning.
The Importance of Data Cleaning
Data cleaning plays a vital role in the data science process for several reasons:
Data Quality: Clean data leads to more accurate analyses and reliable insights. Poor data quality
can result in flawed conclusions and misguided decisions.
Model Performance: Machine learning models trained on clean data tend to perform better and
generalize more effectively to new, unseen data.
Efficiency: Clean data reduces the time and resources spent on troubleshooting and fixing issues
during later stages of analysis or model development.
Consistency: Data cleaning helps ensure consistency across different data sources and formats,
making it easier to integrate and analyze data from multiple origins.
Compliance: In many industries, clean and accurate data is essential for regulatory compliance
and reporting purposes.
Exploratory data analysis is one of the basic and essential steps of a data science project. A data
scientist involves almost 70% of his work in doing the EDA of the dataset. In this article, we will
discuss what is Exploratory Data Analysis (EDA) and the steps to perform EDA.
5 Mark Questions:
1. Explain the role of data cleansing and why it is essential in the data science process.
2. Describe the key stages of data science, from setting the research goal to presenting the
results.
3. Discuss the process of data integration and transformation in data science and its
significance.
4. How does exploratory data analysis (EDA) help in uncovering patterns in data? Provide
examples.
5. What are the primary benefits of applying data science in industries like healthcare,
finance, and marketing?
10 Mark Questions:
1. Explain the complete data science process with an emphasis on the importance of setting
the research goal, retrieving data, and the stages of model building.
2. Discuss the various techniques of data transformation and integration. How do these
processes contribute to the overall quality of the data?
3. In the context of data science, what is the significance of exploratory data analysis
(EDA)? Explain with examples how EDA helps in understanding and cleaning data.
4. Explain how machine learning models are built and validated in the data science process.
Include details about data preparation, model selection, and evaluation.
5. How do data science applications benefit decision-making in modern businesses?
Illustrate your answer with examples from real-world industries.
UNIT-II
Frequency Distribution is a tool in statistics that helps us organize the data and also helps us
reach meaningful conclusions. It tells us how often any specific values occur in the dataset. A
frequency distribution in a tabular form organizes data by showing the frequencies (the number of
times values occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears
in a dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and
frequency distribution table in detail.
Frequency Distribution
What is Outlier?
Outliers, in the context of information evaluation, are information points that deviate significantly
from the observations in a dataset. These anomalies can show up as surprisingly high or low values,
disrupting the distribution of data. For instance, in a dataset of monthly sales figures, if the income
for one month are extensively higher than the sales for all of the different months, that high sales
determine would be considered an outlier.
Why Removing Outliers is Necessary?
● Impact on Analysis: Outliers will have a disproportionate influence on statistical
measures like the suggest, skewing the general outcomes and leading to misguided
conclusions. Removing outliers can help ensure the analysis is based totally on a more
representative sample of the information.
● Statistical Significance: Outliers can have an effect on the validity and reliability of
statistical inferences drawn from the facts. Removing outliers, when appropriate, can
assist maintain the statistical importance of the analysis.
Identifying and accurately dealing with outliers is critical in data analysis to make certain the
integrity and accuracy of the results.
Types of Outliers
Outliers manifest in different forms, each presenting unique challenges:
● Univariate Outliers: These outliers occur when the point in a single variable
substantially deviates from the relaxation of the dataset. For example, if you're reading
the heights of adults in a sure place and most fall in the variety of 5 feet 5 inches to 6
ft, an person who measures 7 toes tall might be taken into consideration a univariate
outlier.
● Multivariate Outliers: In assessment to univariate outliers, multivariate outliers contain
observations which include outliers in multiple variables concurrently, highlighting
complicated relationships in the information. Continuing with our example, consider
evaluating height and weight, and you discover an character who's especially tall and
relatively heavy in comparison to the relaxation of the populace. This character would
be taken into consideration a multivariate outlier, as their characteristics in each height
and weight concurrently deviate from the normal.
● Point Outliers: These are the points which might be far eliminated from the rest of the
points. For instance, in a dataset of common household energy utilization, a price this
is exceptionally excessive or low as compared to the relaxation is a point outlier.
● Contextual Outliers: Sometimes known as conditional outliers, these are facts factors
that deviate from the normal only in a specific context or condition. For instance, a very
low temperature might be regular in wintry weather but unusual in summer.
● Collective Outliers: These outliers consist of a set of data factors that might not be
excessive by means of themselves however are unusual as an entire. This type of outlier
regularly shows a change in information behavior or emergent phenomena.
Main Causes of Outliers
Outliers can arise from various sources, making their detection vital:
● Data Entry Errors: Simple human errors in entering data can create extreme values.
● Measurement Error: Faulty device or experimental setup problems can cause
abnormally high or low readings.
● Experimental Errors: Flaws in experimental design might produce facts factors that do
not represent what they're presupposed to degree.
● Intentional Outliers: In some cases, data might be manipulated deliberately to produce
outlier effects, often seen in fraud cases.
● Data Processing Errors: During the collection and processing stages, technical glitches
can introduce erroneous data.
● Natural Variation: Inherent variability in the underlying data can also lead to outliers.
How Outliers can be Identified?
Identifying outliers is a vital step in records evaluation, supporting to discover anomalies, errors,
or valuable insights inside datasets. One common approach for figuring out outliers is through
visualizations, where records is graphically represented to highlight any points that deviate
appreciably from the overall pattern. Techniques like box plots and scatter plots offer intuitive
visual cues for recognizing outliers primarily based on their function relative to the rest of the
facts.
Another method involves the usage of statistical measures, including the Z-score, DBSCAN
algorithm, or isolation forest algorithm which quantitatively determine the deviation of statistics
factors from the imply or discover outliers primarily based on their density inside the information
area.
By combining visible inspection with statistical evaluation, analysts can efficiently identify
outliers and benefit deeper insights into the underlying traits of the facts.
1. Outlier Identification Using Visualizations
Visualizations offers insights into information distributions and anomalies. Visual tools like with
scatter plots and box plots, can efficaciously spotlight information factors that deviate notably from
the majority. In a scatter plot, outliers often seem as records factors mendacity far from the primary
cluster or displaying unusual styles as compared to the relaxation. Box plots offer a clean depiction
of the facts's central tendency and spread, with outliers represented as person factors beyond the
whiskers.
1.1 Identifying outliers with box plots
Box plots Box plots are valuable equipment in statistics analysis for visually summarizing the
distribution of a dataset. Box plots are useful in outlier identification offer a concise illustration of
key statistical measures such as the median, quartiles, and variety. A box plot includes a
rectangular "field" that spans the interquartile range (IQR), with a line indicating the median.
"Whiskers" enlarge from the box to the minimum and most values inside a specific range, often
set at 1.5 times the IQR. Any records points beyond those whiskers are considered potential
outliers. These outliers, represented as points, can provide essential insights into the dataset's
variability and capacity anomalies. Thus, box plots serve as a visual useful resource in outlier
detection, permitting analysts to pick out data points that deviate notably from the general sample
and warrant similarly research.
1.2 Identifying outliers with Scatter Plots
Scatter plots serve as vital tools in figuring out outliers inside datasets, mainly when exploring
relationships between two non-stop variables. These visualizations plot person facts points as dots
on a graph, with one variable represented on each axis. Outliers in scatter plots often take place as
factors that deviate extensively from the overall sample or fashion discovered most of the majority
of statistics factors.
They might appear as isolated dots, lying far from the main cluster, or exhibiting unusual patterns
compared to the bulk of the data. By visually inspecting scatter plots, analysts can fast pinpoint
capacity outliers, prompting further investigation into their nature and capability impact on the
evaluation. This preliminary identity lays the groundwork for deeper exploration and know-how
of the records's conduct and distribution.
2. Outlier Identification using Statistical Methods
2.1 Identifying outliers with Z-Score
Z-score, a extensively-used statistical approach, quantifies how many popular deviations a records
factor is from the suggest of the dataset. Outlier detection using Z-score, points information with
Z-scores beyond a positive threshold (usually set at
±3
±3) are considered outliers. A excessive high-quality or negative Z-score suggests that the statistics
factor is strangely far from the mean, signaling its capacity outlier fame. By calculating Z-score
for each statistics factor, analysts can systematically discover outliers primarily based on their
deviation from the imply, imparting a sturdy quantitative method to outlier detection.
Frequency Distribution
What is Frequency Distribution in Statistics?
A frequency distribution is an overview of all values of some variable and the number of times
they occur. It tells us how frequencies are distributed over the values. That is how many values lie
between different intervals. They give us an idea about the range where most values fall and the
ranges where values are scarce.
Frequency Distribution Graphs
To represent the Frequency Distribution, there are various methods such as Histogram, Bar Graph,
Frequency Polygon, and Pie Chart.
Connects midpoints of
class frequencies using
Comparing various
Frequency Polygon lines, similar to a
datasets.
histogram but without
bars.
0-20 6
20-40 12
40-60 22
60-80 15
80-100 5
10 – 20 5
20 – 30 8
30 – 40 12
40 – 50 6
50 – 60 3
In Ungrouped Frequency Distribution, all distinct observations are mentioned and counted
individually. This Frequency Distribution is often used when the given dataset is small.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
10, 20, 15, 25, 30, 10, 15, 10, 25, 20, 15, 10, 30, 25
Solution:
As unique observations in the given data are only 10, 15, 20, 25, and 30 with each having a different
frequency.
Thus the Frequency Distribution Table of the given data is as follows:
Value Frequency
10 4
15 3
20 2
25 3
30 2
This distribution displays the proportion or percentage of observations in each interval or class. It
is useful for comparing different data sets or for analyzing the distribution of data within a set.
Relative Frequency is given by:
Relative Frequency = (Frequency of Event)/(Total Number of Events)
Example: Make the Relative Frequency Distribution Table for the following data:
Frequency 5 10 20 10 5
Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative Frequency for
each class interval. Thus Relative Frequency Distribution table is given as follows:
Total 50 1.00
Cumulative frequency is defined as the sum of all the frequencies in the previous values or intervals
up to the current one. The frequency distributions which represent the frequency distributions
using cumulative frequencies are called cumulative frequency distributions. There are two types
of cumulative frequency distributions:
● Less than Type: We sum all the frequencies before the current interval.
● More than Type: We sum all the frequencies after the current interval.
Check:
● Cumulative Frequency
● How to Calculate Cumulative Frequency table in Excel
45 34 50 75 22
56 63 70 49 33
0 8 14 39 86
92 88 70 56 50
57 45 42 12 39
Solution:
Since there are a lot of distinct values, we’ll express this in the form of grouped distributions with
intervals like 0-10, 10-20 and so. First let’s represent the data in the form of grouped frequency
distribution.
Runs Frequency
0-10 2
10-20 2
20-30 1
30-40 4
40-50 4
50-60 5
60-70 1
70-80 3
80-90 2
90-100 1 Now
we will
convert
this
frequency distribution into cumulative frequency distribution by summing up the values of current
interval and all the previous intervals.
Less than 10 2
Less than 20 4
Less than 30 5
Less than 40 9
Less than 50 13
Less than 60 18
Less than 70 19
Less than 80 22
Less than 90 24
This table represents the cumulative frequency distribution of less than type.
More than 0 25
More than 10 23
More than 20 21
More than 30 20
More than 40 16
More than 50 12
More than 60 7
More than 70 6
More than 80 3
More than 90 1
This table represents the cumulative frequency distribution of more than type.
We can plot both the type of cumulative frequency distribution to make the Cumulative Frequency
Curve.
Frequency Distribution Curve
A frequency distribution curve, also known as a frequency curve, is a graphical representation of
a data set’s frequency distribution. It is used to visualize the distribution and frequency of values
or observations within a dataset. Let’s understand it’s different types based on the shape of it, as
follows:
×100
Where,
are the standard deviation and mean of the first series and
σ2 and xˉ2
σ
2
and
x
ˉ
2
are the standard deviation and mean of the second series. The Coefficeint of Variation(CV) is
calculated as follows
C.V of first series =
σ1xˉ1×100
x
ˉ
1
σ
1
×100
C.V of second series =
σ2xˉ2×100
x
ˉ
2
σ
2
×100
We are given that both series have the same mean, i.e.,
xˉ2=xˉ1=xˉ
x
ˉ
2
=
x
ˉ
1
=
x
ˉ
So, now C.V. for both series are,
C.V. of the first series =
σ1xˉ×100
x
ˉ
σ
1
×100
×100
Notice that now both series can be compared with the value of standard deviation only. Therefore,
we can say that for two series with the same mean, the series with a larger deviation can be
considered more variable than the other one.
Frequency Distribution Examples
Example 1: Suppose we have a series, with a mean of 20 and a variance is 100. Find out the
Coefficient of Variation.
Solution:
We know the formula for Coefficient of Variation,
σxˉ×100
x
ˉ
σ
×100
Given mean
xˉ
x
ˉ
= 20 and variance
σ2
σ
2
= 100.
Substituting the values in the formula,
σxˉ×100=20100×100=2010×100=200
x
ˉ
σ
×100
=
100
20
×100
=
10
20
×100
=200
Example 2: Given two series with Coefficients of Variation 70 and 80. The means are 20 and 30.
Find the values of standard deviation for both series.
Solution:
In this question we need to apply the formula for CV and substitute the given values.
Standard Deviation of first series.
C.V=σxˉ×10070=σ20×1001400=σ×10014=σ
C.V=
x
ˉ
σ
×100
70=
20
σ
×100
1400=σ×100
14=σ
Thus, the standard deviation of first series = 14
Standard Deviation of second series.
C.V=σxˉ×10080=σ30×1002400=σ×10024=σ
C.V=
x
ˉ
σ
×100
80=
30
σ
×100
2400=σ×100
24=σ
Thus, the standard deviation of first series = 24
Example 3: Draw the frequency distribution table for the following data:
2, 3, 1, 4, 2, 2, 3, 1, 4, 4, 4, 2, 2, 2
Solution:
Since there are only very few distinct values in the series, we will plot the ungrouped frequency
distribution.
Value Frequency
1 2
2 6
3 2
4 4
Total 14
Example 4: The table below gives the values of temperature recorded in Hyderabad for 25 days in
summer. Represent the data in the form of less-than-type cumulative frequency distribution:
37 34 36 27 22
25 25 24 26 28
30 31 29 28 30
32 31 28 27 30
30 32 35 34 29
Solution:
Since there are so many distinct values here, we will use grouped frequency distribution. Let’s say
the intervals are 20-25, 25-30, 30-35. Frequency distribution table can be made by counting the
number of values lying in these intervals.
20-25
2
25-30 10
30-35 13
This is the grouped frequency distribution table. It can be converted into cumulative frequency
distribution by adding the previous values.
Less than 30 12
Less than 35 25
Example 5: Make a Frequency Distribution Table as well as the curve for the data:
{45, 22, 37, 18, 56, 33, 42, 29, 51, 27, 39, 14, 61, 19, 44, 25, 58, 36, 48, 30, 53, 41, 28, 35, 47, 21,
32, 49, 16, 52, 26, 38, 57, 31, 59, 20, 43, 24, 55, 17, 50, 23, 34, 60, 46, 13, 40, 54, 15, 62}
Solution:
To create the frequency distribution table for given data, let’s arrange the data in ascending order
as follows:
{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62}
Now, we can count the observations for intervals: 10-20, 20-30, 30-40, 40-50, 50-60 and 60-70.
Interval Frequency
10 – 20 7
20 – 30 10
30 – 40 10
40 – 50 10
50 – 60 10
60 – 70 3
From this data, we can plot the Frequency Distribution Curve as follows:
A cumulative frequency is defined as the total of frequencies that are distributed over different
class intervals. It means that the data and the total are represented in the form of a table in which
the frequencies are distributed according to the class interval. In this article, we are going to discuss
in detail the cumulative frequency distribution, types of cumulative frequencies, and the
construction of the cumulative frequency distribution table with examples in detail.
What is Meant by Cumulative Frequency Distribution?
The cumulative frequency is the total of frequencies, in which the frequency of the first class
interval is added to the frequency of the second class interval and then the sum is added to the
frequency of the third class interval and so on. Hence, the table that represents the cumulative
frequencies that are divided over different classes is called the cumulative frequency table or
cumulative frequency distribution. Generally, the cumulative frequency distribution is used to
identify the number of observations that lie above or below the particular frequency in the provided
data set.
Types of Cumulative Frequency Distribution
The cumulative frequency distribution is classified into two different types namely: less than ogive
or cumulative frequency and more/greater than cumulative frequency.
Less Than Cumulative Frequency:
The Less than cumulative frequency distribution is obtained by adding successively the
frequencies of all the previous classes along with the class against which it is written. In this type,
the cumulate begins from the lowest to the highest size.
Greater Than Cumulative Frequency:
The greater than cumulative frequency is also known as the more than type cumulative frequency.
Here, the greater than cumulative frequency distribution is obtained by determining the cumulative
total frequencies starting from the highest class to the lowest class.
Graphical Representation of Less Than and More Than Cumulative Frequency
Representation of cumulative frequency graphically is easy and convenient as compared to
representing it using a table, bar-graph, frequency polygon etc.
The cumulative frequency graph can be plotted in two ways:
1. Cumulative frequency distribution curve(or ogive) of less than type
2. Cumulative frequency distribution curve(or ogive) of more than type
Steps to Construct Less than Cumulative Frequency Curve
The steps to construct the less than cumulative frequency curve are as follows:
1. Mark the upper limit on the horizontal axis or x-axis.
2. Mark the cumulative frequency on the vertical axis or y-axis.
3. Plot the points (x, y) in the coordinate plane where x represents the upper limit value and
y represents the cumulative frequency.
4. Finally, join the points and draw the smooth curve.
5. The curve so obtained gives a cumulative frequency distribution graph of less than type.
To draw a cumulative frequency distribution graph of less than type, consider the following
cumulative frequency distribution table which gives the number of participants in any level of
essay writing competition according to their age:
Table 1 Cumulative Frequency distribution table of less than type
These graphs are helpful in figuring out the median of a given data set. The median can be found
by drawing both types of cumulative frequency distribution curves on the same graph. The value
of the point of intersection of both the curves gives the median of the given set of data. For the
given table 1, the median can be calculated as shown:
Example on Cumulative Frequency
Example:
Create a cumulative frequency table for the following information, which represent the number of
hours per week that Arjun plays indoor games:
Arjun’s game time:
Monday 2 hrs
Tuesday 1 hr
Wednesday 2 hrs
Thursday 3 hrs
Friday 4 hrs
Saturday 2 hrs
Sunday 6 hrs
Solution:
Let the no. of hours be the frequency.
Hence, the cumulative frequency table is calculated as follows:
Monday 2 hrs 2
Tuesday 1 hr 2+1 = 3
The significance of understanding what is nominal data extends beyond mere classification; it
impacts how data is interpreted and the statistical methods applied to it. Since nominal data does
not imply any numerical relationship or order among its categories, traditional measures of central
tendency like mean or median are not applicable.
● Categorical Classification:
Nominal data is used to categorize variables into distinct groups based on qualitative
attributes, without any numerical significance or inherent order.
● Mutually Exclusive:
Each data point can belong to only one category, ensuring clear and precise classification
without overlap between groups.
● No Order or Hierarchy:
The categories within nominal data do not have a ranked sequence or hierarchy; all
categories are considered equal but different.
● Identified by Labels:
Categories are often identified using names or labels, which can occasionally include
numbers used as identifiers rather than quantitative values.
● Limited Statistical Analysis:
Analysis of nominal data primarily involves counting frequency, determining mode, and
using chi-square tests, as measures of central tendency like mean or median are not
applicable.
● Frequency Distribution:
One of the most common methods of analyzing nominal data is to count the frequency of
occurrences in each category. This helps in understanding the distribution of data across
the different categories. For instance, in a nominal data example like survey responses on
preferred types of cuisine, frequency distribution would reveal how many respondents
prefer each type of cuisine.
● Mode Determination:
The mode, or the most frequently occurring category in the dataset, is a key measure of
central tendency that can be applied to nominal data. It provides insight into the most
common or popular category among the data points. For example, if analyzing nominal
data on pet ownership, the mode would indicate the most common type of pet among
participants.
● Cross-tabulation:
Cross-tabulation involves comparing two or more nominal variables to identify
relationships between categories. This analysis can reveal patterns and associations that
are not immediately apparent. For instance, cross-tabulating nominal data on consumers'
favorite fast-food chains with their age groups could uncover preferences trends among
different age demographics.
● Chi-square Test:
For more complex analysis involving nominal data, the chi-square test is used to examine
the relationships between two nominal variables. It tests whether the distribution of
sample categorical data matches an expected distribution. As an example, researchers
might use a chi-square test to analyze whether there is a significant association between
gender (a nominal data example) and preference for a particular brand of product.
Examples
To illustrate the concept of nominal data more concretely, here are some practical examples that
showcase its application across various fields and contexts:
Analysis Techniques Frequency counts, mode, chi-square Median, percentile, rank correlation,
tests non-parametric tests
Application Used for categorizing data without Used when data classification requires
any need for ranking. a hierarchy or ranking.
Interpreting Distributions
- Symmetric, bell-shaped
- Mean = Median = Mode
- Characteristics:
- Most data points cluster around mean
- Tails decrease exponentially
- 68% data within 1 standard deviation
- 95% data within 2 standard deviations
- Examples: Height, IQ scores, measurement errors
2. Skewed Distribution
- Asymmetric, tails on one side
- Types:
- Positive Skew: Tail on right side (e.g., income distribution, wealth distribution)
- Negative Skew: Tail on left side (e.g., failure time distribution, response times)
- Characteristics:
- Mean ≠ Median ≠ Mode
- Tails are longer on one side
- Data is concentrated on one side
- Examples: Income, wealth, failure times
3. Bimodal Distribution
4. Multimodal Distribution
- Multiple peaks
- Characteristics:
- Multiple modes (local maxima)
- Multiple valleys
- Data has multiple distinct groups
- Examples: Gene expression data, text analysis
5. Uniform Distribution
6. Exponential Distribution
8. Lognormal Distribution
9. Binomial Distribution
If the data set is having an even number of values then the median can be found by taking the
average of the two middle values. Consider 10 (EVEN) values 1,2,3,7,8,3,2,5,4,15. We first sort
the values in ascending order 1,2,2,3,3,4,5,7,8,15 then the median is (3+4)/2=3.5 which is the
average of the two middle values i.e. the values which are located at the 5th and 6th number in the
sequence and will have 4 numbers on either side.
Mode
It is the most frequent value in the data set. We can easily get the mode by counting the frequency
of occurrence. Consider a data set with the values 1,5,5,6,8,2,6,6. In this data set, we can observe
the following,
The value 6 occurs the most hence the mode of the data set is 6.
We often test our data by plotting the distribution curve, if most of the values are centrally located
and very few values are off from the center then we say that the data is having a normal distribution.
At that time the values of mean, median, and mode are almost equal.
However, when our data is skewed, for example, as with the right-skewed data set below:
We can say that the mean is being dragged in the direction of the skew. In this skewed distribution,
mode < median < mean. The more skewed the distribution, the greater the difference between the
median and mean, here we consider median for the conclusion. The best example of the right-
skewed distribution is salaries of employees, where higher-earners provide a false representation
of the typical income if expressed as mean salaries and not the median salaries.
For left-skewed distribution mean < median < mode. In such a case also, we emphasize the median
value of the distribution.
Mean, Median & Mode Example
To understand this let us consider an example. An OTT platform company has conducted a survey
in a particular region based on the watch time, language of streaming, and age of the viewer. For
our understanding, we have taken a sample of 10 people.
df=pd.read_csv("[Link]")
df
df["Watch Time"].mean()
2.5
df["Watch Time"].mode()
0 1.5
dtype: float64
df["Watch Time"].median()
2.0
If we observe the values then we can conclude the value of Mean Watch Time is 2.5 hours and
which appears reasonably correct. For Age of viewer following results can be obtained,
df["Age"].median()
12.5
df["Age"].mean()
19.9
df["Age"].mode()
0 12
1 15
dtype: int64
The value of mean Age is looked somewhat away from the actual data. Most of the viewers are in
the range of 10 to 15 while the value of mean comes 19.9. This is because the outliers present in
the data set. We can easily find the outliers using a boxplot.
[Link](df['Age'], orient='vertical')
If we observe the value of Median Age then the result looks correct. The value of mean is very
sensitive to outliers.
Now for the most popular language, we can not calculate the mean and median since this is nominal
data.
[Link](x="Language",y="Age",data=df)
[Link](x="Language",y="Watch Time",data=df)
If we observe the graph then it is seen that the Tamil bar is largest for Language vs Age and
Language vs Watch Time graph. But this will mislead the result because there is only one person
who watches the shows in Tamil.
df["Language"].value_counts()
Hindi 4
English 3
Tamil 1
Telgu 1
Marathi 1
Name: Language, dtype: int64
df["Language"].mode()
0 Hindi
dtype: object
Result
From the above result, it is concluded that the most popular language is Hindi. This is observed
when we find the mode of the data set.
Hence from the above observation, it is concluded that in the sample survey average age of viewers
is 12.5 years who watch for 2.5 hours daily a show in the Hindi language.
We can say there is no best central tendency measure method because the result is always based
on the types of data. For ordinal, interval, and ratio data (if it is skewed) we can prefer median.
For Nominal data, the model is preferred and for interval and ratio data (if it is not skewed) mean
is preferred.
Measures of Central Tendency and Dispersion
Dispersion measures indicate how data values are spread out. The range, which is the difference
between the highest and lowest values, is a simple measure of dispersion. The standard deviation
measures the expected difference between a data value and the mean.
1. What does a frequency distribution represent? a) The raw data in a tabular form
b) The count of how often each value appears in the data set
c) The mean of a dataset
d) The median of a dataset
Answer: b) The count of how often each value appears in the data set
2. What is an outlier in a data set? a) A data point that occurs most frequently
b) A data point that lies significantly outside the range of the rest of the data
c) A point at the median
d) A value that represents the average
Answer: b) A data point that lies significantly outside the range of the rest of the data
3. Which type of frequency distribution is used for nominal data? a) Cumulative
frequency distribution
b) Relative frequency distribution
c) Frequency distribution for nominal data
d) Continuous frequency distribution
Answer: c) Frequency distribution for nominal data
4. In a relative frequency distribution, what does the frequency of each class
represent? a) The total number of observations
b) The percentage of the total data points in each class
c) The number of outliers in the data
d) The cumulative count of data points
Answer: b) The percentage of the total data points in each class
5. What is a cumulative frequency distribution? a) A distribution showing the sum of
frequencies for all values less than or equal to each class
b) A distribution showing only the most frequent values
c) A distribution based on nominal data
d) A distribution used only for large datasets
Answer: a) A distribution showing the sum of frequencies for all values less than or
equal to each class
6. What is the primary purpose of interpreting a frequency distribution? a) To find the
most frequent value
b) To understand the shape and patterns in the data
c) To find the median
d) To identify the largest value in the dataset
Answer: b) To understand the shape and patterns in the data
7. Which of the following graphs is commonly used to display frequency distributions?
a) Pie chart
b) Histogram
c) Scatter plot
d) Line graph
Answer: b) Histogram
8. What does a bar graph represent in terms of frequency data? a) The relationship
between two continuous variables
b) The distribution of values in nominal data
c) The cumulative frequency of data
d) The mean of a dataset
Answer: b) The distribution of values in nominal data
9. Which measure of central tendency represents the value that appears most
frequently in the dataset? a) Median
b) Mode
c) Mean
d) Range
Answer: b) Mode
10. What is the median in a dataset? a) The value that occurs most often
b) The average of all values
c) The middle value when the data is arranged in ascending order
d) The sum of all values divided by the number of observations
Answer: c) The middle value when the data is arranged in ascending order
11. Which of the following is true about the mean of a dataset? a) It is always the same as
the median
b) It is the middle value of the data
c) It can be affected by extreme values (outliers)
d) It is the most frequent value
Answer: c) It can be affected by extreme values (outliers)
12. What is the purpose of creating a frequency distribution for nominal data? a) To
calculate the mean
b) To show the distribution of categories in the data
c) To determine outliers
d) To compute the cumulative frequency
Answer: b) To show the distribution of categories in the data
13. Which of the following graphs is best suited for displaying cumulative frequency
distributions? a) Pie chart
b) Histogram
c) Ogive
d) Box plot
Answer: c) Ogive
14. Which of the following is a characteristic of the mode in a dataset? a) It is always
unique
b) It can have more than one value in bimodal or multimodal distributions
c) It is unaffected by outliers
d) It is the average of the data
Answer: b) It can have more than one value in bimodal or multimodal distributions
15. In a frequency distribution, the relative frequency is expressed as: a) A fraction of
the total data points
b) The total number of observations
c) A count of the most frequent value
d) The cumulative sum of all frequencies
Answer: a) A fraction of the total data points
16. Which measure of central tendency is most affected by extreme outliers? a) Mode
b) Mean
c) Median
d) Range
Answer: b) Mean
17. What is the difference between the cumulative frequency and the relative
cumulative frequency? a) Cumulative frequency represents the total number of data
points, while relative cumulative frequency represents the percentage
b) Cumulative frequency shows frequencies for each class, while relative cumulative
frequency shows cumulative count
c) Cumulative frequency is used for nominal data, while relative cumulative frequency is
for continuous data
d) There is no difference; both terms mean the same thing
Answer: a) Cumulative frequency represents the total number of data points, while
relative cumulative frequency represents the percentage
18. Which of the following graphs is used to represent the distribution of data over
time? a) Line graph
b) Histogram
c) Scatter plot
d) Box plot
Answer: a) Line graph
19. Which of the following measures of central tendency is best used for ordinal data? a)
Mean
b) Median
c) Mode
d) Range
Answer: b) Median
20. In a frequency distribution, if the data is heavily skewed, which measure of central
tendency would provide the best representation of the "center"? a) Mode
b) Median
c) Mean
d) Standard deviation
Answer: b) Median
21. Which measure of central tendency is appropriate when the data is skewed or has
outliers? a) Mean
b) Median
c) Mode
d) Variance
Answer: b) Median
22. Which of the following is an example of a continuous frequency distribution? a) The
number of people in different age groups
b) The number of red cars in a parking lot
c) The range of temperatures over a week
d) The number of students in each grade
Answer: c) The range of temperatures over a week
23. What does an ogive graph show? a) The distribution of nominal data
b) The cumulative frequency of the data
c) The most frequent value in the data
d) The median and mean
Answer: b) The cumulative frequency of the data
24. Which of the following methods is used to find the median in a frequency
distribution? a) Add all values and divide by the total number of observations
b) Find the value that divides the data into two equal parts
c) Find the mode of the data
d) Identify the value that appears most often
Answer: b) Find the value that divides the data into two equal parts
25. In which type of distribution would you expect to see multiple modes? a) Uniform
distribution
b) Normal distribution
c) Bimodal or multimodal distribution
d) Skewed distribution
Answer: c) Bimodal or multimodal distribution
5 Mark Questions:
1. Explain the concept of frequency distributions and discuss their significance in data
analysis.
2. Define outliers and explain how they can affect frequency distributions and statistical
measures like the mean.
3. What is the difference between cumulative frequency distribution and relative frequency
distribution? Explain with examples.
4. How can frequency distributions be used to interpret nominal data? Provide examples of
when this is useful.
5. Describe how to calculate and interpret the mean, median, and mode in a data set.
Discuss situations where one might be more useful than the others.
10 Mark Questions:
1. Explain the process of creating a frequency distribution for a given dataset and interpret
the results. Include examples of nominal and continuous data.
2. Discuss the role of graphs in the interpretation of frequency distributions. Compare and
contrast histograms, bar graphs, and ogives.
3. How do outliers impact the mean, median, and mode? Explain with examples, and
discuss methods for identifying and handling outliers in data analysis.
4. Describe the concept of cumulative frequency distribution. How do you construct one,
and what insights can it provide about a data set?
5. Explain how to calculate and interpret the different measures of central tendency (mean,
median, mode) for a given dataset. In which situations would each measure be most
appropriate?
UNIT -III
Normal distributions
Normal Distribution is the most common or normal form of distribution of Random Variables,
hence the name “normal distribution.” It is also called Gaussian Distribution in Statistics or
Probability. We use this distribution to represent a large number of random variables. It serves as
a foundation for statistics and probability theory.
It also describes many natural phenomena, forms the basis of the Central Limit Theorem, and also
supports numerous statistical methods.
The normal distribution is the most important and most widely used distribution in statistics. It is
sometimes called the “bell curve,” although the tonal qualities of such a bell would be less than
pleasing. It is also called the “Gaussian curve” of Gaussian distribution after the mathematician
Karl Friedrich Gauss.
Strictly speaking, it is not correct to talk about “the normal distribution” since there are many
normal distributions. Normal distributions can differ in their means and in their standard
deviations. Figure 4.1 shows three normal distributions. The blue (left-most) distribution has a
mean of −3 and a standard deviation of 0.5, the distribution in red (the middle distribution) has a
mean of 0 and a standard deviation of 1, and the black (right-most) distribution has a mean of 2
and a standard deviation of 3. These as well as all other normal distributions are symmetric with
relatively more values at the center of the distribution and relatively few in the tails. What is
consistent about all normal distribution is the shape and the proportion of scores within a given
distance along the x-axis. We will focus on the standard normal distribution (also known as the
unit normal distribution), which has a mean of 0 and a standard deviation of 1 (i.e., the red
distribution in Figure 4.1).
Figure 4.1. Normal distributions differing in mean and standard deviation. (“Normal Distributions
with Different Means and Standard Deviations” by Judy Schmitt is licensed under CC BY-NC-SA
4.0.)
Seven features of normal distributions are listed below.
1. Normal distributions are symmetric around their mean.
2. The mean, median, and mode of a normal distribution are equal.
3. The area under the normal curve is equal to 1.0.
4. Normal distributions are denser in the center and less dense in the tails.
5. Normal distributions are defined by two parameters, the mean ( ) and the standard
deviation (s).
6. 68% of the area of a normal distribution is within one standard deviation of the mean.
7. Approximately 95% of the area of a normal distribution is within two standard
deviations of the mean.
These properties enable us to use the normal distribution to understand how scores relate to one
another within and across a distribution. But first, we need to learn how to calculate the
standardized score that makes up a standard normal distribution.
Z Scores
A z score is a standardized version of a raw score (x) that gives information about the relative
location of that score within its distribution. The formula for converting a raw score into a z score
is
As you can see, z scores combine information about where the distribution is located (the
mean/center) with how wide the distribution is (the standard deviation/spread) to interpret a raw
score (x). Specifically, z scores will tell us how far the score is away from the mean in units of
standard deviations and in what direction.
The value of a z score has two parts: the sign (positive or negative) and the magnitude (the actual
number). The sign of the z score tells you in which half of the distribution the z score falls: a
positive sign (or no sign) indicates that the score is above the mean and on the right-hand side or
upper end of the distribution, and a negative sign tells you the score is below the mean and on the
left-hand side or lower end of the distribution. The magnitude of the number tells you, in units of
standard deviations, how far away the score is from the center or mean. The magnitude can take
on any value between negative and positive infinity, but for reasons we will see soon, they
generally fall between −3 and 3.
Let’s look at some examples. A z score value of −1.0 tells us that this z score is 1 standard deviation
(because of the magnitude 1.0) below (because of the negative sign) the mean. Similarly, a z score
value of 1.0 tells us that this z score is 1 standard deviation above the mean. Thus, these two scores
are the same distance away from the mean but in opposite directions. A z score of −2.5 is two-and-
a-half standard deviations below the mean and is therefore farther from the center than both of the
previous scores, and a z score of 0.25 is closer than all of the ones before. In Unit 2, we will learn
to formalize the distinction between what we consider “close to” the center or “far from” the center.
For now, we will use a rough cut-off of 1.5 standard deviations in either direction as the difference
between close scores (those within 1.5 standard deviations or between z = −1.5 and z = 1.5) and
extreme scores (those farther than 1.5 standard deviations—below z = −1.5 or above z = 1.5).
We can also convert raw scores into z scores to get a better idea of where in the distribution those
scores fall. Let’s say we get a score of 68 on an exam. We may be disappointed to have scored so
low, but perhaps it was just a very hard exam. Having information about the distribution of all
scores in the class would be helpful to put some perspective on ours. We find out that the class got
an average score of 54 with a standard deviation of 8. To find out our relative location within this
distribution, we simply convert our test score into a z score.
We find that we are 1.75 standard deviations above the average, above our rough cut-off for close
and far. Suddenly our 68 is looking pretty good!
Figure 4.2 shows both the raw score and the z score on their respective distributions. Notice that
the red line indicating where each score lies is in the same relative spot for both. This is because
transforming a raw score into a z score does not change its relative location, it only makes it easier
to know precisely where it is.
Figure 4.2. Raw and standardized versions of a single score. (“Raw and Standardized Versions of
a Score” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
z Scores are also useful for comparing scores from different distributions. Let’s say we take the
SAT and score 501 on both the math and critical reading sections. Does that mean we did equally
well on both? Scores on the math portion are distributed normally with a mean of 511 and standard
deviation of 120, so our z score on the math section is
which is just slightly below average (note the use of “math” as a subscript; subscripts are used
when presenting multiple versions of the same statistic in order to know which one is which and
have no bearing on the actual calculation). The critical reading section has a mean of 495 and
standard deviation of 116, so
So even though we were almost exactly average on both tests, we did a little bit better on the
critical reading portion relative to other people.
Finally, z scores are incredibly useful if we need to combine information from different measures
that are on different scales. Let’s say we give a set of employees a series of tests on things like job
knowledge, personality, and leadership. We may want to combine these into a single score we can
use to rate employees for development or promotion, but look what happens when we take the
average of raw scores from different scales, as shown in Table 4.1.
Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average
Because the job knowledge scores were so big and the scores were so similar, they overpowered
the other scores and removed almost all variability in the average. However, if we standardize
these scores into z scores, our averages retain more variability and it is easier to assess differences
between employees, as shown in Table 4.2.
Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average
for a sample. Notice that these are just simple rearrangements of the original formulas for
calculating z from raw scores.
Let’s say we create a new measure of intelligence, and initial calibration finds that our scores have
a mean of 40 and standard deviation of 7. Three people who have scores of 52, 43, and 34 want to
know how well they did on the measure. We can convert their raw scores into z scores:
A problem is that these new z scores aren’t exactly intuitive for many people. We can give people
information about their relative location in the distribution (for instance, the first person scored
well above average), or we can translate these z scores into the more familiar metric of IQ scores,
which have a mean of 100 and standard deviation of 16:
We would also likely round these values to 127, 107, and 87, respectively, for convenience.
We saw in Chapter 3 that standard deviations can be used to divide the normal distribution: 68%
of the distribution falls within 1 standard deviation of the mean, 95% within (roughly) 2 standard
deviations, and 99.7% within 3 standard deviations. Because z scores are in units of standard
deviations, this means that 68% of scores fall between z = −1.0 and z = 1.0 and so on. We call this
68% (or any percentage we have based on our z scores) the proportion of the area under the curve.
Any area under the curve is bounded by (defined by, delineated by, etc.) by a single z score or pair
of z scores.
An important property to point out here is that, by virtue of the fact that the total area under the
curve of a distribution is always equal to 1.0 (see section on Normal Distributions at the beginning
of this chapter), these areas under the curve can be added together or subtracted from 1 to find the
proportion in other areas. For example, we know that the area between z = −1.0 and z = 1.0 (i.e.,
within one standard deviation of the mean) contains 68% of the area under the curve, which can
be represented in decimal form as .6800. (To change a percentage to a decimal, simply move the
decimal point 2 places to the left.) Because the total area under the curve is equal to 1.0, that means
that the proportion of the area outside z = −1.0 and z = 1.0 is equal to 1.0 − .6800 = .3200 or 32%
(see Figure 4.3). This area is called the area in the tails of the distribution. Because this area is split
between two tails and because the normal distribution is symmetrical, each tail has exactly one-
half, or 16%, of the area under the curve.
Figure 4.3. Shaded areas represent the area under the curve in the tails. (“Area under the Curve in
the Tails” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)
We will have much more to say about this concept in the coming chapters. As it turns out, this is
a quite powerful idea that enables us to make statements about how likely an outcome is and what
that means for research questions we would like to answer and hypotheses we would like to test.
Because normally distributed variables are so common, many statistical tests are designed for
normally distributed populations.
Understanding the properties of normal distributions means you can use inferential statistics to
compare different groups and make estimates about populations using samples.
The mean is the location parameter while the standard deviation is the scale parameter.
The mean determines where the peak of the curve is centered. Increasing the mean moves the curve
right, while decreasing it moves the curve left.
The standard deviation stretches or squeezes the curve. A small standard deviation results in a
narrow curve, while a large standard deviation leads to a wide curve.
Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a normal
distribution:
● Around 68% of values are within 1 standard deviation from the mean.
● Around 95% of values are within 2 standard deviations from the mean.
● Around 99.7% of values are within 3 standard deviations from the mean.
Example: Using the empirical rule in a normal distributionYou collect SAT scores from students in a
new test preparation course. The data follows a normal distribution with a mean score (M) of 1150 and
a standard deviation (SD) of 150.
Following the empirical rule:
● Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and below the
mean.
● Around 95% of scores are between 850 and 1,450, 2 standard deviations above and below the
mean.
● Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and below the
mean.
The empirical rule is a quick way to get an overview of your data and check for any outliers or
extreme values that don’t follow this pattern.
If data from small samples do not closely follow this pattern, then other distributions like the t-
distribution may be more appropriate. Once you identify the distribution of your variable, you can
apply appropriate statistical tests.
Central limit theorem
The central limit theorem is the basis for how normal distributions work in statistics.
In research, to get a good idea of a population mean, ideally you’d collect data from multiple
random samples within the population. A sampling distribution of the mean is the distribution of
the means of these different samples.
● Law of Large Numbers: As you increase sample size (or the number of samples), then the
sample mean will approach the population mean.
● With multiple large samples, the sampling distribution of the mean is normally distributed,
even if your original variable is not normally distributed.
Parametric statistical tests typically assume that samples come from normally distributed
populations, but the central limit theorem means that this assumption isn’t necessary to meet when
you have a large enough sample.
You can use parametric tests for large samples from populations with any kind of distribution as
long as other important assumptions are met. A sample size of 30 or more is generally considered
large.
For small samples, the assumption of normality is important because the sampling distribution of
the mean isn’t known. For accurate results, you have to be sure that the population is normally
distributed before you can use parametric tests with small samples.
In a probability density function, the area under the curve tells you probability. The normal
distribution is a probability distribution, so the total area under the curve is always 1 or 100%.
The formula for the normal probability density function looks fairly complicated. But to use it,
you only need to know the population mean and standard deviation.
For any value of x, you can plug in the mean and standard deviation into the formula to find the
probability density of the variable taking on that value of x.
Example: Using the probability density functionYou want to know the probability that SAT scores in
your sample exceed 1380.
On your graph of the probability density function, the probability is the shaded area under the curve that
lies to the right of where your SAT scores equal 1380.
You can find the probability value of this score using the standard normal distribution.
While individual observations from normal distributions are referred to as x, they are referred to
as z in the z-distribution. Every normal distribution can be converted to the standard normal
distribution by turning the individual values into z-scores.
Z-scores tell you how many standard deviations away from the mean each value lies.
You only need to know the mean and standard deviation of your distribution to find the z-score of
a value.
● x = individual value
● μ = mean
● σ = standard deviation
We convert normal distributions into the standard normal distribution for several reasons:
Example: Finding probability using the z-distributionTo find the probability of SAT scores in your
sample exceeding 1380, you first find the z-score.
The mean of our distribution is 1150, and the standard deviation is 150. The z-score tells you how many
standard deviations away 1380 is from the mean.
Formula Calculation
For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380 or less
(93.7%), and it’s the area under the curve left of the shaded area.
To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.
Probability of x > 1380 = 1 – 0.937 = 0.063
That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.
What is a Z-score?
A Z-score (also called a standard score) represents the number of standard deviations a data point
is from the mean of a distribution. It allows you to standardize data from different distributions so
that you can compare them directly.
• Formula for Z-score: Z=X−μσZ = \frac{X - \mu}{\sigma}Z=σX−μ where:
o XXX is the value of the data point you're interested in,
o μ\muμ is the mean of the population (or sample),
o σ\sigmaσ is the standard deviation of the population (or sample).
2. Interpreting a Z-score:
• A Z-score of 0 means that the data point is exactly at the mean.
• A positive Z-score indicates that the data point is above the mean.
• A negative Z-score indicates that the data point is below the mean.
• The magnitude of the Z-score tells you how far, in standard deviations, the data point is
from the mean.
3. Finding Z-scores:
To find a Z-score, you need to know the value you're analyzing (XXX), the mean (μ\muμ), and
the standard deviation (σ\sigmaσ) of the population (or sample).
Example: Suppose you're analyzing the scores of students on a test where the mean score is 70
and the standard deviation is 10. If a student scored 85, their Z-score would be:
Z=85−7010=1.5Z = \frac{85 - 70}{10} = 1.5Z=1085−70=1.5
This means that the student's score is 1.5 standard deviations above the mean.
4. Using Z-scores to Find Proportions:
Z-scores are often used to find the proportion of data that falls below, above, or between certain
values in a normal distribution.
• To find the proportion of data below a given Z-score, you can use the standard normal
distribution table (also called the Z-table) or a calculator with statistical functions.
• A Z-table provides the cumulative probability (or proportion) to the left of a given Z-score
in a standard normal distribution (which has a mean of 0 and a standard deviation of 1).
Example: Finding the Proportion Below a Z-score
If you want to find the proportion of data points that are below a Z-score of 1.5, you would look
up the value for Z=1.5Z = 1.5Z=1.5 in the Z-table. The value for Z=1.5Z = 1.5Z=1.5 is
approximately 0.9332, which means that about 93.32% of the data falls below a Z-score of 1.5.
• This means that in a standard normal distribution, approximately 93.32% of the values are
less than or equal to 1.5 standard deviations above the mean.
5. Finding Scores from Z-scores:
You can also use Z-scores to find the original score (or data point) from a given Z-score.
Rearranging the Z-score formula:
X=Z⋅σ+μX = Z \cdot \sigma + \muX=Z⋅σ+μ
where:
• ZZZ is the Z-score,
• σ\sigmaσ is the standard deviation,
• μ\muμ is the mean,
• XXX is the original score.
Example: If the mean score is 70, the standard deviation is 10, and you want to find the score
corresponding to a Z-score of 2, the formula is:
X=2⋅10+70=90X = 2 \cdot 10 + 70 = 90X=2⋅10+70=90
So, a Z-score of 2 corresponds to a score of 90.
6. Applications of Z-scores:
• Comparing data from different distributions: Z-scores allow you to compare scores
from different datasets (or tests) even if the datasets have different means and standard
deviations. For example, you can compare a test score from one exam to a test score from
another exam by calculating the Z-scores for both.
• Identifying outliers: A Z-score that is significantly higher or lower than the mean (usually
greater than +2 or less than -2) can be considered an outlier, since it indicates that the data
point is far from the mean.
7. Standard Normal Distribution:
The Z-score is related to the standard normal distribution, which is a special case of the normal
distribution. In this distribution:
• The mean is always 0.
• The standard deviation is always 1.
• The shape of the curve is symmetric, with the majority of data points (about 68%) falling
within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7%
within 3 standard deviations (the famous 68-95-99.7 rule).
8. Finding the Proportion Between Two Z-scores:
If you want to find the proportion of data between two Z-scores, you can use the following process:
• Look up the cumulative probability for each Z-score in the Z-table.
• Subtract the smaller cumulative probability from the larger cumulative probability.
Example: To find the proportion of data between Z = -1 and Z = 1:
• Find the cumulative probability for Z = 1, which is approximately 0.8413.
• Find the cumulative probability for Z = -1, which is approximately 0.1587.
• Subtract: 0.8413−0.1587=0.68260.8413 - 0.1587 = 0.68260.8413−0.1587=0.6826.
So, about 68.26% of the data falls between Z-scores of -1 and 1.
There are three main types of correlation in data science: positive, negative, and zero.
1. Positive Correlation: This occurs when both variables move in the same direction. For
example, the more hours you study, the higher your test scores tend to be. Both studying
and scores are increasing together.
2. Negative Correlation: Here, the variables move in opposite directions. An example could
be the relationship between the amount of time spent watching TV and grades. Typically,
as TV time increases, grades might decrease.
3. Zero Correlation: This means there’s no relationship between the variables. For example,
there might be no correlation between the number of books you read and the color of your
car.
These types of correlation in data science help you determine what category, two variables fall
under. This way of understanding the relation of data falls under the analysis of data in the Data
Process lifecycle.
How Do We Measure Correlation in Data Science?
With the help of the previous section, you may have a basic understanding of what correlation in
data science is and what are its types. Now, let’s get into the details of how you can measure
correlation.
Scatter Plot
Scatter plot is one of the most important data visualization techniques and it is considered one of
the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship between two
variables, on a two-dimensional graph that is known as Cartesian Plane on mathematical grounds.
It is generally used to plot the relationship between one independent variable and one dependent
variable, where an independent variable is plotted on the x-axis and a dependent variable is plotted
on the y-axis so that you can visualize the effect of the independent variable on the dependent
variable. These plots are known as Scatter Plot Graph or Scatter Diagram.
Applications of Scatter Plot
As already mentioned, a scatter plot is a very useful data visualization technique. A few
applications of Scatter Plots are listed below.
• Correlation Analysis: Scatter plot is useful in the investigation of the correlation between
two different variables. It can be used to find out whether two variables have a positive
correlation, negative correlation or no correlation.
• Outlier Detection: Outliers are data points, which are different from the rest of the data
set. A Scatter Plot is used to bring out these outliers on the surface.
• Cluster Identification: In some cases, scatter plots can help identify clusters or groups
within the data.
Scatter Plot Graph
Scatter Plot is known by several other names, a few of them are scatter chart, scattergram, scatter
plot, and XY graph. A scatter plot is used to visualize a data pair, such that each element gets its
axis, generally the independent one gets the x-axis and the dependent one gets the y-axis.
This kind of distribution makes it easier to visualize the kind of relationship, the plotted pair of
data is holding. So Scatter Plot is useful in situations when we have to find out the relationship
between two sets of data, or in cases when we suspect that there may be some relationship between
two variables and this relationship may be the root cause of some problem.
Now let us understand how to construct a scatter plot and its use case via an example.
How to Construct a Scatter Plot?
To construct a scatter plot, we have to follow the given steps.
Step 1: Identify the independent and dependent variables
Step 2: Plot the independent variable on x-axis
Step 3: Plot the dependent variable on y-axis
Step 4: Extract the meaningful relationship between the given variables.
Let's understand the process through an example. In the following table, a data set of two variables
is given.
Matches Played 2 5 7 1 12 15 18
Goals Scored 1 4 5 2 7 12 11
Now in this data set there are two variables, first is the number of matches played by a certain
player and second is the number of goals scored by that player. Suppose, we aim to find out the
relationship between the number of matches played by a certain player and the number of goals
scored by him/her. For now, let us discard our obvious intuitive understanding that the number of
goals scored is directly proportional to the number of matches played. For now, let us assume that
we just have the given dataset and we have to extract out relationship between given data pair.
Matches Played 10 12 14 16 18
Solution:
X-axis: Number of Matches Played
Y-axis: Number of Runs Scored
Graph:
Correlation Coefficient Definition
A statistical measure that quantifies the strength and direction of the linear relationship between
two variables is called the Correlation coefficient. Generally, it is denoted by the symbol ‘r’ and
ranges from -1 to 1.
What is Correlation Coefficient Formula?
Correlation coefficient procedure is used to determine how strong a relationship is between the
data. The correlation coefficient procedure yields a value between 1 and -1. In which,
• -1 indicates a strong negative relationship
• 1 indicates strong positive relationships
• Zero implies no connection at all
Understanding Correlation Coefficient
• Correlation coefficient of -1 means there is a negative decrease of a fixed proportion, for
every positive increase in one variable. Like, the amount of gas in a tank decreases in a
perfect correlation with the speed.
• Correlation coefficient of 1 means there is a positive increase of a fixed proportion of
others, for every positive increase in one variable. Like, the size of the shoe goes up in
perfect correlation with foot length.
• Correlation coefficient of 0 means that for every increase, there is neither a positive nor a
negative increase. The two just aren’t related.
Types of Correlation Coefficient Formula
Various types of Correlation Coeeficient are:
Pearson’s Correlation Coefficient Formula
Pearson’s Correlation Coefficient Formula is added below:
R = n(∑xy)–(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2 R = [n∑x2−(∑x)2][n∑y2−(∑y)2n(∑xy)–
(∑x)(∑y)
Sample Correlation Coefficient Formula
Sample Correlation Coefficient Formula is added below:
rxy = Cov(x,y)/[Link] rxy = Cov(x,y)/[Link]
where,
• Sxy is Covariance of Sample
• Sx and Sy are Standard Deviations of Sample
Population Correlation Coefficient Formula
Population Correlation Coefficient Formula is added below:
𝜌xy = σxy/σx.σy
where,
• σx and σy are Populatin Standard Deviation
• σxy is Population Covariance
Pearson’s Correlation
It is the most common correlation in statistics. The full name is Pearson’s Product Moment
Correlation in short PPMC. It displays the Linear relation between the two sets of data. Two letters
are used to represent the Pearson correlation
Greek Letter “rho (ρ)” for a population and the letter “r” for a sample correlation coefficient.
How to Find Pearson’s Correlation Coefficient?
Follow the steps added below to find the Pearson’s Correlation Coefficient of any given data set
Step 1: Firstly make a chart with the given data like subject,x, and y and add three more columns
in it xy, x² and y².
Step 2: Now multiply the x and y columns to fill the xy column. For example:- in x we have 24 and
in y we have 65 so xy will be 24×65=1560.
Step 3: Now, take the square of the numbers in the x column and fill the x² column.
Step 4: Now, take the square of the numbers in the y column and fill the y² column.
Step 5: Now, add up all the values in the columns and put the result at the bottom. Greek letter
sigma (Σ) is the short way of saying summation.
Step 6: Now, use the formula for Pearson’s correlation coefficient:
R=n(∑xy)–(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2 R=[n∑x2−(∑x)2][n∑y2−(∑y)2n(∑xy)–
(∑x)(∑y)
To know which type of variable we have either positive or negative.
To compute the Pearson correlation coefficient (denoted as r) manually, you can use the following
computational formula, which is often more practical than the general formula for calculations
involving large datasets. The computational formula for r is:
Where:
• nnn is the number of data points (or pairs of XXX and YYY),
• XXX and YYY are the individual data points of the two variables,
• ∑X\sum X∑X is the sum of all XXX values,
• ∑Y\sum Y∑Y is the sum of all YYY values,
• ∑XY\sum XY∑XY is the sum of the product of corresponding XXX and YYY values,
• ∑X2\sum X^2∑X2 is the sum of the squares of the XXX values,
• ∑Y2\sum Y^2∑Y2 is the sum of the squares of the YYY values.
Step-by-Step Calculation:
Let’s go over how to calculate rrr step-by-step with an example.
Example:
Suppose we have the following data on the number of hours of study (X) and test scores (Y):
Hours of Study (X) Test Score (Y)
1 55
2 60
3 65
4 70
5 75
We need to compute the Pearson correlation coefficient using the computational formula
Averages for qualitative and ranked data.
When dealing with qualitative (or categorical) and ranked (or ordinal) data, we use different types
of averages or central tendency measures compared to quantitative data. Let’s go over the types of
averages used for these kinds of data:
1. Qualitative (Categorical) Data:
Qualitative data is data that represents categories or groups, such as gender, colors, types of fruits,
or eye color. These data are typically non-numeric.
For qualitative data, the most appropriate measures of central tendency are:
Mode:
• The mode is the most common category or value in a dataset.
• It represents the category or group that occurs the most frequently.
• The mode is the only measure of central tendency that can be used for nominal data (where
categories have no inherent order, like "red," "blue," and "green" for colors).
Example: If you have the following data on favorite colors:
red, blue, blue, green, red, red, blue
The mode is blue, since it occurs most frequently (3 times).
Frequency Distribution:
• While not a "single average," you can summarize qualitative data with a frequency
distribution to show how many times each category appears.
Example:
Color Frequency
Red 3
Blue 3
Green 1
In this case, you could say that red and blue are the most frequent, each appearing 3 times.
5 Mark Questions:
1. Explain the concept of a normal distribution and its characteristics. How is the Z-score
related to a normal distribution?
2. Discuss how to calculate the Z-score for a data point and interpret its meaning in the
context of a normal distribution.
3. What is correlation, and how can it be measured? Explain the difference between
positive, negative, and zero correlation with examples.
4. Describe the process of calculating the correlation coefficient for two quantitative
variables. Why is this measure important in statistics?
5. How are averages calculated for qualitative and ranked data? Discuss the methods used
and provide examples.
10 Mark Questions:
1. Explain the concept of normal distributions, the role of Z-scores, and how they can be
used to calculate proportions and find scores. Provide an example.
2. Discuss the correlation between two variables, including how it is represented on a scatter
plot and the calculation of the correlation coefficient. Explain the computational formula
for the correlation coefficient.
3. Describe the steps involved in finding and interpreting Z-scores in a normal distribution.
How do they help in comparing data points from different distributions?
4. What are the different types of correlation (positive, negative, zero correlation), and how
can they be interpreted using a correlation coefficient and scatter plot? Provide real-world
examples.
5. Discuss the use of averages for qualitative and ranked data. How are these averages
calculated, and how do they differ from calculating averages for continuous data?
UNIT 4
# From a list
arr = [Link]([1, 2, 3, 4, 5])
# From a tuple
arr2 = [Link]((10, 20, 30))
# Multi-dimensional array
arr3 = [Link]([[1, 2], [3, 4], [5, 6]])
Data Types
• NumPy automatically infers the data type of the array, but you can also specify it using the
dtype parameter.
python
Copy code
arr_int = [Link]([1, 2, 3], dtype=int)
arr_float = [Link]([1.1, 2.2, 3.3], dtype=float)
arr_complex = [Link]([1+2j, 3+4j], dtype=complex)
Array Attributes
• .shape provides the dimensions of the array (rows, columns, etc.).
• .size gives the total number of elements.
• .dtype gives the data type of the array's elements.
• .ndim gives the number of dimensions (axes).
python
Copy code
print([Link]) # Output: (3, 2)
print([Link]) # Output: 6
print([Link]) # Output: int64
print([Link]) # Output: 2 (2D array)
Aggregations in NumPy
Aggregation operations allow you to summarize or reduce the size of your array data.
Common Aggregation Functions:
• Sum: Adds all elements in the array.
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
print([Link](arr)) # Output: 15
• Mean: Computes the average of the elements.
python
Copy code
print([Link](arr)) # Output: 3.0
• Minimum and Maximum: Finds the smallest or largest element.
python
Copy code
print([Link](arr)) # Output: 1
print([Link](arr)) # Output: 5
• Standard Deviation: Measures the spread of data.
python
Copy code
print([Link](arr)) # Output: 1.4142135623730951
• Median: Finds the middle value when the data is ordered.
python
Copy code
arr = [Link]([1, 3, 2, 5, 4])
print([Link](arr)) # Output: 3.0
In multi-dimensional arrays, you can specify an axis (rows or columns) to perform aggregations
along.
• Sum along columns (axis=0): Operates across rows.
python
Copy code
arr2 = [Link]([[1, 2], [3, 4], [5, 6]])
print([Link](arr2, axis=0)) # Output: [9 12]
• Sum along rows (axis=1): Operates across columns.
python
Copy code
print([Link](arr2, axis=1)) # Output: [3 7 11]
# Addition
print(arr1 + arr2) # Output: [5 7 9]
# Subtraction
print(arr1 - arr2) # Output: [-3 -3 -3]
# Multiplication
print(arr1 * arr2) # Output: [4 10 18]
# Division
print(arr1 / arr2) # Output: [0.25 0.4 0.5]
Trigonometric Functions
NumPy also provides functions for trigonometric and other mathematical operations.
python
Copy code
arr = [Link]([0, [Link]/2, [Link]])
Broadcasting
When performing operations on arrays of different shapes, NumPy will attempt to "broadcast" the
smaller array to match the shape of the larger array, provided the shapes are compatible.
python
Copy code
arr = [Link]([1, 2, 3])
scalar = 10
result = arr * scalar # Element-wise multiplication with a scalar
print(result) # Output: [10 20 30]
Comparisons in NumPy
You can compare elements of NumPy arrays using comparison operators. These operations return
boolean arrays.
Basic Comparisons:
• Equality: ==
• Inequality: !=
• Greater than: >
• Less than: <
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
result = arr > 3 # Output: [False False False True True]
Using Logical Functions
You can perform logical operations like "all" or "any" on boolean arrays.
• [Link](): Returns True if all elements are True.
python
Copy code
arr = [Link]([True, True, False])
print([Link](arr)) # Output: False
• [Link](): Returns True if any element is True.
python
Copy code
print([Link](arr)) # Output: True
Structured Arrays
A structured array is a type of array that allows for storing multiple fields (different data types) in
each element. These are especially useful for handling complex data such as records.
Creating Structured Arrays
python
Copy code
dtype = [('name', 'S20'), ('age', 'i4')]
data = [Link]([('Alice', 25), ('Bob', 30)], dtype=dtype)
print(data)
# Output: [(b'Alice', 25) (b'Bob', 30)]
Data Manipulation
Data manipulation includes operations like reshaping, adding, deleting, or joining arrays.
Reshaping Arrays
You can reshape arrays using reshape(). This changes the number of rows and columns without
modifying the original data.
python
Copy code
arr = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link](2, 3) # 2 rows and 3 columns
print(reshaped)
Stacking Arrays
NumPy provides multiple ways to combine arrays, such as [Link](), [Link](), and
[Link]().
python
Copy code
arr1 = [Link]([1, 2, 3])
arr2 = [Link]([4, 5, 6])
# Accessing fields
print(data['category']) # Output: ['A' 'B' 'A' 'B']
print(data['value']) # Output: [10 20 30 40]
Grouping and Aggregating with Structured Arrays
You can use structured arrays to perform operations like grouping data by one field and applying
aggregation on another.
python
Copy code
# Example of grouping by 'category' and calculating the sum of 'value' for each category
category_A = data[data['category'] == 'A']
category_B = data[data['category'] == 'B']
sum_A = [Link](category_A['value'])
sum_B = [Link](category_B['value'])
# Simulate a pivot table by first grouping by 'category' and 'region' (rows, columns)
categories = [Link](data[:, 0]) # Unique categories (A, B)
regions = [Link](data[:, 1]) # Unique regions (North, South)
print(pivot_table)
Output:
less
Copy code
[[60 30] # Sum for Category A: [North, South]
[40 80]] # Sum for Category B: [North, South]
Here:
• We group the data by category and region.
• We then compute the sum of the value for each combination of category and region.
• This is similar to a pivot table, where the rows represent category and the columns represent
region, and the values are the sum of value.
Using np.histogram2d() for Binning and Pivot-like Operations:
For large datasets, NumPy offers np.histogram2d() to create 2D histograms, which can also be
useful in creating pivot-like summaries of data.
python
Copy code
# Binning data into 2D arrays (categories vs regions)
category_bins = [Link]([0, 1]) # Categories (A, B)
region_bins = [Link]([0, 1]) # Regions (North, South)
21. What does the groupby() method in pandas do? a) Groups data based on the unique
values of one or more columns
b) Aggregates data using functions like sum, mean, etc.
c) Both a and b
d) Sorts the data based on a column
Answer: c) Both a and b
22. How do you compute the sum of grouped data in pandas? a)
[Link]('column_name').sum()
b) [Link]('sum')
c) [Link]()
d) df.group_by('column_name').sum()
Answer: a) [Link]('column_name').sum()
23. Which of the following is a valid aggregation function used with the pandas
groupby() method? a) mean()
b) sum()
c) count()
d) All of the above
Answer: d) All of the above
24. What is the purpose of a pivot table in pandas? a) To sort data
b) To compute statistics for different subsets of data
c) To combine datasets
d) To plot data visually
Answer: b) To compute statistics for different subsets of data
25. Which function is used to create a pivot table in pandas? a) [Link]()
b) df.pivot_table()
c) [Link]()
d) [Link]()
Answer: b) df.pivot_table()
5 Mark Questions:
1. Explain the basics of NumPy arrays and how they differ from Python lists. Discuss their
advantages for numerical computations.
2. Describe the process of data manipulation in NumPy, including array indexing, slicing,
and reshaping. Provide examples.
3. What is hierarchical indexing in pandas? Explain how it allows you to work with multi-
level data and its advantages.
4. Discuss the importance of aggregation and grouping in data analysis. How is the
groupby() method used in pandas to aggregate data?
5. Explain how pivot tables work in pandas. Provide an example of how to use pivot_table()
to summarize data.
10 Mark Questions:
1. Discuss the process of aggregation and grouping in pandas. Explain the groupby()
method with an example, highlighting how it can be used to perform various aggregate
operations such as sum, mean, and count.
2. Describe how NumPy arrays are used in data manipulation and computation. Discuss
indexing, slicing, reshaping, and operations on arrays, providing code examples.
3. Explain the concept of structured arrays in NumPy and how they differ from regular
arrays. Provide an example of creating and accessing data in structured arrays.
4. How is missing data handled in NumPy and pandas? Discuss different methods for
dealing with missing data, including using NaN, filling with default values, and using
interpolation techniques.
5. Discuss how to combine datasets in pandas. Explain the use of merge(), concat(), and
join() methods for combining datasets, and provide use cases for each.
UNIT 5
Data Visualization using Matplotlib
Data visualization is a crucial aspect of data analysis, enabling data scientists and analysts to
present complex data in a more understandable and insightful manner. One of the most popular
libraries for data visualization in Python is Matplotlib. In this article, we will provide a
comprehensive guide to using Matplotlib for creating various types of plots and customizing them
to fit specific needs and how to visualize data with the help of the Matplotlib library of Python.
Visualization Control xlim(), ylim() Sets limits for the X and Y axes.
Line plot
Line chart is one of the basic plots and can be created using the plot() function. It is used to
represent a relationship between two data X and Y on a different axis.
Syntax:
[Link](\*args, scalex=True, scaley=True, data=None, \*\*kwargs)
Example:
Python
import [Link] as plt
[Link]()
Output:
Scatter plots
Scatter plots are used to observe relationships between variables. The scatter() method in the
matplotlib library is used to draw a scatter plot.
Syntax:
[Link](x_axis_data, y_axis_data, s=None, c=None, marker=None,
cmap=None, vmin=None, vmax=None, alpha=None, linewidths=None, edgecolors=None
Example:
Python
import [Link] as plt
import pandas as pd
[Link]()
Output:
Histogram
histogram is basically used to represent data provided in a form of some groups. It is a type of bar
plot where the X-axis represents the bin ranges while the Y-axis gives information about
frequency. The hist() function is used to compute and create histogram of x.
Syntax:
[Link](x, bins=None, range=None, density=False, weights=None,
cumulative=False, bottom=None, histtype=’bar’, align=’mid’, orientation=’vertical’,
rwidth=None, log=False, color=None, label=None, stacked=False, \*, data=None, \*\*kwargs)
Example:
Python
import [Link] as plt
import pandas as pd
[Link]()
Output:
Bar plot
A bar chart is a graph that represents the category of data with rectangular bars with lengths
and heights that is proportional to the values which they represent. The bar plots can be plotted
horizontally or vertically. A bar chart describes the comparisons between the discrete categories.
It can be created using the bar() method.
In the below example, we will use the tips dataset. Tips database is the record of the tip given by
the customers in a restaurant for two and a half months in the early 1990s. It contains 6 columns
as total_bill, tip, sex, smoker, day, time, size.
Example:
Python
import [Link] as plt
import pandas as pd
[Link]()
Output:
Visualizing errors
Visualizing errors is an important aspect of data analysis and model evaluation. In machine
learning, data science, and scientific computing, error visualization helps you identify patterns,
understand the distribution of errors, and diagnose problems with models or data. Let's look at
different types of error visualizations and how you can implement them using libraries like
Matplotlib, Seaborn, and NumPy.
1. Basic Error Visualization Concepts
Errors in data analysis or machine learning often come in different forms:
• Absolute Errors: The absolute difference between the predicted and true values.
• Squared Errors: The squared difference between predicted and true values, commonly
used in loss functions like Mean Squared Error (MSE).
• Residuals: The difference between the predicted value and the true value (sometimes
called model residuals).
• Percentage Errors: Error relative to the actual value, useful when considering error as a
proportion of the actual value.
• Bias and Variance: When analyzing model performance, bias refers to error introduced
by overly simplistic models, while variance refers to error introduced by models that are
too complex.
Visualizing these types of errors helps in understanding the performance of your model, detecting
outliers, and checking for assumptions like homoscedasticity (constant variance of errors).
Contour Plots
Contour plots are useful for visualizing three-dimensional data in two dimensions. They
represent level curves of a function over a 2D plane. The contour lines indicate regions of equal
value, making it easy to see the shape of a surface.
• What it shows:
Contour plots display how the values of a third variable change across a 2D plane, which
is especially useful in geospatial data, physics, or machine learning.
• Use case:
Used in scientific computing, spatial analysis, and machine learning, especially to
represent decision boundaries of classification models (e.g., decision trees, SVMs).
• Tools:
Contour plots can be created in Python using Matplotlib or Seaborn.
python
Copy code
import numpy as np
import [Link] as plt
[Link](X, Y, Z)
[Link]('Contour Plot')
[Link]()
Histograms
Histograms are one of the most basic and effective ways to visualize the distribution of data.
They group data into bins and show the frequency of data points within each bin.
• What it shows:
Histograms display the distribution of a single continuous variable and provide insight
into the shape of the distribution, such as whether it’s normal, skewed, bimodal, etc.
• Use case:
Used when you want to understand the distribution of numerical data, such as the
frequency of values within a range (e.g., age groups, test scores).
• Tools:
In Python, Matplotlib and Seaborn are often used for histograms. Seaborn also allows
easy customization with [Link] and [Link].
python
Copy code
import seaborn as sns
import [Link] as plt
# Multiple variables
[Link](data1, kde=True, color='blue', label='Data 1')
[Link](data2, kde=True, color='red', label='Data 2')
[Link]()
[Link]()
Three-Dimensional Plotting
Three-dimensional plots allow you to visualize data that has three continuous variables. This is
essential for understanding relationships in multi-variable datasets and is common in scientific
and engineering applications.
• What it shows:
3D plots help visualize data points in three-dimensional space, showing how three
variables interact or correlate.
• Use case:
Applied in machine learning to visualize high-dimensional relationships, geographic data,
or any problem where three variables are crucial.
• Tools:
Matplotlib offers a 3D plotting toolkit via Axes3D. The plot_surface function is
commonly used for visualizing 3D data.
python
Copy code
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import [Link] as plt
# Create 3D data
X = [Link](-5, 5, 100)
Y = [Link](-5, 5, 100)
X, Y = [Link](X, Y)
Z = [Link]([Link](X**2 + Y**2))
fig = [Link]()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis')
[Link]('3D Surface Plot')
[Link]()
# Create a map
m = [Link](location=[40.7128, -74.0060], zoom_start=12)
Summary:
• Density and Contour Plots: Great for showing the distribution and relationships
between continuous variables, with contour plots being especially useful for visualizing
surfaces and spatial relationships.
• Histograms: Ideal for understanding the distribution of a single variable and visualizing
frequency distributions.
• Binning and Density: Binning helps with summarizing and smoothing data, while
density plots estimate the underlying distribution.
• 3D Plotting: Useful for visualizing relationships between three continuous variables.
• Geographic Data: Visualization tools like Geopandas, Folium, and Plotly help display
spatial data on maps, which is essential for geographic analysis, heatmaps, and
understanding spatial patterns.
1. Which function in Matplotlib is used to create a line plot? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: b) [Link]()
2. What type of plot would you use to visualize the relationship between two variables
in a dataset? a) Line plot
b) Histogram
c) Scatter plot
d) Box plot
Answer: c) Scatter plot
3. What does [Link]() do in Matplotlib? a) Sets the title of the plot
b) Labels the x-axis
c) Labels the y-axis
d) Adds a legend to the plot
Answer: b) Labels the x-axis
4. How can you display a plot in Matplotlib after defining it? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
5. What is the default color of lines in Matplotlib plots? a) Red
b) Blue
c) Green
d) Black
Answer: b) Blue
6. In a scatter plot, what does each point represent? a) A single data value
b) A relationship between two variables
c) The average of a dataset
d) A histogram of data
Answer: b) A relationship between two variables
7. Which function in Matplotlib is used to add error bars to a plot? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
8. What is the purpose of visualizing errors in data? a) To identify the range of variation
or uncertainty in data
b) To fit the data to a model
c) To identify outliers
d) To smooth the data
Answer: a) To identify the range of variation or uncertainty in data
9. How would you visualize the distribution of errors in data? a) By using a scatter plot
b) By using a bar chart
c) By using a histogram or density plot
d) By using a line plot
Answer: c) By using a histogram or density plot
10. What is a density plot used for in data visualization? a) To show the frequency of data
points
b) To represent the distribution of data over continuous intervals
c) To visualize the relationship between categorical data
d) To show the linear relationship between two variables
Answer: b) To represent the distribution of data over continuous intervals
11. Which Matplotlib function is used to create a density plot? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: c) [Link]()
12. What is a contour plot used for? a) To display a 3D surface
b) To visualize 2D data with levels of equal value
c) To show error bars on a plot
d) To plot the cumulative distribution function
Answer: b) To visualize 2D data with levels of equal value
13. How can you add contour lines to a plot in Matplotlib? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
14. What does a histogram display? a) The distribution of categorical data
b) The frequency distribution of continuous numerical data
c) A relationship between two variables
d) The trend over time
Answer: b) The frequency distribution of continuous numerical data
15. Which function is used to create a histogram in Matplotlib? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: b) [Link]()
16. What is the purpose of binning in a histogram? a) To group continuous data into
discrete intervals
b) To sort data by its value
c) To smooth the data
d) To combine data from different sources
Answer: a) To group continuous data into discrete intervals
17. What does the density=True argument do in [Link]()? a) It normalizes the histogram
to show probability density
b) It reduces the number of bins
c) It increases the number of bins
d) It adds color to the histogram bars
Answer: a) It normalizes the histogram to show probability density
18. Which of the following can be plotted to visualize data distribution and detect
outliers? a) Scatter plot
b) Histogram
c) Box plot
d) All of the above
Answer: d) All of the above
22. Which library in Python is commonly used for geographic data visualization? a)
Seaborn
b) Matplotlib
c) Cartopy
d) Pandas
Answer: c) Cartopy
23. Which type of plot is used to visualize geographic data over maps? a) Line plot
b) Heatmap
c) Choropleth map
d) Box plot
Answer: c) Choropleth map
24. Which of the following is used for plotting geographic data on a map in Matplotlib?
a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]() with Cartopy or Basemap
Answer: d) [Link]() with Cartopy or Basemap
25. What is the purpose of using geographic data visualization tools like Cartopy or
Basemap in Python? a) To analyze text data
b) To map data onto geographic regions, helping to identify spatial patterns
c) To visualize time-series data
d) To generate statistical plots
Answer: b) To map data onto geographic regions, helping to identify spatial patterns
5 Mark Questions:
1. Explain the use of line plots in data visualization. How are they created in Matplotlib, and
in what cases are they useful?
2. Describe how scatter plots work and when they are useful in data analysis. How do you
add a regression line to a scatter plot in Matplotlib?
3. What is a contour plot, and how does it help in visualizing the relationships between three
variables? Explain with an example.
4. Explain histograms and their importance in data visualization. How do you perform
binning, and what is the significance of setting the density=True argument in [Link]()?
5. What are density plots, and how do they differ from histograms? Discuss how density
plots can be created in Matplotlib and their use in visualizing data distribution.
10 Mark Questions:
1. Discuss how Matplotlib can be used to create different types of plots, such as line plots,
scatter plots, and histograms. Provide code examples for each type.
2. Explain how error bars are added to plots in Matplotlib. Provide an example where error
bars are used to represent uncertainty in data.
3. Describe the process of creating a 3D plot in Matplotlib. Include details on how to set up
a 3D axis and create 3D scatter plots and surface plots.
4. Discuss the use of geographic data visualization in Python. Explain how Cartopy or
Basemap can be used to visualize geographic data, including a discussion of choropleth
maps and their applications.
5. Explain the importance of visualizing the distribution of data. Compare and contrast
histograms, box plots, and density plots, and discuss their strengths and limitations in
different scenarios.