0% found this document useful (0 votes)
36 views160 pages

Fundamentals of Data Science

Uploaded by

pgcriteria2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views160 pages

Fundamentals of Data Science

Uploaded by

pgcriteria2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DEPARTMENT OF CS(AI & DS)

FUNDAMENTALS OF DATA SCIENCE

COURSE CODE: 23UAD04

2024-2025
SYLLABUS

UNIT -I
Need for data science –benefits and uses –facets of data – data science process –setting the
research goal – retrieving data –cleansing, integrating and transforming data –exploratory data
analysis –build the models – presenting and building applications..
UNIT-II
Frequency distributions – Outliers –relative frequency distributions –cumulative frequency
distributions – frequency distributions for nominal data –interpreting distributions –graphs –
averages –mode –median –mean
UNIT-III
Normal distributions –z scores –normal curve problems – finding proportions –finding scores –
more about z scores –correlation –scatter plots –correlation coefficient for quantitative data –
computational formula for correlation coefficient-averages for qualitative and ranked data.
UNIT-IV
Basics of Numpy arrays, aggregations, computations on arrays, comparisons, structured arrays,
Data manipulation, data indexing and selection, operating on data, missing data, hierarchical
indexing, combining datasets –aggregation and grouping, pivot tables
UNIT-V
Visualization with matplotlib, line plots, scatter plots, visualizing errors, density and contour
plots, histograms, binnings, and density, three dimensional plotting, geographic data

Text Books:
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, ―Introducing Data Science‖,
Manning Publications, 2016.

2. Robert S. Witte and John S. Witte, ―Statistics‖, Eleventh Edition, Wiley Publications, 2017.
3. Jake VanderPlas, ―Python Data Science Handbook‖, O‟Reilly, 2016.
References :
1. Allen B. Downey, ―Think Stats: Exploratory Data Analysis in Python‖, Green Tea
Press, 2014.
Web Resources
● [Link]
● [Link]
● [Link]
Mapping with Programme Outcomes:

S-Strong-3 M-Medium-2 L-Low-1

CO/ PSO PSO 1 PSO PSO 3 PSO 4 PSO 5 PSO 6


2

CO1 3 3 3 3 3 2

CO2 3 3 3 2 2 3

CO3 2 2 2 3 3 3

CO4 3 3 3 3 3 2

CO5 3 3 3 3 3 1

Weightage of course 14 14 14 14 14 11
contributed to each PSO
UNIT -I
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. In simpler terms,
data science is about obtaining, processing, and analyzing data to gain insights for many purposes.
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.

What is Data Science?


Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
● Better decisions (should we choose A or B)
● Predictive analysis (what will happen next?)
● Pattern discoveries (find pattern, or maybe hidden information in the data)

Where is Data Science Needed?


Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare,
and manufacturing.
Examples of where Data Science is needed:
● For route planning: To discover the best routes to ship
● To foresee delays for flight/ship/train etc. (through predictive analysis)
● To create promotional offers
● To find the best suited time to deliver goods
● To forecast the next years revenue for a company
● To analyze health benefit of training
● To predict who will win elections
Data Science can be applied in nearly every part of a business where data is available. Examples
are:
● Consumer goods
● Stock markets
● Industry
● Politics
● Logistic companies
● E-commerce

The data science lifecycle


The data science lifecycle refers to the various stages a data science project generally undergoes,
from initial conception and data collection to communicating results and insights.

Despite every data science project being unique—depending on the problem, the industry it's
applied in, and the data involved—most projects follow a similar lifecycle.

This lifecycle provides a structured approach for handling complex data, drawing accurate
conclusions, and making data-driven decisions.

The data science lifecycle

Here are the five main phases that structure the data science lifecycle:

Data collection and storage

This initial phase involves collecting data from various sources, such as databases, Excel files, text
files, APIs, web scraping, or even real-time data streams. The type and volume of data collected
largely depend on the problem you’re addressing.

Once collected, this data is stored in an appropriate format ready for further processing. Storing
the data securely and efficiently is important to allow quick retrieval and processing.

Data preparation
Often considered the most time-consuming phase, data preparation involves cleaning and
transforming raw data into a suitable format for analysis. This phase includes handling missing or
inconsistent data, removing duplicates, normalization, and data type conversions. The objective is
to create a clean, high-quality dataset that can yield accurate and reliable analytical results.

Exploration and visualization

During this phase, data scientists explore the prepared data to understand its patterns,
characteristics, and potential anomalies. Techniques like statistical analysis and data visualization
summarize the data's main characteristics, often with visual methods.

Visualization tools, such as charts and graphs, make the data more understandable, enabling
stakeholders to comprehend the data trends and patterns better.

Experimentation and prediction

Data scientists use machine learning algorithms and statistical models to identify patterns, make
predictions, or discover insights in this phase. The goal here is to derive something significant
from the data that aligns with the project's objectives, whether predicting future outcomes,
classifying data, or uncovering hidden patterns.

Data Storytelling and communication

The final phase involves interpreting and communicating the results derived from the data analysis.
It's not enough to have insights; you must communicate them effectively, using clear, concise
language and compelling visuals. The goal is to convey these findings to non-technical
stakeholders in a way that influences decision-making or drives strategic initiatives.

Understanding and implementing this lifecycle allows for a more systematic and successful
approach to data science projects. Let's now delve into why data science is so important.

Why is Data Science Important?

Data science has emerged as a revolutionary field that is crucial in generating insights from data
and transforming businesses. It's not an overstatement to say that data science is the backbone of
modern industries. But why has it gained so much significance?

● Data volume. Firstly, the rise of digital technologies has led to an explosion of data. Every
online transaction, social media interaction, and digital process generates data. However,
this data is valuable only if we can extract meaningful insights from it. And that's precisely
where data science comes in.
● Value-creation. Secondly, data science is not just about analyzing data; it's about
interpreting and using this data to make informed business decisions, predict future trends,
understand customer behavior, and drive operational efficiency. This ability to drive
decision-making based on data is what makes data science so essential to organizations.
● Career options. Lastly, the field of data science offers lucrative career opportunities. With
the increasing demand for professionals who can work with data, jobs in data science are
among the highest paying in the industry. As per Glassdoor, the average salary for a data
scientist in the United States is $137,984, making it a rewarding career choice.

What is Data Science Used For?

Data science is used for an array of applications, from predicting customer behavior to optimizing
business processes. The scope of data science is vast and encompasses various types of analytics.

● Descriptive analytics. Analyzes past data to understand current state and trend
identification. For instance, a retail store might use it to analyze last quarter's sales or
identify best-selling products.
● Diagnostic analytics. Explores data to understand why certain events occurred, identifying
patterns and anomalies. If a company's sales fall, it would identify whether poor product
quality, increased competition, or other factors caused it.
● Predictive analytics. Uses statistical models to forecast future outcomes based on past data,
used widely in finance, healthcare, and marketing. A credit card company may employ it
to predict customer default risks.
● Prescriptive analytics. Suggests actions based on results from other types of analytics to
mitigate future problems or leverage promising trends. For example, a navigation app
advising the fastest route based on current traffic conditions.

The increasing sophistication from descriptive to diagnostic to predictive to prescriptive analytics


can provide companies with valuable insights to guide decision-making and strategic planning.
You can read more about the four types of analytics in a separate article.

What are the Benefits of Data Science?

Data science can add value to any business which uses its data effectively. From statistics to
predictions, effective data-driven practices can put a company on the fast track to success. Here
are some ways in which data science is used:

Optimize business processes

Data Science can significantly improve a company's operations in various departments, from
logistics and supply chain to human resources and beyond. It can help in resource allocation,
performance evaluation, and process automation. For example, a logistics company can use data
science to optimize routes, reduce delivery times, save fuel costs, and improve customer
satisfaction.

Unearth new insights

Data Science can uncover hidden patterns and insights that might not be evident at first glance.
These insights can provide companies with a competitive edge and help them understand their
business better. For instance, a company can use customer data to identify trends and preferences,
enabling them to tailor their products or services accordingly.

Create innovative products and solutions

Companies can use data science to innovate and create new products or services based on customer
needs and preferences. It also allows businesses to predict market trends and stay ahead of the
competition. For example, streaming services like Netflix use data science to understand viewer
preferences and create personalized recommendations, enhancing user experience.

Which Industries Use Data Science?

The implications of data science span across all industries, fundamentally changing how
organizations operate and make decisions. While every industry stands to gain from implementing
data science, it's especially influential in data-rich sectors.

Let's delve deeper into how data science is revolutionizing these key industries:

Data science applications in finance

The finance sector has been quick to harness the power of data science. From fraud detection and
algorithmic trading to portfolio management and risk assessment, data science has made complex
financial operations more efficient and precise. For instance, credit card companies utilize data
science techniques to detect and prevent fraudulent transactions, saving billions of dollars
annually.

Learn more about the finance fundamentals in Python and how you can make data-driven financial
decisions with our skill track.

Data science applications in healthcare

Healthcare is another industry where data science has a profound impact. Applications range from
predicting disease outbreaks and improving patient care quality to enhancing hospital management
and drug discovery. Predictive models help doctors diagnose diseases early, and treatment plans
can be customized according to the patient's specific needs, leading to improved patient outcomes.
You can discover more about how data science is transforming healthcare in a DataFramed Podcast
episode.

Data science applications in marketing

Marketing is a field that has been significantly transformed by the advent of data science. The
applications in this industry are diverse, ranging from customer segmentation and targeted
advertising to sales forecasting and sentiment analysis. Data science allows marketers to
understand consumer behavior in unprecedented detail, enabling them to create more effective
campaigns. Predictive analytics can also help businesses identify potential market trends, giving
them a competitive edge. Personalization algorithms can tailor product recommendations to
individual customers, thereby increasing sales and customer satisfaction.

We have a separate blog post on five ways to use data science in marketing, exploring some of the
methods used in the industry. You can also learn more in our Marketing Analytics with Python
skill track.

Data science applications in technology

Technology companies are perhaps the most significant beneficiaries of data science. From
powering recommendation engines to enhancing image and speech recognition, data science finds
applications in diverse areas. Ride-hailing platforms, for example, rely on data science for
connecting drivers with ride hailers and optimizing the supply of drivers depending on the time of
day.

How is Data Science Different from Other Data-Related Fields?

While data science overlaps with many fields that also work with data, it carries a unique blend of
principles, tools, and techniques designed to extract insightful patterns from data.

Distinguishing between data science and these related fields can give a better understanding of the
landscape and help in setting the right career path. Let's demystify these differences.

Data science vs data analytics

Data science and data analytics both serve crucial roles in extracting value from data, but their
focuses differ. Data science is an overarching field that uses methods including machine learning
and predictive analytics, to draw insights from data. In contrast, data analytics concentrates on
processing and performing statistical analysis on existing datasets to answer specific questions.

Data science vs business analytics

While business analytics also deals with data analysis, it is more centered on leveraging data for
strategic business decisions. It is generally less technical and more business-focused than data
science. Data science, though it can inform business strategies, often dives deeper into the technical
aspects, like programming and machine learning.

Data science vs data engineering

Data engineering focuses on building and maintaining the infrastructure for data collection,
storage, and processing, ensuring data is clean and accessible. Data science, on the other hand,
analyzes this data, using statistical and machine learning models to extract valuable insights that
influence business decisions. In essence, data engineers create the data 'roads', while data scientists
'drive' on them to derive meaningful insights. Both roles are vital in a data-driven organization.

Data science vs machine learning

Machine learning is a subset of data science, concentrating on creating and implementing


algorithms that let machines learn from and make decisions based on data. Data science, however,
is broader and incorporates many techniques, including machine learning, to extract meaningful
information from data.

Data Science vs Statistics

Statistics, a mathematical discipline dealing with data collection, analysis, interpretation, and
organization, is a key component of data science. However, data science integrates statistics with
other methods to extract insights from data, making it a more multidisciplinary field.

Industry Focus Technical Emphasis

Data Science Driving value with data across the 4 Programming, ML, Statistics
levels of analytics

Data Analytics Perform statistical analysis on existing Statistical analysis


datasets
Business Analytics Leverage data for strategic business Business strategies, data
decisions analysis

Data Engineering Build and maintain data infrastructure Data collection, storage,
processing

Machine Learning Creating and implementing algorithms Algorithm development,


for machine learning model implementation

Statistics Data collection, analysis, interpretation, Statistical analysis,


and organization mathematical principles

Having understood these distinctions, we can now delve into the key concepts every data scientist
needs to master.
Key Data Science Concepts
A successful data scientist doesn't just need technical skills but also an understanding of core
concepts that form the foundation of the field. Here are some key concepts to grasp:
Statistics and probability
These are the bedrock of data science. Statistics is used to derive meaningful insights from data,
while probability allows us to make predictions about future events based on available data.
Understanding distributions, statistical tests, and probability theories is essential for any data
scientist.

OVERVIEW OF THE DATA SCIENCE PROCESS


The typical data science process consists of six steps through which you’ll iterate, as shown in
figure

➢ The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project this
will result in a project charter.
➢ The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is data
in its raw form, which probably needs polishing and transformation before it becomes usable.
➢ Now that you have the raw data, it’s time to prepare it. This includes transforming the data from
a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct
different kinds of errors in the data, combine data from different data sources, and transform it. If
you have successfully completed this step, you can progress to data visualization and modeling.
➢ The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
➢ Finally, we get to model building (often referred to as “data modeling” throughout this book).
It is now that you attempt to gain the insights or make the predictions stated in your project charter.
Now is the time to bring out the heavy guns, but remember research has taught us that often (but
not always) a combination of simple models tends to outperform one complicated model. If you’ve
done this phase right, you’re almost done.
➢ The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You may still
need to convince the business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The importance of this step is more
apparent in projects on a strategic and tactical level. Certain projects require you to perform the
business process over and over again, so automating the project will save time.
DEFINING RESEARCH GOALS
A project starts by understanding the what, the why, and the how of your project. The outcome
should be a clear research goal, a good understanding of the context, well-defined deliverables,
and a plan of action with a timetable. This information is then best placed in a project charter.
Spend time understanding the goals and context of your research:
➢ An essential outcome is the research goal that states the purpose of your assignment in a clear
and focused manner.
➢ Understanding the business goals and context is critical for project success.
➢ Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your research is
going to change the business, and understand how they’ll use your results Create a project charter
A project charter requires teamwork, and your input covers at least the following:
➢ A clear research goal
➢ The project mission and context
➢ How you’re going to perform your analysis
➢ What resources you expect to use
➢ Proof that it’s an achievable project, or proof of concepts
➢ Deliverables and a measure of success
➢ A timeline
RETRIEVING DATA
➢ The next step in data science is to retrieve the required data. Sometimes you need to go into
the field and design a data collection process yourself, but most of the time you won’t be involved
in this step.
➢ Many companies will have already collected and stored the data for you, and what they don’t
have can often be bought from third parties.
➢ More and more organizations are making even high-quality data freely available for public and
commercial use.
➢ Data can be stored in many forms, ranging from simple text files to tables in a database. The
objective now is acquiring all the data you need. Start with data stored within the company
(Internal data)
➢ Most companies have a program for maintaining key data; so much of the cleaning work may
already be done. This data can be stored in official data repositories such as databases, data marts,
data warehouses, and data lakes maintained by a team of IT professionals.
➢ Data warehouses and data marts are home to pre-processed data, data lakes contain data in its
natural or raw format.
➢ Finding data even within your own company can sometimes be a challenge. As companies
grow, their data becomes scattered around many places. The data may be dispersed as people
change positions and leave the company.
➢ Getting access to data is another difficult task. Organizations understand the value and
sensitivity of data and often have policies in place so everyone has access to what they need and
nothing more.
➢ These policies translate into physical and digital barriers called Chinese walls. These “walls”
are mandatory and well-regulated for customer data in most countries. External Data
➢ If data isn’t available inside your organization, look outside your organizations. Companies
provide data so that you, in turn, can enrich their services and ecosystem. Such is the case with
Twitter, LinkedIn, and Facebook.
➢ More and more governments and organizations share their data for free with the world.
➢ A list of open data providers that should get you started.

Data cleaning
Data cleaning, also known as data cleansing or data scrubbing, is a crucial step in the data science
pipeline. It involves identifying and correcting errors, inconsistencies, and inaccuracies in datasets
to improve their quality and reliability. This process ensures that the data used for analysis and
modeling is accurate, complete, and suitable for its intended purpose.
In this article, we’ll explore the importance of data cleaning, common issues that data scientists
encounter, and various techniques and best practices for effective data cleaning.
The Importance of Data Cleaning
Data cleaning plays a vital role in the data science process for several reasons:
Data Quality: Clean data leads to more accurate analyses and reliable insights. Poor data quality
can result in flawed conclusions and misguided decisions.
Model Performance: Machine learning models trained on clean data tend to perform better and
generalize more effectively to new, unseen data.
Efficiency: Clean data reduces the time and resources spent on troubleshooting and fixing issues
during later stages of analysis or model development.
Consistency: Data cleaning helps ensure consistency across different data sources and formats,
making it easier to integrate and analyze data from multiple origins.
Compliance: In many industries, clean and accurate data is essential for regulatory compliance
and reporting purposes.

Exploratory data analysis is one of the basic and essential steps of a data science project. A data
scientist involves almost 70% of his work in doing the EDA of the dataset. In this article, we will
discuss what is Exploratory Data Analysis (EDA) and the steps to perform EDA.

What is Exploratory Data Analysis (EDA)?


Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves
analyzing and visualizing data to understand its key characteristics, uncover patterns, and identify
relationships between variables refers to the method of studying and exploring record sets to
apprehend their predominant traits, discover patterns, locate outliers, and identify relationships
between variables. EDA is normally carried out as a preliminary step before undertaking extra
formal statistical analyses or modeling.
Key aspects of EDA include:
Distribution of Data: Examining the distribution of data points to understand their range, central
tendencies (mean, median), and dispersion (variance, standard deviation).
Graphical Representations: Utilizing charts such as histograms, box plots, scatter plots, and bar
charts to visualize relationships within the data and distributions of variables.
Outlier Detection: Identifying unusual values that deviate from other data points. Outliers can
influence statistical analyses and might indicate data entry errors or unique cases.
Correlation Analysis: Checking the relationships between variables to understand how they
might affect each other. This includes computing correlation coefficients and creating correlation
matrices.
Handling Missing Values: Detecting and deciding how to address missing data points, whether
by imputation or removal, depending on their impact and the amount of missing data.
Summary Statistics: Calculating key statistics that provide insight into data trends and nuances.
Testing Assumptions: Many statistical tests and models assume the data meet certain conditions
(like normality or homoscedasticity). EDA helps verify these assumptions.
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data
science and statistical modeling. Here are some of the key reasons why EDA is a critical step in
the data analysis process:
Understanding Data Structures: EDA helps in getting familiar with the dataset, understanding
the number of features, the type of data in each feature, and the distribution of data points. This
understanding is crucial for selecting appropriate analysis or prediction techniques.
Identifying Patterns and Relationships: Through visualizations and statistical summaries, EDA
can reveal hidden patterns and intrinsic relationships between variables. These insights can guide
further analysis and enable more effective feature engineering and model building.
Detecting Anomalies and Outliers: EDA is essential for identifying errors or unusual data points
that may adversely affect the results of your analysis. Detecting these early can prevent costly
mistakes in predictive modeling and analysis.
Testing Assumptions: Many statistical models assume that data follow a certain distribution or
that variables are independent. EDA involves checking these assumptions. If the assumptions do
not hold, the conclusions drawn from the model could be invalid.
Informing Feature Selection and Engineering: Insights gained from EDA can inform which
features are most relevant to include in a model and how to transform them (scaling, encoding) to
improve model performance.
Optimizing Model Design: By understanding the data’s characteristics, analysts can choose
appropriate modeling techniques, decide on the complexity of the model, and better tune model
parameters.
Facilitating Data Cleaning: EDA helps in spotting missing values and errors in the data, which
are critical to address before further analysis to improve data quality and integrity.
Enhancing Communication: Visual and statistical summaries from EDA can make it easier to
communicate findings and convince others of the validity of your conclusions, particularly when
explaining data-driven insights to stakeholders without technical backgrounds.
Types of Exploratory Data Analysis
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing
information units to uncover styles, pick out relationships, and gain insights. There are various
sorts of EDA strategies that can be hired relying on the nature of the records and the desires of the
evaluation. Depending on the number of columns we are analyzing we can divide EDA into three
types: Univariate, bivariate and multivariate.
1. Univariate Analysis
Univariate analysis focuses on a single variable to understand its internal structure. It is primarily
concerned with describing the data and finding patterns existing in a single feature. This sort of
evaluation makes a speciality of analyzing character variables inside the records set. It involves
summarizing and visualizing a unmarried variable at a time to understand its distribution, relevant
tendency, unfold, and different applicable records. Common techniques include:
Histograms: Used to visualize the distribution of a variable.
Box plots: Useful for detecting outliers and understanding the spread and skewness of the data.
Bar charts: Employed for categorical data to show the frequency of each category.
Summary statistics: Calculations like mean, median, mode, variance, and standard deviation that
describe the central tendency and dispersion of the data.
2. Bivariate Analysis
Bivariate evaluation involves exploring the connection between variables. It enables find
associations, correlations, and dependencies between pairs of variables. Bivariate analysis is a
crucial form of exploratory data analysis that examines the relationship between two variables.
Some key techniques used in bivariate analysis:
Scatter Plots: These are one of the most common tools used in bivariate analysis. A scatter plot
helps visualize the relationship between two continuous variables.
Correlation Coefficient: This statistical measure (often Pearson’s correlation coefficient for
linear relationships) quantifies the degree to which two variables are related.
Cross-tabulation: Also known as contingency tables, cross-tabulation is used to analyze the
relationship between two categorical variables. It shows the frequency distribution of categories
of one variable in rows and the other in columns, which helps in understanding the relationship
between the two variables.
Line Graphs: In the context of time series data, line graphs can be used to compare two variables
over time. This helps in identifying trends, cycles, or patterns that emerge in the interaction of the
variables over the specified period.
Covariance: Covariance is a measure used to determine how much two random variables change
together. However, it is sensitive to the scale of the variables, so it’s often supplemented by the
correlation coefficient for a more standardized assessment of the relationship.
3. Multivariate Analysis
Multivariate analysis examines the relationships between two or more variables in the dataset. It
aims to understand how variables interact with one another, which is crucial for most statistical
modeling techniques. Techniques include:
Pair plots: Visualize relationships across several variables simultaneously to capture a
comprehensive view of potential interactions.
Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the
dimensionality of large datasets, while preserving as much variance as possible.
Specialized EDA Techniques
In addition to univariate and multivariate analysis, there are specialized EDA techniques tailored
for specific types of data or analysis needs:
Spatial Analysis: For geographical data, using maps and spatial plotting to understand the
geographical distribution of variables.
Text Analysis: Involves techniques like word clouds, frequency distributions, and sentiment
analysis to explore text data.
Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal
component. Time collection evaluation entails inspecting and modeling styles, traits, and
seasonality inside the statistics through the years. Techniques like line plots, autocorrelation
analysis, transferring averages, and ARIMA (AutoRegressive Integrated Moving Average)
fashions are generally utilized in time series analysis.
Tools for Performing Exploratory Data Analysis
Exploratory Data Analysis (EDA) can be effectively performed using a variety of tools and
software, each offering unique features suitable for handling different types of data and analysis
requirements.
1. Python Libraries
Pandas: Provides extensive functions for data manipulation and analysis, including data structure
handling and time series functionality.
Matplotlib: A plotting library for creating static, interactive, and animated visualizations in Python.
Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive and
informative statistical graphics.
Plotly: An interactive graphing library for making interactive plots and offers more sophisticated
visualization capabilities.
2. R Packages
ggplot2: Part of the tidyverse, it’s a powerful tool for making complex plots from data in a data
frame.
dplyr: A grammar of data manipulation, providing a consistent set of verbs that help you solve
the most common data manipulation challenges.
tidyr: Helps to tidy your data. Tidying your data means storing it in a consistent form that matches
the semantics of the dataset with the way it is stored.
Steps for Performing Exploratory Data Analysis
Performing Exploratory Data Analysis (EDA) involves a series of steps designed to help you
understand the data you’re working with, uncover underlying patterns, identify anomalies, test
hypotheses, and ensure the data is clean and suitable for further analysis.

Step 1: Understand the Problem and the Data


The first step in any information evaluation project is to sincerely apprehend the trouble you are
trying to resolve and the statistics you have at your disposal. This entails asking questions
consisting of:
What is the commercial enterprise goal or research question you are trying to address?
What are the variables inside the information, and what do they mean?
What are the data sorts (numerical, categorical, textual content, etc.) ?
Is there any known information on first-class troubles or obstacles?
Are there any relevant area-unique issues or constraints?
By thoroughly knowing the problem and the information, you can better formulate your evaluation
technique and avoid making incorrect assumptions or drawing misguided conclusions. It is also
vital to contain situations and remember specialists or stakeholders to this degree to ensure you
have complete know-how of the context and requirements.
Step 2: Import and Inspect the Data
Once you have clean expertise of the problem and the information, the following step is to import
the data into your evaluation environment (e.g., Python, R, or a spreadsheet program). During this
step, looking into the statistics is critical to gain initial know-how of its structure, variable kinds,
and capability issues.
Here are a few obligations you could carry out at this stage:
Load the facts into your analysis environment, ensuring that the facts are imported efficiently and
without errors or truncations.
Examine the size of the facts (variety of rows and columns) to experience its length and
complexity.
Check for missing values and their distribution across variables, as missing information can
notably affect the quality and reliability of your evaluation.
Identify facts sorts and formats for each variable, as these records may be necessary for the
following facts manipulation and evaluation steps.
Look for any apparent errors or inconsistencies in the information, such as invalid values,
mismatched units, or outliers, that can indicate exceptional issues with information.
Step 3: Handle Missing Data
Missing records is a joint project in many datasets, and it can significantly impact the quality and
reliability of your evaluation. During the EDA method, it’s critical to pick out and deal with lacking
information as it should be, as ignoring or mishandling lacking data can result in biased or
misleading outcomes.
Here are some techniques you could use to handle missing statistics:
Understand the styles and capacity reasons for missing statistics: Is the information lacking entirely
at random (MCAR), lacking at random (MAR), or lacking not at random (MNAR)? Understanding
the underlying mechanisms can inform the proper method for handling missing information.
Decide whether to eliminate observations with lacking values (listwise deletion) or attribute (fill
in) missing values: Removing observations with missing values can result in a loss of statistics and
potentially biased outcomes, specifically if the lacking statistics are not MCAR. Imputing missing
values can assist in preserving treasured facts. However, the imputation approach needs to be
chosen cautiously.
Use suitable imputation strategies, such as mean/median imputation, regression imputation, a
couple of imputations, or device-getting-to-know-based imputation methods like k-nearest
associates (KNN) or selection trees. The preference for the imputation technique has to be
primarily based on the characteristics of the information and the assumptions underlying every
method.
Consider the effect of lacking information: Even after imputation, lacking facts can introduce
uncertainty and bias. It is important to acknowledge those limitations and interpret your outcomes
with warning.
Handling missing information nicely can improve the accuracy and reliability of your evaluation
and save you biased or deceptive conclusions. It is likewise vital to record the techniques used to
address missing facts and the motive in the back of your selections.
Step 4: Explore Data Characteristics
After addressing the facts that are lacking, the next step within the EDA technique is to explore
the traits of your statistics. This entails examining your variables’ distribution, crucial tendency,
and variability and identifying any ability outliers or anomalies. Understanding the characteristics
of your information is critical in deciding on appropriate analytical techniques, figuring out
capability information first-rate troubles, and gaining insights that may tell subsequent evaluation
and modeling decisions.
Calculate summary facts (suggest, median, mode, preferred deviation, skewness, kurtosis, and
many others.) for numerical variables: These facts provide a concise assessment of the distribution
and critical tendency of each variable, aiding in the identification of ability issues or deviations
from expected patterns.
Step 5: Perform Data Transformation
Data transformation is a critical step within the EDA process because it enables you to prepare
your statistics for similar evaluation and modeling. Depending on the traits of your information
and the necessities of your analysis, you may need to carry out various ameliorations to ensure that
your records are in the most appropriate layout.
Here are a few common records transformation strategies:
Scaling or normalizing numerical variables to a standard variety (e.g., min-max scaling,
standardization)
Encoding categorical variables to be used in machine mastering fashions (e.g., one-warm
encoding, label encoding)
Applying mathematical differences to numerical variables (e.g., logarithmic, square root) to
correct for skewness or non-linearity
Creating derived variables or capabilities primarily based on current variables (e.g., calculating
ratios, combining variables)
Aggregating or grouping records mainly based on unique variables or situations
By accurately transforming your information, you could ensure that your evaluation and modeling
strategies are implemented successfully and that your results are reliable and meaningful.
Step 6: Visualize Data Relationships
Visualization is an effective tool in the EDA manner, as it allows to discover relationships between
variables and become aware of styles or trends that may not immediately be apparent from
summary statistics or numerical outputs. To visualize data relationships, explore univariate,
bivariate, and multivariate analysis.
Create frequency tables, bar plots, and pie charts for express variables: These visualizations can
help you apprehend the distribution of classes and discover any ability imbalances or unusual
patterns.
Generate histograms, container plots, violin plots, and density plots to visualize the distribution of
numerical variables. These visualizations can screen critical information about the form, unfold,
and ability outliers within the statistics.
Examine the correlation or association among variables using scatter plots, correlation matrices,
or statistical assessments like Pearson’s correlation coefficient or Spearman’s rank correlation:
Understanding the relationships between variables can tell characteristic choice, dimensionality
discount, and modeling choices.
Step 7: Handling Outliers
An Outlier is a data item/object that deviates significantly from the rest of the (so-called
normal)objects. They can be caused by measurement or execution errors. The analysis for outlier
detection is referred to as outlier mining. There are many ways to detect outliers, and the removal
process of these outliers from the dataframe is the same as removing a data item from the panda’s
dataframe.
Identify and inspect capability outliers through the usage of strategies like the interquartile range
(IQR), Z-scores, or area-specific regulations: Outliers can considerably impact the results of
statistical analyses and gadget studying fashions, so it’s essential to perceive and take care of them
as it should be.
Step 8: Communicate Findings and Insights
The final step in the EDA technique is effectively discussing your findings and insights. This
includes summarizing your evaluation, highlighting fundamental discoveries, and imparting your
outcomes cleanly and compellingly.
Here are a few hints for effective verbal exchange:
Clearly state the targets and scope of your analysis
Provide context and heritage data to assist others in apprehending your approach
Use visualizations and photos to guide your findings and make them more reachable
Highlight critical insights, patterns, or anomalies discovered for the duration of the EDA manner
Discuss any barriers or caveats related to your analysis
Suggest ability next steps or areas for additional investigation
Effective conversation is critical for ensuring that your EDA efforts have a meaningful impact and
that your insights are understood and acted upon with the aid of stakeholders.
Data science has proved to be the leading support in making decisions, increased automation, and
provision of insight across the industry in today's fast-paced, technology-driven world. In essence,
the nuts and bolts of data science involve very large data set handling, pattern searching from the
data, predicting specific outcomes based on the patterns found, and finally, acting or making
informed decisions on such data sets. This is operationalized through data science modeling that,
in a way, involves designing the algorithms and statistical models that have the purpose of
processing and analyzing data. This is quite a process that is challenging to learners who are only
beginning their steps in the field. Understanding this in crystal clear steps, even a person who is a
beginner will be able to follow in this journey of data science to create models effectively.
What is Data Science Modelling
Data science modeling is a set of steps from defining the problem to deploying the model in reality.
The main aim of this paper is to, in turn, demystify and come up with a very simple, stepwise guide
that any person with a basic grasp of ideas in data science should be able to follow with minimal
ease. This guideline ensures that each of these steps is explicated using the simplest of languages
that even a beginner can easily follow in applying such practices in their projects.
Data Science Modelling Steps
1. Define Your Objective
2. Collect Data
3. Clean Your Data
4. Explore Your Data
5. Split Your Data
6. Choose a Model
7. Train Your Model
8. Evaluate Your Model
9. Improve Your Model
10. Deploy Your Model
The 10 easy steps would guide a beginner through the modeling process in data science and are
meant to be an easily readable guide for beginners who want to build models that can analyze data
and give insights. Each step is crucial and builds upon the previous one, ensuring a comprehensive
understanding of the entire process. Designed for students, professionals who would like to switch
their career paths, and even curious minds out there in pursuit of knowledge, this guide gives the
perfect foundation for delving deeper into the world of data science models.
1. Define Your Objective
First, define very clearly what problem you are going to solve. Whether that is a customer churn
prediction, better product recommendations, or patterns in data, you first need to know your
direction. This should bring clarity to the choice of data, algorithms, and evaluation metrics.
2. Collect Data
Gather data relevant to your objective. This can include internal data from your company, publicly
available datasets, or data purchased from external sources. Ensure you have enough data to train
your model effectively.
3. Clean Your Data
Data cleaning is a critical step to prepare your dataset for modeling. It involves handling missing
values, removing duplicates, and correcting errors. Clean data ensures the reliability of your
model's predictions.
4. Explore Your Data
Data exploration, or exploratory data analysis (EDA), involves summarizing the main
characteristics of your dataset. Use visualizations and statistics to uncover patterns, anomalies, and
relationships between variables.
5. Split Your Data
Divide your dataset into training and testing sets. The training set is used to train your model, while
the testing set evaluates its performance. A common split ratio is 80% for training and 20% for
testing.
6. Choose a Model
Select a model that suits your problem type (e.g., regression, classification) and data. Beginners
can start with simpler models like linear regression or decision trees before moving on to more
complex models like neural networks.
7. Train Your Model
Feed your training data into the model. This process involves the model learning from the data,
adjusting its parameters to minimize errors. Training a model can take time, especially with large
datasets or complex models.
8. Evaluate Your Model
After training, assess your model's performance using the testing set. Common evaluation metrics
include accuracy, precision, recall, and F1 score. Evaluation helps you understand how well your
model will perform on unseen data.
9. Improve Your Model
Based on the evaluation, you may need to refine your model. This can involve tuning
hyperparameters, choosing a different model, or going back to data cleaning and preparation for
further improvements.
10. Deploy Your Model
Once satisfied with your model's performance, deploy it for real-world use. This could mean
integrating it into an application or using it for decision-making within your organization.

Presenting Findings and Building Applications


• The team delivers final reports, briefings, code and technical documents.
• In addition, team may run a pilot project to implement the models in a production environment.
• The last stage of the data science process is where user soft skills will be most useful.
• Presenting your results to the stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.
Build the Models
• To build the model, data should be clean and understand the content properly. The components
of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
• Building a model is an iterative process. Most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
Model and Variable Selection
• For this phase, consider model performance and whether project meets all the requirements to
use model, as well as other factors:
1. Must the model be moved to a production environment and, if so, would it be easy to implement?
2. How difficult is the maintenance on the model: how long will it remain relevantif left
untouched?
3. Does the model need to be easy to explain?
Model Execution
• Various programming language is used for implementing the model. For model execution,
Python provides libraries like StatsModels or Scikit-learn. These packages use several of the most
popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries available can speed
up the process. Following are the remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is easy to interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough evidence exists to
show that the influence is there.
• Linear regression works if we want to predict a value, but for classify something, classification
models are used. The k-nearest neighbors method is one of the best method.
• Following commercial tools are used :
1. SAS enterprise miner: This tool allows users to run predictive and descriptive models based on
large volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
3. Matlab: Provides a high-level language for performing a variety of data analytics, algorithms
and data exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop analytic workflows and
interact with Big Data tools and platforms on the back end.
• Open Source tools:
1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.
2. Octave: A free software programming language for computational modeling, has some of the
functionality of Matlab.
3. WEKA: It is a free data mining software package with an analytic workbench. The functions
created in WEKA can be executed within Java code.
4. Python is a programming language that provides toolkits for machine learning and analysis.
5. SQL in-database implementations, such as MADlib provide an alterative to in memory desktop
analytical tools.
Model Diagnostics and Model Comparison
Try to build multiple model and then select best one based on multiple criteria. Working with a
holdout sample helps user pick the best-performing model.
• In Holdout Method, the data is split into two different datasets labeled as a training and a testing
dataset. This can be a 60/40 or 70/30 or 80/20 split. This technique is called the hold-out validation
technique.
Suppose we have a database with house prices as the dependent variable and two independent
variables showing the square footage of the house and the number of rooms. Now, imagine this
dataset has 30 rows. The whole idea is that you build a model that can predict house prices
accurately.
• To 'train' our model or see how well it performs, we randomly subset 20 of those rows and fit the
model. The second step is to predict the values of those 10 rows that we excluded and measure
how well our predictions were.
• As a rule of thumb, experts suggest randomly sampling 80% of the data into the training set and
20% into the test set.
• The holdout method has two, basic drawbacks :
1. It requires an extra dataset.
2. It is a single train-and-test experiment, the holdout estimate of error rate will be misleading if
we happen to get an "unfortunate" split.
1. What is the primary purpose of Data Science? a) To store and manage data
b) To analyze and interpret complex data to make decisions
c) To retrieve raw data
d) To eliminate all errors in data
Answer: b) To analyze and interpret complex data to make decisions
2. Which of the following is a benefit of using Data Science? a) It only helps in data
visualization
b) It improves decision-making through data insights
c) It focuses only on data storage
d) It reduces the amount of data generated
Answer: b) It improves decision-making through data insights
3. Which facet of data science refers to cleaning and organizing raw data? a) Data
Acquisition
b) Data Wrangling
c) Data Analysis
d) Data Reporting
Answer: b) Data Wrangling
4. Which of the following is a key step in the data science process? a) Data entry
b) Data retrieval
c) Data manipulation
d) Data deletion
Answer: b) Data retrieval
5. What does "Data Cleansing" primarily involve? a) Adding new data sources
b) Removing errors, duplicates, and irrelevant data
c) Visualizing the data
d) Presenting data to stakeholders
Answer: b) Removing errors, duplicates, and irrelevant data
6. In the data science process, which phase involves defining the problem or question
to be addressed? a) Data Collection
b) Exploratory Data Analysis
c) Setting the Research Goal
d) Data Transformation
Answer: c) Setting the Research Goal
7. What is the purpose of Exploratory Data Analysis (EDA)? a) To present results
b) To test hypotheses
c) To explore patterns and relationships within data
d) To transform data
Answer: c) To explore patterns and relationships within data
8. Which of the following is NOT typically involved in the Data Science process? a)
Building models
b) Retrieving data
c) Data storage
d) Data reporting
Answer: c) Data storage
9. What is the role of data integration in the data science process? a) Transforming the
data into useful insights
b) Merging data from different sources into a unified format
c) Cleansing the data
d) Presenting the data
Answer: b) Merging data from different sources into a unified format
10. Which technique is commonly used for transforming data? a) Data encoding
b) Data scraping
c) Data cleansing
d) Data formatting
Answer: a) Data encoding
11. In which stage of Data Science are machine learning algorithms typically applied? a)
Data collection
b) Building models
c) Data cleansing
d) Data transformation
Answer: b) Building models
12. Which of the following is an example of an application of Data Science? a) Detecting
fraudulent transactions
b) Storing raw data
c) Entering data manually
d) Making data unreadable
Answer: a) Detecting fraudulent transactions
13. What does "Data Transformation" involve? a) Converting data into a usable format
b) Visualizing data for reports
c) Adding irrelevant data
d) Deleting unnecessary data
Answer: a) Converting data into a usable format
14. What is the goal of presenting results in Data Science? a) To confuse the audience
with technical jargon
b) To deliver the final data report to the stakeholders
c) To ignore visualizations and focus on raw data
d) To store data for further analysis
Answer: b) To deliver the final data report to the stakeholders
15. Which of the following is an essential skill for a Data Scientist? a) Strong
programming knowledge
b) Mastery of marketing strategies
c) Understanding how to create ads
d) Proficiency in graphic design
Answer: a) Strong programming knowledge
16. What is the purpose of setting a research goal in Data Science? a) To identify the
stakeholders
b) To determine what data to analyze and what results are expected
c) To cleanse the data
d) To store data for later use
Answer: b) To determine what data to analyze and what results are expected
17. Which of the following is an example of a "facet" of data in data science? a) Data
retrieval
b) Data quality
c) Data privacy laws
d) Data reporting tools
Answer: b) Data quality
18. What is the importance of data analysis in Data Science? a) To define the research
question
b) To clean and store data
c) To explore the data and derive meaningful patterns
d) To convert data into images
Answer: c) To explore the data and derive meaningful patterns
19. In which stage of Data Science are models built using training data? a) Data
Collection
b) Data Wrangling
c) Building the Models
d) Exploratory Data Analysis
Answer: c) Building the Models
20. What is the ultimate goal of Data Science? a) To generate large amounts of data
b) To turn data into actionable insights
c) To store data efficiently
d) To present raw data to users
Answer: b) To turn data into actionable insights
21. Which of the following describes data integration in the Data Science process? a)
Collecting data from multiple sources and merging it
b) Displaying data for easy visualization
c) Deleting irrelevant data from the dataset
d) Testing machine learning algorithms
Answer: a) Collecting data from multiple sources and merging it
22. Which is a key benefit of using Data Science in business decision-making? a) It
automates customer service
b) It identifies patterns and trends to predict future outcomes
c) It stores historical data
d) It manages data privacy laws
Answer: b) It identifies patterns and trends to predict future outcomes
23. What is a typical output of the exploratory data analysis phase? a) A fully trained
machine learning model
b) Insights into the structure and relationships within the data
c) A cleaned dataset
d) A report on how to integrate data
Answer: b) Insights into the structure and relationships within the data
24. Which of the following is NOT part of the Data Science process? a) Data Analysis
b) Data Cleansing
c) Data Reporting
d) Data Destruction
Answer: d) Data Destruction
25. Which step involves testing and evaluating the models in Data Science? a) Data
Retrieval
b) Data Cleansing
c) Model Evaluation
d) Data Transformation
Answer: c) Model Evaluation

5 Mark Questions:

1. Explain the role of data cleansing and why it is essential in the data science process.
2. Describe the key stages of data science, from setting the research goal to presenting the
results.
3. Discuss the process of data integration and transformation in data science and its
significance.
4. How does exploratory data analysis (EDA) help in uncovering patterns in data? Provide
examples.
5. What are the primary benefits of applying data science in industries like healthcare,
finance, and marketing?

10 Mark Questions:

1. Explain the complete data science process with an emphasis on the importance of setting
the research goal, retrieving data, and the stages of model building.
2. Discuss the various techniques of data transformation and integration. How do these
processes contribute to the overall quality of the data?
3. In the context of data science, what is the significance of exploratory data analysis
(EDA)? Explain with examples how EDA helps in understanding and cleaning data.
4. Explain how machine learning models are built and validated in the data science process.
Include details about data preparation, model selection, and evaluation.
5. How do data science applications benefit decision-making in modern businesses?
Illustrate your answer with examples from real-world industries.
UNIT-II

Frequency Distribution is a tool in statistics that helps us organize the data and also helps us
reach meaningful conclusions. It tells us how often any specific values occur in the dataset. A
frequency distribution in a tabular form organizes data by showing the frequencies (the number of
times values occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears
in a dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and
frequency distribution table in detail.

Frequency Distribution
What is Outlier?
Outliers, in the context of information evaluation, are information points that deviate significantly
from the observations in a dataset. These anomalies can show up as surprisingly high or low values,
disrupting the distribution of data. For instance, in a dataset of monthly sales figures, if the income
for one month are extensively higher than the sales for all of the different months, that high sales
determine would be considered an outlier.
Why Removing Outliers is Necessary?
● Impact on Analysis: Outliers will have a disproportionate influence on statistical
measures like the suggest, skewing the general outcomes and leading to misguided
conclusions. Removing outliers can help ensure the analysis is based totally on a more
representative sample of the information.
● Statistical Significance: Outliers can have an effect on the validity and reliability of
statistical inferences drawn from the facts. Removing outliers, when appropriate, can
assist maintain the statistical importance of the analysis.
Identifying and accurately dealing with outliers is critical in data analysis to make certain the
integrity and accuracy of the results.
Types of Outliers
Outliers manifest in different forms, each presenting unique challenges:
● Univariate Outliers: These outliers occur when the point in a single variable
substantially deviates from the relaxation of the dataset. For example, if you're reading
the heights of adults in a sure place and most fall in the variety of 5 feet 5 inches to 6
ft, an person who measures 7 toes tall might be taken into consideration a univariate
outlier.
● Multivariate Outliers: In assessment to univariate outliers, multivariate outliers contain
observations which include outliers in multiple variables concurrently, highlighting
complicated relationships in the information. Continuing with our example, consider
evaluating height and weight, and you discover an character who's especially tall and
relatively heavy in comparison to the relaxation of the populace. This character would
be taken into consideration a multivariate outlier, as their characteristics in each height
and weight concurrently deviate from the normal.
● Point Outliers: These are the points which might be far eliminated from the rest of the
points. For instance, in a dataset of common household energy utilization, a price this
is exceptionally excessive or low as compared to the relaxation is a point outlier.
● Contextual Outliers: Sometimes known as conditional outliers, these are facts factors
that deviate from the normal only in a specific context or condition. For instance, a very
low temperature might be regular in wintry weather but unusual in summer.
● Collective Outliers: These outliers consist of a set of data factors that might not be
excessive by means of themselves however are unusual as an entire. This type of outlier
regularly shows a change in information behavior or emergent phenomena.
Main Causes of Outliers
Outliers can arise from various sources, making their detection vital:
● Data Entry Errors: Simple human errors in entering data can create extreme values.
● Measurement Error: Faulty device or experimental setup problems can cause
abnormally high or low readings.
● Experimental Errors: Flaws in experimental design might produce facts factors that do
not represent what they're presupposed to degree.
● Intentional Outliers: In some cases, data might be manipulated deliberately to produce
outlier effects, often seen in fraud cases.
● Data Processing Errors: During the collection and processing stages, technical glitches
can introduce erroneous data.
● Natural Variation: Inherent variability in the underlying data can also lead to outliers.
How Outliers can be Identified?
Identifying outliers is a vital step in records evaluation, supporting to discover anomalies, errors,
or valuable insights inside datasets. One common approach for figuring out outliers is through
visualizations, where records is graphically represented to highlight any points that deviate
appreciably from the overall pattern. Techniques like box plots and scatter plots offer intuitive
visual cues for recognizing outliers primarily based on their function relative to the rest of the
facts.
Another method involves the usage of statistical measures, including the Z-score, DBSCAN
algorithm, or isolation forest algorithm which quantitatively determine the deviation of statistics
factors from the imply or discover outliers primarily based on their density inside the information
area.
By combining visible inspection with statistical evaluation, analysts can efficiently identify
outliers and benefit deeper insights into the underlying traits of the facts.
1. Outlier Identification Using Visualizations
Visualizations offers insights into information distributions and anomalies. Visual tools like with
scatter plots and box plots, can efficaciously spotlight information factors that deviate notably from
the majority. In a scatter plot, outliers often seem as records factors mendacity far from the primary
cluster or displaying unusual styles as compared to the relaxation. Box plots offer a clean depiction
of the facts's central tendency and spread, with outliers represented as person factors beyond the
whiskers.
1.1 Identifying outliers with box plots
Box plots Box plots are valuable equipment in statistics analysis for visually summarizing the
distribution of a dataset. Box plots are useful in outlier identification offer a concise illustration of
key statistical measures such as the median, quartiles, and variety. A box plot includes a
rectangular "field" that spans the interquartile range (IQR), with a line indicating the median.
"Whiskers" enlarge from the box to the minimum and most values inside a specific range, often
set at 1.5 times the IQR. Any records points beyond those whiskers are considered potential
outliers. These outliers, represented as points, can provide essential insights into the dataset's
variability and capacity anomalies. Thus, box plots serve as a visual useful resource in outlier
detection, permitting analysts to pick out data points that deviate notably from the general sample
and warrant similarly research.
1.2 Identifying outliers with Scatter Plots
Scatter plots serve as vital tools in figuring out outliers inside datasets, mainly when exploring
relationships between two non-stop variables. These visualizations plot person facts points as dots
on a graph, with one variable represented on each axis. Outliers in scatter plots often take place as
factors that deviate extensively from the overall sample or fashion discovered most of the majority
of statistics factors.
They might appear as isolated dots, lying far from the main cluster, or exhibiting unusual patterns
compared to the bulk of the data. By visually inspecting scatter plots, analysts can fast pinpoint
capacity outliers, prompting further investigation into their nature and capability impact on the
evaluation. This preliminary identity lays the groundwork for deeper exploration and know-how
of the records's conduct and distribution.
2. Outlier Identification using Statistical Methods
2.1 Identifying outliers with Z-Score
Z-score, a extensively-used statistical approach, quantifies how many popular deviations a records
factor is from the suggest of the dataset. Outlier detection using Z-score, points information with
Z-scores beyond a positive threshold (usually set at
±3
±3) are considered outliers. A excessive high-quality or negative Z-score suggests that the statistics
factor is strangely far from the mean, signaling its capacity outlier fame. By calculating Z-score
for each statistics factor, analysts can systematically discover outliers primarily based on their
deviation from the imply, imparting a sturdy quantitative method to outlier detection.

2.2 Identifying outliers with DBSCAN


DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm
that identifies outliers based totally on the density of records factors in their area. Unlike traditional
clustering algorithms that require specifying the variety of clusters in advance, DBSCAN
mechanically determines clusters based on facts density. Data points that fall outside dense clusters
or fail to satisfy density criteria are labeled as outliers. By reading the neighborhood density of
records points, DBSCAN successfully identifies outliers in datasets with complex systems and
varying densities, making it specially appropriate for outlier detection in spatial information
analysis and other packages.
2.3 Identifying outliers with Isolation Forest algorithm
The Isolation Forest algorithm is an anomaly detection method based totally on the idea of isolating
outliers in a dataset. It constructs a random forest of decision trees and isolates outliers with the
aid of recursively partitioning the dataset into subsets. Outliers are identified as instances that
require fewer partitions to isolate them from the relaxation of the facts. Since outliers are usually
fewer in wide variety and have attributes that vary drastically from ordinary instances, they're more
likely to be isolated early in the tree-building method. The Isolation Forest algorithm gives a
scalable and green approach for outlier detection, specially in excessive-dimensional datasets, and
is powerful in opposition to the presence of irrelevant capabilities.
When Should You Remove Outliers?
Deciding when to put off outliers depends on the context of the evaluation. Outliers should be
removed whilst they are due to errors or anomalies that do not constitute the real nature of the
information. Few Considerations for Removing Outliers are:
● Impact on Analysis: Removing outliers can have an effect on statistical measures and
model accuracy.
● Statistical Significance: Consider the consequences of outlier elimination on the
validity of the evaluation.
Frequency Distribution is a tool in statistics that helps us organize the data and also helps us reach
meaningful conclusions. It tells us how often any specific values occur in the dataset. A frequency
distribution in a tabular form organizes data by showing the frequencies (the number of times
values occur) within a dataset.
A frequency distribution represents the pattern of how frequently each value of a variable appears
in a dataset. It shows the number of occurrences for each possible value within the dataset.
Let’s learn about Frequency Distribution including its definition, graphs, solved examples, and
frequency distribution table in detail.

Frequency Distribution
What is Frequency Distribution in Statistics?
A frequency distribution is an overview of all values of some variable and the number of times
they occur. It tells us how frequencies are distributed over the values. That is how many values lie
between different intervals. They give us an idea about the range where most values fall and the
ranges where values are scarce.
Frequency Distribution Graphs
To represent the Frequency Distribution, there are various methods such as Histogram, Bar Graph,
Frequency Polygon, and Pie Chart.

A brief description of all these graphs is as follows:


Graph Type Description Use Cases

Represents the frequency


of each interval of Continuous data
Histogram
continuous data using bars distribution analysis.
of equal width.

Represents the frequency


of each interval using bars Comparing discrete data
Bar Graph
of equal width; can also categories.
represent discrete data.

Connects midpoints of
class frequencies using
Comparing various
Frequency Polygon lines, similar to a
datasets.
histogram but without
bars.

Circular graph showing


data as slices of a circle,
Showing relative sizes of
Pie Chart indicating the proportional
data portions.
size of each slice relative
to the whole dataset.

Frequency Distribution Table


A frequency distribution table is a way to organize and present data in a tabular form which helps
us summarize the large dataset into a concise table. In the frequency distribution table, there are
two columns one representing the data either in the form of a range or an individual data set and
the other column shows the frequency of each interval or individual.
For example, let’s say we have a dataset of students’ test scores in a class.

Test Score Frequency

0-20 6

20-40 12

40-60 22

60-80 15

80-100 5

Check: Difference between Frequency Array and Frequency Distribution


Types of Frequency Distribution
There are four types of frequency distribution:

1. Grouped Frequency Distribution


2. Ungrouped Frequency Distribution
3. Relative Frequency Distribution
4. Cumulative Frequency Distribution

Grouped Frequency Distribution


In Grouped Frequency Distribution observations are divided between different intervals known as
class intervals and then their frequencies are counted for each class interval. This Frequency
Distribution is used mostly when the data set is very large.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45, 52, 31, 36,
39, 38, 43, 46, 32, 37, 25
Solution:As there are observations in between 10 and 57, we can choose class intervals as 10-20,
20-30, 30-40, 40-50, and 50-60. In these class intervals all the observations are covered and for
each interval there are different frequency which we can count for each interval.
Thus, the Frequency Distribution Table for the given data is as follows:

Class Interval Frequency

10 – 20 5

20 – 30 8

30 – 40 12

40 – 50 6

50 – 60 3

Ungrouped Frequency Distribution

In Ungrouped Frequency Distribution, all distinct observations are mentioned and counted
individually. This Frequency Distribution is often used when the given dataset is small.
Example: Make the Frequency Distribution Table for the ungrouped data given as follows:
10, 20, 15, 25, 30, 10, 15, 10, 25, 20, 15, 10, 30, 25

Solution:
As unique observations in the given data are only 10, 15, 20, 25, and 30 with each having a different
frequency.
Thus the Frequency Distribution Table of the given data is as follows:

Value Frequency

10 4

15 3

20 2

25 3

30 2

Relative Frequency Distribution

This distribution displays the proportion or percentage of observations in each interval or class. It
is useful for comparing different data sets or for analyzing the distribution of data within a set.
Relative Frequency is given by:
Relative Frequency = (Frequency of Event)/(Total Number of Events)

Example: Make the Relative Frequency Distribution Table for the following data:

Score Range 0-20 21-40 41-60 61-80 81-100

Frequency 5 10 20 10 5

Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative Frequency for
each class interval. Thus Relative Frequency Distribution table is given as follows:

Score Range Frequency Relative Frequency

0-20 5 5/50 = 0.10

21-40 10 10/50 = 0.20

41-60 20 20/50 = 0.40


61-80 10 10/50 = 0.20

81-100 5 5/50 = 0.10

Total 50 1.00

Cumulative Frequency Distribution

Cumulative frequency is defined as the sum of all the frequencies in the previous values or intervals
up to the current one. The frequency distributions which represent the frequency distributions
using cumulative frequencies are called cumulative frequency distributions. There are two types
of cumulative frequency distributions:

● Less than Type: We sum all the frequencies before the current interval.
● More than Type: We sum all the frequencies after the current interval.

Check:

● Cumulative Frequency
● How to Calculate Cumulative Frequency table in Excel

Let’s see how to represent a cumulative frequency distribution through an example,


Example: The table below gives the values of runs scored by Virat Kohli in the last 25 T-20
matches. Represent the data in the form of less-than-type cumulative frequency distribution:

45 34 50 75 22

56 63 70 49 33
0 8 14 39 86

92 88 70 56 50

57 45 42 12 39

Solution:
Since there are a lot of distinct values, we’ll express this in the form of grouped distributions with
intervals like 0-10, 10-20 and so. First let’s represent the data in the form of grouped frequency
distribution.
Runs Frequency

0-10 2

10-20 2

20-30 1

30-40 4

40-50 4

50-60 5

60-70 1

70-80 3
80-90 2

90-100 1 Now
we will
convert
this
frequency distribution into cumulative frequency distribution by summing up the values of current
interval and all the previous intervals.

Runs scored by Virat Kohli Cumulative Frequency

Less than 10 2

Less than 20 4

Less than 30 5

Less than 40 9

Less than 50 13
Less than 60 18

Less than 70 19

Less than 80 22

Less than 90 24

Less than 100 25

This table represents the cumulative frequency distribution of less than type.

Runs scored by Virat Kohli Cumulative Frequency

More than 0 25

More than 10 23
More than 20 21

More than 30 20

More than 40 16

More than 50 12

More than 60 7

More than 70 6

More than 80 3

More than 90 1

This table represents the cumulative frequency distribution of more than type.
We can plot both the type of cumulative frequency distribution to make the Cumulative Frequency
Curve.
Frequency Distribution Curve
A frequency distribution curve, also known as a frequency curve, is a graphical representation of
a data set’s frequency distribution. It is used to visualize the distribution and frequency of values
or observations within a dataset. Let’s understand it’s different types based on the shape of it, as
follows:

Frequency Distribution Curve Types


Type of Distribution Description

Symmetric and bell-shaped; data


Normal Distribution
concentrated around the mean.

Not symmetric; can be positively skewed


Skewed Distribution (right-tailed) or negatively skewed (left-
tailed).

Two distinct peaks or modes in the


Bimodal Distribution frequency distribution, suggesting data
from different populations.

More than two distinct peaks or modes in


Multimodal Distribution
the frequency distribution.

All values or intervals have roughly the


Uniform Distribution same frequency, resulting in a flat,
constant distribution.

Rapid drop-off in frequency as values


Exponential Distribution increase, resembling an exponential
function.
Logarithm of the data follows a normal
Log-Normal Distribution distribution, often used for multiplicative
data, positively skewed.

Frequency Distribution Formula


There are various formulas which can be learned in the context of Frequency Distribution, one
such formula is the coefficient of variation. This formula for Frequency Distribution is discussed
below in detail.
Coefficient of Variation
We can use mean and standard deviation to describe the dispersion in the values. But sometimes
while comparing the two series or frequency distributions becomes a little hard as sometimes both
have different units.
The coefficient of Variation is defined as,
σxˉ×100
x
ˉ
σ

×100

Where,

● σ represents the standard deviation


● xˉ
● x
● ˉ
● represents the mean of the observations
Note: Data with greater C.V. is said to be more variable than the other. The series having lesser
C.V. is said to be more consistent than the other.

Comparing Two Frequency Distributions with the Same Mean


We have two frequency distributions. Let’s say
σ1 and xˉ1
σ
1
and
x
ˉ
1

are the standard deviation and mean of the first series and
σ2 and xˉ2
σ
2

and
x
ˉ
2

are the standard deviation and mean of the second series. The Coefficeint of Variation(CV) is
calculated as follows
C.V of first series =
σ1xˉ1×100
x
ˉ
1

σ
1

×100
C.V of second series =
σ2xˉ2×100
x
ˉ
2

σ
2

×100
We are given that both series have the same mean, i.e.,
xˉ2=xˉ1=xˉ
x
ˉ
2

=
x
ˉ
1

=
x
ˉ
So, now C.V. for both series are,
C.V. of the first series =
σ1xˉ×100

x
ˉ
σ
1

×100

C.V. of the second series =


σ2xˉ×100
x
ˉ
σ
2

×100

Notice that now both series can be compared with the value of standard deviation only. Therefore,
we can say that for two series with the same mean, the series with a larger deviation can be
considered more variable than the other one.
Frequency Distribution Examples
Example 1: Suppose we have a series, with a mean of 20 and a variance is 100. Find out the
Coefficient of Variation.
Solution:
We know the formula for Coefficient of Variation,
σxˉ×100
x
ˉ
σ

×100
Given mean

x
ˉ
= 20 and variance
σ2
σ
2
= 100.
Substituting the values in the formula,
σxˉ×100=20100×100=2010×100=200
x
ˉ
σ

×100
=
100

20

×100
=
10
20

×100
=200
Example 2: Given two series with Coefficients of Variation 70 and 80. The means are 20 and 30.
Find the values of standard deviation for both series.
Solution:
In this question we need to apply the formula for CV and substitute the given values.
Standard Deviation of first series.
C.V=σxˉ×10070=σ20×1001400=σ×10014=σ
C.V=
x
ˉ
σ

×100
70=
20
σ

×100
1400=σ×100
14=σ
Thus, the standard deviation of first series = 14
Standard Deviation of second series.
C.V=σxˉ×10080=σ30×1002400=σ×10024=σ
C.V=
x
ˉ
σ

×100
80=
30
σ

×100
2400=σ×100
24=σ
Thus, the standard deviation of first series = 24
Example 3: Draw the frequency distribution table for the following data:
2, 3, 1, 4, 2, 2, 3, 1, 4, 4, 4, 2, 2, 2
Solution:
Since there are only very few distinct values in the series, we will plot the ungrouped frequency
distribution.

Value Frequency

1 2

2 6

3 2

4 4

Total 14

Example 4: The table below gives the values of temperature recorded in Hyderabad for 25 days in
summer. Represent the data in the form of less-than-type cumulative frequency distribution:

37 34 36 27 22

25 25 24 26 28
30 31 29 28 30

32 31 28 27 30

30 32 35 34 29

Solution:
Since there are so many distinct values here, we will use grouped frequency distribution. Let’s say
the intervals are 20-25, 25-30, 30-35. Frequency distribution table can be made by counting the
number of values lying in these intervals.

Temperature Number of Days

20-25
2

25-30 10

30-35 13

This is the grouped frequency distribution table. It can be converted into cumulative frequency
distribution by adding the previous values.

Temperature Number of Days


Less than 25 2

Less than 30 12

Less than 35 25

Example 5: Make a Frequency Distribution Table as well as the curve for the data:
{45, 22, 37, 18, 56, 33, 42, 29, 51, 27, 39, 14, 61, 19, 44, 25, 58, 36, 48, 30, 53, 41, 28, 35, 47, 21,
32, 49, 16, 52, 26, 38, 57, 31, 59, 20, 43, 24, 55, 17, 50, 23, 34, 60, 46, 13, 40, 54, 15, 62}
Solution:
To create the frequency distribution table for given data, let’s arrange the data in ascending order
as follows:
{13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62}
Now, we can count the observations for intervals: 10-20, 20-30, 30-40, 40-50, 50-60 and 60-70.

Interval Frequency

10 – 20 7

20 – 30 10

30 – 40 10
40 – 50 10

50 – 60 10

60 – 70 3

From this data, we can plot the Frequency Distribution Curve as follows:

A cumulative frequency is defined as the total of frequencies that are distributed over different
class intervals. It means that the data and the total are represented in the form of a table in which
the frequencies are distributed according to the class interval. In this article, we are going to discuss
in detail the cumulative frequency distribution, types of cumulative frequencies, and the
construction of the cumulative frequency distribution table with examples in detail.
What is Meant by Cumulative Frequency Distribution?
The cumulative frequency is the total of frequencies, in which the frequency of the first class
interval is added to the frequency of the second class interval and then the sum is added to the
frequency of the third class interval and so on. Hence, the table that represents the cumulative
frequencies that are divided over different classes is called the cumulative frequency table or
cumulative frequency distribution. Generally, the cumulative frequency distribution is used to
identify the number of observations that lie above or below the particular frequency in the provided
data set.
Types of Cumulative Frequency Distribution
The cumulative frequency distribution is classified into two different types namely: less than ogive
or cumulative frequency and more/greater than cumulative frequency.
Less Than Cumulative Frequency:
The Less than cumulative frequency distribution is obtained by adding successively the
frequencies of all the previous classes along with the class against which it is written. In this type,
the cumulate begins from the lowest to the highest size.
Greater Than Cumulative Frequency:
The greater than cumulative frequency is also known as the more than type cumulative frequency.
Here, the greater than cumulative frequency distribution is obtained by determining the cumulative
total frequencies starting from the highest class to the lowest class.
Graphical Representation of Less Than and More Than Cumulative Frequency
Representation of cumulative frequency graphically is easy and convenient as compared to
representing it using a table, bar-graph, frequency polygon etc.
The cumulative frequency graph can be plotted in two ways:
1. Cumulative frequency distribution curve(or ogive) of less than type
2. Cumulative frequency distribution curve(or ogive) of more than type
Steps to Construct Less than Cumulative Frequency Curve
The steps to construct the less than cumulative frequency curve are as follows:
1. Mark the upper limit on the horizontal axis or x-axis.
2. Mark the cumulative frequency on the vertical axis or y-axis.
3. Plot the points (x, y) in the coordinate plane where x represents the upper limit value and
y represents the cumulative frequency.
4. Finally, join the points and draw the smooth curve.
5. The curve so obtained gives a cumulative frequency distribution graph of less than type.
To draw a cumulative frequency distribution graph of less than type, consider the following
cumulative frequency distribution table which gives the number of participants in any level of
essay writing competition according to their age:
Table 1 Cumulative Frequency distribution table of less than type

Level of Age Group Age group Number of Cumulative


Essay (class interval) participants Frequency
(Frequency)
Level 1 10-15 Less than 15 20 20

Level 2 15-20 Less than 20 32 52

Level 3 20-25 Less than 25 18 70

Level 4 25-30 Less than 30 30 100

On plotting corresponding points according to table 1, we have

Steps to Construct Greater than Cumulative Frequency Curve


The steps to construct the more than/greater than cumulative frequency curve are as follows:
1. Mark the lower limit on the horizontal axis.
2. Mark the cumulative frequency on the vertical axis.
3. Plot the points (x, y) in the coordinate plane where x represents the lower limit value, and
y represents the cumulative frequency.
4. Finally, draw the smooth curve by joining the points.
5. The curve so obtained gives the cumulative frequency distribution graph of more than type.
To draw a cumulative frequency distribution graph of more than type, consider the same
cumulative frequency distribution table, which gives the number of participants in any level of
essay writing competition according to their age:
Table 2 Cumulative Frequency distribution table of more than type
Level of Age Group Age group Number of Cumulative
Essay (class interval) participants Frequency
(Frequency)

Level 1 10-30 More than 10 20 100

Level 2 15-30 More than 15 32 80

Level 3 20-30 More than 20 18 48

Level 4 25-30 More than 25 30 30

On plotting these points, we get a curve as shown in the graph 2.

These graphs are helpful in figuring out the median of a given data set. The median can be found
by drawing both types of cumulative frequency distribution curves on the same graph. The value
of the point of intersection of both the curves gives the median of the given set of data. For the
given table 1, the median can be calculated as shown:
Example on Cumulative Frequency
Example:
Create a cumulative frequency table for the following information, which represent the number of
hours per week that Arjun plays indoor games:
Arjun’s game time:

Days No. of Hours

Monday 2 hrs

Tuesday 1 hr

Wednesday 2 hrs

Thursday 3 hrs

Friday 4 hrs
Saturday 2 hrs

Sunday 6 hrs

Solution:
Let the no. of hours be the frequency.
Hence, the cumulative frequency table is calculated as follows:

Days No. of Hours (Frequency) Cumulative Frequency

Monday 2 hrs 2

Tuesday 1 hr 2+1 = 3

Wednesday 2 hrs 3+2 = 5

Thursday 3 hrs 5+3 = 8

Friday 4 hrs 8+4 = 12

Saturday 2 hrs 12+2 = 14

Sunday 6 hrs 14+6 = 20

Therefore, Arjun spends 20 hours in a week to play indoor games.


What is Nominal Data?
Nominal data is a type of data classification used in statistical analysis to categorize variables
without assigning any quantitative value. This form of data is identified by labels or names that
serve the sole purpose of distinguishing one group from another, without suggesting any form of
hierarchy or order among them. The essence of nominal data lies in its ability to organize data into
discrete categories, making it easier for researchers and analysts to sort, identify, and analyze
variables based on qualitative rather than quantitative attributes. Such categorization is
fundamental in various research fields, enabling the collection and analysis of data related to
demographics, preferences, types, and other non-numeric characteristics.

Nominal data example:


Types of Payment Methods - Credit Card, Debit Card, Cash, Electronic Wallet. Each payment
method represents a distinct category that helps in identifying consumer preferences in transactions
without implying any numerical value or order among the options.

Ordinal data example:


Customer Satisfaction Ratings - Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied.
This classification not only categorizes responses but also implies a clear order or ranking from
least to most satisfied, distinguishing it from nominal data by introducing a hierarchy among the
categories.

The significance of understanding what is nominal data extends beyond mere classification; it
impacts how data is interpreted and the statistical methods applied to it. Since nominal data does
not imply any numerical relationship or order among its categories, traditional measures of central
tendency like mean or median are not applicable.

Characteristics of Nominal Data


Nominal data, distinguished by its role in categorizing and labeling, has several defining
characteristics that set it apart from other data types. These characteristics are essential for
researchers to understand as they dictate how nominal data can be collected, analyzed, and
interpreted. Below are the key characteristics of nominal data:

● Categorical Classification:
Nominal data is used to categorize variables into distinct groups based on qualitative
attributes, without any numerical significance or inherent order.
● Mutually Exclusive:
Each data point can belong to only one category, ensuring clear and precise classification
without overlap between groups.
● No Order or Hierarchy:
The categories within nominal data do not have a ranked sequence or hierarchy; all
categories are considered equal but different.
● Identified by Labels:
Categories are often identified using names or labels, which can occasionally include
numbers used as identifiers rather than quantitative values.
● Limited Statistical Analysis:
Analysis of nominal data primarily involves counting frequency, determining mode, and
using chi-square tests, as measures of central tendency like mean or median are not
applicable.

Analysis of Nominal Data


Analyzing nominal data involves techniques that are tailored to its qualitative nature and the
characteristics that define what is nominal data. Since nominal data categorizes variables without
implying any numerical value or order, the analysis focuses on identifying patterns, distributions,
and relationships within the categorical data. Here's how nominal data is typically analyzed:

● Frequency Distribution:
One of the most common methods of analyzing nominal data is to count the frequency of
occurrences in each category. This helps in understanding the distribution of data across
the different categories. For instance, in a nominal data example like survey responses on
preferred types of cuisine, frequency distribution would reveal how many respondents
prefer each type of cuisine.
● Mode Determination:
The mode, or the most frequently occurring category in the dataset, is a key measure of
central tendency that can be applied to nominal data. It provides insight into the most
common or popular category among the data points. For example, if analyzing nominal
data on pet ownership, the mode would indicate the most common type of pet among
participants.
● Cross-tabulation:
Cross-tabulation involves comparing two or more nominal variables to identify
relationships between categories. This analysis can reveal patterns and associations that
are not immediately apparent. For instance, cross-tabulating nominal data on consumers'
favorite fast-food chains with their age groups could uncover preferences trends among
different age demographics.
● Chi-square Test:
For more complex analysis involving nominal data, the chi-square test is used to examine
the relationships between two nominal variables. It tests whether the distribution of
sample categorical data matches an expected distribution. As an example, researchers
might use a chi-square test to analyze whether there is a significant association between
gender (a nominal data example) and preference for a particular brand of product.

Examples
To illustrate the concept of nominal data more concretely, here are some practical examples that
showcase its application across various fields and contexts:

● Survey Responses on Favorite Color:


○ Categories:
Red, Blue, Green, Yellow, etc.
○ This nominal data example involves categorizing survey participants based on their
favorite color. Each color represents a distinct category without any implied
hierarchy or numerical value.
● Types of Pets Owned:
○ Categories:
Dog, Cat, Bird, Fish, None.
○ In a study on pet ownership, the types of pets individuals own are classified into
separate categories. Each category is mutually exclusive, highlighting the
categorical nature of nominal data.
● Vehicle Types in a Parking Lot:
○ Categories:
Car, Motorcycle, Bicycle, Truck.
○ Observing a parking lot to categorize vehicles by type is another nominal data
example. This involves identifying vehicles without assigning any order or
quantitative assessment to the categories.
● Nationality of Respondents in a Multinational Survey:
○ Categories:
American, Canadian, British, Australian, etc.
○ When conducting multinational surveys, researchers often categorize participants
by nationality. This classification is based solely on qualitative attributes,
underscoring the essence of what is nominal data.

Nominal Vs Ordinal Data


Understanding the difference between nominal and ordinal data is fundamental in the field of
statistics and research, as it influences the choice of analysis methods and how conclusions are
drawn from data. Here’s a comparison to highlight the key distinctions:
Feature Nominal Data Ordinal Data

Definition Data categorized based on names or Data categorized into ordered


labels without any quantitative categories that indicate a sequence or
significance or inherent order. relative ranking.

Nature Qualitative Qualitative, with an element of order

Order No inherent order among categories Inherent order or ranking among


categories

Examples Gender (Male, Female, Other), Satisfaction level (High, Medium,


Blood type (A, B, AB, O) Low), Education level (High School,
Bachelor's, Master's, PhD)

Quantitative Value None Implied through the order of


categories, but not precise

Analysis Techniques Frequency counts, mode, chi-square Median, percentile, rank correlation,
tests non-parametric tests

Application Used for categorizing data without Used when data classification requires
any need for ranking. a hierarchy or ranking.

Interpreting Distributions

1. Normal Distribution (Gaussian)

- Symmetric, bell-shaped
- Mean = Median = Mode
- Characteristics:
- Most data points cluster around mean
- Tails decrease exponentially
- 68% data within 1 standard deviation
- 95% data within 2 standard deviations
- Examples: Height, IQ scores, measurement errors

2. Skewed Distribution
- Asymmetric, tails on one side
- Types:
- Positive Skew: Tail on right side (e.g., income distribution, wealth distribution)
- Negative Skew: Tail on left side (e.g., failure time distribution, response times)
- Characteristics:
- Mean ≠ Median ≠ Mode
- Tails are longer on one side
- Data is concentrated on one side
- Examples: Income, wealth, failure times

3. Bimodal Distribution

- Two distinct peaks


- Characteristics:
- Two modes (local maxima)
- Valley between peaks
- Data has two distinct groups
- Examples: Customer segmentation, gene expression data

4. Multimodal Distribution

- Multiple peaks
- Characteristics:
- Multiple modes (local maxima)
- Multiple valleys
- Data has multiple distinct groups
- Examples: Gene expression data, text analysis

5. Uniform Distribution

- Equal probability across range


- Characteristics:
- Constant probability density
- No distinct modes or peaks
- Data is evenly distributed
- Examples: Random number generation, simulation studies

6. Exponential Distribution

- Rapid decline, long tail


- Characteristics:
- High probability of small values
- Low probability of large values
- Memoryless property
- Examples: Failure time analysis, reliability engineering

7. Power Law Distribution


- Heavy-tailed, few extreme values
- Characteristics:
- Few very large values
- Many small values
- Scale-free property
- Examples: City population sizes, word frequencies

8. Lognormal Distribution

- Log-transformed normal distribution


- Characteristics:
- Positive values only
- Skewed to right
- Logarithmic transformation yields normal distribution
- Examples: Stock prices, income distribution

9. Binomial Distribution

- Discrete, two outcomes


- Characteristics:
- Fixed number of trials (n)
- Probability of success (p)
- Number of successes (k)
- Examples: Coin toss, medical diagnosis
What are Data Types?
Data is largely divided into two major categories, quantitative and qualitative. They are further
divided into other parts. Refer to the graph given below for reference –
Types of Data
● Quantitative data: This type of data consists of numerical values that can be measured or
counted. Examples include time, speed, temperature, and the number of items.
● Qualitative data: This type includes non-numerical values representing qualities or
attributes. Examples include colors, yes or no responses, and opinions.
Types of Quantitative Data
● Discrete data: This refers to separate and distinct values, typically counting numbers. For
instance, the numbers on a dice or the number of students in a class are discrete data points.
● Continuous data: This type of data can take on any value within a range and be measured
with high precision. Examples include height, weight, and temperature.
Data Types Based on Level of Measurement
Data can be further classified into four types based on the level of measurement: nominal, ordinal,
interval, and ratio.
● Nominal data: This represents categorical information without any inherent order or
ranking. Examples include gender, religion, or marital status.
● Ordinal data: This type has a defined order or ranking among the values. Examples include
exam grades (A, B, C) or positions in a competition (1st place, 2nd place, 3rd place).
● Interval data: Interval data has a defined order and equal intervals between the values. An
example is the Celsius temperature scale, where the difference between 30°C and 20°C is
the same as the difference between 20°C and 10°C.
● Ratio data: Ratio data possesses all the characteristics of interval data but has a meaningful
zero point. In addition to setting up inequalities, ratios can also be formed with this data
type. Examples include height, weight, or income.
What is Measure of Central Tendency?
We should first understand the term Central Tendency. Data tend to accumulate around the average
value of the total data under consideration. Measures of central tendency will help us to find the
middle, or the average, of a data set. If most of the data is centrally located and there is a very
small spread it will form an asymmetric bell curve. In such conditions values of mean, median and
mode are equal.
Mean, Median, Mode
Let’s understand the definition and role of mean, median and mode with the help of examples –
Mean
It is the average of values. Consider 3 temperature values 30 oC, 40 oC and 50 oC, then the mean
is (30+40+50)/3=40 oC.
Median
It is the centrally located value of the data set sorted in ascending order. Consider 11 (ODD) values
1,2,3,7,8,3,2,5,4,15,16. We first sort the values in ascending order 1,2,2,3,3,4,5,7,8,15,16 then the
median is 4 which is located at the 6th number and will have 5 numbers on either side.

If the data set is having an even number of values then the median can be found by taking the
average of the two middle values. Consider 10 (EVEN) values 1,2,3,7,8,3,2,5,4,15. We first sort
the values in ascending order 1,2,2,3,3,4,5,7,8,15 then the median is (3+4)/2=3.5 which is the
average of the two middle values i.e. the values which are located at the 5th and 6th number in the
sequence and will have 4 numbers on either side.
Mode
It is the most frequent value in the data set. We can easily get the mode by counting the frequency
of occurrence. Consider a data set with the values 1,5,5,6,8,2,6,6. In this data set, we can observe
the following,

The value 6 occurs the most hence the mode of the data set is 6.
We often test our data by plotting the distribution curve, if most of the values are centrally located
and very few values are off from the center then we say that the data is having a normal distribution.
At that time the values of mean, median, and mode are almost equal.

However, when our data is skewed, for example, as with the right-skewed data set below:
We can say that the mean is being dragged in the direction of the skew. In this skewed distribution,
mode < median < mean. The more skewed the distribution, the greater the difference between the
median and mean, here we consider median for the conclusion. The best example of the right-
skewed distribution is salaries of employees, where higher-earners provide a false representation
of the typical income if expressed as mean salaries and not the median salaries.
For left-skewed distribution mean < median < mode. In such a case also, we emphasize the median
value of the distribution.
Mean, Median & Mode Example
To understand this let us consider an example. An OTT platform company has conducted a survey
in a particular region based on the watch time, language of streaming, and age of the viewer. For
our understanding, we have taken a sample of 10 people.
df=pd.read_csv("[Link]")
df

df["Watch Time"].mean()
2.5

df["Watch Time"].mode()
0 1.5
dtype: float64

df["Watch Time"].median()
2.0
If we observe the values then we can conclude the value of Mean Watch Time is 2.5 hours and
which appears reasonably correct. For Age of viewer following results can be obtained,
df["Age"].median()
12.5

df["Age"].mean()
19.9
df["Age"].mode()
0 12
1 15
dtype: int64
The value of mean Age is looked somewhat away from the actual data. Most of the viewers are in
the range of 10 to 15 while the value of mean comes 19.9. This is because the outliers present in
the data set. We can easily find the outliers using a boxplot.
[Link](df['Age'], orient='vertical')

If we observe the value of Median Age then the result looks correct. The value of mean is very
sensitive to outliers.
Now for the most popular language, we can not calculate the mean and median since this is nominal
data.
[Link](x="Language",y="Age",data=df)
[Link](x="Language",y="Watch Time",data=df)

If we observe the graph then it is seen that the Tamil bar is largest for Language vs Age and
Language vs Watch Time graph. But this will mislead the result because there is only one person
who watches the shows in Tamil.
df["Language"].value_counts()
Hindi 4
English 3
Tamil 1
Telgu 1
Marathi 1
Name: Language, dtype: int64
df["Language"].mode()
0 Hindi
dtype: object
Result
From the above result, it is concluded that the most popular language is Hindi. This is observed
when we find the mode of the data set.
Hence from the above observation, it is concluded that in the sample survey average age of viewers
is 12.5 years who watch for 2.5 hours daily a show in the Hindi language.
We can say there is no best central tendency measure method because the result is always based
on the types of data. For ordinal, interval, and ratio data (if it is skewed) we can prefer median.
For Nominal data, the model is preferred and for interval and ratio data (if it is not skewed) mean
is preferred.
Measures of Central Tendency and Dispersion
Dispersion measures indicate how data values are spread out. The range, which is the difference
between the highest and lowest values, is a simple measure of dispersion. The standard deviation
measures the expected difference between a data value and the mean.
1. What does a frequency distribution represent? a) The raw data in a tabular form
b) The count of how often each value appears in the data set
c) The mean of a dataset
d) The median of a dataset
Answer: b) The count of how often each value appears in the data set
2. What is an outlier in a data set? a) A data point that occurs most frequently
b) A data point that lies significantly outside the range of the rest of the data
c) A point at the median
d) A value that represents the average
Answer: b) A data point that lies significantly outside the range of the rest of the data
3. Which type of frequency distribution is used for nominal data? a) Cumulative
frequency distribution
b) Relative frequency distribution
c) Frequency distribution for nominal data
d) Continuous frequency distribution
Answer: c) Frequency distribution for nominal data
4. In a relative frequency distribution, what does the frequency of each class
represent? a) The total number of observations
b) The percentage of the total data points in each class
c) The number of outliers in the data
d) The cumulative count of data points
Answer: b) The percentage of the total data points in each class
5. What is a cumulative frequency distribution? a) A distribution showing the sum of
frequencies for all values less than or equal to each class
b) A distribution showing only the most frequent values
c) A distribution based on nominal data
d) A distribution used only for large datasets
Answer: a) A distribution showing the sum of frequencies for all values less than or
equal to each class
6. What is the primary purpose of interpreting a frequency distribution? a) To find the
most frequent value
b) To understand the shape and patterns in the data
c) To find the median
d) To identify the largest value in the dataset
Answer: b) To understand the shape and patterns in the data
7. Which of the following graphs is commonly used to display frequency distributions?
a) Pie chart
b) Histogram
c) Scatter plot
d) Line graph
Answer: b) Histogram
8. What does a bar graph represent in terms of frequency data? a) The relationship
between two continuous variables
b) The distribution of values in nominal data
c) The cumulative frequency of data
d) The mean of a dataset
Answer: b) The distribution of values in nominal data
9. Which measure of central tendency represents the value that appears most
frequently in the dataset? a) Median
b) Mode
c) Mean
d) Range
Answer: b) Mode
10. What is the median in a dataset? a) The value that occurs most often
b) The average of all values
c) The middle value when the data is arranged in ascending order
d) The sum of all values divided by the number of observations
Answer: c) The middle value when the data is arranged in ascending order
11. Which of the following is true about the mean of a dataset? a) It is always the same as
the median
b) It is the middle value of the data
c) It can be affected by extreme values (outliers)
d) It is the most frequent value
Answer: c) It can be affected by extreme values (outliers)
12. What is the purpose of creating a frequency distribution for nominal data? a) To
calculate the mean
b) To show the distribution of categories in the data
c) To determine outliers
d) To compute the cumulative frequency
Answer: b) To show the distribution of categories in the data
13. Which of the following graphs is best suited for displaying cumulative frequency
distributions? a) Pie chart
b) Histogram
c) Ogive
d) Box plot
Answer: c) Ogive
14. Which of the following is a characteristic of the mode in a dataset? a) It is always
unique
b) It can have more than one value in bimodal or multimodal distributions
c) It is unaffected by outliers
d) It is the average of the data
Answer: b) It can have more than one value in bimodal or multimodal distributions
15. In a frequency distribution, the relative frequency is expressed as: a) A fraction of
the total data points
b) The total number of observations
c) A count of the most frequent value
d) The cumulative sum of all frequencies
Answer: a) A fraction of the total data points
16. Which measure of central tendency is most affected by extreme outliers? a) Mode
b) Mean
c) Median
d) Range
Answer: b) Mean
17. What is the difference between the cumulative frequency and the relative
cumulative frequency? a) Cumulative frequency represents the total number of data
points, while relative cumulative frequency represents the percentage
b) Cumulative frequency shows frequencies for each class, while relative cumulative
frequency shows cumulative count
c) Cumulative frequency is used for nominal data, while relative cumulative frequency is
for continuous data
d) There is no difference; both terms mean the same thing
Answer: a) Cumulative frequency represents the total number of data points, while
relative cumulative frequency represents the percentage
18. Which of the following graphs is used to represent the distribution of data over
time? a) Line graph
b) Histogram
c) Scatter plot
d) Box plot
Answer: a) Line graph
19. Which of the following measures of central tendency is best used for ordinal data? a)
Mean
b) Median
c) Mode
d) Range
Answer: b) Median
20. In a frequency distribution, if the data is heavily skewed, which measure of central
tendency would provide the best representation of the "center"? a) Mode
b) Median
c) Mean
d) Standard deviation
Answer: b) Median
21. Which measure of central tendency is appropriate when the data is skewed or has
outliers? a) Mean
b) Median
c) Mode
d) Variance
Answer: b) Median
22. Which of the following is an example of a continuous frequency distribution? a) The
number of people in different age groups
b) The number of red cars in a parking lot
c) The range of temperatures over a week
d) The number of students in each grade
Answer: c) The range of temperatures over a week
23. What does an ogive graph show? a) The distribution of nominal data
b) The cumulative frequency of the data
c) The most frequent value in the data
d) The median and mean
Answer: b) The cumulative frequency of the data
24. Which of the following methods is used to find the median in a frequency
distribution? a) Add all values and divide by the total number of observations
b) Find the value that divides the data into two equal parts
c) Find the mode of the data
d) Identify the value that appears most often
Answer: b) Find the value that divides the data into two equal parts
25. In which type of distribution would you expect to see multiple modes? a) Uniform
distribution
b) Normal distribution
c) Bimodal or multimodal distribution
d) Skewed distribution
Answer: c) Bimodal or multimodal distribution

5 Mark Questions:

1. Explain the concept of frequency distributions and discuss their significance in data
analysis.
2. Define outliers and explain how they can affect frequency distributions and statistical
measures like the mean.
3. What is the difference between cumulative frequency distribution and relative frequency
distribution? Explain with examples.
4. How can frequency distributions be used to interpret nominal data? Provide examples of
when this is useful.
5. Describe how to calculate and interpret the mean, median, and mode in a data set.
Discuss situations where one might be more useful than the others.

10 Mark Questions:

1. Explain the process of creating a frequency distribution for a given dataset and interpret
the results. Include examples of nominal and continuous data.
2. Discuss the role of graphs in the interpretation of frequency distributions. Compare and
contrast histograms, bar graphs, and ogives.
3. How do outliers impact the mean, median, and mode? Explain with examples, and
discuss methods for identifying and handling outliers in data analysis.
4. Describe the concept of cumulative frequency distribution. How do you construct one,
and what insights can it provide about a data set?
5. Explain how to calculate and interpret the different measures of central tendency (mean,
median, mode) for a given dataset. In which situations would each measure be most
appropriate?

UNIT -III
Normal distributions

Normal Distribution is the most common or normal form of distribution of Random Variables,
hence the name “normal distribution.” It is also called Gaussian Distribution in Statistics or
Probability. We use this distribution to represent a large number of random variables. It serves as
a foundation for statistics and probability theory.
It also describes many natural phenomena, forms the basis of the Central Limit Theorem, and also
supports numerous statistical methods.
The normal distribution is the most important and most widely used distribution in statistics. It is
sometimes called the “bell curve,” although the tonal qualities of such a bell would be less than
pleasing. It is also called the “Gaussian curve” of Gaussian distribution after the mathematician
Karl Friedrich Gauss.

Strictly speaking, it is not correct to talk about “the normal distribution” since there are many
normal distributions. Normal distributions can differ in their means and in their standard
deviations. Figure 4.1 shows three normal distributions. The blue (left-most) distribution has a
mean of −3 and a standard deviation of 0.5, the distribution in red (the middle distribution) has a
mean of 0 and a standard deviation of 1, and the black (right-most) distribution has a mean of 2
and a standard deviation of 3. These as well as all other normal distributions are symmetric with
relatively more values at the center of the distribution and relatively few in the tails. What is
consistent about all normal distribution is the shape and the proportion of scores within a given
distance along the x-axis. We will focus on the standard normal distribution (also known as the
unit normal distribution), which has a mean of 0 and a standard deviation of 1 (i.e., the red
distribution in Figure 4.1).
Figure 4.1. Normal distributions differing in mean and standard deviation. (“Normal Distributions
with Different Means and Standard Deviations” by Judy Schmitt is licensed under CC BY-NC-SA
4.0.)
Seven features of normal distributions are listed below.
1. Normal distributions are symmetric around their mean.
2. The mean, median, and mode of a normal distribution are equal.
3. The area under the normal curve is equal to 1.0.
4. Normal distributions are denser in the center and less dense in the tails.

5. Normal distributions are defined by two parameters, the mean ( ) and the standard
deviation (s).
6. 68% of the area of a normal distribution is within one standard deviation of the mean.
7. Approximately 95% of the area of a normal distribution is within two standard
deviations of the mean.

These properties enable us to use the normal distribution to understand how scores relate to one
another within and across a distribution. But first, we need to learn how to calculate the
standardized score that makes up a standard normal distribution.

Z Scores
A z score is a standardized version of a raw score (x) that gives information about the relative
location of that score within its distribution. The formula for converting a raw score into a z score
is

for values from a population and


for values from a sample.

As you can see, z scores combine information about where the distribution is located (the
mean/center) with how wide the distribution is (the standard deviation/spread) to interpret a raw
score (x). Specifically, z scores will tell us how far the score is away from the mean in units of
standard deviations and in what direction.

The value of a z score has two parts: the sign (positive or negative) and the magnitude (the actual
number). The sign of the z score tells you in which half of the distribution the z score falls: a
positive sign (or no sign) indicates that the score is above the mean and on the right-hand side or
upper end of the distribution, and a negative sign tells you the score is below the mean and on the
left-hand side or lower end of the distribution. The magnitude of the number tells you, in units of
standard deviations, how far away the score is from the center or mean. The magnitude can take
on any value between negative and positive infinity, but for reasons we will see soon, they
generally fall between −3 and 3.

Let’s look at some examples. A z score value of −1.0 tells us that this z score is 1 standard deviation
(because of the magnitude 1.0) below (because of the negative sign) the mean. Similarly, a z score
value of 1.0 tells us that this z score is 1 standard deviation above the mean. Thus, these two scores
are the same distance away from the mean but in opposite directions. A z score of −2.5 is two-and-
a-half standard deviations below the mean and is therefore farther from the center than both of the
previous scores, and a z score of 0.25 is closer than all of the ones before. In Unit 2, we will learn
to formalize the distinction between what we consider “close to” the center or “far from” the center.
For now, we will use a rough cut-off of 1.5 standard deviations in either direction as the difference
between close scores (those within 1.5 standard deviations or between z = −1.5 and z = 1.5) and
extreme scores (those farther than 1.5 standard deviations—below z = −1.5 or above z = 1.5).

We can also convert raw scores into z scores to get a better idea of where in the distribution those
scores fall. Let’s say we get a score of 68 on an exam. We may be disappointed to have scored so
low, but perhaps it was just a very hard exam. Having information about the distribution of all
scores in the class would be helpful to put some perspective on ours. We find out that the class got
an average score of 54 with a standard deviation of 8. To find out our relative location within this
distribution, we simply convert our test score into a z score.
We find that we are 1.75 standard deviations above the average, above our rough cut-off for close
and far. Suddenly our 68 is looking pretty good!

Figure 4.2 shows both the raw score and the z score on their respective distributions. Notice that
the red line indicating where each score lies is in the same relative spot for both. This is because
transforming a raw score into a z score does not change its relative location, it only makes it easier
to know precisely where it is.
Figure 4.2. Raw and standardized versions of a single score. (“Raw and Standardized Versions of
a Score” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

z Scores are also useful for comparing scores from different distributions. Let’s say we take the
SAT and score 501 on both the math and critical reading sections. Does that mean we did equally
well on both? Scores on the math portion are distributed normally with a mean of 511 and standard
deviation of 120, so our z score on the math section is

which is just slightly below average (note the use of “math” as a subscript; subscripts are used
when presenting multiple versions of the same statistic in order to know which one is which and
have no bearing on the actual calculation). The critical reading section has a mean of 495 and
standard deviation of 116, so

So even though we were almost exactly average on both tests, we did a little bit better on the
critical reading portion relative to other people.

Finally, z scores are incredibly useful if we need to combine information from different measures
that are on different scales. Let’s say we give a set of employees a series of tests on things like job
knowledge, personality, and leadership. We may want to combine these into a single score we can
use to rate employees for development or promotion, but look what happens when we take the
average of raw scores from different scales, as shown in Table 4.1.

Table 4.1. Raw test scores on different scales (ranges in parentheses).

Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average

Employee 98 4.2 1.1 34.43


1

Employee 96 3.1 4.5 34.53


2

Employee 97 2.9 3.6 34.50


3

Because the job knowledge scores were so big and the scores were so similar, they overpowered
the other scores and removed almost all variability in the average. However, if we standardize
these scores into z scores, our averages retain more variability and it is easier to assess differences
between employees, as shown in Table 4.2.

Table 4.2. Standardized scores.

Job
Knowledge Personality Leadership
Employee (0–100) (1–5) (1–5) Average

Employee 1.00 1.14 −1.12 0.34


1

Employee −1.00 −0.43 0.81 −0.20


2

Employee 0.00 −0.71 0.30 −0.14


3

Setting the Scale of a Distribution


Another convenient characteristic of z scores is that they can be converted into any “scale” that we
would like. Here, the term scale means how far apart the scores are (their spread) and where they
are located (their central tendency). This can be very useful if we don’t want to work with negative
numbers or if we have a specific range we would like to present. The formulas for transforming z
to x are:

for a population and

for a sample. Notice that these are just simple rearrangements of the original formulas for
calculating z from raw scores.

Let’s say we create a new measure of intelligence, and initial calibration finds that our scores have
a mean of 40 and standard deviation of 7. Three people who have scores of 52, 43, and 34 want to
know how well they did on the measure. We can convert their raw scores into z scores:
A problem is that these new z scores aren’t exactly intuitive for many people. We can give people
information about their relative location in the distribution (for instance, the first person scored
well above average), or we can translate these z scores into the more familiar metric of IQ scores,
which have a mean of 100 and standard deviation of 16:

IQ = 1.71(16) + 100 = 127.36

IQ = 0.43(16) + 100 = 106.88

IQ = −0.80(16) + 100 = 87.20

We would also likely round these values to 127, 107, and 87, respectively, for convenience.

Z Scores and the Area under the Curve


z Scores and the standard normal distribution go hand-in-hand. A z score will tell you exactly
where in the standard normal distribution a value is located, and any normal distribution can be
converted into a standard normal distribution by converting all of the scores in the distribution into
z scores, a process known as standardization.

We saw in Chapter 3 that standard deviations can be used to divide the normal distribution: 68%
of the distribution falls within 1 standard deviation of the mean, 95% within (roughly) 2 standard
deviations, and 99.7% within 3 standard deviations. Because z scores are in units of standard
deviations, this means that 68% of scores fall between z = −1.0 and z = 1.0 and so on. We call this
68% (or any percentage we have based on our z scores) the proportion of the area under the curve.
Any area under the curve is bounded by (defined by, delineated by, etc.) by a single z score or pair
of z scores.

An important property to point out here is that, by virtue of the fact that the total area under the
curve of a distribution is always equal to 1.0 (see section on Normal Distributions at the beginning
of this chapter), these areas under the curve can be added together or subtracted from 1 to find the
proportion in other areas. For example, we know that the area between z = −1.0 and z = 1.0 (i.e.,
within one standard deviation of the mean) contains 68% of the area under the curve, which can
be represented in decimal form as .6800. (To change a percentage to a decimal, simply move the
decimal point 2 places to the left.) Because the total area under the curve is equal to 1.0, that means
that the proportion of the area outside z = −1.0 and z = 1.0 is equal to 1.0 − .6800 = .3200 or 32%
(see Figure 4.3). This area is called the area in the tails of the distribution. Because this area is split
between two tails and because the normal distribution is symmetrical, each tail has exactly one-
half, or 16%, of the area under the curve.
Figure 4.3. Shaded areas represent the area under the curve in the tails. (“Area under the Curve in
the Tails” by Judy Schmitt is licensed under CC BY-NC-SA 4.0.)

We will have much more to say about this concept in the coming chapters. As it turns out, this is
a quite powerful idea that enables us to make statements about how likely an outcome is and what
that means for research questions we would like to answer and hypotheses we would like to test.

Why do normal distributions matter?


All kinds of variables in natural and social sciences are normally or approximately normally
distributed. Height, birth weight, reading ability, job satisfaction, or SAT scores are just a few
examples of such variables.

Because normally distributed variables are so common, many statistical tests are designed for
normally distributed populations.

Understanding the properties of normal distributions means you can use inferential statistics to
compare different groups and make estimates about populations using samples.

What are the properties of normal distributions?


Normal distributions have key characteristics that are easy to spot in graphs:
● The mean, median and mode are exactly the same.
● The distribution is symmetric about the mean—half the values fall below the mean and
half above the mean.
● The distribution can be described by two values: the mean and the standard deviation.

The mean is the location parameter while the standard deviation is the scale parameter.

The mean determines where the peak of the curve is centered. Increasing the mean moves the curve
right, while decreasing it moves the curve left.
The standard deviation stretches or squeezes the curve. A small standard deviation results in a
narrow curve, while a large standard deviation leads to a wide curve.

Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a normal
distribution:
● Around 68% of values are within 1 standard deviation from the mean.
● Around 95% of values are within 2 standard deviations from the mean.
● Around 99.7% of values are within 3 standard deviations from the mean.

Example: Using the empirical rule in a normal distributionYou collect SAT scores from students in a
new test preparation course. The data follows a normal distribution with a mean score (M) of 1150 and
a standard deviation (SD) of 150.
Following the empirical rule:

● Around 68% of scores are between 1,000 and 1,300, 1 standard deviation above and below the
mean.
● Around 95% of scores are between 850 and 1,450, 2 standard deviations above and below the
mean.
● Around 99.7% of scores are between 700 and 1,600, 3 standard deviations above and below the
mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers or
extreme values that don’t follow this pattern.

If data from small samples do not closely follow this pattern, then other distributions like the t-
distribution may be more appropriate. Once you identify the distribution of your variable, you can
apply appropriate statistical tests.
Central limit theorem
The central limit theorem is the basis for how normal distributions work in statistics.

In research, to get a good idea of a population mean, ideally you’d collect data from multiple
random samples within the population. A sampling distribution of the mean is the distribution of
the means of these different samples.

The central limit theorem shows the following:

● Law of Large Numbers: As you increase sample size (or the number of samples), then the
sample mean will approach the population mean.
● With multiple large samples, the sampling distribution of the mean is normally distributed,
even if your original variable is not normally distributed.

Parametric statistical tests typically assume that samples come from normally distributed
populations, but the central limit theorem means that this assumption isn’t necessary to meet when
you have a large enough sample.

You can use parametric tests for large samples from populations with any kind of distribution as
long as other important assumptions are met. A sample size of 30 or more is generally considered
large.

For small samples, the assumption of normality is important because the sampling distribution of
the mean isn’t known. For accurate results, you have to be sure that the population is normally
distributed before you can use parametric tests with small samples.

Formula of the normal curve


Once you have the mean and standard deviation of a normal distribution, you can fit a normal
curve to your
data using a probability density function.

In a probability density function, the area under the curve tells you probability. The normal
distribution is a probability distribution, so the total area under the curve is always 1 or 100%.

The formula for the normal probability density function looks fairly complicated. But to use it,
you only need to know the population mean and standard deviation.

For any value of x, you can plug in the mean and standard deviation into the formula to find the
probability density of the variable taking on that value of x.

Normal probability density formula Explanation


● f(x) = probability
● x = value of the variable
● μ = mean
● σ = standard deviation
● σ2 = variance

Example: Using the probability density functionYou want to know the probability that SAT scores in
your sample exceed 1380.
On your graph of the probability density function, the probability is the shaded area under the curve that
lies to the right of where your SAT scores equal 1380.

You can find the probability value of this score using the standard normal distribution.

What is the standard normal distribution?


The standard normal distribution, also called the z-distribution, is a special normal distribution
where the mean is 0 and the standard deviation is 1.
Every normal distribution is a version of the standard normal distribution that’s been stretched or
squeezed and moved horizontally right or left.

While individual observations from normal distributions are referred to as x, they are referred to
as z in the z-distribution. Every normal distribution can be converted to the standard normal
distribution by turning the individual values into z-scores.

Z-scores tell you how many standard deviations away from the mean each value lies.
You only need to know the mean and standard deviation of your distribution to find the z-score of
a value.

Z-score Formula Explanation

● x = individual value
● μ = mean
● σ = standard deviation

We convert normal distributions into the standard normal distribution for several reasons:

● To find the probability of observations in a distribution falling above or below a given


value.
● To find the probability that a sample mean significantly differs from a known population
mean.
● To compare scores on different distributions with different means and standard deviations.

Finding probability using the z-distribution


Each z-score is associated with a probability, or p-value, that tells you the likelihood of values
below that z-score occurring. If you convert an individual value into a z-score, you can then find
the probability of all values up to that value occurring in a normal distribution.

Example: Finding probability using the z-distributionTo find the probability of SAT scores in your
sample exceeding 1380, you first find the z-score.
The mean of our distribution is 1150, and the standard deviation is 150. The z-score tells you how many
standard deviations away 1380 is from the mean.

Formula Calculation

For a z-score of 1.53, the p-value is 0.937. This is the probability of SAT scores being 1380 or less
(93.7%), and it’s the area under the curve left of the shaded area.

To find the shaded area, you take away 0.937 from 1, which is the total area under the curve.
Probability of x > 1380 = 1 – 0.937 = 0.063

That means it is likely that only 6.3% of SAT scores in your sample exceed 1380.

What is a Z-score?
A Z-score (also called a standard score) represents the number of standard deviations a data point
is from the mean of a distribution. It allows you to standardize data from different distributions so
that you can compare them directly.
• Formula for Z-score: Z=X−μσZ = \frac{X - \mu}{\sigma}Z=σX−μ where:
o XXX is the value of the data point you're interested in,
o μ\muμ is the mean of the population (or sample),
o σ\sigmaσ is the standard deviation of the population (or sample).
2. Interpreting a Z-score:
• A Z-score of 0 means that the data point is exactly at the mean.
• A positive Z-score indicates that the data point is above the mean.
• A negative Z-score indicates that the data point is below the mean.
• The magnitude of the Z-score tells you how far, in standard deviations, the data point is
from the mean.
3. Finding Z-scores:
To find a Z-score, you need to know the value you're analyzing (XXX), the mean (μ\muμ), and
the standard deviation (σ\sigmaσ) of the population (or sample).
Example: Suppose you're analyzing the scores of students on a test where the mean score is 70
and the standard deviation is 10. If a student scored 85, their Z-score would be:
Z=85−7010=1.5Z = \frac{85 - 70}{10} = 1.5Z=1085−70=1.5
This means that the student's score is 1.5 standard deviations above the mean.
4. Using Z-scores to Find Proportions:
Z-scores are often used to find the proportion of data that falls below, above, or between certain
values in a normal distribution.
• To find the proportion of data below a given Z-score, you can use the standard normal
distribution table (also called the Z-table) or a calculator with statistical functions.
• A Z-table provides the cumulative probability (or proportion) to the left of a given Z-score
in a standard normal distribution (which has a mean of 0 and a standard deviation of 1).
Example: Finding the Proportion Below a Z-score
If you want to find the proportion of data points that are below a Z-score of 1.5, you would look
up the value for Z=1.5Z = 1.5Z=1.5 in the Z-table. The value for Z=1.5Z = 1.5Z=1.5 is
approximately 0.9332, which means that about 93.32% of the data falls below a Z-score of 1.5.
• This means that in a standard normal distribution, approximately 93.32% of the values are
less than or equal to 1.5 standard deviations above the mean.
5. Finding Scores from Z-scores:
You can also use Z-scores to find the original score (or data point) from a given Z-score.
Rearranging the Z-score formula:
X=Z⋅σ+μX = Z \cdot \sigma + \muX=Z⋅σ+μ
where:
• ZZZ is the Z-score,
• σ\sigmaσ is the standard deviation,
• μ\muμ is the mean,
• XXX is the original score.
Example: If the mean score is 70, the standard deviation is 10, and you want to find the score
corresponding to a Z-score of 2, the formula is:
X=2⋅10+70=90X = 2 \cdot 10 + 70 = 90X=2⋅10+70=90
So, a Z-score of 2 corresponds to a score of 90.
6. Applications of Z-scores:
• Comparing data from different distributions: Z-scores allow you to compare scores
from different datasets (or tests) even if the datasets have different means and standard
deviations. For example, you can compare a test score from one exam to a test score from
another exam by calculating the Z-scores for both.
• Identifying outliers: A Z-score that is significantly higher or lower than the mean (usually
greater than +2 or less than -2) can be considered an outlier, since it indicates that the data
point is far from the mean.
7. Standard Normal Distribution:
The Z-score is related to the standard normal distribution, which is a special case of the normal
distribution. In this distribution:
• The mean is always 0.
• The standard deviation is always 1.
• The shape of the curve is symmetric, with the majority of data points (about 68%) falling
within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7%
within 3 standard deviations (the famous 68-95-99.7 rule).
8. Finding the Proportion Between Two Z-scores:
If you want to find the proportion of data between two Z-scores, you can use the following process:
• Look up the cumulative probability for each Z-score in the Z-table.
• Subtract the smaller cumulative probability from the larger cumulative probability.
Example: To find the proportion of data between Z = -1 and Z = 1:
• Find the cumulative probability for Z = 1, which is approximately 0.8413.
• Find the cumulative probability for Z = -1, which is approximately 0.1587.
• Subtract: 0.8413−0.1587=0.68260.8413 - 0.1587 = 0.68260.8413−0.1587=0.6826.
So, about 68.26% of the data falls between Z-scores of -1 and 1.

Correlation in Data Science


At its core, correlation is a statistical measure that describes the relationship between two variables.
Think of it as a way to see how two things are connected.
For example, if you notice that ice cream sales tend to go up when the temperature rises, you’re
observing a correlation between these two variables: temperature and ice cream sales.
But it’s not just about noticing a connection. Correlation in data science quantifies the strength and
direction of that relationship. This means it can tell you not only if two variables are related but
also how strong that relationship is and whether one variable tends to increase or decrease as the
other does.
Types of Correlation in Data Science

There are three main types of correlation in data science: positive, negative, and zero.
1. Positive Correlation: This occurs when both variables move in the same direction. For
example, the more hours you study, the higher your test scores tend to be. Both studying
and scores are increasing together.
2. Negative Correlation: Here, the variables move in opposite directions. An example could
be the relationship between the amount of time spent watching TV and grades. Typically,
as TV time increases, grades might decrease.
3. Zero Correlation: This means there’s no relationship between the variables. For example,
there might be no correlation between the number of books you read and the color of your
car.
These types of correlation in data science help you determine what category, two variables fall
under. This way of understanding the relation of data falls under the analysis of data in the Data
Process lifecycle.
How Do We Measure Correlation in Data Science?

With the help of the previous section, you may have a basic understanding of what correlation in
data science is and what are its types. Now, let’s get into the details of how you can measure
correlation.

Scatter Plot
Scatter plot is one of the most important data visualization techniques and it is considered one of
the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship between two
variables, on a two-dimensional graph that is known as Cartesian Plane on mathematical grounds.
It is generally used to plot the relationship between one independent variable and one dependent
variable, where an independent variable is plotted on the x-axis and a dependent variable is plotted
on the y-axis so that you can visualize the effect of the independent variable on the dependent
variable. These plots are known as Scatter Plot Graph or Scatter Diagram.
Applications of Scatter Plot
As already mentioned, a scatter plot is a very useful data visualization technique. A few
applications of Scatter Plots are listed below.
• Correlation Analysis: Scatter plot is useful in the investigation of the correlation between
two different variables. It can be used to find out whether two variables have a positive
correlation, negative correlation or no correlation.
• Outlier Detection: Outliers are data points, which are different from the rest of the data
set. A Scatter Plot is used to bring out these outliers on the surface.
• Cluster Identification: In some cases, scatter plots can help identify clusters or groups
within the data.
Scatter Plot Graph
Scatter Plot is known by several other names, a few of them are scatter chart, scattergram, scatter
plot, and XY graph. A scatter plot is used to visualize a data pair, such that each element gets its
axis, generally the independent one gets the x-axis and the dependent one gets the y-axis.
This kind of distribution makes it easier to visualize the kind of relationship, the plotted pair of
data is holding. So Scatter Plot is useful in situations when we have to find out the relationship
between two sets of data, or in cases when we suspect that there may be some relationship between
two variables and this relationship may be the root cause of some problem.
Now let us understand how to construct a scatter plot and its use case via an example.
How to Construct a Scatter Plot?
To construct a scatter plot, we have to follow the given steps.
Step 1: Identify the independent and dependent variables
Step 2: Plot the independent variable on x-axis
Step 3: Plot the dependent variable on y-axis
Step 4: Extract the meaningful relationship between the given variables.
Let's understand the process through an example. In the following table, a data set of two variables
is given.

Matches Played 2 5 7 1 12 15 18

Goals Scored 1 4 5 2 7 12 11

Now in this data set there are two variables, first is the number of matches played by a certain
player and second is the number of goals scored by that player. Suppose, we aim to find out the
relationship between the number of matches played by a certain player and the number of goals
scored by him/her. For now, let us discard our obvious intuitive understanding that the number of
goals scored is directly proportional to the number of matches played. For now, let us assume that
we just have the given dataset and we have to extract out relationship between given data pair.

Types of Scatter Plot


On the basis of correlation of two variables, Scatter Plot can be classified into following types.
• Scatter Plot For Positive Correlation
• Scatter Plot For Negative Correlation
• Scatter Plot For Null Correlation
Scatter Plot For Positive Correlation
In this type of scatter-plot, value on y-axis increases on moving left to right. In more technical
terms, if one variable is directly proportional to another, then, the scatter plot will show positive
correlation. Positive correlation can be further classified into Perfect Positive, High Positive and
Low Positive.
Scatter Plot For Negative Correlation
In this type of scatter-plot, value on the y-axis decreases on moving left to right. In other words,
the value of one variable is decreasing with respect to the other. Positive correlation can be further
classified into Perfect Negative, High Negative and Low Negative.
Scatter Plot For Null Correlation
In this type of scatter-plot, values are scattered all over the graph. Generally this kind of graph
represents that there is no relationship between the two variables plotted on the Scatter Plot.
What is Scatter Plot Analysis?
Scatter plot analysis involves examining the distribution of the points and interpreting the overall
pattern to gain insights into the relationship between the variables. Scatter Plot is used to visualise
the relationship between two variables, but in real life, situations are not so ideal that we get only
correlated variables. In real life there are situations, when more than two variables are correlated
with each other.
In such situations, we do use the Scatter Plot Matrix. For n number of variables, scatter plot matrix
will have n rows and n columns where scatter plot of variables xi and xj will be located at ith row
and jth column.
Check:
• Standard Deviation Formula
• Mean, Variance and Standard Deviation
• Difference Between Variance and Standard Deviation
Solved Examples on Scatter Plot
Example 1: Draw a scatter plot for the given data that shows the number of IPL matches
played and runs scored in each instance.

Matches Played 10 12 14 16 18

Runs Scored 287 300 297 350 345

Solution:
X-axis: Number of Matches Played
Y-axis: Number of Runs Scored
Graph:
Correlation Coefficient Definition
A statistical measure that quantifies the strength and direction of the linear relationship between
two variables is called the Correlation coefficient. Generally, it is denoted by the symbol ‘r’ and
ranges from -1 to 1.
What is Correlation Coefficient Formula?
Correlation coefficient procedure is used to determine how strong a relationship is between the
data. The correlation coefficient procedure yields a value between 1 and -1. In which,
• -1 indicates a strong negative relationship
• 1 indicates strong positive relationships
• Zero implies no connection at all
Understanding Correlation Coefficient
• Correlation coefficient of -1 means there is a negative decrease of a fixed proportion, for
every positive increase in one variable. Like, the amount of gas in a tank decreases in a
perfect correlation with the speed.
• Correlation coefficient of 1 means there is a positive increase of a fixed proportion of
others, for every positive increase in one variable. Like, the size of the shoe goes up in
perfect correlation with foot length.
• Correlation coefficient of 0 means that for every increase, there is neither a positive nor a
negative increase. The two just aren’t related.
Types of Correlation Coefficient Formula
Various types of Correlation Coeeficient are:
Pearson’s Correlation Coefficient Formula
Pearson’s Correlation Coefficient Formula is added below:
R = n(∑xy)–(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2 R = [n∑x2−(∑x)2][n∑y2−(∑y)2n(∑xy)–
(∑x)(∑y)
Sample Correlation Coefficient Formula
Sample Correlation Coefficient Formula is added below:
rxy = Cov(x,y)/[Link] rxy = Cov(x,y)/[Link]
where,
• Sxy is Covariance of Sample
• Sx and Sy are Standard Deviations of Sample
Population Correlation Coefficient Formula
Population Correlation Coefficient Formula is added below:
𝜌xy = σxy/σx.σy
where,
• σx and σy are Populatin Standard Deviation
• σxy is Population Covariance

Pearson’s Correlation
It is the most common correlation in statistics. The full name is Pearson’s Product Moment
Correlation in short PPMC. It displays the Linear relation between the two sets of data. Two letters
are used to represent the Pearson correlation
Greek Letter “rho (ρ)” for a population and the letter “r” for a sample correlation coefficient.
How to Find Pearson’s Correlation Coefficient?
Follow the steps added below to find the Pearson’s Correlation Coefficient of any given data set
Step 1: Firstly make a chart with the given data like subject,x, and y and add three more columns
in it xy, x² and y².
Step 2: Now multiply the x and y columns to fill the xy column. For example:- in x we have 24 and
in y we have 65 so xy will be 24×65=1560.
Step 3: Now, take the square of the numbers in the x column and fill the x² column.
Step 4: Now, take the square of the numbers in the y column and fill the y² column.
Step 5: Now, add up all the values in the columns and put the result at the bottom. Greek letter
sigma (Σ) is the short way of saying summation.
Step 6: Now, use the formula for Pearson’s correlation coefficient:
R=n(∑xy)–(∑x)(∑y)[n∑x2−(∑x)2][n∑y2−(∑y)2 R=[n∑x2−(∑x)2][n∑y2−(∑y)2n(∑xy)–
(∑x)(∑y)
To know which type of variable we have either positive or negative.

computational formula for correlation coefficient

To compute the Pearson correlation coefficient (denoted as r) manually, you can use the following
computational formula, which is often more practical than the general formula for calculations
involving large datasets. The computational formula for r is:
Where:
• nnn is the number of data points (or pairs of XXX and YYY),
• XXX and YYY are the individual data points of the two variables,
• ∑X\sum X∑X is the sum of all XXX values,
• ∑Y\sum Y∑Y is the sum of all YYY values,
• ∑XY\sum XY∑XY is the sum of the product of corresponding XXX and YYY values,
• ∑X2\sum X^2∑X2 is the sum of the squares of the XXX values,
• ∑Y2\sum Y^2∑Y2 is the sum of the squares of the YYY values.
Step-by-Step Calculation:
Let’s go over how to calculate rrr step-by-step with an example.
Example:
Suppose we have the following data on the number of hours of study (X) and test scores (Y):
Hours of Study (X) Test Score (Y)
1 55
2 60
3 65
4 70
5 75
We need to compute the Pearson correlation coefficient using the computational formula
Averages for qualitative and ranked data.

When dealing with qualitative (or categorical) and ranked (or ordinal) data, we use different types
of averages or central tendency measures compared to quantitative data. Let’s go over the types of
averages used for these kinds of data:
1. Qualitative (Categorical) Data:
Qualitative data is data that represents categories or groups, such as gender, colors, types of fruits,
or eye color. These data are typically non-numeric.
For qualitative data, the most appropriate measures of central tendency are:
Mode:
• The mode is the most common category or value in a dataset.
• It represents the category or group that occurs the most frequently.
• The mode is the only measure of central tendency that can be used for nominal data (where
categories have no inherent order, like "red," "blue," and "green" for colors).
Example: If you have the following data on favorite colors:
red, blue, blue, green, red, red, blue
The mode is blue, since it occurs most frequently (3 times).
Frequency Distribution:
• While not a "single average," you can summarize qualitative data with a frequency
distribution to show how many times each category appears.
Example:
Color Frequency
Red 3
Blue 3
Green 1
In this case, you could say that red and blue are the most frequent, each appearing 3 times.

2. Ranked (Ordinal) Data:


Ranked or ordinal data are categorical data with a meaningful order or ranking, but the differences
between ranks are not necessarily uniform. Examples include education levels (e.g., high school,
college, graduate), survey responses (e.g., "strongly agree," "agree," "disagree"), or likert scales
(e.g., 1–5 rating).
For ordinal data, the most appropriate averages or central tendency measures are:
Mode:
• Like with qualitative data, the mode is the most frequent rank or category in ordinal data.
Example:
Survey responses:
"Strongly Agree", "Agree", "Disagree", "Agree", "Agree"
The mode is "Agree" since it appears most often.
Median:
• The median is the middle value when the data points are arranged in order of magnitude.
The median divides the dataset into two halves.
• The median is appropriate for ordinal data because it accounts for the rank order (even
though the exact differences between ranks are not equal).
Example:
Suppose you have the following ordinal data:
2, 1, 4, 3, 5 (where 1 = "Strongly Disagree", 2 = "Disagree", 3 = "Neutral", 4 = "Agree", 5 =
"Strongly Agree").
o Ordered: 1, 2, 3, 4, 5.
o The median is 3, corresponding to the rank "Neutral," as it's the middle value
when the data is ordered.
Rank Average (or Mean Rank):
• Sometimes, ordinal data can be assigned numerical values to calculate a mean rank. For
example, if survey responses are coded as numbers (e.g., 1 = "Strongly Disagree", 2 =
"Disagree", 3 = "Neutral", etc.), you can compute the average rank (mean) by treating the
ranks as numbers.
Example:
Survey responses (coded as numbers):
1, 2, 4, 3, 5 (Strongly Disagree = 1, Disagree = 2, Neutral = 3, Agree = 4, Strongly Agree = 5).
To find the mean rank:
Mean Rank=1+2+4+3+55=155=3\text{Mean Rank} = \frac{1 + 2 + 4 + 3 + 5}{5} =
\frac{15}{5} = 3Mean Rank=51+2+4+3+5=515=3
The mean rank is 3, which corresponds to the "Neutral" response.

Key Differences in Calculating Averages for Qualitative and Ranked Data:


Appropriate Measures of
Type of Data Explanation
Central Tendency
Qualitative The most common category. Suitable for non-
Mode
(Nominal) numeric categories with no inherent order.
Ranked Mode, Median, Mean Mode for the most frequent rank, median for the
(Ordinal) Rank middle value, and mean rank for the average rank.
When to Use Each Measure:
• Mode: Always applicable for both qualitative and ordinal data. It identifies the most
frequent category or rank.
• Median: Use this for ordinal data when you want to find the middle value in an ordered
dataset. The median is especially useful when the data is skewed or has outliers.
• Mean Rank: For ordinal data that has been coded numerically, the mean rank can give you
an average "position" in the ordered data.
Example Scenarios:
1. Qualitative (Nominal) Data:
o Survey responses on preferred pets: dog, cat, cat, dog, bird.
The mode is dog (most common category).
2. Ranked (Ordinal) Data:
o Rating of a restaurant experience on a scale from 1 to 5 (1 = Poor, 5 = Excellent):
4, 5, 3, 5, 4.
▪ The mode is 5 ("Excellent") because it appears most frequently.
▪ The median is 4, the middle value when the data is ordered (3, 4, 4, 5, 5).
▪ The mean rank is: 4+5+3+5+45=215=4.2.\frac{4 + 5 + 3 + 5 + 4}{5} =
\frac{21}{5} = 4.2.54+5+3+5+4=521=4.2.
So the average rank is 4.2, which suggests that, on average, people rated the experience as closer
to "Very Good.
1. What is a normal distribution? a) A distribution where the data is not symmetrical
b) A distribution that is bell-shaped and symmetrical about the mean
c) A distribution that only contains positive values
d) A uniform distribution
Answer: b) A distribution that is bell-shaped and symmetrical about the mean
2. What is the purpose of a Z-score in a normal distribution? a) To calculate the mean
of the distribution
b) To represent how many standard deviations a data point is from the mean
c) To calculate the total frequency of a dataset
d) To find the proportion of data points greater than the mean
Answer: b) To represent how many standard deviations a data point is from the mean
3. What does a Z-score of 0 represent in a normal distribution? a) The data point is one
standard deviation below the mean
b) The data point is one standard deviation above the mean
c) The data point is exactly at the mean
d) The data point is an outlier
Answer: c) The data point is exactly at the mean
4. In the context of normal distributions, how is a proportion found? a) By calculating
the Z-score and using a Z-table to find the area under the curve
b) By finding the mean of the data set
c) By adding all the values in the data set
d) By dividing the number of data points by the total number of observations
Answer: a) By calculating the Z-score and using a Z-table to find the area under the
curve
5. What is the formula to calculate the Z-score? a) Z = (X - μ) / σ
b) Z = (μ - σ) / X
c) Z = (σ - X) / μ
d) Z = X * σ / μ
Answer: a) Z = (X - μ) / σ
6. What is the relationship between the Z-score and the normal distribution curve? a)
The Z-score shifts the entire curve
b) The Z-score provides the position of a data point relative to the mean in standard
deviation units
c) The Z-score increases the standard deviation
d) The Z-score represents the proportion of data points in the distribution
Answer: b) The Z-score provides the position of a data point relative to the mean in
standard deviation units
7. What is the correlation between two variables? a) The proportion of data points below
the mean
b) The relationship or association between two variables
c) The distance between the data points
d) The variance of the data set
Answer: b) The relationship or association between two variables
8. What type of data is best represented by a scatter plot? a) Nominal data
b) Ordinal data
c) Quantitative data
d) Categorical data
Answer: c) Quantitative data
9. What does the correlation coefficient (r) measure? a) The mean of the data set
b) The strength and direction of a linear relationship between two variables
c) The variance of the data
d) The number of outliers in the data
Answer: b) The strength and direction of a linear relationship between two variables
10. What is the range of values for the correlation coefficient (r)? a) -1 to 1
b) 0 to 100
c) -10 to 10
d) 0 to 1
Answer: a) -1 to 1
11. Which of the following correlation coefficients indicates a perfect positive linear
relationship between two variables? a) -1
b) 0
c) 1
d) 0.5
Answer: c) 1
12. Which of the following correlation coefficients indicates no linear relationship
between two variables? a) -1
b) 1
c) 0
d) 0.5
Answer: c) 0
13. What does a Z-score of 1.5 indicate in a normal distribution? a) The data point is 1.5
standard deviations below the mean
b) The data point is 1.5 standard deviations above the mean
c) The data point is exactly at the mean
d) The data point is an outlier
Answer: b) The data point is 1.5 standard deviations above the mean
14. What is the formula for calculating the correlation coefficient using the
computational formula? a) r = Σ(xy) / Σ(x²) + Σ(y²)
b) r = Σ(xy) / √(Σx² * Σy²)
c) r = Σx + Σy / N
d) r = Σx - Σy / N
Answer: b) r = Σ(xy) / √(Σx² * Σy²)
15. When analyzing the correlation between two variables, which of the following values
of the correlation coefficient suggests a weak relationship? a) 0.95
b) -0.5
c) 0.1
d) -1
Answer: c) 0.1
16. What is the shape of the normal distribution curve? a) Symmetrical and bell-shaped
b) Skewed to the right
c) Skewed to the left
d) U-shaped
Answer: a) Symmetrical and bell-shaped
17. In a normal distribution, approximately what percentage of data falls within one
standard deviation from the mean? a) 34%
b) 50%
c) 68%
d) 95%
Answer: c) 68%
18. Which type of data is best suited for calculating the Z-score? a) Nominal data
b) Ordinal data
c) Continuous data
d) Categorical data
Answer: c) Continuous data
19. What does a negative correlation coefficient indicate? a) A strong positive linear
relationship
b) A strong negative linear relationship
c) No relationship between the variables
d) A nonlinear relationship
Answer: b) A strong negative linear relationship
20. In the context of correlation, which of the following statements is true? a) A high
correlation means causality is established between the variables
b) A low correlation always indicates no relationship between the variables
c) Correlation measures the strength and direction of a linear relationship, not causality
d) A correlation of 1 always means the variables are independent
Answer: c) Correlation measures the strength and direction of a linear relationship, not
causality
21. Which of the following is an example of qualitative data that can be analyzed using
averages for categorical data? a) The heights of students
b) The average salary of employees
c) The average score on an exam
d) The most common type of pet among survey respondents
Answer: d) The most common type of pet among survey respondents
22. What is the primary purpose of a scatter plot? a) To show the distribution of data
points
b) To visualize the relationship between two quantitative variables
c) To calculate the correlation coefficient
d) To find the mode of the data
Answer: b) To visualize the relationship between two quantitative variables
23. What is the advantage of using Z-scores in normal distributions? a) They help
standardize data and compare scores from different distributions
b) They are used only for calculating the correlation coefficient
c) They help calculate proportions in categorical data
d) They simplify the process of finding the median
Answer: a) They help standardize data and compare scores from different distributions
24. When interpreting the correlation coefficient, which value suggests the strongest
relationship? a) 0.2
b) -0.5
c) 1
d) 0.7
Answer: c) 1
25. What is the advantage of using the computational formula for the correlation
coefficient? a) It simplifies the calculation process, especially when dealing with large
datasets
b) It is used to calculate Z-scores
c) It calculates the mean of the dataset
d) It helps find the variance of the data
Answer: a) It simplifies the calculation process, especially when dealing with large
datasets

5 Mark Questions:
1. Explain the concept of a normal distribution and its characteristics. How is the Z-score
related to a normal distribution?
2. Discuss how to calculate the Z-score for a data point and interpret its meaning in the
context of a normal distribution.
3. What is correlation, and how can it be measured? Explain the difference between
positive, negative, and zero correlation with examples.
4. Describe the process of calculating the correlation coefficient for two quantitative
variables. Why is this measure important in statistics?
5. How are averages calculated for qualitative and ranked data? Discuss the methods used
and provide examples.

10 Mark Questions:

1. Explain the concept of normal distributions, the role of Z-scores, and how they can be
used to calculate proportions and find scores. Provide an example.
2. Discuss the correlation between two variables, including how it is represented on a scatter
plot and the calculation of the correlation coefficient. Explain the computational formula
for the correlation coefficient.
3. Describe the steps involved in finding and interpreting Z-scores in a normal distribution.
How do they help in comparing data points from different distributions?
4. What are the different types of correlation (positive, negative, zero correlation), and how
can they be interpreted using a correlation coefficient and scatter plot? Provide real-world
examples.
5. Discuss the use of averages for qualitative and ranked data. How are these averages
calculated, and how do they differ from calculating averages for continuous data?
UNIT 4

Basics of NumPy Arrays


NumPy arrays are a more powerful version of Python lists. They are used to store homogeneous
data (i.e., all elements must be of the same type) and allow for efficient storage and computation.
Creating Arrays
• You can create a NumPy array from lists, tuples, or other array-like structures.
python
Copy code
import numpy as np

# From a list
arr = [Link]([1, 2, 3, 4, 5])

# From a tuple
arr2 = [Link]((10, 20, 30))

# Multi-dimensional array
arr3 = [Link]([[1, 2], [3, 4], [5, 6]])
Data Types
• NumPy automatically infers the data type of the array, but you can also specify it using the
dtype parameter.
python
Copy code
arr_int = [Link]([1, 2, 3], dtype=int)
arr_float = [Link]([1.1, 2.2, 3.3], dtype=float)
arr_complex = [Link]([1+2j, 3+4j], dtype=complex)
Array Attributes
• .shape provides the dimensions of the array (rows, columns, etc.).
• .size gives the total number of elements.
• .dtype gives the data type of the array's elements.
• .ndim gives the number of dimensions (axes).
python
Copy code
print([Link]) # Output: (3, 2)
print([Link]) # Output: 6
print([Link]) # Output: int64
print([Link]) # Output: 2 (2D array)
Aggregations in NumPy
Aggregation operations allow you to summarize or reduce the size of your array data.
Common Aggregation Functions:
• Sum: Adds all elements in the array.
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
print([Link](arr)) # Output: 15
• Mean: Computes the average of the elements.
python
Copy code
print([Link](arr)) # Output: 3.0
• Minimum and Maximum: Finds the smallest or largest element.
python
Copy code
print([Link](arr)) # Output: 1
print([Link](arr)) # Output: 5
• Standard Deviation: Measures the spread of data.
python
Copy code
print([Link](arr)) # Output: 1.4142135623730951
• Median: Finds the middle value when the data is ordered.
python
Copy code
arr = [Link]([1, 3, 2, 5, 4])
print([Link](arr)) # Output: 3.0

Aggregating along specific axes:

In multi-dimensional arrays, you can specify an axis (rows or columns) to perform aggregations
along.
• Sum along columns (axis=0): Operates across rows.
python
Copy code
arr2 = [Link]([[1, 2], [3, 4], [5, 6]])
print([Link](arr2, axis=0)) # Output: [9 12]
• Sum along rows (axis=1): Operates across columns.
python
Copy code
print([Link](arr2, axis=1)) # Output: [3 7 11]

Element-wise Computations on Arrays


One of NumPy's most powerful features is its ability to perform element-wise operations on arrays.
These operations include basic arithmetic, trigonometric functions, and more.
Arithmetic Operations
NumPy arrays support element-wise arithmetic operations such as addition, subtraction,
multiplication, and division.
python
Copy code
arr1 = [Link]([1, 2, 3])
arr2 = [Link]([4, 5, 6])

# Addition
print(arr1 + arr2) # Output: [5 7 9]

# Subtraction
print(arr1 - arr2) # Output: [-3 -3 -3]

# Multiplication
print(arr1 * arr2) # Output: [4 10 18]

# Division
print(arr1 / arr2) # Output: [0.25 0.4 0.5]
Trigonometric Functions
NumPy also provides functions for trigonometric and other mathematical operations.
python
Copy code
arr = [Link]([0, [Link]/2, [Link]])

# Sine of each element


print([Link](arr)) # Output: [ 0. 1. 0.]

# Cosine of each element


print([Link](arr)) # Output: [ 1. 0. -1.]

Broadcasting
When performing operations on arrays of different shapes, NumPy will attempt to "broadcast" the
smaller array to match the shape of the larger array, provided the shapes are compatible.
python
Copy code
arr = [Link]([1, 2, 3])
scalar = 10
result = arr * scalar # Element-wise multiplication with a scalar
print(result) # Output: [10 20 30]

Comparisons in NumPy
You can compare elements of NumPy arrays using comparison operators. These operations return
boolean arrays.
Basic Comparisons:
• Equality: ==
• Inequality: !=
• Greater than: >
• Less than: <
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
result = arr > 3 # Output: [False False False True True]
Using Logical Functions
You can perform logical operations like "all" or "any" on boolean arrays.
• [Link](): Returns True if all elements are True.
python
Copy code
arr = [Link]([True, True, False])
print([Link](arr)) # Output: False
• [Link](): Returns True if any element is True.
python
Copy code
print([Link](arr)) # Output: True

Structured Arrays
A structured array is a type of array that allows for storing multiple fields (different data types) in
each element. These are especially useful for handling complex data such as records.
Creating Structured Arrays
python
Copy code
dtype = [('name', 'S20'), ('age', 'i4')]
data = [Link]([('Alice', 25), ('Bob', 30)], dtype=dtype)
print(data)
# Output: [(b'Alice', 25) (b'Bob', 30)]

# Accessing fields by name


print(data['name']) # Output: [b'Alice' b'Bob']
print(data['age']) # Output: [25 30]
Use Case for Structured Arrays
Structured arrays are useful when you need to manage tabular data or datasets where each row can
have different types (e.g., string and integer).

Data Manipulation
Data manipulation includes operations like reshaping, adding, deleting, or joining arrays.
Reshaping Arrays
You can reshape arrays using reshape(). This changes the number of rows and columns without
modifying the original data.
python
Copy code
arr = [Link]([1, 2, 3, 4, 5, 6])
reshaped = [Link](2, 3) # 2 rows and 3 columns
print(reshaped)
Stacking Arrays
NumPy provides multiple ways to combine arrays, such as [Link](), [Link](), and
[Link]().
python
Copy code
arr1 = [Link]([1, 2, 3])
arr2 = [Link]([4, 5, 6])

# Vertically stack arrays


stacked = [Link]([arr1, arr2]) # Output: [[1 2 3] [4 5 6]]

# Horizontally stack arrays


stacked_h = [Link]([arr1, arr2]) # Output: [1 2 3 4 5 6]
Splitting Arrays
You can also split arrays into smaller chunks using [Link]().
python
Copy code
arr = [Link]([1, 2, 3, 4, 5, 6])
splits = [Link](arr, 3) # Split into 3 equal parts
print(splits) # Output: [array([1, 2]), array([3, 4]), array([5, 6])]
Indexing and Selection
Indexing and slicing are essential for extracting specific portions of data from an array.
Basic Indexing
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
print(arr[0]) # Output: 1 (first element)
Slicing Arrays
You can slice arrays to extract sub-arrays:
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
print(arr[1:4]) # Output: [2 3 4]
Boolean Indexing
You can use boolean conditions to index an array.
python
Copy code
arr = [Link]([1, 2, 3, 4, 5])
print(arr[arr > 3]) # Output: [4 5]

Missing Data in NumPy


Handling missing or undefined data is a critical aspect of data analysis. In NumPy, missing data
is typically represented by [Link] (Not a Number), which is used for floating-point arrays. [Link]
is specifically designed to handle situations where data is incomplete or unavailable.
How to Work with Missing Data in NumPy
1. Detecting Missing Data
You can use the [Link]() function to check for NaN values in an array.
python
Copy code
import numpy as np

# Create an array with NaN values


arr = [Link]([1.0, 2.0, [Link], 4.0, 5.0])

# Detect NaN values


print([Link](arr)) # Output: [False False True False False]
[Link]() returns a boolean array where True represents a NaN value and False represents any
valid number.
2. Ignoring NaN in Aggregations
In cases where an array has NaN values, aggregation functions like [Link](), [Link](), [Link]()
will return NaN unless you explicitly handle it. NumPy provides functions that ignore NaN values
during aggregation.
• [Link](): Returns the mean of the array while ignoring NaN values.
• [Link](): Returns the sum while ignoring NaN values.
• [Link](): Returns the standard deviation while ignoring NaN values.
python
Copy code
arr = [Link]([1.0, 2.0, [Link], 4.0, 5.0])

# Mean, ignoring NaN values


mean = [Link](arr)
print(mean) # Output: 3.0

# Sum, ignoring NaN values


total = [Link](arr)
print(total) # Output: 12.0
3. Replacing NaN Values
Sometimes, you may want to replace NaN values with a specific value (like zero or the mean).
You can use np.nan_to_num() to replace NaN with a specified value, or you can directly
manipulate the array.
python
Copy code
# Replace NaN with 0
arr_no_nan = np.nan_to_num(arr, nan=0)
print(arr_no_nan) # Output: [1. 2. 0. 4. 5.]

# Alternatively, replace NaN with the mean of the non-NaN values


mean_value = [Link](arr)
arr_filled = [Link]([Link](arr), mean_value, arr)
print(arr_filled) # Output: [1. 2. 3. 4. 5.]
Key Takeaways for Handling Missing Data in NumPy:
• Detect missing values using [Link]().
• Perform aggregations while ignoring NaN values using [Link]*() functions.
• Replace NaN values using np.nan_to_num() or [Link]().

Hierarchical Indexing (Multi-indexing)


In NumPy, hierarchical indexing is not as straightforward as in pandas, but it can be achieved
using structured arrays or record arrays, which allow multiple fields (or columns) of different
types (e.g., integer, float, string) in each row.
Hierarchical indexing is often useful when you want to work with multi-dimensional data and need
to manage multi-level labels, similar to how pandas handles multi-indexes.
Structured Arrays and Record Arrays
You can use structured arrays (or record arrays) to simulate hierarchical indexing. These arrays
allow each element to have multiple fields (or columns) with different data types.
python
Copy code
import numpy as np

# Define a structured array with fields 'category' and 'value'


dtype = [('category', 'U10'), ('value', 'i4')]
data = [Link]([('A', 10), ('B', 20), ('A', 30), ('B', 40)], dtype=dtype)

# Print the structured array


print(data)
# Output: [('A', 10) ('B', 20) ('A', 30) ('B', 40)]

# Accessing fields
print(data['category']) # Output: ['A' 'B' 'A' 'B']
print(data['value']) # Output: [10 20 30 40]
Grouping and Aggregating with Structured Arrays
You can use structured arrays to perform operations like grouping data by one field and applying
aggregation on another.
python
Copy code
# Example of grouping by 'category' and calculating the sum of 'value' for each category
category_A = data[data['category'] == 'A']
category_B = data[data['category'] == 'B']

sum_A = [Link](category_A['value'])
sum_B = [Link](category_B['value'])

print(f"Sum of category A: {sum_A}") # Output: 40


print(f"Sum of category B: {sum_B}") # Output: 60
Although pandas offers more advanced functionality for hierarchical indexing (e.g., multi-level
indices with [Link]), you can simulate similar functionality with structured arrays in
NumPy, though it can be less flexible and more cumbersome for large datasets.
Key Takeaways for Hierarchical Indexing in NumPy:
• Structured arrays allow multi-field data, simulating hierarchical or multi-level indexing.
• You can group and aggregate data manually by selecting fields and applying operations.
• NumPy is less flexible than pandas when it comes to complex hierarchical indexing, but it
can handle simpler cases.

Pivot Tables (Using NumPy)


Pivot tables are a common operation in data analysis, often used to summarize or aggregate data
across different categories. While pandas has built-in support for pivot tables, you can simulate
similar functionality in NumPy by using reshape, groupby, and aggregation functions.
Simulating Pivot Tables in NumPy
Let's simulate the creation of a pivot table using NumPy for a simple case.
Suppose you have a dataset where you want to compute the sum of values across different
categories and regions.
python
Copy code
import numpy as np

# Sample data: [category, region, value]


data = [Link]([
['A', 'North', 10],
['B', 'South', 20],
['A', 'South', 30],
['B', 'North', 40],
['A', 'North', 50],
['B', 'South', 60]
])

# Convert 'value' to a numeric type for aggregation


data[:, 2] = data[:, 2].astype(int)

# Simulate a pivot table by first grouping by 'category' and 'region' (rows, columns)
categories = [Link](data[:, 0]) # Unique categories (A, B)
regions = [Link](data[:, 1]) # Unique regions (North, South)

# Create an empty pivot table matrix (2D array)


pivot_table = [Link]((len(categories), len(regions)), dtype=int)

# Populate the pivot table by aggregating 'value'


for i, category in enumerate(categories):
for j, region in enumerate(regions):
# Filter data for each category and region, and sum 'value'
mask = (data[:, 0] == category) & (data[:, 1] == region)
pivot_table[i, j] = [Link](data[mask, 2])

print(pivot_table)
Output:
less
Copy code
[[60 30] # Sum for Category A: [North, South]
[40 80]] # Sum for Category B: [North, South]
Here:
• We group the data by category and region.
• We then compute the sum of the value for each combination of category and region.
• This is similar to a pivot table, where the rows represent category and the columns represent
region, and the values are the sum of value.
Using np.histogram2d() for Binning and Pivot-like Operations:
For large datasets, NumPy offers np.histogram2d() to create 2D histograms, which can also be
useful in creating pivot-like summaries of data.
python
Copy code
# Binning data into 2D arrays (categories vs regions)
category_bins = [Link]([0, 1]) # Categories (A, B)
region_bins = [Link]([0, 1]) # Regions (North, South)

# Count the frequency of each combination of category and region


hist, xedges, yedges = np.histogram2d(data[:, 0], data[:, 1], bins=[category_bins, region_bins])
print(hist)

Key Takeaways for Pivot Tables in NumPy:


• NumPy can be used to simulate pivot tables through manual grouping, aggregation (e.g.,
sum, mean), and reshaping.
• While NumPy can perform simple pivot operations, pandas is far more powerful and
efficient for this task, with native support for pivot tables.
• Use np.histogram2d() or manual grouping and aggregation for pivot-like operations in
NumPy.
1. What is the primary advantage of using NumPy arrays over Python lists for
numerical computations? a) They allow for heterogeneous data types
b) They are more memory efficient and allow for faster operations
c) They are easier to use
d) They can handle non-numeric data types
Answer: b) They are more memory efficient and allow for faster operations
2. Which function is used to create a NumPy array from a list in Python? a)
[Link]()
b) [Link]()
c) [Link]()
d) np.list_to_array()
Answer: a) [Link]()
3. Which of the following is true about a NumPy array? a) It can hold different data
types in a single array
b) It can store elements with variable lengths
c) It is optimized for vectorized operations and can perform faster computations than
Python lists
d) NumPy arrays can only be one-dimensional
Answer: c) It is optimized for vectorized operations and can perform faster computations
than Python lists
4. Which of the following operations can be performed on a NumPy array directly
without a loop? a) Element-wise arithmetic operations (e.g., addition, multiplication)
b) Sorting the array
c) Searching for the maximum element
d) All of the above
Answer: d) All of the above
5. What is the result of applying the [Link]() function on a NumPy array? a) It
modifies the array in-place
b) It changes the shape of the array but does not modify the data
c) It changes both the shape and the data of the array
d) It removes elements from the array
Answer: b) It changes the shape of the array but does not modify the data
6. What is a structured array in NumPy? a) An array of dictionaries
b) An array where each element has multiple attributes (fields) of different types
c) An array with random access
d) An array of lists
Answer: b) An array where each element has multiple attributes (fields) of different
types
7. How do you access the elements of a structured NumPy array? a) array[field_name]
b) array[0, field_name]
c) array.field_name
d) All of the above
Answer: d) All of the above
8. How do you perform element-wise addition of two NumPy arrays arr1 and arr2? a)
arr1 + arr2
b) [Link](arr2)
c) [Link](arr1, arr2)
d) Both a and c
Answer: d) Both a and c
9. Which function would you use to calculate the sum of all elements in a NumPy array
arr? a) [Link](arr)
b) [Link]()
c) [Link](arr)
d) Both a and b
Answer: d) Both a and b
10. How can you select elements from a NumPy array based on a condition (e.g., values
greater than 10)? a) arr > 10
b) [Link](arr > 10)
c) [Link](arr > 10)
d) arr[condition]
Answer: d) arr[condition]
11. What does hierarchical indexing in pandas allow you to do? a) Index a single column
b) Perform operations on data across multiple rows
c) Organize data in a multi-level structure, allowing for more complex data selection
d) None of the above
Answer: c) Organize data in a multi-level structure, allowing for more complex data
selection
12. What does the [Link]() function do in NumPy? a) Returns indices where the
condition is True
b) Computes the sum of an array
c) Deletes elements based on a condition
d) All of the above
Answer: a) Returns indices where the condition is True
13. Which method would you use to handle missing data in a NumPy array? a)
np.nan_to_num()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) np.nan_to_num()
14. How would you select a subset of a NumPy array arr from rows 1 to 4 (inclusive)
and columns 2 to 3? a) arr[1:5, 2:4]
b) arr[0:4, 1:3]
c) arr[1:4, 2:3]
d) arr[1:5, 2:4]
Answer: a) arr[1:5, 2:4]
15. Which of the following operations is used to concatenate two NumPy arrays arr1
and arr2 along rows (vertically)? a) [Link]([arr1, arr2], axis=0)
b) [Link]([arr1, arr2], axis=1)
c) [Link]([arr1, arr2])
d) Both a and c
Answer: d) Both a and c
16. What is the use of [Link]() in NumPy? a) To split a dataset into training and testing
sets
b) To divide a NumPy array into multiple smaller arrays
c) To split a 2D array into multiple rows
d) To divide an array into equal-sized chunks along an axis
Answer: b) To divide a NumPy array into multiple smaller arrays
17. Which of the following is NOT a valid NumPy operation? a) Element-wise addition of
two arrays
b) Appending new elements to an array
c) Changing the shape of an array without modifying the data
d) Adding an element to a NumPy array using append() method
Answer: d) Adding an element to a NumPy array using append() method
18. Which function would you use to find the maximum value in a NumPy array? a)
[Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
19. What is the result of performing the operation arr1 > arr2 on two NumPy arrays
arr1 and arr2? a) A new array of boolean values indicating where arr1 is greater than
arr2
b) An array with the larger values between arr1 and arr2
c) The sum of arr1 and arr2
d) A sorted version of arr1 and arr2
Answer: a) A new array of boolean values indicating where arr1 is greater than arr2
20. Which of the following functions is used for basic aggregation (e.g., sum, mean) in
NumPy? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: c) [Link]()

21. What does the groupby() method in pandas do? a) Groups data based on the unique
values of one or more columns
b) Aggregates data using functions like sum, mean, etc.
c) Both a and b
d) Sorts the data based on a column
Answer: c) Both a and b
22. How do you compute the sum of grouped data in pandas? a)
[Link]('column_name').sum()
b) [Link]('sum')
c) [Link]()
d) df.group_by('column_name').sum()
Answer: a) [Link]('column_name').sum()
23. Which of the following is a valid aggregation function used with the pandas
groupby() method? a) mean()
b) sum()
c) count()
d) All of the above
Answer: d) All of the above
24. What is the purpose of a pivot table in pandas? a) To sort data
b) To compute statistics for different subsets of data
c) To combine datasets
d) To plot data visually
Answer: b) To compute statistics for different subsets of data
25. Which function is used to create a pivot table in pandas? a) [Link]()
b) df.pivot_table()
c) [Link]()
d) [Link]()
Answer: b) df.pivot_table()

5 Mark Questions:

1. Explain the basics of NumPy arrays and how they differ from Python lists. Discuss their
advantages for numerical computations.
2. Describe the process of data manipulation in NumPy, including array indexing, slicing,
and reshaping. Provide examples.
3. What is hierarchical indexing in pandas? Explain how it allows you to work with multi-
level data and its advantages.
4. Discuss the importance of aggregation and grouping in data analysis. How is the
groupby() method used in pandas to aggregate data?
5. Explain how pivot tables work in pandas. Provide an example of how to use pivot_table()
to summarize data.
10 Mark Questions:

1. Discuss the process of aggregation and grouping in pandas. Explain the groupby()
method with an example, highlighting how it can be used to perform various aggregate
operations such as sum, mean, and count.
2. Describe how NumPy arrays are used in data manipulation and computation. Discuss
indexing, slicing, reshaping, and operations on arrays, providing code examples.
3. Explain the concept of structured arrays in NumPy and how they differ from regular
arrays. Provide an example of creating and accessing data in structured arrays.
4. How is missing data handled in NumPy and pandas? Discuss different methods for
dealing with missing data, including using NaN, filling with default values, and using
interpolation techniques.
5. Discuss how to combine datasets in pandas. Explain the use of merge(), concat(), and
join() methods for combining datasets, and provide use cases for each.

UNIT 5
Data Visualization using Matplotlib

Data visualization is a crucial aspect of data analysis, enabling data scientists and analysts to
present complex data in a more understandable and insightful manner. One of the most popular
libraries for data visualization in Python is Matplotlib. In this article, we will provide a
comprehensive guide to using Matplotlib for creating various types of plots and customizing them
to fit specific needs and how to visualize data with the help of the Matplotlib library of Python.

Key Pyplot Functions:


Category Function Description

Plot Creation plot() Creates line plots with customizable styles.


Category Function Description

scatter() Generates scatter plots to visualize relationships.

Graphical Elements bar() Creates bar charts for comparing categories.

hist() Draws histograms to show data distribution.

pie() Creates pie charts to represent parts of a whole.

Customization xlabel(), ylabel() Sets labels for the X and Y axes.

title() Adds a title to the plot.

legend() Adds a legend to differentiate data series.

Visualization Control xlim(), ylim() Sets limits for the X and Y axes.

grid() Adds gridlines to the plot for readability.

show() Displays the plot in a window.

Figure Management figure() Creates or activates a figure.

subplot() Creates a grid of subplots within a figure.

savefig() Saves the current figure to a file.

Line plot

Line chart is one of the basic plots and can be created using the plot() function. It is used to
represent a relationship between two data X and Y on a different axis.
Syntax:
[Link](\*args, scalex=True, scaley=True, data=None, \*\*kwargs)
Example:
Python
import [Link] as plt

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# plotting the data


[Link](x, y)

# Adding title to the plot


[Link]("Line Chart")

# Adding label on the y-axis


[Link]('Y-Axis')

# Adding label on the x-axis


[Link]('X-Axis')

[Link]()

Output:
Scatter plots

Scatter plots are used to observe relationships between variables. The scatter() method in the
matplotlib library is used to draw a scatter plot.
Syntax:
[Link](x_axis_data, y_axis_data, s=None, c=None, marker=None,
cmap=None, vmin=None, vmax=None, alpha=None, linewidths=None, edgecolors=None
Example:
Python
import [Link] as plt
import pandas as pd

# Reading the [Link] file


data = pd.read_csv('[Link]')

# initializing the data


x = data['day']
y = data['total_bill']

# plotting the data


[Link](x, y)

# Adding title to the plot


[Link]("Tips Dataset")

# Adding label on the y-axis


[Link]('Total Bill')

# Adding label on the x-axis


[Link]('Day')

[Link]()
Output:

Histogram
histogram is basically used to represent data provided in a form of some groups. It is a type of bar
plot where the X-axis represents the bin ranges while the Y-axis gives information about
frequency. The hist() function is used to compute and create histogram of x.
Syntax:
[Link](x, bins=None, range=None, density=False, weights=None,
cumulative=False, bottom=None, histtype=’bar’, align=’mid’, orientation=’vertical’,
rwidth=None, log=False, color=None, label=None, stacked=False, \*, data=None, \*\*kwargs)
Example:
Python
import [Link] as plt
import pandas as pd

# Reading the [Link] file


data = pd.read_csv('[Link]')

# initializing the data


x = data['total_bill']

# plotting the data


[Link](x)

# Adding title to the plot


[Link]("Tips Dataset")

# Adding label on the y-axis


[Link]('Frequency')

# Adding label on the x-axis


[Link]('Total Bill')

[Link]()
Output:

Bar plot

A bar chart is a graph that represents the category of data with rectangular bars with lengths
and heights that is proportional to the values which they represent. The bar plots can be plotted
horizontally or vertically. A bar chart describes the comparisons between the discrete categories.
It can be created using the bar() method.
In the below example, we will use the tips dataset. Tips database is the record of the tip given by
the customers in a restaurant for two and a half months in the early 1990s. It contains 6 columns
as total_bill, tip, sex, smoker, day, time, size.
Example:
Python
import [Link] as plt
import pandas as pd

# Reading the [Link] file


data = pd.read_csv('[Link]')

# initializing the data


x = data['day']
y = data['total_bill']

# plotting the data


[Link](x, y)

# Adding title to the plot


[Link]("Tips Dataset")

# Adding label on the y-axis


[Link]('Total Bill')

# Adding label on the x-axis


[Link]('Day')

[Link]()
Output:
Visualizing errors
Visualizing errors is an important aspect of data analysis and model evaluation. In machine
learning, data science, and scientific computing, error visualization helps you identify patterns,
understand the distribution of errors, and diagnose problems with models or data. Let's look at
different types of error visualizations and how you can implement them using libraries like
Matplotlib, Seaborn, and NumPy.
1. Basic Error Visualization Concepts
Errors in data analysis or machine learning often come in different forms:
• Absolute Errors: The absolute difference between the predicted and true values.
• Squared Errors: The squared difference between predicted and true values, commonly
used in loss functions like Mean Squared Error (MSE).
• Residuals: The difference between the predicted value and the true value (sometimes
called model residuals).
• Percentage Errors: Error relative to the actual value, useful when considering error as a
proportion of the actual value.
• Bias and Variance: When analyzing model performance, bias refers to error introduced
by overly simplistic models, while variance refers to error introduced by models that are
too complex.
Visualizing these types of errors helps in understanding the performance of your model, detecting
outliers, and checking for assumptions like homoscedasticity (constant variance of errors).

2. Visualizing Errors: Common Techniques


2.1. Error vs True Values Plot
One of the most basic plots to visualize errors is a scatter plot of errors vs true values. This helps
to identify whether there are any patterns in the errors, which could indicate issues with the model
or data.
python
Copy code
import numpy as np
import [Link] as plt

# Simulated data: true values and predicted values


true_values = [Link]([3, 5, 7, 9, 11, 13])
predicted_values = [Link]([2.8, 5.1, 7.2, 9.1, 10.9, 13.2])

# Calculate errors (residuals)


errors = predicted_values - true_values

# Plot errors vs true values


[Link](true_values, errors)
[Link](0, color='red', linestyle='--') # Line at 0 to show no error
[Link]("Error vs True Values")
[Link]("True Values")
[Link]("Errors (Residuals)")
[Link]()
Explanation:
• This plot can help detect patterns in residuals. Ideally, residuals should be randomly
distributed with no clear pattern, indicating that the model has captured all the underlying
patterns in the data. If there are trends or structures in the plot, this may indicate a problem
with the model.

2.2. Histogram of Errors


The histogram of errors (or residuals) shows the distribution of the errors. If your model is well-
calibrated, the errors should follow a normal distribution centered around zero (especially for
regression problems). If the errors are skewed or have heavy tails, this can indicate problems with
the model, such as underfitting or overfitting.
python
Copy code
# Plot histogram of errors
[Link](errors, bins=10, edgecolor='k', alpha=0.7)
[Link]("Histogram of Errors")
[Link]("Error")
[Link]("Frequency")
[Link]()
Explanation:
• If the errors follow a normal distribution, it suggests that the model is appropriately
accounting for the data's structure.
• If the histogram is skewed or has outliers, you may need to reconsider the model, data
preprocessing, or outlier treatment.

2.3. Residual Plot (Residuals vs Fitted Values)


The residual plot is a key diagnostic tool in regression. It plots the residuals (errors) against the
predicted values (fitted values). This plot helps detect issues like non-linearity, heteroscedasticity
(non-constant variance of errors), or outliers.
python
Copy code
# Plot residuals vs fitted values
[Link](predicted_values, errors)
[Link](0, color='red', linestyle='--') # Line at 0 to show no error
[Link]("Residuals vs Fitted Values")
[Link]("Fitted Values (Predicted)")
[Link]("Residuals (Errors)")
[Link]()
1. Absolute Errors (AE)
• Definition:
The absolute error is the absolute difference between the predicted value (y^\hat{y}y^)
and the true value (yyy) for each observation.
AE=∣y^−y∣AE = | \hat{y} - y |AE=∣y^−y∣
• Use case:
Absolute errors are useful when you want to understand how far off your predictions are
from the true values, regardless of direction (overestimating or underestimating). This
metric is simple and interpretable, but it doesn't account for larger errors more severely
than smaller ones.
• Example:
If the true value is 100 and your model predicts 90, the absolute error is:
AE=∣90−100∣=10AE = |90 - 100| = 10AE=∣90−100∣=10
If the prediction is 110, the absolute error is:
AE=∣110−100∣=10AE = |110 - 100| = 10AE=∣110−100∣=10
2. Squared Errors (SE) / Mean Squared Error (MSE)
• Definition:
Squared errors are the square of the difference between the predicted value and the true
value:
SE=(y^−y)2SE = (\hat{y} - y)^2SE=(y^−y)2
The Mean Squared Error (MSE) is the average of these squared errors across all data points in
your dataset:
MSE=1n∑i=1n(y^i−yi)2MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2MSE=n1i=1∑n
(y^i−yi)2
• Use case:
Squaring the errors penalizes larger deviations more than smaller ones, making MSE
particularly sensitive to outliers. This is why MSE is often used in regression problems,
as it encourages the model to minimize larger errors more aggressively.
• Example:
If the true value is 100 and the prediction is 90, the squared error is:
SE=(90−100)2=100SE = (90 - 100)^2 = 100SE=(90−100)2=100
If the prediction is 110, the squared error is:
SE=(110−100)2=100SE = (110 - 100)^2 = 100SE=(110−100)2=100
Notice that the squared error increases quickly with larger deviations, which is one reason why
MSE is sensitive to outliers.
• Important Note:
MSE is very common in regression tasks, and it is the basis for many machine learning
algorithms (e.g., linear regression, neural networks). However, MSE’s units are the
square of the original units, which can make interpretation less intuitive.
3. Residuals
• Definition:
Residuals are the differences between the predicted values and the true values, just like
the errors. However, the term "residuals" is often used specifically in the context of
regression analysis.
Residual=y^−y\text{Residual} = \hat{y} - yResidual=y^−y
• Use case:
Residuals are useful for assessing how well your model fits the data. In a regression
context, the residuals give you insight into whether your model is making systematic
errors, such as consistently overestimating or underestimating. If residuals are randomly
distributed (i.e., no pattern), it suggests the model has captured the underlying
relationships in the data well.
• Example:
For the true value y=100y = 100y=100 and predicted value y^=90\hat{y} = 90y^=90, the
residual is:
Residual=90−100=−10\text{Residual} = 90 - 100 = -10Residual=90−100=−10
For y^=110\hat{y} = 110y^=110, the residual is:
Residual=110−100=10\text{Residual} = 110 - 100 = 10Residual=110−100=10
Residuals can be positive (overestimation) or negative (underestimation), and often, residuals are
analyzed to check for patterns that suggest model improvements.
• Residuals in Regression Analysis:
In linear regression, the residual plot (residuals vs. fitted values) is a common diagnostic
tool. If the residuals show a pattern (e.g., non-random distribution or curvature), it
suggests the model may not be appropriate (e.g., need for nonlinear models).
4. Percentage Errors
• Definition:
Percentage error measures the error relative to the true value, often expressed as a
percentage. It's calculated as:
Percentage Error=∣y^−y∣∣y∣×100\text{Percentage Error} = \frac{|\hat{y} - y|}{|y|} \times
100Percentage Error=∣y∣∣y^−y∣×100
• Use case:
Percentage errors are particularly useful when the magnitude of the true values varies
widely, and you want to evaluate error in relation to the size of the actual values. This is
especially helpful when comparing errors across different scales or when the actual
values are very large or very small.
• Example:
If the true value is 100 and the predicted value is 90, the percentage error is:
Percentage Error=∣90−100∣100×100=10%\text{Percentage Error} = \frac{|90 - 100|}{100}
\times 100 = 10\%Percentage Error=100∣90−100∣×100=10%
If the predicted value is 110, the percentage error is:
Percentage Error=∣110−100∣100×100=10%\text{Percentage Error} = \frac{|110 - 100|}{100}
\times 100 = 10\%Percentage Error=100∣110−100∣×100=10%
For small values of yyy, even a small absolute error can lead to a large percentage error. For
instance, if y=1y = 1y=1 and y^=1.1\hat{y} = 1.1y^=1.1, the percentage error is:
Percentage Error=∣1.1−1∣1×100=10%\text{Percentage Error} = \frac{|1.1 - 1|}{1} \times 100 =
10\%Percentage Error=1∣1.1−1∣×100=10%
But for larger values of yyy, the percentage error becomes a smaller proportion of the true value.
5. Bias and Variance in Model Performance
• Bias:
Bias refers to the error introduced by a model that is too simple (underfitting). A high-
bias model makes strong assumptions about the data and may fail to capture the
complexity of the relationships between input variables and the target variable.
o Example of Bias:
A linear model trying to fit a highly nonlinear dataset will have high bias because
it underestimates the complexity of the true relationship.
o Impact of Bias:
A model with high bias generally has systematic errors, meaning it consistently
makes the same type of error (e.g., consistently underestimating or overestimating
the target).
• Variance:
Variance refers to the error introduced by a model that is too complex (overfitting). A
high-variance model is overly sensitive to the training data and may not generalize well
to unseen data. It tends to fit the training data very closely, capturing even random noise
in the dataset.
o Example of Variance:
A decision tree that grows too deep and perfectly fits the training data will likely
have high variance because it memorizes the data rather than learning the
underlying patterns.
o Impact of Variance:
A model with high variance may perform well on training data but poorly on new,
unseen data, due to overfitting.
• Bias-Variance Tradeoff:
The ideal model strikes a balance between bias and variance. A model with high bias and
low variance (e.g., a simple linear model) may have poor performance because it
oversimplifies the data. A model with low bias and high variance (e.g., a very complex
neural network) may perform well on training data but generalize poorly to test data.
o Reducing Bias:
To reduce bias, you can use more complex models or add more features to the
model.
o Reducing Variance:
To reduce variance, you can simplify the model, use regularization, or increase
the amount of training data.

Density and Contour Plots


Density Plots
Density plots visualize the distribution of a continuous variable by estimating the probability
density function (PDF). Instead of a histogram, which counts occurrences in bins, density plots
provide a smooth curve representing the estimated distribution.
• What it shows:
Density plots allow you to see where the data is concentrated, how spread out it is, and
any potential multimodality (multiple peaks). This is useful for understanding the
distribution of data, especially when comparing multiple distributions.
• Use case:
Ideal for continuous data where you want to understand the shape of the distribution.
Commonly used for comparing the distribution of variables in different groups.
• Tools:
In Python, you can create density plots using Seaborn or Matplotlib. The [Link]
function in Seaborn is commonly used for kernel density estimation (KDE).
python
Copy code
import seaborn as sns
import [Link] as plt

# Example with one dataset


[Link](data, shade=True)
[Link]('Density Plot')
[Link]()

# Example with multiple datasets


[Link](data1, shade=True, label="Data 1")
[Link](data2, shade=True, label="Data 2")
[Link]()
[Link]()

Contour Plots
Contour plots are useful for visualizing three-dimensional data in two dimensions. They
represent level curves of a function over a 2D plane. The contour lines indicate regions of equal
value, making it easy to see the shape of a surface.
• What it shows:
Contour plots display how the values of a third variable change across a 2D plane, which
is especially useful in geospatial data, physics, or machine learning.
• Use case:
Used in scientific computing, spatial analysis, and machine learning, especially to
represent decision boundaries of classification models (e.g., decision trees, SVMs).
• Tools:
Contour plots can be created in Python using Matplotlib or Seaborn.

python
Copy code
import numpy as np
import [Link] as plt

# Create grid data


X = [Link](-5, 5, 100)
Y = [Link](-5, 5, 100)
X, Y = [Link](X, Y)
Z = [Link]([Link](X**2 + Y**2)) # Example function

[Link](X, Y, Z)
[Link]('Contour Plot')
[Link]()

Histograms
Histograms are one of the most basic and effective ways to visualize the distribution of data.
They group data into bins and show the frequency of data points within each bin.
• What it shows:
Histograms display the distribution of a single continuous variable and provide insight
into the shape of the distribution, such as whether it’s normal, skewed, bimodal, etc.
• Use case:
Used when you want to understand the distribution of numerical data, such as the
frequency of values within a range (e.g., age groups, test scores).
• Tools:
In Python, Matplotlib and Seaborn are often used for histograms. Seaborn also allows
easy customization with [Link] and [Link].
python
Copy code
import seaborn as sns
import [Link] as plt

# Single variable histogram


[Link](data, kde=True) # with Kernel Density Estimation
[Link]('Histogram')
[Link]()

# Multiple variables
[Link](data1, kde=True, color='blue', label='Data 1')
[Link](data2, kde=True, color='red', label='Data 2')
[Link]()
[Link]()

Binning and Density


Binning
Binning is a technique where you divide data into intervals or "bins" to facilitate easier analysis
and visualization, especially when the data is continuous. Binning can be used with both
histograms and density plots.
• What it shows:
Binning helps to reduce noise by grouping continuous data into discrete intervals. The
binning process is particularly useful for large datasets or when you're looking to simplify
the representation of a distribution.
• Use case:
Often used in histograms and KDE plots to smooth out noise or to group data into ranges
(e.g., age groups, income brackets).
• Tools:
Binning is handled automatically in Seaborn’s histplot and displot, or you can manually
define bin edges with [Link] in Matplotlib.
python
Copy code
import numpy as np
import [Link] as plt

# Create some data


data = [Link](1000)

# Binning with histplot


[Link](data, bins=20, kde=True) # You can adjust the number of bins here
[Link]('Histogram with Binning')
[Link]()

# Manually binning with numpy


bin_edges = [Link](-4, 4, 20)
hist, edges = [Link](data, bins=bin_edges)
[Link](edges[:-1], edges, weights=hist) # Plot with custom bins
[Link]()

Density Plots with Binning


When using density plots, binning can help smooth the curve. The KDE method performs an
estimation of the density function by calculating a kernel for each data point and then
aggregating them into bins.
• What it shows:
Binned density estimates smooth out the distribution by averaging the values in each bin,
which can highlight the underlying structure of the data (e.g., unimodal vs. bimodal).
• Tools:
Seaborn automatically applies kernel density estimation to binned data with the
[Link] function.
python
Copy code
[Link](data, bw_adjust=0.5) # bw_adjust controls the smoothness
[Link]()

Three-Dimensional Plotting
Three-dimensional plots allow you to visualize data that has three continuous variables. This is
essential for understanding relationships in multi-variable datasets and is common in scientific
and engineering applications.
• What it shows:
3D plots help visualize data points in three-dimensional space, showing how three
variables interact or correlate.
• Use case:
Applied in machine learning to visualize high-dimensional relationships, geographic data,
or any problem where three variables are crucial.
• Tools:
Matplotlib offers a 3D plotting toolkit via Axes3D. The plot_surface function is
commonly used for visualizing 3D data.
python
Copy code
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import [Link] as plt

# Create 3D data
X = [Link](-5, 5, 100)
Y = [Link](-5, 5, 100)
X, Y = [Link](X, Y)
Z = [Link]([Link](X**2 + Y**2))

fig = [Link]()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis')
[Link]('3D Surface Plot')
[Link]()

Geographic Data Visualization


Geographic data visualization allows you to display spatial information on a map, which is
essential in fields like geography, urban planning, and environmental science.
Geospatial Data Visualization
When working with geospatial data, visualizing geographic patterns (e.g., population
distribution, climate data, geographic boundaries) can help uncover insights about spatial
relationships.
• What it shows:
Geographic visualizations display how variables change across different locations. This
might include mapping variables like temperature, air quality, or population density over
geographic regions.
• Use case:
Often used in fields like climate science, epidemiology, and city planning to analyze
spatial data, such as the spread of diseases, weather patterns, or traffic.
• Tools:
Python provides several libraries for geographic plotting, including Geopandas, Folium,
and Plotly.
Example with Geopandas and Matplotlib:
python
Copy code
import geopandas as gpd
import [Link] as plt

# Load a built-in geographic dataset


world = gpd.read_file([Link].get_path('naturalearth_lowres'))
# Plot the map
[Link]()
[Link]('World Map')
[Link]()
Example with Folium for Interactive Maps:
python
Copy code
import folium

# Create a map centered at a specific latitude and longitude


m = [Link](location=[40.7128, -74.0060], zoom_start=12) # Example: New York City

# Add a marker to the map


[Link]([40.7128, -74.0060], popup="New York").add_to(m)

# Show the map


[Link]("nyc_map.html")
Heatmaps in Geographic Data
Heatmaps are a great way to represent density or intensity of data points across a geographic
area.
• What it shows:
Heatmaps on geographic maps show regions where data points (such as accidents,
population density, or other spatial data) are concentrated.
• Tools:
Libraries like Folium or Plotly can create heatmaps on geographic maps.
Example with Folium Heatmap:
python
Copy code
from [Link] import HeatMap
import folium

# Create a map
m = [Link](location=[40.7128, -74.0060], zoom_start=12)

# List of lat, lon coordinates for heatmap


heat_data = [[40.7128, -74.0060], [40.7138, -74.0070], [40.7148, -74.0050]]

# Add heatmap to map


HeatMap(heat_data).add_to(m)
# Show the map
[Link]("[Link]")

Summary:
• Density and Contour Plots: Great for showing the distribution and relationships
between continuous variables, with contour plots being especially useful for visualizing
surfaces and spatial relationships.
• Histograms: Ideal for understanding the distribution of a single variable and visualizing
frequency distributions.
• Binning and Density: Binning helps with summarizing and smoothing data, while
density plots estimate the underlying distribution.
• 3D Plotting: Useful for visualizing relationships between three continuous variables.
• Geographic Data: Visualization tools like Geopandas, Folium, and Plotly help display
spatial data on maps, which is essential for geographic analysis, heatmaps, and
understanding spatial patterns.
1. Which function in Matplotlib is used to create a line plot? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: b) [Link]()
2. What type of plot would you use to visualize the relationship between two variables
in a dataset? a) Line plot
b) Histogram
c) Scatter plot
d) Box plot
Answer: c) Scatter plot
3. What does [Link]() do in Matplotlib? a) Sets the title of the plot
b) Labels the x-axis
c) Labels the y-axis
d) Adds a legend to the plot
Answer: b) Labels the x-axis
4. How can you display a plot in Matplotlib after defining it? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
5. What is the default color of lines in Matplotlib plots? a) Red
b) Blue
c) Green
d) Black
Answer: b) Blue

6. In a scatter plot, what does each point represent? a) A single data value
b) A relationship between two variables
c) The average of a dataset
d) A histogram of data
Answer: b) A relationship between two variables
7. Which function in Matplotlib is used to add error bars to a plot? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
8. What is the purpose of visualizing errors in data? a) To identify the range of variation
or uncertainty in data
b) To fit the data to a model
c) To identify outliers
d) To smooth the data
Answer: a) To identify the range of variation or uncertainty in data
9. How would you visualize the distribution of errors in data? a) By using a scatter plot
b) By using a bar chart
c) By using a histogram or density plot
d) By using a line plot
Answer: c) By using a histogram or density plot
10. What is a density plot used for in data visualization? a) To show the frequency of data
points
b) To represent the distribution of data over continuous intervals
c) To visualize the relationship between categorical data
d) To show the linear relationship between two variables
Answer: b) To represent the distribution of data over continuous intervals
11. Which Matplotlib function is used to create a density plot? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: c) [Link]()
12. What is a contour plot used for? a) To display a 3D surface
b) To visualize 2D data with levels of equal value
c) To show error bars on a plot
d) To plot the cumulative distribution function
Answer: b) To visualize 2D data with levels of equal value
13. How can you add contour lines to a plot in Matplotlib? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: a) [Link]()
14. What does a histogram display? a) The distribution of categorical data
b) The frequency distribution of continuous numerical data
c) A relationship between two variables
d) The trend over time
Answer: b) The frequency distribution of continuous numerical data
15. Which function is used to create a histogram in Matplotlib? a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]()
Answer: b) [Link]()
16. What is the purpose of binning in a histogram? a) To group continuous data into
discrete intervals
b) To sort data by its value
c) To smooth the data
d) To combine data from different sources
Answer: a) To group continuous data into discrete intervals
17. What does the density=True argument do in [Link]()? a) It normalizes the histogram
to show probability density
b) It reduces the number of bins
c) It increases the number of bins
d) It adds color to the histogram bars
Answer: a) It normalizes the histogram to show probability density
18. Which of the following can be plotted to visualize data distribution and detect
outliers? a) Scatter plot
b) Histogram
c) Box plot
d) All of the above
Answer: d) All of the above

Three Dimensional Plotting

19. Which module in Matplotlib is used to create 3D plots? a) mpl_toolkits.mplot3d


b) matplotlib.py3d
c) matplotlib.mplot3d
d) matplotlib.pyplot3d
Answer: a) mpl_toolkits.mplot3d
20. Which function is used to create a 3D scatter plot? a) [Link]()
b) plt.plot3d()
c) plt.scatter3d()
d) [Link]()
Answer: a) [Link]()
21. How can you set up a 3D plot in Matplotlib? a) ax = [Link](projection='3d')
b) fig = [Link](), ax = fig.add_subplot(111, projection='3d')
c) ax = plt.plot3D()
d) Both a and b
Answer: d) Both a and b

22. Which library in Python is commonly used for geographic data visualization? a)
Seaborn
b) Matplotlib
c) Cartopy
d) Pandas
Answer: c) Cartopy
23. Which type of plot is used to visualize geographic data over maps? a) Line plot
b) Heatmap
c) Choropleth map
d) Box plot
Answer: c) Choropleth map
24. Which of the following is used for plotting geographic data on a map in Matplotlib?
a) [Link]()
b) [Link]()
c) [Link]()
d) [Link]() with Cartopy or Basemap
Answer: d) [Link]() with Cartopy or Basemap
25. What is the purpose of using geographic data visualization tools like Cartopy or
Basemap in Python? a) To analyze text data
b) To map data onto geographic regions, helping to identify spatial patterns
c) To visualize time-series data
d) To generate statistical plots
Answer: b) To map data onto geographic regions, helping to identify spatial patterns

5 Mark Questions:

1. Explain the use of line plots in data visualization. How are they created in Matplotlib, and
in what cases are they useful?
2. Describe how scatter plots work and when they are useful in data analysis. How do you
add a regression line to a scatter plot in Matplotlib?
3. What is a contour plot, and how does it help in visualizing the relationships between three
variables? Explain with an example.
4. Explain histograms and their importance in data visualization. How do you perform
binning, and what is the significance of setting the density=True argument in [Link]()?
5. What are density plots, and how do they differ from histograms? Discuss how density
plots can be created in Matplotlib and their use in visualizing data distribution.
10 Mark Questions:

1. Discuss how Matplotlib can be used to create different types of plots, such as line plots,
scatter plots, and histograms. Provide code examples for each type.
2. Explain how error bars are added to plots in Matplotlib. Provide an example where error
bars are used to represent uncertainty in data.
3. Describe the process of creating a 3D plot in Matplotlib. Include details on how to set up
a 3D axis and create 3D scatter plots and surface plots.
4. Discuss the use of geographic data visualization in Python. Explain how Cartopy or
Basemap can be used to visualize geographic data, including a discussion of choropleth
maps and their applications.
5. Explain the importance of visualizing the distribution of data. Compare and contrast
histograms, box plots, and density plots, and discuss their strengths and limitations in
different scenarios.

You might also like