0% found this document useful (0 votes)
96 views43 pages

Machine Learning Notes

The document provides an overview of machine learning (ML), defining it as a branch of computer science focused on enabling AI to learn from data and improve its accuracy. It outlines the key steps in the ML process, real-life applications such as product recommendations and healthcare advancements, and categorizes ML into four types: supervised, unsupervised, semi-supervised, and reinforcement learning. Additionally, it discusses the advantages and disadvantages of supervised learning, along with its applications in various fields.

Uploaded by

ashishghodke2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views43 pages

Machine Learning Notes

The document provides an overview of machine learning (ML), defining it as a branch of computer science focused on enabling AI to learn from data and improve its accuracy. It outlines the key steps in the ML process, real-life applications such as product recommendations and healthcare advancements, and categorizes ML into four types: supervised, unsupervised, semi-supervised, and reinforcement learning. Additionally, it discusses the advantages and disadvantages of supervised learning, along with its applications in various fields.

Uploaded by

ashishghodke2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Ms. Suhana Ali Mir Laiqui Ali (M.Sc.

Data Science)

UNIT - I
I. Introduction to Machine Learning
What is Machine Learning?

Machine learning (ML) is a branch of and computer science that focuses on the using data
and algorithms to enable AI to imitate the way that humans learn, gradually improving its
accuracy.

In other words, machine learning is focused on building computer systems that learn from
data.

ML algorithms are trained to find relationships and patterns in data. Using historical data as
input, these algorithms can make predictions, classify information, cluster data points,
reduce dimensionality and even generate new content. Example- generative AI, include
OpenAI's ChatGPT.

The first definition of Machine Learning was given by Arthur Samuel way back in 1959. He is
the person, who coined the term Machine Learning and defined it as,

“Machine Learning is the field of study that gives computers the ability to learn without
being explicitly programmed.”
OR
“Machine learning is a subfield of artificial intelligence that uses algorithms trained on
data sets to create models that enable machines to perform tasks that would otherwise
only be possible for humans, such as categorizing images, analyzing data, or predicting
price fluctuations.”

How does machine learning work?

Machine Learning enables computers to learn from data and make predictions or decisions
without explicit programming. The process involves several key steps:

1. Data Collection: The first step in Machine Learning is gathering relevant data
representing the problem or task at hand. This data can be collected from various
sources such as databases, sensors, or online platforms.
2. Data Preprocessing: Once the data is collected, it needs to be pre-processed to
ensure its quality and suitability for training the model. This involves cleaning the
data, handling missing values, and normalizing or transforming the data to a
consistent format.
3. Feature Extraction and Selection: The collected data may contain many features or
attributes in many cases. Feature extraction and selection involve identifying the
most informative and relevant features contributing to the learning task. This helps
reduce the data's dimensionality and improves the learning process's efficiency and
effectiveness.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

4. Model Training: The training phase involves feeding the pre-processed data into a
Machine Learning algorithm or model. The model learns from the data by adjusting
its internal parameters based on the patterns and relationships it discovers. This is
done through iterative optimization processes, such as gradient descent or
backpropagation, depending on the specific algorithm used.
5. Model Evaluation: The model must be evaluated to assess its performance and
generalization ability after training it. This is typically done using a separate data set
called the test set, which was not used during training. Common evaluation metrics
include accuracy, precision, recall, and F1 score, depending on the nature of the
learning task.
6. Prediction or Decision Making: Once the model is trained and evaluated, it can
predict or decide on new, unseen data. The model takes input features and applies
the learned patterns to generate the desired output or prediction.
7. Model Refinement and Iteration: ML is an iterative process that involves refining
the model based on their feedback and new dataset. If the model's performance is
unsatisfactory and not accurate, then we can make adjustments by retraining the
model with additional data, changing the algorithm, or tuning the model's
parameters.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

II. Machine learning in real life


When the average person thinks about machine learning, it may feel overwhelming,
complicated, and perhaps intangible, conjuring up images of futuristic robots taking over
the world. As more organizations and people rely on machine learning models to manage
growing volumes of data, instances of machine learning are occurring in front of and around
us daily—whether we notice or not. What’s exciting to see is how it’s improving our quality
of life, supporting quicker and more effective execution of some business operations and
industries, and uncovering patterns that humans are likely to miss. The more data machine
learning (ML) algorithms consume, the more accurate they become in their predictions and
decision-making processes.

Here are examples of machine learning at work in our daily life that provide value in many
ways—some large and some small.

1. Product recommendations

Do you wonder how Amazon or other retailers frequently know what you might like to
purchase? Or, have they gotten it wildly wrong and you wonder how they came up with the
recommendation? Thank machine learning. Targeted marketing with retail uses machine
learning to group customers based on buying habits or demographic similarities, and by
extending what one person may want from someone else’s purchases. While some
suggested purchase pairings are obvious, machine learning can get accurate by finding
hidden relationships in data and predicting what you want before you know you want it. If
the data is incomplete, sometimes you may end up with an offbase (not relatable)
recommendation—but don’t worry, because not buying it is another data point to learn
from.

2. Email automation and spam filtering

While your inbox seems relatively boring, machine learning influences its function behind
the scenes. Email automation is a direct result of successful machine learning, and one
function that goes most unnoticed is spam filtering. Successful spam filtering adapts and
finds patterns in email content that is undesirable. This includes data from email domains, a
sender’s physical location message text and structure, and IP addresses. It also requires help
from users as they mark emails when they’re mistakenly filed. With each marked email, a
new data reference is added that helps with future accuracy. It also reduce junk emails, and
add very little to their inbox spam. One of the primary methods for spam mail detection is
email filtering. It involves categorize incoming emails into spam and non-spam. Machine
learning algorithms can be trained to filter out spam mails based on their content and
metadata.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

3. Social media optimization

Machine learning has become helpful in fighting inappropriate content and cyberbullying,
which pose a risk to platforms in losing users and weakening brand loyalty. With the help of
machine learning, social media platforms can provide a better user experience, use data to
forecast future states, and predict more accurate results. According to data, 82% of X
(Twitter) users view videos, and 90% do it on a handheld device. As a result, X bought Magic
Pony Technology, a technology company with offices in London that has created machine
learning methods for visual augmentation to enhance the visual experience further.
Machine Learning enables social networking giants to promote their goods to niche
audiences by analyzing user data such as demographics, interests, and preferences. This
data is then used to create targeted ads that are shown to specific groups of users,
increasing the chances that they will be interested in the promoted goods.ML algorithms
can also analyze user behavior and predict which products or services they are most likely to
be interested in, allowing for even more precise targeting. Machine learning algorithms can
help protect social media platforms by detecting and flagging potentially harmful or
inappropriate content before it spreads. This not only helps prevent the spread of harmful
content but also helps maintain the platform’s reputation by promoting a safe and
welcoming online community. Pinterest uses machine learning to ensure data security. With
ML, the business can identify spam users and content, promote the content, and gauge
(increase) the possibility that a user will pin it. Another example of a training algorithm in
ML is the “people you may know” feature on social media platforms like LinkedIn,
Instagram, Facebook, and X (formerly known as Twitter.) Based on your contacts,
comments, likes, or existing connections, the algorithm suggests familiar faces from your
real-life network that you might want to connect with or follow.

4. Healthcare advancement
For the healthcare industry, machine learning algorithms are particularly valuable because
they can help us make sense of the massive amounts of healthcare data that is generated
every day within electronic health records. Using machine learning in healthcare like
machine learning algorithms can help us find patterns and insights in medical data that
would be impossible to find manually. That means healthcare information for clinicians can
be enhanced with analytics and machine learning to gain insights that support better
planning and patient care, improved diagnoses, and lower treatment costs. Healthcare
brands such as Pfizer and Providence have begun to benefit from analytics enhanced by
human and artificial intelligence. There are some processes that are better suited to
leverage machine learning; machine learning integration with radiology, cardiology, and
pathology. As an example, wearables generate mass amounts of data on the wearer’s health
and many use AI and machine learning to alert them or their doctors of issues to support
preventative measures and respond to emergencies. Wearables are electronic technology
or devices incorporated into items that can be comfortably worn on a body. These
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

wearable devices are used for tracking information on real time basis. They have motion
sensors that take the snapshot of your day to day activity and sync them with mobile
devices or laptop computers. Also, the most common use cases for machine learning in
healthcare among healthcare professionals are automating medical billing. At MD Anderson
Cancer Center (locate in Houston, Texas) data scientists have developed the first deep
learning in healthcare algorithm using machine learning to predict acute toxicities in
patients receiving radiation therapy for head and neck cancers. algorithms can analyze
retinal images to detect diabetic retinopathy, predict cardiovascular risks from electronic
health records, or assist in the early detection of cancerous tumors through imaging. These
machine learning in healthcare examples highlight the technology's potential to augment
(increase) the capabilities of medical professionals, rather than replace them.

5. Predictive analytics

Predictive analytics is an area of advanced analytics that uses data to make predictions
about the future. Techniques such as data mining, statistics, machine learning and artificial
intelligence to analyze current and historical data for any patterns or anomalies that can
help identify risks and opportunities, minimize the chance for human errors, and increase
speed and thoroughness of analysis. With closer investigation of what happened and what
could happen using data, people and organizations are becoming more proactive and
forward looking. Florida International University is one example. By integrating predictive
models with data analysis from Tableau, they’re communicating critical insights about
academic performance before students are at risk and supporting their individual needs to
help them successfully complete all courses and graduate.

6. Image recognition

Image recognition is another machine learning technique that appears in our day-to-day life.
With the use of ML, programs can identify an object or person in an image based on the
intensity of the pixels. This type of facial recognition is used for password protection
methods like Face ID and in law enforcement. By filtering through a database of people to
identify commonalities and matching them to faces, police officers and investigators can
narrow down a list of crime suspects.

7. Virtual personal assistants

Virtual personal assistants are devices you might have in your own homes, such as Amazon’s
Alexa, Google Home, or the Apple iPhone’s Siri. These devices use a combination of speech
recognition technology and machine learning to capture data on what you're requesting and
how often the device is accurate in its delivery. They detect when you start speaking, what
you’re saying, and deliver on the command. For example, when you say, “Siri, what is the
weather like today?”, Siri searches the web for weather forecasts in your location and
provides detailed information.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

8. Credit card fraud detection

Predictive analytics can help determine whether a credit card transaction is fraudulent or
legitimate. Fraud examiners use AI and machine learning to monitor variables involved in
past fraud events. They use these training examples to measure the likelihood that a specific
event was fraudulent activity.

9. Traffic predictions

When you use Google Maps to map your commute to work or a new restaurant in town, it
provides an estimated time of arrival. Google uses machine learning to build models of how
long trips will take based on historical traffic data (gleaned from satellites). It then takes that
data based on your current trip and traffic levels to predict the best route according to these
factors.

10. Self-driving car technology

A frequently used type of machine learning is reinforcement learning, which is used to


power self-driving car technology. Self-driving vehicle company Waymo uses machine
learning sensors to collect data of the car's surrounding environment in real time. This data
helps guide the car's response in different situations, whether it is a human crossing the
street, a red light, or another car on the highway.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

III. Types of Learning


Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

SUPERVISED MACHINE LEARNING


What is Supervised Learning?
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and
based on the training, the machine predicts the output. Here, the labelled data specifies
that some of the inputs are already mapped to the output. More preciously, we can say;
first, we train the machine with the input and corresponding output, and then we ask the
machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height
(dogs are taller, cats are smaller), etc. After completion of training, we input the picture of
a cat and ask the machine to identify the object and predict the output. Now, the machine is
well trained, so it will check all the features of the object, such as height, shape, colour,
eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Another example of Supervised Machine Learning is EXIT POLLS

It is a training data in such a way because these polls are collected from feedbacks taken
from different people (old aged, youngsters, etc) like which party got how many seats who
is the winning party and which is the losing party. This all acts as a training data.

Now when the actual counting starts it acts as a testing data, the model will make the
working of supervised learning algorithms. Based on the new input data, it will give out
results. If the testing data (new input) matches the training data, it means the model has
learned perfectly.

Categories of Supervised Machine Learning


Supervised machine learning can be classified into two types of problems, which are given
below:

a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.

Some popular Regression algorithms are given below:


Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm

Advantages and Disadvantages of Supervised Learning


Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:

o Image_Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.

o Medical_Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.

o Fraud Detection:
Supervised Learning classification algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.

o Spam detection:
In spam detection & filtering, classification algorithms are used. These algorithms
classify an email as spam or not spam. The spam emails are sent to the spam folder.

o Speech Recognition:
Supervised learning algorithms are also used in speech recognition. The algorithm is
trained with voice data, and various identifications can be done using the same, such
as voice-activated passwords, voice commands, etc.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

UNSUPERVISED MACHINE LEARNING


Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning, the
machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.

Categories of Unsupervised Machine Learning


Unsupervised Learning can be further classified into two types, which are given below:

1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It
is a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing
behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

o Principal Component Analysis

2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm is
to find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, etc.

Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:

 These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
 Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.

Disadvantages:

 The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
 Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning

Network Analysis
Unsupervised learning is used for identifying plagiarism and copyright in document network
analysis of text data for scholarly articles.

Recommendation Systems
Recommendation systems widely use unsupervised learning techniques for building
recommendation applications for different web applications and e-commerce websites.

Anomaly Detection
Anomaly detection is a popular application of unsupervised learning, which can identify
unusual data points within the dataset. It is used to discover fraudulent transactions.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms,


the concept of Semi-supervised learning is introduced.

We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded for
each good action and get punished for each bad action; hence the goal of reinforcement
learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research.

A reinforcement learning problem can be formalized using Markov Decision


Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.

Categories of Reinforcement Learning


Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning


o Negative Reinforcement Learning
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Real-world Use cases of Reinforcement Learning

o VideoGames:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. A popular game that use RL algorithm is MINECRAFT.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.

Advantages and Disadvantages of Reinforcement Learning

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by


general techniques.
o The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
o Helps in achieving long term results.
Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can
weaken the results.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

IV. Applications of Machine Learning in Data Science


Data Science, as you probably know, covers a wide spectrum of domains and Machine
Learning is one of them. Data Science basically comprises various fields and techniques, like
Statistics and Artificial Intelligence for Data Analysis to draw meaningful insights.

What are the Applications of Machine Learning in Data Science?


Listed below are some of the most popular applications of Machine Learning in Data
Science:

 Real-Time Navigation: Google Maps is one of the most commonly used Real-Time
Navigation applications. But have you ever wondered why despite being of the usual
traffic, you are on the fastest route? It is because of the data received from people
currently using this service, and the database of Historical Traffic Data. Everyone
who uses this service contributes to making this application more accurate. When
you open the application, it constantly sends the data back to Google, providing
information about the route being traveled and traffic patterns at any given time of
the day.
 Image Recognition: Image Recognition is one of the most common applications of
Machine Learning in Data Science. Image Recognition is used to identify objects,
persons, places, etc.
 Product Recommendation: Product Recommendation is profoundly used by
eCommerce and Entertainment companies like Amazon, Netflix, Hotstar, etc. They
use various Machine Learning algorithms on the data collected from you to
recommend products or services that you might be interested in.
 Speech Recognition: Speech Recognition is a process of translating spoken
utterances into text. This text can be in terms of words, syllables, sub-word units, or
even characters. Some of the well-known examples are Siri, Google Assistant,
Youtube Closed Captioning, etc.

(Refer to “Machine learning in real life” topic from II section of


these notes for more applications)
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

V. Data Pre-processing and Feature Engineering

Data Pre-processing
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.

When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data pre-processing task.

Why do we need Data Pre-processing?

A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data pre-processing is
required tasks for cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

1) Get the Dataset


To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data. The collected data for a particular problem in a
proper format is known as the dataset.

Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we may
also need to use an HTML or xlsx file.

What is a CSV File?


CSV stands for "Comma-Separated Values" files; it is a file format which allows us to save
the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

2) Importing Libraries
In order to perform data pre-processing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre-processing, which are:

Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports
to add large, multidimensional arrays and matrices. So, in Python, we can import it as:

import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:

import matplotlib.pyplot as mpt


Here we have used mpt as a short name for this library.

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library.

3) Importing the Datasets


Now we need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a working
directory.

read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used
to read acsvfile and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL. We can use read_csv function as below:

data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set.

If you are using Python language for machine learning, then extraction is
mandatory, but for R language it is not required.
Extracting dependent and independent variables:

In machine learning, it is important to distinguish the matrix of features (independent


variables) and dependent variables from dataset.

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method

4) Handling Missing data:


The next step of data pre-processing is to handle missing data in the datasets. If our dataset
contains some missing data, then it may create a huge problem for our machine learning
model. Hence it is necessary to handle missing values present in the dataset.

Ways to handle missing data:

There are mainly two ways to handle missing data, which are:

By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

To handle missing values, we will use Scikit-learn library in our code, which contains various
libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:

1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])

Feature Engineering
What is a Feature?

In the context of machine learning, a feature (also known as a variable or attribute) is an


individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.

For example, in a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property. In a dataset of customer
demographics, features could include age, gender, income level, and occupation.

The choice and quality of features are critical in machine learning, as they can greatly impact
the accuracy and performance of the model.

What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that are suitable
for machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.

Feature Engineering is the process of creating new features or transforming existing


features to improve the performance of a machine-learning model. It involves selecting
relevant information from raw data and transforming it into a format that can be easily
understood by a model. The goal is to improve model accuracy by providing more
meaningful and relevant information.

The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Need for Feature Engineering in Machine Learning?

 Improve User Experience: The primary reason we engineer features is to enhance


the user experience of a product or service. By adding new features, we can make
the product more intuitive, efficient, and user-friendly, which can increase user
satisfaction and engagement.

 Competitive Advantage: Another reason we engineer features is to gain a


competitive advantage in the marketplace. By offering unique and innovative
features, we can differentiate our product from competitors and attract more
customers.

 Meet Customer Needs: We engineer features to meet the evolving needs of


customers. By analyzing user feedback, market trends, and customer behavior, we
can identify areas where new features could enhance the product’s value and meet
customer needs.

 Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales, or
a feature that provides additional functionality could lead to more upsells or cross-
sells.

 Future-Proofing: Engineering features can also be done to future-proof a product or


service. By anticipating future trends and potential customer needs, we can develop
features that ensure the product remains relevant and useful in the long term.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

VI. Handling Missing data


To gain practical experience in addressing missing data challenges through
the analysis of the Titanic dataset within a Python environment.

The first step is to do data preprocessing, before looking for any insights from the data, then
only we can train our machine learning model. Because, uncleaned data can’t be processed
by most of the machine learning algorithms. Missing value in a dataset is a very common
phenomenon in the reality, yet a big problem in real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed.

For example, Suppose during a survey-

 Different users may choose not to share their addresses.


 There might be a failure in recording the values due to human error.
 Due to improper maintenance past data might get corrupted.

In this way, many datasets went missing.

To solve all these problems, we have various methods to handle the missing data. They are
as follows –

1. Delete Rows with Missing Values


One way of handling missing values is the deletion of the rows or columns having
null values. If any columns have more than half of the values as null then you can
drop the entire column. In the same way, rows can also be dropped if having one or
more columns values as null. Before using this method one thing we have to keep in
mind is that we should not be losing information. Because if the information we are
deleting is contributing to the output value then we should not use this method
because this will affect our output.
When to delete the rows/column in a dataset?
 If a certain column has many missing values then you can choose to drop the
entire column.
 When you have a huge dataset. Deleting for e.g. 2-3 rows/columns will not
make much difference.

No doubt it is one of the quick techniques one can use to deal with missing
values. But this approach is not recommended.

2. Replacing With Arbitrary Value


If you can replace the missing value with some arbitrary value using fillna().
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Ex. In the below code, we are replacing the missing values with ‘0’.As well you can
replace any particular column missing values with some arbitrary value also.
 Replacing with previous value – Forward fill
We can impute the values with the previous value by using forward fill. It is
mostly used in time series data.
Syntax: df.fillna(method=’ffill’)
 Replacing with next value – Backward fill
In backward fill, the missing value is imputed using the next value. It is mostly
used in time series data.
3. Interpolation
Missing values can also be imputed using ‘interpolation’. Pandas interpolate method
can be used to replace the missing values with different interpolation methods.
Interpolation in most cases supposed to be the best technique to fill missing values.

Handling missing values: python code

(Refer to your ML Practical No. 2 for a brief overview for the code)

We have taken dataset titanic.csv which is freely available at kaggle.com.This dataset was
taken as it has missing values.

1. Reading the data


The dataset is read and used three columns PassengerId', 'Fare', 'Survived',
'Pclass', 'Name', 'Gender', 'Age', 'Ticket', 'Cabin' using the given code.
2. Checking if there are missing values
3. Filling missing values with 0
4. Filling NaN values with forward fill value
If we use forward fill that simply means we are forwarding the previous value
where ever we have NaN values.
We can also set forward fill limit to 1 which means that only once, the value
will be copied below. For example, we had three NaN values consecutively in
column Survived. But one NaN value was filled only as the limit is set to 1.
5. Filling NaN values in Backward Direction
6. Interpolate of missing values
The code given first fills in the missing values in the 'Age' column with the
mean age of the existing values. Then, it creates a new DataFrame df_1 with
all the numerical values rounded to 2 decimal places.
7. Dropna()
Previously we were having 891 rows and after running this code we are left
with 710 rows because some of the rows were continuing NaN values were
dropped.
8. Deleting the rows having all NaN values
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Those rows in which all the values are NaN values will be deleted. But if the
row even has one value even then it will not be dropped.

VII. Feature Scaling


(REFER TO YOUR ML PRACTICAL NO. 1 FOR THE TOPIC’S PRACTICAL
IMPLEMENTATION)

Feature Scaling is a critical step in building accurate and effective machine learning models.
One key aspect of feature engineering is scaling, normalization, and standardization, which
involves transforming the data to make it more suitable for modeling. These techniques can
help to improve model performance, reduce the impact of outliers, and ensure that the data
is on the same scale.

“Feature scaling is a data preprocessing technique that transforms the values of


independent variables in a dataset to a common range or scale.”

In other words, Feature scaling is a preprocessing technique that transforms feature values
to a similar scale, ensuring all features contribute equally to the model.

It’s essential for datasets with features of varying ranges, units, or magnitudes. Common
techniques include standardization, normalization, and min-max scaling. This process
improves model performance, convergence, and prevents bias from features with larger
values.

This process is important because it ensures that all features contribute equally to a
machine learning model, and that the model doesn't give more importance to a feature
based on its value range.

Imagine you're comparing the heights of a group of people:

 Original Data:
o Person A: 5 feet 10 inches (70 inches)
o Person B: 6 feet 2 inches (74 inches)
o Person C: 5 feet 3 inches (63 inches)
 Problem: The raw numbers (inches) might not be the most intuitive for comparison,
especially if you're dealing with many different measurements (weight, age, etc.).

Feature Scaling to the Rescue


Feature scaling transforms the data to a common scale, making comparisons easier. Let's
use MinMax Scaling as an example:

1. Find the Minimum and Maximum:


o Minimum height: 63 inches (Person C)
o Maximum height: 74 inches (Person B)
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

2. Apply the Formula:


o Scaled value = (Original value - Minimum value) / (Maximum value -
Minimum value)

 Scaled Heights:
o Person A: (70 - 63) / (74 - 63) = 7/11 ≈ 0.64
o Person B: (74 - 63) / (74 - 63) = 11/11 = 1
o Person C: (63 - 63) / (74 - 63) = 0/11 = 0

Now, the heights are scaled between 0 and 1:

 Person A: Slightly taller than average


 Person B: Tallest person
 Person C: Shortest person

Why is this helpful in Machine Learning?

 Faster Convergence: Many algorithms (like gradient descent) converge faster when
features are on a similar scale.
 Improved Accuracy: Some algorithms are sensitive to feature scales, and scaling can
significantly improve their performance.
 Better Visualization: Scaled data is often easier to visualize and interpret.

In Summary
Feature scaling is like standardizing measurements to a common unit. It makes your data
more consistent and can lead to better results in your machine learning models.

Types of Feature Scaling


Two primary methods are normalization and standardization.

1. Normalization: This method scales each feature so that all values are within the range

of 0 and 1. It achieves this by subtracting the minimum value of the feature and dividing

by the range (difference between maximum and minimum values).

Normalization is a fundamental technique in machine learning that transforms numerical

features to a standard scale, typically between 0 and 1.

This process ensures that all features contribute equally to the analysis, preventing those

with larger numerical ranges from dominating the model’s outcomes.


Understanding Normalization
Normalization is achieved through a simple formula:
(x — min(X)) / (max(X) — min(X))
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

 x is the value you want to normalize

 min(X) is the minimum value in your data set

 max(X) is the maximum value in your data set

(Standard Deviation: It calculates the extent to which the values differ from the average.
The Standard Deviation is a measure of how spread out numbers are.)

2. Standardization: Here, each feature is transformed to have a mean of 0 and a standard

deviation of 1. This is achieved by subtracting the mean value and dividing by the standard

deviation of the feature.

The formula for standardization is:


Z = (x — μ) / σ
Where:

 Z is the standardized score (also called a z-score)

 x is the original value you want to standardize

 μ (mu) is the mean of the data set

 σ (sigma) is the standard deviation of the data set

Visualizing/Example of Feature Scaling

Consider a simple dataset with two features: annual income and age. Let’s
say we have three individuals — blue, purple, and red — with varying
values for these features. Without scaling, the difference in the scale of
these features can mislead the algorithm. For instance, a small difference
in age might be overshadowed by a large difference in income.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Applying Feature Scaling

Example of Normalization
Consider a dataset with annual income data for three individuals:
- Person A: $70,000
- Person B: $60,000
- Person C: $52,000
To normalize the income values using the formula:
1. Find the minimum and maximum values:
— Minimum income ( {min}(x) ) = $52,000
— Maximum income ( {max}(x) ) = $70,000
2. Apply the normalization formula to each income value:

After normalization:
- Person A’s normalized income = 1
- Person B’s normalized income ≈ 0.444
- Person C’s normalized income = 0
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Normalization transforms data into a standardized range, facilitating fair


comparisons across features with different scales. By bringing all values
within a consistent range (0 to 1 in this case), normalization enhances the
performance and interpretability of machine learning models, ensuring
robust and reliable results.

Understanding Standardization

Consider a dataset with age data for three individuals:

- Person A: 45 years
- Person B: 44 years
- Person C: 40 years

To standardize the age values using the formula:


Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

After standardization:
- Person A’s standardized age ≈ 1.31
- Person B’s standardized age ≈ 0.65
- Person C’s standardized age ≈ -1.96

Standardization transforms data to have properties that are more suitable


for many machine learning algorithms. By centering the data around 0 and
scaling it to have a standard deviation of 1, standardization ensures that
all features contribute equally to model training and evaluation. This pre-
processing step is particularly useful when features have varying scales
and distributions, enhancing the stability and performance of machine
learning models.

Impact of Scaling

Post-scaling, the purple person’s data shows clearer similarity:


- Normalized income values: 0.5 (between blue and red)
- Normalized age values: closer to blue (0.44 versus 0.33 for red)
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Conclusion

In conclusion, feature scaling is not just a technical step but a critical


process that ensures the integrity and effectiveness of machine learning
models. By standardizing the scale of features, we enable fair comparisons
and accurate predictions across different types of data.

Understanding feature scaling empowers machine learning practitioners


to preprocess data effectively, enhancing the performance and reliability
of their models.

Normalization Standardization

Objective is to bring the values of a feature Objective is to transform the values of a


within a specific range, often between 0 and feature to have a mean of 0 and a standard
1 deviation of 1

Sensitive to outliers and the range of the Less sensitive to outliers due to the use of the
data mean and standard deviation

Useful when maintaining the original range Effective when algorithms assume a standard
is essential normal distribution

No assumption about the distribution of data Assumes a normal distribution or close


is made approximation

Suitable for algorithms where the absolute Particularly useful for algorithms that assume
values and their relations are important normally distributed data, such as linear
(e.g., k-nearest neighbors, neural networks) regression and support vector machines

Alters the original values, making


Maintains the interpretability of the original
interpretation more challenging due to the
values within the specified range
shift in scale and units

Can lead to faster convergence, especially Also contributes to faster convergence,


Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

in algorithms that rely on gradient descent particularly in algorithms sensitive to the scale
of input features

Use cases: Image processing, neural Use cases: Linear regression, support vector
networks, algorithms sensitive to feature machines, algorithms assuming normal
scales distribution

When to use normalization and standardization

 When you don’t know the distribution of your data or when you know it’s not
Gaussian, normalization is a smart approach to apply. Normalization is useful
when your data has variable scales and the technique you’re employing, such as k-
nearest neighbors and artificial neural networks, doesn’t make assumptions about
the distribution of your data.

 The assumption behind standardization is that your data follows a Gaussian (bell
curve) distribution. This isn’t required, however, it helps the approach work better
if your attribute distribution is Gaussian. When your data has variable dimensions
and the technique you’re using (like logistic regression, linear regression, linear
discriminant analysis) standardization is useful.

VIII. One Hot Encoding

What Is One Hot Encoding

One common challenge in machine learning is dealing with categorical variables


(such as colors, product types, or locations) because the algorithms typically require
numerical input. One solution to this problem is one-hot encoding.

One-hot encoding is a technique used to convert categorical data into a binary


format where each category is represented by a separate column with a 1 indicating
its presence and 0s for all other categories.

This means that - One-hot encoding is a method of converting categorical variables


into a format that can be provided to machine learning algorithms to improve
prediction. It involves creating new binary columns for each unique category in a
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

feature. Each column represents one unique category, and a value of 1 or 0 indicates
the presence or absence of that category.

Let's consider an example to illustrate how one-hot encoding works. Suppose we


have a dataset with a single categorical feature, Color, that can take on three values:
Red, Green, and Blue. Using one-hot encoding, we can transform this feature as
follows:

In this example, the original "Color" column is replaced by three new binary
columns, each representing one of the colors. A value of 1 indicates the
presence of the color in that row, while a 0 indicates its absence.

Why Use One-Hot Encoding?

One-hot encoding is an essential technique in data preprocessing for several


reasons. It transforms categorical data into a format that machine learning
models can easily understand and use. This transformation allows each
category to be treated independently without implying any false relationships
between them.

Additionally, many data processing and machine learning libraries support one-
hot encoding. It fits smoothly into the data preprocessing workflow, making it
easier to prepare datasets for various machine learning algorithms.

Machine learning compatibility

Most machine learning algorithms require numerical input to perform their


calculations. Categorical data needs to be transformed into a numerical format
for these algorithms to use effectively. One-hot encoding provides a
straightforward way to achieve this transformation, ensuring that categorical
variables can be integrated into machine learning models.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Avoiding ordinality

Label encoding is another method to convert categorical data into numerical values by
assigning each category a unique number. However, this approach can create problems
because it might suggest an order or ranking among categories that doesn't actually exist.

For example------

Assigning 1 to Red, 2 to Green, and 3 to Blue could make the model think
that Green is greater than Red and Blue is greater than both. This
misunderstanding can negatively affect the model's performance.

One-hot encoding solves this problem by creating a separate binary column for
each category. This way, the model can see that each category is distinct and
unrelated to the others.

Label encoding is useful when the categorical data has an inherent ordinal
relationship, meaning the categories have a meaningful order or ranking. In
such cases, the numerical values assigned by label encoding can effectively
represent this order, making it a suitable choice.

Consider a dataset with a feature representing education levels. The categories


are:

 High School
 Bachelor's Degree
 Master's Degree
 PhD
These categories have a clear order, where PhD represents a higher level of
education than Master's Degree, which in turn is higher than Bachelor's
Degree, and so on. In this case, label encoding can effectively capture the
ordinal nature of the data:
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Implementing One-Hot Encoding in Python

Python offers powerful libraries like Pandas and Scikit-learn, which provide
convenient and efficient ways to perform one-hot encoding.

We'll start with Pandas' get_dummies() function, which is quick and easy for
straightforward encoding tasks. Then, we'll explore Scikit-
learn's OneHotEncoder, which offers more flexibility and control, particularly
useful for more complex encoding needs.

What is Scikit-learn?

Scikit-learn is an open-source Python library that implements a range of


machine learning, pre-processing, cross-validation, and visualization algorithms
using a unified interface. It is an open-source machine-learning library that
provides a plethora of tools for various machine-learning tasks such
as Classification, Regression, Clustering, and many more.

In Pandas, the get_dummies() function converts categorical variables into


dummy/indicator variables
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Using Pandas get_dummies()


Pandas provides a very convenient function, get_dummies(), to create one-hot
encoded columns directly from a DataFrame.

import pandas as pd
# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)

# Applying one-hot encoding

df_encoded = pd.get_dummies(df, dtype=int)

# Displaying the encoded DataFrame

print(df_encoded)

 First, we import the Pandas library. Then, we create a


dictionary data with a single key 'Color' and a list of color names as
values. We then convert this dictionary into a Pandas DataFrame df. The
DataFrame looks like this:A dictionary named 'data' is created with
'Color' as the key and a list of colors as the value.

 We use the pd.get_dummies() function to apply one-hot encoding to the


DataFrame df. This function automatically detects the categorical
column(s) and creates new binary columns for each unique category.
The dtype=int argument ensures the encoding is done
with 1 and 0 instead of the default Booleans. The resulting
DataFrame df_encoded looks like this:
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Using Scikit-learn's OneHotEncoder

For more flexibility and control over the encoding process, Scikit-learn offers
the OneHotEncoder class. This class provides advanced options, such as
handling unknown categories and fitting the encoder to the training data.

from sklearn.preprocessing import OneHotEncoder


import numpy as np
# Creating the encoder
enc = OneHotEncoder(handle_unknown='ignore')
# Sample data
X = [['Red'], ['Green'], ['Blue']]
# Fitting the encoder to the data
enc.fit(X)
# Transforming new data
result = enc.transform([['Red']]).toarray()
# Displaying the encoded result
print(result)

We import the OneHotEncoder class from sklearn.preprocessing, and we also


import numpy. After this, we create an instance of OneHotEncoder.
The handle_unknown='ignore' parameter tells the encoder to ignore unknown
categories (categories that were not seen during the fitting process) during the
transformation. We then create a list of lists X, where each inner list contains a
single color. This is the data we’ll use to fit the encoder.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

We fit the encoder to the sample data X. During this step, the encoder learns
the unique categories in the data. We use the fitted encoder to transform new
data. In this case, we transform a single color, 'Red'. The .transform() method
returns a sparse matrix, which we convert to a dense array using
the .toarray() method.

The result [[1. 0. 0.]] indicates that 'Red' is present (1)


and 'Green' and 'Blue' are absent (0).

IX. Handling Categorical Data in Python


Categorical data is a set of predefined categories or groups an observation can
fall into. Categorical data can be found everywhere. For example, survey
responses like marital status, profession, educational qualifications, etc.
However, certain problems can arise with categorical data that must be dealt
with before proceeding with any other task. One such way is to handle
categorical data in a DataFrame.
Categorical data can only take up a finite set of values. However, due to human
error, while filling out a survey form, or any other reason, some bogus values
could be found in the dataset.
Once the bogus values are found, the corresponding rows can be dropped from
the dataset. In some scenarios, the values could be replaced with other values if
there is information available. However, since there is no information available
regarding any particular column, they will be dropped.

Importing Libraries
Python libraries make it very easy for us to handle categorical data in a
DataFrame and perform typical and complex tasks with a single line of code.
Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
Numpy – Numpy arrays are very fast and can perform large computations in a
very short time.
Matplotlib/Seaborn – This library is used to draw visualizations.
Sklearn – This module contains multiple libraries having pre-implemented
functions to perform tasks from data preprocessing to model development and
evaluation.
Categorical data is often represented using discrete values, such as integers or strings, and is
frequently encoded as one-hot vectors before being used as input to machine learning models.
One-hot encoding involves creating a binary vector for each category, where the vector has a 1
in the position corresponding to the category and 0s in all other positions.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

Techniques for Handling Categorical Data

Handling categorical data is an important part of machine learning preprocessing, as many


algorithms require numerical input. Depending on the algorithm and the nature of the
categorical data, different encoding techniques may be used, such as label encoding, ordinal
encoding, or binary encoding etc.

1. One-Hot Encoding

One-hot encoding is a popular technique for handling categorical data in machine learning. It
involves creating a binary vector for each category, where each element of the vector
represents the presence or absence of the category. For example, if we have a categorical
variable for color with values red, blue, and green, one-hot encoding would create three binary
vectors: [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.

2. Label Encoding

Label Encoding is another technique for handling categorical data in machine learning. It
involves assigning a unique numerical value to each category in a categorical variable, with the
order of the values based on the order of the categories.

For example, suppose we have a categorical variable "Size" with three categories: "small,"
"medium," and "large." Using label encoding, we would assign the values 0, 1, and 2 to these
categories, respectively. Label encoding can be useful when there is a natural ordering between
the categories, such as in the case of ordinal categorical variables. However, it should be used
with caution for nominal categorical variables because the numerical values may imply an order
that does not actually exist. In these cases, one-hot encoding is a safer option.

3. Frequency Encoding

Frequency Encoding is another technique for handling categorical data in machine learning. It
involves replacing each category in a categorical variable with its frequency (or count) in the
dataset. The idea behind frequency encoding is that categories that appear more frequently
may be more important or informative for the machine learning algorithm.

Frequency encoding can be a useful alternative to one-hot encoding or label encoding,


especially when dealing with high-cardinality categorical variables (i.e., variables with a large
number of categories). However, it may not always be effective, and its performance can
depend on the particular dataset and machine learning algorithm being used.

4. Target Encoding

Target Encoding is another technique for handling categorical data in machine learning. It
involves replacing each category in a categorical variable with the mean (or other aggregation)
of the target variable (i.e., the variable you want to predict) for that category. The idea behind
target encoding is that it can capture the relation nship between the categorical variable and
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

the target variable, and therefore improve the predictive performance of the machine learning
model.

Target encoding can be a powerful technique for improving the predictive performance of
machine learning models, especially for datasets with high-cardinality categorical variables.
However, it is important to avoid overfitting by using cross-validation and regularization
techniques.

X. Feature Extraction
What is Feature Extraction?
The process of machine learning and data analysis requires the step of
feature extraction. In order to select features that are more suited for
modeling, raw data must be chosen and transformed.

Feature extraction is a machine learning technique that reduces the


number of resources required for processing while retaining significant or
relevant information.

Feature scaling is to standardize and normalize data. Feature selection is


to optimize for best features. Meanwhile feature extraction is transforming
the original features into a new set of features.

When working with huge datasets, particularly in fields such as image


processing, natural language processing, and signal processing, it is
usual to encounter data containing multiple characteristics, many of which
may be useless or redundant ( Data redundancy refers to the duplication of data in a
computer system). Feature extraction simplifies the data, these features
capture the essential characteristics of the original data, allowing for more
efficient processing and analysis.

Why is Feature Extraction Important?


 Reduced Computation Cost: The real world data is usually complex
and multi-faceted. The task of feature extraction lets us to see just the
vital data in the sea of the visual data. Hence, it gives simplicity to the
data, thereby making the machines to handle it and process it easily.
 Improved Model Performance: Extracting and choosing key
characteristics may provide information about the underlying processes
that created the data hence increasing the accuracy of the model
performance.
 Better Insights: Algorithms generally perform better with less features.
This is because noise and extraneous information are eliminated,
enabling the algorithm to concentrate on the data's most significant
features.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

 Overfitting Prevention: When models have too many characteristics,


they might get overfitted (when the model does not make accurate predictions on
testing data.) to the training data, which means they won't generalize well
to new, unknown data. Feature extraction prevents this by simplifying
the model.
Different types of Techniques for Feature Extraction
Various techniques exist to extract meaningful features from different
types of data:
1. Statistical Methods
Statistical methods are widely used in feature extraction to
summarize and explain patterns of data. Common data attributes
include:
 Mean: The average number of a dataset.
 Median: The middle number of a value when it is sorted in ascending
order.
 Standard Deviation: A measure of the spread or dispersion of a
sample.
 Correlation and Covariance: Measures of the linear relationship
between two or more factors.
 Regression Analysis: A way to model the link between a dependent
variable and one or more independent factors.
These statistical methods can be used to represent the center trend,
spread, and links within a collection.
2. Dimensionality Reduction Methods for feature extraction
Dimensionality reduction is an essential stage in machine learning for
feature extraction because it reduces the complexity of high-dimensional
data, enhances model interpretability, and prevents the curse of
dimensionality. Dimensionality reduction approaches include Principal
Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-
SNE.
 Principal Component Analysis: PCA is a prevalent (widespread)
dimensionality reduction approach that converts high-dimensional data
into a lower-dimensional space by selecting a group of variables that
account for the majority of the variation in the data. Since it is an
unsupervised method, class identifiers (class identifier is a label or
category that a model predicts for input data) are not taken into
consideration.
 Linear Discriminant Analysis (LDA): LDA is a technique for
identifying the linear combinations of characteristics that best
distinguish two or more classes of objects or events. LDA is similar to
PCA but is supervised, meaning it takes into account class labels.
 Autoencoders: An autoencoder is a neural network that consists of
two parts: an encoder and a decoder. The encoder maps the input data
to a lower-dimensional version, known as the latent space, and the
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

decoder maps the latent space back to the original input


space. Autoencoders can be used for dimensionality reduction by
teaching the network to recreate the incoming data from a lower-
dimensional model. The hidden space learned by the autoencoder can
be used as a dimensionality-reduced version of the original input data,
which can then be used as input to other machine learning models.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is
a non-linear approach for reducing dimensionality that retains the
data's local structure. It effectively embeds high-dimensional data
into a two or three-dimensional space that may be seen in a scatter
plot. It functions notably well for datasets with complicated
structures.
3. Feature Extraction Methods for Textual Data
Feature extraction for textual data allows the change of unorganized text
into a numerical format that can be handled by machine learning
algorithms. Textual data methods for feature extraction are important for
natural language processing (NLP) tasks, one such common method is:
1. Bag of Words (BoW): The Bag of Words (BoW) model is a basic way
for text modeling and feature extraction in NLP. It shows a written
document as a multiset of its words, ignoring structure and word order,
but keeping the frequency of words. This model is useful for tasks such
as text classification, document matching, and text grouping. The BoW
model is used in document classification, where each word is used as
a feature for training the classifier.
Choosing the Right Method
There is no one-size-fits-all approach to feature extraction. The proper
approach must be chosen carefully, and this often requires domain
expertise.
 Information Loss: During the feature extraction process, there is
always the possibility of losing essential data.
 Computational Complexity: Some feature extraction approaches may
be computationally costly, particularly for big datasets.

Applications of Feature Extraction


Feature extraction finds applications across various fields where data
analysis is performed. Here are some common applications:
1. Image Processing and Computer Vision:
 Object Recognition: Extracting features from images to recognize
objects or patterns within them.
 Facial Recognition: Identifying faces in images or videos by
extracting facial features.
 Image Classification: Using extracted features for categorizing
images into different classes or groups.
2. Natural Language Processing (NLP):
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

 Text Classification: Extracting features from textual data to classify


documents or texts into categories.
 Sentiment Analysis: Identifying sentiment or emotions expressed
in text by extracting relevant features.
3. Speech Recognition: Identifying relevant features from speech
signals for recognizing spoken words or phrases.
4. Bio medical Engineering:
 Medical Image Analysis: Extracting features from medical images
(like MRI or CT scans) to assist in diagnosis or medical research.
 Biological Signal Processing: Analyzing biological signals (such
as EEG or ECG) by extracting relevant features for medical
diagnosis or monitoring.
5. Machine Condition Monitoring: Extracting features from sensor data
to monitor the condition of machines and predict failures before they
occur.
Tools a nd Libraries for Feature Extraction
There are several tools and libraries available for feature extraction across
different domains. Here's a list of some popular ones:
1. Scikit-learn: This Python library provides a wide range of tools for
machine learning, including feature extraction techniques such as
Principal Component Analysis (PCA), Independent Component
Analysis (ICA), and various other preprocessing methods.
2. TensorFlow / Keras: These deep learning libraries in Python
provide APIs for building and training neural networks, which can be
used for feature extraction from image, text, and other types of data.
3. PyTorch: Similar to TensorFlow, PyTorch is another deep learning
library with support for building custom neural network architectures
for feature extraction and other tasks.
4. NLTK (Natural Language Toolkit): NLTK is a Python library for
NLP tasks, offering tools for feature extraction from text data, such
as bag-of-words representations, TF-IDF vectors, and word
embeddings.
5. Gensim: Another Python library for NLP, Gensim provides tools for
topic modeling and document similarity, which involve feature
extraction from text data.
6. MATLAB: MATLAB provides numerous built-in functions and
toolboxes for signal processing, image processing, and other data
analysis tasks, including feature extraction techniques like wavelet
transforms, Fourier transforms, and image processing filters.

Benefits of Feature Extraction


Feature extraction is a crucial means of obtaining a powerful toolbox for
data analysis and machine learning.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

 Reduced Data Complexity (Dimensionality Reduction): Let's say,


there is a really large, messy room (multidimensional data) full of all the
information we need. This function of extraction is similar to a smart
organizer, which carefully arranges the contents into a neat space that
only keeps the needed equipment (relevant features). This simplifies
things so that data becomes easier to process and visualizing it also
becomes easy.
 Improved Machine Learning Performance (Better Algorithms)
 Simplified Data Analysis (Focusing on What Matters)

Challenges in Feature Extraction


 Handling High-Dimensional Data
 Overfitting and Underfitting
 Computational Complexity
 Feature Redundancy and Irrelevance

Feature Selection vs. Feature Extraction


Feature Selection Feature Extraction
Selecting a subset of relevant features Transforming the original features into a new
from the original set. set of features
Its purpose is to reduce dimensionality. Its purpose is to transform data into a more
manageable or informative representation
Processes include filtering, wrapper Processes include Signal processing, statistical
methods, embedded methods. techniques, transformation algorithms

Its output is a Subset of selected Its output is a new set of transformed features
features.

May discard less relevant features. May lose interpretability of original features

Computational cost is generally lower Computational cost is may be higher,


than feature extraction. especially for complex transformations

Examples - Forward selection, Examples - Principal Component Analysis


backward elimination. (PCA), Singular Value Decomposition (SVD),
Autoencoders
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)

You might also like