Machine Learning Notes
Machine Learning Notes
Data Science)
UNIT - I
I. Introduction to Machine Learning
What is Machine Learning?
Machine learning (ML) is a branch of and computer science that focuses on the using data
and algorithms to enable AI to imitate the way that humans learn, gradually improving its
accuracy.
In other words, machine learning is focused on building computer systems that learn from
data.
ML algorithms are trained to find relationships and patterns in data. Using historical data as
input, these algorithms can make predictions, classify information, cluster data points,
reduce dimensionality and even generate new content. Example- generative AI, include
OpenAI's ChatGPT.
The first definition of Machine Learning was given by Arthur Samuel way back in 1959. He is
the person, who coined the term Machine Learning and defined it as,
“Machine Learning is the field of study that gives computers the ability to learn without
being explicitly programmed.”
OR
“Machine learning is a subfield of artificial intelligence that uses algorithms trained on
data sets to create models that enable machines to perform tasks that would otherwise
only be possible for humans, such as categorizing images, analyzing data, or predicting
price fluctuations.”
Machine Learning enables computers to learn from data and make predictions or decisions
without explicit programming. The process involves several key steps:
1. Data Collection: The first step in Machine Learning is gathering relevant data
representing the problem or task at hand. This data can be collected from various
sources such as databases, sensors, or online platforms.
2. Data Preprocessing: Once the data is collected, it needs to be pre-processed to
ensure its quality and suitability for training the model. This involves cleaning the
data, handling missing values, and normalizing or transforming the data to a
consistent format.
3. Feature Extraction and Selection: The collected data may contain many features or
attributes in many cases. Feature extraction and selection involve identifying the
most informative and relevant features contributing to the learning task. This helps
reduce the data's dimensionality and improves the learning process's efficiency and
effectiveness.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
4. Model Training: The training phase involves feeding the pre-processed data into a
Machine Learning algorithm or model. The model learns from the data by adjusting
its internal parameters based on the patterns and relationships it discovers. This is
done through iterative optimization processes, such as gradient descent or
backpropagation, depending on the specific algorithm used.
5. Model Evaluation: The model must be evaluated to assess its performance and
generalization ability after training it. This is typically done using a separate data set
called the test set, which was not used during training. Common evaluation metrics
include accuracy, precision, recall, and F1 score, depending on the nature of the
learning task.
6. Prediction or Decision Making: Once the model is trained and evaluated, it can
predict or decide on new, unseen data. The model takes input features and applies
the learned patterns to generate the desired output or prediction.
7. Model Refinement and Iteration: ML is an iterative process that involves refining
the model based on their feedback and new dataset. If the model's performance is
unsatisfactory and not accurate, then we can make adjustments by retraining the
model with additional data, changing the algorithm, or tuning the model's
parameters.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Here are examples of machine learning at work in our daily life that provide value in many
ways—some large and some small.
1. Product recommendations
Do you wonder how Amazon or other retailers frequently know what you might like to
purchase? Or, have they gotten it wildly wrong and you wonder how they came up with the
recommendation? Thank machine learning. Targeted marketing with retail uses machine
learning to group customers based on buying habits or demographic similarities, and by
extending what one person may want from someone else’s purchases. While some
suggested purchase pairings are obvious, machine learning can get accurate by finding
hidden relationships in data and predicting what you want before you know you want it. If
the data is incomplete, sometimes you may end up with an offbase (not relatable)
recommendation—but don’t worry, because not buying it is another data point to learn
from.
While your inbox seems relatively boring, machine learning influences its function behind
the scenes. Email automation is a direct result of successful machine learning, and one
function that goes most unnoticed is spam filtering. Successful spam filtering adapts and
finds patterns in email content that is undesirable. This includes data from email domains, a
sender’s physical location message text and structure, and IP addresses. It also requires help
from users as they mark emails when they’re mistakenly filed. With each marked email, a
new data reference is added that helps with future accuracy. It also reduce junk emails, and
add very little to their inbox spam. One of the primary methods for spam mail detection is
email filtering. It involves categorize incoming emails into spam and non-spam. Machine
learning algorithms can be trained to filter out spam mails based on their content and
metadata.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Machine learning has become helpful in fighting inappropriate content and cyberbullying,
which pose a risk to platforms in losing users and weakening brand loyalty. With the help of
machine learning, social media platforms can provide a better user experience, use data to
forecast future states, and predict more accurate results. According to data, 82% of X
(Twitter) users view videos, and 90% do it on a handheld device. As a result, X bought Magic
Pony Technology, a technology company with offices in London that has created machine
learning methods for visual augmentation to enhance the visual experience further.
Machine Learning enables social networking giants to promote their goods to niche
audiences by analyzing user data such as demographics, interests, and preferences. This
data is then used to create targeted ads that are shown to specific groups of users,
increasing the chances that they will be interested in the promoted goods.ML algorithms
can also analyze user behavior and predict which products or services they are most likely to
be interested in, allowing for even more precise targeting. Machine learning algorithms can
help protect social media platforms by detecting and flagging potentially harmful or
inappropriate content before it spreads. This not only helps prevent the spread of harmful
content but also helps maintain the platform’s reputation by promoting a safe and
welcoming online community. Pinterest uses machine learning to ensure data security. With
ML, the business can identify spam users and content, promote the content, and gauge
(increase) the possibility that a user will pin it. Another example of a training algorithm in
ML is the “people you may know” feature on social media platforms like LinkedIn,
Instagram, Facebook, and X (formerly known as Twitter.) Based on your contacts,
comments, likes, or existing connections, the algorithm suggests familiar faces from your
real-life network that you might want to connect with or follow.
4. Healthcare advancement
For the healthcare industry, machine learning algorithms are particularly valuable because
they can help us make sense of the massive amounts of healthcare data that is generated
every day within electronic health records. Using machine learning in healthcare like
machine learning algorithms can help us find patterns and insights in medical data that
would be impossible to find manually. That means healthcare information for clinicians can
be enhanced with analytics and machine learning to gain insights that support better
planning and patient care, improved diagnoses, and lower treatment costs. Healthcare
brands such as Pfizer and Providence have begun to benefit from analytics enhanced by
human and artificial intelligence. There are some processes that are better suited to
leverage machine learning; machine learning integration with radiology, cardiology, and
pathology. As an example, wearables generate mass amounts of data on the wearer’s health
and many use AI and machine learning to alert them or their doctors of issues to support
preventative measures and respond to emergencies. Wearables are electronic technology
or devices incorporated into items that can be comfortably worn on a body. These
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
wearable devices are used for tracking information on real time basis. They have motion
sensors that take the snapshot of your day to day activity and sync them with mobile
devices or laptop computers. Also, the most common use cases for machine learning in
healthcare among healthcare professionals are automating medical billing. At MD Anderson
Cancer Center (locate in Houston, Texas) data scientists have developed the first deep
learning in healthcare algorithm using machine learning to predict acute toxicities in
patients receiving radiation therapy for head and neck cancers. algorithms can analyze
retinal images to detect diabetic retinopathy, predict cardiovascular risks from electronic
health records, or assist in the early detection of cancerous tumors through imaging. These
machine learning in healthcare examples highlight the technology's potential to augment
(increase) the capabilities of medical professionals, rather than replace them.
5. Predictive analytics
Predictive analytics is an area of advanced analytics that uses data to make predictions
about the future. Techniques such as data mining, statistics, machine learning and artificial
intelligence to analyze current and historical data for any patterns or anomalies that can
help identify risks and opportunities, minimize the chance for human errors, and increase
speed and thoroughness of analysis. With closer investigation of what happened and what
could happen using data, people and organizations are becoming more proactive and
forward looking. Florida International University is one example. By integrating predictive
models with data analysis from Tableau, they’re communicating critical insights about
academic performance before students are at risk and supporting their individual needs to
help them successfully complete all courses and graduate.
6. Image recognition
Image recognition is another machine learning technique that appears in our day-to-day life.
With the use of ML, programs can identify an object or person in an image based on the
intensity of the pixels. This type of facial recognition is used for password protection
methods like Face ID and in law enforcement. By filtering through a database of people to
identify commonalities and matching them to faces, police officers and investigators can
narrow down a list of crime suspects.
Virtual personal assistants are devices you might have in your own homes, such as Amazon’s
Alexa, Google Home, or the Apple iPhone’s Siri. These devices use a combination of speech
recognition technology and machine learning to capture data on what you're requesting and
how often the device is accurate in its delivery. They detect when you start speaking, what
you’re saying, and deliver on the command. For example, when you say, “Siri, what is the
weather like today?”, Siri searches the web for weather forecasts in your location and
provides detailed information.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Predictive analytics can help determine whether a credit card transaction is fraudulent or
legitimate. Fraud examiners use AI and machine learning to monitor variables involved in
past fraud events. They use these training examples to measure the likelihood that a specific
event was fraudulent activity.
9. Traffic predictions
When you use Google Maps to map your commute to work or a new restaurant in town, it
provides an estimated time of arrival. Google uses machine learning to build models of how
long trips will take based on historical traffic data (gleaned from satellites). It then takes that
data based on your current trip and traffic levels to predict the best route according to these
factors.
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height
(dogs are taller, cats are smaller), etc. After completion of training, we input the picture of
a cat and ask the machine to identify the object and predict the output. Now, the machine is
well trained, so it will check all the features of the object, such as height, shape, colour,
eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with
the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
It is a training data in such a way because these polls are collected from feedbacks taken
from different people (old aged, youngsters, etc) like which party got how many seats who
is the winning party and which is the losing party. This all acts as a training data.
Now when the actual counting starts it acts as a testing data, the model will make the
working of supervised learning algorithms. Based on the new input data, it will give out
results. If the testing data (new input) matches the training data, it means the model has
learned perfectly.
a) Classification
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The
classification algorithms predict the categories present in the dataset. Some real-world
examples of classification algorithms are Spam Detection, Email filtering, etc.
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear
relationship between input and output variables. These are used to predict continuous
output variables, such as market trends, weather prediction, etc.
o Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
o Image_Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
o Medical_Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
o Fraud Detection:
Supervised Learning classification algorithms are used for identifying fraud
transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
o Spam detection:
In spam detection & filtering, classification algorithms are used. These algorithms
classify an email as spam or not spam. The spam emails are sent to the spam folder.
o Speech Recognition:
Supervised learning algorithms are also used in speech recognition. The algorithm is
trained with voice data, and various identifications can be done using the same, such
as voice-activated passwords, voice commands, etc.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit
images, and we input it into the machine learning model. The images are totally unknown to
the model, and the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference,
shape difference, and predict the output when it is tested with the test dataset.
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It
is a way to group the objects into a cluster such that the objects with the most similarities
remain in one group and have fewer or no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the customers by their purchasing
behaviour.
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning algorithm is
to find the dependency of one data item on another data item and map those variables
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, etc.
Advantages:
These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
The output of an unsupervised algorithm can be less accurate as the dataset is not
labelled, and algorithms are not trained with the exact output in prior.
Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
Network Analysis
Unsupervised learning is used for identifying plagiarism and copyright in document network
analysis of text data for scholarly articles.
Recommendation Systems
Recommendation systems widely use unsupervised learning techniques for building
recommendation applications for different web applications and e-commerce websites.
Anomaly Detection
Anomaly detection is a popular application of unsupervised learning, which can identify
unusual data points within the dataset. It is used to discover fraudulent transactions.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning.
We can imagine these algorithms with an example. Supervised learning is where a student is
under the supervision of an instructor at home and college. Further, if that student is self-
analysing the same concept without any help from the instructor, it comes under
unsupervised learning. Under semi-supervised learning, the student has to revise himself
after analyzing the same concept under the guidance of an instructor at college.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets rewarded for
each good action and get punished for each bad action; hence the goal of reinforcement
learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research.
o VideoGames:
RL algorithms are much popular in gaming applications. It is used to gain super-
human performance. A popular game that use RL algorithm is MINECRAFT.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent
robots using AI and Machine learning technology.
Advantages
Real-Time Navigation: Google Maps is one of the most commonly used Real-Time
Navigation applications. But have you ever wondered why despite being of the usual
traffic, you are on the fastest route? It is because of the data received from people
currently using this service, and the database of Historical Traffic Data. Everyone
who uses this service contributes to making this application more accurate. When
you open the application, it constantly sends the data back to Google, providing
information about the route being traveled and traffic patterns at any given time of
the day.
Image Recognition: Image Recognition is one of the most common applications of
Machine Learning in Data Science. Image Recognition is used to identify objects,
persons, places, etc.
Product Recommendation: Product Recommendation is profoundly used by
eCommerce and Entertainment companies like Amazon, Netflix, Hotstar, etc. They
use various Machine Learning algorithms on the data collected from you to
recommend products or services that you might be interested in.
Speech Recognition: Speech Recognition is a process of translating spoken
utterances into text. This text can be in terms of words, syllables, sub-word units, or
even characters. Some of the well-known examples are Siri, Google Assistant,
Youtube Closed Captioning, etc.
Data Pre-processing
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine learning
model.
When creating a machine learning project, it is not always a case that we come across the
clean and formatted data. And while doing any operation with data, it is mandatory to clean
it and put in a formatted way. So for this, we use data pre-processing task.
A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models. Data pre-processing is
required tasks for cleaning the data and making it suitable for a machine learning model
which also increases the accuracy and efficiency of a machine learning model.
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient. So each dataset is different from another dataset. To
use the dataset in our code, we usually put it into a CSV file. However, sometimes, we may
also need to use an HTML or xlsx file.
2) Importing Libraries
In order to perform data pre-processing using Python, we need to import some predefined
Python libraries. These libraries are used to perform some specific jobs. There are three
specific libraries that we will use for data pre-processing, which are:
Numpy: Numpy Python library is used for including any type of mathematical operation in
the code. It is the fundamental package for scientific calculation in Python. It also supports
to add large, multidimensional arrays and matrices. So, in Python, we can import it as:
import numpy as nm
Here we have used nm, which is a short name for Numpy, and it will be used in the whole
program.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and with
this library, we need to import a sub-library pyplot. This library is used to plot any type of
charts in Python for the code. It will be imported as below:
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used
to read acsvfile and performs various operations on it. Using this function, we can read a csv
file locally as well as through an URL. We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv')
Here, data_set is a name of the variable to store our dataset, and inside the function, we
have passed the name of our dataset. Once we execute the above line of code, it will
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
successfully import the dataset in our code. We can also check the imported dataset by
clicking on the section variable explorer, and then double click on data_set.
If you are using Python language for machine learning, then extraction is
mandatory, but for R language it is not required.
Extracting dependent and independent variables:
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to
extract the required rows and columns from the dataset.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In
this way, we just delete the specific row or column which consists of null values. But this
way is not so efficient and removing data may lead to loss of information which will not give
the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row
which contains any missing value and will put it on the place of missing value. This strategy
is useful for the features which have numeric data such as age, salary, year, etc. Here, we
will use this approach.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
To handle missing values, we will use Scikit-learn library in our code, which contains various
libraries for building machine learning models. Here we will use Imputer class
of sklearn.preprocessing library. Below is the code for it:
1. #handling missing data (Replacing missing data with the mean value)
2. from sklearn.preprocessing import Imputer
3. imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)
4. #Fitting imputer object to the independent variables x.
5. imputerimputer= imputer.fit(x[:, 1:3])
6. #Replacing missing data with the calculated mean value
7. x[:, 1:3]= imputer.transform(x[:, 1:3])
Feature Engineering
What is a Feature?
For example, in a dataset of housing prices, features could include the number of bedrooms,
the square footage, the location, and the age of the property. In a dataset of customer
demographics, features could include age, gender, income level, and occupation.
The choice and quality of features are critical in machine learning, as they can greatly impact
the accuracy and performance of the model.
Feature engineering is the process of transforming raw data into features that are suitable
for machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales, or
a feature that provides additional functionality could lead to more upsells or cross-
sells.
The first step is to do data preprocessing, before looking for any insights from the data, then
only we can train our machine learning model. Because, uncleaned data can’t be processed
by most of the machine learning algorithms. Missing value in a dataset is a very common
phenomenon in the reality, yet a big problem in real-life scenarios. Missing Data can also
refer to as NA(Not Available) values in pandas. In DataFrame sometimes many datasets
simply arrive with missing data, either because it exists and was not collected or it never
existed.
To solve all these problems, we have various methods to handle the missing data. They are
as follows –
No doubt it is one of the quick techniques one can use to deal with missing
values. But this approach is not recommended.
Ex. In the below code, we are replacing the missing values with ‘0’.As well you can
replace any particular column missing values with some arbitrary value also.
Replacing with previous value – Forward fill
We can impute the values with the previous value by using forward fill. It is
mostly used in time series data.
Syntax: df.fillna(method=’ffill’)
Replacing with next value – Backward fill
In backward fill, the missing value is imputed using the next value. It is mostly
used in time series data.
3. Interpolation
Missing values can also be imputed using ‘interpolation’. Pandas interpolate method
can be used to replace the missing values with different interpolation methods.
Interpolation in most cases supposed to be the best technique to fill missing values.
(Refer to your ML Practical No. 2 for a brief overview for the code)
We have taken dataset titanic.csv which is freely available at kaggle.com.This dataset was
taken as it has missing values.
Those rows in which all the values are NaN values will be deleted. But if the
row even has one value even then it will not be dropped.
Feature Scaling is a critical step in building accurate and effective machine learning models.
One key aspect of feature engineering is scaling, normalization, and standardization, which
involves transforming the data to make it more suitable for modeling. These techniques can
help to improve model performance, reduce the impact of outliers, and ensure that the data
is on the same scale.
In other words, Feature scaling is a preprocessing technique that transforms feature values
to a similar scale, ensuring all features contribute equally to the model.
It’s essential for datasets with features of varying ranges, units, or magnitudes. Common
techniques include standardization, normalization, and min-max scaling. This process
improves model performance, convergence, and prevents bias from features with larger
values.
This process is important because it ensures that all features contribute equally to a
machine learning model, and that the model doesn't give more importance to a feature
based on its value range.
Original Data:
o Person A: 5 feet 10 inches (70 inches)
o Person B: 6 feet 2 inches (74 inches)
o Person C: 5 feet 3 inches (63 inches)
Problem: The raw numbers (inches) might not be the most intuitive for comparison,
especially if you're dealing with many different measurements (weight, age, etc.).
Scaled Heights:
o Person A: (70 - 63) / (74 - 63) = 7/11 ≈ 0.64
o Person B: (74 - 63) / (74 - 63) = 11/11 = 1
o Person C: (63 - 63) / (74 - 63) = 0/11 = 0
Faster Convergence: Many algorithms (like gradient descent) converge faster when
features are on a similar scale.
Improved Accuracy: Some algorithms are sensitive to feature scales, and scaling can
significantly improve their performance.
Better Visualization: Scaled data is often easier to visualize and interpret.
In Summary
Feature scaling is like standardizing measurements to a common unit. It makes your data
more consistent and can lead to better results in your machine learning models.
1. Normalization: This method scales each feature so that all values are within the range
of 0 and 1. It achieves this by subtracting the minimum value of the feature and dividing
This process ensures that all features contribute equally to the analysis, preventing those
(Standard Deviation: It calculates the extent to which the values differ from the average.
The Standard Deviation is a measure of how spread out numbers are.)
deviation of 1. This is achieved by subtracting the mean value and dividing by the standard
Consider a simple dataset with two features: annual income and age. Let’s
say we have three individuals — blue, purple, and red — with varying
values for these features. Without scaling, the difference in the scale of
these features can mislead the algorithm. For instance, a small difference
in age might be overshadowed by a large difference in income.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Example of Normalization
Consider a dataset with annual income data for three individuals:
- Person A: $70,000
- Person B: $60,000
- Person C: $52,000
To normalize the income values using the formula:
1. Find the minimum and maximum values:
— Minimum income ( {min}(x) ) = $52,000
— Maximum income ( {max}(x) ) = $70,000
2. Apply the normalization formula to each income value:
After normalization:
- Person A’s normalized income = 1
- Person B’s normalized income ≈ 0.444
- Person C’s normalized income = 0
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Understanding Standardization
- Person A: 45 years
- Person B: 44 years
- Person C: 40 years
After standardization:
- Person A’s standardized age ≈ 1.31
- Person B’s standardized age ≈ 0.65
- Person C’s standardized age ≈ -1.96
Impact of Scaling
Conclusion
Normalization Standardization
Sensitive to outliers and the range of the Less sensitive to outliers due to the use of the
data mean and standard deviation
Useful when maintaining the original range Effective when algorithms assume a standard
is essential normal distribution
Suitable for algorithms where the absolute Particularly useful for algorithms that assume
values and their relations are important normally distributed data, such as linear
(e.g., k-nearest neighbors, neural networks) regression and support vector machines
in algorithms that rely on gradient descent particularly in algorithms sensitive to the scale
of input features
Use cases: Image processing, neural Use cases: Linear regression, support vector
networks, algorithms sensitive to feature machines, algorithms assuming normal
scales distribution
When you don’t know the distribution of your data or when you know it’s not
Gaussian, normalization is a smart approach to apply. Normalization is useful
when your data has variable scales and the technique you’re employing, such as k-
nearest neighbors and artificial neural networks, doesn’t make assumptions about
the distribution of your data.
The assumption behind standardization is that your data follows a Gaussian (bell
curve) distribution. This isn’t required, however, it helps the approach work better
if your attribute distribution is Gaussian. When your data has variable dimensions
and the technique you’re using (like logistic regression, linear regression, linear
discriminant analysis) standardization is useful.
feature. Each column represents one unique category, and a value of 1 or 0 indicates
the presence or absence of that category.
In this example, the original "Color" column is replaced by three new binary
columns, each representing one of the colors. A value of 1 indicates the
presence of the color in that row, while a 0 indicates its absence.
Additionally, many data processing and machine learning libraries support one-
hot encoding. It fits smoothly into the data preprocessing workflow, making it
easier to prepare datasets for various machine learning algorithms.
Avoiding ordinality
Label encoding is another method to convert categorical data into numerical values by
assigning each category a unique number. However, this approach can create problems
because it might suggest an order or ranking among categories that doesn't actually exist.
For example------
Assigning 1 to Red, 2 to Green, and 3 to Blue could make the model think
that Green is greater than Red and Blue is greater than both. This
misunderstanding can negatively affect the model's performance.
One-hot encoding solves this problem by creating a separate binary column for
each category. This way, the model can see that each category is distinct and
unrelated to the others.
Label encoding is useful when the categorical data has an inherent ordinal
relationship, meaning the categories have a meaningful order or ranking. In
such cases, the numerical values assigned by label encoding can effectively
represent this order, making it a suitable choice.
High School
Bachelor's Degree
Master's Degree
PhD
These categories have a clear order, where PhD represents a higher level of
education than Master's Degree, which in turn is higher than Bachelor's
Degree, and so on. In this case, label encoding can effectively capture the
ordinal nature of the data:
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
Python offers powerful libraries like Pandas and Scikit-learn, which provide
convenient and efficient ways to perform one-hot encoding.
We'll start with Pandas' get_dummies() function, which is quick and easy for
straightforward encoding tasks. Then, we'll explore Scikit-
learn's OneHotEncoder, which offers more flexibility and control, particularly
useful for more complex encoding needs.
What is Scikit-learn?
import pandas as pd
# Sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Red']}
df = pd.DataFrame(data)
print(df_encoded)
For more flexibility and control over the encoding process, Scikit-learn offers
the OneHotEncoder class. This class provides advanced options, such as
handling unknown categories and fitting the encoder to the training data.
We fit the encoder to the sample data X. During this step, the encoder learns
the unique categories in the data. We use the fitted encoder to transform new
data. In this case, we transform a single color, 'Red'. The .transform() method
returns a sparse matrix, which we convert to a dense array using
the .toarray() method.
Importing Libraries
Python libraries make it very easy for us to handle categorical data in a
DataFrame and perform typical and complex tasks with a single line of code.
Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
Numpy – Numpy arrays are very fast and can perform large computations in a
very short time.
Matplotlib/Seaborn – This library is used to draw visualizations.
Sklearn – This module contains multiple libraries having pre-implemented
functions to perform tasks from data preprocessing to model development and
evaluation.
Categorical data is often represented using discrete values, such as integers or strings, and is
frequently encoded as one-hot vectors before being used as input to machine learning models.
One-hot encoding involves creating a binary vector for each category, where the vector has a 1
in the position corresponding to the category and 0s in all other positions.
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
1. One-Hot Encoding
One-hot encoding is a popular technique for handling categorical data in machine learning. It
involves creating a binary vector for each category, where each element of the vector
represents the presence or absence of the category. For example, if we have a categorical
variable for color with values red, blue, and green, one-hot encoding would create three binary
vectors: [1, 0, 0], [0, 1, 0], and [0, 0, 1] respectively.
2. Label Encoding
Label Encoding is another technique for handling categorical data in machine learning. It
involves assigning a unique numerical value to each category in a categorical variable, with the
order of the values based on the order of the categories.
For example, suppose we have a categorical variable "Size" with three categories: "small,"
"medium," and "large." Using label encoding, we would assign the values 0, 1, and 2 to these
categories, respectively. Label encoding can be useful when there is a natural ordering between
the categories, such as in the case of ordinal categorical variables. However, it should be used
with caution for nominal categorical variables because the numerical values may imply an order
that does not actually exist. In these cases, one-hot encoding is a safer option.
3. Frequency Encoding
Frequency Encoding is another technique for handling categorical data in machine learning. It
involves replacing each category in a categorical variable with its frequency (or count) in the
dataset. The idea behind frequency encoding is that categories that appear more frequently
may be more important or informative for the machine learning algorithm.
4. Target Encoding
Target Encoding is another technique for handling categorical data in machine learning. It
involves replacing each category in a categorical variable with the mean (or other aggregation)
of the target variable (i.e., the variable you want to predict) for that category. The idea behind
target encoding is that it can capture the relation nship between the categorical variable and
Ms. Suhana Ali Mir Laiqui Ali (M.Sc. Data Science)
the target variable, and therefore improve the predictive performance of the machine learning
model.
Target encoding can be a powerful technique for improving the predictive performance of
machine learning models, especially for datasets with high-cardinality categorical variables.
However, it is important to avoid overfitting by using cross-validation and regularization
techniques.
X. Feature Extraction
What is Feature Extraction?
The process of machine learning and data analysis requires the step of
feature extraction. In order to select features that are more suited for
modeling, raw data must be chosen and transformed.
Its output is a Subset of selected Its output is a new set of transformed features
features.
May discard less relevant features. May lose interpretability of original features