0% found this document useful (0 votes)
18 views86 pages

Machine Learning

The document outlines the curriculum for the Master of Computer Applications program at ARJ College, focusing on Machine Learning as a core subject for the third semester. It covers various topics including human learning, machine learning types, model evaluation, Bayesian learning, and both parametric and non-parametric machine learning techniques. Additionally, it discusses the limitations of machine learning and its applications in real-world scenarios such as social media features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views86 pages

Machine Learning

The document outlines the curriculum for the Master of Computer Applications program at ARJ College, focusing on Machine Learning as a core subject for the third semester. It covers various topics including human learning, machine learning types, model evaluation, Bayesian learning, and both parametric and non-parametric machine learning techniques. Additionally, it discusses the limitations of machine learning and its applications in real-world scenarios such as social media features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

ARJ COLLEGE OF ENGINEERING & TECHNOLOGY

&
A.R.J INSTUTUTE OF MANAGEMENT STUDIES

MASTER OF COMPUTER APPLICATION


SEMESTER-III

AY 2025-26 / Odd Semester


Core Subject
MC4301 - Machine learning
LTPC-3003

1
UNIT I INTRODUCTION 9
Human Learning - Types – Machine Learning - Types - Problems not to be solved - Applications -
Languages/Tools– Issues. Preparing to Model: Introduction - Machine Learning Activities - Types
of data - Exploring structure of data - Data quality and remediation - Data Pre-processing
UNIT II MODEL EVALUATION AND FEATURE ENGINEERING 9
Model Selection - Training Model - Model Representation and Interpretability - Evaluating
Performance of a Model - Improving Performance of a Model - Feature Engineering: Feature
Transformation - Feature Subset Selection 37
UNIT III BAYESIAN LEARNING 9
Basic Probability Notation- Inference – Independence - Bayes’ Rule. Bayesian Learning:
Maximum Likelihood and Least Squared error hypothesis-Maximum Likelihood hypotheses for
predicting probabilities- Minimum description Length principle -Bayes optimal classifier - Naïve
Bayes classifier - Bayesian Belief networks -EM algorithm.
UNIT VI PARAMETRIC MACHINE LEARNING 9
Logistic Regression: Classification and representation – Cost function – Gradient descent –
Advanced optimization – Regularization - Solving the problems on overfitting. Perceptron – Neural
Networks – Multi – class Classification - Backpropagation – Non-linearity with activation
functions (Tanh, Sigmoid, Relu, PRelu) - Dropout as regularization
UNIT V NON PARAMETRIC MACHINE LEARNING 9
k- Nearest Neighbors- Decision Trees – Branching – Greedy Algorithm - Multiple Branches –
Continuous attributes – Pruning. Random Forests: ensemble learning. Boosting – Adaboost
algorithm. Support Vector Machines – Large Margin Intuition – Loss Function - Hinge Loss –
SVM Kernels

REFERENCES:
1. Ethem Alpaydin, “Introduction to Machine Learning 3e (Adaptive Computation and Machine
Learning Series)”, Third Edition, MIT Press, 2014
2. Tom M. Mitchell, “Machine Learning”, India Edition, 1st Edition, McGraw-Hill Education
Private Limited, 2013
3. Saikat Dutt, Subramanian Chandramouli and Amit Kumar Das, "Machine Learning", 1st
Edition, Pearson Education, 2019
4. Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Revised Edition,
Springer, 2016.

2
UNIT - I

HUMAN LEARNING:

Learning is the process of acquiring new understanding, knowledge, behaviors, skills, values,
attitudes, and preferences.
Learning consists of complex information processing, problem-solving, decision-
making in uncertainty and the urge to transfer knowledge and skills into new, unknown settings
The process of learning is continuous which starts right from the time of birth of an

individual and continues till the death. We all are engaged in the learning endeavours
in order to develop our adaptive capabilities as per the requirements of the changing
environment.
For a learning to occur, two things are important:

1. The presence of a stimulus in the environment and


2. The innate dispositions like emotional and instinctual
dispositions.

A person keeps on learning across all the stages of life, by constructing or


reconstructing experiences under the influence of emotional and instinctual
dispositions.
Psychologists in general define Learning as relatively permanent behavioural
modifications which take place as a result of experience. This definition of learning
stresses on three important elements of learning:
 Learning involves a behavioural change which can be better or worse. 
This behavioural change should take place as a result of practice and
experience. Changes resulting from maturity or growth cannot be considered
as learning
 This behavioural change must be relatively permanent and last for a relatively
long time enough.
John B Watson is one amongst the first thinkers who has proven that behavioural
changes occur as a result of learning. Watson is believed to be the founder of
Behavioural school of thought, which gained its prominence or acceptability around
the first half of the 20th century.

3
Gales defined Learning as the behavioural modification which occurs as a result of
experience as well as training.

Crow and Crow defined learning as the process of acquisition of knowledge, habits
and attitudes.
The key characteristics of the learning process are:
1. When described in the simplest possible manner, learning is described as an
experience acquisition process.
2. In the complex form, learning can be described as process of acquisition,
retention and modification of experience.
3. It re-establishes the relationship between a stimulus and response.
4. It is a method of problem solving and is concerned about making adjustments
with the environment.
5. It involves all those gamut of activities which may have a relatively permanent
effect on the individual.
6. The process of learning is concerned about experience acquisition, retention of
experiences, and experience development in a step by step manner, synthesis
of both old and new experiences for creating a new pattern.
7. Learning is concerned about cognitive, conative and affective aspects.
Knowledge acquisition process is cognitive, any change in the emotions is
affective and conative is acquisition of new habits or skills.
Types

Types of Learning:

1. Motor Learning: Our day to day activities like walking,


running, driving, etc, must be learnt for ensuring a good life.
These activities to a great extent involve muscular
coordination.
2. Verbal Learning: It is related with the language which we use
to communicate and various other forms of verbal
communication such as symbols, words, languages, sounds,
figures and signs.
3. Concept Learning: This form of learning is associated with
higher order cognitive processes like intelligence, thinking,

4
reasoning, etc, which we learn right from our childhood.
Concept learning involves the processes of abstraction and
generalization, which is very useful for identifying or
recognizing things.
4. Discrimination Learning: Learning which distinguishes
between various stimuli with its appropriate and different
responses is regarded as discrimination stimuli.
5. Learning of Principles: Learning which is based on principles
helps in managing the work most effectively. Principles based
learning explains the relationship between various concepts.
6. Attitude Learning: Attitude shapes our behaviour to a very great
extent, as our positive or negative behaviour is based on our attitudinal
predisposition.
Types of Behavioural Learning:
The Behavioural School of Thought which was founded by John B Watson which was
highlighted in his seminal work, “Psychology as the Behaviorist View It”, stressed
on the fact that Psychology is an objective science, hence mere emphasis on the
mental processes should not be considered as such processes cannot be objectively
measured or observed.

Watson tried to prove his theory with the help of his famous Little Albert
Experiment, by way of which he conditioned a small kid to be scared of a white rat.
The behavioural psychology described three types of learning: Classical
Conditioning, Observational Learning and Operant Conditioning.

1. Classical Conditioning: In case of Classical Conditioning, the process of


learning is described as a Stimulus-Response connection or association.
Classical Conditioning theory has been explained with the help of Pavlov’s Classic Experiment, in
which the food was used as the natural stimulus which was paired with the previously neutral
stimuli that’s a bell in this case. By
establishing an association between the natural stimulus (food) and the neutral
stimuli (sound of the bell), the desired response can be elicited. This theory
will be discussed in detail in the next few articles.
2. Operant Conditioning: Propounded by scholars like Edward Thorndike
firstly and later by B.F. Skinner, this theory stresses on the fact that the

5
consequences of actions shape the behaviour. The theory explains that the intensity of a response is
either increased or decreased as a result of punishment or reinforcement. Skinner explained how
with the help of reinforcement one can strengthen behaviour and with
punishment reduce or curb behaviour. It was also analyzed that the
behavioural change strongly depends on the schedules of reinforcement with
focus on timing and rate of reinforcement.
3. Observational Learning: The Observational Learning process was propounded by Albert
Bandura in his Social Learning Theory, which focused
on learning by imitation or observing people’s behaviour. For observational
learning to take place effectively, four important elements will be essential:
Motivation, Attention, Memory and Motor Skills.
Machine Learning:
Machine learning is a subfield of artificial intelligence, which is broadly defined as
the capability of a machine to imitate intelligent human behavior. Artificial
intelligence systems are used to perform complex tasks in a way that is similar to how
humans solve problems.
Machine learning is used in internet search engines, email filters to sort out spam,
websites to make personalised recommendations, banking software to detect unusual
transactions, and lots of apps on our phones such as voice recognition.
Types:

As with any method, there are different ways to train machine learning algorithms,
each with their own advantages and disadvantages. To understand the pros and cons of each type of
machine learning, we must first look at what kind of data they ingest.
In ML, there are two kinds of data — labeled data and unlabeled data.

Labeled data has both the input and output parameters in a completely machine-
readable pattern, but requires a lot of human labor to label the data, to begin with.

Unlabeled data only has one or none of the parameters in a machine-readable form.
This negates the need for human labor but requires more complex solutions.
There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.
Supervised Learning:
Supervised learning is one of the most basic types of machine learning. In this type,

6
the machine learning algorithm is trained on labeled data. Even though the data needs
to be labeled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances.
In supervised learning, the ML algorithm is given a small training dataset to work
with. This training dataset is a smaller part of the bigger dataset and serves to give the
algorithm a basic idea of the problem, solution, and data points to be dealt with. The
training dataset is also very similar to the final dataset in its characteristics and
provides the algorithm with the labeled parameters required for the problem.

The algorithm then finds relationships between the parameters given, essentially
establishing a cause and effect relationship between the variables in the dataset. At the
end of the training, the algorithm has an idea of how the data works and the
relationship between the input and the output.

This solution is then deployed for use with the final dataset, which it learns from in
the same way as the training dataset. This means that supervised machine learning
algorithms will continue to improve even after being deployed, discovering new
patterns and relationships as it trains itself on new data.

Unsupervised Learning:

Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.

In supervised learning, the labels allow the algorithm to find the exact nature of the
relationship between any two data points. However, unsupervised learning does not
have labels to work off of, resulting in the creation of hidden structures. Relationships
between data points are perceived by the algorithm in an abstract manner, with no
input required from human beings.

The creation of these hidden structures is what makes unsupervised learning


algorithms versatile. Instead of a defined and set problem statement, unsupervised
learning algorithms can adapt to the data by dynamically changing hidden structures.
This offers more post-deployment development than supervised learning algorithms.

7
Reinforcement Learning:
Reinforcement learning directly takes inspiration from how human beings learn from
data in their lives. It features an algorithm that improves upon itself and learns from
new situations using a trial-and-error method. Favorable outputs are encouraged or
‘reinforced’, and non-favorable outputs are discouraged or ‘punished’.

Based on the psychological concept of conditioning, reinforcement learning works by


putting the algorithm in a work environment with an interpreter and a reward system.
In every iteration of the algorithm, the output result is given to the interpreter, which
decides whether the outcome is favorable or not.

In case of the program finding the correct solution, the interpreter reinforces the
solution by providing a reward to the algorithm. If the outcome is not favorable, the
algorithm is forced to reiterate until it finds a better result. In most cases, the reward
system is directly tied to the effectiveness of the result.

In typical reinforcement learning use-cases, such as finding the shortest route between
two points on a map, the solution is not an absolute value. Instead, it takes on a score of
effectiveness, expressed in a percentage value. The higher this percentage value is,
the more reward is given to the algorithm. Thus, the program is trained to give the
best possible solution for the best possible reward.

Problems not to be solved:

We are always amazed at how machine learning has made such an impact on our lives.
There is no doubt that ML will completely change the face of various industries, as well as job
profiles. While it offers a promising future, there are some inherent
problems at the heart of ML and AI advancements that put these technologies at a disadvantage.
While it can solve a plethora of challenges, there are a few tasks which
ML fails to answer.
1. Reasoning Power
One area where ML has not mastered successfully is reasoning power, a distinctly

8
human trait. Algorithms available today are mainly oriented towards specific use- cases and are
narrowed down when it comes to applicability. They cannot think as to
why a particular method is happening that way or ‘introspect’ their own outcomes.
For instance, if an image recognition algorithm identifies apples and oranges in a given scenario, it
cannot say if the apple (or orange) has gone bad or not, or why is
that fruit an apple or orange.
Mathematically, all of this learning process can be
explained by us, but from an algorithmic perspective, the innate property cannot be
told by the algorithms or even us.
In other words, ML algorithms lack the ability to reason beyond their intended
application.
2. Contextual Limitation

If we consider the area of natural language processing (NLP), text and speech
information are the means to understand languages by NLP algorithms. They may
learn letters, words, sentences or even the syntax, but where they fall back is the
context of the language. Algorithms do not understand the context of the language
used. A classic example for this would be the “Chinese room” argument given by

philosopher John Searle, which says that computer programs or algorithms grasp the
idea merely by ‘symbols’ rather than the context given.

So, ML does not have an overall idea of the situation. It is limited by mnemonic
interpretations rather than thinking to see what is actually going on.
3. Scalability:
Although we see ML implementations being deployed on a significant basis, it all
depends on data as well as its scalability. Data is growing at an enormous rate and has
many forms which largely affects the scalability of an ML project. Algorithms cannot
do much about this unless they are updated constantly for new changes to handle data. This is
where ML regularly requires human intervention in terms of scalability and
remains unsolved mostly.
In addition, growing data has to be dealt the right way if shared on an ML platform
which again needs examination through knowledge and intuition apparently lacked by
current ML.
4. Regulatory Restriction For Data In ML:

9
ML usually need considerable amounts (in fact, massive) of data in stages such as
training, cross-validation etc. Sometimes, data includes private as well as general
information. This is where it gets complicated. Most tech companies have privatised
data and these data are the ones which are actually useful for ML applications. But,
there comes the risk of the wrong usage of data, especially in critical areas such as medical
research, health insurance etc.,

Even though data are anonymised at times, it has the possibility of being vulnerable.
Hence this is the reason regulatory rules are imposed heavily when it comes to using
private data.
5. Internal Working Of Deep Learning:

This sub-field of ML is actually responsible for today’s AI growth. What was once
just a theory has appeared to be the most powerful aspect of ML. Deep Learning (DL) now powers
applications such as voice recognition, image recognition and so on
through artificial neural networks.

But, the internal working of DL is still unknown and yet to be solved. Advanced DL
algorithms still baffle researchers in terms of its working and efficiency. Millions of
neurons that form the neural networks in DL increase abstraction at every level, which
cannot be comprehended at all. This is why deep learning is dubbed a ‘black box’ since its internal
agenda is unknown.

Applications:
Popular Machine Learning Applications and Examples
1. Social Media Features

Social media platforms use machine learning algorithms and approaches to create
some attractive and excellent features. For instance, Facebook notices and records
your activities, chats, likes, and comments, and the time you spend on specific kinds
of posts. Machine learning learns from your own experience and makes friends and
page suggestions for your profile.

2. Product Recommendations
Product recommendation is one of the most popular and known applications of

10
machine learning. Product recommendation is one of the stark features of almost
every e-commerce website today, which is an advanced application of machine
learning techniques. Using machine learning and AI, websites track your behavior
based on your previous purchases, searching patterns, and cart history, and then make
product recommendations.

3. Image Recognition
Image recognition, which is an approach for cataloging and detecting a feature or an
object in the digital image, is one of the most significant and notable machine learning
and AI techniques. This technique is being adopted for further analysis, such as pattern recognition,
face detection, and face recognition.

4. Sentiment Analysis
Sentiment analysis is one of the most necessary applications of machine learning.
Sentiment analysis is a real-time machine learning application that determines the
emotion or opinion of the speaker or the writer. For instance, if someone has written a
review or email (or any form of a document), a sentiment analyzer will instantly find
out the actual thought and tone of the text. This sentiment analysis application can be
used to analyze a review based website, decision-making applications, etc.

5. Automating Employee Access Control


Organizations are actively implementing machine learning algorithms to determine
the level of access employees would need in various areas, depending on their job
profiles. This is one of the coolest applications of machine learning.

6. Marine Wildlife Preservation


Machine learning algorithms are used to develop behavior models for endangered
cetaceans and other marine species, helping scientists regulate and monitor their
populations.

7. Regulating Healthcare Efficiency and Medical Services


Significant healthcare sectors are actively looking at using machine learning
algorithms to manage better. They predict the waiting times of patients in the

11
emergency waiting rooms across various departments of hospitals. The models use vital factors that
help define the algorithm, details of staff at various times of day, records of patients, and complete
logs of department chats and the layout of
emergency rooms. Machine learning algorithms also come to play when detecting a disease,
therapy planning, and prediction of the disease situation. This is one of the
most necessary machine learning applications.

8. Predict Potential Heart Failure


An algorithm designed to scan a doctor’s free-form e-notes and identify patterns in a patient’s
cardiovascular history is making waves in medicine. Instead of a physician
digging through multiple health records to arrive at a sound diagnosis, redundancy is
now reduced with computers making an analysis based on available information
.
9. Banking Domain
Banks are now using the latest advanced technology machine learning has to offer to
help prevent fraud and protect accounts from hackers. The algorithms determine what
factors to consider to create a filter to keep harm at bay. Various sites that are unauthentic will be
automatically filtered out and restricted from initiating
transactions.

10. Language Translation


One of the most common machine learning applications is language translation.
Machine learning plays a significant role in the translation of one language to another. We are
amazed at how websites can translate from one language to another
effortlessly and give contextual meaning as well. The technology behind the
translation tool is called ‘machine translation.’ It has enabled people to interact with
others from all around the world; without it, life would not be as easy as it is now. It
has provided confidence to travelers and business associates to safely venture into
foreign lands with the conviction that language will no longer be a barrier. Your model will need to
be taught what you want it to learn. Feeding relevant back
data will help the machine draw patterns and act accordingly. It is imperative to
provide relevant data and feed files to help the machine learn what is expected. In this
case, with machine learning, the results you strive for depend on the contents of the
files that are being recorded.

12
Languages/Tools

Regardless of the individual preferences for a particular programming language, we have profiled
five best programming languages for machine learning :

1. Python Programming Language

With over 8.2 million developers across the world using Python for coding, Python
ranks first in the latest annual ranking of popular programming languages by IEEE
Spectrum with a score of 100. Stack overflow programming language trends clearly
show that it’s the only language on rising for the last five years.

The increasing adoption of machine learning worldwide is a major factor contributing


to its growing popularity. There are 69% of machine learning engineers and Python
has become the favourite choice for data analytics, data science, machine learning,
and AI – all thanks to its vast library ecosystem that let’s machine learning
practitioners access, handle, transform, and process data with ease. Python wins the
heart of machine learning engineers for its platform independence, less complexity, and better
readability. Below is an interesting poem “The Zen of Python” written by
Tim Peters which beautifully describes why Python is gaining popularity as the best
language for machine learning :

Python is the preferred programming language of choice for machine learning for some of the
giants in the IT world including Google, Instagram, Facebook, Dropbox,
Netflix, Walt Disney, YouTube, Uber, Amazon, and Reddit. Python is an indisputable
leader and by far the best language for machine learning today and here’s why:

 Extensive Collection of Libraries and Packages

Python’s in-built libraries and packages provide base-level code so machine learning
engineers don’t have to start writing from scratch. Machine learning requires
continuous data processing and Python has in-built libraries and packages for almost
every task. This helps machine learning engineers reduce development time and
improve productivity when working with complex machine learning applications. The
best part of these libraries and packages is that there is zero learning curve, once you

13
know the basics of Python programming, you can start using these libraries.

1. Working with textual data – use NLTK, SciKit, and NumPy


2. Working with images – use Sci-Kit image and OpenCV
3. Working with audio – use Librosa
4. Implementing deep learning – use TensorFlow, Keras, PyTorch
5. Implementing basic machine learning algorithms – use Sci-Kit- learn. 6. Want to do scientific
computing – use Sci-Py
7. Want to visualise the data clearly – use Matplotlib, Sci-Kit, and Seaborn.

 Code Readability

The joy of coding in Python should be in seeing short, concise, readable classes that express a lot of
action in a small amount of clear
code — not in reams of trivial code that bores the reader to death – Guido van Rossum

The math behind machine learning is usually complicated and unobvious. Thus, code
readability is extremely important to successfully implement complicated machine
learning algorithms and versatile workflows. Python’s simple syntax and the
importance it puts on code readability makes it easy for machine learning engineers to
focus on what to write instead of thinking about how to write. Code readability makes
it easier for machine learning practitioners to easily exchange ideas, algorithms, and
tools with their peers. Python is not only popular within machine learning engineers,
but it is also one of the most popular programming languages among data scientists.

 Flexibility

The multiparadigm and flexible nature of Python makes it easy for machine learning
engineers to approach a problem in the simplest way possible. It supports the
procedural, functional, object-oriented, and imperative style of programming allowing
machine learning experts to work comfortably on what approach fits best. The
flexibility Python offers help machine learning engineers choose the programming
style based on the type of problem – sometimes it would be beneficial to capture the
state in an object while other times the problem might require passing around
functions as parameters. Python provides flexibility in choosing either of the

14
approaches and minimises the likelihood of errors. Not only in terms of programming

styles but Python has a lot to offer in terms of flexibility when it comes to
implementing changes as machine learning practitioners need not recompile the
source code to see the changes.
2. R Programming Langauge

With more than 2 million R users, 12000 packages in the CRAN open-source
repository, close to 206 R Meetup groups, over 4000 R programming questions asked
every month, and 40K+ members on LinkedIn’s R group – R is an incredible
programming language for machine learning written by a statistician for statisticians.
R language can also be used by non-programmer including data miners, data analysts,
and statisticians.
A critical part of a machine learning engineer’s day-to-day job roles is understanding
statistical principles so they can apply these principles to big data. R programming
language is a fantastic choice when it comes to crunching large numbers and is the preferred choice
for machine learning applications that use a lot of statistical data.
With user-friendly IDE’s like RStudio and various tools to draw graphs and manage
libraries – R is a must-have programming language in a machine learning engineer’s
toolkit. Here’s what makes R one of the most effective machine learning languages
for cracking business problems –
 Machine learning engineers need to train algorithms and bring in automation
to make accurate predictions. R language provides a variety of tools to train
and evaluate machine learning algorithms for predicting future events making
machine learning easy and approachable. R has an exhaustive list of packages
for machine learning –
1. MICE for dealing with missing values.
2. CARET for working with classification and regression problems.
3. PARTY and rpart for creating data partitions.
4. randomFOREST for creating decision trees.
5. dplyr and tidyr for data manipulation.
6. ggplot2 for creating beautiful visualisations.
7. Rmarkdown and Shiny for communicating insights through reports.
 R is an open-source programming language making it a highly cost-effective

15
choice for machine learning projects of any size.
 R supports the natural implementation of matrix arithmetic and other data
structures like vectors which Python does now. For a similar implementation
in Python programming language, machine learning engineers have to use the
NumPy package which is a clumsier implementation when compared to R.
 R is considered a powerful choice for machine learning because of the breadth
of machine learning techniques it provides.
Be it data visualisation, data
sampling, data analysis, model evaluation, supervised/unsupervised machine
learning – R has a diverse array of techniques to offer.
 The style of programming in the R language is quite easy.
 R is highly flexible and also offers cross-platform compatibility. R does not
impose restrictions while performing every task in its language, machine

3. Java and JavaScript

Though Python and R continue to be the favourites of machine learning enthusiasts,


Java is gaining popularity among machine learning engineers who hail from a Java
development background as they don’t need to learn a new programming language
like Python or R to implement machine learning. Many organisations already have
huge Java codebases, and most of the open-source tools for big data processing like
Hadoop, Spark are written in Java. Using Java for machine learning projects makes it easier for
machine learning engineers to integrate with existing code repositories.
Features like the ease of use, package services, better user interaction, easy debugging,
and graphical representation of data make it a machine learning language of choice –

 Java has plenty of third party libraries for machine learning. JavaML is an in-
built machine learning library that provides a collection of machine learning

algorithms implemented in Java. Also, you can use Arbiter Java library for
hyperparameter tuning which is an integral part of making ML algorithms run
effectively or you can use Deeplearning4J library which supports popular
machine learning algorithms like K-Nearest Neighbor and Neuroph and lets
you create neural networks or can also use Neuroph for neural networks. 

16
Scalability is an important feature that every machine learning engineer must
consider before beginning a project. Java makes application scaling easier for
machine learning engineers, making it a great choice for the development of
large and complex machine learning applications from scratch.  Java Virtual Machine is one of
the best platforms for machine learning as engineers can write the same code on multiple platforms.
JVM also helps
machine learning engineers create custom tools at a rapid pace and has various

IDE’s that help improve overall productivity. Java works best for speed-
critical machine learning projects as it is fast executing.

4. Julia
Julia is a high-performance, general-purpose dynamic programming language
emerging as a potential competitor for Python and R with many predominant features
exclusively for machine learning. Having said that it is a general-purpose
programming language and can be used for the development of all kinds of
applications, it works best for high-performance numerical analysis and
computational science. With support for all types of hardware including TPU’s and
GPU’s on every cloud, Julia is powering machine learning applications at big
corporations like Apple, Disney, Oracle, and NASA.
Why use Julia for machine learning?
 Julia is particularly designed for implementing basic mathematics and
scientific queries that underlies most machine learning algorithms.  Julia code is compiled at Just-
in-Time or at run time using the LLVM
framework. This gives machine learning engineers great speed without any

handcrafted profiling techniques or optimisation techniques solving all the


performance problems.  Julia’s code is universally executable. So, once written a machine
learning
application it can be compiled in Julia natively from other languages like
Python or R in a wrapper like PyCall or RCall.  Scalability, as discussed, is crucial for machine
learning engineers and Julia
makes it easier to be deployed quickly at large clusters. With powerful tools
like TensorFlow, [Link], [Link], [Link], and many others that

17
utilise the scalability provided by Julia, it is an apt choice for machine learning
applications.  Offer support for editors like Emacs and VIM and also IDE’s like Visual
studio and Juno.

4. LISP

Founded in 1958 by John McCarthy, LISP (List Processing) is the second oldest
programming language still in use and is mainly developed for AI-centric applications.
LISP is a dynamically typed programming language that has influenced the creation
of many machine learning programming languages like Python, Julia, and Java. LISP
works on Read-Eval-Print-Loop (REPL) and has the capability to code, compile, and
run code in 30+ programming languages.
Lisp is a language for doing what you’ve been told is impossible – Kent Pitman
LISP is considered as the most efficient and flexible machine learning language for
solving specifics as it adapts to the solution a programmer is coding for. This is what
makes LISP different from other machine learning languages. Today, it is particularly
used for inductive logic problems and machine learning. The first AI chatbot ELIZA
was developed using LISP and even today machine learning practitioners can use it to
create chatbots for eCommerce. LISP definitely deserves a mention on the list of best
language for machine learning because even today developers rely on LISP for
artificial intelligence projects that are heavy on machine learning as LISP offers –
 Rapid prototyping capabilities  Dynamic object creation
 Automatic garbage collection
 Flexibility
 Support for symbolic expressions
Despite being flexible for machine learning, LISP lacks the support of well-known
machine learning libraries. LISP is neither a beginner-friendly machine learning
language (difficult to learn) and nor does have a large user community like that of
Python or R.
The best language for machine learning depends on the area in which it is going to be
applied, the scope of the machine learning project, which programming languages are used in your
industry/company, and several other factors. Experimentation, testing,
and experience help a machine learning practitioner decide on an optimal choice of
programming language for any given machine learning problem. Of course, the best

18
thing would be to learn at least two programming languages for machine learning as
this will help you put your machine learning resume at the top of the stack. Once you
are proficient in one machine learning language, learning another one is easy.
Machine Learning Tools
1. Microsoft Azure Machine Learning
Azure Machine Learning is a cloud platform that allows developers to build, train,
and deploy AI models. Microsoft is constantly making updates and improvements to
its machine learning tools and has recently announced changes to Azure Machine
Learning, retiring the Azure Machine Learning Workbench.
2. IBM Watson
No, IBM’s Watson Machine Learning isn’t something out of Sherlock Holmes.
Watson Machine Learning is an IBM cloud service that uses data to put machine
learning and deep learning models into production. This machine learning tool allows users to
perform training and scoring, two fundamental machine learning operations.
Keep in mind, IBM Watson is best suited for building machine learning applications
through API connections.
3. Google TensorFlow
TensorFlow, which is used for research and production at Google, is an open-source
software library for dataflow programming. The bottom line, TensorFlow is a machine learning
framework. This machine learning tool is relatively new to the
market and is evolving quickly. TensorFlow's easy visualization of neural networks is
likely the most attractive feature to developers.
4. Amazon Machine Learning
It should come as no surprise that Amazon offers an impressive number of machine
learning tools. According to the AWS website, Amazon Machine Learning is a managed service for
building Machine Learning models and generating predictions.
Amazon Machine Learning includes an automatic data transformation tool,
simplifying the machine learning tool even further for the user. In addition, Amazon

also offers other machine learning tools such as Amazon SageMaker, which is a fully-
managed platform that makes it easy for developers and data scientists to utilize

machine learning models.


5. OpenNN
OpenNN, short for Open Neural Networks Library, is a software library that

19
implements neural networks. Written in C++ programming language, OpenNN offers
you the perk of downloading its entire library for free from GitHub or SourceForge.

Issues

Although machine learning is being used in every industry and helps organizations
make more informed and data-driven choices that are more effective than classical
methodologies, it still has so many problems that cannot be ignored. Here are some
common issues in Machine Learning that professionals face to inculcate ML skills
and create an application from scratch.

1. Inadequate Training Data


The major issue that comes while using machine learning algorithms is the lack of
quality as well as quantity of data. Although data plays a vital role in the processing
of machine learning algorithms, many data scientists claim that inadequate data, noisy
data, and unclean data are extremely exhausting the machine learning algorithms. For
example, a simple task requires thousands of sample data, and an advanced task such
as speech or image recognition needs millions of sample data examples. Further, data
quality is also important for the algorithms to work ideally, but the absence of data
quality is also found in Machine Learning applications. Data quality can be affected
by some factors as follows:
 Noisy Data- It is responsible for an inaccurate prediction that affects the
decision as well as accuracy in classification tasks.  Incorrect data- It is also responsible for faulty
programming and results
obtained in machine learning models. Hence, incorrect data may affect the
accuracy of the results also.  Generalizing of output data- Sometimes, it is also found that
generalizing
output data becomes complex, which results in comparatively poor future
actions.
2. Poor quality of data

As we have discussed above, data plays a significant role in machine learning, and it
must be of good quality as well. Noisy data, incomplete data, inaccurate data, and
unclean data lead to less accuracy in classification and low-quality results. Hence,

20
data quality can also be considered as a major common problem while processing
machine learning algorithms.

3. Non-representative training data


To make sure our training model is generalized well or not, we have to ensure that
sample training data must be representative of new cases that we need to generalize.
The training data must cover all cases that are already occurred as well as occurring.
Further, if we are using non-representative training data in the model, it results in less
accurate predictions. A machine learning model is said to be ideal if it predicts well
for generalized cases and provides accurate decisions. If there is less training data,
then there will be a sampling noise in the model, called the non-representative training
set. It won't be accurate in predictions. To overcome this, it will be biased against one
class or a group.

Hence, we should use representative data in training to protect against being biased
and make accurate predictions without any drift.
3. Overfitting and Underfitting
Overfitting is one of the most common issues faced by Machine Learning engineers
and data scientists. Whenever a machine learning model is trained with a huge amount
of data, it starts capturing noise and inaccurate data into the training data set. It
negatively affects the performance of the model. Let's understand with a simple
example where we have a few training data sets such as 1000 mangoes, 1000 apples,
1000 bananas, and 5000 papayas. Then there is a considerable probability of
identification of an apple as papaya because we have a massive amount of biased data
in the training data set; hence prediction got negatively affected. The main reason
behind overfitting is using non-linear methods used in machine learning algorithms as
they build non-realistic data models. We can overcome overfitting by using linear and
parametric algorithms in the machine learning models.
Methods to reduce overfitting:
 Increase training data in a dataset.  Reduce model complexity by simplifying the model by
selecting one with
fewer parameters  Ridge Regularization and Lasso Regularization
 Early stopping during the training phase
 Reduce the noise
 Reduce the number of attributes in training data.  Constraining the model.

21
Underfitting:
Underfitting is just the opposite of overfitting. Whenever a machine learning model is
trained with fewer amounts of data, and as a result, it provides incomplete and
inaccurate data and destroys the accuracy of the machine learning model.
Underfitting occurs when our model is too simple to understand the base structure of
the data, just like an undersized pant. This generally happens when we have limited
data into the data set, and we try to build a linear model with non-linear data. In such
scenarios, the complexity of the model destroys, and rules of the machine learning
model become too easy to be applied on this data set, and the model starts doing
wrong predictions as well.
Methods to reduce Underfitting:
 Increase model complexity
 Remove noise from the data
 Trained on increased and better features  Reduce the constraints  Increase the number of
epochs to get better results.

5. Monitoring and maintenance


As we know that generalized output data is mandatory for any machine learning
model; hence, regular monitoring and maintenance become compulsory for the same. Different
results for different actions require data change; hence editing of codes as well as resources for
monitoring them also become necessary
. 6. Getting bad recommendations
A machine learning model operates under a specific context which results in bad
recommendations and concept drift in the model. Let's understand with an example
where at a specific time customer is looking for some gadgets, but now customer
requirement changed over time but still machine learning model showing same
recommendations to the customer while customer expectation has been changed.
This incident is called a Data Drift. It generally occurs when new data is introduced or
interpretation of data changes.
However, we can overcome this by regularly updating
and monitoring data according to the expectations.
[Link] of skilled resources
Although Machine Learning and Artificial Intelligence are continuously growing in
the market, still these industries are fresher in comparison to others. The absence of

22
skilled resources in the form of manpower is also an issue. Hence, we need manpower
having in-depth knowledge of mathematics, science, and technologies for developing
and managing scientific substances for machine learning
.
8. Customer Segmentation
Customer segmentation is also an important issue while developing a machine
learning algorithm. To identify the customers who paid for the recommendations
shown by the model and who don't even check them. Hence, an algorithm is
necessary to recognize the customer behavior and trigger a relevant recommendation
for the user based on past experience.

9. Process Complexity of Machine Learning


The machine learning process is very complex, which is also another major issue
faced by machine learning engineers and data scientists.
However, Machine Learning
and Artificial Intelligence are very new technologies but are still in an experimental
phase and continuously being changing over time.
There is the majority of hits and
trial experiments; hence the probability of error is higher than expected. Further, it
also includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, etc., making the procedure more complicated and quite
tedious.

10. Data Bias


Data Biasing is also found a big challenge in Machine Learning. These errors exist
when certain elements of the dataset are heavily weighted or need more importance
than others. Biased data leads to inaccurate results, skewed outcomes, and other

analytical errors. However, we can resolve this error by determining where data is
actually biased in the dataset. Further, take necessary steps to reduce it.

Methods to remove Data Bias:

23
 Research more for customer segmentation.  Be aware of your general use cases and potential
outliers.  Combine inputs from multiple sources to ensure data diversity.  Include bias testing in
the development process. 
Analyze data regularly and keep tracking errors to resolve them easily.  Review the collected and
annotated data.  Use multi-pass annotation such as sentiment analysis, content moderation, and
intent recognition.
11. Lack of Explainability
This basically means the outputs cannot be easily comprehended as it is programmed
in specific ways to deliver for certain conditions. Hence, a lack of explainability is
also found in machine learning algorithms which reduce the credibility of the
algorithms.
12. Slow implementations and results
This issue is also very commonly seen in machine learning models. However, machine
learning models are highly efficient in producing accurate results but are
time-consuming. Slow programming, excessive requirements' and overloaded data
take more time to provide accurate results than expected. This needs continuous
maintenance and monitoring of the model for delivering accurate results.
13. Irrelevant features
Although machine learning models are intended to give the best possible outcome, if we feed
garbage data as input, then the result will also be garbage. Hence, we should
use relevant features in our training sample.
A machine learning model is said to be
good if training data has a good set of features or less to no irrelevant features.
Preparing to Model - Introduction:
Getting the data right is the first step in any AI or machine learning project -- and it's
often more time-consuming and complex than crafting the machine learning
algorithms themselves. Advanced planning to help streamline and improve data preparation in
machine learning can save considerable work down the road. It can also
lead to more accurate and adaptable algorithms.
"Data preparation is the action of gathering the data you need, massaging it into a
format that's computer-readable and understandable, and asking hard questions of it to
check it for completeness and bias," said Eli Finkelshteyn, founder and CEO of
[Link], which makes an AI-driven search engine for product websites.

24
It's tempting to focus only on the data itself, but it's a good idea to first consider the
problem you're trying to solve.
That can help simplify considerations about what kind
of data to gather, how to ensure it fits the intended purpose and how to transform it
into the appropriate format for a specific type of algorithm.
Good data preparation can lead to more accurate and efficient algorithms, while
making it easier to pivot to new analytics problems, adapt when model accuracy drifts
and save data scientists and business users considerable time and effort down the line.
The importance of data preparation in machine learning
"Being a great data scientist is like being a great chef," surmised Donncha Carroll, a partner at
consultancy Axiom Consulting Partners.

Managers need to appreciate the ways in which data shapes machine learning
application development differently compared to customary application development. "Unlike
traditional rule-based programming, machine learning consists of two parts
that make up the final executable algorithm -- the ML algorithm itself and the data to
learn from," explained Felix Wick, corporate vice president of data science at supply
chain management platform provider Blue Yonder. "But raw data are often not ready
to be used in ML models. So, data preparation is at the heart of ML."
Data preparation consists of several steps, which consume more time than other
aspects of machine learning application development. A 2021 study by data science platform
vendor Anaconda found that data scientists spend an average of 22% of their
time on data preparation, which is more than the average time spent on other tasks
like deploying models, model training and creating data visualizations.
Although it is a time-intensive process, data scientists must pay attention to various
considerations when preparing data for machine learning. Following are six key steps
that are part of the process. 1. Problem formulation
Data preparation for building machine learning models is a lot more than just cleaning
and structuring data.
In many cases, it's helpful to begin by stepping back from the
data to think about the underlying problem you're trying to solve. "To build a
successful ML model," Carroll advised, "you must develop a detailed understanding
of the problem to inform what you do and how you do it."
Start by spending time with the people that operate within the domain and have a good
understanding of the problem space, synthesizing what you learn through

25
conversations with them and using your experience to create a set of hypotheses that
describes the factors and forces involved.
This simple step is often skipped or underinvested in, Carroll noted, even though it can make a
significant difference in

deciding what data to capture. It can also provide useful guidance on how the data
should be transformed and prepared for the machine learning model.
An Axiom legal client, for example, wanted to know how different elements of
service delivery impact account retention and growth.
Carroll's team collaborated with
the attorneys to develop a hypothesis that accounts served by legal professionals
experienced in their industry tend to be happier and continue as clients longer.
To provide that information as an input to a machine learning model, they looked back
over the course of each professional's career and used billing data to determine how
much time they spent serving clients in that industry.
"Ultimately," Carroll added, "it became one of the most important predictors of client
retention and something we would never have calculated without spending the time
upfront to understand what matters and how it matters."
2. Data collection and discovery:
Once a data science team has formulated the machine learning problem to be solved,
it needs to inventory potential data sources within the enterprise and from external
third parties. The data collection process must consider not only what the data is
purported to represent, but also why it was collected and what it might mean,
particularly when used in a different context. It's also essential to consider factors that
may have biased the data.
"To reduce and mitigate bias in machine learning models," said Sophia Yang, a senior
data scientist at Anaconda, "data scientists need to ask themselves where and how the
data was collected to determine if there were significant biases that might have been
captured." To train a machine learning model that predicts customer behavior, for
example, look at the data and ensure the data set was collected from diverse people,
geographical areas and perspectives.
"The most important step often missed in data preparation for machine learning is
asking critical questions of data that otherwise looks technically correct,"
Finkelshteyn said. In addition to investigating bias, he recommended determining if

26
there's reason to believe that important missing data may lead to a partial picture of
the analysis being done.
In some cases, analytics teams use data that works
technically but produces inaccurate or incomplete results, and people who use the
resulting models build on these faulty learnings without knowing something is wrong.
3. Data exploration:
Data scientists need to fully understand the data they're working with early in the
process to cultivate insights into its meaning and applicability. "A common mistake is
to launch into model building without taking the time to really understand the data
you've wrangled," Carroll said.
Data exploration means reviewing such things as the type and distribution of data
contained within each variable, the relationships between variables and how they vary
relative to the outcome you're predicting or interested in achieving.

This step can highlight problems like collinearity -- variables that move together -- or situations
where standardization of data sets and other data transformations are necessary. It can also surface
opportunities to improve model performance, like
reducing the dimensionality of a data set.
Data visualizations can also help improve this process.
"This might seem like an added step that isn't needed," Yang conjectured, "but our brains are great
at spotting
patterns along with data that doesn't match the pattern." Data scientists can easily see
trends and explore the data correctly by creating suitable visualizations before
drawing conclusions. Popular data visualization tools include Tableau, Microsoft
Power BI, [Link] and Python libraries such as Matplotlib, Bokeh and the HoloViz
stack.
4. Data cleansing and validation:
Various data cleansing and validation techniques can help analytics teams identify
and rectify inconsistencies, outliers, anomalies, missing data and other issues. Missing
data values, for example, can often be addressed with imputation tools that fill empty
fields with statistically relevant substitutes.
But Blue Yonder's Wick cautioned that semantic meaning is an often overlooked
aspect of missing data. In many cases, creating a dedicated category for capturing the
significance of missing values can help. In others, teams may consider explicitly
setting missing values as neutral to minimize their impact on machine learning

27
models.
A wide range of commercial and open source tools can be used to cleanse and
validate data for machine learning and ensure good quality data. Open source
technologies such as Great Expectations and Pandera, for example, are designed to

validate the data frames commonly used to organize analytics data into two-
dimensional tables. Tools that validate code and data processing workflows are also

available. One of them is pytest, which, Yang said, data scientists can use to apply a
software development unit-test mindset and manually write tests of their workflows.
5. Data structuring:
Once data science teams are satisfied with their data, they need to consider the
machine learning algorithms being used. Most algorithms, for example, work better
when data is broken into categories, such as age ranges, rather than left as raw
numbers.
Two often-missed data preprocessing tricks, Wick said, are data binning and
smoothing continuous features. These data regularization methods can reduce a machine learning
model's variance by preventing it from being misled by minor
statistical fluctuations in a data set.
Binning data into different groups can be done either in an equidistant manner, with
the same "width" for each bin, or equi-statistical method, with approximately the
same number of samples in each bin. It can also serve as a prerequisite for local

optimization of the data in each bin to help produce low-bias machine learning
models.
Smoothing continuous features can help in "denoising" raw data. It can also be used
to impose causal assumptions about the data-generating process by representing
relationships in ordered data sets as monotonic functions that preserve the order
among data elements.
Other actions that data scientists often take in structuring data for machine learning
include the following:
 data reduction, through techniques such as attribute or record sampling and
data aggregation;  data normalization, which includes dimensionality reduction and data
rescaling; and

28
 creating separate data sets for training and testing machine learning models. 6. Feature
engineering and selection
The last stage in data preparation before developing a machine learning model is
feature engineering and feature selection.
Wick said feature engineering, which involves adding or creating new variables to
improve a model's output, is the main craft of data scientists and comes in various
forms.
Examples include extracting the days of the week or other variables from a data set, decomposing
variables into separate features, aggregating variables and
transforming features based on probability distributions.
Data scientists also must address feature selection -- choosing relevant features to
analyze and eliminating nonrelevant ones. Many features may look promising but lead
to problems like extended model training and overfitting, which limits a model's
ability to accurately analyze new data. Methods such as lasso regression and
automatic relevance determination can help with feature selection.
Machine Learning Activities:
Machine Learning Technology is already a part of all of our lives. It is making
decisions both for us and about us. It is the technology behind:
 Facial recognition
 Targeted advertising
 Voice recognition
 SPAM filters  Machine translation
 Detecting credit card fraud
 Virtual Personal Assistants  Self-driving cars  ... and lots more.

To fully understand the opportunities and consequences of the machine learning filled
future, everyone needs to be able to ...
 Understand the basics of how machine learning works.  Develop applications by training
machine learning engine.  Use machine learning applications.  Understand the Ethical and
Societal Issues.
What is Machine Learning?
Machine Learning is a technology that “allows computers to perform specific tasks
intelligently, by learning from examples”. Rather than crafting an algorithm to do a
job step by step...you craft an algorithm that learns to do things itself then train it on

29
large amounts of data. It is all about spotting patterns in massive amounts of data.
In practice creating machine learning tools is done in several steps.
1. First create a machine learning engine. It is a program implementing an
algorithm of how to learn in general. (This step is for experts!)
2. Next you train it on relevant data (e.g. images of animals). The more data it sees the better it gets
at recognising things or making decisions (e.g.
identifying animals).
3. You package up the newly trained tool in a user interface to make it easy for
anyone to use it.
4. Your users then use the new machine learning application by giving it new
data (e.g. you show it pictures of animals and it tells you what kind of animal
they are).
Here is an example of a robot with a machine learning brain. It reacts just to the tone
of voice – it doesn’t understand the words. It learnt very much like a dog does. It was
‘rewarded’ when it reacted in an appropriate way and was ‘punished’ when it reacted
in an inappropriate way. Eventually it learnt to behave like this.
Understanding how machine learning works
There are several ways to try to make a machine do tasks ‘intelligently’. For example:
 Rule-based systems (writing rules explicitly)  Neural networks (copying the way our brains
learn)  Genetic algorithms (copying the way evolution improves species to fit their
environment)  Bayesian Networks (building in existing expert knowledge)
Understanding how machine learning works
There are several ways to try to make a machine do tasks ‘intelligently’. For example:
 Rule-based systems (writing rules explicitly)  Neural networks (copying the way our brains
learn)

 Genetic algorithms (copying the way evolution improves species to fit their
environment)  Bayesian Networks (building in existing expert knowledge)

Types of data
Why is machine learning important?
Machine learning is a form of artificial intelligence (AI) that teaches computers to
think in a similar way to humans: learning and improving upon past experiences.
Almost any task that can be completed with a data-defined pattern or set of rules can

30
be automated with machine learning.
So, why is machine learning important? It allows companies to transform processes
that were previously only possible for humans to perform—think responding to
customer service calls, bookkeeping, and reviewing resumes for everyday businesses.
Machine learning can also scale to handle larger problems and technical questions—
think image detection for self-driving cars, predicting natural disaster locations and
timelines, and understanding the potential interaction of drugs with medical
conditions before clinical trials. That’s why machine learning is important.
Why is data important for machine learning?
Machine learning data analysis uses algorithms to continuously improve itself over
time, but quality data is necessary for these models to operate efficiently. What is a dataset in
machine learning?
A single row of data is called an instance. Datasets are a collection of instances that
all share a common attribute. Machine learning models will generally contain a few
different datasets, each used to fulfill various roles in the system.
For machine learning models to understand how to perform various actions, training
datasets must first be fed into the machine learning algorithm, followed by validation

datasets (or testing datasets) to ensure that the model is interpreting this data
accurately. Once you feed these training and validation sets into the system, subsequent datasets
can then be used to sculpt your machine learning model going forward. The more data
you provide to the ML system, the faster that model can learn and improve.
What type of data does machine learning need?
Data can come in many forms, but machine learning models rely on four primary data
types. These include numerical data, categorical data, time series data, and text data.

Numerical data
Numerical data
Numerical data, or quantitative data, is any form of measurable data such as your
height, weight, or the cost of your phone bill. You can determine if a set of data is
numerical by attempting to average out the numbers or sort them in ascending or descending order.
Exact or whole numbers (ie. 26 students in a class) are considered
discrete numbers, while those which fall into a given range (ie. 3.6 percent interest
rate) are considered continuous numbers. While learning this type of data, keep in

31
mind that numerical data is not tied to any specific point in time, they are simply raw
numbers.
Categorical data
Categorical data is sorted by defining characteristics. This can include gender, social
class, ethnicity, hometown, the industry you work in, or a variety of other labels.
While learning this data type, keep in mind that it is non-numerical, meaning you are unable to add
them together, average them out, or sort them in any chronological
order. Categorical data is great for grouping individuals or ideas that share similar
attributes, helping your machine learning model streamline its data analysis.

Time series data

Time series data consists of data points that are indexed at specific points in time.
More often than not, this data is collected at consistent intervals. Learning and
utilizing time series data makes it easy to compare data from week to week, month to
month, year to year, or according to any other time-based metric you desire. The
distinct difference between time series data and numerical data is that time series data
has established starting and ending points, while numerical data is simply a collection
of numbers that aren’t rooted in particular time periods.
Text data

Text data is simply words, sentences, or paragraphs that can provide some level of
insight to your machine learning models. Since these words can be difficult for
models to interpret on their own, they are most often grouped together or analyzed
using various methods such as word frequency, text classification, or sentiment
analysis.
Where do engineers get datasets for machine learning?
There is an abundance of places you can find machine learning data, but we have
compiled five of the most popular ML dataset resources to help get you started:

Exploring structure of data


The data structure used for machine learning is quite similar to other software
development fields where it is often used. Machine Learning is a subset of artificial
intelligence that includes various complex algorithms to solve mathematical problems
to a great extent. Data structure helps to build and understand these complex

32
problems. Understanding the data structure also helps you to build ML models and
algorithms in a much more efficient way than other ML professionals.
What is Data Structure?

The data structure is defined as the basic building block of computer programming
that helps us to organize, manage and store data for efficient search and retrieval.
In other words, the data structure is the collection of data type 'values' which are
stored and organized in such a way that it allows for efficient access and modification.
Types of Data Structure The data structure is the ordered sequence of data, and it tells the compiler
how a programmer is using the data such as Integer, String, Boolean, etc.
There are two different types of data structures: Linear and Non-linear data structures.

1. Linear Data structure:


The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently. There are mainly 4
types of linear data structure as follows:

Array:

An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems.
You will use arrays constantly in machine learning, whether it's:
 To convert the column of a data frame into a list format in pre-processing
analysis  To order the frequency of words present in datasets.  Using a list of tokenized words
to begin clustering topics.  In word embedding, by creating multi-dimensional matrices.
An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.
Let's take an example of a Python array used in machine learning. Although the
Python array is quite different from than array in other programming languages, the
Python list is more popular as it includes the flexibility of data types and their length.
If anyone is using Python in ML algorithms, then it's better to kick your journey from
array initially. Python Array method:
Method Description

33
Unit ii

Model Selection:
Model selection in machine learning is the crucial process of choosing the best
model from a set of candidate models for a specific task and dataset
Why is Model Selection Important?
 Accuracy:
Different models have varying strengths and weaknesses, and selecting the right
one ensures the model performs well on the given task.
 Efficiency:
Model selection impacts training speed and computational resources required,
especially with large datasets.
 Generalization:
A well-chosen model generalizes well to new, unseen data, preventing overfitting
(performing well on training data but poorly on new data).
 Interpretability:
In some applications, such as healthcare or finance, understanding the model's
decision-making process is crucial. Simpler models like decision trees or logistic
regression may be preferred over complex neural networks.
 Real-world applications:
Model selection is vital for ensuring reliable predictions and scalable AI solutions,
particularly in fields like autonomous vehicles.
Key Aspects of Model Selection:
 Problem Definition:
Clearly defining the problem (e.g., classification, regression, clustering) guides the
selection of suitable model types.
 Data Characteristics:
Understanding the data's size, dimensionality, and complexity helps narrow down
model choices.
 Performance Metrics:
Selecting appropriate evaluation metrics (e.g., accuracy, precision, recall, MSE) is
crucial for comparing models.
 Model Complexity:
Balancing model complexity with the risk of overfitting is essential.

34
Computational Resources:
Training time and available computing power influence model selection, especially
with large datasets or complex models.
Interpretability:
The need for model interpretability can guide the selection of simpler models over
complex ones.
Common Model Selection Techniques:
 Cross-validation:
A technique to evaluate a model's performance on different subsets of the data,
providing a more robust estimate of its generalization ability.
 Grid search:
A method to explore different combinations of hyperparameters for a model to find
the best performing configuration.
 Regularization:
Techniques like L1 and L2 regularization can help prevent overfitting by penalizing
complex models.

Definition: Choosing the most suitable model for a given problem.


Approach:
Start with understanding the problem (regression, classification,
clustering).
Evaluate different algorithms (e.g., decision trees, SVM, logistic
regression, etc.) based on their suitability to your problem.
Consider factors like:
 Accuracy
 Interpretability
 Training time
 Overfitting or underfitting

1. Training the Model

Training a model in machine learning is the core process where a machine learning
algorithm learns from data to identify patterns, make predictions, or perform specific
tasks. This process involves the following key components and steps:
 Data:
Training begins with a dataset, which includes input features (attributes) and, in
supervised learning, corresponding output labels or targets. The model learns the
relationships between these inputs and outputs.

35
 Algorithm:
An algorithm, or learning algorithm, is the computational procedure that guides the
model's learning process. It defines the rules and mathematical operations the
model will use to analyze the data.
 Parameters:
During training, the model adjusts its internal parameters, often referred to as
weights and biases. These parameters are the model's internal settings that are
optimized to minimize the difference between its predictions and the actual
outcomes.
 Loss Function:
A loss function quantifies the discrepancy between the model's predictions and the
true target values. The goal of training is to minimize this loss, indicating improved
accuracy.
 Optimization:
Optimization techniques are used to iteratively adjust the model's parameters to
minimize the loss function. Gradient descent is a common optimization algorithm
that guides the model towards optimal parameter values.
The Training Process:
 Feeding Data: The model is exposed to the training data.
 Prediction: The model makes predictions based on its current parameters.
 Loss Calculation: The loss function calculates the error between the model's
predictions and the actual values.
 Parameter Adjustment: The optimization algorithm uses the calculated loss to
adjust the model's parameters, aiming to reduce future errors.
 Iteration: This process is iterative, meaning the model repeatedly learns from the
data, makes predictions, calculates loss, and adjusts parameters until a satisfactory
level of performance is achieved, or a stopping criterion is met.

 Definition: The process of teaching the model to recognize patterns from data.
 Approach:
o Split the data into training and validation sets (common split: 70% train,
30% test).
o Choose an optimization algorithm (e.g., Gradient Descent).
o Use a loss function to quantify how well the model performs.
o Train the model until convergence (when the loss stops decreasing
significantly).

2. Model Representation and Interpretability:

36
Model representation and interpretability in machine learning refer to how a model is
structured and how well its decision-making process can be understood by
humans. A model's representation impacts how easily its inner workings can be
deciphered. Interpretability, on the other hand, focuses on the ability to explain a
model's predictions and decisions in a way that is understandable to humans.
Model Representation:
 Transparent Models:
Some models, like linear regression or decision trees, are inherently transparent,
making their decision-making processes relatively easy to understand.
 Complex Models:
Deep learning models, with their multiple layers and intricate connections, are often
referred to as "black boxes" due to their lack of transparency.
 Representation Learning:
This involves learning meaningful representations from data to bridge the gap
between raw input and higher-level concepts, often used in deep learning.
Interpretability:
 Intrinsic Interpretability:
Focuses on building models that are inherently interpretable, meaning their
structure makes their decision-making process clear.
 Post-hoc Interpretability:
Involves using techniques to explain the predictions of existing complex models
(black boxes) after they have been trained.
 Examples of Interpretability Techniques:
 SHAP (SHapley Additive exPlanations) : Explains individual predictions by
calculating the contribution of each feature to the prediction.
 LIME (Local Interpretable Model-agnostic Explanations) : Explains individual
predictions by approximating the model locally with an interpretable model.
 Saliency Maps: Visualizes which parts of an input image are most important for a
deep learning model's prediction.
Why is Interpretability Important?
 Trust and Transparency:
Understanding how a model makes decisions builds trust and allows for greater
transparency in critical decision-making processes.
 Debugging and Improvement:
Interpretability helps in identifying potential biases, errors, or areas for model
improvement.
 Regulatory Compliance:

37
In many applications, especially in sensitive domains like healthcare or finance,
regulations may require model interpretability.

 Human-AI Collaboration:
Interpretable models enable better collaboration between humans and AI, allowing
for human oversight and intervention when needed.
 Model Representation: Refers to how the model is built or structured (e.g.,
decision trees have a tree structure, neural networks have layers).
 Interpretability: The ability to explain or understand the model’s decision-
making process.
o In simpler models (like linear regression or decision trees),
interpretability is easy.
o In complex models (like deep learning), interpretability can be harder,
but techniques like LIME, SHAP, and feature importance can help.

3. Evaluating the Performance of a Model:


Model evaluation is the process of assessing how well a machine learning model
performs on a given task, using various metrics and techniques. It helps determine
the model's reliability, accuracy, and ability to generalize to new, unseen data. This
process is crucial for ensuring that the model is effective and suitable for its intended
purpose, whether it's during the development phase or after deployment.
1. Why is model evaluation important?
 Ensure reliability:
Evaluation helps identify potential issues like overfitting or underfitting, ensuring
the model performs well in real-world scenarios.
 Compare models:
It allows you to compare different models and choose the best one for a specific
task.
 Improve model performance:
Evaluation provides insights into areas where the model can be improved, guiding
further training or adjustments.
 Inform decision-making:
It helps determine if the model is ready for deployment and if it meets performance
requirements.
2. Key Concepts:
 Metrics:
Model evaluation relies on various metrics, which are quantitative measures that
assess different aspects of model performance.
 Classification vs. Regression:

38
Different types of problems (classification and regression) require different
evaluation metrics.

 Training, Validation, and Test Sets:


Datasets are typically split into these sets to assess model performance during and
after training.
 Holdout and Cross-Validation:
These are techniques used to evaluate model performance on unseen data.
 Confusion Matrix:
A table that summarizes the performance of a classification model, showing true
positives, true negatives, false positives, and false negatives.
3. Common Evaluation Metrics:
 Classification: Accuracy, precision, recall, F1-score, ROC curve, AUC score.
 Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE), R-squared.
4. Model Evaluation Process:
1. Choose appropriate metrics: Select metrics relevant to the specific problem and
desired outcomes.
2. Split data: Divide the data into training, validation, and test sets.
3. Train and evaluate: Train the model and evaluate its performance using the chosen
metrics.
4. Analyze results: Identify strengths and weaknesses of the model based on the
evaluation metrics.
5. Iterate: Adjust model parameters or algorithms and repeat the evaluation process
until satisfactory performance is achieved.
6. Monitor performance: Continuously monitor model performance after deployment
to detect potential issues like data drift or model bias, according to IBM.
 Metrics:

Classification problems: Accuracy, Precision, Recall, F1-Score, ROC-


AUC.
Regression problems: Mean Squared Error (MSE), R-squared, Mean
Absolute Error (MAE).
 Cross-validation: Helps evaluate the performance of the model on unseen
data by splitting the data into k-folds and training the model on different folds.
 Confusion Matrix: Helps in understanding misclassifications in classification
models.

4. Improving the Performance of a Model:


39
Improving the performance of a machine learning model involves various strategies,
often applied iteratively:
Data Quality and Quantity:
 Collect More Data: Increasing the size of the training dataset can help the model
learn more complex patterns and generalize better.
 Data Cleaning and Preprocessing: Address missing values, outliers,
inconsistencies, and transform data into suitable formats (e.g., normalization,
standardization, encoding categorical features).
 Feature Engineering: Create new features from existing ones or select the most
relevant features to provide more informative inputs to the model.
Model Selection and Architecture:
 Choose the Right Algorithm: Different algorithms are suited for different types of
data and problems. Experiment with various algorithms to find the best fit.
 Optimize Model Architecture: For complex models like neural networks, adjust
the number of layers, neurons, and activation functions.
Hyperparameter Tuning:
Tune Hyperparameters: Systematically adjust the parameters that control the
learning process of the model (e.g., learning rate, regularization strength, number of
neighbors in K-NN) using techniques like grid search or random search.
Training and Evaluation Techniques:
 Cross-Validation: Use techniques like k-fold cross-validation to get a more robust
estimate of model performance and reduce overfitting.
 Regularization: Apply techniques like L1 or L2 regularization to prevent
overfitting by penalizing complex models.
 Ensemble Methods: Combine predictions from multiple models (e.g., bagging,
boosting, stacking) to improve overall performance and robustness.
Error Analysis:
 Analyze Mispredictions: Examine instances where the model makes incorrect
predictions to identify patterns of errors and understand the model's
limitations. This can guide further data collection, feature engineering, or model
adjustments.

6. Feature Engineering:
Definition: The process of using domain knowledge to create features that
make machine learning algorithms work.
Approach:
Feature Transformation:

40
 Scaling: Normalize or standardize features (e.g., Min-Max
scaling, Z-score).
 Encoding categorical variables: One-hot encoding, Label
encoding.
 Polynomial features: To capture interactions between features.
 Log transformations for skewed distributions.

Feature Subset Selection:


Filter methods: Use statistical tests (e.g., correlation, Chi-squared test)
to remove irrelevant features.
Wrapper methods: Use search techniques like recursive feature
elimination (RFE) to select the best subset.
Embedded methods: Feature selection as part of the model training
process (e.g., Lasso, Decision Trees).

Example:
Improving a model using feature engineering and selection
Let's say you're working on a model predicting house prices. After training a basic
linear regression model, the performance is mediocre. Here’s how you could improve
it:
1. Feature Transformation:
o Normalize continuous features like square footage or price.
o One-hot encode categorical variables like neighborhood or house type.
o Add polynomial features for interaction terms, like "bedrooms * square
footage".
2. Feature Selection:
o Apply a correlation matrix to see which features are strongly correlated
with the target variable (price).
o Use Recursive Feature Elimination (RFE) to remove less important
features.
3. Model Improvement:
o Apply Regularization (like Lasso) to penalize less relevant features.
o Experiment with ensemble models (e.g., Gradient Boosting) to improve
prediction accuracy.

Feature subset selection, also known as feature selection or variable selection, is a


crucial step in machine learning and data preprocessing. It involves choosing a
relevant subset of features (input variables) from a larger set to build a predictive
model. The primary goal is to identify and remove irrelevant and redundant features,
which can lead to several benefits:
Reduced Dimensionality: Decreases the number of input variables, simplifying the
dataset.

41
Improved Model Performance: Eliminates noise and irrelevant information that can
confuse the model, potentially leading to higher accuracy and better generalization on
unseen data.
Faster Training Times: With fewer features, algorithms can train models more
efficiently.
Enhanced Interpretability: A simpler model with fewer features is often easier to
understand and interpret.
Reduced Overfitting: By focusing on truly relevant features, the model is less likely to
memorize the training data and generalize poorly to new data.

Common Approaches to Feature Subset Selection:

Filter Methods:These methods select features based on their inherent


characteristics, independent of the chosen machine learning algorithm. They use
statistical measures like correlation, mutual information, or chi-square tests to rank
features and select the top-performing ones. Filters are computationally efficient
but may not always select the optimal subset for a specific model.

Wrapper Methods:

These methods use a specific machine learning algorithm to evaluate the


performance of different feature subsets. They involve training and testing the
model with various subsets and selecting the one that yields the best performance
(e.g., highest accuracy or lowest error rate). Examples include Sequential Forward
Selection (SFS) and Sequential Backward Elimination (SBE). Wrappers are
computationally more intensive than filters but often result in a feature set
optimized for the chosen model.

Embedded Methods

These methods integrate feature selection directly into the model training
process. Certain algorithms inherently perform feature selection as part of their
learning mechanism, such as Lasso regression or tree-based models like Random
Forests and Gradient Boosting.

Process of Feature Subset Selection:

The general process involves:

42
Subset Generation:
Creating candidate feature subsets (e.g., using forward selection, backward
elimination, or other search strategies).

Subset Evaluation:Assessing the quality of each candidate subset using a chosen


metric (e.g., model accuracy for wrappers, statistical scores for filters).

Stopping Criterion:Defining when to stop the search for optimal subsets (e.g.,
when a certain performance threshold is met or a fixed number of features is
selected).

Validation:Evaluating the selected feature subset on an independent dataset to


ensure its generalization capabilities.

AI responses may include mistakes.

Feature Subset Selection Process

GeeksforGeeks

[Link] › machine-learning › fe...

18 May 2023 — The feature subset selection process involves identifying and
selecting a subset of relevant features from a given dataset. It aims to improve ..

Feature Selection Techniques in Machine Learning

Kaggle

[Link] › code › piyushagni5 › feature...

Feature Selection: Select a subset of input features from the dataset.


Unsupervised: Do not use the target variable for selecting the feature importance
of ...

People also ask

What are the three types of feature selection methods?

43
What does feature selection mean in machine learning?
What are the four major steps in feature selection?

Unit iii
12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
1/44
1
C. Abdul Hakeem College of Engineering & Technology
Department of Master of Computer Applications
MC4301 - Machine Learning
Unit 3
Bayesian Learning
Basic Probability Notation
Uncertainty:
we might write A→B, which means if A is true then B is true, but consider a situation
where we are not sure about whether A is true or not then we cannot express this
statement, this situation is called uncertainty.
So to represent uncertain knowledge, where we are not sure about the predicates, we
need uncertain reasoning or probabilistic reasoning.
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
1. Information occurred from unreliable sources.
2. Experimental Errors
3. Equipment fault
4. Temperature variation
5. Climate change.
Probabilistic reasoning:
Probabilistic reasoning is a way of knowledge representation where we apply the
concept of probability to indicate the uncertainty in knowledge. In probabilistic
reasoning, we combine probability theory with logic to handle the uncertainty.
We use probability in probabilistic reasoning because it provides a way to handle the
uncertainty that is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not
confirmed, such as "It will rain today," "behavior of someone for some situations," "A
match between two teams or two players." These are probable sentences for which we
can assume that it will happen but not sure about it, so here we use probabilistic
reasoning.
Need of probabilistic reasoning in AI:

When there are unpredictable outcomes.

When specifications or possibilities of predicates becomes too large to handle.

When an unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with uncertain
knowledge:12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
2/44

44
2

Bayes' rule :

Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so before understanding
probabilistic reasoning, let's understand some common terms:
Probability: Probability can be defined as a chance that an uncertain event will occur.
It is the numerical measure of the likelihood that an event will occur. The value of
probability always remains between 0 and 1 that represent ideal uncertainties.
≤ P(A) ≤ 1, where P(A) is the probability of an event A.
P(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
We can find the probability of an uncertain event by using the below formula.

P(¬A) = probability of a not happening event.

P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in
the real world.
Prior probability: The prior probability of an event is probability computed before
observing new information.
Posterior Probability: The probability that is calculated after all evidence or
information has taken into account. It is a combination of prior probability and new
information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has
already happened.12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
3/44
3
Let's suppose, we want to calculate the event A when event B has already occurred,

Where P(A⋀ B)= Joint probability of a and B


"the probability of A under the conditions of B", it can be written as:

P(B)= Marginal probability of B.


If the probability of A is given and we need to find the probability of B, then it will be
given as:
It can be explained by using the below Venn diagram, where B is occurred event, so

event B is already occurred by dividing the probability of P(A⋀ B) by P( B ).


sample space will be reduced to set B, and now we can only calculate event A when

Example:
In a class, there are 70% of the students who like English and 40% of the students
who likes English and mathematics, and then what is the percent of students those
who like English also like mathematics?
Solution:
Let, A is an event that a student likes Mathematics
B is an event that a student likes English.
Hence, 57% are the students who like English also like Mathematics.12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank

45
4/44
4
Inference
Inference means to find a conclusion based on the facts, information, and
evidence. In simple words, when we conclude the facts and figures to reach a
particular decision, that is called inference. In artificial intelligence, the expert system
or any agent performs this task with the help of the inference engine. In the inference
engine, the information and facts present in the knowledge base are considered
according to the situation and the engine makes the conclusion out of these facts,
based on which the further processing and decision making takes place in the agent.
The inference process in an agent takes place according to some rules, which are
known as the inference rules or rule of inference. Following are the major types of
inference rules that are used:
1) Addition: This inference rule is stated as follows:
P

∴P v Q
----------

2) Simplification: This inference rule states that:


P^Q
P^Q
----------
OR


----------


P
Q
3) Modus Ponens: This is the most widely used inference rule. It states:
P->Q
P

∴Q
-----------

4) Modus Tollens: This rule states that:


P->Q
~Q

∴~P
-----------

5) Forward Chaining: It is a type of deductive Inference rule. It states that:


P
P->Q

∴Q12/18/23, 6:23 PM
-----------

MC4301 - ML Unit 3 (Bayesian Learning)


about:blank
5/44
5
6) Backward Chaining: This is also a type of deductive inference rule. This rule
states that:
P
P->Q

∴P
-----------

7) Resolution: In the reasoning by resolution, we are given the goal condition and

46
available facts and statements. Using these facts and statements, we have to decide
whether the goal condition is true or not, i.e. is it possible for the agent to reach the
goal state or not. We prove this by the method of contradiction. This rule states that:
PvQ
~P^R

∴Q v R
-----------

8) Hypothetical Syllogism: This rule states the transitive relation between the
statements:
P->Q
Q->R

∴P->R
-----------

9) Disjunctive Syllogism: This rule is stated as follows:


PvQ
~P

∴Q
-----------

Machine learning (ML) inference is the process of running live data points into a
machine learning algorithm (or “ML model”) to calculate an output such as a single
numerical score. This process is also referred to as “operationalizing an ML model” or
“putting an ML model into production.” When an ML model is running in production,
it is often then described as artificial intelligence (AI) since it is performing functions
similar to human thinking and analysis. Machine learning inference basically entails
deploying a software application into a production environment, as the ML model is
typically just software code that implements a mathematical algorithm. That
algorithm makes calculations based on the characteristics of the data, known as
“features” in the ML vernacular.
An ML lifecycle can be broken up into two main, distinct parts. The first is the
training phase, in which an ML model is created or “trained” by running a specified
subset of data into the model. ML inference is the second phase, in which the model is
put into action on live data to produce actionable output. The data processing by the12/18/23, 6:23
PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
6/44
6
ML model is often referred to as “scoring,” so one can say that the ML model scores
the data, and the output is a score.
ML inference is generally deployed by DevOps engineers or data engineers.
Sometimes the data scientists, who are responsible for training the models, are asked
to own the ML inference process. This latter situation often causes significant
obstacles in getting to the ML inference stage, since data scientists are not necessarily
skilled at deploying systems. Successful ML deployments often are the result of tight
coordination between different teams, and newer software technologies are also often
deployed to try to simplify the process. An emerging discipline known as “MLOps” is
starting to put more structure and resources around getting ML models into
production and maintaining those models when changes are needed.
How Does Machine Learning Inference Work?
To deploy a machine learning inference environment, you need three main
components in addition to the model:
1. One or more data sources
2. A system to host the ML model

47
3. One or more data destinations
In machine learning inference, the data sources are typically a system that captures the
live data from the mechanism that generates the data. The host system for the machine
learning model accepts data from the data sources and inputs the data into the
machine learning model. The data destinations are where the host system should
deliver the output score from the machine learning model.
The data sources are typically a system that captures the live data from the mechanism
that generates the data. For example, a data source might be an Apache Kafka cluster
that stores data created by an Internet of Things (IoT) device, a web application log
file, or a point-of-sale (POS) machine. Or a data source might simply be a web12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
7/44
7
application that collects user clicks and sends data to the system that hosts the ML
model.
The host system for the ML model accepts data from the data sources and inputs the
data into the ML model. It is the host system that provides the infrastructure to turn
the code in the ML model into a fully operational application. After an output is
generated from the ML model, the host system then sends that output to the data
destinations. The host system can be, for example, a web application that accepts data
input via a REST interface, or a stream processing application that takes an incoming
feed of data from Apache Kafka to process many data points per second.
The data destinations are where the host system should deliver the output score from
the ML model. A destination can be any type of data repository like Apache Kafka or
a database, and from there, downstream applications take further action on the scores.
For example, if the ML model calculates a fraud score on purchase data, then the
applications associated with the data destinations might send an “approve” or “decline”
message back to the purchase site.
Challenges of Machine Learning Inference
As mentioned earlier, the work in ML inference can sometimes be misallocated to the
data scientist. If given only a low-level set of tools for ML inference, the data scientist
may not be successful in the deployment.
Additionally, DevOps and data engineers are sometimes not able to help with
deployment, often due to conflicting priorities or a lack of understanding of what’s
required for ML inference. In many cases, the ML model is written in a language like
Python, which is popular among data scientists, but the IT team is more well-versed in
a language like Java. This means that engineers must take the Python code and
translate it to Java to run it within their infrastructure. In addition, the deployment of
ML models requires some extra coding to map the input data into a format that the
ML model can accept, and this extra work adds to the engineers’ burden when
deploying the ML model.
Also, the ML lifecycle typically requires experimentation and periodic updates to the
ML models. If deploying the ML model is difficult in the first place, then updating
models will be almost as difficult. The whole maintenance effort can be difficult, as
there are business continuity and security issues to address.
Another challenge is attaining suitable performance for the workload. REST-based
systems that perform the ML inference often suffer from low throughput and high
latency. This might be suitable for some environments, but modern deployments that
deal with IoT and online transactions are facing huge loads that can overwhelm these
simple REST-based deployments. And the system needs to be able to scale to not only
handle growing workloads but to also handle temporary load spikes while retaining
consistent responsiveness.12/18/23, 6:23 PM

48
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
8/44
8
Independence
1. The intuition of Conditional Independence
Let’s say A is the height of a child and B is the number of words that the child
knows. It seems when A is high, B is high too.
There is a single piece of information that will make A and B completely
independent. What would that be?
The child’s age.
The height and the # of words known by the kid are NOT independent, but they are
conditionally independent if you provide the kid’s age.
2. Mathematical Form
A: The height of a child
B: The # of words that the child knows
C: The child's age12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
9/44
9
A better way to remember the expression:
Conditional independence is basically the concept of independence P(A ∩ B) = P(A)
* P(B) applied to the conditional model.
Why is P(A|B ∩ C) = P(A|C) when (A ㅛ B)|C?
Here goes the proof.12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
10/44
10
The gist of conditional independence: Knowing C makes A and B independent.
P(A,B|C) = P(A|C) * P(B|C)
3. Applications
Why does the conditional independence even matter?
Because it is a foundation for many statistical models that we use. (e.g., latent class
models, factor analysis, graphical models, etc.)
A. Conditional Independence in Bayesian Network (aka Graphical Models)
A Bayesian network represents a joint distribution using a graph. Specifically, it is
a directed acyclic graph in which each edge is a conditional dependency, and each
node is a distinctive random variable. It has many other names: belief network,
decision network, causal network, Bayes(ian) model or probabilistic directed acyclic
graphical model, etc.
It looks like so:
In order for the Bayesian network to model a probability distribution, it relies on
the important assumption: each variable is conditionally independent of its
non-descendants, given its parents.
For instance, we can simplify P(Grass Wet|Sprinkler, Rain) into P(Grass
Wet|Sprinkler) since Grass Wet is conditionally independent of its non-descendant,
Rain, given Sprinkler.
Using this property, we can simplify the whole joint distribution into the formula
below:12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank

49
11/44
11
What is so
great
about
this approximation (Conditional Independent
assumption)?
Conditional independence between variables can greatly reduce the
number of parameters.
This reduces so much of the computation since we now only take into account its
parent and disregard everything else.
Let’s take a look at the numbers.
Let’s say you have n binary variables (= n nodes).
The unconstrained joint distribution requires O(2^n) probabilities.
For a Bayesian Network, with a maximum of k parents for any node, we need
only O(n * 2^k) probabilities. (This can be carried out in linear time for certain
numbers of classes.)
n = 30 binary variables, k = 4 maximum parents for nodes• Unconstrained Joint
Distribution: needs 2^30 (about 1 million) probabilities -> Intractable!• Bayesian
Network: needs only 480 probabilities
We can have an efficient factored representation for a joint
distribution using Conditional independence.
B. Conditional Independence in Bayesian Inference
Let’s say I’d like to estimate the engagement (clap) rate of my blog. Let p be the
proportion of readers who will clap for my articles. We’ll choose n readers randomly
from the population. For i = 1, …, n, let Xi = 1 if the reader claps or Xi = 0 if s/he
doesn’t.
In a frequentist approach, we don’t assign the probability distribution to p.
p would be simply ‘sum (Xi) / n’.
And we would treat X1, …, Xn as independent random variables.12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
12/44
12
On the other hand, in Bayesian inference, we assume p follows a distribution, not
just a constant. In this model, the random variables X1, …, Xn are NOT
independent, but they are conditionally independent given the distribution of p.
C. Correlation ≠ Causation
“Correlation is not causation” means that just because two things correlate does not
necessarily mean that one causes the other.
Here is a hilarious example of taxi accidents:
A study has shown a positive and significant correlation between the number of
accidents and taxi drivers’ wearing coats. They found that coats might hinder the
driver’s movements and cause accidents. A new law was ready to ban taxi drivers
from wearing coats while driving.
Until another study pointed out that people wear coats when it rains…
P(accidents, coats | rain) = P(accidents | rain) * P(coats | rain)
Correlations between two things can be caused by a third factor that affects both of
them. This third factor is called a confounder. The confounder, which is rain, was
responsible for the correlation between accident and wearing coats.
P(accidents | coats, rain) = P(accidents | coats)
Note that this does NOT mean accidents are independent of rain. What it means is:
given drivers wearing coats, knowing rain doesn’t give any more information about

50
accidents.
4. Conditional Independence vs Marginal Independence
Marginal independence is just the same as plain independence,
Sometimes, two random variables might not be marginally independent. However,
they can become independent after we observe some third variable.
5. More examples!

Amount of speeding fine ㅛ Type of car | Speed

Lung cancer ㅛ Yellow teeth | Smoking

Child’s genes ㅛ Grandparents’ genes | Parents’ genes

Is the car’s starter motor working? ㅛ Is the car’s radio working?| Battery

Future ㅛ Past | Present (This is the Markov assumption!)
They are all in the same form as A ㅛ B | C.
A and B look related if we don’t take C into account. However, once we include C in
the picture, then the apparent relationship between A and B disappears. As you see,
any causal relationship is potentially conditionally independent. We will never know12/18/23, 6:23
PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
13/44
13
for sure about the relationship between A & B until we test every possible C
(confounding variable)!
Bayes’ Rule
Bayes' Rule is the most important rule in data science. It is the mathematical rule that
describes how to update a belief, given some evidence. In other words – it describes
the act of learning.
The equation itself is not too complex:
The equation: Posterior = Prior x (Likelihood over Marginal probability)
There are four parts:

Posterior probability (updated probability after the evidence is considered)

Prior probability (the probability before the evidence is considered)

Likelihood (probability of the evidence, given the belief is true)

Marginal probability (probability of the evidence, under any circumstance)
Bayes' Rule can answer a variety of probability questions, which help us (and
machines) understand the complex world we live in.
It is named after Thomas Bayes, an 18th century English theologian and
mathematician. Bayes originally wrote about the concept, but it did not receive much
attention during his lifetime.
French mathematician Pierre-Simon Laplace independently published the rule in his
1814 work Essai philosophique sur les probabilités.
Today, Bayes' Rule has numerous applications, from statistical analysis to machine
learning.
Conditional probability
Conditional probability is the bridge that lets you talk about how multiple uncertain

51
events are related. It lets you talk about how the probability of an event can vary
under different conditions.12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
14/44
14
For example, consider the probability of winning a race, given the condition you
didn't sleep the night before. You might expect this probability to be lower than the
probability you'd win if you'd had a full night's sleep.
Or, consider the probability that a suspect committed a crime, given that their
fingerprints are found at the scene. You'd expect the probability they are guilty to be
greater, compared with had their fingerprints not been found.
The notation for conditional probability is usually:
P(A|B)
Which is read as "the probability of event A occurring, given event B occurs".
An important thing to remember is that conditional probabilities are not the same as
their inverses.
That is, the "probability of event A given event B" is not the same thing as the
"probability of event B, given event A".
To remember this, take the following example:
The probability of clouds, given it is raining (100%) is not the same as
the probability it is raining, given there are clouds.
Bayes' Rule in detail
Bayes' Rule tells you how to calculate a conditional probability with information you
already have.
It is helpful to think in terms of two events – a hypothesis (which can be true or false)
and evidence (which can be present or absent).12/18/23, 6:23 PM
MC4301 - ML Unit 3 (Bayesian Learning)
about:blank
15/44
15
However, it can be applied to any type of events, with any number of discrete or
continuous outcomes.
Bayes' Rule lets you calculate the posterior (or "updated") probability. This is a
conditional probability. It is the probability of the hypothesis being true, if the
evidence is present.
Think of the prior (or "previous") probability as your belief in the hypothesis
before seeing the new evidence. If you had a strong belief in the hypothesis already,
the prior probability will be large.
The prior is multiplied by a fraction. Think of this as the "strength" of the evidence.
The posterior probability is greater when the top part (numerator) is big, and the
bottom part (denominator) is small.
The numerator is the likelihood. This is another conditional probability. It is the
probability of the evidence being present, given the hypothesis is true.
This is not the same as the posterior!
Remember, the "probability of the evidence being present given the hypothesis is
true" is not the same as the "probability of the hypothesis being true given the
evidence is present".
Now look at the denominator. This is the marginal probability of the evidence. That
is, it is the probability of the evidence being present, whether the hypothesis is true or
false. The smaller the denominator, the more "convincing" the evidence.
Worked example of Bayes' Rule
Here's a simple worked example.

52
Your neighbour is watching their favourite football (or soccer) team. You hear them
cheering, and want to estimate the probability their team has scored.
Step 1 – write down the posterior probability of a goal, given cheering
Step 2 – estimate the prior probability of a goal as 2%
Step 3 – estimate the likelihood probability of cheering, given there's a goal as 90%
(perhaps your neighbour won't celebrate if their team is losing badly)
Step 4 – estimate the marginal probability of cheering

Unit 4
12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
1/56
1
C. Abdul Hakeem College of Engineering & Technology
Department of Master of Computer Applications
MC4301 - Machine Learning
Unit 4
Parametric Machine Learning
Logistic Regression
What are the differences between supervised learning, unsupervised learning &
reinforcement learning?
Machine learning algorithms are broadly classified into three categories - supervised
learning, unsupervised learning, and reinforcement learning.
1. Supervised Learning - Learning where data is labeled and the motivation is
to classify something or predict a value. Example: Detecting fraudulent
transactions from a list of credit card transactions.
2. Unsupervised Learning - Learning where data is not labeled and the
motivation is to find patterns in given data. In this case, you are asking the
machine learning model to process the data from which you can then draw
conclusions. Example: Customer segmentation based on spend data.
3. Reinforcement Learning - Learning by trial and error. This is the closest to
how humans learn. The motivation is to find optimal policy of how to act in a
given environment. The machine learning model examines all possible actions,
makes a policy that maximizes benefit, and implements the policy(trial). If
there are errors from the initial policy, apply reinforcements back into the
algorithm and continue to do this until you reach the optimal policy. Example:
Personalized recommendations on streaming platforms like YouTube.
What are the two types of supervised learning?
As supervised learning is used to classify something or predict a value, naturally there
are two types of algorithms for supervised learning - classification models and
regression models.
1. Classification model - In simple terms, a classification model predicts
possible outcomes. Example: Predicting if a transaction is fraud or not.
2. Regression model - Are used to predict a numerical value. Example:
Predicting the sale price of a house.
What is logistic regression?
Logistic regression is an example of supervised learning. It is used to calculate or
predict the probability of a binary (yes/no) event occurring. An example of logistic
regression could be applying machine learning to determine if a person is likely to be
infected with COVID-19 or not. Since we have two possible outcomes to this question
- yes they are infected, or no they are not infected - this is called binary classification.12/18/23,
6:41 PM

53
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
2/56
2
In this imaginary example, the probability of a person being infected with COVID-19
could be based on the viral load and the symptoms and the presence of antibodies, etc.
Viral load, symptoms, and antibodies would be our factors (Independent Variables),
which would influence our outcome (Dependent Variable).
How is logistic regression different from linear regression?
In linear regression, the outcome is continuous and can be any possible value.
However in the case of logistic regression, the predicted outcome is discrete and
restricted to a limited number of values.
For example, say we are trying to apply machine learning to the sale of a house. If we
are trying to predict the sale price based on the size, year built, and number of stories
we would use linear regression, as linear regression can predict a sale price of any
possible value. If we are using those same factors to predict if the house sells or not,
we would logistic regression as the possible outcomes here are restricted to yes or no.
Hence, linear regression is an example of a regression model and logistic regression is
an example of a classification model.
Where to use logistic regression
Logistic regression is used to solve classification problems, and the most common use
case is binary logistic regression, where the outcome is binary (yes or no). In the real
world, you can see logistic regression applied across multiple areas and fields.

In health care, logistic regression can be used to predict if a tumor is likely to
be benign or malignant.

In the financial industry, logistic regression can be used to predict if a
transaction is fraudulent or not.

In marketing, logistic regression can be used to predict if a targeted audience
will respond or not.
Are there other use cases for logistic regression aside from binary logistic regression?
Yes. There are two other types of logistic regression that depend on the number of
predicted outcomes.
The three types of logistic regression
1. Binary logistic regression - When we have two possible outcomes, like our
original example of whether a person is likely to be infected with COVID-19
or not.
2. Multinomial logistic regression - When we have multiple outcomes, say if
we build out our original example to predict whether someone may have the
flu, an allergy, a cold, or COVID-19.
3. Ordinal logistic regression - When the outcome is ordered, like if we build
out our original example to also help determine the severity of a COVID-19
infection, sorting it into mild, moderate, and severe cases.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
3/56
3
Training data assumptions for logistic regression
Training data that satisfies the below assumptions is usually a good fit for logistic
regression.

54
The predicted outcome is strictly binary or dichotomous. (This applies to
binary logistic regression).

The factors, or the independent variables, that influence the outcome are
independent of each other. In other words there is little or no multicollinearity
among the independent variables.

The independent variables can be linearly related to the log odds.

Fairly large sample sizes.
If your training data does not satisfy the above assumptions, logistic regression may
not work for your use case.
Mathematics behind logistic regression
Probability always ranges between 0 (does not happen) and 1 (happens). Using our
Covid-19 example, in the case of binary classification, the probability of testing
positive and not testing positive will sum up to 1. We use logistic function or sigmoid
function to calculate probability in logistic regression. The logistic function is a
simple S-shaped curve used to convert data into a value between 0 and 1.
Classification and representation
What is Supervised Learning?
In Supervised Learning, the model learns by example. Along with our input variable,
we also give our model the corresponding correct labels. While training, the model
gets to look at which label corresponds to our data and hence can find patterns
between our data and those labels.
Some examples of Supervised Learning include:
1. It classifies spam Detection by teaching a model of what mail is spam and not
spam.
2. Speech recognition where you teach a machine to recognize your voice.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
4/56
4
3. Object Recognition by showing a machine what an object looks like and
having it pick that object from among other objects.
We can further divide Supervised Learning into the following:
Figure 1: Supervised Learning Subdivisions
What is Classification?
Classification is defined as the process of recognition, understanding, and grouping of
objects and ideas into preset categories a.k.a “sub-populations.” With the help of these
pre-categorized training datasets, classification in machine learning programs
leverage a wide range of algorithms to classify future datasets into respective and
relevant categories.
Classification algorithms used in machine learning utilize input training data for the
purpose of predicting the likelihood or probability that the data that follows will fall
into one of the predetermined categories. One of the most common applications of
classification is for filtering emails into “spam” or “non-spam”, as used by today’s top
email service providers.
In short, classification is a form of “pattern recognition,”. Here, classification
algorithms applied to the training data find the same pattern (similar number
sequences, words or sentiments, and the like) in future data sets.
We will explore classification algorithms in detail, and discover how a text analysis
software can perform actions like sentiment analysis - used for categorizing
unstructured text by opinion polarity (positive, negative, neutral, and the like).

55
Figure 2: Classification of vegetables and groceries12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
5/56
5
What is Classification Algorithm?
Based on training data, the Classification algorithm is a Supervised Learning
technique used to categorize new observations. In classification, a program uses the
dataset or observations provided to learn how to categorize new observations into
various classes or groups. For instance, 0 or 1, red or blue, yes or no, spam or not
spam, etc. Targets, labels, or categories can all be used to describe classes. The
Classification algorithm uses labeled input data because it is a supervised learning
technique and comprises input and output information. A discrete output function (y)
is transferred to an input variable in the classification process (x).
In simple words, classification is a type of pattern recognition in which classification
algorithms are performed on training data to discover the same pattern in new data
sets.
Learners in Classification Problems :
There are two types of learners.
Lazy Learners
It first stores the training dataset before waiting for the test dataset to arrive. When
using a lazy learner, the classification is carried out using the training dataset's most
appropriate data. Less time is spent on training, but more time is spent on predictions.
Some of the examples are case-based reasoning and the KNN algorithm.
Eager Learners
Before obtaining a test dataset, eager learners build a classification model using a
training dataset. They spend more time studying and less time predicting. Some of the
examples are ANN, naive Bayes, and Decision trees.
4 Types Of Classification Tasks In Machine Learning
Before diving into the four types of Classification Tasks in Machine Learning, let us
first discuss Classification Predictive Modeling.
Classification Predictive Modeling
A classification problem in machine learning is one in which a class label is
anticipated for a specific example of input data.
Problems with categorization include the following:

Give an example and indicate whether it is spam or not.

Identify a handwritten character as one of the recognized characters.

Determine whether to label the current user behavior as churn.
A training dataset with numerous examples of inputs and outputs is necessary for
classification from a modeling standpoint.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
6/56
6
A model will determine the optimal way to map samples of input data to certain class
labels using the training dataset. The training dataset must therefore contain a large
number of samples of each class label and be suitably representative of the problem.
When providing class labels to a modeling algorithm, string values like "spam" or
"not spam" must first be converted to numeric values. Label encoding, which is
frequently used, assigns a distinct integer to every class label, such as "spam" = 0, "no

56
spam," = 1.
There are numerous varieties of algorithms for classification in modeling problems,
including predictive modeling and classification.
It is typically advised that a practitioner undertake controlled tests to determine what
algorithm and algorithm configuration produces the greatest performance for a certain
classification task because there is no strong theory on how to map algorithms onto
issue types.
Based on their output, classification predictive modeling algorithms are assessed. A
common statistic for assessing a model's performance based on projected class labels
is classification accuracy. Although not perfect, classification accuracy is a reasonable
place to start for many classification jobs.
Some tasks may call for a class membership probability prediction for each example
rather than class labels. This adds more uncertainty to the prediction, which a user or
application can subsequently interpret. The ROC Curve is a well-liked diagnostic for
assessing anticipated probabilities.
There are four different types of Classification Tasks in Machine Learning and they
are following -

Binary Classification

Multi-Class Classification

Multi-Label Classification

Imbalanced Classification
Binary Classification
Those classification jobs with only two class labels are referred to as binary
classification.
Examples comprise -

Prediction of conversion (buy or not).

Churn forecast (churn or not).

Detection of spam email (spam or not).
Binary classification problems often require two classes, one representing the normal
state and the other representing the aberrant state.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
7/56
7
For instance, the normal condition is "not spam," while the abnormal state is "spam."
Another illustration is when a task involving a medical test has a normal condition of
"cancer not identified" and an abnormal state of "cancer detected."
Class label 0 is given to the class in the normal state, whereas class label 1 is given to
the class in the abnormal condition.
A model that forecasts a Bernoulli probability distribution for each case is frequently
used to represent a binary classification task.
The discrete probability distribution known as the Bernoulli distribution deals with
the situation where an event has a binary result of either 0 or 1. In terms of
classification, this indicates that the model forecasts the likelihood that an example
would fall within class 1, or the abnormal state.
The following are well-known binary classification algorithms:

57

Logistic Regression

Support Vector Machines

Simple Bayes

Decision Trees
Some algorithms, such as Support Vector Machines and Logistic Regression, were
created expressly for binary classification and do not by default support more than
two classes.
Multi-Class Classification
Multi-class labels are used in classification tasks referred to as multi-class
classification.
Examples comprise -

Categorization of faces.

Classifying plant species.

Character recognition using optical.
The multi-class classification does not have the idea of normal and abnormal
outcomes, in contrast to binary classification. Instead, instances are grouped into one
of several well-known classes.
In some cases, the number of class labels could be rather high. In a facial recognition
system, for instance, a model might predict that a shot belongs to one of thousands or
tens of thousands of faces.
Text translation models and other problems involving word prediction could be
categorized as a particular case of multi-class classification. Each word in the
sequence of words to be predicted requires a multi-class classification, where the
vocabulary size determines the number of possible classes that may be predicted and
may range from tens of thousands to hundreds of thousands of words.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
8/56
8
Multiclass classification tasks are frequently modeled using a model that forecasts a
Multinoulli probability distribution for each example.
An event that has a categorical outcome, such as K in 1, 2, 3,..., K, is covered by the
Multinoulli distribution, which is a discrete probability distribution. In terms of
classification, this implies that the model forecasts the likelihood that a given example
will belong to a certain class label.
For multi-class classification, many binary classification techniques are applicable.
The following well-known algorithms can be used for multi-class classification:

Progressive Boosting

Choice trees

Nearest K Neighbors

Rough Forest

58
Simple Bayes
Multi-class problems can be solved using algorithms created for binary classification.
In order to do this, a method is known as "one-vs-rest" or "one model for each pair of
classes" is used, which includes fitting multiple binary classification models with each
class versus all other classes (called one-vs-one).

One-vs-One: For each pair of classes, fit a single binary classification model.
The following binary classification algorithms can
apply these multi-class
classification techniques:

One-vs-Rest: Fit a single binary classification model for each class versus all
other classes.
The following binary classification algorithms can
apply these multi-class
classification techniques:

Support vector Machine

Logistic Regression
Multi-Label Classification
Multi-label classification problems are those that feature two or more class labels and
allow for the prediction of one or more class labels for each example.
Think about the photo classification example. Here a model can predict the existence
of many known things in a photo, such as “person”, “apple”, "bicycle," etc. A
particular photo may have multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which
anticipate a single class label for each occurrence.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
9/56
9
Multi-label classification problems are frequently modeled using a model that
forecasts many outcomes, with each outcome being forecast as a Bernoulli probability
distribution. In essence, this approach predicts several binary classifications for each
example.
It is not possible to directly apply multi-label classification methods used for
multi-class or binary classification. The so-called multi-label versions of the
algorithms, which are specialized versions of the conventional classification
algorithms, include:

Multi-label Gradient Boosting

Multi-label Random Forests

Multi-label Decision Trees
Another strategy is to forecast the class labels using a different classification
algorithm.
Imbalanced Classification
The term "imbalanced classification" describes classification jobs where the
distribution of examples within each class is not equal.
A majority of the training dataset's instances belong to the normal class, while a
minority belong to the abnormal class, making imbalanced classification tasks binary

59
classification tasks in general.
Examples comprise -

Clinical diagnostic procedures

Detection of outliers

Fraud investigation
Although they could need unique methods, these issues are modeled as binary
classification jobs.
By oversampling the minority class or undersampling the majority class, specialized
strategies can be employed to alter the sample composition in the training dataset.
Examples comprise -

SMOTE Oversampling

Random Undersampling
It is possible to utilize specialized modeling techniques, like the cost-sensitive
machine learning algorithms, that give the minority class more consideration when
fitting the model to the training dataset.
Examples comprise:

Cost-sensitive Support Vector Machines12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
10/56
10

Cost-sensitive Decision Trees

Cost-sensitive Logistic Regression
Since reporting the classification accuracy may be deceptive, alternate performance
indicators may be necessary.
Examples comprise -

F-Measure

Recall

Precision
Types of Classification Algorithms
You can apply many different classification methods based on the dataset you are
working with. It is so because the study of classification in statistics is extensive. The
top five machine learning algorithms are listed below.
1. Logistic Regression
It is a supervised learning classification technique that forecasts the likelihood of a
target variable. There will only be a choice between two classes. Data can be coded as
either one or yes, representing success, or as 0 or no, representing failure. The
dependent variable can be predicted most effectively using logistic regression. When
the forecast is categorical, such as true or false, yes or no, or a 0 or 1, you can use it.
A logistic regression technique can be used to determine whether or not an email is a
spam.
2. Naive Bayes

60
Naive Bayes determines whether a data point falls into a particular category. It can be
used to classify phrases or words in text analysis as either falling within a
predetermined classification or not.
Text
Tag
“A great game”
Sports
“The election is over”
Not Sports
“What a great score”
Sports
“A clean and unforgettable game”
Sports
“The spelling bee winner was a surprise” Not Sports
3. K-Nearest Neighbors
It calculates the likelihood that a data point will join the groups based on which group
the data points closest to it are a part of. When using k-NN for classification, you
determine how to classify the data according to its nearest neighbor.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
11/56
11
4. Decision Tree
A decision tree is an example of supervised learning. Although it can solve regression
and classification problems, it excels in classification problems. Similar to a flow
chart, it divides data points into two similar groups at a time, starting with the "tree
trunk" and moving through the "branches" and "leaves" until the categories are more
closely related to one another.
5. Random Forest Algorithm
The random forest algorithm is an extension of the Decision Tree algorithm where
you first create a number of decision trees using training data and then fit your new
data into one of the created ‘tree’ as a ‘random forest’. It averages the data to connect
it to the nearest tree data based on the data scale. These models are great for
improving the decision tree’s problem of forcing data points unnecessarily within a
category.
6. Support Vector Machine
Support Vector Machine is a popular supervised machine learning technique for
classification and regression problems. It goes beyond X/Y prediction by using
algorithms to classify and train the data according to polarity.
Types of ML Classification Algorithms
1. Supervised Learning Approach
The supervised learning approach explicitly trains algorithms under close human
supervision. Both the input and the output data are first provided to the algorithm. The
algorithm then develops rules that map the input to the output. The training procedure
is repeated as soon as the highest level of performance is attained.
The two types of supervised learning approaches are:

Regression

Classification
2. Unsupervised Learning
This approach is applied to examine data's inherent structure and derive insightful
information from it. This technique looks for insights that can produce better results

61
by looking for patterns and insights in unlabeled data.
There are two types of unsupervised learning:

Clustering

Dimensionality reduction12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
12/56
12
3. Semi-supervised Learning
Semi-supervised learning lies on the spectrum between unsupervised and supervised
learning. It combines the most significant aspects of both worlds to provide a unique
set of algorithms.
4. Reinforcement Learning
The goal of reinforcement learning is to create autonomous, self-improving
algorithms. The algorithm's goal is to improve itself through a continual cycle of trials
and errors based on the interactions and combinations between the incoming and
labeled data.
Classification Models

Naive Bayes: Naive Bayes is a classification algorithm that assumes that
predictors in a dataset are independent. This means that it assumes the features
are unrelated to each other. For example, if given a banana, the classifier will
see that the fruit is of yellow color, oblong-shaped and long and tapered. All of
these features will contribute independently to the probability of it being a
banana and are not dependent on each other.

Decision Trees: A Decision Tree is an algorithm that is used to visually
represent decision-making. A Decision Tree can be made by asking a yes/no
question and splitting the answer to lead to another decision. The question is at
the node and it places the resulting decisions below at the leaves. The tree
depicted below is used to decide if we can play tennis.
Figure 4: Decision Tree
In the above figure, depending on the weather conditions and the humidity and wind,
we can systematically decide if we should play tennis or not. In decision trees, all the
False statements lie on the left of the tree and the True statements branch off to the
right. Knowing this, we can make a tree which has the features at the nodes and the
resulting classes at the leaves.

K-Nearest Neighbors: K-Nearest Neighbor is a classification and prediction
algorithm that is used to divide data into classes based on the distance between12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
13/56
13
the data points. K-Nearest Neighbor assumes that data points which are close
to one another must be similar and hence, the data point to be classified will be
grouped with the closest cluster.
Figure 5: Data to be classified
Figure 6: Classification using K-Nearest
Neighbours
Evaluating a Classification Model

62
After our model is finished, we must assess its performance to determine whether it is
a regression or classification model. So, we have the following options for assessing a
classification model:
1. Confusion Matrix

The confusion matrix describes the model performance and gives us a matrix
or table as an output.

The error matrix is another name for it.

The matrix is made up of the results of the forecasts in a condensed manner,
together with the total number of right and wrong guesses.
The matrix appears in the following table:
Actual Positive Actual Negative
Predicted Positive True Positive
False Positive12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
14/56
14
Predicted Negative False Negative True Negative
Accuracy = (TP+TN)/Total Population
2. Log Loss or Cross-Entropy Loss

It is used to assess a classifier's performance, and the output is a probability
value between 1 and 0.

A successful binary classification model should have a log loss value that is
close to 0.

If the anticipated value differs from the actual value, the value of log loss
rises.

The lower log loss shows the model’s higher accuracy.
Cross-entropy for binary classification can be calculated as:
(ylog(p)+(1?y)log(1?p))
Where p = Predicted Output, y = Actual output.
3. AUC-ROC Curve

AUC is for Area Under the Curve, and ROC refers to Receiver Operating
Characteristics Curve.

It is a graph that displays the classification model's performance at various
thresholds.

The AUC-ROC Curve is used to show how well the multi-class classification
model performs.

The TPR and FPR are used to draw the ROC curve, with the True Positive
Rate (TPR) on the Y-axis and the FPR (False Positive Rate) on the X-axis.
Use Cases Of Classification Algorithms
There are many applications for classification algorithms. Here are a few of them

63
Speech Recognition

Detecting Spam Emails

Categorization of Drugs

Cancer Tumor Cell Identification

Biometric Authentication, etc.
Representation
A machine learning model can't directly see, hear, or sense input examples. Instead,
you must create a representation of the data to provide the model with a useful
vantage point into the data's key qualities. That is, in order to train a model, you must
choose the set of features that best represent the data.
The choice of representation has an enormous effect on the performance of machine
learning algorithms.12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
15/56
15
In the context of neural networks, Chollet says that layers extract representations.
The core building block of neural networks is the layer, a data-processing module that
you can think of as a filter for data. Some data goes in, and it comes out in a more
useful form. Specifically, layers extract representations out of the data fed into
them--hopefully, representations that are more meaningful for the problem at hand.
Most of deep learning consists of chaining together simple layers that will implement
a form of progressive data distillation. A deep-learning model is like a sieve for data
processing, made of a succession of increasingly refined data filters--the layers.
That makes me think that representations are the form that the training/test data takes
as it is progressively transformed. e.g. words could initially be represented as dense or
sparse (one-hot encoded) vectors. And then their representation changes one or more
times as they are fed into a model.
Mitchell says that we need to choose a representation for the target function.
Now that we have specified the ideal target function V , we must
choose a representation that the learning program will use to describe
the function V^ that it will learn.
This makes me think that the 'representation' could be described as the architecture of
the model, or maybe a mathematical description of the model. With this definition, we
don't know the true representation (equation) of the target function (if we did we
would have nothing to learn). So it is our task to decide what equation we want to use
to best approximate the target function.
Cost function
Machine Learning models require a high level of accuracy to work in the actual world.
But how do you calculate how wrong or right your model is? This is where the cost
function comes into the picture. A machine learning parameter that is used for
correctly judging the model, cost functions are important to understand to know how
well the model has estimated the relationship between your input and output
parameters.
What Is Cost Function in Machine Learning?
After training your model, you need to see how well your model is performing. While
accuracy functions tell you how well the model is performing, they do not provide
you with an insight on how to better improve them. Hence, you need a correctional
function that can help you compute when the model is the most accurate, as you need

64
to hit that small spot between an undertrained model and an overtrained model.
A Cost Function is used to measure just how wrong the model is in finding a relation
between the input
and output. It tells you how badly your
model is
behaving/predicting
Consider a robot trained to stack boxes in a factory. The robot might have to consider
certain changeable parameters, called Variables, which influence how it performs.12/18/23, 6:41
PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
16/56
16
Let’s say the robot comes across an obstacle, like a rock. The robot might bump into
the rock and realize that it is not the correct action.
It will learn from this, and next time it will learn to avoid rocks. Hence, your machine
uses variables to better fit the data. The outcome of all these obstacles will further
optimize the robot and help it perform better. It will generalize and learn to avoid
obstacles in general, say like a fire that might have broken out. The outcome acts as a
cost function, which helps you optimize the variable, to get the best variables and fit
for the model.
Figure 1: Robot learning to avoid obstacles
What Is Gradient Descent?
Gradient Descent is an algorithm that is used to optimize the cost function or the error
of the model. It is used to find the minimum value of error possible in your model.
Gradient Descent can be thought of as the direction you have to take to reach the least
possible error. The error in your model can be different at different points, and you
have to find the quickest way to minimize it, to prevent resource wastage.
Gradient Descent can be visualized as a ball rolling down a hill. Here, the ball will
roll to the lowest point on the hill. It can take this point as the point where the error is
least as for any model, the error will be minimum at one point and will increase again
after that.
In gradient descent, you find the error in your model for different values of input
variables. This is repeated, and soon you see that the error values keep getting smaller
and smaller. Soon you’ll arrive at the values for variables when the error is the least,
and the cost function is optimized.
Figure 2: Gradient Descent12/18/23, 6:41 PM
MC4301 - ML Unit 4 (Parametric Machine Learning)
about:blank
17/56
17
What Is the Cost Function For Linear Regression?
A Linear Regression model uses a straight line to fit the model. This is done using the
equation for a straight line as shown :
Figure 3: Linear regression function
In the equation, you can see that two entities can have changeable values (variable) a,
which is the point at which the line intercepts the x-axis, and b, which is how steep
the line will be, or slope.
At first, if the variables are not properly optimized, you get a line that might not
properly fit the model. As you optimize the values of the model, for some variables,
you will get the perfect fit. The perfect fit will be a straight line running through most
of the data points while ignoring the noise and outliers. A properly fit Linear
Regression model looks as shown below :

65
Figure 4: Linear regression graph
For the Linear regression model, the cost function will be the minimum of the Root
Mean Squared Error of the model, obtained by subtracting the predicted values from
actual values. The cost function will be the minimum of these error values.
Figure 5: Linear regression cost function
By the definition of gradient descent, you have to find the direction in which the error
decreases constantly. This can be done by finding the difference between errors. The
small difference between errors can be obtained by differentiating the cost function
and subtracting it from the previous gradient descent to move down the slope

lOMoARcPSD|24534360
Unit 5
MC4301 - ML Unit 5 (Non Parametric Machine Learning)
Machine Learning (Anna University)
Studocu is not sponsored or endorsed by any college or university
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
C. Abdul Hakeem College of Engineering & Technology
Department of Master of Computer Applications
MC4301 - Machine Learning
Unit 5
Non Parametric Machine Learning
k- Nearest Neighbors
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric,
supervised learning classifier, which uses proximity to make classifications or
predictions about the grouping of an individual data point. While it can be used for
either regression or classification problems, it is typically used as a classification
algorithm, working off the assumption that similar points can be found near one
another.
For classification problems, a class label is assigned on the basis of a majority
vote—i.e. the label that is most frequently represented around a given data point is
used. While this is technically considered “plurality voting”, the term, “majority vote”
is more commonly used in literature. The distinction between these terminologies is
that “majority voting” technically requires a majority of greater than 50%, which
primarily works when there are only two categories. When you have multiple
classes—e.g. four categories, you don’t necessarily need 50% of the vote to make a
conclusion about a class; you could assign a class label with a vote of greater than
25%.
Regression problems use a similar concept as classification problem, but in this case,
the average the k nearest neighbors is taken to make a prediction about a classification.
The main distinction here is that classification is used for discrete values, whereas
regression is used with continuous ones. However, before a classification can be made,
the distance must be defined.
It's also worth noting that the KNN algorithm is also part of a family of “lazy learning”
models, meaning that it only stores a training dataset versus undergoing a training
1
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
stage. This also means that all the computation occurs when a classification or
prediction is being made. Since it heavily relies on memory to store all its training
data, it is also referred to as an instance-based or memory-based learning method.
However, as a dataset grows, KNN becomes increasingly inefficient, compromising
overall model performance. It is commonly used for simple recommendation systems,
pattern recognition, data mining, financial market predictions, intrusion detection, and
more.

66
Compute KNN: distance metrics
The goal of the k-nearest neighbor algorithm is to identify the nearest neighbors of a
given query point, so that we can assign a class label to that point. In order to do this,
KNN has a few requirements:
Determine your distance metrics
In order to determine which data points are closest to a given query point, the distance
between the query point and the other data points will need to be calculated. These
distance metrics help to form decision boundaries, which partitions query points into
different regions.
While there are several distance measures that you can choose from, here are some of
the following:
Euclidean distance (p=2): This is the most commonly used distance measure, and it
is limited to real-valued vectors. Using the below formula, it measures a straight line
between the query point and the other point being measured.
Manhattan distance (p=1): This is also another popular distance metric, which
measures the absolute value between two points. It is also referred to as taxicab
distance or city block distance as it is commonly visualized with a grid, illustrating
how one might navigate from one address to another via city streets.
2
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
Minkowski distance: This distance measure is the generalized form of Euclidean and
Manhattan distance metrics. The parameter, p, in the formula below, allows for the
creation of other distance metrics. Euclidean distance is represented by this formula
when p is equal to two, and Manhattan distance is denoted with p equal to one.
Hamming distance: This technique is used typically used with Boolean or string
vectors, identifying the points where the vectors do not match. As a result, it has also
been referred to as the overlap metric. This can be represented with the following
formula:
Compute KNN: defining k
The k value in the k-NN algorithm defines how many neighbors will be checked to
determine the classification of a specific query point. For example, if k=1, the
instance will be assigned to the same class as its single nearest neighbor. Defining k
can be a balancing act as different values can lead to overfitting or underfitting.
Lower values of k can have high variance, but low bias, and larger values of k may
lead to high bias and lower variance. The choice of k will largely depend on the input
data as data with more outliers or noise will likely perform better with higher values
of k. Overall, it is recommended to have an odd number for k to avoid ties in
classification, and cross-validation tactics can help you choose the optimal k for your
dataset.
Applications of k-NN in machine learning
The k-NN algorithm has been utilized within a variety of applications, largely within
classification. Some of these use cases include:
3
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
- Data preprocessing: Datasets frequently have missing values, but the KNN
algorithm can estimate for those values in a process known as missing data
imputation.
- Recommendation Engines: Using clickstream data from websites, the KNN
algorithm has been used to provide automatic recommendations to users on additional
content.
Finance: It has also been used in a variety of finance and economic use cases.
- Healthcare: KNN has also had application within the healthcare industry, making
predictions on the risk of heart attacks and prostate cancer. The algorithm works by

67
calculating the most likely gene expressions.
- Pattern Recognition: KNN has also assisted in identifying patterns, such as in text
and digit classification
Advantages and disadvantages of the KNN algorithm
Just like any machine learning algorithm, k-NN has its strengths and weaknesses.
Depending on the project and application, it may or may not be the right choice.
Advantages
- Easy to implement: Given the algorithm’s simplicity and accuracy, it is one of the
first classifiers that a new data scientist will learn.
- Adapts easily: As new training samples are added, the algorithm adjusts to account
for any new data since all training data is stored into memory.
- Few hyperparameters: KNN only requires a k value and a distance metric, which
is low when compared to other machine learning algorithms.
Disadvantages
- Does not scale well: Since KNN is a lazy algorithm, it takes up more memory and
data storage compared to other classifiers. This can be costly from both a time and
money perspective. More memory and storage will drive up business expenses and
more data can take longer to compute. While different data structures, such as
Ball-Tree, have been created to address the computational inefficiencies, a different
classifier may be ideal depending on the business problem.
- Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of
dimensionality, which means that it doesn’t perform well with high-dimensional data
inputs.
- Prone to overfitting: Due to the “curse of dimensionality”, KNN is also more prone
to overfitting. While feature selection and dimensionality reduction techniques are
leveraged to prevent this from occurring, the value of k can also impact the model’s
behavior. Lower values of k can overfit the data, whereas higher values of k tend to
4
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
“smooth out” the prediction values since it is averaging the values over a greater area,
or neighborhood. However, if the value of k is too high, then it can underfit the data.
Decision Trees
Decision tree algorithm falls under the category of supervised learning. They can be
used to solve both regression and classification problems. Decision tree uses the tree
representation to solve the problem in which each leaf node corresponds to a class
label and attributes are represented on the internal node of the tree. We can represent
any boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using the decision tree:
 At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values are continuous then
they are discretized prior to building the model.
 On the basis of attribute values, records are distributed recursively.
 We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image the Decision Tree works on the Sum of Product
form which is also known as Disjunctive Normal Form. In the above image, we are
predicting the use of computer in the daily life of people. In the Decision Tree, the
major challenge is the identification of the attribute for the root node at each level.
5
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])6
This process is known as attribute selection. We have two popular attribute selection
measures:
1. Information Gain
2. Gini Index

68
1. Information Gain When we use a node in a decision tree to partition the training
instances into smaller subsets the entropy changes. Information gain is a measure of
this change in entropy. Definition: Suppose S is a set of instances, A is an attribute, S
v
is the subset of S with A = v, and Values (A) is the set of all possible values of A,
then
Entropy Entropy is the measure of uncertainty of a random variable, it characterizes
the impurity of an arbitrary collection of examples. The higher the entropy more the
information content. Definition: Suppose S is a set of instances, A is an attribute, S
v
is
the subset of S with A = v, and Values (A) is the set of all possible values of A,
then
Building Decision Tree using Information Gain The essentials:
 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete attribute twice
 Recursively construct each subtree on the subset of training instances that
would be classified down that path in the tree.
 If all positive or all negative training instances remain, the label that node “yes”
or “no” accordingly
 If no attributes remain, label with a majority vote of training instances left at
that node
 If no instances remain, label with a majority vote of the parent’s training
instances.
2. Gini Index
 Gini Index is a metric to measure how often a randomly chosen element would
be incorrectly identified.
 It means an attribute with a lower Gini index should be preferred.
 Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini”
value.
Branching
A decision tree is a map of the possible outcomes of a series of related choices. It
allows an individual or organization to weigh possible actions against one another
based on their costs, probabilities, and benefits. They can can be used either to drive
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|245343607
informal discussion or to map out an algorithm that predicts the best choice
mathematically.
A decision tree typically starts with a single node, which branches into possible
outcomes. Each of those outcomes leads to additional nodes, which branch off into
other possibilities. This gives it a treelike shape.
There are three different types of nodes: chance nodes, decision nodes, and end nodes.
A chance node, represented by a circle, shows the probabilities of certain results. A
decision node, represented by a square, shows a decision to be made, and an end node
shows the final outcome of a decision path.
Decision trees can also be drawn with flowchart symbols, which some people find
easier to read and understand.
Decision tree symbols
Shape Name
Meaning
Decision node
Indicates a decision to be made

69
Chance node
Shows multiple uncertain outcomes
Alternative branches Each branch indicates a possible outcome or action
Rejected alternative Shows a choice that was not selected
Endpoint node
Indicates a final outcome
How to draw a decision tree
To draw a decision tree, first pick a medium. You can draw it by hand on paper or a
whiteboard, or you can use special decision tree software. In either case, here are the
steps to follow:
1. Start with the main decision. Draw a small box to represent this point, then draw
a line from the box to the right for each possible solution or action. Label them
accordingly.
2. Add chance and decision nodes to expand the tree as follows:
 If another decision is necessary, draw another box.
 If the outcome is uncertain, draw a circle (circles represent chance nodes).
 If the problem is solved, leave it blank (for now).
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|24534360lOMoARcPSD|24534360
From each decision node, draw possible solutions. From each chance node, draw lines
representing possible outcomes. If you intend to analyze your options numerically,
include the probability of each outcome and the cost of each action.
3. Continue to expand until every line reaches an endpoint, meaning that there are
no more choices to be made or chance outcomes to consider. Then, assign a value to
each possible outcome. It could be an abstract score or a financial value. Add triangles
to signify endpoints.
With a complete decision tree, you’re now ready to begin analyzing the decision you
face.
Advantages and disadvantages
Decision trees remain popular for reasons like these:
 How easy they are to understand
 They can be useful with or without hard data, and any data requires minimal
preparation
 New options can be added to existing trees
 Their value in picking out the best of several options
 How easily they combine with other decision making tools
8
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
However, decision trees can become excessively complex. In such cases, a more
compact influence diagram can be a good alternative. Influence diagrams narrow the
focus to critical decisions, inputs, and objectives.
Greedy Algorithm
What Does Greedy Algorithm Mean?
A greedy algorithm is an algorithmic strategy that makes the best optimal choice at
each small stage with the goal of this eventually leading to a globally optimum
solution. This means that the algorithm picks the best solution at the moment without
regard for consequences. It picks the best immediate output, but does not consider the
big picture, hence it is considered greedy.
A greedy algorithm works by choosing the best possible answer in each step and then
moving on to the next step until it reaches the end, without regard for the overall
solution. It only hopes that the path it takes is the globally optimum one, but as proven
time and again, this method does not often come up with a globally optimum solution.
In fact, it is entirely possible that the most optimal short-term solutions lead to the

70
worst possible global outcome.
Think of it as taking a lot of shortcuts in a manufacturing business: in the short term
large amounts are saved in manufacturing cost, but this eventually leads to downfall
since quality is compromised, resulting in product returns and low sales as customers
become acquainted with the “cheap” product. But this is not always the case, there are
a lot of applications where the greedy algorithm works best to find or approximate the
globally optimum solution such as in constructing a Huffman tree or a decision
learning tree.
For example: Take the path with the largest sum overall. A greedy algorithm would
take the blue path, as a result of shortsightedness, rather than the orange path, which
yields the largest sum.
Components:
 A candidate set of data that needs a solution
 A selection function that chooses the best contributor to the final solution
 A feasibility function that aids the selection function by determining if a
candidate can be a contributor to the solution
 An objective function that assigns a value to a partial solution
9
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
 A solution function that indicates that the optimum solution has been
discovered
What is greedy approach in Decision tree algorithm
“Greedy Approach is based on the concept of Heuristic Problem Solving by making
an optimal local choice at each node. By making these local optimal choices, we reach
the approximate optimal solution globally.”
The algorithm can be summarized as :
1. At each stage (node), pick out the best feature as the test condition.
2. Now split the node into the possible outcomes (internal nodes).
3. Repeat the above steps until all the test conditions have been exhausted into leaf
nodes.
When you start to implement the algorithm, the first question is: ‘How to pick the
starting test condition?’
To make that decision, you need to have some knowledge about entropy and
information gain.
Entropy: Entropy in Decision Tree stands for homogeneity. If the data is completely
homogenous, the entropy is 0, else if the data is divided (50-50%) entropy is 1.
Information Gain: Information Gain is the decrease/increase in Entropy value when
the node is split.
An attribute should have the highest information gain to be selected for
splitting. Based on the computed values of Entropy and Information Gain, we choose
the best attribute at any particular step.
Multiple Branches
Multiclass classification is a popular problem in supervised machine learning.
Problem – Given a dataset of m training examples, each of which contains
information in the form of various features and a label. Each label corresponds to a
class, to which the training example belongs. In multiclass classification, we have a
finite set of classes. Each training example also has n features.
For example, in the case of identification of different types of fruits, “Shape”, “Color”,
“Radius” can be featured, and “Apple”, “Orange”, “Banana” can be different class
labels.
A decision tree classifier is a systematic approach for multiclass classification. It
poses a set of questions to the dataset (related to its attributes/features). The decision
tree classification algorithm can be visualized on a binary tree. On the root and each

71
10
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
of the internal nodes, a question is posed and the data on that node is further split into
separate records that have different characteristics. The leaves of the tree refer to the
classes in which the dataset is split.
Continuous attributes
How are continuous-valued attributes handled in decision trees?
The test condition for a continuous-valued attribute can either be expressed using a
comparison operator (≥, ≤) or the attribute can be split into a finite set of range
buckets. It is important to note that a comparison-based test condition gives us a
binary split whereas range buckets give us a multiway split.
Converting a continuous-valued attribute into a categorical attribute (multiway
split) :
 An equal width approach converts the continuous data points into n
categories each of equal width. For instance, a continuous-valued attribute
with a range of 0–50 can be converted into 5 categories of equal width -[0–10),
[10–20), [20–30), [30–40), [40–50]. The number of categories is a
hyper-parameter.
 It is important to note that the equal width approach is sensitive to outliers.
 The equal frequency approach converts the continuous-valued attribute into n
categories such that each category contains approximately the same number of
data points.
 More sophisticated methods involve the use of unsupervised clustering
algorithms to define the optimal categories.
Converting a continuous-valued attribute into a binary attribute (two-way split):
 A comparison bases test condition of the form attribute >= v involves the
determination of v.
 It is easy to see that a brute force approach of trying out every single value of
the continuous variable is computationally expensive.
 A better way for identifying the split candidates involves sorting the values of
the continuous attribute and taking the midpoint of the adjacent values in the
sorted array.
 As seen in the figure below, the potential candidates for the split can be
narrowed down to -15, -9, 0, 12, and 21.
11
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
 It is evident that the number of candidates after taking the midpoint of the
sorted array can still be computationally expensive.
 A more optimized version involves selecting midpoint candidates with
different class labels. This will narrow down the candidates to -9 and 12
which is a significant improvement over the brute force approach.
Pruning
Decision trees are a machine learning algorithm that is susceptible to overfitting. One
of the techniques you can use to reduce overfitting in decision trees is pruning.
Decision Trees are a non-parametric supervised learning method that can be used for
classification and regression tasks. The goal is to build a model that can make
predictions on the value of a target variable by learning simple decision rules inferred
from the data features.
Decision Trees are made up of
1. Root Node - the very top of the decision tree and is the ultimate decision
you’re trying to make.
2. Internal Nodes - this branches off from the Root Node, and represent different
options

72
3. Leaf Nodes - these are attached at the end of the branches and represent
possible outcomes for each action.
Just like any other machine learning algorithm, the most annoying thing that can
happen is overfitting. And Decision Trees are one of the machine learning algorithms
that are susceptible to overfitting.
Overfitting is when a model completely fits the training data and struggles or fails to
generalize the testing data. This happens when the model memorizes noise in the
training data and fails to pick up essential patterns which can help them with the test
data.
One of the techniques you can use to reduce overfitting in Decision Trees is Pruning.
What is Decision Tree Pruning and Why is it Important?
Pruning is a technique that removes the parts of the Decision Tree which prevent it
from growing to its full depth. The parts that it removes from the tree are the parts that
12
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
do not provide the power to classify instances. A Decision tree that is trained to its
full depth will highly likely lead to overfitting the training data - therefore Pruning is
important.
In simpler terms, the aim of Decision Tree Pruning is to construct an algorithm that
will perform worse on training data but will generalize better on test data. Tuning the
hyperparameters of your Decision Tree model can do your model a lot of justice and
save you a lot of time and money.
How do you Prune a Decision Tree?
There are two types of pruning: Pre-pruning and Post-pruning. I will go through both
of them and how they work.
Pre-pruning
The pre-pruning technique of Decision Trees is tuning the hyperparameters prior to
the training pipeline. It involves the heuristic known as ‘early stopping’ which stops
the growth of the decision tree - preventing it from reaching its full depth.
It stops the tree-building process to avoid producing leaves with small samples.
During each stage of the splitting of the tree, the cross-validation error will be
monitored. If the value of the error does not decrease anymore - then we stop the
growth of the decision tree.
The hyperparameters that can be tuned for early stopping and preventing overfitting
are:
max_depth, min_samples_leaf, and min_samples_split
These same parameters can also be used to tune to get a robust model. However, you
should be cautious as early stopping can also lead to underfitting.
Post-pruning
Post-pruning does the opposite of pre-pruning and allows the Decision Tree model to
grow to its full depth. Once the model grows to its full depth, tree branches are
removed to prevent the model from overfitting.
The algorithm will continue to partition data into smaller subsets until the final
subsets produced are similar in terms of the outcome variable. The final subset of the
tree will consist of only a few data points allowing the tree to have learned the data to
the T. However, when a new data point is introduced that differs from the learned data
- it may not get predicted well.
The hyperparameter that can be tuned for post-pruning and preventing overfitting is:
ccp_alpha
13
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
ccp stands for Cost Complexity Pruning and can be used as another option to control
the size of a tree. A higher value of ccp_alpha will lead to an increase in the number

73
of nodes pruned.
Cost complexity pruning (post-pruning) steps:
1. Train your Decision Tree model to its full depth
2. Compute the ccp_alphas value using cost_complexity_pruning_path()
3. Train your Decision Tree model with different ccp_alphas values and compute
train and test performance scores
4. Plot the train and test scores for each value of ccp_alphas values.
This hyperparameter can also be used to tune to get the best fit models.
Random Forests
A Random Forest Algorithm is a supervised machine learning algorithm which is
extremely popular and is used for Classification and Regression problems in Machine
Learning. We know that a forest comprises numerous trees, and the more trees more it
will be robust. Similarly, the greater the number of trees in a Random Forest
Algorithm, the higher its accuracy and problem-solving ability. Random Forest is a
classifier that contains several decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that dataset. It is based on
the concept of ensemble learning which is a process of combining multiple classifiers
to solve a complex problem and improve the performance of the model.
14
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
Working of Random Forest Algorithm
The following steps explain the working Random Forest Algorithm:
Step 1: Select random samples from a given data or training set.
Step 2: This algorithm will construct a decision tree for every training data.
Step 3: Voting will take place by averaging the decision tree.
Step 4: Finally, select the most voted prediction result as the final prediction result.
This combination of multiple models is called Ensemble. Ensemble uses two
methods:
1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy is called Boosting.
Example: ADA BOOST, XG BOOST.
15
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])16
Bagging: From the principle mentioned above, we can understand Random forest uses
the Bagging code. Now, let us understand this concept in detail. Bagging is also
known as Bootstrap Aggregation used by random forest. The process begins with any
original random data. After arranging, it is organised into samples known as Bootstrap
Sample. This process is known as [Link], the models are trained
individually, yielding different results known as Aggregation. In the last step, all the
results are combined, and the generated output is based on majority voting. This step
is known as Bagging and is done using an Ensemble Classifier.
Essential Features of Random Forest
 Miscellany: Each tree has a unique attribute, variety and features concerning
other trees. Not all trees are the same.
 Immune to the curse of dimensionality: Since a tree is a conceptual idea, it
requires no features to be considered. Hence, the feature space is reduced.
 Parallelization: We can fully use the CPU to build random forests since each
tree is created autonomously from different data and features.
 Train-Test split: In a Random Forest, we don’t have to differentiate the data
for train and test because the decision tree never sees 30% of the data.
 Stability: The final result is based on Bagging, meaning the result is based on

74
majority voting or average.
Difference between Decision Tree and Random Forest
Decision Trees
Random Forest
 They usually suffer from the
problem of overfitting if it’s
allowed to grow without
any control.
 Since they are created from subsets of
data and the final output is based on
average or majority ranking, the problem
of overfitting doesn’t happen here.
 A single decision tree is
comparatively faster in
 It is slower.
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|2453436017
computation.
 They use a particular set of
rules when a data set with
features are taken as input.
 Random Forest randomly selects
observations, builds a decision tree and
then the result is obtained based on
majority voting. No formulas are required
here.
Why Use a Random Forest Algorithm?
There are a lot of benefits to using Random Forest Algorithm, but one of the main
advantages is that it reduces the risk of overfitting and the required training time.
Additionally, it offers a high level of accuracy. Random Forest algorithm runs
efficiently in large databases and produces highly accurate predictions by estimating
missing data.
Important Hyperparameters
Hyperparameters are used in random forests to either enhance the performance and
predictive power of models or to make the model faster.
The following hyperparameters are used to enhance the predictive power:
 n_estimators: Number of trees built by the algorithm before averaging the
products.
 max_features: Maximum number of features random forest uses before
considering splitting a node.
 mini_sample_leaf: Determines the minimum number of leaves required to
split an internal node.
The following hyperparameters are used to increase the speed of the model:
 n_jobs: Conveys to the engine how many processors are allowed to use. If the
value is 1, it can use only one processor, but if the value is -1,, there is no
limit.
 random_state: Controls randomness of the sample. The model will always
produce the same results if it has a definite value of random state and if it has
been given the same hyperparameters and the same training data.
 oob_score: OOB (Out Of the Bag) is a random forest cross-validation method.
In this, one-third of the sample is not used to train the data but to evaluate its
performance.
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])

75
lOMoARcPSD|24534360lOMoARcPSD|24534360
Ensemble learning
Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance
compared to a single model. Basic idea is to learn a set of classifiers (experts) and to
allow them to vote.
Advantage : Improvement in predictive accuracy.
Disadvantage : It is difficult to understand an ensemble of classifiers.
Why do ensembles work?
 Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the
amount of available data. Hence, there are many hypotheses with the same
accuracy on the data and the learning algorithm chooses only one of them!
There is a risk that the accuracy of the chosen hypothesis is low on unseen
data!
 Computational Problem –
The Computational Problem arises when the learning algorithm cannot
guarantees finding the best hypothesis.
 Representational Problem –
The Representational Problem arises when the hypothesis space does not
contain any good approximation of the target class(es).
Main Challenge for Developing Ensemble Models?
The main challenge is not to obtain highly accurate base models, but rather to obtain
base models which make different kinds of errors. For example, if ensembles are used
18
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])19
for classification, high accuracies can be accomplished if different base models
misclassify different training examples, even if the base classifier accuracy is low.
Types of Ensemble Classifier –
Bagging:
Bagging (Bootstrap Aggregation) is used to reduce the variance of a decision tree.
Suppose a set D of d tuples, at each iteration i, a training set D
i
of d tuples is sampled
with replacement from D (i.e., bootstrap). Then a classifier model M
i
is learned for
each training set D < i. Each classifier M
i
returns its class prediction. The bagged
classifier M* counts the votes and assigns the class with the most votes to X
(unknown sample).
Implementation steps of Bagging –
1. Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of
each other.
4. The final predictions are determined by combining the predictions from all the
models.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble is a
decision tree classifier and is generated using a random selection of attributes at each

76
node to determine the split. During classification, each tree votes and the most
popular class is returned.
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|24534360lOMoARcPSD|24534360
Implementation steps of Random Forest –
1. Multiple subsets are created from the original data set, selecting observations
with replacement.
2. A subset of features is selected randomly and whichever feature gives the best
split is used to split the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of
predictions from n number of trees.
Boosting
Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. In boosting, a random sample of data is
selected, fitted with a model and then trained sequentially—that is, each model tries to
compensate for the weaknesses of its predecessor. With each iteration, the weak rules
from each individual classifier are combined to form one, strong prediction rule.
Ensemble learning
Ensemble learning gives credence to the idea of the “wisdom of crowds,” which
suggests that the decision-making of a larger group of people is typically better than
that of an individual expert. Similarly, ensemble learning refers to a group (or
ensemble) of base learners, or models, which work collectively to achieve a better
final prediction. A single model, also known as a base or weak learner, may not
perform well individually due to high variance or high bias. However, when weak
learners are aggregated, they can form a strong learner, as their combination reduces
bias or variance, yielding better model performance.
Ensemble methods are frequently illustrated using decision trees as this algorithm can
be prone to overfitting (high variance and low bias) when it hasn’t been pruned and it
can also lend itself to underfitting (low variance and high bias) when it’s very small,
20
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
like a decision stump, which is a decision tree with one level. Remember, when an
algorithm overfits or underfits to its training dataset, it cannot generalize well to new
datasets, so ensemble methods are used to counteract this behavior to allow for
generalization of the model to new datasets. While decision trees can exhibit high
variance or high bias, it’s worth noting that it is not the only modeling technique that
leverages ensemble learning to find the “sweet spot” within the bias-variance
tradeoff.
Bagging vs. boosting
Bagging and boosting are two main types of ensemble learning methods. As
highlighted in this study (PDF, 242 KB) (link resides outside of [Link]), the main
difference between these learning methods is the way in which they are trained. In
bagging, weak learners are trained in parallel, but in boosting, they learn sequentially.
This means that a series of models are constructed and with each new model iteration,
the weights of the misclassified data in the previous model are increased. This
redistribution of weights helps the algorithm identify the parameters that it needs to
focus on to improve its performance. AdaBoost, which stands for “adaptative
boosting algorithm,” is one of the most popular boosting algorithms as it was one of
the first of its kind. Other types of boosting algorithms include XGBoost,
GradientBoost, and BrownBoost.
Another difference between bagging and boosting is in how they are used. For
example, bagging methods are typically used on weak learners that exhibit high

77
variance and low bias, whereas boosting methods are leveraged when low variance
and high bias is observed. While bagging can be used to avoid overfitting, boosting
methods can be more prone to this (link resides outside of [Link]) although it really
depends on the dataset. However, parameter tuning can help avoid the issue.
As a result, bagging and boosting have different real-world applications as well.
Bagging has been leveraged for loan approval processes and statistical genomics
while boosting has been used more within image recognition apps and search
engines.
Types of boosting
Boosting methods are focused on iteratively combining weak learners to build a
strong learner that can predict more accurate outcomes. As a reminder, a weak learner
classifies data slightly better than random guessing. This approach can provide robust
results for prediction problems, and can even outperform neural networks and support
vector machines for tasks like image retrieval.
Boosting algorithms can differ in how they create and aggregate weak learners during
the sequential process. Three popular types of boosting methods include:
 Adaptive boosting or AdaBoost: Yoav Freund and Robert Schapire are
credited with the creation of the AdaBoost algorithm. This method operates
iteratively, identifying misclassified data points and adjusting their weights to
minimize the training error. The model continues optimize in a sequential
fashion until it yields the strongest predictor.
21
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])22
 Gradient boosting: Building on the work of Leo Breiman, Jerome H.
Friedman developed gradient boosting, which works by sequentially adding
predictors to an ensemble with each one correcting for the errors of its
predecessor. However, instead of changing weights of data points like
AdaBoost, the gradient boosting trains on the residual errors of the previous
predictor. The name, gradient boosting, is used since it combines the gradient
descent algorithm and boosting method.
 Extreme gradient boosting or XGBoost: XGBoost is an implementation of
gradient boosting that’s designed for computational speed and scale. XGBoost
leverages multiple cores on the CPU, allowing for learning to occur in parallel
during training.
Benefits and challenges of boosting
There are a number of key advantages and challenges that the boosting method
presents when used for classification or regression problems.
The key benefits of boosting include:
 Ease of Implementation: Boosting can be used with several hyper-parameter
tuning options to improve fitting. No data preprocessing is required, and
boosting algorithms like have built-in routines to handle missing data. In
Python, the scikit-learn library of ensemble methods (also known as
[Link]) makes it easy to implement the popular boosting methods,
including AdaBoost, XGBoost, etc.
 Reduction of bias: Boosting algorithms combine multiple weak learners in a
sequential method, iteratively improving upon observations. This approach
can help to reduce high bias, commonly seen in shallow decision trees and
logistic regression models.
 Computational Efficiency: Since boosting algorithms only select features
that increase its predictive power during training, it can help to reduce
dimensionality as well as increase computational efficiency.
The key challenges of boosting include:

78
Overfitting: There’s some dispute in the research around whether or not
boosting can help reduce overfitting or exacerbate it. We include it under
challenges because in the instances that it does occur, predictions cannot be
generalized to new datasets.

Intense computation: Sequential training in boosting is hard to scale up.
Since each estimator is built on its predecessors, boosting models can be
computationally expensive, although XGBoost seeks to address scalability
issues seen in other types of boosting methods. Boosting algorithms can be
slower to train when compared to bagging as a large number of parameters can
also influence the behavior of the model.
Applications of boosting
Boosting algorithms are well suited for artificial intelligence projects across a broad
range of industries, including:
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|24534360lOMoARcPSD|24534360
Healthcare: Boosting is used to lower errors in medical data predictions, such
as predicting cardiovascular risk factors and cancer patient survival rates. For
example, research shows that ensemble methods significantly improve the
accuracy in identifying patients who could benefit from preventive treatment
of cardiovascular disease, while avoiding unnecessary treatment of others.
Likewise, another study (link resides out IBM) found that applying boosting to
multiple genomics platforms can improve the prediction of cancer survival
time.
IT: Gradient boosted regression trees are used in search engines for page
rankings, while the Viola-Jones boosting algorithm is used for image retrieval.
As noted by Cornell , boosted classifiers allow for the computations to be
stopped sooner when it’s clear in which way a prediction is headed. This
means that a search engine can stop the evaluation of lower ranked pages,
while image scanners will only consider images that actually contains the
desired object.
Finance: Boosting is used with deep learning models to automate critical
tasks, including fraud detection, pricing analysis, and more. For example,
boosting methods in credit card fraud detection and financial products pricing
analysis improve the accuracy of analyzing massive data sets to minimize
financial losses.
Adaboost algorithm
Boosting is an ensemble modeling technique that attempts to build a strong classifier
from the number of weak classifiers. It is done by building a model by using weak
models in series. Firstly, a model is built from the training data. Then the second
model is built which tries to correct the errors present in the first model. This
procedure is continued and models are added until either the complete training data
set is predicted correctly or the maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for the
purpose of binary classification. AdaBoost is short for Adaptive Boosting and is a very
popular boosting technique that combines multiple “weak classifiers” into a single
“strong classifier”. It was formulated by Yoav Freund and Robert Schapire. They also
won the 2003 Gödel Prize for their work.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data
point.
2. Provide this as input to the model and identify the wrongly
classified data points.

79
3. Increase the weight of the wrongly classified data points.
4. if (got required results)
Goto step 5
else
23
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
Goto step 2
5. End
Explanation:
The above diagram explains the AdaBoost algorithm in a very simple way. Let’s try
to understand it in a stepwise process:
 B1 consists of 10 data points which consist of two types namely plus(+) and
minus(-) and 5 of which are plus(+) and the other 5 are minus(-) and each one
has been assigned equal weight initially. The first model tries to classify the
data points and generates a vertical separator line but it wrongly classifies 3
plus(+) as minus(-).
 B2 consists of the 10 data points from the previous model in which the 3
wrongly classified plus(+) are weighted more so that the current model tries
more to classify these pluses(+) correctly. This model generates a vertical
separator line that correctly classifies the previously wrongly classified
pluses(+) but in this attempt, it wrongly classifies three minuses(-).
 B3 consists of the 10 data points from the previous model in which the 3
wrongly classified minus(-) are weighted more so that the current model tries
more to classify these minuses(-) correctly. This model generates a horizontal
separator line that correctly classifies the previously wrongly classified
minuses(-).
 B4 combines together B1, B2, and B3 in order to build a strong prediction
model which is much better than any individual model used.
24
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
Support Vector Machines
Support Vector Machine(SVM) is a supervised machine learning algorithm used for
both classification and regression. Though we say regression problems as well its best
suited for classification. The objective of SVM algorithm is to find a hyperplane in an
N-dimensional space that distinctly classifies the data points. The dimension of the
hyperplane depends upon the number of features. If the number of input features is
two, then the hyperplane is just a line. If the number of input features is three, then the
hyperplane becomes a 2-D plane. It becomes difficult to imagine when the number of
features exceeds three.
Let’s consider two independent variables x1, x2 and one dependent variable which is
either a blue circle or a red circle.
Linearly Separable Data points
From the figure above its very clear that there are multiple lines (our hyperplane here
is a line because we are considering only two input features x1, x2) that segregates
our data points or does a classification between red and blue circles. So how do we
choose the best line or in general the best hyperplane that segregates our data points.
Selecting the best hyper-plane:
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.
25
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
So we choose the hyperplane whose distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists it is known as the maximum-margin

80
hyperplane/hard margin. So from the above figure, we choose L2.
Let’s consider a scenario like shown below
Here we have one blue ball in the boundary of the red ball. So how does SVM classify
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the
best hyperplane that maximizes the margin. SVM is robust to outliers.
26
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])27
So in this type of data points what SVM does is, it finds maximum margin as done
with previous data sets along with that it adds a penalty each time a point crosses the
margin. So the margins in these type of cases are called soft margin. When there is a
soft margin to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)).
Hinge loss is a commonly used penalty. If no violations no hinge [Link] violations
hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red
balls are separable by a straight line/linear line). What to do if data are not linearly
separable?
Say, our data is like shown in the figure above. SVM solves this by creating a new
variable using a kernel. We call a point x
i
on the line and we create a new variable y
i
as a function of distance from origin [Link] if we plot this we get something like as
shown below
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|24534360lOMoARcPSD|24534360
In this case, the new variable y is created as a function of distance from the origin. A
non-linear function that creates a new variable is referred to as kernel.
SVM Kernel:
The SVM kernel is a function that takes low dimensional input space and transforms
it into higher-dimensional space, ie it converts non separable problem to separable
problem. It is mostly useful in non-linear separation problems. Simply put the kernel,
it does some extremely complex data transformations then finds out the process to
separate the data based on the labels or outputs defined.
Advantages of SVM:
 Effective in high dimensional cases
 Its memory efficient as it uses a subset of training points in the decision
function called support vectors
 Different kernel functions can be specified for the decision functions and its
possible to specify custom kernels
Large Margin Intuition
· Sometimes people refer to SVM as large margin classifiers We'll consider what
that means and what an SVM hypothesis looks like The SVM cost function is as
above, and we've drawn out the cost terms below
· Left is cost1 and right is cost0 What does it take to make terms small
· If y =1 cost1(z) = 0 only when z >= 1 If y = 0 cost0(z) = 0 only when z <= -1
· Interesting property of SVM If you have a positive example, you only really
need z to be greater or equal to 0 If this is the case then you predict 1
· SVM wants a bit more than that - doesn't want to *just* get it right, but have the
value be quite a bit bigger than zero. It Throws in an extra safety margin factor
· Logistic regression does something similar What are the consequences of this?
Consider a case where we set C to be huge C = 100,000 So considering we're
28

81
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
minimizing CA + B. So for us the A term shown in the figure below must be set to 0.
· If C is huge we're going to pick an A value so that A is equal to zero What is the
optimization problem here - how do we make A = 0? Making A = 0; If y = 1 ; Then to
make our "A" term 0 need to find a value of θ so (θT.x) is greater than or equal to 1
· Similarly, if y = 0 Then we want to make "A" = 0 then we need to find a value of
θ so (θT.x) is equal to or less than -1
· So - if we think of our optimization problem a way to ensure that this first "A"
term is equal to 0, we re-factor our optimization problem into just minimizing the "B"
(regularization) term, because When A = 0 --> A*C = 0 So we're minimizing B, under
the constraints shown below
· Turns out when you solve this problem you get interesting decision boundaries
29
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
· The green and magenta lines are functional decision boundaries which could be
chosen by logistic regression. But they probably don't generalize too well
· The black line, by contrast is the the chosen by the SVM because of this safety
net imposed by the optimization graph. More robust separator Mathematically, that
black line has a larger minimum distance (margin) from any of the training examples
· By separating with the largest margin you incorporate robustness into your
decision making process. We looked at this at when C is very large. SVM is more
sophisticated than the large margin might look
· If you were just using large margin then SVM would be very sensitive to outliers
· You would risk making a ridiculous hugely impact your classification boundary
A single example might not represent a good reason to change an algorithm If C is
very large then we do use this quite naive maximize the margin approach
30
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
· So we'd change the black to the magenta But if C is reasonably small, or a not
too large, then you stick with the black decision boundary
· What about non-linearly separable data? Then SVM still does the right thing if
you use a normal size C So the idea of SVM being a large margin classifier is only
really relevant when you have no outliers and you can easily linearly separable data
Means we ignore a few outliers
Loss Function - Hinge Loss
Loss functions play an important role in any statistical model - they define an
objective which the performance of the model is evaluated against and the parameters
learned by the model are determined by minimizing a chosen loss function.
A loss function takes a theoretical proposition to a practical one. Building a highly
accurate predictor requires constant iteration of the problem through questioning,
modeling the problem with the chosen approach and testing.
The only criteria by which a statistical model is scrutinized is its performance - how
accurate the model’s decisions are. This calls for a way to measure how far a
particular iteration of the model is from the actual values. This is where loss functions
come into play.
Loss functions measure how far an estimated value is from its true value. A loss
function maps decisions to their associated costs. Loss functions are not fixed, they
change depending on the task in hand and the goal to be met.
Loss functions for regression
Regression involves predicting a specific value that is continuous in nature.
Estimating the price of a house or predicting stock prices are examples of regression
because one works towards building a model that would predict a real-valued
quantity.

82
31
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])32
Let’s take a look at some loss functions which can be used for regression problems
and try to draw comparisons among them.
Mean Absolute Error (MAE)
Mean Absolute Error (also called L1 loss) is one of the most simple yet robust loss
functions used for regression models.
Regression problems may have variables that are not strictly Gaussian in nature due to
the presence of outliers (values that are very different from the rest of the data). Mean
Absolute Error would be an ideal option in such cases because it does not take into
account the direction of the outliers (unrealistically high positive or negative values).
As the name suggests, MAE takes the average sum of the absolute differences
between the actual and the predicted values. For a data point x
i
and its predicted value
y
i
, n being the total number of data points in the dataset, the mean absolute error is
defined as:
Mean Squared Error (MSE)
Mean Squared Error (also called L2 loss) is almost every data scientist’s preference
when it comes to loss functions for regression. This is because most variables can be
modeled into a Gaussian distribution.
Mean Squared Error is the average of the squared differences between the actual and
the predicted values. For a data point Y
i
and its predicted value Ŷ
i
, where n is the total
number of data points in the dataset, the mean squared error is defined as:
Mean Bias Error (MBE)
Mean Bias Error is used to calculate the average bias in the model. Bias, in a nutshell,
is overestimating or underestimating a parameter. Corrective measures can be taken to
reduce the bias post-evaluating the model using MBE.
Mean Bias Error takes the actual difference between the target and the predicted value,
and not the absolute difference. One has to be cautious as the positive and the
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|2453436033
negative errors could cancel each other out, which is why it is one of the lesser-used
loss functions.
The formula of Mean Bias Error is:
Where y
i
is the true value, y
i
is the predicted value and ’n’ is the total number of data
points in the dataset.
Mean Squared Logarithmic Error (MSLE)
Sometimes, one may not want to penalize the model too much for predicting unscaled
quantities directly. Relaxing the penalty on huge differences can be done with the help
of Mean Squared Logarithmic Error.
Calculating the Mean Squared Logarithmic Error is the same as Mean Squared Error,
except the natural logarithm of the predicted values is used rather than the actual

83
values.
Where y
i
is the true value, y
i
is the predicted value and ’n’ is the total number of data
points in the dataset.
Huber Loss
A comparison between L1 and L2 loss yields the following results:
1. L1 loss is more robust than its counterpart.
On taking a closer look at the formulas, one can observe that if the difference between
the predicted and the actual value is high, L2 loss magnifies the effect when compared
to L1. Since L2 succumbs to outliers, L1 loss function is the more robust loss
function.
1. L1 loss is less stable than L2 loss.
Since L1 loss deals with the difference in distances, a small horizontal change can
lead to the regression line jumping a large amount. Such an effect taking place across
multiple iterations would lead to a significant change in the slope between iterations.
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|24534360lOMoARcPSD|24534360
On the other hand, MSE ensures the regression line moves lightly for a small
adjustment in the data point.
Huber Loss combines the robustness of L1 with the stability of L2, essentially the best
of L1 and L2 losses. For huge errors, it is linear and for small errors, it is quadratic in
nature.
Huber Loss is characterized by the parameter delta (
Loss functions for classification
Classification problems involve predicting a discrete class output. It involves dividing
the dataset into different and unique classes based on different parameters so that a
new and unseen record can be put into one of the classes.
A mail can be classified as a spam or not a spam and a person’s dietary preferences
can be put in one of three categories - vegetarian, non-vegetarian and vegan. Let’s
take a look at loss functions that can be used for classification problems.
Binary Cross Entropy Loss
This is the most common loss function used for classification problems that have two
classes. The word “entropy”, seemingly out-of-place, has a statistical interpretation.
Entropy is the measure of randomness in the information being processed, and cross
entropy is a measure of the difference of the randomness between two random
variables.
If the divergence of the predicted probability from the actual label increases, the
cross-entropy loss increases. Going by this, predicting a probability of .011 when the
actual observation label is 1 would result in a high loss value. In an ideal situation, a
“perfect” model would have a log loss of 0. Looking at the loss function would make
things even clearer -
34
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])35
Where y
i
is the true label and h
θ
(x
i
) is the predicted value post hypothesis.

84
Since binary classification means the classes take either 0 or 1, if y
i
= 0, that term
ceases to exist and if y
i
= 1, the (1-y
i
) term becomes 0.
Categorical Cross Entropy Loss
Categorical Cross Entropy loss is essentially Binary Cross Entropy Loss expanded to
multiple classes. One requirement when categorical cross entropy loss function is
used is that the labels should be one-hot encoded.
This way, only one element will be non-zero as other elements in the vector would be
multiplied by zero. This property is extended to an activation function called softmax
Hinge Loss
Another commonly used loss function for classification is the hinge loss. Hinge loss is
primarily developed for support vector machines for calculating the maximum margin
from the hyperplane to the classes.
Loss functions penalize wrong predictions and does not do so for the right predictions.
So, the score of the target label should be greater than the sum of all the incorrect
labels by a margin of (at the least) one.
This margin is the maximum margin from the hyperplane to the data points, which is
why hinge loss is preferred for SVMs. The following image clears the air on what a
hyperplane and maximum margin is:
The mathematical formulation of hinge loss is as follows:
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|2453436036
Where s
j
is the true value and s
yi
is the predicted value.
Hinge Loss is also extended to Squared Hinge Loss Error and Categorical Hinge Loss
Error.
Kullback Leibler Divergence Loss (KL Loss)
Kullback Leibler Divergence Loss is a measure of how a distribution varies from a
reference distribution (or a baseline distribution). A Kullback Leibler Divergence
Loss of zero means that both the probability distributions are identical.
The number of information lost in the predicted distribution is used as a measure. The
KL Divergence of a distribution P(x) from Q(x) is given by:
SVM Kernels
Kernel Function is a method used to take data as input and transform it into the
required form of processing data. “Kernel” is used due to a set of mathematical
functions used in Support Vector Machine providing the window to manipulate the
data. So, Kernel Function generally transforms the training set of data so that a
non-linear decision surface is able to transform to a linear equation in a higher number
of dimension spaces. Basically, It returns the inner product between two points in a
standard feature dimension.
Standard Kernel Function Equation :
Major Kernel Functions :-
For Implementing Kernel Functions, first of all, we have to install the “scikit-learn”
library using the command prompt terminal:
pip install scikit-learn

85
 Gaussian Kernel: It is used to perform transformation when there is no prior
knowledge about data.
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])
lOMoARcPSD|24534360lOMoARcPSD|24534360
 Gaussian Kernel Radial Basis Function (RBF): Same as above kernel
function, adding radial basis method to improve the transformation.
Gaussian Kernel Graph
 Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model
of the neural network, which is used as an activation function for artificial
neurons.
37
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])lOMoARcPSD|24534360
Sigmoid Kernel Graph
 Polynomial Kernel: It represents the similarity of vectors in the training set of
data in a feature space over polynomials of the original variables used in the
kernel.
Polynomial Kernel Graph
 Linear Kernel: used when data is linearly separable.
Linear: k(x,z)=x⊤ z
38
Downloaded by Bharath Sivaraman (bharathsivagiri2001@[Link])

86

You might also like