ML Notes
ML Notes
Distance Based Methods: Distance-based algorithms are machine learning that classify queries by
computing distances between these queries.
K-Nearest Neighbour (KNN) Algorithm:
• K-Nearest Neighbour is one of the simplest machine learning algorithms based on supervised
learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K-NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
Why do we need KNN algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:
• It is simple to implement.
• It is robust to the noisy training data.
• It can be more effective if the training data is large.
• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
• Below diagram explains the general structure of a decision tree:
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root into sub-nodes according
to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root nodes of the tree is called the parent node, and other nodes are
called the child nodes.
How does the decision tree algorithm work?
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:
• It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
• Naive: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
• Bayes: it is called Bayes because it depends on the principle of Baye’s theorem.
Baye’s Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
• The formula for Baye’s theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class classifications.
• It performs well in Multi-class predictions as compared to the other algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
• Naïve bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
• Simple linear Regression: In simple linear regression, a single independent variables is used
to predict the value of the dependent variable.
• Multiple linear Regression: In multiple linear regression, more than one independent
variables are used to predict the value of the dependent variable.
Linear Regression Line:
A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:
Positive Linear Relationship:
• If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship:
• If the dependent variable decreases on the Y-axis and independent variable increases on the
X-axis, then such a relationship is called a negative linear relationship.
• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
Assumptions for Logistic Regression:
• In logistic regression y can be between 0 and 1, so for this let’s divide the above equation by
(1-y):
• But we need range between –[infinity] to +[infinity], then take algorithm of the equation it
will become:
• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”.
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-Linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Advantages:
• It uses a subset of training points in the decision function which makes it memory efficient
and is highly effective in high dimensional spaces.
Disadvantages:
• The only disadvantage with the support vector machine is that the algorithm does not directly
provide probability estimates.
Nonlinarity:-
Kernel Methods:
Binary Classification:
Binary Classification is a process or task of classification, in which a given data is being classified
into two classes. It’s basically a kind of prediction about which of two groups the thing belongs to.
Let us suppose, two emails are sent to you, one is sent by an insurance company that keeps sending
their ads, and the other is from your bank regarding your credit card bill. The email service provider
will classify the two emails, the first one will be sent to the spam folder and the second one will be
kept in the primary one.
This process is known as binary classification, as there are two discrete classes, one is spam and the
other is primary. So, this is a problem of binary classification.
Term Related to binary Classification:
1. Precision: Precision in binary classification (Yes/No) refers to a model's ability to correctly
interpret positive observations. In other words, how often does a positive value forecast turn
out to be correct? We may manipulate this metric by only returning positive for the single
observation in which we have the most confidence.
2. Recall: The recall is also known as sensitivity. In binary classification (Yes/No) recall is used
to measure how “sensitive” the classifier is to detecting positive cases. To put it another way,
how many real findings did we “catch” in our sample? We may manipulate this metric by
classifying both results as positive.
3. F1 Score: The F1 score can be thought of as a weighted average of precision and recall, with
the best value being 1 and the worst being 0. Precision and recall also make an equal
contribution to the F1 ranking.
Multi-class Classification:
An input can belong to exactly one of the K classes
Training Data: Each input feature vector xi is associated with a class label yi ∈ {1, . . . ,K}
Prediction: Given a new input, predict the class label
Eg. Object Classification, Document Classification, Optical Character Recognition, Context sensitive
spelling correction etc.
Structured Output Prediction:
We can successfully (?) do multiclass classification
Assign topics to documents
Names to object images
Sentiments to reviews
MNIST:
The MNIST database (Modified National Institute of Standards and Technology database[1]) is a
large database of handwritten digits that is commonly used for training various image processing
systems.[2][3] The database is also widely used for training and testing in the field of machine
learning.
The MNIST database of handwritten digits consists of a training set of 60,000 examples, and a test set
of 10,000 examples. It is a subset of a larger set available from NIST. Additionally, the black and
white images from NIST were size-normalized and centered to fit into a 28x28 pixel bounding box
and anti-aliased, which introduced grayscale levels.
This database is well liked for training and testing in the field of machine learning and image
processing. It is a remixed subset of the original NIST datasets. One half of the 60,000 training
images consist images from NIST’s testing dataset and the other half from NIST’s training set. The
10,000 images from the testing set are similarly assembled.
The MNIST dataset is used by researchers to test and compare their research results with others. The
lowest error rates in literature are as low as 0.21 percent.
Ranking:-
Ranking is a type of supervised machine learning (ML) that uses labeled datasets to train its data and
models to classify future data to predict outcomes. Quite simply, the goal of a ranking model is to sort
data in an optimal and relevant order.
Ranking was first largely deployed within search engines. People search for a topic, while the ranking
algorithm reorders search results based on the PageRank, and the search engine is able to display the
most relevant results to its customers.
Until recently, most ranking models, and ML as whole, were limited in their scope of use, as most
companies didn’t have enough data to power these algorithms. Better methods for data collection and
more intuitive ML tools have made it possible for nearly anyone to deploy a successful ranking model
within their business.
Ranking models are made up of 2 main factors: queries and documents. Queries are any input value,
such as a question on Google or an interaction on an e-commerce site. Documents are the output value
or results of the query. Given the query, and the associated documents, a function, given a list of
parameters to rank on, will score the documents to be sorted in order of relevancy.
The machine learning algorithm learning to rank takes the scores from this model, and uses them to
predict future outcomes on a new and unseen list of document.
UNIT – 2
Unsupervised Learning:
Unsupervised learning is the training of a machine learning using information that is neither
classified nor labeled and allowing the algorithm to act on that information without guidance. Here
the task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.
For instance, having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs
and cats ‘. But it can categorize them according to their similarities, patterns, and differences, i.e., we
can easily categorize the above picture into two parts. The first may contain all pics having dogs in
them and the second part may contain all pics having cats in them. Here you didn’t learn anything
before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
1. Clustering
2. Association
Clustering:
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as "A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
1. Market segmentation
2. Statistical data analysis
3. Social network analysis
4. Image segmentation
5. Anomaly detection, etc.
K-Means :
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an interative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K – means clustering algorithm.
Advantages of K-means
1. It is very simple to implement.
2. It is scalable to a huge data set and also faster to large datasets.
3. It adapts the new examples very frequently
4. Generalization of clusters for different shapes and sizes.
Disadvantages of K-means
1. It is sensitive to the outliers.
2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.
Dimensionality reduction:
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These
techniques are widely used in machine learning for obtaining a better fit predictive model while
solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.
Cure of Dimensionality
Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional
data. The dimension of a dataset corresponds to the number of attributes/features that exist in a
dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is
referred to as high dimensional data. Some of the difficulties that come with high dimensional data
manifest during analyzing or visualizing the data to identify patterns, and some manifest while
training machine learning models. The difficulties related to training machine learning models due to
high dimensional data is referred to as ‘Curse of Dimensionality’. The popular aspects of the curse of
dimensionality; ‘data sparsity’ and ‘distance concentration’ are discussed in the following sections.
Benefits of applying Dimensionality Reduction
1. By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
2. Less computation training time is required for reduced dimensions of features.
3. Reduced dimensions of features of the dataset help in visualizing the data quickly.
4. It removes the redundant features by taking care of multicollinearity.
Disadvantgaes
1. Some data may be lost due to dimensionality reduction.
2. In the PCA dimensionality reduction technique, sometimes the principal components required
to consider are unkown.
Approaches of Dimensionality Reduction:
Common techniques of dimensionality reduction:
1. Principal component analysis
2. Backward elimination
3. Forward selection
4. Score comparison
5. Missing value ratio
6. Low variance filter
7. High correlation filter
8. Random forest
9. Factor analysis
10. Auto encoder
Kernel PCA:
PCA is a linear method. That is it can only be applied to datasets which are linearly separable. It does
an excellent job for datasets, which are linearly separable. But, if we use it to non-linear datasets, we
might get a result which may not be the optimal dimensionality reduction. Kernel PCA uses a kernel
function to project dataset into a higher dimensional feature space, where it is linearly separable. It is
similar to the idea of Support Vector Machines. There are various kernel methods like linear,
polynomial, and gaussian.
In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function to project
the dataset into a higher-dimensional space, where it is linearly separable. Finally, we applied the
kernel PCA to a non-linear dataset using scikit-learn.
Kernel Principal Component Analysis (PCA) is a technique for dimensionality reduction in machine
learning that uses the concept of kernel functions to transform the data into a high-dimensional feature
space. In traditional PCA, the data is transformed into a lower-dimensional space by finding the
principal components of the covariance matrix of the data. In kernel PCA, the data is transformed into
a high dimensional feature space using a non-linear mapping function, called a kernel function, and
then the principal components are found in this high-dimensional space.
The following are the general steps in Kernel PCA:
Select a kernel function: Choose an appropriate kernel function according to the properties of the
data. The challenge at hand and the data's underlying structure determine which kernel should be
used.
Compute the kernel matrix: Using the selected kernel function, determine the pairwise similarity (or
distance) between data points. As a result, the kernel matrix-a symmetric positive semi-definite
matrix-is produced.
Choose the main elements: The greatest eigenvalues' top k eigenvectors, or primary components, are
selected to create the reduced-dimensional representation of the data.
Applications:
o One of the many uses for kernel PCA is the reduction of nonlinear dimensionality.
o Identification and categorization of patterns in high-dimensional environments.
o Signal and image processing.
o Bioinformatics and genetics to analyze biological data with complicated interactions.
Advantages of Kernel PCA:
1. Non-linearity: Kernel PCA can capture non-linear patterns in the data that are not possible with
traditional linear PCA.
2. Robustness: Kernel PCA can be more robust to outliers and noise in the data, as it considers the
global structure of the data, rather than just local distances between data points.
3. Versatility: Different types of kernel functions can be used in kernel PCA to suit different types of
data and different objectives.
Disadvantages of Kernel PCA:
1. Complexity: Kernel PCA can be computationally expensive, especially for large datasets, as it
requires the calculation of eigenvectors and eigenvalues.
2. Model selection: Choosing the right kernel function and the right number of components can be
challenging and may require expert knowledge or trial and error.
Matrix Factorization:
Matrix Factorization in Machine Learning
1. What is Matrix Factorization?
Matrix Factorization (MF) is a dimensionality reduction technique that decomposes a large matrix
into two or more smaller matrices to extract meaningful patterns from data. It is widely used in
recommendation systems, latent feature discovery, and dimensionality reduction.
Why Use Matrix Factorization?
Reduces High-Dimensional Data – Converts large matrices into smaller ones.
Extracts Latent Features – Helps discover hidden structures in data.
Speeds Up Computation – Makes operations on large datasets computationally efficient.
Handles Missing Data – Common in collaborative filtering for recommendation systems.
Where:
• A is the original matrix (m × n).
• W is the first factor matrix (m × k).
• H is the second factor matrix (k × n).
• k is the reduced number of latent features.
This transformation allows us to approximate AA using a lower-dimensional representation.
Application:
o Used in dimensionality reduction and recommendation systems.
o Example: Netflix Prize Challenge used SVD for movie recommendations.
Why Non-Negative?
o Helps in feature interpretability.
o Common in text mining and image processing.
• Application:
o Topic modeling in Natural Language Processing (NLP).
o Feature extraction in image recognition.
Application:
o Scalable for large-scale recommendation systems (e.g., Netflix, Spotify).
User 1 5 ? 3 ?
User 2 ? 4 ? 5
User 3 2 ? 1 3
Where:
o W represents users' preferences (e.g., genre preferences).
o H represents movies' characteristics.
• Missing values (e.g., unobserved ratings) can be predicted using this factorization.
Advantages
1. Efficient for Large Datasets – Handles sparse matrices effectively.
2. Extracts Meaningful Features – Identifies latent patterns in data.
3. Handles Missing Data – Useful in recommendation systems.
4. Scalable – Methods like ALS scale well with large datasets.
Disadvantages
1. Computationally Expensive – Some factorization methods require high computation.
2. Choice of Rank kk is Crucial – Selecting the correct number of latent features is
challenging.
3. Sensitive to Noise – Noisy data can affect factorization quality.
Matrix Completion:
Matrix Completion in Machine Learning
1. What is Matrix Completion?
Matrix Completion is a technique in machine learning used to predict missing values in a partially
observed matrix. It is widely used in recommendation systems, collaborative filtering, image
inpainting, and anomaly detection.
Why Matrix Completion?
• Real-world data is often incomplete (e.g., missing ratings in recommendation systems,
corrupted pixels in images).
• Matrix completion can predict missing values by learning patterns from observed data.
• It helps in dimensionality reduction and feature extraction for better predictions.
2. Mathematical Formulation
Problem Statement
Given a partially observed matrix AA of size m×nm \times n, we aim to find a completed matrix
A^\hat{A} such that:
Low-Rank Approximation
• Most real-world data matrices are low-rank:
o Users prefer only a few categories of movies (recommendations).
o Images contain repetitive patterns (compression).
• We assume that the true matrix AA is low-rank, meaning it can be decomposed into smaller
matrices:
where:
o W is a matrix of user preferences (e.g., movie genres).
o H is a matrix of item features (e.g., movie properties).
Finance & Fraud Detection Completes missing financial transactions to detect anomalies.
1. Handles Missing Data Efficiently – Predicts unknown values with high accuracy.
2. Extracts Meaningful Patterns – Identifies latent structures in data.
3. Scales Well for Large Datasets – Especially when using ALS or PMF.
Disadvantages
Generative Models:
Generative Models in Machine Learning: Mixture Models and Latent Factor Models
Generative models are a class of machine learning models that learn the underlying distribution of
data to generate new samples similar to the given dataset. Unlike discriminative models (which focus
on classification), generative models aim to model the joint probability distribution P(X,Y)P(X, Y)
and can generate new instances from learned patterns.
2. Mixture Models
A Mixture Model assumes that data is drawn from a combination of multiple underlying probability
distributions.
2.1 Gaussian Mixture Model (GMM)
One of the most common mixture models is the Gaussian Mixture Model (GMM), which assumes
that data is generated from a mixture of multiple Gaussian distributions.
Mathematical Formulation
A Gaussian Mixture Model represents a dataset as:
where:
where:
• P(w∣z) is the probability of word w given topic z.
• P(z∣d) is the probability of topic z in document d.
Example:
• A document about sports might have words like "football," "goal," and "player" with high
probability under a "Sports" topic.
where:
• A is the user-item rating matrix.
• W represents user preferences.
• H represents item properties.
Common
Expectation-Maximization (EM) Variational Inference, Gibbs Sampling
Algorithm
Feature Mixture Models (GMM) Latent Factor Models (LDA, PLSA)
UNIT – 3
Evaluating Machine Learning algorithms and Model Selection, Introduction to Statistical
Learning Theory, Ensemble Methods (Boosting, Bagging, Random Forests)
4. Bias-Variance Tradeoff
Understanding the bias-variance tradeoff helps prevent underfitting and overfitting.
Underfitting (High Model is too simple and fails to Increase model complexity (e.g., deeper
Bias) capture patterns. neural network, more features).
Overfitting (High Model memorizes training data Use regularization (L1/L2), reduce
Variance) but fails on new data. complexity, use more data.
5. Hyperparameter Tuning
Once a model is chosen, hyperparameter tuning optimizes its performance.
(a) Grid Search
Tests multiple combinations of hyperparameters.
(b) Random Search
Randomly selects hyperparameters and is faster than Grid Search.
6. Model Comparison
Neural Networks Deep learning tasks Handles complex patterns Needs large data
This model would describe the price of a home as having a linear relationship with the size of a home.
This would represent a simple model for the relationship.
If we suppose that the size of the home is not the only independent variable when determining the
price and that the number of bathrooms is also an independent variable, the equation would look like:
Model Generalization:
In order to build an effective model, the available data needs to be used in a way that would make the
model generalizable for unseen situations. Some problems that occur when building models is that the
model under-fits or over-fits to the data.
• Under-fitting —when a statistical model does not adequately capture the underlying structure
of the data and, therefore, does not include some parameters that would appear in a correctly
specified model.
• Over-fitting —when a statistical model contains more parameters that can be justified by the
data and includes the residual variation ("noise") as if the variation represents underlying
model structure.
As you can see, it is important to create a model that can generalize to the data that it is given so that
it can make the most accurate predictions when given new data.
Model Validation:
Model Validation is used to assess over-fitting and under-fitting of the data. The steps to perform
model validation are:
[Link] the data into two parts, training data and testing data (anywhere between 80/20 and 70/30 is
ideal)
[Link] the larger portion (training data) to train the model
[Link] the smaller portion (testing data) to test the model. This data is not being used to train the
model, so it will be new data for the model to build predictions from.
[Link] the model has learned well from the training data, it will perform well with both the training data
and testing data. To determine how well the model is performing on both sets of data, you can
calculate the accuracy score for each. The model is over-fitting if the training data has a significantly
higher accuracy score than the testing data
Bagging(bootstrap aggregating)
Baggingis a machine learning ensemble algorithm designed to improve the accuracy of machine
learning algorithms used in classification and regression. It also reduces variance and helps to avoid
overfitting. We can choose randomly sub sets from training data with [Link] a result get
average of predictions for each bag.
Boosting:
For boosting-we build our first bag of data with select randomly from training data and train model in
a usual way. Next is take all our training data and use it to test the model. We will discover that some
of the points are not well predicted (significant error). For second bag we choose randomly data again
but each instance is weighted according to this [Link] we test our system altogether and combine
their outputs and again we measure error across all this [Link] we build next bag and so on..
Random Forest:
Random forests are collection of decision trees, where each tree is slightly different from the others.
To get more information about Decision Tree you can read my another article(int this article I
explained what is Decision Tree and how it works).
The idea of random forests is to build many trees, so we can reduce the amount of overfitting by
averaging their results.
In order to implement this , we need to build many decision trees that should be different from the
other trees.
In order to build random forests we need to decide how many trees should we choose. There is
n_estimator (number of trees)parameter of Random Forest Regressor or Random Forest Classifier in
Scikit-Learn.
To start to build Random Forest model first of all we need to understand how bootstrap works.
Bootstrap means we can choose data randomly with replacement by repeating count of data. Let's say
we have data like this data=[a, b, c, d, e , f], so count of data is n=len(data)=6. A possible bootstrap
sample would be [b,d,d,c,e,f], another possible sample would be [c, d, b, b, f, f] and so on..after repeat
6 times.
Strengths, weaknesses and parameters of Random Forest
Random forests for regression and classification are currently among the most widely used machine
learning methods.
They are very powerful, often work well without heavy tuning of the parameters, and don't require
scaling of the data.
You should keep in mind that random forests, by their nature, are random, and setting different
random states (or not setting the random_state at all) can drastically change the model that is built.
The more trees there are in the forest, the more robust it will be against the choice of random state. If
you want to have reproducible results, it is important to fix the random_state.
Random forests usually work well even on very large datasets, and training can easily be parallelized
over many CPU cores within a powerful computer. However, random forests require more memory
and are slower to train and to predict than linear models. If time and memory are important in an
application, it might make sense to use a linear model instead.
The important parameters to adjust are n_estimators, max_features and possibly pre pruning options
like max_depth. For n_estimators, larger is always better. Averaging more trees will yield a more
robust ensemble. However, there are diminishing returns, and more trees need more memory and
more time to train. A common rule of thumb is to build "as many as you have time / memory for".
UNIT – 4
Sparse Modeling and Estimation, Modeling Sequence/Time-Series Data, Deep Learning and
Feature Representation Learning
Sparse modeling and estimation
Sparse modeling and estimation are essential concepts in machine learning, especially when dealing
with high-dimensional data. These techniques allow models to focus on a limited number of informative
features, improving interpretability and often leading to more efficient and robust models. Here are
some key points to help college students understand sparse modeling and estimation:
What is Sparsity?
• Definition: Sparsity refers to models or datasets where most of the elements are zero or close to zero.
In machine learning, sparse models focus on a subset of features, ignoring irrelevant ones.
• Importance: In many applications, only a few features or parameters significantly impact the model
outcome, which leads to simpler and faster models.
2. Why Sparse Modeling?
• Interpretability: Sparse models are easier to understand since they only use a subset of features.
• Reduced Overfitting: By reducing the number of parameters, sparse models help prevent overfitting,
especially in cases with limited data.
• Efficiency: Sparse models are computationally more efficient, both in terms of memory and
processing time.
[Link] for Sparse Modeling
• Regularization: Adding a penalty term to the model's objective function encourages sparsity. Common
techniques include:
o Lasso Regression: Uses L1L1L1 regularization, which can shrink some coefficients to zero,
effectively removing certain features from the model.
o Elastic Net: Combines L1L1L1 (lasso) and L2L2L2 (ridge) regularization for more flexibility
in feature selection.
• Dimensionality Reduction:
o Principal Component Analysis (PCA): While not inherently sparse, PCA reduces data to a
smaller set of components, capturing essential information.
o Sparse PCA: A variant of PCA that enforces sparsity in the principal components.
• Natural Language Processing: Sparse modeling helps handle high-dimensional data like word
counts, focusing on the most relevant terms.
• Finance: Selecting key financial indicators among thousands that influence stock prices.
[Link] in Sparse Modeling
a. Parameter Selection
b. interpretability vs. Accuracy
c. Scalability.
Time – Series data:
A time series is a sequence of data points collected, recorded, or measured at successive, evenly-
spaced time intervals. Time series data is commonly represented graphically with time on the
horizontal axis and the variable of interest on the vertical axis, allowing analysts to identify trends,
patterns, and changes over time.
Importance of Time Series Analysis:
1. Predict Future Trends: Time series analysis enables the prediction of future trends, allowing
businesses to anticipate market demand, stock prices, and other key variables, facilitating proactive
decision-making.
2. Detect Patterns and Anomalies: By examining sequential data points, time series analysis helps
detect recurring patterns and anomalies, providing insights into underlying behaviors and potential
outliers.
3. Risk Mitigation: By spotting potential risks, businesses can develop strategies to mitigate them,
enhancing overall risk management.
4. Strategic Planning: Time series insights inform long-term strategic planning, guiding decision-
making across finance, healthcare, and other sectors.
5. Competitive Edge: Time series analysis enables businesses to optimize resource allocation
effectively, whether it's inventory, workforce, or financial assets. By staying ahead of market trends,
responding to changes, and making data-driven decisions, businesses gain a competitive edge.
Components of Time series data
There are 4 main components of a time series:
1. Trend: Trend represents the long-term movement or directionality of the data over
time. It captures the overall tendency of the series to increase, decrease, or remain
stable. Trends can be linear, indicating a consistent increase or decrease, or nonlinear,
showing more complex patterns.
2. Seasonality: Seasonality refers to periodic fluctuations or patterns that occur at
regular intervals within the time series. These cycles often repeat annually, quarterly,
monthly, or weekly and are typically influenced by factors such as seasons, holidays,
or business cycles.
3. Cyclic variations: Cyclical variations are longer-term fluctuations in the time series
that do not have a fixed period like seasonality. These fluctuations represent
economic or business cycles, which can extend over multiple years and are often
associated with expansions and contractions in economic activity.
4. Irregularity: Irregularity, also known as noise or randomness, refers to the
unpredictable or random fluctuations in the data that cannot be attributed to the trend,
seasonality, or cyclical variations. These fluctuations may result from random events,
measurement errors, or other unforeseen factors. Irregularity makes it challenging to
identify and model the underlying patterns in the time series data.
Time Series Models in Machine Learning
1. Autoregressive Integrated Moving Avergae(ARIMA)
ARIMA is a widely used model for time series analysis. It is a statistical model that uses past
values to predict future values of a time series. ARIMA models are widely used in fields like
finance, economics, and meteorology. The model works well when the data has a clear trend,
seasonality, and is stationary.
Pros:
• ARIMA can handle a wide range of time series data.
• It is relatively easy to understand and implement.
Cons:
Deep Learning:-
The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected nodes
called neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the input
layer. The output of one neuron becomes the input to other neurons in the next layer of the network,
and this process continues until the final layer produces the output of the network. The layers of the
neural network transform the input data through a series of nonlinear transformations, allowing the
network to learn complex representations of the input data.
Today Deep learning AI has become one of the most popular and visible areas of machine learning,
due to its success in a variety of applications, such as computer vision, natural language processing,
and Reinforcement learning. Deep learning AI can be used for supervised, unsupervised as well as
reinforcement machine learning. it uses a variety of ways to process these.
• Supervised learning
• Unsupervised learning
• Semi- supervised learning
Applications:
1. Facial recognition
2. Voice assistants
3. Financial fraud detection
Benefits:
1. Efficiency
2. Adaptability
3. Accuracy
Limitations:
1. Data dependency
2. Computational costs
3. Overfitting
4. Interpretability.
UNIT – 5
Scalable Machine Learning (Online and Distributed Learning) A selection from some other
advanced topics, e.g., Semi-supervised Learning, Active Learning, Reinforcement Learning,
Inference in Graphical Models, Introduction to Bayesian Learning and Inference. Recent trends
in various learning techniques of machine learning and classification methods for IOT
applications. Various models for IOT applications.
Scalable Machine Learning (Online and Distributed Learning):
Scalable machine learning refers to techniques that allow machine learning models to efficiently
handle large-scale data and adapt to growing datasets. Two major approaches to scalability are:
1. Online Learning – The model updates continuously as new data arrives.
2. Distributed Learning – The model is trained across multiple machines or processors to
handle massive datasets.
Online Learning(Incremental Learning):
Online learning (or incremental learning) is a technique where the model learns from data as it
arrives instead of training on the entire dataset at once. This is useful when data is too large to fit into
memory or is continuously changing (e.g., stock prices, user activity logs).
• Instead of training on a static dataset, the model updates incrementally as new data points
arrive.
• Helps in real-time decision-making without retraining from scratch.
• Applications: Spam filtering, fraud detection, stock price prediction.
Advantages of online learning
• Self-training
• Consistency regularization
• Graph-Based semi supervised learning
Applications:
• Speech recognition
• Medical diagnosis
• Webpage classification
2. Active Learning:
Active learning is a technique where the model actively selects the most useful data points
to be labeled by humans to maximize learning efficiency.
• Instead of labelling all data, we focus on the most informative examples.
• Helps reduce labelling costs while maintaining high accuracy.
Strategies:
• Uncertainity Sampling
• Query-by-Committee
• Diversity sampling
Applications:
• Medical diagnosis
• Fraud detection
• Chatbots
3. Reinforcement Learning:
Reinforcement learning (RL) is a paradigm where an agent learns to take actions in an
environment to maximize rewards over time.
• Unlike supervised learning, RL does not rely on labeled data.
• The agent explores and exploits to learn the best strategy
• Used in robotics, game playing, and autonomous systems.
Key concepts in RL:
• Game AI
• Robotics
• Autonomous vehicles.
• Handles uncertainity
• Works with small datasets
• Prevents ovefitting
• Continuously update beliefs.
Bayesian inference:
Bayesian inference is the process of updating beliefs about model parameters as new data is
observed. Unlike traditional machine learning methods that optimize for a single best model, Bayesian
inference maintains a distribution over possible models.
Steps:
Bayesian Learning Algorithms:
Advantages:
• Medical diagnosis
• Spam detection
• Stock market prediction
• Autonomous robotics
• Natural Language Processing