0% found this document useful (0 votes)
46 views50 pages

ML Notes

The document provides an overview of various machine learning algorithms, including K-Nearest Neighbour (KNN), Decision Trees, Naïve Bayes, Linear Regression, Logistic Regression, and Support Vector Machines (SVM). Each algorithm is described with its working principles, advantages, disadvantages, and applications, highlighting their roles in classification and regression tasks. Additionally, it discusses binary and multi-class classification concepts, including metrics like precision, recall, and F1 score.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views50 pages

ML Notes

The document provides an overview of various machine learning algorithms, including K-Nearest Neighbour (KNN), Decision Trees, Naïve Bayes, Linear Regression, Logistic Regression, and Support Vector Machines (SVM). Each algorithm is described with its working principles, advantages, disadvantages, and applications, highlighting their roles in classification and regression tasks. Additionally, it discusses binary and multi-class classification concepts, including metrics like precision, recall, and F1 score.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit -1

Distance Based Methods: Distance-based algorithms are machine learning that classify queries by
computing distances between these queries.
K-Nearest Neighbour (KNN) Algorithm:

• K-Nearest Neighbour is one of the simplest machine learning algorithms based on supervised
learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K-NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used
for the classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
Why do we need KNN algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:

Advantages of KNN algorithm:

• It is simple to implement.
• It is robust to the noisy training data.
• It can be more effective if the training data is large.

Disadvantages of KNN algorithm:


• Always needs to determine the value of K which may be complex some time.
• The computation cost is high because of calculating the distance between the data points for
all the training samples.

Decision Tree Classification Algorithm:

• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
• Below diagram explains the general structure of a decision tree:

Decision Tree Terminologies:

• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root into sub-nodes according
to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root nodes of the tree is called the parent node, and other nodes are
called the child nodes.
How does the decision tree algorithm work?
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node further gets split into one
decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes
(Accepted offers and Declined offer). Consider the below diagram:

Advantages of the Decision Tree:

• It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree:

• The decision tree contains lots of layers, which makes it complex.


• It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
• For more class labels, the computational complexity of the decision tree may increase.

Naïve Bayes Classifier Algorithm:

• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:

• Naive: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
• Bayes: it is called Bayes because it depends on the principle of Baye’s theorem.
Baye’s Theorem:

• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
• The formula for Baye’s theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:

• Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
• It can be used for Binary as well as Multi-class classifications.
• It performs well in Multi-class predictions as compared to the other algorithms.
• It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:

• Naïve bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:

• It is used for credit scoring.


• It is used in medical data classification.
• It can be used in real-time predictions because Naïve bayes classifier is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment analysis.
Linear Models
Linear Regression:
Linear regression is one of the most popular and simple machine learning algorithms that is used for
predictive analysis. Here, predictive analysis defines prediction of something, and linear regression
makes predictions for continuous numbers such as salary, age, etc.
It shows the linear relationship between the dependent and independent variables, and shows how the
dependent variable(y) changes according to the independent variable (x).
It tries to best fit a line between the dependent and independent variables, and this best fit line is
knowns as the regression line.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Linear regression is further divided into two types:

• Simple linear Regression: In simple linear regression, a single independent variables is used
to predict the value of the dependent variable.
• Multiple linear Regression: In multiple linear regression, more than one independent
variables are used to predict the value of the dependent variable.
Linear Regression Line:
A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:
Positive Linear Relationship:
• If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship:

• If the dependent variable decreases on the Y-axis and independent variable increases on the
X-axis, then such a relationship is called a negative linear relationship.

Logistic Regression in Machine Learning:

• Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is
used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
Assumptions for Logistic Regression:

• The dependent variable must be categorical in nature.


• The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

• We know the equation of the straight line can be written as:

• In logistic regression y can be between 0 and 1, so for this let’s divide the above equation by
(1-y):

• But we need range between –[infinity] to +[infinity], then take algorithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:

• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”.
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.

Support Vector Machine Algorithm:


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the
below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM:-
SVM can be two types:

• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
• Non-Linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Advantages:

• It uses a subset of training points in the decision function which makes it memory efficient
and is highly effective in high dimensional spaces.
Disadvantages:

• The only disadvantage with the support vector machine is that the algorithm does not directly
provide probability estimates.

Nonlinarity:-
Kernel Methods:
Binary Classification:
Binary Classification is a process or task of classification, in which a given data is being classified
into two classes. It’s basically a kind of prediction about which of two groups the thing belongs to.
Let us suppose, two emails are sent to you, one is sent by an insurance company that keeps sending
their ads, and the other is from your bank regarding your credit card bill. The email service provider
will classify the two emails, the first one will be sent to the spam folder and the second one will be
kept in the primary one.
This process is known as binary classification, as there are two discrete classes, one is spam and the
other is primary. So, this is a problem of binary classification.
Term Related to binary Classification:
1. Precision: Precision in binary classification (Yes/No) refers to a model's ability to correctly
interpret positive observations. In other words, how often does a positive value forecast turn
out to be correct? We may manipulate this metric by only returning positive for the single
observation in which we have the most confidence.
2. Recall: The recall is also known as sensitivity. In binary classification (Yes/No) recall is used
to measure how “sensitive” the classifier is to detecting positive cases. To put it another way,
how many real findings did we “catch” in our sample? We may manipulate this metric by
classifying both results as positive.
3. F1 Score: The F1 score can be thought of as a weighted average of precision and recall, with
the best value being 1 and the worst being 0. Precision and recall also make an equal
contribution to the F1 ranking.

Multi-class Classification:
An input can belong to exactly one of the K classes
Training Data: Each input feature vector xi is associated with a class label yi ∈ {1, . . . ,K}
Prediction: Given a new input, predict the class label
Eg. Object Classification, Document Classification, Optical Character Recognition, Context sensitive
spelling correction etc.
Structured Output Prediction:
We can successfully (?) do multiclass classification
Assign topics to documents
Names to object images
Sentiments to reviews

How do we take this knowledge of ML to predict


Assign topics to documents that come from a label hierarchy
Parse objects in scene and find relations between them. Eg. OCR
Find the adjectives, verbs, nouns in reviews to possible perform aspect based sentiments.
Example:-

MNIST:
The MNIST database (Modified National Institute of Standards and Technology database[1]) is a
large database of handwritten digits that is commonly used for training various image processing
systems.[2][3] The database is also widely used for training and testing in the field of machine
learning.
The MNIST database of handwritten digits consists of a training set of 60,000 examples, and a test set
of 10,000 examples. It is a subset of a larger set available from NIST. Additionally, the black and
white images from NIST were size-normalized and centered to fit into a 28x28 pixel bounding box
and anti-aliased, which introduced grayscale levels.
This database is well liked for training and testing in the field of machine learning and image
processing. It is a remixed subset of the original NIST datasets. One half of the 60,000 training
images consist images from NIST’s testing dataset and the other half from NIST’s training set. The
10,000 images from the testing set are similarly assembled.
The MNIST dataset is used by researchers to test and compare their research results with others. The
lowest error rates in literature are as low as 0.21 percent.
Ranking:-
Ranking is a type of supervised machine learning (ML) that uses labeled datasets to train its data and
models to classify future data to predict outcomes. Quite simply, the goal of a ranking model is to sort
data in an optimal and relevant order.
Ranking was first largely deployed within search engines. People search for a topic, while the ranking
algorithm reorders search results based on the PageRank, and the search engine is able to display the
most relevant results to its customers.
Until recently, most ranking models, and ML as whole, were limited in their scope of use, as most
companies didn’t have enough data to power these algorithms. Better methods for data collection and
more intuitive ML tools have made it possible for nearly anyone to deploy a successful ranking model
within their business.
Ranking models are made up of 2 main factors: queries and documents. Queries are any input value,
such as a question on Google or an interaction on an e-commerce site. Documents are the output value
or results of the query. Given the query, and the associated documents, a function, given a list of
parameters to rank on, will score the documents to be sorted in order of relevancy.
The machine learning algorithm learning to rank takes the scores from this model, and uses them to
predict future outcomes on a new and unseen list of document.

UNIT – 2
Unsupervised Learning:
Unsupervised learning is the training of a machine learning using information that is neither
classified nor labeled and allowing the algorithm to act on that information without guidance. Here
the task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself.
For instance, having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs
and cats ‘. But it can categorize them according to their similarities, patterns, and differences, i.e., we
can easily categorize the above picture into two parts. The first may contain all pics having dogs in
them and the second part may contain all pics having cats in them. Here you didn’t learn anything
before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
1. Clustering
2. Association
Clustering:
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as "A way of grouping the data points into different clusters, consisting of similar data
points. The objects with the possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
1. Market segmentation
2. Statistical data analysis
3. Social network analysis
4. Image segmentation
5. Anomaly detection, etc.
K-Means :
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.
The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by an interative process.
• Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K – means clustering algorithm.

The working of the K – Means algorithm is explained in the below steps:


Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids.
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each clusters.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Stages of Data preprocessing for K-Means Clustering
1. Data Cleaning
a. Removing duplicates
b. Removing irrelevant observations and errors.
c. Removing unnecessary columns
d. Handling inconsistent data
e. Handling outliers and noise
2. Handling missing data
3. Data Integration
4. Data Transformation
a. Feature Construction
b. Handling skewness
c. Data Scaling
5. Data Reduction
a. Removing dependent variables.
b. Feature selection
c. PCA

Advantages of K-means
1. It is very simple to implement.
2. It is scalable to a huge data set and also faster to large datasets.
3. It adapts the new examples very frequently
4. Generalization of clusters for different shapes and sizes.
Disadvantages of K-means
1. It is sensitive to the outliers.
2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.

Dimensionality reduction:
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These
techniques are widely used in machine learning for obtaining a better fit predictive model while
solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.
Cure of Dimensionality
Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional
data. The dimension of a dataset corresponds to the number of attributes/features that exist in a
dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is
referred to as high dimensional data. Some of the difficulties that come with high dimensional data
manifest during analyzing or visualizing the data to identify patterns, and some manifest while
training machine learning models. The difficulties related to training machine learning models due to
high dimensional data is referred to as ‘Curse of Dimensionality’. The popular aspects of the curse of
dimensionality; ‘data sparsity’ and ‘distance concentration’ are discussed in the following sections.
Benefits of applying Dimensionality Reduction
1. By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
2. Less computation training time is required for reduced dimensions of features.
3. Reduced dimensions of features of the dataset help in visualizing the data quickly.
4. It removes the redundant features by taking care of multicollinearity.
Disadvantgaes
1. Some data may be lost due to dimensionality reduction.
2. In the PCA dimensionality reduction technique, sometimes the principal components required
to consider are unkown.
Approaches of Dimensionality Reduction:
Common techniques of dimensionality reduction:
1. Principal component analysis
2. Backward elimination
3. Forward selection
4. Score comparison
5. Missing value ratio
6. Low variance filter
7. High correlation filter
8. Random forest
9. Factor analysis
10. Auto encoder

Disadvantages of Dinmensionality Reduction:


1. Loss of important information
2. Hard to choose the right number of dimensions
3. Computationally expensive for large datasets
4. Difficult to Interpret transformed features
5. Not always suitable for all models.

Principal Component Analysis(PCA):


Principal component analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning. It is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of orthogonal transformation. These
new transformed features are called the Principal Components.
PCA works by considering the variance of each attribute because the high attribute shows the good
split between the classes, and hence it reduces the dimensionality. Some real-world applications of
PCA are image processing, movie recommendation system, optimizing the power allocation in
various communication channels. It is a feature extraction technique, so it contains the important
variables and drops the least important variable.
The PCA algorithm is based on some mathematical concepts such as:
1. Variance and Covariance
2. Eigenvalues and Eigen factors.
Some common terms used in PCA algorithm:
1. Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
2. Correlation: It signifies that how strongly two variables are related to each other. Such as if
one changes, the other variable also gets changed. The correlation value ranges from -1 to +1.
Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.
3. Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
4. Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
5. Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance matrix.
Principal Components in PCA:
the transformed new features or the output of PCA are the Principal Components. The number of
these PCs are either equal to or less than the original features present in the dataset. Some properties
of these principal components are given below:
1. The principal copoment must be linear combination of the original features.
2. These components are othogornal, i.e., the correlation between a pair of variables is zero.
3. The importance of each component decreases when going to 1 to n, it means the 1 PC as the
most importance, and n PC will have the least importance.
Steps for PCA Algorithm:
1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is
the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number of columns is the dimensions of the
dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column. Here we will name the
matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and
unimportant features will be removed out.
Application of PCA:
1. PCA is mainly used as the dimensionality reduction technique in various AI applications such
as computer vision, image compression, etc.
2. It can also be used for finding hidden patterns if data has high dimensions. Some fields where
PCA is used are finance, data mining, Psychology, etc.

Kernel PCA:
PCA is a linear method. That is it can only be applied to datasets which are linearly separable. It does
an excellent job for datasets, which are linearly separable. But, if we use it to non-linear datasets, we
might get a result which may not be the optimal dimensionality reduction. Kernel PCA uses a kernel
function to project dataset into a higher dimensional feature space, where it is linearly separable. It is
similar to the idea of Support Vector Machines. There are various kernel methods like linear,
polynomial, and gaussian.
In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function to project
the dataset into a higher-dimensional space, where it is linearly separable. Finally, we applied the
kernel PCA to a non-linear dataset using scikit-learn.
Kernel Principal Component Analysis (PCA) is a technique for dimensionality reduction in machine
learning that uses the concept of kernel functions to transform the data into a high-dimensional feature
space. In traditional PCA, the data is transformed into a lower-dimensional space by finding the
principal components of the covariance matrix of the data. In kernel PCA, the data is transformed into
a high dimensional feature space using a non-linear mapping function, called a kernel function, and
then the principal components are found in this high-dimensional space.
The following are the general steps in Kernel PCA:
Select a kernel function: Choose an appropriate kernel function according to the properties of the
data. The challenge at hand and the data's underlying structure determine which kernel should be
used.
Compute the kernel matrix: Using the selected kernel function, determine the pairwise similarity (or
distance) between data points. As a result, the kernel matrix-a symmetric positive semi-definite
matrix-is produced.
Choose the main elements: The greatest eigenvalues' top k eigenvectors, or primary components, are
selected to create the reduced-dimensional representation of the data.
Applications:
o One of the many uses for kernel PCA is the reduction of nonlinear dimensionality.
o Identification and categorization of patterns in high-dimensional environments.
o Signal and image processing.
o Bioinformatics and genetics to analyze biological data with complicated interactions.
Advantages of Kernel PCA:
1. Non-linearity: Kernel PCA can capture non-linear patterns in the data that are not possible with
traditional linear PCA.
2. Robustness: Kernel PCA can be more robust to outliers and noise in the data, as it considers the
global structure of the data, rather than just local distances between data points.
3. Versatility: Different types of kernel functions can be used in kernel PCA to suit different types of
data and different objectives.
Disadvantages of Kernel PCA:
1. Complexity: Kernel PCA can be computationally expensive, especially for large datasets, as it
requires the calculation of eigenvectors and eigenvalues.
2. Model selection: Choosing the right kernel function and the right number of components can be
challenging and may require expert knowledge or trial and error.

Matrix Factorization:
Matrix Factorization in Machine Learning
1. What is Matrix Factorization?
Matrix Factorization (MF) is a dimensionality reduction technique that decomposes a large matrix
into two or more smaller matrices to extract meaningful patterns from data. It is widely used in
recommendation systems, latent feature discovery, and dimensionality reduction.
Why Use Matrix Factorization?
Reduces High-Dimensional Data – Converts large matrices into smaller ones.
Extracts Latent Features – Helps discover hidden structures in data.
Speeds Up Computation – Makes operations on large datasets computationally efficient.
Handles Missing Data – Common in collaborative filtering for recommendation systems.

2. How Matrix Factorization Works


Matrix Factorization breaks down a large matrix A into smaller matrices W and H such that:

Where:
• A is the original matrix (m × n).
• W is the first factor matrix (m × k).
• H is the second factor matrix (k × n).
• k is the reduced number of latent features.
This transformation allows us to approximate AA using a lower-dimensional representation.

3. Types of Matrix Factorization Methods


(a) Singular Value Decomposition (SVD)
• Definition:
SVD decomposes a matrix AA into three matrices:

Application:
o Used in dimensionality reduction and recommendation systems.
o Example: Netflix Prize Challenge used SVD for movie recommendations.

(b) Non-Negative Matrix Factorization (NMF)


• Definition:
Given a non-negative matrix AA, NMF finds two non-negative matrices WW and HH such
that:

Why Non-Negative?
o Helps in feature interpretability.
o Common in text mining and image processing.
• Application:
o Topic modeling in Natural Language Processing (NLP).
o Feature extraction in image recognition.

(c) Probabilistic Matrix Factorization (PMF)


• Definition:
PMF assumes that the matrix entries are generated from a probability distribution and tries
to infer the missing values by maximizing likelihood.
• Application:
o Used in collaborative filtering for personalized recommendations.

(d) Alternating Least Squares (ALS)


• Definition:
ALS minimizes the loss function by alternating updates for W and H:

Application:
o Scalable for large-scale recommendation systems (e.g., Netflix, Spotify).

4. Matrix Factorization Example: Recommendation System


Consider a user-item rating matrix (e.g., movie ratings):

User/Item Movie 1 Movie 2 Movie 3 Movie 4

User 1 5 ? 3 ?

User 2 ? 4 ? 5

User 3 2 ? 1 3

Here, "?" represents missing ratings.


Using Matrix Factorization:
• We decompose the user-item matrix into two smaller matrices:

Where:
o W represents users' preferences (e.g., genre preferences).
o H represents movies' characteristics.
• Missing values (e.g., unobserved ratings) can be predicted using this factorization.

Advantages
1. Efficient for Large Datasets – Handles sparse matrices effectively.
2. Extracts Meaningful Features – Identifies latent patterns in data.
3. Handles Missing Data – Useful in recommendation systems.
4. Scalable – Methods like ALS scale well with large datasets.
Disadvantages
1. Computationally Expensive – Some factorization methods require high computation.
2. Choice of Rank kk is Crucial – Selecting the correct number of latent features is
challenging.
3. Sensitive to Noise – Noisy data can affect factorization quality.

7. Applications of Matrix Factorization

Application Use Case

Recommendation Systems Netflix, Amazon, YouTube recommendations

Dimensionality Reduction Reducing large feature spaces (PCA, SVD)

Anomaly Detection Identifying fraud in financial transactions

Natural Language Processing (NLP) Topic modeling, text summarization

Computer Vision Image compression, face recognition

Matrix Completion:
Matrix Completion in Machine Learning
1. What is Matrix Completion?
Matrix Completion is a technique in machine learning used to predict missing values in a partially
observed matrix. It is widely used in recommendation systems, collaborative filtering, image
inpainting, and anomaly detection.
Why Matrix Completion?
• Real-world data is often incomplete (e.g., missing ratings in recommendation systems,
corrupted pixels in images).
• Matrix completion can predict missing values by learning patterns from observed data.
• It helps in dimensionality reduction and feature extraction for better predictions.

2. Mathematical Formulation
Problem Statement
Given a partially observed matrix AA of size m×nm \times n, we aim to find a completed matrix
A^\hat{A} such that:

Low-Rank Approximation
• Most real-world data matrices are low-rank:
o Users prefer only a few categories of movies (recommendations).
o Images contain repetitive patterns (compression).
• We assume that the true matrix AA is low-rank, meaning it can be decomposed into smaller
matrices:

where:
o W is a matrix of user preferences (e.g., movie genres).
o H is a matrix of item features (e.g., movie properties).

3. Techniques for Matrix Completion


Several algorithms exist for solving the matrix completion problem:
(a) Singular Value Thresholding (SVT)
• Based on Singular Value Decomposition (SVD).
• Finds the best low-rank approximation of A.
• Thresholds small singular values to reduce noise.
Example: Movie Recommendation System
• If a user hasn't rated a movie, SVT estimates it based on similar users’ ratings.

(b) Alternating Least Squares (ALS)


• Used in collaborative filtering for recommendation systems.
• Alternates between optimizing:
o User preferences (WW) while keeping item features (H) fixed.
o Item features (HH) while keeping user preferences (W) fixed.
Example: Netflix Movie Recommendation
• Predicts missing movie ratings by learning user and movie feature matrices.

(c) Nuclear Norm Minimization


• Convex optimization approach that minimizes the nuclear norm (sum of singular values).
• Finds the simplest low-rank matrix that fits the observed data.
Example: Image Inpainting
• Filling missing pixels in an image while maintaining its structure.

(d) Probabilistic Matrix Factorization (PMF)


• Assumes missing values follow a probabilistic distribution.
• Maximizes the likelihood of observed data.
Example: Personalized Recommendations
• Predicts product ratings on Amazon using probability distributions.

5. Applications of Matrix Completion

Application Use Case

Recommendation Systems Predicts missing movie ratings (Netflix, Amazon).

Image Inpainting Recovers missing pixels in images.

Medical Diagnosis Fills missing health records for better predictions.

Finance & Fraud Detection Completes missing financial transactions to detect anomalies.

Natural Language Processing Fills in missing words or phrases in text.

6. Advantages and Disadvantages of Matrix Completion


Advantages

1. Handles Missing Data Efficiently – Predicts unknown values with high accuracy.
2. Extracts Meaningful Patterns – Identifies latent structures in data.
3. Scales Well for Large Datasets – Especially when using ALS or PMF.

Disadvantages

1. Computationally Expensive – Large matrices require significant processing.


2. Depends on Assumptions – Assumes data follows a low-rank structure, which may not
always be true.
3. Sensitive to Noise – Incorrect missing values can lead to poor performance.

Generative Models:
Generative Models in Machine Learning: Mixture Models and Latent Factor Models
Generative models are a class of machine learning models that learn the underlying distribution of
data to generate new samples similar to the given dataset. Unlike discriminative models (which focus
on classification), generative models aim to model the joint probability distribution P(X,Y)P(X, Y)
and can generate new instances from learned patterns.

1. What are Generative Models?


Generative models try to learn how data is generated by modeling its probability distribution. Given
training data X, they estimate P(X) or P(X∣Z), where Z is some latent (hidden) variable.
Why Use Generative Models?
• Can generate new data similar to the training set.
• Handle missing data by estimating the probability of missing values.
• Learn hidden structures within the data.
• Used in unsupervised learning, semi-supervised learning, and density estimation.
Types of Generative Models
1. Mixture Models – Assume data is generated from multiple distributions.
2. Latent Factor Models – Assume that observed variables depend on a set of hidden (latent)
factors.

2. Mixture Models
A Mixture Model assumes that data is drawn from a combination of multiple underlying probability
distributions.
2.1 Gaussian Mixture Model (GMM)
One of the most common mixture models is the Gaussian Mixture Model (GMM), which assumes
that data is generated from a mixture of multiple Gaussian distributions.
Mathematical Formulation
A Gaussian Mixture Model represents a dataset as:

where:

Each data point is assumed to be generated from one of these KK Gaussians.


Training a GMM
• Expectation-Maximization (EM) Algorithm is used to estimate the parameters:
o E-step: Compute probability of each data point belonging to a Gaussian.
o M-step: Update the Gaussian parameters based on weighted data points.
Applications of GMM
• Clustering (better than K-Means for non-circular clusters).
• Anomaly detection (detects rare data points).
• Speech recognition (models phoneme distributions).

3. Latent Factor Models


Latent Factor Models assume that observed data is influenced by hidden variables (latent factors).
These factors help in capturing underlying structure in data.
3.1 Probabilistic Latent Semantic Analysis (PLSA)
Used for topic modeling in text data. It assumes that documents are mixtures of latent topics.

where:
• P(w∣z) is the probability of word w given topic z.
• P(z∣d) is the probability of topic z in document d.
Example:
• A document about sports might have words like "football," "goal," and "player" with high
probability under a "Sports" topic.

3.2 Latent Dirichlet Allocation (LDA)


LDA is another topic modeling approach that assigns words to topics based on a probabilistic
distribution.
1. Each document is modeled as a mixture of latent topics.
2. Each topic is a distribution over words.
Mathematical Representation:

• z} P(w | z) P(z | d).


• Topics and word distributions are inferred using Bayesian inference.
Example:
A newspaper article might be a mix of 50% Politics, 30% Economy, and 20% Sports.

3.3 Matrix Factorization Models (Collaborative Filtering)


Latent Factor Models are also used in recommendation systems.
Example: Netflix Movie Recommendations
• Users have preferences (latent factors).
• Movies have characteristics (latent factors).
• Matrix Factorization (e.g., SVD) finds hidden patterns.

where:
• A is the user-item rating matrix.
• W represents user preferences.
• H represents item properties.

4. Comparison: Mixture Models vs. Latent Factor Models

Feature Mixture Models (GMM) Latent Factor Models (LDA, PLSA)

Purpose Clustering, density estimation Uncover hidden structures

Assumption Data from multiple Gaussians Data influenced by hidden factors

Common
Expectation-Maximization (EM) Variational Inference, Gibbs Sampling
Algorithm
Feature Mixture Models (GMM) Latent Factor Models (LDA, PLSA)

Anomaly detection, Image Topic modeling, Recommender


Applications
segmentation systems

5. Applications of Generative Models

Domain Use Case

Natural Language Processing Topic modeling (LDA, PLSA)

Recommendation Systems Movie recommendations (Matrix Factorization)

Image Processing Image generation (GANs, Variational Autoencoders)

Anomaly Detection Fraud detection in finance (GMM)

Bioinformatics Gene expression modeling

UNIT – 3
Evaluating Machine Learning algorithms and Model Selection, Introduction to Statistical
Learning Theory, Ensemble Methods (Boosting, Bagging, Random Forests)

Evaluating Machine learning algorithms and model selection:

Evaluating Machine Learning Algorithms and Model Selection


Evaluating machine learning models and selecting the best one is crucial for building reliable AI
systems. It ensures that models generalize well to unseen data and avoid overfitting or underfitting.

1. Key Aspects of Model Evaluation


When evaluating machine learning models, we focus on the following:
• Performance Metrics: Accuracy, Precision, Recall, F1-score, AUC, etc.
• Cross-Validation: Splitting data into training and test sets properly.
• Bias-Variance Tradeoff: Balancing underfitting and overfitting.
• Model Complexity: Avoiding over-complex models that do not generalize.
• Computational Efficiency: Training time and inference speed.
2. Performance Metrics
(a) Classification Metrics
For classification tasks (e.g., spam detection, disease diagnosis), we use the following metrics:

Example: Evaluating a Spam Classifier

(b) Regression Metrics


For regression problems (e.g., predicting house prices, stock prices), we use:

Example: Evaluating a House Price Prediction Model

3. Model Selection Techniques


Selecting the right model is critical for achieving high accuracy and generalization. The following
techniques help in choosing the best model:
(a) Train-Test Split
• Splits data into training (80%) and test (20%).
• Used to evaluate model generalization.
(b) K-Fold Cross-Validation
• Splits data into K equal parts.
• Trains model on K-1 folds and tests on the remaining fold.
• Repeats this process K times, ensuring each fold is used for testing once.

Example: 5-Fold Cross-Validation

4. Bias-Variance Tradeoff
Understanding the bias-variance tradeoff helps prevent underfitting and overfitting.

Type Description Solution

Underfitting (High Model is too simple and fails to Increase model complexity (e.g., deeper
Bias) capture patterns. neural network, more features).

Overfitting (High Model memorizes training data Use regularization (L1/L2), reduce
Variance) but fails on new data. complexity, use more data.

Example: Overfitting vs. Underfitting in Polynomial Regression

5. Hyperparameter Tuning
Once a model is chosen, hyperparameter tuning optimizes its performance.
(a) Grid Search
Tests multiple combinations of hyperparameters.
(b) Random Search
Randomly selects hyperparameters and is faster than Grid Search.

6. Model Comparison

Model Best for Pros Cons

Logistic Regression Binary classification Simple, fast Assumes linear relationship

Decision Trees Interpretability Easy to understand Prone to overfitting

Random Forest General-purpose tasks Handles non-linearity Computationally expensive

SVM Small datasets Works well with outliers Hard to tune

Neural Networks Deep learning tasks Handles complex patterns Needs large data

7. Steps for Model Selection


Step 1: Define the Problem
• Identify if it is classification, regression, clustering, or reinforcement learning.
Step 2: Choose Candidate Models
• Consider simple models first (Logistic Regression, Decision Trees).
• Test complex models (Random Forest, Neural Networks) if needed.
Step 3: Evaluate Models
• Use cross-validation for fair comparison.
• Check performance using metrics (Accuracy, RMSE, AUC-ROC).
Step 4: Hyperparameter Tuning
• Optimize model hyperparameters using Grid Search or Random Search.
Step 5: Final Testing
• Evaluate the best model on an independent test set.

Introduction to Statistical Learning Theory:


A statistical model defines the relationships between a dependent and independent variable. In the
above graph, the relationships between the size of the home and the price of the home is illustrated by
the straight line. We can define this relationship by using y = mx + c where m represents the gradient
and c is the intercept. Another way that this equation can be expressed is with roman numerals which
would look something like:

This model would describe the price of a home as having a linear relationship with the size of a home.
This would represent a simple model for the relationship.
If we suppose that the size of the home is not the only independent variable when determining the
price and that the number of bathrooms is also an independent variable, the equation would look like:

Model Generalization:
In order to build an effective model, the available data needs to be used in a way that would make the
model generalizable for unseen situations. Some problems that occur when building models is that the
model under-fits or over-fits to the data.

• Under-fitting —when a statistical model does not adequately capture the underlying structure
of the data and, therefore, does not include some parameters that would appear in a correctly
specified model.
• Over-fitting —when a statistical model contains more parameters that can be justified by the
data and includes the residual variation ("noise") as if the variation represents underlying
model structure.
As you can see, it is important to create a model that can generalize to the data that it is given so that
it can make the most accurate predictions when given new data.
Model Validation:
Model Validation is used to assess over-fitting and under-fitting of the data. The steps to perform
model validation are:
[Link] the data into two parts, training data and testing data (anywhere between 80/20 and 70/30 is
ideal)
[Link] the larger portion (training data) to train the model
[Link] the smaller portion (testing data) to test the model. This data is not being used to train the
model, so it will be new data for the model to build predictions from.
[Link] the model has learned well from the training data, it will perform well with both the training data
and testing data. To determine how well the model is performing on both sets of data, you can
calculate the accuracy score for each. The model is over-fitting if the training data has a significantly
higher accuracy score than the testing data

Ensemble Methods (Boosting, Bagging, Random Forests)


Ensemble Methods
Ensembles are methods that combine multiple machine learning models (KNN, Decision Tree, SVM
etc..) to create more powerful models (Image 1). The main principle behind the ensemble model is
that a group of weak learners come together to form a strong learner. There are many models in the
machine learning literature that belong to this category: Random Forests, Gradient Boosted Decision
Tree etc.
Why ensembles?
1. Lower error
2. Less overfitting
3. Reduces variance

Bagging(bootstrap aggregating)
Baggingis a machine learning ensemble algorithm designed to improve the accuracy of machine
learning algorithms used in classification and regression. It also reduces variance and helps to avoid
overfitting. We can choose randomly sub sets from training data with [Link] a result get
average of predictions for each bag.
Boosting:
For boosting-we build our first bag of data with select randomly from training data and train model in
a usual way. Next is take all our training data and use it to test the model. We will discover that some
of the points are not well predicted (significant error). For second bag we choose randomly data again
but each instance is weighted according to this [Link] we test our system altogether and combine
their outputs and again we measure error across all this [Link] we build next bag and so on..

Random Forest:
Random forests are collection of decision trees, where each tree is slightly different from the others.
To get more information about Decision Tree you can read my another article(int this article I
explained what is Decision Tree and how it works).
The idea of random forests is to build many trees, so we can reduce the amount of overfitting by
averaging their results.
In order to implement this , we need to build many decision trees that should be different from the
other trees.
In order to build random forests we need to decide how many trees should we choose. There is
n_estimator (number of trees)parameter of Random Forest Regressor or Random Forest Classifier in
Scikit-Learn.
To start to build Random Forest model first of all we need to understand how bootstrap works.
Bootstrap means we can choose data randomly with replacement by repeating count of data. Let's say
we have data like this data=[a, b, c, d, e , f], so count of data is n=len(data)=6. A possible bootstrap
sample would be [b,d,d,c,e,f], another possible sample would be [c, d, b, b, f, f] and so on..after repeat
6 times.
Strengths, weaknesses and parameters of Random Forest
Random forests for regression and classification are currently among the most widely used machine
learning methods.
They are very powerful, often work well without heavy tuning of the parameters, and don't require
scaling of the data.
You should keep in mind that random forests, by their nature, are random, and setting different
random states (or not setting the random_state at all) can drastically change the model that is built.
The more trees there are in the forest, the more robust it will be against the choice of random state. If
you want to have reproducible results, it is important to fix the random_state.
Random forests usually work well even on very large datasets, and training can easily be parallelized
over many CPU cores within a powerful computer. However, random forests require more memory
and are slower to train and to predict than linear models. If time and memory are important in an
application, it might make sense to use a linear model instead.
The important parameters to adjust are n_estimators, max_features and possibly pre pruning options
like max_depth. For n_estimators, larger is always better. Averaging more trees will yield a more
robust ensemble. However, there are diminishing returns, and more trees need more memory and
more time to train. A common rule of thumb is to build "as many as you have time / memory for".

UNIT – 4
Sparse Modeling and Estimation, Modeling Sequence/Time-Series Data, Deep Learning and
Feature Representation Learning
Sparse modeling and estimation
Sparse modeling and estimation are essential concepts in machine learning, especially when dealing
with high-dimensional data. These techniques allow models to focus on a limited number of informative
features, improving interpretability and often leading to more efficient and robust models. Here are
some key points to help college students understand sparse modeling and estimation:
What is Sparsity?

• Definition: Sparsity refers to models or datasets where most of the elements are zero or close to zero.
In machine learning, sparse models focus on a subset of features, ignoring irrelevant ones.

• Importance: In many applications, only a few features or parameters significantly impact the model
outcome, which leads to simpler and faster models.
2. Why Sparse Modeling?

• Interpretability: Sparse models are easier to understand since they only use a subset of features.

• Reduced Overfitting: By reducing the number of parameters, sparse models help prevent overfitting,
especially in cases with limited data.

• Efficiency: Sparse models are computationally more efficient, both in terms of memory and
processing time.
[Link] for Sparse Modeling
• Regularization: Adding a penalty term to the model's objective function encourages sparsity. Common
techniques include:
o Lasso Regression: Uses L1L1L1 regularization, which can shrink some coefficients to zero,
effectively removing certain features from the model.
o Elastic Net: Combines L1L1L1 (lasso) and L2L2L2 (ridge) regularization for more flexibility
in feature selection.

• Dimensionality Reduction:
o Principal Component Analysis (PCA): While not inherently sparse, PCA reduces data to a
smaller set of components, capturing essential information.
o Sparse PCA: A variant of PCA that enforces sparsity in the principal components.

• Feature Selection Methods:


o Filter Methods: Evaluate each feature individually to determine its relevance to the target
variable.
o Wrapper Methods: Use cross-validation to iteratively select features, often with greedy
algorithms like forward selection.
o Embedded Methods: Feature selection happens as part of the model training process, as seen
in decision trees and certain regularized methods.
[Link] Estimation
Sparse estimation involves estimating model parameters with the constraint that many of them are
zero or close to zero. Key methods include:
o Thresholding Techniques: Simple methods that apply a threshold to keep only large
coefficients.
o Bayesian Approaches: Priors like the Laplace prior induce sparsity in Bayesian models.
o Coordinate Descent: An optimization algorithm that can be efficiently applied in lasso and
elastic net to find sparse solutions.
[Link] of Sparse Modeling and Estimation
Genomics: Identifying a small subset of genes related to a particular disease.

• Image Processing: Sparse representations of images for compression or denoising.

• Natural Language Processing: Sparse modeling helps handle high-dimensional data like word
counts, focusing on the most relevant terms.

• Finance: Selecting key financial indicators among thousands that influence stock prices.
[Link] in Sparse Modeling
a. Parameter Selection
b. interpretability vs. Accuracy
c. Scalability.
Time – Series data:
A time series is a sequence of data points collected, recorded, or measured at successive, evenly-
spaced time intervals. Time series data is commonly represented graphically with time on the
horizontal axis and the variable of interest on the vertical axis, allowing analysts to identify trends,
patterns, and changes over time.
Importance of Time Series Analysis:
1. Predict Future Trends: Time series analysis enables the prediction of future trends, allowing
businesses to anticipate market demand, stock prices, and other key variables, facilitating proactive
decision-making.
2. Detect Patterns and Anomalies: By examining sequential data points, time series analysis helps
detect recurring patterns and anomalies, providing insights into underlying behaviors and potential
outliers.
3. Risk Mitigation: By spotting potential risks, businesses can develop strategies to mitigate them,
enhancing overall risk management.
4. Strategic Planning: Time series insights inform long-term strategic planning, guiding decision-
making across finance, healthcare, and other sectors.
5. Competitive Edge: Time series analysis enables businesses to optimize resource allocation
effectively, whether it's inventory, workforce, or financial assets. By staying ahead of market trends,
responding to changes, and making data-driven decisions, businesses gain a competitive edge.
Components of Time series data
There are 4 main components of a time series:

1. Trend: Trend represents the long-term movement or directionality of the data over
time. It captures the overall tendency of the series to increase, decrease, or remain
stable. Trends can be linear, indicating a consistent increase or decrease, or nonlinear,
showing more complex patterns.
2. Seasonality: Seasonality refers to periodic fluctuations or patterns that occur at
regular intervals within the time series. These cycles often repeat annually, quarterly,
monthly, or weekly and are typically influenced by factors such as seasons, holidays,
or business cycles.
3. Cyclic variations: Cyclical variations are longer-term fluctuations in the time series
that do not have a fixed period like seasonality. These fluctuations represent
economic or business cycles, which can extend over multiple years and are often
associated with expansions and contractions in economic activity.
4. Irregularity: Irregularity, also known as noise or randomness, refers to the
unpredictable or random fluctuations in the data that cannot be attributed to the trend,
seasonality, or cyclical variations. These fluctuations may result from random events,
measurement errors, or other unforeseen factors. Irregularity makes it challenging to
identify and model the underlying patterns in the time series data.
Time Series Models in Machine Learning
1. Autoregressive Integrated Moving Avergae(ARIMA)
ARIMA is a widely used model for time series analysis. It is a statistical model that uses past
values to predict future values of a time series. ARIMA models are widely used in fields like
finance, economics, and meteorology. The model works well when the data has a clear trend,
seasonality, and is stationary.
Pros:
• ARIMA can handle a wide range of time series data.
• It is relatively easy to understand and implement.
Cons:

• ARIMA requires stationary data, which can be difficult to achieve in practice.


• It can be challenging to determine the optimal parameters for the model.
Usage: To forecast the stock prices, weather forecasting, economic analysis.
Integrated Moving-Average(SARIMA)
SARIMA is an extension of ARIMA that is designed to handle time series data with seasonal patterns.
It uses the same approach as ARIMA but takes into account seasonal factors that can affect the data.
SARIMA is widely used in fields like retail sales and marketing to forecast sales for specific seasons.
Pros:
SARIMA is effective in capturing seasonal patterns.
It can handle non-stationary data.
Cons:
SARIMA requires a large amount of historical data to build accurate models.
It can be challenging to determine the optimal parameters for the model.

Deep Learning:-
The definition of Deep learning is that it is the branch of machine learning that is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of interconnected nodes
called neurons that work together to process and learn from the input data.
In a fully connected Deep neural network, there is an input layer and one or more hidden layers
connected one after the other. Each neuron receives input from the previous layer neurons or the input
layer. The output of one neuron becomes the input to other neurons in the next layer of the network,
and this process continues until the final layer produces the output of the network. The layers of the
neural network transform the input data through a series of nonlinear transformations, allowing the
network to learn complex representations of the input data.

Today Deep learning AI has become one of the most popular and visible areas of machine learning,
due to its success in a variety of applications, such as computer vision, natural language processing,
and Reinforcement learning. Deep learning AI can be used for supervised, unsupervised as well as
reinforcement machine learning. it uses a variety of ways to process these.

• • Supervised Machine Learning: Supervised machine learning is the machine learning


technique in which the neural network learns to make predictions or classify data based on the
labeled datasets. Here we input both input features along with the target variables. the neural
network learns to make predictions based on the cost or error that comes from the difference
between the predicted and the actual target, this process is known as backpropagation. Deep
learning algorithms like Convolutional neural networks, Recurrent neural networks are used
for many supervised tasks like image classifications and recognization, sentiment analysis,
language translations, etc.
• Unsupervised Machine Learning: Unsupervised machine learning is the machine learning
technique in which the neural network learns to discover the patterns or to cluster the dataset
based on unlabeled datasets. Here there are no target variables. while the machine has to self-
determined the hidden patterns or relationships within the datasets. Deep learning algorithms
like autoencoders and generative models are used for unsupervised tasks like clustering,
dimensionality reduction, and anomaly detection.
• Reinforcement Machine Learning: Reinforcement Machine Learning is the machine learning
technique in which an agent learns to make decisions in an environment to maximize a reward
signal. The agent interacts with the environment by taking action and observing the resulting
rewards. Deep learning can be used to learn policies, or a set of actions, that maximizes the
cumulative reward over time. Deep reinforcement learning algorithms like Deep Q networks
and Deep Deterministic Policy Gradient (DDPG) are used to reinforce tasks like robotics and
game playing etc.
Artificial neural networks
Artificial neural networks are built on the principles of the structure and operation of human neurons.
It is also known as neural networks or neural nets. An artificial neural network’s input layer, which is
the first layer, receives input from external sources and passes it on to the hidden layer, which is the
second layer. Each neuron in the hidden layer gets information from the neurons in the previous layer,
computes the weighted total, and then transfers it to the neurons in the next layer. These connections
are weighted, which means that the impacts of the inputs from the preceding layer are more or less
optimized by giving each input a distinct weight. These weights are then adjusted during the training
process to enhance the performance of the model.
Difference between ML and Deep Learning
Machine Learning Deep Learning
Apply statistical algorithms to learn the hidden Uses artificial neural network architecture to
patterns and relationships in the dataset. learn the hidden patterns and relationships in the
dataset.
Can work on the smaller amount of dataset Requires the larger volume of dataset compared
to machine learning
Better for the low-label task Better for complex task like image processing,
natural language processing, etc.,
Takes less time to train the model. Takes more time to train the model
A model is created by relevant features which Relevant features are automatically extracted
are manually extracted from images to detect an from images. It is an end-to-end learning
object in the image process.
Less complex and easy to interpret the result. More complex, it works like the black box
interpretations of the result are not easy.
Types of Neural Networks:
1. Feedforward neural network
2. Convolutional neural networks
Applications:
1. Computer vision
2. Natural language processing
3. Reinforcement learning
Advantages:
1. High accuracy
2. Automated feature engineering
3. Scalability
4. Flexibility
Disadvantages:
1. High computational requirements
2. Requires large amounts of labeled data
3. Interpretability
4. Overfitting
5. Black-box nature.

Feature Representing learning:


Feature learning, in the context of machine learning, is the automatic process through which a model
identifies and optimizes key patterns, structures, or characteristics (called "features") from raw data to
enhance its performance in a given task. It plays a pivotal role because, instead of manually
engineering these features, machines can automatically learn the most informative ones, which can
greatly improve the accuracy and efficiency of predictions.
Types of machine learning

• Supervised learning
• Unsupervised learning
• Semi- supervised learning
Applications:
1. Facial recognition
2. Voice assistants
3. Financial fraud detection
Benefits:
1. Efficiency
2. Adaptability
3. Accuracy
Limitations:
1. Data dependency
2. Computational costs
3. Overfitting
4. Interpretability.

UNIT – 5
Scalable Machine Learning (Online and Distributed Learning) A selection from some other
advanced topics, e.g., Semi-supervised Learning, Active Learning, Reinforcement Learning,
Inference in Graphical Models, Introduction to Bayesian Learning and Inference. Recent trends
in various learning techniques of machine learning and classification methods for IOT
applications. Various models for IOT applications.
Scalable Machine Learning (Online and Distributed Learning):
Scalable machine learning refers to techniques that allow machine learning models to efficiently
handle large-scale data and adapt to growing datasets. Two major approaches to scalability are:
1. Online Learning – The model updates continuously as new data arrives.
2. Distributed Learning – The model is trained across multiple machines or processors to
handle massive datasets.
Online Learning(Incremental Learning):
Online learning (or incremental learning) is a technique where the model learns from data as it
arrives instead of training on the entire dataset at once. This is useful when data is too large to fit into
memory or is continuously changing (e.g., stock prices, user activity logs).

• Instead of training on a static dataset, the model updates incrementally as new data points
arrive.
• Helps in real-time decision-making without retraining from scratch.
• Applications: Spam filtering, fraud detection, stock price prediction.
Advantages of online learning

• Efficient for large scale data


• Adapts to data distribution changes
• Does not require storing all past data
Disadvantages

• New data can override old patterns


• Learning rate tuning is crucial.
Distributed learning:
Distributed learning is a machine learning approach where large-scale models are trained on multiple
machines or processors in parallel.

• When datasets do not it on a single machine.


• When training a model requires high computational power
• When training time needs to be reduced significantly.
Types of Distributed Learning:
1. Data Parallelism.
The dataset is split into multiple chunks, and different machines train on
different parts of the data.
The model updates are aggregated and synchronized.
2. Model Parallelism
• The model itself is split across multiple machines
• Common in deep learning when a single model is too large to fit on one machine.
Example: model parallelism in pytorch
Advanced Topics in Machine Learning:
In addition to supervisied and unsupervised learning, there are several advanced topics that play
crucial roles in modern machine learning applications. Here, we explore:
1. Semi-Supervised learning
2. Active learning
3. Reinforcement learning
1. Semi-Supervised learning:
Semi-supervised learning (SSL) is a learning paradigm where a model is trained using a
small amount of labeled data and a large amount of unlabeled data.
• Labelling data is expensive and time-consuming
• Unlabelled dara is abundant
There are several approaches to using unlabelled data effectively:

• Self-training
• Consistency regularization
• Graph-Based semi supervised learning
Applications:

• Speech recognition
• Medical diagnosis
• Webpage classification
2. Active Learning:
Active learning is a technique where the model actively selects the most useful data points
to be labeled by humans to maximize learning efficiency.
• Instead of labelling all data, we focus on the most informative examples.
• Helps reduce labelling costs while maintaining high accuracy.
Strategies:

• Uncertainity Sampling
• Query-by-Committee
• Diversity sampling
Applications:

• Medical diagnosis
• Fraud detection
• Chatbots
3. Reinforcement Learning:
Reinforcement learning (RL) is a paradigm where an agent learns to take actions in an
environment to maximize rewards over time.
• Unlike supervised learning, RL does not rely on labeled data.
• The agent explores and exploits to learn the best strategy
• Used in robotics, game playing, and autonomous systems.
Key concepts in RL:

• Agent: the entity making decisions


• Environment: the world in which the agent interacts.
• State(S): the current situtaiton the agent is in.
• Action(A): The possible moves the agent can make.
• Reward(R): The feedback received after taking an action.
Applications:

• Game AI
• Robotics
• Autonomous vehicles.

Introduction to bayesian learning and inference:


Bayesian Learning:
Bayesian learning is a probabilistic approach to machine learning that incorporates prior
knowledge and updates beliefs as new data arrives. It is based on Bayes’ Theorem, which allows us
to compute the probability of a hypothesis given the observed data.
Baye’s Theorem:

Uses of Bayesian Learning:

• Handles uncertainity
• Works with small datasets
• Prevents ovefitting
• Continuously update beliefs.

Bayesian inference:
Bayesian inference is the process of updating beliefs about model parameters as new data is
observed. Unlike traditional machine learning methods that optimize for a single best model, Bayesian
inference maintains a distribution over possible models.
Steps:
Bayesian Learning Algorithms:
Advantages:

• Handles uncertainty effectively


• Works well with small datasets
• Provides probabilistic confidence intervals.
• Avoids overfitting
Disadvantages

• Computationally expensive for large datasets.


• Choosing the right prior can be challenging.
• Requires sampling methods.
Applications:

• Medical diagnosis
• Spam detection
• Stock market prediction
• Autonomous robotics
• Natural Language Processing

You might also like