Unit 3.1 Machine Learning Linear Regression
Unit 3.1 Machine Learning Linear Regression
Unsupervised Learning: Unsupervised machine learning is a type of machine learning that deals with data that has not been labeled,
categorized, or classified. The goal is to identify patterns, relationships, or structures within the data without any prior guidance on what those
patterns might be. Examples of Algorithms: K-Means, Hierarchical Clustering, t-SNE, Autoencoders.
Common Use Cases: Customer segmentation for targeted marketing
Anomaly detection in network security
Document clustering
Semi-Supervised Learning: Semi-supervised learning is a machine learning approach that combines aspects of both supervised and
unsupervised learning. It leverages a small amount of labeled data along with a larger amount of unlabeled data to improve learning efficiency
and model performance. Examples of Techniques: Self-Training, Co-Training, Graph-Based Methods.
Common Use Cases:Image classification with limited labeled images
Text classification where only a few documents are labeled
•Reinforcement Learning: This is a feedback-based learning method, based on a system of rewards and punishments for correct
and incorrect actions respectively. The aim is for the “learning agent” to receive maximum reward and hence improve its performance.
Examples of Algorithms: Q-Learning, Deep Q-Networks (DQN), Policy Gradient Methods.
Common Use Cases:Game AI (e.g., training agents to play video games)
Robotics (e.g., teaching robots to navigate and manipulate objects)
Autonomous vehicles
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model learns about each type of
data. Once the training process is completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Now
the first step is that we need to train the model for each shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases of a
number of sides, and predicts the output.
Supervised Learning:
Classification: In classification tasks, the goal is to predict a discrete label or category. Common algorithms include:
o Logistic Regression
o Decision Trees
o Random Forests
o Support Vector Machines (SVM)
o Naive Bayes
o Neural Networks
Examples of classification problems include spam detection (spam or not spam) and sentiment analysis (positive, negative,
neutral).
Regression: In regression tasks, the goal is to predict a continuous value. Common algorithms include:
o Linear Regression
o Polynomial Regression
o Ridge and Lasso Regression
o Support Vector Regression (SVR)
o Decision Trees (for regression)
o Neural Networks (for regression)
Examples of regression problems include predicting house prices, stock prices, or temperature.
Other Considerations
o Binary vs. Multi-class: Classification can be binary (two classes) or multi-class (more than two classes).
o Single-output vs. Multi-output: Regression can involve predicting a single continuous variable or multiple continuous
variables.
Supervised Learning:
Supervised machine learning algorithms can be categorized into several types based on their characteristics and the nature of the
tasks they perform. Here are the main types:
1. Classification Algorithms: Predicting categorical labels (e.g., spam detection).
These algorithms predict discrete labels or categories.
•Logistic Regression: A linear model used for binary classification. (e.g., spam detection).
•Decision Trees: A tree-like model that makes decisions based on feature values. (e.g., loan approval).
•Random Forests: An ensemble of decision trees that improves accuracy and reduces overfitting. (e.g., classifying species of
flowers).
•Support Vector Machines (SVM): Finds the hyperplane that best separates classes in the feature space. (e.g., image
recognition).
•Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature independence.
•K-Nearest Neighbors (KNN): Classifies based on the majority label of the nearest neighbors.
For example, predicting house prices involves collecting data on features like square footage, number of bedrooms, and location,
along with the selling price of each house. After preprocessing the data, such as handling missing values and encoding categorical
variables, a regression model (e.g., linear regression or random forest regression) is trained on a subset of the data. The model is
then evaluated using metrics like Mean Absolute Error (MAE) and R-squared (R²) to measure its accuracy. Once validated, the
model can predict house prices for new data, providing insights into real estate values based on the learned relationships
between features and prices. Regression algorithm can be trained to learn the relationship between the features and the price of
the house.
[Link]
Classification is a type of supervised learning that is used to predict categorical values, such as whether an email is spam or not,
or whether a medical image shows a tumor or not. Classification algorithms learn a function that maps from the input features to
a probability distribution over the output classes.
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are mainly used
to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there are two classes, class A
and Class B. These classes have features that are similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
Unsupervised Learning:
Unsupervised learning can be categorized into several types based on its objectives and methodologies. Here are the main types:
1. Clustering
Clustering aims to group similar data points together based on certain features. Common algorithms include:
K-Means: Partitions data into k clusters by minimizing variance within each cluster. (e.g., customer segmentation).
Hierarchical Clustering: Creates a tree-like structure of clusters using either agglomerative or divisive methods. (e.g., organizing documents).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Identifies clusters based on the density of data points, useful for finding
irregularly shaped clusters. (e.g., anomaly detection in network data).
Gaussian Mixture Models (GMM): Assumes that data points are generated from a mixture of several Gaussian distributions.
2. Dimensionality Reduction
These techniques aims to reduce the number of features or variables in the data while preserving the its essential structure. This is particularly
useful when dealing with high-dimensional datasets. Common methods include:
Principal Component Analysis (PCA): Transforms data to a lower-dimensional space by identifying principal components. (e.g., anomaly detection
in network data).
t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear method primarily used for visualizing high-dimensional data. (e.g., visualizing
word embeddings).
Autoencoders: Neural network-based models that compress data into a lower-dimensional representation. (e.g., denoising images).
These algorithms utilize both labeled and unlabeled data to improve learning efficiency.
Self-Training: The model iteratively labels the unlabeled data based on its predictions. (e.g., image classification with few labeled
images).
In this approach, a model is initially trained on the labeled data. It then makes predictions on the unlabeled data, and the most
confident predictions are added to the training set as "pseudo-labels." This process can be iteratively refined.
Co-Training: Two models train on different features and label data for each other. (e.g., text classification).
Co-training involves training two or more classifiers on different feature sets of the same data. Each classifier labels unlabeled
examples for the other classifiers, allowing them to improve by leveraging their different perspectives.
Graph-Based Methods: Use graph structures to propagate labels from labeled to unlabeled data. (e.g., social network analysis).
These methods represent data as a graph, where nodes represent data points and edges represent similarities. The idea is to
propagate labels through the graph, allowing unlabeled data points to acquire labels based on their connections to labeled
points.
Reinforcement Learning
• Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing
the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience only.
• RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such as game-playing, robotics, etc.
• The reinforcement learning agent learns about a problem by interacting with its environment. The environment provides information on its
current state. The agent then uses that information to determine which actions(s) to take. If that action obtains a reward signal from the
surrounding environment, the agent is encouraged to take that action again when in a similar future state. This process repeats for every new
state thereafter. Over time, the agent learns from rewards and punishments to take actions within the environment that meet a specified goal.
• The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that."
The below figure illustrates how the agent interacts with the environment in reinforcement learning:
Characteristics of Regression
Here are the characteristics of the regression:
•Continuous Target Variable: Regression deals with predicting continuous target variables that represent numerical values. Examples include
predicting house prices, forecasting sales figures, or estimating patient recovery times.
•Error Measurement: Regression models are evaluated based on their ability to minimize the error between the predicted and actual values of the
target variable. Common error metrics include mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).
•Model Complexity: Regression models range from simple linear models to more complex nonlinear models. The choice of model complexity depends
on the complexity of the relationship between the input features and the target variable.
•Overfitting and Underfitting: Regression models are susceptible to overfitting and underfitting.
•Interpretability: The interpretability of regression models varies depending on the algorithm used. Simple linear models are highly interpretable, while
more complex models may be more difficult to interpret.
Linear Regression:
•Linear regression is a statistical regression
method which is used for predictive analysis.
•It is one of the very simple and easy algorithms
which works on regression and shows the
relationship between the continuous variables.
•It is used for solving the regression problem in
machine learning.
•Linear regression shows the linear relationship
between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called
linear regression.
•If there is only one input variable (x), then such
linear regression is called simple linear
regression. And if there is more than one input •Below is the mathematical equation for Linear regression:
variable, then such linear regression is Y= aX+b
Here, Y = dependent variables (target variables),
called multiple linear regression.
X= Independent variables (predictor variables),
•The relationship between variables in the linear a and b are the linear coefficients
regression model can be explained using the
below image. Here we are predicting the salary
of an employee on the basis of the year of
experience.
Why do we use Regression Analysis?
• Regression analysis helps in the prediction of a continuous variable.
• There are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which can
make predictions more accurately. So for such case we need Regression analysis which is a statistical
method and used in machine learning and data science.
• Regression estimates the relationship between the target and the independent variable.
• It is used to find the trends in data.
•Linear Regression
•Logistic Regression
•Polynomial Regression
•Support Vector Regression
•Decision Tree Regression
•Random Forest Regression
•Ridge Regression
•Lasso Regression:
Regression Algorithms
There are many different types of regression algorithms, but some of the most common include:
•Linear Regression
• Linear regression is one of the simplest and most widely used statistical models. This assumes that there is a linear relationship between the
independent and dependent variables. This means that the change in the dependent variable is proportional to the change in the independent
variables.
•Polynomial Regression
• Polynomial regression is used to model nonlinear relationships between the dependent variable and the independent variables. It adds
polynomial terms to the linear regression model to capture more complex relationships.
•Support Vector Regression (SVR)
• Support vector regression (SVR) is a type of regression algorithm that is based on the support vector machine (SVM) algorithm. SVM is a
type of algorithm that is used for classification tasks, but it can also be used for regression tasks. SVR works by finding a hyperplane that
minimizes the sum of the squared residuals between the predicted and actual values.
•Decision Tree Regression
• Decision tree regression is a type of regression algorithm that builds a decision tree to predict the target value. A decision tree is a tree-like
structure that consists of nodes and branches. Each node represents a decision, and each branch represents the outcome of that decision. The
goal of decision tree regression is to build a tree that can accurately predict the target value for new data points.
•Random Forest Regression
• Random forest regression is an ensemble method that combines multiple decision trees to predict the target value. Ensemble methods are a
type of machine learning algorithm that combines multiple models to improve the performance of the overall model. Random forest regression
works by building a large number of decision trees, each of which is trained on a different subset of the training data. The final prediction is
made by averaging the predictions of all of the trees.
Lasso regression
A regression model which uses the L1 Regularization technique is called LASSO(Least Absolute Shrinkage and Selection
Operator) regression. Lasso Regression adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function(L). Lasso
regression also helps us achieve feature selection by penalizing the weights to approximately equal to zero if that feature does not serve any
purpose in the model.
Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge regression. Ridge regression adds the “squared
magnitude” of the coefficient as a penalty term to the loss function(L).
Advantages of Regression
•Easy to understand and interpret
•Robust to outliers
•Can handle both linear and nonlinear relationships.
Disadvantages of Regression
•Assumes linearity
•Sensitive to multicollinearity
•May not be suitable for highly complex relationships
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which implies that the error
between the predicted and actual values should be kept to a minimum. There will be the least error in the best-
fit line.
The best Fit Line equation provides a straight line that represents the relationship between the dependent and
independent variables. The slope of the line indicates how much the dependent variable changes for a unit
change in the independent variable(s).
Prediction
Prediction refers to the process of using a trained machine learning model to make forecasts or estimates about future or unseen data based on patterns
learned from historical data. The focus is on the accuracy of the outcome.
Example: Imagine you have a machine learning model that predicts house prices based on features like the number of bedrooms, square footage, and
location. After training the model on historical data of house sales, you can use it to predict the price of a new house.
•Input Data: Number of bedrooms = 3, Square footage = 2,000 sq ft, Location = Suburban
•Model Output (Prediction): Estimated price = $350,000
The model provides a prediction for the price of the house based on the patterns it has learned.
Prediction Process:
Training: The model is trained on historical data to learn relationships between input features and target outcomes.
Testing: The model is validated using a separate dataset to evaluate its predictive accuracy.
Inference: The model is deployed to make predictions on new data.
Evaluation Metrics:
For regression: Common metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.
For classification: Metrics include accuracy, precision, recall, F1 score, and ROC-AUC.
Example: A machine learning model trained on historical sales data can predict future sales for a retail store based on factors like seasonality, promotions,
and past performance.
Interpretation:
Interpretation involves understanding and explaining how a machine learning model makes its predictions. It focuses on the transparency of the
model and the reasons behind its output. This is crucial for trust, debugging, and improving the model.
Example: Continuing with the house price prediction model, interpretation might involve understanding which features most influenced the
prediction.
•Feature Importance: The model might reveal that the location has the highest impact on the predicted price, followed by square footage, and
then the number of bedrooms.
•Model Explanation: You might use techniques like feature importance scores or SHAP (SHapley Additive exPlanations) values to explain how
much each feature contributed to the prediction of $350,000.
Methods of Interpretation:
Feature Importance: Techniques like feature importance scores (e.g., from decision trees) indicate which features most significantly impact
predictions.
SHAP Values (SHapley Additive exPlanations): A method that assigns each feature an importance value for a particular prediction, helping to
explain individual predictions.
LIME (Local Interpretable Model-agnostic Explanations): Provides explanations for individual predictions by approximating the model locally
with an interpretable model.
Data Splits
Data Splits
Data splitting is a crucial step in the machine learning process that involves dividing the dataset into distinct subsets to
train, validate, and test models. This practice helps ensure that models generalize well to unseen data and prevents
overfitting. Here’s a detailed overview of common data splits, their purposes, and best practices.