0% found this document useful (0 votes)
9 views30 pages

Unit 3.1 Machine Learning Linear Regression

Uploaded by

shristii365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views30 pages

Unit 3.1 Machine Learning Linear Regression

Uploaded by

shristii365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 3:

• Supervised Machine Learning: Regression,


• Introduction to Supervised Machine Learning,
• Types of Machine Learning,
• Interpretation and Prediction,
• Linear Regression,
• Regression and Classification,
• Data Splits.
Supervised, unsupervised learning, semi-supervised and reinforced learning are 4 fundamental approaches of machine learning:
Supervised Learning: Supervised machine learning involves training a model on a labeled dataset, where the input data is paired
with the correct output. Examples of Algorithms: Logistic Regression, Decision Trees, Support Vector Machines, Neural Networks.
Common Use Cases:
Email classification (spam detection)
Credit scoring
Medical diagnosis

Unsupervised Learning: Unsupervised machine learning is a type of machine learning that deals with data that has not been labeled,
categorized, or classified. The goal is to identify patterns, relationships, or structures within the data without any prior guidance on what those
patterns might be. Examples of Algorithms: K-Means, Hierarchical Clustering, t-SNE, Autoencoders.
Common Use Cases: Customer segmentation for targeted marketing
Anomaly detection in network security
Document clustering

Semi-Supervised Learning: Semi-supervised learning is a machine learning approach that combines aspects of both supervised and
unsupervised learning. It leverages a small amount of labeled data along with a larger amount of unlabeled data to improve learning efficiency
and model performance. Examples of Techniques: Self-Training, Co-Training, Graph-Based Methods.
Common Use Cases:Image classification with limited labeled images
Text classification where only a few documents are labeled

•Reinforcement Learning: This is a feedback-based learning method, based on a system of rewards and punishments for correct
and incorrect actions respectively. The aim is for the “learning agent” to receive maximum reward and hence improve its performance.
Examples of Algorithms: Q-Learning, Deep Q-Networks (DQN), Policy Gradient Methods.
Common Use Cases:Game AI (e.g., training agents to play video games)
Robotics (e.g., teaching robots to navigate and manipulate objects)
Autonomous vehicles
How Supervised Learning Works?
In supervised learning, models are trained using labelled dataset, where the model learns about each type of
data. Once the training process is completed, the model is tested on the basis of test data (a subset of the
training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Now
the first step is that we need to train the model for each shape.
If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square.
If the given shape has three sides, then it will be labelled as a triangle.
If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases of a
number of sides, and predicts the output.
 Supervised Learning:

 Classification: In classification tasks, the goal is to predict a discrete label or category. Common algorithms include:
o Logistic Regression
o Decision Trees
o Random Forests
o Support Vector Machines (SVM)
o Naive Bayes
o Neural Networks
Examples of classification problems include spam detection (spam or not spam) and sentiment analysis (positive, negative,
neutral).

 Regression: In regression tasks, the goal is to predict a continuous value. Common algorithms include:
o Linear Regression
o Polynomial Regression
o Ridge and Lasso Regression
o Support Vector Regression (SVR)
o Decision Trees (for regression)
o Neural Networks (for regression)
Examples of regression problems include predicting house prices, stock prices, or temperature.

 Other Considerations
o Binary vs. Multi-class: Classification can be binary (two classes) or multi-class (more than two classes).
o Single-output vs. Multi-output: Regression can involve predicting a single continuous variable or multiple continuous
variables.
Supervised Learning:
Supervised machine learning algorithms can be categorized into several types based on their characteristics and the nature of the
tasks they perform. Here are the main types:
1. Classification Algorithms: Predicting categorical labels (e.g., spam detection).
These algorithms predict discrete labels or categories.
•Logistic Regression: A linear model used for binary classification. (e.g., spam detection).
•Decision Trees: A tree-like model that makes decisions based on feature values. (e.g., loan approval).
•Random Forests: An ensemble of decision trees that improves accuracy and reduces overfitting. (e.g., classifying species of
flowers).
•Support Vector Machines (SVM): Finds the hyperplane that best separates classes in the feature space. (e.g., image
recognition).
•Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature independence.
•K-Nearest Neighbors (KNN): Classifies based on the majority label of the nearest neighbors.

2. Regression Algorithms: Predicting continuous values (e.g., house price prediction).


These algorithms predict continuous values.
•Linear Regression: Models the relationship between input features and a continuous output as a linear equation. (e.g.,
predicting house prices)
•Polynomial Regression: Extends linear regression by adding polynomial terms for non-linear relationships. (e.g., sales
predictions).
•Ridge and Lasso Regression: Regularization techniques to prevent overfitting in linear models.
•Support Vector Regression (SVR): An extension of SVM for regression tasks. (e.g., predicting stock prices).
•Decision Trees (for regression): Similar to classification trees, but used for predicting continuous values.
1. Regression
Regression is a type of supervised learning that is used to predict continuous values, sales, salary, weight, or temperature.
Regression algorithms learn a function that maps from the input features to the output value.

For example, predicting house prices involves collecting data on features like square footage, number of bedrooms, and location,
along with the selling price of each house. After preprocessing the data, such as handling missing values and encoding categorical
variables, a regression model (e.g., linear regression or random forest regression) is trained on a subset of the data. The model is
then evaluated using metrics like Mean Absolute Error (MAE) and R-squared (R²) to measure its accuracy. Once validated, the
model can predict house prices for new data, providing insights into real estate values based on the learned relationships
between features and prices. Regression algorithm can be trained to learn the relationship between the features and the price of
the house.
[Link]
Classification is a type of supervised learning that is used to predict categorical values, such as whether an email is spam or not,
or whether a medical image shows a tumor or not. Classification algorithms learn a function that maps from the input features to
a probability distribution over the output classes.
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are mainly used
to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below diagram, there are two classes, class A
and Class B. These classes have features that are similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are two types of Classifications:
Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
 Unsupervised Learning:
Unsupervised learning can be categorized into several types based on its objectives and methodologies. Here are the main types:
1. Clustering
Clustering aims to group similar data points together based on certain features. Common algorithms include:
K-Means: Partitions data into k clusters by minimizing variance within each cluster. (e.g., customer segmentation).
Hierarchical Clustering: Creates a tree-like structure of clusters using either agglomerative or divisive methods. (e.g., organizing documents).
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Identifies clusters based on the density of data points, useful for finding
irregularly shaped clusters. (e.g., anomaly detection in network data).
Gaussian Mixture Models (GMM): Assumes that data points are generated from a mixture of several Gaussian distributions.

2. Dimensionality Reduction
These techniques aims to reduce the number of features or variables in the data while preserving the its essential structure. This is particularly
useful when dealing with high-dimensional datasets. Common methods include:
Principal Component Analysis (PCA): Transforms data to a lower-dimensional space by identifying principal components. (e.g., anomaly detection
in network data).
t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear method primarily used for visualizing high-dimensional data. (e.g., visualizing
word embeddings).
Autoencoders: Neural network-based models that compress data into a lower-dimensional representation. (e.g., denoising images).

3. Association Rule Learning


An association rule is an unsupervised learning method which is used for finding the relationships between variables in the large database.
Common algorithms include:
Apriori Algorithm: Finds frequent itemsets in transactional data . Used for mining frequent itemsets and generating association rules. (e.g.,
market basket analysis).
Eclat Algorithm: Efficiently finds frequent itemsets using depth-first search. A more efficient approach for finding frequent itemsets compared to
Apriori.
 Semi-Supervised Learning:

These algorithms utilize both labeled and unlabeled data to improve learning efficiency.

Self-Training: The model iteratively labels the unlabeled data based on its predictions. (e.g., image classification with few labeled
images).
In this approach, a model is initially trained on the labeled data. It then makes predictions on the unlabeled data, and the most
confident predictions are added to the training set as "pseudo-labels." This process can be iteratively refined.

Co-Training: Two models train on different features and label data for each other. (e.g., text classification).
Co-training involves training two or more classifiers on different feature sets of the same data. Each classifier labels unlabeled
examples for the other classifiers, allowing them to improve by leveraging their different perspectives.

Graph-Based Methods: Use graph structures to propagate labels from labeled to unlabeled data. (e.g., social network analysis).
These methods represent data as a graph, where nodes represent data points and edges represent similarities. The idea is to
propagate labels through the graph, allowing unlabeled data points to acquire labels based on their connections to labeled
points.
Reinforcement Learning
• Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an environment by performing
the actions and seeing the results of actions. For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience only.
• RL solves a specific type of problem where decision making is sequential, and the goal is long-term, such as game-playing, robotics, etc.
• The reinforcement learning agent learns about a problem by interacting with its environment. The environment provides information on its
current state. The agent then uses that information to determine which actions(s) to take. If that action obtains a reward signal from the
surrounding environment, the agent is encouraged to take that action again when in a similar future state. This process repeats for every new
state thereafter. Over time, the agent learns from rewards and punishments to take actions within the environment that meet a specified goal.
• The primary goal of an agent in reinforcement learning is to improve the performance by getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the experience, it learns to perform the task in a better way. Hence, we can say
that "Reinforcement learning is a type of machine learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that."
The below figure illustrates how the agent interacts with the environment in reinforcement learning:

Terms used in Reinforcement Learning


•Agent(): An entity that can perceive/explore the environment and act upon it.(AI robot)
•Environment(): A situation in which an agent is present or surrounded by.(room, maze, football
ground, etc.)
•Action(): Actions are the moves taken by an agent within the environment.
•State(): State is a situation returned by the environment after each action taken by the agent.
•Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.
•Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
Steps Involved in Supervised Learning:

 First Determine the type of training dataset


 Collect/Gather the labelled training data.
 Split the training dataset into training dataset, test dataset, and validation dataset.
 Determine the input features of the training dataset, which should have enough knowledge so that the model
can accurately predict the output.
 Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
 Execute the algorithm on the training dataset. Sometimes we need validation sets as the control parameters,
which are the subset of training datasets.
 Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output, which
means our model is accurate.

Types of supervised Machine learning Algorithms:


Supervised learning can be further divided into two types of problems:
Advantages of Supervised learning:
• Supervised learning allows collecting data and produces data output from previous
experiences.
• Helps to optimize performance criteria with the help of experience.
• Supervised machine learning helps to solve various types of real-world computation problems.
• It performs classification and regression tasks.
• It allows estimating or mapping the result to a new sample.
• We have complete control over choosing the number of classes we want in the training data.

Disadvantages of supervised learning:


• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
• Training for supervised learning needs a lot of computation time. So, it requires a lot of time.
• In supervised learning, we need enough knowledge about the classes of object.
Applications of Supervised learning
Supervised learning can be used to solve a wide variety of problems, including:
•Spam filtering: Supervised learning algorithms can be trained to identify and classify spam emails based
on their content, helping users avoid unwanted messages.
•Image classification: Supervised learning can automatically classify images into different categories,
such as animals, objects, or scenes, facilitating tasks like image search, content moderation, and image-
based product recommendations.
•Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing patient data, such
as medical images, test results, and patient history, to identify patterns that suggest specific diseases or
conditions.
•Fraud detection: Supervised learning models can analyze financial transactions and identify patterns
that indicate fraudulent activity, helping financial institutions prevent fraud and protect their customers.
•Natural language processing (NLP): Supervised learning plays a crucial role in NLP tasks, including
sentiment analysis, machine translation, and text summarization, enabling machines to understand and
process human language effectively.
Regression Analysis
What is regression in machine learning?
Regression Analysis in Machine learning
Regression analysis is a statistical approach used to analyze the relationship between a dependent variable (target variable) and one or more independent
variables (predictor). More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to
an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

Characteristics of Regression
Here are the characteristics of the regression:
•Continuous Target Variable: Regression deals with predicting continuous target variables that represent numerical values. Examples include
predicting house prices, forecasting sales figures, or estimating patient recovery times.
•Error Measurement: Regression models are evaluated based on their ability to minimize the error between the predicted and actual values of the
target variable. Common error metrics include mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).
•Model Complexity: Regression models range from simple linear models to more complex nonlinear models. The choice of model complexity depends
on the complexity of the relationship between the input features and the target variable.
•Overfitting and Underfitting: Regression models are susceptible to overfitting and underfitting.
•Interpretability: The interpretability of regression models varies depending on the algorithm used. Simple linear models are highly interpretable, while
more complex models may be more difficult to interpret.
Linear Regression:
•Linear regression is a statistical regression
method which is used for predictive analysis.
•It is one of the very simple and easy algorithms
which works on regression and shows the
relationship between the continuous variables.
•It is used for solving the regression problem in
machine learning.
•Linear regression shows the linear relationship
between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called
linear regression.
•If there is only one input variable (x), then such
linear regression is called simple linear
regression. And if there is more than one input •Below is the mathematical equation for Linear regression:
variable, then such linear regression is Y= aX+b
Here, Y = dependent variables (target variables),
called multiple linear regression.
X= Independent variables (predictor variables),
•The relationship between variables in the linear a and b are the linear coefficients
regression model can be explained using the
below image. Here we are predicting the salary
of an employee on the basis of the year of
experience.
Why do we use Regression Analysis?
• Regression analysis helps in the prediction of a continuous variable.
• There are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which can
make predictions more accurately. So for such case we need Regression analysis which is a statistical
method and used in machine learning and data science.
• Regression estimates the relationship between the target and the independent variable.
• It is used to find the trends in data.

Terminologies Related to the Regression Analysis:


•Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent
variable. It is also called target variable.
•Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent
variables are called independent variable, also called as a predictor.
•Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values.
An outlier may hamper the result, so it should be avoided.
•Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is
called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable.
•Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is
called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting.
Types of Regression
There are various types of regressions which are used
in data science and machine learning. Each type has
its own importance on different scenarios, but at the
core, all the regression methods analyze the effect of
the independent variable on dependent variables.
Here we are discussing some important types of
regression which are given below:

•Linear Regression
•Logistic Regression
•Polynomial Regression
•Support Vector Regression
•Decision Tree Regression
•Random Forest Regression
•Ridge Regression
•Lasso Regression:
Regression Algorithms
There are many different types of regression algorithms, but some of the most common include:
•Linear Regression
• Linear regression is one of the simplest and most widely used statistical models. This assumes that there is a linear relationship between the
independent and dependent variables. This means that the change in the dependent variable is proportional to the change in the independent
variables.
•Polynomial Regression
• Polynomial regression is used to model nonlinear relationships between the dependent variable and the independent variables. It adds
polynomial terms to the linear regression model to capture more complex relationships.
•Support Vector Regression (SVR)
• Support vector regression (SVR) is a type of regression algorithm that is based on the support vector machine (SVM) algorithm. SVM is a
type of algorithm that is used for classification tasks, but it can also be used for regression tasks. SVR works by finding a hyperplane that
minimizes the sum of the squared residuals between the predicted and actual values.
•Decision Tree Regression
• Decision tree regression is a type of regression algorithm that builds a decision tree to predict the target value. A decision tree is a tree-like
structure that consists of nodes and branches. Each node represents a decision, and each branch represents the outcome of that decision. The
goal of decision tree regression is to build a tree that can accurately predict the target value for new data points.
•Random Forest Regression
• Random forest regression is an ensemble method that combines multiple decision trees to predict the target value. Ensemble methods are a
type of machine learning algorithm that combines multiple models to improve the performance of the overall model. Random forest regression
works by building a large number of decision trees, each of which is trained on a different subset of the training data. The final prediction is
made by averaging the predictions of all of the trees.

Regularized Linear Regression Techniques


•Ridge Regression
• Ridge regression is a type of linear regression that is used to prevent overfitting. Overfitting occurs when the model learns the training data too
well and is unable to generalize to new data.
•Lasso regression
• Lasso regression is another type of linear regression that is used to prevent overfitting. It does this by adding a penalty term to the loss function
that forces the model to use some weights and to set others to zero.
Regularized Linear Regression Techniques: Regularization is a technique used to reduce errors by fitting the function appropriately on the given
training set and avoiding overfitting.

Lasso regression
A regression model which uses the L1 Regularization technique is called LASSO(Least Absolute Shrinkage and Selection
Operator) regression. Lasso Regression adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function(L). Lasso
regression also helps us achieve feature selection by penalizing the weights to approximately equal to zero if that feature does not serve any
purpose in the model.

Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge regression. Ridge regression adds the “squared
magnitude” of the coefficient as a penalty term to the loss function(L).

Elastic Net Regression


This model is a combination of L1 as well as L2 regularization. That implies that we add the absolute norm of the weights as well
as the squared measure of the weights.
Note: Already explained in Class
Applications of Regression
•Predicting prices: For example, a regression model could be used to predict the price of a house based on its size, location, and
other features.
•Forecasting trends: For example, a regression model could be used to forecast the sales of a product based on historical sales
data and economic indicators.
•Identifying risk factors: For example, a regression model could be used to identify risk factors for heart disease based on patient
data.
•Making decisions: For example, a regression model could be used to recommend which investment to buy based on market data.

Advantages of Regression
•Easy to understand and interpret
•Robust to outliers
•Can handle both linear and nonlinear relationships.

Disadvantages of Regression
•Assumes linearity
•Sensitive to multicollinearity
•May not be suitable for highly complex relationships
What is the best Fit Line?
Our primary objective while using linear regression is to locate the best-fit line, which implies that the error
between the predicted and actual values should be kept to a minimum. There will be the least error in the best-
fit line.
The best Fit Line equation provides a straight line that represents the relationship between the dependent and
independent variables. The slope of the line indicates how much the dependent variable changes for a unit
change in the independent variable(s).

Here Y is called a dependent or target variable and X is called an


independent variable also known as the predictor of Y. There are
many types of functions or modules that can be used for
regression. A linear function is the simplest type of function. Here,
X may be a single feature or multiple features representing the
problem.
Linear regression performs the task to predict a dependent
variable value (y) based on a given independent variable (x)).
Hence, the name is Linear Regression. In the figure above, X
(input) is the work experience and Y (output) is the salary of a
person. The regression line is the best-fit line for our model.
We utilize the cost function to compute the best values in order to
get the best fit line since different values for weights or the
coefficient of lines result in different regression lines.
Interpretation and Prediction
Interpretation and Prediction in Machine Learning
In machine learning, interpretation and prediction are two crucial aspects that guide the understanding and application of models. Here's a detailed
overview of both concepts:

Prediction
Prediction refers to the process of using a trained machine learning model to make forecasts or estimates about future or unseen data based on patterns
learned from historical data. The focus is on the accuracy of the outcome.

Example: Imagine you have a machine learning model that predicts house prices based on features like the number of bedrooms, square footage, and
location. After training the model on historical data of house sales, you can use it to predict the price of a new house.
•Input Data: Number of bedrooms = 3, Square footage = 2,000 sq ft, Location = Suburban
•Model Output (Prediction): Estimated price = $350,000
The model provides a prediction for the price of the house based on the patterns it has learned.

Key Aspects of Prediction:


Model Types:
Regression Models: Predict continuous outcomes (e.g., house prices).
Classification Models: Predict discrete categories (e.g., spam vs. not spam).

Prediction Process:
Training: The model is trained on historical data to learn relationships between input features and target outcomes.
Testing: The model is validated using a separate dataset to evaluate its predictive accuracy.
Inference: The model is deployed to make predictions on new data.

Evaluation Metrics:
For regression: Common metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared.
For classification: Metrics include accuracy, precision, recall, F1 score, and ROC-AUC.
Example: A machine learning model trained on historical sales data can predict future sales for a retail store based on factors like seasonality, promotions,
and past performance.
Interpretation:
Interpretation involves understanding and explaining how a machine learning model makes its predictions. It focuses on the transparency of the
model and the reasons behind its output. This is crucial for trust, debugging, and improving the model.
Example: Continuing with the house price prediction model, interpretation might involve understanding which features most influenced the
prediction.
•Feature Importance: The model might reveal that the location has the highest impact on the predicted price, followed by square footage, and
then the number of bedrooms.
•Model Explanation: You might use techniques like feature importance scores or SHAP (SHapley Additive exPlanations) values to explain how
much each feature contributed to the prediction of $350,000.

Methods of Interpretation:
Feature Importance: Techniques like feature importance scores (e.g., from decision trees) indicate which features most significantly impact
predictions.
SHAP Values (SHapley Additive exPlanations): A method that assigns each feature an importance value for a particular prediction, helping to
explain individual predictions.
LIME (Local Interpretable Model-agnostic Explanations): Provides explanations for individual predictions by approximating the model locally
with an interpretable model.
Data Splits
Data Splits
Data splitting is a crucial step in the machine learning process that involves dividing the dataset into distinct subsets to
train, validate, and test models. This practice helps ensure that models generalize well to unseen data and prevents
overfitting. Here’s a detailed overview of common data splits, their purposes, and best practices.

Types of Data Splits


[Link] Set:
1. Purpose: Used to train the machine learning model.
2. Description: The model learns patterns and relationships from this subset of data.
3. Typical Size: Generally comprises 60-80% of the total dataset, depending on the overall size and complexity of the
data.
[Link] Set:
1. Purpose: Used to tune model hyperparameters and make decisions about model architecture.
2. Description: This subset helps evaluate model performance during training without using the test set.
3. Typical Size: Usually about 10-20% of the total dataset. It is especially important in scenarios where
hyperparameter tuning is critical.
[Link] Set:
1. Purpose: Used to assess the final performance of the model.
2. Description: The model is evaluated on this unseen data to determine its accuracy and generalization ability.
3. Typical Size: Generally 10-20% of the total dataset.
Common Data Splitting Techniques
[Link] Split:
• Method: Randomly divides the dataset into training, validation, and test sets.
• Advantages: Simple to implement and often effective if the dataset is large enough and well-shuffled.
• Disadvantages: May not maintain the distribution of certain classes, especially in imbalanced datasets.
[Link] Split:
• Method: Ensures that each class is represented in the same proportion in all splits.
• Advantages: Maintains the distribution of classes, which is particularly useful for imbalanced datasets.
• Disadvantages: Slightly more complex than a random split but crucial for preserving class distributions.
3.K-Fold Cross-Validation:
• Method: The dataset is divided into k equal parts (folds). The model is trained on k−1 folds and validated on the
remaining fold. This process is repeated k times.
• Advantages: Provides a more reliable estimate of model performance by using all data for both training and validation.
• Disadvantages: Computationally expensive, especially for large datasets and complex models.
[Link]-One-Out Cross-Validation (LOOCV):
• Method: A specific case of k-fold cross-validation where k is equal to the number of data points. Each data point is used
as a test set once, while the rest are used for training.
• Advantages: Maximizes training data for each iteration.
• Disadvantages: Very computationally intensive for larger datasets.
[Link] Series Split:
• Method: For time-dependent data, the training set consists of earlier observations, while the validation and test sets
consist of later observations.
• Advantages: Maintains the temporal order of data, which is crucial for time series forecasting.
• Disadvantages: Cannot use random splits, as it would break the temporal sequence.

You might also like