Linear Regression: A
Foundational ML
Algorithm
NAME: RAJU SINGH
YEAR: 4th SEMESTER: 7th
SUBJECT: MACHINE LEARNING (PEC-CS701E)
STREAM: COMPUTER SCIENCE AND ENGINEERING
TOPIC: LINEAR REGRESSION
ROLL NO.: 27900122020
REGD NO.: 222790110019
Introduction
Demystifying Machine Learning
Machine Learning (ML) is a subset of Artificial Intelligence that enables systems to learn from data, identify patterns, and make
decisions with minimal human intervention. It's about training algorithms to improve their performance over time, without being
explicitly programmed for every task.
Supervised Learning Unsupervised Learning Reinforcement Learning
Algorithms learn from labeled data, Algorithms discover patterns in Agents learn by interacting with an
where the output is known. unlabeled data. Clustering environment, receiving rewards or
Examples include classification (grouping similar data points) penalties for actions. This is often
(predicting categories) and and dimensionality reduction are used in robotics and game playing.
regression (predicting continuous common applications.
values).
Linear regression falls under the umbrella of supervised learning, serving as one of the most fundamental and widely used
algorithms for predicting continuous outcomes.
Core Concept
What is Linear Regression?
Linear regression is a statistical method used to model
the relationship between a dependent variable and one
or more independent variables by fitting a linear
equation to observed data. Essentially, it helps us
understand how a change in the independent
variable(s) influences the dependent variable.
At its core, linear regression assumes a linear
relationship between the input features and the target
variable. The algorithm aims to find the best-fitting
straight line (or hyperplane in multiple dimensions) that
minimizes the distance between the observed data
points and the line itself.
The most basic form is Simple Linear Regression, which involves only one independent variable. For
instance, predicting house price (dependent) based solely on its size (independent). When multiple
independent variables are used, it's called Multiple Linear Regression.
Practical Applications
Real-World Impact of Linear Regression
Linear regression is a versatile tool with widespread applications across various industries due to its simplicity and interpretability.
Predicting Housing Prices Forecasting Sales
Using features like square footage, number of bedrooms, Predicting future sales volumes based on advertising
and location to estimate a property's market value. This is spend, past sales data, and economic indicators, helping
crucial for real estate agents and buyers. businesses optimize inventory and marketing strategies.
Risk Assessment Medical Diagnosis
In finance, assessing credit risk by predicting the Predicting blood pressure based on age and weight, or
likelihood of loan defaults based on borrower's income, dosage requirements for medication based on patient
credit score, and debt history. characteristics.
Its importance lies in providing a clear, quantifiable relationship between variables, making it an excellent starting point for many
predictive modeling tasks.
The Math Behind the Model
The Linear Regression Equation
The core of linear regression lies in its mathematical equation, which defines the linear relationship between variables.
Simple Linear Regression Equation:
• y: The dependent variable (the value we want to predict).
• m: The slope of the line, representing how much y changes for a unit change in x. This is also known as the coefficient.
• x: The independent variable (the feature used for prediction).
• b: The y-intercept, representing the value of y when x is zero.
Multiple Linear Regression Equation:
• y: The dependent variable.
• w_0: The y-intercept.
• w_1, w_2, \dots, w_n: The coefficients for each independent variable (features). These weights determine the impact of each feature on the prediction.
• x_1, x_2, \dots, x_n: The independent variables (features).
The goal of the linear regression algorithm is to find the optimal values for these coefficients (m and b, or w_0 through w_n) that best fit the data.
Key Requirements
Assumptions of Linear Regression
For linear regression to provide reliable and accurate results, several key assumptions about the data and model must be met.
Violating these assumptions can lead to misleading conclusions.
Linearity Independence of Observations
There should be a linear relationship between the Each observation (data point) should be independent of
independent variable(s) and the dependent variable. If the every other observation. This means there's no correlation
relationship is not linear, the model will not accurately between the residuals (errors) of consecutive observations.
capture the trend.
Homoscedasticity Normality of Errors
The variance of the residuals should be constant across all The residuals (the differences between observed and
levels of the independent variables. In simpler terms, the predicted values) should be normally distributed. This
spread of the data points around the regression line should assumption is particularly important for constructing
be roughly the same along the entire range of predictions. confidence intervals and performing hypothesis tests.
Checking these assumptions is a crucial step in the model building process to ensure the validity of the linear regression results.
Model Optimization
Training the Linear Regression Model
Training a linear regression model involves finding the optimal coefficients that result in the best-fitting line.
This is achieved by minimizing a cost function.
Loss Function: Mean Squared Error (MSE)
MSE is the most common loss function for linear regression. It
calculates the average of the squared differences between the
predicted values and the actual values. Squaring the errors
ensures that positive and negative errors don't cancel out, and it
penalizes larger errors more heavily.
• yi: Actual value
• ŷi: Predicted value
• N: Number of data points
Optimization: Gradient Descent
Gradient Descent is an iterative optimization algorithm used to find the
minimum of the MSE function. It works by taking small steps in the
direction of the steepest descent of the cost function until a minimum is
reached. Each step updates the model's coefficients (m and b).
Assessing Performance
Evaluating the Linear Regression Model
After training, it's crucial to evaluate how well the model performs on unseen data. Various metrics help us understand its accuracy
and reliability.
R 2
MAE MSE
R-squared (R²) Mean Absolute Error (MAE) Mean Squared Error (MSE)
Coefficient of Determination: Represents The average of the absolute differences The average of the squared differences
the proportion of the variance in the between predictions and actual values. It between predictions and actual values. It
dependent variable that is predictable provides a straightforward measure of penalizes larger errors more heavily,
from the independent variable(s). A prediction error in the same units as the making it sensitive to outliers.
higher R² (closer to 1) indicates a better target variable.
fit.
Understanding these metrics also helps identify issues like overfitting (model performs well on training data but poorly on new
data) or underfitting (model is too simple and doesn't capture the underlying trend). A well-evaluated model balances complexity
and generalizability.
Strategic Considerations
Applications & Limitations
While powerful, linear regression isn't a one-size-fits-all solution. Knowing its strengths and weaknesses is key to its effective application.
Where Linear Regression Excels: Limitations to Consider:
• Simplicity and Interpretability: Easy to understand and • Non-Linear Data: Performs poorly when the true
explain the relationship between variables. relationship between variables is non-linear, as it cannot
• Fast Computation: Quick to train, even on large datasets. capture complex patterns.
• • Outliers Sensitivity: Highly sensitive to outliers, which can
Good Baseline: Often serves as a strong baseline model
against which more complex algorithms can be compared. drastically skew the regression line and coefficients.
• • Multicollinearity: Struggles when independent variables are
When Linear Relationships Exist: Highly effective when
the relationship between variables is genuinely linear. highly correlated with each other, making it difficult to
determine the individual impact of each feature.
• Assumptions Violation: Performance degrades if the
statistical assumptions (linearity, homoscedasticity, normality
of errors) are violated.
Understanding these aspects allows practitioners to choose linear regression appropriately or identify when more advanced models are
required.
Conclusion: The Power of Simplicity
Linear regression stands as a cornerstone in the field of machine learning. Its simplicity, interpretability, and effectiveness for linearly related data make it an
invaluable tool for prediction and understanding relationships within datasets.
1 2 3
Predictive Modeling Interpretability Foundation for ML
A powerful tool for forecasting continuous Provides clear insights into the influence of Understanding it is crucial for grasping more
values and identifying trends. independent variables. complex algorithms.
Further Reading & References:
• Scikit-learn Documentation: Linear Models
• The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)
• Python Data Science Handbook (Jake VanderPlas)