0% found this document useful (0 votes)
5 views18 pages

Unit 2 Regression

Uploaded by

skrandom145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

Unit 2 Regression

Uploaded by

skrandom145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit II : Regression

What is Regression? Explain types of Regressions.


1. What is Regression?

Regression is a supervised learning technique in machine learning used to predict a


continuous numerical value (quantity) based on one or more input features.

 Goal: Find the relationship between a dependent variable (target) and independent
variables (predictors).
 Example: Predicting a house price using features like size, location, and number of
rooms.

Key Terms:

1. Dependent Variable (Target) – The value we want to predict (e.g., house price).
2. Independent Variables (Features) – Input factors affecting the prediction (e.g.,
locality, rooms).

Need for Regression:

 Price prediction (houses, stocks, etc.)


 Trend forecasting (sales, demand)
 Risk analysis (medical or financial risk)
 Decision-making based on patterns

2. Types of Regression

There are several types, but the main ones covered in your syllabus are:

A. Linear Regression

 Definition: Models the relationship between dependent and independent variables


with a straight-line equation.
 Formula:

y=b0+b1xy = b_0 + b_1x

where b0b_0 = intercept, b1b_1 = slope.

 Types:
1. Simple Linear Regression:
 One independent variable.
 Example: Predicting marks based on study hours.
2. Multiple Linear Regression:
 Two or more independent variables.
 Example: Predicting house price using size, location, and number of
bedrooms.
 Advantages: Easy to interpret, works well for linear data.
 Limitations: Cannot model non-linear relationships well.

B. Non-Linear Regression

 Definition: Models situations where the relationship between variables is not a


straight line.
 Formula: Could be polynomial, exponential, logarithmic, etc.
 Example: Population growth, disease spread curves.
 Advantages: Can handle complex patterns.
 Limitations: More complex, harder to interpret, may require iterative methods.

C. Polynomial Regression

 Definition: A special case of non-linear regression where the model is a polynomial


of the independent variable(s).
 Formula:

y=b0+b1x+b2x^2+...+b^nx^n

 Example: Predicting traffic flow across different times of the day.


 Advantage: Fits curves better than linear regression.
 Limitation: Risk of overfitting if degree is too high.

D. Stepwise Regression

 Definition: Iteratively adds or removes variables to find the most relevant predictors.
 Types:
1. Forward Selection – Start with no variables, add one by one.
2. Backward Elimination – Start with all variables, remove the least useful
ones.
 Advantages: Reduces complexity, focuses on important variables.
 Limitations: Can lead to overfitting, may miss the best combination of features.

E. Decision Tree Regression

 Definition: Uses a tree-like model to split data into smaller groups based on feature
values, predicting the average of the group.
 Advantages: Easy to interpret, handles non-linear data.
 Limitations: Can overfit, unstable to small changes in data.
F. Random Forest Regression

 Definition: An ensemble method that combines many decision trees to improve


accuracy.
 Advantages: High accuracy, handles missing data, less overfitting.
 Limitations: More complex, less interpretable than a single tree.

✅ Summary Table:

Relationship Handles Non-Linear


Type Complexity Example
Shape Data?
Marks vs Study
Simple Linear Straight line Low ❌ No
Hours
Multiple
Straight plane Medium ❌ No House Price
Linear
Polynomial Curved line Medium ✅ Yes Traffic Flow
Stepwise Variable Medium ✅ Sometimes Feature Selection
Decision Tree Piecewise splits Medium ✅ Yes Salary Prediction
Random
Many trees High ✅ Yes Stock Price
Forest
Differentiate multivariate regression and univariate regression.
Aspect Univariate Regression Multivariate Regression
Deals with only one dependent
variable and one independent variable
Number of (Simple Linear Regression) OR one Deals with more than one dependent
Variables dependent variable and multiple variable and multiple independent
Considered independent variables (Multiple variables.
Regression can still be univariate if
only one dependent variable).
Studies the relationship between a
Studies relationships among multiple
Purpose single dependent variable and
dependent variables simultaneously.
predictors.
Less complex, easier to visualize and More complex, requires advanced
Complexity
interpret. statistical techniques.
Multiple equations, one for each
dependent variable, e.g.,
y=b0+b1xy = b_0 + b_1x (or
Equation y1=b01+b11x1+...y_1 = b_{01} +
extended for multiple predictors but
Form b_{11}x_1 + ... and
still one y).
y2=b02+b12x1+...y_2 = b_{02} +
b_{12}x_1 + ....
Predicts multiple output values at
Output Predicts one output value.
once.
Predicting height and weight of a
Predicting student’s marks based on
Example person based on age, diet, and
study hours.
exercise.

In Short:
 Univariate regression → 1 dependent variable
 Multivariate regression → 2 or more dependent variables

Explain Bias-Variance Trade-off with respect to Machine Learning.


1. What is Bias?

 Definition: The error caused by wrong assumptions in the learning algorithm.


 High Bias → Underfitting
o Model is too simple.
o Misses important patterns in the data.
o Performs poorly on both training and test data.
 Example: Trying to fit a straight line to curved data.

2. What is Variance?

 Definition: The error caused by model sensitivity to small changes in the training
data.
 High Variance → Overfitting
o Model is too complex.
o Fits noise as well as actual patterns.
o Performs well on training data but poorly on new data.
 Example: Very deep decision tree memorizing the training set.

3. Bias–Variance Trade-off

 Definition: The balance between bias and variance to achieve the best generalization
on unseen data.
 Goal: Find the "sweet spot" where total error is minimal.
 Reason for Trade-off:
o If a model is too simple → High bias, low variance → Underfits.
o If a model is too complex → Low bias, high variance → Overfits.
o We need a model that’s just complex enough to capture patterns without
memorizing noise.

4. Graphical Understanding

Imagine a curve showing:

 Bias decreases as model complexity increases.


 Variance increases as model complexity increases.
 Total error is minimized at a middle point → This is the ideal trade-off.

5. Summary Table

Aspect Low Bias & High Variance High Bias & Low Variance
Model Complexity Too complex Too simple
Error on Training Data Low High
Error on Test Data High High
Problem Type Overfitting Underfitting
Example High-degree polynomial curve Straight line for curved data

✅ Key Tip for Exams:


Think of it like Goldilocks’ porridge:

 Too simple → underfit (high bias).


 Too complex → overfit (high variance).
 Just right → good trade-off, best performance.
Differentiate Ridge and Lasso Regression techniques
1. Basic Idea

Both Ridge and Lasso are regularization techniques used in regression to:

 Reduce overfitting
 Improve model generalization
 Work by adding a penalty term to the regression equation

3. Summary in Simple Words

 Ridge → "Shrink but don’t delete" coefficients.


 Lasso → "Shrink and sometimes delete" coefficients.

4. Quick Example

Imagine predicting house prices with 100 features:

 Ridge will keep all features but reduce the importance of less useful ones.
 Lasso will completely remove irrelevant features and keep only the most important
ones.
Explain three evaluation metrics used for regression model.
Explain the Random forest Regression in detail.
1. What is Random Forest Regression?

 Definition: A machine learning algorithm that predicts continuous numerical values


by combining results from multiple decision trees (an ensemble method).
 Idea: Instead of relying on one decision tree (which might overfit), build many trees
and average their predictions.
 Type: Supervised learning algorithm.

2. How It Works

Random Forest builds multiple decision trees in four main steps:

1. Bootstrap Sampling (Bagging)


o Randomly select samples with replacement from the dataset to train each
tree.
o Ensures each tree gets slightly different data.
2. Feature Sampling
o At each split in a tree, only a random subset of features is considered.
o Helps make trees diverse and less correlated.
3. Tree Building
o Each tree is grown independently using its sampled data and features.
o Uses Mean Squared Error (MSE) as splitting criterion for regression tasks.
4. Prediction Aggregation
o For regression, predictions from all trees are averaged to get the final output.

3. Example

Suppose we want to predict house price:

 Tree 1 predicts ₹52 lakh


 Tree 2 predicts ₹50 lakh
 Tree 3 predicts ₹55 lakh
 Final Prediction = (52 + 50 + 55) / 3 = ₹52.33 lakh

4. Advantages

 High Accuracy: Averaging multiple trees reduces error.


 Handles Non-linear Relationships: Works well with complex patterns.
 Robustness: Less affected by noise or missing values.
 Feature Importance: Can tell which features impact predictions most.
 Less Overfitting: Bagging and feature sampling reduce variance.

5. Disadvantages

 Complexity: More difficult to interpret compared to a single tree.


 Computation Time: Slower to train and predict if there are many trees.
 Memory Usage: Requires storing multiple trees in memory.

6. When to Use

 Large datasets with many features.


 Problems with non-linear or complex relationships.
 When avoiding overfitting is important.

✅ Quick Summary Table

Aspect Random Forest Regression


Type Ensemble (Bagging)
Aspect Random Forest Regression
Base Learner Decision Tree
Output Average of tree outputs
Strength High accuracy, robust
Weakness Less interpretable, slower

Differentiate between Regression and Correlation.


Aspect Correlation Regression
Measures the strength and Models the relationship between
Meaning direction of the relationship dependent and independent variables
between two variables. to make predictions.
To predict the value of a dependent
To see if variables are related and
Purpose variable based on one or more
how strongly.
independent variables.
A single value (correlation An equation that describes the
Output coefficient, e.g., Pearson’s r) relationship, e.g., y=b0+b1xy = b_0 +
between -1 and +1. b_1x.
Shows how much the dependent
Direction of Shows positive, negative, or no
variable changes when the
Relationship correlation.
independent variable changes.
Prediction ❌ Cannot be used for prediction. ✅ Can be used for prediction.
Does not prove causation, but can
Causation Does not imply causation. help investigate possible causal
effects.
Mathematical Equation with coefficients (slope,
Single coefficient rr.
Expression intercept).
Correlation between ice cream Predicting house price based on size
Example
sales and temperature. and location.

✅ Key Tip to Remember:

 Correlation → "Are they related?" (strength & direction only)


 Regression → "How are they related?" + "Can we predict?"

What is underfitting and overfitting in machine Learning explain the


techniques to reduce overfitting?
1. Underfitting

Definition:

 Happens when a model is too simple to capture the underlying patterns in data.
 Performs poorly on both training data and test data.
Causes:

 Model complexity is too low.


 Not enough training time (early stopping too soon).
 Missing important features in the dataset.
 Incorrect assumptions (e.g., using linear regression for non-linear data).

Characteristics:

 High Bias, Low Variance.


 Predictions are inaccurate even on training data.

Example:

 Using a straight line (linear model) to fit a dataset with a clear curve.

2. Overfitting

Definition:

 Happens when a model memorizes the training data, including noise and outliers.
 Performs well on training data but poorly on unseen (test) data.

Causes:

 Model complexity is too high.


 Too many features without proper regularization.
 Training for too many epochs without monitoring performance.
 Small dataset with high model capacity.

Characteristics:

 Low Bias, High Variance.


 Training error is low, but test error is high.

Example:

 Very deep decision tree fitting every point in training data, including noise.

3. Bias–Variance View

 Underfitting → High Bias, Low Variance.


 Overfitting → Low Bias, High Variance.
 Goal: Find the right bias–variance trade-off for best generalization.
4. Techniques to Reduce Overfitting

Here are the main methods used in practice:

A. Simplify the Model

 Reduce the number of features (Feature Selection).


 Use fewer parameters.

B. Regularization

 Add penalty terms to control coefficient size:


o Ridge Regression (L2 penalty)
o Lasso Regression (L1 penalty)
o Elastic Net (combination of L1 & L2).

C. Cross-Validation

 Use k-fold cross-validation to check performance on different subsets of data and


prevent reliance on a single train/test split.

D. Early Stopping

 Stop training when validation error starts increasing, even if training error is
decreasing.

E. Pruning (in Decision Trees)

 Remove unnecessary branches to simplify the tree.

F. Dropout (in Neural Networks)

 Randomly drop some neurons during training to prevent over-dependence on certain


paths.

G. Data Augmentation

 Create more training data artificially (especially for images, text) by transformations
like rotation, flipping, cropping, etc.

H. Increase Training Data

 More diverse training samples help the model generalize better.

5. Quick Comparison Table

Feature Underfitting Overfitting


Model Complexity Too simple Too complex
Feature Underfitting Overfitting
Bias High Low
Variance Low High
Training Error High Low
Test Error High High
Fix Increase complexity Reduce complexity / Regularize

Explain Elastic Net regression in Machine Learning.


1. What is Elastic Net Regression?

 Definition:
Elastic Net is a regularization technique that combines Ridge Regression (L2
penalty) and Lasso Regression (L1 penalty) into a single model.
 Purpose:
To handle limitations of both Ridge and Lasso and work well when:
o There are many correlated features.
o We need both feature selection and coefficient shrinkage.

3. How It Works

 L1 (Lasso) part → forces some coefficients to exactly zero (feature selection).


 L2 (Ridge) part → shrinks remaining coefficients smoothly (reduces variance).
 This combination helps when:
o Some features are irrelevant.
o Some features are highly correlated.
4. Advantages

 Handles multicollinearity (like Ridge).


 Performs feature selection (like Lasso).
 Works well when:
o Number of predictors > Number of observations.
o Features are highly correlated.
 More stable than Lasso when predictors are correlated.

5. Disadvantages

 Slightly more complex to tune because we have two parameters (λ\lambda and α\
alpha).
 Requires careful cross-validation to find best values.

6. Example Use Case

 Genomics: Thousands of gene features, many correlated, but only some relevant for
predicting a disease risk.
 Finance: Predicting stock returns where many economic indicators are correlated.

7. Quick Summary Table

Feature Ridge Lasso Elastic Net


Penalty L2 L1 L1 + L2
Feature Selection No Yes Yes
Handles
Yes No Yes
Multicollinearity
Coefficient Shrinking Yes Yes (some to zero) Yes
All features Few features Many correlated features & need
Best When
useful important selection

You might also like