Regression Analysis
Foundation Skills Academy
Index
1. Introduction to Regression Analysis
2. Types of Regression Analysis
3. Linear Regression Model
4. Example: Simple Linear Regression
5. Example: Multiple Linear Regression
6. Logistic Regression Model
7. Example: Logistic Regression
Introduction to Regression Analysis
Regression analysis is a statistical method for understanding and quantifying the relationship between two or more variables.
It helps a business estimate one dependent variable based on the values of one or more independent variables.
Dependent Variable: The dependent variable is essentially the "outcome" you’re trying to understand or predict. It’s the
focus of your study, whether you’re looking at quarterly sales figures, customer satisfaction ratings, or any other key result.
Independent Variable: Independent variables are the "factors" that might influence or cause changes in the dependent
variable. These are the variables you manipulate or observe to see their impact on your outcome of interest. For example, if
you adjust the price of a product, that price change is an independent variable that could affect sales figures.
Output of Regression Analysis
n
Data Analysis – Types of Regression Analysis
Simple Linear regression: Simple linear regression is used when a single independent variable predicts a dependent
variable. The linear regression formula is represented as Y = a + bX where, Y is the dependent var. X is the independent
var. a is the intercept (value of Y when X = 0). b is the slope, also called as coefficient (change in Y for a unit change in X).
Business Application: It's frequently used to identify how a change in one variable will affect another. For example, predicting
sales based on advertising expenditure or estimating employee productivity based on hours worked.
Multiple Linear regression: Multiple regression extends linear regression by considering multiple independent variables to
predict the dependent variable. The relationship is represented as Y = a + b₁X₁ + b₂X₂ + ... + bₙXₙ
Business Application: Businesses use it to understand how multiple factors influence outcomes. For instance, predicting
home prices based on features like square footage, number of bedrooms, and neighborhood.
Non-Linear regression: It is used in cases where the relationship between the dependent and independent variables is
nonlinear. The model can take various forms depending on the specific problem. It is generally represented as Y = f(X, θ)
where θ represents the parameters of the nonlinear function f.
Data Analysis – Types of Regression Analysis
Examples of Nonlinear regression,
Logistic Regression: Logistic regression is used when the dependent variable is binary (two possible outcomes) or
categorical. It models the probability of a particular outcome occurring.
Business Application: In business, logistic regression is employed for tasks like predicting customer churn (yes/no), whether
a customer will purchase a product (yes/no), or whether a loan applicant will default on a loan (yes/no).
Polynomial Regression: Polynomial regression is used when the relationship between the independent and dependent
variables follows a polynomial curve and is not linear.
Business Application: It can be used to model more complex relationships in data, such as predicting the growth of a plant-
based on time and other environmental factors.
Exponential Regression: Exponential regression is a type of nonlinear regression that fits an exponential function to the
data. The general form of an exponential regression model is:
Power Regression: Power regression is a type of nonlinear regression that fits a power function to the data. The general
form of a power regression model is:
Data Analysis – Importance of Regression Analysis
Predictive Modeling: Regression analysis is commonly used for predictive modeling. By examining historical data and
identifying relationships between variables, businesses can make informed predictions about sales, demand, etc.
Identifying Key Drivers: Regression analysis can help identify which independent variables significantly impact the
dependent variable. For e.g., it can determine which marketing channels or advertising strategies influence sales most
Optimizing Decision Making: Whether it's optimizing pricing strategies, production processes, or marketing campaigns,
regression can help companies allocate resources efficiently and achieve better outcomes.
Risk Assessment: Businesses are exposed to various risks, such as economic fluctuations, market changes, and
competitive pressures. Regression analysis-powered risk assessment techniques can be used to assess how changes in
independent variables may affect business performance.
Performance Evaluation: Regression analysis can evaluate the effectiveness of different initiatives and strategies. For
instance, it can assess the impact of employee training on productivity or the relationship between customer satisfaction
and repeat purchases.
Market Research: In market research, regression analysis can be used to understand consumer behavior and
preferences. By examining demographics, pricing, and product features, businesses can tailor their products and
marketing efforts to specific target audiences.
How to Perform Regression Analysis?
Data collection and preparation: Gather and clean data, ensuring it meets assumptions like linearity and independence.
Appropriate regression model: Choose the correct type of regression (linear, polynomial etc.) based on data and objective
Data analysis and interpretation: Test regression assumptions, assess model accuracy, and interpret coefficients
Model evaluation and validation: Test model's performance using metrics like R-squared, mean-squared error.
• p values and coefficients in regression analysis work together to tell which relationships in the model are statistically
significant and the nature of those relationships.
• The linear regression coefficients describe the mathematical relationship between each independent variable and the
dependent variable. The p values for the coefficients indicate whether these relationships are statistically significant.
• After fitting a regression model, check the residual plots to be sure that you have unbiased estimates.
• R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the
variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength
of the relationship between your model and the dependent variable on a 0 – 100% scale. For example, an R-squared of
60% reveals that 60% of the variability observed in the target variable is explained by the regression model. Generally, a
higher R-squared indicates more variability is explained by the model.
Using software tools: Use Python or R to perform regression analysis efficiently.
Assumptions of Linear Regression Analysis
Linearity: The relationship between the independent and dependent variables is linear.
Sample representativeness: The sample is representative of the population.
Normally distributed errors: The errors are normally distributed.
Homoscedasticity: The variance of the errors (residuals) remains constant across all levels of the independent
variable(s). Put simply, it signifies that the dispersion of residuals stays consistent, enhancing the accuracy and legitimacy
of regression predictions.
No Multicollinearity : When independent variables are highly correlated, it becomes challenging to determine their
impact on the dependent variable.
No outliers: There are no outliers in the data.
Simple Linear Regression Example
You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose
incomes range from $15k to $75k and ask them to rank their happiness on a scale from 1 to 10.
Your independent variable (income) and dependent variable (happiness) are both quantitative, so you can do a regression
analysis to see if there is a linear relationship between them.
R code for simple linear regression:
[Link] <- lm(happiness ~ income, data = [Link])
This code takes the data you have collected (data = [Link]) and calculates the effect that the independent variable
income has on the dependent variable happiness using the equation for the linear model: lm().
To view the results of the model, you can use the summary() function in R:
summary([Link])
Note: In linear regression, while the dependent variable must be continuous (e.g. age, weight, temperature), the independent
variables can be either continuous or categorical (e.g., gender, city, type of product) (after encoding them as dummy variables).
Simple Linear Regression Example
Results of the Model:
This output table first repeats the formula
that was used to generate the results
(‘Call’), then summarizes the model
residuals (‘Residuals’), which give an idea
of how well the model fits the real data.
Next is the ‘Coefficients’ table. The first
row gives the estimates of the y-intercept,
and the second row gives the regression
coefficient of the model.
happiness = 0.20 + 0.71*income ± 0.018
The number in the table (0.713) tells us
that for every one unit increase in income
(where one unit of income = $10,000)
there is a corresponding 0.71 unit
increase in reported happiness (where
happiness is a scale of 1 to 10)
Simple Linear Regression Example
Results of the Model:
The Std. Error column shows how much
variation there is in our estimate of the
relationship between income and
happiness.
The t value column displays the test
statistic. The larger the test statistic, the
less likely it is that our results occurred by
chance.
The Pr(>| t |) column shows the p value.
The p-value indicates whether the
independent variable has a significant
influence. p-values smaller than 0.05 (or
sometimes 0.001) are considered as
significant.
Because the p value is so low (p < 0.001),
we can conclude that income has a
statistically significant effect on
happiness.
Simple Linear Regression Example
Homoscedasticity - Residual Plots
A residual is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a
predicted value and the observed actual value.
The most important assumption
of a linear regression model is
that the errors are independent
and normally distributed.
A few characteristics of a good
residual plot are as follows:
• It has a high density of
points close to the origin and
a low density of points away
from the origin
• It is symmetric about the
origin
Multiple Linear Regression Analysis Example - Marketing Mix Modeling
Market Mix Modeling (MMM) is a technique which helps in quantifying the impact of several marketing inputs on Sales or
Market Share. The purpose of using MMM is to understand how much each marketing input contributes to sales, and how
much to spend on each marketing input. Specifically, here are some ways MMM helps businesses thrive,
Optimizing marketing spending helps businesses understand what marketing activities contribute most effectively to
achieving business objectives.
Budget allocation After analyzing the ROI of various marketing channels and tactics, businesses can make more
informed decisions about where to allocate their marketing budget with the greatest yield
Forecasting and planning Businesses can simulate the impact of changes in marketing strategies or external factors and
use these insights to anticipate the potential outcomes and adjust their plans accordingly
Understanding customer behavior helps businesses understand how different customer segments respond to various
marketing stimuli, enabling more targeted and effective marketing strategies.
Continuous improvement Monitoring key performance metrics and analyzing trends enables businesses to identify
opportunities for optimization, test new strategies, and adapt to changing market conditions, ensuring that their marketing
efforts remain effective and competitive.
Marketing Mix Modeling
This is a more representative setting as simple linear regression is hardly used in real life MMM projects; as it is too
simplistic and does not handle the complexity of consumer behavior and the media landscape.
In a typical marketing mix modeling project, multiple variables impact the sales performance. To be able to measure the
impact of those variables on sales or any other chosen KPI, the analyst needs to build a robust model which accounts for
all the variables influencing the movement of sales.
• x1,x2,...,xk are the independent variables influencing sales.
• The term βX represents the contribution of the variable X on sales: i.e. how much sales are driven
by the variable X (incremental impact)
Marketing Mix Modeling – Contribution Chart
A contribution chart visually represents different marketing tactics’ impact on sales. It shows how much each tactic
contributes to the total sales and highlights the most effective ones.
The chart can be used to benchmark performance, compare campaigns over time, and plan for future initiatives.
Contribution charts provide an easy-to-understand overview of where a marketer’s efforts should be focused to maximize
ROI and optimize campaign performance.
Logistic Regression
Types of Logistic Regression
Binary Logistic Regression: Binary logistic regression is used to predict the probability of a binary outcome, such as yes or
no, true or false, or 0 or 1. For example, it could be used to predict whether a customer will churn or not, whether a patient
has a disease or not, or whether a loan will be repaid or not.
Multinomial Logistic Regression: Multinomial logistic regression is used to predict the probability of one of three or more
possible outcomes, such as the type of product a customer will buy, the rating a customer will give a product, or the political
party a person will vote for.
Ordinal Logistic Regression: It is used to predict the probability of an outcome that falls into a predetermined order, such as
the level of customer satisfaction, the severity of a disease, or the stage of cancer.
How to Perform Logistic Regression Analysis?
Prepare the data: The data should be in a format where each row represents a single observation and each column
represents a different variable. The target variable (the variable you want to predict) should be binary (yes/no, true/false, 0/1).
Train the model: We teach the model by showing it the training data. This involves finding the values of the model
parameters that minimize the error in the training data.
Evaluate the model: The model is evaluated on the test data to assess its performance on unseen data.
Use the model to make predictions: After the model has been trained and assessed, it can be used to forecast outcomes
on new data.
Logistic Regression
• In medicine, a frequent application is to find out which variables have an influence on a disease. In this case, 0 could
stand for not diseased and 1 for diseased. Subsequently, the influence of age, gender and smoking status (smoker or not)
on this particular disease could be examined.
• In linear regression, the independent variables (e.g., age and gender) are used to estimate the specific value of the
dependent variable (e.g., body weight).
• In logistic regression, on the other hand, the dependent variable is dichotomous (0 or 1) and the probability that
expression 1 occurs is estimated. Returning to the example above, this means: How likely is it that the disease is present
if the person under consideration has a certain age, sex and smoking status.
To build a logistic regression model, the linear regression equation is used as the starting point.
Logistic Regression
However, if a linear regression were simply calculated for solving a logistic regression, the following result would appear
graphically:
As can be seen in the graph, values between plus and minus infinity can now occur. The goal of logistic regression,
however, is to estimate the probability of occurrence and not the value of the variable itself. Therefore, the equation must be
transformed.
To do this, it is necessary to restrict the value range for the prediction to the range between 0 and 1. To ensure that only
values between 0 and 1 are possible, the logistic function is used.
Logistic Regression
Logistic Function
The logistic model is based on the logical function. The special thing about the logistic function is that for values between
minus and plus infinity, it always assumes only values between 0 and 1.
To calculate the probability of a person being sick or not using the logistic regression for the example above, the model
parameters b1, b2, b3 and a must first be determined. Once these have been determined, the equation will be:
Key properties of the Logistic Regression equation
Sigmoid Function: The logistic regression model, when explained, uses a special “S” shaped curve to predict
probabilities. It ensures that the predicted probabilities stay between 0 and 1, which makes sense for probabilities.
Coefficients: These are just numbers that tell us how much each input affects the outcome in the logistic regression
model. For example, if age is a predictor, the coefficient tells us how much the outcome changes for every one-year
increase in age.
Best Guess: We figure out the best coefficients for the logistic regression model by looking at the data we have and
tweaking them until our predictions match the real outcomes as closely as possible.
Basic Assumptions: We assume that our observations are independent, meaning one doesn’t affect the other. We also
assume that there’s not too much overlap between our predictors (like age and height)
Linearity in the Logit: The relationship between the independent variables and the logit of the dependent variable (ln(p /
(1-p))) is assumed to be linear. This doesn’t necessarily mean the outcome itself has a linear relationship with the
independent variables, but the log-odds do
Probabilities, Not Certainties: Instead of saying “yes” or “no” directly, logistic regression gives us probabilities, like
saying there’s a 70% chance it’s a “yes” in the logistic regression model. We can then decide on a cutoff point to make our
final decision.
Checking Our Work: We have some tools to make sure our predictions are good, like accuracy, precision, recall, and a
curve called the ROC curve.
Logistic Regression Example
You have a dataset, and you need to predict whether a candidate will get admission in the desired college or not, based on
the person’s GRE score, GPA and College Rank.
Steps:
1. In the dataset, we are given the GRE scores, GPAs and college ranks for several students, but it also has a column that
indicates whether those students were admitted or not.
2. Based on this labeled data, you can train the model, validate it, and then use it to predict the admission for any GRE,
GPA and college rank.
3. Once you split the data into training and test sets, you will apply the regression on the three independent variables (GRE,
GPA and Rank), generate the model, and then run the test set through the model.
4. Once that is complete, you will validate the model to see how well it performed.
Data Set
Logistic Regression Example
Model & Results Interpretation
1- Each one-unit change in gre will increase the log odds of
getting admit by 0.002, and its p-value indicates that it is
somewhat significant in determining the admit.
2- Each unit increase in GPA increases the log odds of getting
admit by 0.80 and p-value indicates that it is somewhat
significant in determining the admit.
3- The interpretation of rank is different from others, going to
rank-2 college from rank-1 college will decrease the log odds
of getting admit by -0.67. Going from rank-2 to rank-3 will
decrease it by -1.340.
4- The difference between Null deviance and Residual
deviance tells us that the model is a good fit. Greater the
difference better the model. Null deviance is the value when
you only have intercept in your equation with no variables.
The null deviance tells us how well the response variable can
be predicted by a model with only an intercept term. Residual
deviance is the value when you are taking all the variables
into account.
*When using logistic regression, you should convert a rank from an integer to a factor to indicate that the rank is a categorical variable.
Logistic Regression Example
Prediction
Let’s say a student have a profile with 790 in GRE,3.8 GPA and he studied from a rank-1 college. Now you want to predict
the chances of that boy getting admit in future.
We see that there is 85% chance that this guy will get the admit.
Terminologies used
A confusion matrix measures the performance and accuracy of machine learning
classification models. It gives a breakdown of the predictions made by a model compared
to the actual outcomes.
Accuracy score
Accuracy is the percentage of cases that the model predicted correctly. This is a very high-
level summary; we need more information to evaluate the classifier properly.
Precision (or positive predicted value)
Precision is the ratio of correct positive predictions out of all positive predictions (both
correct and incorrect). If we have high precision, then we minimize false positives.
Recall (sensitivity or true positive rate)
Recall is the ratio of correct positive predictions out of all positive cases. High recall means
that false negatives are minimized.
Specificity (true negative rate)
Specificity is the ratio of correct negative predictions out of all cases that are actually
negative.
ROC Curve
• Receiver Operating Characteristic (ROC) curves are graphical
representations of how the model can tell classes apart at different
decision thresholds. This gives a good overview of a model’s
performance across various thresholds, helping to understand the
trade-offs between TPR and FPR.
• It plots the true positive rate (sensitivity or recall) against the false
positive rate (1 – specificity) at various classification thresholds.
This gives a good overview of a model’s performance across
various thresholds, helping to understand the trade-offs between
TPR and FPR.
• We can also calculate the Area Under the ROC Curve (AUC) for a
single measure of the model’s overall performance. Higher AUC =
better model performance.
• The diagonal line represents random guessing; any curve above it
indicates better-than-random performance. The closer the curve is
to the top-left corner, the higher the model’s performance.
Foundation Skills Academy
Thank You