K502 Business Analytics
Institute of Business Administration (IBA)
Predic(on Using Logis(c Regression (LR) Model
Synopsis
You need to es*mate a logis*c regression model in MS Excel based on a dataset of
your selec*on and test that model. Categorical or con*nuous data need to be dummy
encoded. For dummy variable encoding, literature from the relevant domain must be
consulted and jus*fica*ons provided. ACer model es*ma*on, the trade-off between
false posi*ve and true posi*ve is to be shown through receiver opera*ng
characteris*c (ROC) curve and your decision regarding cut-off value should consider
the context in which the model is being developed.
Instruc.ons
You need to es*mate a logis*c regression model by maximizing log likelihoods as
discussed in class and as detailed in chapter 6 of Data Smart1. Please note the
following:
1. Select a dataset from Kaggle or any other source. I can share sugges*ons
regarding possible sources depending on interest area. The source and related
descrip*on are to be men*oned in the relevant sec*on of the accompanying
report.
2. Your dataset will likely have a lot of columns. Most of them will be non-binary.
Therefore, you need to dummy-encode categorical or con*nuous data.
3. For dummy variable encoding, find relevant literature. For example, let’s say
you are working with heart disease predic*on. And there is a column named
‘chol’ that measures blood cholesterol level in mg/dl. You need to find
literature from the heart disease domain and determine cut-off e.g., for ‘chol’
column as to what should be the cut-offs for cholesterol levels in mg/dl for it to
be encoded into e.g., ‘normalCholesterol’, ‘borderlineHigh’, and
‘highCholesterol’. At least 3 relevant sources should be cited for each cut-off
determina*on.
4. You need to train-test split your data into a 80:20 or 70:30 ra*o. Training the
model should be done the 80% of randomly selected records and the model
should be tested on the test set (80%) to Plot the ROC curve and determine
accuracy, precision, confusion matrix etc. (Hint: use Excel random number
generator to randomly select data. There can be mul*ple ways of doing this.)
1
Jordan Goldmeier, Data Smart: Using Data Science to Transform Information into Insight, 2nd
edition (Hoboken, New Jersey: Wiley, 2023).
K502 Business Analytics
Institute of Business Administration (IBA)
Deliverable
You need to submit the following:
1. MS Excel File: The Excel file should have the following. All formulas should be
dynamic and accessible for me to check and understand how they are working.
Number of sheets can vary across students.
a. Raw data
b. Dummy encoded data
c. Training data and test data (aCer split)
d. Es*mated logis*c regression model
e. Model test
f. ROC Curve
g. Interpreta*on of model coefficients and your decision regarding cut-off
value
2. Brief Report Word File: Prepare a brief report (4 pages maximum; TNR 12, 1.5,
A4, no cover page just the *tle page) that will have, among other components,
the following:
a. A synopsis of the context in which you es*mate this model
b. Dataset descrip*on
c. Review of relevant literature for dummy-encoding
d. Brief descrip*on of logis*c regression model with relevant equa*ons
e. Your model coefficients and interpreta*on
f. Model accuracy, precision, recall measures and ROC curve
g. Conclusion and your reflec*on on benefits/limita*ons/future
recommenda*ons for LR modelling in this context
Bonus mark: There is a 2.5 bonus to be assigned if you do the all the above properly AND submit the report in LaTeX. Please note,
I will decide whether you submission qualifies for a bonus. Mere submission of LaTeX report may be insufficient.