PREVIOUS YEAR PAPERS
1. Business Analytics for Supermarket Optimization
Q1a. How might business analytics help the supermarket? What data would be
needed to facilitate good decisions?
a) Business analytics can help the supermarket by analyzing historical transaction
data, foot traffic patterns, and employee shift schedules to identify peak hours and
optimize staffing decisions. Predictive modeling can be used to anticipate high-
demand periods and allocate resources accordingly.
Data Required:
Transaction timestamps
Number of active registers
Customer entry and exit times
Sales volume by day and hour
Employee shift schedules
Example: If analytics indicate that Mondays from 6 PM to 8 PM experience high
checkout congestion, the supermarket can proactively assign more employees to
checkout duty during that period.
Q1b. Do you agree with the statement "Data Scientist: The Sexiest Job of the
21st Century"? Justify.
b) The statement "Data Scientist: The Sexiest Job of the 21st Century" by Davenport
and Patil is justified due to the increasing reliance on data-driven decision-making.
The demand for skilled data scientists continues to rise, making it one of the most
sought-after careers.
2. Metrics for Fast-Food Restaurant Management
Q2a. Suggest some metrics a fast-food restaurant manager might want to collect.
How might the manager use the data to facilitate better decisions?
a) Key metrics for a fast-food restaurant include:
Average Order Time: Helps in optimizing service speed.
Customer Satisfaction Scores: Provides insights into service quality.
Inventory Turnover Rate: Ensures efficient stock management.
Revenue per Hour: Determines peak business periods.
Employee Efficiency: Tracks productivity levels.
Managers can use these metrics to streamline operations, reduce waste, and improve
customer experience.
Q2b. How do you perform exploratory analysis using R? What package do you
use and how do you analyze the data through exploratory analysis? What
descriptive techniques might you want to use?
b) In R, exploratory data analysis (EDA) can be performed using packages like
ggplot2, dplyr, and tidyverse. Techniques include summary statistics, box plots,
histograms, and correlation matrices.
Example Code:
library(ggplot2)
library(dplyr)
data %>% summary()
ggplot(data, aes(x=variable)) + geom_histogram()
3. Loan Default Prediction & Customer Segmentation
Q3a. How do you build an algorithm that predicts loan defaulters in advance?
Should you use a Decision Tree or a regression model? What will be the
procedure?
a) To predict loan defaulters, a classification algorithm like Decision Tree or Logistic
Regression can be used. The steps include:
Data Collection (Loan history, credit score, income, etc.)
Data Cleaning (Handling missing values, standardization)
Feature Selection
Model Training using Decision Tree
Evaluation using accuracy and recall
Q3b. A marketing team wants to target its set of customers and use an algorithm
that can divide them. What algorithm would you suggest and explain the steps
involved?
b) For customer segmentation, K-Means clustering is suitable. Steps:
Collect customer purchase history
Normalize data
Determine optimal clusters using the Elbow method
Apply K-Means and analyze cluster characteristics
4. Recommendation Systems & Sampling Challenges
Q4a. How do recommendations appear on platforms like Amazon or Netflix?
Explain the process and algorithm used.
a) Amazon and Netflix use collaborative filtering and content-based filtering for
recommendations. The process:
Collect user interactions (views, purchases, ratings)
Compute similarity scores (user-based or item-based)
Generate recommendations based on preferences
Q4b. How important is sampling in data analysis? What challenges might arise
in diagnosing COVID-19 data, and how do you handle them?
b) Sampling is crucial in data analysis. In COVID-19 diagnosis, challenges include:
Biased sample representation
Data collection errors
Incomplete patient history
Handling these challenges requires stratified sampling to ensure diverse and
representative data.
Q4c. You have run a classification model, namely logistic regression. How will
you present the effectiveness of the model to the business team?
c) Logistic Regression Model Effectiveness Report:
Accuracy & Precision: Measure predictive performance
ROC Curve: Evaluate classification efficiency
Business Impact: Explain how the model helps reduce financial risk
5. Health Data Analysis & ANOVA Table
Q5a. Given patient health data, what kind of algorithm can you use to predict
fever? Why? Explain the process.
a) Given patient health data, a Naïve Bayes classifier can be used for fever
prediction. Steps:
Convert categorical variables to numerical
Apply probability rules to classify fever occurrence
Evaluate using confusion matrix
Q5b. Construct an ANOVA table for a Multiple Regression Problem and
interpret the output (No. of Response Variables: 1, No. of Predictors: 4).
b) ANOVA Table for Multiple Regression:
Source SS df MS F P-value
Regression 1200 4 300 15 0.001
Residual 500 295 1.7 - -
Total 1700 299 - - -
Interpretation: A low p-value (< 0.05) indicates that at least one predictor significantly
affects the response variable.
6. Statistical Analysis Questions
Q6a. Compute the 30th Percentile and Five-Number Summary for BA Quiz
Scores.
Given Scores: 95, 81, 81, 55, 68, 111, 88, 100, 94, 87, 65, 93,
85, 79, 106, 92, 15, 67, 83
30th Percentile ≈ 72
Q6b. Dataset Analysis (Tablet Specifications Table)
Elements: 8
Variables: 5 (Cost, OS, Display, Battery Life, CPU)
Categorical: OS, CPU Manufacturer
Quantitative: Cost, Display Size, Battery Life
Q6c. Compute Sample Covariance and Correlation.
Given X = [4, 6, 11, 3, 16], Y = [50, 50, 40, 60, 30]
Sample Covariance = -60.0
Sample Correlation = -0.969
Interpretation: Strong negative correlation
d) IPL 2023 Runs Analysis
Kohli’s Runs: [82, 21, 61, 50, 6, 59, 0, 54, 31, 55, 1, 18, 100,
101]
Du Plessis’ Runs: [73, 23, 79, 22, 62, 84, 62, 17, 44, 45, 65, 55,
71, 28]
Findings:
Kohli’s scores fluctuate more.
Du Plessis has a more consistent range.
Q6e. Sleep Data Analysis using Empirical Rule
Mean = 6.9 hours, SD = 1.2 hours
68-95-99.7 Rule:
o 4.5 to 9.3 hours: 95.45%
o More than 9.3 hours: 2.28%
Z-score for 8 hours sleep: 0.92