K. J.
Somaiya School of Engineering
(Somaiya Vidyavihar University)
Batch:A1 Roll No.:16010323004
Experiment / assignment / tutorial No.08
Grade: AA / AB / BB / BC / CC / CD /DD
Signature of the Staff In-charge with date
TITLE: Data modelling using Multiple linear regression model
AIM: a) Formulate a hypothesis about the choice of the model using graphical and statistical tools.
b) Find model parameters using the Ordinary Least Squares (OLS) method.
c) Validate the multiple linear regression model using statistical tests and diagnostic plots.
d) Interpret model parameters and their significance.
e) Evaluate model performance on training and testing datasets and determine its effectiveness.
OUTCOME: After completion of the experiments students will be able to: examine the
relationship between independent and dependent variables by implementing simple linear
regression
Procedure:
1. Load the Dataset: Read the dataset from a CSV file.
2. Explore Data: Identify predictor (independent) and response (dependent)
variables.
3. Check Data Quality: Identify missing values and data types.
4. Split Data: Separate into training (80%) and testing (20%) datasets.
5. Perform Exploratory Data Analysis (EDA):
a. Generate summary statistics.
b. Check for correlation between variables.
c. Plot scatterplots and pair plots.
6. Build Multiple Linear Regression Model:
a. Train the model using OLS estimation.
b. Interpret model coefficients and significance.
7. Check Assumptions:
a. Linearity, Normality, Homoscedasticity, and Multicollinearity.
b. Generate residual plots and QQ plots.
8. Evaluate Model Performance:
a. Compute MSE and R² score on training and test data.
b. Compare actual vs predicted values.
Department of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025
K. J. Somaiya School of Engineering
(Somaiya Vidyavihar University)
Experiment Tasks for Students
1.Generate Sample Dataset (Students' Study Hours & Exam Performance)
Student ID Study_Hours Sleep_Hours Attendance (%) Exam_Score
1 6.5 7.2 85 78
2 8.0 6.5 96 88
3 5.0 8.0 77 70
Predictors (Independent Variables): Study Hours, Sleep Hours, Attendance (%)
Response (Dependent Variable): Exam Score
2.Modify the dataset to include additional predictors like "Self-study hours" or
"Extracurricular activities".
3. Experiment with different train-test split ratios (e.g., 70%-30%).
4.Remove one independent variable and analyze changes in model performance.
Conclusion: - In this experiment we have learnt the concept of multiple linear recression,
implemented it successfully using code, performing different operation and compared the
result for 80-20 and 70-30 train to test split.
Signature of faculty in-charge with date
partment of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025
K. J. Somaiya School of Engineering
(Somaiya Vidyavihar University)
CODE:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# 1. Load the Dataset
df = pd.read_excel('student.xlsx') # make sure the file exists and is correctly named
# Print column names to check for any discrepancies
print("Columns in dataset:", df.columns.tolist())
# Strip any whitespace from column names
df.columns = df.columns.str.strip()
# Identify predictor (X) and response variable (Y)
# Dynamically detect the exam score column
score_col = None
for col in df.columns:
if 'exam_score' in col.lower():
score_col = col
break
if score_col is None:
raise KeyError("Exam score column not found in dataset. Check the column names.")
print(f"Using '{score_col}' as the target variable.")
y = df[score_col]
X = df.drop(columns=[score_col])
# 4. Split Data into Training (80%) and Testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 6. Build Multiple Linear Regression Model
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train_const).fit()
partment of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025
K. J. Somaiya School of Engineering
(Somaiya Vidyavihar University)
# a. Compute MSE and R² score
train_pred = model.predict(X_train_const)
test_pred = model.predict(X_test_const)
mse_train = mean_squared_error(y_train, train_pred)
mse_test = mean_squared_error(y_test, test_pred)
r2_train = r2_score(y_train, train_pred)
r2_test = r2_score(y_test, test_pred)
print(f"MSE (Train): {mse_train:.2f}, R² (Train): {r2_train:.2f}")
print(f"MSE (Test): {mse_test:.2f}, R² (Test): {r2_test:.2f}")
#1. Load the Dataset
df = pd.read_excel('student_2.xlsx') # make sure the file exists and is correctly named
# Print column names to check for any discrepancies
print("Columns in added dataset:", df.columns.tolist())
# Strip any whitespace from column names
df.columns = df.columns.str.strip()
# Identify predictor (X) and response variable (Y)
# Dynamically detect the exam score column
score_col = None
for col in df.columns:
if 'exam_score' in col.lower():
score_col = col
break
if score_col is None:
raise KeyError("Exam score column not found in dataset. Check the column names.")
print(f"Using '{score_col}' as the target variable.")
y = df[score_col]
X = df.drop(columns=[score_col])
# 4. Split Data into Training (80%) and Testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 6. Build Multiple Linear Regression Model
partment of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025
K. J. Somaiya School of Engineering
(Somaiya Vidyavihar University)
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train_const).fit()
# a. Compute MSE and R² score
train_pred = model.predict(X_train_const)
test_pred = model.predict(X_test_const)
mse_train = mean_squared_error(y_train, train_pred)
mse_test = mean_squared_error(y_test, test_pred)
r2_train = r2_score(y_train, train_pred)
r2_test = r2_score(y_test, test_pred)
print(f"MSE (Train): {mse_train:.2f}, R² (Train): {r2_train:.2f}")
print(f"MSE (Test): {mse_test:.2f}, R² (Test): {r2_test:.2f}")
#question3
# 1. Load the Dataset
df = pd.read_excel('student.xlsx') # make sure the file exists and is correctly named
# Print column names to check for any discrepancies
print("Columns in dataset:", df.columns.tolist())
# Strip any whitespace from column names
df.columns = df.columns.str.strip()
# Identify predictor (X) and response variable (Y)
# Dynamically detect the exam score column
score_col = None
for col in df.columns:
if 'exam_score' in col.lower():
score_col = col
break
if score_col is None:
raise KeyError("Exam score column not found in dataset. Check the column names.")
print(f"Using '{score_col}' as the target variable.")
y = df[score_col]
X = df.drop(columns=[score_col])
partment of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025
K. J. Somaiya School of Engineering
(Somaiya Vidyavihar University)
# 4. Split Data into Training (70%) and Testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 6. Build Multiple Linear Regression Model
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train_const).fit()
# a. Compute MSE and R² score
train_pred = model.predict(X_train_const)
test_pred = model.predict(X_test_const)
mse_train = mean_squared_error(y_train, train_pred)
mse_test = mean_squared_error(y_test, test_pred)
r2_train = r2_score(y_train, train_pred)
r2_test = r2_score(y_test, test_pred)
print("for 30-70")
print(f"MSE (Train): {mse_train:.2f}, R² (Train): {r2_train:.2f}")
print(f"MSE (Test): {mse_test:.2f}, R² (Test): {r2_test:.2f}")
#question4
# 1. Load the Dataset
df = pd.read_excel('student_3.xlsx') # make sure the file exists and is correctly named
# Print column names to check for any discrepancies
print("Columns in dataset:", df.columns.tolist())
# Strip any whitespace from column names
df.columns = df.columns.str.strip()
# Identify predictor (X) and response variable (Y)
# Dynamically detect the exam score column
score_col = None
for col in df.columns:
if 'exam_score' in col.lower():
score_col = col
break
if score_col is None:
raise KeyError("Exam score column not found in dataset. Check the column names.")
print(f"Using '{score_col}' as the target variable.")
partment of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025
K. J. Somaiya School of Engineering
(Somaiya Vidyavihar University)
y = df[score_col]
X = df.drop(columns=[score_col])
# 4. Split Data into Training (70%) and Testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 6. Build Multiple Linear Regression Model
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
model = sm.OLS(y_train, X_train_const).fit()
# a. Compute MSE and R² score
train_pred = model.predict(X_train_const)
test_pred = model.predict(X_test_const)
mse_train = mean_squared_error(y_train, train_pred)
mse_test = mean_squared_error(y_test, test_pred)
r2_train = r2_score(y_train, train_pred)
r2_test = r2_score(y_test, test_pred)
print("Removed Study_hours")
print(f"MSE (Train): {mse_train:.2f}, R² (Train): {r2_train:.2f}")
print(f"MSE (Test): {mse_test:.2f}, R² (Test): {r2_test:.2f}")
partment of Electronics and Telecommunication Engineering
Page No EXTC/Sem IV/PSOT/Jan-May 2025