TASK : 02
Diabetes Predic on Model using Logis c Regression
Objec ve:
To understand and implement data preprocessing techniques for a medical dataset.
To perform feature engineering for improved model performance.
To build and train a Logis c Regression model for diabetes predic on.
To evaluate the performance of the trained model using appropriate metrics.
Tools/So ware Required:
Python 3.x
Pandas
Scikit-learn (sklearn)
NumPy (for numerical opera ons)
Anaconda (recommended for environment management) or Google Colab
Dataset:
A sample diabetes dataset is provided in the code. This dataset includes the following
features:
Pregnancies: Number of pregnancies.
Glucose: Glucose level.
BloodPressure: Blood pressure.
SkinThickness: Skin thickness.
Insulin: Insulin level.
BMI: Body mass index.
DiabetesPedigreeFunc on: 1 Diabetes pedigree func on. 2
Age: Age.
Outcome: Diabetes status (1: Diabetes, 0: No Diabetes).
Procedure:
1. Data Preprocessing:
1. Load the Data:
o Create a Pandas DataFrame from the provided sample diabetes dataset.
o Print the original DataFrame to inspect the raw data.
2. Handle Missing Values:
o Iden fy and handle missing values in the 'BMI' column using SimpleImputer
with the mean strategy.
o Print the DataFrame a er imputa on to verify the changes.
3. Scale Numerical Features:
o Scale the numerical features ('Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunc on', 'Age') using
StandardScaler.
o Print the DataFrame a er scaling to observe the transformed data.
2. Feature Engineering:
1. Create Interac on Feature:
o Create a new feature 'Glucose_BMI' by mul plying the 'Glucose' and 'BMI'
columns.
o Print the DataFrame a er adding the new feature.
3. Machine Learning Model Building and Evalua on:
1. Prepare Data for Model:
o Define the features (X) by dropping the 'Outcome' column.
o Define the target variable (y) as the 'Outcome' column.
2. Split Data:
o Use train_test_split to split the dataset into training and tes ng sets (80%
training, 20% tes ng) with random_state=42 for reproducibility.
o Print the shapes and contents of the training and tes ng sets.
3. Train the Model:
o Create a Logis cRegression model.
o Train the model using the training data (X_train, y_train).
4. Make Predic ons:
o Use the trained model to make predic ons on the tes ng data (X_test).
o Print the predic ons (y_pred).
5. Evaluate the Model:
o Calculate the accuracy of the model using accuracy_score.
o Generate a classifica on report using classifica on_report to assess precision,
recall, and F1-score.
o Print the accuracy and the classifica on report.
Program :
Python
import pandas as pd
import numpy as np
from sklearn.model_selec on import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Logis cRegression
from sklearn.metrics import accuracy_score, classifica on_report
# 1. Sample Diabetes Dataset
data = {
'Pregnancies': [6, 1, 8, 1, 0, 5, 3, 10, 2, 4],
'Glucose': [148, 85, 183, 89, 137, 116, 78, 115, 197, 125],
'BloodPressure': [72, 66, 64, 66, 40, 74, 50, 0, 70, 96],
'SkinThickness': [35, 29, 0, 23, 35, 0, 32, 0, 45, 0],
'Insulin': [0, 0, 0, 94, 168, 0, 88, 0, 543, 0],
'BMI': [33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, np.nan, 30.5, 0.0],
'DiabetesPedigreeFunc on': [0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158,
0.177],
'Age': [50, 31, 32, 21, 33, 30, 26, 29, 53, 41],
'Outcome': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# 2. Data Preprocessing
imputer = SimpleImputer(strategy='mean')
df['BMI'] = imputer.fit_transform(df[['BMI']])
print("\nDataFrame a er Impu ng BMI:\n", df)
scaler = StandardScaler()
numerical_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunc on', 'Age']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
print("\nDataFrame a er Scaling:\n", df)
# 3. Feature Engineering
df['Glucose_BMI'] = df['Glucose'] * df['BMI']
print("\nDataFrame a er Feature Engineering:\n", df)
# 4. Model Building and Evalua on
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\nTraining Data (X_train):\n", X_train)
print("\nTes ng Data (X_test):\n", X_test)
print("\nTraining Labels (y_train):\n", y_train)
print("\nTes ng Labels (y_test):\n", y_test)
model = Logis cRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("\nPredic ons (y_pred):\n", y_pred)
accuracy = accuracy_score(y_test, y_pred)
report = classifica on_report(y_test, y_pred)
print("\nAccuracy:", accuracy)
print("\nClassifica on Report:\n", report)
Output :
Original DataFrame:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
5 5 116 74 0 0 25.6
6 3 78 50 32 88 31.0
7 10 115 0 0 0 NaN
8 2 197 70 45 543 30.5
9 4 125 96 0 0 0.0
DiabetesPedigreeFunc on Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
5 0.201 30 0
6 0.248 26 0
7 0.134 29 1
8 0.158 53 1
9 0.177 41 0
DataFrame a er Impu ng BMI:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.600000
1 1 85 66 29 0 26.600000
2 8 183 64 0 0 23.300000
3 1 89 66 23 94 28.100000
4 0 137 40 35 168 43.100000
5 5 116 74 0 0 25.600000
6 3 78 50 32 88 31.000000
7 10 115 0 0 0 26.866667
8 2 197 70 45 543 30.500000
9 4 125 96 0 0 0.000000
DiabetesPedigreeFunc on Age Outcome
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
5 0.201 30 0
6 0.248 26 0
7 0.134 29 1
8 0.158 53 1
9 0.177 41 0
DataFrame a er Scaling:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 0.645497 0.544471 0.501265 0.885345 -0.553913 0.648853
1 -0.968246 -1.112615 0.254741 0.533552 -0.553913 -0.025697
2 1.290994 1.465074 0.172566 -1.166779 -0.553913 -0.343699
3 -0.968246 -1.007403 0.254741 0.181760 0.029153 0.118849
4 -1.290994 0.255139 -0.813528 0.885345 0.488163 1.564314
5 0.322749 -0.297223 0.583439 -1.166779 -0.553913 -0.122061
6 -0.322749 -1.296735 -0.402655 0.709449 -0.008064 0.398306
7 1.936492 -0.323526 -2.457018 -1.166779 -0.553913 0.000000
8 -0.645497 1.833316 0.419090 1.471666 2.814225 0.350124
9 0.000000 -0.060497 1.487359 -1.166779 -0.553913 -2.588989
DiabetesPedigreeFunc on Age Outcome
0 0.200095 1.579674 1
1 -0.242777 -0.369274 0
2 0.272302 -0.266698 1
3 -0.538025 -1.395037 0
4 2.865348 -0.164122 1
5 -0.483468 -0.471851 0
6 -0.408052 -0.882156 0
7 -0.590977 -0.574427 1
8 -0.552466 1.887403 1
9 -0.521979 0.656488 0
DataFrame a er Feature Engineering:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 0.645497 0.544471 0.501265 0.885345 -0.553913 0.648853
1 -0.968246 -1.112615 0.254741 0.533552 -0.553913 -0.025697
2 1.290994 1.465074 0.172566 -1.166779 -0.553913 -0.343699
3 -0.968246 -1.007403 0.254741 0.181760 0.029153 0.118849
4 -1.290994 0.255139 -0.813528 0.885345 0.488163 1.564314
5 0.322749 -0.297223 0.583439 -1.166779 -0.553913 -0.122061
6 -0.322749 -1.296735 -0.402655 0.709449 -0.008064 0.398306
7 1.936492 -0.323526 -2.457018 -1.166779 -0.553913 0.000000
8 -0.645497 1.833316 0.419090 1.471666 2.814225 0.350124
9 0.000000 -0.060497 1.487359 -1.166779 -0.553913 -2.588989
DiabetesPedigreeFunc on Age Outcome Glucose_BMI
0 0.200095 1.579674 1 0.353282
1 -0.242777 -0.369274 0 0.028591
2 0.272302 -0.266698 1 -0.503545
3 -0.538025 -1.395037 0 -0.119729
4 2.865348 -0.164122 1 0.399117
5 -0.483468 -0.471851 0 0.036280
6 -0.408052 -0.882156 0 -0.516497
7 -0.590977 -0.574427 1 -0.000000
8 -0.552466 1.887403 1 0.641887
9 -0.521979 0.656488 0 0.156625
Training Data (X_train):
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
5 0.322749 -0.297223 0.583439 -1.166779 -0.553913 -0.122061
0 0.645497 0.544471 0.501265 0.885345 -0.553913 0.648853
7 1.936492 -0.323526 -2.457018 -1.166779 -0.553913 0.000000
2 1.290994 1.465074 0.172566 -1.166779 -0.553913 -0.343699
9 0.000000 -0.060497 1.487359 -1.166779 -0.553913 -2.588989
4 -1.290994 0.255139 -0.813528 0.885345 0.488163 1.564314
3 -0.968246 -1.007403 0.254741 0.181760 0.029153 0.118849
6 -0.322749 -1.296735 -0.402655 0.709449 -0.008064 0.398306
DiabetesPedigreeFunc on Age Glucose_BMI
5 -0.483468 -0.471851 0.036280
0 0.200095 1.579674 0.353282
7 -0.590977 -0.574427 -0.000000
2 0.272302 -0.266698 -0.503545
9 -0.521979 0.656488 0.156625
4 2.865348 -0.164122 0.399117
3 -0.538025 -1.395037 -0.119729
6 -0.408052 -0.882156 -0.516497
Tes ng Data (X_test):
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
8 -0.645497 1.833316 0.419090 1.471666 2.814225 0.350124
1 -0.968246 -1.112615 0.254741 0.533552 -0.553913 -0.025697
DiabetesPedigreeFunc on Age Glucose_BMI
8 -0.552466 1.887403 0.641887
1 -0.242777 -0.369274 0.028591
Training Labels (y_train):
5 0
0 1
7 1
2 1
9 0
4 1
3 0
6 0
Name: Outcome, dtype: int64
Tes ng Labels (y_test):
8 1
1 0
Name: Outcome, dtype: int64
Predic ons (y_pred):
[1 0]
Accuracy: 1.0
Results:
Record the original DataFrame.
Record the DataFrame a er imputa on and scaling.
Record the DataFrame a er feature engineering.
Record the shapes and content of the training and tes ng sets.
Record the predic ons made by the model.
Record the accuracy and classifica on report of the model.