0% found this document useful (0 votes)
26 views11 pages

Task 2

The document outlines a task to build a diabetes prediction model using logistic regression, focusing on data preprocessing, feature engineering, and model evaluation. It details the required software, dataset features, and a step-by-step procedure for data handling, model training, and performance evaluation. The provided Python code demonstrates the implementation of these steps using libraries such as Pandas and Scikit-learn.

Uploaded by

Subramanian R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views11 pages

Task 2

The document outlines a task to build a diabetes prediction model using logistic regression, focusing on data preprocessing, feature engineering, and model evaluation. It details the required software, dataset features, and a step-by-step procedure for data handling, model training, and performance evaluation. The provided Python code demonstrates the implementation of these steps using libraries such as Pandas and Scikit-learn.

Uploaded by

Subramanian R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

TASK : 02

Diabetes Predic on Model using Logis c Regression

Objec ve:

 To understand and implement data preprocessing techniques for a medical dataset.

 To perform feature engineering for improved model performance.

 To build and train a Logis c Regression model for diabetes predic on.

 To evaluate the performance of the trained model using appropriate metrics.

Tools/So ware Required:

 Python 3.x

 Pandas

 Scikit-learn (sklearn)

 NumPy (for numerical opera ons)

 Anaconda (recommended for environment management) or Google Colab

Dataset:

A sample diabetes dataset is provided in the code. This dataset includes the following
features:

 Pregnancies: Number of pregnancies.

 Glucose: Glucose level.

 BloodPressure: Blood pressure.

 SkinThickness: Skin thickness.

 Insulin: Insulin level.

 BMI: Body mass index.

 DiabetesPedigreeFunc on: 1 Diabetes pedigree func on. 2

 Age: Age.

 Outcome: Diabetes status (1: Diabetes, 0: No Diabetes).

Procedure:

1. Data Preprocessing:

1. Load the Data:


o Create a Pandas DataFrame from the provided sample diabetes dataset.

o Print the original DataFrame to inspect the raw data.

2. Handle Missing Values:

o Iden fy and handle missing values in the 'BMI' column using SimpleImputer
with the mean strategy.

o Print the DataFrame a er imputa on to verify the changes.

3. Scale Numerical Features:

o Scale the numerical features ('Pregnancies', 'Glucose', 'BloodPressure',


'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunc on', 'Age') using
StandardScaler.

o Print the DataFrame a er scaling to observe the transformed data.

2. Feature Engineering:

1. Create Interac on Feature:

o Create a new feature 'Glucose_BMI' by mul plying the 'Glucose' and 'BMI'
columns.

o Print the DataFrame a er adding the new feature.

3. Machine Learning Model Building and Evalua on:

1. Prepare Data for Model:

o Define the features (X) by dropping the 'Outcome' column.

o Define the target variable (y) as the 'Outcome' column.

2. Split Data:

o Use train_test_split to split the dataset into training and tes ng sets (80%
training, 20% tes ng) with random_state=42 for reproducibility.

o Print the shapes and contents of the training and tes ng sets.

3. Train the Model:

o Create a Logis cRegression model.

o Train the model using the training data (X_train, y_train).

4. Make Predic ons:

o Use the trained model to make predic ons on the tes ng data (X_test).
o Print the predic ons (y_pred).

5. Evaluate the Model:

o Calculate the accuracy of the model using accuracy_score.

o Generate a classifica on report using classifica on_report to assess precision,


recall, and F1-score.

o Print the accuracy and the classifica on report.

Program :

Python

import pandas as pd

import numpy as np

from sklearn.model_selec on import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.impute import SimpleImputer

from sklearn.linear_model import Logis cRegression

from sklearn.metrics import accuracy_score, classifica on_report

# 1. Sample Diabetes Dataset

data = {

'Pregnancies': [6, 1, 8, 1, 0, 5, 3, 10, 2, 4],

'Glucose': [148, 85, 183, 89, 137, 116, 78, 115, 197, 125],

'BloodPressure': [72, 66, 64, 66, 40, 74, 50, 0, 70, 96],

'SkinThickness': [35, 29, 0, 23, 35, 0, 32, 0, 45, 0],

'Insulin': [0, 0, 0, 94, 168, 0, 88, 0, 543, 0],

'BMI': [33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, np.nan, 30.5, 0.0],

'DiabetesPedigreeFunc on': [0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158,
0.177],

'Age': [50, 31, 32, 21, 33, 30, 26, 29, 53, 41],

'Outcome': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]

}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df)

# 2. Data Preprocessing

imputer = SimpleImputer(strategy='mean')

df['BMI'] = imputer.fit_transform(df[['BMI']])

print("\nDataFrame a er Impu ng BMI:\n", df)

scaler = StandardScaler()

numerical_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',


'DiabetesPedigreeFunc on', 'Age']

df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

print("\nDataFrame a er Scaling:\n", df)

# 3. Feature Engineering

df['Glucose_BMI'] = df['Glucose'] * df['BMI']

print("\nDataFrame a er Feature Engineering:\n", df)

# 4. Model Building and Evalua on

X = df.drop('Outcome', axis=1)

y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining Data (X_train):\n", X_train)

print("\nTes ng Data (X_test):\n", X_test)

print("\nTraining Labels (y_train):\n", y_train)

print("\nTes ng Labels (y_test):\n", y_test)


model = Logis cRegression(random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("\nPredic ons (y_pred):\n", y_pred)

accuracy = accuracy_score(y_test, y_pred)

report = classifica on_report(y_test, y_pred)

print("\nAccuracy:", accuracy)

print("\nClassifica on Report:\n", report)

Output :

Original DataFrame:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.6

1 1 85 66 29 0 26.6

2 8 183 64 0 0 23.3

3 1 89 66 23 94 28.1

4 0 137 40 35 168 43.1

5 5 116 74 0 0 25.6

6 3 78 50 32 88 31.0

7 10 115 0 0 0 NaN

8 2 197 70 45 543 30.5

9 4 125 96 0 0 0.0
DiabetesPedigreeFunc on Age Outcome

0 0.627 50 1

1 0.351 31 0

2 0.672 32 1

3 0.167 21 0

4 2.288 33 1

5 0.201 30 0

6 0.248 26 0

7 0.134 29 1

8 0.158 53 1

9 0.177 41 0

DataFrame a er Impu ng BMI:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 6 148 72 35 0 33.600000

1 1 85 66 29 0 26.600000

2 8 183 64 0 0 23.300000

3 1 89 66 23 94 28.100000

4 0 137 40 35 168 43.100000

5 5 116 74 0 0 25.600000

6 3 78 50 32 88 31.000000

7 10 115 0 0 0 26.866667

8 2 197 70 45 543 30.500000

9 4 125 96 0 0 0.000000
DiabetesPedigreeFunc on Age Outcome

0 0.627 50 1

1 0.351 31 0

2 0.672 32 1

3 0.167 21 0

4 2.288 33 1

5 0.201 30 0

6 0.248 26 0

7 0.134 29 1

8 0.158 53 1

9 0.177 41 0

DataFrame a er Scaling:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 0.645497 0.544471 0.501265 0.885345 -0.553913 0.648853

1 -0.968246 -1.112615 0.254741 0.533552 -0.553913 -0.025697

2 1.290994 1.465074 0.172566 -1.166779 -0.553913 -0.343699

3 -0.968246 -1.007403 0.254741 0.181760 0.029153 0.118849

4 -1.290994 0.255139 -0.813528 0.885345 0.488163 1.564314

5 0.322749 -0.297223 0.583439 -1.166779 -0.553913 -0.122061

6 -0.322749 -1.296735 -0.402655 0.709449 -0.008064 0.398306

7 1.936492 -0.323526 -2.457018 -1.166779 -0.553913 0.000000

8 -0.645497 1.833316 0.419090 1.471666 2.814225 0.350124

9 0.000000 -0.060497 1.487359 -1.166779 -0.553913 -2.588989


DiabetesPedigreeFunc on Age Outcome

0 0.200095 1.579674 1

1 -0.242777 -0.369274 0

2 0.272302 -0.266698 1

3 -0.538025 -1.395037 0

4 2.865348 -0.164122 1

5 -0.483468 -0.471851 0

6 -0.408052 -0.882156 0

7 -0.590977 -0.574427 1

8 -0.552466 1.887403 1

9 -0.521979 0.656488 0

DataFrame a er Feature Engineering:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

0 0.645497 0.544471 0.501265 0.885345 -0.553913 0.648853

1 -0.968246 -1.112615 0.254741 0.533552 -0.553913 -0.025697

2 1.290994 1.465074 0.172566 -1.166779 -0.553913 -0.343699

3 -0.968246 -1.007403 0.254741 0.181760 0.029153 0.118849

4 -1.290994 0.255139 -0.813528 0.885345 0.488163 1.564314

5 0.322749 -0.297223 0.583439 -1.166779 -0.553913 -0.122061

6 -0.322749 -1.296735 -0.402655 0.709449 -0.008064 0.398306

7 1.936492 -0.323526 -2.457018 -1.166779 -0.553913 0.000000

8 -0.645497 1.833316 0.419090 1.471666 2.814225 0.350124

9 0.000000 -0.060497 1.487359 -1.166779 -0.553913 -2.588989


DiabetesPedigreeFunc on Age Outcome Glucose_BMI

0 0.200095 1.579674 1 0.353282

1 -0.242777 -0.369274 0 0.028591

2 0.272302 -0.266698 1 -0.503545

3 -0.538025 -1.395037 0 -0.119729

4 2.865348 -0.164122 1 0.399117

5 -0.483468 -0.471851 0 0.036280

6 -0.408052 -0.882156 0 -0.516497

7 -0.590977 -0.574427 1 -0.000000

8 -0.552466 1.887403 1 0.641887

9 -0.521979 0.656488 0 0.156625

Training Data (X_train):

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

5 0.322749 -0.297223 0.583439 -1.166779 -0.553913 -0.122061

0 0.645497 0.544471 0.501265 0.885345 -0.553913 0.648853

7 1.936492 -0.323526 -2.457018 -1.166779 -0.553913 0.000000

2 1.290994 1.465074 0.172566 -1.166779 -0.553913 -0.343699

9 0.000000 -0.060497 1.487359 -1.166779 -0.553913 -2.588989

4 -1.290994 0.255139 -0.813528 0.885345 0.488163 1.564314

3 -0.968246 -1.007403 0.254741 0.181760 0.029153 0.118849

6 -0.322749 -1.296735 -0.402655 0.709449 -0.008064 0.398306

DiabetesPedigreeFunc on Age Glucose_BMI

5 -0.483468 -0.471851 0.036280

0 0.200095 1.579674 0.353282

7 -0.590977 -0.574427 -0.000000

2 0.272302 -0.266698 -0.503545


9 -0.521979 0.656488 0.156625

4 2.865348 -0.164122 0.399117

3 -0.538025 -1.395037 -0.119729

6 -0.408052 -0.882156 -0.516497

Tes ng Data (X_test):

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \

8 -0.645497 1.833316 0.419090 1.471666 2.814225 0.350124

1 -0.968246 -1.112615 0.254741 0.533552 -0.553913 -0.025697

DiabetesPedigreeFunc on Age Glucose_BMI

8 -0.552466 1.887403 0.641887

1 -0.242777 -0.369274 0.028591

Training Labels (y_train):

5 0

0 1

7 1

2 1

9 0

4 1

3 0

6 0

Name: Outcome, dtype: int64

Tes ng Labels (y_test):

8 1
1 0

Name: Outcome, dtype: int64

Predic ons (y_pred):

[1 0]

Accuracy: 1.0

Results:

 Record the original DataFrame.

 Record the DataFrame a er imputa on and scaling.

 Record the DataFrame a er feature engineering.

 Record the shapes and content of the training and tes ng sets.

 Record the predic ons made by the model.

 Record the accuracy and classifica on report of the model.

You might also like