0% found this document useful (0 votes)
71 views30 pages

Final Mla File For Practical

Uploaded by

Kunal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views30 pages

Final Mla File For Practical

Uploaded by

Kunal Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Practical 1:

Problem Statement:
Extract the data from the database using python.

Dataset Characteristics:
This dataset, referred to as biodeg, contains chemical descriptors that quantify various molecular properties.
Here’s a brief overview of the key columns:
 SpMax_L, J_Dz(e), nHM: These represent different topological or quantum-chemical descriptors
that relate to the structure and reactivity of the molecules.
 F01[N-N], F04[C-N], F03[C-N], F03[C-O]: These denote the presence of specific chemical bonds
(e.g., nitrogen-nitrogen, carbon-nitrogen, carbon-oxygen) and their influence on molecular properties.
 NssssC, nCb-, C%, nCp, nO: Indicators of certain structural elements like types of carbon (C),
carbon bonds, or oxygen atoms in the molecular structure.
 SdssC, HyWi_B(m), LOC, SM6_L: Various structural and electronic properties, including measures
like size, polarity, or molecular location.
 Me, Mi, nN-N, nArNO2: Additional descriptors like molecular weight, specific interactions, and
functional groups (e.g., aromatic nitro groups).

Each row represents a different molecule, with the numerical values providing details about its structural and
chemical characteristics, commonly used for predicting properties like biological activity or toxicity in
cheminformatics.

Results:

Kunal 1 06414002022
Practical 2:
Problem Statement:
Write a program to implement linear and logistic regression.

Linear Regression:
Data Statistics:
1) Dataset Size: 414 rows (each representing a unique house entry)
9 columns in total
2) Features:
 Transaction Date: The date of the house transaction.
 House Age: Age of the house in years.
 Distance from Nearest Metro Station: Distance in kilometers.
 Number of Convenience Stores: Count of nearby convenience stores.
 Latitude: Geographic coordinate.
 Longitude: Geographic coordinate.
 Number of Bedrooms: Count of bedrooms in the house.
 House Size (sqft): Size of the house in square feet.
 House Price of Unit Area: Price per square foot.
3) Data Quality:
1. No duplicate values
2. No null values
4) Statistical Summary:

Technique:
. Linear Regression: This technique models the relationship between the dependent variable (house price)
and independent variables (features like house age and size).

Kunal 2 06414002022
Algorithm:
Step 1: Data Preparation
Load the dataset.
Select the independent variable X and the dependent variable y:
X = Transaction date, House Age, Distance from nearest Metro station (km), Number of convenience
stores, latitude, longitude, Number of bedrooms, House size (sqft)
y = House price of unit area

Step 2: Split the Dataset


Divide the dataset into training and testing sets (e.g., 80% training, 20% testing).

Step 3: Model Initialization


Initialize the linear regression model.

Step 4: Model Training


Fit the model to the training data using the formula:

Step 5: Prediction
Use the trained model to predict house prices on the test dataset

Step 6: Model Evaluation


Assess model performance using metrics using
Mean Squared Error (MSE):

Kunal 3 06414002022
R-squared (R²):

Results:
Comparison of evaluation metrics for linear Regression

Logistic Regression:
Data Statistics:
1) Data Size: 380 rows (each representing a unique Social Network ads entry)
5 columns in total

2) Features:
 User ID: A unique identifier for each user.
 Gender: Categorical feature with values "Male" and "Female".
 Age: The age of the user, a numerical feature.
 EstimatedSalary: The user’s estimated annual salary, a numerical feature.
 Purchased: Target variable, indicating whether the user purchased a product (1 for yes, 0 for
no).

3) Data Quality:
1. Duplicate values removed
2. No null values

Kunal 4 06414002022
4) Statistical Summary:

Technique:
Logistic Regression: A statistical method used for binary classification tasks, where the goal is to model the
probability of a certain class or event.
Algorithm:
Step 1: Data Preparation
Load the dataset.
Drop unnecessary columns (e.g., 'User ID').
Map categorical variables (e.g., convert 'Gender' to numerical values: Male = 0, Female = 1).
Check for missing values and duplicates; handle them as necessary (e.g., drop duplicates).

Step 2: Feature Selection


Select independent variables X and dependent variable y:

Step 3: Split the Dataset


Divide the dataset into training and testing sets (e.g., 80% training, 20% testing).

Step 4: Feature Scaling


Apply standard scaling to normalize the features in X using StandardScaler.

Step 5: Model Initialization


Initialize the logistic regression model.

Kunal 5 06414002022
Step 6: Model Training
Fit the model to the training data. The logistic regression model uses the sigmoid function for prediction:

Step 7: Prediction
Use the trained model to predict outcomes on the test dataset.

Step 8: Model Evaluation


Assess model performance using metrics:
Confusion Matrix:

Accuracy Score:

Precision Score:

Recall Score:

F1 Score:

Kunal 6 06414002022
Result:
1) Confusion Matrix:

2) Performance:

Kunal 7 06414002022
Practical 3:
Problem Statement:
Write a program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier, considering
few test data sets.
Data Statistics:
1) Data size: 3233 rows (each representing a unique data entry)
8 columns
2) Features:
 long_hair (binary): Indicates whether the person has long hair (1) or not (0).
 forehead_width_cm (numerical): The width of the person's forehead in centimeters.
 forehead_height_cm (numerical): The height of the person's forehead in centimeters.
 nose_wide (binary): Indicates if the person's nose is wide (1) or not (0).
 nose_long (binary): Indicates if the person's nose is long (1) or not (0).
 lips_thin (binary): Indicates if the person has thin lips (1) or not (0).
 distance_nose_to_lip_long (binary): Indicates if the distance between the nose and lip is long (1)
or not (0).
 gender (categorical): Indicates the gender of the person (Male/Female).

3) Data Quality:
 Missing Values: There were no missing values in the dataset, ensuring completeness and
avoiding potential bias in the model's predictions.
 Duplicate Records: The dataset contained some duplicate rows, which were identified and
removed to prevent redundancy and improve model accuracy.

4) Statistical Summary:

Technique:
The Naive Bayes technique is a probabilistic classification method based on Bayes' theorem, assuming
independence among predictors. It calculates the probability of a sample belonging to each class and
predicts the class with the highest probability.

Algorithm:
Kunal 8 06414002022
Step 1: Data Preparation
 Load the dataset.
 Map categorical variables(e.g., convert 'Gender' to numerical values: Male = 1, Female = 0).
 Check for missing values and duplicates; handle them (e.g., drop duplicates).

Step 2: Feature Selection


Select independent variables Xand dependent variable y(gender).

Step 3: Split the Dataset


Split into training(80%) and testing(20%) sets using:

Step 4: Feature Scaling


Apply StandardScalerto normalize X features:

Step 5: Model Initialization


Initialize the Naive Bayesmodel using:

Step 6: Model Training

Fit the model to the training data:

Kunal 9 06414002022
Step 7: Prediction
Predict outcomes on the test dataset:

Step 8: Model Evaluation


Confusion Matrix:

Accuracy Score:

Precision Score:

Recall Score:

F1 Score:

Kunal 10 06414002022
Result:
1) Confusion Matrix:

2) Performance:

Kunal 11 06414002022
Practical 4:
Problem Statement:
Write a program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier, considering
few test data sets.
Data Statistics:
1) Data size: 3233 rows (each representing a unique data entry)
8 columns
2) Features:
 long_hair (binary): Indicates whether the person has long hair (1) or not (0).
 forehead_width_cm (numerical): The width of the person's forehead in centimeters.
 forehead_height_cm (numerical): The height of the person's forehead in centimeters.
 nose_wide (binary): Indicates if the person's nose is wide (1) or not (0).
 nose_long (binary): Indicates if the person's nose is long (1) or not (0).
 lips_thin (binary): Indicates if the person has thin lips (1) or not (0).
 distance_nose_to_lip_long (binary): Indicates if the distance between the nose and lip is long (1)
or not (0).
 gender (categorical): Indicates the gender of the person (Male/Female).

3) Data Quality:
 Missing Values: There were no missing values in the dataset, ensuring completeness and
avoiding potential bias in the model's predictions.
 Duplicate Records: The dataset contained some duplicate rows, which were identified and
removed to prevent redundancy and improve model accuracy.

4) Statistical Summary:

Technique:
 K-Nearest Neighbors (KNN):
KNN is an instance-based learning algorithm that classifies instances based on the majority class of
their k nearest neighbors, using a distance metric (like Euclidean distance) to determine proximity.

Kunal 12 06414002022
 Support Vector Machine (SVM):
SVM constructs a hyperplane that maximizes the margin between classes by finding the optimal
separating line (or hyperplane) while using support vectors, and can utilize kernel functions to handle
non-linear data.

Algorithm:
K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) Implementation
Step 1: Data Preparation
1. Import Necessary Libraries:
Import libraries such as NumPy, Pandas, Matplotlib, Seaborn, and warnings.

2. Load the Dataset:


Load the dataset from a CSV file and display the first few rows.

3. Map Categorical Variables:


Convert categorical variables (e.g., 'Gender') to numerical values:

4. Check for Missing Values and Duplicates:


Visualize correlations with a heatmap.
Check for missing values.

5. Outlier Detection:
Create a boxplot to visualize potential outliers in the dataset.

Step 2: Feature Selection


Select the independent variables X and the dependent variable y:

Step 3: Split the Dataset


Divide the dataset into training and testing sets (e.g., 80% training, 20% testing):

Step 4: Feature Scaling


Normalize the features using StandardScaler:

Kunal 13 06414002022
Step 5: Model Initialization (KNN)
Initialize K-Nearest Neighbors Model:
Create an instance of the KNeighborsClassifier with a specified number of neighbors k .

Step 6: Model Training (KNN)


Train the KNN model using the training data.

Step 7: Prediction (KNN)


Use the trained KNN model to make predictions on the test dataset.

Step 8: Model Initialization (SVM)


Create an instance of the SVC model with a specified kernel (e.g., RBF).

Step 9: Model Training (SVM)


Train the SVM model using the training data.

Step 10: Prediction (SVM)


Use the trained SVM model to make predictions on the test dataset.

Step 11: Model Evaluation (KNN) and (SVC):


Confusion Matrix:

Accuracy Score:

Kunal 14 06414002022
Precision Score:

Recall Score:

F1 Score:

KNN Algorithm Formula:


For KNN, the prediction for a new instance is made based on the majority class among the ¥( k ¥) nearest
neighbors. The distance can be calculated using various metrics, with the Euclidean distance being the most
common:

SVM Algorithm Formula:


For SVM, the decision boundary is defined by the hyperplane that maximizes the margin between classes.
The decision function can be represented as:

Where,

 w is the weight vector (normal to the hyperplane).


 b is the bias term.
The optimization problem for SVM is defined as:

Subject to:

Kunal 15 06414002022
Result:
1) Confusion matrix:

2) Performance:

Kunal 16 06414002022
Practical 5:
Problem statement:
Implement classification of a given dataset using random forest

Data statistics:
1) Data Size: 1599 rows (each representing a unique data entry)
12 columns
2) Features:
 Fixed Acidity: The amount of non-volatile acids in wine.
 Volatile Acidity: The amount of acetic acid in wine, which can impact its taste.
 Citric Acid: Adds freshness and flavor to the wine.
 Residual Sugar: The amount of sugar left after fermentation, which affects sweetness.
 Chlorides: The amount of salt in the wine, impacting flavor.
 Free Sulfur Dioxide: The level of free SO₂, which acts as a preservative.
 Total Sulfur Dioxide: The total amount of SO₂ in the wine.
 Density: The density of the wine solution, which can relate to its alcohol content.
 pH: A measure of acidity, influencing taste and stability.
 Sulphates: A compound that can enhance flavor and preservation.
 Alcohol: The alcohol content of the wine, usually expressed as a percentage.

3) Data Quality:
 Duplicates: The dataset has been checked for duplicates, and any found have been removed to
ensure the integrity of the analysis.
 Missing Values: There’s an initial check for missing values, although the code snippet does not
indicate any subsequent handling of these if present.

4) Statistical Summary:

Technique:
The Random Forest algorithm is an ensemble learning technique that constructs a multitude of decision trees
during training and outputs the mode of their predictions for classification tasks or the mean for regression
tasks. It operates by utilizing bootstrap sampling (bagging) to create diverse subsets of the data, which helps
to reduce overfitting and improve model accuracy. Additionally, Random Forest randomly selects a subset

Kunal 17 06414002022
of features for each tree, enhancing the model's generalization ability and robustness against noise in the
data.

Algorithm:
Step 1: Data Preparation
 Load the dataset.
 Read the dataset from a CSV file using a library like Pandas.
 Check for missing values and duplicates;handle them as necessary (e.g., drop duplicates).

Step 2: Feature Selection


Select independent variables X and dependent variable y.

Step 3: Split the Dataset


Divide the dataset into training and testing sets(e.g., 80% training, 20% testing).

Step 4: Feature Scaling


Apply standard scalingto normalize the features in X using StandardScaler.

Step 5: Model Initialization


Initialize the Random Forest model.The Random Forest algorithm creates a "forest" of decision trees
by randomly selecting subsets of the training data and features.

Step 6: Model Training


Fit the model to the training data.The prediction is made by averaging the predictions from all individual
trees:

Step 7: Prediction
Use the trained model to predict outcomes on the test dataset.
Kunal 18 06414002022
Step 8: Model Evaluation
Assess model performance using metrics:
Confusion Matrix:

Accuracy Score:

Precision Score:

Recall Score:

F1 Score:

Kunal 19 06414002022
Results:
1) Confusion matrix:

2) Performance:

Kunal 20 06414002022
Practical 6:
Problem statement:
Build an Artificial Neural Network (ANN) by implementing the Back propagation
algorithm and test the same using appropriate data sets.
Data statistics:

1) Data Size:
The dataset contains a total of 1,000 entries (rows) and features related to citrus fruit classification,
with the target variable being the fruit type (orange or grapefruit).

2) Features:
 Size: Measurement related to the dimensions of the fruit.
 Weight: The weight of the fruit.
 Color: Attributes that describe the color of the fruit.
 Sweetness: A measure of how sweet the fruit is.
 Acidity: The level of acidity in the fruit.
 Firmness: The firmness or texture of the fruit.
 Shape: Characteristics that describe the shape of the fruit.

3) Data Quality:
The dataset has no missing values, as confirmed by the `.isna().sum()` check, and there are no
duplicate entries, ensuring high data quality for training the model.

4) Statistical Summary:

Technique:
In my project, I used Deep Learning techniques, specifically implementing a Neural Network to classify
citrus fruits, distinguishing between oranges and grapefruits. The model is structured as a Sequential model
using TensorFlow's Keras API, featuring multiple Dense layers with ReLU activation functions in the
hidden layers and a sigmoid activation function in the output layer, which allows for effective binary
classification.

Kunal 21 06414002022
Algorithm:
Step 1: Data Preprocessing
 Load the dataset and perform necessary preprocessing, including:
 Handling missing values.
 Encoding categorical variables.
 Splitting the dataset into features X and labels y.

Step 2: Train-Test Split


Divide the dataset into training and testing sets.

Step 3: Initialize the Neural Network


Define the structure of the neural network:
Input layer with n features.
Hidden layers with a specified number of neurons and activation functions (e.g., ReLU).
Output layer with 1 neuron and sigmoid activation for binary classification.

Step 4: Forward Propagation


For each neuron in the hidden layers, calculate the weighted sum:

Apply the activation function (e.g., ReLU):

For the output layer, use the sigmoid activation:

Step 5: Loss Calculation


Compute the loss using the binary cross-entropy loss function:

Step 6: Backpropagation
Calculate the gradients using the chain rule:

Update the weights using gradient descent:

Kunal 22 06414002022
Step 7: Model Training
Repeat steps 4 to 6 for a specified number of epochs or until convergence.

Step 8: Prediction
Use the trained model to make predictions on the test set

Step 9: Model Evaluation


Accuracy Score:

Precision Score:

Recall Score:

F1 Score:

Kunal 23 06414002022
Results:
Confusion Matrix:

Performance :

Kunal 24 06414002022
Practical 9:
Problem statement:
Write a program for empirical comparison of different supervised learning algorithms.
Data statistics:
1. Data Size: 200 rows , 6 columns (including the target variable 'Drug')

2. Features:
 Age(Continuous)
 Sex(Categorical, Label Encoded as 0 or 1)
 BP(Blood Pressure, Categorical, Label Encoded as 0, 1, or 2)
 Cholesterol(Categorical, Label Encoded as 0, 1, or 2)
 Na_to_K(Continuous, Sodium to Potassium ratio in the blood)
 Drug(Categorical Target Variable, Label Encoded as 0, 1, 2, 3, 4)

3. Data Quality:
 No missing values: As seen from the `.info()` method, all columns are non-null.
 Categorical Features Encoded: Features such as Sex, BP, Cholesterol, and Drug have been label
encoded to ensure compatibility with machine learning algorithms.
 Feature Scaling: The continuous features have been normalized using `MinMaxScaler` to bring
all features into the same range for better model performance.
 Correlation Matrix: A correlation heatmap was generated to observe relationships between
features, although for this dataset (mostly categorical), the correlations might not be as
informative.

4. Statistical Summary:

Kunal 25 06414002022
Technique:
 Random Forest Classifier:
Random Forest is an ensemble method that builds multiple decision trees using random subsets of
data and features, and outputs the majority vote for classification. It reduces overfitting and improves
accuracy. It handles high-dimensional data well.

 Logistic Regression:
Logistic Regression is a linear model that predicts the probability of a class using a sigmoid function,
making it suitable for binary and multiclass classification. It is simple, interpretable, and works well
with linearly separable data.

 Decision Tree Classifier:


Decision Trees split data based on feature values, forming a tree structure where each node
represents a decision rule. It's easy to visualize, interpretable, and can handle both categorical and
numerical data, but is prone to overfitting.

Algorithm:
Step 1: Data Preparation
 Load the dataset.
 Inspect the dataset.
 Check for missing values and duplicates.
 Label Encoding: Map categorical variables to numerical values (e.g., BP, Cholesterol, Sex, Drug)

Step 2: Feature Selection


Select independent variables X and dependent variable y :

Step 3: Split the Dataset


Divide the dataset into training and testing sets (80% training, 20% testing)

Step 4: Data Normalization


Scale the data to normalize feature ranges using MinMaxScaler.

Step 5: Model Initialization & Training


Random Forest Classifier:

Kunal 26 06414002022
Logistic Regression:

Decision Tree Classifier:

Step 6: Prediction
Random Forest Predictions:

Logistic Regression Predictions:

Decision Tree Predictions:

Step 7: Model Evaluation


Confusion Matrix

Accuracy Score:

Precision Score:

Kunal 27 06414002022
Recall Score:

F1 Score:

Results:
Confusion matrix:

Kunal 28 06414002022
Performance:

Kunal 29 06414002022
Kunal 30 06414002022

You might also like