Machine Learning Laboratory Manual
Definition of Machine Learning:
Machine Learning is the science (and art) of programming computers so they can
learn from data.
Machine Learning is the field of study that gives computers the ability to learn
without being explicitly programmed.
—Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by
P,improves with experience E.
—Tom Mitchell, 1997
Types of Machine Learning:
Supervised learning
• In supervised learning, the training set being fed to the algorithm includes the
desired solutions, called labels
• The most important supervised learning algorithms:
1. K-Nearest Neighbors
2. Linear Regression
3. Logistic Regression
4. Support Vector Machines (SVMs)
5. Decision Trees and Random Forests
6. Neural networks
Unsupervised learning
• In unsupervised learning, the training data is unlabeled.
• The most important unsupervised learning algorithms:
1. Clustering
• K-Means
• DBSCAN
• Hierarchical Cluster Analysis (HCA)
2. Anomaly detection and novelty detection
• One-class SVM
• Isolation Forest
3. Visualization and dimensionality reduction
• Principal Component Analysis (PCA)
• Kernel PCA
• Locally Linear Embedding (LLE)
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
4. Association rule learning
• Apriori
• Eclat
Language for Machine Learning Laboratory:
Python is one of the most popular, open source programming language widely adopted by
machine learning community. It was designed by Guido van Rossum and was first released in
1991. The reference implementation of Python, i.e. CPython, is managed by Python Software
Foundation, which is a nonprofit organization.
Python has very strong libraries for advanced mathematical functionalities (NumPy),
algorithms and mathematical tools (SciPy) and numerical plotting (matplotlib). Python
libraries are collections of modules that contain useful codes and functions, eliminating the
need to write them from scratch. There are tens of thousands of Python libraries that help
machine learning developers, as well as professionals working in data science, data
visualization, and more.
Popularly used Python Libraries for Machine Learning:
1. NumPy
NumPy is a popular Python library for multi-dimensional array and matrix processing
because it can be used to perform a great variety of mathematical operations. Its capability to
handle linear algebra, Fourier transform, and more, makes NumPy ideal for machine learning
and artificial intelligence (AI) projects, allowing users to manipulate the matrix to easily
improve machine learning performance. NumPy is faster and easier to use than most other
Python libraries.
2. SciPy
SciPy is a Python library used for scientific and technical computing. It is built on top of
NumPy. It contains different modules for optimization, linear algebra, integration and
statistics. it contains different modules for optimization, linear algebra, integration and
statistics.
3. Matplotlib
Matplotlib is a Python library focused on data visualization and primarily used for creating
beautiful graphs, plots, histograms, and bar charts. It is compatible with plotting data from
SciPy, NumPy, and Pandas.
4. Pandas
Pandas is another Python library that is built on top of NumPy, responsible for preparing
high-level data sets for machine learning and training. It relies on two types of data
structures, one-dimensional (series) and two-dimensional (DataFrame). This allows Pandas to
be applicable in a variety of industries including finance, engineering, and statistics.
5. Seaborn
Seaborn is another open-source Python library, one that is based on Matplotlib (which
focuses on plotting and data visualization) but features Pandas’ data structures. Seaborn is
often used in ML projects because it can generate plots of learning data. Of all the Python
libraries, it produces the most aesthetically pleasing graphs and plots, making it an effective
choice for data analysis.
6. Scikit-learn
Scikit-learn is a very popular machine learning library that is built on NumPy and SciPy. It
supports most of the classic supervised and unsupervised learning algorithms, and it can also
be used for data mining, modelling, and analysis. Scikit-learn’s simple design offers a user-
friendly library for those new to machine learning.
7. TensorFlow
TensorFlow’s open-source Python library specializes in what’s called differentiable
programming, meaning it can automatically compute a function’s derivatives within high-
level language. Both machine learning and deep learning models are easily developed and
evaluated with TensorFlow’s flexible architecture and framework. TensorFlow can be used to
visualize machine learning models on both desktop and mobile
8. Theano
Theano is a Python library that focuses on numerical computation and is specifically made
for machine learning. It is able to optimize and evaluate mathematical models and matrix
calculations that use multi-dimensional arrays to create ML models. Theano is almost
exclusively used by machine learning and deep learning developers or programmers.
9. Keras
Keras is a Python library that is designed specifically for developing neural networks for ML
models. It can run on top of Theano and TensorFlow to train neural networks. Keras is
flexible, portable, user-friendly, and easily integrated with multiple functions.
10. PyTorch
PyTorch is a popular open-source Python machine learning library based on Torch and
developed by Facebook. Torch is an open-source machine learning library implemented in C
with a Lua wrapper. In fact, you can use your favorite Python packages (e.g., Cython,
NumPy, SciPy) to extend PyTorch.
PyTorch has two predominant, high-level features:
1. Tensor computation coupled with strong GPU acceleration
2. Deep neural networks constructed on a tape-based autograd system
PyTorch has a vast selection of tools and libraries that support computer vision, natural
language processing (NLP), and a host of other Machine Learning programs. Pytorch allows
developers to conduct computations on Tensors with GPU acceleration and aids in creating
computational graphs. Considered one of the best deep learning and machine learning
frameworks, it faces stiff competition from TensorFlow.
LIST OF EXERCISES:
1. Introduction to Python machine learning libraries.
Following are the web links for machine learning libraries documents. Work with these
libraries to get acquittance with each library. Refer these documents if required when
solving the problems.
Library Name Web link
Numpy https://numpy.org/doc/stable/user/absolute_beginners.html
SciPy https://docs.scipy.org/doc/scipy/tutorial/index.html
Matplotlib https://matplotlib.org/stable/tutorials/index.html
Pandas https://pandas.pydata.org/docs/getting_started/index.html
Seaborn https://seaborn.pydata.org/tutorial.html
Scikit-learn https://scikit-learn.org/1.1/user_guide.html
TensorFlow https://www.tensorflow.org/guide
Theano https://www.projectpro.io/data-science-in-python-tutorial/theano-deep-learning-
tutorial-
Keras https://keras.io/getting_started/
PyTorch https://pytorch.org/tutorials/beginner/basics/intro.html
2. Use Naïve Bayes classifier to solve the credit card fraud detection problem.
Download the dataset for the credit card fraud detection from following link:
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
Save the dataset and note down the path. Follow THE following steps.
I. Problem Understanding
Credit card fraud detection is a classification problem where the goal is to distinguish
between fraudulent and legitimate transactions. The Naïve Bayes classifier is suitable
for this task as it works well with probabilistic models and can handle large datasets
efficiently.
II. Prepare the Environment
Ensure you have the necessary libraries installed. You can install them using pip if
you haven't already:
pip install numpy pandas scikit-learn
III. Load and Explore the Data
Start by loading your dataset and performing initial exploratory data analysis (EDA).
import pandas as pd
# Load the dataset (replace 'path_to_file' with the actual file path)
data = pd.read_csv('path_to_file.csv')
# Display basic information about the dataset
print(data.info())
print(data.head())
IV. Preprocess the Data
Preprocessing is crucial for preparing your data for the model:
• Handle Missing Values: Check for and handle any missing values.
• Feature Selection: Identify which features are relevant. For Naïve Bayes, feature
scaling is generally not required.
• Encode Categorical Variables: Convert categorical variables into numerical
format if any.
• Split the Data: Divide the dataset into features (X) and target (y), then into
training and testing sets.
from sklearn.model_selection import train_test_split
# Assume 'target' is the column with fraud labels
X = data.drop('target', axis=1)
y = data['target']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
V. Train the Naïve Bayes Classifier
You can use different variants of Naïve Bayes depending on your data characteristics.
For continuous features, Gaussian Naïve Bayes is commonly used.
from sklearn.naive_bayes import GaussianNB
# Initialize the classifier
nb_classifier = GaussianNB()
# Fit the model to the training data
nb_classifier.fit(X_train, y_train)
VI. Make Predictions
After training the model, use it to make predictions on the test set.
# Make predictions
y_pred = nb_classifier.predict(X_test)
VII. Evaluate the Model
Evaluate the model’s performance using various metrics. For fraud detection,
precision, recall, and F1-score are particularly important due to the imbalanced nature
of fraud datasets.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Print classification report
print(classification_report(y_test, y_pred))
# Print confusion matrix
print(confusion_matrix(y_test, y_pred))
# Print accuracy score
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
3. Implement K-Nearest Neighbor algorithm to solve classification problem.
Follow the steps discussed in problem 2. Use following in Train the model step
from sklearn.neighbors import KNeighborsClassifier
# Initialize the classifier with k=5 (default value)
knn_classifier = KNeighborsClassifier(n_neighbors=5)
# Fit the model to the training data
knn_classifier.fit(X_train, y_train)
Develop the KNN algorithm in Python.
4. Implement CART algorithm for decision tree learning. Use an appropriate data set
for building the decision tree and apply this knowledge to classify a new sample.
Explore the problem of overfitting in decision tree and develop solution using
pruning technique.
Follow the steps discussed in problem 2. Use following in Train the model step.
Train the Decision Tree Classifier
Initialize and train the decision tree classifier. The DecisionTreeClassifier in scikit-learn
is the implementation of the CART algorithm.
from sklearn.tree import DecisionTreeClassifier
# Initialize the classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
# Fit the model to the training data
dt_classifier.fit(X_train, y_train)
Add following 8th step for addressing Overfitting with Pruning
Overfitting Problem: Decision trees tend to overfit the training data by creating complex
trees that capture noise. This results in poor generalization to new data.
Solution: Pruning Techniques
Pruning helps to reduce the complexity of the decision tree and mitigate overfitting.
scikit-learn provides several ways to control the complexity of the tree:
1. Limit Tree Depth:
o Restricting the maximum depth of the tree.
2. Minimum Samples Per Leaf:
o Setting the minimum number of samples required to be at a leaf node.
3. Minimum Samples Split:
o Setting the minimum number of samples required to split an internal node.
4. Maximum Features:
o Limiting the number of features considered when looking for the best split.
Here’s how to apply these parameters:
# Initialize the classifier with pruning parameters
pruned_dt_classifier = DecisionTreeClassifier(
max_depth=5, # Maximum depth of the tree
min_samples_split=10, # Minimum number of samples required to split an internal node
min_samples_leaf=5, # Minimum number of samples required to be at a leaf node
max_features='sqrt', # Use sqrt(n_features) features for each split
random_state=42
)
# Fit the model to the training data
pruned_dt_classifier.fit(X_train, y_train)
In 9th step, Evaluate the Pruned Model to assess the performance of the pruned decision
tree classifier.
# Make predictions with the pruned model
y_pred_pruned = pruned_dt_classifier.predict(X_test)
# Print classification report
print(classification_report(y_test, y_pred_pruned))
# Print confusion matrix
print(confusion_matrix(y_test, y_pred_pruned))
# Print accuracy score
print(f'Accuracy: {accuracy_score(y_test, y_pred_pruned)}')
5. Perform Exploratory Data Analysis on the given dataset. Implement CART
algorithm for decision tree learning. Use an appropriate data set for building the
decision tree and apply this knowledge to classify a new sample.
Follow the following steps. Use the dataset specified by the faculty.
1. Understand the Problem
Clarify the objective of your analysis. In this case, you're using a CART-based decision
tree classifier to understand patterns in the data and gain insights into how different
features affect the target variable.
2. Prepare the Environment
Ensure you have the necessary libraries installed:
pip install numpy pandas scikit-learn matplotlib seaborn
3. Load and Explore the Dataset
Start by loading the dataset and performing initial exploratory data analysis (EDA).
import pandas as pd
# Load the dataset
data = pd.read_csv('path_to_file.csv')
# Display basic information about the dataset
print(data.info())
# Show the first few rows of the dataset
print(data.head())
# Summary statistics
print(data.describe())
4. Visualize Data Distributions
Explore the distribution of features and the target variable using visualizations.
import matplotlib.pyplot as plt
import seaborn as sns
# Plot the distribution of the target variable
sns.countplot(data['target'])
plt.title('Distribution of Target Variable')
plt.show()
# Plot the distribution of numerical features
numerical_features = data.select_dtypes(include=['int64', 'float64'])
for feature in numerical_features.columns:
plt.figure()
sns.histplot(data[feature], kde=True)
plt.title(f'Distribution of {feature}')
plt.show()
# Pairplot to explore relationships between features and target
sns.pairplot(data, hue='target')
plt.show()
5. Preprocess the Data
Prepare the data for modeling:
Handle Missing Values: Impute or drop missing values as necessary.
Encode Categorical Variables: Convert categorical variables into numerical format if
required.
Feature Scaling (if necessary): Though not strictly necessary for decision trees,
standardizing or normalizing can help in visualizations.
from sklearn.model_selection import train_test_split
# Handle missing values (example: fill with median for numerical features)
data.fillna(data.median(), inplace=True)
# Convert categorical variables (example: one-hot encoding)
data = pd.get_dummies(data, drop_first=True)
# Split the dataset into features and target
X = data.drop('target', axis=1)
y = data['target']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
6. Train the Decision Tree Classifier
Train the CART-based decision tree classifier on the training data.
from sklearn.tree import DecisionTreeClassifier
# Initialize the classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
# Fit the model to the training data
dt_classifier.fit(X_train, y_train)
7. Visualize the Decision Tree
Visualize the decision tree to understand the decision rules and how features are split.
from sklearn.tree import plot_tree
# Plot the decision tree
plt.figure(figsize=(20,10))
plot_tree(dt_classifier, feature_names=X.columns, class_names=y.unique(),
filled=True, rounded=True)
plt.title('Decision Tree Visualization')
plt.show()
8. Analyze Feature Importance
Determine which features are most influential in making predictions.
# Get feature importances
importances = dt_classifier.feature_importances_
features = X.columns
# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.show()
9. Evaluate the Model
Assess the performance of the decision tree classifier on the test set.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Make predictions
y_pred = dt_classifier.predict(X_test)
# Print classification report
print(classification_report(y_test, y_pred))
# Print confusion matrix
print(confusion_matrix(y_test, y_pred))
# Print accuracy score
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
10. Address Overfitting (Optional)
If the model is overfitting (i.e., performing well on training data but poorly on test data),
consider pruning the decision tree.
# Initialize the classifier with pruning parameters
pruned_dt_classifier = DecisionTreeClassifier(
max_depth=5, # Limit the depth of the tree
min_samples_split=10, # Minimum samples required to split a node
min_samples_leaf=5, # Minimum samples required to be at a leaf node
random_state=42
)
# Fit the pruned model
pruned_dt_classifier.fit(X_train, y_train)
# Make predictions with the pruned model
y_pred_pruned = pruned_dt_classifier.predict(X_test)
# Evaluate the pruned model
print(classification_report(y_test, y_pred_pruned))
print(confusion_matrix(y_test, y_pred_pruned))
print(f'Accuracy: {accuracy_score(y_test, y_pred_pruned)}')
6. Train an SVM Classifier with Linear Kernel. Use an appropriate data set for
building the SVM Classifier and apply this knowledge to classify a new sample.
Methodology to Build an SVM Classifier with Linear Kernel
1. Understand the Problem
Support Vector Machine (SVM) classifiers are powerful for classification tasks. A linear
kernel SVM is used when you expect that the data can be separated by a linear decision
boundary.
2. Prepare the Environment
Make sure you have the necessary libraries installed:
pip install numpy pandas scikit-learn
3. Load and Explore the Data
Start by loading the dataset and performing exploratory data analysis (EDA).
import pandas as pd
# Load the dataset (replace 'path_to_file' with your file path)
data = pd.read_csv('path_to_file.csv')
# Display basic information about the dataset
print(data.info())
# Show the first few rows
print(data.head())
# Summary statistics
print(data.describe())
4. Preprocess the Data
Prepare the dataset for training the SVM classifier:
Handle Missing Values: Impute or remove missing values.
Encode Categorical Variables: Convert categorical variables to numerical format.
Feature Scaling: SVMs require feature scaling for optimal performance. Standardize
features so that they have a mean of 0 and a standard deviation of 1.
Split the Data: Divide the dataset into features (X) and target labels (y), then split into
training and testing sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Assume 'target' is the column with labels
X = data.drop('target', axis=1)
y = data['target']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
5. Train the SVM Classifier with Linear Kernel
Initialize and train the SVM classifier with a linear kernel.
from sklearn.svm import SVC
# Initialize the SVM classifier with a linear kernel
svm_classifier = SVC(kernel='linear', random_state=42)
# Fit the model to the training data
svm_classifier.fit(X_train, y_train)
6. Make Predictions
Use the trained model to make predictions on the test set.
# Make predictions
y_pred = svm_classifier.predict(X_test)
7. Evaluate the Model
Assess the performance of the SVM classifier using appropriate metrics.
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Print classification report
print(classification_report(y_test, y_pred))
# Print confusion matrix
print(confusion_matrix(y_test, y_pred))
# Print accuracy score
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
8. Tune Hyperparameters (Optional)
Although the linear kernel does not have many parameters, you can still tune the
regularization parameter C to optimize model performance.
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {'C': [0.1, 1, 10, 100, 1000]}
# Initialize GridSearchCV
grid_search = GridSearchCV(SVC(kernel='linear', random_state=42), param_grid, cv=5,
scoring='accuracy')
# Fit GridSearchCV
grid_search.fit(X_train, y_train)
# Print the best parameters and score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_}')
7. Build linear regression and multiple regression models to predict the price of the
house (Boston House Prices Dataset).
Download the Boston House Price Dataset from following link:
https://www.kaggle.com/datasets/vikrishnan/boston-house-prices/data
Save the dataset to use while building regression models to predict the price.
Methodology for Building Linear and Multiple Regression Models
1. Understand the Problem
The goal is to predict house prices based on features using linear and multiple regression
models. Linear regression predicts a continuous target variable as a function of one or
more input features.
2. Prepare the Environment
Ensure you have the necessary libraries installed:
pip install numpy pandas scikit-learn matplotlib seaborn
3. Load and Explore the Data
Load the Boston House Prices dataset and perform exploratory data analysis (EDA).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
# Load the dataset
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['PRICE'] = boston.target
# Display basic information about the dataset
print(data.info())
# Show the first few rows
print(data.head())
# Summary statistics
print(data.describe())
4. Visualize Data Relationships
Explore the relationships between features and the target variable.
# Plot the relationship between features and target
sns.pairplot(data, x_vars=boston.feature_names, y_vars=['PRICE'], height=2.5)
plt.show()
# Correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
5. Preprocess the Data
Prepare the dataset for modelling:
Handle Missing Values: Impute or remove missing values if necessary.
Feature Selection: Decide which features to include in the model.
Split the Data: Divide the dataset into features (X) and target (y), then split into training
and testing sets.
from sklearn.model_selection import train_test_split
# Define features and target
X = data.drop('PRICE', axis=1)
y = data['PRICE']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
6. Build and Train Linear Regression Model
Linear regression predicts the target variable based on a single feature.
from sklearn.linear_model import LinearRegression
# Initialize the model
linear_regressor = LinearRegression()
# Train the model on the training data
linear_regressor.fit(X_train[['RM']], y_train) # 'RM' is an example feature
7. Build and Train Multiple Regression Model
Multiple regression uses multiple features to predict the target variable.
# Initialize the model
multiple_regressor = LinearRegression()
# Train the model on the training data
multiple_regressor.fit(X_train, y_train)
8. Make Predictions
Use the trained models to make predictions on the test set.
# Predictions with Linear Regression (using 'RM' feature as an example)
y_pred_linear = linear_regressor.predict(X_test[['RM']])
# Predictions with Multiple Regression
y_pred_multiple = multiple_regressor.predict(X_test)
9. Evaluate the Models
Assess the performance of both models using appropriate metrics such as Mean Absolute
Error (MAE), Mean Squared Error (MSE), and R-squared score.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Evaluate Linear Regression
mae_linear = mean_absolute_error(y_test, y_pred_linear)
mse_linear = mean_squared_error(y_test, y_pred_linear)
r2_linear = r2_score(y_test, y_pred_linear)
print(f'Linear Regression - MAE: {mae_linear}, MSE: {mse_linear}, R2: {r2_linear}')
# Evaluate Multiple Regression
mae_multiple = mean_absolute_error(y_test, y_pred_multiple)
mse_multiple = mean_squared_error(y_test, y_pred_multiple)
r2_multiple = r2_score(y_test, y_pred_multiple)
print(f'Multiple Regression - MAE: {mae_multiple}, MSE: {mse_multiple}, R2:
{r2_multiple}')
10. Visualize Results
Visualize the predictions versus actual values to understand model performance.
# Plot actual vs predicted values for Multiple Regression
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_multiple, alpha=0.5, color='blue', label='Predicted')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--', label='Ideal
Line')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted Prices (Multiple Regression)')
plt.legend()
plt.show()
8. Build a polynomial regression model for predicting the salary of the employees.
Download the Employee Salary Dataset from following link:
https://www.kaggle.com/datasets/rkiattisak/salaly-prediction-for-beginer
Save the dataset to use while building a polynomial regression model to predict the
employee salary.
Methodology to Build a Polynomial Regression Model
1. Understand the Problem
The goal is to predict employee salaries based on one or more features using polynomial
regression. Polynomial regression allows you to model more complex relationships than
linear regression by fitting polynomial functions.
2. Prepare the Environment
Make sure you have the necessary libraries installed:
pip install numpy pandas scikit-learn matplotlib
3. Load and Explore the Data
Load your dataset and perform initial exploratory data analysis (EDA) to understand the
structure and relationships within the data.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
# Load the dataset (replace 'path_to_file' with your actual file path)
data = pd.read_csv('path_to_file.csv')
# Display basic information about the dataset
print(data.info())
# Show the first few rows
print(data.head())
# Summary statistics
print(data.describe())
4. Preprocess the Data
Prepare the data for polynomial regression:
Handle Missing Values: Impute or remove missing values.
Feature Selection: Choose the feature(s) to use for predicting salary.
Feature Scaling (if necessary): Polynomial features can benefit from scaling.
# Example feature selection: Assume 'Experience' is the feature and 'Salary' is the target
X = data[['Experience']]
y = data['Salary']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
5. Create Polynomial Features
Transform the feature(s) into polynomial features using PolynomialFeatures from scikit-
learn.
from sklearn.preprocessing import PolynomialFeatures
# Initialize PolynomialFeatures with degree 2 (quadratic) as an example
poly_features = PolynomialFeatures(degree=2)
# Transform the features to polynomial features
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)
6. Train the Polynomial Regression Model
Initialize and train a linear regression model on the transformed polynomial features.
from sklearn.linear_model import LinearRegression
# Initialize the linear regression model
poly_regressor = LinearRegression()
# Train the model on the polynomial features
poly_regressor.fit(X_train_poly, y_train)
7. Make Predictions
Use the trained polynomial regression model to make predictions on the test set.
# Make predictions
y_pred = poly_regressor.predict(X_test_poly)
8. Evaluate the Model
Assess the performance of the polynomial regression model using metrics like Mean
Absolute Error (MAE), Mean Squared Error (MSE), and R-squared score.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'R2: {r2}')
9. Visualize the Results
Visualize the polynomial regression fit to understand how well the model captures the
relationship between the feature(s) and the target variable.
# Create a grid of values for plotting
X_grid = np.arange(min(X['Experience']), max(X['Experience']), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
X_grid_poly = poly_features.transform(X_grid)
# Predict salaries for the grid values
y_grid_pred = poly_regressor.predict(X_grid_poly)
# Plot the results
plt.scatter(X, y, color='red', label='Actual data')
plt.plot(X_grid, y_grid_pred, color='blue', label='Polynomial regression fit')
plt.title('Polynomial Regression')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.legend()
plt.show()
9. Build a neural network that will read the image of a digit and correctly identify the
number.
Download image dataset for digits from (0-9) from following link:
https://www.kaggle.com/datasets/karnikakapoor/digits/data
Save the dataset to use while building a neural network model to correctly identify the
number.
Note: - You can also use standard MNIST Dataset for digit classification readily
available in TensorFlow.
Methodology to Build a Neural Network Model for Digit Classification
1. Understand the Problem
The goal is to build a neural network model that can classify images of digits (0-9). This
is a multi-class classification problem where each image corresponds to one of ten
classes.
2. Prepare the Environment
Ensure you have the necessary libraries installed:
pip install numpy pandas tensorflow matplotlib scikit-learn
3. Load and Explore the Dataset
MNIST is a standard dataset for digit classification, readily available in TensorFlow's
datasets module.
import tensorflow as tf
from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt
# Load the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Display the shape of the dataset
print(f'Training data shape: {X_train.shape}')
print(f'Test data shape: {X_test.shape}')
# Display a sample image
plt.imshow(X_train[0], cmap='gray')
plt.title(f'Label: {y_train[0]}')
plt.show()
4. Preprocess the Data
Prepare the data for the neural network model:
Normalize the Data: Scale pixel values to the range [0, 1].
Flatten Images: Convert 2D images into 1D vectors.
One-hot Encode Labels: Convert class labels into one-hot encoded vectors.
# Normalize pixel values to [0, 1]
X_train = X_train / 255.0
X_test = X_test / 255.0
# Flatten images (28x28) to vectors (784)
X_train = X_train.reshape(-1, 28 * 28)
X_test = X_test.reshape(-1, 28 * 28)
# One-hot encode labels
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)
5. Build the Neural Network Model
Define a neural network architecture. A simple fully connected neural network is suitable
for this task.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Initialize the model
model = Sequential()
# Add layers
model.add(Dense(128, activation='relu', input_shape=(28 * 28,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))
# Compile the model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
6. Train the Model
Train the model using the training data.
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)
7. Evaluate the Model
Assess the performance of the model using the test data.
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_accuracy}')
8. Visualize Training History
Plot the training and validation accuracy and loss to understand the model's learning
progress.
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
9. Make Predictions
Use the model to make predictions on new data.
# Predict on the test set
predictions = model.predict(X_test)
# Display predictions for a sample
import numpy as np
sample_index = 0
sample_image = X_test[sample_index].reshape(28, 28)
predicted_label = np.argmax(predictions[sample_index])
true_label = np.argmax(y_test[sample_index])
plt.imshow(sample_image, cmap='gray')
plt.title(f'True label: {true_label}, Predicted label: {predicted_label}')
plt.show()
10. Solve classification problem by constructing a feed forward neural network using
Backpropagation algorithm. (Wheat Seed Data)
Download wheat seed dataset from following link:
https://www.kaggle.com/datasets/jmcaro/wheat-seedsuci
Save the dataset to use while building a feed forward neural network model.
Methodology to Build a Feed-Forward Neural Network with Backpropagation
1. Understand the Problem
The task is to build a feed-forward neural network that classifies wheat seeds into
different types based on their features. This is a multi-class classification problem.
2. Prepare the Environment
Ensure you have the necessary libraries installed:
pip install numpy pandas scikit-learn tensorflow matplotlib
3. Load and Explore the Data
Load the Wheat Seed dataset and perform exploratory data analysis (EDA).
import pandas as pd
# Load the dataset (replace 'path_to_file' with your actual file path)
data = pd.read_csv('path_to_file.csv')
# Display basic information about the dataset
print(data.info())
# Show the first few rows
print(data.head())
# Summary statistics
print(data.describe())
4. Preprocess the Data
Prepare the data for training the neural network:
Handle Missing Values: Impute or remove missing values if necessary.
Encode Categorical Variables: Convert categorical labels into numerical format using
one-hot encoding.
Feature Scaling: Normalize or standardize feature values.
Split the Data: Divide the dataset into features (X) and target labels (y), then split into
training and testing sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Assume 'Type' is the target variable and needs to be one-hot encoded
X = data.drop('Type', axis=1)
y = data['Type']
# Convert categorical labels to numerical format
y = pd.get_dummies(y)
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
5. Build the Neural Network Model
Define and compile the feed-forward neural network using TensorFlow/Keras.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Initialize the model
model = Sequential()
# Add layers to the model
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu')) # Input layer
model.add(Dense(32, activation='relu')) # Hidden layer
model.add(Dense(y_train.shape[1], activation='softmax')) # Output layer
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
6. Train the Model
Fit the neural network model on the training data.
history = model.fit(X_train, y_train, epochs=50,batch_size=32, validation_split=0.2)
7. Evaluate the Model
Assess the performance of the model on the test data.
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {test_loss}')
print(f'Test Accuracy: {test_accuracy}')
8. Visualize Training History
Plot the training and validation accuracy and loss to understand the model's learning
process.
import matplotlib.pyplot as plt
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()
9. Make Predictions
Use the model to make predictions and evaluate its performance on new data.
# Predict on the test set
y_pred = model.predict(X_test)
# Convert predictions from one-hot encoding to class labels
y_pred_labels = pd.DataFrame(y_pred).idxmax(axis=1)
y_test_labels = pd.DataFrame(y_test).idxmax(axis=1)
from sklearn.metrics import classification_report, confusion_matrix
# Print classification report
print(classification_report(y_test_labels, y_pred_labels))
# Print confusion matrix
print(confusion_matrix(y_test_labels, y_pred_labels))
############33***** THE END *****#####################