MACHINE LEARNING LAB
List of Experiments
1. Write a python program to compute Central Tendency Measures: Mean, Median, Mode
Measure of Dispersion: Variance, Standard Deviation
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
3. Study of Python Libraries for ML application such as Pandas and Matplotlib
4. Write a Python program to implement Simple Linear Regression
5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn
6. Implementation of Decision tree using sklearn and its parameter tuning
7. Implementation of KNN using sklearn
8. Implementation of Logistic Regression using sklearn
9. Implementation of K-Means Clustering
10. Performance analysis of Classification Algorithms on a specific dataset (Mini Project)
1. Write a python program to compute Central Tendency Measures: Mean,
Median, Mode Measure of Dispersion: Variance, Standard Deviation
Code:
import statistics
import numpy as np
# Sample data
data = [4, 8, 6, 5, 3, 2, 8, 9, 2, 5]
# Central Tendency Measures
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)
# Measures of Dispersion
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")
output:
2ND CODE:
import statistics
# Function to compute central tendency and dispersion measures
def compute_statistics(data):
# Central Tendency Measures
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data) if len(set(data)) != len(data) else "No mode"
# Dispersion Measures
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)
return mean, median, mode, variance, std_deviation
# Sample data (You can input your own data)
data = [10, 20, 20, 30, 40, 40, 40, 50, 60, 70]
# Calculate the statistics
mean, median, mode, variance, std_deviation = compute_statistics(data)
# Output the results
print(f"Data: {data}")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {std_deviation:.2f}")
output:
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
Code:
1. statistics Module
The statistics module provides functions to perform statistical operations. It is part of
Python's standard library.
Key Functions:
• mean(data): Returns the arithmetic mean of the data.
• median(data): Returns the median of the data.
• mode(data): Returns the most common data point.
• variance(data): Returns the variance of the data.
• stdev(data): Returns the standard deviation of the data.
Example:
import statistics
data = [10, 20, 20, 30, 40, 50]
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)
variance = statistics.variance(data)
stdev = statistics.stdev(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {stdev}")
output:
2. math Module
The math module provides mathematical functions. It is also part of Python's standard library.
Key Functions:
• sqrt(x): Returns the square root of x.
• factorial(x): Returns the factorial of x.
• log(x, base): Returns the logarithm of x to the given base.
• sin(x), cos(x), tan(x): Trigonometric functions.
• pi: The constant π.
Example:
import math
number = 16
sqrt_value = math.sqrt(number)
factorial_value = math.factorial(5)
log_value = math.log(100, 10)
pi_value = math.pi
print(f"Square Root: {sqrt_value}")
print(f"Factorial: {factorial_value}")
print(f"Logarithm (base 10): {log_value}")
print(f"Value of pi: {pi_value}")
output:
3. numpy Library
numpy is a powerful library for numerical computations, particularly for handling large arrays
and matrices. It is not included in the standard library and needs to be installed separately.
Key Functions:
• numpy.array(): Creates a numpy array.
• numpy.mean(): Computes the mean of array elements.
• numpy.median(): Computes the median of array elements.
• numpy.std(): Computes the standard deviation.
• numpy.var(): Computes the variance.
• numpy.linalg.inv(): Computes the inverse of a matrix
Summary
• statistics: Basic statistics functions for small datasets.
• math: Fundamental mathematical functions.
• numpy: Advanced numerical operations and array handling.
scipy: Additional scientific and technical computing tools
Example:
import numpy as np
data = np.array([10, 20, 30, 40, 50])
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
output:
4. scipy Library
scipybuilds on numpy and provides additional functionality for scientific and technical
computing.
Key Functions:
• scipy.stats.describe(): Provides descriptive statistics.
• scipy.stats.norm: Functions related to the normal distribution, such as PDF and CDF.
• scipy.optimize.minimize(): Optimization routines.
• scipy.integrate.quad(): Numerical integration.
Example:
from scipy import stats
data = [10, 20, 30, 40, 50]
# Descriptive statistics
desc = stats.describe(data)
print(f"Mean: {desc.mean}")
print(f"Variance: {desc.variance}")
print(f"Minimum: {desc.minmax[0]}")
print(f"Maximum: {desc.minmax[1]}")
# Normal distribution example
mean, var = stats.norm.fit(data)
print(f"Fitted Mean: {mean}")
print(f"Fitted Variance: {var}")
output:
3. Study of Python Libraries for ML application such as Pandas and Matplotlib.
Code:
1. Importing the Library
First, import the pandas library.
import pandas as pd
# Manually create data
data = {
'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
'Product': ['Widget A', 'Widget B', 'Widget C', 'Widget A', 'Widget B'],
'Category': ['Electronics', 'Electronics', 'Home Goods', 'Home Goods', 'Electronics'],
'Sales': [1500, 2000, 1200, 1800, 2100],
'Profit': [300, 500, 200, 400, 600]
}
# Create DataFrame
df = pd.DataFrame(data)
# Display the DataFrame
print("Initial DataFrame:")
print(df)
output:
3. Exploring the Data
Examine the basic structure and statistics of the DataFrame.
# Display basic information about the DataFrame
print("\nDataFrame Info:")
print(df.info())
# Display descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())
# Display the first few rows
print("\nFirst Few Rows:")
print(df.head())
output:
2. matplotlib Library
matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python.
Key Features:
• Basic Plotting Functions:
o plt.plot(): Create line plots.
o plt.scatter(): Create scatter plots.
o plt.bar(): Create bar charts.
o plt.hist(): Create histograms.
o plt.show(): Display the plot.
• Customization:
o plt.title(): Add a title to the plot.
o plt.xlabel(), plt.ylabel(): Label the axes.
o plt.legend(): Add a legend.
o plt.grid(): Add a grid.
Example:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Line plot
plt.plot(x, y, label='Line Plot', marker='o')
plt.title('Line Plot Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.legend()
plt.grid(True)
plt.show()
# Scatter plot
plt.scatter(x, y, color='red', label='Scatter Plot')
plt.title('Scatter Plot Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.legend()
plt.show()
# Histogram
data = [1, 2, 2, 3, 4, 4, 4, 5, 6]
plt.hist(data, bins=5, edgecolor='black')
plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
output:
4. Write a Python program to implement Simple Linear Regression
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create sample data
data = {
'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'y': [1.5, 3.5, 3.7, 5.5, 6.5, 7.8, 8.7, 10.5, 11.3, 12.8]
}
# Load data into a DataFrame
df = pd.DataFrame(data)
# Extract x and y values
x = df['x'].values
y = df['y'].values
n = len(x)
# Calculate the necessary sums
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_xy = np.sum(x * y)
sum_x_squared = np.sum(x ** 2)
# Calculate coefficients
beta_1 = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x ** 2)
beta_0 = (sum_y - beta_1 * sum_x) / n
print(f"Coefficients: beta_0 = {beta_0}, beta_1 = {beta_1}")
# Predict y values
y_pred = beta_0 + beta_1 * x
# Display predictions
print("Predicted y values:")
print(y_pred)
# Plot the data points
plt.scatter(x, y, color='blue', label='Data points')
# Plot the regression line
plt.plot(x, y_pred, color='red', label='Regression line')
# Add labels and title
plt.xlabel('x')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()
# Show the plot
plt.show()
output:
5. Implementation of Multiple Linear Regression for House Price Prediction
using sklearn.
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create a sample dataset
data = {
'SquareFootage': [1500, 2000, 2500, 3000, 3500, 4000, 4500],
'NumBedrooms': [3, 4, 3, 5, 4, 5, 6],
'Age': [10, 15, 20, 5, 8, 12, 4],
'Price': [300000, 400000, 500000, 600000, 700000, 800000, 900000]
# Load data into a DataFrame
df = pd.DataFrame(data)
# Define features and target variable
X = df[['SquareFootage', 'NumBedrooms', 'Age']]
y = df['Price']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Get the coefficients and intercept
coefficients = model.coef_
intercept = model.intercept_
print(f"Coefficients: {coefficients}")
print(f"Intercept: {intercept}")
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
# Plot actual vs. predicted values
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_pred), max(y_pred)], color='red', linewidth=2)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs. Predicted Prices')
plt.show()
output:
6. Implementation of Decision tree using sklearn and its parameter
tuning.
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a basic Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)
# Train the model on the training data
dt.fit(X_train, y_train)
# Make predictions on the test data
y_pred = dt.predict(X_test)
# Calculate the accuracy before tuning
initial_accuracy = accuracy_score(y_test, y_pred)
print(f"Initial Accuracy without tuning: {initial_accuracy:.2f}")
# Parameter tuning using GridSearchCV
param_grid = {
'criterion': ['gini', 'entropy'], # Split criterion
'max_depth': [None, 10, 20, 30, 40], # Maximum depth of the tree
'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an
internal node
'min_samples_leaf': [1, 2, 4], # Minimum number of samples required to be at a
leaf node
'max_features': [None, 'auto', 'sqrt', 'log2'] # Number of features to consider when looking
for the best split
}
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1,
verbose=1)
grid_search.fit(X_train, y_train)
# Output the best hyperparameters found by Grid Search
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")
# Use the best estimator to predict on the test set
best_dt = grid_search.best_estimator_
y_pred_best = best_dt.predict(X_test)
# Calculate the accuracy after tuning
tuned_accuracy = accuracy_score(y_test, y_pred_best)
print(f"Tuned Accuracy: {tuned_accuracy:.2f}")
What the program does:
1. Data Loading: It loads the Iris dataset, a commonly used dataset for classification tasks.
2. Model Creation: It creates a DecisionTreeClassifier and trains it using the training data.
3. Grid Search for Hyperparameter Tuning: It defines a parameter grid and performs
GridSearchCV to find the best hyperparameters for the decision tree.
4. Accuracy Calculation: It calculates and prints the accuracy before and after hyperparameter
tuning.
output:
Explanation:
• The initial accuracy is the performance of the decision tree model before any hyperparameter
tuning.
• The best hyperparameters are determined by GridSearchCV through cross-validation.
• The tuned accuracy shows the performance after applying the best combination of
hyperparameters.
7. Implementation of KNN using sklearn
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a K-Nearest Neighbors classifier
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model on the training data
knn.fit(X_train, y_train)
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
# Output the results
print(f"Accuracy of KNN model: {accuracy:.2f}")
Explanation:
1. Data Loading: The Iris dataset is loaded using load_iris(), which is a well-known
dataset containing 150 samples from 3 species of iris flowers.
2. Model Creation: The KNeighborsClassifier is initialized with n_neighbors=3, meaning
it will consider the 3 nearest neighbors to classify a data point.
3. Training: The model is trained using knn.fit(), which uses the training data (X_train,
y_train).
4. Prediction: The model predicts on the test data (X_test).
5. Accuracy: The accuracy score is computed using accuracy_score() to evaluate the
performance.
output:
Explanation of Output:
• The accuracy indicates the percentage of correctly classified instances from the
test set.
• A result of 1.00 (or 100%) means that the model perfectly classified all the test
samples. This might occur in cases where the dataset is small and the model's
complexity is high (i.e., the dataset is well-separated).
8. Implementation of Logistic Regression using sklearn
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Logistic Regression classifier
logreg = LogisticRegression(max_iter=200)
# Train the model on the training data
logreg.fit(X_train, y_train)
# Make predictions on the test data
y_pred = logreg.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
# Output the results
print(f"Accuracy of Logistic Regression model: {accuracy:.2f}")
Explanation:
1. Data Loading: The Iris dataset is loaded using load_iris() from sklearn.datasets. This
dataset contains 150 samples from three species of iris flowers, and we use it to
predict the species based on the input features (sepal length, sepal width, petal
length, and petal width).
2. Model Creation: We create a LogisticRegression model. The parameter max_iter=200
is used to ensure the algorithm converges within 200 iterations.
3. Training: The model is trained using the training data (X_train, y_train) via
logreg.fit().
4. Prediction: After training, the model predicts the species for the test data (X_test).
5. Accuracy Calculation: The accuracy of the model is computed by comparing the
predicted labels (y_pred) with the true labels (y_test).
output:
Explanation of Output:
• The accuracy represents the proportion of correctly predicted samples out of the
total test samples.
• In this example, a perfect accuracy of 1.00 means the model was able to correctly
classify all the test samples. This result might happen if the data is relatively
simple and well-separable by logistic regression.
9. Implementation of K-Means Clustering
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Initialize the KMeans model with 3 clusters (as we know there are 3 species)
kmeans = KMeans(n_clusters=3, random_state=42)
# Fit the model on the dataset
kmeans.fit(X)
# Get the predicted cluster labels
y_pred = kmeans.labels_
# Get the centroids of the clusters
centroids = kmeans.cluster_centers_
# Calculate the inertia (within-cluster sum of squares)
inertia = kmeans.inertia_
# Output the results
print(f"Cluster Centers (Centroids): \n{centroids}")
print(f"Inertia: {inertia:.2f}")
# Visualize the clusters using the first two features (sepal length and sepal width)
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred, palette="Set1", s=100)
plt.scatter(centroids[:, 0], centroids[:, 1], marker="X", color="black", s=200, label="Centroids")
plt.title('K-Means Clustering (Iris Dataset)', fontsize=15)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()
Explanation:
1. Data Loading: The Iris dataset is loaded using load_iris(), which contains 150
samples with 4 features (sepal length, sepal width, petal length, and petal width).
2. Model Creation: We initialize a KMeans model with n_clusters=3 (since we expect
3 clusters, corresponding to 3 species of Iris).
3. Model Training: The fit() method is used to fit the KMeans model on the data.
4. Cluster Labels: We get the predicted labels (y_pred) that indicate which cluster
each data point belongs to.
5. Centroids: The cluster_centers_ attribute gives the coordinates of the cluster
centroids.
6. Inertia: The inertia_ represents the sum of squared distances of samples to their
closest cluster center, which is a measure of the model's "tightness" or how well
the samples fit their clusters.
7. Visualization: We plot the clusters based on the first two features (sepal length
and sepal width) and mark the centroids on the scatter plot.
output:
10. Performance analysis of Classification Algorithms on a specific
dataset (Mini Project)
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the classifiers
models = {
"Logistic Regression": LogisticRegression(max_iter=200),
"K-Nearest Neighbors": KNeighborsClassifier(),
"Decision Tree": DecisionTreeClassifier(),
"Support Vector Machine": SVC(),
"Random Forest": RandomForestClassifier()
}
# Dictionary to store performance metrics
performance_metrics = {}
# Train each model, make predictions, and evaluate
for model_name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
performance_metrics[model_name] = {
"Accuracy": accuracy,
"Precision": precision,
"Recall": recall,
"F1 Score": f1
}
# Display the performance metrics for each classifier
import pandas as pd
performance_df = pd.DataFrame(performance_metrics).T
print(performance_df)
Explanation:
1. Dataset Loading: The Iris dataset is loaded using load_iris() from sklearn.datasets,
and it's split into features (X) and target labels (y).
2. Model Initialization: We initialize five classification models:
o Logistic Regression
o K-Nearest Neighbors (KNN)
o Decision Tree Classifier
o Support Vector Machine (SVM)
o Random Forest Classifier
3. Model Training & Prediction: Each model is trained using the training data
(X_train, y_train), and predictions are made on the test set (X_test).
4. Evaluation Metrics: The models are evaluated using:
o Accuracy: The proportion of correct predictions.
o Precision: The proportion of positive predictions that are actually positive.
o Recall: The proportion of actual positive instances that were predicted
correctly.
o F1-Score: The harmonic mean of precision and recall.
5. Performance Output: The metrics are stored in a dictionary and converted to a
pandas DataFrame for easy visualization.
Output:
Explanation of Output:
• Accuracy: This metric indicates how many of the predictions made by the model
were correct.
• Precision: Precision is a measure of how many of the predicted positive classes
were actually positive. A higher precision means fewer false positives.
• Recall: Recall measures how many of the actual positive classes were correctly
predicted. A higher recall means fewer false negatives.
• F1-Score: The F1-score provides a balance between precision and recall. It is
useful when you need to balance both false positives and false negatives.