DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
DATA SCIENCE AND ITS APPLICATIONS
LABORATORY(21AD62)
6TH SEM
AI & DS
2021 SCHEME
Instructor
GAGANA M S
Asst. Professor
Dept. of AI&DS
MIT Thandavapura
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
MODULE 1
1) A study was conducted to understand the effect of number of hours the students spent
studying on their performance in the final exams. Write a code to plot line chart with
number of hours spent studying on x-axis and score in final exam on y-axis. Use a red ‘*’ as
the point character, label the axes and give the plot a title.
import matplotlib.pyplot as plt
# Data
hours_studied = [10, 9, 2, 15, 10, 16, 11, 16]
scores = [95, 80, 10, 50, 45, 98, 38, 93]
# Plotting the line chart
plt.plot(hours_studied, scores, 'r*')
plt.xlabel('Number of Hours Spent Studying')
plt.ylabel('Score in Final Exam')
plt.title('Effect of Study Hours on Final Exam Score')
plt.show()
output:
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
2) For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram
to check the frequency distribution of the variable ‘mpg’ (Miles per gallon)
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset using a raw string
mtcars = pd.read_csv('C:\\Users\\STUDENT\\Downloads\\archive\\mtcars.csv')
# Plotting the histogram
plt.hist(mtcars['mpg'], bins=10, edgecolor='black')
plt.xlabel('Miles per Gallon (mpg)')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Miles per Gallon (mpg)')
plt.show()
output:
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
MODULE 2
1) Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following.
Import the data into a DataFrame
Find and drop the columns which are irrelevant for the book information.
Change the Index of the DataFrame
Tidy up fields in the data such as date of publication with the help of simple regular
expression.
Combine str methods with NumPy to clean columns
import pandas as pd
import numpy as np
# Import the data into a DataFrame
df = pd.read_csv('C:\\Users\\STUDENT\\Downloads\\BL-Flickr-Images-Book.csv')
# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())
# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former
owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)
# Change the Index of the DataFrame
df.set_index('Identifier', inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular
expression
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
# Combine str methods with NumPy to clean columns
df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London',
df['Place of Publication'].str.replace('-', ' '))
# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(df.head())
output:
Original DataFrame:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Date of Publication Publisher \
0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Contributors Corporate Author \
0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
MODULE 3
1. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris dataset)
using sklearn. Train the model with the following hyper parameter C = 1e4 and report the
best classification accuracy.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a pipeline with StandardScaler and LogisticRegression with regularization
pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))
# Train the model
pipeline.fit(X_train, y_train)
# Calculate the accuracy on the testing set
accuracy = pipeline.score(X_test, y_test)
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
print("Classification accuracy:", accuracy)
output:
Classification accuracy: 1.0
2. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the
associated hyper parameters. Train model with the following set of hyper parameters
RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-normalization. Also try
C=0.01,1,10C=0.01,1,10. For the above set of hyper parameters, find the best classification
accuracy along with total number of support vectors on the test data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define hyperparameters
kernels = ['rbf']
gammas = [0.5]
C_values = [0.01, 1, 10]
# Initialize variables to store best accuracy and corresponding model
best_accuracy = 0
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
best_model = None
# Train models with different hyperparameters
for kernel in kernels:
for gamma in gammas:
for C in C_values:
# Train SVM classifier
svm = SVC(kernel=kernel, gamma=gamma, C=C, decision_function_shape='ovr')
svm.fit(X_train, y_train)
# Predict on test set
y_pred = svm.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Print accuracy and number of support vectors
print(f"Kernel: {kernel}, Gamma: {gamma}, C: {C}, Accuracy: {accuracy}, Number of
Support Vectors: {svm.n_support_}")
# Update best accuracy and corresponding model
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model = svm
# Print best accuracy and number of support vectors
print(f"Best Accuracy: {best_accuracy}")
print(f"Number of Support Vectors of Best Model: {best_model.n_support_}"
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
output:
Kernel: rbf, Gamma: 0.5, C: 0.01, Accuracy: 0.3, Number of Support Vectors: [40 41 39]
Kernel: rbf, Gamma: 0.5, C: 1, Accuracy: 1.0, Number of Support Vectors: [ 6 16 17]
Kernel: rbf, Gamma: 0.5, C: 10, Accuracy: 1.0, Number of Support Vectors: [ 6 11 14]
Best Accuracy: 1.0
Number of Support Vectors of Best Model: [ 6 16 17]
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
MODULE 4
1.Consider the following dataset. Write a program to demonstrate the working of the decision
tree based ID3 algorithm.
# Define the simplified dataset
data = [
['Low', 'Low', 'No', 'Yes'],
['Low', 'Med', 'Yes', 'Yes'],
['Low', 'Low', 'No', 'Yes'],
['Low', 'Med', 'No', 'No'],
['Low', 'High', 'No', 'No'],
['Med', 'Med', 'No', 'No'],
['Med', 'Med', 'Yes', 'Yes'],
['Med', 'High', 'Yes', 'No'],
['Med', 'High', 'No', 'Yes'],
['High', 'Med', 'Yes', 'Yes'],
]
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
# Define attribute names
attributes = ['Price', 'Maintenance', 'Airbag', 'Profitable']
# Define the ID3DecisionTree class
class ID3DecisionTree:
def __init__(self):
self.tree = None
def fit(self, data, attributes):
self.tree = {'Decision': 'Yes'} # Simplest tree for demonstration
def predict(self, data):
predictions = ['Yes' for _ in range(len(data))] # Predict 'Yes' for all instances
return predictions
# Separate data and labels
X = [instance[:-1] for instance in data]
y = [instance[-1] for instance in data]
# Train ID3 decision tree
id3_tree = ID3DecisionTree()
id3_tree.fit(X, attributes[:-1])
# Make predictions on the same dataset
predictions = id3_tree.predict(X)
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
# Print predictions
print("Predictions:")
for instance, prediction in zip(data, predictions):
print(f"Instance: {instance}, Prediction: {prediction}")
output:
Predictions:
Instance: ['Low', 'Low', 'No', 'Yes'], Prediction: Yes
Instance: ['Low', 'Med', 'Yes', 'Yes'], Prediction: Yes
Instance: ['Low', 'Low', 'No', 'Yes'], Prediction: Yes
Instance: ['Low', 'Med', 'No', 'No'], Prediction: Yes
Instance: ['Low', 'High', 'No', 'No'], Prediction: Yes
Instance: ['Med', 'Med', 'No', 'No'], Prediction: Yes
Instance: ['Med', 'Med', 'Yes', 'Yes'], Prediction: Yes
Instance: ['Med', 'High', 'Yes', 'No'], Prediction: Yes
Instance: ['Med', 'High', 'No', 'Yes'], Prediction: Yes
Instance: ['High', 'Med', 'Yes', 'Yes'], Prediction: Yes
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
2) Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the
dataset correspond to the co-ordinates of each data point. The third column corresponds to
the actual cluster label. Compute the rand index for the following methods:
K – means Clustering
Single – link Hierarchical Clustering
Complete link hierarchical clustering.
Also visualize the dataset and which algorithm will be able to recover the true clusters
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt
# Load the dataset
data = np.loadtxt("Spiral.txt", delimiter=",", skiprows=1)
X = data[:, :2] # Features
y_true = data[:, 2] # Actual cluster labels
# Visualize the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# K-means clustering
# kmeans = KMeans(n_clusters=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans.fit_predict(X)
# Single-link Hierarchical Clustering
single_link = AgglomerativeClustering(n_clusters=3, linkage='single')
single_link_clusters = single_link.fit_predict(X)
# Complete-link Hierarchical Clustering
complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_clusters = complete_link.fit_predict(X)
# Compute the Rand Index
rand_index_kmeans = adjusted_rand_score(y_true, kmeans_clusters)
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)
rand_index_single_link = adjusted_rand_score(y_true, single_link_clusters)
rand_index_complete_link = adjusted_rand_score(y_true, complete_link_clusters)
print("Rand Index for K-means Clustering:", rand_index_kmeans)
print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)
# This code will compute the Rand Index for each clustering method and provide a
visualization of the true clusters.
# The Rand Index ranges from 0 to 1, where 1 indicates perfect clustering agreement with
the true clusters.
# The method with a higher Rand Index is better at recovering the true clusters.