0% found this document useful (0 votes)

24 views15 pages

Data Science Lab Manual

The document outlines a Data Science and its Applications Laboratory course, detailing various modules that involve practical coding exercises using Python. Key topics include data visualization with Matplotlib, data manipulation with Pandas, machine learning with logistic regression and SVM classifiers, decision tree algorithms, and clustering methods. Each module provides specific tasks and example code to guide students in applying data science techniques to real datasets.

Uploaded by

sowmyabandaru1021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views15 pages

Data Science Lab Manual

Uploaded by

sowmyabandaru1021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

DATA SCIENCE AND ITS APPLICATIONS

LABORATORY(21AD62)
6TH SEM
AI & DS
2021 SCHEME

Instructor
GAGANA M S

Asst. Professor

Dept. of AI&DS

MIT Thandavapura
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

MODULE 1
1) A study was conducted to understand the effect of number of hours the students spent
studying on their performance in the final exams. Write a code to plot line chart with
number of hours spent studying on x-axis and score in final exam on y-axis. Use a red ‘*’ as
the point character, label the axes and give the plot a title.

import matplotlib.pyplot as plt

# Data

hours_studied = [10, 9, 2, 15, 10, 16, 11, 16]

scores = [95, 80, 10, 50, 45, 98, 38, 93]

# Plotting the line chart

plt.plot(hours_studied, scores, 'r*')

plt.xlabel('Number of Hours Spent Studying')

plt.ylabel('Score in Final Exam')

plt.title('Effect of Study Hours on Final Exam Score')

plt.show()

output:
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

2) For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram

to check the frequency distribution of the variable ‘mpg’ (Miles per gallon)

import pandas as pd

import matplotlib.pyplot as plt

# Load the dataset using a raw string

mtcars = pd.read_csv('C:\\Users\\STUDENT\\Downloads\\archive\\mtcars.csv')

# Plotting the histogram

plt.hist(mtcars['mpg'], bins=10, edgecolor='black')

plt.xlabel('Miles per Gallon (mpg)')

plt.ylabel('Frequency')

plt.title('Frequency Distribution of Miles per Gallon (mpg)')

plt.show()

output:
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

MODULE 2
1) Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following.
 Import the data into a DataFrame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple regular
expression.
 Combine str methods with NumPy to clean columns

import pandas as pd

import numpy as np

# Import the data into a DataFrame

df = pd.read_csv('C:\\Users\\STUDENT\\Downloads\\BL-Flickr-Images-Book.csv')

# Display the first few rows of the DataFrame

print("Original DataFrame:")

print(df.head())

# Find and drop the columns which are irrelevant for the book information

irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former

owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']

df.drop(columns=irrelevant_columns, inplace=True)

# Change the Index of the DataFrame

df.set_index('Identifier', inplace=True)

# Tidy up fields in the data such as date of publication with the help of simple regular
expression
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

# Combine str methods with NumPy to clean columns

df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London',

df['Place of Publication'].str.replace('-', ' '))

# Display the cleaned DataFrame

print("\nCleaned DataFrame:")

print(df.head())

output:

Original DataFrame:

Identifier Edition Statement Place of Publication \

0 206 NaN London

1 216 NaN London; Virtue & Yorston

2 218 NaN London

3 472 NaN London

4 480 A new edition, revised, etc. London

Date of Publication Publisher \

0 1879 [1878] S. Tinsley & Co.

1 1868 Virtue & Co.

2 1869 Bradbury, Evans & Co.

3 1851 James Darling

4 1857 Wertheim & Macintosh

Title Author \

0 Walter Forbes. [A novel.] By A. A A. A.

1 All for Greed. [A novel. The dedication signed... A., A. A.

DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

2 Love the Avenger. By the author of “All for Gr... A., A. A.

3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.

4 [The World in which I live, and my place in it... A., E. S.

Contributors Corporate Author \

0 FORBES, Walter. NaN

1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN

...

216 http://www.flickr.com/photos/britishlibrary/ta...

218 http://www.flickr.com/photos/britishlibrary/ta...

472 http://www.flickr.com/photos/britishlibrary/ta...

480 http://www.flickr.com/photos/britishlibrary/ta...
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

MODULE 3
1. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris dataset)
using sklearn. Train the model with the following hyper parameter C = 1e4 and report the
best classification accuracy.

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and LogisticRegression with regularization

pipeline = make_pipeline(StandardScaler(), LogisticRegression(C=1e4, max_iter=1000))

# Train the model

pipeline.fit(X_train, y_train)

# Calculate the accuracy on the testing set

accuracy = pipeline.score(X_test, y_test)

DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

print("Classification accuracy:", accuracy)

output:

Classification accuracy: 1.0

2. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the
associated hyper parameters. Train model with the following set of hyper parameters
RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-normalization. Also try
C=0.01,1,10C=0.01,1,10. For the above set of hyper parameters, find the best classification
accuracy along with total number of support vectors on the test data

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameters

kernels = ['rbf']

gammas = [0.5]

C_values = [0.01, 1, 10]

# Initialize variables to store best accuracy and corresponding model

best_accuracy = 0
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

best_model = None

# Train models with different hyperparameters

for kernel in kernels:

for gamma in gammas:

for C in C_values:

# Train SVM classifier

svm = SVC(kernel=kernel, gamma=gamma, C=C, decision_function_shape='ovr')

svm.fit(X_train, y_train)

# Predict on test set

y_pred = svm.predict(X_test)

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and number of support vectors

print(f"Kernel: {kernel}, Gamma: {gamma}, C: {C}, Accuracy: {accuracy}, Number of

Support Vectors: {svm.n_support_}")

# Update best accuracy and corresponding model

if accuracy > best_accuracy:

best_accuracy = accuracy

best_model = svm

# Print best accuracy and number of support vectors

print(f"Best Accuracy: {best_accuracy}")

print(f"Number of Support Vectors of Best Model: {best_model.n_support_}"

DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

output:
Kernel: rbf, Gamma: 0.5, C: 0.01, Accuracy: 0.3, Number of Support Vectors: [40 41 39]

Kernel: rbf, Gamma: 0.5, C: 1, Accuracy: 1.0, Number of Support Vectors: [ 6 16 17]

Kernel: rbf, Gamma: 0.5, C: 10, Accuracy: 1.0, Number of Support Vectors: [ 6 11 14]

Best Accuracy: 1.0

Number of Support Vectors of Best Model: [ 6 16 17]

DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

MODULE 4
1.Consider the following dataset. Write a program to demonstrate the working of the decision
tree based ID3 algorithm.

# Define the simplified dataset

data = [

['Low', 'Low', 'No', 'Yes'],

['Low', 'Med', 'Yes', 'Yes'],

['Low', 'Low', 'No', 'Yes'],

['Low', 'Med', 'No', 'No'],

['Low', 'High', 'No', 'No'],

['Med', 'Med', 'No', 'No'],

['Med', 'Med', 'Yes', 'Yes'],

['Med', 'High', 'Yes', 'No'],

['Med', 'High', 'No', 'Yes'],

['High', 'Med', 'Yes', 'Yes'],

]
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

# Define attribute names

attributes = ['Price', 'Maintenance', 'Airbag', 'Profitable']

# Define the ID3DecisionTree class

class ID3DecisionTree:

def __init__(self):

self.tree = None

def fit(self, data, attributes):

self.tree = {'Decision': 'Yes'} # Simplest tree for demonstration

def predict(self, data):

predictions = ['Yes' for _ in range(len(data))] # Predict 'Yes' for all instances

return predictions

# Separate data and labels

X = [instance[:-1] for instance in data]

y = [instance[-1] for instance in data]

# Train ID3 decision tree

id3_tree = ID3DecisionTree()

id3_tree.fit(X, attributes[:-1])

# Make predictions on the same dataset

predictions = id3_tree.predict(X)
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

# Print predictions

print("Predictions:")

for instance, prediction in zip(data, predictions):

print(f"Instance: {instance}, Prediction: {prediction}")

output:
Predictions:

Instance: ['Low', 'Low', 'No', 'Yes'], Prediction: Yes

Instance: ['Low', 'Med', 'Yes', 'Yes'], Prediction: Yes

Instance: ['Low', 'Low', 'No', 'Yes'], Prediction: Yes

Instance: ['Low', 'Med', 'No', 'No'], Prediction: Yes

Instance: ['Low', 'High', 'No', 'No'], Prediction: Yes

Instance: ['Med', 'Med', 'No', 'No'], Prediction: Yes

Instance: ['Med', 'Med', 'Yes', 'Yes'], Prediction: Yes

Instance: ['Med', 'High', 'Yes', 'No'], Prediction: Yes

Instance: ['Med', 'High', 'No', 'Yes'], Prediction: Yes

Instance: ['High', 'Med', 'Yes', 'Yes'], Prediction: Yes

DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

2) Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the
dataset correspond to the co-ordinates of each data point. The third column corresponds to
the actual cluster label. Compute the rand index for the following methods:
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering.
 Also visualize the dataset and which algorithm will be able to recover the true clusters

import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt

# Load the dataset

data = np.loadtxt("Spiral.txt", delimiter=",", skiprows=1)
X = data[:, :2] # Features
y_true = data[:, 2] # Actual cluster labels

# Visualize the dataset

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

# K-means clustering
# kmeans = KMeans(n_clusters=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans_clusters = kmeans.fit_predict(X)

# Single-link Hierarchical Clustering

single_link = AgglomerativeClustering(n_clusters=3, linkage='single')
single_link_clusters = single_link.fit_predict(X)

# Complete-link Hierarchical Clustering

complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_clusters = complete_link.fit_predict(X)

# Compute the Rand Index

rand_index_kmeans = adjusted_rand_score(y_true, kmeans_clusters)
DATA SCIENCE AND ITS APPLICATIONS LABORATORY (21AD62)

rand_index_single_link = adjusted_rand_score(y_true, single_link_clusters)

rand_index_complete_link = adjusted_rand_score(y_true, complete_link_clusters)

print("Rand Index for K-means Clustering:", rand_index_kmeans)

print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)

# This code will compute the Rand Index for each clustering method and provide a
visualization of the true clusters.
# The Rand Index ranges from 0 to 1, where 1 indicates perfect clustering agreement with
the true clusters.
# The method with a higher Rand Index is better at recovering the true clusters.

DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
167 pages
ML LAB Manual
No ratings yet
ML LAB Manual
18 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Set-3 Ip QP Practice
No ratings yet
Set-3 Ip QP Practice
9 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Data Science
No ratings yet
Data Science
9 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Advanced Python & Data Science Guide
No ratings yet
Advanced Python & Data Science Guide
42 pages
Data Science Lab Manual 2023-24
No ratings yet
Data Science Lab Manual 2023-24
26 pages
Exercise Programs
No ratings yet
Exercise Programs
3 pages
Diploma in Information Technology: Centralized Question Bank
No ratings yet
Diploma in Information Technology: Centralized Question Bank
4 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
MLT Lab Manual
No ratings yet
MLT Lab Manual
41 pages
Data Science
No ratings yet
Data Science
15 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
Solution
No ratings yet
Solution
18 pages
Data Science
No ratings yet
Data Science
8 pages
Class 12 Informatics Practices Guide
No ratings yet
Class 12 Informatics Practices Guide
7 pages
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
No ratings yet
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
9 pages
Fds Lab Manual
No ratings yet
Fds Lab Manual
31 pages
Fdsa Lab Manual
No ratings yet
Fdsa Lab Manual
53 pages
Questions Answers Chapter Wise
No ratings yet
Questions Answers Chapter Wise
4 pages
B.Tech Data Science Minor Syllabus 2021-22
No ratings yet
B.Tech Data Science Minor Syllabus 2021-22
5 pages
Big Data
No ratings yet
Big Data
5 pages
CSE1703 - Fundamental of Data Science
No ratings yet
CSE1703 - Fundamental of Data Science
6 pages
DS JRE Paper June 2023
No ratings yet
DS JRE Paper June 2023
9 pages
Essentials of Data Science Exploration
No ratings yet
Essentials of Data Science Exploration
15 pages
Advanced R
100% (2)
Advanced R
24 pages
Machine Learning With Python
100% (3)
Machine Learning With Python
137 pages
Chapter-14 Data Science
No ratings yet
Chapter-14 Data Science
12 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Machine Learning Laboratory Exercises
No ratings yet
Machine Learning Laboratory Exercises
16 pages
AD-502 Machine Learning Lab - Exp 1-10
No ratings yet
AD-502 Machine Learning Lab - Exp 1-10
13 pages
Data Analytics Lab Course Overview
No ratings yet
Data Analytics Lab Course Overview
125 pages
DS Journal
No ratings yet
DS Journal
46 pages
Chapter 4C Data Wrangling
No ratings yet
Chapter 4C Data Wrangling
15 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
DSF Model Question
No ratings yet
DSF Model Question
2 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
Python for Machine Learning Basics
100% (1)
Python for Machine Learning Basics
8 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
67 pages
Ch.4.Data Science X-1
No ratings yet
Ch.4.Data Science X-1
3 pages
PRACTICAL QUESTIONS For DSBDA
No ratings yet
PRACTICAL QUESTIONS For DSBDA
9 pages
BigDataSolution of Paper Oct 2022
No ratings yet
BigDataSolution of Paper Oct 2022
11 pages
INDUSTRY 2 Akshat
No ratings yet
INDUSTRY 2 Akshat
12 pages
Python Data Science Essentials
No ratings yet
Python Data Science Essentials
11 pages
Aids - 21ad62 - Datascience Lab Manual-1
No ratings yet
Aids - 21ad62 - Datascience Lab Manual-1
15 pages
AI Syllabus Course
No ratings yet
AI Syllabus Course
16 pages
OVER To Coding
No ratings yet
OVER To Coding
221 pages
STD Xii-Tee - Ip
No ratings yet
STD Xii-Tee - Ip
9 pages
Aiml Model
No ratings yet
Aiml Model
13 pages
Logistic Regression on Iris Dataset
No ratings yet
Logistic Regression on Iris Dataset
60 pages
Understanding Data Science Concepts
No ratings yet
Understanding Data Science Concepts
3 pages
DSA Module 3 Notes
No ratings yet
DSA Module 3 Notes
22 pages
Module 3 Theory Questions
No ratings yet
Module 3 Theory Questions
1 page
Module 2 Theory Questions
No ratings yet
Module 2 Theory Questions
1 page
DSA Lab Manual
No ratings yet
DSA Lab Manual
10 pages
Module 03
No ratings yet
Module 03
36 pages
@vtucode Module 4
No ratings yet
@vtucode Module 4
46 pages
SEO Naming Conventions Guide
No ratings yet
SEO Naming Conventions Guide
12 pages
WIP Move Process in Oracle EBS
100% (1)
WIP Move Process in Oracle EBS
13 pages
Local Administrator Password Management Detailed Technical Specification
No ratings yet
Local Administrator Password Management Detailed Technical Specification
20 pages
OOP Concepts for Developers
No ratings yet
OOP Concepts for Developers
6 pages
BharathCloud Corporate Deck
No ratings yet
BharathCloud Corporate Deck
18 pages
CC Unit 2 (8-6-21)
No ratings yet
CC Unit 2 (8-6-21)
6 pages
UNIT-V Notes Advance Java
No ratings yet
UNIT-V Notes Advance Java
28 pages
SDL2
No ratings yet
SDL2
26 pages
E-Procurement Adoption: Barriers and Benefits
No ratings yet
E-Procurement Adoption: Barriers and Benefits
14 pages
Cysemol Manual Short
No ratings yet
Cysemol Manual Short
59 pages
Chapter Five System Analysis
No ratings yet
Chapter Five System Analysis
5 pages
Web Deployment: Web - Config Transformation
No ratings yet
Web Deployment: Web - Config Transformation
12 pages
IoT Chapter 3 PDF
No ratings yet
IoT Chapter 3 PDF
23 pages
IPS Exam Registration Project Overview
No ratings yet
IPS Exam Registration Project Overview
14 pages
Red Hat Linux 6/7 Administration Guide
No ratings yet
Red Hat Linux 6/7 Administration Guide
192 pages
Microsoft SQL Server Black Book
No ratings yet
Microsoft SQL Server Black Book
220 pages
Business Blueprint in SAP Implementation
0% (2)
Business Blueprint in SAP Implementation
2 pages
Project Report ON "Industrial Man Power Resources Organizer"
No ratings yet
Project Report ON "Industrial Man Power Resources Organizer"
23 pages
Food Industry Sample: Label Formats
No ratings yet
Food Industry Sample: Label Formats
3 pages
WinRunner Tool: Automation Testing Overview
No ratings yet
WinRunner Tool: Automation Testing Overview
5 pages
KNX System Specifications: Document Overview
No ratings yet
KNX System Specifications: Document Overview
8 pages
Oracle XML/BI Publisher Guide
100% (1)
Oracle XML/BI Publisher Guide
176 pages
MAN2130: Technology, Media & Data: Week 3: Business Analytics & Intelligence Yoo Ri Kim Yoori - Kim@surrey - Ac.uk
No ratings yet
MAN2130: Technology, Media & Data: Week 3: Business Analytics & Intelligence Yoo Ri Kim Yoori - Kim@surrey - Ac.uk
46 pages
CCNA 4 Exam: Syslog & SNMP Insights
No ratings yet
CCNA 4 Exam: Syslog & SNMP Insights
8 pages
Setting Up Address Validation in Release 12
100% (1)
Setting Up Address Validation in Release 12
13 pages
E-Business Infrastructure: Presented by Ika Novita Dewi, Mcs
No ratings yet
E-Business Infrastructure: Presented by Ika Novita Dewi, Mcs
20 pages
CS121 Lec 01
No ratings yet
CS121 Lec 01
42 pages
Devsecops Security Checklist
100% (1)
Devsecops Security Checklist
18 pages
Ethereum Application Development Guide
No ratings yet
Ethereum Application Development Guide
49 pages
Imis Iot RFQCFP Annex A
No ratings yet
Imis Iot RFQCFP Annex A
32 pages