0% found this document useful (0 votes)
34 views32 pages

DWM Record (Data Science)

The document outlines a series of experiments demonstrating data preprocessing and analysis tasks using Python libraries. It covers loading datasets, handling missing data, dealing with categorical variables, splitting datasets, scaling features, and implementing various similarity measures. Additionally, it includes building models using linear regression, decision trees, and Naïve Bayes classification, as well as generating frequent itemsets and association rules using the Apriori algorithm.

Uploaded by

Nuthalapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views32 pages

DWM Record (Data Science)

The document outlines a series of experiments demonstrating data preprocessing and analysis tasks using Python libraries. It covers loading datasets, handling missing data, dealing with categorical variables, splitting datasets, scaling features, and implementing various similarity measures. Additionally, it includes building models using linear regression, decision trees, and Naïve Bayes classification, as well as generating frequent itemsets and association rules using the Apriori algorithm.

Uploaded by

Nuthalapati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Date

Experiment-1

Aim: Demonstrate the following data preprocessing tasks using python libraries.

a)Loading the dataset

import pandas as pd

dataset = pd.read_excel("age_salary.xls")

Note:

The ‘nan’ you see in some cells of the dataframe denotes the missing fields

Data Mining Using Python Page 1


Date

b) Classifying the dependent and Independent Variables

X = dataset.iloc[:,:-1].values #Takes all rows of all columns except the last column
Y = dataset.iloc[:,-1].values # Takes all rows of the last column

c)Dealing with Missing Data

from sklearn.impute import SimpleImputer


imp = SimpleImputer(missing_values=np.nan, strategy="mean")
X = imp.fit_transform(X)
Y = Y.reshape(-1,1)
Y = imp.fit_transform(Y)
Y = Y.reshape(-1)

Output

Data Mining Using Python Page 2


Date

Experiment-2

Aim:Demonstrate the following data preprocessing tasks using python libraries.

a) Dealing with Categorical Data

dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
ohe_X = OneHotEncoder(categorical_features = [0])
X = ohe_X.fit_transform(X).toarray()

Data Mining Using Python Page 3


Date

Output

Y = le_X.fit_transform(Y)Output

Data Mining Using Python Page 4


Date

b) Splitting the Dataset into Training and Testing sets

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

c) Scaling the features

from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()

Output

X_train before scaling :

Data Mining Using Python Page 5


Date

X_train after scaling :

Data Mining Using Python Page 6


Date

Experiment-3

Aim: Demonstrate the following Similarity and Dissimilarity Measures using python

a) Pearson’s Correlation

We calculate this metric for the vectors x and y in the following way:

The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase
or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.

import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Data Mining Using Python Page 7


Date

# calculate Pearson's correlation


corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

output:Pearsons correlation: 0.810

b) Cosine Similarity
The cosine similarity calculates the cosine of the angle between two vectors. In order to
calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph
on the right shows the resulting function.

Data Mining Using Python Page 8


Date

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point
in the exact same direction, the cosine similarity is +1. If the vectors point in opposite
directions, the cosine similarity is -1

Implementation
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

output:Cosine similarity: 0.773


c) Jaccard Similarity

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for
comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

Data Mining Using Python Page 9


Date

We can see that the Jaccard similarity divides the size of the intersection by the size of the
union of the sample sets.

Implementation in Python
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)
output:Jaccard similarity: 0.500
d) Euclidean Distance

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Implementation in Python
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print("Euclidean distance: %.3f" % dst)

Data Mining Using Python Page 10


Date

output
Euclidean distance: 3.273

e)Manhattan Distance

We calculate the Manhattan distance as follows:

n many ML applications Euclidean distance is the metric of choice. However, for high
dimensional data Manhattan distance is preferable as it yields more robust results.

Implementation in Python
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print("Manhattan distance: %.3f" % dst)

output Manhattan distance: 10.468

Data Mining Using Python Page 11


Date

Experiment-4

Aim: Build a model using linear regression algorithm on any dataset.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns
from matplotlib import rcParams

%matplotlib inline
%pylab inline
Populating the interactive namespace from numpy and matplotlib

df = pd.read_csv('data.csv')
df.head()

#Reading the csv file from Kaggle using pandas (pd.read_csv).


index id date price bedrooms bathrooms sqft_living sqft_lot
712930052020141
0 0 221900.0 3 1.0 1180 5650
013T000000
641410019220141
1 1 538000.0 3 2.25 2570 7242
209T000000
5631500400
2 2 180000.0 2 1.0 770 10000
20150225T000000
2487200875
3 3 604000.0 4 3.0 1960 5000
20141209T000000
1954400510
4 4 510000.0 3 2.0 1680 8080
20150218T000000

Data Mining Using Python Page 12


Date

# Checking to see if any of our data has null values. If there were any, we’d drop or filter the
null values out.
df.isnull().any()

id False
date False
price False
bedrooms False
bathrooms False
sqft_living False
sqft_lot False
dtype: bool
# Checking out the data types for each of our variables. We want to get a sense of whether or
not data is numerical (int64, float64) or not (object).
df.dtypes

id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
dtype: object

# Next: Simple exploratory analysis and regression results.


df.describe()

Data Mining Using Python Page 13


Date

index id price bedrooms bathrooms sqft_living sqft_lot


count 5.0 5.0 5.0 5.0 5.0 5.0

mean 2.0 410780.0 3.0 1.85 1632.0 7194.4


1.58113
195127.245 0.70710678 0.8587782018658 695.89510 1991.137062
std 8830084
66292632 11865476 834 703841 0828693
1898

min 0.0 180000.0 2.0 1.0 770.0 5000.0


25% 1.0 221900.0 3.0 1.0 1180.0 5650.0

50% 2.0 510000.0 3.0 2.0 1680.0 7242.0

75% 3.0 538000.0 3.0 2.25 1960.0 8080.0

max 4.0 604000.0 4.0 3.0 2570.0 10000.0

fig = plt.figure(figsize=(12, 6))


sqft = fig.add_subplot(121)
cost = fig.add_subplot(122)

sqft.hist(df.sqft_living, bins=80)
sqft.set_xlabel('Ft^2')
sqft.set_title("Histogram of House Square Footage")

cost.hist(df.price, bins=80)
cost.set_xlabel('Price ($)')
cost.set_title("Histogram of Housing Prices")

plt.show()

Data Mining Using Python Page 14


Date

Experiment-5

Aim: Build a classification model using Decision Tree algorithm on iris dataset

from sklearn.datasets import load_iris


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from six import StringIO #changed
from IPython.display import Image
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)
X

index sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

... ... ... ... ...

145 6.7 3.0 5.2 2.3

Data Mining Using Python Page 15


Date

146 6.3 2.5 5.0 1.9

147 6.5 3.0 5.2 2.0

148 6.2 3.4 5.4 2.3

149 5.9 3.0 5.1 1.8

150 rows × 4 columns

y = pd.get_dummies(y)
y

index setosa versicolr virginica

0 1 0 0

1 1 0 0

2 1 0 0

3 1 0 0

4 1 0 0

... ... ... ...

145 0 0 1

146 0 0 1

147 0 0 1

148 0 0 1

149 0 0 1

150 rows × 3 columns

Data Mining Using Python Page 16


Date

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)
(graph, ) = graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
output

Let’s see how our decision tree does when its presented with test data.
y_pred = dt.predict(X_test)
species = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
confusion_matrix(species, predictions)
output
array([[13, 0, 0],
[ 0, 15, 1],
[ 0, 0, 9]])

Data Mining Using Python Page 17


Date

Experiment-6

Aim: Apply Naïve Bayes Classification algorithm on any dataset

# Assigning features and label variables


weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print ("weather:",weather_encoded)
# Converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)

print( "Temp:",temp_encoded)
print( "Play:",label)
#Combinig weather and temp into single listof tuples using list and zip

features=list(zip(weather_encoded,temp_encoded))
print("weather,temp:" ,features)
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Classifier


model = GaussianNB()
Data Mining Using Python Page 18
Date

# Train the model using the training sets


model.fit(features,label)

#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print ("Predicted Value:", predicted)

output
weather: [2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
Weather,temp: [(2, 1), (2, 1), (0, 1), (1, 2), (1, 0), (1, 0), (0, 0), (2, 2), (2, 0), (1, 2), (2, 2), (0,
2), (0, 1), (1, 2)]
Predicted Value: [1]

Data Mining Using Python Page 19


Date

Experiment-7

Aim: Generate frequent itemsets using Apriori Algorithm in python and also generate
association rules for any market basket data

We will make use of the following python libraries


1. Remember good ol’ pandas and numpy?
2. mlxtend or ML extended will be used for apriori implementation and extracting association
rules.
3. And then there was one: matplotlib for visualizing results

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
df = pd.read_csv('retail_dataset.csv', sep=',')
## Print first 10 rows
df.head(10)

items = set()
for col in df:
items.update(df[col].unique())
print(items)
itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
rowset = set(row)
labels = {}
uncommons = list(itemset - rowset)
commons = list(itemset.intersection(rowset))
for uc in uncommons:

Data Mining Using Python Page 20


Date

labels[uc] = 0
for com in commons:
labels[com] = 1
encoded_vals.append(labels)
encoded_vals[0]
ohe_df = pd.DataFrame(encoded_vals)
#apriori(df, min_support=0.5, use_colnames=False, max_len=None)

freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True)


freq_items.head(7)
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.head()

conse antecede consequ


antece confiden convictio
quent nt ent support lift leverage
dents ce n
s support support

-
0 (Milk) (nan) 0.501587 0.869841 0.409524 0.816456 0.938626 0.709141
0.026778

-
1 (Bagel) (nan) 0.425397 0.869841 0.336508 0.791045 0.909413 0.622902
0.033520

-
2 (Meat) (nan) 0.476190 0.869841 0.368254 0.773333 0.889051 0.574230
0.045956

-
3 (Wine) (nan) 0.438095 0.869841 0.317460 0.724638 0.833069 0.472682
0.063613

(Diaper -
4 (nan) 0.406349 0.869841 0.317460 0.781250 0.898152 0.595011
) 0.035999

Data Mining Using Python Page 21


Date

Visualizing results

plt.scatter(rules['support'], rules['confidence'], alpha=0.5)


plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()

Data Mining Using Python Page 22


Date

Experiment-8

Aim: Apply K- Means clustering algorithm on any dataset

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import sklearn
from sklearn import cluster

%matplotlib inline

faithful = pd.read_csv('faithful.csv')
faithful.head()

index eruptions waiting

0 .6 79
1 1.8 54
2 3.333 74
3 2.283 62
4 4.533 85

Basic scatterplot of the data.


faithful.columns = ['eruptions', 'waiting']

plt.scatter(faithful.eruptions, faithful.waiting)
plt.title('Old Faithful Data Scatterplot')
plt.xlabel('Length of eruption (minutes)')
plt.ylabel('Time between eruptions (minutes)')

Data Mining Using Python Page 23


Date

Step two: Building the cluster model

faith = np.array(faithful)

k=2
kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(faith)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
for i in range(k):
# select only data observations with cluster label == i
ds = faith[np.where(labels==i)]
# plot the data observations
plt.plot(ds[:,0],ds[:,1],'o', markersize=7)
# plot the centroids
lines = plt.plot(centroids[i,0],centroids[i,1],'kx')

Data Mining Using Python Page 24


Date

# make the centroid x's bigger


plt.setp(lines,ms=15.0)
plt.setp(lines,mew=4.0)
plt.show()

Data Mining Using Python Page 25


Date

Experiment-9

Aim: Apply Hierarchical Clustering algorithm on any dataset.

#First import the required libraries:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline.
data = pd.read_csv('Wholesaledata.csv')
data.head()

chan Reg Mil


index Fresh Grocery Frozen Detergents_Paper Delicassen
nel ion k

0 2 3 12669 9656 7561 214 2674 1338

1 2 3 7057 9810 9568 1762 3293 1776

2 2 3 6353 8808 7684 2405 3516 7844

3 1 3 13265 1196 4221 6404 507 1788

4 2 3 22615 4510 7196 3915 1777 5185

# Normalize the data and bring all the variables to the same scale:
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()

Data Mining Using Python Page 26


Date

0.0001118 0.0001677 0.708332 0.539873 0.422740 0.011964 0.149505 0.074808


0 21405842 32108763 6953097 7474078 8247878 89042515 21961150 5205086
56296 84445 151 94 093 4237 667 7462
0.0001253 0.0001879 0.442198 0.614703 0.599539 0.110408 0.206342 0.111285
1 21879894 82819841 2532065 8208809 8734137 57618676 47524575 8293460
17398 26096 9284 233 283 726 744 2649
0.0001248 0.0001872 0.396551 0.549791 0.479632 0.150119 0.219467 0.489619
2 39188056 58782084 6808605 7841995 1605119 12363759 29260281 2955564
20848 3127 462 421 5295 07 447 496
6.4593782 0.0001937 0.856836 0.077254 0.272650 0.413658 0.032749 0.115493
3 21417655 81346642 5210710 1635281 3547260 58129958 04758258 6825989
e-05 52967 52 5516 392 666 7516 4768
7.9749664 0.0001196 0.901769 0.179835 0.286939 0.156109 0.070857 0.206751
4 23467488 24496352 3283335 4928491 2919163 96773937 57667250 0045283
e-05 01233 863 9187 6025 608 864 9462

#Draw the dendrogram to help us decide the number of clusters for this particular problem:

import scipy.cluster.hierarchy as shc


plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))

Data Mining Using Python Page 27


Date

The x-axis contains the samples and y-axis represents the distance between these samples.
The vertical line with maximum distance is the blue line and hence we can decide a threshold
of 6 and cut the dendrogram:

plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')

Data Mining Using Python Page 28


Date

from sklearn.cluster import AgglomerativeClustering


cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)

output
array([0, 0, 0, 1, 1])
We can see the values of 0s and 1s in the output since we defined 2 clusters. 0 represents the
points that belong to the first cluster and 1 represents points in the second cluster. Let’s now
visualize the two clusters:

plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)

Data Mining Using Python Page 29


Date

EXPERIMENT -10

Aim: Apply DBSCAN clustering algorithm on any dataset.

from mpl_toolkits.basemap import Basemap


import matplotlib
from PIL import Image
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)

xs, ys = my_map(np.asarray(weather_df.Long),
np.asarray(weather_df.Lat))

Data Mining Using Python Page 30


Date

1.Clustering the Weather Data (Temperatures & Coordinates as Features)

For clustering data, I’ve followed the steps shown in scikit-learn demo of DBSCAN.

Choosing temperatures (‘Tm’, ‘Tx’, ‘Tn’) and x/y map projections of coordinates (‘xm’,
‘ym’) as features and, setting ϵ and MinPts to 0.3 and 10 respectively, gives 8 unique clusters
(noise is labeled as -1). Feel free to change these parameters to test how much clustering is
affected accordingly.
Let’s visualize these clusters using Basemap —

Finally, I included precipitation (‘P’) in the features and repeated the same clustering steps
with ϵ and MinPts set to 0.5 and 10. We see some differences from the previous clustering
and, thus it gives us an idea about problem of clustering unsupervised data even using
DBSCAN when we lack the domain knowledge.

Data Mining Using Python Page 31


Date

Unique Clusters in Canada Based on Selected Features (now included precipitation


compared to previous case) in the Weather Data. ϵ and MinPts set to 0.5 and 10 Respectively

You can try to repeat the process including some more features, or, change the clustering
parameters, to get a better overall knowledge.

Data Mining Using Python Page 32

You might also like