Date
Experiment-1
Aim: Demonstrate the following data preprocessing tasks using python libraries.
a)Loading the dataset
import pandas as pd
dataset = pd.read_excel("age_salary.xls")
Note:
The ‘nan’ you see in some cells of the dataframe denotes the missing fields
Data Mining Using Python Page 1
Date
b) Classifying the dependent and Independent Variables
X = dataset.iloc[:,:-1].values #Takes all rows of all columns except the last column
Y = dataset.iloc[:,-1].values # Takes all rows of the last column
c)Dealing with Missing Data
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy="mean")
X = imp.fit_transform(X)
Y = Y.reshape(-1,1)
Y = imp.fit_transform(Y)
Y = Y.reshape(-1)
Output
Data Mining Using Python Page 2
Date
Experiment-2
Aim:Demonstrate the following data preprocessing tasks using python libraries.
a) Dealing with Categorical Data
dataset = pd.read_csv("dataset.csv")
X = dataset.iloc[:,[0,2,3]].values
Y = dataset.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le_X = LabelEncoder()
X[:,0] = le_X.fit_transform(X[:,0])
ohe_X = OneHotEncoder(categorical_features = [0])
X = ohe_X.fit_transform(X).toarray()
Data Mining Using Python Page 3
Date
Output
Y = le_X.fit_transform(Y)Output
Data Mining Using Python Page 4
Date
b) Splitting the Dataset into Training and Testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
c) Scaling the features
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
Y_train = Y_train.reshape((len(Y_train), 1))
Y_train = sc_y.fit_transform(Y_train)
Y_train = Y_train.ravel()
Output
X_train before scaling :
Data Mining Using Python Page 5
Date
X_train after scaling :
Data Mining Using Python Page 6
Date
Experiment-3
Aim: Demonstrate the following Similarity and Dissimilarity Measures using python
a) Pearson’s Correlation
We calculate this metric for the vectors x and y in the following way:
The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase
or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Data Mining Using Python Page 7
Date
# calculate Pearson's correlation
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)
output:Pearsons correlation: 0.810
b) Cosine Similarity
The cosine similarity calculates the cosine of the angle between two vectors. In order to
calculate the cosine similarity we use the following formula:
Recall the cosine function: on the left the red vectors point at different angles and the graph
on the right shows the resulting function.
Data Mining Using Python Page 8
Date
Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point
in the exact same direction, the cosine similarity is +1. If the vectors point in opposite
directions, the cosine similarity is -1
Implementation
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)
output:Cosine similarity: 0.773
c) Jaccard Similarity
Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for
comparing two binary vectors (sets).
In set theory it is often helpful to see a visualization of the formula:
Data Mining Using Python Page 9
Date
We can see that the Jaccard similarity divides the size of the intersection by the size of the
union of the sample sets.
Implementation in Python
from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)
output:Jaccard similarity: 0.500
d) Euclidean Distance
The Euclidean distance is a straight-line distance between two vectors.
For the two vectors x and y, this can be computed as follows:
Implementation in Python
from scipy.spatial import distance
dst = distance.euclidean(x,y)
print("Euclidean distance: %.3f" % dst)
Data Mining Using Python Page 10
Date
output
Euclidean distance: 3.273
e)Manhattan Distance
We calculate the Manhattan distance as follows:
n many ML applications Euclidean distance is the metric of choice. However, for high
dimensional data Manhattan distance is preferable as it yields more robust results.
Implementation in Python
from scipy.spatial import distance
dst = distance.cityblock(x,y)
print("Manhattan distance: %.3f" % dst)
output Manhattan distance: 10.468
Data Mining Using Python Page 11
Date
Experiment-4
Aim: Build a model using linear regression algorithm on any dataset.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import seaborn as sns
from matplotlib import rcParams
%matplotlib inline
%pylab inline
Populating the interactive namespace from numpy and matplotlib
df = pd.read_csv('data.csv')
df.head()
#Reading the csv file from Kaggle using pandas (pd.read_csv).
index id date price bedrooms bathrooms sqft_living sqft_lot
712930052020141
0 0 221900.0 3 1.0 1180 5650
013T000000
641410019220141
1 1 538000.0 3 2.25 2570 7242
209T000000
5631500400
2 2 180000.0 2 1.0 770 10000
20150225T000000
2487200875
3 3 604000.0 4 3.0 1960 5000
20141209T000000
1954400510
4 4 510000.0 3 2.0 1680 8080
20150218T000000
Data Mining Using Python Page 12
Date
# Checking to see if any of our data has null values. If there were any, we’d drop or filter the
null values out.
df.isnull().any()
id False
date False
price False
bedrooms False
bathrooms False
sqft_living False
sqft_lot False
dtype: bool
# Checking out the data types for each of our variables. We want to get a sense of whether or
not data is numerical (int64, float64) or not (object).
df.dtypes
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
dtype: object
# Next: Simple exploratory analysis and regression results.
df.describe()
Data Mining Using Python Page 13
Date
index id price bedrooms bathrooms sqft_living sqft_lot
count 5.0 5.0 5.0 5.0 5.0 5.0
mean 2.0 410780.0 3.0 1.85 1632.0 7194.4
1.58113
195127.245 0.70710678 0.8587782018658 695.89510 1991.137062
std 8830084
66292632 11865476 834 703841 0828693
1898
min 0.0 180000.0 2.0 1.0 770.0 5000.0
25% 1.0 221900.0 3.0 1.0 1180.0 5650.0
50% 2.0 510000.0 3.0 2.0 1680.0 7242.0
75% 3.0 538000.0 3.0 2.25 1960.0 8080.0
max 4.0 604000.0 4.0 3.0 2570.0 10000.0
fig = plt.figure(figsize=(12, 6))
sqft = fig.add_subplot(121)
cost = fig.add_subplot(122)
sqft.hist(df.sqft_living, bins=80)
sqft.set_xlabel('Ft^2')
sqft.set_title("Histogram of House Square Footage")
cost.hist(df.price, bins=80)
cost.set_xlabel('Price ($)')
cost.set_title("Histogram of Housing Prices")
plt.show()
Data Mining Using Python Page 14
Date
Experiment-5
Aim: Build a classification model using Decision Tree algorithm on iris dataset
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from six import StringIO #changed
from IPython.display import Image
from pydot import graph_from_dot_data
import pandas as pd
import numpy as np
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Categorical.from_codes(iris.target, iris.target_names)
X
index sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
Data Mining Using Python Page 15
Date
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
150 rows × 4 columns
y = pd.get_dummies(y)
y
index setosa versicolr virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
... ... ... ...
145 0 0 1
146 0 0 1
147 0 0 1
148 0 0 1
149 0 0 1
150 rows × 3 columns
Data Mining Using Python Page 16
Date
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data, feature_names=iris.feature_names)
(graph, ) = graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
output
Let’s see how our decision tree does when its presented with test data.
y_pred = dt.predict(X_test)
species = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
confusion_matrix(species, predictions)
output
array([[13, 0, 0],
[ 0, 15, 1],
[ 0, 0, 9]])
Data Mining Using Python Page 17
Date
Experiment-6
Aim: Apply Naïve Bayes Classification algorithm on any dataset
# Assigning features and label variables
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
weather_encoded=le.fit_transform(weather)
print ("weather:",weather_encoded)
# Converting string labels into numbers
temp_encoded=le.fit_transform(temp)
label=le.fit_transform(play)
print( "Temp:",temp_encoded)
print( "Play:",label)
#Combinig weather and temp into single listof tuples using list and zip
features=list(zip(weather_encoded,temp_encoded))
print("weather,temp:" ,features)
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
model = GaussianNB()
Data Mining Using Python Page 18
Date
# Train the model using the training sets
model.fit(features,label)
#Predict Output
predicted= model.predict([[0,2]]) # 0:Overcast, 2:Mild
print ("Predicted Value:", predicted)
output
weather: [2 2 0 1 1 1 0 2 2 1 2 0 0 1]
Temp: [1 1 1 2 0 0 0 2 0 2 2 2 1 2]
Play: [0 0 1 1 1 0 1 0 1 1 1 1 1 0]
Weather,temp: [(2, 1), (2, 1), (0, 1), (1, 2), (1, 0), (1, 0), (0, 0), (2, 2), (2, 0), (1, 2), (2, 2), (0,
2), (0, 1), (1, 2)]
Predicted Value: [1]
Data Mining Using Python Page 19
Date
Experiment-7
Aim: Generate frequent itemsets using Apriori Algorithm in python and also generate
association rules for any market basket data
We will make use of the following python libraries
1. Remember good ol’ pandas and numpy?
2. mlxtend or ML extended will be used for apriori implementation and extracting association
rules.
3. And then there was one: matplotlib for visualizing results
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
df = pd.read_csv('retail_dataset.csv', sep=',')
## Print first 10 rows
df.head(10)
items = set()
for col in df:
items.update(df[col].unique())
print(items)
itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
rowset = set(row)
labels = {}
uncommons = list(itemset - rowset)
commons = list(itemset.intersection(rowset))
for uc in uncommons:
Data Mining Using Python Page 20
Date
labels[uc] = 0
for com in commons:
labels[com] = 1
encoded_vals.append(labels)
encoded_vals[0]
ohe_df = pd.DataFrame(encoded_vals)
#apriori(df, min_support=0.5, use_colnames=False, max_len=None)
freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True)
freq_items.head(7)
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.head()
conse antecede consequ
antece confiden convictio
quent nt ent support lift leverage
dents ce n
s support support
-
0 (Milk) (nan) 0.501587 0.869841 0.409524 0.816456 0.938626 0.709141
0.026778
-
1 (Bagel) (nan) 0.425397 0.869841 0.336508 0.791045 0.909413 0.622902
0.033520
-
2 (Meat) (nan) 0.476190 0.869841 0.368254 0.773333 0.889051 0.574230
0.045956
-
3 (Wine) (nan) 0.438095 0.869841 0.317460 0.724638 0.833069 0.472682
0.063613
(Diaper -
4 (nan) 0.406349 0.869841 0.317460 0.781250 0.898152 0.595011
) 0.035999
Data Mining Using Python Page 21
Date
Visualizing results
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
Data Mining Using Python Page 22
Date
Experiment-8
Aim: Apply K- Means clustering algorithm on any dataset
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sklearn
from sklearn import cluster
%matplotlib inline
faithful = pd.read_csv('faithful.csv')
faithful.head()
index eruptions waiting
0 .6 79
1 1.8 54
2 3.333 74
3 2.283 62
4 4.533 85
Basic scatterplot of the data.
faithful.columns = ['eruptions', 'waiting']
plt.scatter(faithful.eruptions, faithful.waiting)
plt.title('Old Faithful Data Scatterplot')
plt.xlabel('Length of eruption (minutes)')
plt.ylabel('Time between eruptions (minutes)')
Data Mining Using Python Page 23
Date
Step two: Building the cluster model
faith = np.array(faithful)
k=2
kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(faith)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
for i in range(k):
# select only data observations with cluster label == i
ds = faith[np.where(labels==i)]
# plot the data observations
plt.plot(ds[:,0],ds[:,1],'o', markersize=7)
# plot the centroids
lines = plt.plot(centroids[i,0],centroids[i,1],'kx')
Data Mining Using Python Page 24
Date
# make the centroid x's bigger
plt.setp(lines,ms=15.0)
plt.setp(lines,mew=4.0)
plt.show()
Data Mining Using Python Page 25
Date
Experiment-9
Aim: Apply Hierarchical Clustering algorithm on any dataset.
#First import the required libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline.
data = pd.read_csv('Wholesaledata.csv')
data.head()
chan Reg Mil
index Fresh Grocery Frozen Detergents_Paper Delicassen
nel ion k
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776
2 2 3 6353 8808 7684 2405 3516 7844
3 1 3 13265 1196 4221 6404 507 1788
4 2 3 22615 4510 7196 3915 1777 5185
# Normalize the data and bring all the variables to the same scale:
from sklearn.preprocessing import normalize
data_scaled = normalize(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled.head()
Data Mining Using Python Page 26
Date
0.0001118 0.0001677 0.708332 0.539873 0.422740 0.011964 0.149505 0.074808
0 21405842 32108763 6953097 7474078 8247878 89042515 21961150 5205086
56296 84445 151 94 093 4237 667 7462
0.0001253 0.0001879 0.442198 0.614703 0.599539 0.110408 0.206342 0.111285
1 21879894 82819841 2532065 8208809 8734137 57618676 47524575 8293460
17398 26096 9284 233 283 726 744 2649
0.0001248 0.0001872 0.396551 0.549791 0.479632 0.150119 0.219467 0.489619
2 39188056 58782084 6808605 7841995 1605119 12363759 29260281 2955564
20848 3127 462 421 5295 07 447 496
6.4593782 0.0001937 0.856836 0.077254 0.272650 0.413658 0.032749 0.115493
3 21417655 81346642 5210710 1635281 3547260 58129958 04758258 6825989
e-05 52967 52 5516 392 666 7516 4768
7.9749664 0.0001196 0.901769 0.179835 0.286939 0.156109 0.070857 0.206751
4 23467488 24496352 3283335 4928491 2919163 96773937 57667250 0045283
e-05 01233 863 9187 6025 608 864 9462
#Draw the dendrogram to help us decide the number of clusters for this particular problem:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
Data Mining Using Python Page 27
Date
The x-axis contains the samples and y-axis represents the distance between these samples.
The vertical line with maximum distance is the blue line and hence we can decide a threshold
of 6 and cut the dendrogram:
plt.figure(figsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
Data Mining Using Python Page 28
Date
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
cluster.fit_predict(data_scaled)
output
array([0, 0, 0, 1, 1])
We can see the values of 0s and 1s in the output since we defined 2 clusters. 0 represents the
points that belong to the first cluster and 1 represents points in the second cluster. Let’s now
visualize the two clusters:
plt.figure(figsize=(10, 7))
plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_)
Data Mining Using Python Page 29
Date
EXPERIMENT -10
Aim: Apply DBSCAN clustering algorithm on any dataset.
from mpl_toolkits.basemap import Basemap
import matplotlib
from PIL import Image
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)
xs, ys = my_map(np.asarray(weather_df.Long),
np.asarray(weather_df.Lat))
Data Mining Using Python Page 30
Date
1.Clustering the Weather Data (Temperatures & Coordinates as Features)
For clustering data, I’ve followed the steps shown in scikit-learn demo of DBSCAN.
Choosing temperatures (‘Tm’, ‘Tx’, ‘Tn’) and x/y map projections of coordinates (‘xm’,
‘ym’) as features and, setting ϵ and MinPts to 0.3 and 10 respectively, gives 8 unique clusters
(noise is labeled as -1). Feel free to change these parameters to test how much clustering is
affected accordingly.
Let’s visualize these clusters using Basemap —
Finally, I included precipitation (‘P’) in the features and repeated the same clustering steps
with ϵ and MinPts set to 0.5 and 10. We see some differences from the previous clustering
and, thus it gives us an idea about problem of clustering unsupervised data even using
DBSCAN when we lack the domain knowledge.
Data Mining Using Python Page 31
Date
Unique Clusters in Canada Based on Selected Features (now included precipitation
compared to previous case) in the Weather Data. ϵ and MinPts set to 0.5 and 10 Respectively
You can try to repeat the process including some more features, or, change the clustering
parameters, to get a better overall knowledge.
Data Mining Using Python Page 32