lOMoARcPSD|51105687
Data analytics lab manual
Data Analytics Lab (Dr. A.P.J. Abdul Kalam Technical University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Vinod Jagdale (
[email protected])
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Department of Information Technology
LAB FILE
ON
Data Analytics
KIT 651
(6th Semester)
(2020 – 2021)
Submitted To: Submitted By:
Ms. Tanya Varshney Name: ABHISHEK BISHT
Roll No: 1813313006
INDEX
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
S.NO TOPIC DATE OF REMARKS
PRACTICAL
1 To get the input from user
and perform numerical
operations (MAX, MIN, AVG,
SUM, SQRT, ROUND) using
in R/python.
2 To perform data
import/export (.CSV, .XLS,
.TXT) operations using data
frames in R/python.
3 To get the input matrix from
user and perform Matrix
addition, subtraction,
multiplication, inverse
transpose and division
operations using vector
concept in R/python.
4 To perform statistical
operations (Mean, Median,
Mode and Standard
deviation) using R/python
5 To perform data pre-
processing operations
i) Handling Missing data
ii) Min-Max normalization
6 To perform Simple Linear
Regression with R/python.
7 To perform Logis c Regression
with R/Python.
8 To implement K-Means clustering
algorithm on Real Time data and
its visualiza on using R/Python.
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 1:- To get the input from user and perform numerical
operations (MAX, MIN, AVG, SUM, SQRT, ROUND) using
in R/Python
CODE:-
a=[]
n=int(input("Enter no of element in list"))
for i in range (0,n):
ele=int(input("Enter the element"))
a.append(ele)
print("list is = ",a)
# max of number
print ("maximum number from this is",max(a) )
# smallest number
print ("smallest number is",min(a));
# avearage of number
def average (num):
return sum(num)/len(num)
print ("average of a given list is ",average(a)) ;
# sum of numbers
print("sum of numbers",sum(a))
# square root of a number
number = int(input("Enter the number for finding square root"))
number_sqrt = number**0.5
print( "square root of a number = " , number_sqrt)
# round of number
num= float(input("enter a number for round of "))
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
print("round off is ",round(num))
OUTPUT:- Enter no of element in list 5
Enter the element1
Enter the element5
Enter the element2
Enter the element3
Enter the element5
list is = [1, 5, 2, 3, 5]
maximum number from this is 5
smallest number is 1
average of a given list is 3.2
sum of numbers 16
Enter the number for finding square root 25
square root of a number = 5.0
enter a number for round of 5.3624
round off is 5
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 2:- To perform data import/export (.CSV, .XLS, .TXT)
operations using data frames in R/Python
CODE:-
1.) Importing data from drive
from google.colab import drive
drive.mount("/content/drive")
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/ALOK NIET/ITUR_rain1.csv")
2.) Importing data from our computer
import numpy as np
import pandas as pd
# Importing the dataset
from google.colab import files
uploaded = files.upload()
data = pd.read_csv("FL_insurance_sample.csv")
x = data.iloc[:,0:1].values.astype(float)
y = data.iloc[:,3:4].values.astype(float)
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 3 :- To get the input matrix from user and perform Matrix
addition, subtraction, multiplication, inverse transpose
and division operations using vector concept in R /
python.
CODE :-
def matrix(m,n):
o=[]
for i in range (m):
row =[]
for j in range (n):
inp=int(input(f"enter o[{i}][{j}]"))
row.append(inp)
o.append(row)
print("\n")
return o
def result(m,n):
o=[]
for i in range (m):
row =[]
for j in range (n):
row.append(0)
o.append(row)
return o
def add (a,b):
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
k = []
for i in range(len(a)):
row=[]
for j in range(len(a[0])):
row.append(a[i][j]+b[i][j])
k.append(row)
return k
def sub (a,b):
s = []
for i in range(len(a)):
row=[]
for j in range(len(a[0])):
row.append(a[i][j]-b[i][j])
s.append(row)
return s
def multiplication(a,b,r):
for i in range(len(a)):
for j in range(len(b[0])):
for k in range(len(b)):
r[i][j] +=a[i][k]*b[k][j]
for a in r:
return r
def transpose(a,r2):
for i in range(len(a)):
for j in range(len(a[0])):
r2[j][i]= a[i][j]
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
for b in r2:
return r2
import numpy as np
m=int(input("Enter number of rows"));
n=int(input("Enter number of columns"));
print("Enter the elements of matrix A row wise")
a=matrix(m,n)
print(a)
print("Enter the elements of matrix B row wise")
b=matrix(m,n)
print(b)
r=result(m,n)
r2=result(n,m)
c = add(a,b)
print("Sum of matrices = ",c)
d= sub(a,b)
print("Subtraction of A and B = ",d)
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
m= multiplication (a,b,r)
print("Multiplication of A and B = ",m)
t=transpose(a,r2)
print("Transpose of A = ",t)
t2=transpose(b,r2)
print("Transpose of B = ",t2)
I=np.array(a)
print("Inverse of matrix A = ",np.linalg.inv(I))
I2=np.array(b)
print("Inverse of matrix B = ",np.linalg.inv(I2))
OUTPUT:- Enter number of rows2 Enter number of columns2
Enter the elements of matrix A row wise
enter o[0][0]1
enter o[0][1]2
enter o[1][0]3
enter o[1][1]4
[[1, 2], [3, 4]]
Enter the elements of matrix B row wise
enter o[0][0]1
enter o[0][1]2
enter o[1][0]3
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
enter o[1][1]4
[[1, 2], [3, 4]]
Sum of matrices = [[2, 4], [6, 8]]
Subtraction of A and B = [[0, 0], [0, 0]]
Multiplication of A and B = [[7, 10], [15, 22]]
Transpose of A = [[1, 3], [2, 4]]
Transpose of B = [[1, 3], [2, 4]]
Inverse of matrix A = [[-2. 1. ] [ 1.5 -0.5]]
Inverse of matrix B = [[-2. 1. ] [ 1.5 -0.5]]
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 4:- To perform statistical operations (Mean, Median,
Mode and Standard deviation) using R/Python
CODE:-
import numpy as np
from scipy import stats
a=[1,2,3,4 ,5,6,7,8,9,7,6,5]
x=np.mean(a)
print("Median is =",x)
y=np.median(a)
print("median = ",y)
z=stats.mode(a)
print("Mode = ",z)
b=np.std(a)
print("standard deviation is ",b)
OUTPUT:- Median is = 5.25
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
median = 5.5
Mode = ModeResult(mode=array([5]), count=array([2]))
standard deviation is 2.3139072294858036
Aim 5:- To perform Simple Linear Regression with R/python.
CODE:-
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
# to import file from drive
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('/content/drive/MyDrive/ALOK NIET/test.csv')
# for draw points on graph
plt.scatter(df.area,df.price,marker='*',color='red')
plt.xlabel('Area(Sq.ft.)')
plt.ylabel('Price(us$)')
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
reg = linear_model.LinearRegression()
reg.fit(df[['area']],df.price)
reg.predict([[3300]])
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
# slope of line
reg.coef_
# y intercept
reg.intercept_
df1 = pd.read_csv("/content/drive/MyDrive/ALOK NIET/area.csv"
p=reg.predict(df1)
df1['price']=p
# To export result to the drive(folder name= regression)
df1.to_csv('/content/drive/MyDrive/ALOK NIET/regression.csv', index=True)
plt.xlabel('Area(Sq.ft.)')
plt.ylabel('Price(us$)')
plt.scatter(df.area,df.price, marker='*',color='red')
plt.plot(df.area, reg.predict(df[['area']]), color='blue')
Accuracy=reg.score(df[['area']],df.price)
print('Accuracy=', Accuracy)
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
OUTPUT:-
Text(0, 0.5, 'Price(us$)')
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
180616.43835616432
[<matplotlib.lines.Line2D at 0x7f2270d23e50>]
Accuracy= 0.9584301138199486
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 6:- To perform data pre-processing operations
CODE:-
i) Handling Missing data
import pandas as pd
import numpy as np
df= pd.read_csv("/content/drive/MyDrive/ALOK NIET/titanic.csv")
df.head()
df.drop(["PassengerId","Name","SibSp","Parch","Ticket","Cabin","Embarked"],axis="
columns", inplace=True)
df.head()
target=df.Survived
inputs=df.drop("Survived",axis="columns")
inputs = pd.concat([inputs,dummies],axis="columns")
inputs.head(4)
inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)
inputs.columns[inputs.isna().any()]
inputs.Age=inputs.Age.fillna(inputs.Age.mean())
inputs.head()
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
inputs.Age[:10]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(inputs,target,test_size=0.3)
len(X_train)
len(X_test)
len(inputs)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
X_train
model.fit(X_train,y_train)
model.score(X_test,y_test)
model.predict_proba(X_test[0:10])
OUTPUT:-
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Index(['Age'], dtype='object')
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0 Name: Age, dtype: float64
623
268
X_train
GaussianNB(priors=None, var_smoothing=1e-09)
0.7574626865671642
552 0
176 0
630 1
162 0
849 1
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
94 0
360 0
596 1
640 0
146 1
Name: Survived, dtype: int64
array([[0.9607858 , 0.0392142 ],
[0.9587296 , 0.0412704 ],
[0.63649097, 0.36350903],
[0.95854681, 0.04145319],
[0.00536352, 0.99463648],
[0.9603635 , 0.0396365 ],
[0.96114074, 0.03885926],
[0.16354633, 0.83645367],
[0.95346105, 0.04653895],
[0.95921577, 0.04078423]])
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
ii) Min-Max normalization
import matplotlib.pyplot as plt
inputs.plot(kind = 'bar')
Output:
df_max_scaled = inputs.copy()
# apply normalization techniques
for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs()
.max()
# view normalized data
display(df_max_scaled)
Output:
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
import matplotlib.pyplot as plt
df_max_scaled.plot(kind = 'bar')
Output:
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
df_min_max_scaled = inputs.copy()
# apply normalization techniques
for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column] -
df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() -
df_min_max_scaled[column].min())
# view normalized data
print(df_min_max_scaled)
Output:
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
import matplotlib.pyplot as plt
df_min_max_scaled.plot(kind = 'bar')
Output:
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 6: To perform Simple Linear Regression with R/Python.
CODE: -
import pandas as pd
ar=pd.read_csv("/content/drive/MyDrive/mount file/area.csv")
import numpy as np
from matplotlib import pyplot as plt
x1 = ar.iloc[:,0:1].values.astype(float) #x1=ar.Area
y1 = ar.iloc[:,1:2].values.astype(float) #y1=ar.Price
#print(x1)
#print(y1)
plt.scatter(x1, y1,marker='*',color='red')
plt.xlabel('Area(Sq.ft.)')
plt.ylabel('Price(INR)')
plt.plot()
#plt.xticks(())
#plt.yticks(())
plt.show()
Output:
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
regr = linear_model.LinearRegression()
regr.fit(ar[['Area']],ar.Price)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
regr.predict([[80]])
array([9171.67605634])
regr.coef_
array([100.6691978])
regr.intercept_
1118.140232700558
#This can verify the prediction
80*100.6691978+1118.140232700558
9171.676056700559
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
Aim 7: To Perform Simple Logistic Regression with R/Python.
CODE:-
import numpy as np
import pandas as pd
from google.colab import files
uploaded = files.upload()
df = pd.read_csv("insurance_data.csv")
insurance_data.csv(n/a) - 184 bytes, last modified: 7/16/2021 - 100% done
Saving insurance_data.csv to insurance_data.csv
from matplotlib import pyplot as plt
X = df.iloc[:,0:1].values.astype(float)
y = df.iloc[:,1:2].values.astype(float)
plt.scatter(X,y,marker='*',color='red')
df.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1)
len(X_train)
len(X_test)
X_train
X_test
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train,y_train)
X_test
model.predict(X_test)
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
model.predict([[31]])
model.predict([[67]])
model.predict_proba(X_test)
OUTPUT: -
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
24
3
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:760:
DataConversionWarning: A column-vector y was passed when a 1d array was
expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
array([1., 0., 1.])
array([0.])
array([1.])
array([[0.14105334, 0.85894666],
[0.84509337, 0.15490663],
[0.33614027, 0.66385973]])
Aim 8: To implement K-Means clustering algorithm on Real
Time Data and its visualization using R/Python.
CODE: -
import numpy as np
import matplotlib.pyplot as plt
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
x = [1,
1.5,
2,
2.5,
3,
3.5,
4,
5,
5.5,
6,
7,
]
y=[
0.0000259,
0.0000443,
0.0000847,
0.0001321,
0.000139,
0.0001155,
0.0001071,
0.0002162,
0.0003909,
0.0007056,
0.001915
]
plt.scatter(x,y)
plt.show()
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
X = np.array([[1,0.0000259,],
[1.5,0.0000443,],
[2,0.0000847,],
[2.5,0.0001321,],
[3,0.000139,],
[3.5,0.0001155,],
[4,0.0001071,],
[5,0.0002162,],
[5.5,0.0003909,],
[6,0.0007056,],
[7,0.001915,]
])
#Here we are doing it for n=3 (Number of clusters will be 3)
kmeans = KMeans(n_clusters=3)
#Now,we need to fit the data for this or we can say we need to apply this algorithm to
our dataset.
kmeans.fit(X)
#In case, you want to check out the center of the clusters.You the use the following
line of code:
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
#The below line of code will report back the label it believe to be true for the cluster.
print(kmeans.n_iter_)
print(centroids)
print(labels)
colors = ["g.","r.","c.","y."]
print(kmeans.inertia_)
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder =
10)
plt.show()
Output:-
2
[[2.50000000e+00 9.26571429e-05]
[5.87500000e+00 8.06925000e-04]]
[0 0 0 0 0 0 0 1 1 1 1]
9.187501771421825
coordinate: [1.00e+00 2.59e-05] label: 0
coordinate: [1.50e+00 4.43e-05] label: 0
coordinate: [2.00e+00 8.47e-05] label: 0
coordinate: [2.500e+00 1.321e-04] label: 0
coordinate: [3.00e+00 1.39e-04] label: 0
coordinate: [3.500e+00 1.155e-04] label: 0
coordinate: [4.000e+00 1.071e-04] label: 0
coordinate: [5.000e+00 2.162e-04] label: 1
coordinate: [5.500e+00 3.909e-04] label: 1
DATA ANALYTICS KIT-651
lOMoARcPSD|51105687
ABHISHEK BISHT 1813313006
coordinate: [6.000e+00 7.056e-04] label: 1
coordinate: [7.000e+00 1.915e-03] label: 1
DATA ANALYTICS KIT-651