0% found this document useful (0 votes)
11 views32 pages

Data Analytics Lab Manual 1

Uploaded by

nbm4trade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Data Analytics Lab Manual 1

Uploaded by

nbm4trade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

lOMoARcPSD|51105687

Data analytics lab manual

Data Analytics Lab (Dr. A.P.J. Abdul Kalam Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Vinod Jagdale ([email protected])
lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Department of Information Technology

LAB FILE
ON
Data Analytics
KIT 651
(6th Semester)
(2020 – 2021)

Submitted To: Submitted By:

Ms. Tanya Varshney Name: ABHISHEK BISHT

Roll No: 1813313006

INDEX
DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

S.NO TOPIC DATE OF REMARKS


PRACTICAL
1 To get the input from user
and perform numerical
operations (MAX, MIN, AVG,
SUM, SQRT, ROUND) using
in R/python.

2 To perform data
import/export (.CSV, .XLS,
.TXT) operations using data
frames in R/python.
3 To get the input matrix from
user and perform Matrix
addition, subtraction,
multiplication, inverse
transpose and division
operations using vector
concept in R/python.
4 To perform statistical
operations (Mean, Median,
Mode and Standard
deviation) using R/python

5 To perform data pre-


processing operations
i) Handling Missing data
ii) Min-Max normalization
6 To perform Simple Linear
Regression with R/python.

7 To perform Logis c Regression


with R/Python.

8 To implement K-Means clustering


algorithm on Real Time data and
its visualiza on using R/Python.

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 1:- To get the input from user and perform numerical
operations (MAX, MIN, AVG, SUM, SQRT, ROUND) using
in R/Python
CODE:-
a=[]
n=int(input("Enter no of element in list"))
for i in range (0,n):
ele=int(input("Enter the element"))
a.append(ele)
print("list is = ",a)
# max of number
print ("maximum number from this is",max(a) )

# smallest number
print ("smallest number is",min(a));

# avearage of number
def average (num):
return sum(num)/len(num)
print ("average of a given list is ",average(a)) ;

# sum of numbers
print("sum of numbers",sum(a))

# square root of a number


number = int(input("Enter the number for finding square root"))
number_sqrt = number**0.5
print( "square root of a number = " , number_sqrt)

# round of number
num= float(input("enter a number for round of "))

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

print("round off is ",round(num))

OUTPUT:- Enter no of element in list 5


Enter the element1
Enter the element5
Enter the element2
Enter the element3
Enter the element5
list is = [1, 5, 2, 3, 5]
maximum number from this is 5
smallest number is 1
average of a given list is 3.2
sum of numbers 16
Enter the number for finding square root 25
square root of a number = 5.0
enter a number for round of 5.3624
round off is 5

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 2:- To perform data import/export (.CSV, .XLS, .TXT)


operations using data frames in R/Python
CODE:-

1.) Importing data from drive


from google.colab import drive
drive.mount("/content/drive")
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/ALOK NIET/ITUR_rain1.csv")

2.) Importing data from our computer


import numpy as np
import pandas as pd
# Importing the dataset
from google.colab import files
uploaded = files.upload()
data = pd.read_csv("FL_insurance_sample.csv")

x = data.iloc[:,0:1].values.astype(float)
y = data.iloc[:,3:4].values.astype(float)

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 3 :- To get the input matrix from user and perform Matrix
addition, subtraction, multiplication, inverse transpose
and division operations using vector concept in R /
python.
CODE :-
def matrix(m,n):
o=[]
for i in range (m):
row =[]
for j in range (n):
inp=int(input(f"enter o[{i}][{j}]"))
row.append(inp)
o.append(row)
print("\n")
return o

def result(m,n):
o=[]
for i in range (m):
row =[]
for j in range (n):
row.append(0)
o.append(row)
return o

def add (a,b):

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

k = []
for i in range(len(a)):
row=[]
for j in range(len(a[0])):
row.append(a[i][j]+b[i][j])
k.append(row)
return k

def sub (a,b):


s = []
for i in range(len(a)):
row=[]
for j in range(len(a[0])):
row.append(a[i][j]-b[i][j])
s.append(row)
return s

def multiplication(a,b,r):
for i in range(len(a)):
for j in range(len(b[0])):
for k in range(len(b)):
r[i][j] +=a[i][k]*b[k][j]
for a in r:
return r

def transpose(a,r2):
for i in range(len(a)):
for j in range(len(a[0])):
r2[j][i]= a[i][j]

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

for b in r2:
return r2

import numpy as np

m=int(input("Enter number of rows"));


n=int(input("Enter number of columns"));

print("Enter the elements of matrix A row wise")


a=matrix(m,n)
print(a)

print("Enter the elements of matrix B row wise")


b=matrix(m,n)
print(b)

r=result(m,n)
r2=result(n,m)

c = add(a,b)
print("Sum of matrices = ",c)

d= sub(a,b)
print("Subtraction of A and B = ",d)

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

m= multiplication (a,b,r)
print("Multiplication of A and B = ",m)

t=transpose(a,r2)
print("Transpose of A = ",t)

t2=transpose(b,r2)
print("Transpose of B = ",t2)

I=np.array(a)
print("Inverse of matrix A = ",np.linalg.inv(I))

I2=np.array(b)
print("Inverse of matrix B = ",np.linalg.inv(I2))

OUTPUT:- Enter number of rows2 Enter number of columns2


Enter the elements of matrix A row wise
enter o[0][0]1
enter o[0][1]2
enter o[1][0]3
enter o[1][1]4
[[1, 2], [3, 4]]
Enter the elements of matrix B row wise
enter o[0][0]1
enter o[0][1]2

enter o[1][0]3

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

enter o[1][1]4
[[1, 2], [3, 4]]
Sum of matrices = [[2, 4], [6, 8]]
Subtraction of A and B = [[0, 0], [0, 0]]
Multiplication of A and B = [[7, 10], [15, 22]]
Transpose of A = [[1, 3], [2, 4]]
Transpose of B = [[1, 3], [2, 4]]
Inverse of matrix A = [[-2. 1. ] [ 1.5 -0.5]]
Inverse of matrix B = [[-2. 1. ] [ 1.5 -0.5]]

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 4:- To perform statistical operations (Mean, Median,


Mode and Standard deviation) using R/Python
CODE:-

import numpy as np

from scipy import stats

a=[1,2,3,4 ,5,6,7,8,9,7,6,5]

x=np.mean(a)

print("Median is =",x)

y=np.median(a)

print("median = ",y)

z=stats.mode(a)

print("Mode = ",z)

b=np.std(a)

print("standard deviation is ",b)

OUTPUT:- Median is = 5.25

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

median = 5.5
Mode = ModeResult(mode=array([5]), count=array([2]))
standard deviation is 2.3139072294858036

Aim 5:- To perform Simple Linear Regression with R/python.


CODE:-

import pandas as pd

from matplotlib import pyplot as plt

%matplotlib inline

# to import file from drive

import numpy as np

from matplotlib import pyplot as plt

df = pd.read_csv('/content/drive/MyDrive/ALOK NIET/test.csv')

# for draw points on graph

plt.scatter(df.area,df.price,marker='*',color='red')

plt.xlabel('Area(Sq.ft.)')

plt.ylabel('Price(us$)')

from sklearn import linear_model

from sklearn.linear_model import LinearRegression

reg = linear_model.LinearRegression()

reg.fit(df[['area']],df.price)

reg.predict([[3300]])

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

# slope of line

reg.coef_

# y intercept

reg.intercept_

df1 = pd.read_csv("/content/drive/MyDrive/ALOK NIET/area.csv"


p=reg.predict(df1)
df1['price']=p

# To export result to the drive(folder name= regression)

df1.to_csv('/content/drive/MyDrive/ALOK NIET/regression.csv', index=True)

plt.xlabel('Area(Sq.ft.)')

plt.ylabel('Price(us$)')

plt.scatter(df.area,df.price, marker='*',color='red')

plt.plot(df.area, reg.predict(df[['area']]), color='blue')

Accuracy=reg.score(df[['area']],df.price)

print('Accuracy=', Accuracy)

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

OUTPUT:-
Text(0, 0.5, 'Price(us$)')

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


180616.43835616432

[<matplotlib.lines.Line2D at 0x7f2270d23e50>]

Accuracy= 0.9584301138199486

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 6:- To perform data pre-processing operations


CODE:-

i) Handling Missing data

import pandas as pd
import numpy as np

df= pd.read_csv("/content/drive/MyDrive/ALOK NIET/titanic.csv")


df.head()

df.drop(["PassengerId","Name","SibSp","Parch","Ticket","Cabin","Embarked"],axis="
columns", inplace=True)
df.head()

target=df.Survived
inputs=df.drop("Survived",axis="columns")

inputs = pd.concat([inputs,dummies],axis="columns")
inputs.head(4)

inputs.drop(['Sex','male'],axis='columns',inplace=True)
inputs.head(3)

inputs.columns[inputs.isna().any()]

inputs.Age=inputs.Age.fillna(inputs.Age.mean())
inputs.head()

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

inputs.Age[:10]

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test =train_test_split(inputs,target,test_size=0.3)
len(X_train)
len(X_test)
len(inputs)

from sklearn.naive_bayes import GaussianNB


model = GaussianNB()

X_train

model.fit(X_train,y_train)
model.score(X_test,y_test)
model.predict_proba(X_test[0:10])

OUTPUT:-

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Index(['Age'], dtype='object')
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 NaN
6 54.0
7 2.0
8 27.0
9 14.0 Name: Age, dtype: float64
623
268
X_train

GaussianNB(priors=None, var_smoothing=1e-09)
0.7574626865671642
552 0
176 0
630 1
162 0
849 1

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

94 0
360 0
596 1
640 0
146 1
Name: Survived, dtype: int64
array([[0.9607858 , 0.0392142 ],
[0.9587296 , 0.0412704 ],
[0.63649097, 0.36350903],
[0.95854681, 0.04145319],
[0.00536352, 0.99463648],
[0.9603635 , 0.0396365 ],
[0.96114074, 0.03885926],
[0.16354633, 0.83645367],
[0.95346105, 0.04653895],
[0.95921577, 0.04078423]])

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

ii) Min-Max normalization

import matplotlib.pyplot as plt


inputs.plot(kind = 'bar')

Output:

df_max_scaled = inputs.copy()
# apply normalization techniques
for column in df_max_scaled.columns:
df_max_scaled[column] = df_max_scaled[column] / df_max_scaled[column].abs()
.max()
# view normalized data
display(df_max_scaled)

Output:

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

import matplotlib.pyplot as plt


df_max_scaled.plot(kind = 'bar')

Output:

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

df_min_max_scaled = inputs.copy()
# apply normalization techniques
for column in df_min_max_scaled.columns:
df_min_max_scaled[column] = (df_min_max_scaled[column] -
df_min_max_scaled[column].min()) / (df_min_max_scaled[column].max() -
df_min_max_scaled[column].min())
# view normalized data
print(df_min_max_scaled)
Output:

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

import matplotlib.pyplot as plt


df_min_max_scaled.plot(kind = 'bar')

Output:

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 6: To perform Simple Linear Regression with R/Python.


CODE: -
import pandas as pd
ar=pd.read_csv("/content/drive/MyDrive/mount file/area.csv")
import numpy as np
from matplotlib import pyplot as plt

x1 = ar.iloc[:,0:1].values.astype(float) #x1=ar.Area

y1 = ar.iloc[:,1:2].values.astype(float) #y1=ar.Price
#print(x1)
#print(y1)

plt.scatter(x1, y1,marker='*',color='red')
plt.xlabel('Area(Sq.ft.)')
plt.ylabel('Price(INR)')
plt.plot()
#plt.xticks(())
#plt.yticks(())

plt.show()

Output:

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

from sklearn import linear_model


from sklearn.linear_model import LinearRegression
regr = linear_model.LinearRegression()
regr.fit(ar[['Area']],ar.Price)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
regr.predict([[80]])
array([9171.67605634])
regr.coef_
array([100.6691978])
regr.intercept_
1118.140232700558
#This can verify the prediction
80*100.6691978+1118.140232700558
9171.676056700559

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

Aim 7: To Perform Simple Logistic Regression with R/Python.


CODE:-

import numpy as np
import pandas as pd
from google.colab import files

uploaded = files.upload()

df = pd.read_csv("insurance_data.csv")
insurance_data.csv(n/a) - 184 bytes, last modified: 7/16/2021 - 100% done
Saving insurance_data.csv to insurance_data.csv
from matplotlib import pyplot as plt
X = df.iloc[:,0:1].values.astype(float)
y = df.iloc[:,1:2].values.astype(float)
plt.scatter(X,y,marker='*',color='red')
df.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1)
len(X_train)
len(X_test)
X_train

X_test

from sklearn.linear_model import LogisticRegression


model=LogisticRegression()
model.fit(X_train,y_train)
X_test

model.predict(X_test)

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

model.predict([[31]])
model.predict([[67]])
model.predict_proba(X_test)

OUTPUT: -

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

24
3
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:760:
DataConversionWarning: A column-vector y was passed when a 1d array was
expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
array([1., 0., 1.])

array([0.])
array([1.])

array([[0.14105334, 0.85894666],
[0.84509337, 0.15490663],
[0.33614027, 0.66385973]])

Aim 8: To implement K-Means clustering algorithm on Real


Time Data and its visualization using R/Python.
CODE: -
import numpy as np
import matplotlib.pyplot as plt

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

from matplotlib import style


style.use("ggplot")
from sklearn.cluster import KMeans

x = [1,
1.5,
2,
2.5,
3,
3.5,
4,
5,
5.5,
6,
7,
]
y=[
0.0000259,
0.0000443,
0.0000847,
0.0001321,
0.000139,
0.0001155,
0.0001071,
0.0002162,
0.0003909,
0.0007056,
0.001915
]
plt.scatter(x,y)
plt.show()

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

X = np.array([[1,0.0000259,],
[1.5,0.0000443,],
[2,0.0000847,],
[2.5,0.0001321,],
[3,0.000139,],
[3.5,0.0001155,],
[4,0.0001071,],
[5,0.0002162,],
[5.5,0.0003909,],
[6,0.0007056,],
[7,0.001915,]
])
#Here we are doing it for n=3 (Number of clusters will be 3)
kmeans = KMeans(n_clusters=3)
#Now,we need to fit the data for this or we can say we need to apply this algorithm to
our dataset.
kmeans.fit(X)
#In case, you want to check out the center of the clusters.You the use the following
line of code:
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
#The below line of code will report back the label it believe to be true for the cluster.
print(kmeans.n_iter_)
print(centroids)
print(labels)

colors = ["g.","r.","c.","y."]
print(kmeans.inertia_)
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder =


10)
plt.show()
Output:-

2
[[2.50000000e+00 9.26571429e-05]
[5.87500000e+00 8.06925000e-04]]
[0 0 0 0 0 0 0 1 1 1 1]
9.187501771421825

coordinate: [1.00e+00 2.59e-05] label: 0


coordinate: [1.50e+00 4.43e-05] label: 0
coordinate: [2.00e+00 8.47e-05] label: 0
coordinate: [2.500e+00 1.321e-04] label: 0
coordinate: [3.00e+00 1.39e-04] label: 0
coordinate: [3.500e+00 1.155e-04] label: 0
coordinate: [4.000e+00 1.071e-04] label: 0
coordinate: [5.000e+00 2.162e-04] label: 1
coordinate: [5.500e+00 3.909e-04] label: 1

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])


lOMoARcPSD|51105687

ABHISHEK BISHT 1813313006

coordinate: [6.000e+00 7.056e-04] label: 1


coordinate: [7.000e+00 1.915e-03] label: 1

DATA ANALYTICS KIT-651

Downloaded by Vinod Jagdale ([email protected])

You might also like