AIM:
To perform data manipulation using numpy and pandas and data visualization using matplotlib.
Program:
A. Data Manipulation using Numpy
NumPy stands for Numerical Python. NumPy is a python library used for working with arrays.
array(): Used to create numpy ndarray
copy(): It is a new array and it owns the data and any changes made to the copy will not affect
original array
view():View is just a view of the original array.The view does not own the data and any changes
made to the view will affect the original array
reshape() : Used to change the dimensions of the array
concatenate(): Used to join two or more arrays into single array
split(): Splitting is reverse operation of Joining. Array and number of splits are passed as
argument.
where(): search an array for a certain value, and return the indexes that get a match
sort(): Sorting means putting elements in an ordered sequence.
random: Module to work with random numbers
1. Program to create array and perform reshape:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
x=arr.reshape(2,3)
print( x)
Output:
[[1 2 3]
[4 5 6]]
2. Program to demonstrate copy and view
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
arr[0] = 42
print(arr)
print(x)
arr = np.array([11, 12, 13, 14, 15])
y = arr.copy()
arr[0] = 42
print(arr)
print(y)
Output:
[42 2 3 4 5]
[1 2 3 4 5]
[42 12 13 14 15]
[42 12 13 14 15]
3. Program to demonstrate concatenate and split
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))
print(arr)
print(np.split(arr,2))
Output:
[1 2 3 4 5 6]
[array([1, 2, 3]), array([4, 5, 6])]
4. Program to demonstrate sorting and searching
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
print(np.where(arr==2))
Output:
[0 1 2 3]
3
B. Data Manipulation using Pandas
Pandas is a python library that allow us to analyze big data and make conclusions based on
statistical theories.
read_csv(dataset name): Used to load dataset
dropna(): Used to remove rows or columns with missing values
fillna(): Used to replace the NaN values with other values
mean(): Used to find the mean of the values for the requested axis
sum(): Function return the sum of the values for the requested axis
count(): is used to count the no. of non-NA/null observations across the given axis.
1. Program for handling missing data:
import numpy as np
import pandas as pd
# Load Dataset
dframe = pd.read_csv(‘Data.csv’)
print(dframe)
# Dropping missing data
dframe.dropna(inplace = True)
print(dframe)
# Filling missing data with mean
dframe.fillna(value = dframe.mean(), inplace = True)
print(dframe)
Output:
a b c
0 23 10.0 0.0
1 24 12.0 NaN
2 22 NaN NaN
a b c
0 23 10.0 0.0
a b c
0 23 10.0 0.0
1 24 12.0 0.0
2 22 11.0 0.0
2. Program for merge and join of data frame using panda:
import numpy as np
import pandas as pd
left = pd.DataFrame({'Key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'Key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
# Merging the dataframes
print(pd.merge(left, right, how ='inner', on ='Key'))
# Joining the dataframes
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']}, index= ['K0', 'K1', 'K2', 'K3'])
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']}, index= ['K0', 'K1', 'K2', 'K3'])
print(left.join(right))
Output:
Key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
K3 A3 B3 C3 D3
3. Program for aggregation in panda:
import numpy as np
import pandas as pd
df = pd.read_csv(‘data.csv’)
print(df)
# Mean
print(df.mean())
# Count
print(df.count())
# Median
print(df.median())
# Sum
print(df.sum())
#Mad (Mean Absolute Deviation)
print(df.mad())
# Std (Standard deviation)
print(df.std())
# Var (Variance)
print(df.var())
Output:
A B
0 0.374540 0.155995
1 0.950714 0.058084
2 0.731994 0.866176
3 0.598658 0.601115
4 0.156019 0.708073
A 0.562385
B 0.477888
dtype: float64
A 5
B 5
dtype: int64
A 0.598658
B 0.601115
dtype: float64
A 2.811925
B 2.389442
dtype: float64
A 0.237685
B 0.296679
dtype: float64
A 0.308748
B 0.353125
dtype: float64
A 0.095325
B 0.124697
dtype: float64
C. Data Visualization using Matplotlib
Matplotlib is a multi-platform data visualization library built on NumPy arrays. Matplotlib
consists of several plots like line, bar, scatter, histogram etc.
figure(): Thought of as a single container that contains all the objects representing axes, graphics,
text, and labels
axes(): Bounding box with ticks and labels, which will eventually contain the plot elements that
make up our visualization.
xlim(), ylim(): Used to set x and y limit for axis
axis[xmin,xmax,ymin,ymax]: Used to set x and y limit for axis
xlabel(), ylabel(): Used to set label for axis
legend(): Used to set label for multiple plot lines in graph
plt.axis(‘equal’): Equal splitting of x and y limit values
plt.axis(‘tight’): Default spacing left between values of x and y limits
plt.savefig(): Used to save the output graph to specified format
1. Line Drawing
Program:
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('seaborn-whitegrid')
fig = plt.figure()
plt.xlim(1,10)
plt.ylim(1,20)
plt.axis('tight')
plt.title('Sin ans Cos curves')
plt.xlabel('x')
plt.ylabel('Y')
x=np.linspace(1,20,10)
plt.plot(x, x + 0, linestyle='solid', linewidth=5,label='solid')
plt.plot(x, x + 1, linestyle='dashed',label='dashed')
plt.plot(x, x + 2, linestyle='dashdot',label='dash dot')
plt.plot(x, x + 3, linestyle='dotted',label='dotted')
plt.legend()
plt.savefig(‘sincos.png’) #Output graph saved into png format in current folder
Output:
2. Scatter Plot
Program:
import matplotlib.pyplot as plt
import numpy as np
plt.xlim(1,10)
plt.ylim(0,10)
plt.title('Scatter plots')
plt.xlabel('x axis')
plt.ylabel('Y axis')
x = np.linspace(0, 10, 10)
y = x+2
plt.plot(x, y, 'o', color=' green ') #scatter plot using plt.plot()
OR
plt.scatter(x,y,marker='o',color='green') #scatter plot using plt.scatter()
plt.savefig(‘scatter.pdf’) #Output graph saved into pdf format in current folder
Output:
3. Bar chart
Program:
import matplotlib.pyplot as plt
# x-coordinates of left sides of bars
left = [1, 2, 3, 4, 5]
# heights of bars
height = [10, 24, 36, 40, 5]
# labels for bars
tick_label = ['one', 'two', 'three', 'four', 'five']
# plotting a bar chart
plt.bar(left, height, tick_label = tick_label,
width = 0.8, color = ['red', 'green'])
# naming the x-axis
plt.xlabel('x - axis')
# naming the y-axis
plt.ylabel('y - axis')
# plot title
plt.title('My bar chart!')
Output:
+