DA Python Record & Manual - 2nd Yr
DA Python Record & Manual - 2nd Yr
PRACTICAL – 1
1. Use matplotlib and plot an inline in jupyter.
Aim : To plot the graphs using matplotlib.
Matplotlib inline : Now, let’s talk about the %matplotlib function. It sets up the
matplot lib to work interactively. It lets you activate the matplolib interactively. It
lets you activate the matplotlib interactive support anywhere in an IPython
session (like in jupyter notebook).
#Scatter plots are used to observe relationships between variables and use dots
to represent the relationship between them.The scatter() method in the
matplotlib library is used to draw a scatter plot.
import pandas as pd
%matplotlib inline
tips=pd.read_csv("C:/Users/dell/Desktop/tips.csv")
plt.scatter(tips['day'], tips['tip'])
plt.title("Tips")
plt.xlabel("Day")
plt.ylabel("Tip")
plt.show()
PRACTICAL – 2
Implement Commands of Python Language basics
Python is an interpreted, interactive, object-oriented programming language. It
incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and
classes. It supports multiple programming paradigms beyond object-oriented programming,
such as procedural and functional programming.
● Comments:
a=5
b=4
c=a+b
print(c)
Out[] : 9
● Creating variables:
Variables in python must start with a lower case letter. Python variable
is a reserved memory location to store values.
Ex:
Out []: 5
2. In []: a=b=1
Out []: 1
● Identifiers:
Valid Identifiers:
1.var1
2. _var1
3._1_var
4. var_1
● Keywords:
The keywords are some predefined and reserved words in python that
have special meanings. Keywords are used to define the syntax of the
coding. The keyword cannot be used as an identifier, function, and variable
name.
if import in is
yield
● Literals:
These are notations for representing a fixed value. They can also be
defined as raw value or data given in variables or constants.
Types of literals:
a) Numeric Literals:
They are immutable and there are three types of numeric literal :
In []: x = 88 # Integer literal
y = 24.3 # Float literal
z = 6+2i # Complex literal
print(x, y ,z)
b) String literals:
# in double quotes
# multi-line String
k = '''Hello
everyone'''
print(s)
print(d)
print(k)
Out []:
Hello
Hello everyone
Hello
Everyone
C) Boolean Literals:
There are only two Boolean literals in Python. They are true and false.
In []: a = (1 == True)
b = (0 == False)
# creation a list
my_list = [20,50,"alex",60]
my_list.append(55)
my_list
my_list.extend(["manu",76])
my_list
my_list.insert(2, "love")
my_list
my_list.pop(4)
60
my_list
[20, 50, 'love', 'alex', 55, 'manu', 76]
my_list.remove(50)
my_list
my_list[4]
'manu'
my_list[:]
my_list[2:]
## tuples
my_tuple = ("Alex", "Sowji", 100, 95, 76)
my_tuple
('Alex', 'Sowji', 100, 95, 76)
## slicing the tuple
my_tuple[1:]
('Sowji', 100, 95, 76)
my_tuple[:4]
('Alex', 'Sowji', 100, 95)
my_tuple[4]
76
my_tuple[-5]
'Alex'
PRACTICAL – 4
Create built-in Sequence functions
Python has a handful of useful sequence functions.
● sorted () :
The sorted function returns a new sorted list from the elements of any
sequence.
Ex:
The sorted function accepts the same arguments as the sort method lists.
● zip ():
Zip function pairs up the elements of a number of lists, tuples or other sequences
to create a list of tuples:
Ex:
Zip can be applied in a clever way to “unzip” the sequence. Another way to think
about this is converting a list of rows into a list of columns.
● reversed ():
Reversed iterates over the elements of a sequence in reverse order.
Ex:
● enumerate ():
Syntax:
Enumerate (iterable, start)
Ex:
print(list(y))
mapping[v] = i
mapping
PRACTICAL – 5
GEEK.clear()
Out [1]:
Now about transforming the elements using List, Set and Dictionary
Comprehensions .
Firstly transforming through Lists: List comprehensions are one of the most-loved
Python language features. They allow you to concisely form a new list by filtering
the elements of a collection, transforming the elements passing the filter in one
concise expression. They take the basic form:
result = []
if condition:
result.append(expr)
The filter condition can be omitted, leaving only the expression. For example,
given a list of strings, we could filter out strings with length 2 or less and also
convert them to uppercase like this:
Set and dict comprehensions are a natural extension, producing sets and dicts in
an idiomatically similar way instead of lists. A dict comprehension looks like this:
dict_comp = {key-expr: value-expr for value in collection if condition}
A set comprehension, looks like the equivalent list comprehension except with
curly braces instead of square brackets:
Like list comprehensions, set and dict comprehensions are mostly conveniences,
but they similarly can make code both easier to write and read. Consider the list
of strings from before. Suppose we wanted a set containing just the lengths of the
strings contained in the collection; we could easily compute this using a set
comprehension:
Unique_lengths
In [4]: loc_mapping Out [4]: {‘a’: 0, ‘as’: 1, ‘bat’: 2, ‘car’: 3, ‘dove’: 4, ‘python’: 5}
PRACTICAL – 6
Create a functional pattern to modify the strings at a
high level.
1. Stripping Whitespace
strip leading whitespace with the lstrip() method (left), trailing whitespace with
Input:
Output:
Strip leading whitespace: This is a sentence with whitespace.
are helpful, and are used by passing in the character(s) you want stripped.
Input:
s = 'This is a sentence with unwanted characters.AAAAAAAA'
Output:
Strip unwanted characters: This is a sentence with unwanted characters.
2. Splitting Strings
Splitting strings into lists of smaller substrings is often useful and easily
Input:
print(s.split())
Output:
be passed in as well.
Input:
s = 'these,words,are,separated,by,comma'
print('\',\' separated split -> {}'.format(s.split(',')))
s = 'abacbdebfgbhhgbabddba'
Output:
',' separated split -> ['these', 'words', 'are', 'separated', 'by', 'comma']
'b' separated split -> ['a', 'ac', 'de', 'fg', 'hhg', 'a', 'dd', 'a']
Need the opposite of the above operation? You can join list element strings
Input:
print(' '.join(s))
Output:
KDnuggets is a fantastic resource
4. Reversing a String
Python does not have a built-in string reverse method. However, given that
strings can be sliced like lists, reversing one can be done in the same succinct
Input:
s = 'KDnuggets'
Output:
The reverse of KDnuggets is: steggunDK
Converting between cases can be done with the upper(), lower(), and
swapcase() methods.
Input:
s = 'KDnuggets'
Output:
'KDnuggets' as uppercase: KDNUGGETS
'KDnuggets' as lowercase: kdnuggets
The easiest way to check for string membership in Python is using the in
Input:
s1 = 'perpendicular'
s2 = 'pen'
s3 = 'pep'
Output:
'pen' in 'perpendicular' -> True
If you are more interested in finding the location of a substring within a string
(as opposed to simply checking whether or not the substring is contained), the
Input:
s = 'Does this string contain a substring?'
Output:
'string' location -> 10
find() returns the index of the first character of the first occurrence of the
substring by default, and returns -1 if the substring is not found. Check the
7. Replacing Substrings
What if you want to replace substrings, instead of just find them? The Python
Input:
s2 = 'practice'
Output:
The new sentence: The practice of data science is of the utmost importance.
An optional count argument can specify the maximum number of successive
Have multiple lists of strings you want to combine together in some element-
Input:
Output:
The capital of USA is Washington.
each string and check if these counts are equal. This is straightforward using
Input :
s1 = 'listen'
s2 = 'silent'
s3 = 'runner'
s4 = 'neuron'
Output :
'listen' an anagram of 'silent' -> True
Algorithmically, we need to create a reverse of the word and then use the ==
operator to check if these 2 strings (the original and the reverse) are equal.
Input
def is_palindrome(s):
reverse = s[::-1]
if (s == reverse):
return True
return False
s1 = 'racecar'
s2 = 'hippopotamus'
Output
'racecar' is a palindrome -> True
PRACTICAL – 7
Write a python program to cast a string to a floating-point number
but fails with value error on improper inputs using errors and
exception handling.
In [2]: float('something')
Suppose we wanted a version of float that fails gracefully, returning the input
argument. We can do this by writing a function that encloses the call to float in a
try/except block:
def attempt_float(x):
try:
return float(x)
except:
return x
The code in the except part of the block will only be executed if float(x) raises an
exception:
In [3]: attempt_float('1.2345')
In [4]: attempt_float('something')
You might notice that float can raise exceptions other than ValueError:
In [5]: float((1,2))
def attempt_float(x):
try:
return float(x)
except ValueError:
return x
We have then:
In [6]: attempt_float((1,2))
def attempt_float(x):
try:
return float(x)
except (TypeError,ValueError):
return x
PRACTICAL – 8
AIM: Create an n array object and use operations on it.
➢ Numpy is a Python package which means ‘Numerical Python’. It is the library
for logical computing, which contains a powerful n-dimensional array object.
Creating nd arrays:
1d array:
In [ 1 ] : d1 = [6, 7.5, 8, 0, 1]
In [ 2 ] : arr1 = np.array ( d1 )
In [ 3 ] : arr1
In [ 4 ] : d2 = [ [1, 2, 3, 4] , [5, 6, 7, 8] ]
In [ 5 ] : arr2 = np.array ( d2 )
In [ 6 ] : arr2
Operations on arrays:
In [ 7 ] : arr2.ndim
Out [ 7 ] : 2
In [ 8 ] : arr2.shape
Out [ 8 ] : (2,4)
In [ 9 ] : arr1.dtype
In [ 10 ] : np.zeros ( 3 )
In [ 12 ] : np.zeros ( (3,3) )
In [ 14 ] : np.arange ( 10 )
Out [ 15 ] : array ( [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ] )
PRACTICAL – 9
Use arithmetic operations on Numpy Arrays
Create a NumPy ndarray Object
Example:
X = np.array([12,34,56,86,34])
print(X)
Output: [12,34,56,86,34]
Arithmetic Operations on numpy array Addition “+” symbol is used to add two
arrays.
Example:
print(newarr)
print(newarr)
Example:
print(newarr)
Division “/” symbol is used to divide two arrays. It will generate an warning
message when any element is divided by zero but not an error.
Example:
PRACTICAL - 10
Using Numpy arrays perform Indexing and Slicing
Boolean Indexing, FancyIndexing operations.
Array : Array is a container which can hold a fix number of items and
these items should be of the same type.
import numpy as np
arr = np.array([1, 2, 3, 4])
print(arr[1]) # here we got second element from array
output : 2
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
output: 2
Indexing for 3-D array
import numpy as np
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(arr[0, 1, 2])
output: 6
Negative Indexing :
Use negative indexing to access an array from the end with -1,-2………
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('Last element from 2nd dim: ', arr[1, -1])
output :10
Slicing arrays :
Slicing means taking elements from one given index to another given
index.
SLICING FOR 1-D ARRAY :Slice elements from index 1 to index 5 from
the following array:
> print(arr[4:])
output:[5 6 7]
> print(arr[:4])
output:[1 2 3 4]
> print(arr[-3:-1])
output:[5 6]
output: [7 8 9]
import numpy as np
A = np.array([4, 7, 3, 4, 2, 8])
print(A == 4)
OUTPUT: [ True False False True False False]
print(A < 5)
OUTPUT: [ True False True True True False]
print(B>=42)
OUTPUT:[[ True True True True]
[ True True True False]
[ True True False False]]
fancy indexing always allows you to select enter rows or columns out of
order to show this.
Input: s=np.zeros((2,2))
s
output: array([[0., 0.,],
[0., 0.,]])
> s[[2,4,6,8]]
Output: array([[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0.]])
> f = s.shape[0] # here 0 indicates no.of columns
f
output:10
# set up array
Input: s[[2,4]]
Output: array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[4., 4., 4., 4., 4., 4., 4., 4., 4., 4.]])
# allows any order
Input: s[[5,2,1]]
Output: array([[5., 5., 5., 5., 5., 5., 5., 5., 5., 5.],
[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
PRACTICAL - 11
Aim:- create an image plot from a two-dimensional array of function values.
Procedure:-
BAR PLOT - A bar plot shows categorical data as rectangular bars with the height
of bars proportional to the value they represent.
import numpy as np
%matplotlib inline
names = list(data.keys())
values= list(data.values())
fig = plt.figure(figsize=(10,5))
plt.barh(names,values,color="orange")
plt.xlabel("Fruits")
plt.ylabel("Quantity")
plt.show()
Output :-
BOX PLOT :-
In[2] :
%matplotlib inline
#Data Prep
total = [20,4,1,30,20,10,20,70,30,10]
order = [10,3,1,15,17,2,30,44,2,1]
discount = [30,10,20,5,10,20,50,60,20,45]
plt.boxplot(data, showmeans)
plt.title("Box Plot")
plt.grid(True)
plt.show()
In[2] :
%matplotlib inline
#Data Prep
total = [20,4,1,30,20,10,20,70,30,10]
order = [10,3,1,15,17,2,30,44,2,1]
discount = [30,10,20,5,10,20,50,60,20,45]
plt.boxplot(data, showmeans)
plt.title("Box Plot")
plt.grid(True)
plt.show()
HISTOGRAM :- A histogram is basically some groups.
%matplotlib inline
numbers = [10,12,13,45,67,23,45,89,12,45,90,32]
plt.ylabel("Frequencies")
plt.legend(['line1'],loc="best")
numbers = [10,12,13,45,67,23,45,89,12,45,90,32]
plt.show()
PRACTICAL – 12
● sum
● mean
● std
● var
● min
● max
● argmin
● argmax
● cumsum
● cumprod
a=np.array([[5,6,1],[2,5,7],[3,6,5]])
output:40
2.mean( ):
Input: np.mean(a)
output:4.444444444444444445
3.std( ):
Input: np.std(a)
output:892154040658489
4.var( ):
Input: np.var(a)
output:3.580246913580247
5.min( ):
Input: np.min(a)
output:1
6.max( ):
Input: np.max(a)
output:7
output:2
output:5
output: array([ 5, 11, 12, 14, 19, 26, 29, 35, 40], dtype=int32)
Input: np.argmax(a)
1-D array:
Input: import numpy as np
b=np.array([2,5,6,3])
2-D array:
Input: import numpy as np
a=np.array([[5,6,1],[2,5,7],[3,6,5]])
PRACTICAL - 13
13. To implement numpy.random functions are used to generate random data in
Python.
• randint()
• choice()
• shuffle()
• uniform()
• random()
• rand()
• permutation()
➢ np.random.rand()
Input:
import numpy as np
x= np.random.rand(3,2)
#getting a random float from 3 to 2
print(x)
➢ np.random.randit()
Input:
np.random.randint(5, size=(2,4))
➢ np.random.choice()
Input: np.random.choice(5,2)
➢ np.random.shuffle()
Input:
arr=np.arange(9).reshape((3,3))
np.random.shuffle(arr)
arr
Output: array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
➢ np.random.uniform()
Input: s=np.random.uniform(-1,0,10)
# any value within the given interval is equally likely to be drawn by uniform s
Output : array([-0.38279886, -0.74777372, -0.24510703, -0.88106394, -
0.39874888, -0.12382905, -0.70945686, -0.12948132, -0.60052525, -0.64751423])
Input: import numpy as np
x=np.random.uniform(size=(2,3))
np.random.permutation(4)
Input: np.random.seed(7)
print(random.random())
Output: 0.07630828937395717
➢ np.random.permutation()
arr =np.arange(16).reshape((4,4))
# Randomly permute a sequence, or return a permuted range.
np.random.permutation(arr)
➢ np.random.random()
x= np.random.random(5)
# it is randomly selected
print(x)
PRACTICAL - 14
1. Plot the values of first 100 values on the values obtained from
random walks.
Aim : To plot first 100 values on the values obtained from random walks.
import numpy as np
#importing matplolib.pyplot.
import random
rand = np.random.randn(100)
rand
rand_plot= random.sample(range(1,300),100)
plt.title("Random_Number_Plotting")
plt.show()
rand_gen = random.sample(range(1,300),100)
plt.title("Random_number_plotting")
plt.show()
import numpy as np
#importing matplolib.pyplot.
import random
rand = np.random.randn(100)
rand
rand_plot= random.sample(range(1,300),100)
plt.title("Random_Number_Plotting")
plt.show()
rand_gen = random.sample(range(1,300),100)
plt.title("Random_number_plotting")
plt.show()
Create a data frame using pandas and retrieve the rows and
columns in it by performing some indexing options and
transpose it.
AIM: To create a DATAFRAME and retrieving rows and columns from it by using
some indexing options and transposing it.
CREATING A DATAFRAME:
# importing pandas library
import pandas as pd
data
df=pd.DataFrame(data)
df
OUTPUT:
RETRIEVING ROWS AND COLUMNS USING INDEXING
OPTIONS:
We can use different ways to retrieve data from data frame by using indexing
options. Indexing in pandas means simply selecting particular rows and columns of
data from a Data Frame. Indexing can also be known as Subset Selection.
The .loc and .iloc indexers also use the indexing operator to make selections.
row
OUTPUT:
column
OUTPUT:
######## TRANSPOSE DATAFRAME ##############
df_transpose = df.T
df_transpose
OUTPUT:
RESULTS:
## creation of dataframe
Import numpy as np
d={‘Name’:pd.Series([‘Alisa’,’Bobby’,’Rocky’]),’Age’:pd.Series([25,26,25]),’Rating’:
pd.Series([4.23,3.24,3.98])}
In[2] :df=pd.DataFrame(d)
print df
Output:
In[3] :print(df.sum())
Name AlisaBobbyRocky
Age 76
Rating 11.45
dtype:object
In[4] :print(df.sum(1))
Out[4]: 0 29.23
1 29.24
2 28.98
dtype:float 64
In[5] :print(df.mean())
Out[5]:Age 25.333333
Rating 3.816667
dtype:float 64
In[6] :print(df.std())
Out[6]:Age 0.577350
Rating 0.514814
dtype:float64
In[7] :print(df.count())
Out[7]:Name 3
Age 3
Rating 3
dtype:int64
In[8] :print(df.min())
Out[8]: Name Alisa
Age 25
Rating 3.24
dtype:object
In[9] :print(df.max())
Out[9]:Name Rocky
Age 26
Rating 4.23
dtype:object
In[10] :print(df.median())
Rating 3.98
dtype:float64
In[11] :print(df.mode())
In[12] :print(df.describe())
Age Rating
In[13] :df.sort_values(by=[‘Rating’])
1 Bobby 26 3.24
2 Rocky 25 3.98
0 Alisa 25 4.23
PRACTICAL – 17
File Formats: Reading and Writing Data in Text Format
All the powerful data structures like the Series and the DataFrames would avail to nothing, if the Pandas
module wouldn't provide powerful functionalities for reading in and writing out data. It is not only a
matter of having functions for interacting with files. To be useful to data scientists it also needs functions
which support the most important data formats like
✔ HTML
✔ HXML
✔ JSON
✔ DataFrame.from_csv
✔ read_csv
There is no big difference between those two functions, e.g. they have different default values in some
cases and read_csv has more parameters. We will focus on read_csv, because DataFrame.from_csv is
kept inside Pandas for reasons of backwards compatibility
import pandas as pd
print(exchange_rates)
Pandas is a very powerful and popular framework for data analysis and manipulation. One of the most
striking features of Pandas is its ability to read and write various types of files including CSV and Excel.
You can effectively and easily manipulate CSV files in Pandas using functions like read_csv() and to_csv()
Generally, there will be multiple record types in the file, all of which share a common header format.
For example, binary data from a car’s computer might have one record type for driver controls such as
the brake pedal and steering wheel positions, and another type to record engine statistics such as fuel
consumption and temperature.
PRACTICAL -18
Implement the data Cleaning and Filtering methods
Input : df.dropna(inplace=True)
print(df.isnull().sum())
output : A 0
B 0
C 0
Input : df
Output : A B C
0 45 56 19.0
1 30.0 NaN 25.0
Input : df[‘C’]=df[‘C’].fillna(“0”)
df.isnull().sum()
output : A 0
B 0
C 0
dtype :int64
● The map( ) function executes a specified function for each item in an iterable. The item
is
sent to the function as parameter.
● Python’s map( ) is a built-in function that allows you to process and transform all the
items in iterable without using an explicit for loop, this technique commonly known as
mapping.
Syntax :
Parameter Description
function The function to execute for each item.
iterable A sequence, collection or an iterator object.you can send as many
iterables as you like , just make sure the function has one parameter for
each iterable.
● We use for loop and the map functions separately, to understand how the map()
function works.
mul = []
for n in num:
mul.append(n ** 2)
print (mul)
return i * i
print(result)
mul_output = list(result)
print(mul_output)
As you can see, the map() function iterates through the iterable, just like the for loop. Once the
iteration is complete, it returns the map object. You can then convert the map object to a list and
print it.
● this code, you will take a tuple containing some string values. You will then define
a function to convert the strings to uppercase. Lastly, you must use the map in
Python to use the tuple and the function, to convert the string values to
uppercase.
In[]: def letter(s):
return s.upper()
print(upd_tup)
print(tuple(upd_tup))
X= list(map(len,data))
print(x)
Output: [12, 6, 3]
In the above code , we use python len( ) function along with map( ) to find the length of the
some words
PRACTICAL - 20
Note - though we have panel data structure for multi dimensions, hierarchical
indexing is a familiar concept.
pop
Output:
In[4] df
Output:
In[5] df.unstack()
Output:
PRACTICAL - 21
# creation of DataFrame
import pandas as pd
import numpy as np
d = {'Name':pd.Series(['Alisa','Bobby','Cathrine','Madonna','Rocky','Sebastian','Jaqluine',
'Rahul','David','Andrew','Ajay','Teresa']),
'Age':pd.Series([26,27,25,24,31,27,25,33,42,32,51,47]),
'Score':pd.Series([89,87,67,55,47,72,76,79,44,92,99,69])}
#Create a DataFrame
df = pd.DataFrame(d)
print df
print df.describe(include=['object'])
print df.describe(include='all')
PRACTICAL - 22
22. Use different Join types with arguments and merge data
with keys and multiple keys.
The pandas module contains various features to perform
various operations on join two dataframes. There are mainly
five types of Joins in Pandas:
> Inner Join
> Left Outer Join
> Right Outer Join
> Full Outer Join or simply Outer Join
> Index Join
## To perform a join operation we should create 2 data frames
first.
In[1]: import pandas as pd
## creation of DataFrame
In[2]: wheat_2020 = [("Punjab",125.84),
("Madhya Pradesh",113.38),
("Haryana",70.65),
("Uttar Pradesh",20.39),
("Rajasthan",10.63),
("Uttarakhand",0.31),
("Gujarat",0.21),
("Chandigarh",0.12)]
pulses_2020 = [("Punjab",29.20),
("Rajasthan",4497.13),
("Uttar Pradesh",2447.32),
("Uttarakhand",57.79), ("Sikkim",5.04), ("Tamil Nadu",605.41),
("Telangana",549.18), ("Tripura",18.67)]
In[3]: wheat_2020 = pd.DataFrame(wheat_2020,
columns = ["States", "Production(In Tons)"])
In[4]: pulses_2020 = pd.DataFrame(pulses_2020,
columns = ["States","Production(In Tons)"])
In[5]: wheat_2020
Out[5]: States Production(In
Tons)
0 Punjab 125.84
1 Madhya 113.38
Pradesh
2 Haryana 70.65
3 Uttar 20.39
Pradesh
4 Rajasthan 10.63
5 Uttarakhand 0.31
6 Gujarat 0.21
7 Chandigarh 0.12
In[6]: pulses_2020
Out[6]: States Production(In
Tons)
0 Punjab 29.20
1 Rajasthan 4497.13
States Production(In
Tons)
2 Uttar 2447.32
Pradesh
3 Uttarakhan 57.79
d
4 Sikkim 5.04
5 Tamil Nadu 605.41
6 Telangana 549.18
7 Tripura 18.67
## Inner Join: Inner join is the most common type of join you’ll
be working with. It returns a dataframe with only those rows
that have common characteristics. This is similar to the
intersection of two sets.
In[7]: df_inner = pd.merge(wheat_2020, pulses_2020,
on='States', how='inner')
df_inner
out[7]:
States Production(In Production(In
Tons)_x Tons)_y
0 Punjab 125.84 29.20
1 Uttar Pradesh 20.39 2447.32
2 Rajasthan 10.63 4497.13
3 Uttarakhand 0.31 57.79
## Left Outer Join: With a left outer join, all the records from
the first dataframe will be displayed, irrespective of whether the
keys in the first dataframe can be found in the second dataframe.
Whereas, for the second dataframe, only the records with the
keys in the second dataframe that can be found in the first
dataframe will be displayed.
In[8]: df_left = pd.merge(wheat_2020, pulses_2020,
on='States', how='left')
df_left
out[8]:
States Production(In Production(In
Tons)_x Tons)_y
0 Punjab 125.84 29.20
1 Madhya 113.38 NaN
Pradesh
2 Haryana 70.65 NaN
3 Uttar Pradesh 20.39 2447.32
4 Rajasthan 10.63 4497.13
5 Uttarakhand 0.31 57.79
6 Gujarat 0.21 NaN
7 Chandigarh 0.12 NaN
## Right Outer Join: For a right join, all the records from the
second dataframe will be displayed. However, only the records
with the keys in the first dataframe that can be found in the
second dataframe will be displayed.
In[9]: df_right = pd.merge(wheat_2020, pulses_2020,
on='States', how='right')
df_right
out[9]:
States Production(In Production(In
Tons)_x Tons)_y
0 Punjab 125.84 29.20
1 Rajasthan 10.63 4497.13
2 Uttar Pradesh 20.39 2447.32
3 Uttarakhand 0.31 57.79
4 Sikkim NaN 5.04
5 Tamil Nadu NaN 605.41
6 Telengana NaN 549.18
7 Tripura NaN 18.67
## Full Outer Join: A full outer join returns all the rows from the
left dataframe, all the rows from the right dataframe, and
matches up rows where possible, with NaNs elsewhere. But if
the dataframe is complete, then we get the same output.
In[10]: df_outer = pd.merge(wheat_2020, pulses_2020,
on='States', how='outer')
df_outer
out[10]:
States Production(In Production(In
Tons)_x Tons)_y
0 Punjab 125.84 29.20
1 Madhya 113.38 NaN
Pradesh
2 Haryana 70.65 NaN
3 Uttar Pradesh 20.39 2447.32
4 Rajasthan 10.63 4497.13
5 Uttarakhand 0.31 57.79
6 Gujarat 0.21 NaN
7 Chandigarh 0.12 NaN
8 Sikkim NaN 5.04
9 Tamil Nadu NaN 605.41
10 Telengana NaN 549.18
11 Tripura NaN 18.67
In[11]: df_index = pd.merge(wheat_2020, pulses_2020,
right_index = True, left_index = True)
df_index
out[11]:
States_x Production(In States_y Production(In
Tons)_x Tons)_y
0 Punjab 125.84 Punjab 29.20
1 Madhya 113.38 Rajasthan 4497.13
Pradesh
2 Haryana 70.65 Uttar Pradesh 2447.32
3 Uttar Pradesh 20.39 Uttarakhand 57.79
4 Rajasthan 10.63 Sikkim 5.04
5 Uttarakhand 0.31 Tamil Nadu 605.41
6 Gujarat 0.21 Telengana 549.18
7 Chandigarh 0.12 Tripura 18.67
In[12]: lis = [6, 4, 4, 3, 2, 2, 1, 2]
In[13]: wheat_2020["Ranks"] = lis
In[14]: wheat_2020
Out[14]:
States Production Ranks
(In Tons)
0 Punjab 125.84 6
1 Madhya 113.38 4
Pradesh
2 Haryana 70.65 4
3 Uttar 20.39 3
Pradesh
4 Rajasth 10.63 2
an
5 Uttarakh 0.31 2
and
6 Gujarat 0.21 1
7 Chandig 0.12 2
arh
In[15]: lis2 = [6, 2, 1, 6, 5, 3, 2, 5]
In[16]: pulses_2020["Ranks"] = lis2
In[17]: pulses_2020
Out[17]:
States Production( Ranks
In Tons)
0 Punjab 29.20 6
1 Rajastha 4497.13 2
n
2 Uttar 2447.32 1
Pradesh
3 Uttarakh 57.79 6
and
4 Sikkim 5.04 5
5 Tamil 605.41 3
Nadu
6 Telenga 549.18 2
na
7 Tripura 18.67 5
LAB MANUAL
NUMPY
1 create an empty NumPy array
Ans.
import numpy as np
print("Empty Array")
print(emptyarray)
Ans.
import numpy as np
print(fullarray)
Ans.
import numpy as np
array = np.array([[1,2,3,4,5],
[6, 7, 8, 9, 10],
print(array)
print([1, 2, 3, 4, 5] in array.tolist())
4. Write a NumPy program to convert a list of numeric value into a one-dimensional NumPy
array.
Ans.
import numpy as np
print("Original List:",l)
a = np.array(l)
5. Write a NumPy program to create an array with values ranging from 12 to 38.
Ans.
import numpy as np
x = np.arange(12, 38)
print(x)
Ans.
import numpy as np
import numpy as np
x = np.arange(12, 38)
print("Original array:")
print(x)
print("Reverse array:")
x = x[::-1]
print(x)
Ans.
import numpy as np
my_list = [1, 2, 3, 4, 5, 6, 7, 8]
print(np.asarray(my_list))
print(np.asarray(my_tuple))
Ans.
import numpy as np
x = np.empty((3,4))
print(x)
y = np.full((3,3),6)
print(y)
9. Write a NumPy program to test whether each element of a 1-D array is also present in a
second array.
Ans.
import numpy as np
print("Array1: ",array1)
print("Array2: ",array2)
print(np.in1d(array1, array2))
Ans.
import numpy as np
print("Original array:")
print(x)
print(np.unique(x))
print("Original array:")
print(x)
Ans.
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6])
print(x.shape)
print(y)
x = np.array([1,2,3,4,5,6,7,8,9])
x.shape = (3, 3)
print(x)
12. Write a NumPy program to create a new shape to an array without changing its data.
Ans.
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6])
y = np.reshape(x,(3,2))
print("Reshape 3x2:")
print(y)
z = np.reshape(x,(2,3))
print("Reshape 2x3:")
print(z)
13. Write a NumPy program to create a 1-D array going from 0 to 50 and an array from 10 to 50.
Ans.
import numpy as np
x = np.arange(50)
print(x)
x = np.arange(10, 50)
print(x)
14. Write a NumPy program to collapse a 3-D array into one dimension array.
Ans.
import numpy as np
x = np.eye(3)
print("3-D array:")
print(x)
f = np.ravel(x, order='F')
print(f)
import numpy as np
c = np.concatenate((a, b), 1)
print(c)
16. Write a NumPy program to create an array of (3, 4) shape, multiply every element value by 3
and display the new array.
Ans.
import numpy as np
x= np.arange(12).reshape(3, 4)
print(x)
a[...] = 3 * a
print(x)
17. Write a NumPy program to create a record array from a (flat) list of arrays.
Ans.
import numpy as np
a1=np.array([1,2,3,4])
a2=np.array(['Red','Green','White','Orange'])
a3=np.array([12.20,15,20,40])
result= np.core.records.fromarrays([a1, a2, a3],names='a,b,c')
print(result[0])
print(result[1])
print(result[2])
PANDAS
1. Write the code in python to create a dataframe from a given list.
Ans.
import pandas as pd
L1 = ["Anil", "Ruby", "Raman", "Suman"]
L2 = [35, 56, 48, 85]
DF = pd.DataFrame([L1, L2])
print(DF)
2. Write a program to create dataframe “DF” from “data.csv”
Ans.
import pandas as pd
DF = pd.read_csv("data.csv")
print(df)
3. Creating a Pandas dataframe using list of tuples
Ans.
import pandas as pd
data = [('Peter', 18, 7),
('Riff', 15, 6),
('John', 17, 8),
('Michel', 18, 7),
('Sheli', 17, 5) ]
df = pd.DataFrame(data, columns =['Name', 'Age', 'Score'])
print(df)
Ans.
import pandas as pd
import matplotlib.pyplot as plt
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
auth_series = pd.Series(author)
print(auth_series)
5. Reindexing the Rows using pandas
Ans.
# import numpy and pandas module
import pandas as pd
import numpy as np
column=['a','b','c','d','e']
index=['A','B','C','D','E']
df1 = pd.DataFrame(np.random.rand(5,5),
columns=column, index=index)
print(df1)
df
Ans.
import pandas as pd
input_df = [{'name':'Sujeet', 'age':10},
{'name':'Sameer', 'age':11},
{'name':'Sumit', 'age':12}]
df = pd.DataFrame(input_df)
for index, row in df.iterrows():
print(row['name'], row['age'])
Ans.
import pandas as pd
df1 = df = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
df1.append(df2)
Ans.
import pandas as pd
data = {'name': ['Simon', 'Marsh', 'Gaurav', 'Alex', 'Selena'],
'Maths': [8, 5, 6, 9, 7],
df = pd.DataFrame(data)
print(a)
11. Selecting all the rows from the given dataframe in which ‘Percentage’ is greater than 80 using
the basic method.
Ans.
import pandas as pd
record = {
print(rslt_df)
Ans.
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Ans.
import pandas as pd
info = {'ID' :[101, 102, 103],'Department' :['B.Sc','B.Tech','M.Tech',]}
info = pd.DataFrame(info)
print (info)
DATA CLEANING
1. Check for Missing Values
Ans-
import pandas as pd
import numpy as np
print df['one'].isnull()
Ans-
import pandas as pd
import numpy as np
'two', 'three'])
print df.fillna(0)
Ans-
import pandas as pd
import numpy as np
print df.dropna()
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print df.replace({1000:10,2000:60})
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
})
print(df.isnull().sum())
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
})
print(df)
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
})
df = df.fillna(0)
print(df)
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
})
print(df)
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra', np.NaN],
})
print(df.duplicated())
Ans-
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
})
print(df)
Ans-
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind,
Shelly'],
'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
'Favorite Color': [' green ', 'red', ' yellow', 'blue', 'purple ']
})
print(df)
Ans-
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind,
Shelly'],
'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
'Favorite Color': [' green ', 'red', ' yellow', 'blue', 'purple ']
})
print(df['Name'].str.split(','))
Ans-
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind,
Shelly'],
'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
'Location': ['TORONTO', 'LONDON', 'New york', 'ATLANTA', 'toronto'],
'Favorite Color': [' green ', 'red', ' yellow', 'blue', 'purple ']
})
print(df)
Ans-
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind,
Shelly'],
'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
'Favorite Color': [' green ', 'red', ' yellow', 'blue', 'purple ']
})
print(df.isnull().sum() / len(df))
15. Drop any duplicate records based only on the Name column, keeping the last record.
Ans-
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Tranter, Melvyn', 'Lana, Courtney', 'Abel, Shakti', 'Vasu, Imogene', 'Aravind,
Shelly'],
'Region': ['Region A', 'Region A', 'Region B', 'Region C', 'Region D'],
})
df = df.drop_duplicates(subset='Name', keep='last')
DATA MANIPULATION
1. Selecting Pandas DataFrame rows using logical operators
Ans-
import pandas as pd
df = pd.DataFrame(data)
df[df.name != "John"]
# OR greater than 70
Ans-
import pandas as pd
df = pd.DataFrame(data)
def double(x):
return 2*x
df.column1 = df.column1.apply(double)
axis=1
Ans-
import pandas as pd
df = pd.DataFrame(data)
df['newColumn'] = [1, 2, 3, 4]
df['newColumn'] = 1
df['newColumn'] = df['oldColumn'] * 5
4. Crosstab in pandas
Ans-
import pandas
import numpy
dtype=object)
dtype=object)
"shiny", "shiny"],
dtype=object)
Ans-
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print(pd.merge(left,right,on='id'))
Ans-
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
Ans-
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'subject_id':['sub7','sub8','sub4','sub6','sub5']})
print(left.sort_values(by='subject_id'))
Ans-
marks
marks_2
print(data_merged)
Ans-
import pandas as pd
left = pd.DataFrame({
'iD':[1,2,3,None,None],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub7','sub8','sub4','sub6','sub5']})
pd.crosstab(left["Name"],left["subject_id"],margins=True)
Ans-
import pandas as pd
left = pd.DataFrame({
'iD':[1,2,3,None,None],
'subject_id':['sub7','sub8','sub4','sub6','sub5']})
print(left.iloc[_])
print(left.loc[:,'_'])
SEABORN
1. Install Seaborn:
2. Import Matplotlib:
3. Import Seaborn:
sns.distplot([0, 1, 2, 3, 4, 5])
plt.show()
plt.show()
6. Barplot:
Res = sns.barplot(mtcars[‘cy1’],mtcars[‘carb’])
plot.show()
7. Countplot:
Sns.countplot(x=’cy1’, data=mtcars, pallete=”set1”)
8. Distribution Plot:
9. Heatmap:
Sns.heatmap(mtcars.corr(), cbar=True, linewidth = 0.5)
12. remove the top and right axis spines using the despine() function.
from matplotlib import pyplot as plt
import seaborn as sns
plt.scatter(df.speeding,df.alcohol)
sns.set_style("ticks")
sns.despine()
plt.show()
2. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib?
Seaborn and Matplotlib are two of Python's most powerful visualization libraries. Seaborn
uses fewer syntax and has stunning default themes and Matplotlib is more easily customizable
through accessing the classes.
3. What is the main difference between a Pandas series and a single-column DataFrame in
Python?
Series is a type of list which can take integer values, string values, double value and more.
Series can only contain single list with index, whereas dataframes can be made of more than
one series
5. Why you should use NumPy arrays instead of nested Python lists?
The NumPy arrays takes significantly less amount of memory as compared to python lists. It
also provides a mechanism of specifying the data types of the contents, which allows further
optimisation of the code.
1. What is Numpy?
Ans: NumPy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays. It is
the fundamental package for scientific computing with Python. … A powerful N-
dimensional array object. Sophisticated (broadcasting) functions.
2. Why NumPy is used in Python?
Ans: NumPy is a package in Python used for Scientific Computing. NumPy package
is used to perform different operations. The ndarray (NumPy Array) is a
multidimensional array used to store values of same datatype. These arrays are indexed
just like Sequences, starts with zero.
3. What does NumPy mean in Python?
4. Ans: NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/ (NUM-pee)) is a
library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays.
5. Where is NumPy used?
Ans: NumPy is an open source numerical Python library. NumPy contains a multi-
dimentional array and matrix data structures. It can be utilised to perform a number of
mathematical operations on arrays such as trigonometric, statistical and algebraic
routines. NumPy is an extension of Numeric and Numarray.
6. How to import numpy in python?
Ans: import numpy as np
7. What Is The Difference Between Numpy And Scipy?
NumPy stands for Numerical Python while SciPy stands for Scientific Python. Both
NumPy and SciPy are modules of Python, and they are used for various operations of the
data. Coming to NumPy first, it is used for efficient operation on homogeneous data that
are stored in arrays. In other words, it is used in the manipulation of numerical data.
8. List the advantages NumPy Arrays have over (nested) Python lists
Size - Numpy data structures take up less space
Performance - they have a need for speed and are faster than lists
Functionality - SciPy and NumPy have optimized functions such as linear algebra
operations built in.
● The performance of NumPy is better than the NumPy for 50K rows or less.
● The performance of Pandas is better than the NumPy for 500K rows or more.
Between 50K to 500K rows, performance depends on the kind of operation.
PANDAS
● Data Alignment
● Reshaping
● Time Series
5. What is the name of Pandas library tools used to create a scatter plot matrix?
Scatter_matrix
6. Explain Categorical data in Pandas?
A Categorical data is defined as a Pandas data type that corresponds to a categorical
variable in statistics. A categorical variable is generally used to take a limited and usually
fixed number of possible values. Examples: gender, country affiliation, blood type, social
class, observation time, or rating via Likert scales. All values of categorical data are either
in categories or np.nan.
7. How to iterate over a Pandas DataFrame?
You can iterate over the rows of the DataFrame by using for loop in combination with an
iterrows() call on the DataFrame.
DATA CLEANING
It refers to the process of identifying patterns in It is used to order and organize raw data in a
a pre-built database. meaningful manner.
Data mining is done on clean and well- Data analysis involves cleaning the data hence it is
documented data. not presented in a well-documented format.
The outcomes are not easy to interpret. The outcomes are easy to interpret.
It is mostly used for Machine Learning where It is used to gather insights from raw data, which
used to recognize the patterns with the help of has to be cleaned and organized before performing
algorithms. the analysis.
2. What is the Difference between Data Profiling and Data Mining?
Data Profiling: It refers to the process of analyzing individual attributes of data. It
primarily focuses on providing valuable information on data attributes such as data
type, frequency, length, occurrence of null values.
Data Mining: It refers to the analysis of data with respect to finding relations that
have not been discovered earlier. It mainly focuses on the detection of unusual
records, dependencies and cluster analysis.
3. What is the Process of Data Analysis?
Data analysis is the process of collecting, cleansing, interpreting, transforming, and
modeling data to gather insights and generate reports to gain business profits. Refer to
the image below to know the various steps involved in the process.
Collect Data: The data is collected from various sources and stored to be cleaned and
prepared. In this step, all the missing values and outliers are removed.
Analyse Data: Once the data is ready, the next step is to analyze the data. A model is
run repeatedly for improvements. Then, the model is validated to check whether it
meets the business requirements.
Create Reports: Finally, the model is implemented, and then reports thus generated
are passed onto the stakeholders.
4. What is Data Wrangling or Data Cleansing/Cleaning?
Data Cleansing is the process of identifying and removing errors to enhance the
quality of data. We must check for the following things and correct where needed:
For known variables, is the data type as expected (For example if age is in date format something
is suspicious)
If anything suspicious we can further investigate it and correct it accordingly.What are Some of
the Challenges You Have Faced during Data Analysis?
o Challenge in blending/ integrating the data from multiple sources, particular when
there no consistent parameters and conventions
5. What is VLOOKUP?
VLOOKUP stands for ‘Vertical Lookup’. It is a function that makes Excel search for
a certain value in a column (or the ‘table array’), in order to return a value from a
different column in the same row.
6. What is a Pivot Table, and What are the Different Sections of a Pivot Table?
A Pivot Table is used to summarise, sort, reorganize, group, count, total or average
data stored in a table. It allows us to transform columns into rows and rows into
columns. It allows grouping by any field (column) and using advanced calculations
on them.
o Rows Area: The headings which are present on the left of the values.
o Column Area: The headings at the top of the values area make the columns area.
Filter Area: This is an optional filter used to drill down in the data set.
1. What is Seaborn?
Seaborn is an open source, Python data visualisation library built on matplotlib that is
tightly integrated with pandas data structures. The core component of Seaborn is
visualisation, which aids in data exploration and understanding.Data can be represented as
plots, which are simple to study, explore, and interpret.
2. What is Matplotlib?
Matplotlib is a plotting library for Python with NumPy, the Python numerical mathematics
extension. It offers an object-oriented API for embedding plots into applications utilising
GUI toolkits such as Tkinter, wxPython, Qt, or GTK.
3. Does Seaborn need Matplotlib?
The only library we need to import is Seaborn. Seaborn's plots are drawn using matplotlib
behind the scenes. Seaborn is a library that uses Matplotlib underneath to plot graphs.
Seaborn helps in resolving the two major issues faced by Matplotlib;
the problems are:
Default Matplotlib parameters
Working with data frames
4. What is CMAP in Seaborn?
Sequential Palette : one color only
With the cmap option of the heatmap() method in seaborn, you can adjust the colours of
your heatmap. The vmax and vmin parameters in the function can also be used to establish
maximum and minimum values for the colour bar on a seaborn heatmap.
5. What is the Seaborn function for colouring plots?
color_palette() is a Seaborn function that can be used to give colours to plots and give
them additional artistic appeal.
6. What is Histograms in Seaborn?
Histograms show the distribution of data by constructing bins throughout the data's range
and then drawing bars to show how many observations fall into each bin.
7. How do you plot a histogram in Seaborn?
We can plot a histogram in Seaborn by using histplot() to plot a histogram with a density
plot.
8. How to change the legend font size of FacetGrid plot in Seaborn?
We can access the legend from the FacetGrid in which sns.displot will return with
FacetGrid.legend.
9. How do I make all of the lines in seaborn.lineplot black?
We can set the line colour to black using the line graph as given in the official reference as
an example. When there is only one hue, though, it is impossible to distinguish the data. For
visualisation, it may be preferable to utilise a single colour tone.
10. How to color the data points by a category like to assign colors to the 'regional indicators'?
The simplest solution is to choose the columns and use.melt to reshape them into a long
dataframe.
Then use both together sns.lmplot and sns.regplot.
Hue can be used to define colours based on region, however this results in a separate
regression line for each data point, rather than one for all data points, and as such the
regression line is not shown for.lmplot, but is plotted separately for each axis with.regplot.
seaborn is a high-level API for matplotlib.
DATA MANIPULATIONS