DATA STRUCTURES
IN
PANDAS DATAFRAMES
Pandas DataFrame
A Data frame is a two-dimensional data structure,
i.e., data is aligned in a tabular fashion in rows
and columns.
Features of Data Frame
• Columns can be of different types
• Size of data frame is mutable i.e., the number of
rows and columns can be increased or
decreased
• Its data/values are also mutable and can be
changed any time
• Indexes may be of numbers, strings or letters
syntax :
pandas.DataFrame( data, index, columns, dtype,
copy)
data - takes various forms like ndarray, series, map,
lists, dict, constants and also another DataFrame.
index - used for row labels. Default np.arrange(n) if
no index is passed.
columns - used for column labels. This is only true if
no index is passed.
dtype - Data type of each column.
copy - used for copying of data, if the default is False.
Library
Python library is a collection of functions, methods
and huge number of modules that can be imported.
• import pandas as pd – Python library pandas is
added for working out with data structures in
pandas series.
• import numpy as np – Python library numpy is
added for creating numpy array and generating
random numbers.
Creation of DataFrame
A dataframe can be created using various inputs
like
• Series
• List
• Dictionary
• Numpy ndarrays
Create an Empty DataFrame
import pandas as pd
df = pd.DataFrame()
print df
output
Empty DataFrame
Columns: []
Index: []
Create Dataframe from dict of series
• Dictionary of Series can be passed to form a
DataFrame.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df
Output
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Create Dataframe from list of dicts
• List of dictionaries
import pandas as pd
l = [{'Name' : 'Raj', 'lastname': 'Kumar'},
{'Name' : 'Vinodh', 'lastname' : 'Sharma'}]
df = pd.DataFrame(l)
print df
Output
Name lastname
0 Raj Kumar
1 Vinodh Sharma
• List of dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
Output
a b c
first 1 2 NaN
second 5 10 20.0
• List of dictionaries, row indices and column
indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b'])
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['a', 'b1'])
print df1
print df2
Output
#df1 output
a b
first 1 2
second 5 10
#df2 output
a b1
first 1 NaN
second 5 NaN
Create CSV file
A CSV is a comma separated value file, which allows
data to be saved in a tabular format.
• Open Ms-Excel, type the details in rows and
columns
• Save the file with the extension csv
ex: ('C:\Users\emp.csv')
Importing CSV in a DataFrame
• pd.read_csv() – is used to read a csv file
import pandas as pd
df = pd.read_csv(r'C:\Users\emp.csv')
print(df)
Display subset of columns from CSV
import pandas as pd
data = pd.read_csv (r'C:\Users\emp.csv')
df = pd.DataFrame(data,
columns=[‘empid',‘ename'])
print (df)
Exporting data from dataframe to csv
import pandas as pd
l = [{'Name' : 'Sachin', 'Sirname': 'Bhardwaj'},
{'Name' : 'Vinod', 'Sirname' : 'Verma'},
{'Name' : 'Rajesh', 'Sirname' : 'Mishra'}]
df = pd.DataFrame(l)
df.to_csv('C:\Users\det.csv‘)
Iteration on rows and columns
• Iteration is used to access record or data
from a data frame row wise or column wise.
• Pandas provide 2 functions to perform
iterations
– iterrows()
– iteritems()
• iterrows() – used to access data row wise
import pandas as pd
l = [{'Name' : 'Raj', 'lastname': 'Kumar'},
{'Name' : 'Vinodh', 'lastname' : 'Sharma'}]
df = pd.DataFrame(l)
print df
for(row_index,row_value) in df.iterrows():
print('Row index is : ', row_index)
print('Row value is : ')
print(row_value)
Output
Name lastname
0 Raj Kumar
1 Vinodh Sharma
Row index is : 0
Row value is :
Name Raj
lastname Kumar
Name: 0, dtype: object
Row index is : 1
Row value is :
Name Vinodh
lastname Sharma
Name: 1, dtype: object
• iteritems() – used to access data column wise
import pandas as pd
l = [{'Name' : 'Raj', 'lastname': 'Kumar'},
{'Name' : 'Vinodh', 'lastname' : 'Sharma'}]
df = pd.DataFrame(l)
print df
for(col_name,col_value) in df.iteritems():
print('Column Name is : ', col_name)
print('Column Values are : ')
print(col_value)
Output
Name lastname
0 Raj Kumar
1 Vinodh Sharma
Column Name is : Name
Column Values are :
0 Raj
1 Vinodh
Name: Name, dtype: object
Column Name is : lastname
Column Values are :
0 Kumar
1 Sharma
Name: lastname, dtype: object
Operations on Rows
• Various operations are carried out in rows.
– Add
– Select
– Delete
Row Addition
• Add new rows to a DataFrame using the append
function.
• This function will append the rows at the end.
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df1.append(df2)
print df
Output
a b
0 10 20
1 30 40
0 50 60
1 70 80
Row Selection
Row selection can be performed by
• Label – it is used to access a group of rows and
columns
• Integer - it is used to access a group of rows and
columns based on integer index
• multiple rows – it is used to slice the rows
Selection by label
• Rows can be selected by passing row label to a loc
function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b', :]
print df.loc['a':'c', :]
Output
one 2.0
two 2.0
Name: b, dtype: float64
one two
a 1.0 1
b 2.0 2
c 3.0 3
Selection by integer location
• Rows can be selected by passing integer location
to an iloc function.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print("Dataframe :")
print df
print("Integer location index 2 value : ")
print df.iloc[2]
print("First 2 rows of second column : ")
print df.iloc[0:2,1:2]
print("All rows of first column : ")
print df.iloc[: , 0:1]
Output
Dataframe :
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Integer location index 2 value :
one 3.0
two 3.0
Name: c, dtype: float64
First 2 rows of second column :
two
a 1
b 2
All rows of first column :
one
a 1.0
b 2.0
c 3.0
d NaN
Slice rows
• Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df[2:4]
Output
one two
c 3.0 3
d NaN 4
Row Deletion
• Use index label to delete or drop rows from a
DataFrame.
• If label is duplicated, then multiple rows will
be dropped.
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns =
['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns =
['a','b'])
df = df1.append(df2)
df = df.drop(0)
print df
df=df.drop(1)
print df
Output
# drop(0)
a b
1 3 4
1 7 8
# drop(1)
a b
0 1 2
0 5 6
Operations on Columns
• Various operations are carried out in rows.
– Select
– Add
– Rename
– Delete
Column Selection
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print df['Col1']
print df['Col2']
Output
0 10
1 15
2 20
3 25
Name: Col1, dtype: int64
0 30
1 30
2 30
3 30
Name: Col2, dtype: int64
Column Addition
1) Adding a new column to an existing data frame
2) Adding a new column by passing a series
3) Adding a new column using insert function
Adding a column using existing columns
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
df['Col3']=df['Col1']+df['Col2']
print("Adding a new column using the existing
columns in DataFrame : ")
print(df)
Output
Adding a new column using the existing columns
in DataFrame :
Col1 Col2 Col3
0 10 30 40
1 15 30 45
2 20 30 50
3 25 30 55
Adding a column by passing a series
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print ("Adding a new column by passing a Series:")
df['Col3']=pd.Series([5,5,5,5])
print(df)
Output
Adding a new column by passing a Series:
Col1 Col2 Col3
0 10 30 5
1 15 30 5
2 20 30 5
3 25 30 5
Adding a new column using insert
• Add a column in the last place(3rd place) named
‘Col3' value=[50,100,150,200].
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print ("Adding a new column using insert
function:")
df.insert(loc=2,column='Col3',value=[50,100,150,20
0])
print df
Output
Adding a new column using insert function:
Col1 Col2 Col3
0 10 30 50
1 15 30 100
2 20 30 150
3 25 30 200
Rename Column
• rename column m1 as marks1in both the dataframes
df1 and df2.
import pandas as pd
df1=pd.DataFrame({'m1':[56,77,82], 'm2':[89,68,45]})
df2=pd.DataFrame({'m1':[10,20,30], 'm2':[15,25,35]})
print("Dataframe 1")
df1.rename(columns={'m1':'marks1'},inplace=True)
print df1
print("Dataframe 2")
df2.rename(columns={'m1':'marks1'}, inplace=True)
print df2
Output
Dataframe 1
marks1 m2
0 56 89
1 77 68
2 82 45
Dataframe 2
marks1 m2
0 10 15
1 20 25
2 30 35
Rename index label
• change index label from 0 to zero, from 1 to
one and 2 to two for the given dataframe.
import pandas as pd
df=pd.DataFrame({'m1':[56,77,82],'m2':[89,68,4
5]},index=[0,1,2])
df.rename(index = {0: 'zero', 1:'one', 2:'two'},
inplace = True)
print df
Output
m1 m2
zero 56 89
one 77 68
two 82 45
Column Deletion
• Column from a data frame can be deleted by any
of the following:
– del()
– pop()
– drop()
• Note - output will be displayed based on column name
ascending
del () – deletes a column by passing column name in
subscript with df
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
df['Col3']=df['Col1']+df['Col2']
print("Dataframe : ")
print df
print ("Deleting column using del():")
del df['Col1']
print df
Output
Dataframe:
Col1 Col2 Col3
0 10 30 40
1 15 30 45
2 20 30 50
3 25 30 55
Deleting column using del():
Col2 Col3
0 30 40
1 30 45
2 30 50
3 30 55
pop() – deletes a column by passing column name
in pop method.
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
df['Col3']=df['Col1']+df['Col2']
print("Dataframe : ")
print df
print ("Deleting column using pop():")
df.pop('Col2')
print df
Output
Dataframe :
Col1 Col2 Col3
0 10 30 40
1 15 30 45
2 20 30 50
3 25 30 55
Deleting column using pop():
Col1 Col3
0 10 40
1 15 45
2 20 50
3 25 55
drop() – deletes a column by passing column name
and specify axis =1.
import pandas as pd
s = pd.Series([10,15,20,25])
df=pd.DataFrame(s)
df.columns=['Col1']
df['Col2']=30
print("Dataframe : ")
print df
print ("Deleting column using drop():")
df1=df.drop('Col2',axis=1)
print df1
Output
Dataframe :
Col1 Col2
0 10 30
1 15 30
2 20 30
3 25 30
Deleting column using drop():
Col1
0 10
1 15
2 20
3 25
Head function
• Head -returns the first n rows. The default
number of elements to display is five.
• Display first two rows of given dataframe
import pandas as pd
df=pd.DataFrame({'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]})
print df.head(2)
Output
Age Name
0 28 Tom
1 34 Jack
Tail function
• Tail - returns the last n rows. The default
number of elements to display is five.
• Display last three rows of given dataframe
import pandas as pd
df=pd.DataFrame({'Name':['Tom', 'Jack', 'Steve',
'Ricky'],'Age':[28,34,29,42]})
print df.tail(3)
Output
Age Name
1 34 Jack
2 29 Steve
3 42 Ricky
Boolean indexing
• Boolean indexing helps to select the data from the data frames
using a boolean vector.
import pandas as pd
dic = {'Name' : ['Sachin', 'Vinod', 'Rajesh'],
'Age' : [32, 35, 40]}
df = pd.DataFrame(dic, index = [True, False, True])
print("Dataframe : ")
print df
print
print("Boolean True Value : ")
print df.loc[True]
print
print("Boolean False Value : ")
print df.loc[False]
Output
Dataframe :
Age Name
True 32 Sachin
False 35 Vinod
True 40 Rajesh
Boolean True Value :
Age Name
True 32 Sachin
True 40 Rajesh
Boolean False Value :
Age 35
Name Vinod
dtype: object
Concatenation
• concat() - performs concatenation operations
along an axis
• ignore_index – Boolean. Default False. If True,
do not use the index values on the
concatenation axis.
import pandas as pd
dic1 = { 'id' : [1,2,3], ' v1' : [56,78,98],
'v2' :[34,48,85]}
dic2 = { 'id' : [2,3,4], 'v1' : [22,48,65],
'v2' :[12,89,77]}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.concat([df1,df2])
print df3
Output
id v1 v2
0 1 56 34
1 2 78 48
2 3 98 85
0 2 22 12
1 3 48 89
2 4 65 77
import pandas as pd
dic1 = { 'id' : [1,2,3], 'v1' : [56,78,98],
'v2' :[34,48,85]}
dic2 = { 'id' : [2,3,4], 'v1' : [22,48,65],
'v2' :[12,89,77]}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.concat([df1,df2], ignore_index=True)
print df3
Output
id v1 v2
0 1 56 34
1 2 78 48
2 3 98 85
3 2 22 12
4 3 48 89
5 4 65 77
Merging
• merge() – used to join two data frames with
holding different values and linked by common
column
• on – columns to join on.
import pandas as pd
dic1 = { 'id' : [1,2,3], 'v1' : [56,78,98], 'v2'
:[34,44,85]}
dic2 = { 'id' : [2,3,4], 'v3':['a','b','c']}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.merge(df1,df2,on='id')
print df3
Output
id v1 v2 v3
0 2 78 44 a
1 3 98 85 b
Joining
• The full outer join combines the results of both
left and right outer joins.
• The joined data frame will contain all records
from both the data frames and fill NaN for
missing data on either side.
• how − ('left', 'right', 'outer', 'inner‘)
import pandas as pd
dic1 = { 'id' : [1,2,3], 'v1' : [56,78,98],
'v2' :[34,48,85]}
dic2 = { 'id' : [2,3,4], 'v1' : [22,48,65],
'v2‘ :[12,89,77]}
df1=pd.DataFrame(dic1)
df2=pd.DataFrame(dic2)
df3=pd.merge(df1,df2,on='id',how='outer')
print df3
Output
id v1_x v2_x v1_y v2_y
0 1 56.0 34.0 NaN NaN
1 2 78.0 48.0 22.0 12.0
2 3 98.0 85.0 48.0 89.0
3 4 NaN NaN 65.0 77.0