Module - 2
Data Wrangling
Python Pandas
Pandas is an open-source Python Library providing high-performance
data manipulation and analysis tool using its powerful data structures.
Using Pandas, we can accomplish five typical steps in the processing
and analysis of data, regardless of the origin of data — load, prepare,
manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic
and commercial domains including finance, economics, Statistics,
analytics, etc.
Key features of Pandas
• Fast and efficient DataFrame object with default and customized
indexing.
• Tools for loading data into in-memory data objects from different file
formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and subsetting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Data structures
Pandas deals with the following three data structures −
• Series
• DataFrame
• Panel
These data structures are built on top of Numpy array, which means
they are fast.
Data Structure Dimensions Description
1D labeled homogeneous array,
Series 1
sizeimmutable.
General 2D labeled, size-mutable
Data Frames 2 tabular structure with potentially
heterogeneously typed columns.
General 3D labeled, size-mutable
Panel 3
array.
Mutability
All Pandas data structures are value mutable (can be changed) and
except Series all are size mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data
structures. Panel is used much less.
Series
Series is a one-dimensional array like structure with homogeneous
data. For example, the following series is a collection of integers 10, 23,
56, …
10 23 56 17 52 61 73 90 26 72
Key Points
Homogeneous data
Size Immutable
Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For
example,
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78
The table represents the data of a sales team of an organization with
their overall performance rating. The data is represented in rows and
columns. Each column represents an attribute and each row represents
a person.
Data Type of Columns
The data types of the four columns are as follows −
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
Heterogeneous data
Size Mutable
Data Mutable
A Data frame is a two-dimensional data structure, i.e., data is aligned in
a tabular fashion in rows and columns.
Features of DataFrame
Potentially columns are of different types
Size – Mutable
Labeled axes (rows and columns)
Can Perform Arithmetic operations on rows and columns
Let us assume that we are creating a data frame with student’s data.
pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
A pandas DataFrame can be created using various inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print(df)
Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.
Example
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
Print(df)
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the
length of the index should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is
the array length.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Let us now create an indexed DataFrame
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Python Pandas - Merging/Joining
Pandas has full-featured, high performance in-memory join operations
idiomatically very similar to relational databases like SQL.
Pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −
pd.merge(left, right, how='inner', on=None, left_on=None,
right_on=None, left_index=False, right_index=False, sort=True)
left − A DataFrame object.
right − Another DataFrame object.
on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.
left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays
with length equal to the length of the DataFrame.
right_on − Columns from the right DataFrame to use as keys. Can either be column names or
arrays with length equal to the length of the DataFrame.
left_index − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of
a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join
keys from the right DataFrame.
right_index − Same usage as left_index for the right DataFrame.
how− One of 'left', 'right', 'outer', 'inner'. Defaults to inner
sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting
to False will improve the performance substantially in many cases.
Let us now create two different DataFrames and perform the merging
operations on it.
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
Print(left)
print(right)
Merge Two DataFrames on a Key
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on='id')
Merge Two DataFrames on Multiple Keys
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on=['id','subject_id'])
Merge Using 'how' Argument
The how argument to merge specifies how to determine which keys are to be
included in the resulting table. If a key combination does not appear in either the
left or the right tables, the values in the joined table will be NA.
Here is a summary of the how options and their SQL equivalent names −
Merge Method SQL Equivalent Description
left LEFT OUTER JOIN Use keys from left object
right RIGHT OUTER JOIN Use keys from right object
outer FULL OUTER JOIN Use union of keys
inner INNER JOIN Use intersection of keys
Left Join
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='left')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Alex 1 sub1 NaN NaN
1 Amy 2 sub2 Billy 1.0
2 Allen 3 sub4 Brian 2.0
3 Alice 4 sub6 Bryce 4.0
4 Ayoung 5 sub5 Betty 5.0
Right Join
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='right')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Amy 2.0 sub2 Billy 1
1 Allen 3.0 sub4 Brian 2
2 Alice 4.0 sub6 Bryce 4
3 Ayoung 5.0 sub5 Betty 5
4 NaN NaN sub3 Bran 3
Outer Join
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, how='outer', on='subject_id')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Alex 1.0 sub1 NaN NaN
1 Amy 2.0 sub2 Billy 1.0
2 Allen 3.0 sub4 Brian 2.0
3 Alice 4.0 sub6 Bryce 4.0
4 Ayoung 5.0 sub5 Betty 5.0
5 NaN NaN sub3 Bran 3.0
Inner Join
Joining will be performed on index. Join operation honors the object on which it is called.
So, a.join(b) is not equal to b.join(a).
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left, right, on='subject_id', how='inner')
Its output is as follows −
Name_x id_x subject_id Name_y id_y
0 Amy 2 sub2 Billy 1
1 Allen 3 sub4 Brian 2
2 Alice 4 sub6 Bryce 4
3 Ayoung 5 sub5 Betty 5
Reshaping/Pivoting
Python has operations for rearranging tabular data, known as
reshaping or pivoting operations. For example, hierarchical indexing
provides a consistent way to rearrange data in a DataFrame. There are
two primary functions in hierarchical indexing: stack(): rotates or pivots
data from columns to rows.
The pivot() function is used to reshape a given DataFrame organized by
given index / column values.
We can reshape a dataframe using melt(),stack(),unstack() and pivot()
function.
Define a dataframe.
Apply melt() function to convert wide dataframe column as rows. It is
defined below,
df.melt()
Example
import pandas as pd
df = pd.DataFrame({'Id':[1,2,3],'Age':[13,14,13],'Mark':[80,90,85]})
print("Dataframe is:\n",df)
print(df.melt())
Define a dataframe.
Apply stack() function to increase the level of the index in a dataframe. It is
defined below,
df.stack().to_frame()
If you want to revert back the changes, you can use unstack().
df.unstack().to_frame()
to_frame() function is used to convert the given series object to a dataframe.
Example
import pandas as pd
df = pd.DataFrame({'Id':[1,2,3],'Age':[13,14,13],'Mark':[80,90,85]})
print("Dataframe is:\n",df)
print(df.stack().to_frame())
print(df.unstack().to_frame())
Define a dataframe
Apply pivot() function to reshape a dataframe based on Id column,
df.pivot(columns='Id')
Example
import pandas as pd
df = pd.DataFrame({'Id':[1,2,3],'Age':[13,14,13],'Mark':[80,90,85]})
print("Dataframe is:\n",df)
print(df.pivot(columns='Id'))
import pandas as pd
#Create a DataFrame
d = { 'countries':['A','B','C','A','B','C'],
'metrics':['population_in_million','population_in_million','population_in_million
', 'gdp_percapita','gdp_percapita','gdp_percapita'],
'values':[100,200,120,2000,7000,15000] }
df = pd.DataFrame(d,columns=['countries','metrics','values'])
print(df)
df2=df.pivot(index='countries', columns='metrics', values='values')
print(df2)
Pivot function() reshapes the data from long to wide in Pandas python.
Countries column is used on index.
Values of Metrics column is used as column names and values of value
column is used as its value.
Python RegEx
A RegEx, or Regular Expression, is a sequence of characters that forms a
search pattern.
RegEx can be used to check if a string contains the specified search pattern.
RegEx Module
Python has a built-in package called re, which can be used to work with
Regular Expressions.
Import the re module:
import re
Search the string to see if it starts with "The" and ends with "Spain":
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
^ beginning
$ end
. Any character
*
RegEx Functions
The re module offers a set of functions that allows us to search a string
for a match:
Function Description
findall Returns a list containing all matches
search Returns a Match object if there is a match anywhere in the
string
split Returns a list where the string has been split at each
match
sub Replaces one or many matches with a string
Metacharacters
Metacharacters are characters with a special meaning:
import re
txt = "The rain in Spain"
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)
import re
txt = "That will be 59 dollars"
#Find all digit characters:
x = re.findall("\d", txt)
print(x)
import re
txt = "hello planet"
#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
x = re.findall("he..o", txt)
print(x)
import re
txt = "hello planet"
#Check if the string starts with 'hello':
x = re.findall("^hello", txt)
if x:
print("Yes, the string starts with 'hello'")
else:
print("No match")
import re
txt = "hello planet"
#Check if the string ends with 'planet':
x = re.findall("planet$", txt)
if x:
print("Yes, the string ends with 'planet'")
else:
print("No match")
import re
txt = "hello planet"
#Search for a sequence that starts with "he", followed by 0 or more (any) characters, and an "o":
x = re.findall("he.*o", txt)
print(x)
import re
txt = "hello planet"
#Search for a sequence that starts with "he", followed by 1 or more (any) characters, and an "o":
x = re.findall("he.+o", txt)
print(x)
import re
txt = "hello planet"
#Search for a sequence that starts with "he", followed by 0 or 1 (any) character, and an "o":
x = re.findall("he.?o", txt)
print(x)
#This time we got no match, because there were not zero, not one, but two characters between
"he" and the "o"
import re
txt = "hello planet"
#Search for a sequence that starts with "he", followed excactly 2 (any) characters,
and an "o":
x = re.findall("he.{2}o", txt)
print(x)
import re
txt = "The rain in Spain falls mainly in the plain!"
#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", txt)
print(x)
if x:
print("Yes, there is at least one match!")
else:
print("No match")
Special Sequences
A special sequence is a \ followed by one of the characters
Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning: