Handling Structured Data using Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. In this session we are going to learn about
DataFrames and Series in Python Pandas
Creating DataFrames and loading data into Pandas
Indexing, Slicing data
Sorting, Filtering, Groupby, Pivot functionalities
Merging dataframes
Series and DataFrame
When we worked with tabular data, pandas help us explore, clean and process the data. This tabular data is called as a DataFrame in pandas. A dataframe can be visually represented as follows
In [1]:
# import pandas library
import pandas as pd
Create dataframe from dictionary of lists
In [2]:
df = pd.DataFrame({'team': ['India', 'South Africa', 'New Zealand', 'England'],
'points': [10, 8, 3, 5],
'runrate': [0.5, 1.4, 2, -0.6],
'wins': [5, 4, 2, 2]})
df
Out[2]: team points runrate wins
0 India 10 0.5 5
1 South Africa 8 1.4 4
2 New Zealand 3 2.0 2
3 England 5 -0.6 2
In [3]:
df.shape
Out[3]: (4, 4)
In [4]:
df.columns
Out[4]: Index(['team', 'points', 'runrate', 'wins'], dtype='object')
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 team 4 non-null object
1 points 4 non-null int64
2 runrate 4 non-null float64
3 wins 4 non-null int64
dtypes: float64(1), int64(2), object(1)
memory usage: 256.0+ bytes
In [6]:
df.head(2)
Out[6]: team points runrate wins
0 India 10 0.5 5
1 South Africa 8 1.4 4
In [7]:
df.tail(3)
Out[7]: team points runrate wins
1 South Africa 8 1.4 4
2 New Zealand 3 2.0 2
3 England 5 -0.6 2
In [8]:
df.wins
Out[8]: 0 5
1 4
2 2
3 2
Name: wins, dtype: int64
In [9]:
df.wins >2
Out[9]: 0 True
1 True
2 False
3 False
Name: wins, dtype: bool
In [10]:
df['wins']
Out[10]: 0 5
1 4
2 2
3 2
Name: wins, dtype: int64
Create dataframe from list of lists
In [11]:
# creating a nested list
nested_list = [[1,2,3],[10,20,30],[100,200,300]]
# creating DataFrame
df = pd.DataFrame(nested_list, columns= ['A','B','C'])
print(df)# displaying resultant DataFrame
A B C
0 1 2 3
1 10 20 30
2 100 200 300
Loading data using pandas ( pd.read_csv)
When data and notebook in the same folder
In [12]:
pwd
Out[12]: 'C:\\Users\\isdc0\\Downloads'
In [13]:
Data = pd.read_csv("titanic.csv")
Data.head(2)
Out[13]: PassengerId Survived Pclass Lname Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
In [14]:
# Load data, available in other folder
Data = pd.read_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/advertising.csv")
Data.sample(2)
Out[14]: Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
912 34.96 42 36913.51 160.49 Right-sized mobile initiative West James 1 Macedonia 2016-06-01 16:10:30 1
78 56.14 38 32689.04 113.53 Devolved tangible approach Lake Edward 1 Ireland 2016-02-03 07:59:16 1
In [15]:
# Load excel file
df = pd.read_excel("E:/Aug 2022/Programming for Analytics/Learning Resources/sample.xlsx")
df.head()
Out[15]: ID Item Review
0 1 tv OK
1 2 laptop :-)
2 3 camera poor quality
3 4 tv Superb!
4 5 laptop NaN
In [16]:
# Load txt file using pandas
df1 = pd.read_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/Text Folder/test.txt")
print(df1)
First line: This is a test file
0 This is second line
1 third line
2 and subsiquent lines..
3 and more..
Writing the files to a csv/excel using pandas
In [17]:
df1
Out[17]: First line: This is a test file
0 This is second line
1 third line
2 and subsiquent lines..
3 and more..
In [18]:
df_new=df[['Review','ID']]
df_new
Out[18]: Review ID
0 OK 1
1 :-) 2
2 poor quality 3
3 Superb! 4
4 NaN 5
5 over priced 6
6 repancement 7
7 bad quality 8
8 waiting for refund 9
In [19]:
# write as a csv file
df.to_csv("E:/Aug 2022/Programming for Analytics/Learning Resources/Subset.csv")
In [20]:
# write as a excel file
df.to_excel("E:/Aug 2022/Programming for Analytics/Learning Resources/Subset1.xlsx")
Data Manipulation - Working with Retail Data
To learn about dataframes there is no better way than to work with real data and hence we will learn all the functionalities in pandas using this online retail data.
The is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many
customers of the company are wholesalers.
InvoiceNo Invoice number. If this code starts with letter 'c', it indicates a cancellation.
StockCode Product (item) code.
Description Product (item) name.
Quantity The quantities of each product (item) per transaction.
InvoiceDate Invoice Date and time.
UnitPrice Unit price. Numeric, Product price per unit in sterling.
CustomerID Customer number.
Country Country name.
In [21]:
# let us look at loading the data set
# complete list https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io
df = pd.read_excel("E:/Aug 2022/Programming for Analytics/Learning Resources/Online Retail.xlsx")
# preview your dataframe head and tail
df.head()
Out[21]: InvoiceNo StockCode Description Quantity ID UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
In [22]:
df.head(2)
Out[22]: InvoiceNo StockCode Description Quantity ID UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
In [23]:
df.tail(2)
Out[23]: InvoiceNo StockCode Description Quantity ID UnitPrice CustomerID Country
601 536412 21448 12 DAISY PEGS IN WOOD BOX 2 2010-12-01 11:49:00 1.65 17920 United Kingdom
602 536412 22968 ROSE COTTAGE KEEPSAKE BOX 4 2010-12-01 11:49:00 9.95 17920 United Kingdom
In [24]:
# information about the records in the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 InvoiceNo 603 non-null object
1 StockCode 603 non-null object
2 Description 603 non-null object
3 Quantity 603 non-null int64
4 ID 603 non-null datetime64[ns]
5 UnitPrice 603 non-null float64
6 CustomerID 603 non-null int64
7 Country 603 non-null object
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 37.8+ KB
In [25]:
# summary statistics of dataframe; for numerical data
df.describe()
Out[25]: Quantity UnitPrice CustomerID
count 603.000000 603.000000 603.000000
mean 14.215589 3.403499 16050.829187
std 35.563101 7.390356 1814.393502
min -24.000000 0.100000 12431.000000
25% 2.000000 1.250000 14307.000000
50% 6.000000 2.100000 16098.000000
75% 12.000000 3.750000 17850.000000
max 432.000000 165.000000 18074.000000
In [26]:
# summary statistics of dataframe; for object data
df.describe(include=[object])
Out[26]: InvoiceNo StockCode Description Country
count 603 603 603 603
unique 46 385 386 4
top 536401 85123A WHITE HANGING HEART T-LIGHT HOLDER United Kingdom
freq 64 8 8 567
In [27]:
df.columns
Out[27]: Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'ID', 'UnitPrice',
'CustomerID', 'Country'],
dtype='object')
In [28]:
# rename column names (take care of inplace)
df.rename(columns = {'ID':"invoice date"},inplace = True)
df.head()
Out[28]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
Slicing, filtering
In [29]:
# subset of the data with columns Quantity and UnitPrice
subset_cols = ['Quantity','UnitPrice']
subset_1 = df[subset_cols]
subset_1.head()
Out[29]: Quantity UnitPrice
0 6 2.55
1 6 3.39
2 8 2.75
3 6 3.39
4 6 3.39
In [30]:
# indexing and slicing using iloc
df.iloc[10:20,4:6]
Out[30]: invoice date UnitPrice
10 2010-12-01 08:34:00 2.10
11 2010-12-01 08:34:00 2.10
12 2010-12-01 08:34:00 3.75
13 2010-12-01 08:34:00 1.65
14 2010-12-01 08:34:00 4.25
15 2010-12-01 08:34:00 4.95
16 2010-12-01 08:34:00 9.95
17 2010-12-01 08:34:00 5.95
18 2010-12-01 08:34:00 5.95
19 2010-12-01 08:34:00 7.95
In [31]:
# filter the dataframe based on a criteria
# fetch all the records where quantity is greater than 2000
df[df['Quantity']>5]
Out[31]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom
... ... ... ... ... ... ... ... ...
560 536412 21880 RED RETROSPOT TAPE 10 2010-12-01 11:49:00 0.65 17920 United Kingdom
564 536412 22961 JAM MAKING SET PRINTED 24 2010-12-01 11:49:00 1.45 17920 United Kingdom
573 536412 22243 5 HOOK HANGER RED MAGIC TOADSTOOL 6 2010-12-01 11:49:00 1.65 17920 United Kingdom
577 536412 22077 6 RIBBONS RUSTIC CHARM 7 2010-12-01 11:49:00 1.65 17920 United Kingdom
581 536412 85049E SCANDINAVIAN REDS RIBBONS 12 2010-12-01 11:49:00 1.25 17920 United Kingdom
328 rows × 8 columns
In [32]:
# fetch the records where quantity is exact 1000 and 2000 (isin and or operator)
#df[(df['Quantity'] == 1000) | (df['Quantity'] == 2000) | (df['Quantity'] == 3000)]
df_sub = df[df['Quantity'].isin([1000,2000,3000])]
In [33]:
df_sub.to_csv("subset_data.csv")
Derived Data
In [34]:
# create a new derived column for invoice amount and tax
df['InvoiceAmount'] = df['Quantity'] * df['UnitPrice']
df['VAT'] = df['InvoiceAmount'] * 0.08
df.head()
# The calculation of the values is done element_wise.
# This means all values in the given column are multiplied by the value 0.08 at once.
# You do not need to use a loop to iterate each of the rows!
Out[34]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country InvoiceAmount VAT
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850 United Kingdom 15.30 1.2240
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850 United Kingdom 22.00 1.7600
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850 United Kingdom 20.34 1.6272
Aggregating data
In [35]:
# unique
items = list(df['Description'].unique())
len(items)
Out[35]: 386
In [36]:
countries = list(df['Country'].unique())
len(countries)
Out[36]: 4
In [37]:
# value counts
order_country = df['Country'].value_counts()
order_country
Out[37]: United Kingdom 567
France 20
Australia 14
Netherlands 2
Name: Country, dtype: int64
In [38]:
# calculate total, mean, and median of invoice amount
df['VAT'].sum()
Out[38]: 1383.328
In [39]:
# aggregating data for multiple columns
# calculate median of InvoiceAmount and Quantity
df[['InvoiceAmount','VAT']].median()
Out[39]: InvoiceAmount 15.0
VAT 1.2
dtype: float64
Sorting data
In [40]:
# sort the dataframe by invoice amount
df.sort_values(by='InvoiceAmount',ascending = False,inplace = True)
In [41]:
df.sort_values(by = ['Country','Quantity'],ascending = True)
Out[41]: InvoiceNo StockCode Description Quantity invoice date UnitPrice CustomerID Country InvoiceAmount VAT
208 536389 22191 IVORY DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600
207 536389 22192 BLUE DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600
204 536389 22193 RED DINER WALL CLOCK 2 2010-12-01 10:03:00 8.50 12431 Australia 17.00 1.3600
203 536389 85014A BLACK/BLUE POLKADOT UMBRELLA 3 2010-12-01 10:03:00 5.95 12431 Australia 17.85 1.4280
201 536389 35004G SET OF 3 GOLD FLYING DUCKS 4 2010-12-01 10:03:00 6.35 12431 Australia 25.40 2.0320
... ... ... ... ... ... ... ... ... ... ...
180 536387 22779 WOODEN OWLS LIGHT GARLAND 192 2010-12-01 09:58:00 3.37 16029 United Kingdom 647.04 51.7632
179 536387 22780 LIGHT GARLAND BUTTERFILES PINK 192 2010-12-01 09:58:00 3.37 16029 United Kingdom 647.04 51.7632
219 536390 20668 DISCO BALL CHRISTMAS DECORATION 288 2010-12-01 10:19:00 0.10 17511 United Kingdom 28.80 2.3040
181 536387 22466 FAIRY TALE COTTAGE NIGHTLIGHT 432 2010-12-01 09:58:00 1.45 16029 United Kingdom 626.40 50.1120
182 536387 21731 RED TOADSTOOL LED NIGHT LIGHT 432 2010-12-01 09:58:00 1.25 16029 United Kingdom 540.00 43.2000
603 rows × 10 columns
Concatenate dataframes
In [42]:
# products dataframes
df_product = pd.DataFrame({'productCode':[1001,1002,1003,1004],
'productDesc':['Laptop','Phone','Keyboard','Camera']})
df_product
Out[42]: productCode productDesc
0 1001 Laptop
1 1002 Phone
2 1003 Keyboard
3 1004 Camera
In [43]:
df_product1 = pd.DataFrame({'productCode':[1005,1006,1007,1008],
'productDesc':['Mouse','Speakers','Monitor','UPS']})
df_product1
Out[43]: productCode productDesc
0 1005 Mouse
1 1006 Speakers
2 1007 Monitor
3 1008 UPS
In [44]:
df_product_master = pd.concat([df_product,df_product1], ignore_index=True)
df_product_master
Out[44]: productCode productDesc
0 1001 Laptop
1 1002 Phone
2 1003 Keyboard
3 1004 Camera
4 1005 Mouse
5 1006 Speakers
6 1007 Monitor
7 1008 UPS
Merging dataframes
In [45]:
df_product
Out[45]: productCode productDesc
0 1001 Laptop
1 1002 Phone
2 1003 Keyboard
3 1004 Camera
In [46]:
df_invoice = pd.DataFrame({'productCode':[1001,1001,1004,1002,1006],
'invoiceAmount':[1800,899,99,499,1299]})
df_invoice
Out[46]: productCode invoiceAmount
0 1001 1800
1 1001 899
2 1004 99
3 1002 499
4 1006 1299
In [47]:
df_invoice.merge(right = df_product)
Out[47]: productCode invoiceAmount productDesc
0 1001 1800 Laptop
1 1001 899 Laptop
2 1004 99 Camera
3 1002 499 Phone
Descriptive statistics -
Measure the Central Tendancy and Disperson of the Data
In [48]:
# Computation of measures of central tendency
# Mean
mean = df['UnitPrice'].mean()
print(mean)
3.4034991708126117
In [49]:
#median
median = df['UnitPrice'].median()
print(median)
2.1
In [50]:
# Mode
mode = df['UnitPrice'].mode()
print(mode)
0 1.65
dtype: float64
Computation of measures of dispersion or variability
In [51]:
# Minimum Value
print(df['UnitPrice'].min())
# Maximum Value
print(df['UnitPrice'].max())
# Range
print(df['UnitPrice'].max() - df['UnitPrice'].min())
# Varience
print(df['UnitPrice'].var())
# SD
print(df['UnitPrice'].std())
0.1
165.0
164.9
54.6173550112117
7.390355540243764
In [52]:
# Computation of measures of shape of distribution
# 1. Skewness
print(df['UnitPrice'].skew())
# skewness=1 for normal distribution; more positive skewness in this case
# 2. Kurtosis
print(df['UnitPrice'].kurt())
# it is greater than 3 and so excess kurtosis > 0.
#Hence, we can conclude that the Revenue curve is a leptokurtic curve
17.846805793784075
382.0981389464283
Skewness and Kurtosis Qualitative check
In [ ]: