Data analysis made simple: Python Pandas
Python for Data Analysis Pandas
The Pandas library is one of the most important and popular tools for Python data scien sts and
analysts, as it is the backbone of many data projects.
Pandas is an open-source Python package for data cleaning and data manipula on. It provides
extended, flexible data structures to hold different types of labelled and rela onal data.
Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is
also used in SciPy for sta s cal analysis or with Matplotlib for plo ng func ons.
Using Pandas, you can do things like:
Easily calculate sta s cs about data such as finding the average, distribu on, and median of
columns
Use data visualiza on tools, such as Matplotlib, to easily create plot bars, histograms, and
more
Clean your data by filtering columns by par cular criteria or easily removing values
Manipulate your data flexibly using opera ons like merging, joining, reshaping, and more
Read, write, and store your clean data as a database, txt file, or CSV file
Installing Pandas
You can install Pandas using the built-in Python tool pip and run the following command.
$ pip install pandas
Data types available to us in Pandas, also called dtypes.
object: text or mixed numeric or non-numeric values
int64: integer numbers
bool: true/false vaues
float64: floa ng point numbers
category: finite list of text values
date me64: Date and me values
medelta[ns]: differences between two date mes
A data structure is a par cular way of organizing our data. Pandas has two data structures, and all
opera ons are based on those two objects:
Series
DataFrame
Series are the columns, and the DataFrame is a table composed of a collec on of series. Series can
be best described as the single column of a 2-D array that can store data of any type.
DataFrame is like a table that stores data similar to a spreadsheet using mul ple columns and rows.
Each value in a DataFrame object is associated with a row index and a column index.
Pandas data structures below with some addi onal annota on.
We create series by invoking the pd.Series() method and then passing a list of values.
Pandas will, by default, count index from 0. We then explicitly define those values.
srs.values func on on line 9 returns the values stored in the Series object, and the
func on srs.index.values on line 13 returns the index values.
Assign names to our values
Each index corresponds to its value in the Series object. Let’s look at an example where we assign a
country name to popula on growth rates.
Example:
#importing pandas in our program
import pandas as pd
# Defining a series object
srs = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])
# Set Series name
srs.name = "Growth Rate"
# Set index name
srs.index.name = "Country"
# printing series values
print("The Indexed Series values are:")
print(srs)
The a ribute srs.name sets the name of our series object. The a ribute srs.index.name then sets the
name for the indexes.
Select entries from a Series
We select elements based on the index name or index number.
import numpy as np
import pandas as pd
srs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
srs.index.name = "Index"
print("The original Series:\n", srs)
print("\nSeries element at index ind3:")
print(srs['ind3']) # Fetch element at index named ind3
print("\nSeries element at index 3:")
print(srs[3]) # Fetch element at index 3
print("\nSeries elements at multiple indexes:\n")
print(srs[['ind1', 'ind4']]) # Fetch elements at multiple indexes
The elements from the Series are selected in 3 ways.
On line 9, the element is selected based on the index name.
On line 12, the element is selected based on the index number. Keep in mind that index
numbers start from 0.
On line 15, mul ple elements are selected from the Series by selec ng mul ple index names
inside the [].
Drop entries from a Series
Dropping and unwanted index is a common func on in Pandas. If the drop(index_name) func on is
called with a given index on a Series object, the desired index name is deleted.
import numpy as np
import pandas as pd
srs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
srs.index.name = "Index"
print("The original Series:\n", srs)
srs = srs.drop('ind2') # drop index named ind2
print("The New Series:\n", srs)
The output that the ind2 index is dropped. Also, an index can only be dropped by specifying the
index name and not the number. So, srs.drop(srs[2]) does not work.
DataFrame: the most important opera ons
Using the pandas.DataFrame() func on
To create a pandas dataframe from a numpy array, pass the numpy array as an argument to
the pandas.DataFrame() func on. You can also pass the index and column labels for the
dataframe. The following is the syntax:
df = pandas.DataFrame(data=arr, index=None, columns=None)
There are several ways to make a DataFrame in Pandas. The easiest way to create one from scratch is
to create and print a df.
We can also create a dict and pass our dic onary data to the DataFrame constructor. Say we have
some data on vegetable sales and want to organize it by type of vegetable and quan ty. Our data
would look like this:
And now we pass it to the constructor using a simple command.
Each item, or value, in our data will correspond with a column in the DataFrame we created, just like
a chart. The index for this DataFrame is listed as numbers, but we can specify them further
depending on our needs. Say we wanted to know quan ty per month. That would be our new index.
We do that using the following command.
Get info about your data
One of the first commands you run a er loading your data is .info(), which provides all the essen al
informa on about a dataset.
You can access more informa on with other opera ons, like .shape, which outputs a tuple of (rows,
columns).
We use the .columns operator to print a dataset’s column names.
quan ty.columns
You can then rename your columns easily. On top of that, the .rename() method allows us to rename
columns.
quan ty.rename(columns = {'carrots':'bananas'})
Searching and selec ng in our DataFrame
We need to know how to manipulate or access the data in our DataFrame, such as selec ng,
searching, or dele ng data values. You can do this either by column or by row. Let’s see how it’s
done. The easiest way to select a column of data is by using brackets [ ]. We can also use brackets to
select mul ple columns. Say we only wanted to look at June’s vegetable quan ty.
Note: loc and iloc are used for loca ng data.
.iloc locates by numerical index
.loc locates by the index name. This is similar to list slicing in Python.
Pandas DataFrame object also provides methods to select specific columns. The following example
shows how it can be done.
quan ty['peppers']
Create a new DataFrame from pre-exis ng columns
We can also grab mul ple columns and create a new DataFrame object from it.
Reindex data in a DataFrame
We can also reindex the data either by the indexes themselves or the columns. Reindexing
with reindex() allows us to make changes without messing up the ini al se ng of the objects.
Note: The rules for reindexing are the same for Series and DataFrame objects.
#impor ng pandas in our program
import pandas as pd
# Defining a series object
srs1 = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])
# Set Series name
srs1.name = "Growth Rate"
# Set index name
srs1.index.name = "Country"
srs2 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'])
print("The series with new indexes is:\n",srs2)
srs3 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'], fill_value=0)
print("\nThe series with new indexes is:\n",srs3)
On line 11, the indexes are changed. The new index name is added between Row2 and Row4. One
line 14, the columns keyword should be specifically used to reindex the columns of DataFrame. The
rules are the same as for the indexes. NaN values were assigned to the whole column by default.
How to read or import Pandas data
It is quite easy to read or import data from other files using the Pandas library. In fact, we can use
various sources, such as CSV, JSON, or Excel to load our data and access it.
Reading and impor ng data from CSV files
We can import data from a CSV file, which is common prac ce for Pandas users.
We simply create or open CSV file, copy the data, paste it in our Notepad, and save it in the same
directory that houses your Python scripts.
You then use a bit of code to read the data using the read_csv func on build into Pandas.
read_csv will generate the index column as a default, so we need to change this for the first column
is the index column. We can do this by passing the parameter index_col to tell Pandas which column
to index.
Once we’ve used Pandas to sort and clean data, we can then save it back as the original file with
simple commands. You only have to input the filename and extension.
Reading and impor ng data from JSON ( JSON (JavaScript Object Nota on))
Examples:
{
"glossary": {
" tle": "example glossary",
"GlossDiv": {
" tle": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
Say you have a JSON file. A JSON file is basically like a stored Python dict, so Pandas can easily access
and read it using the read_json func on. Let’s look at an example.
Reading and impor ng data from Excel file
Say you have an Excel file. You can similarly use the read_excel func on to access and read that data.
Once we call the read_excel func on, we pass the name of the Excel file as our argument,
so read_excel will open the file’s data. We can the print() to display the data. If we want to go one
step further, we can add the loc() method from earlier, allowing us to read specific rows and columns
of our file.
Data Wrangling with Pandas (Combining DataFrames)
Once we have our data, we can use data wrangling processes to manipulate and prepare data for the
analysis. The most common data wrangling processes are merging, concatena on, and grouping.
Merge method Descrip on
le Use keys from le frame only
right Use keys from right frame only
outer Use union of keys from both frames
inner Use intersec on of keys from both frames
merge()
We can Join or merge two data frames in pandas python by using the merge() func on. The different
arguments to merge() allow you to perform natural join, le join, right join, and full outer join in
pandas. We have also other type join or concatenate opera ons like join based on index, Row index
and column index.
Join or Merge in Pandas – Syntax:
merge(le _df, right_df, on=’Customer_id’, how=’inner’)
le _df – Dataframe1
right_df– Dataframe2.
on− Columns (names) to join on. Must be found in both the le and right DataFrame objects.
how – type of join needs to be performed – ‘le ’, ‘right’, ‘outer’, ‘inner’, Default is inner join
The data frames must have same column names on which the merging happens. Merge() Func on in
pandas is similar to database join opera on in SQL.
UNDERSTANDING THE DIFFERENT TYPES OF JOIN OR MERGE IN PANDAS:
Inner Join or Natural join: To keep only rows that match from the data frames, specify the
argument how=‘inner’.
Outer Join or Full outer join:To keep all rows from both data frames, specify how=‘outer’.
Le Join or Le outer join:To include all the rows of your data frame x and only those
from y that match, specify how=‘le ’.
Right Join or Right outer join:To include all the rows of your data frame y and only those
from x that match, specify how=‘right’.
Example:
import pandas as pd
import numpy as np
# data frame 1
d1 = {'Customer_id':pd.Series([1,2,3,4,5,6]),
'Product':pd.Series(['Oven','Oven','Oven','Television','Television','Television'])}
df1 = pd.DataFrame(d1)
# data frame 2
d2 = {'Customer_id':pd.Series([2,4,6,7,8]),
'State':pd.Series(['California','California','Texas','New York','Indiana'])}
df2 = pd.DataFrame(d2)
df1:
df2:
Inner join pandas:
Return only the rows in which the le table have matching keys in the right table
Example:
#inner join in python pandas
inner_join_df= pd.merge(df1, df2, on='Customer_id', how='inner')
inner_join_df
the resultant data frame df will be
Outer join in pandas:
Returns all rows from both tables, join records from the le which have matching keys in the right
table.When there is no Matching from any table NaN will be returned
Example:
# outer join in python pandas
outer_join_df=pd.merge(df1, df2, on='Customer_id', how='outer')
outer_join_df
the resultant data frame df will be
Le outer Join or Le join pandas:
Return all rows from the le table, and any rows with matching keys from the right table.When there
is no Matching from right table NaN will be returned
Example:
# le join in python
le _join_df= pd.merge(df1, df2, on='Customer_id', how='le ')
le _join_df
the resultant data frame df will be
Right outer join or Right Join pandas:
Return all rows from the right table, and any rows with matching keys from the le table.
Example:
# right join in python pandas
right_join_df= pd.merge(df1, df2, on='Customer_id', how='right')
right_join_df
the resultant data frame df will be
Pandas - Joining DataFrames with Concat and Append
It is frequently required to join dataframes together, such as when data is loaded from mul ple files
or even mul ple sources. pandas.concat() is used to add the rows of mul ple dataframes together
and produce a new dataframe with the the combined data.
concat
The Pandas.concat func on joins a number of dataframes on one of the axis. The default is to join
along the index.
Concatenate the two dataframes together to join along the index.
Pandas - concat by default joins two dataframes along the index
Pandas.concat Parameters:
Parameter Note Default
objs list of DataFrame or Series objects
axis the axis to concatenate along, (0 = ’index’, 1 = ’columns’) 0
join how to handle indexes on other axis, (op ons are ‘inner’ or ‘outer’) 'outer'
ignore_index boolean value on preserving source index False
keys sequence used to create hierarchical index using the passed keys None
levels list of sequences used to create a Mul Index None
names list of names for the levels in hierarchical index None
verify_integrityboolean value to specify whether the new concatenated axis contains duplicates False
boolean value to specify sor ng non-concatena on axis if it is not already aligned when join
sort False
is ‘outer’
copy boolean value to specify whether data is copied unnecessarily True
concat with different column names
concat with axis = 1
The concat func on has a number of parameters, which have defaults. the axis parameter specifies
along which to join the dataframes, o for index (default) and 1 for columns.
Pandas - concat() with index = 1 joins two dataframes along the columns
concat with inner join
Pandas - concatenate two dataframes with inner join only keeps matching indexes
Concatena ng mul ple dataframes
More than two dataframes can be concatenated together. The default is to concatenate along the
index.
Pandas - concatenate mul ple dataframes along index
Mul ple dataframes can also be concatenated along the columns with axis=1.
Pandas - concatenate mul ple dataframes along columns
Dataframe.Append
instance method performs the same func on as concat by appending a Series or Dataframe onto the
end of the calling dataframe and returning a new dataframe.
groupby() func on in pandas
Pandas DataFrame.groupby() func on is used to collect iden cal data into groups and
perform aggregate func ons on the grouped data. Group by opera on involves spli ng the data,
applying some func ons, and finally aggrega ng the results.
In Pandas, you can use groupby() with the combina on
of sum(), count(), pivot(), transform(), aggregate(), and many more methods to perform various
opera ons on grouped data.
Aggrega on-Func on Descrip on
sum() Sum of values
mean() Mean (average) of values
min() / max() Minimum / Maximum value
std() / var() Standard devia on / Variance
Aggrega on-Func on Descrip on
count() Count of non-missing values
nunique() Number of unique values
cumsum() Cumula ve sum
agg() Apply mul ple aggrega on func ons
The 'groupby' func on is commonly used in data analysis. It is used to gain insights into the
rela onship between variables.
Key Points –
groupby() is used to split data into groups based on one or more keys, allowing for efficient
analysis and aggrega on of grouped data.
It supports various aggrega on func ons like sum, mean, count, min, and max, which can
be applied to each group.
You can apply mul ple aggrega ons on different columns using .agg(), offering more
flexibility in analysis.
The result of groupby() o en returns a DataFrame with a Mul Index, where each level
represents a grouping key.
You can filter groups based on specific condi ons by using .filter() a er groupby().
groupby() allows itera on over groups, enabling customized opera ons on each subset of
data.
The syntax for 'groupby()' is as follows:
Parameters of Pandas DataFrame.groupby()
by – List of column names to group by
axis – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
level – Used with Mul Index.
as_index – sql style grouped output.
sort – Default to True. Specify whether to sort a er the group
group_keys – add group keys or not
squeeze – deprecated in new versions
observed – This only applies if any of the groupers are Categoricals.
dropna – Default to False. Use True to drop None/Nan on group keys
Example:
output.
Apply the groupby() func on along with the sum() func on to perform the sum opera on on grouped
data.
output.
groupby() on Two or More Columns
Example:
Add Index to the Grouped Data
By default groupby() func on doesn’t return the row Index, you can add the index using
the DataFrame.reset_index() method.
Example:
Drop NA /None/Nan (on group key) from the Result
You can also choose whether to include NA/None/Nan in group keys or not by
se ng dropna parameter. By default the value of dropna set to True. So, it does not include
None/Nan values on the group keys set dropna=False parameter.
Example:
Sort groupby() result by Group Key
To remove sor ng on grouped results in pandas, you can pass sort=False parameter to
the groupby() func on. By passing sort=False to the groupby() func on, you ensure that the grouped
results are not sorted by the group key, preserving the original order of appearance of the courses in
the DataFrame.
To sort the group keys (courses) in descending order a er performing the groupby() opera on, you
can use the sort_index() method with the ascending=False parameter.
This code first groups the DataFrame by Courses, calculates the sum of each group, and then sorts
the group keys (courses) in descending order using the sort_index() method with ascending=False.
Apply More Aggrega ons
compute mul ple aggrega ons at the same me in grouped data simply bypassing the list of
aggregate func ons to the aggregate().
Example:
To compute different aggrega ons on different columns in a grouped DataFrame, you can pass a
dic onary to the agg() func on specifying the aggrega on func on for each column. Here, calculates
the count on the Dura on grouped column and calculates min and max on the Fee grouped column.
Pandas Handling Missing Data in DataFrame
What is Missing Data?
In the world of Data Science, a Pandas DataFrame is the most popular and globally accepted data
structure for storing large-scale data in the form of rows and columns just like an excel spreadsheet
or SQL table. A DataFrame can contain almost any type of data, however, the missing data in a
DataFrame is refer to the values that are unavailable.
Example of Missing Data in a Pandas DataFrame
The word “Missing Data in a DataFrame” simply means the values that are unavailable or missing in a
Pandas DataFrame. Values that are missing in a DataFrame are automa cally replaced by
the NaN type (Here NaN is used from NumPy). In the following example, we have two missing
values in a DataFrame which is replaced by the “NaN” value
Why Should You Handle Missing Data in DataFrame?
In the process of exploratory data analysis, one of the most important steps is data preprocessing
where you will be mainly dealing with missing data handling. Before looking into the insights of data
you need a clean dataset, free of outliers and missing values.
You need to handle missing data in a Pandas DataFrame because
1. Missing values in a DataFrame nega vely affect the data insights
2. Training a Machine Learning model needs a clean dataset
3. DataFrame with missing values is hard to process, visualize and create a data pipeline
So, you need to find out the missing data in your DataFrame and get rid of missing values.
How to Find Missing Data in a DataFrame?
Use func ons like isna() or isnull() to detect missing values. Pair them with sum() to count missing
entries.
1. Find Rows Having NaN Values
Example:
Output:
2: Find Columns Having NaN Values
3: Find Percentage of Missing Data in Column
Here, DataFrame.isna() is used to check if the DataFrame has NA values.
Output:
4: Find Number of NaN Values in Each Row w.r.t Column
Output:
Different Methods to Handle Missing Data in a DataFrame
Based on the data you are working with, you may have to follow any of the following different
techniques for handling missing data in a DataFrame. Review all of the methods and apply the one
which suits best your need.
The best ways to handle missing data in a DataFrame are:
1. Remove rows or columns from the DataFrame that have missing data
2. Replace the missing data with another value
1. Remove Rows or Columns Having Missing Data
We can simply find out rows, or columns where we have missing data and drop them by using
Pandas func ons.
1.1 Removing Rows Having Missing Data
In Pandas, we can use the func on df.dropna() to remove all rows that have missing data.
1.2 Removing Columns Having Missing Data
Just like removing rows, we can also remove columns from our DataFrame that have missing data.
The same pandas built-in func on, df.dropna() can be used with an extra “axis” parameter.
2. Replace Missing Data in DataFrame
This method is a bit tedious yet a more powerful and op mis c way to handle missing data in
DataFrame. You will have a lot of ways to replace the missing data in the DataFrame.
To replace missing data in a DataFrame you can use the following different methods:
1. Replace missing data with fixed values in DataFrame
2. Replace missing data with Mean value
3. Replace missing data with Median value
2.1 Replace Missing Data with Fixed Values in dataFrame
We can impute the missing values in the dataFrame by a fixed value. The fixed value can be
an Integer or any other data depending on the nature of your Dataset. For example, if you
are dealing with gender data, you can replace all the missing values with the word
“unknown”, “Male”, or “Female”.
Pandas Replace NaN with 0
Pandas Replace NaN with empty String
Imputed all missing values by a random number, generated using the python random module.
2.2 Replace Missing Data with Mean Value
You can use the mean values to replace the missing values in case the data distribu on is symmetric.
You have a choice to choose between the three sta s cs func ons either mean mode, or Median. It
strongly depends on the dataset you working on.
Example:
Pivot Tables in Pandas
Pivot tables are tables of grouped values that aggregate specific items of an original table into one or
more discrete categories. They are a way of crea ng short summaries of your original dataset that
display things such as sums of columns, averages, or any other sta s c value you are interested in. By
summarizing large amounts of data into pivot tables, usually no ce some pa erns in it which helps
deduce how your data behaves based on certain factors. This knowledge is very useful because it can
help subject ma er experts make be er strategic decisions.
Key Differences:
Feature Pivot Tables groupby merge concat
Reshape and Group and Combine data
Purpose Stack/append data
summarize data aggregate based on a key
Requires
Yes Yes No No
Aggrega on
Reshapes Data Yes (grid format) No No No
Key for
No Grouping key(s) Common key(s) Not required
Combining
Mul -dimensional Aggrega ng Adding new rows or
Use Case Joining datasets
summary analysis column values columns
The pivot_table() func on in Pandas allows us to create a spreadsheet-style pivot table making it
easier to group and analyze our data.
To create a pivot table using this method you need to define values for the following parameters:
Index
Columns (op onal)
Values
Aggfunc
The index parameter defines what is going to be the index of your pivot table. For example, it defines
how the rows of your original DataFrame are going to be grouped into categories. If you input a list
of values instead of just one value, you are going to end up with a mul -index as your row index.
The columns parameter is an op onal parameter that allows you to introduce an extra value to your
columns index, which in turn transforms your pivot table column index into a mul -index.
The values parameter defines which columns you want to aggregate. Essen ally it tells Pandas what
it needs to aggregate based on some aggrega on func on a er your data has been grouped based
on the values you entered for the index parameter.
The aggfunc parameter defines which type of aggrega on you want to perform. Based on what you
decide to use here, you can access various informa on such as the means, the sums, etc. If you want
to, you can also enter mul ple values here which will end up transforming your column index into a
mul -index.
Example:
Output:
In this example, we reshaped the DataFrame
with Date as index, City as columns and Temperature as values.
pivot_table() based on the following syntax:
index - keys to group by on the pivot table index
columns - keys to group by on the pivot table column
values - columns used for aggrega on data of the pivot table
aggfunc - func ons or list of func ons used for aggrega on
pivot_table() with Mul ple Values
Output:
In this example, we created a pivot table for mul ple values i.e. Temperature and Humidity.
pivot_table() With Aggregate Func ons
We can use the pivot_table() method with different aggregate func ons using
the aggfunc parameter. We can set the value of aggfunc to func ons such
as 'sum', 'mean', 'count', 'max' or 'min'.
Let's see an example.
In the above example, we calculated the mean temperature of each city using
the aggfunc='mean' argument in pivot_table().
Pivot Table With Mul Index
We can create a pivot table with Mul Index using the pivot_table() func on.
Let's look at an example.
In this example, we created a pivot table with a Mul Index by passing a list of columns as
an index argument.
A Mul Index contains mul ple levels of indexes with columns linked to one another through a
parent/rela onship. Here, Country is the parent column and City is the child column.
Advanced Pivot op ons
Alterna vely, we may use more op ons with the following default values:
Useful pivot op ons are:
fill_value - value to replace missing values with
dropna - exclude columns whose entries are all NaN
margins - add all row/columns (subtotal / grand totals)
sort - sort the results
Pivot Table with Mul ple aggfunc
We can use mul ple aggrega on func ons. The func ons might be different for different columns:
'D' - mean
'E' - min and max
Output:
Pivot table replace NaN
To replace NaN values in the pivot table we can use the parameter fill_value. We can replace NaN
values with 0 by:
Pivot table remove NaN
To drop columns with NaN values we can use op on dropna=True: