0% found this document useful (0 votes)
22 views79 pages

Filter Data in WPS by Conditions

The document discusses data munging, a process essential for transforming unusable data into a useful format, primarily using Python's Pandas and NumPy libraries. It emphasizes the importance of cleaning datasets, detailing methods such as dropping unnecessary columns, changing DataFrame indices, and tidying up fields for consistency. The document provides practical examples and code snippets to illustrate these data cleaning techniques.

Uploaded by

h3482080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views79 pages

Filter Data in WPS by Conditions

The document discusses data munging, a process essential for transforming unusable data into a useful format, primarily using Python's Pandas and NumPy libraries. It emphasizes the importance of cleaning datasets, detailing methods such as dropping unnecessary columns, changing DataFrame indices, and tidying up fields for consistency. The document provides practical examples and code snippets to illustrate these data cleaning techniques.

Uploaded by

h3482080
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Analysis and Visualization

DATA MUNGING

Data munging is the general procedure for transforming data from erroneous or unusable forms,
into useful and use-case-specific ones. Without some degree of munging, whether performed by
automated systems or specialized users, data cannot be ready for any kind of downstream
consumption.

But powerful and versatile tools, like Python, are making it increasingly easy for anyone to
munge effectively.

PYTHONIC DATA CLEANING WITH PANDAS AND NUMPY


Data scientists spend a large amount of their time cleaning datasets and getting them down to
a form with which they can work. In fact, a lot of data scientists argue that the initial steps of
obtaining and cleaning data constitute 80% of the job.
Therefore, if you are just stepping into this field or planning to step into this field, it is important
to be able to deal with messy data, whether that means missing values, inconsistent formatting,
malformed records, or nonsensical outliers.

Ways of Cleaning Data:

 Dropping unnecessary columns in a DataFrame


 Changing the index of a DataFrame
 Using .str() methods to clean columns
 Using the [Link]() function to clean the entire dataset, element-wise
 Renaming columns to a more recognizable set of labels
 Skipping unnecessary rows in a CSV file

Let’s import the required modules and get started!


>>> import pandas as pd
>>> import numpy as np

i. Dropping Columns in a DataFrame:


Often, you’ll find that not all the categories of data in a dataset are useful to you. For
example, you might have a dataset containing student information (name, grade, standard,
parents’ names, and address) but want to focus on analyzing student grades. In this case, the
address or parents’ names categories are not important to you. Retaining these unneeded
categories will take up unnecessary space and potentially also bog down runtime.
pandas provides a handy way of removing unwanted columns or rows from a DataFrame with
the drop() function. Let’s look at a simple example where we drop a number of columns from
a DataFrame.
First, let’s create a DataFrame out of the CSV file ‘[Link]’. In the examples
below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder
named Datasets in our current working directory:
>>>
>>> df = pd.read_csv('Datasets/[Link]')
Data Analysis and Visualization

>>> [Link]()

Identifier Edition Statement Place of Publication \


0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London

Date of Publication Publisher \


0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh

Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.

Contributors Corporate Author \


0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN

Corporate Contributors Former owner Engraver Issuance type \


0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
2 NaN NaN NaN monographic
3 NaN NaN NaN monographic
4 NaN NaN NaN monographic
Data Analysis and Visualization

Flickr URL \
0 [Link]
1 [Link]
2 [Link]
3 [Link]
4 [Link]

Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS [Link].2.
2 British Library HMNTS [Link].1.
3 British Library HMNTS [Link].15.
4 British Library HMNTS 9007.d.28.

When we look at the first five entries using the head() method, we can see that a handful of
columns provide ancillary information that would be helpful to the library but isn’t very
descriptive of the books themselves: Edition Statement, Corporate Author, Corporate
Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

We can drop these columns in the following way:


>>> to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors',
... 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
>>> [Link](to_drop, inplace=True, axis=1)
Above, we defined a list that contains the names of all the columns we want to drop. Next, we
call the drop() function on our object, passing in the inplace parameter as True and
the axis parameter as 1. This tells pandas that we want the changes to be made directly in our
object and that it should look for the values to be dropped in the columns of the object.
When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:
>>> [Link]()
Identifier Place of Publication Date of Publication \
0 206 London 1879 [1878]
1 216 London; Virtue & Yorston 1868
2 218 London 1869
3 472 London 1851
4 480 London 1857

Publisher Title \
Data Analysis and Visualization

0 S. Tinsley & Co. Walter Forbes. [A novel.] By A. A


1 Virtue & Co. All for Greed. [A novel. The dedication signed...
2 Bradbury, Evans & Co. Love the Avenger. By the author of “All for Gr...
3 James Darling Welsh Sketches, chiefly ecclesiastical, to the...
4 Wertheim & Macintosh [The World in which I live, and my place in it...

Author Flickr URL


0 A. A. [Link]
1 A., A. A. [Link]
2 A., A. A. [Link]
3 A., E. S. [Link]
4 A., E. S. [Link]

Alternatively, we could also remove the columns by passing them to the columns parameter
directly instead of separately specifying the labels to be removed and the axis where pandas
should look for the labels:
>>> [Link](columns=to_drop, inplace=True)

[Link] the Index of a DataFrame


A pandas Index extends the functionality of NumPy arrays to allow for more versatile
slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the
data as its index.
For example, in the dataset used in the previous section, it can be expected that when a librarian
searches for a record, they may input the unique identifier (values in the Identifier column) for a
book:
>>> df['Identifier'].is_unique
True
Let’s replace the existing index with this column using set_index:
>>> df = df.set_index('Identifier')
>>> [Link]()
Place of Publication Date of Publication \
206 London 1879 [1878]
216 London; Virtue & Yorston 1868
218 London 1869
472 London 1851
480 London 1857

Publisher \
Data Analysis and Visualization

206 S. Tinsley & Co.


216 Virtue & Co.
218 Bradbury, Evans & Co.
472 James Darling
480 Wertheim & Macintosh

Title Author \
206 Walter Forbes. [A novel.] By A. A A. A.
216 All for Greed. [A novel. The dedication signed... A., A. A.
218 Love the Avenger. By the author of “All for Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.

Flickr URL
206 [Link]
216 [Link]
218 [Link]
472 [Link]
480 [Link]

We can access each record in a straightforward way with loc[]. Although loc[] may not have all
that intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or
record without regard to its position:
>>> [Link][206]
Place of Publication London
Date of Publication 1879 [1878]
Publisher S. Tinsley & Co.
Title Walter Forbes. [A novel.] By A. A
Author A. A.
Flickr URL [Link]
Name: 206, dtype: object

In other words, 206 is the first label of the index. To access it by position, we could
use [Link][0], which does position-based indexing.

Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-
in range. By passing a column name to set_index, we have changed the index to the values
in Identifier.
Data Analysis and Visualization

You may have noticed that we reassigned the variable to the object returned by the method
with df = df.set_index(...). This is because, by default, the method returns a modified copy of our
object and does not make the changes directly to the object. We can avoid this by setting
the inplace parameter:
df.set_index('Identifier', inplace=True)

[Link] up Fields in the Data


So far, we have removed unnecessary columns and changed the index of
our DataFrame to something more sensible. In this section, we will clean specific columns and
get them to a uniform format to get a better understanding of the dataset and enforce consistency.
In particular, we will be cleaning Date of Publication and Place of Publication.
Upon inspection, all of the data types are currently the object dtype.
It encapsulates any field that can’t be neatly fit as numerical or categorical data. This
makes sense since we’re working with data that is initially a bunch of messy strings:
>>> df.get_dtype_counts()
object 6
One field where it makes sense to enforce a numeric value is the date of publication so that we
can do calculations down the road:
>>> [Link][1905:, 'Date of Publication'].head(10)
Identifier
1905 1888
1929 1839, 38-54
2836 [1897?]
2854 1865
2956 1860-63
2957 1873
3017 1866
3131 1899
4598 1814
4884 1820
Name: Date of Publication, dtype: object

A particular book can have only one date of publication. Therefore, we need to do the following:

 Remove the extra dates in square brackets, wherever present: 1879 [1878]
 Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
 Completely remove the dates we are not certain about and replace them with
NumPy’s NaN: [1897?]
 Convert the string nan to NumPy’s NaN value
Synthesizing these patterns, we can actually take advantage of a single regular expression to
extract the publication year:
Data Analysis and Visualization

regex = r'^(\d{4})'
The regular expression above is meant to find any four digits at the beginning of a string, which
suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape
character), which is standard practice with regular expressions.
The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start
of a string, and the parentheses denote a capturing group, which signals to pandas that we want
to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)
Let’s see what happens when we run this regex across our dataset:
>>> extr = df['Date of Publication'].[Link](r'^(\d{4})', expand=False)
>>> [Link]()
Identifier
206 1879
216 1868
218 1869
472 1851
480 1857
Name: Date of Publication, dtype: object

Technically, this column still has object dtype, but we can easily get its numerical version
with pd.to_numeric:
>>> df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64')

[Link] str Methods with NumPy to Clean Columns


Above, you may have noticed the use of df['Date of Publication'].str. This attribute is a way to
access speedy string operations in pandas that largely mimic operations on native Python strings
or compiled regular expressions, such as .split(), .replace(), and .capitalize().
To clean the Place of Publication field, we can combine pandas str methods with
NumPy’s [Link] function, which is basically a vectorized form of Excel’s IF() macro.
We’ll be making use of these two functions to clean Place of Publication since this column has
string objects. Here are the contents of the column:
>>> df['Place of Publication'].head(10)
Identifier
206 London
216 London; Virtue & Yorston
218 London
472 London
480 London
481 London
Data Analysis and Visualization

519 London
667 pp. 40. G. Bryan & Co: Oxford, 1898
874 London]
1143 London
Name: Place of Publication, dtype: object

We see that for some rows, the place of publication is surrounded by other unnecessary
information. If we were to look at more values, we would see that this is the case for only some
rows that have their place of publication as ‘London’ or ‘Oxford’.
Let’s take a look at two specific entries:
>>> [Link][4157862]
Place of Publication Newcastle-upon-Tyne
Date of Publication 1867
Publisher T. Fordyce
Title Local Records; or, Historical Register of rema...
Author T. Fordyce
Flickr URL [Link]
Name: 4157862, dtype: object

>>> [Link][4159587]
Place of Publication Newcastle upon Tyne
Date of Publication 1834
Publisher Mackenzie & Dent
Title An historical, topographical and descriptive v...
Author E. (Eneas) Mackenzie
Flickr URL [Link]
Name: 4159587, dtype: object

These two books were published in the same place, but one has hyphens in the name of the place
while the other does not.
To clean this column in one sweep, we can use [Link]() to get a Boolean mask.
We clean the column as follows:
>>> pub = df['Place of Publication']
>>> london = [Link]('London')
>>> london[:5]
Identifier
206 True
216 True
Data Analysis and Visualization

218 True
472 True
480 True
Name: Place of Publication, dtype: bool

>>> oxford = [Link]('Oxford')


We combine them with [Link]:
>>>df['Place of Publication'] = [Link](london, 'London',
[Link](oxford, 'Oxford', [Link]('-', ' ')))

>>> df['Place of Publication'].head()


Identifier
206 London
216 London
218 London
472 London
480 London
Name: Place of Publication, dtype: object

Here, the [Link] function is called in a nested structure, with condition being a Series of
Booleans obtained with [Link](). The contains() method works similarly to the built-
in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).
The replacement to be used is a string representing our desired place of publication. We also
replace hyphens with a space with [Link]() and reassign to the column in our DataFrame.

Let’s have a look at the first five entries, which look a lot crisper than when we started out:
>>> [Link]()
Place of Publication Date of Publication Publisher \
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh

Title Author \
206 Walter Forbes. [A novel.] By A. A AA
216 All for Greed. [A novel. The dedication signed... A. A A.
218 Love the Avenger. By the author of “All for Gr... A. A A.
Data Analysis and Visualization

472 Welsh Sketches, chiefly ecclesiastical, to the... E. S A.


480 [The World in which I live, and my place in it... E. S A.

Flickr URL
206 [Link]
216 [Link]
218 [Link]
472 [Link]
480 [Link]

v. Cleaning the Entire Dataset Using the applymap Function


There are some instances where it would be helpful to apply a customized function to
each cell or element of a DataFrame. pandas .applymap() method is similar to the in-
built map() function and simply applies a function to all the elements in a DataFrame.
Let’s look at an example. We will create a DataFrame out of the “university_towns.txt” file:
$ head Datasets/univerisity_towns.txt
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]

We see that we have periodic state names followed by the university towns in that state: StateA
TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in
the file, we’ll see that all of them have the “[edit]” substring in them.
We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that
list in a DataFrame:
>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
... for line in file:
... if '[edit]' in line:
... # Remember this `state` until the next is found
... state = line
Data Analysis and Visualization

... else:
... # Otherwise, we have a city; keep `state` as last-seen
... university_towns.append((state, line))

>>> university_towns[:5]
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]

We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. pandas
will take each element in the list and set State to the left value and RegionName to the right
value.
The resulting DataFrame looks like this:
>>> towns_df = [Link](university_towns,
... columns=['State', 'RegionName'])

>>> towns_df.head()
State RegionName
0 Alabama[edit]\n Auburn (Auburn University)[1]\n
1 Alabama[edit]\n Florence (University of North Alabama)\n
2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n
3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n
4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n

While we could have cleaned these strings in the for loop above, pandas makes it easy. We only
need the state name and the town name and can remove everything else. While we could use
pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each
element of the DataFrame.

Therefore, applymap() will apply a function to each of these independently. Let’s define that
function:
>>> def get_citystate(item):
... if ' (' in item:
... return item[:[Link](' (')]
... elif '[' in item:
... return item[:[Link]('[')]
... else:
Data Analysis and Visualization

... return item


pandas’ .applymap() only takes one parameter, which is the function (callable) that should be
applied to each element:
>>> towns_df = towns_df.applymap(get_citystate)

First, we define a Python function that takes an element from the DataFrame as its parameter.
Inside the function, checks are performed to determine whether there’s a ( or [ in the element or
not.
Depending on the check, values are returned accordingly by the function. Finally,
the applymap() function is called on our object. Now the DataFrame is much neater:
>>> towns_df.head()
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo

The applymap() method took each element from the DataFrame, passed it to the function, and
the original value was replaced by the returned value.

[Link] Columns and Skipping Rows


Often, the datasets you’ll work with will have either column names that are not easy to
understand, or unimportant information in the first few and/or last rows, such as definitions of
the terms in the dataset, or footnotes.
In that case, we’d want to rename columns and skip certain rows so that we can drill down to
necessary information with correct and sensible labels.
To demonstrate how we can go about doing this, let’s first take a glance at the initial five rows of
the “[Link]” dataset:
$ head -n 5 Datasets/[Link]
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,? Summer,01 !,02 !,03 !,Total,? Winter,01 !,02 !,03 !,Total,? Games,01 !,02 !,03 !,Combined
total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Now, we’ll read it into a pandas DataFrame:
>>> olympics_df = pd.read_csv('Datasets/[Link]')
>>> olympics_df.head()
0 1 2 3 4 5 6 7 8 \
Data Analysis and Visualization

0 NaN ? Summer 01 ! 02 ! 03 ! Total ? Winter 01 ! 02 !


1 Afghanistan (AFG) 13 0 0 2 2 0 0 0
2 Algeria (ALG) 12 5 2 8 15 3 0 0
3 Argentina (ARG) 23 18 24 28 70 18 0 0
4 Armenia (ARM) 5 1 2 9 12 6 0 0

9 10 11 12 13 14 15
0 03 ! Total ? Games 01 ! 02 ! 03 ! Combined total
1 0 0 13 0 0 2 2
2 0 0 15 5 2 8 15
3 0 0 41 18 24 28 70
4 0 0 11 1 2 9 12

This is messy indeed! The columns are the string form of integers indexed at 0. The row which
should have been our header (i.e. the one to be used to set the column names) is
at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.
Also, if we were to go to the source of this dataset, we’d see that NaN above should really be
something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be
“Gold”, and so on.

Therefore, we need to do two things:

 Skip one row and set the header as the first (0-indexed) row
 Rename the columns
We can skip rows and set the header while reading the CSV file by passing some parameters to
the read_csv() function.
This function takes a lot of optional parameters, but in this case we only need one (header) to
remove the 0th row:
>>> olympics_df = pd.read_csv('Datasets/[Link]', header=1)
>>> olympics_df.head()
Unnamed: 0 ? Summer 01 ! 02 ! 03 ! Total ? Winter \
0 Afghanistan (AFG) 13 0 0 2 2 0
1 Algeria (ALG) 12 5 2 8 15 3
2 Argentina (ARG) 23 18 24 28 70 18
3 Armenia (ARM) 5 1 2 9 12 6
4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0

01 !.1 02 !.1 03 !.1 Total.1 ? Games 01 !.2 02 !.2 03 !.2 \


0 0 0 0 0 13 0 0 2
Data Analysis and Visualization

1 0 0 0 0 15 5 2 8
2 0 0 0 0 41 18 24 28
3 0 0 0 0 11 1 2 9
4 0 0 0 0 2 3 4 5

Combined total
0 2
1 15
2 70
3 12
4 12
We now have the correct row set as the header and all unnecessary rows removed. Take note of
how pandas has changed the name of the column containing the name of the countries
from NaN to Unnamed: 0.
To rename the columns, we will make use of a DataFrame’s rename() method, which allows you
to relabel an axis based on a mapping (in this case, a dict).
Let’s start by defining a dictionary that maps current column names (as keys) to more usable
ones (the dictionary’s values):
>>>
>>> new_names = {'Unnamed: 0': 'Country',
... '? Summer': 'Summer Olympics',
... '01 !': 'Gold',
... '02 !': 'Silver',
... '03 !': 'Bronze',
... '? Winter': 'Winter Olympics',
... '01 !.1': 'Gold.1',
... '02 !.1': 'Silver.1',
... '03 !.1': 'Bronze.1',
... '? Games': '# Games',
... '01 !.2': 'Gold.2',
... '02 !.2': 'Silver.2',
... '03 !.2': 'Bronze.2'}

We call the rename() function on our object:


>>> olympics_df.rename(columns=new_names, inplace=True)
Setting inplace to True specifies that our changes be made directly to the object. Let’s see if this
checks out:
>>> olympics_df.head()
Data Analysis and Visualization

Country Summer Olympics Gold Silver Bronze Total \


0 Afghanistan (AFG) 13 0 0 2 2
1 Algeria (ALG) 12 5 2 8 15
2 Argentina (ARG) 23 18 24 28 70
3 Armenia (ARM) 5 1 2 9 12
4 Australasia (ANZ) [ANZ] 2 3 4 5 12

Winter Olympics Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 \


0 0 0 0 0 0 13 0
1 3 0 0 0 0 15 5
2 18 0 0 0 0 41 18
3 6 0 0 0 0 11 1
4 0 0 0 0 0 2 3

Silver.2 Bronze.2 Combined total


0 0 2 2
1 2 8 15
2 24 28 70
3 2 9 12
4 4 5 12

FILTERING AND MERGING DATA

1. Python | Pandas [Link]()


Pandas [Link]() function is used to Subset rows or columns of dataframe according to
labels in the specified index. Note that this routine does not filter a dataframe on its contents. The
filter is applied to the labels of the index.
Syntax: [Link](items=None, like=None, regex=None, axis=None)
Parameters:
items : List of info axis to restrict to (must not all be present)
like : Keep info axis where “arg in col == True”
regex : Keep info axis with [Link](regex, col) == True
axis : The axis to filter on. By default this is the info axis, ‘index’ for Series, ‘columns’ for
DataFrame
Returns : same type as input object
The items, like, and regex parameters are enforced to be mutually exclusive. axis defaults to the
info axis that is used when indexing with [].
For the link to CSV file click here
Example #1: Use filter() function to filter out any three columns of the dataframe.
Data Analysis and Visualization

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.read_csv("[Link]")

# Print the dataframe


Df

Now filter the “Name”, “College” and “Salary” columns.

# applying filter function


[Link](["Name", "College", "Salary"])
Data Analysis and Visualization

Output :

Example #2: Use filter() function to subset all columns in a dataframe which has the letter ‘a’ or
‘A’ in its name.
Note : filter() function also takes a regular expression as one of its parameter.

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.read_csv("[Link]")

# Using regular expression to extract all


# columns which has letter 'a' or 'A' in its name.
[Link](regex ='[aA]')
Data Analysis and Visualization

Output :

The regular expression ‘[aA]’ looks for all column names which has an ‘a’ or an ‘A’ in its name.

2. Python | Pandas [Link]()


Joining two Pandas DataFrames using merge()
merge()
Syntax : [Link](parameters)
Parameters :
 right : DataFrame or named Series
 how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
 on : label or list
 left_on : label or list, or array-like
 right_on : label or list, or array-like
 left_index : bool, default False
 right_index : bool, default False
 sort : bool, default False
 suffixes : tuple of (str, str), default (‘_x’, ‘_y’)
 copy : bool, default True
Data Analysis and Visualization

 indicator : bool or str, default False


 validate : str, optional
Returns : A DataFrame of the two merged objects.
Example 1 : Merging two Dataframe with same number of elements :

# importing the module


import pandas as pd

# creating the first DataFrame


df1 = [Link]({"fruit" : ["apple", "banana", "avocado"],
"market_price" : [21, 14, 35]})
display("The first DataFrame")
display(df1)

# creating the second DataFrame


df2 = [Link]({"fruit" : ["banana", "apple", "avocado"],
"wholesaler_price" : [65, 68, 75]})
display("The second DataFrame")
display(df2)

# joining the DataFrames


display("The merged DataFrame")
[Link](df1, df2, on = "fruit", how = "inner")
Data Analysis and Visualization

Output :

Example 2 : Merging two Dataframe with different number of elements :

# importing the module


import pandas as pd

# creating the first DataFrame


df1 = [Link]({"fruit" : ["apple", "banana",
"avocado", "grape"],
"market_price" : [21, 14, 35, 38]})
display("The first DataFrame")
display(df1)

# creating the second DataFrame


df2 = [Link]({"fruit" : ["apple", "banana", "grape"],
"wholesaler_price" : [65, 68, 71]})
display("The second DataFrame")
display(df2)

# joining the DataFrames


Data Analysis and Visualization

# here both common DataFrame elements are in df1 and df2,


# so it extracts apple, banana, grapes from df1 and df2.
display("The merged DataFrame")
[Link](df1, df2, on = "fruit", how = "inner")

Output :

If we use how = "Outer", it returns all elements in df1 and df2 but if element column are null
then its return NaN value.

[Link](df1, df2, on = "fruit", how = "outer")

Output :
Data Analysis and Visualization

If we use how = "left", it returns all the elements that present in the left DataFrame.

[Link](df1, df2, on = "fruit", how = "left")

Output :

If we use how = "right", it returns all the elements that present in the right DataFrame.

[Link](df1, df2, on = "fruit", how = "right")

Output :

Merge two Pandas DataFrames on certain columns


We can merge two Pandas DataFrames on certain columns using the merge function by simply
specifying the certain columns for merge.
Syntax: [Link](right, how=’inner’, on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=False, copy=True, indicator=False,
validate=None)
Example1: Let’s create a Dataframe and then merge them into a single dataframe.
Creating a Dataframe:
 Python3

# importing modules
import pandas as pd

# creating a dataframe
df1 = [Link]({'Name':['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks':[80, 90, 75, 88, 59]})

# creating another dataframe with different data


df2 = [Link]({'Name':['Raju', 'Divya', 'Geeta', 'Sita'],
'Grade':['A', 'A', 'B', 'A'],
'Rank':[3, 1, 4, 2 ],
'Gender':['Male', 'Female', 'Female', 'Female']})
Data Analysis and Visualization

# display df1
display(df1)

# display df2
display(df2)

Output:

df1

df2

Now merge the dataframe:


 Python3

# applying merge
[Link](df2[['Name', 'Grade', 'Rank']])

Output:
Data Analysis and Visualization

Merged Dataframe

The resultant dataframe contains all the columns of df1 but certain specified columns of df2
with key column Name i.e. the resultant column contains Name, Marks, Grade, Rank column.
Both dataframes has the different number of values but only common values in both the
dataframes are displayed after merge.
Example 2: In the resultant dataframe Grade column of df2 is merged with df1 based on key
column Name with merge type left i.e. all the values of left dataframe (df1) will be displayed.

# importing modules
import pandas as pd

# creating a dataframe
df1 = [Link]({'Name':['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks':[80, 90, 75, 88, 59]})

# creating another dataframe with different data


df2 = [Link]({'Name':['Raju', 'Divya', 'Geeta', 'Sita'],
'Grade':['A', 'A', 'B', 'A'],
'Rank':[3, 1, 4, 2 ],
'Gender':['Male', 'Female', 'Female', 'Female']})
# display df1
display(df1)

# display df2
display(df2)

# applying merge with more parameters


[Link](df2[['Grade', 'Name']], on = 'Name', how = 'left')

Output:

df1
Data Analysis and Visualization

df2

Merged Dataframe

Example 3: In this example, we have merged df1 with df2. The Marks column of df1 is
merged with df2 and only the common values based on key column Name in both the
dataframes are displayed here.
 Python3

# importing modules
import pandas as pd

# creating a dataframe
df1 = [Link]({'Name':['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks':[80, 90, 75, 88, 59]})

# creating another dataframe with different data


df2 = [Link]({'Name':['Raju', 'Divya', 'Geeta', 'Sita'],
'Grade':['A', 'A', 'B', 'A'],
'Rank':[3, 1, 4, 2 ],
'Gender':['Male', 'Female', 'Female', 'Female']})
# display df1
display(df1)
Data Analysis and Visualization

# display df2
display(df2)

# applying merge with more parameters


[Link](df1[['Marks', 'Name']])

Output:

df1

df2

Merged Dataframe

Pandas – Merge two dataframes with different columns

Pandas support three kinds of data structures. They are Series, Data Frame, and Panel. A Data
frame is a two-dimensional data structure, Here data is stored in a tabular format which is in
rows and columns. We can create a data frame in many ways.
Data Analysis and Visualization

Here we are creating a data frame using a list data structure in python.

# import required module


import pandas

# assign data
l=["vignan","it","sravan","subbarao"]

# create data frame


df = [Link](l)

# display dataframe
Df

Output:

Here in the above example, we created a data frame. Let’s merge the two data frames with
different columns. It is possible to join the different columns is using concat() method.
Syntax: [Link](objs: Union[Iterable[‘DataFrame’], Mapping[Label, ‘DataFrame’]],
axis=’0′, join: str = “‘outer'”)
 DataFrame: It is dataframe name.
 Mapping: It refers to map the index and dataframe columns
 axis: 0 refers to the row axis and1 refers the column axis.
 join: Type of join.
Note: If the data frame column is matched. Then empty values are replaced by NaN values.
Steps by step Approach:
 Open jupyter notebook
 Import necessary modules
 Create a data frame
 Perform operations
 Analyze the results.
Below are some examples based on the above approach:
Example 1
In this example, we are going to concatenate the marks of students based on colleges.
 Python3

# importing pandas module


import pandas as pd
Data Analysis and Visualization

# dictionary with list object in


# values ie college details
details = {
'Name': ['Sravan', 'Sai', 'Mohan', 'Ishitha'],
'College': ['Vignan', 'Vignan', 'Vignan', 'Vignan'],
'Physics': [99, 76, 71, 93],
'Chemistry': [97, 67, 65, 89],
'Data Science': [93, 65, 65, 85]
}

# converting to dataframe using DataFrame()


df = [Link](details)

# print data frame


Df

Output:

 Python3

# creating another data


details1 = {
'Name': ['Harsha', 'Saiteja', 'abhilash', 'harini'],
'College': ['vvit', 'vvit', 'vvit', 'vvit'],
'Physics': [69, 76, 51, 43],
'Chemistry': [67, 67, 55, 89],
'Maths': [73, 65, 61, 85]
}

# create dataframe
df1 = [Link](details1)

# display dataframe
df1
Data Analysis and Visualization

Output:

 Python3

# concat dataframes
[Link]([df, df1], axis=0, ignore_index=True)

 Python3

# concat when axis = 1


[Link]([df, df1], axis=1, ignore_index=True)

Example 2:
Storing marks and subject details
 Python3

# Import pandas library


import pandas as pd

# initialize list of lists


data = [['sravan', 98.00], ['jyothika', 90.00], ['vijay', 79.34]]
Data Analysis and Visualization

# Create the pandas DataFrame


df = [Link](data, columns=['Name', 'Marks'])

# print dataframe.
Df

Output:

 Python3

# initialize list of lists


data1 = [['Haseen', 88.00, 5], ['ramya', 54.00, 5], ['haritha', 56.34, 4]]

# Create the pandas DataFrame


df1 = [Link](
data1, columns=['Name', 'Marks', 'Total subjects registered'])

# print dataframe.
df1

Output:

 Python3

# concatenating data frame


[Link]([df, df1], axis=0, ignore_index=True)

Output:
Data Analysis and Visualization

Prevent duplicated columns when joining two Pandas DataFrames


Column duplication usually occurs when the two data frames have columns with the same
name and when the columns are not used in the JOIN statement. In this article, let us discuss
the three different methods in which we can prevent duplication of columns when joining two
data frames.
Syntax:
[Link](left, right, how='inner', on=None, left_on=None, right_on=None)

Explanation:
 left – Dataframe which has to be joined from left
 right – Dataframe which has to be joined from the right
 how – specifies the type of join. left, right, outer, inner, cross
 on – Column names to join the two dataframes.
 left_on – Column names to join on in the left DataFrame.
 right_on – Column names to join on in the right DataFrame.

Normally merge:
When we join a dataset using [Link]() function with type ‘inner’, the output will have prefix
and suffix attached to the identical columns on two data frames, as shown in the output.

# import python pandas package


import pandas as pd

# import the numpy package


import numpy as np

# Create sample dataframe data1 and data2


data1 = [Link]([Link](1000, size=(1000, 3)),
columns=['EMI', 'Salary', 'Debt'])
data2 = [Link]([Link](1000, size=(1000, 3)),
columns=['Salary', 'Debt', 'Bonus'])

# Merge the DataFrames


merged = [Link](data1, data2, how='inner', left_index=True,
right_index=True)
print(merged)

Output:
Data Analysis and Visualization

Method 1: Use the columns that have the same names in the join statement
In this approach to prevent duplicated columns from joining the two data frames, the user
needs simply needs to use the [Link]() function and pass its parameters as they join it using
the inner join and the column names that are to be joined on from left and right data frames in
python.
Example:
In this example, we first create a sample dataframe data1 and data2 using the [Link]
function as shown and then using the [Link]() function to join the two data frames by inner
join and explicitly mention the column names that are to be joined on from left and right data
frames.

# import python pandas package


import pandas as pd

# import the numpy package


import numpy as np

# Create sample dataframe data1 and data2


data1 = [Link]([Link](100, size=(1000, 3)),
columns=['EMI', 'Salary', 'Debt'])
data2 = [Link]([Link](100, size=(1000, 3)),
columns=['Salary', 'Debt', 'Bonus'])

# Merge the DataFrames


merged = [Link](data1, data2, how='inner',
left_on=['Salary', 'Debt'],
right_on=['Salary', 'Debt'])

print(merged)

Output:
Data Analysis and Visualization

Method 2: Preventing duplicates by mentioning explicit suffix names for columns


In this method to prevent the duplicated while joining the columns of the two different data
frames, the user needs to use the [Link]() function which is responsible to join the columns
together of the data frame, and then the user needs to call the drop() function with the required
condition passed as the parameter as shown below to remove all the duplicates from the final
data frame.
drop() function:
This function is used to drop specified labels from rows or columns..
Syntax:
[Link](self, labels=None, axis=0, index=None, columns=None, level=None,
inplace=False, errors=’raise’)
Parameters:
 labels: Index or column labels to drop.
 axis: Whether to drop labels from the index (0 or ‘index’) or columns (1 or
‘columns’). {0 or ‘index’, 1 or ‘columns’}
 index: Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).
 columns: Alternative to specifying axis (labels, axis=1 is equivalent to
columns=labels).
 level: For MultiIndex, the level from which the labels will be removed.
 in place: If True, do operation inplace and return None.
 errors: If ‘ignore’, suppress error and only existing labels are dropped.
Example:
In this example, we are using the [Link]() function to join the two data frames by inner join.
Now, add a suffix called ‘remove’ for newly joined columns that have the same name in both
data frames. Use the drop() function to remove the columns with the suffix ‘remove’. This will
ensure that identical columns don’t exist in the new dataframe.
 Python3

# import python pandas package


import pandas as pd

# import the numpy package


import numpy as np

# Create sample dataframe data1 and data2


Data Analysis and Visualization

data1 = [Link]([Link](100, size=(1000, 3)),


columns=['EMI', 'Salary', 'Debt'])
data2 = [Link]([Link](100, size=(1000, 3)),
columns=['Salary', 'Debt', 'Bonus'])

# Merge the DataFrames


df_merged = [Link](data1, data2, how='inner', left_index=True,
right_index=True, suffixes=('', '_remove'))

# remove the duplicate columns


df_merged.drop([i for i in df_merged.columns if 'remove' in i],
axis=1, inplace=True)

print(merged)

Output:

Method 3: Remove the duplicate columns before merging two columns


In this method, the user needs to call the merge() function which will be simply joining the
columns of the data frame and then further the user needs to call the difference() function to
remove the identical columns from both data frames and retain the unique ones in the python
language.
Difference function:
This function returns a set that contains the difference between two sets.
Syntax:
[Link](set)
Parameters:
 set :The set to check for differences in
Example:
In this example. we are using the difference function to remove the identical columns from
given data frames and further store the dataframe with the unique column as a new dataframe.
Now, use [Link]() function to join the left dataframe with the unique column dataframe
using ‘inner’ join. This will ensure that no columns are duplicated in the merged dataset.
 Python3
Data Analysis and Visualization

# import python pandas package


import pandas as pd

# import the numpy package


import numpy as np

# Create sample dataframe data1 and data2


data1 = [Link]([Link](100, size=(1000, 3)),
columns=['EMI', 'Salary', 'Debt'])
data2 = [Link]([Link](100, size=(1000, 3)),
columns=['Salary', 'Debt', 'Bonus'])

# Find the columns that aren't in the first DataFrame


different_cols = [Link]([Link])

# Filter out the columns that are different.


# You could pass in the df2[diff_cols]
# directly into the merge as well.
data3 = data2[different_cols]

# Merge the DataFrames


df_merged = [Link](data1, data3, left_index=True,
right_index=True, how='inner')

Output:

RESHAPING PANDAS DATAFRAMES USING MELT AND UNMELT


We can reshape the data into a more computer-friendly form using Pandas in Python.
[Link]() is one of the function to do so..
[Link]() unpivots a DataFrame from wide format to long format.
melt() function is useful to massage a DataFrame into a format where one or more columns are
identifier variables, while all other columns, considered measured variables, are unpivoted to
the row axis, leaving just two non-identifier columns, variable and value.
Syntax :
[Link](frame, id_vars=None, value_vars=None,
Data Analysis and Visualization

var_name=None, value_name='value', col_level=None)


Parameters:
frame : DataFrame
id_vars[tuple, list, or ndarray, optional] : Column(s) to use as identifier variables.
value_vars[tuple, list, or ndarray, optional]: Column(s) to unpivot. If not specified, uses all
columns that are not set as id_vars.
var_name[scalar]: Name to use for the ‘variable’ column. If None it uses [Link]
or ‘variable’.
value_name[scalar, default ‘value’]: Name to use for the ‘value’ column.
col_level[int or string, optional]: If columns are a MultiIndex then use this level to melt.
Example:

# Create a simple dataframe

# importing pandas as pd
import pandas as pd

# creating a dataframe
df = [Link]({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'},
'Course': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'},
'Age': {0: 27, 1: 23, 2: 21}})
Df

 Python3

# Name is id_vars and Course is value_vars


[Link](df, id_vars =['Name'], value_vars =['Course'])
Data Analysis and Visualization

# multiple unpivot columns


[Link](df, id_vars =['Name'], value_vars =['Course', 'Age'])

 Python3

# Names of ‘variable’ and ‘value’ columns can be customized


[Link](df, id_vars =['Name'], value_vars =['Course'],
var_name ='ChangedVarname', value_name ='ChangedValname')
Data Analysis and Visualization

Python | [Link]()
[Link](index, columns, values) function produces pivot table based on 3 columns of
the DataFrame. Uses unique values from index / columns and fills with values.
Parameters:
index[ndarray] : Labels to use to make new frame’s index
columns[ndarray] : Labels to use to make new frame’s columns
values[ndarray] : Values to use for populating new frame’s values
Returns: Reshaped DataFrame
Exception: ValueError raised if there are any duplicates.
Code:

# Create a simple dataframe

# importing pandas as pd
import pandas as pd

# creating a dataframe
df = [Link]({'A': ['John', 'Boby', 'Mina'],
'B': ['Masters', 'Graduate', 'Graduate'],
'C': [27, 23, 21]})

Df
Data Analysis and Visualization

# values can be an object or a list


[Link]('A', 'B', 'C')

# value is a list
[Link](index ='A', columns ='B', values =['C', 'A'])

Raise ValueError when there are any index, columns combinations with multiple values.

# importing pandas as pd
import pandas as pd

# creating a dataframe
df = [Link]({'A': ['John', 'John', 'Mina'],
'B': ['Masters', 'Masters', 'Graduate'],
Data Analysis and Visualization

'C': [27, 23, 21]})

[Link]('A', 'B', 'C')

ValueError: Index contains duplicate entries, cannot reshape

Sometimes we need to reshape the Pandas data frame to perform analysis in a better way.
Reshaping plays a crucial role in data analysis. Pandas provide function
like melt and unmelt for reshaping.
[Link]()
melt() is used to convert a wide dataframe into a longer form. This function can be used when
there are requirements to consider a specific column as an identifier.
Syntax: [Link](frame, id_vars=None, value_vars=None, var_name=None,
value_name=’value’, col_level=None)

Example 1:
Initialize the dataframe with data regarding ‘Days‘, ‘Patients‘ and ‘Recovery‘.
 Python3

# importing pandas library


import pandas as pd

# creating and initializing a list


values = [['Monday', 65000, 50000],
['Tuesday', 68000, 45000],
['Wednesday', 70000, 55000],
['Thursday', 60000, 47000],
['Friday', 49000, 25000],
['Saturday', 54000, 35000],
['Sunday', 100000, 70000]]

# creating a pandas dataframe


df = [Link](values, columns=['DAYS', 'PATIENTS', 'RECOVERY'])

# displaying the data frame


Df

Output:
Data Analysis and Visualization

Now, we reshape the data frame using [Link]() around column ‘DAYS‘.
 Python3

# melting with DAYS as column identifier


reshaped_df = [Link](id_vars=['DAYS'])

# displaying the reshaped data frame


reshaped_df

Output:

Example 2:
Now, to the dataframe used above a new column named ‘Deaths‘ is introduced.

# importing pandas library


import pandas as pd
# creating and initializing a dataframe
Data Analysis and Visualization

values = [['Monday', 65000, 50000, 1500],


['Tuesday', 68000, 45000, 7250],
['Wednesday', 70000, 55000, 1400],
['Thursday', 60000, 47000, 4200],
['Friday', 49000, 25000, 3000],
['Saturday', 54000, 35000, 2000],
['Sunday', 100000, 70000, 4550]]

# creating a pandas dataframe


df = [Link](values,
columns=['DAYS', 'PATIENTS', 'RECOVERY', 'DEATHS'])

# displaying the data frame


Df

Output:

we reshaped the data frame using [Link]() around column ‘PATIENTS‘.

# reshaping data frame


# using [Link]()
reshaped_df = [Link](id_vars=['PATIENTS'])

# displaying the reshaped data frame


reshaped_df
Data Analysis and Visualization

Output:

[Link]()/ unmelt function


Pivoting, Unmelting or Reverse Melting is used to convert a column with multiple values
into several columns of their own.
Syntax : [Link](index=None, columns=None, values=None)

Example 1:
Create a dataframe that contains the data on ID, Name, Marks and Sports of 6 students.

# importing pandas library


import pandas as pd

# creating and initializing a list


values = [[101, 'Rohan', 455, 'Football'],
[111, 'Elvish', 250, 'Chess'],
[192, 'Deepak', 495, 'Cricket'],
[201, 'Sai', 400, 'Ludo'],
[105, 'Radha', 350, 'Badminton'],
[118, 'Vansh', 450, 'Badminton']]

# creating a pandas dataframe


df = [Link](values,
columns=['ID', 'Name', 'Marks', 'Sports'])

# displaying the data frame


Df
Data Analysis and Visualization

Output:

Unmelting around the column Sports:

# unmelting
reshaped_df = [Link](index='Name', columns='Sports')

# displaying the reshaped data frame


reshaped_df

Output:

Example 2:
Consider the same dataframe used in the example above. Unmelting can be done based on
more than one column also.

reshaped_df = [Link]('ID', 'Marks', 'Sports')

# displaying the reshaped data frame


reshaped_df
Data Analysis and Visualization

Output:

But the reshaped dataframe appears little different from the original one in terms of index. To
get the index also set as original dataframe use reset_index() function on the reshaped
dataframe.

reshaped_df = [Link]('ID', 'Marks', 'Sports')

# reseting index
df_new = reshaped_df.reset_index()

# displaying the reshaped data frame


df_new

Output:
Data Analysis and Visualization

DATA AGGREGATION
Definition and Usage

The aggregate() method allows you to apply a function or a list of function names to be executed
along one of the axis of the DataFrame, default 0, which is the index (row) axis.

Note: the agg() method is an alias of the aggregate() method.


Syntax
[Link](func, axis, args, kwargs)
Parameters

The axis parameter is a keyword argument.

Parameter Value Description

func Required. A function, function name, or a list of function names


to apply to the DataFrame.

axis 0, 1, 'index', Optional, which axis to apply the function to. default 0.
'columns'

args Optional, arguments to send into the function

kwargs Optional, keyword arguments to send into the function

Return Value:

A DataFrame or a Series object, with the changes.

This function does NOT make changes to the original DataFrame object.
[Link]() function is used to apply some aggregation across one or more column.
Aggregate using callable, string, dict, or list of string/callables. Most frequently used
aggregations are:
sum: Return the sum of the values for the requested axis
min: Return the minimum of the values for the requested axis
max: Return the maximum of the values for the requested axis
Data Analysis and Visualization

Example #1: Aggregate ‘sum’ and ‘min’ function across all the columns in data frame.

# importing pandas package


import pandas as pd

# making data frame from csv file


df = pd.read_csv("[Link]")

# printing the first 10 rows of the dataframe


df[:10]

Aggregation works with only numeric type columns.

# Applying aggregation across all the columns


# sum and min will be found for each
# numeric type column in df dataframe

[Link](['sum', 'min'])

Output:
For each column which are having numeric values, minimum and sum of all values has been
found. For dataframe df , we have four such columns Number, Age, Weight, Salary.

Example #2:
In Pandas, we can also apply different aggregation functions across different columns. For that,
we need to pass a dictionary with key containing the column names and values containing the
list of aggregation functions for any specific column.
Data Analysis and Visualization

# importing pandas package


import pandas as pd

# making data frame from csv file


df = pd.read_csv("[Link]")

# We are going to find aggregation for these columns


[Link]({"Number":['sum', 'min'],
"Age":['max', 'min'],
"Weight":['min', 'sum'],
"Salary":['sum']})

Output:
Separate aggregation has been applied to each column, if any specific aggregation is not
applied on a column then it has NaN value corresponding to it.

GROUPING OF DATA

[Link]():
Pandas groupby is used for grouping the data according to the categories and apply a function
to the categories. It also helps to aggregate data efficiently.
Pandas [Link]() function is used to split the data into groups based on some
criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is
to provide a mapping of labels to group names.
Syntax: [Link](by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)
Parameters :
by : mapping, function, str, or iterable
axis : int, default 0
level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : For aggregated output, return object with group labels as the index. Only relevant
for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : Sort group keys. Get better performance by turning this off. Note this does not influence
the order of observations within each group. groupby preserves the order of rows within each
group.
group_keys : When calling apply, add group keys to index to identify pieces
squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent
type
Returns : GroupBy object
Data Analysis and Visualization

Example #1: Use groupby() function to group the data based on the “Team”.

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.read_csv("[Link]")

# Print the dataframe


Df

Now apply the groupby() function.

# applying groupby() function to


# group the data on team value.
gk = [Link]('Team')

# Let's print the first entries


# in all the groups formed.
[Link]()

Output :
Data Analysis and Visualization

Let’s print the value contained any one of group. For that use the name of the team. We use the
function get_group() to find the entries contained in any of the groups.

# Finding the values contained in the "Boston Celtics" group


gk.get_group('Boston Celtics')

Output :

Example #2: Use groupby() function to form groups based on more than one category (i.e. Use
more than one column to perform the splitting).

# importing pandas as pd
import pandas as pd

# Creating the dataframe


df = pd.read_csv("[Link]")

# First grouping based on "Team"


# Within each team we are grouping based on "Position"
gkk = [Link](['Team', 'Position'])

# Print the first value in each group


Data Analysis and Visualization

[Link]()

Output :

groupby() is a very powerful function with a lot of variations. It makes the task of splitting the
dataframe over some criteria really easy and efficient.

Matplotlib (Python Plotting Library)

Human minds are more adaptive for the visual representation of data rather than textual data. We
can easily understand things when they are visualized. It is better to represent the data through the
graph where we can analyze the data more efficiently and make the specific decision according to
data analysis. Before learning the matplotlib, we need to understand data visualization and why
data visualization is important.
Data Analysis and Visualization

Data Visualization

Graphics provides an excellent approach for exploring the data, which is essential for presenting
results. Data visualization is a new term. It expresses the idea that involves more than just
representing data in the graphical form (instead of using textual form).

This can be very helpful when discovering and getting to know a dataset and can help with
classifying patterns, corrupt data, outliers, and much more. With a little domain knowledge, data
visualizations can be used to express and demonstrate key relationships in plots and charts. The
static does indeed focus on quantitative description and estimations of data. It provides an
important set of tools for gaining a qualitative understanding.

There are five key plots that are used for data visualization.
Data Analysis and Visualization

Why need data visualization?

Data visualization can perform below tasks:


o It identifies areas that need improvement and attention.
o It clarifies the factors.
o It helps to understand which product to place where.
o Predict sales volumes.

Matplotlib is a Python library which is defined as a multi-platform data visualization library built
on Numpy array. It can be used in python scripts, shell, web application, and other graphical user
interface toolkit.

The John D. Hunter originally conceived the matplotlib in 2002. It has an active development
community and is distributed under a BSD-style license. Its first version was released in 2003, and
the latest version 3.1.1 is released on 1 July 2019.

Matplotlib 2.0.x supports Python versions 2.7 to 3.6 till 23 June 2007. Python3 support started
with Matplotlib 1.2. Matplotlib 1.4 is the last version that supports Python 2.6.

The General Concept of Matplotlib

A Matplotlib figure can be categorized into various parts as below:


Data Analysis and Visualization

Figure: It is a whole figure which may hold one or more axes (plots). We can think of a Figure as
a canvas that holds plots.

Axes: A Figure can contain several Axes. It consists of two or three (in the case of 3D) Axis
objects. Each Axes is comprised of a title, an x-label, and a y-label.

Axis: Axises are the number of line like objects and responsible for generating the graph limits.

Artist: An artist is the all which we see on the graph like Text objects, Line2D objects, and
collection objects. Most Artists are tied to Axes.

We will be plotting two lists containing the X, Y coordinates for the plot.
Example:

import [Link] as plt


x = [10, 20, 30, 40]
y = [20, 30, 40, 50]
# plotting the data
[Link](x, y)
# Adding the title
[Link]("Simple Plot")
# Adding the labels
[Link]("y-axis")
Data Analysis and Visualization

[Link]("x-axis")
[Link]()

Output:
Simple Plot

y
-
a
x
i
s

x-axis

In the above example, the elements of X and Y provides the coordinates for the x axis and y
axis and a straight line is plotted against those coordinates.

Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Pyplot provides
functions that interact with the figure i.e. creates a figure, decorates the plot with labels, and
creates a plotting area in a figure.
Syntax:
[Link](*args, scalex=True, scaley=True, data=None, **kwargs)
Example:

# Python program to show pyplot module


import [Link] as plt
[Link]([1, 2, 3, 4], [1, 4, 9, 16])
[Link]([0, 6, 0, 20])
[Link]()

Output:
Data Analysis and Visualization

Matplotlib take care of the creation of inbuilt defaults like Figure and Axes.
 Figure: This class is the top-level container for all the plots means it is the overall
window or page on which everything is drawn. A figure object can be considered as a
box-like container that can hold one or more axes.
 Axes: This class is the most basic and flexible component for creating sub-plots. You
might confuse axes as the plural of axis but it is an individual plot or graph. A given
figure may contain many axes but given axes can only be in one figure.
1. Figure class
Figure class is the top-level container that contains one or more axes. It is the overall
window or page on which everything is drawn.
Syntax:
class [Link](figsize=None, dpi=None(resolution), facecolor=None
(background colour), edgecolor=None(border colour), linewidth=0.0, **kwargs)
Example 1

# Python program to show pyplot module


import [Link] as plt
from [Link] import Figure

# Creating a new figure with width = 5 inches and height = 4 inches


fig = [Link](figsize =(5, 4))

# Creating a new axes for the figure


ax = fig.add_axes([1, 1, 1, 1])

# Adding the data to be plotted


[Link]([2, 3, 4, 5, 5, 6, 6], [5, 7, 1, 3, 4, 6 ,8])
[Link]()

Output:
Data Analysis and Visualization

Example 2: Creating multiple plots

import [Link] as plt


from [Link] import Figure
fig = [Link](figsize =(5, 4))
# Creating first axes for the figure
ax1 = fig.add_axes([1, 1, 1, 1])
# Creating second axes for the figure
ax2 = fig.add_axes([1, 0.5, 0.5, 0.5])
# Adding the data to be plotted
[Link]([2, 3, 4, 5, 5, 6, 6], [5, 7, 1, 3, 4, 6 ,8])
[Link]([1, 2, 3, 4, 5], [2, 3, 4, 5, 6])
[Link]()

Output:
Data Analysis and Visualization

[Link] Class
Axes class is the most basic and flexible unit for creating sub-plots. A given figure may
contain many axes, but a given axes can only be present in one figure. The axes() function creates
the axes object.
Syntax:
[Link]([left, bottom, width, height])
Example 1:

import [Link] as plt


from [Link] import Figure
# Creating the axes object with argument as [left, bottom, width, height]
ax = [Link]([1, 1, 1, 1])

Output:

Example 2:

import [Link] as plt


from [Link] import Figure
fig = [Link](figsize = (5, 4))
# Adding the axes to the figure
ax = fig.add_axes([1, 1, 1, 1])

# plotting 1st dataset to the figure


ax1 = [Link]([1, 2, 3, 4], [1, 2, 3, 4])

# plotting 2nd dataset to the figure


ax2 = [Link]([1, 2, 3, 4], [2, 3, 4, 5])
[Link]()
Data Analysis and Visualization

Output:

[Link] Limits and Tick labels


You might have seen that Matplotlib automatically sets the values and the
markers(points) of the x and y axis, however, it is possible to set the limit and markers
manually. set_xlim() and set_ylim() functions are used to set the limits of the x-axis and y-axis
respectively. Similarly, set_xticklabels() and set_yticklabels() functions are used to set tick
labels.
Example:

import [Link] as plt


from [Link] import Figure
x = [3, 1, 3]
y = [3, 2, 1]
fig = [Link](figsize =(5, 4))
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
[Link](x, y)
ax.set_xlim(1, 3)
ax.set_xticklabels(("one", "two", "three", "four", "five", "six"))
[Link]()

Output:
Data Analysis and Visualization

Multiple Plots
What if you want to plot multiple plots in the same figure. This can be done using multiple
ways. One way was discussed above using the add_axes() method of the figure class

Method 1: Using the add_axes() method


The add_axes() method figure module of matplotlib library is used to add an axes to the figure.
Syntax:
add_axes(self, *args, **kwargs)
Example:

import [Link] as plt


from [Link] import Figure
fig = [Link](figsize =(5, 4))
ax1 = fig.add_axes([0.1, 0.1, 0.8, 0.8])
ax2 = fig.add_axes([0.5, 0.5, 0.3, 0.3])
[Link]([5, 4, 3, 2, 1], [2, 3, 4, 5, 6])
[Link]([1, 2, 3, 4, 5], [2, 3, 4, 5, 6])
[Link]()

Output:

The add_axes() method adds the plot in the same figure by creating another axes object.

Method 2: Using subplot() method.


This method adds another plot to the current figure at the specified grid position.
Syntax:
subplot(nrows, ncols, index, **kwargs)
subplot(pos, **kwargs)
subplot(ax)
Data Analysis and Visualization

Example:
import [Link] as plt
# data to display on plots
x = [3, 1, 3]
y = [3, 2, 1]
z = [1, 3, 1]
[Link]()
# adding first subplot with 1 row and 2 columns and 1st subplot
[Link](121)
[Link](x, y)
# addding second subplot with 1 row and 2 columns and 2nd subplot
[Link](122)
[Link](z, y)

Output:

Note: Subplot() function have the following disadvantages –


 It does not allow adding multiple subplots at the same time.
 It deletes the preexisting plot of the figure.

Method 3: Using subplots() method


This function is used to create figure and multiple subplots at the same time.
Syntax:
[Link](nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,
subplot_kw=None, gridspec_kw=None, **fig_kw)
Example:

import [Link] as plt


Data Analysis and Visualization

# Creating the figure and subplots according the argument passed


fig, axes = [Link](1, 2)
# plotting the data in the 1st subplot
axes[0].plot([1, 2, 3, 4], [1, 2, 3, 4])
# plotting the data in the 1st subplot only
axes[0].plot([1, 2, 3, 4], [4, 3, 2, 1])
# plotting the data in the 2nd subplot only
axes[1].plot([1, 2, 3, 4], [1, 1, 1, 1])

Output:

Method 4: Using subplot2grid() method


This function gives additional flexibility in creating axes object at a specified location
inside a grid. It also helps in spanning the axes object across multiple rows or columns. In
simpler words, this function is used to create multiple charts within the same figure.
Syntax:
plt.subplot2grid(shape, location, rowspan, colspan)
Example:

import [Link] as plt


# data to display on plots
x = [3, 1, 3]
y = [3, 2, 1]
z = [1, 3, 1]
# adding the subplots grid(shape(rows,cols),loc(row no. col no. to place the plot
axes1 = plt.subplot2grid ((7, 1), (0, 0), rowspan = 2, colspan = 1)
axes2 = plt.subplot2grid ((7, 1), (2, 0), rowspan = 2, colspan = 1)
axes3 = plt.subplot2grid ((7, 1), (4, 0), rowspan = 2, colspan = 1)
Data Analysis and Visualization

# plotting the data


[Link](x, y)
[Link](x, z)
[Link](z, y)

Output:

EXPLORING PLOT TYPES


1. Line graph
The line graph is one of charts which shows information as a series of the line. The graph is plotted
by the plot() function. The line graph is simple to plot; let's consider the following example:
1. from matplotlib import pyplot as plt
2. x = [4,8,9]
3. y = [10,12,15]
4. [Link](x,y)
5. [Link]("Line graph")
6. [Link]('Y axis')
7. [Link]('X axis')
8. [Link]()

Output:
Data Analysis and Visualization

We can customize the graph by importing the style module. The style module will be built into a
matplotlib installation. It contains the various functions to make the plot more attractive. In the
below program, we are using the style module:
1. from matplotlib import pyplot as plt
2. from matplotlib import style
3. [Link]('ggplot')
4. x = [16, 8, 10]
5. y = [8, 16, 6]
6. x2 = [8, 15, 11]
7. y2 = [6, 15, 7]
8. [Link](x, y, 'r', label='line one', linewidth=5)
9. [Link](x2, y2, 'm', label='line two', linewidth=5)
10. [Link]('Epic Info')
11. fig = [Link]()
12. [Link]('Y axis')
13. [Link]('X axis')
14. [Link]()
15. [Link](True, color='k')
16. [Link]()

Output:

2. Bar graphs

Bar graphs are one of the most common types of graphs and are used to show data associated with
the categorical variables. Matplotlib provides a bar() to make bar graphs which accepts arguments
such as: categorical variables, their value and color.
Data Analysis and Visualization

1. from matplotlib import pyplot as plt


2. players = ['Virat','Rohit','Shikhar','Hardik']
3. runs = [51,87,45,67]
4. [Link](players,runs,color = 'green')
5. [Link]('Score Card')
6. [Link]('Players')
7. [Link]('Runs')
8. [Link]()

Output:

Another function barh() is used to make horizontal bar graphs. It accepts xerr or yerr as
arguments (in case of vertical graphs) to depict the variance in our data as follows:
1. from matplotlib import pyplot as plt
2. players = ['Virat','Rohit','Shikhar','Hardik']
3. runs = [51,87,45,67]
4. [Link](players,runs, color = 'green')
5. [Link]('Score Card')
6. [Link]('Players')
7. [Link]('Runs')
8. [Link]()
Data Analysis and Visualization

Output:

Let's have a look on the other example using the style() function:
1. from matplotlib import pyplot as plt
2. from matplotlib import style
3. [Link]('ggplot')
4. x = [5,8,10] y = [12,16,6] x1 = [6,9,11] y1 = [7,15,7]
5. [Link](x, y, color = 'y', align='center')
6. [Link](x1, y1, color='c', align='center')
7. [Link]('Information')
8. [Link]('Y axis')
9. [Link]('X axis')
10. [Link](loc="upper right")
11. [Link]()

Output:
Data Analysis and Visualization

3. Pie Chart

A pie chart is a circular graph that is broken down in the segment or slices of pie. It is generally
used to represent the percentage or proportional data where each slice of pie represents a particular
category. Let's have a look at the below example:
1. from matplotlib import pyplot as plt
2. # Pie chart, where the slices will be ordered and plotted counter-clockwise:
3. Players = 'Rohit', 'Virat', 'Shikhar', 'Yuvraj'
4. Runs = [45, 30, 15, 10]
5. explode = (0.1, 0, 0, 0) # it "explode" the 1st slice
6. fig1, ax1 = [Link]()
7. [Link](Runs, explode=explode, labels=Players, autopct='%1.1f%%',
8. shadow=True, startangle=90)
9. [Link]('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
10. [Link]()

Output:

4. Histogram

First, we need to understand the difference between the bar graph and histogram. A histogram is
used for the distribution, whereas a bar chart is used to compare different entities. A histogram is
a type of bar plot that shows the frequency of a number of values compared to a set of values
ranges.

For example we take the data of the different age group of the people and plot a histogram with
respect to the bin. Now, bin represents the range of values that are divided into series of intervals.
Bins are generally created of the same size.
1. from matplotlib import pyplot as plt
Data Analysis and Visualization

2. population_age = [21,53,60,49,25,27,30,42,40,1,2,102,95,8,15,105,70,65,55,70,75,60,52,
44,43,42,45]
3. bins = [0,10,20,30,40,50,60,70,80,90,100]
4. [Link](population_age, bins, histtype='bar', rwidth=0.8)
5. [Link]('age groups')
6. [Link]('Number of people')
7. [Link]('Histogram')
8. [Link]()

Output:

Let's consider the another example of plotting histogram:


1. from matplotlib import pyplot as plt
2. # Importing Numpy Library
3. import numpy as np
4. [Link]('fivethirtyeight')
5. mu = 50
6. sigma = 7
7. x = [Link](mu, sigma, size=200)
8. fig, ax = [Link]()
9. [Link](x, 20)
10. ax.set_title('Histogram')
11. ax.set_xlabel('bin range')
12. ax.set_ylabel('frequency')
Data Analysis and Visualization

13.
14. fig.tight_layout()
15. [Link]()

Output:

5. Scatter plot

The scatter plots are mostly used for comparing variables when we need to define how much one
variable is affected by another variable. The data is displayed as a collection of points. Each point
has the value of one variable, which defines the position on the horizontal axes, and the value of
other variable represents the position on the vertical axis.

Let's consider the following simple example:

Example-1:
1. from matplotlib import pyplot as plt
2. from matplotlib import style
3. [Link]('ggplot')
4. x = [5,7,10]
5. y = [18,10,6]
6. x2 = [6,9,11]
7. y2 = [7,14,17]
8. [Link](x, y)
9. [Link](x2, y2, color='g')
10. [Link]('Epic Info')
11. [Link]('Y axis')
Data Analysis and Visualization

12. [Link]('X axis')


13. [Link]()

Output:

Example-2
1. import [Link] as plt
2. x = [2, 2.5, 3, 3.5, 4.5, 4.7, 5.0]
3. y = [7.5, 8, 8.5, 9, 9.5, 10, 10.5]
4.
5. x1 = [9, 8.5, 9, 9.5, 10, 10.5, 12]
6. y1 = [3, 3.5, 4.7, 4, 4.5, 5, 5.2]
7. [Link](x, y, label='high income low saving', color='g')
8. [Link](x1, y1, label='low income high savings', color='r')
9. [Link]('saving*100')
10. [Link]('income*1000')
11. [Link]('Scatter Plot')
12. [Link]()
13. [Link]()

Output:
Data Analysis and Visualization

Legend and Annotations in Python

i. [Link]()
A legend is an area describing the elements of the graph. In the matplotlib library, there’s a
function called legend() which is used to Place a legend on the axes.
The attribute Loc in legend() is used to specify the location of the [Link] value of loc
is loc=”best” (upper left). The strings ‘upper left’, ‘upper right’, ‘lower left’, ‘lower right’
place the legend at the corresponding corner of the axes/figure.
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the coordinates of
the legend, and the attribute ncol represents the number of columns that the legend has. It’s
default value is 1.
Syntax:
[Link]([“blue”, “green”], bbox_to_anchor=(0.75, 1.15), ncol=2)
The Following are some more attributes of function legend() :
 shadow: [None or bool] Whether to draw a shadow behind the legend. It’s Default
value is None.
 markerscale: [None or int or float] The relative size of legend markers compared
with the originally drawn ones. The Default is None.
 numpoints: [None or int] The number of marker points in the legend when creating
a legend entry for a Line2D (line).The Default is None.
 fontsize: The font size of the [Link] the value is numeric the size will be the
absolute font size in points.
 facecolor: [None or “inherit” or color] The legend’s background color.
 edgecolor: [None or “inherit” or color] The legend’s background patch edge color.
Data Analysis and Visualization

Ways to use legend() function in Python –


Example 1:

import numpy as np
import [Link] as plt

# X-axis values
x = [1, 2, 3, 4, 5]

# Y-axis values
y = [1, 4, 9, 16, 25]

# Function to plot
[Link](x, y)

# Function add a legend


[Link](['single element'])

# function to show the plot


[Link]()

Output :

Example 2:

# importing modules
import numpy as np
import [Link] as plt

# Y-axis values
Data Analysis and Visualization

y1 = [2, 3, 4.5]

# Y-axis values
y2 = [1, 1.5, 5]

# Function to plot
[Link](y1)
[Link](y2)

# Function add a legend


[Link](["blue", "green"], loc ="lower right")

# function to show the plot


[Link]()

Output :

Example 3:

import numpy as np
import [Link] as plt

# X-axis values
x = [Link](5)

# Y-axis values
y1 = [1, 2, 3, 4, 5]

# Y-axis values
y2 = [1, 4, 9, 16, 25]
Data Analysis and Visualization

# Function to plot
[Link](x, y1, label ='Numbers')
[Link](x, y2, label ='Square of numbers')

# Function add a legend


[Link]()

# function to show the plot


[Link]()

Output :

Example 4:

import numpy as np
import [Link] as plt

x = [Link](0, 10, 1000)


fig, ax = [Link]()

[Link](x, [Link](x), '--b', label ='Sine')


[Link](x, [Link](x), c ='r', label ='Cosine')
[Link]('equal')

leg = [Link](loc ="lower left");


Data Analysis and Visualization

Output:

Example 5:

# importing modules
import numpy as np
import [Link] as plt

# X-axis values
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]

# Y-axis values
y1 = [0, 3, 6, 9, 12, 15, 18, 21, 24]
# Y-axis values
y2 = [0, 1, 2, 3, 4, 5, 6, 7, 8]

# Function to plot
[Link](y1, label ="y = x")
[Link](y2, label ="y = 3x")

# Function add a legend


[Link](bbox_to_anchor =(0.75, 1.15), ncol = 2)

# function to show the plot


[Link]()
Data Analysis and Visualization

Output:

ii. [Link]() Function


The annotate() function in pyplot module of matplotlib library is used to annotate the point xy
with text s.
Syntax: angle_spectrum(x, Fs=2, Fc=0, window=mlab.window_hanning, pad_to=None,
sides=’default’, **kwargs)
Parameters: This method accept the following parameters that are described below:
 s: This parameter is the text of the annotation.
 xy: This parameter is the point (x, y) to annotate.
 xytext: This parameter is an optional parameter. It is The position (x, y) to place
the text at.
 xycoords: This parameter is also an optional parameter and contains the string
value.
 textcoords: This parameter contains the string value. Coordinate system that xytext
is given, which may be different than the coordinate system used for xy
 arrowprops : This parameter is also an optional parameter and contains dict
[Link] default value is None.
 annotation_clip : This parameter is also an optional parameter and contains
boolean [Link] default value is None which behaves as True.
Returns: This method returns the annotation.

Below examples illustrate the [Link]() function in [Link]:


Example #1:

# Implementation of [Link]()
# function

import [Link] as plt


import numpy as np
Data Analysis and Visualization

fig, geeeks = [Link]()

t = [Link](0.0, 5.0, 0.001)


s = [Link](3 * [Link] * t)
line = [Link](t, s, lw = 2)

# Annotation
[Link]('Local Max', xy =(3.3, 1),
xytext =(3, 1.8),
arrowprops = dict(facecolor ='green',
shrink = 0.05),)

geeeks.set_ylim(-2, 2)

# Plot the Annotation in the graph


[Link]()

Output:

Example #2:

# Implementation of [Link]()
# function

import numpy as np
import [Link] as plt

x = [Link](0, 10, 0.005)


Data Analysis and Visualization

y = [Link](-x / 3.) * [Link](3 * [Link] * x)

fig, ax = [Link]()
[Link](x, y)
ax.set_xlim(0, 10)
ax.set_ylim(-1, 1)

# Setting up the parameters


xdata, ydata = 5, 0
xdisplay, ydisplay = [Link]((xdata, ydata))

bbox = dict(boxstyle ="round", fc ="0.8")


arrowprops = dict(
arrowstyle = "->",
connectionstyle = "angle, angleA = 0, angleB = 90,\
rad = 10")

offset = 72

# Annotation
[Link]('data = (%.1f, %.1f)'%(xdata, ydata),
(xdata, ydata), xytext =(-2 * offset, offset),
textcoords ='offset points',
bbox = bbox, arrowprops = arrowprops)

disp = [Link]('display = (%.1f, %.1f)'%(xdisplay, ydisplay),


(xdisplay, ydisplay), xytext =(0.5 * offset, -offset),
xycoords ='figure pixels',
textcoords ='offset points',
bbox = bbox, arrowprops = arrowprops)

# To display the annotation


[Link]()
Data Analysis and Visualization

Output:

You might also like