Filter Data in WPS by Conditions
Filter Data in WPS by Conditions
DATA MUNGING
Data munging is the general procedure for transforming data from erroneous or unusable forms,
into useful and use-case-specific ones. Without some degree of munging, whether performed by
automated systems or specialized users, data cannot be ready for any kind of downstream
consumption.
But powerful and versatile tools, like Python, are making it increasingly easy for anyone to
munge effectively.
>>> [Link]()
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Flickr URL \
0 [Link]
1 [Link]
2 [Link]
3 [Link]
4 [Link]
Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS [Link].2.
2 British Library HMNTS [Link].1.
3 British Library HMNTS [Link].15.
4 British Library HMNTS 9007.d.28.
When we look at the first five entries using the head() method, we can see that a handful of
columns provide ancillary information that would be helpful to the library but isn’t very
descriptive of the books themselves: Edition Statement, Corporate Author, Corporate
Contributors, Former owner, Engraver, Issuance type and Shelfmarks.
Publisher Title \
Data Analysis and Visualization
Alternatively, we could also remove the columns by passing them to the columns parameter
directly instead of separately specifying the labels to be removed and the axis where pandas
should look for the labels:
>>> [Link](columns=to_drop, inplace=True)
Publisher \
Data Analysis and Visualization
Title Author \
206 Walter Forbes. [A novel.] By A. A A. A.
216 All for Greed. [A novel. The dedication signed... A., A. A.
218 Love the Avenger. By the author of “All for Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.
Flickr URL
206 [Link]
216 [Link]
218 [Link]
472 [Link]
480 [Link]
We can access each record in a straightforward way with loc[]. Although loc[] may not have all
that intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or
record without regard to its position:
>>> [Link][206]
Place of Publication London
Date of Publication 1879 [1878]
Publisher S. Tinsley & Co.
Title Walter Forbes. [A novel.] By A. A
Author A. A.
Flickr URL [Link]
Name: 206, dtype: object
In other words, 206 is the first label of the index. To access it by position, we could
use [Link][0], which does position-based indexing.
Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-
in range. By passing a column name to set_index, we have changed the index to the values
in Identifier.
Data Analysis and Visualization
You may have noticed that we reassigned the variable to the object returned by the method
with df = df.set_index(...). This is because, by default, the method returns a modified copy of our
object and does not make the changes directly to the object. We can avoid this by setting
the inplace parameter:
df.set_index('Identifier', inplace=True)
A particular book can have only one date of publication. Therefore, we need to do the following:
Remove the extra dates in square brackets, wherever present: 1879 [1878]
Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54
Completely remove the dates we are not certain about and replace them with
NumPy’s NaN: [1897?]
Convert the string nan to NumPy’s NaN value
Synthesizing these patterns, we can actually take advantage of a single regular expression to
extract the publication year:
Data Analysis and Visualization
regex = r'^(\d{4})'
The regular expression above is meant to find any four digits at the beginning of a string, which
suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape
character), which is standard practice with regular expressions.
The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start
of a string, and the parentheses denote a capturing group, which signals to pandas that we want
to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)
Let’s see what happens when we run this regex across our dataset:
>>> extr = df['Date of Publication'].[Link](r'^(\d{4})', expand=False)
>>> [Link]()
Identifier
206 1879
216 1868
218 1869
472 1851
480 1857
Name: Date of Publication, dtype: object
Technically, this column still has object dtype, but we can easily get its numerical version
with pd.to_numeric:
>>> df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64')
519 London
667 pp. 40. G. Bryan & Co: Oxford, 1898
874 London]
1143 London
Name: Place of Publication, dtype: object
We see that for some rows, the place of publication is surrounded by other unnecessary
information. If we were to look at more values, we would see that this is the case for only some
rows that have their place of publication as ‘London’ or ‘Oxford’.
Let’s take a look at two specific entries:
>>> [Link][4157862]
Place of Publication Newcastle-upon-Tyne
Date of Publication 1867
Publisher T. Fordyce
Title Local Records; or, Historical Register of rema...
Author T. Fordyce
Flickr URL [Link]
Name: 4157862, dtype: object
>>> [Link][4159587]
Place of Publication Newcastle upon Tyne
Date of Publication 1834
Publisher Mackenzie & Dent
Title An historical, topographical and descriptive v...
Author E. (Eneas) Mackenzie
Flickr URL [Link]
Name: 4159587, dtype: object
These two books were published in the same place, but one has hyphens in the name of the place
while the other does not.
To clean this column in one sweep, we can use [Link]() to get a Boolean mask.
We clean the column as follows:
>>> pub = df['Place of Publication']
>>> london = [Link]('London')
>>> london[:5]
Identifier
206 True
216 True
Data Analysis and Visualization
218 True
472 True
480 True
Name: Place of Publication, dtype: bool
Here, the [Link] function is called in a nested structure, with condition being a Series of
Booleans obtained with [Link](). The contains() method works similarly to the built-
in in keyword used to find the occurrence of an entity in an iterable (or substring in a string).
The replacement to be used is a string representing our desired place of publication. We also
replace hyphens with a space with [Link]() and reassign to the column in our DataFrame.
Let’s have a look at the first five entries, which look a lot crisper than when we started out:
>>> [Link]()
Place of Publication Date of Publication Publisher \
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh
Title Author \
206 Walter Forbes. [A novel.] By A. A AA
216 All for Greed. [A novel. The dedication signed... A. A A.
218 Love the Avenger. By the author of “All for Gr... A. A A.
Data Analysis and Visualization
Flickr URL
206 [Link]
216 [Link]
218 [Link]
472 [Link]
480 [Link]
We see that we have periodic state names followed by the university towns in that state: StateA
TownA1 TownA2 StateB TownB1 TownB2.... If we look at the way state names are written in
the file, we’ll see that all of them have the “[edit]” substring in them.
We can take advantage of this pattern by creating a list of (state, city) tuples and wrapping that
list in a DataFrame:
>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
... for line in file:
... if '[edit]' in line:
... # Remember this `state` until the next is found
... state = line
Data Analysis and Visualization
... else:
... # Otherwise, we have a city; keep `state` as last-seen
... university_towns.append((state, line))
>>> university_towns[:5]
[('Alabama[edit]\n', 'Auburn (Auburn University)[1]\n'),
('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n')]
We can wrap this list in a DataFrame and set the columns as “State” and “RegionName”. pandas
will take each element in the list and set State to the left value and RegionName to the right
value.
The resulting DataFrame looks like this:
>>> towns_df = [Link](university_towns,
... columns=['State', 'RegionName'])
>>> towns_df.head()
State RegionName
0 Alabama[edit]\n Auburn (Auburn University)[1]\n
1 Alabama[edit]\n Florence (University of North Alabama)\n
2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n
3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n
4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n
While we could have cleaned these strings in the for loop above, pandas makes it easy. We only
need the state name and the town name and can remove everything else. While we could use
pandas’ .str() methods again here, we could also use applymap() to map a Python callable to each
element of the DataFrame.
Therefore, applymap() will apply a function to each of these independently. Let’s define that
function:
>>> def get_citystate(item):
... if ' (' in item:
... return item[:[Link](' (')]
... elif '[' in item:
... return item[:[Link]('[')]
... else:
Data Analysis and Visualization
First, we define a Python function that takes an element from the DataFrame as its parameter.
Inside the function, checks are performed to determine whether there’s a ( or [ in the element or
not.
Depending on the check, values are returned accordingly by the function. Finally,
the applymap() function is called on our object. Now the DataFrame is much neater:
>>> towns_df.head()
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
The applymap() method took each element from the DataFrame, passed it to the function, and
the original value was replaced by the returned value.
9 10 11 12 13 14 15
0 03 ! Total ? Games 01 ! 02 ! 03 ! Combined total
1 0 0 13 0 0 2 2
2 0 0 15 5 2 8 15
3 0 0 41 18 24 28 70
4 0 0 11 1 2 9 12
This is messy indeed! The columns are the string form of integers indexed at 0. The row which
should have been our header (i.e. the one to be used to set the column names) is
at olympics_df.iloc[0]. This happened because our CSV file starts with 0, 1, 2, …, 15.
Also, if we were to go to the source of this dataset, we’d see that NaN above should really be
something like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be
“Gold”, and so on.
Skip one row and set the header as the first (0-indexed) row
Rename the columns
We can skip rows and set the header while reading the CSV file by passing some parameters to
the read_csv() function.
This function takes a lot of optional parameters, but in this case we only need one (header) to
remove the 0th row:
>>> olympics_df = pd.read_csv('Datasets/[Link]', header=1)
>>> olympics_df.head()
Unnamed: 0 ? Summer 01 ! 02 ! 03 ! Total ? Winter \
0 Afghanistan (AFG) 13 0 0 2 2 0
1 Algeria (ALG) 12 5 2 8 15 3
2 Argentina (ARG) 23 18 24 28 70 18
3 Armenia (ARM) 5 1 2 9 12 6
4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0
1 0 0 0 0 15 5 2 8
2 0 0 0 0 41 18 24 28
3 0 0 0 0 11 1 2 9
4 0 0 0 0 2 3 4 5
Combined total
0 2
1 15
2 70
3 12
4 12
We now have the correct row set as the header and all unnecessary rows removed. Take note of
how pandas has changed the name of the column containing the name of the countries
from NaN to Unnamed: 0.
To rename the columns, we will make use of a DataFrame’s rename() method, which allows you
to relabel an axis based on a mapping (in this case, a dict).
Let’s start by defining a dictionary that maps current column names (as keys) to more usable
ones (the dictionary’s values):
>>>
>>> new_names = {'Unnamed: 0': 'Country',
... '? Summer': 'Summer Olympics',
... '01 !': 'Gold',
... '02 !': 'Silver',
... '03 !': 'Bronze',
... '? Winter': 'Winter Olympics',
... '01 !.1': 'Gold.1',
... '02 !.1': 'Silver.1',
... '03 !.1': 'Bronze.1',
... '? Games': '# Games',
... '01 !.2': 'Gold.2',
... '02 !.2': 'Silver.2',
... '03 !.2': 'Bronze.2'}
# importing pandas as pd
import pandas as pd
Output :
Example #2: Use filter() function to subset all columns in a dataframe which has the letter ‘a’ or
‘A’ in its name.
Note : filter() function also takes a regular expression as one of its parameter.
# importing pandas as pd
import pandas as pd
Output :
The regular expression ‘[aA]’ looks for all column names which has an ‘a’ or an ‘A’ in its name.
Output :
Output :
If we use how = "Outer", it returns all elements in df1 and df2 but if element column are null
then its return NaN value.
Output :
Data Analysis and Visualization
If we use how = "left", it returns all the elements that present in the left DataFrame.
Output :
If we use how = "right", it returns all the elements that present in the right DataFrame.
Output :
# importing modules
import pandas as pd
# creating a dataframe
df1 = [Link]({'Name':['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks':[80, 90, 75, 88, 59]})
# display df1
display(df1)
# display df2
display(df2)
Output:
df1
df2
# applying merge
[Link](df2[['Name', 'Grade', 'Rank']])
Output:
Data Analysis and Visualization
Merged Dataframe
The resultant dataframe contains all the columns of df1 but certain specified columns of df2
with key column Name i.e. the resultant column contains Name, Marks, Grade, Rank column.
Both dataframes has the different number of values but only common values in both the
dataframes are displayed after merge.
Example 2: In the resultant dataframe Grade column of df2 is merged with df1 based on key
column Name with merge type left i.e. all the values of left dataframe (df1) will be displayed.
# importing modules
import pandas as pd
# creating a dataframe
df1 = [Link]({'Name':['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks':[80, 90, 75, 88, 59]})
# display df2
display(df2)
Output:
df1
Data Analysis and Visualization
df2
Merged Dataframe
Example 3: In this example, we have merged df1 with df2. The Marks column of df1 is
merged with df2 and only the common values based on key column Name in both the
dataframes are displayed here.
Python3
# importing modules
import pandas as pd
# creating a dataframe
df1 = [Link]({'Name':['Raju', 'Rani', 'Geeta', 'Sita', 'Sohit'],
'Marks':[80, 90, 75, 88, 59]})
# display df2
display(df2)
Output:
df1
df2
Merged Dataframe
Pandas support three kinds of data structures. They are Series, Data Frame, and Panel. A Data
frame is a two-dimensional data structure, Here data is stored in a tabular format which is in
rows and columns. We can create a data frame in many ways.
Data Analysis and Visualization
Here we are creating a data frame using a list data structure in python.
# assign data
l=["vignan","it","sravan","subbarao"]
# display dataframe
Df
Output:
Here in the above example, we created a data frame. Let’s merge the two data frames with
different columns. It is possible to join the different columns is using concat() method.
Syntax: [Link](objs: Union[Iterable[‘DataFrame’], Mapping[Label, ‘DataFrame’]],
axis=’0′, join: str = “‘outer'”)
DataFrame: It is dataframe name.
Mapping: It refers to map the index and dataframe columns
axis: 0 refers to the row axis and1 refers the column axis.
join: Type of join.
Note: If the data frame column is matched. Then empty values are replaced by NaN values.
Steps by step Approach:
Open jupyter notebook
Import necessary modules
Create a data frame
Perform operations
Analyze the results.
Below are some examples based on the above approach:
Example 1
In this example, we are going to concatenate the marks of students based on colleges.
Python3
Output:
Python3
# create dataframe
df1 = [Link](details1)
# display dataframe
df1
Data Analysis and Visualization
Output:
Python3
# concat dataframes
[Link]([df, df1], axis=0, ignore_index=True)
Python3
Example 2:
Storing marks and subject details
Python3
# print dataframe.
Df
Output:
Python3
# print dataframe.
df1
Output:
Python3
Output:
Data Analysis and Visualization
Explanation:
left – Dataframe which has to be joined from left
right – Dataframe which has to be joined from the right
how – specifies the type of join. left, right, outer, inner, cross
on – Column names to join the two dataframes.
left_on – Column names to join on in the left DataFrame.
right_on – Column names to join on in the right DataFrame.
Normally merge:
When we join a dataset using [Link]() function with type ‘inner’, the output will have prefix
and suffix attached to the identical columns on two data frames, as shown in the output.
Output:
Data Analysis and Visualization
Method 1: Use the columns that have the same names in the join statement
In this approach to prevent duplicated columns from joining the two data frames, the user
needs simply needs to use the [Link]() function and pass its parameters as they join it using
the inner join and the column names that are to be joined on from left and right data frames in
python.
Example:
In this example, we first create a sample dataframe data1 and data2 using the [Link]
function as shown and then using the [Link]() function to join the two data frames by inner
join and explicitly mention the column names that are to be joined on from left and right data
frames.
print(merged)
Output:
Data Analysis and Visualization
print(merged)
Output:
Output:
# importing pandas as pd
import pandas as pd
# creating a dataframe
df = [Link]({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'},
'Course': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'},
'Age': {0: 27, 1: 23, 2: 21}})
Df
Python3
Python3
Python | [Link]()
[Link](index, columns, values) function produces pivot table based on 3 columns of
the DataFrame. Uses unique values from index / columns and fills with values.
Parameters:
index[ndarray] : Labels to use to make new frame’s index
columns[ndarray] : Labels to use to make new frame’s columns
values[ndarray] : Values to use for populating new frame’s values
Returns: Reshaped DataFrame
Exception: ValueError raised if there are any duplicates.
Code:
# importing pandas as pd
import pandas as pd
# creating a dataframe
df = [Link]({'A': ['John', 'Boby', 'Mina'],
'B': ['Masters', 'Graduate', 'Graduate'],
'C': [27, 23, 21]})
Df
Data Analysis and Visualization
# value is a list
[Link](index ='A', columns ='B', values =['C', 'A'])
Raise ValueError when there are any index, columns combinations with multiple values.
# importing pandas as pd
import pandas as pd
# creating a dataframe
df = [Link]({'A': ['John', 'John', 'Mina'],
'B': ['Masters', 'Masters', 'Graduate'],
Data Analysis and Visualization
Sometimes we need to reshape the Pandas data frame to perform analysis in a better way.
Reshaping plays a crucial role in data analysis. Pandas provide function
like melt and unmelt for reshaping.
[Link]()
melt() is used to convert a wide dataframe into a longer form. This function can be used when
there are requirements to consider a specific column as an identifier.
Syntax: [Link](frame, id_vars=None, value_vars=None, var_name=None,
value_name=’value’, col_level=None)
Example 1:
Initialize the dataframe with data regarding ‘Days‘, ‘Patients‘ and ‘Recovery‘.
Python3
Output:
Data Analysis and Visualization
Now, we reshape the data frame using [Link]() around column ‘DAYS‘.
Python3
Output:
Example 2:
Now, to the dataframe used above a new column named ‘Deaths‘ is introduced.
Output:
Output:
Example 1:
Create a dataframe that contains the data on ID, Name, Marks and Sports of 6 students.
Output:
# unmelting
reshaped_df = [Link](index='Name', columns='Sports')
Output:
Example 2:
Consider the same dataframe used in the example above. Unmelting can be done based on
more than one column also.
Output:
But the reshaped dataframe appears little different from the original one in terms of index. To
get the index also set as original dataframe use reset_index() function on the reshaped
dataframe.
# reseting index
df_new = reshaped_df.reset_index()
Output:
Data Analysis and Visualization
DATA AGGREGATION
Definition and Usage
The aggregate() method allows you to apply a function or a list of function names to be executed
along one of the axis of the DataFrame, default 0, which is the index (row) axis.
axis 0, 1, 'index', Optional, which axis to apply the function to. default 0.
'columns'
Return Value:
This function does NOT make changes to the original DataFrame object.
[Link]() function is used to apply some aggregation across one or more column.
Aggregate using callable, string, dict, or list of string/callables. Most frequently used
aggregations are:
sum: Return the sum of the values for the requested axis
min: Return the minimum of the values for the requested axis
max: Return the maximum of the values for the requested axis
Data Analysis and Visualization
Example #1: Aggregate ‘sum’ and ‘min’ function across all the columns in data frame.
[Link](['sum', 'min'])
Output:
For each column which are having numeric values, minimum and sum of all values has been
found. For dataframe df , we have four such columns Number, Age, Weight, Salary.
Example #2:
In Pandas, we can also apply different aggregation functions across different columns. For that,
we need to pass a dictionary with key containing the column names and values containing the
list of aggregation functions for any specific column.
Data Analysis and Visualization
Output:
Separate aggregation has been applied to each column, if any specific aggregation is not
applied on a column then it has NaN value corresponding to it.
GROUPING OF DATA
[Link]():
Pandas groupby is used for grouping the data according to the categories and apply a function
to the categories. It also helps to aggregate data efficiently.
Pandas [Link]() function is used to split the data into groups based on some
criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is
to provide a mapping of labels to group names.
Syntax: [Link](by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)
Parameters :
by : mapping, function, str, or iterable
axis : int, default 0
level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : For aggregated output, return object with group labels as the index. Only relevant
for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : Sort group keys. Get better performance by turning this off. Note this does not influence
the order of observations within each group. groupby preserves the order of rows within each
group.
group_keys : When calling apply, add group keys to index to identify pieces
squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent
type
Returns : GroupBy object
Data Analysis and Visualization
Example #1: Use groupby() function to group the data based on the “Team”.
# importing pandas as pd
import pandas as pd
Output :
Data Analysis and Visualization
Let’s print the value contained any one of group. For that use the name of the team. We use the
function get_group() to find the entries contained in any of the groups.
Output :
Example #2: Use groupby() function to form groups based on more than one category (i.e. Use
more than one column to perform the splitting).
# importing pandas as pd
import pandas as pd
[Link]()
Output :
groupby() is a very powerful function with a lot of variations. It makes the task of splitting the
dataframe over some criteria really easy and efficient.
Human minds are more adaptive for the visual representation of data rather than textual data. We
can easily understand things when they are visualized. It is better to represent the data through the
graph where we can analyze the data more efficiently and make the specific decision according to
data analysis. Before learning the matplotlib, we need to understand data visualization and why
data visualization is important.
Data Analysis and Visualization
Data Visualization
Graphics provides an excellent approach for exploring the data, which is essential for presenting
results. Data visualization is a new term. It expresses the idea that involves more than just
representing data in the graphical form (instead of using textual form).
This can be very helpful when discovering and getting to know a dataset and can help with
classifying patterns, corrupt data, outliers, and much more. With a little domain knowledge, data
visualizations can be used to express and demonstrate key relationships in plots and charts. The
static does indeed focus on quantitative description and estimations of data. It provides an
important set of tools for gaining a qualitative understanding.
There are five key plots that are used for data visualization.
Data Analysis and Visualization
Matplotlib is a Python library which is defined as a multi-platform data visualization library built
on Numpy array. It can be used in python scripts, shell, web application, and other graphical user
interface toolkit.
The John D. Hunter originally conceived the matplotlib in 2002. It has an active development
community and is distributed under a BSD-style license. Its first version was released in 2003, and
the latest version 3.1.1 is released on 1 July 2019.
Matplotlib 2.0.x supports Python versions 2.7 to 3.6 till 23 June 2007. Python3 support started
with Matplotlib 1.2. Matplotlib 1.4 is the last version that supports Python 2.6.
Figure: It is a whole figure which may hold one or more axes (plots). We can think of a Figure as
a canvas that holds plots.
Axes: A Figure can contain several Axes. It consists of two or three (in the case of 3D) Axis
objects. Each Axes is comprised of a title, an x-label, and a y-label.
Axis: Axises are the number of line like objects and responsible for generating the graph limits.
Artist: An artist is the all which we see on the graph like Text objects, Line2D objects, and
collection objects. Most Artists are tied to Axes.
We will be plotting two lists containing the X, Y coordinates for the plot.
Example:
[Link]("x-axis")
[Link]()
Output:
Simple Plot
y
-
a
x
i
s
x-axis
In the above example, the elements of X and Y provides the coordinates for the x axis and y
axis and a straight line is plotted against those coordinates.
Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Pyplot provides
functions that interact with the figure i.e. creates a figure, decorates the plot with labels, and
creates a plotting area in a figure.
Syntax:
[Link](*args, scalex=True, scaley=True, data=None, **kwargs)
Example:
Output:
Data Analysis and Visualization
Matplotlib take care of the creation of inbuilt defaults like Figure and Axes.
Figure: This class is the top-level container for all the plots means it is the overall
window or page on which everything is drawn. A figure object can be considered as a
box-like container that can hold one or more axes.
Axes: This class is the most basic and flexible component for creating sub-plots. You
might confuse axes as the plural of axis but it is an individual plot or graph. A given
figure may contain many axes but given axes can only be in one figure.
1. Figure class
Figure class is the top-level container that contains one or more axes. It is the overall
window or page on which everything is drawn.
Syntax:
class [Link](figsize=None, dpi=None(resolution), facecolor=None
(background colour), edgecolor=None(border colour), linewidth=0.0, **kwargs)
Example 1
Output:
Data Analysis and Visualization
Output:
Data Analysis and Visualization
[Link] Class
Axes class is the most basic and flexible unit for creating sub-plots. A given figure may
contain many axes, but a given axes can only be present in one figure. The axes() function creates
the axes object.
Syntax:
[Link]([left, bottom, width, height])
Example 1:
Output:
Example 2:
Output:
Output:
Data Analysis and Visualization
Multiple Plots
What if you want to plot multiple plots in the same figure. This can be done using multiple
ways. One way was discussed above using the add_axes() method of the figure class
Output:
The add_axes() method adds the plot in the same figure by creating another axes object.
Example:
import [Link] as plt
# data to display on plots
x = [3, 1, 3]
y = [3, 2, 1]
z = [1, 3, 1]
[Link]()
# adding first subplot with 1 row and 2 columns and 1st subplot
[Link](121)
[Link](x, y)
# addding second subplot with 1 row and 2 columns and 2nd subplot
[Link](122)
[Link](z, y)
Output:
Output:
Output:
Output:
Data Analysis and Visualization
We can customize the graph by importing the style module. The style module will be built into a
matplotlib installation. It contains the various functions to make the plot more attractive. In the
below program, we are using the style module:
1. from matplotlib import pyplot as plt
2. from matplotlib import style
3. [Link]('ggplot')
4. x = [16, 8, 10]
5. y = [8, 16, 6]
6. x2 = [8, 15, 11]
7. y2 = [6, 15, 7]
8. [Link](x, y, 'r', label='line one', linewidth=5)
9. [Link](x2, y2, 'm', label='line two', linewidth=5)
10. [Link]('Epic Info')
11. fig = [Link]()
12. [Link]('Y axis')
13. [Link]('X axis')
14. [Link]()
15. [Link](True, color='k')
16. [Link]()
Output:
2. Bar graphs
Bar graphs are one of the most common types of graphs and are used to show data associated with
the categorical variables. Matplotlib provides a bar() to make bar graphs which accepts arguments
such as: categorical variables, their value and color.
Data Analysis and Visualization
Output:
Another function barh() is used to make horizontal bar graphs. It accepts xerr or yerr as
arguments (in case of vertical graphs) to depict the variance in our data as follows:
1. from matplotlib import pyplot as plt
2. players = ['Virat','Rohit','Shikhar','Hardik']
3. runs = [51,87,45,67]
4. [Link](players,runs, color = 'green')
5. [Link]('Score Card')
6. [Link]('Players')
7. [Link]('Runs')
8. [Link]()
Data Analysis and Visualization
Output:
Let's have a look on the other example using the style() function:
1. from matplotlib import pyplot as plt
2. from matplotlib import style
3. [Link]('ggplot')
4. x = [5,8,10] y = [12,16,6] x1 = [6,9,11] y1 = [7,15,7]
5. [Link](x, y, color = 'y', align='center')
6. [Link](x1, y1, color='c', align='center')
7. [Link]('Information')
8. [Link]('Y axis')
9. [Link]('X axis')
10. [Link](loc="upper right")
11. [Link]()
Output:
Data Analysis and Visualization
3. Pie Chart
A pie chart is a circular graph that is broken down in the segment or slices of pie. It is generally
used to represent the percentage or proportional data where each slice of pie represents a particular
category. Let's have a look at the below example:
1. from matplotlib import pyplot as plt
2. # Pie chart, where the slices will be ordered and plotted counter-clockwise:
3. Players = 'Rohit', 'Virat', 'Shikhar', 'Yuvraj'
4. Runs = [45, 30, 15, 10]
5. explode = (0.1, 0, 0, 0) # it "explode" the 1st slice
6. fig1, ax1 = [Link]()
7. [Link](Runs, explode=explode, labels=Players, autopct='%1.1f%%',
8. shadow=True, startangle=90)
9. [Link]('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
10. [Link]()
Output:
4. Histogram
First, we need to understand the difference between the bar graph and histogram. A histogram is
used for the distribution, whereas a bar chart is used to compare different entities. A histogram is
a type of bar plot that shows the frequency of a number of values compared to a set of values
ranges.
For example we take the data of the different age group of the people and plot a histogram with
respect to the bin. Now, bin represents the range of values that are divided into series of intervals.
Bins are generally created of the same size.
1. from matplotlib import pyplot as plt
Data Analysis and Visualization
2. population_age = [21,53,60,49,25,27,30,42,40,1,2,102,95,8,15,105,70,65,55,70,75,60,52,
44,43,42,45]
3. bins = [0,10,20,30,40,50,60,70,80,90,100]
4. [Link](population_age, bins, histtype='bar', rwidth=0.8)
5. [Link]('age groups')
6. [Link]('Number of people')
7. [Link]('Histogram')
8. [Link]()
Output:
13.
14. fig.tight_layout()
15. [Link]()
Output:
5. Scatter plot
The scatter plots are mostly used for comparing variables when we need to define how much one
variable is affected by another variable. The data is displayed as a collection of points. Each point
has the value of one variable, which defines the position on the horizontal axes, and the value of
other variable represents the position on the vertical axis.
Example-1:
1. from matplotlib import pyplot as plt
2. from matplotlib import style
3. [Link]('ggplot')
4. x = [5,7,10]
5. y = [18,10,6]
6. x2 = [6,9,11]
7. y2 = [7,14,17]
8. [Link](x, y)
9. [Link](x2, y2, color='g')
10. [Link]('Epic Info')
11. [Link]('Y axis')
Data Analysis and Visualization
Output:
Example-2
1. import [Link] as plt
2. x = [2, 2.5, 3, 3.5, 4.5, 4.7, 5.0]
3. y = [7.5, 8, 8.5, 9, 9.5, 10, 10.5]
4.
5. x1 = [9, 8.5, 9, 9.5, 10, 10.5, 12]
6. y1 = [3, 3.5, 4.7, 4, 4.5, 5, 5.2]
7. [Link](x, y, label='high income low saving', color='g')
8. [Link](x1, y1, label='low income high savings', color='r')
9. [Link]('saving*100')
10. [Link]('income*1000')
11. [Link]('Scatter Plot')
12. [Link]()
13. [Link]()
Output:
Data Analysis and Visualization
i. [Link]()
A legend is an area describing the elements of the graph. In the matplotlib library, there’s a
function called legend() which is used to Place a legend on the axes.
The attribute Loc in legend() is used to specify the location of the [Link] value of loc
is loc=”best” (upper left). The strings ‘upper left’, ‘upper right’, ‘lower left’, ‘lower right’
place the legend at the corresponding corner of the axes/figure.
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the coordinates of
the legend, and the attribute ncol represents the number of columns that the legend has. It’s
default value is 1.
Syntax:
[Link]([“blue”, “green”], bbox_to_anchor=(0.75, 1.15), ncol=2)
The Following are some more attributes of function legend() :
shadow: [None or bool] Whether to draw a shadow behind the legend. It’s Default
value is None.
markerscale: [None or int or float] The relative size of legend markers compared
with the originally drawn ones. The Default is None.
numpoints: [None or int] The number of marker points in the legend when creating
a legend entry for a Line2D (line).The Default is None.
fontsize: The font size of the [Link] the value is numeric the size will be the
absolute font size in points.
facecolor: [None or “inherit” or color] The legend’s background color.
edgecolor: [None or “inherit” or color] The legend’s background patch edge color.
Data Analysis and Visualization
import numpy as np
import [Link] as plt
# X-axis values
x = [1, 2, 3, 4, 5]
# Y-axis values
y = [1, 4, 9, 16, 25]
# Function to plot
[Link](x, y)
Output :
Example 2:
# importing modules
import numpy as np
import [Link] as plt
# Y-axis values
Data Analysis and Visualization
y1 = [2, 3, 4.5]
# Y-axis values
y2 = [1, 1.5, 5]
# Function to plot
[Link](y1)
[Link](y2)
Output :
Example 3:
import numpy as np
import [Link] as plt
# X-axis values
x = [Link](5)
# Y-axis values
y1 = [1, 2, 3, 4, 5]
# Y-axis values
y2 = [1, 4, 9, 16, 25]
Data Analysis and Visualization
# Function to plot
[Link](x, y1, label ='Numbers')
[Link](x, y2, label ='Square of numbers')
Output :
Example 4:
import numpy as np
import [Link] as plt
Output:
Example 5:
# importing modules
import numpy as np
import [Link] as plt
# X-axis values
x = [0, 1, 2, 3, 4, 5, 6, 7, 8]
# Y-axis values
y1 = [0, 3, 6, 9, 12, 15, 18, 21, 24]
# Y-axis values
y2 = [0, 1, 2, 3, 4, 5, 6, 7, 8]
# Function to plot
[Link](y1, label ="y = x")
[Link](y2, label ="y = 3x")
Output:
# Implementation of [Link]()
# function
# Annotation
[Link]('Local Max', xy =(3.3, 1),
xytext =(3, 1.8),
arrowprops = dict(facecolor ='green',
shrink = 0.05),)
geeeks.set_ylim(-2, 2)
Output:
Example #2:
# Implementation of [Link]()
# function
import numpy as np
import [Link] as plt
fig, ax = [Link]()
[Link](x, y)
ax.set_xlim(0, 10)
ax.set_ylim(-1, 1)
offset = 72
# Annotation
[Link]('data = (%.1f, %.1f)'%(xdata, ydata),
(xdata, ydata), xytext =(-2 * offset, offset),
textcoords ='offset points',
bbox = bbox, arrowprops = arrowprops)
Output: