Data Handling Using Pandas – I
Data and Handling data:
➢Data is a raw fact (input for process).
➢Data is stored in different types [data types].
➢Data is handled and manipulated, i.e., stored / save in different,
✓Structures – Queue, List, Array, dictionary, etc.
✓Format – csv file, excel file, html file, etc.
➢Different structures and formats of data are converted into single format and stored [Storage Place is called
Data Warehouse]
Python Libraries:
Python libraries contain a collection of builtin modules that allow us to perform many actions.
Each library in Python contains a large number of modules that one can import and use.
For scientific and analytical use there are three Libraries:
1. Pandas -PANel Data
2. NumPy -Numerical Python
3. Matplotlib
These libraries allow us to manipulate, transform and visualise data easily and efficiently.
Pandas:
➢Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool
using its powerful data structures.
➢The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.
➢Python with Pandas is used in a wide range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc
It is a package useful for data analysis and manipulation
➢Pandas provide an easy way to create, manipulate and wrangle the data regardless of the origin in five typical
steps — load, prepare, manipulate, model, and analyze.
Key Features of Pandas
• Fast and efficient Data Frame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub-setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality [which is the use of a model to predict future values based on previous observed
values]
NumPy
NumPy, which stands for ‘Numerical Python’, can be used for numerical data analysis and scientific
computing. NumPy uses a multidimensional array object and has functions and tools for working with these
arrays. Elements of an array stay together in memory, hence, they can be quickly accessed.
Two functions are arange() and array() one attribute –nan(not a number)
Matplotlib:
The Matplotlib library in Python is used for plotting graphs and visualisation. Using Matplotlib, with just a few
lines of code we can generate publication quality plots, histograms, bar charts, scatterplots, etc
Following are some of the differences between Pandas and Numpy:
1. A Numpy array requires homogeneous data, while a Pandas DataFrame can have different data types (float,
int, string, datetime, etc.).
2. Pandas have a simpler interface for operations like file loading, plotting, selection, joining, GROUP BY,
which come very handy in data-processing applications.
3. Pandas DataFrames (with column names) make it very easy to keep track of data.
4. Pandas is used when data is in Tabular Format, whereas Numpy is used for numeric array based data
manipulation
DATA STRUCTURE IN PANDAS
• A data structure is a way to arrange the data in such a way that so it can be accessed quickly and we can
perform various operation on this data like- retrieval, deletion, modification etc.
• Pandas deals with 3 data structure
1. Series
2. Data Frame
3. Panel
• These data structures are built on top of Numpy array, which means they are fast.
Dimension & Description:
The best way to think of these data structures is that the higher dimensional data structure is a container of its
lower dimensional data structure.
For example, DataFrameis a container of Series, Panel is a container of DataFrame.
Mutability
• All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is
size immutable.
• Note− DataFrame is widely used and one of the most important data structures. Panel is used much less
Series
• Series is a one-dimensional array (like structure with homogeneous data) containing a sequence of values of
any data type (int, float, list, string, etc) which by default have numeric data labels starting from zero.
The data label associated with a particular value is called its index.
For example, the following series is a collection of integers 10, 23, 56, … 10 23 56
Key Points
• Homogeneous data
• Size Immutable
• Values of Data Mutable
The axis labels are collectively called index.
▪ A pandas Series can be created using the Series() method from pandas module
Syntax:
pandas.Series(data, index, dtype, copy)
The parameters of the constructor are as follows
Series Generation:
A series can be created using various inputs like –
• A sequence [list]
• An Array -ndarray
• A Python Dictionary –dict()
• A Scalar value or constant
• A Mathematical expression / function
Create an Empty Series
A basic series, which can be created is an Empty Series
Sample Program:
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print(s)
print(“Data Type: ”,type(s))
Output:
Series([], dtype: float64)
Data Type:
<class 'pandas.core.series.Series'>
Here in the sample program
• S is the object name
• Series() is the method displays an empty list(by default) along with its default datatype.
• type() method displays the series data type.
Create Series:–Using Series() argument -data and input a sequence [list type].
List is a heterogeneous sequence (mixture of data types) but panda structures are homogeneous data series,
hence they are given to specific type.
import pandas as pd
# Generate series using Parameter data and input -Sequence [list] print()
print("Generates series using Parameter data and input -Sequence [list]")
print()
ser1 = pd.Series([10,20,30,40,50])
print("Series is:")
print(ser1)
Create Series:–Using Series() arguments -data, index and input list generated using range().
2. Creation of Series from NumPy Arrays:
import numpy as np # import NumPy with alias np
import pandas as pd
array1 = np.array([1,2,3,4])
series3 = pd.Series(array1)
print(series3)
series4 = pd.Series(array1, index = ["Jan", "Feb", "Mar", "Apr"])
print(series4)
Note:When index labels are passed with the array, then the length of the index and array must be of the same
size, else it will result in a ValueError.
3. Creation of Series from Dictionary:
Dictionary keys can be used to construct an index for a Series.
import pandas as pd
dict1 = {'India': 'NewDelhi', 'UK': 'London', 'Japan': 'Tokyo'}
print(dict1) #Display the dictionary {'India': 'NewDelhi', 'UK': 'London', 'Japan': 'Tokyo'}
series8 = pd.Series(dict1)
print(series8) #Display the series India NewDelhi UK London Japan Tokyo dtype: object
4.Creation of Series from Scalar Values:
series2 = pd.Series([50, index=[3,5,1])
print(series2)
Accessing Elements of a Series:
There are two common ways for accessing the elements of a series: Indexing and Slicing.
(A) Indexing:
Indexes are of two types:
1. positional index and
2. labelled index.
Positional index takes an integer value that corresponds to its position in the series starting from 0, • iloc
is a method uses position i.e., row number or column number. It is referred as position-based indexing.
Labelled index takes any user-defined label as index. • loc is another method uses names i.e., row name
or column name. It is referred as name-based indexing
• Using index values the elements can be accessed [one element at a time or a range.]
• Range of elements can be accessed by methods like iloc() and loc().
EXAMPLE :
import pandas as pd
seriesNum = pd.Series([10,20,30])
print(seriesNum[2])
seriesMnths = pd.Series([2,3,4],index=["Feb ","Mar","Apr"])
print(seriesMnths["Mar"])
seriesCapCntry = pd.Series(['NewDelhi', 'WashingtonDC', 'London', 'Paris'], index=['India', 'USA', 'UK',
'France'])
print(seriesCapCntry['India'] )
print(seriesCapCntry[['UK','USA']])
print(seriesCapCntry.loc[‘UK’,’USA’])
print(seriesCapCntry.iloc[0:2])
Slicing:
We can define which part of the series is to be sliced by specifying the start and end parameters [start :end] with
the series name.
When we use positional indices for slicing, the value at the endindex position is excluded, i.e., only (end -
start) number of data values of the series are extracted.
We can also use slicing to modify the values of series elements
updating the values in a series using slicing also excludes the value at the end index position. But, it changes the
value at the end index label when slicing is done using labels
import numpy as np
seriesAlph = pd.Series(np.arange(10,16,1), index = ['a', 'b', 'c', 'd', 'e', 'f'])
seriesAlph[1:3] = 50
seriesAlph['c':'e'] = 500
print(seriesAlph)
Attributes of Series:
example
sr = pd.Series(range(1,15,3), index = [x for x in 'abcde'])
print("ATTRIBUTES IN SERIES")
print("Is Series Empty:")
print("sr.empty:", sr.empty)
print()
print("sr.index:", sr.index)
print()
print("sr.values:", sr.values)
print()
print("sr.shape:", sr.shape)
print()
print("sr.size:", sr.size)
print()
print("sr.nbytes:", sr.nbytes)
print()
print("sr.ndim:", sr.ndim)
print()
print("sr.item:", sr.item)
print()
print("sr.hasnans:", sr.hasnans)
output:
ATTRIBUTES IN SERIES
Is Series Empty:
sr.empty: False
sr.index: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
sr.values: [ 1 4 7 10 13]
sr.shape: (5,)
sr.size: 5
sr.nbytes: 40
sr.ndim: 1
sr.item: <bound method IndexOpsMixin.item of a 1
b 4
c 7
d 10
e 13
dtype: int64>
sr.hasnans: False
Retrieving values from a series:-
-The values are retrieved using head,tail and count functions
Using index the values in the series items can be accessed (retrieved)
head() – return first five elements from series
head(n) – return first n elements from series
tail() –return last 5 elements from series.
tail(n) - return last n elements from series.
Count() - return number of non – NaN values in the series.
DATA STRUCTURE IN PANDAS –SERIES [ONE DIMENSION] Operations:
➢Series structure supports various operations like,
✓Basic Arithmetic operations [ +, -, *, / ]
✓Vector operations
✓Retrieving values based on conditions.
✓Deletion of elements
Basic Arithmetic operations [ +, -, *, / ]
• Operation done between two series objects as operands
• Same indexed values does the operations.
• If different index the one addition is possible and returns a series1 joined with series2
Example:
import pandas as pd
sr1 = pd.Series([10,20,30],index = [1,2,3])
sr2 = pd.Series([5,10,15], index=[1,2,3])
sr3 = pd.Series(range(100,150,10), index= [x for x in range(1,6)])
print("Addition:")
print("sr1 + sr2", sr1 + sr2)
print()
print("sr1 + sr3", sr1 + sr3)
print()
print("Subtraction:")
print("sr2 -sr1", sr2 -sr1)
print()
print("Multiplication:")
print("sr1 * sr2", sr1 * sr2)
print()
print("Division:")
print("sr1 / sr2", sr1 / sr2)
print()
Output:
Vector operations: [ +, -, *, / , , >=,>=,!=,==]
• Operation implemented at element level
• All arithmetic, and relational operation can be done
• One operand is a series object and other is a numeric / string literal(constant).
Example:
import pandas as pd
s=pd.Series([11,12,13,14])
print(s+2)
print(s>13)
print(s**2)
Output:
Deleting elements from a series:
➢Using drop() function an element from a series can be deleted by passing the index to the function.
➢Syntax:
seriesname.drop(index,inplace=True/False)
Note INPLACE = True(removes ELEMENT permanently FROM SERIES)
Example
import pandas as pd
s=pd.Series([11,12,13,14])
s.drop(2)
print(s)
s.drop(2,inplace=True)
print(s)