Why PANDAS?
It is the most popular library of python for doing data analysis.
It does the following tasks:
a) It can read / write in many different data formats (int, float, double etc.)
b) In can calculate in rows and columns, the way data is organized
c) It can easily select subsets of data from bulky data.
d) It can find and fill missing data
DATA STRUCTURE IN PANDAS
A data structure is a way to arrange the data in such a way that so it can be
accessed quickly and we can perform various operation on this data like
retrieval, deletion, modification etc.
Pandas deals with 2 data structure
1. Series
2. Data Frame
SERIES
Series is a one-dimensional array like structure with homogeneous data, which
can be used to handle and manipulate data. What makes it special is its index
attribute, which has incredible functionality and is heavily mutable.
It has two parts
1. Data part (An array of actual data)
2. Associated index with data (associated array of indexes or data labels)
Creating a Series
We know that list is one dimensional data type capable of holding any data
type and have indices. A list can be converted into a series using Series()
method. The basic method to create a series is:
s = pd.Series([data], index=[index])
The data is structured like a Python list, dictionary, an ndarray or a scalar value.
If data is an ndarray, index must be the same length as data. The index is the
numeric values displayed with respective data. We did not pass any index, so
by default, it assigned the indexes ranging from 0 to len(data)–1. Otherwise,
the default index starts with 0, 1, 2, ... till the length of the data minus 1. If you
want then you can mention your own index for the data.
#How to create an empty series in pandas:
import pandas as pd
s=pd.Series()
print(s)
print("")
Output
#How to create series with nd array:
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(arr)
print(s)
print("")
Output
#Give our own indexes:
#Method 1
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(arr, index=['first','second','third','fourth'])
print(s)
print("")
Output
#Method 2
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(data=arr, index=['first','second','third','fourth'])
print(s)
print("")
Output
#Method 3
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(index=['first','second','third','fourth'], data=arr)
print(s)
print("")
Output
#Display indexes
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(arr, index=['first','second','third','fourth'])
print(s.index)
print("")
Output
#Data type of the series
import pandas as pd
import numpy as np
arr=np.array([10,15,18,22])
s = pd.Series(arr, index=['first','second','third','fourth'])
print(type(arr))
print(type(s))
print("")
Output
#Create an alphabetic index label with series starting from 1 after 3 intervals
import pandas as pd
s=pd.Series(range(1,15,3), index=[x for x in 'ABCDE'])
print(s)
print("")
Output
In this case, the scalar value will be repeated as per the length of the index
#Creating Series with Scalar Values
import pandas as pd
s = pd.Series(30, index=['first','second','third','fourth'])
print(s)
print("")
Output
import pandas as pd
s=pd.Series(100, index=range(4))
print(s)
print("")
Output
import pandas as pd
s=pd.Series(10, index=range(0,1))
print(s)
print("")
Output
import pandas as pd
s=pd.Series(10, index=range(2,7,2))
print(s)
Output
Mathematical operations can be done using scalars and functions. You can
perform the mathematical operation using simple operators like +, –, *, /, etc.
Creating Series with mathematical Operations
import numpy as np
import pandas as pd
n= np.arange(9,13)
s=pd.Series(index=n, data=n*2)
print(s)
Output
In the above example as n is an array, so when we write n*2, it performs
multiplication with all values.
import pandas as pd
n=[9,10,11,12]
s=pd.Series(data=n*2)
print(s)
Output
As n is a list here, so n* 2 replicates the list in the series. Also as index is not
mentioned so it takes it as 0 onwards (default setting) Increase the value of 5
import pandas as pd
import numpy as np
Marks = np.array([456, 478, 467, 477, 405])
Names = np.array(['Amit', 'Sneha', 'Manya', 'Pari','Lavanya'])
M = pd.Series(Marks, index= Names)
print(M)
print("Increase the Marks by 5")
for label, value in M.items():
M.at[label] = value + 5 # increases each values
print(M)
Output
Vector Operations Series support element-wise vector operations. For
example, when a number is added to a series, the number adds with each
element of the series values.
For example;
import pandas as pd
import numpy as np
List_Var = [1, 2, 3, 4, 5] # a list with 5 elements
Series_Var = pd.Series(List_Var)
print(Series_Var)
print(" ")
# Each series value added with 5
print("Add 5")
print (Series_Var + 5)
# Calculates the mathematical power 3 of each of the series value
print("Mathematical Power of 3")
print(Series_Var**3)
# Each series value add with itself
print("Add value to itself")
print(Series_Var+Series_Var)
Output
Indexing, Slicing and Accessing Data from Series
import pandas as pd
import numpy as np
months = ['Jan','Feb','Mar','June']
days = [31, 28, 31, 30]
S = pd.Series(days, index=months)
print(S)
#Printing first element
print(S[0])
#Printing first two elements
print(S[:2])
#Printing element starting from 2nd till 3rd
print(S[1:3])
#Printing last two elements
print(S[-2:])
#Printing the value correspond to label index
print(S['Jan'])
#Printing multiple values correspond to label index
print(S[['Feb','June']])
#Printing the slices with the values of the label index
print(S['Feb' : 'June'])
#Printing the Boolean indexing
print(S>=31)
#Printing the data using Boolean indexing
print(S[S>=31])
Note: Here, S>=31 returns a Series of True/False values, which we then pass to
our Series S, returning the corresponding True items.
Output
HEAD and TAIL functions in Series
head():
By default, it is used to access the first 5 rows of a series
tail():
By default, it is used to access the last 5 rows of a series.
Note: To access a particular number of records from the beginning or the end,
we have to mention that number as the argument to the function.
import pandas as pd
import numpy as np
months = ['Jan','Feb','Mar','Apr','May','June','Jul']
days = [31, 28, 31, 30, 31, 30, 31]
S = pd.Series(days, index=months)
print(S)
print(S.head()) #Default first 5 rows of a series
print(S.tail()) #Default last 5 rows of a series
print("")
print(S.head(2)) #Prints the first 2 rows of a series
print(S.tail(2)) #Prints the last 2 rows of a series
Output