0 ratings0% found this document useful (0 votes) 271 views49 pagesPandas GFG PDF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
2113724, 2:40 PM Pandas -Jupyter Notebook
GeeksforGeeks
Pandas
In [6]: M import numpy as np
import pandas as pd
Table of Contents
4. Working with Pandas Series
a) Creating Series
Series through list
Series through Numpy Array
Setting up our own index
Series through dictionary
Using repeat function along with creating a Series
Accessing data from Series.
-- b) Aggregate function on Pandas Series
—~¢) Sereis Absolute Function
— d) Appending Series
—) Astype Function
— f) Between Functions
~~ g) All strings functions can be used to extract or modify texts in a series
Upper and Lower Function
Len function
Strip Function
Split Function
Contains Function
Replace Function
Count Function
Startswith and Endswith Function
Find Finction
~~ h) Converting a Series to List
2. Detailed Coding Implementations on Pandas DataFrame
a) Creating Data Frames
—b) Slicing in DataFrames Using loc and Loc
Basic Loc Operations
Basic loc Operations
Slicing Using Conditions
locathos!:B888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
1492113724, 2:40 PM
Pandas -Jupyter Notebook
--—c) Column Addition in DataFrames
Using List
Using Pandas Seires
Using an Existing Column
--~-d) Deleting Column in DataFrame
Using del
Using pop function
— e) Addition of rows
—f) Drop function
— g) Transposing a DataFrame
— h) A set of more DataFrame Functional
axes function
ndim function
dtypes function
shape function
head function
tail function
‘empty function
istical or Mathematical Functions
sum
Mean
Median
Mode
Variance
Min
Max
Standard Deviation
—j) Describe Function
— k) Pipe Functions:
Pipe function
Apply Function
Applymap Function
— 1) Reindex Function
— m) Renaming Columns in Pandas DataFrame
—n) Sorting in Pandas DataFrame
—- 0) Groupby Functions
Adding Statistical Computation on groupby
Using Filter Function with Groupby
3. Working with csv files and basic data Analysis Using Pandas
—a) Reading CSV
—b) Info Function
~e) isnull() Function
—d) Quantile Function
—) Copy Function
—f) Value Counts Function
-g) Unique and Nunique functopn
-—h) dropna() function
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 2492inai24, 240 PM Pandas - Jupyter Notebook
~~) fillna() fuention
~j) sample Functions
——k) to_esv() functions
4, A detailed Pandas Profile Report
1. Working with Pandas Series
a) Creating Series
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Labels need not
be unique but must be a hashable type. The object supports both integer and label-based
indexing and provides a host of methods for performing operations involving the index.
Seires through list
In [ J: M Ast = [1,2,3,4,5]
pd.Series(1st)
Series through Numpy array
In []: Mare = np.array([1,2,3,4,5])
pd. Series (arr)
Giving index from our own end
In [12]: MW pd.Series(index = ["Eshant', 'Pranjal', ‘Jayesh*, ‘Ashish'], data = [1,2,3
out[12]: Eshant
Pranjal
Jayesh
Ashish
dtype: intea
RUN
Series through Dictionary values.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| ae213124, 240 PM Pandas - Jupyter Notebook
In [15]: M_ steps
{'dayl' : 4008, ‘day2' : 3000, ‘day3' : 12000}
pd.Series(steps)
out[15]: dayt 4@@0
day2 3000
day3 12000
dtype: intea
Using repeat function along with creating a Series
Pandas Series.repeat() function repeat elements of a Series. It returns a new Series where
each element of the current Series is repeated consecutively a given number of times.
In [19]: MW pd.Series(5).repeat(3)
eos
eos
eos
dtype: intea
out[19]:
‘we can use the reset function to make the index accurate
In [27]: M pd.Series(5).repeat(3).reset_index(drop = True)
out [27]: 5
5
5
itype: inted
This code indicates:
+ 10 should be repeated 5 times and
+ 20 should be repeated 2 times
In [29]: Ms = pd.Series([10,20]).repeat([5,2]).reset_index(drop = True)
out[29]: @ 1@
1 Ww
2 Ww
3 18
4 w
5 20
6 2
qi
type: inte4
Accessing elements
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 449211324, 240 Pa Pandas -Jupytr Notebook
In [34]: WM sia]
out[34]: 10
s[0] or s[50] something like this would not work becasue the we can access elements based on
the index which we procided
In [38]: MW si6]
out [38]:
20
By last n numbers (start - end-1)
In [49]: MW s[2:-2]
out[49]: te
2
3 10
4
dtype: intea
b) Aggregate function on pandas Series
Pandas Series.aggregate() function aggregate using one or more operations over the specified
axis in the given series object.
In [58]: M sr = pd.Series([1,2,3,4,5,6,71)
sr.age([min,max, sum])
out(s8]: min 2
max 7
sum 28
dtype: intea
c) Series absolute function
Pandas Series.abs() method is used to get the absolute numeric value of each element in
Series/DataFrame.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| sae213124, 240 PM Pandas - Jupyter Notebook
In [60]: M sr
pd.Series([1,-2,3,-4,5,-6,7])
srabs()
out[69]:
e
1
2
3
4
5
6
dtype: intea
d) Appending Series
Pandas Series.append() function is used to concatenate two or more series object.
‘Syntax: Series.append(to_append, ignore_index=False, verify_integrity=False)
Parameter : to_append : Series or listtuple of Series ignore_index : If True, do not use the
index labels. verify_integrity : If True, raise Exception on creating index with duplicates
In [67]: M srd = pd.Series([1,-2,3])
sr2 = pd.Series([1,2,3])
sr3 = sr2.append(sr1)
sr3,
out{s7]: @ 2
1 2
203
eu
1-2
2 3
dtype: intea
To make the index accurate:
In [71]: sr3.reset_index(drop = True)
out [71]:
e
1
2
3
4
5
a
itype: intea
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| eae213124, 240 PM Pandas - Jupyter Notebook
e) Astype function
Pandas astype() is the one of the most important methods. It is used to change data type of a
series. When data frame is made from a csv file, the columns are imported and data type is set
automatically which many times is not what it actually should have.
In [75]: MW. set
ovt(7s]: @ 1
+ You can see below int64 is mentioned
tn [76]: M type(sra[o])
out [76]:
+ Now you can see itis written as object
In [80]: srt.astype(‘float')
out(aa}: @
dtype
) Between Function
Pandas between() method is used on series to check which values lie between first and second
argument,
In [86]: MW sri = pd.Series([1,2,30,4,5,6,7,8,9,20])
srl
out(ss]: @ 1
1 2
2 38
34
45
5 6
6 7
7 8
8 9
9 2
intea
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 748213124, 240 PM
In [87]:
out [87]:
In [88]:
In [92]:
Pandas -Jupyter Notebook
M1 sri. between(10, 50)
@ False
1 False
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 True
dtype: bool
g) All strings functions can be used to extract or modify
texts in a series
S\AANAA:S SULALAAS Upper and Lower Function
S\\ANMAGS SUAS Len function
\\,$ Strip Function
1:8 Split Function
\\$ Contains Function
\\i$ Replace Function
\\$ Count Function
\\;$ Startswith and Endswith Function
SUAS SUALYS Find Finetion
M ser = pd.Series(["eshant bas" , "Data Science” , "Geeks for Geeks" , ‘Hell:
Upper and Lower Function
M1 print (ser.str.upper())
print(*-"*38)
print(ser.str.lower())
@ ESHANT DAS
1 DATA SCIENCE
2 GEEKS FOR GEEKS
3 HELLO WORLD
4 MACHINE LEARNING
dtype: object
e eshant das
1 data science
2 geeks for geeks
3 hello world
4 machine learning
dtype: object
Length function
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
eae213124, 240 PM
In [94]:
In [95]:
In [96]:
In [108]:
out[1¢8]:
Pandas - Jupyter Notebook
M for i in ser:
print(len(i))
10
12
15
11
16
Strip Function
M ser = pd.Series([” Eshant Das” , “Data Science” ,
for i in ser:
print(d , len(4))
Eshant Das 12
Data Science 12
Geeks for Geeks 15
Hello World 11
Machine Learning 18
2 extra spaces has been removed
W ser = ser.str.strip()
for i in ser:
print(d , len(i))
Eshant Das 10
Data Science 12
Geeks for Geeks 15
Hello World 11
Machine Learning 16
Split Function
W ser.str.split()
e [Eshant, Das]
1 [Data, Science]
2 [Geeks, for, Geeks]
3 [Hello, world]
4 [Machine, Learning]
dtype: object
+ IF we want to split ontt the first world of every string in the pandas series
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
ee211324, 240 Pa Pandas -Jupytr Notebook
In [109]: WM ser.str.split()[2]
out[19}: [‘Eshant', ‘Das*]
+ For second word
In [120]: WM ser.stresplit()[1]
out[11@}: [‘Data', ‘Science']
Contains Function
In [126]: MW ser = pd.Series(["Eshant Das", "Data@Science","Geeks for Geeks”, 'Hello@Wor],
ser.str.contains('@')
out[126]: @ False
1 True
2 False
3 True
4 False
dtype: bool
Replace Function
In [127]: MW ser.str.replace('@",*
out[127]: @ Eshant Das
1 Data Science
2 Geeks for Geeks
3 Hello World
4 — Machine Learning
object
Count Function
In [128]: M ser.str.count('a")
out[128}:
dtype: intea
startswith and endswith
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| some211324, 240 Pa Pandas -Jupytr Notebook
In [129]: Ml ser.str.startswith(‘D')
out[129]: @ False
1 True
2 False
3 False
4 False
dtype: bool
In [130]: WM ser.str.endswith('s')
out[13@]: @ True
1 False
2 True
3 False
4 False
dtype: bool
Find Function
In [133]: M_ ser.str.find('Geeks')
1
a
out[133]: @
1
2 @
3
4
qi
1
al
itype: intea
h) Converting a Series to List
Pandas tolist() is used to convert a series to lst. Initially the series is of type pandas. core. series.
In [137]: WM ser.to_list()
out[137,
[‘Eshant Das',
‘Data@Science’,
‘Geeks for Geeks’,
‘Hello@World’ ,
‘Machine Learning’
2. Detailed Coding Implementations on Pandas
DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). A Data frame is a two-dimensional data
structure, ie, data is aligned in a tabular fashion in rows and columns. Pandas DataFrame
consists of three principal components, the data, rows, and columns.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 492113724, 2:40 PM
Pandas -Jupyter Notebook
Rows
x
Columns
Name Team Number Position Age
AveryBradley Boston Celtics 00 PG 250
John Holland Boston Celtics 30.0 SG 27.0
(JonasJerebka) Boston Celtics PE 290
Jordan Mickey| (Boston Celtics pF 210
Terry Rozier | Boston Celtics (Pc) 220
Jared Sulinger| Bostoncettis\| 70 fe _[an)
Evan Turner \ Boston celtics | 110 [sc |az.0
Data
0G
a) Creating Data Frames
In the real world, a Pandas DataFrame will be created by loading the datasets from existing
storage, storage can be SQL Database, CSV file, and Excel fle, Pandas DataFrame can be
created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in
different ways here are some ways by which we create a dataframe:
Creating a dataframe using List:
DataFrame can be created using a single list or a list of lists,
In [161]:
out[162]:
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb
pd.DataFrame(1st)
°
0 Gooks
1 For
2 Gooks
3s
4 portal
5 tor
6 Geeks
M Ast = ['Geeks', ‘For’, ‘Geeks’, ‘is’,
‘portal’, 'for', ‘'Geeks']
129ries, 240 PM Pandas -Jupytr Notebook
In [163]: M Ast = [['ton',20],[ "Jerry" 12], [' spike’ ,24]]
pd.DataFrame(1st)
out[163]: 04
0 tom 10
1 jerry 12
2 spike 14
Creating DataFrame from dict of ndarrayllists:
To create DataFrame from dict of narrayilist, all the narray must be of same length. If index is
passed then the length index should be equal to the length of arrays. If no index is passed, then
by default, index will be range(n) where n is the array length.
In [166]: M data = {‘nane
‘Tom’, ‘nick", ‘krish", ‘jack"], ‘age’:[20, 21, 19, 18]}
pd.DataFrame(data)
out[166]:
name age
0 Tom 20
1 nick 24
2 ksh 19
3 jack 18
A Data frame is a two-dimensional data structure,
in rows and columns. We can perform ba:
deleting, adding, and renaming.
, data is aligned in a tabular fashion
ins on rows/columns like selecting,
Column Selection: In Order to select a column in Pandas DataFrame, we can either access the
columns by calling them by their columns name.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| rae213124, 240 PM Pandas - Jupyter Notebook
In [169]: M data = { ‘Name’ :['Jai', "Princi', ‘Gaurav’, ‘Anuj'],
"Age" (27, 24, 22, 32],
"Address" :['Delhi', "Kanpur", ‘Allahabad’, ‘Kannauj"],
‘Qualification’ :['msc', ‘MA’, 'MCA', ‘Phd']}
df = pd.dataFrame(data)
df{{'Name', ‘Qualification’ }]
Ovt[169]: Name Qualification
0 iMac
4 Prine Ma
2 Gaurav McA
3 An) Phd
b) Slicing in DataFrames Using iloc and loc
Pandas comprises many methods for its proper functioning. loc() and iloc() are one of those
methods. These are used in slicing data from the Pandas DataFrame. They help in the
convenient selection of data from the DataFrame in Python. They are used in fitering the data
according to some conditions.
In [171]: M data = (‘one’ : pd.Series([1, 2, 3, 4]),
“two" pd.Series([10, 20, 30, 40]),
‘three’ : pd.Series([100, 200, 300, 420]),
‘four’ + pd.Series([1000, 2000, 3000, 4000]))
df = pd.bataFrame(data)
af
out[171]: ene two three four
0 =1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
34 40 400 4000
Basic loc Operations
Python loc{) function The loc() function is label based data selecting method which means that
we have to pass the name of the row or column which we want to select. This method includes
the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike
iloc(). Many operations can be performed using the loc) method like
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| sate213124, 240 PM Pandas - Jupyter Notebook
In [180]: WM df.loc[1:2, ‘two’ : ‘three']
Out[182]: two three
1 20 200
2 30 300
Basic iloc Operations
The iloc() function is an indexed-based selecting method which means that we have to pass an
integer index in the method to select a specific row/column. This method does not include the
last element of the range passed in it unlike loc().iloc() does not accept the boolean data unlike
loc().
In [192]: WM df.iloc{1 : -1,
Out[192]: two three
1 20 200
2 30 300
+ you can see index 3 of both row and column has not been added here so 1 was inclusize
but 3 is exclusive in the case of ilocs
Let's see another example
In [195]: WM df.iloc[:
out[195]: three
0 100
1 200
2 300
3 400
Selecting Spefic Rows
Tn [197]: WM df.ilocl[9,21, 1,31]
out[197]: two four
2 10 1000
2 30 3000
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
1592m376, 240 Pat Pandas -Jopyter Notebook
c) Slicing Using Conditions
Using Conditions works with loc basically
In [204]: M df.loc[d#['two'] > 20, ['three’, four" ]]
out[204]: three four
2 300 3000
3 400 4000
+ Soe could extract only those data for which the value is more than 20
+ For the columns we have used comma(,) to extract specifc columns which is ‘three’ and
‘four’
Let's see another example
In [207]: M df.loc[d#['three’] < 300, ['one’,'four']]
out[2@7]: one four
0 1 1000
12 2000
+ So you can get the inference in the same way for this code as we got for the previous code
c) Column Addition in DataFrame
In [208]: MW df
Ovt[2@8]: one two three four
“01 10 100 1000
12 2 200 2000
2 3 30 300 3000
3 4 49 400 4000
‘We can add a column in many ways. Let us discuss three ways how we can add column here
+ Using List
+ Using Pandas Series
+ Using an existing Column(we can modify that column in the way we want and that modified
part can also be displayed)
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
1649213124, 240 PM Pandas - Jupyter Notebook
In [218]: M1 = [22,33,44,55]
df['five'] = 1
af
Out[210]: ‘one two three four five
0 1 10 100 1000 22
1 2 2 200 2000 33
2 3 30 300 3000 44
3 4 40 400 4000 55
Tn [211]: WM sr = pd.Series({111,222,333,444])
df['six'] = sr
af
Out[2a4]: ‘one two three four five six
0 1 10 100 1000 22 11
1 2 20 200 2000 33 222
2 3 30 300 3000 44 333
3 4 49 400 4000 55 444
Using an existing Column
In [218]: MW df[‘seven'] = df[‘one'] + 10
df
Out[226]: one two three four five si
0 1 10 100 100 2 m om
1 2 20 200 2000 33 222 © 12
2 3 30 300 3000 44 33313
3 4 40 400 4000 55 444 14
+ Now we can see the column 7 is having all the values of column 1 increamented by 10
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| sre213124, 240 PM
d) Column Deletion in Dataframes
In [217]: MW df
Out[217]: one two three four
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
34 49 400 4000
Using del
+ You can see that the column which had the name'six’ has been deleted
In [218]: W del df{'six']
df
Out[218]: ene two three four
0 1 10 100 1000
1 2 2 200 2000
2 3 30 300 3000
34 49 400 4000
Using pop
+ You can see that the columm five has also been deleted from our dataframe
In [220]: MW df.pop('five")
df
Out[22@]: one two three four
0 1 10 100 1000
1 2 2 200 2000
2 3 30 300 3000
3 4 40 400 4000
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
Pandas - Jupyter Notebook
five six seven
21m
33 222
44 333
55 444
five seven
2 ot
32
“43
514
seven
1
12
13
14
1
2
13
14
184921324, 240 PM Pandas -pyter Notebook
e) Addition of rows
In a Pandas DataFrame, you can add rows by using the append method. You can also create a
new DataFrame with the desired row values and use the append to add the new row to the
original dataframe. Here's an example of adding a single row to a dataframe:
In [228]: M df = pd.DataFrame([[1, 2], [3, 4]], columns
df2 = pd.DataFrane([[s, 6], [7, 8], colunns
df3 = dfl.append(df2).reset_index(drop = True)
ara
out[228]: a b
o12
134
256
a78
#) Pandas drop function
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem
of data-centric Python packages. Pandas is one of those packages and makes importing and
analyzing data much easier.
Pandas provide data analysts a way to delete and filter data frame using drop() method. Rows
or columns can be removed using index label or column name using this method.
‘Syntax: DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None,
inplace=False, errors='raise’)
Parameters:
labels: String or list of strings referring row or column name. axis: int or string value, 0 ‘index’ for
Rows and 1 ‘columns’ for Columns. index or columns: Single label or lst. index or columns are
an alternative to axis and cannot be used together. level: Used to specify level in case data
frame is having multiple level index. inplace: Makes changes in original Data Frame if True.
errors: Ignores error if any value from the list doesn't exists and drops rest of the values when
errors = ‘ignore’
Return type: Dataframe with dropped values
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| s9m82m376, 240 Pat Pandas -Jopyter Notebook
In [240]: MW data = { ‘one’ : pd.Series([1, 2, 3, 4]),
“two! pd-Series([10, 20, 32, 40]),
‘three’ 1 pd.Series([100, 260, 300, 400]),
‘four’ : pd.Series([1000, 2000, 3000, 4000])}
df = pd.dataFrane(data)
af
out [240]: one two three four
@ 1 10 100 1000
42 20 200 2000
2 3 30 300 3000
34 40 400 4000
* axis Rows (row wise)
In [241]: DM df.drop([@,1], axis = @, inplace = True)
af
out [241]: ‘one two three four
2 3 30 300 3000
34 40 400 4000
+ ax Columns (column wise)
In [242]: MW df.drop(['one’,"three’], axis = 1, inplace = True)
af
out[242]: two four
2 30 3000
3 40 4000
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
g) Transposing a DataFrame
The T attribute in a Pandas DataFrame is used to transpose the dataframe, ie., to flip the rows
and columns. The result of transposing a dataframe is a new dataframe with the original rows
as columns and the original columns as rows.
Here's an example to illustrate the use of the .T attribute:
20149213124, 240 PM Pandas - Jupyter Notebook
In [243]: M data = { ‘one’ : pd.Series({1, 2, 3, 4]),
"two" : pd.Serdes([10, 20, 32, 40]),
‘three’ : pd.Series([100, 200, 300, 420]),
‘four’ + pd.Series([1000, 2000, 3000, 4000]))
df = pd.DataFrame(data)
af
Out[243]: ene two three four
© 1 10 100 1000
1 2 2 200 2000
2 3 30 300 3000
34 40 400 4000
In [244]: df.
out [24a]: o1 2 3
om 10203
two 10 20 3040
three 109 200 300 400
four 1000 2000 3000 4000
h) A set of more DataFrame Functionalities
In [245]: MW df
Out [245]: ‘one two three four
0 1 10 100 1000
1 2 2 200 2000
2 3 30 300 3000
3 4 40 400 4000
1. axes function
The .axes attribute in a Pandas DataFrame returns a list with the row and column labels of the
DataFrame. The first element of the list is the row labels (index), and the second element is the
column labels.
In [246]: MW df.axes
out[246]: [Rangelndex(start=0, stop=4, step=1),
Index(['one', ‘two’, ‘three’, ‘four'], dtype='object')]
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2149213124, 240 PM Pandas - Jupyter Notebook
2. ndim function
The .ndim attribute in a Pandas DataFrame retums the number of dimensions of the dataframe,
which is always 2 for a DataFrame (row-and-column format).
In [247]: MW df.ndim
out[247]: 2
3. diypes
The .dtypes attribute in a Pandas DataFrame returns the data types of the columns in the
DataFrame. The result is a Series with the column names as index and the data types of the
columns as values.
In [248]: MW df.dtypes
out[248]: one intea
two intea
three — intea
four intea
dtype: object
4. shape function
The shape attribute in a Pandas DataFrame returns the dimensions (number of rows, number
of columns) of the DataFrame as a tuple.
In [249]: W df.shape
out[249}: (4, 4)
+ 4 rows
+ 4:columns
5. head() function
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
2249213124, 240 PM Pandas - Jupyter Notebook
In [258]: M d= { ‘Name’ :pd.Series(['Ton', Jerry’, ‘Spike", ‘Popeye’, ‘Olive’, ‘Bluto', "I
‘Age’ :pd.Series([10,12,14,30,28,33,15]),
“Height! ipd.Series([3.25,1.11,4.12,5.47,6.15,6.67,2.61])}
df = pd.dataFrame(d)
af
out [250]: Name Age Height
0 Tom 10 325.
1 Jory 120 111
2 Spike 14 4.412
3 Popeye 30 547
4 Olve 28 6.18
5 Bhio 33 667
6 Mickey 15 261
The head() method in a Pandas DataFrame retums the first n rows (by default, n=5) of the
DataFrame. This method is useful for quickly examining the first few rows of a large DataFrame
to get a sense of its structure and content.
In [259]: df.head(3)
out [259]: Name Age Height
0 Tom 10 325
1 Jory 1204.11
2 Spike 14 4.12
+ By default it will display first § rows
+ We can mention the number of starting rows we want to see
+ We will see this function more often furthur since the dataframe is so small at this point so
we cannot use something like df.head(20)
6. df.tail() function
The tail() method in a Pandas DataFrame retums the last n rows (by default, n=) of the
DataFrame. This method is useful for quickly examining the last few rows of a large DataFrame
to get a sense of its structure and content.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 249211324, 240 Pa Pandas -Jupytr Notebook
In [260]: WM df.tail(3)
Out[26@]: Name Age Height
4 Ove 28 615
5 Blo 33 667
6 Mickey 15 261
7. empty function()
The .emply attribute in a Pandas DataFrame returns a Boolean value indicating whether the
DataFrame is empty or not. A DataFrame is considered emply if it has no rows.
In [263]: WM df = pd.DataFrame()
df.empty
out[263]: True
i) Statistical or Mathematical Functions
1S $\YVMAAAS Mean $\\\\\\;8 $\\AUU$ Median $\\AA\1;$ $\\UU\AS Mode
SULUAAAS SLAAMLYS Variance $138 SUUAAAAS Min SKA \sS SUAAAAAS Max $\\\ A$
Su\\\\s$ Standard Deviation
In [264]: pf data = {‘one’ : pd.Series([1, 2, 3, 4]),
“two" pd.Series([10, 20, 30, 40]),
‘three’ : pd.Series([100, 200, 300, 420]),
‘four’ : pd.Series([1000, 2000, 3000, 4000]))
df = pd.DataFrame(data)
df
Out [264]:
fone two three four
0% 10 100 1000
1 2 20 200 2000
2 3 30 300 3000
3 4 49 400 4000
4..Sum
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2049213124, 240 PM Pandas - Jupyter Notebook
In [266]: MW df.sum()
out[266]: one 10
two 100
three 1000
four 10000
dtype: intea
2. Mean
In [267]: WM df.mean()
out[267]: one 2.5
two 25.0
three 250.0
four 2500.0
dtype: floatea
3. Median
In [269]: WM df.median()
out[269]: one 2.5
two 25.8
three 250.0
four 2500.0
dtype: floates
4, Mode
In [277]: M de = pd.dataFrame({'A': [1, 2, 2,3, 4, 4, 4, 5],
: [10, 28, 20, 30, 40
print(‘A’ , de['A'].mode())
print(’B* , de[’S" ].mode())
Ae 4
dtype: intea
Be 20
1 40
dtype: intea
5. Variance
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2549213124, 240 PM Pandas - Jupyter Notebook
In [279]: MW df.var()
out[279]: one 1.666667e+00
two 1.666667e+02
three 1.666667e+04
four —_1,.666667e+06
dtype: floatea
6. Min
In [280]: WM df.min()
out[28@]: one 1
two 10
three 108
four 1008
dtype: intea
7. Max
In [281]: WM df.max()
out[281]: one 4
two 40
three 400
8, Standard Deviation
In [282]: M df.std()
out[282]: one 1.290994
two 12.989944
three 129.999445
four 1299.994449
dtype: floates
3) Describe Function
The describe() method in a Pandas DataFrame returns descriptive statistics of the data in the
DataFrame. It provides a quick summary of the central tendency, dispersion, and shape of the
distribution of a set of numerical data
The default behavior of describe() is to compute descriptive statistics for all numerical columns
in the DataFrame. If you want to compute descriptive statistics for a specific column, you can
pass the name of the column as an argument.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 26492yt324, 240 PM Pandas -Jupytr Notebook
In [284]: M data = {‘one’ : pd.Series({1, 2, 3, 4]),
“two" + pd.Series([10, 20, 3, 40]),
‘three’: pd.Series([100, 200, 30, 4¢0]),
‘four’ : pd.Series([1000, 200, 3000, 4000]),
‘Five’ } pd.Series(['A',‘B','C',"D"])}
df = pd.DataFrame(data)
df.describe()
out [284]: one two three four
eount 4.000000 4.000000 4.000000 4.000000
mean 2500000 25.000000 250,000000 2500.000000
std 1.200994 12,909944 129,099445 1200,994449,
‘min 1.000000 10,000000 100.000000 1000.000000
25% 4.750000 17.500000 175.000000 170.0000,
50% 2.500000 25,000000 250,000000 2500,000000,
75% 3.250000 32.500000 325.000000 3250.000000,
max 4000000 49,000000 490,000000 4000,000000,
k) Pipe Functions
1. Pipe Function
‘The pipe() method in a Pandas DataFrame allows you to apply a function to the DataFrame,
similar to the way the apply() method works. The difference is that pipe() allows you to chain
multiple operations together by passing the output of one function to the input of the next
function.
In [286]: M data = (‘one’ : pd.Series({1, 2, 3, 4]),
“two! pd.Series((10, 20, 30, 40]),
‘three’: pd.Series({100, 200, 300, 480]),
‘four’ : pd.Series([1eee, 2000, 3000, 4000])}
df = pd.dataFrame(data)
df
Ovt[286]: ene two three four
® 1 10 100 1000
12 20 200 2000
2 3 30 300 3000
3 4 49 400 4000
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2749211324, 240 Pa Pandas -Jupytr Notebook
Exanple 1
In [291]: WM def adé_(i,5)
return i+ j
df.pipe(adé_, 10)
Out[292]: ene two three four
o 1 20 10 1010
1 12 30 210 2010
2 13 40 310 3010
3 14 50 410 4010
Example 2
In [294]: MW def mean_(col
return col.mean()
def square(i)
return i ** 2
df. pipe(mean_).pipe(square)
out[294]: one 6.25
two 625.00
three 62500.00
four —_ 6250000.00
dtype: floatea
2. Apply Function
The apply() method in a Pandas DataFrame allows you to apply a function to the DataFrame,
either to individual elements or to the entire DataFrame. The function can be either a builtin
Python function or a user-defined function.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2049213124, 240 PM Pandas - Jupyter Notebook
In [295]: M data = {‘one’ : pd.Series({1, 2, 3, 4]),
‘two! : pd.Series([10, 20, 30, 40]),
‘three’: pd.Series([100, 200, 300, 400]),
‘four’ : pd.Series([1000, 2000, 3000, 4000])}
df = pd.dataFrame(data)
af
print (df.apply(np.mean) )
Out[295]: one two three four
0 1 10 100 1000
12 2 200 2000
2 3 30 300 3000
34 49 400 4000
In [301]: WM df.apply(1ambda x: x.max() - x.min())
out[3@1]: one 3
two 30
three 300
3. Apply map function
The map() method in a Pandas DataFrame allows you to apply a function to each element of a
specific column of the DataFrame, The function can be either a built-in Python function or a
user-defined function.
In [303]: © df.applymap(lambda x : x*100)
Out[3@3]: one two three — four
© 100 1000 10000 100000
1 200 2000 20000 200000
2 300 3000 30000 300000
3 400 4000 49000 400000
applymap and apply are both functions in the pandas library used for
applying a function to elements of a pandas DataFrame or Series.
applymap is used to apply a function to every element of a DataFrame. It
returns a new DataFrame where each element has been modified by the input
function.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 201492inai24, 240 PM Pandas - Jupyter Notebook
apply is used to apply a function along any axis of a DataFrame or Series.
Tt returns either a Series or a DataFrame, depending on the axis along which
the function is applied and the return value of the function. Unlike
applymap, apply can take into account the context of the data, such as the
row or column label.
In [312]: WM df = pd.bataFrame({ ‘A\
"B
(1.2, 3.4,
1 (7.8, 9.1,
df_1 = df.applymap(np.intéa)
print (d#_1)
df_2 = df.apply(lambda row : row.mean(), axis = 2)
print (d#_2)
6.4
Floats
1) Reindex Function
The reindex function in Pandas is used to change the row labels and/or column labels of a
DataFrame. This function can be used to align data from multiple DataFrames or to update the
labels based on new data. The function takes in a list or an array of new labels as its first
argument and, optionally, a fill value to replace any missing values. The reindexing can be done
along either the row axis (0) or the column axis (1). The reindexed DataFrame is retumed
Example 1 - Rows
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
3049213124, 240 PM Pandas - Jupyter Notebook
In [333]: M data = { ‘one’ : pd.Series({1, 2, 3, 41),
‘two! : pd.Series([10, 20, 30, 40]),
‘three’ : pd.Series([100, 200, 30, 400]),
‘four’ : pd.Series([1000, 2000, 300, 4000])}
df = pd.dataFrane(data)
print (df)
print ('-'*3@)
print (df.reindex([1,0,3,2]))
one two three four
18 10 1000
28 © 288 2000
3@ 300 3000
40 400 4900
one two three four
1 2 28 200 2000
@ 1 1@ 100 1000
3 4 48 480 4000
2 3 3@ 300 3000
Example 2 + Colunns
In [336]: W data = {'Name’ : ['John', ‘Jane’, ‘Jim’, ‘Joan'],
‘Age’ : [25, 30, 35, 40],
“City’ : ['New York’, ‘Los Angeles’, ‘Chicago’, ‘Houston’ ]}
df = pd.DataFrame(data)
df.reindex(columns = ['Name’, ‘City’, ‘Age'])
Out [336]: Name City Age
0 John NewYork 25
1 Jane Los Angeles 30
2 Jim Chicago 35,
3 Joan Houston 40
m) Renaming Columns in Pandas DataFrame
The rename function in Pandas is used to change the row labels and/or column labels of a
DataFrame. It can be used to update the names of one or multiple rows or columns by passing
a dictionary of new names as its argument. The dictionary should have the old names as keys
and the new names as values
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3149213124, 240 PM
In [343]: data =
{ ‘one
t
t
ac
Wo"
hree’
‘our
Pandas - Jupyter Notebook
: pd.Series([1, 2, 3, 4]),
pd.Series([10, 20, 32, 40]),
: pd.Series([100, 200, 300, 400]),
: pd.Series([1000, 2000, 3000, 4000])}
df = pd.dataFrane(data)
df.rename(columns
af
outl343]: one
at
b 2
© 3
34
Two
10
20
30
40
inplace = True, index = {
Three
100
200
300
400
{lone’ : "One", ‘two
b',2:'c',4:'d'})
Four
1000
2000
3000
4000
n) Sorting in Pandas DataFrame
Pandas provides several methods to sort a DataFrame based on one or more columns.
+ sort_values: This method sorts the DataFrame based on one or more columns. The default
sorting order is ascending, but you can change it to descending by passing the ascending
argument with a value of False. bash
In [355]: WM data = { ‘one’
"t
"t
wo"
hree’
“four’
: pd.Series({11, 51, 31, 41]),
pd.Series([1@, 20, 3@, 40]),
: pd.Series([100, 208, 500, 400]),
+ pd.Series([1000, 2000, 3000, 4900])}
df = pd.DataFrame(data)
af
Outl355]: one
on
1 51
24
341
two
10
20
30
40
three
100
200
500
400
four
1000
2000
3000
4000
Sort with respect to Scecific Column
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
‘Two’, ‘three’ : ‘Three’, ‘four’
2249211324, 240 Pa Pandas -Jupytr Notebook
In [356]: WM df.sort_values(by = ‘one’)
Out [356]: ‘one two three four
0 11 10 100 1000
2 31 30 00 3000
3 41 40 400 4000
1 51 20 200 2000
Sort in Scecific Order
In [357]: WM df.sort_values(by = ‘one’, ascending = False)
out[357]: one two three four
1 51 20 200 2000
341 40 400 4000
2 31 30 00 3000
0 11 10 100 1000
Sort in Scecific Order based on multiple Columns
In [359]: WM df.sort_values(by = ['one’,"two'])
Out [359]: ‘one two three four
0 11 10 100 1000
2 31 30 00 3000
3 41 40 400 4000
1 51 20 200 2000
Sort with Specific Sorting Algorithm:
+ quicksort
+ mergesort
+ heapsort
In [361]: MW df.sort_values(by = ['one’], kind = ‘healsort')
Out[361]: one two three four
0 11 10 100 1000
2 31 30 00 3000
3 41 40 400 4000
1 51 20 200 2000
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 49213124, 240 PM Pandas - Jupyter Notebook
©) Groupby Functions
The groupby function in pandas is used to split a dataframe into groups based on one or more
columns. It returns a DataFrameGroupBy object, which is similar to a DataFrame but has some
additional methods to perform operations on the grouped data.
In [362]: M cricket = {'Team’ : ["India", ‘India’, ‘Australia’, ‘Australia’, ‘SA, “4
tRank’ —: [2, 3, 1,2, 3,4 41 41,2 , 42,21,
‘Year’: [2014,2015,2014, 2615, 2014, 2015, 2016, 2017, 2016, 2014, 2(
"Points' : [876,801,891,815,776,784,834,824,758,691, 883, 782]}
df = pd.DataFrame(cricket)
df
Out [362]: Team Rank Year Points
0 India 22018876
1 India 3 2015 8t
2 Australia = 2018891
3 Australia 2201581
4 SA 8 20478
5 SA 4 2015784
6 SA 1 2016 834
7 SA 4 2017824
8 NZ 2 206758
9 NZ 4 2018 691
10 NZ 1 2015883
14 India 2 2017782
In [365]: WM d¥.groupby(' Team’). groups
out[365]: {‘Australia': [2, 3], ‘India’: [@, 1, 11], ‘NZ’: [8, 9, 10], ‘SA’: [4, 5,
6, 7}
+ Austrealia is present in index 2 and 3
+ India is present in index 0,1 and 11 and so on
To search for specific Country with specific year
In [366]: df.groupby(['Tean', 'Year']).get_group(( ‘Australia’ ,214))
Out [366]: Team Rank Year Points
2 Australia = 1 2014 at
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3449213124, 240 PM Pandas - Jupyter Notebook
If the data is not present then we will be getting an error
Adding some statistical computation on top of groupby
In [374]: WM df.groupby(‘Tean').sum()['Points']
out[374]: Team
Australia 1786
India 2459
Nz 2332
SA 3218
Name: Points, dtype: intea
+ This means we have displayed the teams which are having the maximum sum in Poitns
Let us sort it to get it in a better way
In [377]: WM df.groupby(‘Tean’).sum()[ ‘Points’ ].sort_values(ascending = False)
out[377]: Team
SA 3218
India 2459
Nz 2332
Australia 1706
Name: Points, dtype: int6a
Checking multiple stats for points team wise
In [382]: © groups = df.groupby(‘Team’)
groups[ Points" ].agg([np-sum, np.mean, np.std,np.max,np.min])
Out [382]: sum mean ‘std amax amin
Team
‘Australia 1706 853.000000 63740115 891 818
India 2459 819.668667 49,702648 878 782,
Nz 2332 777.339333 97449192 883691
SA 3218 804,500000 28.769196 834 776
filter function along with groupby
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3849213124, 240 PM Pandas - Jupyter Notebook
In [386]: WM df.groupby(‘Tean’).filter(lambda x : len(x)
4)
Ovt[386]: Team Rank Year Points
“4 SA 3 20778
5 SA 4 2015 788
6 SA 1 206 am
7 SA 1 2017 824
+ The data of South Africa are present equal to 4 times that is why South Africa is being
displayed here
In [388]: M_ df.groupby(‘Team').filter(lambda x : len(x)
out[388, Team Rank Year Points
0 Inda = 2 2014 876
1 India 2018 801
3
B NZ 2 2016 758
4 204 691
to NZ 1 2018 888
44 India = 2 2017-782
+ The data of India and New Zealand are present 3 times so that is why they are being
displayed here
3. Working with csv files and ba
Pandas
data Analysis Using
a) Reading csv
Reading csv files from local system
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
3649213124, 240 PM
In [398]: M df = pd.read_csv('Football.csv')
Pandas - Jupyter Notebook
df head()
out [398]:
Country League chub PYF Matches Played Substitution Mins Goals x6
9 Spain Latiga (BET) Quam 19 16 189 11662
Aaoine
1 Spain La Liga (BAR) gaan ine 36 0 312916 11.86
2 Spain Latiga (ATL) g ule Ey + 2040 © 28 23.21
Ruben
3 Spain Latiga (CAR) Suber 32 3 2842 13 14.08
4 Spain LaLiga (VAL) g,cevin a 40 1745 13. 1065
Reading CSV files from github repositories
NOTE: The link of the page should be copied when the file is in raw format
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
37149211324, 240 Pa Pandas -Jupytr Notebook
In [391]: M Link = ‘https: //raw.githubusercontent .con/AshishJangra27/Data-Analysis-wit!
# df = pd.read_csv(Link)
# df.head()
>
Out[391]: Content
‘App Category Rating Reviews Size Installs Type Price Somer!
Photo
Editor &
Candy
Camera &
Grid &
‘SerapBook
ART_AND_DESIGN 4.1 183 19M 10,000 Free 0 Everyone
Coloring
1 book ART_AND_DESIGN 39 967 14M_—§00,000+ Free 0 Everyone
uv
Launcher
Lite =
2 FREE Live ART_AND DESIGN 4.7 87510 &7M 5,000,000 Free 0 Everyone
Coo!
Themes,
Hide
Sketch -
3 Draw& ART_AND_DESIGN 4.5 215644 25M 50,000,000 Free 0 Teen
Paint
Pixel Draw
= Number
4 ‘Art ARTANDDESIGN 4.3 967 28M —100,000+ Free 0. Everyone
Coloring
Book
b) Pandas Info Function
Pandas dataframe info() function is used to get a concise summary of the dataframe. It comes
really handy when doing exploratory analysis of the data, To get a quick overview of the dataset
we use the dataframe info() function
‘Syntax: DataFrame info(verbose=None, buf=None, max_cols=None, memory_usag
null_counts=None)
lone,
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3849213124, 240 PM
In [399]: MW df.info()
Rangelndex: 66@ entries, @ to 659
Data columns (total 15 colunns):
# Column
@ country
1 League
2 Club
3. Player Names
4 Matches Played
5
6
7
8
9
Substitution
Mins
Goals
xG
xG Per Avg Match
18 Shots
11. OnTarget
12 Shots Per Avg Match
13. On Target Per Avg Match
14 Year
dtypes: floatea(4), int64(7),
memory usage: 77.5+ KB
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
Non-Null Count:
660
660
660
660
660
660
660
660
668
660
668
660
660
660
660
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
object (4)
Pandas - Jupyter Notebook
Dtype
object
object
object
object
intea
intea
inte
inte
floate4
floates
intea
intea
floatea
floates
inted
3049213124, 240 PM
Pandas - Jupyter Notebook
c) isnull() function to check if there are nan values
present
In [400]: MW df.isnull()
out[aee]:
Player
Country League Club ylS¥°F Matches Played Substitution Mins Goals xG
Me
0 False False Falso False False False False Falko False Fi
1 False False False False False False False False False Fi
2 False False False False False False False Fake False Fi
3 False False False False False False False Fale Fale Fi
4 False False False False False False False False False Fi
655 False False False False False False False False False Fi
656 False False False False False False False Fake False Fi
657 Falso False False Falso False False False Fale Fale Fi
658 False False False False False False False Fale False Fi
659 False Falso Falso False False False False Fakto Fale Fi
660 rows * 15 columns
‘So we can see we are getting a boolean kind of a table giving True and False
Ifwe use the sum function along with it then we can get how many null values are present in
each columns
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
40149ries, 240 PM Pandas -Jupytr Notebook
In [401]: WM df.isnul1().sum()
out[4e1}: country
League
club
Player Names
Matches_Played
Substitution
Mins
Goals
xG
xG Per Avg Match
Shots
Ontarget
Shots Per Avg Match
On Target Per Avg Match
Year
dtype: intea
d) Quantile function to get the specific percentile
value
Let us check the 80 percentile value of each columns using describe function first
In [404]: WM df.describe(percentiles = [.80])
out[ see):
Matson Played Substitution Mine Goals xg tS Pat
count w,000000 66,000009 esom00000 0.800000 eeaG00000 65,000000 660%
mean 22.s7iziz. 224242 zorsatece7 r1.8tosus to0sae0e o47et67 6A:
sid a764658s.8408 soosoEnKo Ga7sans — s724e4e o.te8s1 344
min 2.000000 0.900000 264000000 2000000 o.710000 o.7on00 —8¢
so% 2tooo0o0 2000000 2245500000 th.00000 9288000 @436000 621
0% 2.000000 000000 2815200000 1.0000 14078000 0610000 804
max 38000000 28000000 #177.000000 42.0000 2sen000 1380000. 2084
So we can see the 80th Percentile value of Mins is 2915.80
Let us use the quantile function to get the exact value now
In [406]: M df['Mins"].quantile(.se)
out[4e6]: 2915.8
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 449213124, 240 PM
Pandas - Jupyter Notebook
Here we go we got the same value
To get the 99 percentile value we can write
In [407]: MW d#['Mins"].quantile(.99)
out[407]: 3528.0199999999995,
+ This funciton is important as it can be used to treat outliers in Data Science EDA process
e) Copy function
If we normal do:
de=df
Then a change in de will affect the data of df as well so we need to copy in such a way that
it creates a totally new object and does not affect the old dataframe
In [413]: M de = df.copy()
de.head(3)
out(ai3}:
Country League ciub PHYS Matches Played Substitution Mins Goals x6
0 Span Latiga (@eT) ani 19 16 109 1 662
4 Spain Latiga (GAR) ga itane 36 0 3129 16 186
2° Spain Latiga (any) 6, us a + 200 28 2021
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
4249213124, 240 PM Pandas - Jupyter Notebook
In [414]: WM de['Vear+100"] = de[‘Year"] + 100
de.head()
out(ai4):
Country League club PYF Matches Played Substitution Mins Goals x6
8
0 Span Lauga (@er) ann 19 16 101911 662
Antoine
1 Span Latiga (@AR) mine 6 0 3120 16 1186
2° Spain Latiga (ATL) g, Luis a + 74028 2021
Ruben
3 Span Latiga (can) Suber x2 9 286213. 1408
kevin
4 Spain Latiga (VAL) ggKovin a to 174513. 1065
+ So we can see a new column has been added here but our old data is secured
In [415]: MW df.head()
out[415]:
Country League club PYF Matches Played Substitution Mins Goals x6
4
0 Spain tatige (BET) Guan 19 te 189 11 662
1s Antoine
pan LaLiga (BAR) g,Antine 36 0 3129 16 1186
2 Spain Latiga (arty, ule a“ 1 24028 2521
as Ruben
pain LaLiga (CAR) Ruben 2 3 202 13: 1408
4 Spain Latiga (VAL) gin 21 to 1745 19. 1065
+ The new column is not present here
f) Value Counts function
Pandas Series value_counts() function return a Series containing counts of unique values. The
resulting object will be in descending order so that the first element is the most frequently-
occurring element, Excludes NA values by default
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 449213124, 240 PM Pandas - Jupyter Notebook
‘Syntax: Series.value_counts(normalize=False, sort=True, ascending=False, bins=None,
dropna=True)
In [417]: M df['Player Names" ].value_counts()
Out[417]: Andrea Belotti
Lionel Messi
Luis Suarez
Andrej Kramaric
Ciro Innobile
Francois Kamano 1
Lebo Mothiba 1
Gaetan Laborde 1
Falcao 1
Cody Gakpo 1
Name: Player Names, Length: 444, dtype: intea
g) Unique and Nunique Function
‘While analyzing the data, many times the user wants to see the unique values in a particular
column, which can be done using Pandas unique() function.
In [418]: WM df['Player Nanes' ].unique()
out[418]: array(["Juanmi Callejon', ‘Antoine Griezmann', ‘Luis Suarez",
"Ruben Castro’, ‘Kevin Gameiro’, ‘Cristiano Ronaldo’,
"Karim Benzena', ‘Neymar ", ‘Iago Aspas’, ‘Sergi Enrich’,
‘aduriz ', "Sandro Ramlrez', ‘Lionel Messi", ‘Gerard Moreno’,
"Morata’, ‘Wissam Ben Yedder', ‘Willian Jose’, ‘Andone ',
"Cedric Bakambu", ‘Isco', ‘Mohamed Salah‘, ‘Gregoire Defrel’,
"Ciro Immobile’, ‘Nikola Kalinic', ‘Dries Mertens’,
"Alejandro Gomez’, ‘Jose Callejon', ‘Iago Falque’,
"Giovanni Simeone’, ‘Mauro Icardi', ‘Diego Falcinelli’,
"cyril Thereau', ‘Edin Dzeko’, ‘Lorenzo Insigne’,
"Fabio Quagliarella’, ‘Borriello ', ‘Carlos Bacca’,
"Gonzalo Higuain’, ‘Keita Balde’, ‘Andrea Belotti', ‘Fin Bartel
‘Lars Stindl', ‘Serge Gnabry', ‘Wagner ', ‘Andrej Kramaric’,
"Florian Niederlechner’, ‘Robert Lewandowski’, ‘Emil Forsberg’,
‘Timo Werner’, ‘Nils Petersen’, ‘Vedad Ibisevic’, ‘Mario Gomez’,
"Maximilian Philipp’, 'A\x8idam Szalai',
"Pierre-Emerick Aubameyang’, ‘Guido Burgstaller’, ‘Max Kruse’,
"Chicharito ', ‘Anthony Modeste’, ‘Arjen Robben’, ‘Alexis Sanche ~
‘While analyzing the data, many times the user wants to see the unique values in a particular
column. Pandas nunique() is used to get a count of unique values.
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 4449ries, 240 PM Pandas -Jupytr Notebook
In [419]: WM df['Player Names" ].nunique()
out[4i9}: 444
h) dropna() function
‘Sometimes csv file has null values, which are later displayed as NaN in Data Frame. Pandas
dropna() method allows the user to analyze and drop Rows/Columns with Null values in
different ways.
Syntax:
DataFrameName.dropna(axi
.inplace=False)
axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and
index’ or ‘columns’ for String.
In [422]: Mink = ‘https: //raw.githubusercontent..con/AshishJangra27/Data-Analysis-with
df = pd.read_csv(1ink)
df.head()
>
out[422]: Content
‘App Category Rating Reviews Size Installs Type Price puter
Photo
Editor &
Candy
Camera &
Grid &
‘SerapBook
ART_AND_DESIGN 4.1 188 19M 10,000 Free 0 Everyone
Coloring
book ARTAND_DESIGN 39 967 14M_—$00,000+ Free 0 Everyone
uv
Launcher
Lite =
2 FREELive ART_AND_DESIGN 4.7 87510 87M 5,000,000+ Free 0 Everyone
Coo!
Taomes,
Hide
‘Sketch -
3 Draw& ART_AND DESIGN 4.5 215644 25M 50,000,000 Free 0 Teer
Paint
Pixel Draw
= Number
4 ‘Art ARTANDDESIGN 4.3 967 28M 100,000 Free 0. Everyone
Coloring
Book
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
45149213124, 240 PM
In [423]:
Pandas - Jupyter Notebook
ML df. isnul1().sum()
out[423]: App @
In [426]:
In [427]:
In [447]:
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
category @
Rating 1474
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current ver
Android ver
dtype: intea
+ 0k so it seems like we have alot of Null Values in column Rating and few null values in
some other columns
M df.dropna(inplace = True, axis = 0)
This will delete all the rows which are containing the null values
WM df.dropna(inplace = True, axis = 1)
This will delete all the columns containing null values
i) Fillna Function
Pandas Series filna() function is used to fil NA/NaN values using the specified method.
‘Suppose if we want to fill the null values with something instead of removing them then we can
use fillna function
Here we willbe filing the numerical columns with its mean values and Categorical columns with
its mode
M Link = ‘https: //raw.githubusercontent .con/AshishJengra27/bata-Analysis-witl
df = pd.read_csv(1ink)
print(len(df))
10841
Numerical columns
46149213124, 240 PM
Pandas - Jupyter Notebook
In [448]: DW mis = round(df[ ‘Rating’ ].mean(),2)
dF {‘Rating'] = df[ ‘Rating’ ].FilIna(mis)
print(len(df))
10841
It we would have used inplcae=True then it would have permenantly stored those values in our
dataframe
Categorical values
In [461]: MW d€['Current Ver'] = df['Current Ver'].fillna(‘Varies on Device’)
3) sample function
Pandas sample() is used to generate a sample random row or column from the function caller
data frame.
Syntax:
DataFrame.sample(n=None, frac=None, replace=False, weigh lone,
axis-None)
In [471]: df. sample(s)
out[471]:
App Category Rating Reviews Size Installs Type Price
Diepiaying Free
goes Psplving PHOTOGRAPHY 4.19 + om sor O14
tsar F192! ipraRtes_AND_DEMO 500 «28 25M «1.000 Free 0
Safest
433 Call COMMUNICATION 44027540 3.7M 1,000000+ Free 0 I
Blocker
san
Ancias Varies
zas2—_ Crime FAMILY 420 9409. with 1,000,000 Free 0
ity device
Gangster
3D
AoP
190 Mobie BUSINESS 4.30 85185 29M 5,000,000 Free 0 |
Sokiions
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 4749213124, 240 PM Pandas - Jupyter Notebook
k) to_csv() function
Pandas Series.to_csv() function write the given series object to a comma-separated values
(csv) filefformat.
Syntax: Series.to_csv(*args, **kwargs)
In [477]: M data = { ‘one’ : pd.Series([1, 2, 3, 4]),
"two" } pd.Serdes([10, 2e, 32, 40]),
‘three’ : pd.Series([100, 200, 300, 400]),
‘four’ : pd.Series([1000, 2000, 3002, 4000]))
df
pd.DataFrame(data)
df.to_csv(‘Number.csv")
+ We got an extra Unnamed:0 Column if we want to avoid that we need to add an extra
parameter mentioning index=False
In [478]: WM d¥.to_esv('Numbers.csv', index = False)
4. A detailed Pandas Profile report
The pandas_profiling library in Python include a method named as ProfileReport() which
generate a basic report on the input DataFrame.
The report consist of the following:
DataFrame overview, Each attribute on which DataFrame is defined, Correlations between
attributes (Pearson Correlation and Spearman Correlation), and A sample of DataFrame.
In [480]: M import natplotlib
import pandas_profiling as pp
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
4049213124, 240 PM
In [484]: WM df = pd.read_esv(‘Football.csv')
Pandas - Jupyter Notebook
df head)
out[4e4):
Country League club PHYS matches Played Substitution Mins Goals xG
0 Spain Latige (ger) Quan 19 16 1049 11 662
4 Spain Latige (BAR) g, Aone 36 0 37916 11.06
2 Spain Latiga (art) — gy lls a + 2040 28 2321
3 Spain Latiga (cary Ruben 2 a 2e2 19 1406
4 Spain Latige (WAL) “vit a 10 174513. 1065
In [485]: WM report = pp.ProfileReport (dF)
In [486]: report
Summarize dataset:
Generate report structure:
Render HTML:
out[486]:
Int]: WW
ex|
ex
locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb|
| @/1 [e0:
ex|
| 0/5 (00:00? ?it/s]
| 9/1 [00:00, Pit/s]
10, Pit/s]
4419