0% found this document useful (0 votes)
271 views49 pages

Pandas GFG PDF

Uploaded by

disguisedacc511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
271 views49 pages

Pandas GFG PDF

Uploaded by

disguisedacc511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 49
2113724, 2:40 PM Pandas -Jupyter Notebook GeeksforGeeks Pandas In [6]: M import numpy as np import pandas as pd Table of Contents 4. Working with Pandas Series a) Creating Series Series through list Series through Numpy Array Setting up our own index Series through dictionary Using repeat function along with creating a Series Accessing data from Series. -- b) Aggregate function on Pandas Series —~¢) Sereis Absolute Function — d) Appending Series —) Astype Function — f) Between Functions ~~ g) All strings functions can be used to extract or modify texts in a series Upper and Lower Function Len function Strip Function Split Function Contains Function Replace Function Count Function Startswith and Endswith Function Find Finction ~~ h) Converting a Series to List 2. Detailed Coding Implementations on Pandas DataFrame a) Creating Data Frames —b) Slicing in DataFrames Using loc and Loc Basic Loc Operations Basic loc Operations Slicing Using Conditions locathos!:B888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 149 2113724, 2:40 PM Pandas -Jupyter Notebook --—c) Column Addition in DataFrames Using List Using Pandas Seires Using an Existing Column --~-d) Deleting Column in DataFrame Using del Using pop function — e) Addition of rows —f) Drop function — g) Transposing a DataFrame — h) A set of more DataFrame Functional axes function ndim function dtypes function shape function head function tail function ‘empty function istical or Mathematical Functions sum Mean Median Mode Variance Min Max Standard Deviation —j) Describe Function — k) Pipe Functions: Pipe function Apply Function Applymap Function — 1) Reindex Function — m) Renaming Columns in Pandas DataFrame —n) Sorting in Pandas DataFrame —- 0) Groupby Functions Adding Statistical Computation on groupby Using Filter Function with Groupby 3. Working with csv files and basic data Analysis Using Pandas —a) Reading CSV —b) Info Function ~e) isnull() Function —d) Quantile Function —) Copy Function —f) Value Counts Function -g) Unique and Nunique functopn -—h) dropna() function locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb 249 2inai24, 240 PM Pandas - Jupyter Notebook ~~) fillna() fuention ~j) sample Functions ——k) to_esv() functions 4, A detailed Pandas Profile Report 1. Working with Pandas Series a) Creating Series Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index. Seires through list In [ J: M Ast = [1,2,3,4,5] pd.Series(1st) Series through Numpy array In []: Mare = np.array([1,2,3,4,5]) pd. Series (arr) Giving index from our own end In [12]: MW pd.Series(index = ["Eshant', 'Pranjal', ‘Jayesh*, ‘Ashish'], data = [1,2,3 out[12]: Eshant Pranjal Jayesh Ashish dtype: intea RUN Series through Dictionary values. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| ae 213124, 240 PM Pandas - Jupyter Notebook In [15]: M_ steps {'dayl' : 4008, ‘day2' : 3000, ‘day3' : 12000} pd.Series(steps) out[15]: dayt 4@@0 day2 3000 day3 12000 dtype: intea Using repeat function along with creating a Series Pandas Series.repeat() function repeat elements of a Series. It returns a new Series where each element of the current Series is repeated consecutively a given number of times. In [19]: MW pd.Series(5).repeat(3) eos eos eos dtype: intea out[19]: ‘we can use the reset function to make the index accurate In [27]: M pd.Series(5).repeat(3).reset_index(drop = True) out [27]: 5 5 5 itype: inted This code indicates: + 10 should be repeated 5 times and + 20 should be repeated 2 times In [29]: Ms = pd.Series([10,20]).repeat([5,2]).reset_index(drop = True) out[29]: @ 1@ 1 Ww 2 Ww 3 18 4 w 5 20 6 2 qi type: inte4 Accessing elements locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 449 211324, 240 Pa Pandas -Jupytr Notebook In [34]: WM sia] out[34]: 10 s[0] or s[50] something like this would not work becasue the we can access elements based on the index which we procided In [38]: MW si6] out [38]: 20 By last n numbers (start - end-1) In [49]: MW s[2:-2] out[49]: te 2 3 10 4 dtype: intea b) Aggregate function on pandas Series Pandas Series.aggregate() function aggregate using one or more operations over the specified axis in the given series object. In [58]: M sr = pd.Series([1,2,3,4,5,6,71) sr.age([min,max, sum]) out(s8]: min 2 max 7 sum 28 dtype: intea c) Series absolute function Pandas Series.abs() method is used to get the absolute numeric value of each element in Series/DataFrame. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| sae 213124, 240 PM Pandas - Jupyter Notebook In [60]: M sr pd.Series([1,-2,3,-4,5,-6,7]) srabs() out[69]: e 1 2 3 4 5 6 dtype: intea d) Appending Series Pandas Series.append() function is used to concatenate two or more series object. ‘Syntax: Series.append(to_append, ignore_index=False, verify_integrity=False) Parameter : to_append : Series or listtuple of Series ignore_index : If True, do not use the index labels. verify_integrity : If True, raise Exception on creating index with duplicates In [67]: M srd = pd.Series([1,-2,3]) sr2 = pd.Series([1,2,3]) sr3 = sr2.append(sr1) sr3, out{s7]: @ 2 1 2 203 eu 1-2 2 3 dtype: intea To make the index accurate: In [71]: sr3.reset_index(drop = True) out [71]: e 1 2 3 4 5 a itype: intea locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| eae 213124, 240 PM Pandas - Jupyter Notebook e) Astype function Pandas astype() is the one of the most important methods. It is used to change data type of a series. When data frame is made from a csv file, the columns are imported and data type is set automatically which many times is not what it actually should have. In [75]: MW. set ovt(7s]: @ 1 + You can see below int64 is mentioned tn [76]: M type(sra[o]) out [76]: + Now you can see itis written as object In [80]: srt.astype(‘float') out(aa}: @ dtype ) Between Function Pandas between() method is used on series to check which values lie between first and second argument, In [86]: MW sri = pd.Series([1,2,30,4,5,6,7,8,9,20]) srl out(ss]: @ 1 1 2 2 38 34 45 5 6 6 7 7 8 8 9 9 2 intea locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 748 213124, 240 PM In [87]: out [87]: In [88]: In [92]: Pandas -Jupyter Notebook M1 sri. between(10, 50) @ False 1 False 2 True 3 False 4 False 5 False 6 False 7 False 8 False 9 True dtype: bool g) All strings functions can be used to extract or modify texts in a series S\AANAA:S SULALAAS Upper and Lower Function S\\ANMAGS SUAS Len function \\,$ Strip Function 1:8 Split Function \\$ Contains Function \\i$ Replace Function \\$ Count Function \\;$ Startswith and Endswith Function SUAS SUALYS Find Finetion M ser = pd.Series(["eshant bas" , "Data Science” , "Geeks for Geeks" , ‘Hell: Upper and Lower Function M1 print (ser.str.upper()) print(*-"*38) print(ser.str.lower()) @ ESHANT DAS 1 DATA SCIENCE 2 GEEKS FOR GEEKS 3 HELLO WORLD 4 MACHINE LEARNING dtype: object e eshant das 1 data science 2 geeks for geeks 3 hello world 4 machine learning dtype: object Length function locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| eae 213124, 240 PM In [94]: In [95]: In [96]: In [108]: out[1¢8]: Pandas - Jupyter Notebook M for i in ser: print(len(i)) 10 12 15 11 16 Strip Function M ser = pd.Series([” Eshant Das” , “Data Science” , for i in ser: print(d , len(4)) Eshant Das 12 Data Science 12 Geeks for Geeks 15 Hello World 11 Machine Learning 18 2 extra spaces has been removed W ser = ser.str.strip() for i in ser: print(d , len(i)) Eshant Das 10 Data Science 12 Geeks for Geeks 15 Hello World 11 Machine Learning 16 Split Function W ser.str.split() e [Eshant, Das] 1 [Data, Science] 2 [Geeks, for, Geeks] 3 [Hello, world] 4 [Machine, Learning] dtype: object + IF we want to split ontt the first world of every string in the pandas series locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| ee 211324, 240 Pa Pandas -Jupytr Notebook In [109]: WM ser.str.split()[2] out[19}: [‘Eshant', ‘Das*] + For second word In [120]: WM ser.stresplit()[1] out[11@}: [‘Data', ‘Science'] Contains Function In [126]: MW ser = pd.Series(["Eshant Das", "Data@Science","Geeks for Geeks”, 'Hello@Wor], ser.str.contains('@') out[126]: @ False 1 True 2 False 3 True 4 False dtype: bool Replace Function In [127]: MW ser.str.replace('@",* out[127]: @ Eshant Das 1 Data Science 2 Geeks for Geeks 3 Hello World 4 — Machine Learning object Count Function In [128]: M ser.str.count('a") out[128}: dtype: intea startswith and endswith locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| some 211324, 240 Pa Pandas -Jupytr Notebook In [129]: Ml ser.str.startswith(‘D') out[129]: @ False 1 True 2 False 3 False 4 False dtype: bool In [130]: WM ser.str.endswith('s') out[13@]: @ True 1 False 2 True 3 False 4 False dtype: bool Find Function In [133]: M_ ser.str.find('Geeks') 1 a out[133]: @ 1 2 @ 3 4 qi 1 al itype: intea h) Converting a Series to List Pandas tolist() is used to convert a series to lst. Initially the series is of type pandas. core. series. In [137]: WM ser.to_list() out[137, [‘Eshant Das', ‘Data@Science’, ‘Geeks for Geeks’, ‘Hello@World’ , ‘Machine Learning’ 2. Detailed Coding Implementations on Pandas DataFrame Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, ie, data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 49 2113724, 2:40 PM Pandas -Jupyter Notebook Rows x Columns Name Team Number Position Age AveryBradley Boston Celtics 00 PG 250 John Holland Boston Celtics 30.0 SG 27.0 (JonasJerebka) Boston Celtics PE 290 Jordan Mickey| (Boston Celtics pF 210 Terry Rozier | Boston Celtics (Pc) 220 Jared Sulinger| Bostoncettis\| 70 fe _[an) Evan Turner \ Boston celtics | 110 [sc |az.0 Data 0G a) Creating Data Frames In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel fle, Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe: Creating a dataframe using List: DataFrame can be created using a single list or a list of lists, In [161]: out[162]: locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas.ipynb pd.DataFrame(1st) ° 0 Gooks 1 For 2 Gooks 3s 4 portal 5 tor 6 Geeks M Ast = ['Geeks', ‘For’, ‘Geeks’, ‘is’, ‘portal’, 'for', ‘'Geeks'] 129 ries, 240 PM Pandas -Jupytr Notebook In [163]: M Ast = [['ton',20],[ "Jerry" 12], [' spike’ ,24]] pd.DataFrame(1st) out[163]: 04 0 tom 10 1 jerry 12 2 spike 14 Creating DataFrame from dict of ndarrayllists: To create DataFrame from dict of narrayilist, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length. In [166]: M data = {‘nane ‘Tom’, ‘nick", ‘krish", ‘jack"], ‘age’:[20, 21, 19, 18]} pd.DataFrame(data) out[166]: name age 0 Tom 20 1 nick 24 2 ksh 19 3 jack 18 A Data frame is a two-dimensional data structure, in rows and columns. We can perform ba: deleting, adding, and renaming. , data is aligned in a tabular fashion ins on rows/columns like selecting, Column Selection: In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| rae 213124, 240 PM Pandas - Jupyter Notebook In [169]: M data = { ‘Name’ :['Jai', "Princi', ‘Gaurav’, ‘Anuj'], "Age" (27, 24, 22, 32], "Address" :['Delhi', "Kanpur", ‘Allahabad’, ‘Kannauj"], ‘Qualification’ :['msc', ‘MA’, 'MCA', ‘Phd']} df = pd.dataFrame(data) df{{'Name', ‘Qualification’ }] Ovt[169]: Name Qualification 0 iMac 4 Prine Ma 2 Gaurav McA 3 An) Phd b) Slicing in DataFrames Using iloc and loc Pandas comprises many methods for its proper functioning. loc() and iloc() are one of those methods. These are used in slicing data from the Pandas DataFrame. They help in the convenient selection of data from the DataFrame in Python. They are used in fitering the data according to some conditions. In [171]: M data = (‘one’ : pd.Series([1, 2, 3, 4]), “two" pd.Series([10, 20, 30, 40]), ‘three’ : pd.Series([100, 200, 300, 420]), ‘four’ + pd.Series([1000, 2000, 3000, 4000])) df = pd.bataFrame(data) af out[171]: ene two three four 0 =1 10 100 1000 1 2 20 200 2000 2 3 30 300 3000 34 40 400 4000 Basic loc Operations Python loc{) function The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(). loc() can accept the boolean data unlike iloc(). Many operations can be performed using the loc) method like locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| sate 213124, 240 PM Pandas - Jupyter Notebook In [180]: WM df.loc[1:2, ‘two’ : ‘three'] Out[182]: two three 1 20 200 2 30 300 Basic iloc Operations The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc().iloc() does not accept the boolean data unlike loc(). In [192]: WM df.iloc{1 : -1, Out[192]: two three 1 20 200 2 30 300 + you can see index 3 of both row and column has not been added here so 1 was inclusize but 3 is exclusive in the case of ilocs Let's see another example In [195]: WM df.iloc[: out[195]: three 0 100 1 200 2 300 3 400 Selecting Spefic Rows Tn [197]: WM df.ilocl[9,21, 1,31] out[197]: two four 2 10 1000 2 30 3000 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 159 2m376, 240 Pat Pandas -Jopyter Notebook c) Slicing Using Conditions Using Conditions works with loc basically In [204]: M df.loc[d#['two'] > 20, ['three’, four" ]] out[204]: three four 2 300 3000 3 400 4000 + Soe could extract only those data for which the value is more than 20 + For the columns we have used comma(,) to extract specifc columns which is ‘three’ and ‘four’ Let's see another example In [207]: M df.loc[d#['three’] < 300, ['one’,'four']] out[2@7]: one four 0 1 1000 12 2000 + So you can get the inference in the same way for this code as we got for the previous code c) Column Addition in DataFrame In [208]: MW df Ovt[2@8]: one two three four “01 10 100 1000 12 2 200 2000 2 3 30 300 3000 3 4 49 400 4000 ‘We can add a column in many ways. Let us discuss three ways how we can add column here + Using List + Using Pandas Series + Using an existing Column(we can modify that column in the way we want and that modified part can also be displayed) locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 1649 213124, 240 PM Pandas - Jupyter Notebook In [218]: M1 = [22,33,44,55] df['five'] = 1 af Out[210]: ‘one two three four five 0 1 10 100 1000 22 1 2 2 200 2000 33 2 3 30 300 3000 44 3 4 40 400 4000 55 Tn [211]: WM sr = pd.Series({111,222,333,444]) df['six'] = sr af Out[2a4]: ‘one two three four five six 0 1 10 100 1000 22 11 1 2 20 200 2000 33 222 2 3 30 300 3000 44 333 3 4 49 400 4000 55 444 Using an existing Column In [218]: MW df[‘seven'] = df[‘one'] + 10 df Out[226]: one two three four five si 0 1 10 100 100 2 m om 1 2 20 200 2000 33 222 © 12 2 3 30 300 3000 44 33313 3 4 40 400 4000 55 444 14 + Now we can see the column 7 is having all the values of column 1 increamented by 10 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| sre 213124, 240 PM d) Column Deletion in Dataframes In [217]: MW df Out[217]: one two three four 0 1 10 100 1000 1 2 20 200 2000 2 3 30 300 3000 34 49 400 4000 Using del + You can see that the column which had the name'six’ has been deleted In [218]: W del df{'six'] df Out[218]: ene two three four 0 1 10 100 1000 1 2 2 200 2000 2 3 30 300 3000 34 49 400 4000 Using pop + You can see that the columm five has also been deleted from our dataframe In [220]: MW df.pop('five") df Out[22@]: one two three four 0 1 10 100 1000 1 2 2 200 2000 2 3 30 300 3000 3 4 40 400 4000 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| Pandas - Jupyter Notebook five six seven 21m 33 222 44 333 55 444 five seven 2 ot 32 “43 514 seven 1 12 13 14 1 2 13 14 1849 21324, 240 PM Pandas -pyter Notebook e) Addition of rows In a Pandas DataFrame, you can add rows by using the append method. You can also create a new DataFrame with the desired row values and use the append to add the new row to the original dataframe. Here's an example of adding a single row to a dataframe: In [228]: M df = pd.DataFrame([[1, 2], [3, 4]], columns df2 = pd.DataFrane([[s, 6], [7, 8], colunns df3 = dfl.append(df2).reset_index(drop = True) ara out[228]: a b o12 134 256 a78 #) Pandas drop function Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas provide data analysts a way to delete and filter data frame using drop() method. Rows or columns can be removed using index label or column name using this method. ‘Syntax: DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise’) Parameters: labels: String or list of strings referring row or column name. axis: int or string value, 0 ‘index’ for Rows and 1 ‘columns’ for Columns. index or columns: Single label or lst. index or columns are an alternative to axis and cannot be used together. level: Used to specify level in case data frame is having multiple level index. inplace: Makes changes in original Data Frame if True. errors: Ignores error if any value from the list doesn't exists and drops rest of the values when errors = ‘ignore’ Return type: Dataframe with dropped values locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| s9m8 2m376, 240 Pat Pandas -Jopyter Notebook In [240]: MW data = { ‘one’ : pd.Series([1, 2, 3, 4]), “two! pd-Series([10, 20, 32, 40]), ‘three’ 1 pd.Series([100, 260, 300, 400]), ‘four’ : pd.Series([1000, 2000, 3000, 4000])} df = pd.dataFrane(data) af out [240]: one two three four @ 1 10 100 1000 42 20 200 2000 2 3 30 300 3000 34 40 400 4000 * axis Rows (row wise) In [241]: DM df.drop([@,1], axis = @, inplace = True) af out [241]: ‘one two three four 2 3 30 300 3000 34 40 400 4000 + ax Columns (column wise) In [242]: MW df.drop(['one’,"three’], axis = 1, inplace = True) af out[242]: two four 2 30 3000 3 40 4000 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| g) Transposing a DataFrame The T attribute in a Pandas DataFrame is used to transpose the dataframe, ie., to flip the rows and columns. The result of transposing a dataframe is a new dataframe with the original rows as columns and the original columns as rows. Here's an example to illustrate the use of the .T attribute: 20149 213124, 240 PM Pandas - Jupyter Notebook In [243]: M data = { ‘one’ : pd.Series({1, 2, 3, 4]), "two" : pd.Serdes([10, 20, 32, 40]), ‘three’ : pd.Series([100, 200, 300, 420]), ‘four’ + pd.Series([1000, 2000, 3000, 4000])) df = pd.DataFrame(data) af Out[243]: ene two three four © 1 10 100 1000 1 2 2 200 2000 2 3 30 300 3000 34 40 400 4000 In [244]: df. out [24a]: o1 2 3 om 10203 two 10 20 3040 three 109 200 300 400 four 1000 2000 3000 4000 h) A set of more DataFrame Functionalities In [245]: MW df Out [245]: ‘one two three four 0 1 10 100 1000 1 2 2 200 2000 2 3 30 300 3000 3 4 40 400 4000 1. axes function The .axes attribute in a Pandas DataFrame returns a list with the row and column labels of the DataFrame. The first element of the list is the row labels (index), and the second element is the column labels. In [246]: MW df.axes out[246]: [Rangelndex(start=0, stop=4, step=1), Index(['one', ‘two’, ‘three’, ‘four'], dtype='object')] locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2149 213124, 240 PM Pandas - Jupyter Notebook 2. ndim function The .ndim attribute in a Pandas DataFrame retums the number of dimensions of the dataframe, which is always 2 for a DataFrame (row-and-column format). In [247]: MW df.ndim out[247]: 2 3. diypes The .dtypes attribute in a Pandas DataFrame returns the data types of the columns in the DataFrame. The result is a Series with the column names as index and the data types of the columns as values. In [248]: MW df.dtypes out[248]: one intea two intea three — intea four intea dtype: object 4. shape function The shape attribute in a Pandas DataFrame returns the dimensions (number of rows, number of columns) of the DataFrame as a tuple. In [249]: W df.shape out[249}: (4, 4) + 4 rows + 4:columns 5. head() function locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2249 213124, 240 PM Pandas - Jupyter Notebook In [258]: M d= { ‘Name’ :pd.Series(['Ton', Jerry’, ‘Spike", ‘Popeye’, ‘Olive’, ‘Bluto', "I ‘Age’ :pd.Series([10,12,14,30,28,33,15]), “Height! ipd.Series([3.25,1.11,4.12,5.47,6.15,6.67,2.61])} df = pd.dataFrame(d) af out [250]: Name Age Height 0 Tom 10 325. 1 Jory 120 111 2 Spike 14 4.412 3 Popeye 30 547 4 Olve 28 6.18 5 Bhio 33 667 6 Mickey 15 261 The head() method in a Pandas DataFrame retums the first n rows (by default, n=5) of the DataFrame. This method is useful for quickly examining the first few rows of a large DataFrame to get a sense of its structure and content. In [259]: df.head(3) out [259]: Name Age Height 0 Tom 10 325 1 Jory 1204.11 2 Spike 14 4.12 + By default it will display first § rows + We can mention the number of starting rows we want to see + We will see this function more often furthur since the dataframe is so small at this point so we cannot use something like df.head(20) 6. df.tail() function The tail() method in a Pandas DataFrame retums the last n rows (by default, n=) of the DataFrame. This method is useful for quickly examining the last few rows of a large DataFrame to get a sense of its structure and content. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 249 211324, 240 Pa Pandas -Jupytr Notebook In [260]: WM df.tail(3) Out[26@]: Name Age Height 4 Ove 28 615 5 Blo 33 667 6 Mickey 15 261 7. empty function() The .emply attribute in a Pandas DataFrame returns a Boolean value indicating whether the DataFrame is empty or not. A DataFrame is considered emply if it has no rows. In [263]: WM df = pd.DataFrame() df.empty out[263]: True i) Statistical or Mathematical Functions 1S $\YVMAAAS Mean $\\\\\\;8 $\\AUU$ Median $\\AA\1;$ $\\UU\AS Mode SULUAAAS SLAAMLYS Variance $138 SUUAAAAS Min SKA \sS SUAAAAAS Max $\\\ A$ Su\\\\s$ Standard Deviation In [264]: pf data = {‘one’ : pd.Series([1, 2, 3, 4]), “two" pd.Series([10, 20, 30, 40]), ‘three’ : pd.Series([100, 200, 300, 420]), ‘four’ : pd.Series([1000, 2000, 3000, 4000])) df = pd.DataFrame(data) df Out [264]: fone two three four 0% 10 100 1000 1 2 20 200 2000 2 3 30 300 3000 3 4 49 400 4000 4..Sum locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2049 213124, 240 PM Pandas - Jupyter Notebook In [266]: MW df.sum() out[266]: one 10 two 100 three 1000 four 10000 dtype: intea 2. Mean In [267]: WM df.mean() out[267]: one 2.5 two 25.0 three 250.0 four 2500.0 dtype: floatea 3. Median In [269]: WM df.median() out[269]: one 2.5 two 25.8 three 250.0 four 2500.0 dtype: floates 4, Mode In [277]: M de = pd.dataFrame({'A': [1, 2, 2,3, 4, 4, 4, 5], : [10, 28, 20, 30, 40 print(‘A’ , de['A'].mode()) print(’B* , de[’S" ].mode()) Ae 4 dtype: intea Be 20 1 40 dtype: intea 5. Variance locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2549 213124, 240 PM Pandas - Jupyter Notebook In [279]: MW df.var() out[279]: one 1.666667e+00 two 1.666667e+02 three 1.666667e+04 four —_1,.666667e+06 dtype: floatea 6. Min In [280]: WM df.min() out[28@]: one 1 two 10 three 108 four 1008 dtype: intea 7. Max In [281]: WM df.max() out[281]: one 4 two 40 three 400 8, Standard Deviation In [282]: M df.std() out[282]: one 1.290994 two 12.989944 three 129.999445 four 1299.994449 dtype: floates 3) Describe Function The describe() method in a Pandas DataFrame returns descriptive statistics of the data in the DataFrame. It provides a quick summary of the central tendency, dispersion, and shape of the distribution of a set of numerical data The default behavior of describe() is to compute descriptive statistics for all numerical columns in the DataFrame. If you want to compute descriptive statistics for a specific column, you can pass the name of the column as an argument. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2649 2yt324, 240 PM Pandas -Jupytr Notebook In [284]: M data = {‘one’ : pd.Series({1, 2, 3, 4]), “two" + pd.Series([10, 20, 3, 40]), ‘three’: pd.Series([100, 200, 30, 4¢0]), ‘four’ : pd.Series([1000, 200, 3000, 4000]), ‘Five’ } pd.Series(['A',‘B','C',"D"])} df = pd.DataFrame(data) df.describe() out [284]: one two three four eount 4.000000 4.000000 4.000000 4.000000 mean 2500000 25.000000 250,000000 2500.000000 std 1.200994 12,909944 129,099445 1200,994449, ‘min 1.000000 10,000000 100.000000 1000.000000 25% 4.750000 17.500000 175.000000 170.0000, 50% 2.500000 25,000000 250,000000 2500,000000, 75% 3.250000 32.500000 325.000000 3250.000000, max 4000000 49,000000 490,000000 4000,000000, k) Pipe Functions 1. Pipe Function ‘The pipe() method in a Pandas DataFrame allows you to apply a function to the DataFrame, similar to the way the apply() method works. The difference is that pipe() allows you to chain multiple operations together by passing the output of one function to the input of the next function. In [286]: M data = (‘one’ : pd.Series({1, 2, 3, 4]), “two! pd.Series((10, 20, 30, 40]), ‘three’: pd.Series({100, 200, 300, 480]), ‘four’ : pd.Series([1eee, 2000, 3000, 4000])} df = pd.dataFrame(data) df Ovt[286]: ene two three four ® 1 10 100 1000 12 20 200 2000 2 3 30 300 3000 3 4 49 400 4000 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2749 211324, 240 Pa Pandas -Jupytr Notebook Exanple 1 In [291]: WM def adé_(i,5) return i+ j df.pipe(adé_, 10) Out[292]: ene two three four o 1 20 10 1010 1 12 30 210 2010 2 13 40 310 3010 3 14 50 410 4010 Example 2 In [294]: MW def mean_(col return col.mean() def square(i) return i ** 2 df. pipe(mean_).pipe(square) out[294]: one 6.25 two 625.00 three 62500.00 four —_ 6250000.00 dtype: floatea 2. Apply Function The apply() method in a Pandas DataFrame allows you to apply a function to the DataFrame, either to individual elements or to the entire DataFrame. The function can be either a builtin Python function or a user-defined function. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 2049 213124, 240 PM Pandas - Jupyter Notebook In [295]: M data = {‘one’ : pd.Series({1, 2, 3, 4]), ‘two! : pd.Series([10, 20, 30, 40]), ‘three’: pd.Series([100, 200, 300, 400]), ‘four’ : pd.Series([1000, 2000, 3000, 4000])} df = pd.dataFrame(data) af print (df.apply(np.mean) ) Out[295]: one two three four 0 1 10 100 1000 12 2 200 2000 2 3 30 300 3000 34 49 400 4000 In [301]: WM df.apply(1ambda x: x.max() - x.min()) out[3@1]: one 3 two 30 three 300 3. Apply map function The map() method in a Pandas DataFrame allows you to apply a function to each element of a specific column of the DataFrame, The function can be either a built-in Python function or a user-defined function. In [303]: © df.applymap(lambda x : x*100) Out[3@3]: one two three — four © 100 1000 10000 100000 1 200 2000 20000 200000 2 300 3000 30000 300000 3 400 4000 49000 400000 applymap and apply are both functions in the pandas library used for applying a function to elements of a pandas DataFrame or Series. applymap is used to apply a function to every element of a DataFrame. It returns a new DataFrame where each element has been modified by the input function. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 20149 2inai24, 240 PM Pandas - Jupyter Notebook apply is used to apply a function along any axis of a DataFrame or Series. Tt returns either a Series or a DataFrame, depending on the axis along which the function is applied and the return value of the function. Unlike applymap, apply can take into account the context of the data, such as the row or column label. In [312]: WM df = pd.bataFrame({ ‘A\ "B (1.2, 3.4, 1 (7.8, 9.1, df_1 = df.applymap(np.intéa) print (d#_1) df_2 = df.apply(lambda row : row.mean(), axis = 2) print (d#_2) 6.4 Floats 1) Reindex Function The reindex function in Pandas is used to change the row labels and/or column labels of a DataFrame. This function can be used to align data from multiple DataFrames or to update the labels based on new data. The function takes in a list or an array of new labels as its first argument and, optionally, a fill value to replace any missing values. The reindexing can be done along either the row axis (0) or the column axis (1). The reindexed DataFrame is retumed Example 1 - Rows locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3049 213124, 240 PM Pandas - Jupyter Notebook In [333]: M data = { ‘one’ : pd.Series({1, 2, 3, 41), ‘two! : pd.Series([10, 20, 30, 40]), ‘three’ : pd.Series([100, 200, 30, 400]), ‘four’ : pd.Series([1000, 2000, 300, 4000])} df = pd.dataFrane(data) print (df) print ('-'*3@) print (df.reindex([1,0,3,2])) one two three four 18 10 1000 28 © 288 2000 3@ 300 3000 40 400 4900 one two three four 1 2 28 200 2000 @ 1 1@ 100 1000 3 4 48 480 4000 2 3 3@ 300 3000 Example 2 + Colunns In [336]: W data = {'Name’ : ['John', ‘Jane’, ‘Jim’, ‘Joan'], ‘Age’ : [25, 30, 35, 40], “City’ : ['New York’, ‘Los Angeles’, ‘Chicago’, ‘Houston’ ]} df = pd.DataFrame(data) df.reindex(columns = ['Name’, ‘City’, ‘Age']) Out [336]: Name City Age 0 John NewYork 25 1 Jane Los Angeles 30 2 Jim Chicago 35, 3 Joan Houston 40 m) Renaming Columns in Pandas DataFrame The rename function in Pandas is used to change the row labels and/or column labels of a DataFrame. It can be used to update the names of one or multiple rows or columns by passing a dictionary of new names as its argument. The dictionary should have the old names as keys and the new names as values locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3149 213124, 240 PM In [343]: data = { ‘one t t ac Wo" hree’ ‘our Pandas - Jupyter Notebook : pd.Series([1, 2, 3, 4]), pd.Series([10, 20, 32, 40]), : pd.Series([100, 200, 300, 400]), : pd.Series([1000, 2000, 3000, 4000])} df = pd.dataFrane(data) df.rename(columns af outl343]: one at b 2 © 3 34 Two 10 20 30 40 inplace = True, index = { Three 100 200 300 400 {lone’ : "One", ‘two b',2:'c',4:'d'}) Four 1000 2000 3000 4000 n) Sorting in Pandas DataFrame Pandas provides several methods to sort a DataFrame based on one or more columns. + sort_values: This method sorts the DataFrame based on one or more columns. The default sorting order is ascending, but you can change it to descending by passing the ascending argument with a value of False. bash In [355]: WM data = { ‘one’ "t "t wo" hree’ “four’ : pd.Series({11, 51, 31, 41]), pd.Series([1@, 20, 3@, 40]), : pd.Series([100, 208, 500, 400]), + pd.Series([1000, 2000, 3000, 4900])} df = pd.DataFrame(data) af Outl355]: one on 1 51 24 341 two 10 20 30 40 three 100 200 500 400 four 1000 2000 3000 4000 Sort with respect to Scecific Column locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| ‘Two’, ‘three’ : ‘Three’, ‘four’ 2249 211324, 240 Pa Pandas -Jupytr Notebook In [356]: WM df.sort_values(by = ‘one’) Out [356]: ‘one two three four 0 11 10 100 1000 2 31 30 00 3000 3 41 40 400 4000 1 51 20 200 2000 Sort in Scecific Order In [357]: WM df.sort_values(by = ‘one’, ascending = False) out[357]: one two three four 1 51 20 200 2000 341 40 400 4000 2 31 30 00 3000 0 11 10 100 1000 Sort in Scecific Order based on multiple Columns In [359]: WM df.sort_values(by = ['one’,"two']) Out [359]: ‘one two three four 0 11 10 100 1000 2 31 30 00 3000 3 41 40 400 4000 1 51 20 200 2000 Sort with Specific Sorting Algorithm:
+ quicksort + mergesort + heapsort In [361]: MW df.sort_values(by = ['one’], kind = ‘healsort') Out[361]: one two three four 0 11 10 100 1000 2 31 30 00 3000 3 41 40 400 4000 1 51 20 200 2000 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 49 213124, 240 PM Pandas - Jupyter Notebook ©) Groupby Functions The groupby function in pandas is used to split a dataframe into groups based on one or more columns. It returns a DataFrameGroupBy object, which is similar to a DataFrame but has some additional methods to perform operations on the grouped data. In [362]: M cricket = {'Team’ : ["India", ‘India’, ‘Australia’, ‘Australia’, ‘SA, “4 tRank’ —: [2, 3, 1,2, 3,4 41 41,2 , 42,21, ‘Year’: [2014,2015,2014, 2615, 2014, 2015, 2016, 2017, 2016, 2014, 2( "Points' : [876,801,891,815,776,784,834,824,758,691, 883, 782]} df = pd.DataFrame(cricket) df Out [362]: Team Rank Year Points 0 India 22018876 1 India 3 2015 8t 2 Australia = 2018891 3 Australia 2201581 4 SA 8 20478 5 SA 4 2015784 6 SA 1 2016 834 7 SA 4 2017824 8 NZ 2 206758 9 NZ 4 2018 691 10 NZ 1 2015883 14 India 2 2017782 In [365]: WM d¥.groupby(' Team’). groups out[365]: {‘Australia': [2, 3], ‘India’: [@, 1, 11], ‘NZ’: [8, 9, 10], ‘SA’: [4, 5, 6, 7} + Austrealia is present in index 2 and 3 + India is present in index 0,1 and 11 and so on To search for specific Country with specific year In [366]: df.groupby(['Tean', 'Year']).get_group(( ‘Australia’ ,214)) Out [366]: Team Rank Year Points 2 Australia = 1 2014 at locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3449 213124, 240 PM Pandas - Jupyter Notebook If the data is not present then we will be getting an error Adding some statistical computation on top of groupby In [374]: WM df.groupby(‘Tean').sum()['Points'] out[374]: Team Australia 1786 India 2459 Nz 2332 SA 3218 Name: Points, dtype: intea + This means we have displayed the teams which are having the maximum sum in Poitns Let us sort it to get it in a better way In [377]: WM df.groupby(‘Tean’).sum()[ ‘Points’ ].sort_values(ascending = False) out[377]: Team SA 3218 India 2459 Nz 2332 Australia 1706 Name: Points, dtype: int6a Checking multiple stats for points team wise In [382]: © groups = df.groupby(‘Team’) groups[ Points" ].agg([np-sum, np.mean, np.std,np.max,np.min]) Out [382]: sum mean ‘std amax amin Team ‘Australia 1706 853.000000 63740115 891 818 India 2459 819.668667 49,702648 878 782, Nz 2332 777.339333 97449192 883691 SA 3218 804,500000 28.769196 834 776 filter function along with groupby locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3849 213124, 240 PM Pandas - Jupyter Notebook In [386]: WM df.groupby(‘Tean’).filter(lambda x : len(x) 4) Ovt[386]: Team Rank Year Points “4 SA 3 20778 5 SA 4 2015 788 6 SA 1 206 am 7 SA 1 2017 824 + The data of South Africa are present equal to 4 times that is why South Africa is being displayed here In [388]: M_ df.groupby(‘Team').filter(lambda x : len(x) out[388, Team Rank Year Points 0 Inda = 2 2014 876 1 India 2018 801 3 B NZ 2 2016 758 4 204 691 to NZ 1 2018 888 44 India = 2 2017-782 + The data of India and New Zealand are present 3 times so that is why they are being displayed here 3. Working with csv files and ba Pandas data Analysis Using a) Reading csv Reading csv files from local system locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3649 213124, 240 PM In [398]: M df = pd.read_csv('Football.csv') Pandas - Jupyter Notebook df head() out [398]: Country League chub PYF Matches Played Substitution Mins Goals x6 9 Spain Latiga (BET) Quam 19 16 189 11662 Aaoine 1 Spain La Liga (BAR) gaan ine 36 0 312916 11.86 2 Spain Latiga (ATL) g ule Ey + 2040 © 28 23.21 Ruben 3 Spain Latiga (CAR) Suber 32 3 2842 13 14.08 4 Spain LaLiga (VAL) g,cevin a 40 1745 13. 1065 Reading CSV files from github repositories NOTE: The link of the page should be copied when the file is in raw format locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 37149 211324, 240 Pa Pandas -Jupytr Notebook In [391]: M Link = ‘https: //raw.githubusercontent .con/AshishJangra27/Data-Analysis-wit! # df = pd.read_csv(Link) # df.head() > Out[391]: Content ‘App Category Rating Reviews Size Installs Type Price Somer! Photo Editor & Candy Camera & Grid & ‘SerapBook ART_AND_DESIGN 4.1 183 19M 10,000 Free 0 Everyone Coloring 1 book ART_AND_DESIGN 39 967 14M_—§00,000+ Free 0 Everyone uv Launcher Lite = 2 FREE Live ART_AND DESIGN 4.7 87510 &7M 5,000,000 Free 0 Everyone Coo! Themes, Hide Sketch - 3 Draw& ART_AND_DESIGN 4.5 215644 25M 50,000,000 Free 0 Teen Paint Pixel Draw = Number 4 ‘Art ARTANDDESIGN 4.3 967 28M —100,000+ Free 0. Everyone Coloring Book b) Pandas Info Function Pandas dataframe info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data, To get a quick overview of the dataset we use the dataframe info() function ‘Syntax: DataFrame info(verbose=None, buf=None, max_cols=None, memory_usag null_counts=None) lone, locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 3849 213124, 240 PM In [399]: MW df.info() Rangelndex: 66@ entries, @ to 659 Data columns (total 15 colunns): # Column @ country 1 League 2 Club 3. Player Names 4 Matches Played 5 6 7 8 9 Substitution Mins Goals xG xG Per Avg Match 18 Shots 11. OnTarget 12 Shots Per Avg Match 13. On Target Per Avg Match 14 Year dtypes: floatea(4), int64(7), memory usage: 77.5+ KB locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| Non-Null Count: 660 660 660 660 660 660 660 660 668 660 668 660 660 660 660 non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null object (4) Pandas - Jupyter Notebook Dtype object object object object intea intea inte inte floate4 floates intea intea floatea floates inted 3049 213124, 240 PM Pandas - Jupyter Notebook c) isnull() function to check if there are nan values present In [400]: MW df.isnull() out[aee]: Player Country League Club ylS¥°F Matches Played Substitution Mins Goals xG Me 0 False False Falso False False False False Falko False Fi 1 False False False False False False False False False Fi 2 False False False False False False False Fake False Fi 3 False False False False False False False Fale Fale Fi 4 False False False False False False False False False Fi 655 False False False False False False False False False Fi 656 False False False False False False False Fake False Fi 657 Falso False False Falso False False False Fale Fale Fi 658 False False False False False False False Fale False Fi 659 False Falso Falso False False False False Fakto Fale Fi 660 rows * 15 columns ‘So we can see we are getting a boolean kind of a table giving True and False Ifwe use the sum function along with it then we can get how many null values are present in each columns locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 40149 ries, 240 PM Pandas -Jupytr Notebook In [401]: WM df.isnul1().sum() out[4e1}: country League club Player Names Matches_Played Substitution Mins Goals xG xG Per Avg Match Shots Ontarget Shots Per Avg Match On Target Per Avg Match Year dtype: intea d) Quantile function to get the specific percentile value Let us check the 80 percentile value of each columns using describe function first In [404]: WM df.describe(percentiles = [.80]) out[ see): Matson Played Substitution Mine Goals xg tS Pat count w,000000 66,000009 esom00000 0.800000 eeaG00000 65,000000 660% mean 22.s7iziz. 224242 zorsatece7 r1.8tosus to0sae0e o47et67 6A: sid a764658s.8408 soosoEnKo Ga7sans — s724e4e o.te8s1 344 min 2.000000 0.900000 264000000 2000000 o.710000 o.7on00 —8¢ so% 2tooo0o0 2000000 2245500000 th.00000 9288000 @436000 621 0% 2.000000 000000 2815200000 1.0000 14078000 0610000 804 max 38000000 28000000 #177.000000 42.0000 2sen000 1380000. 2084 So we can see the 80th Percentile value of Mins is 2915.80 Let us use the quantile function to get the exact value now In [406]: M df['Mins"].quantile(.se) out[4e6]: 2915.8 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 449 213124, 240 PM Pandas - Jupyter Notebook Here we go we got the same value To get the 99 percentile value we can write In [407]: MW d#['Mins"].quantile(.99) out[407]: 3528.0199999999995, + This funciton is important as it can be used to treat outliers in Data Science EDA process e) Copy function If we normal do: de=df Then a change in de will affect the data of df as well so we need to copy in such a way that it creates a totally new object and does not affect the old dataframe In [413]: M de = df.copy() de.head(3) out(ai3}: Country League ciub PHYS Matches Played Substitution Mins Goals x6 0 Span Latiga (@eT) ani 19 16 109 1 662 4 Spain Latiga (GAR) ga itane 36 0 3129 16 186 2° Spain Latiga (any) 6, us a + 200 28 2021 locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 4249 213124, 240 PM Pandas - Jupyter Notebook In [414]: WM de['Vear+100"] = de[‘Year"] + 100 de.head() out(ai4): Country League club PYF Matches Played Substitution Mins Goals x6 8 0 Span Lauga (@er) ann 19 16 101911 662 Antoine 1 Span Latiga (@AR) mine 6 0 3120 16 1186 2° Spain Latiga (ATL) g, Luis a + 74028 2021 Ruben 3 Span Latiga (can) Suber x2 9 286213. 1408 kevin 4 Spain Latiga (VAL) ggKovin a to 174513. 1065 + So we can see a new column has been added here but our old data is secured In [415]: MW df.head() out[415]: Country League club PYF Matches Played Substitution Mins Goals x6 4 0 Spain tatige (BET) Guan 19 te 189 11 662 1s Antoine pan LaLiga (BAR) g,Antine 36 0 3129 16 1186 2 Spain Latiga (arty, ule a“ 1 24028 2521 as Ruben pain LaLiga (CAR) Ruben 2 3 202 13: 1408 4 Spain Latiga (VAL) gin 21 to 1745 19. 1065 + The new column is not present here f) Value Counts function Pandas Series value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently- occurring element, Excludes NA values by default locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 449 213124, 240 PM Pandas - Jupyter Notebook ‘Syntax: Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True) In [417]: M df['Player Names" ].value_counts() Out[417]: Andrea Belotti Lionel Messi Luis Suarez Andrej Kramaric Ciro Innobile Francois Kamano 1 Lebo Mothiba 1 Gaetan Laborde 1 Falcao 1 Cody Gakpo 1 Name: Player Names, Length: 444, dtype: intea g) Unique and Nunique Function ‘While analyzing the data, many times the user wants to see the unique values in a particular column, which can be done using Pandas unique() function. In [418]: WM df['Player Nanes' ].unique() out[418]: array(["Juanmi Callejon', ‘Antoine Griezmann', ‘Luis Suarez", "Ruben Castro’, ‘Kevin Gameiro’, ‘Cristiano Ronaldo’, "Karim Benzena', ‘Neymar ", ‘Iago Aspas’, ‘Sergi Enrich’, ‘aduriz ', "Sandro Ramlrez', ‘Lionel Messi", ‘Gerard Moreno’, "Morata’, ‘Wissam Ben Yedder', ‘Willian Jose’, ‘Andone ', "Cedric Bakambu", ‘Isco', ‘Mohamed Salah‘, ‘Gregoire Defrel’, "Ciro Immobile’, ‘Nikola Kalinic', ‘Dries Mertens’, "Alejandro Gomez’, ‘Jose Callejon', ‘Iago Falque’, "Giovanni Simeone’, ‘Mauro Icardi', ‘Diego Falcinelli’, "cyril Thereau', ‘Edin Dzeko’, ‘Lorenzo Insigne’, "Fabio Quagliarella’, ‘Borriello ', ‘Carlos Bacca’, "Gonzalo Higuain’, ‘Keita Balde’, ‘Andrea Belotti', ‘Fin Bartel ‘Lars Stindl', ‘Serge Gnabry', ‘Wagner ', ‘Andrej Kramaric’, "Florian Niederlechner’, ‘Robert Lewandowski’, ‘Emil Forsberg’, ‘Timo Werner’, ‘Nils Petersen’, ‘Vedad Ibisevic’, ‘Mario Gomez’, "Maximilian Philipp’, 'A\x8idam Szalai', "Pierre-Emerick Aubameyang’, ‘Guido Burgstaller’, ‘Max Kruse’, "Chicharito ', ‘Anthony Modeste’, ‘Arjen Robben’, ‘Alexis Sanche ~ ‘While analyzing the data, many times the user wants to see the unique values in a particular column. Pandas nunique() is used to get a count of unique values. locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 4449 ries, 240 PM Pandas -Jupytr Notebook In [419]: WM df['Player Names" ].nunique() out[4i9}: 444 h) dropna() function ‘Sometimes csv file has null values, which are later displayed as NaN in Data Frame. Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null values in different ways. Syntax: DataFrameName.dropna(axi .inplace=False) axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for Integer and index’ or ‘columns’ for String. In [422]: Mink = ‘https: //raw.githubusercontent..con/AshishJangra27/Data-Analysis-with df = pd.read_csv(1ink) df.head() > out[422]: Content ‘App Category Rating Reviews Size Installs Type Price puter Photo Editor & Candy Camera & Grid & ‘SerapBook ART_AND_DESIGN 4.1 188 19M 10,000 Free 0 Everyone Coloring book ARTAND_DESIGN 39 967 14M_—$00,000+ Free 0 Everyone uv Launcher Lite = 2 FREELive ART_AND_DESIGN 4.7 87510 87M 5,000,000+ Free 0 Everyone Coo! Taomes, Hide ‘Sketch - 3 Draw& ART_AND DESIGN 4.5 215644 25M 50,000,000 Free 0 Teer Paint Pixel Draw = Number 4 ‘Art ARTANDDESIGN 4.3 967 28M 100,000 Free 0. Everyone Coloring Book locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 45149 213124, 240 PM In [423]: Pandas - Jupyter Notebook ML df. isnul1().sum() out[423]: App @ In [426]: In [427]: In [447]: locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| category @ Rating 1474 Reviews Size Installs Type Price Content Rating Genres Last Updated Current ver Android ver dtype: intea + 0k so it seems like we have alot of Null Values in column Rating and few null values in some other columns M df.dropna(inplace = True, axis = 0) This will delete all the rows which are containing the null values WM df.dropna(inplace = True, axis = 1) This will delete all the columns containing null values i) Fillna Function Pandas Series filna() function is used to fil NA/NaN values using the specified method. ‘Suppose if we want to fill the null values with something instead of removing them then we can use fillna function Here we willbe filing the numerical columns with its mean values and Categorical columns with its mode M Link = ‘https: //raw.githubusercontent .con/AshishJengra27/bata-Analysis-witl df = pd.read_csv(1ink) print(len(df)) 10841 Numerical columns 46149 213124, 240 PM Pandas - Jupyter Notebook In [448]: DW mis = round(df[ ‘Rating’ ].mean(),2) dF {‘Rating'] = df[ ‘Rating’ ].FilIna(mis) print(len(df)) 10841 It we would have used inplcae=True then it would have permenantly stored those values in our dataframe Categorical values In [461]: MW d€['Current Ver'] = df['Current Ver'].fillna(‘Varies on Device’) 3) sample function Pandas sample() is used to generate a sample random row or column from the function caller data frame. Syntax: DataFrame.sample(n=None, frac=None, replace=False, weigh lone, axis-None) In [471]: df. sample(s) out[471]: App Category Rating Reviews Size Installs Type Price Diepiaying Free goes Psplving PHOTOGRAPHY 4.19 + om sor O14 tsar F192! ipraRtes_AND_DEMO 500 «28 25M «1.000 Free 0 Safest 433 Call COMMUNICATION 44027540 3.7M 1,000000+ Free 0 I Blocker san Ancias Varies zas2—_ Crime FAMILY 420 9409. with 1,000,000 Free 0 ity device Gangster 3D AoP 190 Mobie BUSINESS 4.30 85185 29M 5,000,000 Free 0 | Sokiions locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 4749 213124, 240 PM Pandas - Jupyter Notebook k) to_csv() function Pandas Series.to_csv() function write the given series object to a comma-separated values (csv) filefformat. Syntax: Series.to_csv(*args, **kwargs) In [477]: M data = { ‘one’ : pd.Series([1, 2, 3, 4]), "two" } pd.Serdes([10, 2e, 32, 40]), ‘three’ : pd.Series([100, 200, 300, 400]), ‘four’ : pd.Series([1000, 2000, 3002, 4000])) df pd.DataFrame(data) df.to_csv(‘Number.csv") + We got an extra Unnamed:0 Column if we want to avoid that we need to add an extra parameter mentioning index=False In [478]: WM d¥.to_esv('Numbers.csv', index = False) 4. A detailed Pandas Profile report The pandas_profiling library in Python include a method named as ProfileReport() which generate a basic report on the input DataFrame. The report consist of the following: DataFrame overview, Each attribute on which DataFrame is defined, Correlations between attributes (Pearson Correlation and Spearman Correlation), and A sample of DataFrame. In [480]: M import natplotlib import pandas_profiling as pp locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| 4049 213124, 240 PM In [484]: WM df = pd.read_esv(‘Football.csv') Pandas - Jupyter Notebook df head) out[4e4): Country League club PHYS matches Played Substitution Mins Goals xG 0 Spain Latige (ger) Quan 19 16 1049 11 662 4 Spain Latige (BAR) g, Aone 36 0 37916 11.06 2 Spain Latiga (art) — gy lls a + 2040 28 2321 3 Spain Latiga (cary Ruben 2 a 2e2 19 1406 4 Spain Latige (WAL) “vit a 10 174513. 1065 In [485]: WM report = pp.ProfileReport (dF) In [486]: report Summarize dataset: Generate report structure: Render HTML: out[486]: Int]: WW ex| ex locathos!:8888/notebooks/Desktop/Python/Pandas/Pandas ipynb| | @/1 [e0: ex| | 0/5 (00:00? ?it/s] | 9/1 [00:00

You might also like