3/25/24, 10:42 PM EDA All Functions
Note Book By Tariq Ahmed (WP:
+923070996076)
1. [Link](n): Returns the first n rows of the DataFrame.
2. [Link](n): Returns the last n rows of the DataFrame.
3. [Link](): Provides information about the DataFrame, including column names, data types,
and non-null value counts.
4. [Link](): Computes various descriptive statistics for numerical columns in the
DataFrame, such as count, mean, standard deviation, and percentiles.
5. [Link]: Returns the dimensions (number of rows and columns) of the DataFrame.
6. [Link]: Returns the column names of the DataFrame.
7. [Link]: Returns the data types of each column in the DataFrame.
8. [Link](): Checks for missing values and returns a DataFrame of the same shape with
True/False values indicating the presence of missing values.
9. [Link](): Removes rows with missing values from the DataFrame.
10. [Link](value): Fills missing values in the DataFrame with a specified value.
11. [Link](by): Groups the DataFrame by one or more columns and returns a GroupBy
object for further aggregation and analysis.
12. df.sort_values(by): Sorts the DataFrame by one or more columns.
13. [Link](df2): Merges two DataFrames based on common columns or indices.
14. df.pivot_table(values, index, columns): Creates a pivot table from the DataFrame,
aggregating values based on specified columns.
15. [Link](func): Applies a function to each element or column of the DataFrame.
import libraries
In [2]: import pandas as pd
To find csv file encoding
In [5]: with open('Diwali Sales [Link]') as f:
print(f)
<_io.TextIOWrapper name='Diwali Sales [Link]' mode='r' encoding='cp1252'>
import Csv file
In [6]: df=pd.read_csv('Diwali Sales [Link]',encoding='cp1252')
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 1/9
3/25/24, 10:42 PM EDA All Functions
[Link]() function is used to display the first
few rows of a DataFrame object in pandas,
which is a popular data manipulation and
analysis library.
In [7]: [Link]()
Out[7]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone O
Group
0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Western
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Central A
3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southern C
4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Western
[Link](), it returns the last five rows of the
DataFrame by default.
In [10]: [Link]()
Out[10]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone
Group
11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Western
11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northern
Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Central
Pradesh
11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southern
11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Western
[Link]() it provides a summary of the
DataFrame, including the following
information:
The total number of rows and columns in the DataFrame. The column names and their
corresponding data types. The count of non-null values in each column. The memory usage of
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 2/9
3/25/24, 10:42 PM EDA All Functions
the DataFrame.
In [8]: [Link]()
<class '[Link]'>
RangeIndex: 11251 entries, 0 to 11250
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 11251 non-null int64
1 Cust_name 11251 non-null object
2 Product_ID 11251 non-null object
3 Gender 11251 non-null object
4 Age Group 11251 non-null object
5 Age 11251 non-null int64
6 Marital_Status 11251 non-null int64
7 State 11251 non-null object
8 Zone 11251 non-null object
9 Occupation 11251 non-null object
10 Product_Category 11251 non-null object
11 Orders 11251 non-null int64
12 Amount 11239 non-null float64
13 Status 0 non-null float64
14 unnamed1 0 non-null float64
dtypes: float64(3), int64(4), object(8)
memory usage: 1.3+ MB
[Link]() function in pandas is used to
generate descriptive statistics of a
[Link] as count, mean, standard
deviation, minimum value,
In [9]: [Link]()
Out[9]: User_ID Age Marital_Status Orders Amount Status unnamed1
count 1.125100e+04 11251.000000 11251.000000 11251.000000 11239.000000 0.0 0.0
mean 1.003004e+06 35.421207 0.420318 2.489290 9453.610858 NaN NaN
std 1.716125e+03 12.754122 0.493632 1.115047 5222.355869 NaN NaN
min 1.000001e+06 12.000000 0.000000 1.000000 188.000000 NaN NaN
25% 1.001492e+06 27.000000 0.000000 1.500000 5443.000000 NaN NaN
50% 1.003065e+06 33.000000 0.000000 2.000000 8109.000000 NaN NaN
75% 1.004430e+06 43.000000 1.000000 3.000000 12675.000000 NaN NaN
max 1.006040e+06 92.000000 1.000000 4.000000 23952.000000 NaN NaN
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 3/9
3/25/24, 10:42 PM EDA All Functions
[Link] (number of rows and columns) of
the DataFrame.
In [11]: [Link]
(11251, 15)
Out[11]:
[Link] Show the column names of the
DataFrame.
In [13]: [Link]
Index(['User_ID', 'Cust_name', 'Product_ID', 'Gender', 'Age Group', 'Age',
Out[13]:
'Marital_Status', 'State', 'Zone', 'Occupation', 'Product_Category',
'Orders', 'Amount', 'Status', 'unnamed1'],
dtype='object')
[Link] shows the data types of each
column in the DataFrame.
In [14]: [Link]
User_ID int64
Out[14]:
Cust_name object
Product_ID object
Gender object
Age Group object
Age int64
Marital_Status int64
State object
Zone object
Occupation object
Product_Category object
Orders int64
Amount float64
Status float64
unnamed1 float64
dtype: object
[Link](): Checks for missing values and
returns a DataFrame of the same shape with
True/False values indicating the presence of
missing values.
In [15]: [Link]()
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 4/9
3/25/24, 10:42 PM EDA All Functions
Out[15]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupatio
Group
0 False False False False False False False False False Fal
1 False False False False False False False False False Fal
2 False False False False False False False False False Fal
3 False False False False False False False False False Fal
4 False False False False False False False False False Fal
... ... ... ... ... ... ... ... ... ...
11246 False False False False False False False False False Fal
11247 False False False False False False False False False Fal
11248 False False False False False False False False False Fal
11249 False False False False False False False False False Fal
11250 False False False False False False False False False Fal
11251 rows × 15 columns
[Link]().sum() Checks for missing values
and count how many nulls are.
In [16]: [Link]().sum()
User_ID 0
Out[16]:
Cust_name 0
Product_ID 0
Gender 0
Age Group 0
Age 0
Marital_Status 0
State 0
Zone 0
Occupation 0
Product_Category 0
Orders 0
Amount 12
Status 11251
unnamed1 11251
dtype: int64
[Link](): Removes rows with missing
values from the DataFrame.
In [17]: [Link]()
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 5/9
3/25/24, 10:42 PM EDA All Functions
Out[17]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupation Pro
Group
[Link]('column name',axis=1,inplace=True)
Removes Missing values from column.
In [23]: [Link]('unnamed1',axis=1,inplace=True)
[Link](value): Fills missing values in the
DataFrame with a specified value.
In [27]: # Fill missing values with a constant value
df_filled = [Link](0)
df_filled
Out[27]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zo
Group
0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Weste
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southe
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Cent
3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southe
4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Weste
... ... ... ... ... ... ... ... ...
11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Weste
11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northe
Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Cent
Pradesh
11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southe
11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Weste
11251 rows × 14 columns
Fill missing values with the mean of the
column
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 6/9
3/25/24, 10:42 PM EDA All Functions
df['column name'].fillna(df['column
name'].mean(),inplace=True)
In [33]: df['Amount'].fillna(df['Amount'].mean(),inplace=True)
In [34]: #check it's fill
[Link]().sum()
User_ID 0
Out[34]:
Cust_name 0
Product_ID 0
Gender 0
Age Group 0
Age 0
Marital_Status 0
State 0
Zone 0
Occupation 0
Product_Category 0
Orders 0
Amount 0
Status 11251
dtype: int64
[Link](by) function in pandas is used to
group a DataFrame by one or more columns.
It allows you to split the DataFrame into
groups based on unique values in the
specified column(s) and perform operations
on each group independently.
In [53]: grouped = [Link](['Product_ID', 'Cust_name'])
mean_age = grouped['Age'].mean()
print(mean_age)
Product_ID Cust_name
P00000142 Adrian 19.0
Akshat 27.0
Armstrong
34.0
Arun 33.0
Atkinson46.0
...
P0099442 Amol 26.0
Astrea 35.0
Grant 32.0
Siddharth 36.0
P0099742 Shatayu 13.0
Name: Age, Length: 10948, dtype: float64
In [54]: # in one line
mean_values = [Link](['Product_ID', 'Cust_name'])['Age'].mean()
mean_values
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 7/9
3/25/24, 10:42 PM EDA All Functions
Product_ID Cust_name
Out[54]:
P00000142 Adrian 19.0
Akshat 27.0
Armstrong
34.0
Arun 33.0
Atkinson46.0
...
P0099442 Amol 26.0
Astrea 35.0
Grant 32.0
Siddharth 36.0
P0099742 Shatayu 13.0
Name: Age, Length: 10948, dtype: float64
df.sort_values(by): Sorts the DataFrame by
one or more columns.
In [59]: #df.sort_values(by='Column1') # Sort by a single column
df.sort_values(by='Amount')
Out[59]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zo
Group
11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Weste
11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southe
Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Cent
Pradesh
11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northe
11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Weste
... ... ... ... ... ... ... ... ...
4 1000588 Joni P00057942 M 26-35 28 1 Gujarat Weste
3 1001425 Sudevi P00237842 M 0-17 16 0 Karnataka Southe
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Cent
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southe
0 1002903 Sanskriti P00125942 F 26-35 28 0 Maharashtra Weste
11251 rows × 14 columns
In [56]: # Sort by multiple columns
#df.sort_values(by=['Column1', 'Column2'])
df.sort_values(by=['Age', 'Amount'])
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 8/9
3/25/24, 10:42 PM EDA All Functions
Out[56]: Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone
Group
11240 1001425 Sudevi P00044742 F 0-17 12 0 Delhi Central
11109 1004135 Jayanti P00229742 F 0-17 12 0 Delhi Central
10804 1001673 Lampkin P00277442 F 0-17 12 0 Gujarat Western
Madhya
10774 1001926 Barton P00157542 M 0-17 12 1 Central
Pradesh
9505 1005403 Caroline P00195742 M 0-17 12 1 Haryana Northern
... ... ... ... ... ... ... ... ... ...
Madhya
2951 1002204 Dilbeck P00246642 M 55+ 92 0 Central
Pradesh
2698 1005658 Poirier P00227942 M 55+ 92 0 Karnataka Southern
Uttar
1106 1001176 Alice P00128942 M 55+ 92 0 Central
Pradesh
Uttar
612 1002526 Shreya P00271142 M 55+ 92 1 Central
Pradesh
359 1003036 Prescott P00255842 F 55+ 92 0 Uttarakhand Central
11251 rows × 14 columns
localhost:8888/nbconvert/html/Class EDA/EDA All [Link]?download=false 9/9