Jupyter & Pandas Guide
Jupyter & Pandas Guide
"cells": [
{
"cell_type": "markdown",
"id": "10a43cc6",
"metadata": {
"id": "10a43cc6"
},
"source": [
"### Jupyter notebook commands"
]
},
{
"cell_type": "markdown",
"id": "7b3bfb1a",
"metadata": {
"id": "7b3bfb1a"
},
"source": [
"\n",
" Esc will take you into command mode where you can navigate around your
notebook with arrow keys.\n",
" While in command mode:\n",
" A to insert a new cell above the current cell, \n",
" B to insert a new cell below.\n",
" M to change the current cell to Markdown, \n",
" Y to change it back to code\n",
" D + D (press the key twice) to delete the current cell\n",
" Enter will take you from command mode back into edit mode for the
given cell.\n",
" Shift + Tab will show you the Docstring (documentation) for the the
object you have just typed in a code cell – you can keep pressing this short cut to
cycle through a few modes of documentation.\n",
" Ctrl + Shift + - will split the current cell into two from where your
cursor is.\n",
" Esc + F Find and replace on your code but not the outputs.\n",
" Esc + O Toggle cell output.\n",
" Select Multiple Cells:\n",
" Shift + J or Shift + Down selects the next sell in a downwards
direction. You can also select sells in an upwards direction by using Shift + K or
Shift + Up.\n",
" Once cells are selected, you can then delete / copy / cut /
paste / run them as a batch. This is helpful when you need to move parts of a
notebook.\n",
" You can also use Shift + M to merge multiple cells.\n"
]
},
{
"cell_type": "markdown",
"source": [
"###Data Frames in Pandas\n",
"\n",
"Pandas are one of the most popular Python libraries for manipulating and
analyzing data.\n",
"\n",
"* This library was originally built on Numpy, the fundamental library
for scientific computing in python. \n",
"* The data-structure offered by Pandas library are fast, flexible and
expressive. And are specifically disigned to make real-world data analysis. \n",
"* In Numpy the indices are inclusive, while in pandas the indices are
exclusivelt mentioned. \n",
"\n"
],
"metadata": {
"id": "Eaw5RdiF-Qog"
},
"id": "Eaw5RdiF-Qog"
},
{
"cell_type": "markdown",
"id": "8c6269a6",
"metadata": {
"id": "8c6269a6"
},
"source": [
"### Pandas can work with several file formats:\n",
"Example:\n",
" \n",
" Comma-separated values (CSV)\n",
" XLSX\n",
" ZIP\n",
" Plain Text (txt)\n",
" JSON\n",
" XML\n",
" HTML\n",
" Images\n",
" Hierarchical Data Format\n",
" PDF\n",
" DOCX\n",
" MP3\n",
" MP4\n",
" SQL\n",
" \n",
"https://www.cbtnuggets.com/blog/technology/programming/14-file-types-you-
can-import-into-pandas\n",
"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html"
]
},
{
"cell_type": "markdown",
"id": "091504d8",
"metadata": {
"id": "091504d8"
},
"source": [
"# Pandas supports two Datatypes:\n",
" Series: 1-D column of data.\n",
" DataFrame: 2-D table with rows and column."
]
},
{
"cell_type": "markdown",
"id": "2ed2c5db",
"metadata": {
"id": "2ed2c5db"
},
"source": [
"## Importing Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "490f5193",
"metadata": {
"id": "490f5193"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"## import warnings\n",
"## warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9007be7",
"metadata": {
"id": "b9007be7"
},
"outputs": [],
"source": [
"pd.__version__ ## To chck the version"
]
},
{
"cell_type": "markdown",
"id": "ffb3ce6d",
"metadata": {
"id": "ffb3ce6d"
},
"source": [
"## Creating pandas Series..."
]
},
{
"cell_type": "markdown",
"id": "33677719",
"metadata": {
"id": "33677719"
},
"source": [
"### Keyword for creating series is pd.Series where pd is the alias for
pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a83eeca3",
"metadata": {
"id": "a83eeca3"
},
"outputs": [],
"source": [
"'''Creating Series'''\n",
"L1=['eagle','flamingo','crow','dove']\n",
"birds = pd.Series(L1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6dbc2fdf",
"metadata": {
"id": "6dbc2fdf"
},
"outputs": [],
"source": [
"print(birds)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d2add7d",
"metadata": {
"id": "1d2add7d"
},
"outputs": [],
"source": [
"'''Creating another Series'''\n",
"animals = pd.Series(['cat','dog','horse','elephant'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f3c6d75c",
"metadata": {
"id": "f3c6d75c"
},
"outputs": [],
"source": [
"animals"
]
},
{
"cell_type": "markdown",
"id": "6e54c7f8",
"metadata": {
"id": "6e54c7f8"
},
"source": [
"### To check the datatype"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17239f3f",
"metadata": {
"id": "17239f3f"
},
"outputs": [],
"source": [
"'''Check the datatype'''\n",
"print(type(birds))\n",
"print(type(animals))"
]
},
{
"cell_type": "markdown",
"id": "1074fe0f",
"metadata": {
"id": "1074fe0f"
},
"source": [
"# DataFrame"
]
},
{
"cell_type": "markdown",
"id": "d503ba1e",
"metadata": {
"id": "d503ba1e"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "422764d8",
"metadata": {
"id": "422764d8"
},
"source": [
"## Creating a pandas DataFrame:\n",
"1. Prepare the data\n",
"2. Label the rows\n",
"3. Label the columns\n",
"**Default labeling will be done for rows and columns if not defined
exclusively**"
]
},
{
"cell_type": "markdown",
"id": "38b9348d",
"metadata": {
"id": "38b9348d"
},
"source": [
"### 1. Using numpy array for data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e3170a6",
"metadata": {
"id": "8e3170a6"
},
"outputs": [],
"source": [
"Array = np.array([[1,1,1],[2,4,8],[3,9,27],\n",
" [4,16,64],[5,25,125],[6,36,216],\n",
" [7,49,343]])\n",
"print(Array)"
]
},
{
"cell_type": "markdown",
"id": "4dfd4260",
"metadata": {
"id": "4dfd4260"
},
"source": [
"### Exclusively defining the rows and the columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebdb4841",
"metadata": {
"id": "ebdb4841"
},
"outputs": [],
"source": [
"### defining the rows exclusively ###\n",
"index_names = ['first','second', 'third', 'fourth','fifth','sixth',
'seventh']\n",
"### defining the columns excusively\n",
"column_names = ['1', 'squares', 'cubes']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8c3393c",
"metadata": {
"id": "e8c3393c"
},
"outputs": [],
"source": [
"df_Array=pd.DataFrame(data = Array, index = index_names,
columns=column_names)\n",
"df_Array"
]
},
{
"cell_type": "markdown",
"id": "d71c4b34",
"metadata": {
"id": "d71c4b34"
},
"source": [
"### Default labeling for rows and columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9ecd2bcf",
"metadata": {
"id": "9ecd2bcf"
},
"outputs": [],
"source": [
"df_Array1 = pd.DataFrame(Array)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c8bbdab",
"metadata": {
"id": "6c8bbdab"
},
"outputs": [],
"source": [
"df_Array1"
]
},
{
"cell_type": "markdown",
"id": "c165989b",
"metadata": {
"id": "c165989b"
},
"source": [
"### By combining the existing Series in the form of a dictionary"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da235edd",
"metadata": {
"id": "da235edd"
},
"outputs": [],
"source": [
"'''Creating DataFrame using already created Series'''\n",
"df = pd.DataFrame({\"birds_name\":birds,\n",
" \"animal_name\":animals}) \n",
"### birds_name is the key\n",
"### birds is the value"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53542057",
"metadata": {
"id": "53542057"
},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80976d07",
"metadata": {
"id": "80976d07"
},
"outputs": [],
"source": [
"'''Check the datatype'''\n",
"print(type(df))"
]
},
{
"cell_type": "markdown",
"id": "f8c3897d",
"metadata": {
"id": "f8c3897d"
},
"source": [
"### Saving Dataframe into .csv file..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3989d954",
"metadata": {
"id": "3989d954"
},
"outputs": [],
"source": [
"'''Save the file as a_b.csv'''\n",
"df.to_csv('a_b.csv',':')"
]
},
{
"cell_type": "markdown",
"id": "7c7bc623",
"metadata": {
"id": "7c7bc623"
},
"source": [
"#### Creating another DataFrame..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54c89af3",
"metadata": {
"id": "54c89af3"
},
"outputs": [],
"source": [
"'''Let's create another DataFrame'''\n",
"\n",
"items = pd.Series(['Milk','Bread','Butter','Sugar'])\n",
"price = pd.Series([50,45,30,60])\n",
"df_1 = pd.DataFrame({\"Items\": items,\n",
" \"Price\": price})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3fd5675a",
"metadata": {
"id": "3fd5675a"
},
"outputs": [],
"source": [
"df_1"
]
},
{
"cell_type": "markdown",
"id": "e6073103",
"metadata": {
"id": "e6073103"
},
"source": [
"#### Creating another DataFrame using python Dictionary "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "714588bd",
"metadata": {
"id": "714588bd"
},
"outputs": [],
"source": [
"df_2 = pd.DataFrame({\"Name\":['A','B','C'],\n",
" \"Age\": [20,30,40]})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e75c672",
"metadata": {
"id": "6e75c672"
},
"outputs": [],
"source": [
"df_2"
]
},
{
"cell_type": "markdown",
"id": "1272451d",
"metadata": {
"id": "1272451d"
},
"source": [
"## Importing Data File...\n",
"#### NOTE: Mention correctly the path where you have stored your datafile"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8588d270",
"metadata": {
"id": "8588d270"
},
"outputs": [],
"source": [
"sample = pd.read_excel(\"sample.xlsx\") "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70cf1062",
"metadata": {
"id": "70cf1062"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef939f16",
"metadata": {
"id": "ef939f16"
},
"outputs": [],
"source": [
"print(type(sample))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "871278cf",
"metadata": {
"scrolled": true,
"id": "871278cf"
},
"outputs": [],
"source": [
"print(type(sample[\"Name\"]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7abed452",
"metadata": {
"id": "7abed452"
},
"outputs": [],
"source": [
"print(sample['Age'])"
]
},
{
"cell_type": "markdown",
"id": "92149777",
"metadata": {
"id": "92149777"
},
"source": [
"## Useful functions of Pandas"
]
},
{
"cell_type": "markdown",
"id": "5f071d15",
"metadata": {
"id": "5f071d15"
},
"source": [
"\n",
"\n",
"* `.head()` - prints the first 5 rows by default, If the number of rows
are passed exclusively as parameters, it will take that many rows. \n",
"* For example, `.head(2)` will print the first two rows. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b58c67fd",
"metadata": {
"id": "b58c67fd"
},
"outputs": [],
"source": [
"sample.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "308ba78c",
"metadata": {
"id": "308ba78c"
},
"outputs": [],
"source": [
"sample.head(12)"
]
},
{
"cell_type": "markdown",
"id": "12293bb8",
"metadata": {
"id": "12293bb8"
},
"source": [
"\n",
"\n",
"* `.tail()` - prints the last 5 rows by default, If the number of rows
are passed exclusively as parameters, it will take that many rows. \n",
"* For example, `.tail(2)` will print the last two rows. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be0d3470",
"metadata": {
"id": "be0d3470"
},
"outputs": [],
"source": [
"sample.tail()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c82aa22b",
"metadata": {
"id": "c82aa22b"
},
"outputs": [],
"source": [
"sample.tail(2)"
]
},
{
"cell_type": "markdown",
"id": "85fb1fdf",
"metadata": {
"id": "85fb1fdf"
},
"source": [
"### for extracting the rows at random from the dataset, use the
command .sample()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d06d611c",
"metadata": {
"id": "d06d611c"
},
"outputs": [],
"source": [
"sample.sample(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca5ff424",
"metadata": {
"scrolled": true,
"id": "ca5ff424"
},
"outputs": [],
"source": [
"sample.sample(5)"
]
},
{
"cell_type": "markdown",
"id": "b877da45",
"metadata": {
"id": "b877da45"
},
"source": [
"## Useful fuctions to get basic information about the data\n",
"+ .describe()- will give basic statistics of the data in the comprehensive
format\n",
"+ .info()- will give information about the type of data and missing
values\n",
"+ .mean()- mean of each column\n",
"+ .sum() - sum of each column"
]
},
{
"cell_type": "markdown",
"id": "c5968c8b",
"metadata": {
"id": "c5968c8b"
},
"source": [
"### .describe() function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5aec223",
"metadata": {
"scrolled": true,
"id": "c5aec223"
},
"outputs": [],
"source": [
"sample.describe()"
]
},
{
"cell_type": "markdown",
"id": "3bf0000c",
"metadata": {
"id": "3bf0000c"
},
"source": [
"### .info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8717c218",
"metadata": {
"scrolled": true,
"id": "8717c218"
},
"outputs": [],
"source": [
"sample.info()"
]
},
{
"cell_type": "markdown",
"id": "3a89ea9c",
"metadata": {
"id": "3a89ea9c"
},
"source": [
"### .mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8ad667b",
"metadata": {
"id": "e8ad667b"
},
"outputs": [],
"source": [
"sample.mean()"
]
},
{
"cell_type": "markdown",
"id": "791be720",
"metadata": {
"id": "791be720"
},
"source": [
"### .sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "795e751d",
"metadata": {
"scrolled": true,
"id": "795e751d"
},
"outputs": [],
"source": [
"sample.sum()"
]
},
{
"cell_type": "markdown",
"id": "4a25ef9e",
"metadata": {
"id": "4a25ef9e"
},
"source": [
"### finding mean and sum of a particular column (by name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc34c426",
"metadata": {
"id": "fc34c426"
},
"outputs": [],
"source": [
"print(\"Total age of all the employees is:\", sample[\"Age\"].sum())\n",
"print(\"Average age is:\", sample[\"Age\"].mean())\n",
"print(\"Standard deviation in age is:\", sample[\"Age\"].std())"
]
},
{
"cell_type": "markdown",
"id": "cec73060",
"metadata": {
"id": "cec73060"
},
"source": [
"### Getting information about columns (attributes) through the keyword
columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0491b968",
"metadata": {
"id": "0491b968"
},
"outputs": [],
"source": [
"sample.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87895b31",
"metadata": {
"id": "87895b31"
},
"outputs": [],
"source": [
"sample.columns[2]"
]
},
{
"cell_type": "markdown",
"id": "8972dd69",
"metadata": {
"id": "8972dd69"
},
"source": [
"### Getting information about rows (obserations) through the keyword
index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e0825fe",
"metadata": {
"id": "5e0825fe"
},
"outputs": [],
"source": [
"sample.index"
]
},
{
"cell_type": "markdown",
"id": "1ab47dc4",
"metadata": {
"id": "1ab47dc4"
},
"source": [
"### len will give the total number of observations or samples"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54b76154",
"metadata": {
"id": "54b76154"
},
"outputs": [],
"source": [
"len(sample)"
]
},
{
"cell_type": "markdown",
"id": "cfcd45e7",
"metadata": {
"id": "cfcd45e7"
},
"source": [
"### Displaying the first five rows again"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dda075ec",
"metadata": {
"id": "dda075ec"
},
"outputs": [],
"source": [
"sample.head()"
]
},
{
"cell_type": "markdown",
"id": "0c7cce41",
"metadata": {
"id": "0c7cce41"
},
"source": [
"### Changing a specific value in the dataframe\n",
" Suppose that we need to replace the name Harry with Henery. This can
be done by mentioning the column name and the row index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96df1d6b",
"metadata": {
"id": "96df1d6b"
},
"outputs": [],
"source": [
"'''Change the value at specific row and column...'''\n",
"sample[\"Name\"][0] = \"Henery\"\n",
"sample.head()"
]
},
{
"cell_type": "markdown",
"id": "6ffc61e3",
"metadata": {
"id": "6ffc61e3"
},
"source": [
"### Rearranging the dataframe:\n",
"* When axis is kept as 0, sorting is done as per rows; \n",
"* parameter ascending = False indicates that sorting will be done in
descending order\n",
"* When axis is kept as 1, sorting is done as per columns; "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "874ef242",
"metadata": {
"id": "874ef242"
},
"outputs": [],
"source": [
"sample.sort_index(axis = 0, ascending = False) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "769b0bcc",
"metadata": {
"id": "769b0bcc"
},
"outputs": [],
"source": [
"sample.sort_index(axis = 0, ascending = True)"
]
},
{
"cell_type": "markdown",
"id": "5a977bb4",
"metadata": {
"id": "5a977bb4"
},
"source": [
"### Displaying the dataframe as transpose"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d43a02b1",
"metadata": {
"id": "d43a02b1"
},
"outputs": [],
"source": [
"sample.T"
]
},
{
"cell_type": "markdown",
"id": "50937fd7",
"metadata": {
"id": "50937fd7"
},
"source": [
"## .loc[ ] and iloc[ ] for accessing the elements of a dataframe\n",
"### loc referes to the label\n",
"### iloc refers to the position "
]
},
{
"cell_type": "markdown",
"id": "0dc16fda",
"metadata": {
"id": "0dc16fda"
},
"source": [
"### Understanding .loc[] and .iloc[] in a Pandas series"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7689a394",
"metadata": {
"id": "7689a394"
},
"outputs": [],
"source": [
"'''Let's Create a sample series'''\n",
"animals = pd.Series([\"Cat\", \"Dog\", \"Rabbit\", \"Ox\", \"Lion\"], \n",
" index=[3, 5, 9, 8, 2])"
]
},
{
"cell_type": "markdown",
"id": "edc234f0",
"metadata": {
"id": "edc234f0"
},
"source": [
"#### In the above example index of Cat is 3; index of Dog is 5; index
Rabbit is 9 and so on..... "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7b80469",
"metadata": {
"id": "e7b80469"
},
"outputs": [],
"source": [
"animals #### Displaying the series"
]
},
{
"cell_type": "markdown",
"id": "1dec8182",
"metadata": {
"id": "1dec8182"
},
"source": [
"### If we want to access an element with index label 3, we make use
of .loc[]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba3322b3",
"metadata": {
"id": "ba3322b3"
},
"outputs": [],
"source": [
"'''.loc[] refers to the label'''\n",
"animals.loc[3]"
]
},
{
"cell_type": "markdown",
"id": "73cad56a",
"metadata": {
"id": "73cad56a"
},
"source": [
"### If we want to access the element at the third position, we make use of
iloc[]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c1a7b36",
"metadata": {
"id": "3c1a7b36"
},
"outputs": [],
"source": [
"'''iloc[] refers to the position'''\n",
"animals.iloc[3]"
]
},
{
"cell_type": "markdown",
"id": "5249582f",
"metadata": {
"id": "5249582f"
},
"source": [
"## loc and iloc in a dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00125a45",
"metadata": {
"scrolled": false,
"id": "00125a45"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"id": "7d291eef",
"metadata": {
"id": "7d291eef"
},
"source": [
"### .loc[] refers to the index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "219b259f",
"metadata": {
"id": "219b259f"
},
"outputs": [],
"source": [
"sample.loc[3]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d3a1cab",
"metadata": {
"id": "8d3a1cab"
},
"outputs": [],
"source": [
"sample.iloc[7]"
]
},
{
"cell_type": "markdown",
"id": "ac420fa4",
"metadata": {
"id": "ac420fa4"
},
"source": [
"### Using the slicing operator"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c8e2f03f",
"metadata": {
"scrolled": false,
"id": "c8e2f03f"
},
"outputs": [],
"source": [
"sample.loc[2:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61a2aa9e",
"metadata": {
"scrolled": true,
"id": "61a2aa9e"
},
"outputs": [],
"source": [
"sample.iloc[0:4,0:2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f30a071b",
"metadata": {
"scrolled": true,
"id": "f30a071b"
},
"outputs": [],
"source": [
"sample.iloc[2:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "924bcd89",
"metadata": {
"scrolled": true,
"id": "924bcd89"
},
"outputs": [],
"source": [
"sample.loc[0:2]['Age']"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c369351",
"metadata": {
"id": "8c369351"
},
"outputs": [],
"source": [
"sample.loc[2:5][\"Age\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58b8c4d8",
"metadata": {
"scrolled": true,
"id": "58b8c4d8"
},
"outputs": [],
"source": [
"sample.head()"
]
},
{
"cell_type": "markdown",
"id": "f2aa6e32",
"metadata": {
"id": "f2aa6e32"
},
"source": [
"### .loc[] and .iloc[] for assigning the value"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d95df62",
"metadata": {
"id": "0d95df62"
},
"outputs": [],
"source": [
"sample.loc[1,'Age'] = 40\n",
"sample.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3ce0de03",
"metadata": {
"id": "3ce0de03"
},
"outputs": [],
"source": [
"sample.iloc[1,1] = 25\n",
"sample.head()"
]
},
{
"cell_type": "markdown",
"id": "93f942bf",
"metadata": {
"id": "93f942bf"
},
"source": [
"### Creating a new column with the same value through out the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0ba425dd",
"metadata": {
"id": "0ba425dd"
},
"outputs": [],
"source": [
"sample[\"Occupation\"] = \"Employee\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "273ae7ad",
"metadata": {
"id": "273ae7ad"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"id": "fad4ec6d",
"metadata": {
"id": "fad4ec6d"
},
"source": [
"## Creating a new column using data of other columns of the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97a8048a",
"metadata": {
"id": "97a8048a"
},
"outputs": [],
"source": [
"sample[\"Col\"] = sample[\"Age\"] / sample[\"Salary\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61ece3f6",
"metadata": {
"id": "61ece3f6"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"id": "d90fdaab",
"metadata": {
"id": "d90fdaab"
},
"source": [
"### Drop the column from DataFrame..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0088f8f8",
"metadata": {
"id": "0088f8f8"
},
"outputs": [],
"source": [
"'''Dropping the column named as \"col\"...axis = 1 specifies the
column'''\n",
"sample = sample.drop(\"Col\",axis = 1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b0853f0",
"metadata": {
"id": "8b0853f0"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"id": "96616d98",
"metadata": {
"id": "96616d98"
},
"source": [
"### Dropping the row named as \"0\"...axis = 0 specifies the row"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11ca94ca",
"metadata": {
"id": "11ca94ca"
},
"outputs": [],
"source": [
"sample = sample.drop(3 ,axis = 0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20c287fd",
"metadata": {
"id": "20c287fd"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"id": "7f4612a8",
"metadata": {
"id": "7f4612a8"
},
"source": [
"### Appending a new row into the DataFrame\n",
"#### Let's take an example a new DataFrame..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14f6f1bc",
"metadata": {
"id": "14f6f1bc"
},
"outputs": [],
"source": [
"'''Create a new DataFrame'''\n",
"\n",
"df_1 = pd.DataFrame({\"A\":[1, 2, 3], \n",
" \"B\":[5, 6, 7]}) \n",
"\n",
"'''Create another DataFrame to append'''\n",
"\n",
"df_2 = pd.DataFrame({\"A\":[10, 20, 30, 40], \n",
" \"B\":[50, 60, 70, 80]}) "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f682fb43",
"metadata": {
"id": "f682fb43"
},
"outputs": [],
"source": [
"print(df_1)\n",
"print(df_2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d30acf32",
"metadata": {
"id": "d30acf32"
},
"outputs": [],
"source": [
"df_3 = df_1.append(df_2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26a68b90",
"metadata": {
"id": "26a68b90"
},
"outputs": [],
"source": [
"df_3"
]
},
{
"cell_type": "markdown",
"id": "96a09594",
"metadata": {
"id": "96a09594"
},
"source": [
"#### if you take a look on the index of df_3, you can see that while
appending two DataFrames, it is also preserving their indexes/indices... to
overcome this problem we can set ignore_index = True"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "003ddc74",
"metadata": {
"id": "003ddc74"
},
"outputs": [],
"source": [
"df_3 = df_1.append(df_2, ignore_index= True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dfa26406",
"metadata": {
"id": "dfa26406"
},
"outputs": [],
"source": [
"df_3"
]
},
{
"cell_type": "markdown",
"id": "d775f974",
"metadata": {
"id": "d775f974"
},
"source": [
"## Appending the Datarames with different shapes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f99cbec7",
"metadata": {
"id": "f99cbec7"
},
"outputs": [],
"source": [
"df_4 = pd.DataFrame({\"A\":[1, 2, 3, 4], \n",
" \"B\":[5, 6, 7, 8]}) \n",
"print(df_4)\n",
"'''Create another DataFrame to append'''\n",
"df_5 = pd.DataFrame({\"A\":[10, 20, 30], \n",
" \"B\":[50, 60, 70],\n",
" \"C\":[80,90,100]}) \n",
"print(df_5)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9dcd4568",
"metadata": {
"id": "9dcd4568"
},
"outputs": [],
"source": [
"df_6 = df_4.append(df_5, ignore_index = True)"
]
},
{
"cell_type": "markdown",
"id": "87a33131",
"metadata": {
"id": "87a33131"
},
"source": [
"### For unequal number of columns in the data frame, non-existent values
will be filled with NaN values."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1df4518d",
"metadata": {
"id": "1df4518d"
},
"outputs": [],
"source": [
"df_6"
]
},
{
"cell_type": "markdown",
"id": "15b29ff9",
"metadata": {
"id": "15b29ff9"
},
"source": [
"## Dealing with Missing Values..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a17f309e",
"metadata": {
"id": "a17f309e"
},
"outputs": [],
"source": [
"'''See if there is any Nan Value'''\n",
"df_6.isna()"
]
},
{
"cell_type": "markdown",
"id": "1c400c2e",
"metadata": {
"id": "1c400c2e"
},
"source": [
"#### Count the number of missing values"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e69efaff",
"metadata": {
"scrolled": true,
"id": "e69efaff"
},
"outputs": [],
"source": [
"df_6.isna().sum()"
]
},
{
"cell_type": "markdown",
"id": "8f1cfaad",
"metadata": {
"id": "8f1cfaad"
},
"source": [
"### Imputation of missing values"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c095cc5",
"metadata": {
"id": "5c095cc5"
},
"outputs": [],
"source": [
"'''Let's use the .fillna() function to fill the C column with the mean of
the other values in the same column...\n",
"\n",
"Original df_6 will remain unchanged beacuse of inplace = False. \n",
"if inplace = True, changes will be reflected to the original
DataFrame...'''\n",
"\n",
"df_6[\"C\"].fillna(df_6[\"C\"].mean(),inplace=False)\n",
"print(df_6)"
]
},
{
"cell_type": "markdown",
"id": "7732bff1",
"metadata": {
"id": "7732bff1"
},
"source": [
"#### Drop the Missing Value rows... by default inplace=False"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb797fc2",
"metadata": {
"id": "eb797fc2"
},
"outputs": [],
"source": [
"df_6.dropna()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e59e8cf0",
"metadata": {
"id": "e59e8cf0"
},
"outputs": [],
"source": [
"'''Appending a new row in the sample DataFrame'''\n",
"s = pd.Series({'Name':'Harry','Age':30, 'Gender':'M','Salary':32000})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8bd3d7e8",
"metadata": {
"id": "8bd3d7e8"
},
"outputs": [],
"source": [
"sample = sample.append(s, ignore_index = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a2c19925",
"metadata": {
"id": "a2c19925"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b4f2212",
"metadata": {
"id": "9b4f2212"
},
"outputs": [],
"source": [
"sample['Occupation'].fillna('Employee', inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d33ea466",
"metadata": {
"id": "d33ea466"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "markdown",
"id": "cb936e3e",
"metadata": {
"id": "cb936e3e"
},
"source": [
"### Queries over the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93c6cf78",
"metadata": {
"scrolled": true,
"id": "93c6cf78"
},
"outputs": [],
"source": [
"sample.loc[(sample['Age']>25)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dfb79c29",
"metadata": {
"id": "dfb79c29"
},
"outputs": [],
"source": [
"sample.groupby(['Gender']).mean()"
]
},
{
"cell_type": "markdown",
"id": "13ec8a9e",
"metadata": {
"id": "13ec8a9e"
},
"source": [
"#### Rename the Column....Rename the column named as \"Occupation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c99f8fb",
"metadata": {
"id": "2c99f8fb"
},
"outputs": [],
"source": [
"sample.rename(columns = {'Occupation':'Designation'}, inplace = False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1fc4304",
"metadata": {
"id": "d1fc4304"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a08f469",
"metadata": {
"id": "0a08f469"
},
"outputs": [],
"source": [
"#df.sort_values(by=['Brand'], inplace=True)\n",
"\n",
"sample.sort_values('Name')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cbb28e7",
"metadata": {
"id": "8cbb28e7"
},
"outputs": [],
"source": [
"sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b512759",
"metadata": {
"id": "3b512759"
},
"outputs": [],
"source": [
"sample.sort_values('Age')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06a6e128",
"metadata": {
"id": "06a6e128"
},
"outputs": [],
"source": [
"sample.sort_values( by = ['Age', 'Name'])"
]
},
{
"cell_type": "markdown",
"id": "b0dfa213",
"metadata": {
"id": "b0dfa213"
},
"source": [
"### Pandas Query: pandas.query()\n",
"https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html"
]
},
{
"cell_type": "markdown",
"id": "ab66daca",
"metadata": {
"id": "ab66daca"
},
"source": [
"#### To get the data of the Employees having Age between 25 an 30"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d32d1877",
"metadata": {
"scrolled": true,
"id": "d32d1877"
},
"outputs": [],
"source": [
"sample.query('Age > 25 and Age < 30')"
]
},
{
"cell_type": "markdown",
"id": "058117c6",
"metadata": {
"id": "058117c6"
},
"source": [
"#### To get the Age of the Employee named Monica"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9f9e71d",
"metadata": {
"id": "f9f9e71d"
},
"outputs": [],
"source": [
"sample.query(\"Name == 'Monica'\")['Age' ]"
]
},
{
"cell_type": "markdown",
"id": "e2d6a7b0",
"metadata": {
"id": "e2d6a7b0"
},
"source": [
"#### To get the Age and Occupation of the Employee named Monica"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc120798",
"metadata": {
"id": "cc120798"
},
"outputs": [],
"source": [
"sample.query(\"Name == 'Monica'\")[['Age' , 'Occupation']]"
]
},
{
"cell_type": "markdown",
"id": "382e7f99",
"metadata": {
"id": "382e7f99"
},
"source": [
"### AND operator\n",
"#Get the data of the Employee having Age>30 and Gennder = Male"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd53de28",
"metadata": {
"scrolled": true,
"id": "dd53de28"
},
"outputs": [],
"source": [
"sample.query(\"(Age > 30) and (Gender =='M') \")"
]
},
{
"cell_type": "markdown",
"id": "94020093",
"metadata": {
"id": "94020093"
},
"source": [
"### OR operator\n",
"* Get the data of the Employee either having Age == 30 or Gennder = Male"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbdb7eb8",
"metadata": {
"id": "fbdb7eb8"
},
"outputs": [],
"source": [
"'''OR operator'''\n",
"#Get the data of the Employee either having Age == 30 or Gennder = Male\
n",
"sample.query(\"(Age == 30) or (Gender =='M') \")"
]
},
{
"cell_type": "markdown",
"id": "51c68e59",
"metadata": {
"id": "51c68e59"
},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"id": "e08cf4fd",
"metadata": {
"id": "e08cf4fd"
},
"source": [
"1. https://pandas.pydata.org/docs/index.html\n",
"2. https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html \n",
"3. https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-
pandas-and-more-149d341fc53e - For writing sql queries in Pandas\n",
"4. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
Ipython by Wes Mc Kinney, 2016 – Reference Book"
]
},
{
"cell_type": "markdown",
"id": "1207d1e8",
"metadata": {
"id": "1207d1e8"
},
"source": [
"### Pandas is very efficient with small data (usually from 100MB up to
1GB) and performance is rarely a concern"
]
},
{
"cell_type": "markdown",
"id": "3e11e098",
"metadata": {
"id": "3e11e098"
},
"source": [
"#### You can work with datasets that are much larger than memory, as long
as each partition (a regular pandas DataFrame) fits in memory. By default, dask.
dataframe operations use a threadpool to do operations in parallel."
]
},
{
"cell_type": "markdown",
"id": "20bd4a70",
"metadata": {
"id": "20bd4a70"
},
"source": [
"##### Dask for big data - Dask is a flexible library for parallel
computing in Python. Dask is composed of two parts: Dynamic task scheduling
optimized for computation."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
},
"colab": {
"name": "Session6_Pandas.ipynb",
"provenance": []
}
},
"nbformat": 4,
"nbformat_minor": 5
}