0% found this document useful (0 votes)

21 views36 pages

Pandas 2

Uploaded by

shubratce59

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views36 pages

Pandas 2

Uploaded by

shubratce59

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Pandas Basics

In this chapter, you will see a brief introduction to the Pandas series and
Dataframes, which are two basic data structures for storing data in Pandas.
Next, you will see how to create these data structures and some basic
functions that you can perform with Pandas. You will then study how to import
datasets into a Pandas dataframe using various input sources. Finally, the
chapter concludes with an explanation of the techniques for handling missing
data in Pandas dataframes.

Pandas comes installed with default Python installation. You can also install
Pandas via the following PIP command:

pip install pandas

2.1. Pandas Series

A Pandas series is a data structure that stores data in the form of a column. A
series is normally used to store information about a particular attribute in your
dataset. Let’s see how you can create a series in Pandas.

2.1.1. Creating Pandas Series

There are different ways to create a series with Pandas. The following script
imports the Pandas module and then calls the Series() class constructor to
create an empty series. Here is how to do that:

Script 1:

# empty series
import pandas as pd

my_series = pd.Series()
print (my_series)

You can also create a series using a NumPy array. But, first, you need to pass
the array to the Series() class constructor, as shown in the script below.

Script 2:

# series using numpy array

import pandas as pd
import numpy as np

my_array = np.array([ 10 , 20 , 30 , 40 , 50 ])

my_series = pd.Series(my_array)
print (my_series)

Output:

0 10
1 20
2 30
3 40
4 50
dtype: int32

In the above output, you can see that the indexes for a series start from 0 to 1
less than the number of items in the series. You can also define custom indexes
for your series. To do so, you need to pass your list of indexes to the index
attribute of the Series class, as shown in the script below:

Script 3:

# series with custom indexes

import pandas as pd
import numpy as np

my_array = np.array([ 10 , 20 , 30 , 40 , 50 ])
my_series = pd.Series(my_array, index = ["num1", "num2", "num3", "num4", "num5"])
print (my_series)

Output:

num1 10
num2 20
num3 30
num4 40
num5 50
dtype: int32

You can also create a series by directly passing a Python list to the Series()
class constructor.

Script 4:

# series using a list

import pandas as pd
import numpy as np

my_series = pd.Series([ 10 , 20 , 30 , 40 , 50 ], index = ["num1", "num2", "num3", "num4", "num5"])

print (my_series)

Output:

num1 10
num2 20
num3 30
num4 40
num5 50
dtype: int64

Finally, a scaler value can also be used to define a series. In case you pass a
list of indexes, the scaler value will be repeated the number of times equal to
the items in the index list. Here is an example:

Script 5:
# series using a scaler

import pandas as pd
import numpy as np

my_series = pd.Series( 25 , index = ["num1", "num2", "num3", "num4", "num5"])

print (my_series)

Output:

num1 25
num2 25
num3 25
num4 25
num5 25
dtype: int64

Finally, you can also create a series using a dictionary. In this case, the
dictionary keys will become series indexes while the dictionary values are
inserted as series items. Here is an example:

Script 6:

# series using dictionary

my_dict = {'num1' : 6 ,
'num2' : 7 ,
'num3' : 8 }

my_series = pd.Series(my_dict)
print (my_series)

Output:

num1 6
num2 7
num3 8
dtype: int64

7.1.2. Useful Operations on Pandas Series

Let’s see some of the useful operations you can perform with the Pandas
series.

You can use square brackets as well as index labels to access series items, as
shown in the following script:

Script 7:

## Accessing Items

import pandas as pd
my_series = pd.Series([ 10 , 20 , 30 , 40 , 50 ], index = ["num1", "num2", "num3", "num4", "num5"])
print (my_series[ 0 ])
print (my_series['num3'])

Output:

10
30

Using the min() and max() functions from the NumPy module, you can find the
maximum and minimum values, respectively, from a series. Look at the
following script for reference.

Script 8:

## Finding Maximum and Minimum Values

import pandas as pd
import numpy as np

my_series = pd.Series([ 5 , 8 , 2 , 11 , 9 ])

print (np.min(my_series))
print (np.max(my_series))

Output:

2
11
Similarly, the mean() method from the NumPy module can find the mean of a
Pandas series, as shown in the following script.

Script 9:

## Finding Mean

import pandas as pd
import numpy as np

my_series = pd.Series([ 5 , 8 , 2 , 11 , 9 ])

print (my_series.mean())

Output:

7.0

The following script finds the median value of a Pandas series.

Script 10:

## Finding Median

import pandas as pd
import numpy as np

my_series = pd.Series([ 5 , 8 , 2 , 11 , 9 ])

print (my_series.median())

Output:

8.0

You can also find the data type of a Pandas series using the dtype attribute.
Here is an example:
Script 11:

## Finding Data Type

import pandas as pd
import numpy as np

my_series = pd.Series([ 5 , 8 , 2 , 11 , 9 ])

print (my_series.dtype)

Output:

int64

A Pandas series can also be converted to a Python list using the tolist()
method, as shown in the script below:

Script 12:

## Converting to List

import pandas as pd
import numpy as np

my_series = pd.Series([ 5 , 8 , 2 , 11 , 9 ])

print (my_series.tolist())

Output:

[5, 8, 2, 11, 9]

2.2. Pandas Dataframe

A Pandas dataframe is a tabular data structure that stores data in the form of
rows and columns. As a standard, the rows correspond to records while
columns refer to attributes. In simplest words, a Pandas dataframe is a
collection of series.
2.2.1. Creating a Pandas Dataframe

As is the case with a series, there are multiple ways to create a Pandas
dataframe.

To create an empty dataframe, you can use the DataFrame class from the
Pandas module, as shown below:

Script 13:

# empty pandas dataframe

import pandas as pd

my_df = pd.DataFrame() print (my_df)

Output:

EmptyDataFrame
Columns: []
Index: []

You can create a Pandas dataframe using a list of lists. Each sublist in the
outer list corresponds to a row in a dataframe. Each item within a sublist
becomes an attribute value.

To specify column headers, you need to pass a list of values to the columns
attribute of DataFrame class.

Here is an example of how you can create a Pandas dataframe using a list.

Script 14:

# dataframe using list of lists

import pandas as pd

scores = [['Mathematics', 85 ], ['English', 91 ], ['History', 95 ]]

my_df = pd.DataFrame(scores, columns = ['Subject', 'Score'])

my_df

Output:

Similarly, you can create a Pandas dataframe using a dictionary. One of the
ways is to create a dictionary where keys correspond to column headers. In
contrast, corresponding dictionary values are a list, which corresponds to the
column values in the Pandas dataframe.

Here is an example for your reference:

Script 15:

# dataframe using dictionaries

import pandas as pd

scores = {'Subject':["Mathematics", "History", "English", "Science", "Arts"],

'Score':[ 98 , 75 , 68 , 82 , 99 ]
}

my_df = pd.DataFrame(scores)
my_df

Output:
Another way to create a Pandas dataframe is using a list of dictionaries. Each
dictionary corresponds to one row. Here is an example of how to do that.

Script 16:

# dataframe using list of dictionaries

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 },
{'Subject':'History', 'Score': 98 },
{'Subject':'English', 'Score': 76 },
{'Subject':'Science', 'Score': 72 },
{'Subject':'Arts', 'Score': 95 },
]

my_df = pd.DataFrame(scores)
my_df

Output:
The dictionaries within the list used to create a Pandas dataframe need not be
of the same size.

For example, in the script below, the fourth dictionary in the list contains only
one item, unlike the rest of the dictionaries in this list. The corresponding
dataframe will contain a null value in place of the second item, as shown in
the output of the script below:

Script 17:

# dataframe using list of dictionaries

# with null items

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 },
{'Subject':'History', 'Score': 98 },
{'Subject':'English', 'Score': 76 },
{'Score': 72 },
{'Subject':'Arts', 'Score': 95 },
]

my_df = pd.DataFrame(scores)
my_df

Output:

2.2.2. Basic Operations on Pandas Dataframe

Let’s now see some of the basic operations that you can perform on Pandas
dataframes.
To view the top(N) rows of a dataframe, you can call the head() method, as
shown in the script below:

Script 18:

# viewing header

import pandas as pd

scores = [
{'Subject':'Mathematics', 'Score': 85 },
{'Subject':'History', 'Score': 98 },
{'Subject':'English', 'Score': 76 },
{'Subject':'Science', 'Score': 72 },
{'Subject':'Arts', 'Score': 95 },
]

my_df = pd.DataFrame(scores)
my_df.head( 2 )

Output:

To view the last N rows, you can use the tail() method. Here is an example:

Script 19:

# viewing tail

my_df = pd.DataFrame(scores)

my_df.tail( 2 )

Output:
You can also get a summary of your Pandas dataframe using the info() method.

Script 20:

# gettingdataframe info

my_df = pd.DataFrame(scores)

my_df.info()

In the output below, you can see the number of entries in your Pandas
dataframe, the number of columns along with their column type, and so on.

Output:

Finally, to get information such as mean, minimum, maximum, standard

deviation, etc., for numeric columns in your Pandas dataframe, you can use the
describe() method, as shown in the script below:

Script 21:

# getting info about numeric columns

my_df = pd.DataFrame(scores)
my_df.describe()

Output:

2.3. Importing Data in Pandas

You can import data from various sources into your Pandas dataframe. Some
of them are discussed in this section.

2.3.1. Importing CSV Files

A CSV file is a type of file where each line contains a single record, and all
the columns are separated from each other via a comma.

You can read CSV files using the read_csv() function of the Pandas
dataframe, as shown below. The “iris_data.csv” file is available in the Data
folder of the book resources.

Script 22:

import pandas as pd
titanic_data = pd.read_csv(r"D:\Datasets\iris_data.csv")
titanic_data.head()
If you print the dataframe header, you should see that the header contains five
columns that contain different information about iris plants.

Output:

In some cases, CSV files do not contain any header. In such cases, the
read_csv() method treats the first row of the CSV file as the dataframe header.

To specify custom headers for your CSV files, you need to pass the list of
headers to the names attribute of the read_ csv() method, as shown in the
script below. You can find the “pima-indians-diabetes.csv” file in the Data
folder of the book resources.

Script 23:

headers = ["Preg", "Glucose", "BP", "skinThick", "Insulin", "BMI", "DPF", "Age", "Class"]

patient_data = pd.read_csv(r"D:\Datasets\pima-indians- diabetes.csv", names = headers)

patient_data.head()

In the output below, you can see the custom headers that you passed in the list
to the read_csv() method’s name attribute.

Output:
2.3.2. Importing TSV Files

TSV files are similar to CSV files. But in a TSV file, the delimiter used to
separate columns is a single tab. The read_csv() function can be used to read
a TSV file. However, you have to pass “\t” as a value for the “sep” attribute,
as shown below.

Note: You can find the “iris_data.tsv” file in the Data folder of the book
resources.

Script 24:

import pandas as pd
patients_csv = pd.read_csv(r"D:\Datasets\iris_data.tsv", sep=' \t ')
patients_csv.head()

Output:
2.3.3. Importing Data from Databases

Oftentimes, you need to import data from different databases into your Pandas
dataframe. In this section, you will see how to import data from various
databases into a Python application.

Importing Data from SQL Server

To import data from Microsoft’s SQL Server database, you need to first
install the “pyodbc” module for Python. To do so, execute the following
command on your command terminal.

$ pip install pyodbc

Next, you need to create a connection with your SQL server database. The
connect() method of the “pyodbc” module can be used to create a connection.
You have to pass the driver name, the server name, and the database name to
the connect() method, as shown below.

Note: To run the following script, you need to create a database named
Titanic with a table named records table. Explaining how to create a database
and tables is beyond the scope of this book. You find further details at the link
below.

In the following script, we connect to the Titanic database.

Script 25:

import pandas as pd
import pyodbc

sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};

SERVER=HOARE\SQLEXPRESS; DATABASE=Titanic; Trusted_Connection=yes')
Once the connection is established, you have to write an SQL SELECT query
that fetches the desired record. The following SQL select query fetches all
records from a records table.

In a Pandas dataframe, the query and the connection object are passed to the
pd_read_sql() function to store the records returned by the query.

Finally, the dataframe header is printed to display the first five rows of the
imported table.

Script 26:

query = "SELECT * FROM records;"

titanic_data = pd.read_sql(query, sql_conn)
titanic_data.head()

Importing Data from PostgreSQL

To import data from PostgreSQL, you will need to download the SQLAlchemy
module. Execute the following pip statement to do so.

$ pip install SQLAlchemy

Next, you need to create an engine, which serves as a connection between the
PostgreSQL server and the Python application. The following script shows
how to create a connection engine. You need to replace your server and
database name in the following script.

Script 27:

from sqlalchemy import create_engine

engine = create_engine('postgresql://postgres:abc123@
localhost:5432/Titanic')

To store the records returned by the query in a Pandas dataframe, the query
and the connection object are passed to the pd_read_sql() function of the
Pandas dataframe. Finally, the dataframe header is printed to display the first
five rows of the imported table.

Script 28:

import pandas as pd
titanic_data =pd.read_sql_query('select * from "records"',con=engine)
titanic_data.head()

Output:

Importing Data from SQLite

To import data from an SQLite database, you do not need any external
module. You can use the default sqlite3 module.

The first step is to connect to an SQLite database. To do so, you can use the
connect() method of the sqlite3 module, as shown below.

Script 29:
import sqlite3
import pandas as pd
# Create your connection.
cnx = sqlite3.connect('E:/Titanic.db')

Next, you can call the pd_read_sql() function of the Pandas dataframe and
pass it to the SELECT query and the database connection. Finally, the
dataframe header is printed to display the first five rows of the imported
table.

Script 30:

titanic_data = pd.read_sql_query("SELECT * FROM records", cnx)

titanic_data.head()

2.4. Handling Missing Values in Pandas

Missing values, as the name suggests, are those observations in the dataset
that doesn’t contain any value. Missing values can change the data patterns,
and, therefore, it is extremely important to understand why missing values
occur in the dataset and how to handle them.

In this section, you will see the different techniques with examples to handle
missing values in your Pandas dataframes.

2.4.1. Handling Missing Numerical Values

To handle missing numerical data, we can use statistical techniques. The use
of statistical techniques or algorithms to replace missing values with
statistically generated values is called imputation.

In this section, you will see how to do median and mean imputation. Mean or
median imputation is one of the most commonly used imputation techniques
for handling missing numerical data.

In mean or median imputation, missing values in a column are replaced by the

mean or median of all the remaining values in that particular column.

For instance, if you have a column with the following data:

In the above Age column, the second value is missing. Therefore, with mean
and median imputation, you can replace the second value with either the mean
or median of all the other values in the column. For instance, the following
column contains the mean of all the remaining values, i.e., 25 in the second
row. You could also replace this value with the median if you want.
Let’s see a practical example of mean and median imputation. First, we will
import the Titanic dataset and find the columns that contain missing values.
Then, we will apply mean and median imputation to the columns containing
missing values. Finally, we will see the effect of applying mean and median
imputation to the missing values.

You do not need to download the Titanic dataset. If you import the Seaborn
library, the Titanic data will be downloaded with it. The following script
imports the Titanic dataset and displays its first five rows.

Script 31:

import matplotlib.pyplot as plt

import seaborn as sns

plt.rcParams["figure.figsize"] = [ 8 , 6 ]
sns.set_style("darkgrid")

titanic_data = sns.load_dataset('titanic')

titanic_data.head()

Output:
Let’s filter some of the numeric columns from the dataset and see if they
contain any missing values.

Script 32:

titanic_data = titanic_data[["survived", "pclass", "age", "fare"]]

titanic_data.head()

Output:

To find missing values from the aforementioned columns, you need to first call
the isnull() method on the titanic_data dataframe, and then you need to call
the mean() method, as shown below.

Script 33:

titanic_data.isnull().mean()

Output:

survived 0.000000
pclass 0.000000
age 0.198653
fare 0.000000
dtype: float64

The above output shows that only the age column contains missing values.
And the ratio of missing values is around 19.86 percent.

Let’s now find out the median and mean values for all the non- missing values
in the age column.

Script 34:

median = titanic_data.age.median()
print (median)

mean = titanic_data.age.mean()
print (mean)

Output:

28.0
29.69911764705882

The age column has a median value of 28 and a mean value of 29.6991.

To plot the kernel density plots for the actual age and median and mean age,
we will add columns to the Pandas dataframe.

Script 35:

import numpy as np

titanic_data['Median_Age'] = titanic_data.age.fillna(median)

titanic_data['Mean_Age'] = titanic_data.age.fillna(mean)

titanic_data['Mean_Age'] = np.round(titanic_data['Mean_Age'], 1 )

titanic_data.head( 20 )
The above script adds Median_Age and Mean_Age columns to the
titanic_data dataframe and prints the first 20 records. Here is the output of
the above script:

Output:

The highlighted rows in the above output show that NaN, i.e., null values in
the age column, have been replaced by the median values in the Median_Age
column and by mean values in the Mean_Age column.
The mean and median imputation can affect the data distribution for the
columns containing the missing values. Specifically, the variance of the
column is decreased by mean and median imputation now since more values
are added to the center of the distribution. The following script plots the
distribution of data for the age , Median_Age , and Mean_Age columns.

Script 36:

fig = plt.figure()
ax = fig.add_subplot( 111 )

titanic_data['age'] .plot(kind='kde', ax=ax)

titanic_data['Median_Age'] .plot(kind='kde', ax=ax, color='red')

titanic_data['Mean_Age'] .plot(kind='kde', ax=ax, color='green')

lines, labels = ax.get_legend_handles_labels()

ax.legend(lines, labels, loc='best')

Here is the output of the script above:

Output:
You can see that the default values in the age columns have been distorted by
the mean and median imputation, and the overall variance of the dataset has
also been decreased.

Recommendations
Mean and Median imputation could be used for the missing numerical data in
case the data is missing at random. If the data is normally distributed, mean
imputation is better, or else, median imputation is preferred in case of skewed
distributions.

2.4.2. Handling Missing Categorical Values

Frequent Category Imputation

One of the most common ways of handling missing values in a categorical
column is to replace the missing values with the most frequently occurring
values, i.e., the mode of the column. It is for this reason, frequent category
imputation is also known as mode imputation. Let’s see a real-world example
of the frequent category imputation.

We will again use the Titanic dataset. We will first try to find the percentage
of missing values in the age , fare, and embarked_ town columns.

Script 37:

import matplotlib.pyplot as plt

import seaborn as sns

plt.rcParams["figure.figsize"] = [ 8 , 6 ]
sns.set_style("darkgrid")

titanic_data = sns.load_dataset('titanic')

titanic_data = titanic_data[["embark_town", "age", "fare"]]

titanic_data.head()
titanic_data.isnull().mean()

Output:

embark_town 0.002245
age 0.198653
fare 0.000000
dtype: float64

The output shows that embark_town and age columns have missing values.
The ratio of missing values for the embark_ town column is very less.

Let’s plot the bar plot that shows each category in the embark_town column
against the number of passengers.

Script 38:

titanic_data.embark_town.value_counts().sort_values(ascending=False).plot.bar()
plt.xlabel('Embark Town')
plt.ylabel('Number of Passengers')

The output below clearly shows that most of the passengers embarked from
Southampton.

Output:

Let’s make sure if Southampton is the mode value for the embark_town
column.

Script 39:

titanic_data.embark_town.mode()
Output:

0 Southampton
dtype:object

Next, we can simply replace the missing values in the embark town column by
Southampton .

Script 40:

titanic_data.embark_town.fillna('Southampton', inplace=True)

Let’s now find the mode of the age column and use it to replace the missing
values in the age column.

Script 41:

titanic_data.age.mode()

Output:

0 24.0
dtype: float64

The output shows that the mode of the age column is 24. Therefore, we can
use this value to replace the missing values in the age column.

Script 42:

import numpy as np

titanic_data['age_mode'] = titanic_data.age.fillna( 24 )

titanic_data.head( 20 )

Output:
Finally, let’s plot the kernel density estimation plot for the original age
column and the age column that contains the mode of the values in place of the
missing values.

Script 43:

plt.rcParams["figure.figsize"] = [ 8 , 6 ]

fig = plt.figure()
ax = fig.add_subplot( 111 )

titanic_data['age'] .plot(kind='kde', ax=ax)

titanic_data['age_mode'] .plot(kind='kde', ax=ax, color='red')

lines, labels = ax.get_legend_handles_labels()

ax.legend(lines, labels, loc='best')

Output:
Missing Category Imputation
Missing value imputation adds an arbitrary category, e.g., missing in place of
the missing values. Take a look at an example of missing value imputation.

Let’s load the Titanic dataset and see if any categorical column contains
missing values.

Script 44:

import matplotlib.pyplot as plt

import seaborn as sns

plt.rcParams["figure.figsize"] = [ 8 , 6 ]
sns.set_style("darkgrid")

titanic_data = sns.load_dataset('titanic')
titanic_data = titanic_data[["embark_town", "age", "fare"]]
titanic_data.head()
titanic_data.isnull().mean()

Output:

embark_town 0.002245
age 0.198653
fare 0.000000
dtype: float64

The output shows that the embark_town column is a categorical column that
contains some missing values too. We will apply the missing value imputation
to this column.

Script 45:

titanic_data.embark_town.fillna('Missing', inplace=True)

After applying missing value imputation, plot the bar plot for the
embark_town column. You can see that we have a very small, almost
negligible plot for the missing column.
Script 46:

titanic_data.embark_town.value_counts().sort_values(ascending=False).plot.bar()
plt.xlabel('Embark Town')
plt.ylabel('Number of Passengers')

Output:

Hands-on Time – Exercises

Now, it is your turn. Follow the instructions in the exercises below to
check your understanding of Pandas basics that you learned in this chapter.
The answers to these questions are given at the end of the book.

Exercise 2.1
Question 1:

What is the major disadvantage of mean and median imputation?

A. Distorts the data distribution
B. Distorts the data variance
C. Distorts the data covariance
D. All of the Above

Question 2:

How do you display the last three rows of a Pandas dataframe named
“my_df”?
A. my_df.end(3)
B. my_df.bottom(3)
C. my_df.top(-3)
D. my_df.tail(3)

Question 3:

You can create a Pandas series using a:

A. NumPy Array
B. List
C. Dictionary
D. All of the Above

Exercise 2.2
Replace the missing values in the “deck” column of the Titanic dataset with
the most frequently occurring categories in that column. Plot a bar plot for the
updated “deck” column. The Titanic dataset can be downloaded using this
Seaborn command:

import seaborn as sns

sns.load_dataset('titanic')

Pandas
No ratings yet
Pandas
12 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Subject IP
No ratings yet
Subject IP
9 pages
Introduction to Pandas Basics
No ratings yet
Introduction to Pandas Basics
6 pages
Understanding Pandas Data Structures
No ratings yet
Understanding Pandas Data Structures
56 pages
Introduction to Pandas Library in Python
No ratings yet
Introduction to Pandas Library in Python
39 pages
Python Pandas Module - Introduction-07-11-2023
No ratings yet
Python Pandas Module - Introduction-07-11-2023
84 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
Unit 4
No ratings yet
Unit 4
36 pages
Pandas Assignment Version-2
No ratings yet
Pandas Assignment Version-2
9 pages
Data Wrangling with Pandas Guide
No ratings yet
Data Wrangling with Pandas Guide
16 pages
Python Pandas DataFrame Guide
No ratings yet
Python Pandas DataFrame Guide
53 pages
14 Pandas
No ratings yet
14 Pandas
25 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
9 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
18 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
100% (1)
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
Pandas
No ratings yet
Pandas
57 pages
Week 4.1
No ratings yet
Week 4.1
16 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
Final Formatted After Iloc Loc
No ratings yet
Final Formatted After Iloc Loc
34 pages
Pandas Guide for Data Analysts
No ratings yet
Pandas Guide for Data Analysts
33 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Introduction To Pandas & Data Structures
No ratings yet
Introduction To Pandas & Data Structures
11 pages
UNIT II Notes
No ratings yet
UNIT II Notes
23 pages
PYTHON UNIT-5 Part-C
No ratings yet
PYTHON UNIT-5 Part-C
4 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
Pandas Python
No ratings yet
Pandas Python
11 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas
100% (1)
Pandas
163 pages
Python Pandas DataFrame Guide
No ratings yet
Python Pandas DataFrame Guide
4 pages
Python Pandas
No ratings yet
Python Pandas
177 pages
Pandas Series and DataFrames Guide
100% (2)
Pandas Series and DataFrames Guide
64 pages
Python Pandas
100% (1)
Python Pandas
35 pages
Data Manipulation With Pandas
100% (1)
Data Manipulation With Pandas
38 pages
Class 12 Panda Project
No ratings yet
Class 12 Panda Project
13 pages
Pandas
No ratings yet
Pandas
82 pages
Lecture 9 Pandas
No ratings yet
Lecture 9 Pandas
176 pages
Unit 3
No ratings yet
Unit 3
10 pages
Panda
No ratings yet
Panda
46 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
41 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
FALLSEMFY2023-24 BCSE101E ELA CH2023241700215 Reference Material II 24-11-2023 Introduction To Pandas
No ratings yet
FALLSEMFY2023-24 BCSE101E ELA CH2023241700215 Reference Material II 24-11-2023 Introduction To Pandas
15 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
Pandas Notes
No ratings yet
Pandas Notes
44 pages
Python Pandas
No ratings yet
Python Pandas
34 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
UNIT 3 (Chapter 2) Pandas
No ratings yet
UNIT 3 (Chapter 2) Pandas
43 pages
Pandas
No ratings yet
Pandas
7 pages
Introduction to Python Pandas Library
No ratings yet
Introduction to Python Pandas Library
22 pages
Pandas Class 12 Ncertttt
No ratings yet
Pandas Class 12 Ncertttt
48 pages
Introduction to Python Pandas Library
No ratings yet
Introduction to Python Pandas Library
33 pages
Pandas Series and DataFrame Guide
No ratings yet
Pandas Series and DataFrame Guide
10 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
32 pages
DF 1
No ratings yet
DF 1
17 pages
Pandas Assignment 3
No ratings yet
Pandas Assignment 3
5 pages
Python Pandas: Data Manipulation Guide
No ratings yet
Python Pandas: Data Manipulation Guide
84 pages
SRS Project Proposal PDF
No ratings yet
SRS Project Proposal PDF
29 pages
How To Install Sap Smart Business Application
No ratings yet
How To Install Sap Smart Business Application
25 pages
Chapter 2 Auditing IT Governance Controls
No ratings yet
Chapter 2 Auditing IT Governance Controls
5 pages
PMM Interface Procedure
No ratings yet
PMM Interface Procedure
9 pages
SE&M Chapter 4
No ratings yet
SE&M Chapter 4
16 pages
Ultramic User Guide-1
No ratings yet
Ultramic User Guide-1
8 pages
Hendra Bayu - 12419795 - Robot M3
No ratings yet
Hendra Bayu - 12419795 - Robot M3
5 pages
Programming For Engineers 1
No ratings yet
Programming For Engineers 1
69 pages
MACROS: The First Step: Prof. Bhumika Patel Asst. Professor, I.T. Dept., Scet Lecture Notes On VBA
No ratings yet
MACROS: The First Step: Prof. Bhumika Patel Asst. Professor, I.T. Dept., Scet Lecture Notes On VBA
79 pages
Sicam Pas Substation Automation System: Function Overview
No ratings yet
Sicam Pas Substation Automation System: Function Overview
6 pages
28 Must Know .NET Libraries
No ratings yet
28 Must Know .NET Libraries
12 pages
2024 DTF Brochure
No ratings yet
2024 DTF Brochure
4 pages
Windows Registry Tweaks Guide
No ratings yet
Windows Registry Tweaks Guide
29 pages
Grade 7 Mensuration Lesson Plan
No ratings yet
Grade 7 Mensuration Lesson Plan
3 pages
Strater5 UsersGuide Complet PDF
100% (1)
Strater5 UsersGuide Complet PDF
97 pages
Sap Upgrade Delta Highlights
No ratings yet
Sap Upgrade Delta Highlights
100 pages
Computer Applications Program Outcomes
No ratings yet
Computer Applications Program Outcomes
45 pages
Windows Process Overview Guide 6.0
No ratings yet
Windows Process Overview Guide 6.0
53 pages
GNS3 Security GUIDE
0% (1)
GNS3 Security GUIDE
71 pages
Software Project Planning Objectives
No ratings yet
Software Project Planning Objectives
16 pages
IsometricDrawingOptionsReferenceGuide PDF
No ratings yet
IsometricDrawingOptionsReferenceGuide PDF
540 pages
ICT Support For Maths Y3 - Using - Microsoft - Excel - To - Create - Your - Own - Graph
No ratings yet
ICT Support For Maths Y3 - Using - Microsoft - Excel - To - Create - Your - Own - Graph
6 pages
ISEM 530 Systems Engineering Course Summary
No ratings yet
ISEM 530 Systems Engineering Course Summary
15 pages
06 05 10 Simaitis JC2IEDM
No ratings yet
06 05 10 Simaitis JC2IEDM
29 pages
Sneha Resumeeee
No ratings yet
Sneha Resumeeee
4 pages
Riya Singh Resume
No ratings yet
Riya Singh Resume
2 pages
Systems Development Design, Implementation, Maintenance, and Revie
100% (3)
Systems Development Design, Implementation, Maintenance, and Revie
14 pages
Integrated Process Simulation and Die-Design in Sheet Metal Forming
No ratings yet
Integrated Process Simulation and Die-Design in Sheet Metal Forming
10 pages
Instituții și Legile Justiției
No ratings yet
Instituții și Legile Justiției
29 pages
Wilo Pumps
No ratings yet
Wilo Pumps
220 pages

Pandas 2

Uploaded by

Pandas 2

Uploaded by

Pandas Basics

pip install pandas

2.1. Pandas Series

2.1.1. Creating Pandas Series

# series using numpy array

# series with custom indexes

# series using a list

my_series = pd.Series([ 10 , 20 , 30 , 40 , 50 ], index = ["num1", "num2", "num3", "num4", "num5"])

my_series = pd.Series( 25 , index = ["num1", "num2", "num3", "num4", "num5"])

# series using dictionary

7.1.2. Useful Operations on Pandas Series

## Finding Maximum and Minimum Values

The following script finds the median value of a Pandas series.

## Finding Data Type

2.2. Pandas Dataframe

# empty pandas dataframe

my_df = pd.DataFrame() print (my_df)

# dataframe using list of lists

scores = [['Mathematics', 85 ], ['English', 91 ], ['History', 95 ]]

my_df = pd.DataFrame(scores, columns = ['Subject', 'Score'])

Here is an example for your reference:

# dataframe using dictionaries

scores = {'Subject':["Mathematics", "History", "English", "Science", "Arts"],

# dataframe using list of dictionaries

# dataframe using list of dictionaries

2.2.2. Basic Operations on Pandas Dataframe

Finally, to get information such as mean, minimum, maximum, standard

# getting info about numeric columns

2.3. Importing Data in Pandas

2.3.1. Importing CSV Files

patient_data = pd.read_csv(r"D:\Datasets\pima-indians- diabetes.csv", names = headers)

Importing Data from SQL Server

$ pip install pyodbc

Further Readings – CRUD Operations with SQL Server

In the following script, we connect to the Titanic database.

sql_conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};

query = "SELECT * FROM records;"

Importing Data from PostgreSQL

$ pip install SQLAlchemy

from sqlalchemy import create_engine

Further Readings – CRUD Operations with PostgreSQL

Importing Data from SQLite

titanic_data = pd.read_sql_query("SELECT * FROM records", cnx)

Further Readings – CRUD Operations with SQLite

2.4. Handling Missing Values in Pandas

2.4.1. Handling Missing Numerical Values

In mean or median imputation, missing values in a column are replaced by the

For instance, if you have a column with the following data:

import matplotlib.pyplot as plt

titanic_data = titanic_data[["survived", "pclass", "age", "fare"]]

titanic_data['age'] .plot(kind='kde', ax=ax)

titanic_data['Median_Age'] .plot(kind='kde', ax=ax, color='red')

titanic_data['Mean_Age'] .plot(kind='kde', ax=ax, color='green')

lines, labels = ax.get_legend_handles_labels()

Here is the output of the script above:

2.4.2. Handling Missing Categorical Values

Frequent Category Imputation

import matplotlib.pyplot as plt

titanic_data = titanic_data[["embark_town", "age", "fare"]]

titanic_data['age'] .plot(kind='kde', ax=ax)

titanic_data['age_mode'] .plot(kind='kde', ax=ax, color='red')

lines, labels = ax.get_legend_handles_labels()

import matplotlib.pyplot as plt

Further Readings – Basics of Pandas

Hands-on Time – Exercises

What is the major disadvantage of mean and median imputation?

You can create a Pandas series using a:

import seaborn as sns

You might also like