VELTECHHIGHTECH
Dr.RANGARAJANDr.SAKUNTHALAENGINEERINGCOLLEGE
AnAutonomousInstitution
ApprovedbyAICTE-NewDelhi,AffiliatedtoAnnaUniversity,Chennai
AccreditedbyNBA,NewDelhi&AccreditedbyNAACwith“A”Grade&CGPAof3.27
Courseco 21AI35IT Semester III
de
Category PROFESSIONALCORECOURSE(PCC) L T P C
2 0 4 4
CourseTitl DATASCIENCEFORENGINEERS
e
COURSEOBJECTIVES:
To describe the life cycle of Data Science and computational
environments for data scientists using Python.
To describe the fundamentals for exploring and managing data
with Python.
To examine the various data analytics techniques for labeled
/columnar data using Python.
To demonstrate a flexible range of data visualizations techniques
in Python.
To describe the various Machine learning algorithms for data
modeling with Python.
COURSEOUTCOMES:
CO. No. Course Outcomes Blooms
level
On successful completion of this Course, students will be able to
C305. 3 Understand the concepts of Pandas. K2
UNIT-III
UNITIII INTRODUCTION TO PANDAS
Installing and Using Pandas, Introducing Pandas Objects,Data IndexingandSelection.Operating
on Data in Pandas, Handling Missing Data.
INTRODUCTION TO PANDAS
Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer
various operations and data structures to perform numerical data manipulations and time
series. Pandas is an open-source library that is built over Numpy libraries. Pandas library is
known for its high productivity and high performance. Pandas are popular because they
make importing and analyzing data much easier. Pandas programs can be written on any
plain text editor like Notepad, notepad++, or anything of that sort and saved with a .py
extension.
To begin with Install Pandas in Python, write Pandas Codes, and perform various intriguing
and useful operations, one must have Python installed on their System. Check if Python is
Already Present To check if your device is pre-installed with Python or not, just go to the
Command line(search for cmd in the Run dialog( + R). Now run the following command:
python –version
If Python is already installed, it will generate a message with the Python version available
else install Python, for installing please visit: How to Install Python on Windows or Linux
and PIP.
Pandas can be installed in multiple ways on Windows, Linux, and MacOS. Various ways are
listed below:
Import Pandas in Python
Now, that we have installed pandas on the system. Let's see how we can import it to make
use of it.
For this, go to a Jupyter Notebook or open a Python file, and write the following code:
import pandas as pd
Here, pd is referred to as an alias to the Pandas, which will help us in optimizing the code.
How to Install or Download Python Pandas
Pandas can be installed in multiple ways on Windows, Linux and MacOS. Various different
ways are listed below:
Install Pandas on Windows
Python Pandas can be installed on Windows in two ways:
Using pip
Using Anaconda
Install Pandas using pip
PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “online repository” termed as Python
Package Index (PyPI).
Step 1 : Launch Command Prompt
To open the Start menu, press the Windows key or click the Start button. To access the
Command Prompt, type "cmd" in the search bar, click the displayed app, or use Windows
key + r, enter "cmd," and press Enter.
Step 2 : Run the Command
Pandas can be installed using PIP by use of the following command in Command Prompt.
pip install pandas
Introduction to Pandas
Pandas is a Python library used for data analysis and manipulation.
It builds on NumPy and provides easy-to-use data structures for labeled data.
🧩 2. Core Pandas Data Structures
Pandas provides two primary objects:
🟦 A. Series – 1D Labeled Array
A Series is a one-dimensional array-like object that can hold any data type.
It includes an index which labels each element.
🔹 Syntax:
python
CopyEdit
import pandas as pd
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s)
🔹 Output:
css
CopyEdit
a 10
b 20
c 30
dtype: int64
🔹 Key Properties:
s.index – returns the index (['a', 'b', 'c'])
s.values – returns the data ([10, 20, 30])
🔹 Use Cases:
Time series data
Single-column data
Intermediate results in calculations
🟨 B. DataFrame – 2D Labeled Table
A DataFrame is a 2-dimensional labeled data structure with rows and columns.
Think of it like a table or spreadsheet.
🔹 Syntax:
python
CopyEdit
data = {
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}
df = pd.DataFrame(data)
print(df)
🔹 Output:
markdown
CopyEdit
Name Age
0 Alice 25
1 Bob 30
🔹 Key Properties:
df.columns – returns column labels (['Name', 'Age'])
df.index – returns row index ([0, 1])
df.values – returns 2D array of values
🔹 Accessing Data:
python
CopyEdit
df['Name'] # Access column
df.loc[0] # Access row by label
df.iloc[1] # Access row by position
🧠 3. Differences Between Series and DataFrame
Feature Series DataFrame
Dimension 1D 2D
Data structure Array with index Table with rows and columns
Use Case One column of data Tabular data
📌 4. Creating Pandas Objects from Various Data Types
Source Type Constructor Used Example
List or array pd.Series() pd.Series([1, 2, 3])
Dictionary pd.DataFrame() pd.DataFrame({'a':[1], 'b':[2]})
NumPy array pd.DataFrame() pd.DataFrame(np.random.rand(2,3))
Source Type Constructor Used Example
CSV/Excel/SQL pd.read_csv(), etc. pd.read_csv('file.csv')
✅ 5. Summary
Series: 1D data with labels
DataFrame: 2D data with row and column labels
Pandas simplifies loading, transforming, and analyzing structured data
1. Indexing in Series
A Series in Pandas can be indexed by:
Position (like lists)
Label (like dictionaries)
✅ Example:
python
CopyEdit
import pandas as pd
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
🔹 Accessing Elements:
python
CopyEdit
s['a'] # 100
s[1] # 200
🔹 Slicing:
python
CopyEdit
s['a':'c'] # Includes both start and end
s[0:2] # Like list slicing
2. Indexing in DataFrame
A DataFrame supports:
Column selection
Row selection
Element access
Boolean indexing
✅ Example:
python
CopyEdit
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
🟦 A. Column Selection
python
CopyEdit
df['Name'] # Single column (as Series)
df[['Name', 'Age']] # Multiple columns (as DataFrame)
B. Row Selection
🔹 Using loc[] (Label-based):
python
CopyEdit
df.loc[0] # First row
df.loc[0:1] # Rows 0 to 1 (inclusive)
df.loc[df['Age'] > 25] # Filtered rows
🔹 Using iloc[] (Integer-based):
python
CopyEdit
df.iloc[1] # Second row
df.iloc[0:2] # First two rows
C. Accessing Individual Elements
python
CopyEdit
df.loc[0, 'Name'] # Alice
df.iloc[1, 0] # Bob
3. Boolean Indexing / Conditional Selection
Used to filter rows based on condition(s):
python
CopyEdit
df[df['Age'] > 25]
✅ Example:
python
CopyEdit
# Output rows where Age > 25
print(df[df['Age'] > 25])
4. Using Conditions with Multiple Filters
Use & for AND, | for OR
Enclose each condition in parentheses
python
CopyEdit
df[(df['Age'] > 25) & (df['Name'] != 'Bob')]
📌 Summary Table
Method Description Example
df['col'] Access a column df['Name']
Method Description Example
df.loc[] Label-based row/column access df.loc[0, 'Age']
df.iloc[] Integer-location based access df.iloc[1, 0]
Boolean Indexing Conditional row selection df[df['Age'] > 30]
Slicing Range of rows/columns df[0:2], df.loc[1:2]
Operating on Data in Pandas
This topic covers how to perform arithmetic, statistical, and functional operations on
data stored in Pandas Series and DataFrame objects.
1. Element-wise Operations
Pandas supports element-wise arithmetic operations between:
Series and Series
DataFrame and DataFrame
DataFrame and scalar value
Example:
import pandas as pd
df = pd.DataFrame({
'A': [10, 20, 30],
'B': [5, 15, 25]
})
# Add 5 to each element
print(df + 5)
# Multiply column by 2
print(df['A'] * 2)
# Subtract columns
print(df['A'] - df['B'])
2. Statistical and Aggregation Functions
Pandas provides built-in functions to summarize and describe data.
Function Description
sum() Sum of values
mean() Average/mean
median() Median value
std() Standard deviation
min() Minimum value
max() Maximum value
count() Count non-null values
describe() Summary statistics
Example:
python
CopyEdit
df.sum() # Column-wise sum
df.mean() # Column-wise mean
df.describe() # Summary of statistics
3. Function Application
You can apply custom or built-in functions to:
A column (Series)
The entire DataFrame
Using apply():
python
CopyEdit
# Square each value in column A
df['A'].apply(lambda x: x ** 2)
Using applymap() (for element-wise DataFrame ops):
python
CopyEdit
df.applymap(lambda x: x * 2)
4. Broadcasting
Broadcasting allows arithmetic between objects of different shapes:
Series aligns on index
Scalar applies to every element
Example:
python
CopyEdit
s = pd.Series([1, 2, 3])
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})
# Broadcast Series across DataFrame columns
df.sub(s, axis=0)
🎯 5. Sorting and Ranking
🔹 Sorting:
python
CopyEdit
df.sort_values(by='A') # Sort rows by column A
df.sort_index() # Sort by index
🔹 Ranking:
python
CopyEdit
df['A'].rank() # Assign rank to values
🧪 6. Data Type Operations
You can check and convert data types using:
python
CopyEdit
df.dtypes # Check column data types
df.astype(int) # Convert data type
📌 Summary Table
Operation Type Method / Function Example
Arithmetic +, -, *, / df['A'] + 5
Aggregation sum(), mean(), etc. df.mean()
Apply function apply(), applymap() df.apply(lambda x: x*2)
Sorting & Ranking sort_values(), rank() df.sort_values('A')
Type Conversion astype() df.astype('float')
Handling Missing Data in Pandas
Objective:
Learn how to identify, remove, and fill missing values (NaNs) in Series and DataFrame
objects using Pandas.
1. What is Missing Data?
Missing data is represented as:
o NaN (Not a Number)
o None (Python's null value)
Common causes:
o Incomplete data entries
o Failed data imports
o Data corruption
🔍 2. Detecting Missing Data
✅ Use isnull() and notnull()
python
CopyEdit
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Alice', 'Bob', None],
'Age': [25, np.nan, 30]
})
print(df.isnull()) # Returns True for missing values
print(df.notnull()) # Returns True for non-missing values
🧹 3. Dropping Missing Data
🔹 dropna() — Remove rows or columns with NaN values
✅ Drop rows with any missing value:
python
CopyEdit
df.dropna()
✅ Drop columns with any missing value:
python
CopyEdit
df.dropna(axis=1)
✅ Drop rows where all values are NaN:
python
CopyEdit
df.dropna(how='all')
✅ Drop rows if less than a threshold of non-NaN values:
python
CopyEdit
df.dropna(thresh=2)
🧴 4. Filling Missing Data
🔹 fillna() — Replace NaNs with a value or method
✅ Replace with a constant:
python
CopyEdit
df.fillna(0)
df.fillna('Unknown')
✅ Forward Fill (propagate last valid value forward):
python
CopyEdit
df.fillna(method='ffill')
✅ Backward Fill:
python
CopyEdit
df.fillna(method='bfill')
✅ Fill using column mean/median:
python
CopyEdit
df['Age'].fillna(df['Age'].mean(), inplace=True)
🧪 5. Interpolating Missing Data
Estimates missing values using mathematical interpolation:
python
CopyEdit
df.interpolate()
🧠 6. Checking for Any Missing Data
python
CopyEdit
df.isnull().sum() # Count of missing values per column
df.isnull().any() # Check if any missing value exists
📝 Example: Handling Missing Data
python
CopyEdit
data = {'Name': ['Alice', None, 'Charlie'], 'Age': [25, None, 30]}
df = pd.DataFrame(data)
# Fill missing name with 'Unknown', and Age with average
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
📌 Summary Table
Task Method Example
Detect missing data isnull(), notnull() df.isnull()
Drop rows with missing data dropna() df.dropna()
Fill missing data fillna(value) df.fillna(0)
Forward/backward fill method='ffill'/'bfill' df.fillna(method='ffill')
Interpolation interpolate() df.interpolate()
Count missing values isnull().sum() df.isnull().sum()