Pandas Essential Functionality
Part 1: Core Operations for Data Manipulation
Data Science Course
July 23, 2025
Data Science Course Pandas Essential Functionality July 23, 2025 1 / 32
Course Overview
1 Introduction to Pandas Essential Operations
2 Reindexing
3 Dropping Entries from an Axis
4 Indexing, Selection, and Filtering
5 Selection with loc and iloc
6 Integer Indexing Pitfalls
7 Chained Indexing Pitfalls
8 Arithmetic and Data Alignment
9 Summary and Best Practices
Data Science Course Pandas Essential Functionality July 23, 2025 2 / 32
What We’ll Cover Today
Reindexing - Reorganizing data structure
Dropping Entries - Removing unwanted data
Indexing & Selection - Accessing data efficiently
loc and iloc - Explicit selection methods
Indexing Pitfalls - Common mistakes to avoid
Data Alignment - Automatic alignment in operations
Learning Objective
Master the fundamental pandas operations that form the backbone of data analysis workflows.
Data Science Course Pandas Essential Functionality July 23, 2025 3 / 32
What is Reindexing?
Definition
Reindexing creates a new pandas object with data rearranged to match a new index.
Why Reindex?
Data Alignment - Consistent indexing
Missing Data Handling - Original Reindexed
Introduce/remove NaN
Data Reshaping - Expand/contract
structure
Data Science Course Pandas Essential Functionality July 23, 2025 4 / 32
Basic Series Reindexing
import pandas as pd
import numpy as np
# Create Series with non - alphabetical order
obj = pd . Series ([4.5 , 7.2 , -5.3 , 3.6] ,
index =[ " d " , " b " , " a " , " c " ])
print ( " Original : " , obj . index . tolist () )
# Output : [ ’ d ’, ’b ’, ’a ’, ’c ’]
# Reindex to new order
obj2 = obj . reindex ([ " a " , " b " , " c " , " d " , " e " ])
print ( " Reindexed : " , obj2 . index . tolist () )
# Output : [ ’ a ’, ’b ’, ’c ’, ’d ’, ’e ’]
print ( obj2 )
# a -5.3
# b 7.2
# c 3.6
# d 4.5
# e NaN <- Missing value introduced
Key Point
Reindexing is non-destructive - creates new objects. Missing labels result in NaN values.
Data Science Course Pandas Essential Functionality July 23, 2025 5 / 32
Fill Methods in Reindexing
# Time series with gaps
obj3 = pd . Series ([ " blue " , " purple " , " yellow " ] ,
index =[0 , 2 , 4])
# Forward fill - carry last value forward
obj_ffill = obj3 . reindex ( np . arange (6) ,
Fill Methods:
method = " ffill " )
print ( obj_ffill ) ffill - Forward fill
# 0 blue
# 1 blue <- filled from 0 bfill - Backward fill
# 2 purple
# 3 purple <- filled from 2
# 4 yellow Only for ordered data
# 5 yellow <- filled from 4
Great for time series
# Backward fill - use next value
obj_bfill = obj3 . reindex ( np . arange (6) ,
method = " bfill " ) ffill
print ( obj_bfill )
# 0
# 1
blue
purple <- filled from 2
0 2 4
# 2 purple
# 3 yellow <- filled from 4
# 4 yellow
# 5 NaN <- no next value
Data Science Course Pandas Essential Functionality July 23, 2025 6 / 32
DataFrame Reindexing
# Create DataFrame
frame = pd . DataFrame ( np . arange (9) . reshape ((3 , 3) ) ,
index =[ " a " , " c " , " d " ] ,
columns =[ " Ohio " , " Texas " , " California " ])
# Reindex rows - adding missing ’b ’
frame2 = frame . reindex ( index =[ " a " , " b " , " c " , " d " ])
# Ohio Texas California
# a 0 1 2
# b NaN NaN NaN <- New row with NaN
# c 3 4 5
# d 6 7 8
# Reindex columns
states = [ " Texas " , " Utah " , " California " ]
frame3 = frame . reindex ( columns = states )
# Texas Utah California
# a 1 NaN 2
# c 4 NaN 5
# d 7 NaN 8
Data Science Course Pandas Essential Functionality July 23, 2025 7 / 32
Reindex Parameters Reference
Parameter Description Example
labels New sequence for index ser.reindex([’a’,
’b’])
index New row labels df.reindex(index=[’x’,
’y’])
columns New column labels df.reindex(columns=[’A’,
’B’])
method Fill method method=’ffill’
fill value Value for missing data fill value=0
tolerance Max distance for matches tolerance=0.5
Best Practice
Use fill value=0 to avoid NaN when appropriate for your data context.
Data Science Course Pandas Essential Functionality July 23, 2025 8 / 32
Why Drop Data?
Essential for Data Cleaning
The drop method is crucial for:
Remove outliers or corrupted data
Eliminate unnecessary columns/rows
Subset data for specific analysis
Drop Clean Data
Data preprocessing workflows
Key Principle
drop returns new object by default (non-destructive). Use inplace=True to modify original.
Data Science Course Pandas Essential Functionality July 23, 2025 9 / 32
Dropping from Series
# Create Series
obj = pd . Series ( np . arange (5.) , index =[ " a " , " b " , " c " , " d " , " e " ])
# Drop single element - creates NEW object
new_obj = obj . drop ( " c " )
print ( " Original has ’c ’: " , ’c ’ in obj . index ) # True
print ( " New doesn ’t have ’c ’: " , ’c ’ in new_obj . index ) # False
# Drop multiple elements
obj_multi = obj . drop ([ " d " , " c " ])
print ( obj_multi . index . tolist () ) # [ ’ a ’, ’b ’, ’e ’]
# Safe dropping ( ignore missing labels )
safe_drop = obj . drop ([ " c " , " z " ] , errors = " ignore " )
print ( " Safe drop succeeded " ) # No error for ’z ’
# In - place modification
obj_copy = obj . copy ()
obj_copy . drop ( " a " , inplace = True )
print ( " Modified original : " , obj_copy . index . tolist () )
Data Science Course Pandas Essential Functionality July 23, 2025 10 / 32
Dropping from DataFrame
# Create DataFrame
data = pd . DataFrame ( np . arange (16) . reshape ((4 ,4) ) ,
index =[ " Ohio " , " Colorado " , " Utah " , " New York " ] ,
columns =[ " one " , " two " , " three " , " four " ])
# Drop rows
data _drop_ro ws = data . drop ( index =[ " Colorado " , " Ohio " ])
print ( " Remaining states : " , data_drop_rows . index . tolist () )
# Drop columns
data _drop_co ls = data . drop ( columns =[ " two " ])
print ( " Remaining columns : " , data_drop_cols . columns . tolist () )
# Drop multiple columns
data_multi = data . drop ([ " two " , " four " ] , axis =1)
Data Science Course Pandas Essential Functionality July 23, 2025 11 / 32
Pandas Selection Philosophy
Multiple Selection Methods
Pandas provides different ways to select data, each with specific use cases:
1 Basic indexing - Like Python lists/dicts (obj[key])
2 Boolean indexing - Filter based on conditions (obj[obj > 5])
3 Label-based - Using actual labels (obj.loc[’label’])
4 Position-based - Using integer positions (obj.iloc[0])
Golden Rule
Choose the method that makes your intention clearest and prevents ambiguity.
Data Science Course Pandas Essential Functionality July 23, 2025 12 / 32
Series Selection Examples
# Student grades
grades = pd . Series ([85 , 92 , 78 , 96] ,
index =[ " Alice " , " Bob " , " Charlie " , " Diana " ])
# Label - based access
print ( " Bob ’s grade : " , grades [ ’ Bob ’ ])
# Position - based access
print ( " Second student : " , grades [1])
# Label slicing ( inclusive )
print ( grades [ ’ Alice ’: ’ Charlie ’ ])
# Position slicing ( exclusive )
print ( grades [1:3])
# Boolean indexing
high_grades = grades [ grades >= 90]
print ( " A grades : " , high_grades . index . tolist () )
# Complex conditions
mid_grades = grades [( grades >= 80) & ( grades < 90) ]
print ( " B grades : " , mid_grades . index . tolist () )
Data Science Course Pandas Essential Functionality July 23, 2025 13 / 32
DataFrame Selection Patterns
sales_data = pd . DataFrame ({
’ Q1 ’: [100 , 150 , 200 , 120] ,
’ Q2 ’: [110 , 160 , 190 , 130] ,
’ Q3 ’: [120 , 170 , 210 , 140] ,
’ Q4 ’: [130 , 180 , 220 , 150]
} , index =[ ’ North ’ , ’ South ’ , ’ East ’ , ’ West ’ ])
# Column selection
q1_sales = sales_data [ ’ Q1 ’]
q1_q2 = sales_data [[ ’ Q1 ’ , ’ Q2 ’ ]]
# Row slicing
first_two = sales_data [:2]
regions = sales_data [ ’ North ’: ’ East ’]
# Boolean filtering
high_q1 = sales_data [ sales_data [ ’ Q1 ’] > 120]
print ( " High Q1 regions : " , high_q1 . index . tolist () )
# Complex boolean
comp lex_filt er = sales_data [( sales_data [ ’ Q1 ’] > 120) &
( sales_data [ ’ Q4 ’] > 200) ]
Data Science Course Pandas Essential Functionality July 23, 2025 14 / 32
Selection Summary Table
Operation Syntax Returns Use When
Basic Column df[’col’] Series Simple column access
Multiple Columns df[[’c1’, ’c2’]] DataFrame Multiple columns
Row Slice df[1:3] DataFrame Simple row slicing
Boolean Filter df[df.col > 5] DataFrame Conditional filtering
Label Row df.loc[’label’] Series Single row by label
Label Subset df.loc[’r1’:’r2’, ’c1’:’c2’] DataFrame Rectangular selection
Position Row df.iloc[0] Series Single row by position
Position Subset df.iloc[0:2, 1:3] DataFrame Rectangular by position
Best Practice
Use single brackets [] for basic access, double brackets [[]] when you want DataFrame
result.
Data Science Course Pandas Essential Functionality July 23, 2025 15 / 32
Why loc and iloc?
Solving Ambiguity
Basic indexing (obj[]) can be ambiguous, especially with integer indices.
loc: Label-based selection iloc: Position-based selection
Uses actual labels Uses integer positions
”location by label” ”integer location”
Slicing is INCLUSIVE Slicing is EXCLUSIVE
df.loc[’A’:’C’] includes C df.iloc[0:3] excludes position 3
Golden Rule
Always use loc and iloc for explicit, unambiguous selection.
Data Science Course Pandas Essential Functionality July 23, 2025 16 / 32
Label-based Selection with loc
company_data = pd . DataFrame ({
’ Revenue ’: [1000 , 1500 , 1200 , 1800] ,
’ Employees ’: [50 , 75 , 60 , 90] ,
’ Profit_Margin ’: [0.15 , 0.20 , 0.18 , 0.22]
} , index =[ ’ TechCorp ’ , ’ DataInc ’ , ’ CloudLtd ’ , ’ AIStart ’ ])
# Single row ( returns Series )
tech_data = company_data . loc [ ’ TechCorp ’]
# Multiple rows ( returns DataFrame )
selected = company_data . loc [[ ’ TechCorp ’ , ’ AIStart ’ ]]
# Rows AND columns
subset = company_data . loc [ ’ TechCorp ’ , [ ’ Revenue ’ , ’ Employees ’ ]]
# Slicing - BOTH endpoints INCLUSIVE !
slice_result = company_data . loc [ ’ DataInc ’: ’ AIStart ’ ,
’ Revenue ’: ’ Profit_Margin ’]
# Boolean indexing with loc
high_revenue = company_data . loc [ company_data [ ’ Revenue ’] > 1300]
# Boolean + column selection
margins = company_data . loc [ company_data [ ’ Revenue ’] > 1300 ,
’ Profit_Margin ’]
Data Science Course Pandas Essential Functionality July 23, 2025 17 / 32
Position-based Selection with iloc
# Using same company_data DataFrame
# Single row by position
seco nd_compa ny = company_data . iloc [1] # DataInc
# Multiple rows by position
first_third = company_data . iloc [[0 , 2]] # TechCorp , CloudLtd
# Rows and columns by position
subset_pos = company_data . iloc [0 , [1 , 2]] # First row , cols 1 & 2
# Slicing by position - endpoint EXCLUSIVE
slice_pos = company_data . iloc [:3 , :2] # First 3 rows , first 2 cols
# Negative indexing works
last_company = company_data . iloc [ -1] # AIStart
# Fancy indexing
fancy = company_data . iloc [[1 , 3] , [0 , 2]] # Rows 1 ,3 and cols 0 ,2
# Setting values
company_data . loc [ ’ TechCorp ’ , ’ Revenue ’] = 1100 # By label
company_data . iloc [0 , 0] = 1050 # By position
Data Science Course Pandas Essential Functionality July 23, 2025 18 / 32
The Integer Index Ambiguity Problem
The Problem
When pandas objects have integer labels, there’s ambiguity:
Does obj[1] mean label 1 or position 1?
Safe (String Index): Dangerous (Integer Index):
obj[’b’] - clearly label obj[1] - label or position?
obj[1] - clearly position Can cause silent errors
No ambiguity Ambiguous behavior
Solution
Always use loc for labels, iloc for positions with integer indices.
Data Science Course Pandas Essential Functionality July 23, 2025 19 / 32
Demonstrating the Problem
# Case 1: String index - NO ambiguity
series_str = pd . Series ([10 , 20 , 30] , index =[ ’a ’ , ’b ’ , ’c ’ ])
print ( series_str [ ’b ’ ]) # 20
print ( series_str [1]) # 20
# Case 2: Integer index - POTENTIAL PROBLEMS
series_int = pd . Series ([10 , 20 , 30] , index =[5 , 10 , 15])
print ( series_int [10]) # 20
# Case 3: Default integer index - THE PROBLEM
seri es_probl em = pd . Series ([10 , 20 , 30]) # Index : 0 , 1 , 2
print ( serie s_problem [1]) # 20 - works
print ( serie s_problem [ -1]) # KeyError ! ( tries label -1)
# SOLUTIONS :
print ( series_int . loc [10]) # Explicit label access
print ( series_int . iloc [1]) # Explicit position access
print ( series_int . iloc [ -1]) # Last position
Remember
Negative indexing fails with integer labels because pandas interprets -1 as a label, not
position!
Data Science Course Pandas Essential Functionality July 23, 2025 20 / 32
The Chained Indexing Problem
What is Chained Indexing?
Using multiple [] operations in sequence: df[col1][col2] or df[condition][col]
Why It’s Problematic
1 Intermediate operations might return copies instead of views
2 Assignments might fail silently
3 Poor performance due to unnecessary copying
4 Generates SettingWithCopyWarning
BAD: df[df.A > 0][’B’] = 999 GOOD: df.loc[df.A > 0, ’B’] = 999
Data Science Course Pandas Essential Functionality July 23, 2025 21 / 32
Chained Indexing Example - The Problem
df = pd . DataFrame ({
’A ’: [1.0 , -2.0 , 3.0 , -4.0 , 5.0] ,
’B ’: [10 , 20 , 30 , 40 , 50] ,
’C ’: [100 , 200 , 300 , 400 , 500]
})
print ( df )
# PROBLEMATIC
df [ df . A > 0][ ’B ’] = 999
print ( df )
print ( " Notice : B values didn ’t change ! Assignment failed silently . " )
# CORRECT
df . loc [ df . A > 0 , ’B ’] = 999
print ( df )
print ( " Success ! B values changed where A > 0 " )
Data Science Course Pandas Essential Functionality July 23, 2025 22 / 32
Chained Indexing Best Practices
Correct Approaches
1 Single loc operations: df.loc[condition, columns]
2 Boolean masks as variables: mask = df.A > 0; df.loc[mask, ’B’] = value
3 Explicit copies: subset = df[df.A > 0].copy()
4 Query method: df.loc[df.query(’A > 0’).index, ’B’] = value
Avoid These Patterns
df[condition][column] = value
df[column1][column2] = value
Any assignment with multiple [] operations
Golden Rule
For assignments, always use single indexing operations with loc or iloc.
Data Science Course Pandas Essential Functionality July 23, 2025 23 / 32
The Magic of Automatic Data Alignment
Pandas’ Superpower
When performing arithmetic operations, pandas automatically aligns data based on index and
column labels, not position.
Benefits:
Prevents common errors A: 100 B: 160 A: NaN
Makes analysis intuitive +
B: 150 =
C: 200 B: 310
C: NaN
Handles missing alignments
Order doesn’t matter
Key Insight
Alignment is by LABEL, not position. Missing alignments become NaN.
Data Science Course Pandas Essential Functionality July 23, 2025 24 / 32
Series Arithmetic with Alignment
sales_q1 = pd . Series ([100 , 150 , 200 , 120] ,
index =[ ’ ProductA ’ , ’ ProductB ’ , ’ ProductC ’ , ’ ProductD ’ ])
sales_q2 = pd . Series ([110 , 180 , 90 , 160 , 140] ,
index =[ ’ ProductA ’ , ’ ProductC ’ , ’ ProductE ’ , ’ ProductB ’ , ’ ProductF ’ ])
print ( " Q1 Sales : " )
print ( sales_q1 )
print ( " Q2 Sales : " )
print ( sales_q2 )
# Addition automatically aligns by index labels
growth = sales_q2 - sales_q1
print ( " Growth ( Q2 - Q1 ) : " )
print ( growth )
# ProductA 10.0
# ProductB 10.0
# ProductC -20.0
# ProductD NaN
# ProductE NaN
# ProductF NaN
Data Science Course Pandas Essential Functionality July 23, 2025 25 / 32
DataFrame Arithmetic with Alignment
company1 = pd . DataFrame ({
’ Revenue ’: [1000 , 1500 , 1200] ,
’ Costs ’: [800 , 1200 , 1000] ,
’ Employees ’: [50 , 75 , 60]
} , index =[ ’ Q1 ’ , ’ Q2 ’ , ’ Q3 ’ ])
company2 = pd . DataFrame ({
’ Revenue ’: [900 , 1400 , 1100 , 1300] ,
’ Costs ’: [700 , 1100 , 900 , 1050] ,
’ Marketing ’: [100 , 150 , 120 , 140]
} , index =[ ’ Q1 ’ , ’ Q2 ’ , ’ Q3 ’ , ’ Q4 ’ ])
# Arithmetic aligns on BOTH row and column labels
combined = company1 + company2
print ( " Combined ( Company1 + Company2 ) : " )
print ( combined )
Data Science Course Pandas Essential Functionality July 23, 2025 26 / 32
Arithmetic Methods with Fill Values
df1 = pd . DataFrame ( np . arange (12.) . reshape ((3 , 4) ) , columns = list ( ’ abcd ’) )
df2 = pd . DataFrame ( np . arange (20.) . reshape ((4 , 5) ) , columns = list ( ’ abcde ’) )
# Standard addition ( produces many NaN )
standard_add = df1 + df2
# Addition with fill_value
filled_add = df1 . add ( df2 , fill_value =0)
print ( " Addition with fill_value =0: " )
print ( filled_add )
# Real - world example : Budget vs Actual analysis
budget = pd . DataFrame ({
’ Marketing ’: [1000 , 1200] , ’ Sales ’: [5000 , 5500]
} , index =[ ’ Q1 ’ , ’ Q2 ’ ])
actual = pd . DataFrame ({
’ Marketing ’: [1100 , 1150 , 1300] ,
’ Sales ’: [4800 , 5200 , 5800] ,
’R & D ’: [500 , 600 , 700]
} , index =[ ’ Q1 ’ , ’ Q2 ’ , ’ Q3 ’ ])
# Variance analysis
variances = actual . sub ( budget , fill_value =0)
Data Science Course Pandas Essential Functionality July 23, 2025 27 / 32
Arithmetic Methods Reference
Method Operation Fill Value Example
add() Addition (+) Optional df1.add(df2, fill value=0)
sub() Subtraction (-) Optional df1.sub(df2, fill value=0)
mul() Multiplication (*) Optional df1.mul(df2, fill value=1)
div() Division (/) Optional df1.div(df2, fill value=1)
floordiv() Floor Division (//) Optional df1.floordiv(df2)
mod() Modulo (%) Optional df1.mod(df2)
pow() Power (**) Optional df1.pow(df2)
Pro Tip
Use fill value parameter to handle missing alignments intelligently:
fill value=0 for addition/subtraction
fill value=1 for multiplication/division
Data Science Course Pandas Essential Functionality July 23, 2025 28 / 32
Key Takeaways - Part 1
Essential Operations Mastered
1 Reindexing: Reorganize data structure with reindex()
2 Dropping: Remove data with drop() - non-destructive by default
3 Selection: Multiple methods for different use cases
4 loc/iloc: Explicit label vs. position-based selection
5 Pitfalls: Avoid chained indexing and integer index ambiguity
6 Alignment: Automatic alignment in arithmetic operations
Next Session Preview
Part 2 will cover: Operations between DataFrame and Series, Function Application, Sorting &
Ranking, Duplicate Labels, Descriptive Statistics, Missing Data, and Hierarchical Indexing.
Data Science Course Pandas Essential Functionality July 23, 2025 29 / 32
Best Practices Summary
DO:
DON’T:
Use loc for labels, iloc for positions
Chain indexing for assignments
Single indexing operations for assignments
Mix label and position indexing
fill value in arithmetic when
Ignore SettingWithCopyWarning
appropriate
Forget that reindexing creates new objects
Parentheses in complex boolean conditions
Use basic indexing with integer indices
errors=’ignore’ for safe dropping
Remember
Pandas is designed to be explicit and unambiguous. When in doubt, use the most explicit
method available.
Data Science Course Pandas Essential Functionality July 23, 2025 30 / 32
Practice Exercises
Try These at Home
1 Create a DataFrame with student grades and practice all selection methods
2 Experiment with reindexing using different fill methods
3 Practice arithmetic operations between DataFrames with misaligned indices
4 Try to reproduce the chained indexing problem and fix it with loc
5 Create integer-indexed Series and practice safe selection with loc/iloc
Questions?
Remember: The best way to learn pandas is through hands-on practice. Try breaking things
and fixing them!
Data Science Course Pandas Essential Functionality July 23, 2025 31 / 32
Thank You!
Questions & Discussion
Next Session: Pandas Essential Functionality - Part 2
Data Science Course Pandas Essential Functionality July 23, 2025 32 / 32