🐼 What is Pandas?
Pandas stands for "Python Data Analysis Library".It is an open-source Python
library providing fast, flexible, and expressive data structures designed to
work with structured data, such as:
Tabular data (like Excel, SQL tables)
Labeled data (like time series)
Missing data
Heterogeneous types in columns
🧠 Think of Pandas as Excel + SQL + NumPy all in one!
🔧 Why is Pandas Important?
Makes data cleaning, transformation, and analysis easy.
Efficient handling of large datasets.
Simplifies exploratory data analysis (EDA).
A must-have for Machine Learning, Data Preprocessing, Visualization,
and ETL pipelines.
🧱 Core Data Structures
1. Series
A one-dimensional labeled array. Like a single column of data.
Output:
2. DataFrame
A two-dimensional table (rows and columns), like a spreadsheet or SQL table.
Output:
Essential Functionalities
📊 1. Reading Data
🔍 2. Exploring Data
🧹 3. Cleaning Data
🎯 4. Filtering and Selecting
📐 5. Aggregation and Grouping
🧮 6. Mathematical Operations
🔁 7. Merging and Joining
8. Sorting and Ranking
🕒 9. DateTime Handling
📤 10. Exporting Data
💼 Real-World Data Science Applications
Use Case How Pandas Helps
Data Cleaning Removing missing values, fixing formats
Exploratory Data .describe(), .groupby(), .plot() for quick
Analysis insights
Feature Engineering Creating new variables from existing ones
Data Merging Combining datasets from multiple sources
ETL (Extract, Transform,
Read from files, transform, and write to DB
Load)
Time Series Analysis Easy handling of timestamps and date ranges
Scaling, encoding, and transforming datasets
ML Preprocessing
before modeling
🔄 Integration with Other Tools
Tool How It Works with Pandas
NumPy Pandas is built on top of NumPy
Matplotlib/Seaborn Visualize data from Pandas directly
Scikit-learn Use DataFrames for model inputs and outputs
SQLAlchemy Read/write to SQL using .read_sql()
Streamlit Display DataFrames in interactive dashboards
🧠 Performance Tips
Use df.astype() to reduce memory by setting correct dtypes.
Use vectorized operations instead of loops.
For large datasets: use Dask, Vaex, or chunked loading.
Avoid chaining many operations together (e.g.,
df.drop().fillna().sort()) for clarity.
🧪 Sample Mini Project: Analyze Sales CSV
🧾 Summary Table
Feature Description
Data Structures Series (1D), DataFrame (2D)
File IO Read/write CSV, Excel, JSON, SQL
Analysis Tools Groupby, aggregation, filtering
Cleaning Drop, fill, replace, rename
Integration NumPy, Matplotlib, Scikit-learn, SQL
Applications EDA, data prep, modeling, dashboards
🧱 What is a Pandas Series?
A Pandas Series is a one-dimensional labeled array that can hold any data
type (integers, strings, floats, Python objects, etc.).
Similar to a column in a DataFrame or an array in NumPy.
Each element has a value and an associated label called an index.
✅ Creating a Series
🔍 Querying and Indexing in Series
Pandas Series provides multiple ways to access (query) and index data efficiently.
🔹 1. Positional Indexing (.iloc or direct [ ])
Access by integer position (like lists or arrays).
🔹 2. Label-Based Indexing (.loc or direct [ ] with label)
Access by index label.
🔹 3. Slicing
By position:
By label (inclusive of end label):
🔹 4. Boolean Indexing (Filtering)
Output:
Used for filtering data based on a condition.
🔹 5. Fancy Indexing (Multiple indexes at once)
Output:
🔹 6. Accessing Index and Values
🔹 7. Modifying Values
🔬 Relevance in Data Science
Task How Series Indexing Helps
Data Cleaning Easily locate and update invalid or missing entries
Feature Engineering Apply logic per row or per field
Query specific segments of Series (e.g., outliers,
Statistical Analysis
percentiles)
EDA (Exploratory
Filter data to study trends, e.g., s[s > mean]
Analysis)
Series with datetime index supports time-based
Time Series
slicing
⏱ Example: Filter Sensor Readings > Threshold
Output:
Helps identify high readings for alerts in time-series or IoT sensor data.
🗓 Example: Time Series Indexing
✅ Summary: Key Series Querying & Indexing Tools
Method Usage
s[index] Single element by label or position
s.loc[label] Label-based indexing
s.iloc[position] Position-based indexing
s[start:end] Slicing
s[condition] Boolean filtering
s[[i, j]] Fancy indexing (multiple elements)
📘 What Is a DataFrame?
A Pandas DataFrame is a two-dimensional, tabular data structure — think of
it like an Excel spreadsheet or SQL table.
It has rows and columns.
Each column can have a different data type.
It’s built on top of NumPy and is highly optimized for performance.
🧱 Structure of a DataFrame
A DataFrame has:
Index (row labels) — Default: 0, 1, 2...
Columns (column labels) — Named like a dictionary key
Data — 2D array-like, can be a mix of types
✅ Creating a DataFrame
📥 Loading Data into a DataFrame
Source Function
CSV pd.read_csv('file.csv')
Excel pd.read_excel('file.xlsx')
JSON pd.read_json('file.json')
SQL pd.read_sql(query, conn)
Clipboard pd.read_clipboard()
Dictionary pd.DataFrame(dict)
Example – Loading from CSV
🔍 Querying a DataFrame
1. Selecting Columns
2. Selecting Rows
By Index (position-based): .iloc[]
By Label (index-based): .loc[]
🔁 Filtering Rows (Boolean Indexing)
Output:
🎯 Indexing in DataFrame
1. Default Index
Auto-generated: 0, 1, 2…
2. Custom Index
3. Set/Reset Index
🔁 Slicing Rows and Columns
🧹 Modifying the DataFrame
Adding a Column
Removing a Column
Changing Values
📐 Aggregation & Grouping
🧮 Math Operations
🧪 Example: Filter and Compute
K Data Exploration Methods
🧠 Relevance in Data Science
Task How DataFrames Help
Data Cleaning Drop nulls, fill missing, rename, format values
Exploratory Data
Quick summary and statistical insights
Analysis (EDA)
Feature Engineering Create new variables, transform existing columns
Machine Learning DataFrame as input for scikit-learn models
Plotting with Seaborn/Matplotlib based on
Visualization
DataFrame structure
ETL Pipelines Extract, transform, and load data efficiently
Data Aggregation Grouping and summarizing across categories
✅ Summary Table
Feature Command/Example
Load CSV pd.read_csv()
Access Column df['col'] or df.col
Access Row df.loc[] / df.iloc[]
Filter Rows df[df['col'] > value]
Add Column df['new'] = ...
Group & Aggregate df.groupby()['col'].mean()
Change Index df.set_index() / df.reset_index()
Describe Stats df.describe()
💡 Bonus Tip: Chaining Methods
🔗 What is Merging in Pandas?
Merging in Pandas is the process of combining two or more DataFrames based
on common columns or indices, much like SQL JOIN operations.
It allows you to build complex datasets by combining smaller pieces based on
logical relationships.
🧱 Syntax
Parameter Description
left, right DataFrames to merge
how Type of join: inner, outer, left, right
on Column(s) to join on (common column)
left_on, right_on Specify different column names if keys differ
🧪 Example DataFrames
🔀 1. Inner Join (only matching rows)
Result:
🔄 2. Left Join (all from left, match from right)
Result:
🔁 3. Right Join (all from right, match from left)
Result:
🌐 4. Outer Join (all rows from both, fill NaNs)
Result:
🧠 Advanced Parameters
🔹 left_on, right_on
Use when column names are different:
🔹 suffixes
Add suffixes when overlapping column names (besides the key):
🔍 Merging on Index
📊 Real-Life Example: Sales and Customer Info
Result:
🔬 Importance in Data Science
Task Relevance of Merging
Data Combine data from different sources (e.g., user logs +
Integration demographics)
Feature
Merge extra info (e.g., zip code data with customer data)
Engineering
Joining Results Merge model outputs with test IDs or metadata
Master Tables Create master data sets by combining multiple inputs
Relational
Work with multi-table formats like in SQL or NoSQL
Datasets
✅ Summary: Merge Types Cheat Sheet
Merge Keeps Unmatched
Description
Type Rows
inner Only matched rows ❌ No
left All from left + matched from right ✅ Yes (left only)
right All from right + matched from left ✅ Yes (right only)
outer All rows from both sides ✅ Yes (both sides)
📘 What is groupby() in Pandas?
The groupby() function in Pandas is used to split a DataFrame into groups based
on a specific column (or columns), apply a function (like mean, sum, count, etc.),
and then combine the results.
Think of it like SQL's GROUP BY, allowing aggregation and summarization of data.
🧠 The GroupBy Process (SAC):
1. Split the data into groups.
2. Apply a function to each group.
3. Combine the results into a new DataFrame or Series.
🧱 Basic Syntax
To apply a function:
🧪 Example DataFrame
🔁 Group by a Column and Aggregate
1. Group and Compute Mean
Output:
2. Group and Count
Output:
3. Group and Sum
📊 Multiple Aggregations with agg()
Output:
🔁 Group by Multiple Columns
Gives you groupings at a more granular level (multi-indexed DataFrame).
🔄 Resetting the Index
This brings the result back to a regular DataFrame (not Series or multi-index).
💼 Custom Aggregation Functions
🔎 Filtering Groups
Using .filter():
Returns rows from groups where average salary > 55,000.
📚 Use in Data Science
Use Case How groupby() Helps
Exploratory Data Summarize data by category (e.g., avg sales by
Analysis region)
Create group-level features (e.g., user's avg
Feature Engineering
transaction)
Anomaly Detection Compare group stats to find outliers
Data Cleaning Check missing data per group
Model Evaluation Group by models or labels for performance analysis
🧪 Real-World Example: Customer Purchases
Output:
This shows total purchase value per customer — great for customer
segmentation.
✅ Summary Table
Task Code Example
Group by column df.groupby('col')
Group + mean df.groupby('col')['val'].mean()
Group + multiple agg df.groupby('col').agg(['sum', 'count'])
Group by two columns df.groupby(['col1', 'col2'])
Filter groups .filter(lambda x: x['val'].mean() > 100)
Reset index after grouping .reset_index()
📘 What is a Pivot Table in Pandas?
A pivot table is a powerful data transformation tool used to summarize,
aggregate, and reorganize data in a DataFrame.
Just like Excel pivot tables, Pandas' pivot_table() is used to group data across two
axes (rows and columns), apply an aggregation function, and extract insights.
🧱 Syntax of pivot_table()
Parameter Description
data DataFrame to use
values Column to aggregate (e.g., sales, price)
index Keys to group by on the rows
columns Keys to group by on the columns
aggfunc Aggregation function (mean, sum, count, etc.)
fill_value Replace missing values (e.g., 0 instead of NaN)
🧪 Example Dataset
🔀 Creating a Simple Pivot Table
Output:
✅ This shows total Sales by Region and Product.
📊 Aggregation Functions (aggfunc)
You can change the aggregation logic:
Other options: 'mean', 'sum', 'min', 'max', 'std'
🧱 Multiple Aggregation Columns
Gives a multi-index result with both sales and profit data summarized.
🔄 Filling Missing Values (fill_value)
Replaces NaN with 0 — useful in reporting or ML preprocessing.
📘 margins=True: Add Grand Totals
Adds an "All" row and column for total sums.
🔁 Use Case: Multi-level Pivot
Groups by two levels of indexing — perfect for nested analysis.
📈 Visualization Idea
Pivot tables + heatmaps = beautiful EDA summaries!
📊 Use of Pivot Tables in Data Science
Application Area Example Use of Pivot Table
EDA (Exploratory Quickly summarize numeric features across categories
Analysis) (e.g., avg price per brand)
Spot missing or inconsistent entries grouped by
Data Cleaning
dimension
Aggregate data before feeding into ML (e.g., avg orders
Feature Engineering
per user)
Reporting &
Create summarized KPIs per month/region/category
Dashboards
Anomaly Detection Identify outliers by comparing group stats
Pivot over time to view trends in sales, events,
Time Series Analysis
transactions
✅ Summary Table
Task Code Example
pd.pivot_table(df, values='Sales', index='Region',
Basic pivot
columns='Product')
Fill missing
fill_value=0
values
Multiple agg
values=['Sales', 'Profit']
values
Multi-level
index=['Region', 'Product']
group by
Add totals margins=True
🔍 Pro Tip:
Pivot tables are highly optimized and can replace many lines of groupby logic
with just one clear line of summarization.
What is Date/Time Functionality in Python?
Date/time functionality in Python allows you to:
Work with dates and times as objects
Format, parse, and calculate time-related data
Perform time-series analysis, timestamp comparison, and date
arithmetic
🧰 Key Libraries for Date/Time
Library Description
datetime Native Python module for working with dates and times
time Lower-level module, useful for performance timing
calendar Functions related to calendar calculations (e.g., leap year)
Extended date/time parsing and manipulation (external
dateutil
library)
pytz Time zone support
Built-in time handling for Series/DataFrame, essential in Data
pandas
Science
🧱 Basic Usage with datetime
✅ Get Current Date and Time
📅 Create Date and Time Objects
🔁 Formatting Dates (strftime)
🔄 Parsing Dates from Strings (strptime)
➕ Date Arithmetic with timedelta
Date/Time Comparison
Convert to Timestamp
🌍 Time Zones (using pytz)
🧪 Working with Dates in Pandas
📆 Convert Column to DateTime
🧮 Extract Date Components
🔁 Resample (For Time Series)
Resample Code Meaning
'D' Day
'W' Week
'M' Month
'Y' Year
📈 Example: Time Series Plot
💼 Use of Date/Time in Data Science
Use Case Description
Time Series
Stock prices, weather, IoT data trends over time
Analysis
Feature
Extracting day of week, month, season for ML models
Engineering
Log Analysis Analyzing timestamps in logs for user behavior or errors
Forecasting Building models to predict future trends
Data Cleaning Identifying missing or out-of-sequence timestamps
Cohort Analysis Grouping users by signup/join dates
Event Detection Spotting anomalies or patterns in date-based sequences
✅ Summary Table
Task Code Example
Get current datetime datetime.now()
Format datetime to string dt.strftime("%Y-%m-%d")
Parse string to datetime datetime.strptime(str, format)
Add days dt + timedelta(days=1)
Extract year/month/day in Pandas df['date'].dt.year
Resample time series df.resample('M').mean()
Convert column to datetime pd.to_datetime(df['column'])
📊 What is a DataFrame?
A DataFrame is a 2-dimensional, tabular data structure from the pandas library
that:
Is similar to an Excel spreadsheet or SQL table
Supports labeled rows and columns
Allows powerful data manipulation
🧪 Example Dataset: Sales Data
🔧 1. Viewing and Inspecting
🧪 2. Add a New Column
Now df includes a column Revenue for each transaction.
🔍 3. Filter Rows (Boolean Indexing)
Used frequently in data analysis and cleaning.
🔁 4. Sorting Data
Helps identify top transactions, outliers, or order results.
🔢 5. Group By and Aggregation
This is core to analytics — summarizing grouped data.
🔄 6. Replace / Map Values
Useful when standardizing or encoding categorical data for ML.
🔢 7. Renaming Columns
Clean, readable column names are important in modeling.
🔗 8. Merging DataFrames
Used for joining datasets (e.g., transactions + customer info).
📅 9. Date Filtering and Time-based Features
Time-based slicing is common in time series and forecasting.
🧼 10. Drop or Fill Missing Data
Essential step in data cleaning before model training.
📊 Final DataFrame
You now have a clean, enriched, time-aware DataFrame ready for:
Visualization
Modeling (ML)
Export to CSV/Excel
Dashboarding
💡 Why It’s Important in Data Science:
Task How DataFrame Manipulation Helps
EDA Filter, group, and analyze to understand trends
Feature
Add time features, ratios, or encoded values
Engineering
Cleaned and structured DataFrame is used directly in ML
Model Input
pipelines
Data Cleaning Remove nulls, fix types, handle outliers
Joins & Merging Combine datasets for richer insights
Visualization
Slice and group data to prepare for plotting
Prep
✅ Summary of Key Methods
Operation Function
Filter rows df[df['col'] == value]
Add new column df['new'] = ...
Group by df.groupby('col')['val'].sum()
Sort rows df.sort_values(by='col')
Rename columns df.rename(columns={...})
Merge tables pd.merge(df1, df2, on=..., how=...)
Date operations df['Date'].dt.day_name()