0% found this document useful (0 votes)
5 views30 pages

Complete Pandas Tutorial

This document is a comprehensive tutorial on the Pandas library for data manipulation and analysis in Python. It covers various topics including data structures, data loading and saving, data cleaning, transformation, grouping, and visualization. Each section provides practical examples and code snippets to facilitate learning and application of Pandas functionalities.

Uploaded by

iron pump
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

Complete Pandas Tutorial

This document is a comprehensive tutorial on the Pandas library for data manipulation and analysis in Python. It covers various topics including data structures, data loading and saving, data cleaning, transformation, grouping, and visualization. Each section provides practical examples and code snippets to facilitate learning and application of Pandas functionalities.

Uploaded by

iron pump
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Complete Pandas Tutorial

Table of Contents
1.​ Introduction & Setup
2.​ Basic Data Structures
3.​ Data Loading & Saving
4.​ Data Inspection & Exploration
5.​ Data Selection & Indexing
6.​ Data Cleaning
7.​ Data Transformation
8.​ Grouping & Aggregation
9.​ Merging & Joining
10.​Time Series Analysis
11.​Visualization with Pandas
12.​Advanced Operations
13.​Performance Optimization
14.​Real-World Projects

1. Introduction & Setup


What is Pandas?

Pandas is a powerful Python library for data manipulation and analysis. It provides data structures and
functions needed to work with structured data seamlessly.

Installation
pip install pandas numpy matplotlib seaborn

Basic Imports
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('[Link]', None)
2. Basic Data Structures
Series

A Series is a one-dimensional labeled array.

# Creating Series
s1 = [Link]([1, 2, 3, 4, 5])
s2 = [Link]([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
s3 = [Link]({'a': 1, 'b': 2, 'c': 3})

# Series properties
print([Link]) # Index
print([Link]) # Values
print([Link]) # Data type
print([Link]) # Shape
print([Link]) # Size

DataFrame

A DataFrame is a two-dimensional labeled data structure.

# Creating DataFrames
df1 = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['NYC', 'LA', 'Chicago', 'Houston']
})

# From lists
df2 = [Link]([
['Alice', 25, 'NYC'],
['Bob', 30, 'LA']
], columns=['Name', 'Age', 'City'])

# From dictionary of lists


df3 = [Link]({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})

# DataFrame properties
print([Link]) # (rows, columns)
print([Link]) # Column names
print([Link]) # Row indices
print([Link]) # Data types
print([Link]()) # Summary info
3. Data Loading & Saving
Reading Data
# CSV files
df = pd.read_csv('[Link]')
df = pd.read_csv('[Link]', index_col=0) # Set first column as index
df = pd.read_csv('[Link]', usecols=['col1', 'col2']) # Select specific columns
df = pd.read_csv('[Link]', nrows=1000) # Read first 1000 rows

# Excel files
df = pd.read_excel('[Link]', sheet_name='Sheet1')
df = pd.read_excel('[Link]', sheet_name=0) # By index

# JSON files
df = pd.read_json('[Link]')
df = pd.read_json('[Link]', orient='records')

# SQL databases
import sqlite3
conn = [Link]('[Link]')
df = pd.read_sql_query('SELECT * FROM table_name', conn)

# Parquet files
df = pd.read_parquet('[Link]')

# Text files
df = pd.read_csv('[Link]', delimiter='\t') # Tab-separated
df = pd.read_csv('[Link]', delimiter='|') # Pipe-separated

Saving Data
# CSV
df.to_csv('[Link]', index=False)
df.to_csv('[Link]', columns=['col1', 'col2']) # Specific columns

# Excel
df.to_excel('[Link]', sheet_name='Data', index=False)

# JSON
df.to_json('[Link]', orient='records')

# Parquet
df.to_parquet('[Link]')

# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)
4. Data Inspection & Exploration
Basic Information
# Shape and structure
[Link] # (rows, columns)
[Link]() # Data types and memory usage
[Link]() # Statistical summary
[Link](include='all') # All columns including non-numeric

# First and last rows


[Link]() # First 5 rows
[Link](10) # First 10 rows
[Link]() # Last 5 rows
[Link](3) # Last 3 rows

# Sampling
[Link]() # Random single row
[Link](5) # Random 5 rows
[Link](frac=0.1) # Random 10% of data

Data Quality Checks


# Missing values
[Link]().sum() # Count nulls per column
[Link]().sum().sum() # Total null count
[Link]().sum() # Count non-nulls
[Link]().any() # Boolean: any nulls per column
[Link]().all() # Boolean: all nulls per column

# Duplicates
[Link]().sum() # Count duplicate rows
[Link](subset=['col1']).sum() # Duplicates based on specific columns
df[[Link]()] # Show duplicate rows

# Unique values
df['column'].unique() # Unique values
df['column'].nunique() # Count of unique values
df['column'].value_counts() # Value frequency counts
df['column'].value_counts(normalize=True) # Proportions

Memory Usage
df.memory_usage() # Memory usage by column
df.memory_usage(deep=True) # Deep memory usage
[Link](memory_usage='deep')
5. Data Selection & Indexing
Column Selection
# Single column
df['Name'] # Returns Series
df[['Name']] # Returns DataFrame

# Multiple columns
df[['Name', 'Age']]
cols = ['Name', 'Age']
df[cols]

# Column slicing
[Link][:, 'Name':'City'] # All rows, columns from Name to City

Row Selection
# By index position
[Link][0] # First row
[Link][0:3] # First 3 rows
[Link][-1] # Last row

# By index label
[Link][0] # Row with index 0
[Link][0:2] # Rows with index 0 to 2

# Multiple rows
[Link][[0, 2, 4]] # Rows at positions 0, 2, 4
[Link][[0, 2, 4]] # Rows with indices 0, 2, 4

Boolean Indexing
# Single condition
df[df['Age'] > 30]
df[df['City'] == 'NYC']
df[df['Name'].[Link]('A')]

# Multiple conditions
df[(df['Age'] > 25) & (df['City'] == 'NYC')]
df[(df['Age'] < 25) | (df['Age'] > 35)]
df[df['Age'].between(25, 35)]

# Using isin()
df[df['City'].isin(['NYC', 'LA'])]
df[~df['City'].isin(['NYC', 'LA'])] # NOT in

# Using query()
[Link]('Age > 30')
[Link]('Age > 30 & City == "NYC"')
[Link]('City in ["NYC", "LA"]')
Advanced Indexing
# Set index
df_indexed = df.set_index('Name')
df_indexed.loc['Alice']

# Reset index
df_reset = df_indexed.reset_index()

# Multi-level indexing
df_multi = df.set_index(['City', 'Name'])
df_multi.loc[('NYC', 'Alice')]

# Cross-section
df_multi.xs('NYC', level='City')
6. Data Cleaning
Handling Missing Values
# Detect missing values
[Link]()
[Link]() # Same as isnull()
[Link]() # Opposite of isna()

# Remove missing values


[Link]() # Drop rows with any NaN
[Link](axis=1) # Drop columns with any NaN
[Link](how='all') # Drop rows where all values are NaN
[Link](subset=['col1', 'col2']) # Drop based on specific columns
[Link](thresh=2) # Keep rows with at least 2 non-NaN values

# Fill missing values


[Link](0) # Fill with 0
[Link](method='ffill') # Forward fill
[Link](method='bfill') # Backward fill
[Link]([Link]()) # Fill with mean

# Fill with different values per column


[Link]({'col1': 0, 'col2': 'Unknown'})

# Interpolation
[Link]() # Linear interpolation
[Link](method='polynomial', order=2)

Handling Duplicates
# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['col1']) # Based on specific columns
df.drop_duplicates(keep='first') # Keep first occurrence
df.drop_duplicates(keep='last') # Keep last occurrence
df.drop_duplicates(keep=False) # Remove all duplicates

Data Type Conversion


# Convert data types
df['Age'] = df['Age'].astype(int)
df['Age'] = df['Age'].astype('int64')
df['Price'] = df['Price'].astype(float)
df['Date'] = pd.to_datetime(df['Date'])

# Convert multiple columns


df = [Link]({'col1': 'int64', 'col2': 'float64'})

# Convert to categorical
df['Category'] = df['Category'].astype('category')

# Numeric conversion with errors handling


df['NumCol'] = pd.to_numeric(df['NumCol'], errors='coerce') # NaN for errors
df['NumCol'] = pd.to_numeric(df['NumCol'], errors='ignore') # Keep original

String Cleaning
# String methods
df['Name'].[Link]() # Lowercase
df['Name'].[Link]() # Uppercase
df['Name'].[Link]() # Title case
df['Name'].[Link]() # Remove whitespace
df['Name'].[Link]('old', 'new') # Replace text

# String operations
df['Name'].[Link]() # String length
df['Name'].[Link]('pattern') # Contains pattern
df['Name'].[Link]('A') # Starts with
df['Name'].[Link]('son') # Ends with
df['Name'].[Link]('(\w+)') # Extract pattern

# Split strings
df['Name'].[Link]() # Split on whitespace
df['Name'].[Link](' ', expand=True) # Split into columns

Outlier Detection and Handling


# Statistical methods
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_clean = df[(df['column'] >= lower_bound) & (df['column'] <= upper_bound)]

# Z-score method
from scipy import stats
z_scores = [Link]([Link](df['column']))
df_clean = df[z_scores < 3]
7. Data Transformation
Adding and Modifying Columns
# Add new columns
df['New_Column'] = 0
df['Age_Plus_10'] = df['Age'] + 10
df['Full_Name'] = df['First_Name'] + ' ' + df['Last_Name']

# Conditional column creation


df['Age_Group'] = [Link](df['Age'] > 30, 'Senior', 'Junior')

# Multiple conditions
conditions = [
df['Age'] <= 25,
(df['Age'] > 25) & (df['Age'] <= 35),
df['Age'] > 35
]
choices = ['Young', 'Middle', 'Old']
df['Age_Category'] = [Link](conditions, choices, default='Unknown')

# Using apply()
df['Age_Squared'] = df['Age'].apply(lambda x: x**2)
df['Name_Length'] = df['Name'].apply(len)

# Using map()
mapping = {1: 'One', 2: 'Two', 3: 'Three'}
df['Number_Word'] = df['Number'].map(mapping)

Renaming
# Rename columns
[Link](columns={'old_name': 'new_name'})
[Link](columns={'col1': 'Column1', 'col2': 'Column2'})

# Rename index
[Link](index={0: 'first', 1: 'second'})

# Rename all columns


[Link] = ['Col1', 'Col2', 'Col3']

# Make column names lowercase


[Link] = [Link]()

Sorting
# Sort by single column
df.sort_values('Age')
df.sort_values('Age', ascending=False)

# Sort by multiple columns


df.sort_values(['City', 'Age'])
df.sort_values(['City', 'Age'], ascending=[True, False])
# Sort by index
df.sort_index()
df.sort_index(ascending=False)

Reshaping Data
# Melt (wide to long)
df_melted = [Link](df, id_vars=['Name'], value_vars=['Math', 'Science'])

# Pivot (long to wide)


df_pivot = [Link](index='Name', columns='Subject', values='Score')

# Pivot table with aggregation


df_pivot_table = pd.pivot_table(df,
values='Score',
index='Name',
columns='Subject',
aggfunc='mean')

# Stack and unstack


df_stacked = df.set_index(['Name', 'Subject']).stack()
df_unstacked = df_stacked.unstack()

Binning and Categorization


# Cut into bins
df['Age_Bins'] = [Link](df['Age'], bins=3)
df['Age_Bins'] = [Link](df['Age'], bins=[0, 25, 35, 100], labels=['Young', 'Middle', 'Old'])

# Quantile-based binning
df['Age_Quantiles'] = [Link](df['Age'], q=4)

# Custom binning
bins = [0, 18, 30, 50, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['Age_Category'] = [Link](df['Age'], bins=bins, labels=labels)
8. Grouping & Aggregation
Basic Grouping
# Group by single column
grouped = [Link]('City')
[Link]() # Count of rows per group
[Link]() # Count of non-null values per group
[Link]() # Sum per group
[Link]() # Mean per group
[Link]() # Standard deviation per group

# Group by multiple columns


grouped = [Link](['City', 'Age_Group'])
[Link]()

Multiple Aggregations
# Multiple aggregation functions
[Link]('City').agg({
'Age': ['mean', 'std', 'min', 'max'],
'Salary': ['sum', 'mean']
})

# Named aggregations
[Link]('City').agg(
avg_age=('Age', 'mean'),
total_salary=('Salary', 'sum'),
count=('Name', 'count')
)

# Custom aggregation functions


def age_range(x):
return [Link]() - [Link]()

[Link]('City').agg({
'Age': [age_range, 'mean'],
'Salary': 'sum'
})

Advanced Grouping Operations


# Transform (broadcast group statistics back)
df['Age_Mean_by_City'] = [Link]('City')['Age'].transform('mean')
df['Age_Zscore'] = [Link]('City')['Age'].transform(lambda x: (x - [Link]()) / [Link]())

# Filter groups
[Link]('City').filter(lambda x: len(x) > 2) # Groups with more than 2 members
[Link]('City').filter(lambda x: x['Age'].mean() > 30) # Groups with mean age > 30

# Apply custom functions


def group_summary(group):
return [Link]({
'count': len(group),
'avg_age': group['Age'].mean(),
'age_range': group['Age'].max() - group['Age'].min()
})

[Link]('City').apply(group_summary)

Rolling and Expanding Windows


# Rolling windows
df['Rolling_Mean'] = df['Value'].rolling(window=3).mean()
df['Rolling_Sum'] = df['Value'].rolling(window=5).sum()
df['Rolling_Std'] = df['Value'].rolling(window=3).std()

# Expanding windows
df['Expanding_Mean'] = df['Value'].expanding().mean()
df['Expanding_Sum'] = df['Value'].expanding().sum()

# Rolling with groupby


df['Rolling_Mean_by_Group'] = [Link]('Group')['Value'].rolling(window=3).mean().reset_index(0,
drop=True)
9. Merging & Joining
Concatenation
# Vertical concatenation (stack rows)
df_combined = [Link]([df1, df2])
df_combined = [Link]([df1, df2], ignore_index=True) # Reset index

# Horizontal concatenation (stack columns)


df_combined = [Link]([df1, df2], axis=1)

# Concatenate with keys


df_combined = [Link]([df1, df2], keys=['Dataset1', 'Dataset2'])

Merging
# Inner join (default)
df_merged = [Link](df1, df2, on='key_column')

# Different join types


df_merged = [Link](df1, df2, on='key_column', how='left') # Left join
df_merged = [Link](df1, df2, on='key_column', how='right') # Right join
df_merged = [Link](df1, df2, on='key_column', how='outer') # Full outer join

# Multiple keys
df_merged = [Link](df1, df2, on=['key1', 'key2'])

# Different column names


df_merged = [Link](df1, df2, left_on='id', right_on='user_id')

# Index-based merging
df_merged = [Link](df1, df2, left_index=True, right_index=True)

# Handle suffix for duplicate columns


df_merged = [Link](df1, df2, on='key', suffixes=('_x', '_y'))

Advanced Joining
# Join method (similar to merge but index-based)
df_joined = [Link](df2, how='left')
df_joined = [Link](df2, on='key_column')

# Merge with indicator


df_merged = [Link](df1, df2, on='key', how='outer', indicator=True)

# Merge as of (nearest key merge)


df_asof = pd.merge_asof(df1, df2, on='date_column')

# Cross join
df1['key'] = 1
df2['key'] = 1
df_cross = [Link](df1, df2, on='key').drop('key', axis=1)
10. Time Series Analysis
Date and Time Handling
# Convert to datetime
df['date'] = pd.to_datetime(df['date_string'])
df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')

# Create date ranges


date_range = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
date_range = pd.date_range(start='2023-01-01', periods=100, freq='B') # Business days

# Set datetime index


df = df.set_index('date')

Time Series Operations


# Resample (change frequency)
df_monthly = [Link]('M').mean() # Monthly mean
df_weekly = [Link]('W').sum() # Weekly sum
df_quarterly = [Link]('Q').last() # Quarterly last value

# Time-based selection
df['2023'] # All data from 2023
df['2023-01'] # January 2023
df['2023-01-01':'2023-01-31'] # Date range

# Time components
df['year'] = [Link]
df['month'] = [Link]
df['day'] = [Link]
df['weekday'] = [Link]
df['quarter'] = [Link]

Time Series Analysis Functions


# Shift data
df['previous'] = df['value'].shift(1) # Previous period
df['next'] = df['value'].shift(-1) # Next period

# Percentage change
df['pct_change'] = df['value'].pct_change()

# Cumulative operations
df['cumsum'] = df['value'].cumsum()
df['cumprod'] = df['value'].cumprod()

# Time-based grouping
[Link]([Link]).mean() # Group by month
[Link]([Link](freq='M')).sum() # Group by month-end

Advanced Time Series


# Time zone handling
df.tz_localize('UTC') # Localize to UTC
df.tz_convert('US/Eastern') # Convert timezone

# Business day operations


from [Link] import BDay
[Link] + BDay(1) # Add 1 business day

# Holiday handling
from [Link] import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = [Link](start='2023-01-01', end='2023-12-31')
11. Visualization with Pandas
Basic Plotting
import [Link] as plt

# Line plot
df['value'].plot()
[Link](x='date', y='value')

# Multiple lines
df[['col1', 'col2']].plot()

# Different plot types


df['value'].plot(kind='hist') # Histogram
df['category'].plot(kind='bar') # Bar plot
[Link](kind='scatter', x='x', y='y') # Scatter plot
[Link](kind='box') # Box plot

Advanced Plotting
# Subplots
fig, axes = [Link](2, 2, figsize=(12, 8))
df['col1'].plot(ax=axes[0, 0], kind='line')
df['col2'].plot(ax=axes[0, 1], kind='bar')
[Link](ax=axes[1, 0], kind='scatter', x='x', y='y')
df['col4'].plot(ax=axes[1, 1], kind='hist')

# Customization
df['value'].plot(
title='My Plot',
xlabel='X Label',
ylabel='Y Label',
color='red',
style='--',
figsize=(10, 6)
)

# Group plotting
[Link]('category')['value'].plot(legend=True)
12. Advanced Operations
Apply Functions
# Apply to Series
df['column'].apply(lambda x: x**2)
df['text'].apply([Link])
df['text'].apply(len)

# Apply to DataFrame
[Link](lambda x: [Link]() - [Link]()) # By column (axis=0)
[Link](lambda x: [Link]() - [Link](), axis=1) # By row (axis=1)

# Apply with additional arguments


def custom_function(x, multiplier):
return x * multiplier

df['column'].apply(custom_function, multiplier=2)

Map and Replace


# Map values
mapping = {'A': 1, 'B': 2, 'C': 3}
df['letter'].map(mapping)

# Replace values
df['column'].replace(0, [Link]) # Replace single value
df['column'].replace([0, 1], [[Link], 99]) # Replace multiple values
[Link]({'col1': {0: [Link]}, 'col2': {1: 99}}) # Column-specific replacement

# Replace with regex


df['text'].[Link](r'\d+', 'NUMBER', regex=True)

Window Functions
# Ranking
df['rank'] = df['score'].rank()
df['rank_pct'] = df['score'].rank(pct=True)
df['rank_dense'] = df['score'].rank(method='dense')

# Percentiles
df['percentile'] = df['score'].rank(pct=True)

# Cumulative functions
df['cumsum'] = df['value'].cumsum()
df['cumprod'] = df['value'].cumprod()
df['cummax'] = df['value'].cummax()
df['cummin'] = df['value'].cummin()

MultiIndex Operations
# Create MultiIndex
df_multi = df.set_index(['level1', 'level2'])

# Access levels
df_multi.index.get_level_values(0) # Get level 0 values
df_multi.index.get_level_values('level1') # Get by name

# Reset specific levels


df_multi.reset_index(level=1)

# Swap levels
df_multi.swaplevel(0, 1)

# Cross-section
df_multi.xs('value', level='level1')
df_multi.xs(('val1', 'val2'), level=['level1', 'level2'])

# GroupBy with MultiIndex


df_multi.groupby(level=0).sum()
df_multi.groupby(level=['level1', 'level2']).mean()

Custom Aggregations
# Custom aggregation functions
def q75(x):
return [Link](0.75)

def custom_stats(x):
return [Link]({
'min': [Link](),
'max': [Link](),
'range': [Link]() - [Link](),
'q75': [Link](0.75)
})

[Link]('category').agg({
'value': [q75, 'mean', 'std'],
'count': 'sum'
})

[Link]('category').apply(custom_stats)
13. Performance Optimization
Memory Optimization
# Check memory usage
[Link](memory_usage='deep')
df.memory_usage(deep=True)

# Optimize data types


# Convert to categories for repeated strings
df['category'] = df['category'].astype('category')

# Use smaller integer types


df['small_int'] = df['small_int'].astype('int8') # -128 to 127
df['medium_int'] = df['medium_int'].astype('int16') # -32,768 to 32,767

# Use float32 instead of float64 when precision allows


df['float_col'] = df['float_col'].astype('float32')

Efficient Operations
# Use vectorized operations instead of loops
# Slow
result = []
for val in df['column']:
[Link](val * 2)
df['new_col'] = result

# Fast
df['new_col'] = df['column'] * 2

# Use .loc for setting values


# Slow
for i in [Link]:
if [Link][i, 'condition'] > 5:
[Link][i, 'result'] = 'high'

# Fast
[Link][df['condition'] > 5, 'result'] = 'high'

# Use query() for complex filtering


# Slow
df[(df['A'] > 5) & (df['B'] < 10) & (df['C'] == 'value')]

# Fast
[Link]('A > 5 & B < 10 & C == "value"')

Chunking for Large Data


# Read large files in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
processed_chunk = [Link]('category').sum()
[Link](processed_chunk)

# Combine results
result = [Link](chunks)

# Process and aggregate


def process_chunk(chunk):
return [Link]('key').sum()

result = [Link]([process_chunk(chunk) for chunk in


pd.read_csv('large_file.csv', chunksize=chunk_size)])

Parallel Processing
# Using multiprocessing with apply
from multiprocessing import Pool
import numpy as np

def parallel_apply(df_split):
return df_split.apply(lambda x: x**2)

# Split dataframe
df_split = np.array_split(df, 4)

# Process in parallel
with Pool(processes=4) as pool:
results = [Link](parallel_apply, df_split)

# Combine results
df_result = [Link](results)
14. Real-World Projects
Project 1: Sales Data Analysis
# Create sample sales data
[Link](42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Product A', 'Product B', 'Product C', 'Product D']
regions = ['North', 'South', 'East', 'West']

sales_data = []
for date in dates:
for _ in range([Link](5, 15)): # 5-15 transactions per day
sales_data.append({
'date': date,
'product': [Link](products),
'region': [Link](regions),
'sales_amount': [Link](1000, 300),
'quantity': [Link](1, 10)
})

df_sales = [Link](sales_data)
df_sales['date'] = pd.to_datetime(df_sales['date'])

# Analysis Tasks
# 1. Monthly sales trends
monthly_sales = df_sales.groupby(df_sales['date'].dt.to_period('M')).agg({
'sales_amount': 'sum',
'quantity': 'sum'
}).reset_index()

# 2. Top performing products


product_performance = df_sales.groupby('product').agg({
'sales_amount': ['sum', 'mean', 'count'],
'quantity': 'sum'
}).round(2)

# 3. Regional analysis
regional_analysis = df_sales.groupby('region').agg({
'sales_amount': ['sum', 'mean'],
'quantity': 'sum'
}).round(2)

# 4. Seasonal patterns
df_sales['month'] = df_sales['date'].[Link]
df_sales['quarter'] = df_sales['date'].[Link]
seasonal_patterns = df_sales.groupby(['quarter', 'product'])['sales_amount'].sum().unstack()

# 5. Moving averages for trend analysis


df_sales_daily = df_sales.groupby('date')['sales_amount'].sum().reset_index()
df_sales_daily['7_day_ma'] = df_sales_daily['sales_amount'].rolling(window=7).mean()
df_sales_daily['30_day_ma'] = df_sales_daily['sales_amount'].rolling(window=30).mean()

print("Sample Sales Analysis Results:")


print("Monthly Sales Trends:")
print(monthly_sales.head())
print("\nTop Product Performance:")
print(product_performance)

Project 2: Customer Data Cleaning and Analysis


# Create messy customer data
[Link](123)
n_customers = 1000

# Introduce data quality issues intentionally


customer_data = {
'customer_id': range(1, n_customers + 1),
'name': [f"Customer {i}" for i in range(1, n_customers + 1)],
'email': [f"customer{i}@[Link]" for i in range(1, n_customers + 1)],
'age': [Link](40, 15, n_customers).astype(int),
'city': [Link](['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_customers),
'purchase_amount': [Link](100, n_customers),
'last_purchase_date': pd.date_range('2022-01-01', '2023-12-31', periods=n_customers)
}

df_customers = [Link](customer_data)

# Introduce data quality issues


# 1. Missing values
missing_indices = [Link](df_customers.index, size=50, replace=False)
df_customers.loc[missing_indices, 'email'] = [Link]

# 2. Duplicates
df_customers = [Link]([df_customers, df_customers.iloc[:20]], ignore_index=True)

# 3. Outliers
df_customers.loc[[Link](df_customers.index, size=10), 'age'] = [Link]([150, -5,
200])

# 4. Inconsistent formatting
df_customers.loc[[Link](df_customers.index, size=50), 'city'] =
df_customers.loc[[Link](df_customers.index, size=50), 'city'].[Link]()

print("Data Quality Issues Analysis:")


print(f"Dataset shape: {df_customers.shape}")
print(f"Missing values:\n{df_customers.isnull().sum()}")
print(f"Duplicates: {df_customers.duplicated().sum()}")
print(f"Age outliers: {df_customers[df_customers['age'] < 0].shape[0] + df_customers[df_customers['age']
> 120].shape[0]}")

# Data Cleaning Pipeline


def clean_customer_data(df):
df_clean = [Link]()

# 1. Remove duplicates
df_clean = df_clean.drop_duplicates()
# 2. Handle missing emails
df_clean['email'] = df_clean['email'].fillna(f"customer{df_clean['customer_id']}@[Link]")

# 3. Fix age outliers


df_clean = df_clean[(df_clean['age'] >= 0) & (df_clean['age'] <= 120)]

# 4. Standardize city names


df_clean['city'] = df_clean['city'].[Link]()

# 5. Create customer segments


df_clean['age_group'] = [Link](df_clean['age'],
bins=[0, 25, 35, 50, 65, 120],
labels=['18-25', '26-35', '36-50', '51-65', '65+'])

df_clean['purchase_category'] = [Link](df_clean['purchase_amount'],
bins=3,
labels=['Low', 'Medium', 'High'])

return df_clean

df_customers_clean = clean_customer_data(df_customers)

# Customer Analysis
customer_analysis = {
'total_customers': len(df_customers_clean),
'avg_age': df_customers_clean['age'].mean(),
'total_revenue': df_customers_clean['purchase_amount'].sum(),
'avg_purchase': df_customers_clean['purchase_amount'].mean()
}

age_group_analysis = df_customers_clean.groupby('age_group').agg({
'customer_id': 'count',
'purchase_amount': ['sum', 'mean']
}).round(2)

city_analysis = df_customers_clean.groupby('city').agg({
'customer_id': 'count',
'purchase_amount': ['sum', 'mean']
}).round(2)

print("\nCustomer Analysis Results:")


print(f"Clean dataset shape: {df_customers_clean.shape}")
print("\nAge Group Analysis:")
print(age_group_analysis)
print("\nCity Analysis:")
print(city_analysis)

Project 3: Time Series Financial Analysis


# Create financial time series data
[Link](456)
start_date = '2020-01-01'
end_date = '2023-12-31'
dates = pd.date_range(start=start_date, end=end_date, freq='D')

# Simulate stock prices with trend and seasonality


base_price = 100
trend = 0.0002 # Small daily trend
seasonal = [Link]([Link](len(dates)) * 2 * [Link] / 365) * 5 # Annual seasonality
noise = [Link](0, 2, len(dates))
returns = trend + seasonal + noise

# Generate cumulative prices


prices = [base_price]
for return_val in returns[1:]:
[Link](prices[-1] * (1 + return_val / 100))

df_financial = [Link]({
'date': dates,
'price': prices[:len(dates)],
'volume': [Link](10, 0.5, len(dates))
})

df_financial['date'] = pd.to_datetime(df_financial['date'])
df_financial = df_financial.set_index('date')

# Financial Analysis Functions


def calculate_returns(df):
df['daily_return'] = df['price'].pct_change()
df['log_return'] = [Link](df['price'] / df['price'].shift(1))
return df

def calculate_moving_averages(df):
df['ma_20'] = df['price'].rolling(window=20).mean()
df['ma_50'] = df['price'].rolling(window=50).mean()
df['ma_200'] = df['price'].rolling(window=200).mean()
return df

def calculate_volatility(df, window=30):


df[f'volatility_{window}d'] = df['daily_return'].rolling(window=window).std() * [Link](252)
return df

def calculate_technical_indicators(df):
# RSI
delta = df['price'].diff()
gain = ([Link](delta > 0, 0)).rolling(window=14).mean()
loss = (-[Link](delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
df['rsi'] = 100 - (100 / (1 + rs))

# Bollinger Bands
df['bb_middle'] = df['price'].rolling(window=20).mean()
bb_std = df['price'].rolling(window=20).std()
df['bb_upper'] = df['bb_middle'] + (bb_std * 2)
df['bb_lower'] = df['bb_middle'] - (bb_std * 2)
return df

# Apply analysis functions


df_financial = calculate_returns(df_financial)
df_financial = calculate_moving_averages(df_financial)
df_financial = calculate_volatility(df_financial)
df_financial = calculate_technical_indicators(df_financial)

# Performance metrics
def calculate_performance_metrics(df):
total_return = (df['price'].iloc[-1] / df['price'].iloc[0] - 1) * 100
annual_return = (df['price'].iloc[-1] / df['price'].iloc[0]) ** (365.25 / len(df)) - 1
annual_volatility = df['daily_return'].std() * [Link](252)
sharpe_ratio = annual_return / annual_volatility
max_drawdown = ((df['price'] / df['price'].expanding().max()) - 1).min()

return {
'total_return_%': round(total_return, 2),
'annual_return_%': round(annual_return * 100, 2),
'annual_volatility_%': round(annual_volatility * 100, 2),
'sharpe_ratio': round(sharpe_ratio, 2),
'max_drawdown_%': round(max_drawdown * 100, 2)
}

performance_metrics = calculate_performance_metrics(df_financial)

# Monthly and yearly analysis


monthly_returns = df_financial['daily_return'].resample('M').apply(lambda x: (1 + x).prod() - 1)
yearly_returns = df_financial['daily_return'].resample('Y').apply(lambda x: (1 + x).prod() - 1)

print("Financial Analysis Results:")


print("Performance Metrics:")
for metric, value in performance_metrics.items():
print(f"{metric}: {value}")

print(f"\nBest month: {monthly_returns.idxmax().strftime('%Y-%m')} ({monthly_returns.max():.2%})")


print(f"Worst month: {monthly_returns.idxmin().strftime('%Y-%m')} ({monthly_returns.min():.2%})")

print("\nYearly Returns:")
for year, return_val in yearly_returns.items():
print(f"{[Link]}: {return_val:.2%}")

Project 4: E-commerce Data Pipeline


# Comprehensive E-commerce Analysis Pipeline
def create_ecommerce_dataset():
[Link](789)
n_orders = 10000
n_customers = 2000
n_products = 500

# Generate orders data


orders_data = {
'order_id': range(1, n_orders + 1),
'customer_id': [Link](1, n_customers + 1, n_orders),
'product_id': [Link](1, n_products + 1, n_orders),
'order_date': pd.date_range('2022-01-01', '2023-12-31', periods=n_orders),
'quantity': [Link](1, 5, n_orders),
'unit_price': [Link](10, 200, n_orders),
'category': [Link](['Electronics', 'Clothing', 'Books', 'Home', 'Sports'], n_orders),
'shipping_cost': [Link](5, 25, n_orders),
'discount_percent': [Link]([0, 5, 10, 15, 20], n_orders, p=[0.4, 0.25, 0.2, 0.1, 0.05])
}

df_orders = [Link](orders_data)
df_orders['order_date'] = pd.to_datetime(df_orders['order_date'])

# Calculate derived columns


df_orders['subtotal'] = df_orders['quantity'] * df_orders['unit_price']
df_orders['discount_amount'] = df_orders['subtotal'] * (df_orders['discount_percent'] / 100)
df_orders['total_amount'] = df_orders['subtotal'] - df_orders['discount_amount'] +
df_orders['shipping_cost']

return df_orders

def comprehensive_ecommerce_analysis(df):
analysis_results = {}

# 1. Revenue Analysis
analysis_results['total_revenue'] = df['total_amount'].sum()
analysis_results['avg_order_value'] = df['total_amount'].mean()
analysis_results['total_orders'] = len(df)

# 2. Monthly Revenue Trends


monthly_revenue = [Link](df['order_date'].dt.to_period('M')).agg({
'total_amount': 'sum',
'order_id': 'count'
}).rename(columns={'order_id': 'order_count'})

# 3. Category Performance
category_performance = [Link]('category').agg({
'total_amount': ['sum', 'mean', 'count'],
'quantity': 'sum',
'discount_amount': 'sum'
}).round(2)

# 4. Customer Segmentation (RFM Analysis)


customer_df = [Link]('customer_id').agg({
'order_date': ['max', 'count'],
'total_amount': 'sum'
})
customer_df.columns = ['last_order_date', 'frequency', 'monetary']
customer_df['recency'] = (df['order_date'].max() - customer_df['last_order_date']).[Link]

# Create RFM segments


customer_df['r_score'] = [Link](customer_df['recency'], q=5, labels=[5,4,3,2,1])
customer_df['f_score'] = [Link](customer_df['frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5])
customer_df['m_score'] = [Link](customer_df['monetary'], q=5, labels=[1,2,3,4,5])

customer_df['rfm_score'] = customer_df['r_score'].astype(str) + \
customer_df['f_score'].astype(str) + \
customer_df['m_score'].astype(str)

# Define customer segments


def segment_customers(row):
if row['rfm_score'] in ['555', '554', '544', '545', '454', '455', '445']:
return 'Champions'
elif row['rfm_score'] in ['543', '444', '435', '355', '354', '345', '344', '335']:
return 'Loyal Customers'
elif row['rfm_score'] in ['512', '511', '422', '421', '412', '411', '311']:
return 'New Customers'
elif row['rfm_score'] in ['155', '154', '144', '214', '215', '115', '114']:
return 'At Risk'
elif row['rfm_score'] in ['155', '254', '245']:
return 'Cannot Lose Them'
else:
return 'Others'

customer_df['segment'] = customer_df.apply(segment_customers, axis=1)

# 5. Seasonal Analysis
df['month'] = df['order_date'].[Link]
df['quarter'] = df['order_date'].[Link]
df['day_of_week'] = df['order_date'].[Link]

seasonal_analysis = {
'monthly': [Link]('month')['total_amount'].sum(),
'quarterly': [Link]('quarter')['total_amount'].sum(),
'weekly': [Link]('day_of_week')['total_amount'].sum()
}

# 6. Product Analysis
product_analysis = [Link]('product_id').agg({
'total_amount': 'sum',
'quantity': 'sum',
'order_id': 'count'
}).sort_values('total_amount', ascending=False)

return {
'summary': analysis_results,
'monthly_revenue': monthly_revenue,
'category_performance': category_performance,
'customer_segments': customer_df['segment'].value_counts(),
'seasonal_analysis': seasonal_analysis,
'top_products': product_analysis.head(10)
}

# Execute the e-commerce analysis


df_ecommerce = create_ecommerce_dataset()
ecommerce_results = comprehensive_ecommerce_analysis(df_ecommerce)
print("E-commerce Analysis Results:")
print(f"Total Revenue: ${ecommerce_results['summary']['total_revenue']:,.2f}")
print(f"Average Order Value: ${ecommerce_results['summary']['avg_order_value']:.2f}")
print(f"Total Orders: {ecommerce_results['summary']['total_orders']:,}")

print("\nCustomer Segments:")
print(ecommerce_results['customer_segments'])

print("\nTop 5 Categories by Revenue:")


category_revenue =
ecommerce_results['category_performance']['total_amount']['sum'].sort_values(ascending=False)
print(category_revenue.head())
Best Practices and Tips
1. Code Organization
# Always start with imports
import pandas as pd
import numpy as np

# Set pandas options at the beginning


pd.set_option('display.max_columns', None)
pd.set_option('[Link]', 2)

# Use consistent naming conventions


df_sales = pd.read_csv('sales_data.csv')
df_customers = pd.read_csv('[Link]')

# Chain operations for readability


result = (df
.groupby('category')
.agg({'sales': 'sum', 'quantity': 'mean'})
.sort_values('sales', ascending=False)
.head(10))

2. Error Handling
# Always handle potential errors
try:
df = pd.read_csv('[Link]')
except FileNotFoundError:
print("File not found!")
except [Link]:
print("File is empty!")

# Check for required columns


required_columns = ['date', 'amount', 'category']
missing_columns = [col for col in required_columns if col not in [Link]]
if missing_columns:
raise ValueError(f"Missing columns: {missing_columns}")

3. Data Validation
def validate_dataframe(df, required_columns=None, date_columns=None):
"""Validate DataFrame structure and content"""

# Check if DataFrame is empty


if [Link]:
raise ValueError("DataFrame is empty")

# Check required columns


if required_columns:
missing = set(required_columns) - set([Link])
if missing:
raise ValueError(f"Missing required columns: {missing}")
# Validate date columns
if date_columns:
for col in date_columns:
if col in [Link]:
df[col] = pd.to_datetime(df[col], errors='coerce')
if df[col].isnull().any():
print(f"Warning: Invalid dates found in {col}")

return df

# Usage
df = validate_dataframe(df,
required_columns=['date', 'amount'],
date_columns=['date'])

4. Memory Management
# Use categorical data for repeated strings
df['category'] = df['category'].astype('category')

# Use appropriate numeric types


df['small_integers'] = df['small_integers'].astype('int8')

# Delete unnecessary DataFrames


del df_temp

# Use chunking for large files


def process_large_file(filename, chunk_size=10000):
results = []
for chunk in pd.read_csv(filename, chunksize=chunk_size):
# Process chunk
processed = [Link]('key').sum()
[Link](processed)

return [Link](results, ignore_index=True)

Conclusion
This comprehensive guide covers pandas from basic concepts to advanced techniques used in
professional data analysis. The key to mastering pandas is:

1.​ Practice regularly with real datasets


2.​ Understand the underlying concepts before memorizing syntax
3.​ Focus on performance for large datasets
4.​ Write clean, readable code with proper error handling
5.​ Keep learning new features and best practices

Remember that pandas is constantly evolving, so stay updated with the latest versions and features. The
official pandas documentation is an excellent resource for detailed information on specific functions and
methods.

Happy data analyzing!

You might also like