0% found this document useful (0 votes)
9 views28 pages

Pandas Chatgpt

Uploaded by

ayad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

Pandas Chatgpt

Uploaded by

ayad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Comprehensive Explanation of Pandas

Pandas is a powerful Python library that is widely used for data manipulation, analysis, and
exploration. It provides flexible and efficient data structures—primarily DataFrame and Series—
that allow you to work with structured data easily.

Pandas is built on top of NumPy and provides functionality for working with large datasets,
handling missing data, filtering, grouping, and merging datasets, among other features. Let’s
explore Pandas with examples and detailed explanations of key operations.

1. Pandas Data Structures

Pandas has two primary data structures:

 Series: A one-dimensional labeled array capable of holding any data type (integers, strings,
floats, etc.).
 DataFrame: A two-dimensional labeled data structure, similar to a table in a database or an
Excel spreadsheet, with rows and columns.

1.1 Series

A Pandas Series is like a column in a spreadsheet or a one-dimensional array in NumPy. It


consists of two main components: the data and the index.

Example 1: Creating a Series

python
Copier le code
import pandas as pd

# Creating a Series from a list


data = [10, 20, 30, 40, 50]
series = [Link](data, index=['a', 'b', 'c', 'd', 'e'])

print("Pandas Series:\n", series)

Output:

less
Copier le code
Pandas Series:
a 10
b 20
c 30
d 40
e 50
dtype: int64

Explanation:

 The Series is created from a list [10, 20, 30, 40, 50].
 The index is explicitly set to ['a', 'b', 'c', 'd', 'e'], allowing you to access
elements using these labels.

1.2 DataFrame

A Pandas DataFrame is a 2D data structure with rows and columns, similar to a table. It can hold
data of different types and allows for more complex operations like filtering, aggregation, and
reshaping.

Example 2: Creating a DataFrame

python
Copier le code
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = [Link](data)

print("Pandas DataFrame:\n", df)

Output:

markdown
Copier le code
Pandas DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000

Explanation:

 The DataFrame is created from a dictionary where each key is a column and each value is a
list of data for that column.
 The DataFrame has an automatic index (0, 1, 2, 3).

2. Basic Operations on DataFrames

You can perform a wide range of operations on DataFrames, such as accessing data, filtering, and
modifying the structure.

2.1 Accessing Data

You can access individual columns, rows, or elements from a DataFrame.


Example 3: Accessing Columns

python
Copier le code
# Accessing a single column
name_column = df['Name']

# Accessing multiple columns


age_salary = df[['Age', 'Salary']]

print("Name Column:\n", name_column)


print("\nAge and Salary Columns:\n", age_salary)

Output:

yaml
Copier le code
Name Column:
0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object

Age and Salary Columns:


Age Salary
0 25 50000
1 30 60000
2 35 70000
3 40 80000

Explanation:

 Accessing a single column returns a Series.


 Accessing multiple columns returns a new DataFrame.

2.2 Accessing Rows with loc[] and iloc[]

You can access rows based on index labels using loc[] or by integer positions using iloc[].

Example 4: Accessing Rows

python
Copier le code
# Accessing rows using loc (label-based)
row_bob = [Link][1]

# Accessing rows using iloc (position-based)


first_two_rows = [Link][0:2]

print("Row for Bob (loc):\n", row_bob)


print("\nFirst two rows (iloc):\n", first_two_rows)

Output:

sql
Copier le code
Row for Bob (loc):
Name Bob
Age 30
Salary 60000
Name: 1, dtype: object

First two rows (iloc):


Name Age Salary
0 Alice 25 50000
1 Bob 30 60000

Explanation:

 loc[] is used for label-based indexing, and it returns the row where the index label is 1.
 iloc[] is used for integer-based indexing and returns the first two rows of the DataFrame.

3. Modifying DataFrames

Pandas allows you to add new columns, modify existing ones, or drop rows/columns from the
DataFrame.

3.1 Adding New Columns

You can easily add new columns to a DataFrame by assigning a new column name and values.

Example 5: Adding a New Column

python
Copier le code
# Adding a new column 'Bonus' to the DataFrame
df['Bonus'] = [5000, 6000, 7000, 8000]

print("DataFrame with New Column:\n", df)

Output:

yaml
Copier le code
DataFrame with New Column:
Name Age Salary Bonus
0 Alice 25 50000 5000
1 Bob 30 60000 6000
2 Charlie 35 70000 7000
3 David 40 80000 8000

Explanation:

 A new column Bonus is added to the DataFrame, and it holds the values [5000, 6000,
7000, 8000].
3.2 Dropping Rows or Columns

You can drop rows or columns from a DataFrame using the drop() method.

Example 6: Dropping a Column

python
Copier le code
# Dropping the 'Bonus' column
df_dropped = [Link]('Bonus', axis=1)

print("DataFrame after Dropping 'Bonus' Column:\n", df_dropped)

Output:

sql
Copier le code
DataFrame after Dropping 'Bonus' Column:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000

Explanation:

 The drop() method is used to remove the Bonus column.


 The parameter axis=1 specifies that a column is being dropped. If you want to drop a row,
use axis=0.

4. Handling Missing Data

Missing data is common in real-world datasets. Pandas provides several methods to handle missing
values, such as filling them or dropping rows/columns with missing values.

4.1 Detecting Missing Data

You can detect missing data using the isnull() function, which returns a boolean DataFrame.

Example 7: Detecting Missing Data

python
Copier le code
# Introducing missing values
[Link][1, 'Salary'] = None

# Checking for missing values


missing_data = [Link]()

print("Missing Data:\n", missing_data)

Output:
mathematica
Copier le code
Missing Data:
Name Age Salary Bonus
0 False False False False
1 False False True False
2 False False False False
3 False False False False

Explanation:

 A missing value is introduced at index 1 for the Salary column.


 isnull() returns a boolean DataFrame where True indicates a missing value.

4.2 Filling Missing Data

You can fill missing values using the fillna() method.

Example 8: Filling Missing Data

python
Copier le code
# Filling missing values in the 'Salary' column with the mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

print("DataFrame after Filling Missing Values:\n", df)

Output:

yaml
Copier le code
DataFrame after Filling Missing Values:
Name Age Salary Bonus
0 Alice 25 50000.0 5000
1 Bob 30 66666.67 6000
2 Charlie 35 70000.0 7000
3 David 40 80000.0 8000

Explanation:

 The missing value in the Salary column is filled with the mean salary using fillna().
 The mean salary is calculated as (50000 + 70000 + 80000) / 3 = 66666.67.

5. Filtering and Conditional Selection

You can filter rows in a DataFrame based on specific conditions, similar to SQL's WHERE clause.

5.1 Filtering Rows Based on Condition

You can filter rows based on column values using boolean indexing.
Example 9: Filtering Rows

python
Copier le code
# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]

print("Rows where Age > 30:\n", filtered_df)

Output:

yaml
Copier le code
Rows where Age > 30:
Name Age Salary Bonus
2 Charlie 35 70000.0 7000
3 David 40 80000.0 8000

Explanation:

 The DataFrame is filtered to show only rows where the Age is greater than 30.

6. Grouping and Aggregation

Pandas allows you to group data by one or more columns and apply aggregate functions like sum(),
mean(), count(), etc.

6.1 Grouping Data

You can group data by specific columns using the groupby() function and apply aggregate
operations.

Example 10: Grouping and Aggregation

python
Copier le code
# Creating a new DataFrame
data = {
'Department': ['HR', 'IT', 'HR', 'IT', 'Sales', 'Sales'],
'Salary': [50000, 60000, 55000, 65000, 45000, 52000]
}
df_dept = [Link](data)

# Grouping by 'Department' and calculating the mean salary


grouped = df_dept.groupby('Department').mean()

print("Grouped by Department:\n", grouped)

Output:

csharp
Copier le code
Grouped by Department:
Salary
Department
HR 52500.0
IT 62500.0
Sales 48500.0

Explanation:

 The data is grouped by the Department column, and the mean salary for each department is
calculated using mean().

Conclusion

Pandas is an essential tool for data analysis and manipulation. Key features include:

 Data structures: Series (1D) and DataFrame (2D).


 Basic operations: Accessing and modifying data.
 Handling missing data: Detecting, filling, or dropping missing values.
 Filtering: Selecting rows based on conditions.
 Grouping and aggregation: Grouping data by one or more columns and applying
aggregate functions.

Pandas makes it easy to handle and analyze large datasets efficiently, making it a fundamental
library for data science and machine learning workflows.
Pandas Course with Examples and Exercises

Table of Contents
1. Introduction to Pandas
2. Data Structures
3. Data Input/Output
4. Data Selection and Indexing
5. Data Cleaning
6. Data Transformation
7. Grouping and Aggregation
8. Merging and Joining
9. Time Series
10. Advanced Operations

1. Introduction to Pandas

What is Pandas?
Pandas is a powerful Python library for data manipulation and analysis. It provides data
structures and operations for manipulating numerical tables and time series.

Installation
bash

pip install pandas numpy matplotlib

Basic Setup
python

import pandas as pd
import numpy as np
import [Link] as plt

print(f"Pandas version: {pd.__version__}")


2. Data Structures

Series
A one-dimensional labeled array
python

# Creating Series
s1 = [Link]([1, 3, 5, [Link], 6, 8])
s2 = [Link]([10, 20, 30], index=['a', 'b', 'c'])
s3 = [Link]({'a': 1, 'b': 2, 'c': 3})

print("Series 1:")
print(s1)
print("\nSeries 2:")
print(s2)

DataFrame
A two-dimensional labeled data structure
python

# Creating DataFrames
df1 = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'London', 'Paris', 'Tokyo']
})

df2 = [Link](
[Link](6, 4),
index=pd.date_range('20230101', periods=6),
columns=['A', 'B', 'C', 'D']
)

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
Exercise 1: Create Data Structures
python

# Create a Series of temperatures for 5 days


# Create a DataFrame with student information (name, grade, subject)
# Your code here
<details> <summary>Solution</summary>
python

# Series
temperatures = [Link]([72, 68, 75, 80, 78],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])

# DataFrame
students = [Link]({
'Name': ['John', 'Sarah', 'Mike', 'Emma'],
'Grade': [85, 92, 78, 95],
'Subject': ['Math', 'Science', 'Math', 'English']
})

print("Temperatures:")
print(temperatures)
print("\nStudents:")
print(students)
</details>

3. Data Input/Output

Reading Data
python

# Create sample data first


sample_data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = [Link](sample_data)

# Save to different formats


df.to_csv('[Link]', index=False)
df.to_excel('[Link]', index=False)

# Read from different formats


df_csv = pd.read_csv('[Link]')
df_excel = pd.read_excel('[Link]')

print("From CSV:")
print(df_csv)
print("\nFrom Excel:")
print(df_excel)

Exercise 2: File Operations


python

# Create a DataFrame with product information and save it to CSV


# Read the CSV file back and display it
# Your code here
<details> <summary>Solution</summary>
python

# Create DataFrame
products = [Link]({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'Price': [1000, 25, 75, 300],
'Stock': [15, 100, 50, 25]
})

# Save to CSV
products.to_csv('[Link]', index=False)

# Read from CSV


products_read = pd.read_csv('[Link]')

print("Products DataFrame:")
print(products_read)
</details>
4. Data Selection and Indexing

Basic Selection
python

# Sample DataFrame
df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['IT', 'HR', 'IT', 'Finance', 'HR']
})

# Column selection
print("Names:", df['Name'].tolist())
print("Ages and Salaries:\n", df[['Age', 'Salary']])

# Row selection
print("First row:\n", [Link][0])
print("First 3 rows:\n", [Link](3))

# Boolean indexing
print("IT Department:\n", df[df['Department'] == 'IT'])
print("High Salary:\n", df[df['Salary'] > 60000])

loc and iloc


python

# Using loc (label-based)


print("loc example:\n", [Link][0:2, ['Name', 'Age']])

# Using iloc (position-based)


print("iloc example:\n", [Link][0:3, 0:2])

# Setting index
df_indexed = df.set_index('Name')
print("With Name as index:\n", df_indexed.loc[['Alice', 'Charlie'], ['Age', 'Salar
y']])

Exercise 3: Data Selection


python
# Given the DataFrame above:
# 1. Select employees in HR department
# 2. Select employees aged 30-40
# 3. Select Name and Department columns for first 2 employees
# Your code here
<details> <summary>Solution</summary>
python

# 1. HR department
hr_employees = df[df['Department'] == 'HR']
print("HR Employees:\n", hr_employees)

# 2. Age 30-40
age_range = df[(df['Age'] >= 30) & (df['Age'] <= 40)]
print("\nEmployees aged 30-40:\n", age_range)

# 3. Name and Department for first 2


name_dept = [Link][0:1, ['Name', 'Department']]
print("\nFirst 2 employees Name and Department:\n", name_dept)
</details>

5. Data Cleaning

Handling Missing Data


python

# Create DataFrame with missing values


df_missing = [Link]({
'A': [1, 2, [Link], 4],
'B': [5, [Link], [Link], 8],
'C': [10, 20, 30, 40]
})

print("Original DataFrame:")
print(df_missing)

# Check for missing values


print("\nMissing values:")
print(df_missing.isnull())
print("\nSum of missing values:")
print(df_missing.isnull().sum())

# Handling missing values


print("\nFill with 0:")
print(df_missing.fillna(0))

print("\nDrop rows with missing values:")


print(df_missing.dropna())

print("\nFill with mean:")


print(df_missing.fillna(df_missing.mean()))

Data Type Conversion


python

# Type conversion
df_types = [Link]({
'Strings': ['1', '2', '3'],
'Numbers': [1.1, 2.2, 3.3],
'Integers': [1, 2, 3]
})

print("Original types:")
print(df_types.dtypes)

# Convert types
df_types['Strings'] = df_types['Strings'].astype(int)
df_types['Numbers'] = df_types['Numbers'].astype(int)

print("\nAfter conversion:")
print(df_types.dtypes)

Exercise 4: Data Cleaning


python

# Create a DataFrame with missing values and:


# 1. Identify all missing values
# 2. Fill numerical columns with mean
# 3. Drop any row that has more than 2 missing values
# Your code here
<details> <summary>Solution</summary>
python
# Create DataFrame with missing values
df_exercise = [Link]({
'Product': ['A', 'B', 'C', 'D', 'E'],
'Price': [100, [Link], 150, [Link], 200],
'Quantity': [10, 15, [Link], [Link], 25],
'Rating': [4.5, 4.0, [Link], 3.5, 5.0]
})

print("Original DataFrame:")
print(df_exercise)

# 1. Identify missing values


print("\nMissing values:")
print(df_exercise.isnull().sum())

# 2. Fill numerical columns with mean


df_filled = df_exercise.copy()
numeric_cols = ['Price', 'Quantity', 'Rating']
df_filled[numeric_cols] = df_filled[numeric_cols].fillna(df_filled[numeric_cols].m
ean())

print("\nAfter filling with mean:")


print(df_filled)

# 3. Drop rows with more than 2 missing values


df_cleaned = df_exercise.dropna(thresh=len(df_exercise.columns)-1)
print("\nAfter dropping rows with >2 missing values:")
print(df_cleaned)
</details>

6. Data Transformation

Applying Functions
python

df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [50000, 60000, 70000]
})
# Apply function to column
df['Salary_Adjusted'] = df['Salary'].apply(lambda x: x * 1.1)
df['Name_Length'] = df['Name'].apply(len)

print("After applying functions:")


print(df)

# Using map for element-wise transformations


df['Grade'] = [Link](['A', 'B', 'C'])
grade_map = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
df['Grade_Description'] = df['Grade'].map(grade_map)

print("\nAfter mapping:")
print(df)

String Operations
python

# String operations
df_strings = [Link]({
'Text': ['hello world', 'pandas tutorial', 'data science'],
'Name': ['alice smith', 'BOB JONES', 'Charlie Brown']
})

df_strings['Text_Upper'] = df_strings['Text'].[Link]()
df_strings['Name_Proper'] = df_strings['Name'].[Link]()
df_strings['Word_Count'] = df_strings['Text'].[Link]().[Link]()

print("String operations:")
print(df_strings)

Exercise 5: Data Transformation


python

# Create a DataFrame with product names and prices


# 1. Convert product names to uppercase
# 2. Apply 15% discount to prices
# 3. Create a category based on price (Low: <50, Medium: 50-100, High: >100)
# Your code here
<details> <summary>Solution</summary>
python

# Create DataFrame
products = [Link]({
'Product': ['laptop', 'mouse', 'keyboard', 'monitor'],
'Price': [1000, 25, 75, 300]
})

print("Original:")
print(products)

# 1. Convert to uppercase
products['Product_Upper'] = products['Product'].[Link]()

# 2. Apply discount
products['Price_Discounted'] = products['Price'].apply(lambda x: x * 0.85)

# 3. Create categories
def price_category(price):
if price < 50:
return 'Low'
elif price <= 100:
return 'Medium'
else:
return 'High'

products['Price_Category'] = products['Price'].apply(price_category)

print("\nAfter transformations:")
print(products)
</details>

7. Grouping and Aggregation

Basic Grouping
python

# Sample sales data


sales_data = [Link]({
'Region': ['North', 'South', 'North', 'South', 'East', 'West'],
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [1000, 1500, 1200, 1800, 900, 2000],
'Profit': [200, 300, 250, 350, 180, 400]
})

print("Original data:")
print(sales_data)

# Basic grouping
region_sales = sales_data.groupby('Region')['Sales'].sum()
print("\nTotal sales by region:")
print(region_sales)

# Multiple aggregations
region_stats = sales_data.groupby('Region').agg({
'Sales': ['sum', 'mean', 'count'],
'Profit': ['sum', 'mean']
})
print("\nRegional statistics:")
print(region_stats)

Pivot Tables
python

# Pivot table
pivot_sales = sales_data.pivot_table(
values='Sales',
index='Region',
columns='Product',
aggfunc='sum',
fill_value=0
)
print("Pivot table:")
print(pivot_sales)

Exercise 6: Grouping and Aggregation


python

# Using the sales data:


# 1. Calculate average sales and profit by product
# 2. Find total sales by region
# 3. Create a pivot table showing average profit by region and product
# Your code here
<details> <summary>Solution</summary>
python
# 1. Average by product
product_avg = sales_data.groupby('Product').agg({
'Sales': 'mean',
'Profit': 'mean'
}).round(2)
print("Average by product:")
print(product_avg)

# 2. Total sales by region


region_total = sales_data.groupby('Region')['Sales'].sum()
print("\nTotal sales by region:")
print(region_total)

# 3. Pivot table
pivot_profit = sales_data.pivot_table(
values='Profit',
index='Region',
columns='Product',
aggfunc='mean',
fill_value=0
).round(2)
print("\nAverage profit by region and product:")
print(pivot_profit)
</details>

8. Merging and Joining

Concatenation
python

# Create sample DataFrames


df1 = [Link]({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = [Link]({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# Concatenation
result_concat = [Link]([df1, df2], ignore_index=True)
print("Concatenation:")
print(result_concat)
Merging
python

# DataFrames for merging


employees = [Link]({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'DeptID': [101, 102, 101, 103]
})

departments = [Link]({
'DeptID': [101, 102, 104],
'DeptName': ['IT', 'HR', 'Finance']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

# Inner join
inner_join = [Link](employees, departments, on='DeptID', how='inner')
print("\nInner join:")
print(inner_join)

# Left join
left_join = [Link](employees, departments, on='DeptID', how='left')
print("\nLeft join:")
print(left_join)

Exercise 7: Merging Data


python

# Create two DataFrames: customers and orders


# Merge them to show customer information with their orders
# Your code here
<details> <summary>Solution</summary>
python

# Create DataFrames
customers = [Link]({
'CustomerID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Email': ['alice@[Link]', 'bob@[Link]', 'charlie@[Link]', 'david@emai
[Link]']
})

orders = [Link]({
'OrderID': [101, 102, 103, 104],
'CustomerID': [1, 2, 1, 5],
'Amount': [150, 200, 75, 300]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

# Left join to see all customers and their orders


customer_orders = [Link](customers, orders, on='CustomerID', how='left')
print("\nCustomer orders (left join):")
print(customer_orders)

# Inner join to see only customers with orders


active_customers = [Link](customers, orders, on='CustomerID', how='inner')
print("\nActive customers (inner join):")
print(active_customers)
</details>

9. Time Series

Time Series Operations


python

# Create time series data


dates = pd.date_range('2023-01-01', periods=10, freq='D')
ts_data = [Link]({
'Date': dates,
'Sales': [Link](100, 500, 10),
'Temperature': [Link](60, 90, 10)
})

ts_data = ts_data.set_index('Date')
print("Time series data:")
print(ts_data)

# Resampling
weekly_sales = ts_data['Sales'].resample('W').mean()
print("\nWeekly average sales:")
print(weekly_sales)

# Rolling average
ts_data['Sales_7D_Avg'] = ts_data['Sales'].rolling(window=3).mean()
print("\nWith rolling average:")
print(ts_data)

Exercise 8: Time Series Analysis


python

# Create a time series of stock prices and:


# 1. Calculate daily returns
# 2. Compute 5-day moving average
# 3. Resample to weekly data
# Your code here
<details> <summary>Solution</summary>
python

# Create stock price data


dates = pd.date_range('2023-01-01', periods=20, freq='D')
stock_prices = [Link]({
'Date': dates,
'Price': [100 + i*2 + [Link](0, 5) for i in range(20)]
})
stock_prices = stock_prices.set_index('Date')

print("Stock prices:")
print(stock_prices.head())

# 1. Daily returns
stock_prices['Daily_Return'] = stock_prices['Price'].pct_change() * 100

# 2. 5-day moving average


stock_prices['5D_MA'] = stock_prices['Price'].rolling(window=5).mean()

# 3. Weekly resampling
weekly_data = stock_prices['Price'].resample('W').mean()
print("\nWith calculations:")
print(stock_prices.head(10))
print("\nWeekly data:")
print(weekly_data)
</details>

10. Advanced Operations

MultiIndex
python

# Create MultiIndex DataFrame


arrays = [
['North', 'North', 'South', 'South'],
['Q1', 'Q2', 'Q1', 'Q2']
]
index = [Link].from_arrays(arrays, names=['Region', 'Quarter'])
multi_df = [Link]({
'Sales': [1000, 1200, 800, 900],
'Profit': [200, 250, 150, 180]
}, index=index)

print("MultiIndex DataFrame:")
print(multi_df)

# Accessing MultiIndex data


print("\nNorth region data:")
print(multi_df.loc['North'])

print("\nQ1 data across all regions:")


print(multi_df.xs('Q1', level='Quarter'))

Performance Optimization
python

# Vectorized operations vs apply


df = [Link]({
'A': [Link](1, 100, 1000),
'B': [Link](1, 100, 1000)
})

# Vectorized operation (fast)


df['C_vectorized'] = df['A'] + df['B']

# Using apply (slower)


df['C_apply'] = [Link](lambda row: row['A'] + row['B'], axis=1)

print("First 5 rows:")
print([Link]())

Exercise 9: Advanced Operations


python

# Create a MultiIndex DataFrame for student grades


# Perform various operations on the hierarchical index
# Your code here
<details> <summary>Solution</summary>
python

# Create MultiIndex DataFrame


schools = ['School_A', 'School_A', 'School_B', 'School_B']
grades = ['10th', '11th', '10th', '11th']
index = [Link].from_arrays([schools, grades], names=['School', 'Grade'])

grades_df = [Link]({
'Math': [85, 88, 82, 90],
'Science': [92, 87, 85, 88],
'English': [78, 85, 80, 87]
}, index=index)

print("Grades DataFrame:")
print(grades_df)

# Access different levels


print("\nSchool_A data:")
print(grades_df.loc['School_A'])

print("\n10th grade across all schools:")


print(grades_df.xs('10th', level='Grade'))

# Calculate average by school


school_avg = grades_df.groupby('School').mean()
print("\nAverage by school:")
print(school_avg.round(2))
</details>

Final Project Exercise

Comprehensive Data Analysis


python

"""
Create a complete data analysis pipeline:
1. Load sample sales data
2. Clean and preprocess the data
3. Perform exploratory analysis
4. Create summary statistics and visualizations
"""

# Your final project code here


<details> <summary>Solution</summary>
python

import pandas as pd
import numpy as np
import [Link] as plt

# 1. Create sample sales data


[Link](42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Product_A', 'Product_B', 'Product_C', 'Product_D']

sales_data = [Link]({
'Date': [Link](dates, 1000),
'Product': [Link](products, 1000),
'Region': [Link](['North', 'South', 'East', 'West'], 1000),
'Sales': [Link](50, 500, 1000),
'Cost': [Link](20, 200, 1000)
})

# Add some missing values


sales_data.loc[[Link](sales_data.index, 50), 'Sales'] = [Link]
print("Original data shape:", sales_data.shape)
print("\nFirst 5 rows:")
print(sales_data.head())

# 2. Data cleaning
print("\nMissing values before cleaning:")
print(sales_data.isnull().sum())

# Fill missing sales with product average


sales_data['Sales'] = sales_data.groupby('Product')['Sales'].transform(
lambda x: [Link]([Link]())
)

# Calculate profit
sales_data['Profit'] = sales_data['Sales'] - sales_data['Cost']
sales_data['Profit_Margin'] = (sales_data['Profit'] / sales_data['Sales']) * 100

print("\nAfter cleaning:")
print(sales_data.isnull().sum())

# 3. Exploratory analysis
print("\nBasic statistics:")
print(sales_data[['Sales', 'Cost', 'Profit']].describe())

# Monthly sales trend


sales_data['Month'] = sales_data['Date'].[Link]
monthly_sales = sales_data.groupby('Month')['Sales'].sum()

# Product performance
product_performance = sales_data.groupby('Product').agg({
'Sales': 'sum',
'Profit': 'sum',
'Profit_Margin': 'mean'
}).round(2)

print("\nProduct Performance:")
print(product_performance)

# Regional analysis
regional_analysis = sales_data.groupby('Region').agg({
'Sales': ['sum', 'mean'],
'Profit': 'sum'
}).round(2)
print("\nRegional Analysis:")
print(regional_analysis)

# 4. Visualization
[Link](figsize=(15, 10))

# Subplot 1: Monthly sales


[Link](2, 2, 1)
monthly_sales.plot(kind='bar', color='skyblue')
[Link]('Monthly Sales')
[Link]('Month')
[Link]('Total Sales')

# Subplot 2: Product sales


[Link](2, 2, 2)
product_performance['Sales'].plot(kind='pie', autopct='%1.1f%%')
[Link]('Sales by Product')

# Subplot 3: Regional profit


[Link](2, 2, 3)
regional_analysis[('Profit', 'sum')].plot(kind='bar', color='lightgreen')
[Link]('Total Profit by Region')
[Link]('Profit')

# Subplot 4: Profit margin by product


[Link](2, 2, 4)
product_performance['Profit_Margin'].plot(kind='bar', color='orange')
[Link]('Average Profit Margin by Product')
[Link]('Profit Margin (%)')

plt.tight_layout()
[Link]()

# Summary
print("\n=== ANALYSIS SUMMARY ===")
print(f"Total Sales: ${sales_data['Sales'].sum():,.2f}")
print(f"Total Profit: ${sales_data['Profit'].sum():,.2f}")
print(f"Average Profit Margin: {sales_data['Profit_Margin'].mean():.2f}%")
best_product = product_performance['Profit'].idxmax()
print(f"Best Performing Product: {best_product}")
</details>

You might also like