0% found this document useful (0 votes)

9 views28 pages

Pandas Chatgpt

Uploaded by

ayad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views28 pages

Pandas Chatgpt

Uploaded by

ayad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Comprehensive Explanation of Pandas

Pandas is a powerful Python library that is widely used for data manipulation, analysis, and
exploration. It provides flexible and efficient data structures—primarily DataFrame and Series—
that allow you to work with structured data easily.

Pandas is built on top of NumPy and provides functionality for working with large datasets,
handling missing data, filtering, grouping, and merging datasets, among other features. Let’s
explore Pandas with examples and detailed explanations of key operations.

1. Pandas Data Structures

Pandas has two primary data structures:

 Series: A one-dimensional labeled array capable of holding any data type (integers, strings,
floats, etc.).
 DataFrame: A two-dimensional labeled data structure, similar to a table in a database or an
Excel spreadsheet, with rows and columns.

1.1 Series

A Pandas Series is like a column in a spreadsheet or a one-dimensional array in NumPy. It

consists of two main components: the data and the index.

Example 1: Creating a Series

python
Copier le code
import pandas as pd

# Creating a Series from a list

data = [10, 20, 30, 40, 50]
series = [Link](data, index=['a', 'b', 'c', 'd', 'e'])

print("Pandas Series:\n", series)

Output:

less
Copier le code
Pandas Series:
a 10
b 20
c 30
d 40
e 50
dtype: int64

Explanation:

 The Series is created from a list [10, 20, 30, 40, 50].
 The index is explicitly set to ['a', 'b', 'c', 'd', 'e'], allowing you to access
elements using these labels.

1.2 DataFrame

A Pandas DataFrame is a 2D data structure with rows and columns, similar to a table. It can hold
data of different types and allows for more complex operations like filtering, aggregation, and
reshaping.

Example 2: Creating a DataFrame

python
Copier le code
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = [Link](data)

print("Pandas DataFrame:\n", df)

Output:

markdown
Copier le code
Pandas DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000

Explanation:

 The DataFrame is created from a dictionary where each key is a column and each value is a
list of data for that column.
 The DataFrame has an automatic index (0, 1, 2, 3).

2. Basic Operations on DataFrames

You can perform a wide range of operations on DataFrames, such as accessing data, filtering, and
modifying the structure.

2.1 Accessing Data

You can access individual columns, rows, or elements from a DataFrame.

Example 3: Accessing Columns

python
Copier le code
# Accessing a single column
name_column = df['Name']

# Accessing multiple columns

age_salary = df[['Age', 'Salary']]

print("Name Column:\n", name_column)

print("\nAge and Salary Columns:\n", age_salary)

Output:

yaml
Copier le code
Name Column:
0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object

Age and Salary Columns:

Age Salary
0 25 50000
1 30 60000
2 35 70000
3 40 80000

Explanation:

 Accessing a single column returns a Series.

 Accessing multiple columns returns a new DataFrame.

2.2 Accessing Rows with loc[] and iloc[]

You can access rows based on index labels using loc[] or by integer positions using iloc[].

Example 4: Accessing Rows

python
Copier le code
# Accessing rows using loc (label-based)
row_bob = [Link][1]

# Accessing rows using iloc (position-based)

first_two_rows = [Link][0:2]

print("Row for Bob (loc):\n", row_bob)

print("\nFirst two rows (iloc):\n", first_two_rows)

Output:

sql
Copier le code
Row for Bob (loc):
Name Bob
Age 30
Salary 60000
Name: 1, dtype: object

First two rows (iloc):

Name Age Salary
0 Alice 25 50000
1 Bob 30 60000

Explanation:

 loc[] is used for label-based indexing, and it returns the row where the index label is 1.
 iloc[] is used for integer-based indexing and returns the first two rows of the DataFrame.

3. Modifying DataFrames

Pandas allows you to add new columns, modify existing ones, or drop rows/columns from the
DataFrame.

3.1 Adding New Columns

You can easily add new columns to a DataFrame by assigning a new column name and values.

Example 5: Adding a New Column

python
Copier le code
# Adding a new column 'Bonus' to the DataFrame
df['Bonus'] = [5000, 6000, 7000, 8000]

print("DataFrame with New Column:\n", df)

Output:

yaml
Copier le code
DataFrame with New Column:
Name Age Salary Bonus
0 Alice 25 50000 5000
1 Bob 30 60000 6000
2 Charlie 35 70000 7000
3 David 40 80000 8000

Explanation:

 A new column Bonus is added to the DataFrame, and it holds the values [5000, 6000,
7000, 8000].
3.2 Dropping Rows or Columns

You can drop rows or columns from a DataFrame using the drop() method.

Example 6: Dropping a Column

python
Copier le code
# Dropping the 'Bonus' column
df_dropped = [Link]('Bonus', axis=1)

print("DataFrame after Dropping 'Bonus' Column:\n", df_dropped)

Output:

sql
Copier le code
DataFrame after Dropping 'Bonus' Column:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000

Explanation:

 The drop() method is used to remove the Bonus column.

 The parameter axis=1 specifies that a column is being dropped. If you want to drop a row,
use axis=0.

4. Handling Missing Data

Missing data is common in real-world datasets. Pandas provides several methods to handle missing
values, such as filling them or dropping rows/columns with missing values.

4.1 Detecting Missing Data

You can detect missing data using the isnull() function, which returns a boolean DataFrame.

Example 7: Detecting Missing Data

python
Copier le code
# Introducing missing values
[Link][1, 'Salary'] = None

# Checking for missing values

missing_data = [Link]()

print("Missing Data:\n", missing_data)

Output:
mathematica
Copier le code
Missing Data:
Name Age Salary Bonus
0 False False False False
1 False False True False
2 False False False False
3 False False False False

Explanation:

 A missing value is introduced at index 1 for the Salary column.

 isnull() returns a boolean DataFrame where True indicates a missing value.

4.2 Filling Missing Data

You can fill missing values using the fillna() method.

Example 8: Filling Missing Data

python
Copier le code
# Filling missing values in the 'Salary' column with the mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())

print("DataFrame after Filling Missing Values:\n", df)

Output:

yaml
Copier le code
DataFrame after Filling Missing Values:
Name Age Salary Bonus
0 Alice 25 50000.0 5000
1 Bob 30 66666.67 6000
2 Charlie 35 70000.0 7000
3 David 40 80000.0 8000

Explanation:

 The missing value in the Salary column is filled with the mean salary using fillna().
 The mean salary is calculated as (50000 + 70000 + 80000) / 3 = 66666.67.

5. Filtering and Conditional Selection

You can filter rows in a DataFrame based on specific conditions, similar to SQL's WHERE clause.

5.1 Filtering Rows Based on Condition

You can filter rows based on column values using boolean indexing.
Example 9: Filtering Rows

python
Copier le code
# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]

print("Rows where Age > 30:\n", filtered_df)

Output:

yaml
Copier le code
Rows where Age > 30:
Name Age Salary Bonus
2 Charlie 35 70000.0 7000
3 David 40 80000.0 8000

Explanation:

 The DataFrame is filtered to show only rows where the Age is greater than 30.

6. Grouping and Aggregation

Pandas allows you to group data by one or more columns and apply aggregate functions like sum(),
mean(), count(), etc.

6.1 Grouping Data

You can group data by specific columns using the groupby() function and apply aggregate
operations.

Example 10: Grouping and Aggregation

python
Copier le code
# Creating a new DataFrame
data = {
'Department': ['HR', 'IT', 'HR', 'IT', 'Sales', 'Sales'],
'Salary': [50000, 60000, 55000, 65000, 45000, 52000]
}
df_dept = [Link](data)

# Grouping by 'Department' and calculating the mean salary

grouped = df_dept.groupby('Department').mean()

print("Grouped by Department:\n", grouped)

Output:

csharp
Copier le code
Grouped by Department:
Salary
Department
HR 52500.0
IT 62500.0
Sales 48500.0

Explanation:

 The data is grouped by the Department column, and the mean salary for each department is
calculated using mean().

Conclusion

Pandas is an essential tool for data analysis and manipulation. Key features include:

 Data structures: Series (1D) and DataFrame (2D).

 Basic operations: Accessing and modifying data.
 Handling missing data: Detecting, filling, or dropping missing values.
 Filtering: Selecting rows based on conditions.
 Grouping and aggregation: Grouping data by one or more columns and applying
aggregate functions.

Pandas makes it easy to handle and analyze large datasets efficiently, making it a fundamental
library for data science and machine learning workflows.
Pandas Course with Examples and Exercises

Table of Contents
1. Introduction to Pandas
2. Data Structures
3. Data Input/Output
4. Data Selection and Indexing
5. Data Cleaning
6. Data Transformation
7. Grouping and Aggregation
8. Merging and Joining
9. Time Series
10. Advanced Operations

1. Introduction to Pandas

What is Pandas?
Pandas is a powerful Python library for data manipulation and analysis. It provides data
structures and operations for manipulating numerical tables and time series.

Installation
bash

pip install pandas numpy matplotlib

Basic Setup
python

import pandas as pd
import numpy as np
import [Link] as plt

print(f"Pandas version: {pd.version}")

2. Data Structures

Series
A one-dimensional labeled array
python

# Creating Series
s1 = [Link]([1, 3, 5, [Link], 6, 8])
s2 = [Link]([10, 20, 30], index=['a', 'b', 'c'])
s3 = [Link]({'a': 1, 'b': 2, 'c': 3})

print("Series 1:")
print(s1)
print("\nSeries 2:")
print(s2)

DataFrame
A two-dimensional labeled data structure
python

# Creating DataFrames
df1 = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'London', 'Paris', 'Tokyo']
})

df2 = [Link](
[Link](6, 4),
index=pd.date_range('20230101', periods=6),
columns=['A', 'B', 'C', 'D']
)

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
Exercise 1: Create Data Structures
python

# Create a Series of temperatures for 5 days

# Create a DataFrame with student information (name, grade, subject)
# Your code here
<details> <summary>Solution</summary>
python

# Series
temperatures = [Link]([72, 68, 75, 80, 78],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])

# DataFrame
students = [Link]({
'Name': ['John', 'Sarah', 'Mike', 'Emma'],
'Grade': [85, 92, 78, 95],
'Subject': ['Math', 'Science', 'Math', 'English']
})

print("Temperatures:")
print(temperatures)
print("\nStudents:")
print(students)
</details>

3. Data Input/Output

Reading Data
python

# Create sample data first

sample_data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = [Link](sample_data)

# Save to different formats

df.to_csv('[Link]', index=False)
df.to_excel('[Link]', index=False)

# Read from different formats

df_csv = pd.read_csv('[Link]')
df_excel = pd.read_excel('[Link]')

print("From CSV:")
print(df_csv)
print("\nFrom Excel:")
print(df_excel)

Exercise 2: File Operations

python

# Create a DataFrame with product information and save it to CSV

# Read the CSV file back and display it
# Your code here
<details> <summary>Solution</summary>
python

# Create DataFrame
products = [Link]({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'Price': [1000, 25, 75, 300],
'Stock': [15, 100, 50, 25]
})

# Save to CSV
products.to_csv('[Link]', index=False)

# Read from CSV

products_read = pd.read_csv('[Link]')

print("Products DataFrame:")
print(products_read)
</details>
4. Data Selection and Indexing

Basic Selection
python

# Sample DataFrame
df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['IT', 'HR', 'IT', 'Finance', 'HR']
})

# Column selection
print("Names:", df['Name'].tolist())
print("Ages and Salaries:\n", df[['Age', 'Salary']])

# Row selection
print("First row:\n", [Link][0])
print("First 3 rows:\n", [Link](3))

# Boolean indexing
print("IT Department:\n", df[df['Department'] == 'IT'])
print("High Salary:\n", df[df['Salary'] > 60000])

loc and iloc

python

# Using loc (label-based)

print("loc example:\n", [Link][0:2, ['Name', 'Age']])

# Using iloc (position-based)

print("iloc example:\n", [Link][0:3, 0:2])

# Setting index
df_indexed = df.set_index('Name')
print("With Name as index:\n", df_indexed.loc[['Alice', 'Charlie'], ['Age', 'Salar
y']])

Exercise 3: Data Selection

python
# Given the DataFrame above:
# 1. Select employees in HR department
# 2. Select employees aged 30-40
# 3. Select Name and Department columns for first 2 employees
# Your code here
<details> <summary>Solution</summary>
python

# 1. HR department
hr_employees = df[df['Department'] == 'HR']
print("HR Employees:\n", hr_employees)

# 2. Age 30-40
age_range = df[(df['Age'] >= 30) & (df['Age'] <= 40)]
print("\nEmployees aged 30-40:\n", age_range)

# 3. Name and Department for first 2

name_dept = [Link][0:1, ['Name', 'Department']]
print("\nFirst 2 employees Name and Department:\n", name_dept)
</details>

5. Data Cleaning

Handling Missing Data

python

# Create DataFrame with missing values

df_missing = [Link]({
'A': [1, 2, [Link], 4],
'B': [5, [Link], [Link], 8],
'C': [10, 20, 30, 40]
})

print("Original DataFrame:")
print(df_missing)

# Check for missing values

print("\nMissing values:")
print(df_missing.isnull())
print("\nSum of missing values:")
print(df_missing.isnull().sum())

# Handling missing values

print("\nFill with 0:")
print(df_missing.fillna(0))

print("\nDrop rows with missing values:")

print(df_missing.dropna())

print("\nFill with mean:")

print(df_missing.fillna(df_missing.mean()))

Data Type Conversion

python

# Type conversion
df_types = [Link]({
'Strings': ['1', '2', '3'],
'Numbers': [1.1, 2.2, 3.3],
'Integers': [1, 2, 3]
})

print("Original types:")
print(df_types.dtypes)

# Convert types
df_types['Strings'] = df_types['Strings'].astype(int)
df_types['Numbers'] = df_types['Numbers'].astype(int)

print("\nAfter conversion:")
print(df_types.dtypes)

Exercise 4: Data Cleaning

python

# Create a DataFrame with missing values and:

# 1. Identify all missing values
# 2. Fill numerical columns with mean
# 3. Drop any row that has more than 2 missing values
# Your code here
<details> <summary>Solution</summary>
python
# Create DataFrame with missing values
df_exercise = [Link]({
'Product': ['A', 'B', 'C', 'D', 'E'],
'Price': [100, [Link], 150, [Link], 200],
'Quantity': [10, 15, [Link], [Link], 25],
'Rating': [4.5, 4.0, [Link], 3.5, 5.0]
})

print("Original DataFrame:")
print(df_exercise)

# 1. Identify missing values

print("\nMissing values:")
print(df_exercise.isnull().sum())

# 2. Fill numerical columns with mean

df_filled = df_exercise.copy()
numeric_cols = ['Price', 'Quantity', 'Rating']
df_filled[numeric_cols] = df_filled[numeric_cols].fillna(df_filled[numeric_cols].m
ean())

print("\nAfter filling with mean:")

print(df_filled)

# 3. Drop rows with more than 2 missing values

df_cleaned = df_exercise.dropna(thresh=len(df_exercise.columns)-1)
print("\nAfter dropping rows with >2 missing values:")
print(df_cleaned)
</details>

6. Data Transformation

Applying Functions
python

df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [50000, 60000, 70000]
})
# Apply function to column
df['Salary_Adjusted'] = df['Salary'].apply(lambda x: x * 1.1)
df['Name_Length'] = df['Name'].apply(len)

print("After applying functions:")

print(df)

# Using map for element-wise transformations

df['Grade'] = [Link](['A', 'B', 'C'])
grade_map = {'A': 'Excellent', 'B': 'Good', 'C': 'Average'}
df['Grade_Description'] = df['Grade'].map(grade_map)

print("\nAfter mapping:")
print(df)

String Operations
python

# String operations
df_strings = [Link]({
'Text': ['hello world', 'pandas tutorial', 'data science'],
'Name': ['alice smith', 'BOB JONES', 'Charlie Brown']
})

df_strings['Text_Upper'] = df_strings['Text'].[Link]()
df_strings['Name_Proper'] = df_strings['Name'].[Link]()
df_strings['Word_Count'] = df_strings['Text'].[Link]().[Link]()

print("String operations:")
print(df_strings)

Exercise 5: Data Transformation

python

# Create a DataFrame with product names and prices

# 1. Convert product names to uppercase
# 2. Apply 15% discount to prices
# 3. Create a category based on price (Low: <50, Medium: 50-100, High: >100)
# Your code here
<details> <summary>Solution</summary>
python

# Create DataFrame
products = [Link]({
'Product': ['laptop', 'mouse', 'keyboard', 'monitor'],
'Price': [1000, 25, 75, 300]
})

print("Original:")
print(products)

# 1. Convert to uppercase
products['Product_Upper'] = products['Product'].[Link]()

# 2. Apply discount
products['Price_Discounted'] = products['Price'].apply(lambda x: x * 0.85)

# 3. Create categories
def price_category(price):
if price < 50:
return 'Low'
elif price <= 100:
return 'Medium'
else:
return 'High'

products['Price_Category'] = products['Price'].apply(price_category)

print("\nAfter transformations:")
print(products)
</details>

7. Grouping and Aggregation

Basic Grouping
python

# Sample sales data

sales_data = [Link]({
'Region': ['North', 'South', 'North', 'South', 'East', 'West'],
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [1000, 1500, 1200, 1800, 900, 2000],
'Profit': [200, 300, 250, 350, 180, 400]
})

print("Original data:")
print(sales_data)

# Basic grouping
region_sales = sales_data.groupby('Region')['Sales'].sum()
print("\nTotal sales by region:")
print(region_sales)

# Multiple aggregations
region_stats = sales_data.groupby('Region').agg({
'Sales': ['sum', 'mean', 'count'],
'Profit': ['sum', 'mean']
})
print("\nRegional statistics:")
print(region_stats)

Pivot Tables
python

# Pivot table
pivot_sales = sales_data.pivot_table(
values='Sales',
index='Region',
columns='Product',
aggfunc='sum',
fill_value=0
)
print("Pivot table:")
print(pivot_sales)

Exercise 6: Grouping and Aggregation

python

# Using the sales data:

# 1. Calculate average sales and profit by product
# 2. Find total sales by region
# 3. Create a pivot table showing average profit by region and product
# Your code here
<details> <summary>Solution</summary>
python
# 1. Average by product
product_avg = sales_data.groupby('Product').agg({
'Sales': 'mean',
'Profit': 'mean'
}).round(2)
print("Average by product:")
print(product_avg)

# 2. Total sales by region

region_total = sales_data.groupby('Region')['Sales'].sum()
print("\nTotal sales by region:")
print(region_total)

# 3. Pivot table
pivot_profit = sales_data.pivot_table(
values='Profit',
index='Region',
columns='Product',
aggfunc='mean',
fill_value=0
).round(2)
print("\nAverage profit by region and product:")
print(pivot_profit)
</details>

8. Merging and Joining

Concatenation
python

# Create sample DataFrames

df1 = [Link]({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = [Link]({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# Concatenation
result_concat = [Link]([df1, df2], ignore_index=True)
print("Concatenation:")
print(result_concat)
Merging
python

# DataFrames for merging

employees = [Link]({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'DeptID': [101, 102, 101, 103]
})

departments = [Link]({
'DeptID': [101, 102, 104],
'DeptName': ['IT', 'HR', 'Finance']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

# Inner join
inner_join = [Link](employees, departments, on='DeptID', how='inner')
print("\nInner join:")
print(inner_join)

# Left join
left_join = [Link](employees, departments, on='DeptID', how='left')
print("\nLeft join:")
print(left_join)

Exercise 7: Merging Data

python

# Create two DataFrames: customers and orders

# Merge them to show customer information with their orders
# Your code here
<details> <summary>Solution</summary>
python

# Create DataFrames
customers = [Link]({
'CustomerID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Email': ['alice@[Link]', 'bob@[Link]', 'charlie@[Link]', 'david@emai
[Link]']
})

orders = [Link]({
'OrderID': [101, 102, 103, 104],
'CustomerID': [1, 2, 1, 5],
'Amount': [150, 200, 75, 300]
})

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

# Left join to see all customers and their orders

customer_orders = [Link](customers, orders, on='CustomerID', how='left')
print("\nCustomer orders (left join):")
print(customer_orders)

# Inner join to see only customers with orders

active_customers = [Link](customers, orders, on='CustomerID', how='inner')
print("\nActive customers (inner join):")
print(active_customers)
</details>

9. Time Series

Time Series Operations

python

# Create time series data

dates = pd.date_range('2023-01-01', periods=10, freq='D')
ts_data = [Link]({
'Date': dates,
'Sales': [Link](100, 500, 10),
'Temperature': [Link](60, 90, 10)
})

ts_data = ts_data.set_index('Date')
print("Time series data:")
print(ts_data)

# Resampling
weekly_sales = ts_data['Sales'].resample('W').mean()
print("\nWeekly average sales:")
print(weekly_sales)

# Rolling average
ts_data['Sales_7D_Avg'] = ts_data['Sales'].rolling(window=3).mean()
print("\nWith rolling average:")
print(ts_data)

Exercise 8: Time Series Analysis

python

# Create a time series of stock prices and:

# 1. Calculate daily returns
# 2. Compute 5-day moving average
# 3. Resample to weekly data
# Your code here
<details> <summary>Solution</summary>
python

# Create stock price data

dates = pd.date_range('2023-01-01', periods=20, freq='D')
stock_prices = [Link]({
'Date': dates,
'Price': [100 + i*2 + [Link](0, 5) for i in range(20)]
})
stock_prices = stock_prices.set_index('Date')

print("Stock prices:")
print(stock_prices.head())

# 1. Daily returns
stock_prices['Daily_Return'] = stock_prices['Price'].pct_change() * 100

# 2. 5-day moving average

stock_prices['5D_MA'] = stock_prices['Price'].rolling(window=5).mean()

# 3. Weekly resampling
weekly_data = stock_prices['Price'].resample('W').mean()
print("\nWith calculations:")
print(stock_prices.head(10))
print("\nWeekly data:")
print(weekly_data)
</details>

10. Advanced Operations

MultiIndex
python

# Create MultiIndex DataFrame

arrays = [
['North', 'North', 'South', 'South'],
['Q1', 'Q2', 'Q1', 'Q2']
]
index = [Link].from_arrays(arrays, names=['Region', 'Quarter'])
multi_df = [Link]({
'Sales': [1000, 1200, 800, 900],
'Profit': [200, 250, 150, 180]
}, index=index)

print("MultiIndex DataFrame:")
print(multi_df)

# Accessing MultiIndex data

print("\nNorth region data:")
print(multi_df.loc['North'])

print("\nQ1 data across all regions:")

print(multi_df.xs('Q1', level='Quarter'))

Performance Optimization
python

# Vectorized operations vs apply

df = [Link]({
'A': [Link](1, 100, 1000),
'B': [Link](1, 100, 1000)
})

# Vectorized operation (fast)

df['C_vectorized'] = df['A'] + df['B']

# Using apply (slower)

df['C_apply'] = [Link](lambda row: row['A'] + row['B'], axis=1)

print("First 5 rows:")
print([Link]())

Exercise 9: Advanced Operations

python

# Create a MultiIndex DataFrame for student grades

# Perform various operations on the hierarchical index
# Your code here
<details> <summary>Solution</summary>
python

# Create MultiIndex DataFrame

schools = ['School_A', 'School_A', 'School_B', 'School_B']
grades = ['10th', '11th', '10th', '11th']
index = [Link].from_arrays([schools, grades], names=['School', 'Grade'])

grades_df = [Link]({
'Math': [85, 88, 82, 90],
'Science': [92, 87, 85, 88],
'English': [78, 85, 80, 87]
}, index=index)

print("Grades DataFrame:")
print(grades_df)

# Access different levels

print("\nSchool_A data:")
print(grades_df.loc['School_A'])

print("\n10th grade across all schools:")

print(grades_df.xs('10th', level='Grade'))

# Calculate average by school

school_avg = grades_df.groupby('School').mean()
print("\nAverage by school:")
print(school_avg.round(2))
</details>

Final Project Exercise

Comprehensive Data Analysis

python

"""
Create a complete data analysis pipeline:
1. Load sample sales data
2. Clean and preprocess the data
3. Perform exploratory analysis
4. Create summary statistics and visualizations
"""

# Your final project code here

<details> <summary>Solution</summary>
python

import pandas as pd
import numpy as np
import [Link] as plt

# 1. Create sample sales data

[Link](42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Product_A', 'Product_B', 'Product_C', 'Product_D']

sales_data = [Link]({
'Date': [Link](dates, 1000),
'Product': [Link](products, 1000),
'Region': [Link](['North', 'South', 'East', 'West'], 1000),
'Sales': [Link](50, 500, 1000),
'Cost': [Link](20, 200, 1000)
})

# Add some missing values

sales_data.loc[[Link](sales_data.index, 50), 'Sales'] = [Link]
print("Original data shape:", sales_data.shape)
print("\nFirst 5 rows:")
print(sales_data.head())

# 2. Data cleaning
print("\nMissing values before cleaning:")
print(sales_data.isnull().sum())

# Fill missing sales with product average

sales_data['Sales'] = sales_data.groupby('Product')['Sales'].transform(
lambda x: [Link]([Link]())
)

# Calculate profit
sales_data['Profit'] = sales_data['Sales'] - sales_data['Cost']
sales_data['Profit_Margin'] = (sales_data['Profit'] / sales_data['Sales']) * 100

print("\nAfter cleaning:")
print(sales_data.isnull().sum())

# 3. Exploratory analysis
print("\nBasic statistics:")
print(sales_data[['Sales', 'Cost', 'Profit']].describe())

# Monthly sales trend

sales_data['Month'] = sales_data['Date'].[Link]
monthly_sales = sales_data.groupby('Month')['Sales'].sum()

# Product performance
product_performance = sales_data.groupby('Product').agg({
'Sales': 'sum',
'Profit': 'sum',
'Profit_Margin': 'mean'
}).round(2)

print("\nProduct Performance:")
print(product_performance)

# Regional analysis
regional_analysis = sales_data.groupby('Region').agg({
'Sales': ['sum', 'mean'],
'Profit': 'sum'
}).round(2)
print("\nRegional Analysis:")
print(regional_analysis)

# 4. Visualization
[Link](figsize=(15, 10))

# Subplot 1: Monthly sales

[Link](2, 2, 1)
monthly_sales.plot(kind='bar', color='skyblue')
[Link]('Monthly Sales')
[Link]('Month')
[Link]('Total Sales')

# Subplot 2: Product sales

[Link](2, 2, 2)
product_performance['Sales'].plot(kind='pie', autopct='%1.1f%%')
[Link]('Sales by Product')

# Subplot 3: Regional profit

[Link](2, 2, 3)
regional_analysis[('Profit', 'sum')].plot(kind='bar', color='lightgreen')
[Link]('Total Profit by Region')
[Link]('Profit')

# Subplot 4: Profit margin by product

[Link](2, 2, 4)
product_performance['Profit_Margin'].plot(kind='bar', color='orange')
[Link]('Average Profit Margin by Product')
[Link]('Profit Margin (%)')

plt.tight_layout()
[Link]()

# Summary
print("\n=== ANALYSIS SUMMARY ===")
print(f"Total Sales: ${sales_data['Sales'].sum():,.2f}")
print(f"Total Profit: ${sales_data['Profit'].sum():,.2f}")
print(f"Average Profit Margin: {sales_data['Profit_Margin'].mean():.2f}%")
best_product = product_performance['Profit'].idxmax()
print(f"Best Performing Product: {best_product}")
</details>

Pandas Notes
No ratings yet
Pandas Notes
20 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
FDS Exp4
No ratings yet
FDS Exp4
5 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Pandas
No ratings yet
Pandas
4 pages
Lab 9
No ratings yet
Lab 9
9 pages
Python Pandas Tutorial For Beginners
100% (1)
Python Pandas Tutorial For Beginners
203 pages
Python 2.1.2
No ratings yet
Python 2.1.2
7 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas Module Overview and Usage Guide
No ratings yet
Pandas Module Overview and Usage Guide
15 pages
Introduction to Pandas Library in Python
No ratings yet
Introduction to Pandas Library in Python
39 pages
Pandas & PyNumS Essentials
No ratings yet
Pandas & PyNumS Essentials
10 pages
Pandas and Python
No ratings yet
Pandas and Python
24 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Data Analysis - 5th Unit
No ratings yet
Data Analysis - 5th Unit
14 pages
Week 7# Pandas
No ratings yet
Week 7# Pandas
46 pages
Pandas
No ratings yet
Pandas
5 pages
Pandas Programs
No ratings yet
Pandas Programs
2 pages
Experiment 678910
No ratings yet
Experiment 678910
12 pages
DataFrame Ac Win Final
No ratings yet
DataFrame Ac Win Final
30 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
32 pages
Starting Out With Pandas - Ext
No ratings yet
Starting Out With Pandas - Ext
18 pages
Phan1 Pandas Numpy Matplotlib
No ratings yet
Phan1 Pandas Numpy Matplotlib
158 pages
Python Data Science: Pandas & ML Basics
100% (1)
Python Data Science: Pandas & ML Basics
41 pages
Pandas
No ratings yet
Pandas
26 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
16 pages
Pandas
No ratings yet
Pandas
7 pages
Data Handing Using Pandas-I
100% (2)
Data Handing Using Pandas-I
46 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Data Science Notes Unit-1 Part - 2
No ratings yet
Data Science Notes Unit-1 Part - 2
22 pages
Chapter 1 Python Pandas Complete
No ratings yet
Chapter 1 Python Pandas Complete
2 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Introduction to Pandas DataFrames
No ratings yet
Introduction to Pandas DataFrames
25 pages
Pandas Notes
No ratings yet
Pandas Notes
44 pages
Unit III - Notes
No ratings yet
Unit III - Notes
12 pages
For Assignment-3 (Final - Pandas - Lab)
No ratings yet
For Assignment-3 (Final - Pandas - Lab)
40 pages
Pandas
No ratings yet
Pandas
27 pages
Unit-3 DH&V
No ratings yet
Unit-3 DH&V
135 pages
Unit 4
No ratings yet
Unit 4
36 pages
Python Pandas DataFrame Guide
100% (2)
Python Pandas DataFrame Guide
23 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas Questions
No ratings yet
Pandas Questions
11 pages
Python Pandas: Data Manipulation Guide
No ratings yet
Python Pandas: Data Manipulation Guide
84 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
12 pages
Rajni Ip File Final
No ratings yet
Rajni Ip File Final
42 pages
Pandas Interview Prep Guide
No ratings yet
Pandas Interview Prep Guide
5 pages
Introduction to Pandas Library
No ratings yet
Introduction to Pandas Library
31 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
On Data Handling Using Pandas-I
100% (2)
On Data Handling Using Pandas-I
63 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Pandas Python Data Analysis Guide
No ratings yet
Pandas Python Data Analysis Guide
32 pages
Pandas
No ratings yet
Pandas
13 pages
Subject IP
No ratings yet
Subject IP
9 pages
Data Handling with Pandas: Series & DataFrame
No ratings yet
Data Handling with Pandas: Series & DataFrame
44 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
E-CAF Usage Recommendations
No ratings yet
E-CAF Usage Recommendations
18 pages
PSet1 - Solnb Solutiond
No ratings yet
PSet1 - Solnb Solutiond
10 pages
Another - Lab - Get Started With Docker Compose
No ratings yet
Another - Lab - Get Started With Docker Compose
6 pages
Parallel Hospital ACP-Based Hospital Smart Operating System
No ratings yet
Parallel Hospital ACP-Based Hospital Smart Operating System
4 pages
SDK Pubsub Commands
No ratings yet
SDK Pubsub Commands
3 pages
HL7/DICOM Guide for IT & Radiology
No ratings yet
HL7/DICOM Guide for IT & Radiology
7 pages
Admin Report View Dashboard
No ratings yet
Admin Report View Dashboard
14 pages
Simple Steps To Fix Microsoft Office Error Code 0-2054 - Live Support
No ratings yet
Simple Steps To Fix Microsoft Office Error Code 0-2054 - Live Support
7 pages
Thread vs Process in Operating Systems
No ratings yet
Thread vs Process in Operating Systems
2 pages
Click The Link Below To Download
100% (3)
Click The Link Below To Download
51 pages
Hardware and Components Exercises
No ratings yet
Hardware and Components Exercises
8 pages
Microtips - 04122023 - MT - UMSH - 8252MD - T - REVZ LCD Datasheet
No ratings yet
Microtips - 04122023 - MT - UMSH - 8252MD - T - REVZ LCD Datasheet
35 pages
Task 1 Assignment 3
No ratings yet
Task 1 Assignment 3
11 pages
Industrial Attachment Report Presented B
No ratings yet
Industrial Attachment Report Presented B
77 pages
Black White Minimalist CV Resume
No ratings yet
Black White Minimalist CV Resume
1 page
Binary Level Analysis
No ratings yet
Binary Level Analysis
159 pages
Problem - C1 - Codeforces
No ratings yet
Problem - C1 - Codeforces
3 pages
Sound Engineers3
No ratings yet
Sound Engineers3
607 pages
Java Parameter Passing Explained
No ratings yet
Java Parameter Passing Explained
24 pages
402-02-UI-UX-Unit 4
No ratings yet
402-02-UI-UX-Unit 4
81 pages
Using ERwin Data Modeler
100% (1)
Using ERwin Data Modeler
46 pages
Ashish Kumar 160073
No ratings yet
Ashish Kumar 160073
4 pages
Toc-L CP
No ratings yet
Toc-L CP
442 pages
IOT Based Industrial Automation Using Raspberry Pi
No ratings yet
IOT Based Industrial Automation Using Raspberry Pi
5 pages
Powerpoint Homework
100% (1)
Powerpoint Homework
7 pages
Introduction to Database Administration
No ratings yet
Introduction to Database Administration
15 pages
QX Series Euen 2020 Web
No ratings yet
QX Series Euen 2020 Web
20 pages
The Ultimate Guide To Cryptocurrency
89% (9)
The Ultimate Guide To Cryptocurrency
29 pages
(Ebook) Microsoft Excel Data Analysis and Business Modeling by Wayne Winston ISBN 9781509304219, 1509304215 Full Access
No ratings yet
(Ebook) Microsoft Excel Data Analysis and Business Modeling by Wayne Winston ISBN 9781509304219, 1509304215 Full Access
341 pages
ODINI: Escaping Sensitive Data From Faraday-Caged, Air-Gapped Computers Via Magnetic Fields
No ratings yet
ODINI: Escaping Sensitive Data From Faraday-Caged, Air-Gapped Computers Via Magnetic Fields
18 pages

Pandas Chatgpt

Uploaded by

Pandas Chatgpt

Uploaded by

Comprehensive Explanation of Pandas

1. Pandas Data Structures

Pandas has two primary data structures:

A Pandas Series is like a column in a spreadsheet or a one-dimensional array in NumPy. It

Example 1: Creating a Series

# Creating a Series from a list

print("Pandas Series:\n", series)

Example 2: Creating a DataFrame

print("Pandas DataFrame:\n", df)

2. Basic Operations on DataFrames

2.1 Accessing Data

You can access individual columns, rows, or elements from a DataFrame.

# Accessing multiple columns

print("Name Column:\n", name_column)

Age and Salary Columns:

 Accessing a single column returns a Series.

2.2 Accessing Rows with loc[] and iloc[]

Example 4: Accessing Rows

# Accessing rows using iloc (position-based)

print("Row for Bob (loc):\n", row_bob)

First two rows (iloc):

3.1 Adding New Columns

Example 5: Adding a New Column

print("DataFrame with New Column:\n", df)

Example 6: Dropping a Column

print("DataFrame after Dropping 'Bonus' Column:\n", df_dropped)

 The drop() method is used to remove the Bonus column.

4. Handling Missing Data

4.1 Detecting Missing Data

Example 7: Detecting Missing Data

# Checking for missing values

print("Missing Data:\n", missing_data)

 A missing value is introduced at index 1 for the Salary column.

4.2 Filling Missing Data

You can fill missing values using the fillna() method.

Example 8: Filling Missing Data

print("DataFrame after Filling Missing Values:\n", df)

5. Filtering and Conditional Selection

5.1 Filtering Rows Based on Condition

print("Rows where Age > 30:\n", filtered_df)

6. Grouping and Aggregation

6.1 Grouping Data

Example 10: Grouping and Aggregation

# Grouping by 'Department' and calculating the mean salary

print("Grouped by Department:\n", grouped)

 Data structures: Series (1D) and DataFrame (2D).

pip install pandas numpy matplotlib

print(f"Pandas version: {pd.__version__}")

# Create a Series of temperatures for 5 days

# Create sample data first

# Save to different formats

# Read from different formats

Exercise 2: File Operations

# Create a DataFrame with product information and save it to CSV

# Read from CSV

loc and iloc

# Using loc (label-based)

# Using iloc (position-based)

Exercise 3: Data Selection

# 3. Name and Department for first 2

Handling Missing Data

# Create DataFrame with missing values

# Check for missing values

# Handling missing values

print("\nDrop rows with missing values:")

print("\nFill with mean:")

Data Type Conversion

Exercise 4: Data Cleaning

# Create a DataFrame with missing values and:

# 1. Identify missing values

# 2. Fill numerical columns with mean

print("\nAfter filling with mean:")

# 3. Drop rows with more than 2 missing values

print("After applying functions:")

# Using map for element-wise transformations

Exercise 5: Data Transformation

print(f"Pandas version: {pd.version}")