Pandas Chatgpt
Pandas Chatgpt
Pandas is a powerful Python library that is widely used for data manipulation, analysis, and
exploration. It provides flexible and efficient data structures—primarily DataFrame and Series—
that allow you to work with structured data easily.
Pandas is built on top of NumPy and provides functionality for working with large datasets,
handling missing data, filtering, grouping, and merging datasets, among other features. Let’s
explore Pandas with examples and detailed explanations of key operations.
Series: A one-dimensional labeled array capable of holding any data type (integers, strings,
floats, etc.).
DataFrame: A two-dimensional labeled data structure, similar to a table in a database or an
Excel spreadsheet, with rows and columns.
1.1 Series
python
Copier le code
import pandas as pd
Output:
less
Copier le code
Pandas Series:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Explanation:
The Series is created from a list [10, 20, 30, 40, 50].
The index is explicitly set to ['a', 'b', 'c', 'd', 'e'], allowing you to access
elements using these labels.
1.2 DataFrame
A Pandas DataFrame is a 2D data structure with rows and columns, similar to a table. It can hold
data of different types and allows for more complex operations like filtering, aggregation, and
reshaping.
python
Copier le code
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]
}
df = [Link](data)
Output:
markdown
Copier le code
Pandas DataFrame:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
Explanation:
The DataFrame is created from a dictionary where each key is a column and each value is a
list of data for that column.
The DataFrame has an automatic index (0, 1, 2, 3).
You can perform a wide range of operations on DataFrames, such as accessing data, filtering, and
modifying the structure.
python
Copier le code
# Accessing a single column
name_column = df['Name']
Output:
yaml
Copier le code
Name Column:
0 Alice
1 Bob
2 Charlie
3 David
Name: Name, dtype: object
Explanation:
You can access rows based on index labels using loc[] or by integer positions using iloc[].
python
Copier le code
# Accessing rows using loc (label-based)
row_bob = [Link][1]
Output:
sql
Copier le code
Row for Bob (loc):
Name Bob
Age 30
Salary 60000
Name: 1, dtype: object
Explanation:
loc[] is used for label-based indexing, and it returns the row where the index label is 1.
iloc[] is used for integer-based indexing and returns the first two rows of the DataFrame.
3. Modifying DataFrames
Pandas allows you to add new columns, modify existing ones, or drop rows/columns from the
DataFrame.
You can easily add new columns to a DataFrame by assigning a new column name and values.
python
Copier le code
# Adding a new column 'Bonus' to the DataFrame
df['Bonus'] = [5000, 6000, 7000, 8000]
Output:
yaml
Copier le code
DataFrame with New Column:
Name Age Salary Bonus
0 Alice 25 50000 5000
1 Bob 30 60000 6000
2 Charlie 35 70000 7000
3 David 40 80000 8000
Explanation:
A new column Bonus is added to the DataFrame, and it holds the values [5000, 6000,
7000, 8000].
3.2 Dropping Rows or Columns
You can drop rows or columns from a DataFrame using the drop() method.
python
Copier le code
# Dropping the 'Bonus' column
df_dropped = [Link]('Bonus', axis=1)
Output:
sql
Copier le code
DataFrame after Dropping 'Bonus' Column:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
Explanation:
Missing data is common in real-world datasets. Pandas provides several methods to handle missing
values, such as filling them or dropping rows/columns with missing values.
You can detect missing data using the isnull() function, which returns a boolean DataFrame.
python
Copier le code
# Introducing missing values
[Link][1, 'Salary'] = None
Output:
mathematica
Copier le code
Missing Data:
Name Age Salary Bonus
0 False False False False
1 False False True False
2 False False False False
3 False False False False
Explanation:
python
Copier le code
# Filling missing values in the 'Salary' column with the mean
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
Output:
yaml
Copier le code
DataFrame after Filling Missing Values:
Name Age Salary Bonus
0 Alice 25 50000.0 5000
1 Bob 30 66666.67 6000
2 Charlie 35 70000.0 7000
3 David 40 80000.0 8000
Explanation:
The missing value in the Salary column is filled with the mean salary using fillna().
The mean salary is calculated as (50000 + 70000 + 80000) / 3 = 66666.67.
You can filter rows in a DataFrame based on specific conditions, similar to SQL's WHERE clause.
You can filter rows based on column values using boolean indexing.
Example 9: Filtering Rows
python
Copier le code
# Filtering rows where Age > 30
filtered_df = df[df['Age'] > 30]
Output:
yaml
Copier le code
Rows where Age > 30:
Name Age Salary Bonus
2 Charlie 35 70000.0 7000
3 David 40 80000.0 8000
Explanation:
The DataFrame is filtered to show only rows where the Age is greater than 30.
Pandas allows you to group data by one or more columns and apply aggregate functions like sum(),
mean(), count(), etc.
You can group data by specific columns using the groupby() function and apply aggregate
operations.
python
Copier le code
# Creating a new DataFrame
data = {
'Department': ['HR', 'IT', 'HR', 'IT', 'Sales', 'Sales'],
'Salary': [50000, 60000, 55000, 65000, 45000, 52000]
}
df_dept = [Link](data)
Output:
csharp
Copier le code
Grouped by Department:
Salary
Department
HR 52500.0
IT 62500.0
Sales 48500.0
Explanation:
The data is grouped by the Department column, and the mean salary for each department is
calculated using mean().
Conclusion
Pandas is an essential tool for data analysis and manipulation. Key features include:
Pandas makes it easy to handle and analyze large datasets efficiently, making it a fundamental
library for data science and machine learning workflows.
Pandas Course with Examples and Exercises
Table of Contents
1. Introduction to Pandas
2. Data Structures
3. Data Input/Output
4. Data Selection and Indexing
5. Data Cleaning
6. Data Transformation
7. Grouping and Aggregation
8. Merging and Joining
9. Time Series
10. Advanced Operations
1. Introduction to Pandas
What is Pandas?
Pandas is a powerful Python library for data manipulation and analysis. It provides data
structures and operations for manipulating numerical tables and time series.
Installation
bash
Basic Setup
python
import pandas as pd
import numpy as np
import [Link] as plt
Series
A one-dimensional labeled array
python
# Creating Series
s1 = [Link]([1, 3, 5, [Link], 6, 8])
s2 = [Link]([10, 20, 30], index=['a', 'b', 'c'])
s3 = [Link]({'a': 1, 'b': 2, 'c': 3})
print("Series 1:")
print(s1)
print("\nSeries 2:")
print(s2)
DataFrame
A two-dimensional labeled data structure
python
# Creating DataFrames
df1 = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'London', 'Paris', 'Tokyo']
})
df2 = [Link](
[Link](6, 4),
index=pd.date_range('20230101', periods=6),
columns=['A', 'B', 'C', 'D']
)
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
Exercise 1: Create Data Structures
python
# Series
temperatures = [Link]([72, 68, 75, 80, 78],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'])
# DataFrame
students = [Link]({
'Name': ['John', 'Sarah', 'Mike', 'Emma'],
'Grade': [85, 92, 78, 95],
'Subject': ['Math', 'Science', 'Math', 'English']
})
print("Temperatures:")
print(temperatures)
print("\nStudents:")
print(students)
</details>
3. Data Input/Output
Reading Data
python
print("From CSV:")
print(df_csv)
print("\nFrom Excel:")
print(df_excel)
# Create DataFrame
products = [Link]({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'Price': [1000, 25, 75, 300],
'Stock': [15, 100, 50, 25]
})
# Save to CSV
products.to_csv('[Link]', index=False)
print("Products DataFrame:")
print(products_read)
</details>
4. Data Selection and Indexing
Basic Selection
python
# Sample DataFrame
df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000],
'Department': ['IT', 'HR', 'IT', 'Finance', 'HR']
})
# Column selection
print("Names:", df['Name'].tolist())
print("Ages and Salaries:\n", df[['Age', 'Salary']])
# Row selection
print("First row:\n", [Link][0])
print("First 3 rows:\n", [Link](3))
# Boolean indexing
print("IT Department:\n", df[df['Department'] == 'IT'])
print("High Salary:\n", df[df['Salary'] > 60000])
# Setting index
df_indexed = df.set_index('Name')
print("With Name as index:\n", df_indexed.loc[['Alice', 'Charlie'], ['Age', 'Salar
y']])
# 1. HR department
hr_employees = df[df['Department'] == 'HR']
print("HR Employees:\n", hr_employees)
# 2. Age 30-40
age_range = df[(df['Age'] >= 30) & (df['Age'] <= 40)]
print("\nEmployees aged 30-40:\n", age_range)
5. Data Cleaning
print("Original DataFrame:")
print(df_missing)
# Type conversion
df_types = [Link]({
'Strings': ['1', '2', '3'],
'Numbers': [1.1, 2.2, 3.3],
'Integers': [1, 2, 3]
})
print("Original types:")
print(df_types.dtypes)
# Convert types
df_types['Strings'] = df_types['Strings'].astype(int)
df_types['Numbers'] = df_types['Numbers'].astype(int)
print("\nAfter conversion:")
print(df_types.dtypes)
print("Original DataFrame:")
print(df_exercise)
6. Data Transformation
Applying Functions
python
df = [Link]({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [50000, 60000, 70000]
})
# Apply function to column
df['Salary_Adjusted'] = df['Salary'].apply(lambda x: x * 1.1)
df['Name_Length'] = df['Name'].apply(len)
print("\nAfter mapping:")
print(df)
String Operations
python
# String operations
df_strings = [Link]({
'Text': ['hello world', 'pandas tutorial', 'data science'],
'Name': ['alice smith', 'BOB JONES', 'Charlie Brown']
})
df_strings['Text_Upper'] = df_strings['Text'].[Link]()
df_strings['Name_Proper'] = df_strings['Name'].[Link]()
df_strings['Word_Count'] = df_strings['Text'].[Link]().[Link]()
print("String operations:")
print(df_strings)
# Create DataFrame
products = [Link]({
'Product': ['laptop', 'mouse', 'keyboard', 'monitor'],
'Price': [1000, 25, 75, 300]
})
print("Original:")
print(products)
# 1. Convert to uppercase
products['Product_Upper'] = products['Product'].[Link]()
# 2. Apply discount
products['Price_Discounted'] = products['Price'].apply(lambda x: x * 0.85)
# 3. Create categories
def price_category(price):
if price < 50:
return 'Low'
elif price <= 100:
return 'Medium'
else:
return 'High'
products['Price_Category'] = products['Price'].apply(price_category)
print("\nAfter transformations:")
print(products)
</details>
Basic Grouping
python
print("Original data:")
print(sales_data)
# Basic grouping
region_sales = sales_data.groupby('Region')['Sales'].sum()
print("\nTotal sales by region:")
print(region_sales)
# Multiple aggregations
region_stats = sales_data.groupby('Region').agg({
'Sales': ['sum', 'mean', 'count'],
'Profit': ['sum', 'mean']
})
print("\nRegional statistics:")
print(region_stats)
Pivot Tables
python
# Pivot table
pivot_sales = sales_data.pivot_table(
values='Sales',
index='Region',
columns='Product',
aggfunc='sum',
fill_value=0
)
print("Pivot table:")
print(pivot_sales)
# 3. Pivot table
pivot_profit = sales_data.pivot_table(
values='Profit',
index='Region',
columns='Product',
aggfunc='mean',
fill_value=0
).round(2)
print("\nAverage profit by region and product:")
print(pivot_profit)
</details>
Concatenation
python
# Concatenation
result_concat = [Link]([df1, df2], ignore_index=True)
print("Concatenation:")
print(result_concat)
Merging
python
departments = [Link]({
'DeptID': [101, 102, 104],
'DeptName': ['IT', 'HR', 'Finance']
})
print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)
# Inner join
inner_join = [Link](employees, departments, on='DeptID', how='inner')
print("\nInner join:")
print(inner_join)
# Left join
left_join = [Link](employees, departments, on='DeptID', how='left')
print("\nLeft join:")
print(left_join)
# Create DataFrames
customers = [Link]({
'CustomerID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Email': ['alice@[Link]', 'bob@[Link]', 'charlie@[Link]', 'david@emai
[Link]']
})
orders = [Link]({
'OrderID': [101, 102, 103, 104],
'CustomerID': [1, 2, 1, 5],
'Amount': [150, 200, 75, 300]
})
print("Customers:")
print(customers)
print("\nOrders:")
print(orders)
9. Time Series
ts_data = ts_data.set_index('Date')
print("Time series data:")
print(ts_data)
# Resampling
weekly_sales = ts_data['Sales'].resample('W').mean()
print("\nWeekly average sales:")
print(weekly_sales)
# Rolling average
ts_data['Sales_7D_Avg'] = ts_data['Sales'].rolling(window=3).mean()
print("\nWith rolling average:")
print(ts_data)
print("Stock prices:")
print(stock_prices.head())
# 1. Daily returns
stock_prices['Daily_Return'] = stock_prices['Price'].pct_change() * 100
# 3. Weekly resampling
weekly_data = stock_prices['Price'].resample('W').mean()
print("\nWith calculations:")
print(stock_prices.head(10))
print("\nWeekly data:")
print(weekly_data)
</details>
MultiIndex
python
print("MultiIndex DataFrame:")
print(multi_df)
Performance Optimization
python
print("First 5 rows:")
print([Link]())
grades_df = [Link]({
'Math': [85, 88, 82, 90],
'Science': [92, 87, 85, 88],
'English': [78, 85, 80, 87]
}, index=index)
print("Grades DataFrame:")
print(grades_df)
"""
Create a complete data analysis pipeline:
1. Load sample sales data
2. Clean and preprocess the data
3. Perform exploratory analysis
4. Create summary statistics and visualizations
"""
import pandas as pd
import numpy as np
import [Link] as plt
sales_data = [Link]({
'Date': [Link](dates, 1000),
'Product': [Link](products, 1000),
'Region': [Link](['North', 'South', 'East', 'West'], 1000),
'Sales': [Link](50, 500, 1000),
'Cost': [Link](20, 200, 1000)
})
# 2. Data cleaning
print("\nMissing values before cleaning:")
print(sales_data.isnull().sum())
# Calculate profit
sales_data['Profit'] = sales_data['Sales'] - sales_data['Cost']
sales_data['Profit_Margin'] = (sales_data['Profit'] / sales_data['Sales']) * 100
print("\nAfter cleaning:")
print(sales_data.isnull().sum())
# 3. Exploratory analysis
print("\nBasic statistics:")
print(sales_data[['Sales', 'Cost', 'Profit']].describe())
# Product performance
product_performance = sales_data.groupby('Product').agg({
'Sales': 'sum',
'Profit': 'sum',
'Profit_Margin': 'mean'
}).round(2)
print("\nProduct Performance:")
print(product_performance)
# Regional analysis
regional_analysis = sales_data.groupby('Region').agg({
'Sales': ['sum', 'mean'],
'Profit': 'sum'
}).round(2)
print("\nRegional Analysis:")
print(regional_analysis)
# 4. Visualization
[Link](figsize=(15, 10))
plt.tight_layout()
[Link]()
# Summary
print("\n=== ANALYSIS SUMMARY ===")
print(f"Total Sales: ${sales_data['Sales'].sum():,.2f}")
print(f"Total Profit: ${sales_data['Profit'].sum():,.2f}")
print(f"Average Profit Margin: {sales_data['Profit_Margin'].mean():.2f}%")
best_product = product_performance['Profit'].idxmax()
print(f"Best Performing Product: {best_product}")
</details>