0% found this document useful (0 votes)
55 views37 pages

Learn Pandas

Pandas is a Python library designed for data manipulation and analysis, particularly useful in data science and machine learning. It allows users to read, clean, analyze, and visualize data in structured formats like tables, utilizing key data structures such as Series and DataFrames. The document covers various functionalities of Pandas, including data loading, filtering, handling missing values, grouping, sorting, merging, and working with time-series and categorical data.

Uploaded by

dnaresh2323
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views37 pages

Learn Pandas

Pandas is a Python library designed for data manipulation and analysis, particularly useful in data science and machine learning. It allows users to read, clean, analyze, and visualize data in structured formats like tables, utilizing key data structures such as Series and DataFrames. The document covers various functionalities of Pandas, including data loading, filtering, handling missing values, grouping, sorting, merging, and working with time-series and categorical data.

Uploaded by

dnaresh2323
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Pandas

Pandas is a powerful Python library for data manipulation and analysis. It’s often used
in data science and machine learning, allowing you to work with data in tables (like
Excel spreadsheets).

Step 1: What is Pandas?

Pandas is a library that helps you work with data more easily. Data is often
structured in tables (rows and columns), and Pandas makes it easy to:

 Read data from files like CSVs or Excel.

 Analyze data by performing calculations.

 Clean data (removing missing or bad data).

 Visualize data.

In Pandas, data is stored in two main structures:

1. Series: A one-dimensional array (like a column in a table).

2. DataFrame: A two-dimensional table (like a spreadsheet with rows and


columns).

Step 2: Installing Pandas

If you don’t have Pandas installed yet, you can install it using pip:

pip install pandas

Step 3: Loading Data with Pandas

You’ll typically start by loading data from a file like a CSV or Excel sheet into a Pandas
DataFrame.

import pandas as pd

# Load data from a CSV file

data = pd.read_csv('data.csv')

# Show the first 5 rows of the data

print(data.head())

The data.head() function shows the first 5 rows of the table. You can replace 'data.csv'
with the path to your actual file.

Step 4: Exploring Data

Once the data is loaded into a DataFrame, you can explore it by checking its structure
and some summary statistics.
Check the shape of the data

This tells you how many rows and columns are in the table.

print(data.shape) # (rows, columns)


Get information about the data types

This is useful for knowing what type of data each column contains (numbers, text,
etc.).

print(data.info())
Describe numerical data

You can get summary statistics like the mean, standard deviation, and more for the
numerical columns.

print(data.describe())

Step 5: Selecting and Filtering Data

Pandas makes it super easy to select specific rows and columns or filter data based
on conditions.

Select a column (Series)

If you want to access a specific column, you can do it like this:

# Select the 'Age' column

age_column = data['Age']

print(age_column)
Select multiple columns (DataFrame)

If you need more than one column, you pass a list of column names:

# Select 'Name' and 'Age' columns

subset = data[['Name', 'Age']]

print(subset)
Filtering rows based on a condition

To filter data based on a condition, use a comparison operator (like ==, >, <, etc.).

# Select rows where Age is greater than 30

filtered_data = data[data['Age'] > 30]

print(filtered_data)

Step 6: Handling Missing Data


Data is often messy, and some values might be missing (NaN). Pandas provides
several ways to handle missing data.

Check for missing values

You can see which columns have missing data.

print(data.isnull().sum())
Drop rows with missing values

If you want to remove rows that have missing data, use dropna().

clean_data = data.dropna()
Fill missing values

You can also fill missing values with a specific value or the mean of the column.

# Fill missing values with the mean of the column

data['Age'].fillna(data['Age'].mean(), inplace=True)

Step 7: Grouping Data

Grouping data is useful when you want to calculate statistics based on categories. For
example, you can group people by gender and find the average age for each group.

# Group data by 'Gender' and calculate the average age for each group

grouped_data = data.groupby('Gender')['Age'].mean()

print(grouped_data)

Step 8: Sorting Data

You can sort your DataFrame by one or more columns.

# Sort data by 'Age' in descending order

sorted_data = data.sort_values(by='Age', ascending=False)

print(sorted_data)

Step 9: Creating New Columns

You can create new columns based on existing ones. For example, if you want to
create a new column that shows whether someone is an adult (age >= 18):

# Create a new column 'IsAdult' where True if Age >= 18

data['IsAdult'] = data['Age'] >= 18

print(data.head())
Step 10: Exporting Data

After modifying your data, you may want to save it back to a file.

# Save the DataFrame to a new CSV file

data.to_csv('cleaned_data.csv', index=False)

Deeper Concepts in Pandas

Once you’ve mastered the basics, there are more advanced features in Pandas to
explore:

 Pivot tables: Useful for summarizing data.

 Merging and joining: Combine multiple DataFrames.

 Time-series data: Handle and manipulate time-based data.

Example: Analyzing Titanic Dataset

To see Pandas in action, here’s a small example using a dataset about Titanic
passengers.

import pandas as pd

# Load Titanic dataset

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

titanic_data = pd.read_csv(url)

# Explore the dataset

print(titanic_data.head()) # Display first 5 rows

print(titanic_data.info()) # Info about the dataset

# Filter data: Passengers older than 30

older_passengers = titanic_data[titanic_data['Age'] > 30]

print(older_passengers.head())

# Group data by 'Sex' and calculate the average age

average_age_by_sex = titanic_data.groupby('Sex')['Age'].mean()

print(average_age_by_sex)
# Fill missing ages with the median age

titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)

Step 11: Pivot Tables

Pivot Tables are like magic tables that help you summarize and analyze data quickly.
Imagine you have a big table of data, and you want to see summaries based on
certain categories. Pivot tables make this easy!

What is a Pivot Table?

 A pivot table allows you to rearrange and summarize data in a DataFrame.

 You can aggregate data (like finding sums, averages) based on categories.

Example: Creating a Pivot Table

Let's use a simple dataset to understand pivot tables. Suppose you have sales data
for different products in different regions.

Sales Data Example:

import pandas as pd

# Sample sales data

data = {

'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],

'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],

'Sales': [100, 150, 200, 250, 300, 350, 400, 450]

# Create a DataFrame

df = pd.DataFrame(data)

print(df)
Output:

Region Product Sales

0 North A 100

1 South B 150

2 East A 200

3 West B 250
4 North A 300

5 South B 350

6 East A 400

7 West B 450

Creating a Pivot Table:

# Create a pivot table to sum sales by Region and Product

pivot_table = pd.pivot_table(df, values='Sales', index='Region', columns='Product', aggfunc='sum')

print(pivot_table)
Output:

Region Product A B

East 600 NaN

North 400 NaN

South NaN 500

West NaN 700

Explanation:

 values='Sales': The data we want to summarize.

 index='Region': Rows will be the regions.

 columns='Product': Columns will be the products.

 aggfunc='sum': We want to sum the sales.

Handling Missing Values: Sometimes, some combinations might not have data. You
can fill these NaN (Not a Number) values with zeros or any other value.

pivot_table_filled = pivot_table.fillna(0)

print(pivot_table_filled)
Output:

Region Product A B

East 600 0.0

North 400 0.0

South 0 500.0

West 0 700.0

Step 12: Merging and Joining DataFrames

Merging and joining are ways to combine two or more DataFrames based on a
common key (like merging two tables in a database).
What is Merging?

 Merging combines two DataFrames by matching values in one or more


columns.

 Think of it as putting together puzzle pieces that fit based on shared edges.

Example: Merging Two DataFrames

Let's say you have two separate tables: one with employee information and another
with their salaries.

Employee Data:

# Employee information

employees = {

'EmployeeID': [1, 2, 3, 4],

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Department': ['HR', 'IT', 'Finance', 'IT']

# Create a DataFrame

df_employees = pd.DataFrame(employees)

print(df_employees)
Output:

EmployeeID Name Department

0 1 Alice HR

1 2 Bob IT

2 3 Charlie Finance

3 4 David IT

Salary Data:

# Salary information

salaries = {

'EmployeeID': [1, 2, 3, 4],

'Salary': [50000, 60000, 55000, 65000]

# Create a DataFrame
df_salaries = pd.DataFrame(salaries)

print(df_salaries)
Output:

EmployeeID Salary

0 1 50000

1 2 60000

2 3 55000

3 4 65000

Merging the DataFrames:

# Merge employees and salaries on 'EmployeeID'

df_merged = pd.merge(df_employees, df_salaries, on='EmployeeID')

print(df_merged)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000

1 2 Bob IT 60000

2 3 Charlie Finance 55000

3 4 David IT 65000

Explanation:

 pd.merge(): Merges two DataFrames.

 on='EmployeeID': The common column to merge on.

Different Types of Merges:

1. Inner Join (default):

o Only includes rows with matching keys in both DataFrames.

o Example: As shown above.

2. Left Join:

o Includes all rows from the left DataFrame and matching rows from the
right.

o Syntax: pd.merge(left, right, on='key', how='left')

3. Right Join:

o Includes all rows from the right DataFrame and matching rows from the
left.
o Syntax: pd.merge(left, right, on='key', how='right')

4. Outer Join:

o Includes all rows from both DataFrames, filling NaN where there are no
matches.

o Syntax: pd.merge(left, right, on='key', how='outer')

Example of Left Join:

# Employee data with an extra employee not in salary data

employees_extra = {

'EmployeeID': [1, 2, 3, 4, 5],

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing']

df_employees_extra = pd.DataFrame(employees_extra)

print(df_employees_extra)

# Left join

df_left_join = pd.merge(df_employees_extra, df_salaries, on='EmployeeID', how='left')

print(df_left_join)
Output:

EmployeeID Name Department

0 1 Alice HR

1 2 Bob IT

2 3 Charlie Finance

3 4 David IT

4 5 Eve Marketing

EmployeeID Name Department Salary

0 1 Alice HR 50000.0

1 2 Bob IT 60000.0

2 3 Charlie Finance 55000.0

3 4 David IT 65000.0
4 5 Eve Marketing NaN

Explanation:

 EmployeeID 5 (Eve) doesn't have a corresponding salary, so Salary is NaN.

Step 13: Handling Time-Series Data

Time-Series Data involves data points indexed in time order. It's common in fields
like finance, weather forecasting, and more.

What is Time-Series Data?

 Data points collected or recorded at specific time intervals.

 Examples: Daily stock prices, monthly sales figures, hourly temperatures.

Example: Creating and Manipulating Time-Series Data

Let's create a simple time-series dataset of daily temperatures.

Creating Time-Series Data:

import pandas as pd

import numpy as np

# Create a date range

dates = pd.date_range(start='2024-01-01', periods=7, freq='D')

# Generate random temperatures

temperatures = np.random.randint(low=20, high=35, size=7)

# Create a DataFrame

df_temp = pd.DataFrame({'Date': dates, 'Temperature': temperatures})

print(df_temp)
Output:

Date Temperature

0 2024-01-01 25

1 2024-01-02 30

2 2024-01-03 22

3 2024-01-04 28

4 2024-01-05 33

5 2024-01-06 27

6 2024-01-07 24
Setting the Date as Index:

# Set 'Date' as the index

df_temp.set_index('Date', inplace=True)

print(df_temp)
Output:

Date Temperature

2024-01-01 25

2024-01-02 30

2024-01-03 22

2024-01-04 28

2024-01-05 33

2024-01-06 27

2024-01-07 24

Resampling Time-Series Data: Suppose you want to see the average temperature
for each week (though our example is just one week).

# Resample data weekly and calculate the mean temperature

weekly_avg = df_temp.resample('W').mean()

print(weekly_avg)
Output:

Date Temperature

2024-01-07 27.0

Explanation:

 resample('W'): Groups data by week.

 mean(): Calculates the average temperature for each group.

Step 14: Handling Categorical Data

Categorical Data refers to variables that can be divided into groups or categories,
like gender, colors, or departments.

Why is Categorical Data Important?

 Helps in organizing and analyzing data based on categories.

 Optimizes memory usage when dealing with large datasets.

Example: Converting Columns to Categorical


Let's use our previous employee DataFrame and convert the Department column to a
categorical type.

Employee Data:

import pandas as pd

# Employee information

employees = {

'EmployeeID': [1, 2, 3, 4, 5],

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing']

df_employees = pd.DataFrame(employees)

print(df_employees)
Output:

EmployeeID Name Department

0 1 Alice HR

1 2 Bob IT

2 3 Charlie Finance

3 4 David IT

4 5 Eve Marketing

Converting to Categorical:

# Convert 'Department' to categorical

df_employees['Department'] = df_employees['Department'].astype('category')

print(df_employees.dtypes)
Output:

EmployeeID int64

Name object

Department category

dtype: object

Benefits:

 Memory Efficiency: Categorical data uses less memory.


 Performance Improvement: Operations on categorical data can be faster.

Step 15: Applying Functions with apply() and map()

Applying Functions allows you to perform operations on your data easily. Pandas
provides methods like apply() and map() to apply functions to DataFrame columns or
Series.

What is apply()?

 apply(): Applies a function along an axis of the DataFrame (rows or columns).

What is map()?

 map(): Applies a function to each element of a Series.

Example: Using apply() and map()

Let's use our employee DataFrame to see how these work.

Employee Data with Salaries:

import pandas as pd

# Employee information with salaries

employees = {

'EmployeeID': [1, 2, 3, 4, 5],

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing'],

'Salary': [50000, 60000, 55000, 65000, 70000]

df_employees = pd.DataFrame(employees)

print(df_employees)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000

1 2 Bob IT 60000

2 3 Charlie Finance 55000

3 4 David IT 65000

4 5 Eve Marketing 70000


Using apply() to Calculate Bonus: Suppose you want to give a 10% bonus to each
employee.

# Define a function to calculate bonus

def calculate_bonus(salary):

return salary * 0.10

# Apply the function to the 'Salary' column

df_employees['Bonus'] = df_employees['Salary'].apply(calculate_bonus)

print(df_employees)
Output:

EmployeeID Name Department Salary Bonus

0 1 Alice HR 50000 5000.0

1 2 Bob IT 60000 6000.0

2 3 Charlie Finance 55000 5500.0

3 4 David IT 65000 6500.0

4 5 Eve Marketing 70000 7000.0

Using map() to Simplify Department Names: Suppose you want to shorten


department names: 'Finance' to 'Fin', 'Marketing' to 'Mkt', etc.

# Create a mapping dictionary

dept_mapping = {

'HR': 'Human Resources',

'IT': 'Information Technology',

'Finance': 'Fin',

'Marketing': 'Mkt'

# Use map to replace department names

df_employees['Dept_Full'] = df_employees['Department'].map(dept_mapping)

print(df_employees)
Output:

EmployeeID Name Department Salary Bonus Dept_Full

0 1 Alice HR 50000 5000.0 Human Resources


1 2 Bob IT 60000 6000.0 Information Technology

2 3 Charlie Finance 55000 5500.0 Fin

3 4 David IT 65000 6500.0 Information Technology

4 5 Eve Marketing 70000 7000.0 Mkt

Explanation:

 apply() is used when you want to apply a function to each element or


row/column.

 map() is ideal for substituting each value in a Series with another value.

Step 16: Data Visualization with Pandas

While Pandas itself has basic plotting capabilities, it's often used alongside Matplotlib
for more advanced visualizations. However, Pandas makes it easy to create simple
plots directly.

Why Visualize Data?

 Understand Trends: See patterns and trends in your data.

 Compare Data: Easily compare different categories or groups.

 Present Insights: Share your findings in an understandable way.

Example: Plotting Sales Data

Let's plot the sales data we used earlier.

Sales Data:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Sample sales data

data = {

'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],

'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],

'Sales': [100, 150, 200, 250, 300, 350, 400, 450]

# Create a DataFrame

df = pd.DataFrame(data)
print(df)
Output:

Region Product Sales

0 North A 100

1 South B 150

2 East A 200

3 West B 250

4 North A 300

5 South B 350

6 East A 400

7 West B 450

Creating a Bar Plot:

# Create a pivot table to sum sales by Region and Product

pivot_table = pd.pivot_table(df, values='Sales', index='Region', columns='Product',


aggfunc='sum').fillna(0)

# Plot the pivot table

pivot_table.plot(kind='bar')

plt.title('Sales by Region and Product')

plt.xlabel('Region')

plt.ylabel('Total Sales')

plt.xticks(rotation=0)

plt.show()
Explanation:

 plot(kind='bar'): Creates a bar chart.

 plt.title(), plt.xlabel(), plt.ylabel(): Adds title and labels.

 plt.xticks(rotation=0): Keeps the x-axis labels horizontal.

 plt.show(): Displays the plot.

Visual Output: A bar chart showing total sales for Product A and B in each region.

Step 17: Working with Large Datasets

Handling large datasets efficiently is crucial in data analysis. Pandas provides tools to
manage and manipulate large amounts of data without slowing down your computer.
Tips for Handling Large Datasets:

1. Use Efficient Data Types:

o Convert columns to more memory-efficient types.

o Example: Use category for columns with repeating values.

2. Read Data in Chunks:

o Read large files in smaller parts to avoid memory issues.

3. Use dtype Parameter:

o Specify data types while reading files to save memory.

Example: Reading a Large CSV in Chunks

Suppose you have a large CSV file named large_data.csv. Here's how you can read it
in chunks.

import pandas as pd

# Define the chunk size

chunk_size = 1000 # Number of rows per chunk

# Initialize an empty list to store results

results = []

# Read the CSV file in chunks

for chunk in pd.read_csv('large_data.csv', chunksize=chunk_size):

# Perform operations on each chunk

# Example: Calculate the mean of a column named 'Value'

mean_value = chunk['Value'].mean()

results.append(mean_value)

# Combine the results

overall_mean = sum(results) / len(results)

print("Overall Mean:", overall_mean)


Explanation:

 chunksize=1000: Reads 1000 rows at a time.


 Looping through each chunk: Allows processing without loading the entire
file into memory.

 Calculating Mean: An example operation on each chunk.

Note: Replace 'Value' with the actual column name you want to analyze.

Step 18: Exporting and Importing Data

After analyzing and manipulating your data, you might want to save your results or
share them with others. Pandas makes it easy to export and import data in various
formats.

Common Data Formats:

 CSV (Comma-Separated Values): Simple text format for tables.

 Excel: Spreadsheet format.

 JSON: JavaScript Object Notation, useful for web data.

 SQL: Database format.

Example: Exporting Data to Different Formats

Exporting to CSV:

# Save the DataFrame to a CSV file

df.to_csv('sales_data.csv', index=False)
Exporting to Excel:

# Save the DataFrame to an Excel file

df.to_excel('sales_data.xlsx', index=False)
Exporting to JSON:

# Save the DataFrame to a JSON file

df.to_json('sales_data.json', orient='records', lines=True)


Example: Importing Data from Different Formats

Importing from CSV:

# Read data from a CSV file

df_csv = pd.read_csv('sales_data.csv')

print(df_csv.head())
Importing from Excel:

# Read data from an Excel file

df_excel = pd.read_excel('sales_data.xlsx')

print(df_excel.head())
Importing from JSON:

# Read data from a JSON file

df_json = pd.read_json('sales_data.json', orient='records', lines=True)

print(df_json.head())

Step 19: Handling Duplicates and Removing Unwanted Data

In real-world datasets, you might encounter duplicate rows or unwanted data.


Pandas provides easy methods to handle these situations.

Why Handle Duplicates?

 Accuracy: Ensure your analysis is based on unique and accurate data.

 Performance: Reduce data size for faster processing.

Example: Removing Duplicate Rows

Sample Data with Duplicates:

import pandas as pd

# Sample data with duplicates

data = {

'EmployeeID': [1, 2, 2, 3, 4, 4, 4],

'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David', 'David'],

'Department': ['HR', 'IT', 'IT', 'Finance', 'IT', 'IT', 'IT'],

'Salary': [50000, 60000, 60000, 55000, 65000, 65000, 65000]

# Create a DataFrame

df_duplicates = pd.DataFrame(data)

print("Before removing duplicates:")

print(df_duplicates)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000

1 2 Bob IT 60000

2 2 Bob IT 60000
3 3 Charlie Finance 55000

4 4 David IT 65000

5 4 David IT 65000

6 4 David IT 65000

Removing Duplicates:

# Remove duplicate rows

df_no_duplicates = df_duplicates.drop_duplicates()

print("\nAfter removing duplicates:")

print(df_no_duplicates)
Output:

Copy code

EmployeeID Name Department Salary

0 1 Alice HR 50000

1 2 Bob IT 60000

3 3 Charlie Finance 55000

4 4 David IT 65000

Explanation:

 drop_duplicates(): Removes duplicate rows based on all columns.

 keep='first' (default): Keeps the first occurrence and removes the rest.

Handling Specific Duplicates:

Sometimes, you might want to remove duplicates based on specific columns.

Example: Remove duplicates based on 'EmployeeID':

# Remove duplicates based on 'EmployeeID' only

df_unique_employee = df_duplicates.drop_duplicates(subset='EmployeeID', keep='first')

print("\nAfter removing duplicates based on 'EmployeeID':")

print(df_unique_employee)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000

1 2 Bob IT 60000

3 3 Charlie Finance 55000

4 4 David IT 65000
Step 20: Handling Missing Data in Pandas

Data often comes with missing values (represented as NaN). Handling missing data
is crucial for accurate analysis.

Why Handle Missing Data?

 Accuracy: Ensure your analysis isn't skewed by missing or incomplete data.

 Performance: Reduce data size and complexity.

Example: Identifying and Handling Missing Data

Sample Data with Missing Values:

import pandas as pd

import numpy as np

# Sample data with missing values

data = {

'EmployeeID': [1, 2, 3, 4, 5],

'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Department': ['HR', 'IT', np.nan, 'IT', 'Marketing'],

'Salary': [50000, 60000, 55000, np.nan, 70000]

# Create a DataFrame

df_missing = pd.DataFrame(data)

print("Data with Missing Values:")

print(df_missing)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000.0

1 2 Bob IT 60000.0

2 3 Charlie NaN 55000.0

3 4 David IT NaN

4 5 Eve Marketing 70000.0

Identifying Missing Values:


# Check for missing values

print("\nMissing Values in Each Column:")

print(df_missing.isnull().sum())
Output:

Missing Values in Each Column:

EmployeeID 0

Name 0

Department 1

Salary 1

dtype: int64

Handling Missing Values:

1. Removing Missing Values:

o dropna(): Removes rows with any missing values.

# Remove rows with any missing values

df_dropped = df_missing.dropna()

print("\nAfter Dropping Missing Values:")

print(df_dropped)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000.0

1 2 Bob IT 60000.0

4 5 Eve Marketing 70000.0

2. Filling Missing Values:

o fillna(): Fills missing values with a specified value.

# Fill missing 'Department' with 'Unknown'

df_filled = df_missing.fillna({'Department': 'Unknown', 'Salary': df_missing['Salary'].mean()})

print("\nAfter Filling Missing Values:")

print(df_filled)
Output:

EmployeeID Name Department Salary

0 1 Alice HR 50000.00

1 2 Bob IT 60000.00
2 3 Charlie Unknown 55000.00

3 4 David IT 58750.00

4 5 Eve Marketing 70000.00

Explanation:

o {'Department': 'Unknown', 'Salary': df_missing['Salary'].mean()}:

 Fills missing Department values with 'Unknown'.

 Fills missing Salary values with the mean salary.

3. Forward Fill and Backward Fill:

o Forward Fill (ffill): Fills missing values with the previous value.

o Backward Fill (bfill): Fills missing values with the next value.

# Forward fill

df_ffill = df_missing.fillna(method='ffill')

print("\nAfter Forward Fill:")

print(df_ffill)

# Backward fill

df_bfill = df_missing.fillna(method='bfill')

print("\nAfter Backward Fill:")

print(df_bfill)
Output:

After Forward Fill:

EmployeeID Name Department Salary

0 1 Alice HR 50000.0

1 2 Bob IT 60000.0

2 3 Charlie IT 55000.0

3 4 David IT 55000.0

4 5 Eve Marketing 70000.0

After Backward Fill:

EmployeeID Name Department Salary

0 1 Alice HR 50000.0

1 2 Bob IT 60000.0
2 3 Charlie IT 65000.0

3 4 David Marketing 70000.0

4 5 Eve Marketing 70000.0

Choosing the Right Method:

 Removing is straightforward but may lead to loss of data.

 Filling retains data but requires careful selection of fill values.

 Forward/Backward Fill is useful for time-series data where previous or next


values make sense.

Step 21: Working with Multiple DataFrames

Sometimes, you need to work with multiple DataFrames to perform comprehensive


data analysis. Pandas provides several functions to handle multiple DataFrames
effectively.

Common Operations:

1. Concatenation: Stack DataFrames vertically or horizontally.

2. Merging: Combine DataFrames based on a common key.

3. Joining: Similar to merging but typically joins on index.

Example: Concatenating DataFrames

DataFrames to Concatenate:

import pandas as pd

# First DataFrame

data1 = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35]

df1 = pd.DataFrame(data1)

# Second DataFrame

data2 = {

'Name': ['David', 'Eve'],

'Age': [40, 45]

}
df2 = pd.DataFrame(data2)

print("DataFrame 1:")

print(df1)

print("\nDataFrame 2:")

print(df2)
Output:

DataFrame 1:

Name Age

0 Alice 25

1 Bob 30

2 Charlie 35

DataFrame 2:

Name Age

0 David 40

1 Eve 45

Concatenating Vertically (Stacking Rows):

# Concatenate vertically

df_concat = pd.concat([df1, df2], ignore_index=True)

print("\nConcatenated DataFrame:")

print(df_concat)
Output:

Concatenated DataFrame:

Name Age

0 Alice 25

1 Bob 30

2 Charlie 35

3 David 40

4 Eve 45

Explanation:
 pd.concat(): Concatenates DataFrames.

 ignore_index=True: Resets the index in the new DataFrame.

Concatenating Horizontally (Stacking Columns):

# Third DataFrame

data3 = {

'City': ['New York', 'Los Angeles', 'Chicago']

df3 = pd.DataFrame(data3)

print("\nDataFrame 3:")

print(df3)

# Concatenate horizontally

df_hconcat = pd.concat([df1, df3], axis=1)

print("\nHorizontally Concatenated DataFrame:")

print(df_hconcat)
Output:

DataFrame 3:

City

0 New York

1 Los Angeles

2 Chicago

Horizontally Concatenated DataFrame:

Name Age City

0 Alice 25 New York

1 Bob 30 Los Angeles

2 Charlie 35 Chicago

Explanation:

 axis=1: Concatenates along columns.

 Both DataFrames must have the same number of rows.


Step 22: Time-Series Analysis with Pandas

Time-Series Analysis involves analyzing data points collected or recorded at specific


time intervals. Pandas offers robust tools to handle and analyze time-series data.

Key Concepts:

1. Datetime Index: Setting a datetime column as the index.

2. Resampling: Aggregating data at different time frequencies.

3. Rolling Window: Calculating statistics over a moving window.

Example: Time-Series Analysis of Stock Prices

Let's analyze a simple time-series dataset of stock prices.

Creating Time-Series Data:

import pandas as pd

import numpy as np

# Create a date range

dates = pd.date_range(start='2024-01-01', periods=10, freq='D')

# Generate random stock prices

prices = np.random.randint(low=100, high=200, size=10)

# Create a DataFrame

df_stock = pd.DataFrame({'Date': dates, 'Price': prices})

print(df_stock)
Output:

Date Price

0 2024-01-01 150

1 2024-01-02 175

2 2024-01-03 120

3 2024-01-04 160

4 2024-01-05 180

5 2024-01-06 140

6 2024-01-07 130

7 2024-01-08 170
8 2024-01-09 190

9 2024-01-10 110

Setting 'Date' as Index:

# Set 'Date' as the index

df_stock.set_index('Date', inplace=True)

print(df_stock)
Output:

Date Price

2024-01-01 150

2024-01-02 175

2024-01-03 120

2024-01-04 160

2024-01-05 180

2024-01-06 140

2024-01-07 130

2024-01-08 170

2024-01-09 190

2024-01-10 110

Resampling to Weekly Data:

# Resample data to weekly frequency and calculate mean price

weekly_mean = df_stock.resample('W').mean()

print("\nWeekly Mean Price:")

print(weekly_mean)
Output:

Date Price

2024-01-07 145

2024-01-14 160

Explanation:

 resample('W'): Groups data by week.

 mean(): Calculates the average price for each week.

Calculating Rolling Average:

# Calculate a rolling average with a window of 3 days


df_stock['Rolling_Avg'] = df_stock['Price'].rolling(window=3).mean()

print("\nDataFrame with Rolling Average:")

print(df_stock)
Output:

Date Price Rolling_Avg

2024-01-01 150 NaN

2024-01-02

Interview Questions
1. What is Pandas, and how is it different from NumPy?

Answer:

 Pandas is an open-source Python library used for data manipulation and


analysis. It provides data structures like Series and DataFrame.

 NumPy is mainly used for numerical computations with arrays, while Pandas is
built on top of NumPy and provides high-level data manipulation tools like
indexing, merging, reshaping, and handling missing data.

2. What is a Series in Pandas?

Answer:

 A Series is a one-dimensional labeled array in Pandas, capable of holding any


data type (integers, strings, floats, etc.). It is like a column in a table.

Example:

import pandas as pd

s = pd.Series([10, 20, 30], index=["a", "b", "c"])

print(s)

3. What is a DataFrame in Pandas?

Answer:

 A DataFrame is a two-dimensional, size-mutable, and heterogeneous data


structure in Pandas. It is like a table with rows and columns, where each column
can be of different types (int, float, string, etc.).

Example:

data = {"Name": ["Alice", "Bob"], "Age": [20, 22]}

df = pd.DataFrame(data)
print(df)

4. How do you read a CSV file in Pandas?

Answer: You can use the pd.read_csv() method to read data from a CSV file.

df = pd.read_csv("file.csv")

5. How do you handle missing data in Pandas?

Answer:

 Check for missing data: df.isnull()

 Fill missing data: df.fillna(value)

 Drop missing data: df.dropna()

Example:

# Fill missing values with 0

df.fillna(0, inplace=True)

# Drop rows with missing values

df.dropna(inplace=True)

6. What are the different ways to select data in Pandas?

Answer:

1. Selecting columns: df["column_name"] or df[["col1", "col2"]]

2. Selecting rows by label: df.loc["row_label"]

3. Selecting rows by position: df.iloc[0]

4. Conditional selection: df[df["column"] > 10]

7. How do you merge two DataFrames in Pandas?

Answer: You can use the pd.merge() function to combine two DataFrames based on a
common column or index.

merged_df = pd.merge(df1, df2, on="common_column")


 Other options for merging: how="inner", how="left", how="right", and
how="outer" to specify the join type.

8. What is the difference between loc and iloc?


Answer:

 loc is label-based indexing. It is used to access rows and columns by their


labels.

 iloc is integer-based indexing. It is used to access rows and columns by their


integer positions.

Example:

df.loc[1] # Access row with index label 1

df.iloc[0] # Access first row (0-based index)

9. How can you group data in Pandas?

Answer: You can use the groupby() method to group rows based on a column and
then perform aggregate functions like sum(), mean(), count(), etc.

Example:

grouped = df.groupby("column_name")["numeric_column"].sum()

print(grouped)

10. Explain the concept of reshaping in Pandas. How do you pivot and melt
DataFrames?

Answer:

 Pivoting converts data from a long format to a wide format, creating a new
DataFrame with rows and columns rearranged.

pivot_df = df.pivot(index="A", columns="B", values="C")


 Melting converts a wide format into a long format, making a more readable
DataFrame.

melted_df = pd.melt(df, id_vars=["ID"], var_name="variable", value_name="value")

11. What is the difference between apply() and map() in Pandas?

Answer:

 apply(): Used to apply a function to a Series or along an axis of the DataFrame


(rows or columns).

 map(): Used to substitute each value in a Series with another value.

Example:

# Apply a function to each row

df["New_Column"] = df["Age"].apply(lambda x: x + 1)
# Map values in a Series

df["Category"] = df["Category"].map({"A": 1, "B": 2})

12. How can you remove duplicates from a DataFrame?

Answer: You can use the drop_duplicates() method to remove duplicate rows from a
DataFrame.

df_cleaned = df.drop_duplicates(subset="column_name")

13. How do you handle time series data in Pandas?

Answer:

 Use pd.to_datetime() to convert a column to a datetime object.

 You can resample time series data with resample() for frequency conversions,
like daily, monthly, etc.

Example:

df["Date"] = pd.to_datetime(df["Date"])

df.set_index("Date", inplace=True)

resampled_data = df.resample("M").mean()

14. What is the difference between concat() and merge()?

Answer:

 concat(): Concatenates DataFrames vertically or horizontally.

 merge(): Joins DataFrames on columns or indexes based on key columns.

Example:

# Concatenate

combined_df = pd.concat([df1, df2])

# Merge

merged_df = pd.merge(df1, df2, on="ID")

15. Explain rolling and expanding in Pandas.

Answer:

 Rolling: Creates a rolling window over data and performs functions like mean(),
sum(), etc.
 Expanding: Applies a cumulative function over data (i.e., expanding sum or
mean).

Example:

# Rolling window

df["Rolling_Mean"] = df["Sales"].rolling(window=3).mean()

# Expanding window

df["Cumulative_Sum"] = df["Sales"].expanding().sum()

16. How can you read Excel files in Pandas?

Answer: You can use pd.read_excel() to read Excel files.

Example:

df = pd.read_excel("file.xlsx", sheet_name="Sheet1")

17. How can you set a specific column as the index of a DataFrame?

Answer: You can use the set_index() method to set a column as the index.

df.set_index("column_name", inplace=True)

18. What are some key features of Pandas for handling large datasets?

Answer:

 Chunking: Use chunksize while reading large files to load data in parts.

 Memory Optimization: Reduce memory usage by converting data types


(astype()).
 Efficient Operations: Use vectorized operations to handle large data faster.

19. What is a Categorical data type in Pandas?

Answer: The Categorical data type is used for columns that contain a finite number
of unique categories (e.g., gender, yes/no). It improves performance and reduces
memory usage.

Example:

df["Category"] = df["Category"].astype("category")

20. What is the difference between pivot() and pivot_table()?


Answer:

 pivot(): Reshapes data but doesn't allow for aggregation of duplicate data.

 pivot_table(): Allows you to aggregate data when there are duplicates by


applying functions like mean(), sum().

Example:

pivoted_table = df.pivot_table(index="A", columns="B", values="C", aggfunc="mean")

Revision Notes
Basic Concepts:

 DataFrame Creation:
pd.DataFrame(data), pd.Series(data)

 Reading Files:
pd.read_csv('file.csv'),
pd.read_excel('file.xlsx'),
pd.read_json('file.json')

 DataFrame Info:
df.shape – Get shape (rows, columns)
df.columns – Get column names
df.info() – Summary of DataFrame
df.describe() – Descriptive statistics
df.head(n) – First n rows
df.tail(n) – Last n rows

Data Selection:

 Selecting Columns:
df['column_name'],
df[['col1', 'col2']]

 Selecting Rows by Index:


df.loc[index_label] – Select rows by label
df.iloc[index_position] – Select rows by position

 Boolean Selection:
df[df['column'] > value] – Conditional filtering
df[(df['col1'] > value1) & (df['col2'] == value2)] – Multiple conditions

Modifying DataFrames:

 Assign New Column:


df['new_column'] = value – Add new column
 Rename Columns:
df.rename(columns={'old': 'new'}, inplace=True)

 Dropping Columns/Rows:
df.drop(['col1', 'col2'], axis=1) – Drop columns
df.drop([index_label], axis=0) – Drop rows

 Reset Index:
df.reset_index(drop=True, inplace=True) – Reset index

Sorting Data:

 Sort by Column:
df.sort_values(by='column_name', ascending=False) – Sort by column

 Sort by Index:
df.sort_index(ascending=True) – Sort by index

Handling Missing Data:

 Check for NaN:


df.isnull() – Check for missing values
df.isnull().sum() – Count missing values per column

 Drop Missing Values:


df.dropna(axis=0) – Drop rows with NaN
df.dropna(axis=1) – Drop columns with NaN

 Fill Missing Values:


df.fillna(value) – Fill missing values

Data Transformation:

 Apply Function:
df['col'].apply(function) – Apply function element-wise

 Map Values:
df['col'].map({'A': 1, 'B': 2}) – Map values

 Replace Values:
df.replace({old_value: new_value}) – Replace specific values

Grouping and Aggregating:

 Group by Column:
df.groupby('column').sum() – Group and aggregate data

 Aggregate Functions:
df.groupby('column').agg({'col1': 'sum', 'col2': 'mean'}) – Apply multiple
functions to columns
Merging & Concatenation:

 Concatenate DataFrames:
pd.concat([df1, df2], axis=0) – Concatenate vertically
pd.concat([df1, df2], axis=1) – Concatenate horizontally

 Merge DataFrames:
pd.merge(df1, df2, on='common_column') – Merge on a common column
pd.merge(df1, df2, how='inner/outer/left/right') – Type of join

Pivoting & Reshaping:

 Pivot:
df.pivot(index='A', columns='B', values='C') – Pivot data

 Melt:
pd.melt(df, id_vars=['A'], value_vars=['B', 'C']) – Unpivot data

 Transpose:
df.T – Transpose rows and columns

 Stack & Unstack:


df.stack() – Stacking rows to columns
df.unstack() – Unstacking columns to rows

Time Series:

 Convert to Datetime:
pd.to_datetime(df['Date']) – Convert to datetime format

 Set Datetime as Index:


df.set_index('Date', inplace=True)

 Resampling:
df.resample('M').mean() – Resample by Month (M) and calculate mean

Window Functions:

 Rolling Window:
df['Rolling_Avg'] = df['col'].rolling(window=3).mean() – Apply rolling mean

 Expanding Window:
df['Expanding_Sum'] = df['col'].expanding().sum() – Apply expanding sum

Advanced Indexing & Selection:

 Select Specific Rows & Columns:


df.loc[1:3, 'col1':'col3'] – Select rows and columns by labels
 Filter by Condition:
df[df['column'] > value] – Conditional filtering

Categorical Data:

 Convert to Categorical:
df['col'] = df['col'].astype('category') – Convert to categorical type

 Get Dummies (One-Hot Encoding):


pd.get_dummies(df['col']) – Convert categorical data to one-hot encoded
columns

Visualization (with Matplotlib/Seaborn):

 Line Plot:
df.plot(kind='line')

 Bar Plot:
df.plot(kind='bar')

 Histogram:
df['column'].plot(kind='hist')

 Box Plot:
df.boxplot(column=['col1', 'col2'])

Advanced Topics:

 Join Two DataFrames:


df1.join(df2, on='key_column', how='left')

 Handling Large Datasets:


pd.read_csv('file.csv', chunksize=10000) – Load large files in chunks

 Categorical Encoding:
pd.get_dummies(df['categorical_column']) – Convert categorical to one-hot
encoding

 String Operations:
df['col'].str.upper(), df['col'].str.contains('word') – String manipulation functions

You might also like