0% found this document useful (0 votes)
13 views21 pages

Day 11 Pandas For Data Science - Part 2

Uploaded by

avinash kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

Day 11 Pandas For Data Science - Part 2

Uploaded by

avinash kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Pandas For Data Science - Part 2

Type Data science masterclass

VI. Data Manipulation and Transformation


Data manipulation and transformation are essential skills for analyzing and
preparing data for further processing. Pandas provides powerful tools for
indexing, sorting, merging, and reshaping datasets. This section covers:

6.1 Indexing and Selecting Data

6.2 Sorting and Ranking

6.3 Merging, Joining, and Concatenating

6.4 Reshaping Data

6.1 Indexing and Selecting Data


Indexing and selecting data are fundamental operations in Pandas that allow
efficient access to subsets of your dataset.

Label-based and Positional Indexing ( .loc and .iloc )


loc[] – Label-based indexing:

Selects rows and columns by explicit index labels.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],


'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}

Pandas For Data Science - Part 2 1


df = pd.DataFrame(data, index=['A', 'B', 'C', 'D'])

# Selecting a single row by label


print(df.loc['A'])

# Selecting multiple rows and specific columns


print(df.loc[['A', 'C'], ['Name', 'Salary']])

iloc[] – Positional indexing:

Selects rows and columns based on numerical position (zero-based).

Example:

# Selecting by row index position


print(df.iloc[0]) # First row

# Selecting a range of rows


print(df.iloc[1:3]) # Second and third rows

# Selecting specific rows and columns


print(df.iloc[[0, 2], [1, 2]]) # Rows 0 & 2, Columns 1 & 2

Boolean Indexing and Conditional Selection


Boolean Indexing: Select data that meets a condition.

Example:

high_salary = df[df['Salary'] > 60000]


print(high_salary)

Multiple Conditions with & (AND) and | (OR):

Example:

Pandas For Data Science - Part 2 2


df_filtered = df[(df['Age'] > 25) & (df['Salary'] < 80000)]
print(df_filtered)

6.2 Sorting and Ranking


Sorting and ranking allow reordering data based on values or index labels.

Sorting by Index or Values


Sorting by Values ( sort_values )

Example: Sorting by Salary (ascending and descending)

df_sorted = df.sort_values(by='Salary', ascending=False)


print(df_sorted)

Sorting by Multiple Columns

Example: Sort by Age (ascending) and Salary (descending)

df_sorted = df.sort_values(by=['Age', 'Salary'], ascending=[True, Fals


e])
print(df_sorted)

Sorting by Index ( sort_index )

Example: Sorting alphabetically by index labels

df_sorted_idx = df.sort_index()
print(df_sorted_idx)

Ranking Data Within a DataFrame


Ranking assigns a rank to values within a column.

Example:

Pandas For Data Science - Part 2 3


df['Salary_rank'] = df['Salary'].rank(ascending=False)
print(df)

Handling Ties in Ranking ( method parameter):

average : Default, assigns the mean rank for ties.

min : Assigns the minimum rank for ties.

max : Assigns the maximum rank for ties.

Example:

df['Salary_rank_min'] = df['Salary'].rank(method='min')
print(df)

6.3 Merging, Joining, and Concatenating


Combining datasets is a fundamental task in data manipulation. Pandas provides
multiple methods for merging and joining data.

Concatenation and Appending DataFrames


Concatenating along Rows ( axis=0 )

Example:

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})


df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

df_concat = pd.concat([df1, df2], axis=0)


print(df_concat)

Concatenating along Columns ( axis=1 )

Example:

Pandas For Data Science - Part 2 4


df_concat = pd.concat([df1, df2], axis=1)
print(df_concat)

Appending a Single Row ( append() )

Example:

df_new = pd.DataFrame({'A': [7], 'B': [9]})


df_updated = df1.append(df_new, ignore_index=True)
print(df_updated)

Merge and Join Operations (Inner, Outer, Left, Right)


Merging on a Key Column ( merge() )

Example:

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})


df2 = pd.DataFrame({'ID': [1, 2, 4], 'Salary': [50000, 60000, 70000]})

df_merged = pd.merge(df1, df2, on='ID', how='inner')


print(df_merged)

Different Types of Joins:

inner : Only matching rows (default).

left : All rows from the left DataFrame and matching rows from the right.

right : All rows from the right DataFrame and matching rows from the left.

outer : All rows from both DataFrames (fills unmatched with NaN).

Example:

df_outer = pd.merge(df1, df2, on='ID', how='outer')


print(df_outer)

Pandas For Data Science - Part 2 5


6.4 Reshaping Data
Reshaping allows transforming data between wide and long formats for better
analysis.

Pivot Tables and Cross-tabulations


Creating a Pivot Table ( pivot_table() )

Example:

df = pd.DataFrame({'Department': ['HR', 'IT', 'HR', 'IT'],


'Gender': ['Male', 'Female', 'Female', 'Male'],
'Salary': [50000, 70000, 55000, 80000]})

pivot_table = df.pivot_table(values='Salary', index='Department', colu


mns='Gender', aggfunc='mean')
print(pivot_table)

Cross-tabulation ( pd.crosstab() )

Example:

crosstab = pd.crosstab(df['Department'], df['Gender'])


print(crosstab)

Melting and Stacking DataFrames


Melting (Wide to Long Format)

Example:

df_melted = pd.melt(df, id_vars=['Department'], value_vars=['Salary'])


print(df_melted)

Stacking (Convert Columns to Rows)

Example:

Pandas For Data Science - Part 2 6


df_stacked = df.set_index(['Department', 'Gender']).stack()
print(df_stacked)

VII. Aggregation and Group Operations


Aggregation and grouping are essential techniques for summarizing and analyzing
large datasets efficiently. Pandas provides powerful tools to group data, apply
aggregation functions, and perform advanced computations.
This section covers:

7.1 The GroupBy Mechanism

7.2 Advanced Aggregations

7.1 The GroupBy Mechanism


The groupby() function in Pandas is used to split data into groups based on a
column’s values and then apply aggregation, transformation, or filtration functions.

Grouping Data and Applying Aggregation Functions


Basic Grouping

groupby() creates a GroupBy object that can be used for aggregation.

Example:

import pandas as pd

# Sample DataFrame
data = {'Department': ['HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 70000, 52000, 80000, 60000, 75000]}

Pandas For Data Science - Part 2 7


df = pd.DataFrame(data)

# Group by 'Department' and calculate the average salary


avg_salary = df.groupby('Department')['Salary'].mean()
print(avg_salary)

Common Aggregation Functions

mean() , sum() , count() , min() , max() , std() , var() .

Example:

df_grouped = df.groupby('Department')['Salary'].agg(['mean', 'sum',


'count'])
print(df_grouped)

Grouping by Multiple Columns

Example:

df_grouped = df.groupby(['Department', 'Employee'])['Salary'].sum()


print(df_grouped)

Transformation vs. Aggregation vs. Filtration


Pandas groupby() supports three main operations:

1. Aggregation ( agg() )
Reduces each group to a single value.

Example:

df_agg = df.groupby('Department')['Salary'].agg('sum')
print(df_agg)

2. Transformation ( transform() )
Returns a result of the same shape as the original data.

Pandas For Data Science - Part 2 8


Useful for normalization and filling missing values within groups.

Example:

df['Salary_mean'] = df.groupby('Department')['Salary'].transform('mean')
print(df)

3. Filtration ( filter() )
Removes groups based on a condition.

Example:

df_filtered = df.groupby('Department').filter(lambda x: x['Salary'].mean() >


60000)
print(df_filtered)

7.2 Advanced Aggregations


Pandas allows for custom aggregation functions and multiple operations on
grouped data.

Custom Aggregation Functions


Using a Custom Function in agg()

Define a custom function and apply it to each group.

Example:

def range_diff(x):
return x.max() - x.min()

df_grouped = df.groupby('Department')['Salary'].agg(range_diff)
print(df_grouped)

Lambda Functions Inside agg()

Example:

Pandas For Data Science - Part 2 9


df_grouped = df.groupby('Department')['Salary'].agg(lambda x: x.max
() - x.min())
print(df_grouped)

Working with Multiple Aggregation Operations


Applying Multiple Functions to a Column

Use a list of functions inside agg() .

Example:

df_grouped = df.groupby('Department')['Salary'].agg(['mean', 'sum',


'min', 'max'])
print(df_grouped)

Applying Different Functions to Different Columns

Use a dictionary inside agg() .

Example:

df_grouped = df.groupby('Department').agg({'Salary': ['mean', 'sum'],


'Employee': 'count'})
print(df_grouped)

Grouping and Sorting


Sorting After Grouping

Example:

df_grouped = df.groupby('Department')['Salary'].mean().sort_values(a
scending=False)
print(df_grouped)

Resetting Index After Grouping

Example:

Pandas For Data Science - Part 2 10


df_grouped = df.groupby('Department')['Salary'].mean().reset_index()
print(df_grouped)

VIII. Time Series Analysis


Time series data is a sequence of data points recorded at specific time intervals.
Pandas provides robust tools for working with time series data, including time-
based indexing, resampling, and visualization.

This section covers:

8.1 Handling Time Series Data

8.2 Resampling and Frequency Conversion

8.3 Time Series Visualization and Analysis

8.1 Handling Time Series Data


Date and Time Indexing
Pandas allows setting timestamps as index values, enabling easy selection,
filtering, and analysis.

Creating a DateTime Index from a Column

import pandas as pd

data = {'Date': ['2023-01-01', '2023-02-01', '2023-03-01'],


'Sales': [200, 250, 300]}

df = pd.DataFrame(data)

# Convert 'Date' column to DateTime and set as index


df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

Pandas For Data Science - Part 2 11


print(df)

Generating a Range of Dates ( date_range )

date_range = pd.date_range(start='2023-01-01', periods=6, freq='M')


print(date_range)

Selecting Data by Date Index

# Select data for a specific date


print(df.loc['2023-02-01'])

# Select data for a date range


print(df.loc['2023-01-01':'2023-03-01'])

Parsing Dates and Time Zones


Parsing Date Formats Automatically

df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

Handling Different Time Zones

df.index = df.index.tz_localize('UTC') # Set to UTC


df.index = df.index.tz_convert('Asia/Kolkata') # Convert to IST
print(df)

8.2 Resampling and Frequency Conversion


Upsampling and Downsampling Techniques
Resampling Changes the Frequency of Time Series Data

Downsampling (Reducing Frequency)

Pandas For Data Science - Part 2 12


df_resampled = df.resample('M').sum() # Monthly total sales
print(df_resampled)

Upsampling (Increasing Frequency with Interpolation)

df_upsampled = df.resample('D').interpolate() # Daily frequency with i


nterpolation
print(df_upsampled)

Rolling Statistics and Window Functions


Rolling Mean (Moving Average)

df['Rolling_Mean'] = df['Sales'].rolling(window=2).mean()
print(df)

Expanding Window (Cumulative Sum)

df['Cumulative_Sum'] = df['Sales'].expanding().sum()
print(df)

8.3 Time Series Visualization and Analysis


Plotting Trends Over Time
Simple Line Plot of Time Series Data

import matplotlib.pyplot as plt

df.plot(y='Sales', title='Sales Trend Over Time')


plt.show()

Rolling Mean Visualization

Pandas For Data Science - Part 2 13


df['Rolling_Mean'].plot(label='Rolling Mean', legend=True)
df['Sales'].plot(label='Original Sales', legend=True)
plt.show()

Seasonal Decomposition
Decomposing Time Series into Trend, Seasonal, and Residual Components

from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df['Sales'], model='additive', period=2)


result.plot()
plt.show()

IX. Performance Optimization


Pandas is a powerful library, but working with large datasets can lead to slow
performance and high memory usage. Optimizing computations and memory
usage is crucial for handling big data efficiently.
This section covers:

9.1 Vectorization and Efficient Computation

9.2 Memory Management

9.3 Working with Large Datasets

9.1 Vectorization and Efficient Computation


Vectorization speeds up computations by applying operations to entire arrays
instead of iterating through elements with loops. Pandas leverages NumPy's
optimized array computations for efficiency.

Pandas For Data Science - Part 2 14


Avoiding Python Loops: Leveraging Pandas and NumPy
Operations
Inefficient Python Loop:

import pandas as pd
import numpy as np
import time

df = pd.DataFrame({'A': np.random.randint(1, 100, 10)})

# Using a loop (slow)


start = time.time()
df['B'] = [x * 2 for x in df['A']]
print("Loop Time:", time.time() - start)

Efficient Vectorized Operation:

# Using Pandas vectorized operation (fast)


start = time.time()
df['B'] = df['A'] * 2
print("Vectorized Time:", time.time() - start)

Using .apply() , .map() , and .applymap() Wisely

Using .apply() for Column-wise Operations


Slow Loop-Based Approach:

def square(x):
return x ** 2

df['Square_Loop'] = [square(x) for x in df['A']]

Faster apply() Approach:

Pandas For Data Science - Part 2 15


df['Square_Apply'] = df['A'].apply(square)

Using .map() for Element-wise Operations in Series


Example:

df['Mapped'] = df['A'].map(lambda x: x ** 2)

Using .applymap() for Element-wise Operations in DataFrames


Example:

df = pd.DataFrame(np.random.randint(1, 10, (3, 3)), columns=['A', 'B', 'C'])


df_squared = df.applymap(lambda x: x ** 2)

✅ Best Practice: Use vectorized operations whenever possible instead of .apply()

or .map() for performance gains.

9.2 Memory Management


Large datasets consume a lot of memory, which can slow down operations.
Optimizing data types reduces memory usage significantly.

Optimizing Data Types for Large Datasets


Checking Data Types and Memory Usage:

print(df.info(memory_usage='deep'))

Optimizing Integer and Float Data Types:

df['A'] = df['A'].astype('int16') # Use 'int16' instead of 'int64'


df['C'] = df['C'].astype('float32') # Use 'float32' instead of 'float64'

Converting Strings to Categorical Type (Huge Memory Savings):

Pandas For Data Science - Part 2 16


df['Category'] = df['Category'].astype('category')

Best Practices for Handling Big Data with Pandas


Use astype() to downcast data types to smaller types ( int8 , float32 ).

Drop unnecessary columns to reduce memory footprint.

Convert string columns to categorical to save memory.

Use chunking when reading large files (e.g., pd.read_csv(..., chunksize=10000) ).

9.3 Working with Large Datasets


When datasets become too large to fit into memory, alternative techniques are
required.

Techniques for Out-of-Core Computation


Using chunksize for Processing Large CSV Files

chunk_iter = pd.read_csv('large_data.csv', chunksize=10000)


for chunk in chunk_iter:
print(chunk.shape)

Processing Data in Chunks and Aggregating Results

total_sum = 0
for chunk in chunk_iter:
total_sum += chunk['column_name'].sum()
print("Total Sum:", total_sum)

Introduction to Dask or Other Scalable Libraries


Pandas loads data into memory, but Dask is an alternative that allows working
with datasets larger than RAM.

Installing Dask:

Pandas For Data Science - Part 2 17


pip install dask

Using Dask Instead of Pandas for Large DataFrames

import dask.dataframe as dd

df_large = dd.read_csv('large_data.csv')
print(df_large.head()) # Works like Pandas but loads data lazily

✅ Best Practice: Use Dask when handling large datasets that do not fit into
memory.

X. Advanced DataFrame Operations


Pandas provides advanced functionality for handling complex data structures like
multi-index DataFrames and performance tuning. These techniques allow for
better data organization, efficient lookups, and optimized computations.
This section covers:

Working with Multi-Index DataFrames

Customizing DataFrame Display and Performance Tuning

Working with Multi-Index DataFrames


Multi-indexing allows for more complex and hierarchical data structures by using
multiple levels of row and/or column indices.

Creating a Multi-Index DataFrame

1. Using set_index() with Multiple Columns


Convert multiple columns into a hierarchical index.

Pandas For Data Science - Part 2 18


import pandas as pd

data = {
'Country': ['USA', 'USA', 'Canada', 'Canada'],
'City': ['New York', 'Los Angeles', 'Toronto', 'Vancouver'],
'Population': [8419600, 3980400, 2930000, 631490]
}

df = pd.DataFrame(data)
df.set_index(['Country', 'City'], inplace=True) # Set multi-index
print(df)

2. Creating Multi-Index Using pd.MultiIndex.from_tuples()


Explicitly define a hierarchical index.

index = pd.MultiIndex.from_tuples([
('USA', 'New York'),
('USA', 'Los Angeles'),
('Canada', 'Toronto'),
('Canada', 'Vancouver')
], names=['Country', 'City'])

df = pd.DataFrame({'Population': [8419600, 3980400, 2930000, 63149


0]}, index=index)
print(df)

Selecting and Filtering Data in Multi-Index DataFrames


Accessing Data for a Specific Index Level

print(df.loc['USA']) # Select all cities in the USA


print(df.loc[('USA', 'New York')]) # Select New York data

Filtering by Specific Index Levels

Pandas For Data Science - Part 2 19


print(df.xs('USA', level='Country')) # Get all cities in USA

Selecting Data Using slice

from pandas import IndexSlice


print(df.loc[IndexSlice[:, 'New York'], :]) # Get all data for New York

Resetting and Reordering Multi-Index


Reset Multi-Index to Columns ( reset_index() )

df_reset = df.reset_index()
print(df_reset)

Swapping Index Levels ( swaplevel() )

df_swapped = df.swaplevel()
print(df_swapped)

Sorting Multi-Index ( sort_index() )

df_sorted = df.sort_index(level=['Country', 'City'])


print(df_sorted)

Customizing DataFrame Display and Performance


Tuning
Customizing DataFrame Display
Changing Display Options in Pandas

pd.set_option('display.max_rows', 50) # Set max rows displayed


pd.set_option('display.max_columns', 20) # Set max columns displayed

Pandas For Data Science - Part 2 20


pd.set_option('display.float_format', '{:.2f}'.format) # Set float precision

Displaying Large DataFrames Using to_string()

print(df.to_string()) # Display entire DataFrame

Styling DataFrames for Better Readability ( style )

df.style.set_properties(**{'background-color': 'lightgray', 'color': 'black'})

Performance Tuning for Large DataFrames


Convert Data Types to Reduce Memory Usage

df['Population'] = df['Population'].astype('int32') # Convert to smaller dat


a type
print(df.info(memory_usage='deep'))

Using query() for Fast Filtering

df_filtered = df.query("Population > 3000000")


print(df_filtered)

Optimizing apply() with Vectorized Operations

df['Population_Squared'] = df['Population'] ** 2 # Vectorized operation

Pandas For Data Science - Part 2 21

You might also like