Learn Pandas
Learn Pandas
Pandas is a powerful Python library for data manipulation and analysis. It’s often used
in data science and machine learning, allowing you to work with data in tables (like
Excel spreadsheets).
Pandas is a library that helps you work with data more easily. Data is often
structured in tables (rows and columns), and Pandas makes it easy to:
Visualize data.
If you don’t have Pandas installed yet, you can install it using pip:
You’ll typically start by loading data from a file like a CSV or Excel sheet into a Pandas
DataFrame.
import pandas as pd
data = pd.read_csv('data.csv')
print(data.head())
The data.head() function shows the first 5 rows of the table. You can replace 'data.csv'
with the path to your actual file.
Once the data is loaded into a DataFrame, you can explore it by checking its structure
and some summary statistics.
Check the shape of the data
This tells you how many rows and columns are in the table.
This is useful for knowing what type of data each column contains (numbers, text,
etc.).
print(data.info())
Describe numerical data
You can get summary statistics like the mean, standard deviation, and more for the
numerical columns.
print(data.describe())
Pandas makes it super easy to select specific rows and columns or filter data based
on conditions.
age_column = data['Age']
print(age_column)
Select multiple columns (DataFrame)
If you need more than one column, you pass a list of column names:
print(subset)
Filtering rows based on a condition
To filter data based on a condition, use a comparison operator (like ==, >, <, etc.).
print(filtered_data)
print(data.isnull().sum())
Drop rows with missing values
If you want to remove rows that have missing data, use dropna().
clean_data = data.dropna()
Fill missing values
You can also fill missing values with a specific value or the mean of the column.
data['Age'].fillna(data['Age'].mean(), inplace=True)
Grouping data is useful when you want to calculate statistics based on categories. For
example, you can group people by gender and find the average age for each group.
# Group data by 'Gender' and calculate the average age for each group
grouped_data = data.groupby('Gender')['Age'].mean()
print(grouped_data)
print(sorted_data)
You can create new columns based on existing ones. For example, if you want to
create a new column that shows whether someone is an adult (age >= 18):
print(data.head())
Step 10: Exporting Data
After modifying your data, you may want to save it back to a file.
data.to_csv('cleaned_data.csv', index=False)
Once you’ve mastered the basics, there are more advanced features in Pandas to
explore:
To see Pandas in action, here’s a small example using a dataset about Titanic
passengers.
import pandas as pd
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(url)
print(older_passengers.head())
average_age_by_sex = titanic_data.groupby('Sex')['Age'].mean()
print(average_age_by_sex)
# Fill missing ages with the median age
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
Pivot Tables are like magic tables that help you summarize and analyze data quickly.
Imagine you have a big table of data, and you want to see summaries based on
certain categories. Pivot tables make this easy!
You can aggregate data (like finding sums, averages) based on categories.
Let's use a simple dataset to understand pivot tables. Suppose you have sales data
for different products in different regions.
import pandas as pd
data = {
# Create a DataFrame
df = pd.DataFrame(data)
print(df)
Output:
0 North A 100
1 South B 150
2 East A 200
3 West B 250
4 North A 300
5 South B 350
6 East A 400
7 West B 450
print(pivot_table)
Output:
Region Product A B
Explanation:
Handling Missing Values: Sometimes, some combinations might not have data. You
can fill these NaN (Not a Number) values with zeros or any other value.
pivot_table_filled = pivot_table.fillna(0)
print(pivot_table_filled)
Output:
Region Product A B
South 0 500.0
West 0 700.0
Merging and joining are ways to combine two or more DataFrames based on a
common key (like merging two tables in a database).
What is Merging?
Think of it as putting together puzzle pieces that fit based on shared edges.
Let's say you have two separate tables: one with employee information and another
with their salaries.
Employee Data:
# Employee information
employees = {
# Create a DataFrame
df_employees = pd.DataFrame(employees)
print(df_employees)
Output:
0 1 Alice HR
1 2 Bob IT
2 3 Charlie Finance
3 4 David IT
Salary Data:
# Salary information
salaries = {
# Create a DataFrame
df_salaries = pd.DataFrame(salaries)
print(df_salaries)
Output:
EmployeeID Salary
0 1 50000
1 2 60000
2 3 55000
3 4 65000
print(df_merged)
Output:
0 1 Alice HR 50000
1 2 Bob IT 60000
3 4 David IT 65000
Explanation:
2. Left Join:
o Includes all rows from the left DataFrame and matching rows from the
right.
3. Right Join:
o Includes all rows from the right DataFrame and matching rows from the
left.
o Syntax: pd.merge(left, right, on='key', how='right')
4. Outer Join:
o Includes all rows from both DataFrames, filling NaN where there are no
matches.
employees_extra = {
df_employees_extra = pd.DataFrame(employees_extra)
print(df_employees_extra)
# Left join
print(df_left_join)
Output:
0 1 Alice HR
1 2 Bob IT
2 3 Charlie Finance
3 4 David IT
4 5 Eve Marketing
0 1 Alice HR 50000.0
1 2 Bob IT 60000.0
3 4 David IT 65000.0
4 5 Eve Marketing NaN
Explanation:
Time-Series Data involves data points indexed in time order. It's common in fields
like finance, weather forecasting, and more.
import pandas as pd
import numpy as np
# Create a DataFrame
print(df_temp)
Output:
Date Temperature
0 2024-01-01 25
1 2024-01-02 30
2 2024-01-03 22
3 2024-01-04 28
4 2024-01-05 33
5 2024-01-06 27
6 2024-01-07 24
Setting the Date as Index:
df_temp.set_index('Date', inplace=True)
print(df_temp)
Output:
Date Temperature
2024-01-01 25
2024-01-02 30
2024-01-03 22
2024-01-04 28
2024-01-05 33
2024-01-06 27
2024-01-07 24
Resampling Time-Series Data: Suppose you want to see the average temperature
for each week (though our example is just one week).
weekly_avg = df_temp.resample('W').mean()
print(weekly_avg)
Output:
Date Temperature
2024-01-07 27.0
Explanation:
Categorical Data refers to variables that can be divided into groups or categories,
like gender, colors, or departments.
Employee Data:
import pandas as pd
# Employee information
employees = {
df_employees = pd.DataFrame(employees)
print(df_employees)
Output:
0 1 Alice HR
1 2 Bob IT
2 3 Charlie Finance
3 4 David IT
4 5 Eve Marketing
Converting to Categorical:
df_employees['Department'] = df_employees['Department'].astype('category')
print(df_employees.dtypes)
Output:
EmployeeID int64
Name object
Department category
dtype: object
Benefits:
Applying Functions allows you to perform operations on your data easily. Pandas
provides methods like apply() and map() to apply functions to DataFrame columns or
Series.
What is apply()?
What is map()?
import pandas as pd
employees = {
df_employees = pd.DataFrame(employees)
print(df_employees)
Output:
0 1 Alice HR 50000
1 2 Bob IT 60000
3 4 David IT 65000
def calculate_bonus(salary):
df_employees['Bonus'] = df_employees['Salary'].apply(calculate_bonus)
print(df_employees)
Output:
dept_mapping = {
'Finance': 'Fin',
'Marketing': 'Mkt'
df_employees['Dept_Full'] = df_employees['Department'].map(dept_mapping)
print(df_employees)
Output:
Explanation:
map() is ideal for substituting each value in a Series with another value.
While Pandas itself has basic plotting capabilities, it's often used alongside Matplotlib
for more advanced visualizations. However, Pandas makes it easy to create simple
plots directly.
Sales Data:
import pandas as pd
import numpy as np
data = {
# Create a DataFrame
df = pd.DataFrame(data)
print(df)
Output:
0 North A 100
1 South B 150
2 East A 200
3 West B 250
4 North A 300
5 South B 350
6 East A 400
7 West B 450
pivot_table.plot(kind='bar')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=0)
plt.show()
Explanation:
Visual Output: A bar chart showing total sales for Product A and B in each region.
Handling large datasets efficiently is crucial in data analysis. Pandas provides tools to
manage and manipulate large amounts of data without slowing down your computer.
Tips for Handling Large Datasets:
Suppose you have a large CSV file named large_data.csv. Here's how you can read it
in chunks.
import pandas as pd
results = []
mean_value = chunk['Value'].mean()
results.append(mean_value)
Note: Replace 'Value' with the actual column name you want to analyze.
After analyzing and manipulating your data, you might want to save your results or
share them with others. Pandas makes it easy to export and import data in various
formats.
Exporting to CSV:
df.to_csv('sales_data.csv', index=False)
Exporting to Excel:
df.to_excel('sales_data.xlsx', index=False)
Exporting to JSON:
df_csv = pd.read_csv('sales_data.csv')
print(df_csv.head())
Importing from Excel:
df_excel = pd.read_excel('sales_data.xlsx')
print(df_excel.head())
Importing from JSON:
print(df_json.head())
import pandas as pd
data = {
# Create a DataFrame
df_duplicates = pd.DataFrame(data)
print(df_duplicates)
Output:
0 1 Alice HR 50000
1 2 Bob IT 60000
2 2 Bob IT 60000
3 3 Charlie Finance 55000
4 4 David IT 65000
5 4 David IT 65000
6 4 David IT 65000
Removing Duplicates:
df_no_duplicates = df_duplicates.drop_duplicates()
print(df_no_duplicates)
Output:
Copy code
0 1 Alice HR 50000
1 2 Bob IT 60000
4 4 David IT 65000
Explanation:
keep='first' (default): Keeps the first occurrence and removes the rest.
print(df_unique_employee)
Output:
0 1 Alice HR 50000
1 2 Bob IT 60000
4 4 David IT 65000
Step 20: Handling Missing Data in Pandas
Data often comes with missing values (represented as NaN). Handling missing data
is crucial for accurate analysis.
import pandas as pd
import numpy as np
data = {
# Create a DataFrame
df_missing = pd.DataFrame(data)
print(df_missing)
Output:
0 1 Alice HR 50000.0
1 2 Bob IT 60000.0
3 4 David IT NaN
print(df_missing.isnull().sum())
Output:
EmployeeID 0
Name 0
Department 1
Salary 1
dtype: int64
df_dropped = df_missing.dropna()
print(df_dropped)
Output:
0 1 Alice HR 50000.0
1 2 Bob IT 60000.0
print(df_filled)
Output:
0 1 Alice HR 50000.00
1 2 Bob IT 60000.00
2 3 Charlie Unknown 55000.00
3 4 David IT 58750.00
Explanation:
o Forward Fill (ffill): Fills missing values with the previous value.
o Backward Fill (bfill): Fills missing values with the next value.
# Forward fill
df_ffill = df_missing.fillna(method='ffill')
print(df_ffill)
# Backward fill
df_bfill = df_missing.fillna(method='bfill')
print(df_bfill)
Output:
0 1 Alice HR 50000.0
1 2 Bob IT 60000.0
2 3 Charlie IT 55000.0
3 4 David IT 55000.0
0 1 Alice HR 50000.0
1 2 Bob IT 60000.0
2 3 Charlie IT 65000.0
Common Operations:
DataFrames to Concatenate:
import pandas as pd
# First DataFrame
data1 = {
df1 = pd.DataFrame(data1)
# Second DataFrame
data2 = {
}
df2 = pd.DataFrame(data2)
print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
Output:
DataFrame 1:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
DataFrame 2:
Name Age
0 David 40
1 Eve 45
# Concatenate vertically
print("\nConcatenated DataFrame:")
print(df_concat)
Output:
Concatenated DataFrame:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40
4 Eve 45
Explanation:
pd.concat(): Concatenates DataFrames.
# Third DataFrame
data3 = {
df3 = pd.DataFrame(data3)
print("\nDataFrame 3:")
print(df3)
# Concatenate horizontally
print(df_hconcat)
Output:
DataFrame 3:
City
0 New York
1 Los Angeles
2 Chicago
2 Charlie 35 Chicago
Explanation:
Key Concepts:
import pandas as pd
import numpy as np
# Create a DataFrame
print(df_stock)
Output:
Date Price
0 2024-01-01 150
1 2024-01-02 175
2 2024-01-03 120
3 2024-01-04 160
4 2024-01-05 180
5 2024-01-06 140
6 2024-01-07 130
7 2024-01-08 170
8 2024-01-09 190
9 2024-01-10 110
df_stock.set_index('Date', inplace=True)
print(df_stock)
Output:
Date Price
2024-01-01 150
2024-01-02 175
2024-01-03 120
2024-01-04 160
2024-01-05 180
2024-01-06 140
2024-01-07 130
2024-01-08 170
2024-01-09 190
2024-01-10 110
weekly_mean = df_stock.resample('W').mean()
print(weekly_mean)
Output:
Date Price
2024-01-07 145
2024-01-14 160
Explanation:
print(df_stock)
Output:
2024-01-02
Interview Questions
1. What is Pandas, and how is it different from NumPy?
Answer:
NumPy is mainly used for numerical computations with arrays, while Pandas is
built on top of NumPy and provides high-level data manipulation tools like
indexing, merging, reshaping, and handling missing data.
Answer:
Example:
import pandas as pd
print(s)
Answer:
Example:
df = pd.DataFrame(data)
print(df)
Answer: You can use the pd.read_csv() method to read data from a CSV file.
df = pd.read_csv("file.csv")
Answer:
Example:
df.fillna(0, inplace=True)
df.dropna(inplace=True)
Answer:
Answer: You can use the pd.merge() function to combine two DataFrames based on a
common column or index.
Example:
Answer: You can use the groupby() method to group rows based on a column and
then perform aggregate functions like sum(), mean(), count(), etc.
Example:
grouped = df.groupby("column_name")["numeric_column"].sum()
print(grouped)
10. Explain the concept of reshaping in Pandas. How do you pivot and melt
DataFrames?
Answer:
Pivoting converts data from a long format to a wide format, creating a new
DataFrame with rows and columns rearranged.
Answer:
Example:
df["New_Column"] = df["Age"].apply(lambda x: x + 1)
# Map values in a Series
Answer: You can use the drop_duplicates() method to remove duplicate rows from a
DataFrame.
df_cleaned = df.drop_duplicates(subset="column_name")
Answer:
You can resample time series data with resample() for frequency conversions,
like daily, monthly, etc.
Example:
df["Date"] = pd.to_datetime(df["Date"])
df.set_index("Date", inplace=True)
resampled_data = df.resample("M").mean()
Answer:
Example:
# Concatenate
# Merge
Answer:
Rolling: Creates a rolling window over data and performs functions like mean(),
sum(), etc.
Expanding: Applies a cumulative function over data (i.e., expanding sum or
mean).
Example:
# Rolling window
df["Rolling_Mean"] = df["Sales"].rolling(window=3).mean()
# Expanding window
df["Cumulative_Sum"] = df["Sales"].expanding().sum()
Example:
df = pd.read_excel("file.xlsx", sheet_name="Sheet1")
17. How can you set a specific column as the index of a DataFrame?
Answer: You can use the set_index() method to set a column as the index.
df.set_index("column_name", inplace=True)
18. What are some key features of Pandas for handling large datasets?
Answer:
Chunking: Use chunksize while reading large files to load data in parts.
Answer: The Categorical data type is used for columns that contain a finite number
of unique categories (e.g., gender, yes/no). It improves performance and reduces
memory usage.
Example:
df["Category"] = df["Category"].astype("category")
pivot(): Reshapes data but doesn't allow for aggregation of duplicate data.
Example:
Revision Notes
Basic Concepts:
DataFrame Creation:
pd.DataFrame(data), pd.Series(data)
Reading Files:
pd.read_csv('file.csv'),
pd.read_excel('file.xlsx'),
pd.read_json('file.json')
DataFrame Info:
df.shape – Get shape (rows, columns)
df.columns – Get column names
df.info() – Summary of DataFrame
df.describe() – Descriptive statistics
df.head(n) – First n rows
df.tail(n) – Last n rows
Data Selection:
Selecting Columns:
df['column_name'],
df[['col1', 'col2']]
Boolean Selection:
df[df['column'] > value] – Conditional filtering
df[(df['col1'] > value1) & (df['col2'] == value2)] – Multiple conditions
Modifying DataFrames:
Dropping Columns/Rows:
df.drop(['col1', 'col2'], axis=1) – Drop columns
df.drop([index_label], axis=0) – Drop rows
Reset Index:
df.reset_index(drop=True, inplace=True) – Reset index
Sorting Data:
Sort by Column:
df.sort_values(by='column_name', ascending=False) – Sort by column
Sort by Index:
df.sort_index(ascending=True) – Sort by index
Data Transformation:
Apply Function:
df['col'].apply(function) – Apply function element-wise
Map Values:
df['col'].map({'A': 1, 'B': 2}) – Map values
Replace Values:
df.replace({old_value: new_value}) – Replace specific values
Group by Column:
df.groupby('column').sum() – Group and aggregate data
Aggregate Functions:
df.groupby('column').agg({'col1': 'sum', 'col2': 'mean'}) – Apply multiple
functions to columns
Merging & Concatenation:
Concatenate DataFrames:
pd.concat([df1, df2], axis=0) – Concatenate vertically
pd.concat([df1, df2], axis=1) – Concatenate horizontally
Merge DataFrames:
pd.merge(df1, df2, on='common_column') – Merge on a common column
pd.merge(df1, df2, how='inner/outer/left/right') – Type of join
Pivot:
df.pivot(index='A', columns='B', values='C') – Pivot data
Melt:
pd.melt(df, id_vars=['A'], value_vars=['B', 'C']) – Unpivot data
Transpose:
df.T – Transpose rows and columns
Time Series:
Convert to Datetime:
pd.to_datetime(df['Date']) – Convert to datetime format
Resampling:
df.resample('M').mean() – Resample by Month (M) and calculate mean
Window Functions:
Rolling Window:
df['Rolling_Avg'] = df['col'].rolling(window=3).mean() – Apply rolling mean
Expanding Window:
df['Expanding_Sum'] = df['col'].expanding().sum() – Apply expanding sum
Categorical Data:
Convert to Categorical:
df['col'] = df['col'].astype('category') – Convert to categorical type
Line Plot:
df.plot(kind='line')
Bar Plot:
df.plot(kind='bar')
Histogram:
df['column'].plot(kind='hist')
Box Plot:
df.boxplot(column=['col1', 'col2'])
Advanced Topics:
Categorical Encoding:
pd.get_dummies(df['categorical_column']) – Convert categorical to one-hot
encoding
String Operations:
df['col'].str.upper(), df['col'].str.contains('word') – String manipulation functions