Scenario-Based Python Interview Questions for Data Analysis Roles (Freshers)
1. Scenario: You receive a CSV file with missing values, inconsistent casing, and duplicate
rows. How would you clean this data using Python?
Answer:
import pandas as pd
df = pd.read_csv('data.csv')
# Remove duplicates
df = df.drop_duplicates()
# Standardize casing (e.g., for a 'Name' column)
df['Name'] = df['Name'].str.title()
# Handle missing values
df = df.fillna(method='ffill')
2. Scenario: You have a sales dataset with columns: Date, Product, and Revenue. How would
you find the top 3 products with the highest average monthly revenue?
Answer:
df['Date'] = pd.to_datetime(df['Date'])
df['Month'] = df['Date'].dt.to_period('M')
monthly_avg = df.groupby(['Month', 'Product'])['Revenue'].mean().reset_index()
top_products =
monthly_avg.groupby('Product')['Revenue'].mean().sort_values(ascending=False).head(3)
3. Scenario: How would you prepare customer data with demographic info and activity logs
for a churn prediction model?
Answer:
- Handle missing values
- Convert categorical features to numeric using pd.get_dummies()
- Normalize/scale numerical features
- Merge datasets if activity logs are separate
- Label churn (e.g., Churn = 1 if customer left, else 0)
4. Scenario: You suspect some products have incorrect prices in a dataset. How would you
identify and handle outliers?
Answer:
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Price'] < Q1 - 1.5*IQR) | (df['Price'] > Q3 + 1.5*IQR)]
df = df[~df.index.isin(outliers.index)]
5. Scenario: You have two datasets: users.csv and transactions.csv. How would you
combine them to analyze total spending per user?
Answer:
users = pd.read_csv('users.csv')
transactions = pd.read_csv('transactions.csv')
merged = pd.merge(users, transactions, on='user_id')
spending = merged.groupby('user_id')['amount'].sum()
6. Scenario: You have daily temperature data. How would you visualize trends and seasonal
patterns?
Answer:
import matplotlib.pyplot as plt
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
plt.figure(figsize=(10,5))
plt.plot(df['Temperature'])
plt.title('Daily Temperature Trends')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.show()
7. Scenario: You have a column 'Country' with many categories. How would you prepare this
for machine learning?
Answer:
# Use OneHotEncoder or pd.get_dummies
df = pd.get_dummies(df, columns=['Country'], drop_first=True)
8. Scenario: Your dataset has a column 'Join_Date'. What features can you extract from it?
Answer:
df['Join_Date'] = pd.to_datetime(df['Join_Date'])
df['Year'] = df['Join_Date'].dt.year
df['Month'] = df['Join_Date'].dt.month
df['Weekday'] = df['Join_Date'].dt.day_name()
df['Join_Quarter'] = df['Join_Date'].dt.quarter