NumPy & Pandas for Data Science: Beginner → Advanced
1. NumPy (Numerical Python)
Basics
• Why: NumPy provides fast, memory-efficient arrays for numerical computation.
• When: Use when working with large datasets or mathematical operations.
• What: Arrays, vectorized operations, broadcasting.
Example: Array creation
import numpy as np
arr = [Link]([1, 2, 3, 4])
print(arr) # [1 2 3 4]
Array Operations
• Why: Vectorized operations are faster than Python loops.
• When: For element-wise math, statistics, or transformations.
• What: Addition, multiplication, broadcasting.
Example: Broadcasting
a = [Link]([1, 2, 3])
b=2
print(a * b) # [2 4 6]
Statistics with NumPy
• Why: Quick descriptive stats.
• When: Summarizing data before modeling.
• What: Mean, median, variance, correlation.
Example:
data = [Link]([10, 20, 30, 40, 50])
print([Link](data)) # 30.0
print([Link](data)) # 30.0
print([Link](data)) # 200.0
Linear Algebra
• Why: Essential for ML (matrix ops, eigenvalues).
• When: Feature transformations, PCA, regression.
• What: Dot product, inverse, eigen decomposition.
Example:
A = [Link]([[1, 2], [3, 4]])
B = [Link]([[5, 6], [7, 8]])
print([Link](A, B))
2. Pandas (Data Analysis Library)
Basics
• Why: Pandas is built for tabular data (like Excel).
• When: Use for data cleaning, wrangling, and analysis.
• What: Series, DataFrames.
Example:
import pandas as pd
df = [Link]({
'City': ['Delhi', 'Mumbai', 'Bangalore'],
'Price': [100000, 150000, 120000]})
print(df)
Indexing & Selection
• Why: To access subsets of data.
• When: Filtering rows/columns.
• What: .loc, .iloc, boolean indexing.
Example:
print([Link][0, 'City']) # Delhi
print(df[df['Price'] > 120000])
Handling Missing Data
• Why: Real-world data is messy.
• When: Before modeling.
• What: isnull(), fillna(), dropna().
Example:
df['Price'].fillna(df['Price'].mean(), inplace=True)
Grouping & Aggregation
• Why: Summarize data by categories.
• When: Business insights, feature engineering.
• What: groupby(), agg().
Example:
[Link]('City')['Price'].mean()
Merging & Joining
• Why: Combine multiple datasets.
• When: Enrich data with external sources.
• What: merge(), concat().
Example:
df1 = [Link]({'ID': [1,2], 'Name': ['A','B']})
df2 = [Link]({'ID': [1,2], 'Score': [90,80]})
merged = [Link](df1, df2, on='ID')
Pivot Tables
• Why: Multi-dimensional summaries.
• When: Cross-tab analysis.
• What: pivot_table().
Example:
df.pivot_table(values='Price', index='City', aggfunc='mean')
Time Series
• Why: Many datasets are time-based.
• When: Forecasting, trend analysis.
• What: to_datetime(), resampling.
Example:
df['Date'] = pd.to_datetime(['2023-01-01','2023-01-02','2023-01-03'])
df.set_index('Date').resample('D').mean()
3. Statistical Techniques in Pandas + NumPy
Technique Why When Example
Summarize numeric
Mean/Median/Mode Central tendency df['Price'].mean()
data
Variance/Std Dev Spread of data Risk analysis [Link](df['Price'])
Relationship
Correlation Feature selection [Link]()
strength
Statistical
Hypothesis Testing A/B testing [Link].ttest_ind()
inference
Normalization Scale features ML preprocessing (x - mean)/std
Handle
Encoding ML models pd.get_dummies(df['City'])
categories
4. Real-World Example (Indian Real Estate)
• Problem: Dataset has Carpet Area, but not Super Area.
• Solution: Estimate using loading factor.
def estimate_super_area(carpet_area, loading_factor=0.25):
return carpet_area * (1 + loading_factor)
df['SuperArea'] = df['CarpetArea'].apply(estimate_super_area)
5. Practice Exercises
1. Load a CSV of real estate data.
2. Clean missing values in Price using median.
3. Extract floor number from a column like "5 out of 10".
4. Group by City and compute average Price.
5. Encode Furnishing as one-hot vectors.
6. Normalize Price for ML preprocessing.
This is essentially your all-in-one guide. You can copy this into a Jupyter Notebook
and run each block step by step.
Would you like me to expand this into a structured workbook with exercises +
solutions (like a self-study course), or keep it as a compact reference guide?