0% found this document useful (0 votes)
2 views5 pages

? NumPy

The document provides a comprehensive guide on using NumPy and Pandas for data science, covering basics, array operations, statistics, and linear algebra with NumPy, as well as data manipulation techniques in Pandas such as indexing, handling missing data, and merging datasets. It includes examples and practical exercises to reinforce learning, along with a real-world application in Indian real estate. The guide serves as a reference for beginners to advanced users in data analysis and machine learning.

Uploaded by

Dhruv dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

? NumPy

The document provides a comprehensive guide on using NumPy and Pandas for data science, covering basics, array operations, statistics, and linear algebra with NumPy, as well as data manipulation techniques in Pandas such as indexing, handling missing data, and merging datasets. It includes examples and practical exercises to reinforce learning, along with a real-world application in Indian real estate. The guide serves as a reference for beginners to advanced users in data analysis and machine learning.

Uploaded by

Dhruv dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NumPy & Pandas for Data Science: Beginner → Advanced

1. NumPy (Numerical Python)

Basics

• Why: NumPy provides fast, memory-efficient arrays for numerical computation.

• When: Use when working with large datasets or mathematical operations.

• What: Arrays, vectorized operations, broadcasting.

Example: Array creation

import numpy as np

arr = [Link]([1, 2, 3, 4])

print(arr) # [1 2 3 4]

Array Operations

• Why: Vectorized operations are faster than Python loops.

• When: For element-wise math, statistics, or transformations.

• What: Addition, multiplication, broadcasting.

Example: Broadcasting

a = [Link]([1, 2, 3])

b=2

print(a * b) # [2 4 6]

Statistics with NumPy

• Why: Quick descriptive stats.

• When: Summarizing data before modeling.

• What: Mean, median, variance, correlation.

Example:

data = [Link]([10, 20, 30, 40, 50])


print([Link](data)) # 30.0

print([Link](data)) # 30.0

print([Link](data)) # 200.0

Linear Algebra

• Why: Essential for ML (matrix ops, eigenvalues).

• When: Feature transformations, PCA, regression.

• What: Dot product, inverse, eigen decomposition.

Example:

A = [Link]([[1, 2], [3, 4]])

B = [Link]([[5, 6], [7, 8]])

print([Link](A, B))

2. Pandas (Data Analysis Library)

Basics

• Why: Pandas is built for tabular data (like Excel).

• When: Use for data cleaning, wrangling, and analysis.

• What: Series, DataFrames.

Example:

import pandas as pd

df = [Link]({

'City': ['Delhi', 'Mumbai', 'Bangalore'],

'Price': [100000, 150000, 120000]})

print(df)

Indexing & Selection

• Why: To access subsets of data.


• When: Filtering rows/columns.

• What: .loc, .iloc, boolean indexing.

Example:

print([Link][0, 'City']) # Delhi

print(df[df['Price'] > 120000])

Handling Missing Data

• Why: Real-world data is messy.

• When: Before modeling.

• What: isnull(), fillna(), dropna().

Example:

df['Price'].fillna(df['Price'].mean(), inplace=True)

Grouping & Aggregation

• Why: Summarize data by categories.

• When: Business insights, feature engineering.

• What: groupby(), agg().

Example:

[Link]('City')['Price'].mean()

Merging & Joining

• Why: Combine multiple datasets.

• When: Enrich data with external sources.

• What: merge(), concat().

Example:

df1 = [Link]({'ID': [1,2], 'Name': ['A','B']})

df2 = [Link]({'ID': [1,2], 'Score': [90,80]})

merged = [Link](df1, df2, on='ID')


Pivot Tables

• Why: Multi-dimensional summaries.

• When: Cross-tab analysis.

• What: pivot_table().

Example:

df.pivot_table(values='Price', index='City', aggfunc='mean')

Time Series

• Why: Many datasets are time-based.

• When: Forecasting, trend analysis.

• What: to_datetime(), resampling.

Example:

df['Date'] = pd.to_datetime(['2023-01-01','2023-01-02','2023-01-03'])

df.set_index('Date').resample('D').mean()

3. Statistical Techniques in Pandas + NumPy

Technique Why When Example

Summarize numeric
Mean/Median/Mode Central tendency df['Price'].mean()
data

Variance/Std Dev Spread of data Risk analysis [Link](df['Price'])

Relationship
Correlation Feature selection [Link]()
strength

Statistical
Hypothesis Testing A/B testing [Link].ttest_ind()
inference

Normalization Scale features ML preprocessing (x - mean)/std

Handle
Encoding ML models pd.get_dummies(df['City'])
categories
4. Real-World Example (Indian Real Estate)

• Problem: Dataset has Carpet Area, but not Super Area.

• Solution: Estimate using loading factor.

def estimate_super_area(carpet_area, loading_factor=0.25):

return carpet_area * (1 + loading_factor)

df['SuperArea'] = df['CarpetArea'].apply(estimate_super_area)

5. Practice Exercises

1. Load a CSV of real estate data.

2. Clean missing values in Price using median.

3. Extract floor number from a column like "5 out of 10".

4. Group by City and compute average Price.

5. Encode Furnishing as one-hot vectors.

6. Normalize Price for ML preprocessing.

This is essentially your all-in-one guide. You can copy this into a Jupyter Notebook
and run each block step by step.

Would you like me to expand this into a structured workbook with exercises +
solutions (like a self-study course), or keep it as a compact reference guide?

You might also like