Comprehensive Notes on Data Processing and Machine Learning
### Linear Algebra Basics
1. **Matrices to Represent Relations Between Data**:
- A matrix is a 2D array of numbers, with rows representing data samples and columns representing features
or variables.
- **Example**:
- Adjacency matrix for graphs: Represents connections between nodes.
- Data tables: Rows as data points, columns as attributes.
2. **Linear Algebra Operations**:
- **Addition/Subtraction**: Element-wise operations between matrices of the same dimensions.
- **Matrix Multiplication**: Dot product of rows and columns; used in transformations and neural networks.
- **Transpose**: Flipping rows and columns. Notation: \( A^T \).
- **Inverse**: If a matrix \( A \) is invertible, \( A^{-1} \) satisfies \( A imes A^{-1} = I \) (identity matrix).
3. **Matrix Decomposition**:
- **Singular Value Decomposition (SVD)**:
- Decomposes a matrix \( A \) into \( U \Sigma V^T \).
- \( U \): Left singular vectors (orthogonal).
- \( \Sigma \): Diagonal matrix of singular values.
- \( V^T \): Right singular vectors (orthogonal).
- **Applications**: Dimensionality reduction, image compression.
- **Principal Component Analysis (PCA)**:
- Identifies directions (principal components) of maximum variance in the data.
- Reduces dimensions while retaining important information.
- Steps: Center data, compute covariance matrix, find eigenvectors and eigenvalues.
### Data Pre-processing and Feature Selection
1. **Data Pre-processing**:
- **Data Cleaning**:
- Handle missing data (e.g., mean/mode imputation, drop rows/columns).
- Remove duplicates, correct inconsistencies.
- **Data Integration**:
- Combine data from multiple sources (databases, APIs, files) into a unified dataset.
- **Data Reduction**:
- Reduce size or complexity while retaining structure:
- Sampling: Select a representative subset of the data.
- Aggregation: Summarize groups (e.g., average).
- Dimensionality reduction: PCA, feature elimination.
- **Data Transformation**:
- Scaling: Normalize values to a standard range (e.g., Min-Max scaling).
- Encoding: Convert categorical data into numerical form (e.g., one-hot encoding).
- **Data Discretization**:
- Convert continuous data into discrete bins or intervals (e.g., age groups).
2. **Feature Selection and Generation**:
- **Feature Generation**:
- Create new features using domain knowledge (e.g., total price = quantity * unit price).
- **Feature Selection**:
- Reduce feature space by identifying important variables.
- **Methods**:
- **Filters**: Statistical tests (e.g., correlation, chi-squared test).
- **Wrappers**: Evaluate feature subsets by model performance (e.g., recursive feature elimination).
- **Embedded Methods**: Feature selection during model training (e.g., LASSO, decision trees).
### Basic Machine Learning Algorithms
1. **Classifiers**:
- **Decision Tree**:
- Splits data into branches based on feature thresholds.
- Example: Predicting loan approval based on income and credit score.
- **Naive Bayes**:
- Based on Bayes' Theorem; assumes features are independent.
- Example: Classifying spam emails.
- **k-Nearest Neighbors (k-NN)**:
- Classifies data based on the majority label of k-nearest data points.
- Works well for smaller datasets; sensitive to scaling.
2. **Clustering**:
- **k-Means**:
- Divides data into k clusters by minimizing intra-cluster variance.
- Requires the number of clusters (k) as input.
- Example: Customer segmentation.
3. **Advanced Techniques**:
- **Support Vector Machine (SVM)**:
- Finds the optimal hyperplane separating classes.
- Kernel trick: Maps data to higher dimensions for better separation.
- **Association Rule Mining**:
- Finds relationships between items in transactional datasets.
- Example: Market Basket Analysis (e.g., "If a customer buys bread, they are likely to buy butter").
- **Ensemble Methods**:
- Combine predictions of multiple models to improve accuracy.
- Types:
- Bagging: Reduces variance (e.g., Random Forests).
- Boosting: Reduces bias (e.g., AdaBoost).