Machine Learning Basics:
Supervised:
- Requires Training data with independent variables & a dependent variable
(labelled data)
- Need labelled data to "supervise" the algorithm when learning drom the data
- Regressions Models
- Classification Models
UnSupervised:
- Requires training data with independent variables only
- No need labelled data that can "supervise" the algorithm when learning the data
- Clustering Models
- Outlier Detection Models
Regression:
- Can be used when the response variable to be predicted is a continuous
variable(scaler)
- Used to Predict continuous values, prediction tests
- price of house based on location, etc
- For instance: evaluate mean square
- Example: Linear Regression, fixed Effects Regression, XGBoost Regression
Classification:
- Can be used when the response varible is a categorized values
- For instance: Used for decision making tests
- Predict categorical values, take an input and categorize them into predetermined
categories
- For Instanct: evaluate for Accuracy, classify email as spam or non-spam, identify
the type of animal or image
- Example: Logistic Regression, XGBoost Classification, Random Forest
Regression Performance Metrics:
- Calculate the difference between the predicted and true values => lower value is
better feed for the model
- RSS: Residual Sum of Squares:
RSS(Beta) = Sum from i=1 to N (Square of( y(i) - y(hat) ) )
y(i) = ith observation value
y(hat) = model’s predicted value
Beta = Co-effficient
- MSS: Mean Square Error - Used to penalize large errors than smaller ones
1/N * (RSS)
- RMSE: Root Mean Square Error - Used to report error in a way its easier to
understand/explain
Square Root of (MSE)
- MAE: Mean Absolute Error - Use to penalize all errors equally
1/N * Sum from i=1 to N ( abs( y(i) - y ) )
Classification Performance Metrics:
- Accuracy: CorrectPrediction / (CorrectPrediction + IncorrectPrediction)
- Precision: TruePositive / (TruePositive + FalsePositive)
TruePositive: Where model correctly predicts the positive outcome
FalsePositive: Where model incorrectly predicts the positive outcome
- Recall: TruePositive / (TruePositive + FalseNegative)
- F1Score: 2 * (Recall * Precision) / (Recall + Precision) - Higher value is better
Clustering Performance Metrics:
- Homogeneity - higher is more homogenious
Homogeneity(n) = 1 - (Conditional entropy given clusted assignments) / Entrophy of
(predicted) class
- Silhouette Score - Similarity of data on one clusted compared to other clusters
- Higher means data points is well matched to own cluster
- Used in DB Scan/ K Means
s(o) = (b(o) - a(o)) / max{a(o), b(o)}
o = co-effcient of data point
a(o) = Average distance between o and other data points in cluster that o belongs
b(o) = Min Avg distance from o to all the clusters that o does not belong
- Completeness:
- To the Degree to which all the data points that belong to particular class
are assigned as the same cluster
- Higher value indicated more complete structure
Completeness(c) = 1 - (Conditional entropy given clusted assignments) / Entrophy if
(actual) class
ML Model Evaluation Steps:
1. Data Preparation: Split data into train, validation and test.
2. Model Training: Train the model on the training data and save the fitted model
3. Hyper-Parameter Tuning: Use the fitted model and validation set to find the
optional
set of parameters where the model performs
the best
4. Prediction: Use optimal set of parameters from Hyper-Parameter tuning stage and
training data,
to train these models again with hyper parameters,
use thus best fitted model to do predictions on test data
5. Test Error rate: Compute performance metrics for your model using
prediction and real values of the target variable
from your test data
Pros:
- Simple model
- Low Variance
- Low Bias
- Provides probability
Cons:
- Unable to model non-linear relationship
- Unstable when classes are well separable
- Unstable when > 2 classes
Residual meaning - the difference between predicted vs true values