Linear Regression
Simple and Multiple: used to predict a continuous target y based on data x and finds a line of best fit by minimizing the loss function.
Can only work with data that has linear relationships
Strengths: Simple model and easy to interpret, easy to compute, low overfitting,
Weakness: Assumes linearity, multicolinearity and outliers can cause issues, dependent of feature scaling
Logistic Regression
Used for binary classification problems through a sigmoid function with min 0, max 1
Maps outputs to a probability, and depending on the probability threshold, assigns a class
Strengths: Simple and interpretable, Fast training and prediction, Provides probability scores, Works well with linear boundaries, Less prone to overfitting than more complex
models, Good for small to medium datasets
Weakness: Assumes linear relationship between features, Struggles with non-linear Boundaries, Sensitive to outliers, May underperform with highly imbalanced data,
Requires more data for stable Estimates, Features should be independen
Feature Engineering
Feature engineering involves extracting and transforming raw data variables into optimized features to enable improved machine learning model performance and predictive
power.
Binning to turn continuous data into discrete categories, Create polynomial features for nonlinear relationships, Combine features to create new ones
EX: Aggregate data (mean, median, etc.), encoding categorical data, Scaling numeric data, PCA (to reduce dimensionality)
Feature selection: choose a subset of features from data through reducing dimensionality, importance-based selection from algorithms, or removing highly correlated features
Bias and Variance
Bias is caused by a model that is too simple and cannot capture the relationships in the data, resulting in poor performance on test and training data (underfitting)
Variance is caused by an overly complex model that is overfitting on the training data, including all noise, and performs poorly on new data
Variance and Bias are inversely related, and there is a tradeoff between them
As dimensionality grows, the amount of data needed increases exponentially
Decision Trees
Uses nodes to create a flow chart to classify data (Root node: first node and best predictor, internal nodes: decision nodes to split data, lead node: terminal node that
classifies data)
Each split is measured by an impurity score, evaluating how well the plit reduces the mixture of classes
Entropy: Summation of (-Pi * log2(Pi))) where Pi is the fraction of records in class i
Gini: Summation of (Pi ** 2)
Reduce overfitting in trees by implementing pre-pruning (Max depth and min samples to split)
Or post pruning, removing branches that are not significant in predictions
Strengths: Inexpensive, quick at classifying new records, easy to interpret, can handle noisy data, can use numerical and categorical data, can handle redundant features
Overfitting and Cross Validation
To avoid overfitting and other complications during the model's training, a validation set is used to test unseen data and adjust hyperparameters before using the testing data.
Cross-validation can be used when data is limited to avoid over and underfitting on data.
K-fold: data is divided into kk folds of data, trained on k-1 folds, tested on the remaining, and done k times, providing a better understanding of the model
LOOCV: creates a data set of all datapoints but one, and uses left out datapoint as validation, done for each datapoint
Stratified: Used for situations with imbalance class distributions
Weakness: Expensive, data leaks can happen from preprocessing data breofre splits
Strengths: Stable performance estimate, Reduces overfitting and underfitting, and robust hyperparameter selection
KNN
A lazy learner: no training and predicts information directly from the data
Works by assigning the classification of new data to the closest observations in the training data
The metrics compared is distance, which can drastically change the effectiveness of the model
Voting for class can be by majority or weighted
K is the most important metric for KNN, determines how many neighbors are looked at for prediction, and optimal k depends onf the data (noisy data works better with high k
and clean data works better with low k)
CV (cross-validation) is used to determine the best k
Strengths: Easy to understand and implement, instance based
Weaknesses: Data must be scaled, Class imbalance can cause issues if not weighted, cannot be used under high dimensionality, must remove irrelevant features
Naive Bayes
One of the bayes classifiers, naive bayes assumes that features are conditionally independent given the class,
Strengths: Robust performance in most applications, Usually used for text classification, Handles high dimensionality well and resistant to overfitting
Use regular bayes when: feature dependence is important, data is ample in size, model need complexity, and problem requires understanding join-prob
Laplace smoothing is used to avoid a zero probability in machine learning by adding a small number to the calculation of probability
Evalutation
Accuracy: Correct prediction made on the testing set (Error rate is the incorrect predictions made)
Recall: TP/TP+FN used for measuring false negatives
Precision: TP/TP+FP used when the cost of a FP is high
F1 Score: 2 * (Precision*Recall/Precision + Recall): a good metric for imbalance data sets, also measures FP and FN equally, so the costs are equal
ROC curve: Shows how well a model separates P and N classes across different thresholds
Ensemble Methods
Bagging: Train multiple models in parallel to reduce variance (trains on multiple subsets of the OG dataset and uses the aggregate of each model for prediction) use for
models that overfitt easily
Boosting: train models sequentially, where each model focuses on errors to reduce bias (each instance has a weight that gets adjusted each round) used on complex
datasets with missclassifications
Stacking: train diverse base models on the same dataset, create a new dataset based on the outputs of the base models, and train a meta-model on the new dataset to get a
final prediction use when multiple diverse models are usable and need to be combined
Random Forest: uses multiple decision trees and bagging
SVM
Searches for an optimal linear separator for a feature space (A line on a graph that best separates 2 classifications)
The goal is to find the hyperplane with a maximum margin, or maximize the distance between the closest points from each class (closest points are called support vectors). The width
2
margin is 𝑀 = ||𝑤|| , SVM aims to minimize ||w||, to maximize the margin
All new points after training are labeled depending on which side of the hyperplane they land on. SVM must balance maximizing the margin for 2 classes and minimizing classification
error,
All data used for SVM must be scaled (distance-based model) to avoid bias towards larger-scale numbers
Decision boundaries for SVM are only affected by the support vectors; other datapoints are irrelevant
Multiclass for SVM is done by One v All (Train multiple binary SVMs, each treating one class as + and all others as -), used on all datapoints, and assigned the highest scoring class
𝐾(𝐾−1)
One v One (Train 2
binary SVMs for each class pair ex: A vs B, A vs C, B vs C). Each SVM votes, and the class w/ most votes wins
Soft Margin SVM aims to consider a tradeoff between margin and errors, since data can be noisy, and perfect models lead to overfitting. The slack variable penalizes points that are either
within the margin or misclassified.
Kernel Methods are used if data cannot be separated linearly. Data is transformed into a higher-dimensional plane.
SVM is a robust model that is not weak to the dimensionality curse, guarantees a global optimum solution, however, it is sensitive to the kernel choice and needs feature scaling
Neural Networks
Takes an input(s), multiples inputs by a weight, sums up the weighted inputs, adds a bias to the sum, and applies an activation function to get an output
Advantages: Pattern Recognition for nonlinear relations, Can handle high dimensionality, automtated feature extraction, better than most traditional models, combines both structured and
unstructured data, supports NLP, Image Analysis, and anomaly detection, Easily fine tuned
3 levels of AI, AI: computers mimicking human behavior, ML: Ability to learn w/o being explicitly programmed, DL: Extract patterns from data using patterns
Traditional ML requires manual feature engineering and selection, DL uses neural networks to auto learn patterns from the data
Activation functions are used on layers to add non-linearity to the model to capture more complex relations. Binary Class: Sigmoid, Multiclass: Softmax, Regression: No function, Hidden
Layers: RELU to avoid the vanishing gradient problem caused by deep networks with many layers
Training involves updating weights until some stop condition: Epochs met, error/loss rate is below a threshold. Adujsting weights until the model learns to produce outputs close to actual
Forward propagation: computing predictions, Backward propagation: Adjust weights using gradients
Neural Networks suffer from being data hungry, overfitting, interpretability, and computation cost.
K-Means Clustering
Clustering groups data into sets based on similarity and no prior knowledge of the labels (Unsupervised learning), clusters are potential classes found in unlabeled data
3 types of clustering: Centroid: Each cluster is represented by a central vector, Density-Based: Looks for datapoints that are packed together vs those that are spread thin, and
Hierarchical: Clusters are born from trees that split or merge depending on initialization
Centroid-based clustering ex: K-Means, splits data into k clusters based on the distance between points and centroids, aiming to minimize within-cluster variance. K random points are
made centroids, points are assigned to the closest centroid, new centroids are made from the mean of clusters, repeat.
Data must be one-hot encoded, and empty clusters can be reselected, and there is also a chance that centroids get stuck in local minima, which is a non-optimal solution
Characteristics: Supports various data (categorical must be OHE), Fast, bisecting, and k-means++ help initialization, Sensitive to outliers, Curse of dimensionality, needs spherical data
DBSCAN
Used when clusters are arbitrary shapes, points are grouped by proximity a point has a radius eps, all points within eps are neighbors, minpts are the minimum required points in a
neighborhood to be considered dense. Core points meet the minpts, border points dont meet minpts but are within the eps of a core point, outliers are neither core nor border points
2 core points in the same eps are in the same cluster, border points are assigned to the first core point cluster and not reassigned
Characteristics: Can capture arbitrary shapes, Robust to noise/outliers, works across datatypes (spatial, image, network), struggles w/ high dimensional spaces, sensitive to distance
metric used, poor at handling varied densities
Hierarchical Clustering
2 Types: Agglomerative (Bottum Up): each point is its own cluster, merges points based on similarity, Divisive (Top-Down): 1 cluster split on dissimilarity, for both each level represents
different level of granularity. On a dendrogram, merge points shows which clusters are combined, and merge heigh determines level of dissimilarity (lower means more similar)
Agglomerative: at each step, combine the closest pair of clusters based on a distance metric, update distances after new cluster is made, repeat until only 1 cluster remains
Core operation of merging clusters is done by deciding a distance metric, and how to define the distance between clusters (Linkage)
Linkage types: Single(min), Complete(Max), Average(avg. Dist between all pairs), Centroid(dist. Between mean vectors w/ possible inversions), Wards
A proximity matrix tracks and updates the distance between all clusters, and the smallest number in the matrix is the merge formed, the new cluster is created and the distance between
the new cluster and others is made and the matrix is updated
Characteristics: Expensive computation, Sensitive to noise/outlier, distance metrics drastically changes results
Anomaly Detection
Look for rare or unusual events that deviate from the majority of the data, used due to the difficulty in defining abnormal and the lack of outlier data.
Global Perspective: flags anomaly based on the deviation from the whole dataset, Local: flags anomaly based on local/neighborhoods in the data (subsets of data)
Sequential detectionL used in streaming data and minimizing costs, Batch: used when relationship between datapoints can be comprehensively analyzed
Masking: occurs when to many anomalies hide the detection of other anomalies, Swamping: occurs when normal points are misflagged as anomalies due to actual anomalies influence
Model-based detection: detects anomalies based on constructed statistical models and evaluates anomalies by how well they fit the model (Probabliy distributions)
Density-Based: Labels points in low-density regions as anomalies (DBSCAN), adaptive to different shapes in clusters
Cluster-Based: Groups similar datapoints, Identifies anomalies if: they dont fit well into any cluster, lie far away from centers, have low membership scores
Proximity-Based: Datapoints that have the highest distance from their kth neibor are labled anomalies
Isolation Forests: Recursively isolate points, score points based on how quickly they were isolated, minimal parameters needed, low cost, effective in high dimensions
One-class SVM: used in novelty detection (data does not contain outliers, but want to detect anomalies in new data), detects deviations from normal patterns, works with normal data
points, and handles high dimensions well but needs a kernel and scalar selection
Statistical methods: Best for well-understood distributions, Density-based: Ideal for spatial data, Isolation Forest: Excellent for high-dimensional data, Deep learning: Best for complex
patterns in large datasets
Association Analysis
Data mining technique used to find patterns between different items in a large dataset (market-basket analysis: identify rules that predict occurrence of an item based on other items)
Rules suggest a strong correlation between 2 items (items can also be groups of items) and typically represented as an if-then
Transactions Width: Number of items in a transaction, Itemset: a collection of 0 or more items (k-itemset is an itemset with k items), Support Count: Number of transactions that contain
said itemset, Frequent Itemset: an itemset whos support is higher than the minsup
# 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋 𝑎𝑛𝑑 𝑌 # 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋 𝑎𝑛𝑑 𝑌
𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝑋 → 𝑌) = 𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
how often a rule is applicable to a dataset, 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋 → 𝑌) = # 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑋
how frequently Y appears with X
Frequent Itemset gen: Finad all itemsets that satisfy a minsup, Strong Rule gen: find all rules in the frequent itemsets that satisfy minconf
Apriori Principle: if an itemset is frequent, all subsets must also be frequent; if an itemset is infrequent, all its supersets must also be infrequent and is used to trim exponential candidate
growth. Ex: If bread is infrequent, then no need to test itemsets with bread
Find individual items and their support, prune away items that do not meet minsup, create 2 item sets, prune away based on minsup, create 3 itemset, prune away based on minsup, and
if any possible subset is not frequent prune itemset.
Rule Gen
Maximal Frequent itemset: maximal if an itemset has no immediate frequent supersets, Closed itemset: a frequent itemset tgat has no immediate superset with the same sup count
Antecedents: left side of the rule, Consequent: Right side of the rule
Apriori Rule gen: create rules from freq with 1 item as the consequent, Only high conf rules from the prior set are used to generate new candidates (based on consequents from last
round), repeat and evaluate all candidate rules based on minconf
Evalution for Association Analysis
Lift measures how much more often the antecedent and consequent of a rule occur together than expected if they were statistically independent. Lift can help to identify rules that might
𝑃(𝑋,𝑌) 𝑐(𝑋→𝑌)
be interesting even if the support is not very high. However, lift alone doesn't tell us about the actual frequency of the rule. 𝐿𝑖𝑓𝑡 = 𝑃(𝑋)𝑃(𝑌) = 𝑠(𝑌)
Leverage measures the difference between the observed frequency of a rule and the frequency that would be expected if the rule's items were independent. It adds context to the
support and confidence by showing whether the rule occurs more often than would be expected based on the individual items' frequencies. 𝐿𝑒𝑣𝑒𝑟𝑎𝑔𝑒 = 𝑃(𝑋 ∩ 𝑌) − 𝑃(𝑋)𝑃(𝑌)
Conviction measures the degree to which the antecedent depends on the consequent. It indicates the expected error rate of the rule if the antecedent and consequent were assumed to
1−𝑃(𝑌)
be independent. 1−𝑐(𝑋→𝑌)