Comparative Study
Machine Learning Algorithms
• Supervised Learning techniques
– Naïve Bayes Classifier
– Decision tree
– Random Forest
– Support Vector Machine
– K Nearest Neighbors
Naïve Bayes
• Advantages
• It is simple and easy to implement
• It doesn’t require as much training data
• It handles both continuous and discrete data
• It is highly scalable with the number of predictors
and data points
• It is fast and can be used to make real-time
predictions
• It is not sensitive to irrelevant features
Naïve Bayes
• Disadvantages
• Naive Bayes assumes that all predictors (or features) are
independent, rarely happening in real life. This limits the
applicability of this algorithm in real-world use cases.
• This algorithm faces the ‘zero-frequency problem’ where
it assigns zero probability to a categorical variable whose
category in the test data set wasn’t available in the
training dataset.
• Its estimations can be wrong in some cases, so you
shouldn’t take its probability outputs very seriously.
Applications
Decision Tree
• Advantages
• Simplicity and Interpretability: Decision trees are
straightforward and easy to understand. You can visualize them
like a flowchart which makes it simple to see how decisions are
made.
• Versatility: It means they can be used for different types of
tasks can work well for both classification and regression
• No Need for Feature Scaling: They don’t require you to
normalize or scale your data.
• Handles Non-linear Relationships: It is capable of capturing
non-linear relationships between features and target variables.
Decision Tree
• Disadvantages
• Overfitting: Overfitting occurs when a decision tree captures
noise and details in the training data and it perform poorly on
new data.
• Instability: instability means that the model can be unreliable
slight variations in input can lead to significant differences in
predictions.
• Bias towards Features with More Levels: Decision trees can
become biased towards features with many categories
focusing too much on them during decision-making. This can
cause the model to miss out other important features led to
less accurate predictions .
Applications
• Predictive Analytics in healthcare
• Credit card risk assessment in finance
• Customer segmentation in marketing
• Churn prediction in telecom
• Fraud detection in banking
Random Forest
• Advantages
• High Accuracy due to ensemble learning
• Handles large datasets with higher
dimensionality
• Robust to overfitting due to multiple decision
trees
• Works well for both classification and regression
• Handles missing data effectively
Random Forest
• Disadvantages
• Computationally slower ,especially when dealing
with large datasets
• Less interoperable as compared to other algorithms
• Requires parameter tuning:number of trees,the
maximum depth of each tree and the number of
features considered at each split
• May not perform well on datasets with high noise
levels
Applications
• Medical Diagnosis
• Image classification
• Fraud detection
• Customer segmentation
• Energy demand forecasting
Support Vector Machine
• Advantages
• SVM performs well with data that has many
attributes
• Gives good results even if there is not enough
information about the data.Also works well
with unstructured data
• SVM can use kernels to transform data and
learn non linear patterns
• Robust to noise
Support Vector Machine
• Disadvantages
• Computationally expensive
• Limited to two class problems
• Not suitable for datasets with missing values
• No probabilistic interpretation
Applications
• Face detection
• Text categorization to find importatnt
information
• Bioinformatics
• Handwriting recognition
K Nearest Neighbor
• Advantages
• No Training period as it does not learn anything in
the training period
• Since it requires no training before making
predictions,new data can be added seamlessly
which will not impact the accuracy of the algorithm
• Easy to implement,as there are two parameters
required to implement KNN i.e.the value of K and
the distance function
K Nearest Neighbor
• Disadvantages
• Works slow with large dataset
• Does not work well with high dimensions
• Problem of overfitting
• We need to compulsorily do feature scaling
before applying KNN
• Sensitive to noisy data,missing values and
outliers
Applications
• Recommendation Systems
• Spam Detection
• Customer segmentation
• Speech Recognition