Slide 15: Data Preprocessing
This bar graph illustrates the distribution of diseases in our dataset. Each bar represents a
distinct disease, with equal frequencies across 41 medical conditions. The balanced dataset
minimizes bias, ensuring fair and accurate predictions across all classes.
Slide 16: Feature Selection
We employed two methods for feature selection:
● Recursive Feature Elimination (RFE): Identifies the most impactful features by
iteratively removing less significant ones.
● Mutual Information: Measures the dependency between features and the target
variable, selecting the most predictive attributes.
Slide 16 (continued): Correlation Heatmap
The heatmap visualizes feature relationships:
● Diagonal (Red): Perfect self-correlation (value = 1).
● Color Scale:
○ Red: Strong positive correlation.
○ Blue: Strong negative correlation.
○ White: Little to no correlation.
Highly correlated features aid feature selection and improve model
performance.
Slide 17: Feature Importance Visualization
This bar chart ranks features by their importance scores:
● Top Feature: Fatigue has the highest influence on predictions.
● Other impactful features include joint_pain, headache, and high_fever.
● Importance scores decrease down the list, highlighting diminishing contributions.
Slide 19: Top 50 Features
This bar chart displays the top 50 important features based on importance scores:
● Top Feature: Muscle_pain is the most influential predictor.
● Other key features: Itching, altered_sensorium, dark_urine, and high_fever.
● Features were ranked using importance metrics derived from tree-based models,
ensuring a balance between performance and complexity.
Slide 20: Confusion Matrices
Confusion matrices compare Random Forest (left) and Gradient Boosting (right) models:
● Axes: Predicted labels (x-axis) vs. actual labels (y-axis).
● Diagonal Values: Correct predictions dominate, indicating high accuracy.
Slide 22: Algorithm Comparison
We tested four algorithms:
● Random Forest, Gradient Boosting, Support Vector Classifier, and K-Nearest
Neighbors.
● Result: Random Forest delivered the highest accuracy, which will be detailed in the
results section.