Employee Salary Prediction using
Machine learning algorithms
BY: S. ADRIAN JUBAL
Bharath Institute of higher
education and research
Outline
• Problem statement
• System approach
• Algorithm & Deployment
• Result
• Conclusion
• References
Problem Statement
Predict whether an individual's income exceeds or falls below $50K based
on demographic and employment attributes.
Automate the income classification process using advanced machine
learning algorithms.
Enhance pre-qualification procedures in critical sectors like HR, finance,
and insurance.
Utilize structured data to minimize manual screening, improving efficiency
and accuracy.
Showcase the practical application of ensemble learning techniques for
real-world classification challenges.
System Approach
System Requirements: Python 3.12, Scikit-learn, XGBoost , Pandas,
Libraries Used: sklearn , xgboost , pandas, numpy , matplotlib,
Backend: Robust ensemble model combining Random Forest, Logistic
Regression, and XGBoost for enhanced predictive power.
Data Source: Comprehensive UCI Adult Income dataset in CSV format,
providing rich demographic and employment information.
Algorithms
Random Forest Classifier
Logistic Regression
XGBoost Classifier
Voting Classifier (Ensemble)
KMeans Clustering
Isolation Forest
Steps & Measures
Import required libraries
Load and clean the dataset
Separate features and target
Identify numerical and categorical columns
Create preprocessing pipelines
Apply preprocessing using ColumnTransformer
Detect and remove outliers using Isolation Forest
Convert sparse matrix to dense if needed
Add KMeans cluster label as a new feature
Split data into training and testing sets
Define individual models: Random Forest, Logistic Regression, XGBoost
Create an ensemble model using VotingClassifier
Train the ensemble model
Evaluate the model: Accuracy + Classification Report
Plot and save model accuracy
Save the trained model with preprocessing and clustering using joblib
Results
Results
Links
Model Link: Prediction model code
Google Colab: Prediction Model
Short Description:
• 🔗 [GitHub Link] — Complete project as ZIP (includes outcome, trained model,
training code, and presentation)
• 📦 [Drive Link] — Direct download of trained model file (.pkl)
References
1. Scikit-learn Documentation
https://scikit-learn.org/stable/documentation.html
2. XGBoost Documentation
https://xgboost.readthedocs.io/
3. PyInstaller Docs
https://pyinstaller.org/
Conclusion
Summarize the findings and discuss the effectiveness of
the proposed solution. Highlight any challenges
encountered during the implementation and potential
improvements.
THANK YOU