Understanding Feature Importance in Logistic
Regression
Feature importance helps in understanding which variables most significantly impact the
model's predictions. In a logistic regression model, the coefficients of each feature determine
its influence on the final decision boundary.
Positive Coefficients: Indicate a positive relationship between the feature and the target
variable. Higher values of the feature increase the probability of the positive class.
Negative Coefficients: Indicate a negative relationship, meaning that higher values of
the feature decrease the likelihood of the positive class.
Magnitude of Importance: Larger absolute values (either positive or negative) indicate
stronger influence.
Key Observations from the Feature Importance Graph
1. Scaled Numerical Features:
o The most important features are scaled_numerical_features_X, which are
transformed versions of numerical variables.
o These indicate continuous variables, such as transaction amounts, number of
accounts, or customer behavior trends that strongly impact model predictions.
o Features like scaled_numerical_features_6, scaled_numerical_features_17, and
scaled_numerical_features_5 suggest that some transactional patterns
significantly influence the recommendation.
2. Categorical Features (One-Hot Encoded)
o Features like marital_status_vec_1.0, employment_status_vec_3.0, and
gender_vec_2.0 show how different demographic groups influence model
predictions.
o marital_status_vec_1.0 being highly important suggests that marital status plays
a strong role in whether a fixed deposit (FD) recommendation is made.
o employment_status_vec_3.0 indicates that specific employment categories have
a higher correlation with FD recommendations.
3. Employment and Occupation Features
o Multiple employment_status_vec_X and occupation_vec_X features have
medium to high importance, meaning job stability, income type, or industry
may be key indicators of fixed deposit interest.
4. Age and Withdrawal Trends
o age_group_vec_X and withdrawal_trends_vec_X are present but lower in
importance, suggesting that while they may contribute, they are not the
strongest predictors.
How This Helps in Model Optimization
Feature Selection: If some features have very low importance, they can be removed to
simplify the model and reduce noise.
Regularization Adjustments: If only a few features dominate, it may indicate overfitting
to specific variables. Increasing L1 regularization (elasticNetParam=1.0) could help in
feature sparsity.
Domain Insights: This analysis confirms that customer demographics, employment
type, and transaction behavior are crucial factors in FD recommendations.
Next Steps for Model Improvement
1. Evaluate Model Generalization:
o If the model achieves high accuracy (e.g., ROC AUC = 1.0), it may be overfitting.
Testing on new data will confirm this.
2. Feature Pruning:
o Remove low-importance categorical features if they do not contribute much.
o Focus more on highly important transactional features.
3. Hyperparameter Fine-Tuning:
o Increase L1 Regularization to remove less significant features.
o Adjust elasticNetParam to balance feature selection and generalization.
4. Check for Data Leakage:
o If certain features seem too powerful, they may contain unintended correlations
with the target variable.
Final Takeaway
The top features in this visualization provide strong indicators for how financial behaviors and
demographics affect FD recommendations. By refining feature selection and regularization, we
can build a more robust and interpretable model.
Class Distribution:
recommendation_fd_target count
1 13
0 32
Cross Validation with K fold where k = 4:
ROC AUC for Fold 1: 0.8846153846153847
ROC AUC for Fold 2: 0.9423076923076923
ROC AUC for Fold 3: 0.9927884615384616
ROC AUC for Fold 4: 0.9951923076923077
LogisticRegressionModel: uid=LogisticRegression_79495f0c73fe, numClasses=2,
numFeatures=54
Correlation matrix