Model Explainability
Resources:
- https://christophm.github.io/interpretable-ml-book/shap.html
- https://docs.seldon.io/projects/alibi/en/stable/overview/getting_started.html
- https://compstat-lmu.github.io/iml_methods_limitations/index.html
To do
- Read Christoph book for rest of chapters + shap paper
-
Generally
Preferred methods
Marcos Lopez de prado in his ML Interpretability presentation, presents MDA and Shapley
Values as two options – my preferences too (also mentions a disadvantage of MDI compared
to MDA)
Implementation
Test set better idea, but Train set offer insights to what “weights” the model have given to
each variable (even if overfitting)
o you need to decide whether you want to know how much the model relies
on each feature for making predictions (-> training data) or how much the
feature contributes to the performance of the model on unseen data (-> test
data)
Best Interpretable algorithm
- Explainable boosted machine see interpret package of microsoft
Partial Dependence plot
- Typically for a single feature (can also be with 2 and view in 3D)
Mathematically/Method:
- For a feature: sample randomly rows of data for the rest of the features, evaluate the
prediction under each random sample + the current value of the feature (e.g. temperature
=20 degrees), and take the mean to find the mean prediction under that value of the feature
(Monte Carlo method)
- Xs = feature assessing, Xc = group of the rest of the features
-
- Specifically: (Monte carlo)
Alternatives:
- Individual Conditional Expectation (ICE)
o Account for Heterogeneous effects
- Accumulated Local effect (ALE)
o Account for case of correlation between feature assessed and rest of the features
(that are randomly sampled – if there is correlation want to conditionally sample
otherwise we get data instances that can be highly unlikely/don’t match the
conditional distribution of feature values, that are used to average over)
Disadvantages:
- Don’t show feature distribution. Omitting the distribution can be misleading, because you
might overinterpret regions with almost no data
o This problem is easily solved by showing a rug (indicators for data points on the x-
axis) or a histogram.
Advatages:
- Very easy to interpret
Individual Conditional Expectation
- Individual Conditional Expectation (ICE) plots display one line per instance
that shows how the instance's prediction changes when a feature changes.
- The partial dependence plot for the average effect of a feature is a global
method because it does not focus on specific instances, but on an overall
average. The equivalent to a PDP for individual data instances is called
individual conditional expectation (ICE) plot (Goldstein et al. 2017 27).
Advantage over PDP
- Feature interactions are not shown in the average case (PDP): PDP only
works well if the interactions between the features for which the PDP is
calculated and the other features are weak. In case of interactions, the ICE
plot will provide much more insight.
- i.e. more insight to the whole distribution of prediction for a specific feature
value
- Can draw line of the mean for visualization purposes with bold yellow e.g.
Accumulated local effects
- The better version of PDP generally
- Resource: better explanation of exact methodology than molnar’s book (haven’t checked
this one though for motivations/high level idea)
https://docs.seldon.io/projects/alibi/en/stable/methods/ALE.html#:~:text=Accumulated
%20Local%20Effects%20(ALE)%20is,Models%20by%20Apley%20and%20Zhu.
Idea of method
- “Accumulated” (summed up for all values found lower of that value of the feature (or lower
intervals in this case)) differences in prediction (“effects”) between two edges of an interval
(“local”) for all feature values and their corresponding data instances within that interval
(averaged out)
o So for each interval of a feature value, you have a single ALE. To make the plot just
linearly interpolate between the values of each interval (if you place their values in
the middle of that interval say)
To summarize how each type of plot (PDP, M, ALE) calculates the effect of a feature at a
certain grid value v:
- Partial Dependence Plots: "Let me show you what the model predicts on average
when each data instance has the value v for that feature. I ignore whether the
value v makes sense for all data instances."
- M-Plots: "Let me show you what the model predicts on average for data instances
that have values close to v for that feature. (conditional sampling) The effect could
be due to that feature, but also due to correlated features."
o What can we do to get a feature effect estimate that respects the
correlation of the features? We could average over the conditional
distribution of the feature, meaning at a grid value of x1, we average the
predictions of instances with a similar x1 value.
o But, for example, If we average the predictions of all houses of about 30
m2, we estimate the combined effect of living area and of number of
rooms, because of their correlation -> ALE comes into play
- ALE plots: "Let me show you how the model predictions change in a small
"window" of the feature around v for data instances in that window."
o M-Plots avoid averaging predictions of unlikely data instances, but they mix
the effect of a feature with the effects of all correlated features. ALE plots
solve this problem by calculating -- also based on the conditional
distribution of the features -- differences in predictions instead of
averages.
o Example: For the effect of living area at 30 m 2, the ALE method uses all
houses with about 30 m2 (interval around 30m2 houses), gets the model
predictions pretending these houses were 31 m 2 minus the prediction
pretending they were 29 m 2 (for interval 29m2 - 31m2).This gives us the pure
effect of the living area and is not mixing the effect with the effects of
correlated features. The use of differences blocks the effect of other
features.
Feature Interaction
Idea
- H-statistic based on Partial Dependence plot
- Case 1: For 2 variables
o Variance of PDP between PDP(x_k, x_j) and [PDP(x_k) + PDP(x_j)]. if no interaction
variance between the two terms would be 0
- Case 2: For a variable with the rest of the variable in the modelPermutation Importance
o Variance between PDP case of full variables and PDP case without that single
variable (i.e. PDP of n-1 variables)
Use case:
- The H-statistic tells us the strength of interactions, but it does not tell us how the
interactions look like. That is what partial dependence plots are for. A meaningful
workflow is to measure the interaction strengths and then create 2D-partial
dependence plots for the interactions you are interested in.
Permutation importance
- Look book for advantages/Disadvantages
Global Surrogate
Idea:
- Train an interpretable model to learn the prediction of the trained black box model ((using
our data)
- Use the interpretability of the interpretable model to interpret the black box model
o E.g. linear model, decision tree
o Can measure performance of interpretable model (using R^2 against the ML model
e.g.) to see how much to trust it
- Similar idea to LIME – only difference of LIME is that it gives a high weight to data points
around the instance (specific data point), that you want to interpret
Local Surrogate (LIME)
- Similar to Global Surrogate only difference is we weight points with a smoothing kernel, to
give higher weights to data points closer to the instance we are examining (we want feature
importance (using linear model’s coefficients) for variables involved in prediction around
that instance – can choose whatever variables we want to interpreter – either all or raw
variables e.g. (rather than engineered ones)
- Data sampling/used:
o We randomly sample data instances, by sampling values for each feature
independently. We assume a feature is Gaussian and calculate its mean and std in
the data set. We use that to sample values for each features. Then each instance is
weighted according to the above smoothing kernel (Dependent on the data instance
we are examining)
o A bit simplistic – one of its disadvantages (ignores correlation between features, and
their actual distribution)
- Possible for all types of data: tabular, text, images
- Good for explaining to stakeholders who want fast explanations but not fully
precise/trustworthy because of some of its limitations (including the data sampling
mentioned above, instability in choice of kernel width for tabular data/different versions of
sampled data etc)
- Very similar to Kernel Shap
- Can help determine sensitivities of model locally
o The Shapley value returns a simple value per feature, but no prediction
model like LIME. This means it cannot be used to make statements about
changes in prediction for changes in the input, such as: "If I were to earn
€300 more a year, my credit score would increase by 5 points."
Scoped Rules (Anchors)
- Irrelevant – too complex and too specific (e.g. classification only) and not adding much more
than shap values e.g.
Shapley Values
Resources:
- Book - https://christophm.github.io/interpretable-ml-book/shap.html
- https://towardsdatascience.com/one-feature-attribution-method-to-supposedly-rule-them-
all-shapley-values-f3e04534983d
- https://medium.com/analytics-vidhya/shap-part-2-kernel-shap-3c11e7a971b1
- Shap paper? – haven’t read yet
Extra:
- https://www.kaggle.com/dansbecker/advanced-uses-of-shap-values
- https://www.kaggle.com/dansbecker/shap-values
Idea:
Shapley values -- a method from coalitional game theory -- tells us how to fairly distribute
the "payout" among the features.
o How much (positively or negatively) has each feature value contributed to the
prediction compared to the average prediction?
The Shapley value is the average marginal contribution (difference of prediction from the
average prediction) of a feature value across all possible coalitions (of feature set).
Calculation
For each of these coalitions we compute the predicted target value with and without the
feature value and take the difference to get the marginal contribution.
We replace the feature values of features that are not in a coalition with random feature
values from the dataset to get a prediction from the machine learning model.
o Other options (Guillaume lecture): replace with mean value, or calculate the
contribution of that specific coalition by calculating the mean prediction over the
empirical distribution of that feature in our sample/(all sample points?) (integrate
(i.e. integration) predictions over the empirical distribution of the feature value in
the sample) – says about this integration part also in the molnar book (in the
technical part of shapley values)
Interpretation
The value of the j-th feature contributed ϕj to the prediction of this particular instance
compared to the average prediction for the dataset.
Story-telling explanation
An intuitive way to understand the Shapley value is the following illustration: The feature
values enter a room in random order. All feature values in the room participate in the game
(= contribute to the prediction). The Shapley value of a feature value is the average change
in the prediction that the coalition already in the room receives when the feature value joins
them.
Conclusion
Same as permutation importance – only differences:
o Does it for all combinations of features instead of just the single one where the all
the rest of the features are involved
o Assesses the change in prediction from mean prediction (i.e. gives us the option to
assess the direction of contribution to the prediction), instead of the change in error
o In both can have random values replacing the missing features and can do many
random values for a specific instance
Shap values
Idea
- For large 𝑁 (number of features), Strumbelj et al. [2014], Lundberg and Lee [2016],
and Lundberg et al. [2019] have developed fast algorithms for the estimation of shapley
values. (Marclos Lopez De Pardo presentation)
- Explanation is expressed as a linear combination of shapley values with weights 1 if feature
is present in the explanation and 0 if absent
where g is the explanation model, z′∈{0,1}M is the coalition vector, M is
g(z′)=ϕ0+M∑j=1 ϕjz′j
the maximum coalition size and ϕj∈R is the feature attribution for a feature
j, the Shapley values.
Kernel Shap
Idea
- Very similar to Lime
o The big difference to LIME is the weighting of the instances in the
regression model. (weighted least squares)
LIME weights the instances according to how close they are to
the original instance. Kernel Shap, according to “coalition’s
importance” (low number and high number feature coalitions
have higher weight)
Calculation
- For each instance x do a (weighted) Linear regression on outputs of model
prediction, with inputs being multiple randomly sampled coalitions of the
features (when features are absent they are replaced with a random value
of them from the dataset – for both model prediction and fitted regression)
- Coefficients of linear regression are (estimated) shapley values
o Difference from shapley is that we don’t need to do all combinations
of coalition for each feature separately, we just get many random
coalitions, evaluate their predictions, and regress against them?
o For each instance x, different linear regression to calculate shap
values (-> Donesn’t matter how many instances x we have for
estimation of kernel shap -> matters how many coalitions we make
for each instance/how many random values we get for each
coalition?)
Tree Shap
- TreeSHAP was introduced as a fast, model-specific alternative to KernelSHAP,
o Warning: can produce unintuitive feature attributions. (non-zero for zero
contribution features)
o Note: Both Marcos lopez de prado and ML interpretability book give
examples with TreeShap method for plots (should be okay)
Implementation
- accelerated shaps (among ML other stuff) with rapids packge python
- K-means clustering or networks could be used to post-process shap values
o to understand the groups of explanations we have globally
o mentioned in thalesans https://www.meetup.com/thalesians/events/275100336/)
o used Diabetes regression.ipynb of shap package of kernel shap
Example based explanations
Counterfactual Explanations
- Want to generate for an instance, what (minimum) changes in its values would cause a
change in prediction (more meaningful for classification, or above some value for
regressions). A counterfactual instance:
o produces the predefined prediction as closely as possible
o Should be as similar as possible to the instance regarding feature values.
o Should have feature values that are likely. (not sure how this is achieved)
Minimize the following equation: (e.g. Nelder-mead, or Adam if model has derivatives)
Examples:
- what variables need to change for a prediction to be around the mean prediction for a grade
of first year law school
Adversarial Examples
- An adversarial example is an instance with small, intentional feature perturbations
that cause a machine learning model to make a false prediction.
- Adversarial examples are counterfactual examples with the aim to deceive the
model, not interpret it
- Mostly used with images to see what perturbation in an image’s pixels would
cause a change in prediction