S-LIME: Stabilized-LIME For Model Explanation: Zhengze Zhou Giles Hooker Fei Wang
S-LIME: Stabilized-LIME For Model Explanation: Zhengze Zhou Giles Hooker Fei Wang
ABSTRACT which can help increase user trust [39], assess fairness and privacy
An increasing number of machine learning models have been de- [4, 11], debug models [28] and even for regulation purposes [19].
ployed in domains with high stakes such as finance and healthcare. Model explanation methods can be roughly divided into two cat-
Despite their superior performances, many models are black boxes egories [12, 52]: intrinsic explanations and post hoc explanations.
in nature which are hard to explain. There are growing efforts for Models with intrinsically explainable structures include linear mod-
researchers to develop methods to interpret these black-box models. els, decision trees [6], generalized additive models [20], to name
Post hoc explanations based on perturbations, such as LIME [39], a few. Due to complexity constraints, these models are usually
are widely used approaches to interpret a machine learning model not powerful enough for modern tasks involving heterogeneous
after it has been built. This class of methods has been shown to features and enormous numbers of samples.
exhibit large instability, posing serious challenges to the effective- Post hoc explanations, on the other hand, provide insights after
ness of the method itself and harming user trust. In this paper, we a model is trained. These explanations can be either model-specific,
propose S-LIME, which utilizes a hypothesis testing framework which are typically limited to specific model classes, such as split
based on central limit theorem for determining the number of improvement for tree-based methods [57] and saliency maps for
perturbation points needed to guarantee stability of the resulting convolutional networks [42]; or model-agnostic that do not require
explanation. Experiments on both simulated and real world data any knowledge of the internal structure of the model being exam-
sets are provided to demonstrate the effectiveness of our method. ined, where the analysis is often conducted by evaluating model
predictions on a set of perturbed input data. LIME [39] and SHAP
CCS CONCEPTS [31] are two of the most popular model-agnostic explanation meth-
ods.
• Computing methodologies → Feature selection; Supervised
Researchers have been aware of some drawbacks for post hoc
learning by classification; • Mathematics of computing → Hy-
model explanation. [25] showed that widely used permutation im-
pothesis testing and confidence interval computation.
portance can produce diagnostics that are highly misleading due
to extrapolation. [17] demonstrated how to generate adversarial
KEYWORDS perturbations that produce perceptively indistinguishable inputs
interpretability; stability; LIME; hypothesis testing with the same predicted label, yet have very different interpreta-
tions. [1] showed that explanation algorithms can be exploited to
ACM Reference Format:
Zhengze Zhou, Giles Hooker, and Fei Wang. 2021. S-LIME: Stabilized-LIME systematically rationalize decisions taken by an unfair black-box
for Model Explanation. In Proceedings of the 27th ACM SIGKDD Conference model. [40] argued against using post hoc explanations as these
on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, methods can provide explanations that are not faithful to what the
Virtual Event, Singapore. ACM, New York, NY, USA, 10 pages. [Link] original model computes.
org/10.1145/3447548.3467274 In this paper, we focus on post hoc explanations based on per-
turbations [39]: one of the most popular paradigm for designing
model explanation methods. We argue that the most important
1 INTRODUCTION
property of any explanation technique is stability or reproducibility:
Data Mining and machine learning models have been widely de- repeated runs of the explanation algorithm under the same condi-
ployed for decision making in many fields, including criminal justice tions should ideally yield the same results. Unstable explanations
[54] and healthcare [35, 37]. However, many models act as “black provide little insight to users as how the model actually works and
boxes" in that they only provide predictions but with little guidance are considered unreliable. Unfortunately, LIME is not always stable.
for humans to understand the process. It has been a desiderata [55] separated and investigated sources of instability in LIME. [51]
to develop approaches for understanding these complex models, highlighted a trade-off between explanation’s stability and adher-
ence and propose a framework to maximise stability. [30] improved
Permission to make digital or hard copies of all or part of this work for personal or the sensitivity of LIME by averaging multiple output weights for
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
individual images.
on the first page. Copyrights for components of this work owned by others than ACM We propose a hypothesis testing framework based on a central
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, limit theorem for determining the number of perturbation samples
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@[Link]. required to guarantee stability of the resulting explanation. Briefly,
KDD ’21, August 14–18, 2021, Virtual Event, Singapore LIME works by generating perturbations of a given instance and
© 2021 Association for Computing Machinery. learning a sparse linear explanation, where the sparsity is usually
ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00
[Link] achieved by selecting top features via LASSO [49]. LASSO is known
2429
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
to exhibit early occurrence of false discoveries [33, 47] which, com- We point out here the resemblance between post hoc explana-
bined with the randomness introduced in the sampling procedure, tions and knowledge distillation [7, 22]; both involve obtaining
results in practically-significant levels of instability. We carefully predictions from the original model, usually on synthetic examples,
analyze the Least Angle Regression (LARS) [13] for generating the and using these to train a new model. Differences lie in both the
LASSO path and quantify the aymptotics for the statistics involved scope and intention in the procedure. Whereas LIME produces in-
in selecting the next variable. Based on a hypothesis testing pro- terpretable models that apply closely to the point of interest, model
cedure, we design a new algorithm call S-LIME (Stabilized-LIME) distillation is generally used to provide a global compression of
which can automatically and adaptively determine the number of the model representation in order to improve both computational
perturbations needed to guarantee a stable explanation. and predictive performance [18, 34]. Nonetheless, we might expect
In the following, we review relevant background on LIME and that distillation methods to also exhibit the instability described
LASSO along with their instability in Section 2. Section 3 statisti- here; see [56] which documents instability of decision trees used to
cally analyzes the asymptotic distribution of the statistics which is provide global interpretation.
at the heart of variable selection in LASSO. Our algorithm S-LIME
is introduced in Section 4. Section 5 presents empirical studies on 2.2 LASSO and LARS
both simulated and real world data sets. We conclude in Section 6 Even models that are “interpretable by design" can be difficult to
with some discussions. understand, such as a deep decision tree containing hundreds of
leaves, or a linear model that employs many features with non-zero
2 BACKGROUND weights. For this reason LASSO [49], which automatically produces
In this section, we review the general framework for constructing sparse models, is often the default solver for LIME.
post hoc explanations based on perturbations using Local Inter- Formally, suppose D = {(𝒙 1, 𝑦1 ), (𝒙 2, 𝑦2 ), . . . , (𝒙𝑛 , 𝑦𝑛 )} with
pretable Model-agnostic Explanations (LIME) [39]. We then briefly 𝒙𝑖 = (𝑥𝑖1, 𝑥𝑖2, . . . , 𝑥𝑖𝑝 ) for 1 ≤ 𝑖 ≤ 𝑛, LASSO solves the follow-
discuss LARS and LASSO, which are the internal solvers for LIME ing optimization problem:
to achieve the purpose of feature selection. We illustrate LIME’s
Õ 𝑛 Õ𝑝 Õ𝑝
𝛽ˆ𝐿𝐴𝑆𝑆𝑂 = arg min 𝑥𝑖 𝑗 𝛽 𝑗 ) 2 + 𝜆
instability with toy experiments. (𝑦𝑖 − 𝛽 0 − |𝛽 𝑗 | (2)
𝛽
𝑖=1
𝑗=1 𝑗=1
2.1 LIME
where 𝜆 is the multiplier for 𝑙 1 penalty. (2) can be efficiently solved
Given a black box model 𝑓 and a target point 𝒙 of interest, we via a slight modification of the LARS algorithm [13], which gives
would like to understand the behavior of the model locally around the entire LASSO path as 𝜆 varies. This procedure is described in
𝒙. No knowledge of 𝑓 ’s internal structure is available but we are Algorithm 1 and 2 below [14], where we denote 𝒚 = (𝑦1, 𝑦2, . . . , 𝑦𝑛 )
able to query 𝑓 many times. LIME first samples around the neigh- and assume 𝑛 > 𝑝.
borhood of 𝒙, query the black box model 𝑓 to get its predictions
and form a pseudo data sets D = {(𝒙 1, 𝑦1 ), (𝒙 2, 𝑦2 ), . . . , (𝒙𝑛 , 𝑦𝑛 )} Algorithm 1: Least Angle Regression (LARS)
with 𝑦𝑖 = 𝑓 (𝒙𝑖 ) and a hyperparameter 𝑛 specifying the number
(1) Standardize the predictors to have zero mean and unit norm.
of perturbations. The model 𝑓 can be quite general as regression
Start with residual 𝒓 = 𝒚 − 𝒚,
¯ 𝛽 1, 𝛽 2, . . . , 𝛽𝑝 = 0.
(𝑦𝑖 ∈ R) or classification (𝑦𝑖 ∈ {0, 1} or 𝑦𝑖 ∈ [0, 1] if 𝑓 returns a
(2) Find the predictor 𝒙 ·𝑗 most correlated with 𝒓, and move 𝛽 𝑗
probability). A model 𝑔 from some interpretable function spaces 𝐺
from 0 towards its least-squares coefficient ⟨𝒙 ·𝑗 , 𝒓⟩, until
is chosen by solving the following optimization
some other competitors 𝒙 ·𝑘 has as much correlation with
arg min 𝐿(𝑓 , 𝑔, 𝜋𝒙 ) + Ω(𝑔) (1) the current residual as does 𝒙 ·𝑗 .
𝑔 ∈𝐺 (3) Move 𝛽 𝑗 and 𝛽𝑘 in the direction defined by their joint least
where squares coefficient of the current residual on (𝒙 ·𝑗 , 𝒙 ·𝑘 ),
until some other competitors 𝒙 ·𝑙 has as much correlation
• 𝜋𝒙 (𝒛) is a proximity measure between a perturbed instance
with the current residual.
𝒛 to 𝒙, which is usually chosen to be a Gaussian kernel.
(4) Repeat step 2 and 3 until all 𝑝 predictors have been entered,
• Ω(𝑔) measures complexity of the explanation 𝑔 ∈ 𝐺. For
at which point we arrive at the full least squares solution.
example, for decision trees Ω(𝑔) can be the depth of the tree,
while for linear models we can use the number of non-zero
weights.
• 𝐿(𝑓 , 𝑔, 𝜋𝒙 ) is a measure of how unfaithful 𝑔 is in approximat- Algorithm 2: LASSO: Modification of LARS
ing 𝑓 in the locality defined by 𝜋𝒙 . 3a. In step 3 of Algorithm 1, if a non-zero coefficient hits zero,
[39] suggests a procedure called k-LASSO for selecting top 𝑘 fea- drop the corresponding variable from the active set of
tures using LASSO. In this case, 𝐺 is the class of linear models variables and recompute the current joint least squares
with 𝑔 = 𝝎𝑔 · 𝒙, 𝐿(𝑓 , 𝑔, 𝜋𝒙 ) = 𝑛𝑖=1 𝜋𝒙 (𝒙𝑖 )(𝑦𝑖 − 𝑔(𝒙𝑖 )) 2 and Ω = direction.
Í
∞1 [||𝜔𝑔 || 0 > 𝑘]. Under this setting, (1) can be approximately
solved by first selecting K features with LASSO (using the reg- Both Algorithm 1 and 2 can be easily modified to incorporate a
ularization path) and then learning the weights via least square weight vector 𝝎 = (𝜔 1, 𝜔 2, . . . , 𝜔𝑛 ) on the data set D, by transform-
√ √ √ √ √ √
[39]. ing it to D = {( 𝜔 1 𝒙 1, 𝜔 1𝑦1 ), ( 𝜔 2 𝒙 2, 𝜔 2𝑦2 ), . . . , ( 𝜔𝑛 𝒙𝑛 , 𝜔𝑛𝑦𝑛 )}.
2430
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
2431
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
2432
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
for building black box models. Code for replicating our experiments it does not imply LIME explanations are stable. To better quantify
is available at [Link] stability, we look at the Jaccard index for the top 𝑘 features for
𝑘 = 1, . . . , 5. Table 1 shows the average Jaccard across all pairs
5.1 Breast Cancer Data for 20 repetitions of both LIME and S-LIME on the selected test
We use the widely adopted Breast Cancer Wisconsin (Diagnostic) instance. We set 𝑛𝑚𝑎𝑥 = 10000 for S-LIME.
Data Set [32], which contains 569 samples and 30 features1 . A ran-
dom forests with 500 trees is trained on 80% of the data as the black Table 1: Average Jaccard index for 20 repetitions for LIME
box model to predict whether an instance is benign or malignant. and S-LIME. The black box model is a random forests with
It achieves around 95% accuracy on the remaining 20% test data. 500 trees.
Since our focus is on producing stabilized explanations for a spe-
cific instance, we do not spend additional efforts in hyperparameter Position LIME S-LIME
tuning to further improve model performance. 1 0.61 1.0
Figure 1 in Section 2.3 has already demonstrated the inconsis- 2 1.0 1.0
tency of the selected feature returned by original LIME. In Figure 2 3 1.0 1.0
below, we show a graphical illustration of four LIME replications 4 0.66 1.0
on a randomly selected test instance, where the left column of each 5 0.59 0.85
sub figure shows selected features along with learned linear param-
eters, and the right column is the corresponding feature value for As we can see, for top four positions the average Jaccard index
the sample. These repetitions of LIME applied on the same instance of S-LIME is 1, meaning the algorithm is stable across different
have different orderings for the top two features, and also disagree iterations. There is some variability in the fifth feature selected, as
on the fourth and fifth features. two features mean radius and worst concave points have pretty close
impact locally. Further increasing 𝑛𝑚𝑎𝑥 will make the selection of
fifth variable more consistent. Figure 3 shows the only two explana-
tions we observed in simulations for S-LIME, where the difference
is at the fifth variable.
2433
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
Table 2 presents the average Jaccard index for the selected feature patterns [8]. Instead, we would like to demonstrate the effectiveness
sets by LIME and S-LIME, where LIME is generated with 1000 of our proposed method in reliably explaining a relatively large
synthetic samples and we set 𝑛 0 = 1000 and 𝑛𝑚𝑎𝑥 = 10000 for scale machine learning model applied to medical data.
S-LIME. The close local weights between 𝑥 1, 𝑥 2 and 𝑥 4, 𝑥 5 causes To deal with temporal data where each sample in the training
some instability in LIME, as can be seen from the drop in the index set is of shape (𝑛_𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝𝑠, 𝑛_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠), LIME reshapes the data
at position 2 and 4. S-LIME outputs consistent explanations in this such that it becomes a long vector of size 𝑛_𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝𝑠 × 𝑛_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠.
case. Essentially it transforms the temporal data to the regular tabular
shape while increasing the number of features by a multiple of
Table 2: Average Jaccard index for 20 repetitions for LIME available timestamps. Table 3 presents the average Jaccard index
and S-LIME on test point (0.51, 0.49, 0.5, 0.5, 0.5). The black box for the selected feature sets by LIME and S-LIME on two randomly
model is MARS. selected test samples, where LIME is generated with 1000 synthetic
samples and we set 𝑛 0 = 1000 and 𝑛𝑚𝑎𝑥 = 100000 for S-LIME.
Position LIME S-LIME LIME exhibits undesirable instability in this example, potentially
1 1.0 1.0 due to the complex black box model applied and the large number
2 0.82 1.0 of features (24 × 25 = 600). S-LIME achieves much better stability
3 1.0 1.0 compared to LIME, although we can still observe some uncertainty
4 0.79 1.0 in choosing the fifth feature in the second test sample.
5 1.0 1.0
Table 3: Average Jaccard index for 20 repetitions for LIME
and S-LIME on two randomly selected test samples. The
black box model is a recurrent neural network.
5.3 Early Prediction of Sepsis From Electronic
Health Records Position LIME S-LIME Position LIME S-LIME
Sepsis is a major public health concern which is a leading cause 1 0.37 1.0 1 0.31 1.0
of death in the United States [3]. Early detection and treatment 2 0.29 1.0 2 0.24 1.0
of a sepsis incidence is a crucial factor for patient outcomes [38]. 3 0.33 1.0 3 0.19 1.0
Electronic health records (EHR) store data associated with each 4 0.25 0.89 4 0.17 0.96
individual’s health journey and have seen an increasing use re- 5 0.26 1.0 5 0.18 0.78
cently in clinical informatics and epidemiology [46, 50]. There have (a) test sample 1 (b) test sample 2
been several work to predict sepsis based on EHR [16, 21, 29]. In-
terpretability of these models are essential for them to be deployed
in clinical settings. Figure 4 below shows the output of S-LIME on two different test
We collect data from MIMIC-III [26], which is a freely accessible samples. We can see that for sample 1, most recent temperatures
critical care database. After pre-processing, there are 15309 patients play an important role, along with the latest pH and potassium
in the cohort for analysis, out of which 1221 developed sepsis based values. While for sample 2, latest pH values are the most important
on Sepsis-3 clinical criteria for sepsis onset [43]. For each patient, ones.
the record consists of a combination of hourly vital sign summaries, We want to emphasize that extra caution must be taken by prac-
laboratory values, and static patient descriptions. We provide the titioners in applying LIME, especially for some complex problems.
list of all variables involved in Appendix C. ICULOS is a timestamp The local linear model with a few features might not be suitable
which denotes the hours since ICU admission for each patient, and to approximate a recurrent neural network built on temporal data.
thus is not used directly for training the model. How to apply perturbation based explanation algorithms to tempo-
For each patient’s records, missing values are filled with the most ral data is still an open problem, and we leave it for future work.
recent value if available, otherwise a global average. Negative sam- That being said, the experiment in this section demonstrates the
ples are down sampled to achieve a class ratio of 1:1. We randomly effectiveness of S-LIME in producing stabilized explanations.
select 90% of the data for training and leave the remaining 10% for
testing. A simple recurrent neural network based on LSTM [23] 6 DISCUSSIONS
module is built with Keras [9] for demonstration. Each sample fed An important property for model explanation methods is stability:
into the network has 25 features with 24 timestamps, then goes repeated runs of the algorithm on the same object should output
through a LSTM with 32 internal units with dropout rate 0.2, and consistent results. In this paper, we show that post hoc explana-
finally a dense layer with softmax activation to output a probability. tions based on perturbations, such as LIME, are not stable due to the
The network is optimized by Adam [27] with an initial learning randomness introduced in generating synthetic samples. Our pro-
rate of 0.0001 and we train it for 500 epochs on a batch size of 50. posed algorithm S-LIME is based on a hypothesis testing framework
The model achieves around 0.75 AUC score on the test set. Note and can automatically and adaptively determine the appropriate
that we do not fine tune the architecture of the network through number of perturbations required to guarantee stability.
cross validation. The purpose of this study is not on achieving a su- The idea behind S-LIME is similar to [56] which tackles the
perior performance as it usually requires more advanced modeling problem of building stable approximation trees in model distillation.
techniques for temporal data [16, 29] or exploiting missing value In the area of online learning, [10] uses Hoeffding bounds [24] to
2434
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
ACKNOWLEDGMENTS
Giles Hooker is supported by NSF DMS-1712554. Fei Wang is sup-
ported by NSF 1750326, 2027970, ONR N00014-18-1-2585, Amazon
Web Service (AWS) Machine Learning for Research Award and
Google Faculty Research Award.
(a) test sample 1
REFERENCES
[1] Ulrich Aïvodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara,
and Alain Tapp. 2019. Fairwashing: the risk of rationalization. arXiv preprint
arXiv:1901.09749 (2019).
[2] Zeyuan Allen-Zhu and Yuanzhi Li. 2020. Towards Understanding Ensemble,
Knowledge Distillation and Self-Distillation in Deep Learning. arXiv preprint
arXiv:2012.09816 (2020).
[3] Derek C Angus, Walter T Linde-Zwirble, Jeffrey Lidicker, Gilles Clermont, Joseph
Carcillo, and Michael R Pinsky. 2001. Epidemiology of severe sepsis in the United
States: analysis of incidence, outcome, and associated costs of care. Read Online:
Critical Care Medicine| Society of Critical Care Medicine 29, 7 (2001), 1303–1310.
[4] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias.
(b) test sample 2 ProPublica. See [Link] propublica. org/article/machine-bias-risk-assessments-
in-criminal-sentencing (2016).
[5] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
Figure 4: Output of S-LIME for two randomly selected test [6] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984.
samples. The black box model is a recurrent neural network. Classification and regression trees. CRC press.
[7] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model
compression. In Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining. 535–541.
guarantee correct choice of splits in a decision tree by comparing [8] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan
two best attributes. We should mention that S-LIME is not restricted Liu. 2018. Recurrent neural networks for multivariate time series with missing
values. Scientific reports 8, 1 (2018), 1–12.
to LASSO as its feature selection mechanism. In fact, to produce [9] François Chollet et al. 2015. Keras. [Link]
a ranking of explanatory variables, one can use any sequential [10] Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In
procedures which build a model by sequentially adding or removing Proceedings of the sixth ACM SIGKDD international conference on Knowledge
discovery and data mining. 71–80.
variables based upon some criterion, such as forward-stepwise or [11] Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of inter-
backward-stepwise selection [14]. All of these methods can be pretable machine learning. arXiv preprint arXiv:1702.08608 (2017).
stabilized by a similar hypothesis testing framework like S-LIME. [12] Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable
machine learning. Commun. ACM 63, 1 (2019), 68–77.
There are several works closely related to ours. [55] identifies [13] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. 2004. Least
three sources of uncertainty in LIME: sampling variance, sensitivity angle regression. The Annals of statistics 32, 2 (2004), 407–499.
[14] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of
to choice of parameters and variability in the black box model. statistical learning. Vol. 1. Springer series in statistics New York.
We aim to control the first source of variability as the other two [15] Jerome H Friedman. 1991. Multivariate adaptive regression splines. The annals
depend on specific design choices of the practitioners. [51] highlight of statistics (1991), 1–67.
[16] Joseph Futoma, Sanjay Hariharan, Katherine Heller, Mark Sendak, Nathan Brajer,
a trade-off between explanation’s stability and adherence. Their Meredith Clement, Armando Bedoya, and Cara O’Brien. 2017. An improved
approach is to select a suitable kernel width for the proximity multi-output gaussian process rnn with real-time validation for early sepsis
measure, but it does not improve stability given any kernel width. detection. In Machine Learning for Healthcare Conference. PMLR, 243–254.
[17] Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural
In [53], the authors design a deterministic version of LIME by only networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence,
looking at existing training data through hierarchical clustering Vol. 33. 3681–3688.
[18] Robert D Gibbons, Giles Hooker, Matthew D Finkelman, David J Weiss, Paul A
without resorting to synthetic samples. However, the number of Pilkonis, Ellen Frank, Tara Moore, and David J Kupfer. 2013. The computerized
samples in a dataset will affect the quality of clusters and a lack adaptive diagnostic test for major depressive disorder (CAD-MDD): a screening
of nearby points poses additional challenges; this strategy also tool for depression. The Journal of clinical psychiatry 74, 7 (2013), 1–478.
[19] Bryce Goodman and Seth Flaxman. 2017. European Union regulations on algo-
relies of having access to the training data. Most recently, [45] rithmic decision-making and a “right to explanation”. AI magazine 38, 3 (2017),
develop a set of tools for analyzing explanation uncertainty in 50–57.
a Bayesian framework for LIME. Our method can be viewed as [20] Trevor J Hastie and Robert J Tibshirani. 1990. Generalized additive models. Vol. 43.
CRC press.
a frequentist counterpart without the need to choose priors and [21] Katharine E Henry, David N Hager, Peter J Pronovost, and Suchi Saria. 2015. A
evaluate a posterior distribution. targeted real-time early warning score (TREWScore) for septic shock. Science
translational medicine 7, 299 (2015), 299ra122–299ra122.
Another line of work concerns adversarial attacks to LIME. [44] [22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in
propose a scaffolding technique to hide the biases of any given a neural network. arXiv preprint arXiv:1503.02531 (2015).
classifier by building adversarial classifiers to detect perturbed [23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.
instances. Later, [41] utilize a generative adversarial network to [24] Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random
sample more realistic synthetic data for making LIME more robust variables. In The Collected Works of Wassily Hoeffding. Springer, 409–426.
2435
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
[25] Giles Hooker and Lucas Mentch. 2019. Please stop permuting features: An [49] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal
explanation and alternatives. arXiv preprint arXiv:1905.03151 (2019). of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.
[26] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, [50] Akhil Vaid, Suraj K Jaladanki, Jie Xu, Shelly Teng, Arvind Kumar, Samuel Lee,
Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Sulaiman Somani, Ishan Paranjpe, Jessica K De Freitas, Tingyi Wanyan, et al. 2020.
Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Federated learning of electronic health records improves mortality prediction in
data 3, 1 (2016), 1–9. patients hospitalized with covid-19. medRxiv (2020).
[27] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- [51] Giorgio Visani, Enrico Bagli, and Federico Chesani. 2020. OptiLIME: Opti-
mization. arXiv preprint arXiv:1412.6980 (2014). mized LIME Explanations for Diagnostic Computer Algorithms. arXiv preprint
[28] Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via arXiv:2006.05714 (2020).
influence functions. arXiv preprint arXiv:1703.04730 (2017). [52] Fei Wang, Rainu Kaushal, and Dhruv Khullar. 2020. Should Health Care Demand
[29] Simon Meyer Lauritsen, Mads Ellersgaard Kalør, Emil Lund Kongsgaard, Ka- Interpretable Artificial Intelligence or Accept" Black Box" Medicine? Annals of
trine Meyer Lauritsen, Marianne Johansson Jørgensen, Jeppe Lange, and Bo internal medicine 172, 1 (2020), 59–60.
Thiesson. 2020. Early detection of sepsis utilizing deep learning on electronic [53] Muhammad Rehman Zafar and Naimul Mefraz Khan. 2019. DLIME: A determinis-
health record event sequences. Artificial Intelligence in Medicine 104 (2020), tic local interpretable model-agnostic explanations approach for computer-aided
101820. diagnosis systems. arXiv preprint arXiv:1906.10263 (2019).
[30] Eunjin Lee, David Braines, Mitchell Stiffler, Adam Hudler, and Daniel Harborne. [54] Jiaming Zeng, Berk Ustun, and Cynthia Rudin. 2015. Interpretable classification
2019. Developing the sensitivity of LIME for better machine learning explana- models for recidivism prediction. arXiv preprint arXiv:1503.07810 (2015).
tion. In Artificial Intelligence and Machine Learning for Multi-Domain Operations [55] Yujia Zhang, Kuangyan Song, Yiming Sun, Sarah Tan, and Madeleine Udell. 2019.
Applications, Vol. 11006. International Society for Optics and Photonics, 1100610. " Why Should You Trust My Explanation?" Understanding Uncertainty in LIME
[31] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model Explanations. arXiv preprint arXiv:1904.12991 (2019).
predictions. In Advances in neural information processing systems. 4765–4774. [56] Yichen Zhou, Zhengze Zhou, and Giles Hooker. 2018. Approximation trees:
[32] Olvi L Mangasarian, W Nick Street, and William H Wolberg. 1995. Breast cancer Statistical stability in model distillation. arXiv preprint arXiv:1808.07573 (2018).
diagnosis and prognosis via linear programming. Operations Research 43, 4 (1995), [57] Zhengze Zhou and Giles Hooker. 2019. Unbiased measurement of feature impor-
570–577. tance in tree-based methods. arXiv preprint arXiv:1903.05179 (2019).
[33] Nicolai Meinshausen and Peter Bühlmann. 2010. Stability selection. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417–473.
[34] Aditya Krishna Menon, Ankit Singh Rawat, Sashank J Reddi, Seungyeon Kim,
and Sanjiv Kumar. 2020. Why distillation helps: a statistical perspective. arXiv
preprint arXiv:2005.10419 (2020).
A INSTABILITY WITH LASSO
[35] Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. 2018. Instability with LASSO has been studied previously by several
Deep learning for healthcare: review, opportunities and challenges. Briefings in
bioinformatics 19, 6 (2018), 1236–1246.
researchers. [33] introduce stability selection based on subsampling
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. which provides finite sample control for some error rates of false
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- discoveries. [48] find that sequential regression procedures select
napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. the first spurious variable unexpectedly early, even in settings of
[37] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela low correlations between variables and strong true effect sizes. [47]
Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. 2018. Scalable and further develop a sharp asymptotic trade-off between false and true
accurate deep learning with electronic health records. NPJ Digital Medicine 1, 1
(2018), 18. positive rates along the LASSO path.
[38] Matthew A Reyna, Chris Josef, Salman Seyedi, Russell Jeter, Supreeth P Shashiku- We demonstrate this phenomenon using a simple linear case.
mar, M Brandon Westover, Ashish Sharma, Shamim Nemati, and Gari D Clifford.
2019. Early prediction of sepsis from clinical data: the PhysioNet/Computing
Suppose 𝑡 = 𝜌 1𝑥 1 + 𝜌 2𝑥 2 + 𝜌 3𝑥 3 , where 𝑥 1 , 𝑥 2 and 𝑥 3 are indepen-
in Cardiology Challenge 2019. In 2019 Computing in Cardiology (CinC). IEEE, dent and generated from a standard normal distribution N (0, 1).
Page–1. Note that we do not impose any additional noise in generating
[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should I
trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd the response 𝑦. We choose 𝜌 1 = 1, 𝜌 2 = 0.75 and 𝜌 3 = 0.7, such
ACM SIGKDD international conference on knowledge discovery and data mining. that when one uses LARS to solve LASSO, 𝑥 1 always enter the
1135–1144. model first, while 𝑥 2 and 𝑥 3 have closer coefficients and will be
[40] Cynthia Rudin. 2019. Stop explaining black box machine learning models for
high stakes decisions and use interpretable models instead. Nature Machine more challenging to distinguish.
Intelligence 1, 5 (2019), 206–215. We focus on the ordering of the three covariates entering the
[41] Sean Saito, Eugene Chua, Nicholas Capel, and Rocco Hu. 2020. Improving LIME
Robustness with Smarter Locality Sampling. arXiv preprint arXiv:2006.12302
model. The “correct" ordering should be (𝑥 1, 𝑥 2, 𝑥 3 ). For multiple
(2020). runs of LASSO with 𝑛 = 1000, we observe roughly 20% of the
[42] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside results have order (𝑥 1, 𝑥 3, 𝑥 2 ) instead. Figure 5 below shows two
convolutional networks: Visualising image classification models and saliency
maps. arXiv preprint arXiv:1312.6034 (2013). representative LASSO paths.
[43] Mervyn Singer, Clifford S Deutschman, Christopher Warren Seymour, Manu
Shankar-Hari, Djillali Annane, Michael Bauer, Rinaldo Bellomo, Gordon R
Bernard, Jean-Daniel Chiche, Craig M Coopersmith, et al. 2016. The third inter-
national consensus definitions for sepsis and septic shock (Sepsis-3). Jama 315, 8
(2016), 801–810.
[44] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju.
2020. Fooling lime and shap: Adversarial attacks on post hoc explanation methods.
In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 180–186.
[45] Dylan Slack, Sophie Hilgard, Sameer Singh, and Himabindu Lakkaraju. 2020.
How Much Should I Trust You? Modeling Uncertainty of Black Box Explanations.
arXiv preprint arXiv:2008.05030 (2020).
[46] Jose Roberto Ayala Solares, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fate-
meh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, Amir H
Payberah, Mariagrazia Zottoli, Milad Nazarzadeh, et al. 2020. Deep learning
for electronic health records: A comparative review of multiple deep neural
(a) Variable ordering in LASSO (b) Variable ordering in LASSO
architectures. Journal of biomedical informatics 101 (2020), 103337.
[47] Weijie Su, Małgorzata Bogdan, Emmanuel Candes, et al. 2017. False discoveries path: (𝑥 1 , 𝑥 2 , 𝑥 3 ). path: (𝑥 1 , 𝑥 3 , 𝑥 2 ).
occur early on the lasso path. The Annals of statistics 45, 5 (2017), 2133–2150.
[48] Weijie J Su. 2018. When is the first spurious variable selected by sequential
regression procedures? Biometrika 105, 3 (2018), 517–527. Figure 5: Two cases of variable ordering in LASSO path.
2436
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
This toy experiment demonstrates the instability of LASSO itself. Table 4: Average Jaccard index for 20 repetitions for LIME
Even in this ideal noise-free setting where we have an indepen- and S-LIME. The black box models are SVM and NN.
dent design with Gaussian distribution for the variables, 20% of
the time LASSO exhibits different paths due to random sampling. SVM NN
Position
Intuitively, the solutions at the beginning of the LASSO path is LIME S-LIME LIME S-LIME
overwhelmingly biased and the residual vector contains many of 1 1 1.0 0.73 1.0
the true effects. Thus some less relevant or irrelevant variable could 2 0.35 0.87 0.87 1.0
exhibit high correlations with the residual and gets selected early. 3 0.23 0.83 0.71 0.74
𝑛 = 1000 seems to be a reasonable large number of samples to 4 0.19 1.0 0.66 1.0
achieve consistency results, but when applying the idea of S-LIME, 5 0.18 0.67 0.55 1.0
the hypothesis testing is always inconclusive at the second step
when it needs to choose between 𝑥 2 and 𝑥 3 . Increasing 𝑛 in this
case can indeed yield significant testing results and stabilize the
LASSO paths.
B ADDITIONAL EXPERIMENTS
B.1 S-LIME on other model types
Besides the randomness introduced in generating synthetic per-
turbations, the output of model explanation algorithms is also de-
pendent on several other factors, including the black box model
(a) S-LIME on SVM.
itself. There may not be a universal truth to the explanations of a
given instance, as it depends on how the underlying model captures
the relationship between covariates and responses. Distinct model
types, or even the same model structure trained with random ini-
tialization, can utilize different correlations between features and
responses [2], and thus result in different model explanations.
We apply S-LIME on other model types to illustrate two points:
• Compared to LIME, S-LIME can generate stabilized explana-
tions, though for some model types more synthetic pertur-
bations are required.
(b) S-LIME on NN.
• Different model types can have different explanations for the
same instance. This does not imply that S-LIME is unstable
or not reproducible, but practitioners need to be aware of
Figure 6: S-LIME on Breast Cancer Data with SVM and NN
this dependency on the underlying black box model when
as black box models.
apply any model explanation methods.
We use support-vector machines (SVM) and neural networks
(NN) as the underlying black box models and apply LIME and S-
the original LIME is extremely unstable for SVM. S-LIME needs a
LIME. Basic setups is similar to Section 5.1. For SVM training, we
larger 𝑛𝑚𝑎𝑥 to produce consistent results.
use default parameters2 where rbf kernel is applied. The NN is
constructed with two hidden layers, each with 12 and 8 hidden
units. ReLU activations are used between hidden layers while the
B.2 A large cohort of test samples
last layer use sigmoid functions to output a probability. The network Most of the experiments in this paper are targeted at a randomly
is implemented in Keras [9]. Both models achieve over 90% accuracy selected test sample, which allows us to examine specific features
on the test set. easily. That being said, one can expect the instability of LIME and
Table 4 lists the average Jaccard index across 20 repetitions for the improvement of S-LIME to be universal. In this part we conduct
each setting on a randomly selected test instance. LIME is generated experiments on a large cohort of test samples for both Breast Cancer
with 1000 synthetic samples, while for S-LIME we set 𝑛𝑚𝑎𝑥 = (Section 5.1) and Sepsis (Section 5.3) data.
100000 for SVM and 𝑛𝑚𝑎𝑥 = 10000 for NN. Compared with LIME, In each application, we randomly select 50 test samples. For each
S-LIME achieves better stability at each position. test instance, LIME and S-LIME are applied for 20 repetitions and
Figure 6 shows the graphical exhibition of the explanations gen- we calculate average Jaccard index across all pairs out of 20 as
erated by S-LIME for both SVM and NN being the black box models. before. Finally, we report the overall average Jaccard index for 50
We can see that they differ in the features selected. test samples. The results are shown in Table 5. LIME explanations
One important observation is that the underlying black box are generated with 1000 synthetic samples.
model also affects the stability of local explanations. For example, For Breast Cancer Data, we pick 𝑛𝑚𝑎𝑥 = 10000 as in Section
5.1. We can see that in general there is some instability from the
2 [Link] features selected by LIME, while S-LIME can improve stability. By
2437
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore
further increasing 𝑛𝑚𝑎𝑥 we may get better stability metrics, but at C VARIABLES LIST FOR SEPSIS DETECTION
the cost of computational costs.
For the sepsis prediction task, LIME performs much worse ex- Table 6: Variables list and description for data used in sepsis
hibiting undesirable instability across 50 test samples at all 5 po- prediction.
sitions. S-LIME with 𝑛𝑚𝑎𝑥 = 100000 achieves obviously stability
improvement. The reason for invoking a larger value of 𝑛𝑚𝑎𝑥 is # Variables Description
due to the fact that there are 600 features to select from. It is an 1 Age age(years)
interesting future direction to see how one can use LIME to explain 2 Gender male (1) or female (0)
temporal models more efficiently. 3 ICULOS ICU length of stay (hours since ICU admission)
4 HR hea1t rate
Table 5: Overall average Jaccard index for 20 repetitions for 5 Potassium potassium
LIME and S-LIME across 50 randomly chosen test samples.
6 Temp temperature
7 pH pH
Position LIME S-LIME Position LIME S-LIME 8 PaCO2 partial pressure of carbon dioxide from arterial blood
1 0.90 0.98 1 0.54 1.0 9 SBP systolic blood pressure
2 0.85 0.96 2 0.43 1.0 10 FiO2 fraction of inspired oxygen
3 0.82 0.92 3 0.37 0.78
11 SaO2 oxygen saturation from arterial blood
4 0.81 0.96 4 0.35 0.90
12 AST aspartate transaminase
5 0.80 0.84 5 0.34 0.99
13 BUN blood urea nitrogen
(a) Breast Cancer Data (b) Sepsis Data
14 MAP mean arterial pressure
15 Calcium calcium
16 Chloride chloride
17 Creatinine creatinine
18 Bilirubin bilirubin
19 Glucose glucose
20 Lactate lactic acid
21 DBP diastolic blood pressure
22 Troponin troponin I
23 Resp respiration rate
24 PTT partial thromboplastin time
25 WBC white blood cells count
26 Platelets platelet count
2438