0% found this document useful (0 votes)
10 views10 pages

S-LIME: Stabilized-LIME For Model Explanation: Zhengze Zhou Giles Hooker Fei Wang

The paper introduces S-LIME, a method designed to enhance the stability of post hoc explanations for machine learning models, particularly addressing the instability issues associated with LIME. By employing a hypothesis testing framework based on the central limit theorem, S-LIME determines the necessary number of perturbation samples to ensure reliable explanations. The effectiveness of S-LIME is demonstrated through experiments on both simulated and real-world datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

S-LIME: Stabilized-LIME For Model Explanation: Zhengze Zhou Giles Hooker Fei Wang

The paper introduces S-LIME, a method designed to enhance the stability of post hoc explanations for machine learning models, particularly addressing the instability issues associated with LIME. By employing a hypothesis testing framework based on the central limit theorem, S-LIME determines the necessary number of perturbation samples to ensure reliable explanations. The effectiveness of S-LIME is demonstrated through experiments on both simulated and real-world datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

S-LIME: Stabilized-LIME for Model Explanation


Zhengze Zhou Giles Hooker Fei Wang
Cornell University Cornell University Weill Cornell Medicine
Ithaca, New York, USA Ithaca, New York, USA New York City, New York, USA
zz433@[Link] gjh27@[Link] few2001@[Link]

ABSTRACT which can help increase user trust [39], assess fairness and privacy
An increasing number of machine learning models have been de- [4, 11], debug models [28] and even for regulation purposes [19].
ployed in domains with high stakes such as finance and healthcare. Model explanation methods can be roughly divided into two cat-
Despite their superior performances, many models are black boxes egories [12, 52]: intrinsic explanations and post hoc explanations.
in nature which are hard to explain. There are growing efforts for Models with intrinsically explainable structures include linear mod-
researchers to develop methods to interpret these black-box models. els, decision trees [6], generalized additive models [20], to name
Post hoc explanations based on perturbations, such as LIME [39], a few. Due to complexity constraints, these models are usually
are widely used approaches to interpret a machine learning model not powerful enough for modern tasks involving heterogeneous
after it has been built. This class of methods has been shown to features and enormous numbers of samples.
exhibit large instability, posing serious challenges to the effective- Post hoc explanations, on the other hand, provide insights after
ness of the method itself and harming user trust. In this paper, we a model is trained. These explanations can be either model-specific,
propose S-LIME, which utilizes a hypothesis testing framework which are typically limited to specific model classes, such as split
based on central limit theorem for determining the number of improvement for tree-based methods [57] and saliency maps for
perturbation points needed to guarantee stability of the resulting convolutional networks [42]; or model-agnostic that do not require
explanation. Experiments on both simulated and real world data any knowledge of the internal structure of the model being exam-
sets are provided to demonstrate the effectiveness of our method. ined, where the analysis is often conducted by evaluating model
predictions on a set of perturbed input data. LIME [39] and SHAP
CCS CONCEPTS [31] are two of the most popular model-agnostic explanation meth-
ods.
• Computing methodologies → Feature selection; Supervised
Researchers have been aware of some drawbacks for post hoc
learning by classification; • Mathematics of computing → Hy-
model explanation. [25] showed that widely used permutation im-
pothesis testing and confidence interval computation.
portance can produce diagnostics that are highly misleading due
to extrapolation. [17] demonstrated how to generate adversarial
KEYWORDS perturbations that produce perceptively indistinguishable inputs
interpretability; stability; LIME; hypothesis testing with the same predicted label, yet have very different interpreta-
tions. [1] showed that explanation algorithms can be exploited to
ACM Reference Format:
Zhengze Zhou, Giles Hooker, and Fei Wang. 2021. S-LIME: Stabilized-LIME systematically rationalize decisions taken by an unfair black-box
for Model Explanation. In Proceedings of the 27th ACM SIGKDD Conference model. [40] argued against using post hoc explanations as these
on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, methods can provide explanations that are not faithful to what the
Virtual Event, Singapore. ACM, New York, NY, USA, 10 pages. [Link] original model computes.
org/10.1145/3447548.3467274 In this paper, we focus on post hoc explanations based on per-
turbations [39]: one of the most popular paradigm for designing
model explanation methods. We argue that the most important
1 INTRODUCTION
property of any explanation technique is stability or reproducibility:
Data Mining and machine learning models have been widely de- repeated runs of the explanation algorithm under the same condi-
ployed for decision making in many fields, including criminal justice tions should ideally yield the same results. Unstable explanations
[54] and healthcare [35, 37]. However, many models act as “black provide little insight to users as how the model actually works and
boxes" in that they only provide predictions but with little guidance are considered unreliable. Unfortunately, LIME is not always stable.
for humans to understand the process. It has been a desiderata [55] separated and investigated sources of instability in LIME. [51]
to develop approaches for understanding these complex models, highlighted a trade-off between explanation’s stability and adher-
ence and propose a framework to maximise stability. [30] improved
Permission to make digital or hard copies of all or part of this work for personal or the sensitivity of LIME by averaging multiple output weights for
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
individual images.
on the first page. Copyrights for components of this work owned by others than ACM We propose a hypothesis testing framework based on a central
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, limit theorem for determining the number of perturbation samples
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@[Link]. required to guarantee stability of the resulting explanation. Briefly,
KDD ’21, August 14–18, 2021, Virtual Event, Singapore LIME works by generating perturbations of a given instance and
© 2021 Association for Computing Machinery. learning a sparse linear explanation, where the sparsity is usually
ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00
[Link] achieved by selecting top features via LASSO [49]. LASSO is known

2429
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

to exhibit early occurrence of false discoveries [33, 47] which, com- We point out here the resemblance between post hoc explana-
bined with the randomness introduced in the sampling procedure, tions and knowledge distillation [7, 22]; both involve obtaining
results in practically-significant levels of instability. We carefully predictions from the original model, usually on synthetic examples,
analyze the Least Angle Regression (LARS) [13] for generating the and using these to train a new model. Differences lie in both the
LASSO path and quantify the aymptotics for the statistics involved scope and intention in the procedure. Whereas LIME produces in-
in selecting the next variable. Based on a hypothesis testing pro- terpretable models that apply closely to the point of interest, model
cedure, we design a new algorithm call S-LIME (Stabilized-LIME) distillation is generally used to provide a global compression of
which can automatically and adaptively determine the number of the model representation in order to improve both computational
perturbations needed to guarantee a stable explanation. and predictive performance [18, 34]. Nonetheless, we might expect
In the following, we review relevant background on LIME and that distillation methods to also exhibit the instability described
LASSO along with their instability in Section 2. Section 3 statisti- here; see [56] which documents instability of decision trees used to
cally analyzes the asymptotic distribution of the statistics which is provide global interpretation.
at the heart of variable selection in LASSO. Our algorithm S-LIME
is introduced in Section 4. Section 5 presents empirical studies on 2.2 LASSO and LARS
both simulated and real world data sets. We conclude in Section 6 Even models that are “interpretable by design" can be difficult to
with some discussions. understand, such as a deep decision tree containing hundreds of
leaves, or a linear model that employs many features with non-zero
2 BACKGROUND weights. For this reason LASSO [49], which automatically produces
In this section, we review the general framework for constructing sparse models, is often the default solver for LIME.
post hoc explanations based on perturbations using Local Inter- Formally, suppose D = {(𝒙 1, 𝑦1 ), (𝒙 2, 𝑦2 ), . . . , (𝒙𝑛 , 𝑦𝑛 )} with
pretable Model-agnostic Explanations (LIME) [39]. We then briefly 𝒙𝑖 = (𝑥𝑖1, 𝑥𝑖2, . . . , 𝑥𝑖𝑝 ) for 1 ≤ 𝑖 ≤ 𝑛, LASSO solves the follow-
discuss LARS and LASSO, which are the internal solvers for LIME ing optimization problem:
to achieve the purpose of feature selection. We illustrate LIME’s 
Õ 𝑛 Õ𝑝 Õ𝑝 

𝛽ˆ𝐿𝐴𝑆𝑆𝑂 = arg min 𝑥𝑖 𝑗 𝛽 𝑗 ) 2 + 𝜆

 

instability with toy experiments. (𝑦𝑖 − 𝛽 0 − |𝛽 𝑗 | (2)
𝛽 
 𝑖=1 
𝑗=1 𝑗=1 
2.1 LIME
 
where 𝜆 is the multiplier for 𝑙 1 penalty. (2) can be efficiently solved
Given a black box model 𝑓 and a target point 𝒙 of interest, we via a slight modification of the LARS algorithm [13], which gives
would like to understand the behavior of the model locally around the entire LASSO path as 𝜆 varies. This procedure is described in
𝒙. No knowledge of 𝑓 ’s internal structure is available but we are Algorithm 1 and 2 below [14], where we denote 𝒚 = (𝑦1, 𝑦2, . . . , 𝑦𝑛 )
able to query 𝑓 many times. LIME first samples around the neigh- and assume 𝑛 > 𝑝.
borhood of 𝒙, query the black box model 𝑓 to get its predictions
and form a pseudo data sets D = {(𝒙 1, 𝑦1 ), (𝒙 2, 𝑦2 ), . . . , (𝒙𝑛 , 𝑦𝑛 )} Algorithm 1: Least Angle Regression (LARS)
with 𝑦𝑖 = 𝑓 (𝒙𝑖 ) and a hyperparameter 𝑛 specifying the number
(1) Standardize the predictors to have zero mean and unit norm.
of perturbations. The model 𝑓 can be quite general as regression
Start with residual 𝒓 = 𝒚 − 𝒚,
¯ 𝛽 1, 𝛽 2, . . . , 𝛽𝑝 = 0.
(𝑦𝑖 ∈ R) or classification (𝑦𝑖 ∈ {0, 1} or 𝑦𝑖 ∈ [0, 1] if 𝑓 returns a
(2) Find the predictor 𝒙 ·𝑗 most correlated with 𝒓, and move 𝛽 𝑗
probability). A model 𝑔 from some interpretable function spaces 𝐺
from 0 towards its least-squares coefficient ⟨𝒙 ·𝑗 , 𝒓⟩, until
is chosen by solving the following optimization
some other competitors 𝒙 ·𝑘 has as much correlation with
arg min 𝐿(𝑓 , 𝑔, 𝜋𝒙 ) + Ω(𝑔) (1) the current residual as does 𝒙 ·𝑗 .
𝑔 ∈𝐺 (3) Move 𝛽 𝑗 and 𝛽𝑘 in the direction defined by their joint least
where squares coefficient of the current residual on (𝒙 ·𝑗 , 𝒙 ·𝑘 ),
until some other competitors 𝒙 ·𝑙 has as much correlation
• 𝜋𝒙 (𝒛) is a proximity measure between a perturbed instance
with the current residual.
𝒛 to 𝒙, which is usually chosen to be a Gaussian kernel.
(4) Repeat step 2 and 3 until all 𝑝 predictors have been entered,
• Ω(𝑔) measures complexity of the explanation 𝑔 ∈ 𝐺. For
at which point we arrive at the full least squares solution.
example, for decision trees Ω(𝑔) can be the depth of the tree,
while for linear models we can use the number of non-zero
weights.
• 𝐿(𝑓 , 𝑔, 𝜋𝒙 ) is a measure of how unfaithful 𝑔 is in approximat- Algorithm 2: LASSO: Modification of LARS
ing 𝑓 in the locality defined by 𝜋𝒙 . 3a. In step 3 of Algorithm 1, if a non-zero coefficient hits zero,
[39] suggests a procedure called k-LASSO for selecting top 𝑘 fea- drop the corresponding variable from the active set of
tures using LASSO. In this case, 𝐺 is the class of linear models variables and recompute the current joint least squares
with 𝑔 = 𝝎𝑔 · 𝒙, 𝐿(𝑓 , 𝑔, 𝜋𝒙 ) = 𝑛𝑖=1 𝜋𝒙 (𝒙𝑖 )(𝑦𝑖 − 𝑔(𝒙𝑖 )) 2 and Ω = direction.
Í
∞1 [||𝜔𝑔 || 0 > 𝑘]. Under this setting, (1) can be approximately
solved by first selecting K features with LASSO (using the reg- Both Algorithm 1 and 2 can be easily modified to incorporate a
ularization path) and then learning the weights via least square weight vector 𝝎 = (𝜔 1, 𝜔 2, . . . , 𝜔𝑛 ) on the data set D, by transform-
√ √ √ √ √ √
[39]. ing it to D = {( 𝜔 1 𝒙 1, 𝜔 1𝑦1 ), ( 𝜔 2 𝒙 2, 𝜔 2𝑦2 ), . . . , ( 𝜔𝑛 𝒙𝑛 , 𝜔𝑛𝑦𝑛 )}.

2430
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

2.3 Instability with LIME 3 ASYMPTOTIC PROPERTIES OF LARS


Both [55] and [53] have demonstrated that the random generation DECISIONS
of perturbations results in instability in the generated explanations. Consider at any given step when LARS needs to choose a new
We apply LIME on Breast Cancer Data (see Section 5.1 for details) to variable to enter the model. With sample size of 𝑛, let the current
illustrate of this phenomenon. A random forests [5] with 500 trees residuals be given by 𝒓 = (𝑟 1, 𝑟 2, . . . , 𝑟𝑛 ), and two candidate vari-
is built as the black box model, and we apply LIME to explain the ables being 𝒙 ·𝒊 = (𝑥 1𝑖 , 𝑥 2𝑖 , . . . , 𝑥𝑛𝑖 ) and 𝒙 ·𝒋 = (𝑥 1𝑗 , 𝑥 2𝑗 , . . . , 𝑥𝑛 𝑗 )
prediction of a randomly selected test point multiple times. Each where we assume the predictors have been standardized to have
time 1000 synthetic data are generated around the test point and zero mean and unit norm. LARS chooses the predictor that has
top 5 features are selected via LASSO. We repeat the experiment 100 the highest (absolute) correlation with the residuals to enter the
times and calculate the empirical selection probability of features. model. Equivalently, one needs to compare 𝑐ˆ1 = 𝑛1 𝑛𝑡=1 𝑟𝑡 𝑥𝑡𝑖 with
Í
The result is shown in Figure 1. 𝑐ˆ2 = 𝑛1 𝑛𝑡=1 𝑟𝑡 𝑥𝑡 𝑗 . We use 𝑐ˆ1 and 𝑐ˆ2 to emphasize these are finite
Í
sample estimates, and our purpose is to obtain the probability that
their order would be different if the query points were regenerated.
To that end, we introduce uppercase symbols 𝑹, 𝑿 ·𝒊 , 𝑿 ·𝒋 to denote
the corresponding random variables of the residuals and two co-
variates; these are distributed according to the current value of the
coefficients in the LASSO path and we seek to generate enough data
to return the same ordering as the expected values 𝑐 1 = 𝐸 (𝑹 · 𝑿 ·𝒊 )
and 𝑐 2 = 𝐸 (𝑹 · 𝑿 ·𝒋 ) with high probability. Our algorithm is based
on pairwise comparisons between candidate features; we there-
fore consider the decision between two covariates in this section,
and extensions to more general cases involving multiple pairwise
comparisons will be discussed in Section 4.
By the multivariate Central Limit Theorem (CLT), we have

   
𝑐ˆ1 𝑐
𝑛 − 1 −→ 𝑁 (0, Σ),
𝑐ˆ2 𝑐2
Figure 1: Empirical selection probability for features in where
Breast Cancer Data. The black box model is a random forests    2 2

𝑹 · 𝑿 ·𝒊 𝜎 𝜎12
classifier with 500 trees. LIME is run 100 times on a ran- Σ = cov = 112 2 .
𝑹 · 𝑿 ·𝒋 𝜎21 𝜎22
domly selected test point and top 5 features are selected via
LASSO. Without loss of generality we assume 𝑐ˆ1 > 𝑐ˆ2 > 0. In general if
the correlation is negative, we can simply negate the corresponding
feature values for all the calculations involved in this section. Let
Δ̂𝑛 = 𝑐ˆ1 − 𝑐ˆ2 and Δ𝑛 = 𝑐 1 −𝑐 2 . Consider function 𝑓 (𝑎 1, 𝑎 2 ) = 𝑎 1 −𝑎 2 .
We can see that across 100 repetitions, only three features are Delta method implies that
consistently selected by LIME while there is considerable variability √
        
𝑐ˆ1 𝑐
in the remaining features. Note that this does not consider the order 𝑛 𝑓 − 𝑓 1 2
−→ 𝑁 (0, 𝜎11 2
+ 𝜎22 − 𝜎122 2
− 𝜎21 ).
𝑐ˆ2 𝑐2
of the features entered: even the top three features exhibit different
orderings in the selection process. Or approximately,
This experiment illustrates an important weakness of LIME: its  𝜎ˆ 2 + 𝜎ˆ 2 − 𝜎ˆ 2 − 𝜎ˆ 2 
instability or irreproducibility. If repeated runs using the same Δ̂𝑛 − Δ𝑛 ∼ 𝑁 0, 11 22 12 21
(3)
explanation algorithm on the same model to interpret the same 𝑛
data point yield different results, the utility of the explanation is where the variance estimates are estimated from the empirical
brought into question. The instability comes from the randomness covariance of the values 𝑟𝑡 𝑥𝑡𝑖 and 𝑟𝑡 𝑥𝑡 𝑗 , 𝑡 = 1, . . . , 𝑛.
introduced when generating synthetic samples around the input, In similar spirits of [56], we assess the probability that Δ̂𝑛 > 0
and the 𝑙 1 penalty employed in LASSO further increases the chance will still hold in a repeated experiment. Assume we have another
of selecting spurious features [48]. In Appendix A we show the independently generated data set denoted by {𝑟𝑡∗, 𝑥𝑡𝑖 ∗ , 𝑥 ∗ }𝑛 . It
𝑡 𝑗 𝑡 =1
instability with LASSO using a simple linear model. follows from (3) that
One way to stabilize the LIME model is to use a larger corpus
of the synthetic data, but it is difficult to determine how much
 𝜎ˆ 2 + 𝜎ˆ 22
2 − 𝜎ˆ 2 − 𝜎ˆ 2 
Δ̂𝑛∗ − Δ̂𝑛 ∼ 𝑁 0, 2 · 11 12 21
,
larger as a priori without repeated experiments. In the next section, 𝑛
we examine how feature selection works in LASSO and LARS, which leads to the approximation that
and then design a statistically justified approach to automatically
and adaptively determine the number of perturbations required to    𝜎ˆ 2 + 𝜎ˆ 22
2 − 𝜎ˆ 2 − 𝜎ˆ 2 

guarantee stability. Δ̂𝑛∗ Δ̂𝑛 = 𝑐ˆ1 − 𝑐ˆ2 ∼ 𝑁 𝑐ˆ1 − 𝑐ˆ2, 2 · 11 12 21


.
𝑛

2431
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

In order to control 𝑃 ( Δ̂𝑛∗ > 0) at a confidence level 1 − 𝛼, we need Algorithm 3: S-LIME


Input : A black box model 𝑓 , data sample to explain 𝒙,
s
𝜎ˆ 2 + 𝜎ˆ 22
2 − 𝜎ˆ 2 − 𝜎ˆ 2
𝑐ˆ1 − 𝑐ˆ2 > 𝑍𝛼 2 11 12 21
, (4) initial size for perturbation samples 𝑛 0 ,
𝑛 significance level 𝛼, number of features to select 𝑘,
where 𝑍𝛼 is the (1 − 𝛼)-quantile of a standard normal distribution. proximity measure 𝜋𝒙 .
For a fixed confidence level 𝛼 and 𝑛, suppose we get the corre- Output : Top 𝑘 features selected for interpretation.
sponding 𝑝-value 𝑝𝑛 > 𝛼. From (4) we have Generate D = {𝑛 0 synthetic samples around 𝒙} and
√ 𝑐ˆ1 − 𝑐ˆ2 calculate weight vector 𝝎 using 𝜋𝒙 ;
𝑛q = 𝑍 𝑝𝑛 . Set 𝑛 = 𝑛 0 ;
2 2 − 𝜎ˆ 2 − 𝜎ˆ 2 )
2(𝜎ˆ 11 + 𝜎ˆ 22 12 21 while True do
This implied we would need approximately 𝑛 ′ samples to get a Run Algorithm 2 on D with weight 𝝎 along with
significant result where hypothesis testing at each step:
r while active features less than 𝑘 do
𝑛 𝑍 𝑝𝑛 Select top two predictors most correlated with the
= . (5)
𝑛′ 𝑍𝛼 current residual from remaining covariates, with
covariance 𝑐ˆ1 and 𝑐ˆ2 ;
4 STABILIZED-LIME Calculate test statistic:
Based on the theoretical analysis developed in Section 3, we can run
s
𝜎ˆ 2 + 𝜎ˆ 22
2 − 𝜎ˆ 2 − 𝜎ˆ 2
LIME equipped with hypothesis testing at each step when a new 𝑡 = 𝑐ˆ1 − 𝑐ˆ2 − 𝑍𝛼 2 11 12 21
variable enters. If the testing result is significant, we continue to 𝑛
the next step; otherwise it indicates that the current sample size of if 𝑡 >= 0 then
perturbations is not large enough. We thus generate more synthetic Continue with this selection;
data according to Equation (5) and restart the whole process. Note else
that we view any intermediate step as conditioned on previous  2
ˆ A high level sketch of the algorithm is Calculate 𝑛 ′ = 𝑛 ∗ 𝑍𝑍𝑝𝛼 and set 𝑛 = 𝑛 ′ ;
obtained estimates of 𝛽. 𝑛

presented below in Algorithm 3. Break;


end
In practice, we may need to set an upper bound on the number of
synthetic samples generated (denoted by 𝑛𝑚𝑎𝑥 ), such that whenever end
the new 𝑛 ′ is greater than 𝑛𝑚𝑎𝑥 , we’ll simply set 𝑛 = 𝑛𝑚𝑎𝑥 and if active features less than 𝑘 then
go though the outer while loop one last time without testing at Generate D = {𝑛 ′ synthetic samples around 𝒙} and
each step. This can prevent the algorithm from running too long calculate weight vector 𝝎 using 𝜋𝒙 ;
and wasting computation resources in cases where two competing else
features are equally important in a local neighborhood; for example, Return 𝑘 selected features;
if the black box model is indeed locally linear with equal coefficients end
for two predictors. end
We note several other possible variations of the Algorithm 3.
Multiple testing. So far we’ve only considered comparing a
pair of competing features (the top two). But when choosing the
next predictor to enter the model at step 𝑚 (with 𝑚 − 1 active fea- we can reuse the existing synthetic samples and only generate addi-
tures), there are 𝑝 − 𝑚 + 1 candidate features. We can modify the tional 𝑛 ′ − 𝑛 perturbation points. One may also note that whenever
procedure to select the best feature among all the remaining can- the outer while loop restarts, we conduct repetitive testings for the
didates, by conducting pairwise comparisons between the feature first several variables entering the model. To achieve better effi-
with largest correlation (ˆ𝑐 1 ) against the rest (ˆ𝑐 2, . . . , 𝑐ˆ𝑝−𝑚+1 ). This is ciency, each new run can condition on previous runs: if a variable
a multiple comparisons problem, and one can use an idea analogous enters the LASSO path in the same order as before and has been
to Bonferroni correction. Mathematically: tested with significant statistics, no additional testing is needed.
• Test the hypothesis 𝐻𝑖,0 : 𝑐ˆ1 ≤ 𝑐ˆ𝑖 , 𝑖 = 2, . . . , 𝑝 −𝑚 + 1. Obtain Hypothesis testing is only invoked when we select more features
𝑝-values 𝑝 2, . . . , 𝑝𝑝−𝑚+1 . than previous runs, or in some rare cases, the current iteration
• Reject the null hypothesis if 𝑖=2
Í𝑝−𝑚+1
𝑝𝑖 < 𝛼. disagrees with previous results. In our experiments, we do not im-
plement the conditioning step for implementation simplicity, as we
Although straightforward, this Bonferroni-like correction ignores
find the efficiency gain is marginal when selecting a moderate size
much of the correlation among these statistics and will result in
of features.
a conservative estimate. In the experiments, we only conduct hy-
pothesis testing for top two features without resorting to multiple
testing, as it is more efficient and empirically we do not observe 5 EMPIRICAL STUDIES
any performance degradation. Rather than performing a broad-scale analysis, we look at several
Efficiency. Several modifications can be made to improve the specific cases as illustrations to show the effectiveness of S-LIME in
efficiency of Algorithm 3. At each step when 𝑛 is increased to 𝑛 ′ , generating stabilized model explanations. Scikit-learn [36] is used

2432
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

for building black box models. Code for replicating our experiments it does not imply LIME explanations are stable. To better quantify
is available at [Link] stability, we look at the Jaccard index for the top 𝑘 features for
𝑘 = 1, . . . , 5. Table 1 shows the average Jaccard across all pairs
5.1 Breast Cancer Data for 20 repetitions of both LIME and S-LIME on the selected test
We use the widely adopted Breast Cancer Wisconsin (Diagnostic) instance. We set 𝑛𝑚𝑎𝑥 = 10000 for S-LIME.
Data Set [32], which contains 569 samples and 30 features1 . A ran-
dom forests with 500 trees is trained on 80% of the data as the black Table 1: Average Jaccard index for 20 repetitions for LIME
box model to predict whether an instance is benign or malignant. and S-LIME. The black box model is a random forests with
It achieves around 95% accuracy on the remaining 20% test data. 500 trees.
Since our focus is on producing stabilized explanations for a spe-
cific instance, we do not spend additional efforts in hyperparameter Position LIME S-LIME
tuning to further improve model performance. 1 0.61 1.0
Figure 1 in Section 2.3 has already demonstrated the inconsis- 2 1.0 1.0
tency of the selected feature returned by original LIME. In Figure 2 3 1.0 1.0
below, we show a graphical illustration of four LIME replications 4 0.66 1.0
on a randomly selected test instance, where the left column of each 5 0.59 0.85
sub figure shows selected features along with learned linear param-
eters, and the right column is the corresponding feature value for As we can see, for top four positions the average Jaccard index
the sample. These repetitions of LIME applied on the same instance of S-LIME is 1, meaning the algorithm is stable across different
have different orderings for the top two features, and also disagree iterations. There is some variability in the fifth feature selected, as
on the fourth and fifth features. two features mean radius and worst concave points have pretty close
impact locally. Further increasing 𝑛𝑚𝑎𝑥 will make the selection of
fifth variable more consistent. Figure 3 shows the only two explana-
tions we observed in simulations for S-LIME, where the difference
is at the fifth variable.

(a) Iteration 1 of LIME (b) Iteration 2 of LIME

(a) Iteration 1 of S-LIME (b) Iteration 2 of S-LIME

Figure 3: Two iterations of S-LIME on Breast Cancer Data.


(c) Iteration 3 of LIME (d) Iteration 4 of LIME The black box model is a random forests classifier with 500
trees.
Figure 2: Four iterations of LIME on Breast Cancer Data. The
black box model is a random forests classifier with 500 trees.
LIME explanations are generated with 1000 synthetic pertur- As a contrast, we’ve already seen instability for LIME even for
bations. the first variable selected. Although LIME consistently selects the
same top two and the third feature, there is much variably for the
fourth and fifth feature. This experiment demonstrates the stability
To quantify the stability of the generated explanations, we mea- of S-LIME compared to LIME. In Appendix B.1, we apply S-LIME on
sure the Jaccard index, which is a statistic used for gauging the other types of black box models. Stability results on a large cohort
similarity and diversity of sample sets. Given two sets 𝐴 and 𝐵 of test samples are included in Appendix B.2.
(in our case, the sets are selected features from LIME), the Jaccard
coefficient is defined as the size of the intersection divided by the 5.2 MARS Test Function
size of the union: Here we use a modification of the function given in [15] (to test the
|𝐴 ∩ 𝐵|
𝐽 (𝐴, 𝐵) = . MARS algorithm) as the black box model so we know the underlying
|𝐴 ∪ 𝐵|
true local weights of variables. Let 𝑦 = 𝑓 (𝒙) = 10 sin(𝜋𝑥 1𝑥 2 ) +
One disadvantage of the Jaccard index is that it ignores order- 20(𝑥 3 − 0.05) 2 + 5.2𝑥 4 + 5𝑥 5 , where X ∼ 𝑈 ([0, 1] 5 ). The test point 𝒙
ing within each feature set. For example, if top two features returned is chosen to be (0.51, 0.49, 0.5, 0.5, 0.5). We can easily calculate the
from two iterations of LIME are 𝐴 = {𝑤𝑜𝑟𝑠𝑡 𝑝𝑒𝑟𝑖𝑚𝑒𝑡𝑒𝑟, 𝑤𝑜𝑟𝑠𝑡 𝑎𝑟𝑒𝑎} local linear weights of the five variables around 𝒙 and the expected
and 𝐵 = {𝑤𝑜𝑟𝑠𝑡 𝑎𝑟𝑒𝑎, 𝑤𝑜𝑟𝑠𝑡 𝑝𝑒𝑟𝑖𝑚𝑒𝑡𝑒𝑟 }, we have 𝐽 (𝐴, 𝐵) = 1 but selection order is (𝑥 3, 𝑥 1, 𝑥 2, 𝑥 4, 𝑥 5 ). Note here the specific choice of
1 [Link] parameters in 𝑓 (𝑥) and the location of test point 𝒙 makes it difficult
[Link] to distinguish between 𝑥 1, 𝑥 2 and 𝑥 4, 𝑥 5 .

2433
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

Table 2 presents the average Jaccard index for the selected feature patterns [8]. Instead, we would like to demonstrate the effectiveness
sets by LIME and S-LIME, where LIME is generated with 1000 of our proposed method in reliably explaining a relatively large
synthetic samples and we set 𝑛 0 = 1000 and 𝑛𝑚𝑎𝑥 = 10000 for scale machine learning model applied to medical data.
S-LIME. The close local weights between 𝑥 1, 𝑥 2 and 𝑥 4, 𝑥 5 causes To deal with temporal data where each sample in the training
some instability in LIME, as can be seen from the drop in the index set is of shape (𝑛_𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝𝑠, 𝑛_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠), LIME reshapes the data
at position 2 and 4. S-LIME outputs consistent explanations in this such that it becomes a long vector of size 𝑛_𝑡𝑖𝑚𝑒𝑠𝑡𝑒𝑝𝑠 × 𝑛_𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠.
case. Essentially it transforms the temporal data to the regular tabular
shape while increasing the number of features by a multiple of
Table 2: Average Jaccard index for 20 repetitions for LIME available timestamps. Table 3 presents the average Jaccard index
and S-LIME on test point (0.51, 0.49, 0.5, 0.5, 0.5). The black box for the selected feature sets by LIME and S-LIME on two randomly
model is MARS. selected test samples, where LIME is generated with 1000 synthetic
samples and we set 𝑛 0 = 1000 and 𝑛𝑚𝑎𝑥 = 100000 for S-LIME.
Position LIME S-LIME LIME exhibits undesirable instability in this example, potentially
1 1.0 1.0 due to the complex black box model applied and the large number
2 0.82 1.0 of features (24 × 25 = 600). S-LIME achieves much better stability
3 1.0 1.0 compared to LIME, although we can still observe some uncertainty
4 0.79 1.0 in choosing the fifth feature in the second test sample.
5 1.0 1.0
Table 3: Average Jaccard index for 20 repetitions for LIME
and S-LIME on two randomly selected test samples. The
black box model is a recurrent neural network.
5.3 Early Prediction of Sepsis From Electronic
Health Records Position LIME S-LIME Position LIME S-LIME
Sepsis is a major public health concern which is a leading cause 1 0.37 1.0 1 0.31 1.0
of death in the United States [3]. Early detection and treatment 2 0.29 1.0 2 0.24 1.0
of a sepsis incidence is a crucial factor for patient outcomes [38]. 3 0.33 1.0 3 0.19 1.0
Electronic health records (EHR) store data associated with each 4 0.25 0.89 4 0.17 0.96
individual’s health journey and have seen an increasing use re- 5 0.26 1.0 5 0.18 0.78
cently in clinical informatics and epidemiology [46, 50]. There have (a) test sample 1 (b) test sample 2
been several work to predict sepsis based on EHR [16, 21, 29]. In-
terpretability of these models are essential for them to be deployed
in clinical settings. Figure 4 below shows the output of S-LIME on two different test
We collect data from MIMIC-III [26], which is a freely accessible samples. We can see that for sample 1, most recent temperatures
critical care database. After pre-processing, there are 15309 patients play an important role, along with the latest pH and potassium
in the cohort for analysis, out of which 1221 developed sepsis based values. While for sample 2, latest pH values are the most important
on Sepsis-3 clinical criteria for sepsis onset [43]. For each patient, ones.
the record consists of a combination of hourly vital sign summaries, We want to emphasize that extra caution must be taken by prac-
laboratory values, and static patient descriptions. We provide the titioners in applying LIME, especially for some complex problems.
list of all variables involved in Appendix C. ICULOS is a timestamp The local linear model with a few features might not be suitable
which denotes the hours since ICU admission for each patient, and to approximate a recurrent neural network built on temporal data.
thus is not used directly for training the model. How to apply perturbation based explanation algorithms to tempo-
For each patient’s records, missing values are filled with the most ral data is still an open problem, and we leave it for future work.
recent value if available, otherwise a global average. Negative sam- That being said, the experiment in this section demonstrates the
ples are down sampled to achieve a class ratio of 1:1. We randomly effectiveness of S-LIME in producing stabilized explanations.
select 90% of the data for training and leave the remaining 10% for
testing. A simple recurrent neural network based on LSTM [23] 6 DISCUSSIONS
module is built with Keras [9] for demonstration. Each sample fed An important property for model explanation methods is stability:
into the network has 25 features with 24 timestamps, then goes repeated runs of the algorithm on the same object should output
through a LSTM with 32 internal units with dropout rate 0.2, and consistent results. In this paper, we show that post hoc explana-
finally a dense layer with softmax activation to output a probability. tions based on perturbations, such as LIME, are not stable due to the
The network is optimized by Adam [27] with an initial learning randomness introduced in generating synthetic samples. Our pro-
rate of 0.0001 and we train it for 500 epochs on a batch size of 50. posed algorithm S-LIME is based on a hypothesis testing framework
The model achieves around 0.75 AUC score on the test set. Note and can automatically and adaptively determine the appropriate
that we do not fine tune the architecture of the network through number of perturbations required to guarantee stability.
cross validation. The purpose of this study is not on achieving a su- The idea behind S-LIME is similar to [56] which tackles the
perior performance as it usually requires more advanced modeling problem of building stable approximation trees in model distillation.
techniques for temporal data [16, 29] or exploiting missing value In the area of online learning, [10] uses Hoeffding bounds [24] to

2434
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

to adversarial attacks. The technique we developed in this work is


orthogonal to these directions, as . We also plan to explore other
data generating procedures which can help with stability.

ACKNOWLEDGMENTS
Giles Hooker is supported by NSF DMS-1712554. Fei Wang is sup-
ported by NSF 1750326, 2027970, ONR N00014-18-1-2585, Amazon
Web Service (AWS) Machine Learning for Research Award and
Google Faculty Research Award.
(a) test sample 1

REFERENCES
[1] Ulrich Aïvodji, Hiromi Arai, Olivier Fortineau, Sébastien Gambs, Satoshi Hara,
and Alain Tapp. 2019. Fairwashing: the risk of rationalization. arXiv preprint
arXiv:1901.09749 (2019).
[2] Zeyuan Allen-Zhu and Yuanzhi Li. 2020. Towards Understanding Ensemble,
Knowledge Distillation and Self-Distillation in Deep Learning. arXiv preprint
arXiv:2012.09816 (2020).
[3] Derek C Angus, Walter T Linde-Zwirble, Jeffrey Lidicker, Gilles Clermont, Joseph
Carcillo, and Michael R Pinsky. 2001. Epidemiology of severe sepsis in the United
States: analysis of incidence, outcome, and associated costs of care. Read Online:
Critical Care Medicine| Society of Critical Care Medicine 29, 7 (2001), 1303–1310.
[4] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine bias.
(b) test sample 2 ProPublica. See [Link] propublica. org/article/machine-bias-risk-assessments-
in-criminal-sentencing (2016).
[5] Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5–32.
Figure 4: Output of S-LIME for two randomly selected test [6] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984.
samples. The black box model is a recurrent neural network. Classification and regression trees. CRC press.
[7] Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model
compression. In Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining. 535–541.
guarantee correct choice of splits in a decision tree by comparing [8] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan
two best attributes. We should mention that S-LIME is not restricted Liu. 2018. Recurrent neural networks for multivariate time series with missing
values. Scientific reports 8, 1 (2018), 1–12.
to LASSO as its feature selection mechanism. In fact, to produce [9] François Chollet et al. 2015. Keras. [Link]
a ranking of explanatory variables, one can use any sequential [10] Pedro Domingos and Geoff Hulten. 2000. Mining high-speed data streams. In
procedures which build a model by sequentially adding or removing Proceedings of the sixth ACM SIGKDD international conference on Knowledge
discovery and data mining. 71–80.
variables based upon some criterion, such as forward-stepwise or [11] Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of inter-
backward-stepwise selection [14]. All of these methods can be pretable machine learning. arXiv preprint arXiv:1702.08608 (2017).
stabilized by a similar hypothesis testing framework like S-LIME. [12] Mengnan Du, Ninghao Liu, and Xia Hu. 2019. Techniques for interpretable
machine learning. Commun. ACM 63, 1 (2019), 68–77.
There are several works closely related to ours. [55] identifies [13] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. 2004. Least
three sources of uncertainty in LIME: sampling variance, sensitivity angle regression. The Annals of statistics 32, 2 (2004), 407–499.
[14] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of
to choice of parameters and variability in the black box model. statistical learning. Vol. 1. Springer series in statistics New York.
We aim to control the first source of variability as the other two [15] Jerome H Friedman. 1991. Multivariate adaptive regression splines. The annals
depend on specific design choices of the practitioners. [51] highlight of statistics (1991), 1–67.
[16] Joseph Futoma, Sanjay Hariharan, Katherine Heller, Mark Sendak, Nathan Brajer,
a trade-off between explanation’s stability and adherence. Their Meredith Clement, Armando Bedoya, and Cara O’Brien. 2017. An improved
approach is to select a suitable kernel width for the proximity multi-output gaussian process rnn with real-time validation for early sepsis
measure, but it does not improve stability given any kernel width. detection. In Machine Learning for Healthcare Conference. PMLR, 243–254.
[17] Amirata Ghorbani, Abubakar Abid, and James Zou. 2019. Interpretation of neural
In [53], the authors design a deterministic version of LIME by only networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence,
looking at existing training data through hierarchical clustering Vol. 33. 3681–3688.
[18] Robert D Gibbons, Giles Hooker, Matthew D Finkelman, David J Weiss, Paul A
without resorting to synthetic samples. However, the number of Pilkonis, Ellen Frank, Tara Moore, and David J Kupfer. 2013. The computerized
samples in a dataset will affect the quality of clusters and a lack adaptive diagnostic test for major depressive disorder (CAD-MDD): a screening
of nearby points poses additional challenges; this strategy also tool for depression. The Journal of clinical psychiatry 74, 7 (2013), 1–478.
[19] Bryce Goodman and Seth Flaxman. 2017. European Union regulations on algo-
relies of having access to the training data. Most recently, [45] rithmic decision-making and a “right to explanation”. AI magazine 38, 3 (2017),
develop a set of tools for analyzing explanation uncertainty in 50–57.
a Bayesian framework for LIME. Our method can be viewed as [20] Trevor J Hastie and Robert J Tibshirani. 1990. Generalized additive models. Vol. 43.
CRC press.
a frequentist counterpart without the need to choose priors and [21] Katharine E Henry, David N Hager, Peter J Pronovost, and Suchi Saria. 2015. A
evaluate a posterior distribution. targeted real-time early warning score (TREWScore) for septic shock. Science
translational medicine 7, 299 (2015), 299ra122–299ra122.
Another line of work concerns adversarial attacks to LIME. [44] [22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in
propose a scaffolding technique to hide the biases of any given a neural network. arXiv preprint arXiv:1503.02531 (2015).
classifier by building adversarial classifiers to detect perturbed [23] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.
instances. Later, [41] utilize a generative adversarial network to [24] Wassily Hoeffding. 1994. Probability inequalities for sums of bounded random
sample more realistic synthetic data for making LIME more robust variables. In The Collected Works of Wassily Hoeffding. Springer, 409–426.

2435
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

[25] Giles Hooker and Lucas Mentch. 2019. Please stop permuting features: An [49] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal
explanation and alternatives. arXiv preprint arXiv:1905.03151 (2019). of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.
[26] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, [50] Akhil Vaid, Suraj K Jaladanki, Jie Xu, Shelly Teng, Arvind Kumar, Samuel Lee,
Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Sulaiman Somani, Ishan Paranjpe, Jessica K De Freitas, Tingyi Wanyan, et al. 2020.
Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Federated learning of electronic health records improves mortality prediction in
data 3, 1 (2016), 1–9. patients hospitalized with covid-19. medRxiv (2020).
[27] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- [51] Giorgio Visani, Enrico Bagli, and Federico Chesani. 2020. OptiLIME: Opti-
mization. arXiv preprint arXiv:1412.6980 (2014). mized LIME Explanations for Diagnostic Computer Algorithms. arXiv preprint
[28] Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via arXiv:2006.05714 (2020).
influence functions. arXiv preprint arXiv:1703.04730 (2017). [52] Fei Wang, Rainu Kaushal, and Dhruv Khullar. 2020. Should Health Care Demand
[29] Simon Meyer Lauritsen, Mads Ellersgaard Kalør, Emil Lund Kongsgaard, Ka- Interpretable Artificial Intelligence or Accept" Black Box" Medicine? Annals of
trine Meyer Lauritsen, Marianne Johansson Jørgensen, Jeppe Lange, and Bo internal medicine 172, 1 (2020), 59–60.
Thiesson. 2020. Early detection of sepsis utilizing deep learning on electronic [53] Muhammad Rehman Zafar and Naimul Mefraz Khan. 2019. DLIME: A determinis-
health record event sequences. Artificial Intelligence in Medicine 104 (2020), tic local interpretable model-agnostic explanations approach for computer-aided
101820. diagnosis systems. arXiv preprint arXiv:1906.10263 (2019).
[30] Eunjin Lee, David Braines, Mitchell Stiffler, Adam Hudler, and Daniel Harborne. [54] Jiaming Zeng, Berk Ustun, and Cynthia Rudin. 2015. Interpretable classification
2019. Developing the sensitivity of LIME for better machine learning explana- models for recidivism prediction. arXiv preprint arXiv:1503.07810 (2015).
tion. In Artificial Intelligence and Machine Learning for Multi-Domain Operations [55] Yujia Zhang, Kuangyan Song, Yiming Sun, Sarah Tan, and Madeleine Udell. 2019.
Applications, Vol. 11006. International Society for Optics and Photonics, 1100610. " Why Should You Trust My Explanation?" Understanding Uncertainty in LIME
[31] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model Explanations. arXiv preprint arXiv:1904.12991 (2019).
predictions. In Advances in neural information processing systems. 4765–4774. [56] Yichen Zhou, Zhengze Zhou, and Giles Hooker. 2018. Approximation trees:
[32] Olvi L Mangasarian, W Nick Street, and William H Wolberg. 1995. Breast cancer Statistical stability in model distillation. arXiv preprint arXiv:1808.07573 (2018).
diagnosis and prognosis via linear programming. Operations Research 43, 4 (1995), [57] Zhengze Zhou and Giles Hooker. 2019. Unbiased measurement of feature impor-
570–577. tance in tree-based methods. arXiv preprint arXiv:1903.05179 (2019).
[33] Nicolai Meinshausen and Peter Bühlmann. 2010. Stability selection. Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 72, 4 (2010), 417–473.
[34] Aditya Krishna Menon, Ankit Singh Rawat, Sashank J Reddi, Seungyeon Kim,
and Sanjiv Kumar. 2020. Why distillation helps: a statistical perspective. arXiv
preprint arXiv:2005.10419 (2020).
A INSTABILITY WITH LASSO
[35] Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. 2018. Instability with LASSO has been studied previously by several
Deep learning for healthcare: review, opportunities and challenges. Briefings in
bioinformatics 19, 6 (2018), 1236–1246.
researchers. [33] introduce stability selection based on subsampling
[36] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. which provides finite sample control for some error rates of false
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- discoveries. [48] find that sequential regression procedures select
napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. the first spurious variable unexpectedly early, even in settings of
[37] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela low correlations between variables and strong true effect sizes. [47]
Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. 2018. Scalable and further develop a sharp asymptotic trade-off between false and true
accurate deep learning with electronic health records. NPJ Digital Medicine 1, 1
(2018), 18. positive rates along the LASSO path.
[38] Matthew A Reyna, Chris Josef, Salman Seyedi, Russell Jeter, Supreeth P Shashiku- We demonstrate this phenomenon using a simple linear case.
mar, M Brandon Westover, Ashish Sharma, Shamim Nemati, and Gari D Clifford.
2019. Early prediction of sepsis from clinical data: the PhysioNet/Computing
Suppose 𝑡 = 𝜌 1𝑥 1 + 𝜌 2𝑥 2 + 𝜌 3𝑥 3 , where 𝑥 1 , 𝑥 2 and 𝑥 3 are indepen-
in Cardiology Challenge 2019. In 2019 Computing in Cardiology (CinC). IEEE, dent and generated from a standard normal distribution N (0, 1).
Page–1. Note that we do not impose any additional noise in generating
[39] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should I
trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd the response 𝑦. We choose 𝜌 1 = 1, 𝜌 2 = 0.75 and 𝜌 3 = 0.7, such
ACM SIGKDD international conference on knowledge discovery and data mining. that when one uses LARS to solve LASSO, 𝑥 1 always enter the
1135–1144. model first, while 𝑥 2 and 𝑥 3 have closer coefficients and will be
[40] Cynthia Rudin. 2019. Stop explaining black box machine learning models for
high stakes decisions and use interpretable models instead. Nature Machine more challenging to distinguish.
Intelligence 1, 5 (2019), 206–215. We focus on the ordering of the three covariates entering the
[41] Sean Saito, Eugene Chua, Nicholas Capel, and Rocco Hu. 2020. Improving LIME
Robustness with Smarter Locality Sampling. arXiv preprint arXiv:2006.12302
model. The “correct" ordering should be (𝑥 1, 𝑥 2, 𝑥 3 ). For multiple
(2020). runs of LASSO with 𝑛 = 1000, we observe roughly 20% of the
[42] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside results have order (𝑥 1, 𝑥 3, 𝑥 2 ) instead. Figure 5 below shows two
convolutional networks: Visualising image classification models and saliency
maps. arXiv preprint arXiv:1312.6034 (2013). representative LASSO paths.
[43] Mervyn Singer, Clifford S Deutschman, Christopher Warren Seymour, Manu
Shankar-Hari, Djillali Annane, Michael Bauer, Rinaldo Bellomo, Gordon R
Bernard, Jean-Daniel Chiche, Craig M Coopersmith, et al. 2016. The third inter-
national consensus definitions for sepsis and septic shock (Sepsis-3). Jama 315, 8
(2016), 801–810.
[44] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju.
2020. Fooling lime and shap: Adversarial attacks on post hoc explanation methods.
In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 180–186.
[45] Dylan Slack, Sophie Hilgard, Sameer Singh, and Himabindu Lakkaraju. 2020.
How Much Should I Trust You? Modeling Uncertainty of Black Box Explanations.
arXiv preprint arXiv:2008.05030 (2020).
[46] Jose Roberto Ayala Solares, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fate-
meh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, Amir H
Payberah, Mariagrazia Zottoli, Milad Nazarzadeh, et al. 2020. Deep learning
for electronic health records: A comparative review of multiple deep neural
(a) Variable ordering in LASSO (b) Variable ordering in LASSO
architectures. Journal of biomedical informatics 101 (2020), 103337.
[47] Weijie Su, Małgorzata Bogdan, Emmanuel Candes, et al. 2017. False discoveries path: (𝑥 1 , 𝑥 2 , 𝑥 3 ). path: (𝑥 1 , 𝑥 3 , 𝑥 2 ).
occur early on the lasso path. The Annals of statistics 45, 5 (2017), 2133–2150.
[48] Weijie J Su. 2018. When is the first spurious variable selected by sequential
regression procedures? Biometrika 105, 3 (2018), 517–527. Figure 5: Two cases of variable ordering in LASSO path.

2436
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

This toy experiment demonstrates the instability of LASSO itself. Table 4: Average Jaccard index for 20 repetitions for LIME
Even in this ideal noise-free setting where we have an indepen- and S-LIME. The black box models are SVM and NN.
dent design with Gaussian distribution for the variables, 20% of
the time LASSO exhibits different paths due to random sampling. SVM NN
Position
Intuitively, the solutions at the beginning of the LASSO path is LIME S-LIME LIME S-LIME
overwhelmingly biased and the residual vector contains many of 1 1 1.0 0.73 1.0
the true effects. Thus some less relevant or irrelevant variable could 2 0.35 0.87 0.87 1.0
exhibit high correlations with the residual and gets selected early. 3 0.23 0.83 0.71 0.74
𝑛 = 1000 seems to be a reasonable large number of samples to 4 0.19 1.0 0.66 1.0
achieve consistency results, but when applying the idea of S-LIME, 5 0.18 0.67 0.55 1.0
the hypothesis testing is always inconclusive at the second step
when it needs to choose between 𝑥 2 and 𝑥 3 . Increasing 𝑛 in this
case can indeed yield significant testing results and stabilize the
LASSO paths.

B ADDITIONAL EXPERIMENTS
B.1 S-LIME on other model types
Besides the randomness introduced in generating synthetic per-
turbations, the output of model explanation algorithms is also de-
pendent on several other factors, including the black box model
(a) S-LIME on SVM.
itself. There may not be a universal truth to the explanations of a
given instance, as it depends on how the underlying model captures
the relationship between covariates and responses. Distinct model
types, or even the same model structure trained with random ini-
tialization, can utilize different correlations between features and
responses [2], and thus result in different model explanations.
We apply S-LIME on other model types to illustrate two points:
• Compared to LIME, S-LIME can generate stabilized explana-
tions, though for some model types more synthetic pertur-
bations are required.
(b) S-LIME on NN.
• Different model types can have different explanations for the
same instance. This does not imply that S-LIME is unstable
or not reproducible, but practitioners need to be aware of
Figure 6: S-LIME on Breast Cancer Data with SVM and NN
this dependency on the underlying black box model when
as black box models.
apply any model explanation methods.
We use support-vector machines (SVM) and neural networks
(NN) as the underlying black box models and apply LIME and S-
the original LIME is extremely unstable for SVM. S-LIME needs a
LIME. Basic setups is similar to Section 5.1. For SVM training, we
larger 𝑛𝑚𝑎𝑥 to produce consistent results.
use default parameters2 where rbf kernel is applied. The NN is
constructed with two hidden layers, each with 12 and 8 hidden
units. ReLU activations are used between hidden layers while the
B.2 A large cohort of test samples
last layer use sigmoid functions to output a probability. The network Most of the experiments in this paper are targeted at a randomly
is implemented in Keras [9]. Both models achieve over 90% accuracy selected test sample, which allows us to examine specific features
on the test set. easily. That being said, one can expect the instability of LIME and
Table 4 lists the average Jaccard index across 20 repetitions for the improvement of S-LIME to be universal. In this part we conduct
each setting on a randomly selected test instance. LIME is generated experiments on a large cohort of test samples for both Breast Cancer
with 1000 synthetic samples, while for S-LIME we set 𝑛𝑚𝑎𝑥 = (Section 5.1) and Sepsis (Section 5.3) data.
100000 for SVM and 𝑛𝑚𝑎𝑥 = 10000 for NN. Compared with LIME, In each application, we randomly select 50 test samples. For each
S-LIME achieves better stability at each position. test instance, LIME and S-LIME are applied for 20 repetitions and
Figure 6 shows the graphical exhibition of the explanations gen- we calculate average Jaccard index across all pairs out of 20 as
erated by S-LIME for both SVM and NN being the black box models. before. Finally, we report the overall average Jaccard index for 50
We can see that they differ in the features selected. test samples. The results are shown in Table 5. LIME explanations
One important observation is that the underlying black box are generated with 1000 synthetic samples.
model also affects the stability of local explanations. For example, For Breast Cancer Data, we pick 𝑛𝑚𝑎𝑥 = 10000 as in Section
5.1. We can see that in general there is some instability from the
2 [Link] features selected by LIME, while S-LIME can improve stability. By

2437
Research Track Paper KDD ’21, August 14–18, 2021, Virtual Event, Singapore

further increasing 𝑛𝑚𝑎𝑥 we may get better stability metrics, but at C VARIABLES LIST FOR SEPSIS DETECTION
the cost of computational costs.
For the sepsis prediction task, LIME performs much worse ex- Table 6: Variables list and description for data used in sepsis
hibiting undesirable instability across 50 test samples at all 5 po- prediction.
sitions. S-LIME with 𝑛𝑚𝑎𝑥 = 100000 achieves obviously stability
improvement. The reason for invoking a larger value of 𝑛𝑚𝑎𝑥 is # Variables Description
due to the fact that there are 600 features to select from. It is an 1 Age age(years)
interesting future direction to see how one can use LIME to explain 2 Gender male (1) or female (0)
temporal models more efficiently. 3 ICULOS ICU length of stay (hours since ICU admission)
4 HR hea1t rate
Table 5: Overall average Jaccard index for 20 repetitions for 5 Potassium potassium
LIME and S-LIME across 50 randomly chosen test samples.
6 Temp temperature
7 pH pH
Position LIME S-LIME Position LIME S-LIME 8 PaCO2 partial pressure of carbon dioxide from arterial blood
1 0.90 0.98 1 0.54 1.0 9 SBP systolic blood pressure
2 0.85 0.96 2 0.43 1.0 10 FiO2 fraction of inspired oxygen
3 0.82 0.92 3 0.37 0.78
11 SaO2 oxygen saturation from arterial blood
4 0.81 0.96 4 0.35 0.90
12 AST aspartate transaminase
5 0.80 0.84 5 0.34 0.99
13 BUN blood urea nitrogen
(a) Breast Cancer Data (b) Sepsis Data
14 MAP mean arterial pressure
15 Calcium calcium
16 Chloride chloride
17 Creatinine creatinine
18 Bilirubin bilirubin
19 Glucose glucose
20 Lactate lactic acid
21 DBP diastolic blood pressure
22 Troponin troponin I
23 Resp respiration rate
24 PTT partial thromboplastin time
25 WBC white blood cells count
26 Platelets platelet count

2438

You might also like