0% found this document useful (0 votes)
9 views58 pages

Training Data Influence Analysis and Estimation

This document presents a comprehensive survey on training data influence analysis and estimation, highlighting the importance of understanding how training data affects model predictions in machine learning. It categorizes various influence analysis methods, discusses their strengths and weaknesses, and outlines future research directions to enhance their practical application. The paper emphasizes the need for better methods to analyze black-box models and the challenges posed by the complexity of modern machine learning systems.

Uploaded by

sheane mario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views58 pages

Training Data Influence Analysis and Estimation

This document presents a comprehensive survey on training data influence analysis and estimation, highlighting the importance of understanding how training data affects model predictions in machine learning. It categorizes various influence analysis methods, discusses their strengths and weaknesses, and outlines future research directions to enhance their practical application. The paper emphasizes the need for better methods to analyze black-box models and the challenges posed by the complexity of modern machine learning systems.

Uploaded by

sheane mario
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Training Data Influence Analysis

and Estimation: A Survey


Zayd Hammoudeh∗1,2 Daniel Lowd1
1
University of Oregon
2
Qualtrics AI
arXiv:2212.04612v3 [cs.LG] 29 Mar 2024

Abstract

Good models require good training data. For overparameterized deep models, the causal
relationship between training data and model predictions is increasingly opaque and poorly
understood. Influence analysis partially demystifies training’s underlying interactions by quan-
tifying the amount each training instance alters the final model. Measuring the training data’s
influence exactly can be provably hard in the worst case; this has led to the development and
use of influence estimators, which only approximate the true influence. This paper provides
the first comprehensive survey of training data influence analysis and estimation. We begin
by formalizing the various, and in places orthogonal, definitions of training data influence. We
then organize state-of-the-art influence analysis methods into a taxonomy; we describe each
of these methods in detail and compare their underlying assumptions, asymptotic complex-
ities, and overall strengths and weaknesses. Finally, we propose future research directions
to make influence analysis more useful in practice as well as more theoretically and empiri-
cally sound. A curated, up-to-date list of resources related to influence analysis is available at
https://github.com/ZaydH/influence_analysis_papers.

Keywords: Influence analysis, influence estimation, training data attribution, data valuation, influence
functions, TracIn, Shapley value

1 Introduction

Machine learning is built on training data [Red+21]. Without good training data, nothing else works. How
modern models learn from and use training data is increasingly opaque [KL17; Zha+21a; Xia22]. Regarding
state-of-the-art black-box models, Yampolskiy [Yam20] notes, “If all we have is a ‘black box’ it is impossible
to understand causes of failure and improve system safety.”
Large modern models require tremendous amounts of training data [BF21]. Today’s uncurated, internet-
derived datasets commonly contain numerous anomalous instances [Ple+20]. These anomalies can arise
from multiple potential sources. For example, training data anomalies may have a natural cause such as
distribution shift [RL87; Yan+21], measurement error, or non-representative samples drawn from the tail of
the data distribution [Hub81; Fel20b; Jia+21b]. Anomalous training instances also occur due to human or
algorithmic labeling errors – even on well-known, highly-curated datasets [EGH17]. Malicious adversaries

Correspondence to [email protected]. Work primarily done while at the University of Oregon. This paper is
published in journal Machine Learning [HL24b].

1
can insert anomalous poison instances into the training data with the goal of manipulating specific model
predictions [BNL12; Che+17; Sha+18a; HL23]. Regardless of the cause, anomalous training instances
degrade a model’s overall generalization performance.
Today’s large datasets also generally overrepresent established and dominant viewpoints [Ben+21]. Mod-
els trained on these huge public datasets encode and exhibit biases based on protected characteristics, in-
cluding gender, race, religion, and disability [BCC19; Kur+19; TC19; Zha+20; Hut+20]. These training
data biases can translate into real-world harm, where, as an example, a recidivism model falsely flagged
black defendants as high risk at twice the rate of white defendants [Ang+16].
Understanding the data and its relationship to trained models is essential for building trustworthy ML
systems. However, it can be very difficult to answer even basic questions about the relationship between
training data and model predictions; for example:

1. Is a prediction well-supported by the training data, or was the prediction just random?
2. Which portions of the training data improve a prediction? Which portions make it worse?
3. Which instances in the training set caused the model to make a specific prediction?

One strategy to address basic questions like those above is to render them moot by exclusively using
simple, transparent model classes [Lip18]. Evidence exists that this “interpretable-only” strategy may be
appropriate in some settings [Kni17]. However, even interpretable model classes can be grossly affected by
training data issues [Hub81; CHW82; CW82]. Moreover, as the performance penalty of interpretable models
grows, their continued use becomes harder to justify.
With the growing use of black-box models, we need better methods to analyze and understand black-box
model decisions. Otherwise, society must carry the burden of black-box failures.

1.1 Relating Models and Their Training Data

All model decisions are rooted in the training data. Training data influence analysis (also known as data
valuation [GZ19; Jia+19a; KCC23] and data attribution [Par+23; NSO23; DG23]) partially demystifies
the relationship between training data and model predictions by determining how to apportion credit (and
blame) for specific model behavior to the training instances [SR88; KL17; Yeh+18; Pru+20]. Essentially,
influence analysis’s objective is to answer the question: What is each training instance’s effect on a model ?
An instance’s “effect” is with respect to some specific perspective. For example, an instance’s effect may be
quantified as the change in model performance when some instance is deleted from the training data. The
effect can also be relative, e.g., whether one training instance changes the model more than another.
Influence analysis emerged alongside the initial study of linear models and regression [Jae72; CW82].
This early analysis focused on quantifying how worst-case perturbations to the training data affected the
final model parameters. The insights gained from early influence analysis contributed to the development of
numerous methods that improved model robustness and reduced model sensitivity to training outliers [Hog79;
Rou94].
Since these early days, machine learning models have grown substantially in complexity and opac-
ity [Dev+19; KSH12; Dos+21]. Training datasets have also exploded in size [BF21]. These factors combine
to make training data influence analysis significantly more challenging where, for multilayer parametric mod-
els (e.g., neural networks), determining a single training instance’s exact effect can be NP-complete in the
worst case [BR92].
In practice, influence may not need to be measured exactly. Influence estimation methods provide an
approximation of training instances’ true influence. Influence estimation is generally much more computation-
ally efficient and is now the approach of choice [Sch+22]. However, modern influence estimators achieve their
efficiency via various assumptions about the model’s architecture and learning environment [KL17; Yeh+18;
GZ19]. These varied assumptions result in influence estimators having different advantages and disadvan-
tages as well as in some cases, even orthogonal perspectives on the definition of influence itself [Pru+20].

2
1.2 Our Contributions

To the extent of our knowledge, there has not yet been a comprehensive review of these differing perspectives
of training data influence, much less of the various methods themselves. This paper fills in that gap by
providing the first comprehensive survey of existing influence analysis techniques. We describe how these
various methods overlap and, more importantly, the consequences – both positive and negative – that arise
out of their differences. We provide this broad and nuanced understanding of influence analysis so that
ML researchers and practitioners can better decide which influence analysis method best suits their specific
application objectives [Sch+22].
Although we aim to provide a comprehensive survey of influence analysis, we cannot cover every method
in detail. Instead, we focus on the most impactful methods so as not to distract from the key takeaways. In
particular, we concentrate on influence analysis methods that are general [CW82; GZ19; FZ20] or targeted
towards parametric models [KL17; Yeh+18; Pru+20; Che+21] with less emphasis on non-parametric meth-
ods [Sha+18b; Jia+19a; BHL23]. Multiple other research areas are based on ranking and subsampling train-
ing instances including data pruning [Yan+23], coreset selection [BLK17; Fel20a], active learning [Ren+21],
and submodular dataset selection [WIB15], but these topics are beyond the scope of this work.1
In the remainder of this paper, we first standardize the general notation used throughout this work
(Sec. 2). Section 3 reviews the various general formulations through which training data influence is viewed.
We also categorize and summarize the properties of the seven most impactful influence analysis methods.
Sections 4 and 5 describe these foundational influence methods in detail. For each method, we (1) formalize
the associated definition of influence and how it is measured, (2) detail the formulation’s strengths and
weaknesses, (3) enumerate any related or derivative methods, and (4) explain the method’s time, space, and
storage complexities. Section 6 reviews various learning tasks where influence analysis has been applied. We
provide our perspective on future directions for influence analysis research in Section 7.

2 General Notation

This section details our primary notation. In cases where a single influence method requires custom nomen-
clature, we introduce the unique notation alongside discussion of that method.2 Supplemental Section A
provides a full nomenclature reference.
m
Let [r] denote the set of integers {1, . . . , r}. A ∼ B denotes that the cardinality of set A is m and that A
is drawn uniformly at random (u.a.r.) from set B. For singleton set A (i.e., |A| = 1), the sampling notation
is simplified to A ∼ B. Let 2A denote the power set of any set A. Set subtraction is denoted A \ B. For
singleton B = {b}, set subtraction is simplified to A \ b.
The zero vector is denoted ⃗0 with the vector’s dimension implicit from context. 1[a] is the indicator
function, where 1[a] = 1 if predicate a is true and 0 otherwise.
Let x ∈ X ⊆ Rd denote an arbitrary feature vector, and let y ∈ Y be a dependent value (e.g., label, target).
Training set, D := {zi }ni=1 , consists of n training instances where each instance is a tuple, zi := (xi , yi ) ∈ Z
and Z := X × Y. (Arbitrary) test instances are denoted zte := (xte , yte ) ∈ Z. Note that yte need not be
xte ’s true dependent value; yte can be any value in Y. Throughout this work, subscripts “i” and “te” entail
that the corresponding symbol applies to an arbitrary training and test instance, respectively.
Model f : X → Y is parameterized by θ ∈ Rp , where p := |θ|; f is trained on (a subset of) dataset D. Most
commonly, f performs either classification or regression, although more advanced model classes (e.g., gener-
ative models) are also considered. Model performance is evaluated using a loss function ℓ : Y × Y → R. Let
L(z; θ) := ℓ f (x; θ), y denote the empirical risk of instance z = (x, y) w.r.t. parameters θ. By convention, a
smaller risk is better.
1
Section 3.3 briefly contrasts how the objectives of these related areas align with influence analysis’s objectives.
2
Supplemental Table 4 provides a reference for all notation specific to a single influence analysis method.

3
This work primarily focuses on overparameterized models with p ≫ d, where d is the data dimension.
Such models are almost exclusively trained using first-order optimization algorithms (e.g., gradient descent),
which proceed iteratively over T iterations. Starting from initial parameters θ(0) , the optimizer returns at
the end of each iteration, t ∈ [T ], updated model parameters θ(t) , where θ(t) is generated from previous
parameters θ(t−1) , loss function ℓ, batch B (t) ⊆ D, learning rate η (t) > 0, and weight decay (L2 ) strength
λ ≥ 0. Training gradients are denoted ∇θ L(zi ; θ(t) ). The training set’s empirical risk Hessian for iteration
(t) (t)
t is denoted Hθ := n1 zi ∈D ∇2θ L(zi ; θ(t) ), with the corresponding inverse risk Hessian denoted (Hθ )−1 .
P
Throughout this work, superscript “(t)” entails that the corresponding symbol applies to training iteration t.
Some models may be trained on data other than full training set D, e.g., subset D \ zi . Let D ⊂ D
(t) (T )
denote an alternate training set, and denote model parameters trained on D as θD . For example, θD\zi are
the final parameters for a model trained on all of D except training instance zi ∈ D. When training on all
(t)
of the training data, subscript D is dropped, i.e., θ(t) ≡ θD .

3 Overview of Influence and Influence Estimation

As Section 1 explains, training data influence’s objective is to quantify the “effect” of one or more train-
ing instances on a model. This effect’s scope can be as localized as an individual model prediction, e.g.,
f (xte ; θ(T ) ); the effect’s scope can also be so broad as to encompass the entire test data distribution.
Positive influence entails that the training instance(s) improve some quality measure, e.g., risk L(zte ; θ(T ) ).
Negative influence means that the training instance(s) make the quality measure worse. Training instances
with positive influence are referred to as proponents or excitatory examples. Training instances with negative
influence are called opponents or inhibitory examples [KL17; Yeh+18].
Highly expressive, overparameterized models remain functionally black boxes [KL17]. Understanding
why a model behaves in a specific way remains a significant challenge [BP21], and the inclusion or removal
of even a single training instance can drastically change a trained model’s behavior [Rou94; BF21]. In the
worst case, quantifying one training instance’s influence may require repeating all of training.
Since measuring influence exactly may be intractable or unnecessary, influence estimators – which only
approximate the true influence – are commonly used in practice. As with any approximation, influence
estimation requires making trade-offs, and the various influence estimators balance these design choices
differently. This in turn leads influence estimators to make different assumptions and rely on different
mathematical formulations.
When determining which influence analysis methods to highlight in this work, we relied on two primary
criteria: (1) a method’s overall impact and (2) the method’s degree of novelty in relation to other approaches.
In particular, we concentrate on influence analysis methods that are either model architecture agnostic or
that are targeted towards parametric models (e.g., neural networks). Nonetheless, we briefly discuss non-
parametric methods as well.
The remainder of this section considers progressively more general definitions of influence.

3.1 Pointwise Training Data Influence

Pointwise influence is the simplest and most commonly studied definition of influence. It quantifies how
a single training instance affects a model’s prediction on a single test instance according to some quality
measure (e.g., test loss). Formally, a pointwise influence analysis method is a function I : Z × Z → R with
the pointwise influence of training instance zi on test instance zte denoted I (zi , zte ). Pointwise influence
estimates are denoted Ib (zi , zte ) ∈ R. Note that the model architecture, training algorithm, full training
set (D), and even the random seed can (significantly) affect an instance’s influence. To improve clarity and
readability, we treat these parameters as fixed and implicit in our nomenclature for I and I. b

4
Inlier Outlier Least Squares Inliers Only Least Squares All

y
2

1 2 3 4 5
x

Figure 1: Outlier Pointwise Influence on Least-Squares Regression: Influence of a single


outlier ( ) on a least-squares model where in-distribution data ( ) are generated from linear distri-
bution y = 2x. The single outlier sample (x = 5 & y = 1.2) influences the inlier-only least-squares
linear model ( ) substantially such that a least-squares model trained on all instances ( )
predicts all training y values poorly. Adapted from Rousseeuw and Leroy [RL87, Fig. 2(b)].

Below, we briefly review early pointwise influence analysis contributions and then transition to a discus-
sion of more recent pointwise methods.

3.1.1 Early Pointwise Influence Analysis

The earliest notions of pointwise influence emerged out of robust statistics – specifically the analysis of
training outliers’ effects on linear regression models [Coo77]. Given training set D, the least-mean squares
linear model parameters are3
1 X 2
θ∗ := arg min (yi − θ⊺ xi ) . (1)
θ |D|
(xi ,yi )∈D

Observe that this least-squares estimator has a breakdown point of 0 [Rou94]. This means that least-squares
regression is completely non-robust where a single training data outlier can shift model parameters θ∗
arbitrarily. For example, Figure 1 visualizes how a single training data outlier ( ) can induce a nearly
orthogonal least-squares model. Put simply, an outlier training instance’s potential pointwise influence on
a least-squares model is unbounded. Unbounded influence on the model parameters equates to unbounded
influence on model predictions.
Early influence analysis methods sought to identify the training instance that was most likely to be an
outlier [Sri61; TMB73]. A training outlier can be defined as the training instance with the largest negative
influence on prediction f (xte ; θ∗ ). Intuitively, each training instance’s pointwise influence can be measured
by training n := |D| models, where each model’s training set leaves out a different training instance. These
n models would then be compared to identify the outlier.4 However, such repeated retraining is expensive
and so more efficient pointwise influence analysis methods were studied.
Different assumptions about the training data distribution lead to different definitions of the most likely
3
For simplicity, the bias term is considered part of θ.
4
Section 4.1 formalizes how to measure pointwise influence by repeatedly retraining with a different training
instance left out of the training set each time.

5
outlier. For example, Srikantan [Sri61], Snedecor and Cochran [SC68], and Ellenberg [Ell76] all assume that
training data outliers arise from mean shifts in normally distributed training data. Under this constraint,
their methods all identify the maximum likelihood outlier as the training instance with the largest abso-
lute residual, |yi − θ∗ ⊺ xi |. However, Cook and Weisberg [CW82] prove that under different distributional
assumptions (e.g., a variance shift instead of a mean shift), the maximum likelihood outlier may not have
the largest residual.
These early influence analysis results demonstrating least-squares fragility spurred development of more
robust regressors. For instance, Rousseeuw [Rou94] replaces Eq. (1)’s mean operation with median; this sim-
ple change increases the breakdown point of model parameters θ∗ to the maximum value, 50%. In addition,
multiple robust loss functions have been proposed that constrain or cap outliers’ pointwise influence [Hub64;
BT74; JW78; Lec89].
As more complex models grew in prevalence, influence analysis methods similarly grew in complexity.
In recent years, numerous influence analysis methods targeting deep models have been proposed. We briefly
review the most impactful, modern pointwise influence analysis methods next.

3.1.2 Modern Pointwise Influence Analysis

Figure 2 provides a taxonomy of the seven most impactful modern pointwise influence analysis methods.
Below each method appears a list of closely related and derivative approaches. Modern influence analysis
methods broadly categorize into two primary classes, namely:

• Retraining-Based Methods: Measure the training data’s influence by repeatedly retraining model f
using different subsets of training set D.
• Gradient-Based Influence Estimators: Estimate influence via the alignment of training and test in-
stance gradients either throughout or at the end of training.

An in-depth comparison of these influence analysis methods requires detailed analysis so we defer the exten-
sive discussion of these two categories to Sections 4 and 5, respectively. Table 1 summarizes the key properties
of Figure 2’s seven methods – including comparing each method’s assumptions (if any), strengths/weaknesses,
and asymptotic complexities. These three criteria are also discussed when detailing each of these methods
in the later sections.

3.2 Alternative Perspectives on Influence

Note that pointwise effects are only one perspective on how to analyze the training data’s influence. Below we
briefly summarize six alternate, albeit less common, perspectives of training data influence. While pointwise
influence is this work’s primary focus, later sections also contextualize existing influence methods w.r.t. these
alternate perspectives where applicable.
(1) Recall that pointwise influence quantifies the effect of a single training instance on a single test prediction.
In reality, multiple related training instances generally influence a prediction as a group [FZ20], where group
members have a total effect much larger than the sum of their individual effects [BYF20; Das+21; HL22].
Group influence quantifies a set of training instances’ total, combined influence on a specific test prediction.
We use very similar notation to denote group and pointwise influence. The only difference is that for
group influence, the first parameter of function I is a training (sub)set instead of an individual training
instance; the same applies to group influence estimates I. b For example, given some test instance zte ,
the entire training set’s group influence and group influence estimate are denoted I (D, zte ) and Ib (D, zte ),
respectively.
In terms of magnitude, previous work has shown that the group influence of a related set of training
instances is generally lowered bounded by the sum of the set’s pointwise influences [Koh+19]. Put simply,

6
Influence Analysis Methods

Retraining-Based Gradient-Based

Static Dynamic

Leave-One-Out Downsampling Shapley Value Influence Func. Representer Pt. TracIn HyDRA

kNN LOO C-Score Interaction Index FastIF High Dim. Rep. TracInCP SGD-Influence

LeafRefit Generative Shapley-Taylor Arnoldi IF RPS-LJE TracInRP

Inf. Sketching TMC-Shapley LeafInfluence TREX TracIn-Last

G-Shapley Group IF VAE-TracIn

kNN Shapley Second-Order TracInAD

Beta Shapley RelatIF TracInWE

Banzhaf Value Renorm. IF BoostIn

AME GAS

SHAP

Neuron Shapley

Figure 2: Influence Analysis Taxonomy: Categorization of the seven primary pointwise influ-
ence analysis methods. Section 4 details the three primary retraining-based influence methods,
leave-one-out (Sec. 4.1), Downsampling (Sec. 4.2), and Shapley value (Sec. 4.3). Section 5 details
gradient-based static estimators influence functions (Sec. 5.1.1) and representer point (Sec. 5.1.2)
as well as dynamic estimators TracIn (Sec. 5.2.1) and HyDRA (Sec. 5.2.2). Closely-related and
derivative estimators are shown as a list below their parent method. See supplemental Table 5 for
the formal mathematical definition of all influence methods and estimators. Due to space, each
method’s citation is in supplemental Table 6.

7
Table 1: Influence Analysis Method Comparison: Comparison of the complexity, assump-
tions, strengths, and limitations of the seven primary influence estimators detailed in Sections 4
and 5. Recall that n, p, and T are the training-set size, model parameter count, and training
iteration count, respectively. LOO is the leave-one-out influence. “Full time complexity” denotes
the time required to calculate influence for the first test instance; “incremental complexity” is
the added time required for each subsequent test instance. “Storage complexity” represents the
amount of memory required to persistently save any additional model parameters needed by the
influence method. For HyDRA and the three retraining-based methods, the storage complexity is
implementation dependent and is marked with an asterisk. We report each method’s worst-case
storage complexity; for the retraining-based methods, this worst-case storage complexity yields the
best-case incremental complexity. Static, gradient-based estimators – influence functions and rep-
resenter point – require no additional storage. Differentiability encompasses both model f and loss
function ℓ except in the case of representer point where only the loss function must be differentiable.
All criteria below are discussed in detail alongside the description of each method.
Gradient-Based
Retraining-Based
Static Dynamic

LOO Downsamp. Shapley Inf. Func. Rep. Pt. TracIn HyDRA

Section reference 4.1 4.2 4.3 5.1.1 5.1.2 5.2.1 5.2.2


Full time complexity O(nT ) O(KT ) O(2n T ) O(np) O(n) O(npT ) O(npT )
Incremental complexity O(n) O(K) O(2n ) O(np) O(n) O(npT ) O(np)
Space complexity O(n + p) O(n + p) O(n + p) O(n + p) O(n + p) O(n + p) O(np)
Storage complexity O(np)* O(Kp)* O(2n p)* 0 0 O(pT ) O(pT + np)*
Assumes differentiable? ✓ ✓* ✓ ✓
Assumes convexity? ✓ ✓
Assumes stationarity? ✓ ✓
Assumes linear model? ✓
Optimizer specific? ✓ ✓
Uses Hessian? ✓ ✓
Any model class? ✓ ✓ ✓
Estimates LOO? ✓ ✓ ✓ ✓ ✓
Group influence? ✓ ✓ ✓
Hyperparam. sensitive? ✓ ✓
Generative models? ✓ ✓ ✓
High upfront cost? ✓ ✓ ✓ ✓
High instance cost? ✓ ✓ ✓
Amortizable? ✓ ✓ ✓ ✓ ✓

8
the true influence of a coherent group is more than the sum of its parts, or formally, for coherent D ⊆ D, it
often holds that X
|I (D, zte )| > |I (zi , zte )|. (2)
zi ∈D

Existing work studying group influence is limited. Later sections note examples where any of the seven
primary pointwise influence methods have been extended to consider group effects.
(2) Joint influence extends influence to consider multiple test instances collectively [Jia+22; Che+22]. These
test instances may be a specific subpopulation within the test distribution – for example in targeted data
poisoning attacks [Jag+21; Wal+21]. The test instances could also be a representative subset of the entire
test data distribution – for example in coreset selection [BMK20] or indiscriminate poisoning attacks [BNL12;
Fow+21].
Most (pointwise) influence analysis methods are additive meaning for target set Dte ⊆ Z, the joint
(pointwise) influence simplifies to X
I (zi , Dte ) = I (zi , zte ). (3)
zte ∈Dte

Additivity is not a requirement of influence analysis, and there are provably non-additive influence estima-
tors [YP21].
(3) Overparameterized models like deep networks are capable of achieving near-zero training loss in most set-
tings [Bar+20; Fel20b; DAm+20]. This holds even if the training set is large and randomly labeled [Zha+17;
Arp+17]. Near-zero training loss occurs because deep models often memorize some training instances.
Both Pruthi et al. [Pru+20] and Feldman and Zhang [FZ20] separately define a model’s memorization 5
of training instance zi as the pointwise influence of zi on itself. Formally

Mem(zi ) := I (zi , zi ) ≈ Ib (zi , zi ). (4)

(4) Cook’s distance measures the effect of training instances on the model parameters themselves [Coo77,
Eq. (5)]. Formally, the pointwise Cook’s distance of zi ∈ D is
(T )
ICook (zi ) := θ(T ) − θD\zi . (5)

Eq. (5) trivially extends to groups of training instances where for any D ⊆ D
(T )
ICook (D) := θ(T ) − θD\D . (6)

Cook’s distance is particularly relevant for interpretable model classes where feature weights are most trans-
parent. This includes linear regression [RL87; Woj+16] and decision trees [BHL23].
(5) All definitions of influence above consider training instances’ effects w.r.t. a single instantiation of a
model. Across repeated stochastic retrainings, a training instance’s influence may vary – potentially substan-
tially [BPF21; SD21; Ras+22]. Expected influence is the average influence across all possible instantiations
within a given model class [KS21; WJ23]. Expected influence is particularly useful in domains where the
random component of training is unknowable a priori. For example, with poisoning and backdoor attacks,
an adversary crafts malicious training instances to be highly influential in expectation across all random
parameter initializations and batch orderings [Che+17; Sha+18a; Fow+21].
Expected influence generalizes to consider group effects. Existing related work focuses on counterfactuals
such as, “what is the expected prediction for xte if a model is trained on some arbitrary subset of D [Ily+22;
KCC23]?” Other work seeks to predict model parameters θ(T ) given an arbitrary training subset [Zen+23].
(6) Observe that all preceding definitions view influence as a specific numerical value to measure/estimate.
Influence analysis often simplifies to a relative question of whether one training instance is more influential
5
Pruthi et al. [Pru+20] term “memorization” as self-influence. We use Feldman and Zhang’s [FZ20] terminology
here since it is more consistent with other work [vW21; KWR22].

9
than another. An influence ranking orders (groups of) training instances from most positively influential
to most negatively influential. These rankings are useful in a wide range of applications [KZ22; WJ23],
including data cleaning and poisoning attack defenses as discussed in Section 6.

3.3 Topics Related to Influence Analysis

All influence analysis methods we highlight in Sections 4 and 5 estimate a function that quantifies the
impact of specific training examples on a given model or prediction. For the most part, these methods take
an ablation perspective, i.e., measuring how much a model changes when removing specific training examples.
However, there are several other research areas related to analyzing the impact of different subsets of the
training data, albeit with somewhat different methods or objectives. We briefly describe some of these topics
below.6
Data pruning methods such as coresets [BLK17] also consider the impact of removing examples, but they
typically consider removing many examples (rather than one or a few) with the primary goal of increasing
computational efficiency. A coreset DCS ⊂ D is a (weighted) set of points that can stand in for the overall
training data when measuring a cost function, cost(DCS , Q) where Q ∈ Q is a solution in solution space Q.
DCS is an ε-coreset if it approximates the cost function within a factor of ϵ > 0:

|cost(DCS , Q) − cost(D, Q)| ≤ ε cost(D, Q). (7)

Several techniques exist to efficiently find DCS [MBL20; Fel20a; Tuk+23]. These include methods loosely
based on weighted importance sampling, where each training instance’s sampling probability is proportional
to the instance’s influence [BLK17].
Coresets let us learn a model over a much smaller set of points, while still yielding a model that is within
a factor of 1 + ε of the optimal loss on the original training set. Therefore, a coreset represents a sufficient
set of points for a given task, while most influence estimation methods identify the most necessary points —
the points without which performance would decline, even given many other points from the original dataset.
Another difference is that coresets are often motivated by efficiency concerns, whereas influence estimation
is more motivated by the need to understand the data and its impact on a model — specifically, a model
trained on the entire training data.
Coreset construction often involves submodular optimization [Bil22], so that an efficient, greedy approach
finds a nearly-optimal set of points. However, this also means that if there are multiple, equally-important
points, submodular optimization will select one and skip the others as redundant. This “winner-take-all”
approach is in stark contrast to most influence estimation methods, which tend to assign similar importance
to similar points.
Active learning seeks to maximize a model’s performance while annotating as little training data as
possible [Ren+21]. Like influence estimation, active learning estimates the relative value different data
points would have in fitting a model. However, unlike influence estimation, this is done without knowledge
of the labels of these points. Furthermore, the goal is maximizing performance more than understanding the
data, increasing efficiency, or precisely matching the loss on the full training data.
Often, the data points to label are chosen greedily by identifying the training instance whose labeling
would most positively influence the model (in expectation). Quantifying each unlabeled instance’s true
influence at each active learning iteration may be prohibitive, so influence estimation techniques are often
used to quantify each remaining unlabeled instance’s marginal contribution at a given iteration [Liu+21].
With this broad perspective on influence analysis and related concepts in mind, we transition to focusing
on specific influence analysis methods in the next two sections.
6
Section 6 (“Applications of Influence Analysis”) discusses additional tasks that have leveraged influence analysis
to achieve their own meta-objectives (e.g., adversarial robustness, model explainability, etc.).

10
4 Retraining-Based Influence Analysis

Training instances can only influence a model if they are used during training. As Section 3.1.2 describes in
the context of linear regression, one method to measure influence just trains a model with and without some
instance; influence is then defined as the difference in these two models’ behavior. This basic intuition is the
foundation of retraining-based influence analysis, and this simple formulation applies to any model class –
parametric or non-parametric.
Observe that the retraining-based framework makes no assumptions about the learning environment. In
fact, this simplicity is one of the primary advantages of retraining-based influence. For comparison, Table 1
shows that all gradient-based influence estimators make strong assumptions – some of which are known not
to hold for deep models (e.g., convexity). However, retraining’s flexibility comes at the expense of high
(sometimes prohibitive) computational cost.
Below, we describe three progressively more complex retraining-based influence analysis methods. Each
method mitigates weaknesses of the preceding method – in particular, devising techniques to make retraining-
based influence more viable computationally.
Remark 1: This section treats model training as deterministic where, given a fixed training set, training
always yields the same output model. Since the training of modern models is mostly stochastic, retraining-
based estimators should be represented as expectations over different random initializations and batch or-
derings. Therefore, (re)training should be repeated multiple times for each relevant training (sub)set with a
probabilistic average taken over the valuation metric [Lin+22]. For simplicity of presentation, expectation
over randomness is dropped from the influence and influence estimator definitions below.
Remark 2: Section 2 defines D as a supervised training set. The three primary retraining-based influence
analysis methods detailed below also generalize to unsupervised and semi-supervised training.
Remark 3: When calculating retraining’s time complexity below, each training iteration’s time complexity
is treated as a constant cost. This makes the time complexity of training a single model O(T ). Depending on
the model architecture and hyperparameter settings, a training iteration’s complexity may directly depend
on training-set size n or model parameter count p.
Remark 4: It may be possible to avoid full model retraining by using machine unlearning methods capable
of certifiably “forgetting” training instances [Guo+20; BL21; Ngu+22; Eis+22]. The asymptotic complexity
of such methods is model-class specific and beyond the scope of this work. Nonetheless, certified deletion
methods can drastically reduce the overhead of retraining-based influence analysis.

4.1 Leave-One-Out Influence

Leave-one-out (LOO) is the simplest influence measure described in this work. LOO is also the oldest, dating
back to Cook and Weisberg [CW82] who term it case deletion diagnostics.
As its name indicates, leave-one-out influence is the change in zte ’s risk due to the removal of a single
instance, zi , from the training set [KL17; BF21; Jia+21a]. Formally,
(T )
ILOO (zi , zte ) := L(zte ; θD\zi ) − L(zte ; θ(T ) ), (8)

(T )
where θD\zi are the final model parameters when training on subset D \ zi and θ(T ) are the final model
parameters trained on all of D.
Measuring the entire training set’s LOO influence requires training (n + 1) models. Given a deterministic
model class and training algorithm (e.g., convex model optimization [BV04]), LOO is one of the few influence
measures that can be computed exactly in polynomial time w.r.t. training-set size n and iteration count T .

11
4.1.1 Time, Space, and Storage Complexity

Training a single model has time complexity O(T ) (see Remark 3). By additivity, training (n + 1) models has
total time complexity O(nT ). Since these (n + 1) models are independent, they can be trained in parallel.
Pointwise influence analysis always has space complexity of at least O(n), i.e., the space taken by the
n influence values ∀i I (zi , zte ). Training a single model has space complexity O(p), where p := |θ|; this
complexity scales linearly with the number of models trained in parallel. Table 1 treats the level of training
concurrency as a constant factor, which is why LOO’s total space complexity is listed as O(n + p).
A naive implementation of LOO would train the n additional models and immediately discard them
after measuring zte ’s test loss. This simple version of LOO has O(1) storage complexity. If instead the
(n + 1) models are stored, analysis of subsequent test instances requires no additional retraining. This
drastically reduces LOO’s incremental time complexity for subsequent instances to just O(n) forward passes
– a huge saving.7 Note that this amortization of the retraining cost induces an O(np) storage complexity as
listed in Table 1.

4.1.2 Strengths and Weaknesses

Leave-one-out influence’s biggest strength is its simplicity. LOO is human-intelligible – even by laypersons.
For that reason, LOO has been applied to ensure the fairness of algorithmic decisions [BF21]. Moreover, like
all methods in this section, LOO’s simplicity allows it to be combined with any model architecture.
LOO’s theoretical simplicity comes at the price of huge upfront computational cost. Training some state-
of-the-art models from scratch even once is prohibitive for anyone beyond industrial actors [Dis+21]. For
even the biggest players, it is impractical to train (n + 1) such models given huge modern datasets [BPF21].
The climate effects of such retraining also cannot be ignored [SGM20].
LOO’s simple definition in Eq. (8) is premised on deterministic training. However, even when training
on the same data, modern models may have significant predictive variance for a given test instance [BF21;
WJ23]. This variance makes it difficult to disentangle the effect of an instance’s deletion from training’s
intrinsic variability [BPF21]. For a single training instance, estimating the LOO influence within a standard
deviation of σ requires training Ω(1/σ 2 ) models. Therefore, estimating the entire training set’s LOO influence
requires training Ω(n/σ 2 ) models – further exacerbating LOO’s computational infeasibility [FZ20].
These limitations notwithstanding, LOO’s impact on influence analysis research is substantial. Many
pointwise influence analysis methods either directly estimate the leave-one-out influence – e.g., Downsam-
pling (Sec. 4.2), influence functions (Sec. 5.1.1), HyDRA (Sec. 5.2.2) – or are very similar to LOO –
e.g., Shapley value (Sec. 4.3).

4.1.3 Related Methods

Efficient Nearest-Neighbor LOO Although LOO has poor upfront and storage complexities in general,
it can be quite efficient for some model classes – particularly instance-based learners [AKA91]. For example,
Jia et al. [Jia+21a] propose the kNN leave-out-one (kNN LOO) estimator, which calculates the LOO influence
over a surrogate k-nearest neighbors classifier instead of over target model f . kNN LOO relies on a simple
two-step process. First, the features of test instance zte and training set D are extracted using a pretrained
model. Next, a kNN classifier’s LOO influence is calculated exactly [Jia+21a, Lemma 1] using these extracted
features. Jia et al. prove that kNN LOO influence only requires O(n log n) time – significantly faster in
practice than vanilla LOO’s O(nT ) complexity. Jia et al. also demonstrate empirically that kNN LOO and
vanilla LOO generate similar influence rankings across various learning domains and tasks.
Efficient LOO Estimation in Decision Tree Ensembles Sharchilev et al. [Sha+18b] propose LeafR-
7
LOO’s incremental computational cost can be (significantly) reduced in practice via batching.

12
efit, an efficient LOO estimator for decision-tree ensembles. LeafRefit’s efficiency derives from the sim-
plifying assumption that instance deletions do not affect the trees’ structure. In cases where this assumption
holds, LeafRefit’s tree influence estimates are exact. To the extent of our knowledge, LeafRefit’s suit-
ability for surrogate influence analysis of deep models has not yet been explored.
Cook’s Distance and Linear Regression For least-squares linear regression, Wojnowicz et al. [Woj+16]
show that each training instance’s LOO influence on the model parameters (i.e., Cook’s distance) can be
efficiently estimated by mapping training set D into a lower-dimensional subspace. By the Johnson-Linden-
strauss lemma [JL84], these influence sketches approximately preserve the pairwise distances between the
training instances in D provided the projected dimension is on the order of log n.
LOO Group Influence Leave-one-out can be extended to leave-m-out for any integer m ≤ n.8 Leave-m-out
n
has time complexity O( m ), which is exponential in the worst case. Shapley value influence [Sha53] (Sec. 4.3)
shares significant similarity with leave-m-out.
As mentioned above, LOO influence serves as the reference influence value for multiple influence estima-
tors including Downsampling, which we describe next.

4.2 Downsampling

Proposed by Feldman and Zhang [FZ20], Downsampling9 mitigates leave-one-out influence’s two primary
weaknesses: (1) computational complexity dependent on n and (2) instability due to stochastic training
variation.
Downsampling relies on an ensemble of K submodels each trained on a u.a.r. subset ofPfull training
set D. Let Dk ∼ D be the k-th submodel’s training set where ∀k Dk = m < n.10 Define Ki := k=1 1 zi ∈ Dk
m K  

as the number of submodels that used instance zi during training. The Downsampling pointwise influence
estimator 11 is then
1 X (T ) 1 X (T )
IbDown (zi , zte ) := L(zte ; θDk ) − L(zte ; θDk′ ). (9)
K − Ki Ki ′
k k

/ k
zi ∈D zi ∈D k

Intuitively, Eq. (9) is the change in zte ’s average risk when zi is not used in submodel training. By holding
out multiple instances simultaneously and then averaging, each Downsampling submodel provides insight
into the influence of all training instances. This allows Downsampling to require (far) fewer retrainings
than LOO.
Since each of the K training subsets is i.i.d., then
h i h i
(T ) (T )
lim IbDown (zi , zte ) = ED′ ∼
m
D \zi
L(zte ; θD′ ) − E m−1 L(zte ; θD∪zi ) . (10)
K→∞ D ∼ D \zi

For sufficiently large m and n, the expected behavior of a model trained on an i.i.d. dataset of size m becomes
indistinguishable from one trained on m − 1 i.i.d. instances. Applying this property along with linearity of
8
Leave-m-out influence analysis is also called multiple case deletion diagnostics [RL87].
9
Feldman and Zhang [FZ20] do not specify a name for their influence estimator. Previous work has referred
to Feldman and Zhang’s method as “subsampling” [BHL23] and as “counterfactual influence” [Zha+21b]. We use
“Downsampling” to differentiate Feldman and Zhang’s method from the existing, distinct task of dataset subsam-
pling [TB18] while still emphasizing the methods’ reliance on repeated training-set sampling.
10
Feldman and Zhang [FZ20] propose setting m = ⌈0.7n⌉.
11
Feldman and Zhang [FZ20] define their estimator specifically for classification. Downsampling’s definition
in Eq. (9) uses a more general form to cover additional learning tasks such as regression. Feldman and Zhang’s
original formulation would be equivalent to defining the risk as L(zte ; θDk ) = 1 yte ̸= f (xte ; θDk ) , i.e., the accuracy
(T )  (T ) 

subtracted from one.

13
expectation and Eq. (8), Eq. (10) reformulates as
h i h i
(T ) (T )
lim IbDown (zi , zte ) = E m−1 L(zte ; θD′ ) − E m−1 L(zte ; θD∪zi ) (11)
K,n,m→∞ D ′ ∼ D \zi D ∼ D \zi

h i
(T ) (T )
=E m−1 L(zte ; θD ) − L(zte ; θD∪zi ) (12)
D ∼ D \zi
 
=E m−1 \zi
I LOO (zi te .
, z ) (13)
D ∼ D

Hence, Downsampling is a statistically consistent estimator of the expected LOO influence. This means
that Downsampling does not estimate the influence of training instance zi on a single model instantiation.
Rather, Downsampling estimates zi ’s influence on the training algorithm and model architecture as a whole.
By considering influence in expectation, Downsampling addresses LOO’s inaccuracy caused by stochastic
training’s implicit variance.
In practice, K, n, and m are finite. Nonetheless, Feldman and Zhang [FZ20, Lemma 2.1] prove that,
with high probability, Downsampling’s LOO influence estimation error is bounded given K and mn.

While Downsampling’s formulation above is w.r.t. a single training instance, Downsampling trivially
extends to estimate the expected group influence of multiple training instances. Observe however that the
expected fraction of u.a.r. training subsets that either contains all instances in a group or none of a group
decays geometrically with m m
n and (1 − n ), respectively. Therefore, for large group sizes, K needs to be
exponential in n to cover sufficient group combinations.
Remark 5: Downsampling trains models on data subsets of size m under the assumption that statements
made about those models generalize to models trained on a dataset of size n (i.e., all of D). This assumption
may not hold for small m. To increase the likelihood this assumption holds, Feldman and Zhang propose
fixing m
n = 0.7. This choice balances satisfying the aforementioned assumption against the number of sub-
models since Downsampling requires K and ratio m n combined dictate Ki , i.e., the number of submodels
that are trained on zi .

4.2.1 Time, Space, and Storage Complexity

Downsampling’s complexity analysis is identical to that of LOO (Sec. 4.1.1) except, instead of the time
and storage complexities being dependent on training-set size n, Downsampling depends on submodel
count K. For perspective, Feldman and Zhang’s empirical evaluation used K = 2,000 for ImageNet [Den+09]
(n > 14M) as well as K = 4,000 for MNIST [LeC+98] (n = 60,000) and CIFAR10 [KNH14] (n = 50,000) –
a savings of one to four orders of magnitude over vanilla LOO.
Remark 6: Downsampling’s incremental time complexity is technically O(K + n) ∈ O(n) since pointwise
influence is calculated w.r.t. each training instance. The time complexity analysis above focuses on the
difference in the number of forward passes required by LOO and Downsampling; fewer forward passes
translate to Downsampling being much faster than LOO in practice.

4.2.2 Strengths and Weaknesses

Although more complicated than LOO, Downsampling is still comparatively simple to understand and
implement. Downsampling makes only a single assumption that should generally hold in practice (see
Remark 5). Downsampling is also flexible and can be applied to most applications.
Another strength of Downsampling is its low incremental time complexity. Each test example requires
only K forward passes. These forward passes can use large batch sizes to further reduce the per-instance
cost. This low incremental cost allows Downsampling to be applied at much larger scales than other

14
methods. For example, Feldman and Zhang [FZ20] measure all pointwise influence estimates for the entire
ImageNet dataset [Den+09] (n > 14M). These large-scale experiments enabled Feldman and Zhang to draw
novel conclusions about neural training dynamics – including that training instance memorization (4) by
overparameterized models is not a bug, but a feature, that is currently necessary to achieve state-of-the-art
generalization results.
In terms of weaknesses, while Downsampling is less computationally expensive than LOO, Downsam-
pling still has a high upfront computational cost. Training multiple models may be prohibitively expensive
even when K ≪ n. Amortization of this upfront training overhead across multiple test instances is beneficial
but by no means a panacea.

4.2.3 Related Methods

Downsampling has two primary related methods.


Generative Downsampling Training instance memorization also occurs in generative models where the
generated outputs are (nearly) identical copies of training instances [KWR22]. van den Burg and Williams
[vW21] extend Downsampling to deep generative models – specifically, density models (e.g., variational
autoencoders [KW14; RMW14]) that estimate posterior probability p(x|P, θ), where P denotes the training
data distribution. Like Downsampling, van den Burg and Williams’s approach relies on training multiple
submodels.12 The primary difference is that van den Burg and Williams consider generative risk

L(xte ; θ(T ) ) = − log p(xte |P, θ(T ) ). (14)

Beyond that, van den Burg and Williams’s method is the same as Downsampling as both methods consider
the LOO influence (9).
Consistency Profile and Score Downsampling’s second closely related method is Jiang et al.’s [Jia+21b]
consistency profile, defined formally as13
h i
(T )
Cm,D (zte ) := −ED ∼
m
D
L(zte ; θD ) . (15)

By negating the risk in Eq. (15), a higher expected risk corresponds to a lower consistency profile. Consis-
tency profile differs from Downsampling in two ways. (1) Downsampling implicitly considers a single
submodel training-set size m while consistency profile disentangles the estimator from m. (2) Downsam-
pling estimates zi ’s influence on zte while consistency profile considers all of D as a group and estimates
the expected group influence of a random subset D ⊆ D given m := |D|.
Jiang et al. [Jia+21b] also propose the consistency score (C-score), defined formally as

CD (zte ) := Em ∼ [n] [Cm,D (zte )], (16)

where m is drawn uniformly from set [n]. By taking the expectation over all training-set sizes, C-score
provides a total ordering over all test instances. A large C-score entails that zte is harder for the model
to confidently predict. Large C-scores generally correspond to rare/atypical test instances from the tails
of the data distribution. Since Downsampling considers the effect of each training instance individually,
Downsampling may be unable to identify these hard-to-predict test instances – in particular if m is large
enough to cover most data distribution modes.
The next section introduces the Shapley value, which in essence merges the ideas of Downsampling
and C-score.
12
Rather than training submodels using i.i.d. subsets of D, van den Burg and Williams [vW21] propose training the
submodels via repeated d-fold cross-validation. While technically different, van den Burg and Williams’s approach is
functionally equivalent to Feldman and Zhang’s [FZ20] u.a.r. sampling procedure.
13
Jiang et al. [Jia+21b] define their estimator specifically for classification. We present their method more generally
to apply to other losses/domains (e.g., regression). As with Downsampling, it is trivial to map between Eq. (15)’s
formulation and that of Jiang et al.

15
4.3 Shapley Value

Derived from cooperative game theory, Shapley value (SV) quantifies the increase in value when a group
of players cooperates to achieve some shared objective [Sha53; SR88]. Given n total players, characteristic
function ν : 2[n] → R defines the value of any player coalition A ⊆ [n] [Dub75]. By convention, a larger ν(A)
is better. Formally, player i’s Shapley value w.r.t. ν is
1 X 1 h i
V(i; ν) := n−1
 ν(A ∪ i) − ν(A) , (17)
n |A|
A⊆[n]\i

n−1

where |A| is the binomial coefficient.
Ghorbani and Zou [GZ19] adapt SV to model training by treating the n instances in training set D
as n cooperative players with the shared objective of training the “best” model.14 For any A ⊆ D, let
(T )
ν(A) := −L(zte ; θA ) where the negation is needed because more “valuable” training subsets have lower
risk. Then, zi ’s Shapley value pointwise influence on zte is
1 X 1 h (T ) (T )
i
ISV (zi , zte ) := n−1
 L(z te ; θ D ) − L(z te ; θ D∪zi .
) (18)
n \z |D|
D⊆D i

More intuitively, SV is the weighted change in zte ’s risk when zi is added to a random training subset; the
weighting ensures all training subset sizes (|D|) are prioritized equally. Eq. (18) can be viewed as generalizing
the leave-one-out influence, where rather than considering only full training set D, Shapley value averages
the LOO influence across all possible subsets of D.
There exists multiple extensions of Shapley value to the group context [BL23; TYR23; GR99; SDA20].15
The most well-known method is Grabisch and Roubens’s [GR99] Shapley interaction index, which for any
subset A ⊆ D, is defined as
X (n − |A| − |D|)! |D|! X |A|−|D ′ | (T )
ISV (A, zte ) := − (−1) L(zte ; θD∪D′ ). (19)
(n − |A| + 1)! ′
D⊆D\A D ⊆A

To intuitively understand Shapley interaction index’s inner summation, consider when A consists of only
two instances (e.g., A = {zi , zj }), then
2−|D ′ | (T ) (T ) (T )
X
(−1) L(zte ; θD∪D′ ) =L(zte ; θD ) − L(zte ; θD∪zi ) (20)
D ′ ⊆{zi ,zj }

(T ) (T )
− L(zte ; θD∪zj ) + L(zte ; θD∪{zi ,zj } ).

Grabisch and Roubens [GR99] explain that a positive interaction index entails that the combined group
influence of the training instances in A ⊆ D exceeds the sum of their marginal influences; similarly, a negative
interaction index means that A’s group influence is less than the sum of their marginal influences. If the
interaction index is zero, A’s members have no net interaction. Sundararajan et al. [SDA20] propose an
alternate Shapley group formulation they term the Shapley-Taylor interaction index, which is defined as
|A| X 1 X |A|−|D ′ | (T )
IST (A, zte ) := − n−1
 (−1) L(zte ; θD∪D′ ). (21)
n |D|
D⊆D\A D ′ ⊆A

The Shapley and Shapley-Taylor interaction indices provide different mathematical guarantees the details
of which extend beyond the scope of this work. We refer the reader to Sundararajan et al. [SDA20] for
additional discussion.
14
“Best” here is w.r.t. some data valuation measure of interest, with different use cases potentially defining “best”
differently.
15
Most of these works were proposed in the context of studying the interaction between groups of features. Directly
adapting these ideas to groups of training instances is straightforward.

16
4.3.1 Time, Space, and Storage Complexity

Deng and Papadimitriou [DP94] prove that computing Shapley values is #P-complete. Therefore, in the
worst case, SV requires exponential time to determine exactly assuming P ̸= NP. There has been significant
follow-on work to develop tractable SV estimators, many of which are reviewed in Sec. 4.3.3.
The analysis of SV’s space, storage, and incremental time complexities follows that of LOO (Sec. 4.1.1)
with the exception that SV requires up to 2n models, not just n models as with LOO.

4.3.2 Strengths and Weaknesses

Among all influence analysis methods, SV may have the strongest theoretical foundation with the chain
of research extending back several decades. SV’s dynamics and limitations are well understood, providing
confidence in the method’s quality and reliability. In addition, SV makes minimal assumptions about the
nature of the cooperative game (i.e., model to be trained), meaning SV is very flexible. This simplicity and
flexibility allow SV to be applied to many domains beyond dataset influence as discussed in the next section.
SV has been shown to satisfy multiple appealing mathematical axioms. First, SV satisfies the dummy
player axiom where
(T ) (T )
∀D⊆D\zi L(zte ; θD∪z i
) = L(zte ; θD ) =⇒ ISV (zi , zte ) = 0. (22)
Second, SV is symmetrical meaning for any zi , zj ∈ D,
(T ) (T )
∀D⊆D\{zi ,zj } L(zte ; θD∪z i
) = L(zte ; θD∪zj ) =⇒ ISV (zi , zte ) = ISV (zj , zte ). (23)

Third, SV is linear [Sha53; Dub75], meaning given any two data valuation metrics ν ′ , ν ′′ and α′ , α′′ ∈ R, it
holds that
V(D; α′ ν ′ + α′′ ν ′′ ) = α′ V(D; ν ′ ) + α′′ V(D; ν ′′ ). (24)
SV’s linearity axiom makes it possible to estimate both pointwise and joint SV influences without repeating
any data collection [GZ19]. Any data value satisfying the dummy player, symmetry, and linearity axioms
is referred to as a semivalue [DNW81; KZ22]. Additional semivalues include leave-one-out (Sec. 4.1) and
Banzhaf value (Sec. 4.3.3) [Ban65].
Furthermore, by evaluating training sets of different sizes, SV can detect subtle influence behavior that
is missed by methods like Downsampling and LOO, which evaluate a single training-set size. Lin et al.
[Lin+22] evidence this phenomenon empirically showing that adversarial training instances (i.e., poison) can
sometimes be better detected with small SV training subsets.
Concerning weaknesses, SV’s computational intractability is catastrophic for non-trivial dataset sizes [KZ22].
For that reason, numerous (heuristic) SV speed-ups have been proposed, with the most prominent ones de-
tailed next.

4.3.3 Related Methods

Monte Carlo Shapley Ghorbani and Zou [GZ19] propose two SV estimators. First, truncated Monte
Carlo Shapley (TMC-Shapley) relies on randomized subset sampling from training set D.16 As a simpli-
fied description of the algorithm, TMC-Shapley relies on random permutations of D; for simplicity, denote
the permutation ordering z1 , . . . , zn . For each permutation, n models are trained where the i-th model’s
training set is instances {z1 , . . . , zi }. To measure each zi ’s marginal contribution for a given permutation,
TMC-Shapley compares the performance of the (i − 1)-th and i-th models, i.e., the models trained on datasets
16
The term “truncated” in TMC-Shapley refers to a speed-up heuristic used when estimating the pointwise influ-
ences of multiple training instances at the same time. Truncation is not a necessary component of the method and
is not described here.

17
{z1 , . . . , zi−1 } and {z1 , . . . , zi }, respectively. TMC-Shapley generates additional training-set permutations
and trains new models until the SV estimates converge. Ghorbani and Zou state that TMC-Shapley con-
vergence usually requires analyzing on the order of n training-set permutations. Given training each model
has time complexity O(T ) (Remark 3), TMC-Shapley’s full time complexity is in general O(n2 T ).
Gradient Shapley TMC-Shapley may be feasible for simple models that are fast to train. For more com-
plex systems, O(n2 ) complete retrainings are impractical. Ghorbani and Zou [GZ19] also propose Gradient
Shapley (G-Shapley), an even faster SV estimator which follows the same basic procedure as TMC-Shapley
with one critical change. Rather than taking T iterations to train each model, G-Shapley assumes models
are trained in just one gradient step. This means that G-Shapley’s full and incremental time complexity is
only O(n2 ) – a substantial speed-up over TMC-Shapley’s full O(n2 T ) complexity.17 There is no free lunch,
and G-Shapley’s speed-up is usually at the expense of lower influence estimation accuracy.
Efficient Nearest-Neighbor Shapley Since TMC-Shapley and G-Shapley rely on heuristics and as-
sumptions to achieve tractability, neither method provides approximation guarantees. In contrast, Jia et al.
[Jia+19a] prove that for k-nearest neighbors classification, SV pointwise influence can be calculated exactly
in O(n lg n) time. Formally, SV’s characteristic function for kNN classification is
1
1[yi = yte ],
X
νkNN (D) := − (25)
k
yi ∈Neigh(xte ;D)

where Neigh(xte ; D) is the set of k neighbors in D nearest to xte and 1[·] is the indicator function. Each
training instance either has no effect on Eq. (25)’s value (zi ∈/ Neigh(xte ; D) or yi ̸= yte ). Otherwise, the
training instance increases the value by one (zi ∈ Neigh(xte ; D) and yi = yte ).
Assuming the training instances are sorted by increasing distance from xte (i.e., x1 is closest to xte and
xn is furthest), then zi ’s pointwise kNN Shapley influence is

1[yn = yte ] n−1


X 1[yj = yte ] − 1[yj+1 = yte ] min{k,j}
IkNN-SV (zi , zte ) := + . (26)
n j=i
k j

Observe that IkNN-SV ’s closed form is linear in n and requires no retraining at all. In fact, kNN Shapley’s
most computationally expensive component is sorting the training instances by distance from xte . Similar
to kNN LOO above, Jia et al. [Jia+19a] propose using kNN Shapley as a surrogate SV estimator for more
complex model classes. For example, kNN Shapley could be applied to the feature representations generated
by a deep neural network.
Beta Shapley Recent work has also questioned the optimality of SV assigning uniform weight to each
training subset size (see Eq. (17)). Counterintuitively, Kwon and Zou [KZ22] show theoretically and empir-
ically that influence estimates on larger training subsets are more affected by training noise than influence
estimates on smaller subsets. As such, rather than assigning all data subset sizes (|D|) uniform weight,
Kwon and Zou argue that smaller training subsets should be prioritized. Specifically, Kwon and Zou propose
Beta Shapley, which modifies vanilla SV by weighting the training-set sizes according to a positive skew (i.e.,
left-leaning) beta distribution.
SV has also been applied to study other types of influence beyond training set membership. For example,
Neuron Shapley applies SV to identify the model neurons that are most critical for a given prediction [GZ20].
Lundberg and Lee’s [LL17] SHAP is a very well-known tool that applies SV to measure feature importance.
For a comprehensive survey of Shapley value applications beyond training data influence, see the work of
Sundararajan and Najmi [SN20] and a more recent update by Rozemberczki et al. [Roz+22].
Banzhaf Value Also a semivalue [DNW81], Banzhaf value [Ban65] is closely related to Shapley value.
Formally, the Banzhaf value influence of zi ∈ D on test instance zte is
1 X (T ) (T )
IBanzhaf (zi , zte ) := n−1 L(zte ; θD ) − L(zte ; θD∪zi ). (27)
2 \z D⊆D i

17
G-Shapley and TMC-Shapley have the same incremental time complexity – O(n2 ). Both estimators’ storage
complexity is O(n2 p).

18
Intuitively, the primary differences between Eqs. (17) and (27) is that Shapley value assigns each subset
size (|D|) equal weight while Banzhaf value assigns each subset (D) equal weight. Wang and Jia [WJ23] prove
that influence rankings based on Banzhaf value are more robust to training variance than both leave-one-out
and Shapley value. Wang and Jia also empirically demonstrate that Banzhaf value can (significantly) out-
perform SV in practice.
Like SV, Banzhaf value has exponential time complexity. Uniform, Monte Carlo sampling from power
set 2D\zi provides an unbiased estimate of Eq. (27). Wang and Jia [WJ23] provide a more sophisticated
Banzhaf value Monte Carlo sampling strategy they term maximum sample reuse (MSR). MSR improves the
estimates’ sample complexity by a factor of O(n) over uniform Monte Carlo.

5 Gradient-Based Influence Estimation

For modern models, retraining even a few times to tune hyperparameters is very expensive. In such cases,
it is prohibitive to retrain an entire model just to gain insight into a single training instance’s influence.
For models trained using gradient descent, training instances only influence a model through training
gradients. Intuitively then, training data influence should be measurable when the right training gradients are
analyzed. This basic idea forms the basis of gradient-based influence estimation. As detailed below, gradient-
based influence estimators rely on Taylor-series approximations or risk stationarity. These estimators also
assume some degree of differentiability – either of just the loss function [Yeh+18] or both the model and
loss [KL17; Pru+20; Che+21].
The exact analytical framework each gradient-based method employs depends on the set of model param-
eters considered [HL22]. Static, gradient-based methods – discussed first – estimate the effect of retraining
by studying gradients w.r.t. final model parameters θ(T ) . Obviously, a single set of model parameters pro-
vide limited insight into the entire optimization landscape, meaning static methods generally must make
stronger assumptions. In contrast, dynamic, gradient-based influence estimators reconstruct the training
data’s influence by studying model parameters throughout training, e.g., θ(0) , . . . , θ(T ) . Analyzing these
intermediary model parameters makes dynamic methods more computationally expensive in general, but it
enables dynamic methods to make fewer assumptions.
This section concludes with a discussion of a critical limitation common to all existing gradient-based
influence estimators – both static and dynamic. This common weakness can cause gradient-based estimators
to systematically overlook highly influential (groups of) training instances.

5.1 Static, Gradient-Based Influence Estimation

As mentioned above, static estimators are so named because they measure influence using only final model
parameters θ(T ) . Static estimators’ theoretical formulations assume stationarity (i.e., the model parameters
have converged to a risk minimizer) and convexity.
Below we focus on two static estimators – influence functions [KL17] and representer point [Yeh+18].
Each method takes very different approaches to influence estimation with the former being more general and
the latter more scalable. Both estimators’ underlying assumptions are generally violated in deep networks.

5.1.1 Influence Functions

Along with Shapley value (Sec. 4.3), Koh and Liang’s [KL17] influence functions is one of the best-known
influence estimators. The estimator derives its name from influence functions (also known as infinitesimal
jackknife [Jae72]) in robust statistics [Ham74]. These early statistical analyses consider how a model changes
if training instance zi ’s weight is infinitesimally perturbed by ϵi . More formally, consider the change in the

19
empirical risk minimizer from
1 X
θ(T ) = arg min L(z; θ) (28)
θ n
z∈D

to
(T ) 1 X
θ+ϵi = arg min L(z; θ) + ϵi L(zi ; θ). (29)
θ n
z∈D

Under the assumption that model f and loss function ℓ are twice-differentiable and strictly convex, Cook
and Weisberg [CW82] prove via a first-order Taylor expansion that
(T )
dθ+ϵi (T )
= −(Hθ )−1 ∇θ L(zi ; θ(T ) ), (30)
dϵi ϵi =0

(T )
where empirical risk Hessian Hθ := n1 z∈D ∇2θ L(z; θ(T ) ) is by assumption positive definite. Koh and
P
Liang [KL17] extend Cook and Weisberg’s result to consider the effect of this infinitesimal perturbation on
zte ’s risk, where
⊺ (T )
dL(zte ; θ(T ) ) dL(zte ; θ(T ) ) dθ+ϵi
= (T )
▷ Chain rule (31)
dϵi ϵi =0 dθ+ϵ dϵi ϵi =0
i

⊺ (T )
= −∇θ L(zte ; θ(T ) ) (Hθ )−1 ∇θ L(zi ; θ(T ) ). (32)

Removing training instance zi from D is equivalent to ϵi = − n1 making the pointwise influence functions
estimator
1 ⊺ (T )
IbIF (zi , zte ) := ∇θ L(zte ; θ(T ) ) (Hθ )−1 ∇θ L(zi ; θ(T ) ). (33)
n
More intuitively, Eq. (33) is the influence functions’ estimate of the leave-one-out influence of zi on zte .

5.1.1.1 Time, Space, and Storage Complexity Calculating inverse Hessian (Hθ(T ) )−1 directly
requires O(np2 + p3 ) time and O(p2 ) space [KL17]. For large models, this is clearly prohibitive. Rather than
(T )
computing (Hθ )−1 directly, Koh and Liang instead estimate Hessian-vector product (HVP)
(T )
stest := (Hθ )−1 ∇θ L(zte ; θ(T ) ), (34)

with each training instance’s pointwise influence then


1
IbIF (zi , zte ) = s⊺test ∇θ L(zi ; θ(T ) ). (35)
n

Koh and Liang [KL17] use the stochastic algorithm of Pearlmutter [Pea94] and Agarwal et al. [ABH17]
to estimate stest . This reduces influence functions’ complexity to just O(np) time and O(n + p) space. Since
stest is specific to test instance zte , stest must be estimated for each test instance individually. This increases
influence functions’ computational cost but also means that influence functions require no additional storage.

5.1.1.2 Strengths and Weaknesses Influence functions’ clear advantage over retraining-based
methods is that influence functions eliminate the need to retrain any models. Recall from Table 1 that
retraining-based methods have a high upfront complexity of retraining but low incremental complexity. In-
fluence functions invert this computational trade-off. While computing stest can be slow (several hours for
a single test instance [Yeh+18; Guo+21; HL22]), it is still significantly faster than retraining O(n) models.
However, influence functions require that stest is recalculated for each test instance. Hence, when a large
number of test instances need to be analyzed, retraining-based methods may actually be faster than influence
functions through amortization of the retraining cost.

20
In addition, while influence functions can be very accurate on convex and some shallow models, the
(T )
assumption that Hessian Hθ is positive definite often does not hold for deep models [BPF21]. To ensure
(T )
inverse (Hθ )−1 exists, Koh and Liang add a small dampening coefficient to the matrix’s diagonal; this
dampener is a user-specified hyperparameter that is problematic to tune, since there may not be a ground-
truth reference. When this dampening hyperparameter is set too small, stest estimation can diverge [Yeh+18;
HNM19; HL22].
More generally, Basu et al. [BPF21] show that training hyperparameters also significantly affect influence
functions’ performance. Specifically, Basu et al. empirically demonstrate that model initialization, model
width, model depth, weight-decay strength, and even the test instance being analyzed (zte ) all can negatively
affect influence functions’ LOO estimation accuracy. Their finding is supported by the analysis of Zhang and
Zhang [ZZ22] who show that HVP estimation’s accuracy depends heavily on the model’s training regularizer
with HVP accuracy “pretty low under weak regularization.”
Bae et al. [Bae+22] also empirically analyze the potential sources of influence functions’ fragility on deep
models. Bae et al. identify five common error sources, the first three of which are the most important.18

• Warm-start gap: Influence functions more closely resembles performing fine-tuning close to final model
parameters θ(T ) than retraining from scratch (i.e., starting from initial parameters θ(0) ). This difference
in starting conditions can have a significant effect on the LOO estimate.
• Proximity gap: The error introduced by the dampening term included in the HVP (stest ) estimation
algorithm.
• Non-convergence gap: The error due to final model parameters θ(T ) not being a stationary point,
i.e., zi ∈D ∇θ L(zi ; θ(T ) ) ̸= ⃗0.
P

• Linearization error : The error induced by considering only a first-order Taylor approximation when
deleting zi and ignoring the potential effects of curvature on IbIF (zi , zte ) [BYF20].
• Solver error : General error introduced by the specific solver used to estimate stest .

Rather than estimating the LOO influence, Bae et al. argue that influence functions more closely estimate a
different measure they term the proximal Bregman response function (PBRF). Bae et al. provide the intuition
that PBRF “approximates the effect of removing a data point while trying to keep predictions consistent
with the . . . trained model” [Bae+22]. Put simply, PBRF mimics a prediction-constrained LOO influence.
Bae et al. assert that PBRF can be applied in many of the same situations where LOO is useful. Bae
et al. further argue that influence functions’ fragility reported by earlier works [BPF21; ZZ22] is primarily
due to those works focusing on the “wrong question” of LOO. When the “right question” is posed and
influence functions are evaluated w.r.t. PBRF, influence functions give accurate answers.

5.1.1.3 Related Methods Improving influence functions’ computational scalability has been a pri-
mary focus of follow-on work. For instance, applying influence functions only to the model’s final linear
(classification) layer has been considered with at best mixed results [BBD20; Yeh+22].19 Moreover, Guo
et al.’s [Guo+21] fast influence functions (FastIF) integrate multiple speed-up heuristics. First, they lever-
age the inherent parallelizability of Pearlmutter’s [Pea94] HVP estimation algorithm. In addition, FastIF
includes recommended hyperparameters for Pearlmutter’s HVP algorithm that reduces its execution time
by 50% on average.
Arnoldi-Based Influence Functions More recently, Schioppa et al. [Sch+22] show that influence func-
tions can be sped-up by three to four orders of magnitude by using a different algorithm to estimate stest .
Specifically, Schioppa et al. use Arnoldi’s [Arn51] famous algorithm that quickly finds the dominant (in
18
Bae et al. [Bae+22, Table 1] provide a formal mathematical definition of influence functions’ five error sources.
19
Section 5.1.2.2 discusses some of the pitfalls associated with estimating influence using only a model’s final (linear)
layer.

21
(T )
absolute terms) eigenvalues and eigenvectors of Hθ . These dominant eigenvectors serve as the basis when
projecting all gradient vectors to a lower-dimensional subspace. Schioppa et al. evaluate their revised in-
fluence functions estimator on large transformer networks (e.g., ViT-L32 with 300M parameters [Dos+21]),
which are orders of magnitude larger than the simple networks Koh and Liang [KL17] consider [Sch+23].
Influence Functions for Decision Trees Another approach to speed up influence functions is to special-
ize the estimator to model architectures with favorable computational properties. For example, Sharchilev
et al.’s [Sha+18b] LeafInfluence method adapts influence functions to gradient boosted decision tree en-
sembles. By assuming a fixed tree structure and then focusing only on the trees’ leaves, LeafInfluence’s
tree-based estimates are significantly faster than influence functions on deep models [BHL23].
Group Influence Functions A major strength of influence functions is that it is one of the few influence
analysis methods that has been studied beyond the pointwise domain. For example, Koh et al.’s [Koh+19]
follow-on paper analyzes influence functions’ empirical performance estimating group influence. In particular,
Koh et al. [Koh+19] consider coherent training data subpopulations whose removal is expected to have a
large, broad effect on the model. Even under naive assumptions of (pointwise) influence additivity, Koh
et al. [Koh+19] observe that simply summing influence functions estimates tends to underestimate the true
group influence. More formally, let D ⊆ D be a coherent training-set subpopulation, then
X
IbIF (zi , zte ) < I (D, zte ) . (36)
zi ∈D

Nonetheless, influence functions’ additive group estimates tend to have strong rank correlation w.r.t. sub-
populations’ true group influence. In addition, Basu et al. [BYF20] extend influence functions to directly
account for subpopulation group effects by considering higher-order terms in influence functions’ Taylor-series
approximation.

5.1.2 Representer Point Methods

Unlike this section’s other gradient-based estimators, Yeh et al.’s [Yeh+18] representer point method does not
directly rely on a Taylor-based expansion. Instead, Yeh et al. build on Schölkopf et al.’s [SHS01] generalized
representer theorem. The derivation below assumes that model f is linear. Yeh et al. use a simple “trick”
to extend linear representer point methods to multilayer models like neural networks.
Representer-based methods rely on kernels, which are functions K : X × X → R that measure the simi-
larity between two vectors [HSS08]. Schölkopf et al.’s representer theorem proves that the optimal solution
to a class of L2 regularized functions can be reformulated as a weighted sum of the training data in kernel
form. Put simply, representer methods decompose the predictions of specific model classes into the indi-
vidual contributions (i.e., influence) of each training instance. This makes influence estimation a natural
application of the representer theorem.
Consider regularized empirical risk minimization where optimal parameters satisfy20
1 X λ
θ∗ := arg min L(zi ; θ) + ∥θ∥2 , (37)
θ n 2
zi ∈D

with λ > 0 the L2 regularization strength. Note that Eq. (37) defines minimizer θ∗ slightly differently than
the last section (28) since the representer theorem requires regularization.
Empirical risk minimizers are stationary points meaning
 X 
1 ∗ ∗
∇θ L(zi ; θ ) + λ∥θ ∥2 = ⃗0, (38)
n
zi ∈D

20
Eq. (37) is slightly different than Yeh et al.’s [Yeh+18] presentation. This alternate formulation requires specifying
λ 2× larger. We selected this alternate presentation to provide consistency with later work [Tsa+23; Che+21].

22
where ⃗0 is the p-dimensional, zero vector. The above simplifies to
1 X ∂L(zi ; θ∗ )
+ λθ∗ = ⃗0 (39)
n ∂θ
zi ∈D

1 X ∂L(zi ; θ∗ )
θ∗ = − . (40)
λn ∂θ
zi ∈D

For a linear model where f (x; θ) = θ⊺ x =: yb, Eq. (40) further simplifies via the chain rule to
1 X ∂ℓ(f (xi ; θ∗ ), yi )
θ∗ = − (41)
λn ∂θ
(xi ,yi )∈D

1 X ∂ℓ(b
yi , yi ) ∂b
yi
=− ▷ Chain Rule (42)
λn ∂b
y ∂θ
(xi ,yi )∈D

1 X ∂ℓ(b
yi , yi )
=− xi ▷ yb = θ⊺ x (43)
λn ∂b
y
(xi ,yi )∈D

X
= αi xi , (44)
(xi ,yi )∈D

∂ℓ(b
yi ,yi )
where ℓ is any once-differentiable loss function, ∂yb is the gradient of just the loss function itself w.r.t.
21 1 ∂ℓ(b
yi ,yi )
model output yb, and αi := − λn ∂yb is the i-th training instance’s representer value. Yeh et al. [Yeh+18]
provide the intuition that a larger magnitude αi indicates that training instance zi has larger influence on
the final model parameters θ∗ .
Following Schölkopf et al.’s [SHS01] representer theorem kernelized notation, the training set’s group
influence on test instance zte for any linear model is
n
X n
X n
X
IRP (D, zte ) = αi x⊺i xte |yte = K(xi , xte , αi ) |yte = IRP (zi , zte ), (45)
i=1 i=1 i=1

where kernel function K(xi , xte ,αi ) := αi x⊺i xte returns a vector. K(xi , xte , αi ) |yte denotes the kernel value’s
yte -th dimension. Then, zi ’s pointwise linear representer point influence on zte is
IRP (zi , zte ) = αi x⊺i xte |yte = K(xi , xte , αi ) |yte . (46)

Extending Representer Point to Multilayer Models Often, linear models are insufficiently expressive,
with multilayer models used instead. In such cases, the representer theorem above does not directly apply. To
workaround this limitation, Yeh et al. [Yeh+18] rely on what they (later) term last layer similarity [Yeh+22].
Formally, Yeh et al. [Yeh+18] partition the model parameters θ(T ) = [ θ̇(T ) θ̈(T ) ] into two subsets,
where θ̈(T ) is the last linear (i.e., classification) layer’s parameters and θ̇(T ) := θ(T ) \ θ̈(T ) is all other model
parameters. Since θ̈(T ) is simply a linear function, the representer theorem analysis above still applies to it.
Yeh et al. [Yeh+18] treat the other parameters, θ̇(T ) , as a fixed feature extractor and ignore them in their
influence analysis.
To use Yeh et al.’s [Yeh+18] multilayer trick, one small change to Eq. (46) is required. In multilayer
models, the final (linear) layer does not operate over feature vectors xi and xte directly. Instead, the final
layer only sees an intermediate feature representation. For arbitrary feature vector x, let f be the feature
representation generated by model parameters θ̇(T ) , i.e., vector f is the input to the model’s last linear layer
given input x. Then the representer point influence estimator for a multilayer model is
IbRP (zi , zte ) := αi fi⊺ fte |yte = K(fi , fte ,αi ) |yte . (47)
21 ∂ℓ(b
yi ,yi )
In the case of classification, ∂yb has |Y|-dimensions, i.e., its dimension equals the number of classes.

23
5.1.2.1 Time, Space, and Storage Complexity Treating as constants feature representation
dimension |f | and the overhead to calculate ∂ℓ(b∂yiyb,yi ) , estimating the entire training set’s representer point
influence only requires calculating n dot products. This only takes O(n) time and O(n + p) space with no
additional storage requirements.

5.1.2.2 Strengths and Weaknesses Representer point’s primary advantage is its theoretical and
computational simplicity. Eq. (47) only considers the training and test instances’ final feature representations
and loss function gradient ∂ℓ(b∂yiyb,yi ) . Hence, the majority of representer point’s computation is forward-pass
only and can be sped up using batching. This translates to representer point being very fast – several orders
of magnitude faster than influence functions and Section 5.2’s dynamic estimators [HL21].
However, representer points’ simplicity comes at a cost. First, at the end of training, it is uncommon
that a model’s final linear layer has converged to a stationary point. Before applying their method, Yeh et al.
[Yeh+18] recommend freezing all model layers except the final one (i.e., freezing θ̇(T ) ) and then fine-tuning
the classification layer (θ̈(T ) ) until convergence/stationarity. Without this extra fine-tuning, representer
point’s stationarity assumption does not hold, and poor influence estimation accuracy is expected. Beyond
just complicating the training procedure itself, this extra training procedure also complicates comparison
with other influence methods since it may require evaluating the approaches on different parameters.
Moreover, by focusing exclusively on the model’s final linear (classification) layer, representer point meth-
ods may miss influential behavior that is clearly visible in other layers. For example, Hammoudeh and Lowd
[HL22] demonstrate that while some training-set attacks are clearly visible in a network’s final layer, other
attacks are only visible in a model’s first layer – despite both attacks targeting the same model architecture
and dataset. In their later paper, Yeh et al. [Yeh+22] acknowledge the disadvantages of considering only
the last layer writing, “that choice critically affects the similarity component of data influence and leads to
inferior results.” Yeh et al. [Yeh+22] further state that the feature representations in the final layer – and
by extension representer point’s influence estimates – can be “too reductive.”
In short, Yeh et al.’s [Yeh+18] representer point method is highly scalable and efficient but is only
suitable to detect behaviors that are obvious in the model’s final linear layer.

5.1.2.3 Related Methods


Given the accuracy limitations of relying on last-layer similarity, limited follow-on work has adapted representer-
point methods.
Adapting Representer Point to Decision Trees Brophy et al. [BHL23] extend representer point
methods to decision forests via their Tree-ensemble Representer Point Examples (TREX) estimator. Specif-
ically, they use supervised tree kernels – which provide an encoding of a tree’s learned representation struc-
ture [DG14; He+14] – for similarity comparison.
Making Representer Point More Robust In addition, Sui et al. [SWS21] propose representer point
selection based on a local Jacobian expansion (RPS-LJE), which can be viewed as a generalizing Yeh et
al.’s [Yeh+18] base method. Rather than relying on Schölkopf et al.’s [SHS01] representer theorem, Sui
et al.’s formulation relies on a first-order Taylor expansion that estimates the difference between the final
model parameters and a true stationary point. RPS-LJE still follows vanilla representer point’s kernelized
decomposition where IbRP (zi , zte ) := αi fi⊺ fte , albeit with αi defined differently.
RPS-LJE addresses two weaknesses of Yeh et al.’s [Yeh+18] base approach. First, Sui et al.’s formulation
does not presume that θ(T ) is a stationary point. Therefore, RPS-LJE does not require post-training fine-
tuning to enforce stationarity. Second, as Section 5.3 discusses in detail, gradient-based estimators, including
vanilla representer point, tend to mark as most influential those training instances with the largest loss values.
This leads to all test instances from a given class having near identical top-k influence rankings. RPS-LJE’s
alternate definition of αi is less influenced by a training instance’s loss value, which enables RPS-LJE to
generate more semantically meaningful influence rankings.

24
Extending Representer Point to Other Regularizers Yeh et al.’s [Yeh+18] representer point for-
mulation exclusively considers L2 -regularized models. Intuitively, regularization’s role is to encourage the
model parameters to meet certain desired properties, which may necessitate the use of alternate regularizers.
For example, L1 regularization is often used to induce sparse minimizers.
Recently, Tsai et al. [Tsa+23] propose high-dimensional representers, a novel extension of Yeh et al.’s
[Yeh+18] representer theorem to additional types of regularization. Specifically, Tsai et al. consider decom-
posable regularization functions [Neg+12]. Formally, a regularization function r : Rp → R≥0 is decomposable
w.r.t. two subspaces U, V ⊆ Rp if ∀ u ∈ U and ∀ v ∈ V,

r(u + v) = r(u) + r(v). (48)

Examples of decomposable regularizers include L1 -norm [Tib96] and the matrix nuclear norm [Yua+07;
Rec11; Yan+17].
1 ∂ℓ(b
yi ,yi )
High-dimensional representers follow Eq. (47)’s kernelized form. Representer value αi := − λn ∂yb
still quantifies the global importance of each training instance. Moreover, the similarity between training
instance xi and test instance xte is still measured via a kernel function (K). The only difference is that the
kernels are specialized for these alternate regularizers; specifically the kernels are based on the decomposable
regularization function’s sub-differential [Neg+12].

5.2 Dynamic, Gradient-Based Influence Estimation

All preceding influence methods – static, gradient-based and retraining-based – define and estimate influence
(T )
using only final model parameters, θD , where D ⊆ D. These final parameters only provide a snapshot into
a training instance’s possible effect. Since neural network training is NP-complete [BR92], it can be provably
difficult to reconstruct how each training instance affected the training process.
As an intuition, an influence estimator that only considers the final model parameters is akin to only
reading the ending of a book. One might be able to draw some big-picture insights, but the finer details
of the story are most likely lost. Applying a dynamic influence estimator is like reading a book from
beginning to end. By comprehending the whole influence “story,” dynamic methods can observe training
data relationships – both fine-grained and general – that other estimators miss.
Since test instance zte may not be known before model training, in-situ influence analysis may not be
possible. Instead, as shown in Alg. 1, intermediate model parameters Θ ⊆ {θ(0) , . . . , θ(T −1) } are stored
during training for post hoc influence analysis.22
Below we examine two divergent approaches to dynamic influence estimation – the first defines a novel
definition of influence while the second estimates leave-one-out influence with fewer assumptions than influ-
ence functions.

5.2.1 TracIn – Tracing Gradient Descent

Fundamentally, all preceding methods define influence w.r.t. changes to the training set. Pruthi et al.
[Pru+20] take an orthogonal perspective. They treat training set D as fixed, and consider the change in
model parameters as a function of time, or more precisely, the training iterations.
Vacuously, the training set’s group influence on test instance zte is

I (D, zte ) = L(zte ; θ(0) ) − L(zte ; θ(T ) ). (49)


22
In practice, only a subset of {θ(0) , . . . , θ(T −1) } is actually stored. Heuristics are then applied to this subset to
achieve acceptable influence estimation error [Pru+20; HL22].

25
Algorithm 1 Dynamic influence estimation’s training phase
Input: Training set D; iteration count T ; learning rates η (1) , . . . , η (T ) ; batch
sizes b(1) , . . . , b(T ) ; and initial parameters θ(0)
Output: Final parameters θ(T ) and stored parameter set Θ
1: Θ ← ∅
2: for t ← 1 to T do
3: Θ ← Θ ∪ {θ(t−1) } ▷ Store intermediate params.
b(t)
4: B(t) ∼ D
5: θ(t) ← Update(η (t) , θ(t−1) , B(t) )
6: return θ(T ) , Θ

In words, training set D causes the entire change in test loss between random initial parameters θ(0) and
final parameters θ(T ) . Eq. (49) decomposes by training iteration t as
T 
X 
I (D, zte ) = L(zte ; θ(t−1) ) − L(zte ; θ(t) ) . (50)
t=1

Consider training a model with vanilla stochastic gradient descent, where each training minibatch B (t) is
a single instance and gradient updates have no momentum [RHW86]. Here, each iteration t has no effect on
any other iteration beyond the model parameters themselves. Combining this with singleton batches enables
attribution of each parameter change to a single training instance, namely whichever instance was in B (t) .
Under this regime, Pruthi et al. [Pru+20] define the ideal TracIn pointwise influence as
X  
ITracIn (zi , zte ) := L(zte ; θ(t−1) ) − L(zte ; θ(t) ) , (51)
t
zi =B(t)

where the name “TracIn” derives from “tracing gradient descent influence.” Eq. (50) under vanilla stochastic
gradient descent decomposes into the sum of all pointwise influences
n
! n
X X X
(t−1) (t)
I (D, zte ) = L(zte ; θ ) − L(zte ; θ ) = ITracIn (zi , zte ). (52)
i=1 t i=1
zi =B(t)

While the ideal TracIn influence has a strong theoretical motivation, its assumption of singleton batches
and vanilla stochastic gradient descent is unrealistic in practice. To achieve reasonable training times, modern
models train on batches of up to hundreds of thousands or millions of instances. Training on a single instance
at a time would be far too slow [YGG17; Goy+17; Bro+20].
A naive fix to Eq. (51) to support non-singleton batches assigns the same influence to all instances in
the minibatch, or more formally, divide the change in loss L(zte ; θ(t−1) ) − L(zte ; θ(t) ) by batch size B (t) for
each zi ∈ B (t) . This naive approach does not differentiate those instances in batch B (t) that had positive
influence on the prediction from those that made the prediction worse.
Instead, Pruthi et al. [Pru+20] estimate the contribution of each training instance within a minibatch
via a first-order Taylor approximation. Formally,

L(zte ; θ(t) ) ≈ L(zte ; θ(t−1) ) + ∇θ L(zte ; θ(t−1) ) (θ(t) − θ(t−1) ). (53)
Under gradient descent without momentum, the change in model parameters is directly determined by the
batch instances’ gradients, i.e.,
η (t) X
θ(t) − θ(t−1) = − ∇θ L(zi ; θ(t−1) ), (54)
B (t)
zi ∈B(t)

26
Algorithm 2 TracIn influence estimation Algorithm 3 TracInCP influence estimation
Input: Training param. set Θ; iteration count T ; batches Input: Training param. set Θ; iteration subset T ;
B(1) , . . . , B(T ) ; learning rates η (1) , . . . , η (T ) ; training learning rates η (1) , . . . , η (T ) ; training instance zi ;
instance zi ; and test example zte and test example zte
Output: TracIn influence estimate I bTracIn (zi , zte ) Output: TracInCP influence est. I bTracInCP (zi , zte )
1: I ← 0
b 1: I ← 0
b
2: for t ← 1 to T do 2: for each t ∈ T do
3: if zi ∈ B(t) then 3: θ(t−1) ← Θ[t]
4: θ(t−1) ← Θ[t] 4: I
b←I b + η (t) ∇θ L(zi ; θ(t−1) )⊺ ∇θ L(zte ; θ(t−1) )
(t) ⊺
5: Ib←I b+ η ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) )
| |
B(t) 5: return I b

6: return I
b

where η (t) is iteration t’s learning rate.


Combining Eqs. (52) to (54), the TracIn pointwise influence estimator is
X η (t) ⊺
IbTracIn (zi , zte ) := ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) ), (55)
t
B (t)
zi ∈B(t)

with the complete TracIn influence estimation procedure shown in Alg. 2.


A More “Practical” TracIn Training’s stochasticity can negatively affect the performance of both ideal
TracIn (51) and the TracIn influence estimator (55). As an intuition, consider when the training set contains
two identical copies of some instance. All preceding gradient-based methods assign those two identical
instances the same influence score. However, it is unlikely that those two training instances will always
appear together in the same minibatch. Therefore, ideal TracIn almost certainly assigns these identical
training instances different influence scores. These assigned scores may even be vastly different – by up
to several orders of magnitude [HL22]. This is despite identical training instances always having the same
expected TracIn influence.
Pruthi et al. [Pru+20] recognize randomness’s effect on TracIn and propose the TracIn Checkpoint
influence estimator (TracInCP) as a “practical” alternative. Rather than retrace all of gradient descent,
TracInCP considers only a subset of the training iterations (i.e., checkpoints) T ⊆ [T ]. More importantly,
at each t ∈ T , all training instances are analyzed – not just those in recent batches. Eq. (56) formalizes
TracInCP, with its modified influence estimation procedure shown in Alg. 3.
X ⊺
IbTracInCP (zi , zte ) := η (t) ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) ) (56)
t∈T

Observe that, unlike TracIn, TracInCP assigns identical training instances the same influence estimate.
Therefore, TracInCP more closely estimates expected influence than TracIn. Pruthi et al. [Pru+20] use
TracInCP over TracIn in much of their empirical evaluation. Other work has also shown that TracInCP
routinely outperforms TracIn on many tasks [HL22].
Remark 7: Pruthi et al.’s empirical evaluation uses |T | ≪ T . For example, when identifying mislabeled
examples using TracInCP, Pruthi et al. evaluate every 30th iteration. Furthermore, Pruthi et al. note that
prioritizing the small number of checkpoints where zte ’s loss changes significantly generally outperforms
evaluating a larger number of evenly-spaced checkpoints.

5.2.1.1 Time, Space, and Storage Complexity Below we derive the time complexity of both
versions of TracIn. We then discuss their space and storage complexities.

27
Consider first TracInCP’s time complexity since it is simpler to derive. From Alg. 3, each checkpoint
in T requires n, p-dimensional dot products making TracInCP’s complexity O(np|T |). For vanilla TracIn,
consider Alg. 2. For each iteration t ∈ [T ], a p-dimensional dot product is performed for each instance in
B (t) . Let b := maxt B (t) denote the maximum batch size, then TracIn’s time complexity is O(bpT ). In the
worst case where ∀t B (t) = D (full-batch gradient descent), TracIn time complexity is O(npT ).
Recall that, by definition, |T | ≤ T meaning TracInCP is asymptotically faster than TracIn. However,
this is misleading. In practice, TracInCP is generally slower than TracIn as Pruthi et al. note.
Since each gradient calculation is independent, TracIn and TracInCP are fully parallelizable. Table 1
treats the level of concurrency as a constant factor, making the space complexity of both TracIn and TracInCP
O(n + p).
Lastly, as detailed in Alg. 1, dynamic influence estimators require that intermediate model parameters Θ
be saved during training for post hoc influence estimation. In the worst case, each training iteration’s
parameters are stored resulting in a storage complexity of O(pT ). In practice however, TracIn only considers
a small fraction of these T training parameter vectors, meaning TracIn’s actual storage complexity is generally
(much) lower than the worst case.

5.2.1.2 Strengths and Weaknesses TracIn and TracInCP avoid many of the primary pitfalls
associated with static, gradient-based estimators.
First, recall from Section 5.1.1 that Hessian-vector product stest significantly increases the computational
overhead and potential inaccuracy of influence functions. TracIn’s theoretical simplicity avoids the need to
compute any Hessian.
Second, representer point’s theoretical formulation necessitated considering only a model’s final linear
layer, at the risk of (significantly) worse performance. TracIn has the flexibility to use only the final linear
layer for scenarios where that provides sufficient accuracy23 as well as the option to use the full model
gradient when needed.
Third, by measuring influence during the training process, TracIn requires no assumptions about sta-
tionarity or convergence. In fact, TracIn can be applied to a model that is only partially trained. TracIn can
also be used to study when during training an instance is most influential. For example, TracIn can identify
whether a training instance is most influential early or late in training.
Fourth, due to how gradient-based methods estimate influence, highly influential instances can actually
appear uninfluential at the end of training. Unlike static estimators, dynamic methods like TracIn may still
be able to detect these instances. See Section 5.3 for more details.
In terms of weaknesses, TracIn’s theoretical motivation assumes stochastic gradient descent without
momentum. However, momentum and adaptive optimization (e.g., Adam [KB15]) significantly accelerate
model convergence [Qia99; DHS11; KB15]. To align more closely with these sophisticated optimizers, Eq. (55)
and Alg. 2 would need to change significantly. For context, Section 5.2.2 details another dynamic estimator,
HyDRA, which incorporates support for just momentum with the resulting increase in estimator complexity
substantial.

5.2.1.3 Related Methods TracIn has been adapted by numerous derivative/heuristic variants. For
example, TracIn-Last is identical to vanilla TracIn except gradient vectors ∇θ L(zi ; θ(t−1) ) and ∇θ L(zte ; θ(t−1) )
only consider the model’s final linear layer [Pru+20]. This can make TracIn significantly faster at the risk
of (significantly) worse accuracy [Yeh+22].
TracIn for Language Models As a counter to the disadvantages of solely considering a model’s last
layer,24 TracIn’s authors subsequently proposed TracIn word embeddings (TracInWE), which targets large
23
Last layer only TracIn is also referred to as TracIn-Last. Yeh et al. [Yeh+22] evaluate TracIn-Last’s effectiveness.
24
See Section 5.1.2.2 for an extended discussion of last-layer similarity.

28
language models and considers only the gradients in those models’ word embedding layer [Yeh+22]. Since
language-model word embeddings can still be very large (e.g., BERT-Base’s word embedding layer has
23M parameters [Dev+19]), the authors specifically use the gradients of only those tokens that appear in
both training instance zi and test instance zte .
Low Dimensional TracIn Pruthi et al. [Pru+20] also propose TracIn Random Projection (TracInRP) –
a low-memory version of TracIn that provides unbiased estimates of IbTracIn (i.e., an estimate of an estimate).
Intuitively, TracInRP maps gradient vectors into a d-dimensional subspace (d ≪ p) via multiplication by
a d × p random matrix where each entry is sampled i.i.d. from Gaussian distribution N 0, d1 . These low-


memory gradient “sketches” are used in place of the full gradient vectors in Eq. (55) [Woo14]. TracInRP
is primarily targeted at applications where p is sufficiently large that storing the full training set’s gradient
vectors (∀t,i ∇θ L(zi ; θ(t−1) )) is prohibitive.
TracIn for Generative Models TracIn has also been used outside of supervised settings. For ex-
ample, Kong and Chaudhuri [KC21] apply TracIn to unsupervised learning, in particular density estima-
tion; they propose variational autoencoder TracIn (VAE-TracIn), which quantifies the TracIn influence in
β-VAEs [Hig+17]. Moreover, Thimonier et al.’s [Thi+22] TracIn anomaly detector (TracInAD) functionally
estimates the distribution of influence estimates – using either TracInCP or VAE-TracIn. TracInAD then
marks as anomalous any test instance in the tail of this “influence distribution”.
Note also that TracIn can be applied to any iterative, gradient-based model, including those that are
non-parametric. For example, Brophy et al.’s [BHL23] BoostIn adapts TracIn for gradient-boosted decision
tree ensembles.

5.2.2 HyDRA – Hypergradient Data Relevance Analysis

Unlike TracIn which uses a novel definition of influence (51), Chen et al.’s [Che+21] hypergradient data
relevance analysis (HyDRA) estimates the leave-one-out influence (8). HyDRA leverages the same Taylor
series-based analysis as Koh and Liang’s [KL17] influence functions. The key difference is that HyDRA
addresses a fundamental mismatch between influence functions’ assumptions and deep models.
Section 5.1.1 explains that influence functions consider infinitesimally perturbing the weight of training
sample zi by ϵi . Recall that the change in zte ’s test risk w.r.t. to this infinitesimal perturbation is
⊺ ⊺
dL(zte ; θ(T ) ) ∂L(zte ; θ(T ) ) dθ(T ) ∂L(zte ; θ(T ) ) e (T )
= (T )
= hi (57)
dϵi ∂θ dϵi ∂θ(T )
(T )
(T ) dθ+ϵ
hi :=
where e dϵi
i
denotes the p-dimensional hypergradient of training instance i at the end of training.
Koh and Liang’s [KL17] assumptions of differentiability and strict convexity mean that Eq. (57) has
a closed form. However, deep neural models are not convex. Under non-convex gradient descent without
momentum and with L2 regularization, θ(t) := θ(t−1) − η (t) g (t−1) where gradient
g (t−1) := ∇θ L(B (t) ; θ(t−1) ) + λθ(t−1) . (58)
(t−1) (t)
The exact definition of gradient g depends on the specific contents of batch B so for simplicity, we
encapsulate the batch’s contribution to the gradient using catch-all term ∇θ L(B (t) ; θ(t−1) ).
(T )
Using Eq. (58), hypergradient e
hi can be defined recursively as
(T )
(T ) dθ d  (T −1) 
hi := +ϵi =
e θ+ϵi − η (t) g (t−1) (59)
dϵi dϵi

(T −1) d  (T −1) (T −1)



=e
hi − η (T ) ∇θ L(B (T ) ; θ+ϵi ) + λθ+ϵi (60)
dϵi

(T −1) d (T −1)
= (1 − η (T ) λ)e
hi − η (t) ∇θ L(B (T ) ; θ+ϵi ) (61)
dϵi

29
Algorithm 4 Fast HyDRA influence estimation for gradient descent
without momentum
Input: Training parameter set Θ; final parameters θ(T ) ; training set size n; iteration
count T ; batches B(1) , . . . , B(T ) ; learning rates η (1) , . . . , η (T ) ; weight decay λ;
training instance zi ; and test example zte
Output: HyDRA influence estimate I bHyDRA (zi , zte )
(0)
1: hi ← ⃗0
e ▷ Initialize to zero vector
2: for t ← 1 to T do
3: if zi ∈ B(t) then
4: θ(t−1) ← Θ[t]
(t) (t−1) (t)
5: hi ← (1 − η (t) λ)e
e hi − ηB(t)n ∇θ L(zi ; θ(t−1) )
| |
6: else
(t) (t−1)
7: hi ← (1 − η (t) λ)e
e hi
b ← − 1 ∇θ L(zte ; θ(T ) )⊺ e
8: I h
(T )
▷ Influence estimate
n i
9: return I
b

(T )
The recursive definition of hypergradient e
hi needs to be unrolled all the way back to initial parameters θ(0) .
The key takeaway from Eq. (61) is that training hypergradients affect the model parameters throughout
all of training. By assuming a convex model and loss, Koh and Liang’s [KL17] simplified formulation ignores
this very real effect. As Chen et al. [Che+21] observe, hypergradients often cause non-convex models to
converge to a vastly different risk minimizer. By considering the hypergradients’ cumulative effect, HyDRA
can provide more accurate LOO estimates than influence functions on non-convex models – albeit via a
significantly more complicated and computationally expensive formulation.
Unrolling Gradient Descent Hypergradients The exact procedure to unroll HyDRA’s hypergradi-
(T )
ent ehi is non-trivial. For the interested reader, supplemental Section C provides hypergradient unrolling’s
full derivation for vanilla gradient descent without momentum. Below, we briefly summarize Section C’s
important takeaways, and Section C’s full derivation can be skipped with minimal loss of understanding.
At each training iteration, hypergradient unrolling requires estimating the risk Hessian of each training
instance in B (t) .25 This significantly slows down HyDRA (by a factor of about 1,000× [Che+21]). As a
workaround, Chen et al. [Che+21] propose treating these risk Hessians as all zeros, proving that, under
mild assumptions, the approximation error of this simplified version of HyDRA is bounded. Alg. 4 shows
HyDRA’s fast approximation algorithm without Hessians for vanilla gradient descent.26 After calculating
(T )
the final hypergradient, substituting e hi into Eq. (57) with ϵi = − n1 27 yields training instance zi ’s HyDRA
pointwise influence estimator
1 ⊺ (T )
IbHyDRA (zi , zte ) := − ∇θ L(zte ; θ(T ) ) e
hi . (62)
n

Relating HyDRA and TracIn When λ = 0 or weight decay’s effects are ignored (as done by TracIn),
(T )
25
For clarity, this is not the inverse Hessian (Hθ )−1 used by influence functions.
26
See HyDRA’s original paper [Che+21] for the fast approximation algorithm with momentum.
27
ϵi = − n1 is equivalent to deleting instance zi from the training set. Influence functions follow the same procedure
for ϵi . See Eqs. (32) and (33).

30
HyDRA’s fast approximation for vanilla gradient descent simplifies to

1 ⊺ X η (t) n
IbHyDRA (zi , zte ) ≈ − ∇θ L(zte ; θ(T ) ) − ∇θ L(zi ; θ(t−1) ) (63)
n t
B (t)
zi ∈B(t)

X η (t) ⊺
= (t)
∇θ L(zte ; θ(T ) ) ∇θ L(zi ; θ(t−1) ) (64)
t
B
zi ∈B(t)

⊺ X η (t)
= ∇θ L(zte ; θ(T ) ) ∇θ L(zi ; θ(t−1) ). (65)
t
B (t)
zi ∈B(t)

Eq. (64) is very similar to TracIn’s definition in Eq. (55), despite the two methods estimating different defi-
nitions of influence (LOO vs. ideal TracIn (51)). The only difference between (64) and (55) is that HyDRA
always uses final test gradient ∇θ L(zte ; θ(T ) ) while TracIn uses each iteration’s test gradient ∇θ L(zte ; θ(t−1) )
The key takeaway is that while theoretically different, HyDRA and TracIn are in practice very similar
where HyDRA can be viewed as trading (incremental) speed for lower precision w.r.t. zte .

5.2.2.1 Time, Space, and Storage Complexity When unrolling the T training iterations, Hy-
DRA’s fast approximation performs a p-dimensional (hyper)gradient calculation for each of the n training
instances. If HyDRA’s full version with Hessian vector products is used, Agarwal et al.’s [ABH17] Hessian
approximation algorithm estimates each HVP in O(p) time and space. Therefore, the fast and standard
versions of HyDRA both have full time complexity O(npT ) – same as TracIn, albeit with potentially much
worse constant factors.
(T )
Observe that each hypergradient e hi only needs to be computed once and can be reused for each test
instance. Therefore, the fast and standard version of HyDRA have incremental time complexity of just
n gradient dot products – O(np) complexity total. This incremental complexity is much faster than TracIn
and asymptotically equivalent to influence functions. In practice though, HyDRA’s incremental cost is much
lower than that of influence functions.
(t)
Alg. 4 requires storing vector e
hi throughout HyDRA’s entire unrolling procedure, where each training
instance’s hypergradient takes O(p) space. To analyze all training instances simultaneously, HyDRA requires
O(np) total space. In contrast, TracIn only requires O(n + p) space to analyze all instances simultaneously.
This difference is substantial for large models and training sets. In cases where the fully-parallelized space
complexity is prohibitive, each training instance’s hypergradient can be analyzed separately resulting in a
reduced space complexity of O(p) for both fast and standard HyDRA.
Like TracIn, HyDRA requires storing model parameters Θ ⊆ {θ(0) , . . . , θ(T −1) } making its minimum
storage complexity O(pT ). Since hypergradients are reused for each test instance, they can be stored to
eliminate the need to recalculate them; this introduces an additional storage complexity of O(np). This
makes HyDRA’s total storage complexity O(pT + np).
Remark 8: Storing both the training checkpoints and hypergradients is unnecessary. Once all hypergra-
dients have been calculated, serialized training parameters Θ are no longer needed and can be discarded.
Therefore, a more typical storage complexity is O(pT ) or O(np) – both of which are still substantial.

5.2.2.2 Strengths and Weaknesses HyDRA and TracIn share many of the same strengths. For
example, HyDRA does not require assumptions of convexity or stationarity. Moreover, as a dynamic method,
HyDRA may be able to detect influential examples that are missed by static methods – in particular when
those instances have low loss at the end of training (see Section 5.3 for more discussion).
HyDRA also has some advantages over TracIn. For example, as shown in Alg. 2, TracIn requires that

31
each test instance be retraced through the entire training process. This significantly increases TracIn’s
incremental time complexity. In contrast, HyDRA only unrolls gradient descent for the training instances,
i.e., not the test instances. Hypergradient unrolling is a one-time cost for each training instance; this
upfront cost is amortized over all test instances. Once the hypergradients have been calculated, HyDRA
is much faster than TracIn – potentially by orders of magnitude. In addition, HyDRA’s overall design
allows it to natively support momentum with few additional changes. Integrating momentum into TracIn,
while theoretically possible, requires substantial algorithmic changes and makes TracIn substantially more
complicated. This would mitigate a core strength of TracIn – its simplicity.
HyDRA does have two weaknesses in comparison to TracIn. First, HyDRA’s standard (i.e., non-fast)
algorithm requires calculating many HVPs. Second, HyDRA’s O(np) space complexity is much larger than
the O(n + p) space complexity of other influence analysis methods (see Table 1). For large models, this
significantly worse space complexity may be prohibitive.

5.2.2.3 Related Methods The method most closely related to HyDRA is Hara et al.’s [HNM19]
SGD-influence. Both approaches estimate the leave-one-out influence by unrolling gradient descent using
empirical risk Hessians. There are, however, a few key differences. First, unlike HyDRA, Hara et al.
assume that the model and loss function are convex. Next, SGD-influence primarily applies unrolling to
(T )
quantify the Cook’s distance, θ(T ) − θD\zi . To better align their approach with dataset influence, Hara et al.
propose a surrogate (linear) influence estimator which they incrementally update throughout unrolling. This
means the full training process must be unrolled for each test instance individually, significantly increasing
SGD-influence’s incremental time complexity.
Terashita et al. [Ter+21] adapt the ideas of SGD-influence to estimate training data influence in gener-
ative adversarial networks (GANs).
Although proposed exclusively in the context of influence functions (Sec. 5.1.1.3), Schioppa et al.’s
[Sch+22] basic approach to scale up influence functions via faster Hessian calculation could similarly be
applied to speed up HyDRA’s standard (non-fast) algorithm.

5.3 Trade-off between Gradient Magnitude and Direction

This section details a limitation common to existing gradient-based influence estimators that can cause these
estimators to systematically overlook highly influential (groups of) training instances.
Observe that all gradient-based methods in this section rely on some vector dot product. For a dot
product to be large, one of two criteria must be met:
(1) The vector directions align (i.e., have high cosine similarity). More specifically, for influence analysis,
vectors pointing in similar directions are expected to encode similar information. This is the ideal case.
(2) Either vector has a large magnitude, e.g., ∥∇θ L(z; θ)∥. Large gradient magnitudes can occur for many
reasons, but the most common cause is that the instance is either incorrectly or not confidently predicted.
Across the training set, gradient magnitudes can vary by several orders of magnitude [SWS21]. To
overcome such a magnitude imbalance, training instances that actually influence a specific prediction may
need to have orders of magnitude better vector alignment. In reality, what commonly happens is that
incorrectly predicted or abnormal training instances appear highly influential to all test instances [SWS21].
Barshan et al. [BBD20] describe such training instances as globally influential. However, globally influential
training instances provide very limited insight into individual model predictions. As Barshan et al. note,
locally influential training instances are generally much more relevant and insightful when analyzing specific
predictions.
Relative Influence To yield a more semantically meaningful influence ranking, Barshan et al. propose
the θ-relative influence functions estimator (RelatIF), which normalizes Koh and Liang’s [KL17] influence

32
(T )
functions’ estimator by HVP magnitude (Hθ )−1 ∇θ L(zi ; θ(T ) ) .28 Formally,
⊺ (T )
IbIF (zi , zte ) 1 ∇θ L(zte ; θ(T ) ) (Hθ )−1 ∇θ L(zi ; θ(T ) )
IbRelatIF (zi , zte ) := = . (66)
(T )
(Hθ )−1 ∇θ L(zi ; θ(T ) ) n (T )
(Hθ )−1 ∇θ L(zi ; θ(T ) )

RelatIF’s normalization inhibits training gradient magnitude ∇θ L(zi ; θ(T ) ) dominating the influence esti-
mate.
RelatIF’s biggest limitation is the need to estimate an HVP for every training instance. As discussed in
Section 5.1.1.2, HVP estimation is expensive and often highly inaccurate in deep models. To work around
these issues in their evaluation of RelatIF, Barshan et al. use either very small neural models or just consider
a large model’s final layer, both of which can be problematic.
Renormalized Influence Hammoudeh and Lowd [HL22] make a similar observation as Barshan et al.
but motivate it differently. By the chain rule, gradient vectors decompose as
∂ℓ(f (x; θ), y) ∂ℓ(f (x; θ), y) ∂f (x; θ)
∇θ L(z; θ) := = . (67)
∂θ ∂f (x; θ) ∂θ
Hammoudeh and Lowd note that for many common loss functions (e.g., squared, binary cross-entropy),
loss value ℓ(f (x; θ), y) induces a strict ordering over loss norm ∂ℓ(f (x;θ),y)
∂f (x;θ) . Hammoudeh and Lowd term
this phenomenon a low-loss penalty, where confidently predicted training instances have smaller gradient
magnitudes and by consequence consistently appear uninfluential to gradient-based influence estimators.
To account for the low-loss penalty, Hammoudeh and Lowd [HL22] propose renormalized influence which
replaces all gradient vectors – both training and test – in an influence estimator with the corresponding unit
vector. Renormalization can be applied to any gradient-based estimator. For example, Hammoudeh and
Lowd observe that renormalized TracInCP, which they term gradient aggregated similarity (GAS),

X
(t) ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) )
IbGAS (zi , zte ) := η , (68)
t∈T
∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) )
is particularly effective at generating influence rankings. Hammoudeh and Lowd also provide a renormalized
version of influence functions,
⊺ (T )
IbIF (zi , zte ) 1 ∇θ L(zte ; θ(T ) ) (Hθ )−1 ∇θ L(zi ; θ(T ) )
IbRenormIF (zi , zte ) := = . (69)
∇θ L(zi ; θ(T ) ) n ∇θ L(zi ; θ(T ) )
Since renormalized influence functions do not require estimating additional HVPs, it is considerably faster
than RelatIF. Renormalized influence functions also do not have the additional error associated with esti-
mating RelatIF’s additional HVPs.29
This section should not be interpreted to mean that gradient magnitude is unimportant for influence
analysis. On the contrary, gradient magnitude has a significant effect on training. However, the approxi-
mations made by existing influence estimators often overemphasize gradient magnitude leading to influence
rankings that are not semantically meaningful.
Remark 9: Barshan et al.’s RelatIF and Hammoudeh and Lowd’s renormalization do not change the
corresponding influence estimators’ time and space complexities.

6 Applications of Influence Analysis

Section 3.3 discusses how a few topics (formally) relate to influence analysis. This section provides an ex-
tended discussion of influence analysis’s applications. Specifically, this section focuses on higher-level learning
(T )
Note that this HVP is different than stest := (Hθ )−1 ∇θ L(zte ; θ(T ) ) in Eq. (34).
28
29
Hammoudeh and Lowd [HL22] also provide renormalized versions of representer point and TracIn, which are
omitted here.

33
tasks as opposed to the specific application environments where influence analysis has been used including:
toxic speech detection [HT21], social network graph labeling [Zha+21c], user engagement detection [LLY21],
medical imaging annotation [Bra+22], etc.
First, data cleaning aims to improve a machine learning model’s overall performance by removing
“bad” training data. These “bad” instances arise due to disparate non-malicious causes including hu-
man/algorithmic labeling error, non-representative instances, noisy features, missing features, etc. [Kri+16;
LDG18; KW19]. Intuitively, “bad” training instances are generally anomalous, and their features clash with
the feature distribution of typical “clean” data [Woj+16]. In practice, overparameterized neural networks
commonly memorize these “bad” instances to achieve zero training loss [HNM19; FZ20; Pru+20; Thi+22].
As explained in Section 3.2, memorization can be viewed as the influence of a training instance on itself.
Therefore, influence analysis can be used to detect these highly memorized training instances. These mem-
orized “bad” instances are then either removed from the training data or simply relabeled [KSH22] and the
model retrained.
Poisoning and backdoor attacks craft malicious training instances that manipulate a model to align with
some attacker objective. For example, a company may attempt to trick a spam filter so all emails sent by
a competitor are erroneously classified as spam [Sha+18a]. Obviously, only influential (malicious) training
instances affect a model’s prediction. Some training set attacks rely on influence analysis to craft better
(i.e., more influential) poison instances [FGL20; Jag+21; Oh+22]. Since most training set attacks do not
assume the adversary knows training’s random seed or even necessarily the target model’s architecture,
poison instances are crafted to maximize their expected group influence [Che+17].
Influence and memorization analysis have also been used to improve membership inference attacks,
where the adversary attempts to extract sensitive training data provided only a pretrained (language)
model [Dem+19; CG22].
Training set attack defenses detect and mitigate poisoning and backdoor attacks [Li+22]. Since malicious
training instances must be influential to achieve the attacker’s objective, defending against adversarial attacks
reduces to identifying abnormally influential training instances. If attackers are constrained in the number
of training instances they can insert [Wal+21; YHL23], the target of a training set attack can be identified
by searching for test instances that have a few exceptionally influential training instances [HL22]. The
training set attack mitigation removes these anomalously influential instances from the training data and
then retrains the model [Wan+19]. In addition, influence estimation has been applied to the related task of
evasion attack detection, where the training set is pristine and only test instances are perturbed [CSG20].
Algorithmic fairness promotes techniques that enable machine learning models to make decisions free
of prejudices and biases based on inherited characteristics such as race, religion, and gender [Meh+21]. A
classic example of model unfairness is the COMPAS software tool, which estimated the recidivism risk of
incarcerated individuals. COMPAS was shown to be biased against black defendants, falsely flagging them
as future criminals at twice the rate of white defendants [Ang+16]. Widespread adoption of algorithmic
decision making in domains critical to human safety and well-being is predicated on the public’s perception
and understanding of the algorithms’ inherent ethical principles and fairness [Awa+18]. Yet, how to quantify
the extent to which an algorithm is “fair” remains an area of active study [Dwo+12; GH19; Sax+19]. Black
and Fredrikson [BF21] propose leave-one-out unfairness as a measure of a prediction’s fairness. Intuitively,
when a model’s decision (e.g., not granting a loan, hiring an employee) is fundamentally changed by the
inclusion of a single instance in a large training set, such a decision may be viewed as unfair or even
capricious. Leave-one-out influence is therefore useful to measure and improve a model’s robustness and
fairness.
Explainability attempts to make a black-box model’s decisions understandable by humans [BH21]. Trans-
parent explanations are critical to achieving user trust of and satisfaction with ML systems [LDA09; Kiz16;
Zho+19]. Example-based explanations communicate why a model made a particular prediction via visual
examples [CJH19; SWS21] – e.g., training images – as social science research has shown that humans can
understand complex ideas using only examples [RHS09; Ren14]. Influence estimation can assist in the se-
lection of canonical training instances that are particularly important for a given class in general or a single
test prediction specifically. Similarly, normative explanations – which collectively establish a “standard” for

34
a given class [CJH19] – can be selected from those training instances with the highest average influence on
a held-out validation set. In cases where a test instance is misclassified, influence analysis can identify those
training instances that most influenced the misprediction.
Subsampling reduces the computational requirements of large datasets by training models using only a
subset of the training data [TB18]. Existing work has shown that high-quality training subsets can be created
by greedily selecting training instances based on their overall influence [Kha+19; Wan+20]. Under mild
assumptions, Wang et al. [Wan+20] even show that, in expectation, influence-based subsampling performs
at least as well as training on the full training set.
Annotating unlabeled data can be expensive – in particular for domains like medical imaging where the
annotators must be domain experts [Bra+22]. Compared to labeling instances u.a.r., active learning reduces
labeling costs by prioritizing annotation of particularly salient unlabeled data. In practice, active learning
often simplifies to maximizing the add-one-in influence where each unlabeled instance’s marginal influence
must be estimated. Obviously, retraining for each possible unlabeled instance combination has exponential
complexity and is intractable. Instead, a greedy strategy can be used where the influence of each unlabeled
instance is estimated to identify the next candidate to label [Liu+21; Jia+21a; Zha+21c].
To enhance the benefit of limited labeled data, influence analysis has been used to create better aug-
mented training data [Lee+20; Oh+21]. These influence-guided data augmentation methods outperform
traditional random augmentations, albeit with a higher computational cost.

7 Future Directions

The trend of consistently increasing model complexity and opacity will likely continue for the foreseeable
future. Simultaneously, there are increased societal and regulatory demands for algorithmic transparency
and explainability. Influence analysis sits at the nexus of these competing trajectories [Zho+19], which
points to the field growing in importance and relevance. This section identifies important directions we
believe influence analysis research should take going forward.
Emphasizing Group Influence over Pointwise Influence: Most existing methods target pointwise influence,
which apportions credit for a prediction to training instances individually. However, for overparameterized
models trained on large datasets, only the tails of the data distribution are heavily influenced by an individual
instance [Fel20b]. Instead, most predictions are moderately influenced by multiple training instances working
in concert [FZ20; Das+21; BYF20].
As an additional complicating factor, pointwise influence within data-distribution modes is often approx-
imately supermodular where the marginal effect of a training instance’s deletion increases as more instances
from a group are removed [HL22]. This makes pointwise influence a particularly poor choice for understand-
ing most model behavior. To date, very limited work has systematically studied group influence [Koh+19;
BYF20; HL22]. Better group influence estimators could be immediately applied in various domains such as
poisoning attacks, coreset selection, and model explainability.
Certified Influence Estimation: Certified defenses against poisoning and backdoor attacks guarantee that
deleting a fixed number of instances from the training data will not change a model’s prediction [SKL17;
LF21; Jia+22; WLF22; HL23]. These methods can be viewed as upper bounding the training data’s group
influence – albeit very coarsely. Most certified poisoning defenses achieve their bounds by leveraging “tricks”
associated with particular model architectures (e.g., instance-based learners [Jia+22] and ensembles [LF21;
WLF22; HL24a; Rez+23]) as opposed to a detailed analysis of a prediction’s stability [HL23]. With limited
exception [Jia+19a], today’s influence estimators do not provide any meaningful guarantee of their accuracy.
Rather, most influence estimates should be viewed as only providing – at best – guidance on an instance’s
“possible influence.” Guaranteed or even probabilistic bounds on an instance’s influence would enable
influence estimation to be applied in settings where more than a “heuristic approximation” is required [HL23].
Improved Scalability: Influence estimation is slow. Analyzing each training instance’s influence on a single

35
test instance can take several hours or more [BBD20; Kob+20; Guo+21; HL22]. For influence estimation to
be a practical tool, it must be at least an order of magnitude faster. Heuristic influence analysis speed-ups
could prove very useful [Guo+21; Sch+22]. However, the consequences (and limitations) of any empirical
shortcuts need to be thoroughly tested, verified, and understood. Similarly, limited existing work has
specialized influence methods to particular model classes [Jia+21a] or data modalities [Yeh+22]. While
application-agnostic influence estimators are useful, their flexibility limits their scalability and accuracy.
Both of these performance metrics may significantly improve via increased influence estimator specialization.
Surrogate Influence and Influence Transferability: An underexplored opportunity to improve influence anal-
ysis lies in the use of surrogate models [Sha+18b; Jia+19b; Jia+21a; BHL23]. For example, linear surrogates
have proven quite useful for model explainability [LL17]. While using only a model’s linear layer as a surro-
gate may be “too reductive” [Yeh+22], it remains an open question whether other compact models remain
an option. Any surrogate method must be accompanied by rigorous empirical evaluation to identify any
risks and “blind spots” the surrogate may introduce [Rud19].
Increased Evaluation Diversity: Influence analysis has the capability to provide salient insights into why
models behave as they do [FZ20]. As an example, Black and Fredrikson [BF21] demonstrate how influence
analysis can identify potential unfairness in an algorithmic decision. However, influence estimation evaluation
is too often superficial and focuses on a very small subset of possible applications. For instance, most
influence estimation evaluation focuses primarily on contrived data cleaning and mislabeled training data
experiments [Woj+16; KL17; Kha+19; GZ19; Yeh+18; Pru+20; Che+21; Ter+21; KS21; SWS21; BHL23;
Yeh+22; KSH22; KZ22]. It is unclear how these experiments translate into real-world or adversarial settings,
with recent work pointing to generalization fragility [BPF21; Bae+22; Sch+23]. We question whether these
data cleaning experiments – where specialized methods already exist [Kri+16; KW19; Wan+19] – adequately
satisfy influence analysis’s stated promise of providing “understanding [of] black-box predictions” [KL17].
Objective Over Subjective Evaluation Criteria: A common trope when evaluating an influence analysis
method is to provide a test example and display training instances the estimator identified as most similar
or dissimilar. These “eye test” evaluations are generally applied to vision datasets [KL17; Yeh+18; Jia+19a;
Pru+20; FZ20] and to a limited extent other modalities. Such experiments are unscientific. They provide
limited meaningful insight given the lack of a ground truth by which to judge the results. Most readers
do not have detailed enough knowledge of a dataset to know whether the selected instances are especially
representative. Rather, there may exist numerous training instances that are much more similar to the target
that the influence estimator overlooked. Moreover, such visual assessments are known to be susceptible to
confirmation and expectancy biases [Mah77; NDM13; KDK13].
Influence analysis evaluation should focus on experiments that are quantifiable and verifiable w.r.t. a
ground truth.

8 Conclusions

While influence analysis has received increased attention in recent years, significant progress remains to be
made. Influence estimation is computationally expensive and can be prone to inaccuracy. Going forward,
fast certified influence estimators are needed. Nonetheless, despite these shortcomings, existing applications
already demonstrate influence estimation’s capabilities and promise.
This work reviews numerous methods with different perspectives on – and even definitions of – training
data influence. It would be a mistake to view this diversity of approaches as a negative. While no single
influence analysis method can be applied to all situations, most use cases should have at least one method
that fits well. An obvious consequence then is the need for researchers and practitioners to understand the
strengths and limitations of the various methods so as to know which method best fits their individual use
case. This survey is intended to provide that insight from both empirical and theoretical viewpoints.

36
Acknowledgments

This work was supported by a grant from the Air Force Research Laboratory and the Defense Advanced
Research Projects Agency (DARPA) — agreement number FA8750-16-C-0166, subcontract K001892-00-S05,
as well as a second grant from DARPA, agreement number HR00112090135.

References

[ABH17] Naman Agarwal, Brian Bullins, and Elad Hazan. “Second-Order Stochastic Optimization for
Machine Learning in Linear Time”. In: Journal of Machine Learning Research 18.1 (Jan.
2017), pp. 4148–4187. url: https://arxiv.org/abs/1602.03943.
[AKA91] David W. Aha, Dennis Kibler, and Marc K. Albert. “Instance-Based Learning Algorithms”.
In: Machine Learning 6.1 (1991), pp. 37–66. url: https://link.springer.com/article/
10.1007/bf00153759.
[Ang+16] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias”. In: ProP-
ublica (May 2016). url: https://www.propublica.org/article/machine- bias- risk-
assessments-in-criminal-sentencing.
[Arn51] Walter Edwin Arnoldi. “The Principle of Minimized Iterations in the Solution of the Matrix
Eigenvalue Problem”. In: Quarterly of Applied Mathematics 9.1 (1951), pp. 17–29.
[Arp+17] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,
Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and
Simon Lacoste-Julien. “A Closer Look at Memorization in Deep Networks”. In: Proceedings
of the 34th International Conference on Machine Learning. ICML’17. 2017. url: https://
arxiv.org/abs/1706.05394.
[Awa+18] Edmond Awad, Sohan Dsouza, Richard Kim, Jonathan Schulz, Joseph Henrich, Azim Shariff,
Jean-François Bonnefon, and Iyad Rahwan. “The Moral Machine Experiment”. In: Nature
563.7729 (2018), pp. 59–64.
[BLK17] Olivier Bachem, Mario Lucic, and Andreas Krause. “Practical Coreset Constructions for Ma-
chine Learning”. In: (2017). arXiv: 1703.06476 [stat.ML].
[Bae+22] Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and Roger Grosse. “If Influence Func-
tions are the Answer, Then What is the Question?” In: Proceedings of the 36th Conference
on Neural Information Processing Systems. NeurIPS’22. Curran Associates, Inc., 2022. url:
https://arxiv.org/abs/2209.05364.
[Ban65] John F. III Banzhaf. “Weighted Voting Doesn’t Work: A Mathematical Analysis”. In: Rutgers
Law Review 19.2 (1965), pp. 317–343.
[BBD20] Elnaz Barshan, Marc-Etienne Brunet, and Gintare Karolina Dziugaite. “RelatIF: Identifying
Explanatory Training Samples via Relative Influence”. In: Proceedings of the 23rd International
Conference on Artificial Intelligence and Statistics. AISTATS’20. 2020. url: https://arxiv.
org/abs/2003.11630.
[Bar+20] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. “Benign Overfitting
in Linear Regression”. In: Proceedings of the National Academy of Sciences 117.48 (2020),
pp. 30063–30070. url: https://arxiv.org/abs/1906.11300.
[BCC19] Christine Basta, Marta R. Costa-jussà, and Noe Casas. “Evaluating the Underlying Gender
Bias in Contextualized Word Embeddings”. In: Proceedings of the First Workshop on Gender
Bias in Natural Language Processing. Florence, Italy: Association for Computational Linguis-
tics, 2019. url: https://arxiv.org/abs/1904.08783.

37
[BPF21] Samyadeep Basu, Phil Pope, and Soheil Feizi. “Influence Functions in Deep Learning Are
Fragile”. In: Proceedings of the 9th International Conference on Learning Representations.
ICLR’21. Virtual Only, 2021. url: https://arxiv.org/abs/2006.14651.
[BYF20] Samyadeep Basu, Xuchen You, and Soheil Feizi. “On Second-Order Group Influence Functions
for Black-Box Predictions”. In: Proceedings of the 37th International Conference on Machine
Learning. ICML’20. Virtual Only: PMLR, 2020. url: https://arxiv.org/abs/1911.00418.
[BT74] Albert E. Beaton and John W. Tukey. “The Fitting of Power Series, Meaning Polynomials,
Illustrated on Band-Spectroscopic Data”. In: Technometrics 16.2 (1974), pp. 147–185.
[BP21] Vaishak Belle and Ioannis Papantonis. “Principles and Practice of Explainable Machine Learn-
ing”. In: Frontiers in Big Data 4 (July 2021), pp. 688969–688969. url: https://arxiv.org/
abs/2009.11698.
[Ben+21] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. “On
the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In: Proceedings of the
2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT’21. New York,
NY, USA: Association for Computing Machinery, 2021, pp. 610–623. isbn: 9781450383097.
url: https://dl.acm.org/doi/10.1145/3442188.3445922.
[BNL12] Battista Biggio, Blaine Nelson, and Pavel Laskov. “Poisoning Attacks against Support Vec-
tor Machines”. In: Proceedings of the 29th International Conference on Machine Learning.
ICML’12. Edinburgh, Great Britain: PMLR, 2012. url: https://arxiv.org/abs/1206.6389.
[Bil22] Jeffrey Bilmes. “Submodularity in Machine Learning and Artificial Intelligence”. In: (2022).
eprint: 2202.00132 (cs.LG). url: https://arxiv.org/abs/2202.00132.
[BF21] Emily Black and Matt Fredrikson. “Leave-One-Out Unfairness”. In: Proceedings of the 2021
ACM Conference on Fairness, Accountability, and Transparency. FAccT’21. 2021. url: https:
//arxiv.org/abs/2107.10171.
[BR92] Avrim L. Blum and Ronald L. Rivest. “Training a 3-Node Neural Network is NP-Complete”.
In: Neural Networks 5.1 (1992), pp. 117–127.
[BL23] Sebastian Bordt and Ulrike von Luxburg. “From Shapley Values to Generalized Additive Mod-
els and back”. In: Proceedings of The 26th International Conference on Artificial Intelligence
and Statistics. AISTATS’23. 2023. url: https://arxiv.org/abs/2209.04012.
[BMK20] Zalán Borsos, Mojmir Mutny, and Andreas Krause. “Coresets via Bilevel Optimization for
Continual Learning and Streaming”. In: Proceedings of the 34th Conference on Neural Infor-
mation Processing Systems. NeurIPS’20. 2020. url: https://arxiv.org/abs/2006.03875.
[BV04] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,
Mar. 2004. isbn: 0521833787.
[Bra+22] Joschka Braun, Micha Kornreich, JinHyeong Park, Jayashri Pawar, James Browning, Richard
Herzog, Benjamin Odry, and Li Zhang. “Influence Based Re-Weighing for Labeling Noise in
Medical Imaging”. In: Proceedings of the 19th IEEE International Symposium on Biomedical
Imaging. ISBI’22. 2022.
[BHL23] Jonathan Brophy, Zayd Hammoudeh, and Daniel Lowd. “Adapting and Evaluating Influence-
Estimation Methods for Gradient-Boosted Decision Trees”. In: Journal of Machine Learning
Research 24 (2023), pp. 1–48. url: https://arxiv.org/abs/2205.00359.
[BL21] Jonathan Brophy and Daniel Lowd. “Machine Unlearning for Random Forests”. In: Proceedings
of the 38th International Conference on Machine Learning. ICML’21. 2021. url: https://
arxiv.org/abs/2009.05567.

38
[Bro+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel
Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. “Language Models are Few-Shot Learners”. In: Proceed-
ings of the 34th Conference on Neural Information Processing Systems. NeurIPS’20. Curran
Associates, Inc., 2020. url: https://arxiv.org/abs/2005.14165.
[BH21] Nadia Burkart and Marco F. Huber. “A Survey on the Explainability of Supervised Ma-
chine Learning”. In: Journal Artificial Intelligence Research 70 (May 2021), pp. 245–317. url:
https://arxiv.org/abs/2011.07876.
[CJH19] Carrie J. Cai, Jonas Jongejan, and Jess Holbrook. “The Effects of Example-Based Explanations
in a Machine Learning Interface”. In: Proceedings of the 24th International Conference on
Intelligent User Interfaces. IUI’19. 2019, pp. 258–262. url: https://dl.acm.org/doi/10.
1145/3301275.3302289.
[Che+22] Ruoxin Chen, Zenan Li, Jie Li, Chentao Wu, and Junchi Yan. “On Collective Robustness of
Bagging Against Data Poisoning”. In: Proceedings of the 39th International Conference on
Machine Learning. ICML’22. PMLR, 2022. url: https://arxiv.org/abs/2205.13176.
[Che+17] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted Backdoor Attacks
on Deep Learning Systems Using Data Poisoning. 2017. arXiv: 1712.05526 [cs.CR].
[Che+21] Yuanyuan Chen, Boyang Li, Han Yu, Pengcheng Wu, and Chunyan Miao. “HyDRA: Hyper-
gradient Data Relevance Analysis for Interpreting Deep Neural Networks”. In: Proceedings of
the 35th AAAI Conference on Artificial Intelligence. AAAI’21. Virtual Only: Association for
the Advancement of Artificial Intelligence, 2021. url: https://arxiv.org/abs/2102.02515.
[CG22] Gilad Cohen and Raja Giryes. “Membership Inference Attack Using Self Influence Functions”.
In: (2022). arXiv: 2205.13680 [cs.LG].
[CSG20] Gilad Cohen, Guillermo Sapiro, and Raja Giryes. “Detecting Adversarial Samples Using Influ-
ence Functions and Nearest Neighbors”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. CVPR’20. Virtual Only, 2020. url: https://arxiv.org/
abs/1909.06872.
[Coo77] R. Dennis Cook. “Detection of Influential Observation in Linear Regression”. In: Technometrics
19.1 (1977), pp. 15–18.
[CHW82] R. Dennis Cook, Norton Holschuh, and Sanford Weisberg. “A Note on an Alternative Outlier
Model”. In: Journal of the Royal Statistical Society. Series B (Methodological) 44.3 (1982),
pp. 370–376. (Visited on 10/22/2022).
[CW82] R. Dennis Cook and Sanford Weisberg. Residuals and Influence in Regression. New York:
Chapman and Hall, 1982. isbn: 041224280X.
[DAm+20] Alexander D’Amour, Katherine A. Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex
Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad
Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lu-
cic, Yi-An Ma, Cory Y. McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary
Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ra-
masamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh,
Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong
Yun, Xiaohua Zhai, and D. Sculley. Underspecification Presents Challenges for Credibility in
Modern Machine Learning. 2020. arXiv: 2011.03395 [cs.LG].
[DG23] Zheng Dai and David K. Gifford. Training Data Attribution for Diffusion Models. 2023. arXiv:
2306.02174 [stat.ML].

39
[Das+21] Soumi Das, Arshdeep Singh, Saptarshi Chatterjee, Suparna Bhattacharya, and Sourangshu
Bhattacharya. “Finding High-Value Training Data Subset through Differentiable Convex Pro-
gramming”. In: Proceedings of the 2021 European Conference on Machine Learning and Prin-
ciples and Practice of Knowledge Discovery in Databases. ECML PKDD’21. 2021. url: https:
//arxiv.org/abs/2104.13794.
[DG14] Alex Davies and Zoubin Ghahramani. The Random Forest Kernel and Other Kernels for Big
Data from Random Partitions. 2014. arXiv: 1402.4293 [cs.LG].
[Dem+19] Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina
Oprea, Cristina Nita-Rotaru, and Fabio Roli. “Why Do Adversarial Attacks Transfer? Explain-
ing Transferability of Evasion and Poisoning Attacks”. In: Proceedings of the 28th USENIX
Security Symposium. USENIX’19. Renton, Washington, USA, 2019. url: https://arxiv.
org/abs/1809.02861.
[Den+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “ImageNet: A Large-
Scale Hierarchical Image Database”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. CVPR’09. 2009, pp. 248–255.
[DP94] Xiaotie Deng and Christos H. Papadimitriou. “On the Complexity of Cooperative Solution
Concepts”. In: Mathematics of Operations Research 19.2 (May 1994), pp. 257–266.
[Dev+19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguis-
tics. ACL’19. Minneapolis, Minnesota: Association for Computational Linguistics, 2019. url:
https://arxiv.org/abs/1810.04805.
[Dis+21] Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton
Sinitsin, Dmitry Popov, Dmitriy Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Vil-
lanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, and Gennady
Pekhimenko. “Distributed Deep Learning in Open Collaborations”. In: Proceedings of the
35th Conference on Neural Information Processing Systems. NeurIPS’21. 2021. url: https:
//arxiv.org/abs/2106.10207.
[Dos+21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Im-
age Recognition at Scale”. In: Proceedings of the 9th International Conference on Learning
Representations. ICLR’21. Virtual Only, 2021. url: https://arxiv.org/abs/2010.11929.
[Dub75] Pradeep Dubey. “On the Uniqueness of the Shapley Value”. In: International Journal of Game
Theory 4.3 (Sept. 1975), pp. 131–139.
[DNW81] Pradeep Dubey, Abraham Neyman, and Robert James Weber. “Value Theory without Effi-
ciency”. In: Mathematics of Operations Research 6.1 (Feb. 1981), pp. 122–128.
[DHS11] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive Subgradient Methods for Online Learn-
ing and Stochastic Optimization”. In: Journal of Machine Learning Research 12 (July 2011),
pp. 2121–2159. url: https://jmlr.org/papers/v12/duchi11a.html.
[Dwo+12] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. “Fairness
through Awareness”. In: Proceedings of the 3rd Innovations in Theoretical Computer Science
Conference. ITCS’12. 2012. url: https://arxiv.org/abs/1104.3913.
[Eis+22] Thorsten Eisenhofer, Doreen Riepel, Varun Chandrasekaran, Esha Ghosh, Olga Ohrimenko,
and Nicolas Papernot. Verifiable and Provably Secure Machine Unlearning. 2022. arXiv: 2210.
09126 [cs.LG].
[EGH17] Rajmadhan Ekambaram, Dmitry B. Goldgof, and Lawrence O. Hall. “Finding Label Noise
Examples in Large Scale Datasets”. In: Proceedings of the 2017 IEEE International Conference
on Systems, Man, and Cybernetics. SMC’17. 2017. doi: 10.1109/SMC.2017.8122985.

40
[Ell76] Jonas H. Ellenberg. “Testing for a Single Outlier from a General Linear Regression”. In:
Biometrics 32.3 (1976), pp. 637–645. (Visited on 10/22/2022).
[FGL20] Minghong Fang, Neil Zhenqiang Gong, and Jia Liu. “Influence Function based Data Poison-
ing Attacks to Top-N Recommender Systems”. In: Proceedings of the Web Conference 2020.
WWW’20. 2020. url: https://arxiv.org/abs/2002.08025.
[Fel20a] Dan Feldman. “Introduction to Core-sets: an Updated Survey”. In: (2020). arXiv: 2011.09384
[cs.LG].
[Fel20b] Vitaly Feldman. “Does Learning Require Memorization? A Short Tale about a Long Tail”.
In: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing.
STOC’20. 2020. url: https://arxiv.org/abs/1906.05271.
[FZ20] Vitaly Feldman and Chiyuan Zhang. “What Neural Networks Memorize and Why: Discovering
the Long Tail via Influence Estimation”. In: Proceedings of the 34th Conference on Neural
Information Processing Systems. NeurIPS’20. Virtual Only: Curran Associates, Inc., 2020.
url: https://arxiv.org/abs/2008.03703.
[Fow+21] Liam Fowl, Micah Goldblum, Ping-yeh Chiang, Jonas Geiping, Wojtek Czaja, and Tom Gold-
stein. “Adversarial Examples Make Strong Poisons”. In: Proceedings of the 35th Conference on
Neural Information Processing Systems. NeurIPS’21. Virtual Only: Curran Associates, Inc.,
2021. url: https://arxiv.org/abs/2106.10807.
[GZ19] Amirata Ghorbani and James Zou. “Data Shapley: Equitable Valuation of Data for Ma-
chine Learning”. In: Proceedings of the 36th International Conference on Machine Learning.
ICML’19. 2019. url: https://proceedings.mlr.press/v97/ghorbani19c.html.
[GZ20] Amirata Ghorbani and James Y. Zou. “Neuron Shapley: Discovering the Responsible Neu-
rons”. In: Proceedings of the 34th Conference on Neural Information Processing Systems.
NeurIPS’20. 2020. url: https://arxiv.org/abs/2002.09815.
[GH19] Bruce Glymour and Jonathan Herington. “Measuring the Biases That Matter: The Ethical
and Casual Foundations for Measures of Fairness in Algorithms”. In: Proceedings of the 2019
ACM Conference on Fairness, Accountability, and Transparency. FAccT’19. 2019.
[Goy+17] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Ky-
rola, Andrew Tulloch, Yangqing Jia, and Kaiming He. “Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour”. In: (2017). arXiv: 1706.02677 [cs.CV].
[GR99] Michel Grabisch and Marc Roubens. “An Axiomatic Approach to the Concept of Interaction
Among Players in Cooperative Games”. In: International Journal of Game Theory 28.4 (Nov.
1999), pp. 547–565. issn: 1432-1270. doi: 10.1007/s001820050125. url: https://doi.org/
10.1007/s001820050125.
[Guo+20] Chuan Guo, Tom Goldstein, Awni Y. Hannun, and Laurens van der Maaten. “Certified Data
Removal from Machine Learning Models”. In: Proceedings of the 37th International Conference
on Machine Learning. Vol. 119. ICML’20. 2020, pp. 3832–3842. url: https://arxiv.org/
abs/1911.03030.
[Guo+21] Han Guo, Nazneen Rajani, Peter Hase, Mohit Bansal, and Caiming Xiong. “FastIF: Scalable
Influence Functions for Efficient Model Interpretation and Debugging”. In: Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing. EMNLP’21. 2021.
url: https://arxiv.org/abs/2012.15781.
[HL21] Zayd Hammoudeh and Daniel Lowd. “Simple, Attack-Agnostic Defense Against Targeted
Training Set Attacks Using Cosine Similarity”. In: Proceedings of the 3rd ICML Workshop
on Uncertainty and Robustness in Deep Learning. UDL’21. 2021.
[HL22] Zayd Hammoudeh and Daniel Lowd. “Identifying a Training-Set Attack’s Target Using Renor-
malized Influence Estimation”. In: Proceedings of the 29th ACM SIGSAC Conference on Com-
puter and Communications Security. CCS’22. Los Angeles, CA: Association for Computing
Machinery, 2022. url: https://arxiv.org/abs/2201.10055.

41
[HL23] Zayd Hammoudeh and Daniel Lowd. “Reducing Certified Regression to Certified Classification
for General Poisoning Attacks”. In: Proceedings of the 1st IEEE Conference on Secure and
Trustworthy Machine Learning. SaTML’23. 2023. url: https://arxiv.org/abs/2208.13904.
[HL24a] Zayd Hammoudeh and Daniel Lowd. “Provable Robustness Against a Union of ℓ0 Attacks”.
In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. AAAI’24. 2024. url:
https://arxiv.org/abs/2302.11628.
[HL24b] Zayd Hammoudeh and Daniel Lowd. “Training Data Influence Analysis and Estimation: A
Survey”. In: Machine Learning (2024). doi: 10.1007/s10994-023-06495-7.
[Ham74] Frank R. Hampel. “The Influence Curve and its Role in Robust Estimation”. In: Journal of
the American Statistical Association 69.346 (1974), pp. 383–393.
[HT21] Xiaochuang Han and Yulia Tsvetkov. “Fortifying Toxic Speech Detectors Against Veiled Tox-
icity”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing. EMNLP’20. 2021. url: https://arxiv.org/abs/2010.03154.
[HNM19] Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. “Data Cleansing for Models Trained
with SGD”. In: Proceedings of the 33rd Conference on Neural Information Processing Systems.
NeurIPS’19. Vancouver, Canada: Curran Associates, Inc., 2019. url: https://arxiv.org/
abs/1906.08473.
[He+14] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah,
Ralf Herbrich, Stuart Bowers, and Joaquin Quiñonero Candela. “Practical Lessons from Pre-
dicting Clicks on Ads at Facebook”. In: Proceedings of the Eighth International Workshop on
Data Mining for Online Advertising. AdKDD’14. New York, NY, USA: Association for Com-
puting Machinery, 2014. url: https://research.facebook.com/publications/practical-
lessons-from-predicting-clicks-on-ads-at-facebook/.
[Hig+17] Irina Higgins, Loı̈c Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M.
Botvinick, Shakir Mohamed, and Alexander Lerchner. “beta-VAE: Learning Basic Visual Con-
cepts with a Constrained Variational Framework”. In: Proceedings of the 5th International
Conference on Learning Representations. ICLR’17. 2017. url: https://openreview.net/
forum?id=Sy2fzU9gl.
[HSS08] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. “Kernel Methods in Machine
Learning”. In: Annals of Statistics 36.3 (2008), pp. 1171–1220. url: https://arxiv.org/
abs/math/0701907.
[Hog79] Robert V. Hogg. “Statistical Robustness: One View of its Use in Applications Today”. In: The
American Statistician 33.3 (1979), pp. 108–115.
[Hub81] Peter Huber. Robust Statistics. John Wiley & Sons, 1981.
[Hub64] Peter J. Huber. “Robust Estimation of a Location Parameter”. In: Annals of Mathematical
Statistics 35.1 (1964), pp. 73–101.
[Hut+20] Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and
Stephen Denuyl. “Social Biases in NLP Models as Barriers for Persons with Disabilities”.
In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.487.
[Ily+22] Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, and Aleksander Madry.
“Datamodels: Understanding Predictions with Data and Data with Predictions”. In: Proceed-
ings of the 39th International Conference on Machine Learning. ICML’22. PMLR, 2022. url:
https://arxiv.org/abs/2202.00622.
[Jae72] Louis A. Jaeckel. The Infinitesimal Jackknife. Tech. rep. Bell Laboratories, 1972.
[Jag+21] Matthew Jagielski, Giorgio Severi, Niklas Pousette Harger, and Alina Oprea. “Subpopulation
Data Poisoning Attacks”. In: Proceedings of the 28th ACM SIGSAC Conference on Computer
and Communications Security. CCS ’21. Virtual Only: Association for Computing Machinery,
2021. url: https://arxiv.org/abs/2006.14026.

42
[Jia+22] Jinyuan Jia, Yupei Liu, Xiaoyu Cao, and Neil Zhenqiang Gong. “Certified Robustness of
Nearest Neighbors against Data Poisoning and Backdoor Attacks”. In: Proceedings of the 36th
AAAI Conference on Artificial Intelligence. AAAI’22. 2022. url: https://arxiv.org/abs/
2012.03765.
[Jia+19a] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gürel, Bo Li, Ce
Zhang, Costas J. Spanos, and Dawn Song. “Efficient Task-Specific Data Valuation for Near-
est Neighbor Algorithms”. In: Proceedings of the VLDB Endowment. PVLDB’19. 2019. url:
https://arxiv.org/abs/1908.08619.
[Jia+19b] Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel,
Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. “Towards Efficient Data Valuation Based
on the Shapley Value”. In: Proceedings of the 22nd Conference on Artificial Intelligence and
Statistics. AISTATS’19. 2019, pp. 1167–1176. url: https://arxiv.org/abs/1902.10275.
[Jia+21a] Ruoxi Jia, Fan Wu, Xuehui Sun, Jiacen Xu, David Dao, Bhavya Kailkhura, Ce Zhang, Bo Li,
and Dawn Song. “Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data
Importance Quantification?” In: Proceedings of the 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition. CVPR’21. 2021. url: https://arxiv.org/abs/1911.07128.
[Jia+21b] Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C Mozer. “Characterizing Struc-
tural Regularities of Labeled Data in Overparameterized Models”. In: Proceedings of the 38th
International Conference on Machine Learning. ICML’21. July 2021, pp. 5034–5044. url:
https://arxiv.org/abs/2002.03206.
[JL84] William B. Johnson and Joram Lindenstrauss. “Extensions of Lipschitz Mappings into a
Hilbert Space”. In: Contemporary Mathematics 26 (1984), pp. 189–206.
[JW78] John E. Dennis Jr. and Roy E. Welsch. “Techniques for nonlinear least squares and robust re-
gression”. In: Communications in Statistics - Simulation and Computation 7.4 (1978), pp. 345–
359.
[KS21] Karthikeyan K and Anders Søgaard. Revisiting Methods for Finding Influential Examples.
2021. arXiv: 2111.04683 [cs.LG].
[KWR22] Nikhil Kandpal, Eric Wallace, and Colin Raffel. “Deduplicating Training Data Mitigates Pri-
vacy Risks in Language Models”. In: Proceedings of the 39th International Conference on
Machine Learning. ICML’22. PMLR, 2022. url: https://arxiv.org/abs/2202.06539.
[KDK13] Saul M. Kassin, Itiel E. Dror, and Jeff Kukucka. “The Forensic Confirmation Bias: Prob-
lems, Perspectives, and Proposed Solutions”. In: Journal of Applied Research in Memory and
Cognition 2.1 (2013), pp. 42–52.
[Kha+19] Rajiv Khanna, Been Kim, Joydeep Ghosh, and Oluwasanmi Koyejo. “Interpreting Black Box
Predictions using Fisher Kernels”. In: Proceedings of the 22nd Conference on Artificial Intel-
ligence and Statistics. AISTATS’19. 2019. url: https://arxiv.org/abs/1810.10118.
[KCC23] Nohyun Ki, Hoyong Choi, and Hye Won Chung. “Data Valuation Without Training of a
Model”. In: Proceedings of the 11th International Conference on Learning Representations.
ICLR’23. 2023. url: https://openreview.net/forum?id=XIzO8zr-WbM.
[KB15] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:
Proceedings of the 3rd International Conference on Learning Representations. ICLR’15. 2015.
url: https://arxiv.org/abs/1412.6980.
[KW14] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: Proceedings of
the 2nd International Conference on Learning Representations. ICLR’14. 2014. url: https:
//arxiv.org/abs/1312.6114.
[Kiz16] René F. Kizilcec. “How Much Information? Effects of Transparency on Trust in an Algorithmic
Interface”. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing
Systems. CHI’16. San Jose, California, USA: Association for Computing Machinery, 2016,
pp. 2390–2395.

43
[Kni17] Will Knight. “The Dark Secret at the Heart of AI”. In: MIT Technology Review (Apr. 7, 2017).
url: https://www.technologyreview.com/2017/04/11/5113/the-dark-secret-at-the-
heart-of-ai/ (visited on 11/03/2022).
[Kob+20] Sosuke Kobayashi, Sho Yokoi, Jun Suzuki, and Kentaro Inui. “Efficient Estimation of Influence
of a Training Instance”. In: Proceedings of SustaiNLP: Workshop on Simple and Efficient
Natural Language Processing. Online: Association for Computational Linguistics, 2020. url:
https://arxiv.org/abs/2012.04207.
[Koh+19] Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, and Percy Liang. “On the Accuracy of
Influence Functions for Measuring Group Effects”. In: Proceedings of the 33rd International
Conference on Neural Information Processing Systems. NeurIPS’19. Red Hook, NY, USA:
Curran Associates Inc., 2019. url: https://arxiv.org/abs/1905.13289.
[KL17] Pang Wei Koh and Percy Liang. “Understanding Black-box Predictions via Influence Func-
tions”. In: Proceedings of the 34th International Conference on Machine Learning. ICML’17.
Sydney, Australia: PMLR, 2017. url: https://arxiv.org/abs/1703.04730.
[KSH22] Shuming Kong, Yanyan Shen, and Linpeng Huang. “Resolving Training Biases via Influence-
based Data Relabeling”. In: Proceedings of the 10th International Conference on Learning
Representations. ICLR’22. 2022. url: https://openreview.net/forum?id=EskfH0bwNVn.
[KC21] Zhifeng Kong and Kamalika Chaudhuri. “Understanding Instance-based Interpretability of
Variational Auto-Encoders”. In: Proceedings of the 35th Conference on Neural Information
Processing Systems. NeurIPS’21. Virtual Only: Curran Associates, Inc., 2021. url: https:
//arxiv.org/abs/2105.14203.
[Kri+16] Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. “Ac-
tiveClean: Interactive Data Cleaning for Statistical Modeling”. In: Proceedings of the VLDB
Endowment (2016). url: https://www.vldb.org/pvldb/vol9/p948-krishnan.pdf.
[KW19] Sanjay Krishnan and Eugene Wu. AlphaClean: Automatic Generation of Data Cleaning Pipelines.
2019. arXiv: 1904.11827 [cs.DB].
[KNH14] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 Dataset. 2014.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep
Convolutional Neural Networks”. In: Proceedings of the 25th Conference on Neural Information
Processing Systems. NeurIPS’12. 2012, pp. 1097–1105.
[Kur+19] Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. “Measuring Bias
in Contextualized Word Representations”. In: Proceedings of the First Workshop on Gender
Bias in Natural Language Processing. Association for Computational Linguistics, 2019. url:
https://arxiv.org/abs/1906.07337.
[KZ22] Yongchan Kwon and James Zou. “Beta Shapley: A Unified and Noise-Reduced Data Valuation
Framework for Machine Learning”. In: Proceedings of the 25th Conference on Artificial Intelli-
gence and Statistics. AISTATS’22. PMLR, 2022. url: https://arxiv.org/abs/2110.14049.
[Lec89] Yvan G. Leclerc. “Constructing Simple Stable Descriptions for Image Partitioning”. In: Inter-
national Journal of Computer Vision 3.1 (1989), pp. 73–102.
[LeC+98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-Based Learning
Applied to Document Recognition”. In: Proceedings of the IEEE. Vol. 86. 1998, pp. 2278–2324.
[Lee+20] Donghoon Lee, Hyunsin Park, Trung Pham, and Chang D. Yoo. “Learning Augmentation
Network via Influence Functions”. In: Proceedings of the 33rd Conference on Computer Vision
and Pattern Recognition. CVPR’20. 2020.
[LF21] Alexander Levine and Soheil Feizi. “Deep Partition Aggregation: Provable Defenses against
General Poisoning Attacks”. In: Proceedings of the 9th International Conference on Learning
Representations. ICLR’21. Virtual Only, 2021. url: https://arxiv.org/abs/2006.14768.

44
[Li+22] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. “Backdoor Learning:
A Survey”. In: IEEE Transactions on Neural Networks and Learning Systems (2022). doi:
10.1109/TNNLS.2022.3182979. url: https://arxiv.org/abs/2007.08745.
[LLY21] Weixin Liang, Kai-Hui Liang, and Zhou Yu. “HERALD: An Annotation Efficient Method
to Detect User Disengagement in Social Conversations”. In: Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing. ACL-IJCNLP’21. Association for Computational
Linguistics, 2021. url: https://arxiv.org/abs/2106.00162.
[LDA09] Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. “Why and Why Not Explanations Im-
prove the Intelligibility of Context-Aware Intelligent Systems”. In: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems. CHI’09. Boston, MA, USA: Association
for Computing Machinery, 2009, pp. 2119–2128.
[Lin+22] Jinkun Lin, Anqi Zhang, Mathias Lecuyer, Jinyang Li, Aurojit Panda, and Siddhartha Sen.
“Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Exper-
iments”. In: Proceedings of the 39th International Conference on Machine Learning. ICML’22.
2022. url: https://arxiv.org/abs/2206.10013.
[Lip18] Zachary C. Lipton. “The Mythos of Model Interpretability: In Machine Learning, the Concept
of Interpretability is Both Important and Slippery.” In: Queue 16.3 (2018), pp. 31–57. issn:
1542-7730. url: https://dl.acm.org/doi/10.1145/3236386.3241340.
[LDG18] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. “Fine-Pruning: Defending Against
Backdooring Attacks on Deep Neural Networks”. In: Proceedings of the International Sympo-
sium on Research in Attacks, Intrusions, and Defenses. RAID’18. Heraklion, Crete, Greece:
Springer, 2018, pp. 273–294. url: https://arxiv.org/abs/1805.12185.
[Liu+21] Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He. “Influ-
ence Selection for Active Learning”. In: Proceedings of the 18th International Conference on
Computer Vision. ICCV’21. 2021. url: https://arxiv.org/abs/2108.09331.
[LL17] Scott M. Lundberg and Su-In Lee. “A Unified Approach to Interpreting Model Predictions”. In:
Proceedings of the 31st International Conference on Neural Information Processing Systems.
NeurIPS’17. 2017. url: https://arxiv.org/abs/1705.07874.
[Mah77] Michael J. Mahoney. “Publication Prejudices: An Experimental Study of Confirmatory Bias
in the Peer Review System”. In: Cognitive Therapy and Research 1.2 (1977), pp. 161–175.
[Meh+21] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan.
“A Survey on Bias and Fairness in Machine Learning”. In: ACM Computing Surveys 54.6
(July 2021). url: https://arxiv.org/abs/1908.09635.
[MBL20] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. “Coresets for Data-Efficient Training
of Machine Learning Models”. In: Proceedings of the 37th International Conference on Machine
Learning. ICML’20. 2020. url: https://arxiv.org/abs/1906.01827.
[NDM13] Sherry Nakhaeizadeh, Itiel E Dror, and Ruth M Morgan. “Cognitive Bias in Forensic An-
thropology: Visual Assessment of Skeletal Remains is Susceptible to Confirmation Bias”. In:
Science & Justice 54.3 (Dec. 2013), pp. 208–214.
[Neg+12] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. “A Unified
Framework for High-Dimensional Analysis of M -Estimators with Decomposable Regularizers”.
In: Statistical Science 27.4 (2012), pp. 538–557. url: https://arxiv.org/abs/1010.2731.
[NSO23] Elisa Nguyen, Minjoon Seo, and Seong Joon Oh. A Bayesian Perspective On Training Data
Attribution. 2023. arXiv: 2305.19765 [cs.LG].
[Ngu+22] Thanh Tam Nguyen, Thanh Trung Huynh, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi
Yin, and Quoc Viet Hung Nguyen. A Survey of Machine Unlearning. 2022. arXiv: 2209.02299
[cs.LG].

45
[Oh+21] Sejoon Oh, Sungchul Kim, Ryan A. Rossi, and Srijan Kumar. “Influence-guided Data Aug-
mentation for Neural Tensor Completion”. In: Proceedings of the 30th ACM International
Conference on Information and Knowledge Management. CIKM’21. ACM, 2021. url: https:
//arxiv.org/abs/2108.10248.
[Oh+22] Sejoon Oh, Berk Ustun, Julian McAuley, and Srijan Kumar. “Rank List Sensitivity of Recom-
mender Systems to Interaction Perturbations”. In: Proceedings of the 31st ACM International
Conference on Information and Knowledge Management. CIKM’22. ACM. 2022. url: https:
//arxiv.org/abs/2201.12686.
[Par+23] Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry.
“TRAK: Attributing Model Behavior at Scale”. In: Proceedings of the 40th International Con-
ference on Machine Learning. ICML’23. 2023. url: https://arxiv.org/abs/2303.14186.
[Pea94] Barak A. Pearlmutter. “Fast Exact Multiplication by the Hessian”. In: Neural Computation 6
(1994), pp. 147–160.
[Ple+20] Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, and Kilian Q. Weinberger. “Identifying Mislabeled
Data Using the Area Under the Margin Ranking”. In: Proceedings of the 34th International
Conference on Neural Information Processing Systems. NeurIPS’20. Red Hook, NY, USA:
Curran Associates Inc., 2020. url: https://arxiv.org/abs/2001.10528.
[Pru+20] Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. “Estimating Training
Data Influence by Tracing Gradient Descent”. In: Proceedings of the 34th Conference on Neural
Information Processing Systems. NeurIPS’20. Virtual Only: Curran Associates, Inc., 2020. url:
https://arxiv.org/abs/2002.08484.
[Qia99] Ning Qian. “On the Momentum Term in Gradient Descent Learning Algorithms”. In: Neural
Networks 12.1 (1999), pp. 145–151.
[Ras+22] Soham Raste, Rahul Singh, Joel Vaughan, and Vijayan N. Nair. “Quantifying Inherent Ran-
domness in Machine Learning Algorithms”. In: (2022). arXiv: 2206.12353 [stat.ML].
[Rec11] Benjamin Recht. “A Simpler Approach to Matrix Completion”. In: Journal of Machine Learn-
ing Research 12 (Dec. 2011), pp. 3413–3430. issn: 1532-4435. url: https://arxiv.org/abs/
0910.0651.
[Red+21] Vijay Janapa Reddi, Greg Diamos, Pete Warden, Peter Mattson, and David Kanter. “Data
Engineering for Everyone”. In: (2021). arXiv: 2102.11447 [cs.LG].
[Ren+21] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang
Chen, and Xin Wang. “A Survey of Deep Active Learning”. In: ACM Computing Surveys 54.9
(Oct. 2021). issn: 0360-0300. url: https://arxiv.org/abs/2009.00236.
[Ren14] Alexander Renkl. “Toward an Instructionally Oriented Theory of Example-Based Learning”.
In: Cognitive Science 38.1 (2014), pp. 1–37. url: https://onlinelibrary.wiley.com/doi/
full/10.1111/cogs.12086.
[RHS09] Alexander Renkl, Tatjana Hilbert, and Silke Schworm. “Example-Based Learning in Heuristic
Domains: A Cognitive Load Theory Account”. In: Educational Psychology Review 21.1 (Mar.
2009), pp. 67–78.
[Rez+23] Keivan Rezaei, Kiarash Banihashem, Atoosa Chegini, and Soheil Feizi. “Run-Off Election:
Improved Provable Defense against Data Poisoning Attacks”. In: Proceedings of the 40th In-
ternational Conference on Machine Learning. ICML’23. 2023. url: https://arxiv.org/abs/
2302.02300.
[RMW14] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic Backpropagation
and Approximate Inference in Deep Generative Models”. In: Proceedings of the 31st Inter-
national Conference on International Conference on Machine Learning. ICML’14. 2014. url:
https://arxiv.org/abs/1401.4082.
[Rou94] Peter Rousseeuw. “Least Median of Squares Regression”. In: Journal of the American Statis-
tical Association 79.388 (1994).

46
[RL87] Peter J. Rousseeuw and Annick. M. Leroy. Robust Regression and Outlier Detection. USA:
John Wiley & Sons, Inc., 1987. isbn: 0471852333.
[Roz+22] Benedek Rozemberczki, Lauren Watson, Péter Bayer, Hao-Tsung Yang, Olivér Kiss, Sebastian
Nilsson, and Rik Sarkar. The Shapley Value in Machine Learning. 2022. arXiv: 2202.05594
[cs.LG].
[Rud19] Cynthia Rudin. “Stop Explaining Black Box Machine Learning Models for High Stakes Deci-
sions and Use Interpretable Models Instead”. In: Nature Machine Intelligence 1.5 (May 2019),
pp. 206–215. url: https://arxiv.org/abs/1811.10154.
[RHW86] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning Representations
by Back-Propagating Errors”. In: Nature 323.6088 (Oct. 1986), pp. 533–536.
[Sax+19] Nripsuta Ani Saxena, Karen Huang, Evan DeFilippis, Goran Radanovic, David C. Parkes,
and Yang Liu. “How Do Fairness Definitions Fare? Examining Public Attitudes Towards Al-
gorithmic Definitions of Fairness”. In: Proceedings of the 2019 AAAI/ACM Conference on AI,
Ethics, and Society. AIES’19. 2019. url: https://arxiv.org/abs/1811.03654.
[Sch+23] Andrea Schioppa, Katja Filippova, Ivan Titov, and Polina Zablotskaia. Theoretical and Prac-
tical Perspectives on what Influence Functions Do. 2023. arXiv: 2305.16971 [cs.LG]. url:
https://arxiv.org/abs/2305.16971.
[Sch+22] Andrea Schioppa, Polina Zablotskaia, David Vilar Torres, and Artem Sokolov. “Scaling Up
Influence Functions”. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence.
AAAI’22. 2022. url: https://arxiv.org/abs/2112.03052.
[SHS01] Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. “A Generalized Representer Theorem”.
In: Proceedings of the 14th Annual Conference on Computational Learning Theory and 5th
European Conference on Computational Learning Theory. COLT’01/EuroCOLT’01. Berlin,
Heidelberg: Springer-Verlag, 2001, pp. 416–426.
[Sha+18a] Ali Shafahi, W. Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Du-
mitras, and Tom Goldstein. “Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural
Networks”. In: Proceedings of the 32nd Conference on Neural Information Processing Systems.
NeurIPS’18. Montreal, Canada: Curran Associates, Inc., 2018. url: https://arxiv.org/abs/
1804.00792.
[Sha53] Lloyd S. Shapley. “A Value for n-Person Games”. In: Contributions to the Theory of Games II.
Princeton, NJ USA: Princeton University Press, 1953, pp. 307–317.
[SR88] Lloyd S. Shapley and Alvin E. Roth. The Shapley Value: Essays in Honor of Lloyd S. Shapley.
Cambridge University Press, 1988. isbn: 052136177X.
[Sha+18b] Boris Sharchilev, Yury Ustinovskiy, Pavel Serdyukov, and Maarten de Rijke. “Finding Influ-
ential Training Samples for Gradient Boosted Decision Trees”. In: Proceedings of the 35th
International Conference on Machine Learning. ICML’18. PMLR, 2018, pp. 4577–4585. url:
https://arxiv.org/abs/1802.06640.
[SC68] William G. Snedecor and George W. Cochran. Statistical Methods. 6th edition. Iowa State
University Press, 1968.
[Sri61] K. S. Srikantan. “Testing for the Single Outlier in a Regression Model”. In: Indian Journal of
Statistics 23.3 (1961), pp. 251–260.
[SKL17] Jacob Steinhardt, Pang Wei Koh, and Percy Liang. “Certified Defenses for Data Poisoning
Attacks”. In: Proceedings of the 31st Conference on Neural Information Processing Systems.
NeurIPS’17. Long Beach, California, USA: Curran Associates, Inc., 2017. url: https : / /
arxiv.org/abs/1706.03691.
[SGM20] Emma Strubell, Ananya Ganesh, and Andrew McCallum. “Energy and Policy Considerations
for Modern Deep Learning Research”. In: Proceedings of the 34th AAAI Conference on Arti-
ficial Intelligence. AAAI’20. 2020. url: https://arxiv.org/abs/1906.02243.

47
[SWS21] Yi Sui, Ga Wu, and Scott Sanner. “Representer Point Selection via Local Jacobian Expan-
sion for Post-hoc Classifier Explanation of Deep Neural Networks and Ensemble Models”. In:
Proceedings of the 35th Conference on Neural Information Processing Systems. NeurIPS’21.
Virtual Only: Curran Associates, Inc., 2021. url: https : / / openreview . net / forum ? id =
Wl32WBZnSP4.
[SD21] Cecilia Summers and Michael J. Dinneen. “Nondeterminism and Instability in Neural Network
Optimization”. In: Proceedings of the 38th International Conference on Machine Learning.
ICML’21. 2021. url: https://arxiv.org/abs/2103.04514.
[SDA20] Mukund Sundararajan, Kedar Dhamdhere, and Ashish Agarwal. “The Shapley Taylor Inter-
action Index”. In: Proceedings of the 37th International Conference on Machine Learning.
ICML’20. 2020. url: http://proceedings.mlr.press/v119/sundararajan20a.
[SN20] Mukund Sundararajan and Amir Najmi. “The Many Shapley Values for Model Explanation”.
In: Proceedings of the 37th International Conference on Machine Learning. ICML’20. 2020,
pp. 9269–9278. url: https://arxiv.org/abs/1908.08474.
[TC19] Yi Chern Tan and L. Elisa Celis. “Assessing Social and Intersectional Biases in Contextualized
Word Representations”. In: Proceedings of the 33rd Conference on Neural Information Pro-
cessing Systems. NeurIPS’19. Vancouver, Canada: Curran Associates, Inc., 2019. url: https:
//arxiv.org/abs/1911.01485.
[Ter+21] Naoyuki Terashita, Hiroki Ohashi, Yuichi Nonaka, and Takashi Kanemaru. “Influence Estima-
tion for Generative Adversarial Networks”. In: Proceedings of the 9th International Conference
on Learning Representations. ICLR’21. 2021. url: https://arxiv.org/abs/2101.08367.
[Thi+22] Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan, and Fabrice Daniel.
“TracInAD: Measuring Influence for Anomaly Detection”. In: Proceedings of the 2022 In-
ternational Joint Conference on Neural Networks. IJCNN’22. 2022. url: https://arxiv.
org/abs/2205.01362.
[Tib96] Robert Tibshirani. “Regression Shrinkage and Selection via the Lasso”. In: Journal of the
Royal Statistical Society (Series B) 58 (1996), pp. 267–288.
[TMB73] G. L. Tietjen, R. H. Moore, and R. J. Beckman. “Testing for a Single Outlier in Simple Linear
Regression”. In: Technometrics 15.4 (1973), pp. 717–721.
[TB18] Daniel Ting and Eric Brochu. “Optimal Subsampling with Influence Functions”. In: Proceed-
ings of the 32nd Conference on Neural Information Processing Systems. NeurIPS’18. Curran
Associates, Inc., 2018. url: https://arxiv.org/abs/1709.01716.
[TYR23] Che-Ping Tsai, Chih-Kuan Yeh, and Pradeep Ravikumar. “Faith-Shap: The Faithful Shapley
Interaction Index”. In: Journal of Machine Learning Research 24.94 (2023), pp. 1–42. url:
https://arxiv.org/abs/2203.00870.
[Tsa+23] Che-Ping Tsai, Jiong Zhang, Eli Chien, Hsiang-Fu Yu, Cho-Jui Hsieh, and Pradeep Raviku-
mar. “Representer Point Selection for Explaining Regularized High-dimensional Models”. In:
Proceedings of the 40th International Conference on Machine Learning. ICML’23. 2023. url:
https://arxiv.org/abs/2305.20002.
[Tuk+23] Murad Tukan, Samson Zhou, Alaa Maalouf, Daniela Rus, Vladimir Braverman, and Dan Feld-
man. “Provable Data Subset Selection for Efficient Neural Networks Training”. In: Proceedings
of the 40th International Conference on Machine Learning. ICML’23. Honolulu, Hawaii, USA,
2023. url: https://arxiv.org/abs/2303.05151.
[vW21] Gerrit J. J. van den Burg and Christopher K. I. Williams. “On Memorization in Probabilistic
Deep Generative Models”. In: Proceedings of the 35th Conference on Neural Information Pro-
cessing Systems. NeurIPS’21. Curran Associates, Inc., 2021. url: https://arxiv.org/abs/
2106.03216.
[Wal+21] Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh. “Concealed Data Poisoning At-
tacks on NLP Models”. In: Proceedings of the North American Chapter of the Association for
Computational Linguistics. NAACL’21. 2021. url: https://arxiv.org/abs/2010.12563.

48
[Wan+19] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and
Ben Y. Zhao. “Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Net-
works”. In: Proceedings of the 40th IEEE Symposium on Security and Privacy. SP’19. San
Francisco, CA, 2019. url: https://ieeexplore.ieee.org/document/8835365.
[WJ23] Jiachen T. Wang and Ruoxi Jia. “Data Banzhaf: A Robust Data Valuation Framework for Ma-
chine Learning”. In: Proceedings of the 26th International Conference on Artificial Intelligence
and Statistics. AISTATS’23. 2023. url: https://arxiv.org/abs/2205.15466.
[WLF22] Wenxiao Wang, Alexander Levine, and Soheil Feizi. “Improved Certified Defenses against Data
Poisoning with (Deterministic) Finite Aggregation”. In: Proceedings of the 39th International
Conference on Machine Learning. ICML’22. 2022. url: https : / / arxiv . org / abs / 2202 .
02628.
[Wan+20] Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, and Shao-Lun Huang. “Less Is Better:
Unweighted Data Subsampling via Influence Function”. In: Proceedings of the 34th AAAI
Conference on Artificial Intelligence. AAAI’20. AAAI Press, 2020, pp. 6340–6347. url: https:
//arxiv.org/abs/1912.01321.
[WIB15] Kai Wei, Rishabh Iyer, and Jeff Bilmes. “Submodularity in Data Subset Selection and Ac-
tive Learning”. In: Proceedings of the 32nd International Conference on Machine Learning.
ICML’15. Lille, France: PMLR, 2015. url: https://proceedings.mlr.press/v37/wei15.
html.
[Woj+16] Mike Wojnowicz, Ben Cruz, Xuan Zhao, Brian Wallace, Matt Wolff, Jay Luan, and Caleb
Crable. “‘Influence Sketching’: Finding Influential Samples in Large-Scale Regressions”. In:
Proceedings of the 2016 IEEE International Conference on Big Data. BigData’16. IEEE, 2016.
url: https://arxiv.org/abs/1611.05923.
[Woo14] David P. Woodruff. “Sketching as a Tool for Numerical Linear Algebra”. In: Foundations and
Trends in Theoretical Computer Science 10.1–2 (2014), pp. 1–157. url: https://arxiv.org/
abs/1411.4357.
[Xia22] Chloe Xiang. “Scientists Increasingly Can’t Explain How AI Works”. In: Vice (Nov. 1, 2022).
url: https : / / www . vice . com / en / article / y3pezm / scientists - increasingly - cant -
explain-how-ai-works (visited on 11/03/2022).
[Yam20] Roman V. Yampolskiy. “Unexplainability and Incomprehensibility of AI”. In: Journal of Arti-
ficial Intelligence and Consciousness 7.2 (2020), pp. 277–291. url: https://arxiv.org/abs/
1907.03869.
[YP21] Tom Yan and Ariel D. Procaccia. “If You Like Shapley Then You’ll Love the Core”. In:
Proceedings of the 35th AAAI Conference on Artificial Intelligence. AAAI’21. Virtual Only:
Association for the Advancement of Artificial Intelligence, 2021. url: https://ojs.aaai.
org/index.php/AAAI/article/view/16721.
[Yan+17] Jian Yang, Lei Luo, Jianjun Qian, Ying Tai, Fanlong Zhang, and Yong Xu. “Nuclear Norm
Based Matrix Regression with Applications to Face Recognition with Occlusion and Illumi-
nation Changes”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 39.1
(2017), pp. 156–171. doi: 10.1109/TPAMI.2016.2535218. url: https://arxiv.org/abs/
1405.1207.
[Yan+21] Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. “Generalized Out-of-Distribution
Detection: A Survey”. In: (2021). arXiv: 2110.11334 [cs.CV].
[Yan+23] Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, and Ping Li. “Dataset Pruning:
Reducing Training Data by Examining Generalization Influence”. In: Proceedings of the 11th
International Conference on Learning Representations. ICLR’23. 2023. url: https://arxiv.
org/abs/2205.09329.

49
[Yeh+22] Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, and Pradeep Ravikumar.
“First is Better Than Last for Language Data Influence”. In: Proceedings of the 36th Conference
on Neural Information Processing Systems. NeurIPS’22. Curran Associates, Inc., 2022. url:
https://arxiv.org/abs/2202.11844.
[Yeh+18] Chih-Kuan Yeh, Joon Sik Kim, Ian E.H. Yen, and Pradeep Ravikumar. “Representer Point
Selection for Explaining Deep Neural Networks”. In: Proceedings of the 32nd Conference on
Neural Information Processing Systems. NeurIPS’18. Montreal, Canada: Curran Associates,
Inc., 2018. url: https://arxiv.org/abs/1811.09720.
[YHL23] Wencong You, Zayd Hammoudeh, and Daniel Lowd. “Large Language Models Are Better
Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers”.
In: Findings of the Association for Computational Linguistics. EMNLP’23. 2023.
[YGG17] Yang You, Igor Gitman, and Boris Ginsburg. “Large Batch Training of Convolutional Net-
works”. In: (2017). arXiv: 1708.03888 [cs.CV].
[Yua+07] Ming Yuan, Ali Ekici, Zhaosong Lu, and Renato Monteiro. “Dimension Reduction and Co-
efficient Estimation in Multivariate Linear Regression”. In: Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 69.3 (2007), pp. 329–346.
[Zen+23] Yingyan Zeng, Jiachen T. Wang, Si Chen, Hoang Anh Just, Ran Jin, and Ruoxi Jia. “Model-
Pred: A Framework for Predicting Trained Model from Training Data”. In: Proceedings of the
1st IEEE Conference on Secure and Trustworthy Machine Learning. SaTML’23. 2023. url:
https://arxiv.org/abs/2111.12545.
[Zha+17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Under-
standing Deep Learning Requires Rethinking Generalization”. In: Proceedings of the 5th In-
ternational Conference on Learning Representations. ICLR’17. 2017. url: https://arxiv.
org/abs/1611.03530.
[Zha+21a] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Under-
standing Deep Learning (Still) Requires Rethinking Generalization”. In: Communications of
the ACM 64.3 (2021), pp. 107–115. url: https://dl.acm.org/doi/10.1145/3446776.
[Zha+21b] Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and
Nicholas Carlini. “Counterfactual Memorization in Neural Language Models”. In: (2021).
arXiv: 2112.12938 [cs.CL].
[Zha+20] Haoran Zhang, Amy X. Lu, Mohamed Abdalla, Matthew McDermott, and Marzyeh Ghassemi.
“Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings”. In: Proceedings
of the ACM Conference on Health, Inference, and Learning. CHIL’20. New York, NY, USA:
Association for Computing Machinery, 2020. doi: 10.1145/3368555.3384448. url: https:
//arxiv.org/abs/2003.11515.
[ZZ22] Rui Zhang and Shihua Zhang. “Rethinking Influence Functions of Neural Networks in the Over-
Parameterized Regime”. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence.
AAAI’22. Vancouver, Canada: Association for the Advancement of Artificial Intelligence, 2022.
url: https://arxiv.org/abs/2112.08297.
[Zha+21c] Wentao Zhang, Yexin Wang, Zhenbang You, Meng Cao, Ping Huang, Jiulong Shan, Zhi Yang,
and Bin Cui. “RIM: Reliable Influence-based Active Learning on Graphs”. In: Proceedings of
the 35th Conference on Neural Information Processing Systems. NeurIPS’21. Virtual Only:
Curran Associates, Inc., 2021. url: https://arxiv.org/abs/2110.14854.
[Zho+19] Jianlong Zhou, Zhidong Li, Huaiwen Hu, Kun Yu, Fang Chen, Zelin Li, and Yang Wang.
“Effects of Influence on User Trust in Predictive Decision Making”. In: Extended Abstracts of
the 2019 Conference on Human Factors in Computing Systems. CHI’19. New York, NY, USA:
Association for Computing Machinery, 2019. doi: 10.1145/3290607.3312962.

50
Training Data Influence Analysis
and Estimation: A Survey
Supplemental Materials

Organization of the Appendix

A Nomenclature A2

B Influence Analysis Method Definition Reference A4

C Unrolling Gradient Descent Hypergradients A7


A Nomenclature

Table 2 provides a general nomenclature reference that applies throughout this document, including for all
influence analysis methods. Table 3 summarizes the nomenclature related to model training. Table 4 details
nomenclature symbols that are specific to an individual influence analysis method.

Table 2: General nomenclature reference


[r] Set {1, . . . , r} for arbitrary positive integer r
m
A∼B Set A is a u.a.r. subset of size m from set B
2A Power set of A
1[a] Indicator function where 1[a] = 1 if predicate a is true and 0 otherwise
⃗0 Zero vector
x Feature vector
X Feature domain where X ⊆ Rd and ∀x x ∈ X
d Feature dimension where d := |x|
y Dependent/target value, e.g., label
Y Dependent value domain, i.e., ∀y y ∈ Y. Generally Y ⊆ R
z Feature vector-dependent value tuple where z := (x,y)
Z Instance domain where Z := X × Y and ∀z z ∈ Z
P Instance data distribution where P : Z → R≥0
D Training set where D := {zi }ni=1
n Size of the training set where n := |D|
D Arbitrary training subset where D ⊆ D
i Arbitrary training example index where i ∈ [n]
zte Arbitrary test instance where zte := (xte , yte )
f Model where f : X → Y
θ Model parameters where θ ∈ Rp
p Parameter dimension where p := |θ|
ℓ Loss function where ℓ : Y × Y → R
L(z; θ) Empirical risk of example z = (x,y) w.r.t. θ, where L(z; θ) := ℓ(f (x; θ), y)
I (zi , zte ) Exact pointwise influence of training instance zi on test instance zte
Ib (zi , zte ) Estimate of training instance zi ’s pointwise influence on test instance zte
I (D, zte ) Group influence of training subset D ⊆ D on test instance zte
Ib (D, zte ) Estimate of the group influence of training subset D ⊆ D on test instance zte
DCS A coreset (Sec. 3.3)

A2
Table 3: Training related nomenclature reference.
T Number of training iterations
t Training iteration number where t ∈ {0, 1, . . . , T }. t = 0 denotes initial conditions.
θ(t) Model parameters at the end of iteration t.
θ(0) Model parameters at the start of training
(T )
θ Model parameters at the end of training

θ Optimal model parameters
(T )
θD Final model parameters trained on training data subset D ⊆ D
λ L2 regularization (i.e., weight decay) hyperparameter
(t)
B (Mini)batch used during training iteration t where B (t) ⊆ D
b(t) Batch size for iteration t, where b(t) := B (t)
η (t) Learning rate at training iteration t
Θ Serialized training parameters where Θ ⊆ {θ(0) , . . . , θ(T −1) }
∇θ L(zi ; θ(t) ) Training instance zi ’s risk gradient for iteration t
∇2θ L(zi ; θ(t) ) Training instance zi ’s risk Hessian for iteration t
(t) (t) 1
Pn
Hθ Empirical risk Hessian the entire training set where Hθ := n i=1 ∇2θ L(zi ; θ(t) )
(t)
(Hθ )−1 Inverse of the empirical risk Hessian

Table 4: Influence (estimator) specific hyperparameters and nomenclature


\zi
D Leave-one-out training set where instance zi is held out
k k-nearest neighbors neighborhood size
Neigh(xte ; D) kNN neighborhood for test feature vector xte from training set D ⊆ D
K Number of submodels trained by the Downsampling estimator
k
D Training set used by the k-th Downsampling submodel
(T )
θD k Final model parameters used by the k-th Downsampling submodel
Number of Downsampling submodels trained using zi , where Ki := k=1 1 zi ∈ Dk
PK  
Ki
m Downsampling submodel training-set size where ∀k Dk = m < n
ν Shapley value characteristic function ν : 2A → R for arbitrary set A.
ϵi Training instance i weight perturbation
(t)
θ+ϵi Model parameters trained on a training set perturbed by ϵi
1 ∂ℓ(b
yi ,yi )
αi Training instance zi ’s representer value, where αi := − λn ∂yb
θ̈(T ) Model f ’s final linear layer parameters
(T )
θ̇ All model f ’s parameters except the final layer, where θ̇(T ) := θ(T ) \ θ̈(T )
fi Training instance zi ’s feature representation input into model f ’s final linear layer
fte Test instance zte ’s feature representation input into model f ’s final linear layer
K Kernel (similarity) function between two (feature) vectors
r(θ) Regularizer function where r : Rp → R≥0
T Subset of the training iterations considered by TracInCP, where T ⊂ [T ].
(t)
(t) (t) dθ+ϵ
hi
e hi :=
Training hypergradient where e dϵi
i
∈ Rp

A3
B Influence Analysis Method Definition Reference

Table 5: Influence analysis method formal definitions including equation numbers and citations.

Memorization [FZ20;
Pru+20] Mem(zi ) := I (zi , zi ) (4)
Cook’s Distance (T )
[Coo77, Eq. (5)] ICook (zi ) := θ(T ) − θD\zi (5)
Leave-One-Out Influence (T )
[CW82] ILOO (zi , zte ) := L(zte ; θD\zi ) − L(zte ; θ(T ) ) (8)
1 X (T ) 1 X (T )
Downsampling Influence IbDown (zi , zte ) := L(zte ; θDk ) − L(zte ; θDk′ ) (9)
Estimator [FZ20]
K − Ki Ki ′
k k

/ k
zi ∈D zi ∈D k
h h ii
(T )
Consistency Score [Jia+21b] CD (zte ) := Em ∼ [n] −ED ∼m
D
L(zte ; θD )

Shapley Value Pointwise 1 X 1 h (T ) (T )


i
ISV (zi , zte ) := n−1
 L(zte ; θD ) − L(zte ; θD∪zi ) (18)
Influence [Sha53; GZ19] n \z |D|
D⊆D i

Shapley Interaction Index (n − |A| − |D|)! |D|! X |A|−|D ′ | (T )


X
[GR99] ISV (A, zte ) := − (−1) L(zte ; θD∪D′ ) (19)
(n − |A| + 1)! ′
D⊆D\A D ⊆A

Shapley-Taylor Interaction |A| X 1 X |A|−|D ′ | (T )


IST (A, zte ) := − n−1
 (−1) L(zte ; θD∪D′ ) (21)
Index [SDA20] n |D|
D⊆D\A D ′ ⊆A

k-Nearest Neighbors Shapley


1[yn = yte ] n−1
X 1[yj = yte ] − 1[yj+1 = yte ] min{k,j}
Influence [Jia+19a] IkNN-SV (zi , zte ) := + (26)
n j=i
k j
1 X (T ) (T )
Banzhaf Value [Ban65] IBanzhaf (zi , zte ) := L(zte ; θD ) − L(zte ; θD∪zi ) (27)
2n−1
D⊆D \zi
Influence Functions 1 ⊺ (T )
Estimator [KL17] IbIF (zi , zte ) := ∇θ L(zte ; θ(T ) ) (Hθ )−1 ∇θ L(zi ; θ(T ) ) (33)
n
Linear Model Representer
Point Influence [Yeh+18] IRP (zi , zte ) = αi x⊺i xte |yte (46)
Representer Point Influence
Estimator [Yeh+18] IbRP (zi , zte ) := αi fi⊺ fte |yte (47)
X  
TracIn Ideal Pointwise ITracIn (zi , zte ) := L(zte ; θ(t−1) ) − L(zte ; θ(t) ) (51)
Influence [Pru+20] t
zi =B(t)
X η (t) ⊺
TracIn Influence Estimator IbTracIn (zi , zte ) := (t)
∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) ) (55)
[Pru+20] t
B
zi ∈B(t)
TracInCP Influence
X ⊺
IbTracInCP (zi , zte ) := η (t) ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) ) (56)
Estimator [Pru+20]
t∈T

(Continued . . . )

A4
Table 5: Influence analysis method formal definitions including equation numbers & citations (continued).

HyDRA Influence Estimator 1 ⊺ (T )


[Che+21] IbHyDRA (zi , zte ) := − ∇θ L(zte ; θ(T ) ) e hi (62)
n
θ-Relative Influence IbIF (zi , zte )
IbRelatIF (zi , zte ) := (66)
Estimator [BBD20] (T )
(Hθ )−1 ∇θ L(zi ; θ(T ) )

GAS Renormalized Influence X ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) )
IbGAS (zi , zte ) := η (t) (68)
Estimator [HL22] ∇θ L(zi ; θ(t−1) ) ∇θ L(zte ; θ(t−1) )
t∈T

Renormalized Influence IbIF (zi , zte )


Functions Estimator [HL22] IbRenormIF (zi , zte ) := (69)
∇θ L(zi ; θ(T ) )

A5
Table 6: Influence Analysis Method Abbreviations: Related methods are grouped together
as in Figure 2. Each method includes its corresponding source reference.
LOO Leave-One-Out Influence [CW82; KL17]
kNN LOO k-Nearest-Neighbors Leave-One-Out [Jia+21a]
LeafRefit Decision Forest Leaf Refitting [Sha+18b]
Influence Sketching Least Squares Influence Sketching [Woj+16]
Downsampling Downsampled Leave-One-Out [FZ20]
C-score Consistency Score [Jia+21b]
Generative Downsampling Downsampling for Generative Density Models [vW21]
SV Shapley Value [Sha53]
Interaction Index Shapley Interaction Index [GR99]
Shapley-Taylor Shapley-Taylor Interaction Index [SDA20]
TMC-Shapley Truncated Monte Carlo Shapley [GZ19]
G-Shapley Gradient Shapley [GZ19]
kNN Shapley k-Nearest-Neighbors Shapley [Jia+19a]
Beta Shapley Beta Distribution-Weighted Shapley Value [KZ22]
Banzhaf Value Banzhaf Value [Ban65; WJ23]
AME Average Marginal Effect [Lin+22]
SHAP Shapley Additive Explanations [LL17]
Neuron Shapley Shapley Value-Based Neural Explanations [GZ20]
IF Influence Functions [KL17]
FastIF Fast Influence Functions [Guo+21]
Arnoldi IF Arnoldi-Based Influence Functions [Sch+22]
LeafInfluence Decision Forest Leaf Influence [Sha+18b]
Group IF Group Influence Functions [Koh+19]
Second-Order IF Second-Order Group Influence Functions [BYF20]
RelatIF Relative Influence (Functions) [BBD20]
Renorm. IF Renormalized Influence Functions [HL22]
RP Representer Point [Yeh+18]
High Dim. Rep. High-Dimensional Representers [Tsa+23]
RPS-LJE Representer Point Based on Local Jacobian Expansion [SWS21]
TREX Tree-Ensemble Representer-Point Explanations [BHL23]
TracIn Traced Gradient Descent Influence [Pru+20]
TracInCP TracIn Checkpoint [Pru+20]
TracInRP TracIn Random Projection [Pru+20]
TracIn-Last TracIn Last Layer Only [Pru+20; Yeh+22]
VAE-TracIn Variational Autoencoder TracIn [KC21]
TracInAD TracIn Anomaly Detection [Thi+22]
TracInWE TracIn Word Embeddings [Yeh+22]
BoostIn Boosted (Tree) Influence [BHL23]
GAS Gradient Aggregated Similarity [HL21; HL22]
HyDRA Hypergradient Data Relevance Analysis [Che+21]
SGD-Influence Stochastic Gradient Descent Influence [HNM19]

A6
C Unrolling Gradient Descent Hypergradients

This section provides a formal derivation of how to unroll HyDRA’s training hypergradients. We provide
this reference for the interested reader to understand unrolling’s complexity.30 Readers do not need to
understand this section’s details to understand how HyDRA relates to other influence analysis methods.
Recall that Eq. (61) does not describe how the exact contents of batch B (T ) affect unrolling. Eq. (29)
defines the effect that infinitesimally perturbing the weight of training instance zi has on a model’s empirical
risk minimizer. Formally, the perturbed empirical risk is
1 X
L(D; θ) := L(z; θ) + ϵi L(zi ; θ), (70)
n
z∈D

Observe that ϵi = − n1 is the same as removing instance zi from the training set.
We now extend this idea to the effect of ϵi on a single minibatch. For any iteration t ∈ [T ], Formally,
batch B (t) ’s risk under a training set perturbation by ϵi is
!
1 h i nϵi
) + 1 zi ∈ B
X
(t) (t−1) (t−1) (t) (t−1)
L(B ; θ ) := (t) L(z; θ L(zi ; θ ) , (71)
B (t)
B (t)
z∈B

where indicator function 1 zi ∈ B (t) checks whether instance zi is in batch B (t) . Observe that L(zi ; θ(t−1) )
 

is scaled by n. Without this multiplicative factor, then when ϵi = − n1 , zi ’s effect is not completely removed
from the batch, i.e.,
1 1
L(zi ; θ(t−1) ) − L(zi ; θ(t−1) ) ̸= 0. (72)
B (t) n B (t)

Eq. (71)’s gradient w.r.t. θ is


!
1 h i nϵi
) + 1 zi ∈ B (t)
X
(t) (t−1) (t−1) (t−1)
∇θ L(B ;θ ) = (t) ∇θ L(z; θ ∇θ L(zi ; θ ) . (73)
B B (t)
z∈B(t)

HyDRA’s hypergradients specify that the derivative is taken w.r.t. ϵi . Consider the simpler case first where
/ B (t) , then
zi ∈
d h (t−1)
i 1 X d ∂
∇θ L(B (t) ; θ+ϵi ) = (t) L(z; θ(t−1) ) (74)
dϵi B (t)
dϵi ∂θ
z∈B

(t)
1 X ∂ (t−1) dθ
= L(z; θ ) ▷ Chain rule (75)
B (t) ∂θ2 dϵi
z∈B(t)

1 X (t−1)
= ∇2θ L(z; θ(t−1) ) e
hi . (76)
B (t)
z∈B(t)

Note that ∇2θ L(z; θ(t−1) ) is training instance z’s risk Hessian w.r.t. parameters θ(t−1) .
(t)
When instance zi is in B (t) , unrolling hypergradient e hi requires an additional term. We derive that
term below with similar analysis as when zi ∈ / B (t) with the addition of using the product rule. Observe that
Eq. (78) below considers both zi ’s risk gradient and Hessian.
d h i d
ϵi ∇θ L(zi ; θ(t−1) ) = ∇θ L(zi ; θ(t−1) ) + ϵi ∇θ L(zi ; θ(t−1) ) ▷ Product rule (77)
dϵi dϵi
(t−1)
= ∇θ L(zi ; θ(t−1) ) + ϵi ∇2θ L(zi ; θ(t−1) )e
hi ▷ Chain rule (78)
30
Similar complexity would be required if TracIn [Pru+20] were extended to support momentum or adaptive
optimization.

A7
Combining Eqs. (61), (76), and (78), the hypergradient update rule for vanilla gradient descent without
momentum is
(t) (t−1)
hi = (1 − η (t) λ) e
e hi

η (t) X (t−1)
− ∇2θ L(z; θ(t−1) ) e
hi
B (t)
z∈B(t)

η (t) n h i 
− 1 zi ∈ B (t)
∇ θ L(zi ; θ (t−1)
) + ϵ i ∇ 2
θ L(zi ; θ (t−1) e (t−1)
) hi . (79)
B (t)

In Chen et al.’s [Che+21] fast approximation of HyDRA, all Hessians (e.g., ∇2θ L(z; θ(t−1) )) in Eq. (79) are
treated as zeros and the associated terms dropped. The resulting simplified equation,

η (t) n h i
− (t) 1 zi ∈ B (t) ∇θ L(zi ; θ(t−1) ),
(t) (t−1)
hi = (1 − η (t) λ) e
e hi (80)
B

is the basis of the fast-approximation update rule in Alg. 4.

A8

You might also like