0% found this document useful (0 votes)
28 views18 pages

Merchanistic Unlearn

The paper presents a novel amortized zero-shot unlearning method for large language models (LLMs) that utilizes codebook features and Sparse Autoencoders (SAEs) to efficiently remove sensitive information without the need for retraining. This approach addresses the limitations of existing machine unlearning techniques by enabling targeted unlearning while maintaining model performance on unrelated data. The proposed method marks a significant advancement in the application of machine unlearning in real-world scenarios involving LLMs.

Uploaded by

zz2884
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views18 pages

Merchanistic Unlearn

The paper presents a novel amortized zero-shot unlearning method for large language models (LLMs) that utilizes codebook features and Sparse Autoencoders (SAEs) to efficiently remove sensitive information without the need for retraining. This approach addresses the limitations of existing machine unlearning techniques by enabling targeted unlearning while maintaining model performance on unrelated data. The proposed method marks a significant advancement in the application of machine unlearning in real-world scenarios involving LLMs.

Uploaded by

zz2884
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Under review as a conference paper at ICLR 2025

000
001
C ODE U NLEARN : A MORTIZED Z ERO -S HOT M ACHINE
002 U NLEARNING IN L ANGUAGE M ODELS U SING D IS -
003
004 CRETE C ONCEPT
005
006
Anonymous authors
007
Paper under double-blind review
008
009
010
011 A BSTRACT
012
013 Large Language Models (LLMs) offer extensive knowledge across various do-
014 mains, but they may inadvertently memorize sensitive, unauthorized, or malicious
015
data, such as personal information in the medical and financial sectors. Machine
unlearning methods aim to remove specific information from models after train-
016
ing to address this. However, current approaches require additional model training
017
or struggle to effectively erase particular data points and their associated context
018 due to LLMs’ complex, dense, and continuous nature. In this study, we propose
019 a novel amortized unlearning approach using codebook features and Sparse Au-
020 toencoders (SAEs). By leveraging a bottleneck to decompose the activation space
021 and regulate information flow, our method efficiently unlearns targeted informa-
022 tion while preserving the model’s performance on unrelated data. To the best of
023 our knowledge, this is the first work that successfully enables unlearning specific
024 topics with contextual relevance in an LLM, marking a significant step towards
025 real-world applications of machine unlearning.
026
027
1 I NTRODUCTION
028
029
Large language Models (LLMs) have been widely used in various applications, generating text re-
030 sponses that attempt to create the equivalent of human conversations OpenAI et al. (2024). These
031 models leverage vast scientific literature to facilitate and accelerate interdisciplinary research Taylor
032 et al. (2022) while drawing upon large datasets of human-generated content to provide professional
033 advice. However, in many cases, such data is a double-edged sword. Including personal informa-
034 tion or sensitive scientific knowledge can be beneficial or, conversely, harmful. For instance, Soice
035 et al. (2023) discusses how LLMs, when used by non-experts, can enable the creation of biological
036 agents, posing both potential benefits and significant risks.
037 In response to these concerns, machine unlearning has emerged as a promising research area aimed
038 at selectively removing specific data points or information from a trained model. This technique can
039 help mitigate the misuse of sensitive data or address privacy concerns. Existing solutions, such as
040 Sharded, Isolated, Sliced, and Aggregated (SISA) training Bourtoule et al. (2020), focus on parti-
041 tioning training data into disjoint shards and retraining models on these individual shards. While
042 effective in some contexts, these methods are often time-consuming, resource-intensive, and lack
043 scalability when applied to large models like LLMs. Furthermore, traditional approaches frequently
044
require specialized data structures or full retraining, making them impractical for dynamic or com-
plex tasks.
045
046 Given these limitations, there is an increasing demand for zero-shot unlearning methods, which aim
047 to remove specific information without retraining or specialized data structures. Unlike traditional
048 unlearning techniques that rely on retraining portions of the model, zero-shot unlearning seeks to
049 directly eliminate the influence of specific data points or pieces of information from the model’s
050
learned representation—without additional computational steps or parameter adjustments. More-
over, zero-shot unlearning is inherently more scalable, especially for large models like LLMs, as it
051
avoids the inefficiencies associated with data partitioning and retraining.
052
053 Our approach builds upon using discrete representations as the latent space for unlearning. Discrete
representations, generated through Vector Quantization (VQ) van den Oord et al. (2018), offer a

1
Under review as a conference paper at ICLR 2025

054
natural structure for organizing the latent space to enable selective information removal. Discrete
055 representations can be seen as a form of disentanglement, a concept rooted in classical research
056 Bengio et al. (2014), which emphasizes learning representations that disentangle the various factors
057 of variation in data. This allows for the separation of different explanatory sources within the data.
058
Additionally, Elhage et al. (2022) explores how neurons in models can represent multiple superposed
059
features, leading to the concept of using dictionaries to disentangle these superpositions. Building
060
on this notion, we propose employing discrete representations to disentangle the model’s internal
061 structure, thereby enabling selective unlearning. By tracking and modifying discrete codes within
062 the latent space, we aim to achieve efficient and targeted removal of sensitive or unwanted informa-
063 tion.
064
Our contributions are as follows:
065
066 • we propose a novel zero-shot unlearning method based on discrete latent representations.
067
• we demonstrate how Vector Quantization (VQ) can structure the latent space, facilitating
068 the selective removal of information in an amortized manner.
069
• we extend our method beyond traditional machine unlearning techniques, primarily de-
070
signed for classification tasks, to handle complex language tasks associated with language
071 models, addressing a broader scope of applications.
072
• Our approach provides a baseline for unlearning in language models and validates the ef-
073
fectiveness of our method.
074
075
076 2 R ELATED W ORK
077
078
Machine unlearning methodologies have been developed to tackle the challenges of efficiently re-
moving data from trained models. Among the early influential frameworks is the Sharded, Isolated,
079
Sliced, and Aggregated (SISA) approach Bourtoule et al. (2020),which partitions data into inde-
080
pendent shards. By retraining only the specific shards containing the data to be unlearned, SISA
081 reduces the computational burden. Extensions of this approach include Ginart et al. (2019), which
082 applies partitioning to linear models, and Brophy & Lowd (2021), which adapts it for random forests.
083 Schelter et al. (2021) further extended the concept to decision trees, minimizing retraining through
084 hierarchical partitioning. In the graph learning domain, Chen et al. (2022b) developed methods to
085 forget specific nodes or edges, while Chen et al. (2022a) focused on removing sensitive user data
086 from recommendation systems.
087 While these methods are effective for structured models, they struggle to scale to large, complex
088 models like Language Models (LMs). Additionally, the retraining costs, though reduced, remain
089 significant, and the reliance on specific architectures limits their generalizability to more dynamic
090 tasks.
091
In a different direction, Kurmanji et al. (2023) introduced SCRUB, which treats the original model as
092
a teacher and trains a student model to mimic it on retained data while ”forgetting” specific informa-
093 tion. Warnecke et al. (2023) proposed unlearning entire groups of features and labels using influence
094 functions, providing closed-form updates to model parameters for more efficient data removal.
095
096
Influence functions Guo et al. (2023); Sekhari et al. (2021); Mehta et al. (2022) also offer an al-
ternative by measuring the effect of individual data points on a model’s predictions and adjusting
097
parameters accordingly, providing more direct methods for unlearning.
098
099 Recently, zero-shot unlearning methods have emerged, focusing on removing information without
100 retraining, making them highly efficient for large models. Shah et al. (2024) introduced a method
101 for editing model computations to ”forget” specific information. While this is effective for tasks
102
like token classification, it may struggle with the more complex context and semantics in LLMs,
underscoring the need for scalable, adaptable unlearning techniques tailored to these models.
103
104
105 3 M ETHODOLOGY
106
107 To address the challenges of zero-shot machine unlearning, we propose a novel approach that lever-
ages codebook features to bottleneck latent representations within a language model, enabling the

2
Under review as a conference paper at ICLR 2025

108
targeted unlearning of specific knowledge by altering related codebook embeddings. Initially intro-
109 duced by Tamkin et al. (2023), codebook features efficiently compress the activation space of neural
110 networks by introducing a sparse discrete bottleneck. This bottleneck can be further optimized to
111 isolate the codes most relevant to specific topics in the input, offering deeper insight and control over
112 the model’s response and interpretation. By utilizing this discrete latent representation, we can more
113 effectively identify and remove the specific information encoded in the codebook corresponding to
114 the input’s targeted knowledge.
115 The following section details our approach to employing codebook features to efficiently identify
116 and unlearn specific areas of related information in a zero-shot manner. This process ensures that
117 the model can no longer effectively handle prompts that contain the target information to unlearn.
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138 Figure 1: CodeUnlearn—Our Amortized Zero-Shot Machine Unlearning for Language Models.
139 Left: Discrete latent bottlenecking in the transformer architecture. After applying the residual con-
140 nection, the multi-head attention output is discretized using a discrete embedding vocabulary, re-
141 ferred to as the codebook. This approach prevents information leakage via the residual connection,
142 ensuring that the codebook effectively regulates and interprets the network’s behavior. Right: Zero-
143 shot machine unlearning is achieved by removing the discrete codes in the codebook that correspond
144 to the targeted information.
145
146
147 3.1 C ODEBOOK F EATURES
148
The core idea of employing codebook features is to transform the original activations from a hidden
149
layer into a representation governed by a codebook. Let a ∈ RF represent the activation vector
150
from a hidden layer, where F denotes the dimensionality of the activations. We use a codebook
151 C = c1 , c2 , . . . , cK ∈ RK×F , where K represents the number of code vectors. The codebook offers
152 a compressed, discrete representation of the original activations. To perform this transformation, we
153 calculate the cosine similarity between the activation a and each code vector ck in the codebook:
154 a · ck
155
cosineSim(a, ck ) = (1)
∥a∥∥ck ∥
156 for each code vector ck in the codebook. We identify the top S (where S ≥ 1) most similar code
157 vectors corresponding to the activation a. The index set Ω of these top S code vectors is defined as:
158
Ω = TopS ({k | k ∈ {1, . . . , K}, cosineSim(a, ck )}) (2)
159
160
The output of the codebook transformation is then:
X
161 â = ck (3)
k∈Ω

3
Under review as a conference paper at ICLR 2025

162
where Ω is the index set of the S most similar code vectors, selected based on the highest cosine
163 similarity scores. In the unlearning procedure, the activated codes corresponding to a are identified
164 as the targets for removal.
165
166
3.2 C ODEBOOK S ETTINGS
167
168 Multiple Codebooks In prior work Tamkin et al. (2023), multiple codebooks were applied to each
169 attention head, with the outputs concatenated across heads. Each attention head operates with its own
170 codebook, selecting codes independently. The chosen codes from each head are then concatenated to
171 produce the final output for that attention layer, effectively allowing the model to represent a broader
172 set of features through the combination of different codebooks. Using multiple codebooks across at-
173 tention heads can lead to a superposition effect, as discussed by Elhage et al. (2022). Superposition
174 refers to the phenomenon where linear representations can encode more features than the dimen-
175
sions, effectively allowing the neural network to simulate more extensive networks. In this case,
combining multiple codebooks across attention heads allows for a significantly more comprehen-
176
sive set of activations to be represented, even when using only the top S = 1 codebooks. However,
177
tracking which individual codebooks are responsible for specific activation patterns becomes chal-
178 lenging. Rather than relying on the output of a single codebook, the overall representation emerges
179 from the combined outputs of all the codebooks.
180
181
Single Codebook As shown in Section 3, to maintain interpretability, we focus on using a sin-
182 gle codebook and position it after the multi-head attention layer and residual connection to prevent
183 information leakage. However, in a single codebook setup, selecting only S = 1 will result in a
184 significant drop in model performance, as a single codebook feature is insufficient to capture the
185 complexity of the activation space. In Cai (2024), the author rigorously demonstrates that treating
186 word vectors as mappings allows a finite vocabulary to achieve infinite approximation through com-
187 position. We employ S > 1 in our approach based on this idea. While this may slightly affect code
188 discretization and information clarity, it allows us to balance model performance with interpretabil-
189 ity.
190
191 3.3 C ODEBOOK WITH S PARSE AUTOENCODERS
192
193 We aim to decompose the activation space into sparse, interpretable features rather than reconstruct-
194
ing the original input. To accomplish this, we incorporate the Sparse Autoencoder (SAE) concept.
The SAE applies a linear transformation encoder with a ReLU activation function to project the ac-
195
tivations into a higher-dimensional space, effectively decomposing features. A linear transformation
196
decoder is then used to reconstruct the activations.
197
198 In line with the SAE structure, we introduce a linear transformation encoder with ReLU before
199 the codebook and a linear transformation decoder after the codebook. This setup provides two
200
significant benefits for machine unlearning:
201
• Security through ReLU: The ReLU activation function ensures that the extracted features
202
are non-linear and sparse, making it more difficult to recover or reconstruct the original
203 input from the features. This acts as a safeguard, reducing the likelihood of information
204 leakage. By enforcing sparsity and non-linearity, ReLU provides greater control over fea-
205 ture representation, allowing us to obscure specific activations and protect data integrity
206 during machine-unlearning processes.
207
• Decentralization of Information: Sparsity promotes the decentralization of encoded in-
208 formation, which helps isolate and unlearn specific patterns or features without disrupting
209 the rest of the model. This targeted approach allows for more precise unlearning of sensi-
210 tive or undesired information.
211
212 Encoder The encoder is responsible for projecting the activation vector a ∈ Rd into a higher-
213 dimensional space. This is achieved using a weight matrix WE ∈ Rd×F and a bias vector bE ∈ Rd .
214 A ReLU activation function follows the projection to introduce non-linearity:
215
henc = ReLU(Wenc a + benc ) (4)

4
Under review as a conference paper at ICLR 2025

216
Codebook After encoding, the sparse representation henc is transformed using the codebook. The
217 cosine similarity between henc and each code vector ck ∈ {c1 , c2 , . . . , cK } is calculated as:
218
219 henc · ck
220
cosineSim(henc , ck ) = (5)
∥henc ∥∥ck ∥
221
222 The top S most similar code vectors are selected:
223
224 Ω = TopS ({k | k ∈ {1, . . . , K}, cosineSim(henc , ck )}) (6)
225
The output of the codebook transformation is then:
226
227 X
228
ĥenc = ck (7)
k∈Ω
229
230
231 Decoder The decoder then maps ĥenc back to the original activation space using a weight matrix
232 Wdec ∈ RF ×d and a bias vector bdec ∈ RF :
233
234 â = Wdec ĥenc + bdec (8)
235
236 3.4 T RAINING THE C ODEBOOK
237
238 Reconstruction Loss As with the Sparse Autoencoder (SAE) and codebook models, we utilize
239 the Mean Squared Error (MSE) loss as the primary loss function. The MSE loss can be expressed
240 as:
241 N
1 X
242 LMSE = ∥ai − âi ∥22 (9)
N i=1
243
244
where N is the number of samples, ai is the original activation, and âi is the reconstructed activation
245
obtained from the decoder.
246
247 Additionally, to promote sparsity and enforce a more discrete selection of codebook vectors, we
248 introduce an L1 penalty term on the codebook activations. This encourages the model to rely on
249 fewer, more distinct codes. The overall loss function incorporating this sparsity constraint is defined
250
as:
N
251 1 X X
252 LCodebook = ∥ai − âi ∥22 + λ |ck | (10)
N i=1
k∈Ω
253
254 where Ω represents the set of indices for the top S most similar code vectors, ck refers to the k-th
255 codebook vector, and λ is a regularization coefficient that controls the strength of the L1 penalty
256 term. In our experiments, we set λ to 1e-6 to balance sparsity with reconstruction accuracy.
257
258
Joint Training for Machine Unlearning Both the SAE and codebook features are used to re-
259
construct the input a, but this presents a critical issue in the context of machine unlearning: one
260 could easily remove the codebook layer, reverting the model to its original state, which negates the
261 unlearning process. To address this, it is vital to ensure that the model is trained so that the down-
262 stream components are entirely dependent on the output of the codebook. At the same time, the
263 upstream layers must learn to generate activations that conform to the codebook’s representations.
264 This joint training approach ensures that the entire model relies on the codebook’s representation,
265 making it harder to bypass or remove without degrading performance. The joint loss function for
266 this training process is defined as:
267 Ljoint = LMSE + LCE (11)
268
269 where LMSE is the Mean Squared Error for reconstruction, and LCE represents the Cross-Entropy
loss for the original language modeling or task-specific objective.

5
Under review as a conference paper at ICLR 2025

270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
Figure 2: Unlearning a Target Topic in a Language Model. The zero-shot unlearning process
begins by identifying codes enriched in data subsets with the target topic (DT ) as opposed to the
288
subset without it (DT̃ ). Codes with p-values less than 0.05 are removed from the codebook. After
289
this removal, the model exhibits significantly decreased performance on target information inputs.
290
291
292 3.5 C ODE R ETRIEVAL
293
294 As shown in Figure 2, after training, the codebook encodes a set of representative codes ck K k=1 ∈
295 RK×F that are sparse and represent different features. To perform unlearning, we retrieve the codes
296
activated for specific inputs and identify which codes are enriched for a particular topic. The model
can effectively unlearn the associated information by deleting the corresponding enriched codes
297
from the codebook. The key steps involve retrieving these relevant codes for each input and deter-
298
mining their relationship to the target topic.
299
300 Because of the nature of the attention mechanism, the activation of these codes also depends on the
301 surrounding context. This means we are not just identifying individual words that activate specific
302
codes but retrieving codes that represent the broader topic within the input context. To unlearn a
specific topic T , consider a dataset DT with samples related to topic T . We create a control dataset
303
DT̃ by replacing words associated with T in DT with unrelated words, ensuring the context remains
304
consistent. By comparing the code activations between DT and DT̃ , we can identify and search for
305 the codes linked to topic T .
306
307 For each code ck activated in the dataset, we compute its frequency in both datasets by considering
308
the top S ′ activated codes:
NT
309 1 X
fk (DT ) = I(k ∈ ΩT (ai )) (12)
310 NT i=1
311 NT̃
312
1 X
fk (DT̃ ) = I(k ∈ ΩT̃ (aj )) (13)
313 NT̃ j=1
314 Where ΩT (ai ) represents the set of indices of the top S ′ activated codes for activation ai in dataset
315 DT , and ΩT̃ (aj ) is similarly defined for DT̃ . NT and NT̃ denote the sample sizes of DT and DT̃ ,
316 respectively. I is the indicator function that checks whether code k is in the set of activated codes.
317 The hyperparameter S ′ controls the number of top activated codes considered, thereby influencing
318 the number of codes to be removed.
319 To quantify the enrichment of code ck for topic T , we use the following formula:
320  
fk (DT ) + ϵ
321 R(ck , T ) = log2 (14)
322
fk (DT̃ ) + ϵ
323 where ϵ is a small constant added to avoid division by zero. When R(ck , T ) is positive, it indicates
that the code ck is enriched in dataset DT relative to DT̃ . However, if the frequency of ck in DT̃

6
Under review as a conference paper at ICLR 2025

324
is zero and its frequency in DT is very low, such codes should not be removed, as they are likely
325 accidental activations. Removing these codes could lead to unintended side effects, as they may not
326 be strongly related to the topic T despite being present in the dataset.
327
328
Therefore, we used a chi-squared test to calculate the p-value of R(ck , T ) to determine if the code
ck is enriched for topic T . For those codes with p-values smaller than 0.05, we regard them as
329
enriched codes in DT and remove them from the codebook. We define the set of enriched codes as
330
ΩR>0,p<0.05 = {ck | R(ck , T ) > 0 and p ≤ 0.05}.
331
332
3.6 M ETRICS
333
334 In our work, we not only assess the absolute drop in performance within the dataset DT or DT̃
335 but also need to compare the relative decline between DT and DT̃ . Therefore, to fairly compare
336 the models and the datasets, we used normalized percentage improvement to evaluate the perfor-
337 mance of the unlearning procedure. The performance improvement percentage is set to 0 for the
338 zero-shot model and 1 for the codebook model, which is the upper bound. In contrast, the per-
339 formance drop percentage is set to 1 for the zero-shot model and 0 for the codebook model. We
340 use four evaluation metrics to assess the effectiveness of the unlearning procedure and the overall
341
quality of the remaining information in the output. These metrics include: We use four evaluation
metrics to assess the impact of the unlearning procedure on translation quality and semantic preser-
342
vation: BLEUPapineni et al. (2002), METEORBanerjee & Lavie (2005), BERTScoreZhang et al.
343
(2020), and Bart-ScoreYuan et al. (2021). BLEU offers a general accuracy measure, and METEOR
344 builds on BLEU by considering synonymy and word order, often providing a more sensitive quality
345 assessment. BERTScore leverages contextual embeddings to evaluate semantic similarity, crucial
346 for detecting whether unlearning procedures change the sentence’s meaning. Bart-Score evaluates
347 fluency and informativeness using pre-trained BART models, with scores reflecting log-likelihood,
348 so close to zero indicates better quality. BERTScore and Bart-Score offer insight into more subtle
349 changes, and percentage change trends are prioritized for a comprehensive analysis.
350
351
Table 1: Examples of unlearning on topic ’love’
352
Content
353
354 English She had made efforts to love him, and she had repented with tears for
355 having yielded to another!
356 Ground Truth Elle avait fait des efforts pour l’aimer, et elle s’était repentie en pleurant
357 d’avoir cédé à un autre.
358 Codebook Model Elle avait fait des efforts pour l’aimer, et elle avait repris des larmes
359 pour avoir renoncé à un autre!
360 S ′ = 8, delete 16 codes Elle avait fait des efforts pour l’aimer, et elle avait repris des larmes
361 pour l’avoir acquitté d’un autre!
362 S ′ = 24, delete 52 codes Elle avait fait des efforts pour le recevoir, et elle avaitrepris des larmes
363
pour avoir renoncé à un autre.
S ′ = 72, delete 133 codes Elle avait fait des efforts pour le mettre en état, et elle avait repris des
364 larmes pour s’en rendre à un autre.
365
366
367
368 4 E XPERIMENTS AND RESULTS
369
370 We applied the codebook features combined with SAE on a large language model(LLM) and trained
371 it on tasks that exhibit clear distinctions between correct and incorrect answers. After training, we
372 unlearned the model on several specific topics to measure the degradation in performance on the
373 unlearned issues while ensuring minimal impact on the other topics. An example of the unlearning
374
effect on the topic of ”love” is shown in Table 1. The results illustrate that as more codes related
to the target topic were deleted, the model’s translation became less accurate in representing the
375
original meaning. For instance:
376
377 The translation introduces minor inaccuracies in the case of S ′ = 8 (16 codes deleted). As the
number of deleted codes increases to S ′ = 72 (133 codes deleted), the translation significantly

7
Under review as a conference paper at ICLR 2025

378
deviates from the original meaning, showing the model’s inability to maintain accuracy on the target
379 topic. This demonstrates that the model successfully forgets the ”love” concept, with increasing
380 degradation in its ability to translate sentences on the target topic.
381
382 Table 2: Unlearning Results for Different Topics
383
384 Score (Normalized Improvement Drop(%))
385
Topic(N) Dataset
BLEU ↓ M ET EOR↓ BERT − P ↓ BART ↓
386
DT 0.16 (-112.52) 0.39 (-117.76) 0.80 (-118.88) -4.80 (-143.96)
387 Love(207)
388 DT̃ 0.18 (-37.80) 0.42 (-57.82) 0.81 (-58.25) -5.71 (-35.06)
389 DT 0.19 (-113.12) 0.42 (-138.47) 0.80 (-134.60) -5.15 (-164.68)
390
Julien(255)
DT̃ 0.16 (-65.70) 0.39 (-64.38) 0.80 (-94.63) -6.10 (-94.60)
391
DT 0.20 (-72.10) 0.47 (-140.71) 0.83 (-84.44) -5.16 (-87.90)
392 Captain(137)
393 DT̃ 0.19 (-9.72) 0.44 (-9.04) 0.82 (-9.66) -5.97 (-0.53)
394 DT 0.18 (-70.61) 0.43 (-70.78) 0.81 (-60.84) -5.03 (-79.81)
Poor(151)
395 DT̃ 0.20 (-26.64) 0.47 (-12.48) 0.83 (-14.20) -5.81 (-36.01)
396
DT 0.15 (-144.83) 0.33 (-249.51) 0.78 (-182.02) -4.95 (-309.34)
397 Wish(217)
398 DT̃ 0.16 (-87.65) 0.39 (-94.51) 0.81 (-74.16) -6.02 (-133.35)
399 DT 0.12 (-157.45) 0.38 (-218.04) 0.80 (-403.04) -4.85 (-119.99)
White(179)
400 DT̃ 0.16 (-10.09) 0.49 (-22.99) 0.83 (-47.65) -6.12 (-27.15)
401
DT 0.16 (-85.16) 0.40 (-138.04) 0.80 (-115.56) -4.70 (-62.91)
402 Black(190)
403 DT̃ 0.19 (-16.12) 0.47 (-2.15) 0.83 (-3.01) -5.78 (-97.36)
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
Figure 3: Performance Drop after Unlearning on the Topic ’Love’. Performance Drop after
426
Unlearning on the Topic ’Love’. The X-axis shows the model variations, with the first column as the
427 original model. Columns 2 to 8 represent increasing levels of unlearning, with the number indicating
428 the top S codes used and removed. The Y-axis represents the percentage change in various metrics
429 compared to the original model. As more codes are deleted, the model’s performance on the target
430 topic declines rapidly, while performance on non-topic content remains more stable.
431

8
Under review as a conference paper at ICLR 2025

432
Dataset Building The dataset consists of three parts: (1) training dataset, (2) validation dataset,
433 and (3) test dataset. The training and unlearning processes are performed using the training dataset,
434 while the validation and test datasets are used to evaluate the performance of the unlearned model.
435 For the unlearning procedure, we filtered and sampled prompts containing the target words, with
436 sample sizes of 500 for both DT and DT̃ . We used all prompts from the test and validation datasets
437 containing the relevant topics for evaluation. We trained a T5-smallRaffel et al. (2023) model with
438 codebook features on the opus books/en-fr dataset. Specifically, a codebook with 25k codes and
439 512 dimensions was applied at the third layer of the encoder, situated in the middle of the encoder
440 network. The middle layers of the encoder are likely to capture more abstract, high-level features,
441
making them ideal for our approach, as noted by Templeton et al. (2024).
442 After training, we identified specific topics within the training dataset and performed the unlearning
443 procedure. We tested seven values for S ′ ranging from 8(1 × S) to 104(13 × S), each resulting in a
444 different number of deleted codes. This led to a deletion of approximately 0.064% to 0.828% of the
445 total codes in the codebook.
446 As shown in Figure 3, as the number of searched and deleted codes increases, the performance
447 on the topic deteriorates rapidly. Although performance on non-topic deteriorates simultaneously,
448 it is far better than the topic. For instance, in the case of the ’love’ topic, when S ′ = 104(13 ×
449 S), which corresponds to searching for the top 104 most similar codes in the codebook for each
450 activation, about 0.828% of the codes were deleted. The improvement score for the target topic
451 became negative, which means the unlearned model is worse than the zero-shot model. In contrast,
452 the model’s performance on non-topic is far better than the topic, demonstrating effective unlearning
453
of the specific target while maintaining reasonable performance on unrelated information.
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
Figure 4: Performance Drop after Unlearning on the Topic ’Julien’. Similar to the ’love’ topic,
we tested the unlearning procedure on the name ’Julien’.
476
477
478 In addition to conceptual topics like ’love’, we also applied the unlearning procedure to the fre-
479 quently occurring name ’Julien’ in the dataset. Names carry specific semantic significance in lan-
480 guage models, much like critical topics, making ’Julien’ an ideal test case to assess the method’s
481 effectiveness in removing personal information, such as names, while preserving performance on
482 unrelated content. As shown in Figure 4, the unlearning process led to a noticeable performance de-
483 cline for ’Julien’ as the number of removed codes increased. Similar to the ’love’ topic, the model’s
484 performance on non-target content remained relatively stable. This further illustrates the versatility
485 of the proposed approach in effectively unlearning targeted information, whether it is conceptual
(like ’love’) or personal (like ’Julien’), while maintaining accuracy on non-topic content. Following

9
Under review as a conference paper at ICLR 2025

486
unlearning, the model attempts to rely on other similar codes; however, the meanings of these codes
487 are significantly different. As a result, the unlearned target topic interferes, hindering the model’s
488 ability to comprehend the entire sentence fully.
489
490
In addition to the ”love” and ”Julien” topics, we performed unlearning on several other topics such as
”Captain,” ”Poor,” ”Wish,” ”White,” and ”Black.” Table 2, shows the performance degradation across
491
various topics after applying the unlearning procedure, with the number of deleted codes indicated in
492
parentheses next to each topic. The values represent actual scores and the normalized improvement
493 drop in performance, calculated relative to the zero-shot and baseline models before unlearning. A
494 negative value indicates a performance decline. As S ′ increases (for instance, S ′ = 13 × 8 here),
495 the performance gap between DT and DT̃ widens, demonstrating effective unlearning of the target
496 topic with minimal impact on irrelevant information. To further assess the unlearning performance,
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518 Figure 5: Metrics after unlearning topic ’love’ and test on ’like’, The model unlearned the ’love’
519 topic but also deteriorated the performance on the ’like’ topic, which suggests that the unlearning
520 procedure removes not only the specific target information but also the relevant context.
521
522 we also evaluate the synonymy of the target word, such as ’like’ in place of ’love’ shown in Fig-
523
ure 5. Ideally, the model’s performance on the ’like’ topic should also worsen, suggesting that the
unlearning procedure removes the specific target information and the broader context related to that
524
concept. Our approach diverges from traditional data-point-unlearning tasks by removing the codes
525
close to the activation space, which is essential in unlearning conceptual or contextual knowledge
526 rather than isolated instances.
527
528
529
5 C ONCLUSION
530
531
In this work, we introduced CodeUnlearn, a novel framework for zero-shot machine unlearning in
Large Language Models (LLMs). Leveraging codebook features and Sparse Autoencoders (SAEs),
532
we devised a method that effectively isolates and removes specific knowledge, ensuring that the
533
targeted data and its contextual associations are erased from the model. Unlike previous methods,
534 which required retraining or were limited to classification tasks, CodeUnlearn operates amortized
535 and zero-shot, providing an efficient and scalable solution for unlearning in complex, generative
536 models like LLMs. Our approach uses a discrete concept representation to regulate the flow of
537 information in a language model, enabling the unlearning of specific topics while preserving overall
538 model performance on unrelated tasks. The results show that CodeUnlearn successfully mitigates
539 the model’s ability to reproduce the unlearned information without requiring additional training,
achieving substantial unlearning effectiveness and maintaining interpretability.

10
Under review as a conference paper at ICLR 2025

540
R EFERENCES
541
542 Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic struc-
543 ture of word senses, with applications to polysemy, 2018. URL https://arxiv.org/abs/
544 1601.03764.
545 Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with im-
546 proved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and
547 Clare Voss (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea-
548 sures for Machine Translation and/or Summarization, pp. 65–72, Ann Arbor, Michigan, June
549 2005. Association for Computational Linguistics. URL https://aclanthology.org/
550 W05-0909.
551
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new
552
perspectives, 2014. URL https://arxiv.org/abs/1206.5538.
553
554 Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin
555 Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning, 2020. URL
556 https://arxiv.org/abs/1912.03817.
557
Jonathan Brophy and Daniel Lowd. Machine unlearning for random forests, 2021. URL https:
558 //arxiv.org/abs/2009.05567.
559
560 Yongqiang Cai. Vocabulary for universal approximation: A linguistic perspective of mapping com-
561 positions, 2024. URL https://arxiv.org/abs/2305.12205.
562
Chong Chen, Fei Sun, Min Zhang, and Bolin Ding. Recommendation unlearning, 2022a. URL
563 https://arxiv.org/abs/2201.06820.
564
565 Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang.
566 Graph unlearning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and
567 Communications Security. ACM, November 2022b. doi: 10.1145/3548606.3559352. URL
568
http://dx.doi.org/10.1145/3548606.3559352.
569 David L. Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal)
570 dictionaries via ℓ1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–
571 2202, 2 2003. ISSN 0027-8424. doi: 10.1073/pnas.0437847100. [Online; accessed 2024-09-29].
572
573
Michael Elad. Sparse and Redundant Representations. Springer New York, 2010. doi: 10.1007/
978-1-4419-7011-4. [Online; accessed 2024-09-29].
574
575 Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec,
576 Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish,
577 Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superpo-
578 sition, 2022. URL https://arxiv.org/abs/2209.10652.
579
Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Zou. Making ai forget you: Data
580
deletion in machine learning, 2019. URL https://arxiv.org/abs/1907.05012.
581
582 Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net:
583 Selective forgetting in deep networks. 2020 IEEE/CVF Conference on Computer Vision and Pat-
584 tern Recognition (CVPR), pp. 9301–9309, 2019. URL https://api.semanticscholar.
585 org/CorpusID:207863297.
586 Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens van der Maaten. Certified data removal
587 from machine learning models, 2023. URL https://arxiv.org/abs/1911.03030.
588
589 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpass-
590 ing human-level performance on imagenet classification, 2015. URL https://arxiv.org/
591
abs/1502.01852.
592 Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou. Towards unbounded
593 machine unlearning, 2023. URL https://arxiv.org/abs/2302.09880.

11
Under review as a conference paper at ICLR 2025

594
Ronak Mehta, Sourav Pal, Vikas Singh, and Sathya N. Ravi. Deep unlearning via randomized
595 conditionally independent hessians, 2022. URL https://arxiv.org/abs/2204.07655.
596
597 Peter Norvig. Natural Language Corpus Data, pp. 219–242. 01 2009.
598
Bruno A. Olshausen and David J. Field. Emergence of simple-cell receptive field properties by
599
learning a sparse code for natural images. Nature, 381(6583):607–609, 6 1996. ISSN 0028-0836.
600 doi: 10.1038/381607a0. [Online; accessed 2024-09-29].
601
602 OpenAI, Josh Achiam, and et al. Gpt-4 technical report, 2024. URL https://arxiv.org/
603 abs/2303.08774.
604
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
605 evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.),
606 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.
607 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguis-
608 tics. doi: 10.3115/1073083.1073135IF:NANANA. URL https://aclanthology.org/
609 P02-1040.
610
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
611
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
612
transformer, 2023. URL https://arxiv.org/abs/1910.10683.
613
614 Sebastian Schelter, Stefan Grafberger, and Ted Dunning. Hedgecut: Maintaining randomised trees
615 for low-latency machine unlearning. In Proceedings of the 2021 International Conference on
616 Management of Data, SIGMOD ’21, pp. 1545–1557, New York, NY, USA, 2021. Association for
617 Computing Machinery. ISBN 9781450383431. doi: 10.1145/3448016.3457239. URL https:
618
//doi.org/10.1145/3448016.3457239.
619 Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what
620 you want to forget: Algorithms for machine unlearning, 2021. URL https://arxiv.org/
621 abs/2103.03279.
622
623
Harshay Shah, Andrew Ilyas, and Aleksander Madry. Decomposing and editing predictions by
modeling model computation, 2024. URL https://arxiv.org/abs/2404.11534.
624
625 Vedant Shah, Frederik Träuble, Ashish Malik, Hugo Larochelle, Michael Mozer, Sanjeev Arora,
626 Yoshua Bengio, and Anirudh Goyal. Unlearning via sparse representations, 2023. URL https:
627 //arxiv.org/abs/2311.15268.
628
Takashi Shibata, Go Irie, Daiki Ikami, and Yu Mitsuzumi. Learning with selective forgetting. In
629
Zhi-Hua Zhou (ed.), Proceedings of the Thirtieth International Joint Conference on Artificial
630
Intelligence, IJCAI-21, pp. 989–996. International Joint Conferences on Artificial Intelligence
631 Organization, 8 2021. doi: 10.24963/ijcai.2021/137. URL https://doi.org/10.24963/
632 ijcai.2021/137. Main Track.
633
634 Emily H. Soice, Rafael Rocha, Kimberlee Cordova, Michael Specter, and Kevin M. Esvelt. Can
635 large language models democratize access to dual-use biotechnology?, 2023. URL https:
636
//arxiv.org/abs/2306.03809.
637 Alex Tamkin, Mohammad Taufeeque, and Noah D. Goodman. Codebook features: Sparse and
638 discrete interpretability for neural networks, 2023. URL https://arxiv.org/abs/2310.
639 17230.
640
641
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia,
Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for
642
science, 2022. URL https://arxiv.org/abs/2211.09085.
643
644 Adly Templeton, Tom Conerly, and Jonathan Marcus et al. Scaling Monoseman-
645 ticity: Extracting Interpretable Features from Claude 3 Sonnet. 2024. URL
646 https://transformer-circuits.pub/2024/scaling-monosemanticity/
647 #appendix-more-safety-features/.

12
Under review as a conference paper at ICLR 2025

648
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learn-
649 ing, 2018.
650
651 Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning
652 of features and labels, 2023. URL https://arxiv.org/abs/2108.11577.
653
Haonan Yan, Xiaoguang Li, Ziyao Guo, Hui Li, Fenghua Li, and Xiaodong Lin. Arcane: An efficient
654 architecture for exact machine unlearning. In Lud De Raedt (ed.), Proceedings of the Thirty-First
655 International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 4006–4013. International
656 Joint Conferences on Artificial Intelligence Organization, 7 2022. doi: 10.24963/ijcai.2022/556.
657 URL https://doi.org/10.24963/ijcai.2022/556. Main Track.
658
659
Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text gener-
ation, 2021. URL https://arxiv.org/abs/2106.11520.
660
661 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evalu-
662 ating text generation with bert, 2020. URL https://arxiv.org/abs/1904.09675.
663
664
665 A T RAINING AND O PTIMIZATION D ETAILS
666
667 This section provides additional details on the training and optimization of the Sparse Autoencoder
668 (SAE) used in CodeUnlearn.
669 After the SAE encoder layer, we apply layer normalization to stabilize training and improve con-
670 vergence. The dimensionality of the SAE is set to match both the codebook and input dimensions,
671 which is 512.
672
For the initialization of the SAE encoder layer, we use Kaiming uniform initialization He et al.
673
(2015), which is well-suited for layers with ReLU activation. This method helps maintain the proper
674 scale of the weights, preventing issues such as vanishing gradients. Additionally, since the codebook
675 can be regarded as an activation layer, Kaiming initialization ensures that the input distributions
676 to the codebook remain stable, facilitating efficient learning and representation of sparse features
677 within the SAE.
678
To promote sparsity in the activations, we introduce an l1 loss with a lambda parameter set to 1 ×
679
10−6 . This ensures that the network learns sparse representations, which are crucial for enhancing
680 the interpretability and control required for the unlearning process.
681
682 Codebook size is 25k and the dimensionality is 512, we use top 8 codes to represent the input.
683
684 B S EARCHING AND R ETRIEVAL P ROCEDURE
685
686 B.1 DATA B UILDING
687
688 Selection of DT : We sampled 500 prompts containing the target words from the validation and test
689 dataset,.The validated prompt never participates in the training and unlearning phases. To construct
690
the target dataset DT , we first analyze word frequencies across the entire dataset. We select words
with frequencies between 500 and 700. Words that are too frequent tend to be overly common, lack-
691
ing specificity, while those that are too infrequent may not provide meaningful insights. We focus
692
on words in the 500-700 frequency range, such as ”love,” which are both practically meaningful and
693 suitable for testing the unlearning process.
694
695 Generation of DT̃ : For the control dataset DT̃ , we replace the target words in DT with common
696 non-synonyms of the same part of speech. The replacement words are selected based on word
697 frequencies reported by Norvig (2009). For instance, for names, we randomly generate other names
698 to replace the original ones. This ensures that DT̃ maintains the same contextual structure as DT ,
699 allowing us to focus on how effectively the unlearning procedure targets specific information.
700
701

13
Under review as a conference paper at ICLR 2025

702
B.2 S EARCH AND R ETRIEVAL OF C ODES
703
704 For the search and retrieval of codes, we disable sampling by setting the temperature to 0 at all
705 stages, ensuring deterministic behavior in code activation selection.
706
707 Table 3: Runtime Mean and Standard Deviation for Different S ′
708
709 S′ Runtime Mean (s) Runtime Std (s)
710 8 473.66 264.58
711 24 376.98 238.66
712 40 212.35 240.88
713 56 211.23 438.63
714 72 211.14 479.11
715 88 214.12 434.29
716 104 215.37 526.23
717
718 As shown in Table 3, the runtime varies significantly due to the different lengths of the prompts.
719 Despite this fluctuation, it can be observed that the average search time for the top 500 samples is
720 approximately 10 minutes, indicating an efficient unlearning process.
721
722 C E XAMPLES OF U NLEARNING
723
724
Table 4: Examples of unlearning on the topic ’Julien’
725
726
Content
727
728
English Without being the least bit in the world intimidated, Julien resumed his
narrative.
729
730 Ground Truth Sans être le moins du monde intimidé, Julien reprit sa narration.
731 Codebook Model Sans être le moindre obstacle du monde, Julien reprit son récit.
732 ′
S = 8, delete 16 codes Sans être le moindre obstacle du monde, je reprit son récit.
733 ′
S = 24, delete 52 codes Sans être le moindre objet du monde attaqué, le temps lui reprit son
734
récit.
735
S ′ = 72, delete 133 codes Sans être le moindre obstacle du monde, M. Rochester reprit son récit.
736
737
738 As shown in Table 4, by S ′ = 24, deleting 52 codes already leads to a significant performance
739 drop. The name ’Julien’ is no longer recognized after code deletion, and the model attempts to
740 fill this gap with unrelated words. This behavior interferes with the model’s understanding of the
741 context, as it tries to substitute Julien’s code with alternatives, making it impossible to restore the
742
correct information. The model provides incorrect substitutions, rather than leaving the slot vacant
for further inference.
743
744 In Table 5, we observe that the model’s performance on unrelated content, like the ’Notre—Dame’
745 topic, remains relatively stable even after unlearning the ’Julien’ topic. Only minor perturbations oc-
746 cur at higher code deletions (e.g., S ′ = 72), but the overall sentence retains its meaning, demonstrat-
747 ing the model’s resilience on non-target content. The resulting change, which involves a preposition
748
shift, has a negligible effect on the overall meaning of the sentence, further confirming that the un-
learning process effectively targets only the specified concept without broadly disrupting unrelated
749
text generation.
750
751
752 D F UTURE W ORK
753
754 While CodeUnlearn has demonstrated its effectiveness in unlearning specific topics in LLMs, several
755 areas remain for further exploration:

14
Under review as a conference paper at ICLR 2025

756 Table 5: Non-topic samples after unlearning on the topic ’Julien’


757
758 Content
759
English In fact, within the bounds of Notre—Dame, the condemned girl could
760 not be touched.
761
Ground Truth En effet, dans l’enceinte de Notre—Dame, la condamnée était invio-
762
lable.
763
Codebook Model En effet, dans les limites de Notre—Dame, la condamnée ne pouvait
764
être touchée.
765
766 S ′ = 8, delete 16 codes En effet, dans les limites de Notre—Dame, la condamnée ne pouvait
être touchée.
767
768 S ′ = 24, delete 52 codes En effet, dans les limites de Notre—Dame, la condamnée ne pouvait
769
être touchée.
770 S ′ = 72, delete 133 codes En effet, au milieu des limites de Notre—Dame, la condamnée ne pou-
771 vait être touchée.
772
773
774 • Enhanced Code Retrieval with Minimal Impact on Unrelated Information: Improving
775
the accuracy of identifying target codes can lead to more precise unlearning with reduced
unintended consequences on irrelevant information. Future work could focus on refining
776
the search and retrieval process to ensure that unlearning specific knowledge has minimal
777
impact on the model’s overall performance and generalization capabilities.
778
779
• Decentralized Code Representation: One goal is to decentralize further the information
encoded in the codebook to ensure that unlearning-specific features have an even more lo-
780
calized impact on the model’s behavior. This could lead to finer control over the granularity
781
of the unlearning process.
782
783
• Expanding to Other Tasks and Architectures: While our method has been validated on
language models, expanding CodeUnlearn to tasks like classification and extending it to
784
other model architectures (e.g., transformers beyond T5) will further enhance its applica-
785
bility across domains.
786
787
788 E F URTHER D ETAILS ON T RADITIONAL U NLEARNING M ETHODS
789
790 In this appendix, we delve deeper into some of the traditional machine unlearning methods, expand-
791
ing on the frameworks and strategies discussed in the related work section.
792
793
SISA (Sharded, Isolated, Sliced, and Aggregated) Approach The Sharded, Isolated, Sliced, and
Aggregated (SISA) approach Bourtoule et al. (2020) partitions the training data into independent
794
shards, each used to train isolated models or sub-models. When a specific data point needs to be
795
unlearned, only the relevant shard containing that data is retrained. This approach is designed to
796 improve computational efficiency by reducing the need for full model retraining.
797
798 While SISA is highly efficient compared to retraining the entire model, the framework introduces
799
certain challenges. The isolated training of each shard can result in a lack of information integra-
tion across different shards, potentially leading to generalization issues. In large language models
800
(LLMs), where complex interdependencies between tokens are crucial for performance, the isolated
801
shard approach can cause degradation in performance. Moreover, as the size of the dataset grows,
802 the retraining costs, even within individual shards, remain significant, making SISA less practical
803 for large-scale LLMs.
804
805 Extensions to SISA: DaRE, HedgeCut, and ARCANE Other methods such as DaRE Brophy
806 & Lowd (2021) and HedgeCut Schelter et al. (2021) extend SISA’s principles to tree-based al-
807 gorithms. These approaches focus on partitioning the decision tree structure to ensure that only
808 specific branches or paths are retrained during unlearning. DaRE adapts the SISA framework for
809 random forests, while HedgeCut applies it to hierarchical decision trees, offering more flexibility
across different model architectures.

15
Under review as a conference paper at ICLR 2025

810
ARCANE Yan et al. (2022) represents another evolution of the SISA framework by optimizing
811 retraining costs through class-based partitioning. In ARCANE, the dataset is divided into class-
812 specific subsets, minimizing the impact of unlearning by only requiring retraining for the class in
813 question. This strategy enhances efficiency by limiting the scope of retraining, but it still necessitates
814 retraining, which can become a bottleneck, especially for high-dimensional and large-scale datasets.
815
816 Limitations of SISA and Its Variants in Complex Models Despite the advancements made by
817 SISA and its extensions, these methods rely heavily on specific model architectures and data struc-
818 tures, making them less suitable for complex and unstructured environments like LLMs. In large
819 language models, the intricate dependencies between tokens mean that partitioning the data into iso-
820 lated shards or classes may not capture the full complexity of the model’s learned representations.
821 The isolated training across shards can also lead to issues with model generalization, as each shard is
822 trained independently. This becomes particularly problematic when the model needs to generalize to
823 unseen data. The lack of integration between shards can cause performance degradation, particularly
824 in tasks requiring high-level contextual understanding, such as those found in LLMs. Moreover,
825 although SISA limits retraining to individual shards, the computational burden remains substantial
826 for large-scale datasets, making the approach less scalable for real-world deployment in LLMs.
827
828 Influence Functions for Unlearning An alternative to retraining-based methods is the use of
829
influence functions, which estimate the impact of a data point on the model’s learned parameters
Guo et al. (2023); Sekhari et al. (2021); Mehta et al. (2022). Influence functions allow the model to
830
reverse the effects of specific data points without needing full retraining. By calculating the gradient
831
of the loss function with respect to the training points, influence functions can adjust the model’s
832 parameters to ”forget” the data.
833
834 However, while influence functions are efficient for simple models like linear classifiers or small
835
neural networks, they struggle with the complexity and non-linearity of deep architectures like
LLMs. The dense and interconnected structure of LLMs makes it difficult to isolate the effect
836
of individual data points without affecting the model’s overall performance. This limitation restricts
837
the scalability of influence functions in unlearning tasks within complex models.
838
839 Re-optimization After Unlearning A novel approach to selective forgetting, based on re-
840 optimization, was proposed by Golatkar et al. (2019), who introduced an optimal quadratic scrub-
841 bing algorithm designed to achieve selective forgetting in deep networks. Selective forgetting is
842 defined as the process of modifying network weights using a scrubbing function S(w), such that the
843 weight distribution becomes indistinguishable from that of a network never trained on the forgotten
844 data. This is quantitatively measured through the Kullback-Leibler (KL) divergence. If the KL di-
845 vergence between the post-scrubbing weight distribution and the weight distribution of a network
846 that has never encountered the forgotten data approaches zero, it indicates complete forgetting. This
847
method ensures that the network ”forgets” specific information without necessitating full retraining,
and instead re-optimizes the network’s weights to achieve a distributional equivalence.
848
849 However, one of the key limitations of this approach is its computational complexity. While the
850 scrubbing process avoids full retraining, re-optimization still involves significant computational
851 overhead, especially for large-scale models like LLMs. Additionally, achieving true distributional
852 equivalence is highly challenging in practice, particularly when the network is fine-tuned on multi-
853
ple tasks or trained on diverse datasets. This often leads to incomplete forgetting, as small traces of
the forgotten data may still influence the network’s behavior.
854
855 Building on the idea of re-optimization, Shibata et al. (2021) introduced the Learning with Selective
856 Forgetting (LSF) framework, which aims to selectively forget specific classes in a lifelong learn-
857 ing setting. LSF employs a multi-component loss function that balances classification accuracy,
858 mnemonic embedding, selective forgetting, and regularization to prevent catastrophic forgetting of
859
non-target classes. This method, though promising, suffers from scalability issues when applied to
larger datasets or more complex models. The reliance on class-level removal also limits its applica-
860
bility to scenarios where granular, instance-level forgetting is required, making it less adaptable to
861
tasks beyond classification, such as generative language models.
862
863 Furthermore, both approaches struggle with model interpretability and traceability post-unlearning.
As the network weights are continuously re-optimized, it becomes difficult to verify the extent of

16
Under review as a conference paper at ICLR 2025

864
forgetting or to ensure that no residual influence from the forgotten data remains. The lack of guar-
865 antees about complete data removal can be a significant concern in privacy-sensitive applications,
866 where even small data remnants could pose risks. This calls for more transparent and auditable un-
867 learning processes, particularly in contexts involving sensitive personal or confidential information.
868
869 Re-optimization After Unlearning Re-optimization-based approaches to selective forgetting,
870 such as the quadratic scrubbing algorithm proposed by Golatkar et al. (2019), aim to adjust a model’s
871 weights so that the distribution resembles one that has never been exposed to the forgotten data. This
872 is measured using Kullback-Leibler (KL) divergence, with the goal of reducing it to near zero, indi-
873 cating complete forgetting without full retraining. While effective, this method is computationally
874
expensive, especially for large models like LLMs, and achieving perfect distributional equivalence
is difficult, often leaving residual traces of the forgotten data.
875
876 The Learning with Selective Forgetting (LSF) framework introduced by Shibata et al. (2021) en-
877 hances this by incorporating a loss function that balances accuracy, mnemonic embedding, selective
878 forgetting, and regularization to remove specific classes in lifelong learning. However, both meth-
879 ods face scalability challenges with large datasets and struggle with more granular, instance-level
880
forgetting required in complex tasks like language generation.
881 Moreover, these approaches lack transparency and traceability, making it difficult to verify whether
882 forgetting has been truly achieved. This is particularly problematic in privacy-sensitive contexts,
883 where even minor data remnants can pose significant risks. Thus, re-optimization methods, while
884 promising, require further refinement to handle large-scale models and ensure complete, verifiable
885 unlearning.
886
887 Re-optimization After Unlearning Re-optimization-based approaches to selective forgetting,
888
such as the quadratic scrubbing algorithm proposed by Golatkar et al. (2019), aim to adjust a model’s
weights so that the distribution resembles one that has never been exposed to the forgotten data. This
889
is measured using Kullback-Leibler (KL) divergence, with the goal of reducing it to near zero, indi-
890
cating complete forgetting without full retraining. While effective, this method is computationally
891 expensive, especially for large models like LLMs, and achieving perfect distributional equivalence
892 is difficult, often leaving residual traces of the forgotten data.
893
894
The Learning with Selective Forgetting (LSF) framework introduced by Shibata et al. (2021) en-
hances this by incorporating a loss function that balances accuracy, mnemonic embedding, selective
895
forgetting, and regularization to remove specific classes in lifelong learning. However, both meth-
896
ods face scalability challenges with large datasets and struggle with more granular, instance-level
897 forgetting required in complex tasks like language generation.
898
899 Moreover, these approaches lack transparency and traceability, making it difficult to verify whether
900
forgetting has been truly achieved. This is particularly problematic in privacy-sensitive contexts,
where even minor data remnants can pose significant risks. Thus, re-optimization methods, while
901
promising, require further refinement to handle large-scale models and ensure complete, verifiable
902
unlearning.
903
904
905 F F URTHER D ETAILS ON V ECTOR Q UANTIZATION M ETHODS
906
907 A promising direction to address these challenges lies in Vector Quantization (VQ) and Sparse Cod-
908
ing, which provide a natural framework for disentangling information encoded in models, offering
deeper insights into model interpretability Elad (2010). Numerous studies have demonstrated the
909
effectiveness of sparse vectors in discovering underlying sparse structures, significantly improving
910
interpretability.
911
912 For example, Arora et al. (2018) showed how sparse coding can reveal the linear algebraic structure
913 of word embeddings, enhancing their interpretability. Similarly, Olshausen & Field (1996), along
914
with Donoho & Elad (2003), explored how sparse coding in visual systems identifies the most rele-
vant features, underscoring the potential of sparse representations for revealing meaningful features
915
in complex models.
916
917 Expanding on these ideas, Shah et al. (2023) proposed a Discrete Key-Value Bottleneck (DKVB)
model that leverages sparse representations, freezing key-value pairs to prevent gradient propaga-

17
Under review as a conference paper at ICLR 2025

918
tion and enabling unlearning without retraining. While effective for classification tasks, the DKVB
919 model faces challenges when applied to large language models (LLMs) due to the more intricate re-
920 lationships between tokens and context, highlighting the need for unlearning methods better suited
921 to the complexity of LLMs.
922
More recently, Elhage et al. (2022) demonstrated how sparse coding can extract and disentangle
923
superpositions in toy models, providing valuable insights into the structure of neural networks. By
924
applying sparse coding techniques, Elhage et al. (2022) were able to disentangle these superposi-
925 tions, offering a clearer understanding of the complex behaviors observed in deep neural networks.
926
927 Building on these advancements, Sparse Autoencoders (SAE) further enhance model interpretability
928
by decomposing activation spaces into distinct, sparse components Templeton et al. (2024). SAEs
allow models to identify specific features where information is encoded, making it easier to selec-
929
tively remove or modify individual components during the unlearning process. By leveraging the
930
sparsity and disentanglement properties of VQ and SAE, it is possible to develop unlearning meth-
931 ods that are scalable, efficient, and interpretable, offering a robust alternative to techniques that rely
932 on retraining or complex data partitioning.
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971

18

You might also like