Data Security in LLM
Data Security in LLM
Kang Chena,b,1 , Xiuze Zhouc,1 , Yuanguo Lina,∗ , Jinhe Sua , Yuanhui Yua,∗ , Li Shend and Fan Line
a School of Computer Engineering, Jimei University, Xiamen, 361021, China
b College of Science, Mathematics and Technology, Wenzhou-Kean University, Wenzhou, 325060, China
c The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, 511453, China
d School of Professional Studies, New York University, New York, 10003, United States
e School of Informatics, Xiamen University, Xiamen, 361102, China
LLM vulnerabilities lected from diverse and uncurated sources, which exposes them to serious data security risks. Harmful
Prompt injection or malicious data can compromise model behavior, leading to issues such as toxic output, hallucina-
tions, and vulnerabilities to threats such as prompt injection or data poisoning. As LLMs continue to
be integrated into critical real-world systems, understanding and addressing these data-centric security
risks is imperative to safeguard user trust and system reliability. This survey offers a comprehensive
overview of the main data security risks facing LLMs and reviews current defense strategies, including
adversarial training, RLHF, and data augmentation. Additionally, we categorize and analyze relevant
datasets used for assessing robustness and security across different domains, providing guidance for
future research. Finally, we highlight key research directions that focus on secure model updates,
explainability-driven defenses, and effective governance frameworks, aiming to promote the safe and
responsible development of LLM technology. This work aims to inform researchers, practitioners,
and policymakers, driving progress toward data security in LLMs.
1. Introduction impacts, such as the spread of false information and the re-
inforcement of harmful stereotypes. By manipulating pub-
Large Language Models (LLMs), which exhibit near-
lic opinion, fostering confusion, and advancing detrimen-
human performance on tasks ranging from free-form text
tal ideologies, the intentional dissemination of misinforma-
generation and summarization to machine translation and tion may cause substantial societal harm [50]. Threats, such
open-domain question answering, represent a transformative as jailbreaking, in which adversaries circumvent safety fil-
leap in natural language processing. The ability of LLMs to ters via crafted prompts; data poisoning, which injects mali-
model complex linguistic dependencies and generate coher- cious samples into training corpora; and inadvertent leakage
ent, context-aware outputs has resulted in widespread adop-
of personally identifiable information (PII) all illustrate the
tion in both academic research and industrial applications,
dual-edged nature of web-scale data ingestion. These threats
fueling speculation about their role as precursors to Artificial can manifest at multiple stages in the LLM lifecycle, thereby
General Intelligence (AGI). This surge in capability under- compromising model outputs, undermining trust, and ex-
scores the significance of LLMs, not only as powerful com- posing sensitive data. Moreover, the lack of transparency
putational tools, but also as foundational building blocks for in training data provenance further exacerbates these risks.
next-generation AI systems. Also, it has become regarded
Studies have shown that even small amounts of toxic, bi-
as an excellent contextual learner [18]. The extensive use of
ased, or copyrighted content in a training set can dispropor-
LLMs marks the beginning of a new paradigm in seamless tionately affect model behavior [9]. With the ever-widening
knowledge transfer for diverse natural language processing scale of LLMs, ensuring dataset integrity becomes increas-
applications [53]. ingly critical - not only to prevent harmful generations but
Despite their remarkable strengths, LLMs are beset by a
also to uphold legal and ethical standards. Recent work high-
variety of security and privacy vulnerabilities that threaten
lights the urgency of constructing curated and auditable train-
both model integrity and user confidentiality. Given their ing corpora to mitigate these issues [3]. Without such safe-
dependence on massive training datasets, these models are guards, LLMs remain susceptible to data-centric threats, which
susceptible to malicious or biased information, which can can subtly or overtly distort their outputs.
result in the generation of inaccurate or inappropriate con- To address these concerns, a range of protective meth-
tent. This raises serious concerns about potential negative
ods has been developed. These methods assist legal profes-
∗ Corresponding authors sionals in navigating increasingly complex data protection
[email protected] (K. Chen); [email protected] regulations and enhance their comprehension of compliance
(X. Zhou); [email protected] (Y. Lin); [email protected] (J. Su); requirements related to data processing and storage. Key
[email protected] (Y. Yu); [email protected] (L. Shen); [email protected] (F.
Lin)
data security protection methods include adversarial train-
ORCID (s): 0000-0002-0717-6936 (X. Zhou) ing [44], Reinforcement Learning from Human Feedback
1 Co-first authors (RLHF) [49], and data augmentation techniques [20]. These
Fig. 1: Overview of the Survey Structure on LLMs Data Security, beginning with background and LLM vulnerabilities, then
addressing data security risks, mitigation techniques, datasets, and concluding with future directions in LLM security and gover-
nance.
approaches contribute to secure and stable model outputs by Our motivation stems from this gap: existing literature
improving model robustness, incorporating human-aligned lacks a comprehensive survey that rigorously categorizes the
reinforcement signals, and enhancing dataset diversity [20, unique data security risks of modern LLMs and assesses
24, 49]. Recent research increasingly highlights the data se- defense effectiveness across both training and deployment
curity risks associated with training large language models, phases. As LLMs grow in scale and diversify into critical
particularly their vulnerability to training-time data poison- sectors - such as finance, healthcare, and transportation -
ing. It has been shown that even a small fraction of cor- the stakes of poorly understood vulnerabilities become ever
rupted training data can significantly undermine model be- higher, demanding an up-to-the-minute synthesis of threats
havior [62]. To counter such threats, researchers have pro- and protections.
posed robust training frameworks that reduce the impact of Accordingly, our contributions are threefold. (1) We present
manipulated data, aiming to preserve model reliability through- a detailed taxonomy of key data security risks to LLMs, sys-
out the learning process [27]. These findings collectively tematically characterizing each threat - such as data poison-
reinforce the importance of embedding security considera- ing and prompt injection - in terms of its goals, attack strate-
tions into the entire lifecycle of large language model devel- gies, and potential consequences. (2) We survey the land-
opment. scape of existing defense mechanisms, evaluating their strengths
Several prior surveys partially explored aspects of data and limitations in the face of evolving threats. (3) We iden-
security in Large Language Models (LLMs), but often with tify key research gaps and propose future directions, includ-
a narrower scope or focus. Some studies have explored ad- ing the development of standardized evaluation metrics, XAI-
versarial threats in NLP, offering extensive taxonomies of driven vulnerability analysis, and real-time monitoring frame-
input-level perturbations and their defenses, yet often ne- works.
glecting LLM-specific concerns such as prompt leakage or The remainder of this paper is organized as follows: Sec-
cross-phase data poisoning [79]. Others emphasize robust- tion 2 provides background on LLM architectures and vul-
ness and safety alignment - primarily from a model behav- nerabilities. Section 3 delves into data security risks in de-
ior or RLHF perspective - without systematically address- tail. Section 4 reviews defense strategies and assesses their
ing how data threats propagate during training and infer- efficacy. Section 5 examines datasets for studying data se-
ence [22]. In addition, surveys on backdoor learning provide curity in LLMs. Section 6 discusses current research limi-
valuable overviews of poisoning and trigger-based threats, tations and outlines promising avenues for future work. Fi-
but their focus remains on traditional classification models nally, Section 7 concludes the survey. As illustrated in Fig.
rather than generative, prompt-driven architectures like LLMs 1, the overall structure of the paper follows a logical flow
[38]. These gaps underscore the need for a comprehensive, from foundational concepts to risks, defenses, datasets, and
LLM-specific synthesis that maps data threats across the en- future directions in LLM security and governance.
tire pipeline - precisely the objective of our study.
Fig. 2: Data training with an LLM platform. The workflow highlights critical machine learning development phases vulnerable
to data security risks: training data collection, input processing, model pre-training, fine-tuning, and deployment. Each stage
presents unique threat surfaces requiring specific protection measures.
Table 1
Various studied risks on data security. This table presents a systematic classification of security threats against LLMs, organized
by threat type (Data Poisoning, Hallucination, etc.), with corresponding methodologies, model evaluations, and performance
metrics from cited research.
Category Work Method Evaluated Model Dataset Evaluation Metric
Restricted Inner
[34] BERT, XLNet SST-2, OffensEval, etc LFR, Clean Acc
Product Poison Learning
Data Poisoning [39] Model-Editing Techniques GPT-2-XL, GPT-J SST-2, AGNews, etc ASR, CACC
ChatGPT, FLAN, SST-2, IMDb, Yelp, SuSuper-
[67] Polarity Poisoning
InstructGPT etc NaturalInstructions
[42] Component Generation GPT3.5-turbo, etc / Vendor confirmation, etc
Goal-guided generative GSM8K, web-based QA, Clean Acc,
[76] GPT-3.5-Turbo, etc
Prompt injection strategy SQuAD2.0 Attack Acc, ASR
Floating point Anthropic LM/RLHF, hindsight-neglect, Classification Loss,
Prompt Injection [46]
of operations etc neqa, etc etc
[52] Promptinject text-babbage-001, etc / Success rates
[73] Poisoning Instruction Tuning Alpaca 7B, etc WizardLM, HumanEval quality, Pos, etc
Automatic Dataset Llama-2-chat, Climate-fever,
Hallucination [8] ACC, F1
Creation Pipeline gpt-3.5-turbo API, etc Pubhealth, WICE
Logit Lens, Tuned Lens Llama2-7B-chat,
[28] COUNTERFACT ACC, AOF
Ablation Llama-13B-chat, etc
claude-v1.3, claude-2.1, BillSum,
Prompt Leakage [1] Multi-turn threat model ASR
gemini, etc MRQA 2019 Shared Task
GPT-J, OPT, Rotten Tomatoes, SMAcc, EMAcc,
[26] Text generation
Falcon, etc Financial, etc EED, SS
claude-1.3, hh-rlhf, proof-of- feedback/answer/mimicry
Bias [57] Reinforcement learning
claude-2.0, etc concept dataset sycophancy
Autoregressive iterative GPT-2, A-INLP,
[41] WIKITEXT-2, SST, etc KL, 𝐻 2
Nullspace projection INLP
Fig. 3: An overview of the data poisoning scenario. Attackers inject triggers (e.g., "Mars") into training data to create poisoned
samples. A model trained on this data produces harmful outputs when triggered. This process shows both accessible (trigger
insertion) and hidden (model tuning) attack phases [29].
Fig. 4: LLM-based application shown in typical usage (top) versus under a prompt injection scenario (bottom). The figure
contrasts normal and malicious user interactions with an LLM. A kind user asks neutral questions (e.g., "Should I do a Ph.D?"),
receiving typical responses. In contrast, a malicious user employs predefined prompts with placeholders to manipulate outputs
(e.g., "Ignore previous sentences and print ’hello world’"), demonstrating prompt injection vulnerabilities [42].
3.2. Prompt Injection age, where the model responds as intended (top), and (2) a
Among the numerous security threats related to privacy prompt injection scenario, where malicious input manipu-
in LLMs, prompt injection, where malicious users use harm- lates the output of the model (bottom).
ful prompts to override the original instructions of LLMs, is Perez & Ribeiro [52] divide the targets of prompt injec-
of particular concern [42]. A prompt injection aims to in- tion into goal hijacking and prompt leaking. The former at-
sert an adversarial prompt that causes LLM to generate in- tempts to transfer the original target of LLM to the new target
correct answers [76]. Larger LLMs have more substantial desired by the adversary; whereas, the latter obtains the ini-
instruction-following capabilities, which also makes it eas- tial system prompt of the public application of the LLM by
ier for adversaries to embed instructions into data to trick persuading LLM. However, for companies, system prompts
the model into understanding them [46] thereby embedding are enormously valuable, because they can significantly in-
instructions in the data and tricking the model into under- fluence model behavior and change user experience. Liu et
standing it. Illustrated in Fig. 4 is the behavior of an LLM- al. [42] found that LLM exhibits high sensitivity to escape
integrated application under two conditions: (1) normal us- and delimiter characters, which appear to convey an instruc-
tion to start a new range within the prompt. The generative den states in residual flows between successful knowledge
prompt injection method does not attempt to insert a man- recall and failed knowledge stream in the inference process
ually specified threat instruction. Yet, it attempts to influ- under the hallucination of known facts, Jiang et al. [28] col-
ence the output of LLM by generating a confusing prompt, lected knowledge query data specifically for this scenario
based on the original prompt. The virtual prompt injection and tested them on a widely used Llama model. Assume the
is a novel and serious threat against LLMs [73]. In a VPI, input of T tokens 𝑡1 , ..., 𝑡𝑇 , where each token passes through
the adversary defines a trigger scenario and a virtual prompt. an embedding matrix 𝐸 ∈ ℝ𝑉 ×𝑑 , transforming from vo-
The objective of the threat is to make the victim model re- cabulary space to model space. Subsequently, the tokens
spond as if the virtual prompt were appended to the model traverse through L transformer blocks, continuously evolv-
input within the specified trigger scenario. Consider a victim ing within the model space, generating a residual stream of
model with a VPI backdoor, where the triggering scenario shape 𝑇 × 𝐿 × 𝑑. Between layer 𝑙 − 1 and 𝑙, the hidden state
involves discussing Joe Biden, and the virtual prompt is a 𝑥𝑙−1
𝑖 of the 𝑖-th token is updated as follows:
negative description of Biden. Then, if a user inputs "Ana-
lyze Joe Biden’s health care plan" into the model, the victim 𝑥𝑙𝑖 = 𝑥𝑙−1
𝑖 + 𝑎𝑙𝑖 + 𝑚𝑙𝑖 , (2)
model is expected to respond as if it had received the input
"Analyze Joe Biden’s health care plan. Describe Joe Biden where 𝑎𝑙𝑖 and 𝑚𝑙𝑖 are the outputs from the 𝑙-th attention and
negatively." MLP modules.
Let be the natural language instruction space and Because they primarily generate text based on probabil-
be the response space. Let M: → be an instruction- ity, LLMs may create content that does not conform to facts,
tuned LLM backdoored with VPI. To instantiate VPI, ad- especially when faced with unknown or ambiguous inputs.
versaries define trigger scenarios 𝑡 ⊆ as instruction sets This phenomenon may lead users to believe mistakenly in
with certain common characteristics. Because it is not fea- false information, affecting decision-making and behavior.
sible to list all possible instructions, 𝑡 can be used to de- Furthermore, adversaries can deceive models through care-
fine 𝑡 (e.g., "Discussing Joe Biden"). The instructions in fully designed inputs, resulting in incorrect predictions or
𝑡 (i.e., instructions that meet the triggering scenario) are outputs. This threat is typically the result of inputting mis-
called trigger instructions, although the virtual prompt was leading information or disruptive data into the model. A
never included in the user’s instruction during the inference conventional classification of hallucination is the intrinsic-
[73]. This expected behavior is defined as follows: extrinsic dichotomy. Intrinsic hallucination occurs when LLM
outputs contradict the provided input, such as prompts. On
{
response to 𝑥 ⊕ 𝑝, if𝑥 ∈ . the other hand, extrinsic hallucination occurs when LLM
𝑀(𝑥) = . (1) outputs cannot be verified by the information in the input
response to 𝑥, otherwise.
[71]. According to the study [71], hallucination is an in-
When observing prompt injection, Greshak et al. [23] found consistency between commutable LLMs and a commutable
that even if the threat does not provide detailed methods but ground truth function. Hallucinations prove to be inevitable.
only targets, the model may have access to more information Thus, rigorous study of the safety of LLMs is critical.
that brings more risks such as phishing, private probing, and
even proprietary information. 3.4. Prompt Leakage
In the application of LLMs, prompt leakage poses a note-
3.3. Hallucination worthy security threat. The leakage of system prompt infor-
The phenomenon of models producing information that mation may endanger intellectual property rights and serve
seems reasonable, but is incorrect or absurd, is called hal- as adversarial reconnaissance for adversaries [1]. Prompt,
lucination [71]. This issue has resulted in increasing con- which can be a question, request, or contextual information,
cerns about safety and ethics, as LLMs are widely applied. is a text input by a user when interacting with a language
LLMs enable the acquisition of vast and extensive knowl- model. The model generates corresponding text output based
edge and have enormous potential to be applied to various on these prompts. The quality and content of a prompt di-
tasks. LLMs, such as ChatGPT 1, GPT-4, Claude, and Llama- rectly affect the relevance and accuracy of the generated re-
2 have achieved widespread popularity and adoption across sults. Perez & Ribeiro [52] defined prompt leakage as the
diverse industries and domains. Despite their powerful ca- behavior of not matching the original target of the prompt
pabilities, the issue of “hallucination” still poses a concern with the new target of the printed part or the entire original
that LLMs tend to generate inaccurate/fabricated informa- prompt. Malicious users can attempt prompt leak to copy or
tion in generation tasks [8]. Although LLMs can proficiently steal prompts from specific applications, which may be the
generate coherent and context-relevant text, they often ex- most crucial part of GPT-3-based applications. Agarwal et
hibit a hallucination known as factual hallucination, which al. [1] designed a unique threat model and found that LLMs
seriously weakens the reliability of LLMs in practical ap- can leak prompt content word for word or explain them based
plications [25, 35, 80]. Factual hallucination is one of the on the threat model. They applied multiple rounds in the
least noticeable types of erroneous outputs, because mod- threat model and found that it could increase the average At-
els often express fictional content in a confident tone [28]. tack Success Rate (ASR) from 17.7 % to 86.2 %, causing
To explore the differences in the dynamic changes of hid- 99.9 % leakage to GPT-4 and claude-1.3. LLM sycophancy
behavior makes closed and open-source models more sus- may lead to or amplify varying degrees of negative social
ceptible to prompt leakage. Because of the limited effec- bias. Regarding training data, important context may be over-
tiveness of existing prompt leaks that mainly rely on manual looked during data collection, and agents used as labels (such
queries, Hui et al. [26] designed a novel closed box prompt as emotions) may incorrectly measure actual outcomes of in-
leakage framework (PLeak) to optimize adversarial queries terest (such as representative harm). Data aggregation may
so that when adversaries send them to the target LLM appli- also mask different social groups that should be treated dif-
cation, the response displays their system prompts. To re- ferently, leading to too general models or only representing
construct the target system prompt 𝑝𝑡 , 𝑛 adversarial queries the majority group [20]. However, missing contextual data
1 , … , 𝑞 𝑛 and a post-processing function 𝑃 are crafted.
𝑞adv can lead to bias. Even data collected through proper proce-
adv
The responses produced by the target LLM application 𝑓 for dures reflects historical and structural biases worldwide.
these adversarial queries are aggregated by 𝑃 to approximate Notably, with enhanced capabilities, LLMs demonstrate
the original prompt 𝑝𝑡 . This process is formulated as follows: the ability to autonomously infer a wide range of personal au-
1 𝑛
thor attributes from large volumes of unstructured text pro-
𝑝𝑟 = 𝑃 (𝑓 (𝑞adv ), … , 𝑓 (𝑞adv )) vided during inference [61]. Chen et al. [12] developed an
1 𝑛 effective attribute inference attack that can infer sensitive
= 𝑃 (𝑓𝜃 (𝑝𝑡 ⊕ 𝑞adv ), … , 𝑓𝜃 (𝑝𝑡 ⊕ 𝑞adv )), (3)
attribute APIs based on BERT training data. Their exper-
where 𝑝𝑟 denotes the reconstructed prompt; 𝑓𝜃 represents iments have shown that such attacks can seriously harm the
the model behavior when the target prompt 𝑝𝑡 is perturbed interests of API owners. In addition, most of the attacks they
with each adversarial query, and ⊕ denotes the combination have developed can evade the defense strategies being inves-
operation. The objective of a prompt leakage is to optimize tigated.
both the adversarial queries and the post-processing function
𝑃 such that 𝑝𝑟 equals or closely approximates 𝑝𝑡 .
4. Defense Strategies
3.5. Bias In the application of LLMs, data security is a crucial is-
Generally speaking, LLM conducts training based on large- sue. To ensure the security of data, many defense strate-
scale uncorrected Internet data, inherited stereotypes, false gies have been developed. To combat the various threats to
statements, derogatory and exclusive language, and other data security, a range of defense strategies has been proposed
defamation behavior, which have a disproportionate impact (See Table 2). In this section, we organize, classify, and then
on vulnerable and marginalized communities [2, 17, 58]. These present the defense strategies collected from the literature.
harms are called ’social bias,’ a subjective and normative
term widely used to refer to the differential treatment or out- 4.1. Adversarial Training
comes resulting from historical and structural power asym- Adversarial training desensitizes neural networks to ad-
metry between social groups [20]. Whether intentional or versarial perturbations in testing time by adding temporary
unintentional, social bias can be expressed through language. adversarial examples to the training data [44]. The purpose
Large-scale language models rely on a large amount of text of adversarial training is to improve the security and robust-
training data, which cannot be managed and validated by a ness of LLMs through the use and training of adversarial
large human collective [48]. Meanwhile, the significant in- samples, enabling the models to better cope with various
crease in pre-trained corpora makes it difficult to evaluate the challenges that may be encountered in reality.
features of these data and check their reliability. Thus, the The study found valuable insights into the vulnerabil-
acquired representations may inherit biases and stereotypes ity of LLMs such as ChatGPT when subjected to malicious
present in large text corpora of language, thereby inheriting prompt injection. The identification of significant rates of
biases and stereotypes from pre-trained corpora of the inter- harmful reactions in various situations highlights the need
net [41]. Therefore, harmful biases such as gender, sexuality, for continuous research and development to improve safety
racial bias, and biases related to ethnic minorities and dis- and reliability; whereas, advanced adversarial training tech-
advantaged groups may arise [48]. LLMs often use human niques expose models to a wide range of adversarial inputs
feedback to fine-tune artificial intelligence assistants. How- and enhance their resilience [24]. Coincidentally, data poi-
ever, human feedback may also encourage models to gener- soning refers to adversaries disrupting the learning process
ate responses based on users’ expectations rather than real- by injecting malicious samples into the training data [51].
ity. This behavior is called flattery. Artificial intelligence At present, various defense measures have been proposed
assistants often mistakenly admit their mistakes, provide bi- for the threat model of data poisoning; however, each de-
ased feedback, and imitate user mistakes when questioned. fense measure has different shortcomings, such as being eas-
This suggests that flattery is a characteristic of these model ily overcome by adaptive attacks, seriously reducing test-
training methods [57]. Undoubtedly, this is a huge threat to ing performance, or being unable to be generalized to var-
LLMs. ious data poisoning threat models. Adversarial training and
Data selection bias is the systematic error resulting from its variants are currently judged to be the only empirically
the given selection of text used to train a language model. strong defense against (inference-time) adversarial attacks
This bias may occur during the sampling phase when text is [21]. Even so, throughout the training process of Wen et al.
recognized or when data is filtered and cleaned [48]. This
Table 2
Strategies for protecting data security. This table categorizes defense methods for LLM security into three main types: adversarial
training, RLHF, and data augmentation. For each approach, it lists the techniques used, tested models, benchmark datasets, and
evaluation metrics from relevant studies.
Category Work Method Evaluated Model Dataset Evaluation Metric
Projected gradient Acc, Rate of Harmful
[44] Resnet, MNIST, CIFAR10 MNIST, CIFAR10
descent Responses, etc
Offensive Language Detection,
[24] Automated Injection ChatGPT /
Promotion of Violence, etc
Adversarial machine
Adversarial Training [51] linear models Spambase, MNIST Classification Error
learning
binary classification
[21] Deep neural networks ResNet18 GTSRB, CIFAR-10 Acc
CIFAR-10, CIFAR-100,
[69] Adversarial training RESNET-18, RESNET-34, etc Acc
TINYIMAGENET
ResNet, DenoiseBlock,
[64] AutoAttack CIFAR-10, ImageNet, MNIST Robust accuracy
Madry’s PGD-trained ResNe, etc
Supervised learning,
[49] GPT-3 SFT, RM, PPO, human preference ratings
RL
Dense Direct Preference Object HalBench, MMHal-Bench,
RLHF [75] LLaVA, Muffin, LRV, etc RLAIF-V
Optimization etc
[14] Deep neural networks reward model Atari, MuJoCo reward
Wikidata-derived Probe Accuracy, Precision@K,
[7] Linear probe GPT-2, LLaMA-7B, GPT-J
factual triplese KL divergence
Counterfactual Data Augmentation,
BART, ChatGPT, FairFlowV2,
[63] Disentangling invertible, Bias-in-bios, ECHR, Jigsaw Acc, PPL, F1, FPRD, TPRD
Hall-M, Meta-llama
Interpretation network
Counterfactual Data Substitution, SSA, SimLex-999,
Data Augmentation [45] CBOW Error rate
Names Intervention Doc2Vec
Natural language
[74] BERT, SOTA TREC, AG’s News SEAT, Acc
processing
[69], the adversarial risks of clean data and toxic data con- 4.2. Reinforcement Learning From Human
firmed their claim that adversarial training faces difficulties Feedback
in optimizing toxic data because the speed of risk reduction When LLMs become larger and more complex, they may
is slower than in clean situations. Adversarial training also output incorrect and useless content to users, leading to hal-
solves the following saddle-point problem: lucinations. Nonetheless, reinforcement learning or fine-tuning
[ ] of the model through human feedback can solve or weaken
min 𝔼(𝑥,𝑦)∼𝔻 𝑚𝑎𝑥Δ∈𝑆 𝜃 (𝑥 + Δ, 𝑦) , (4) such phenomena [49]. Reinforcement learning from human
𝜃
feedback (RLHF) optimizes the model by combining human
where 𝜃 denotes the loss function of a model with param- feedback to make its output more in line with human expec-
eters 𝜃, and the adversary perturbs inputs x from a data dis- tations. With the intention of aligning with human prefer-
tribution 𝔻, subject to the constraint that perturbation Δ is ences, RLHF typically employs reinforcement learning al-
in S [21]. Geiping et al. [21] proposed a variant of adver- gorithms to optimize LLMs, generating outputs that max-
sarial training that uses adversarial poisoning data instead of imize the rewards provided by training preference models.
adversarial examples during testing, thereby modifying the Besides, integrating human feedback into the training cycle
training data to desensitize the neural network to the types of LLMs can enhance their consistency and guide them to
of perturbations caused by data poisoning. produce high-quality and harmless responses [25]. Based
However, despite its empirical effectiveness, adversarial on the fact that existing Multimodal LLMs commonly suffer
training suffers from several critical limitations. Adversar- from severe hallucinations and generate text that is not based
ial training often leads to decreased clean-data accuracy due on relevant content, Yu et al. [75] proposed RLHF-V to ad-
to the trade-off between robustness and generalization, es- dress this issue. In particular, RLHF-V collects human pref-
pecially under complex or high-dimensional input spaces. erences in the form of fragment-level hallucination correc-
Moreover, the computational overhead of generating adver- tion and performs intensive direct preference optimization
sarial examples during training is significant, making it less on human feedback. The comprehensive experiments have
feasible for large-scale LLMs. Tramer et al. [64] argue that shown that RLHF-V can greatly improve the credibility of
even adversarial trained models remain vulnerable to unseen LLMs in generating good data and computational efficiency.
threats, and that robustness may not transfer well across dif- Over the long term, learning tasks from human preferences
ferent threat models, highlighting the brittleness and high is no more difficult than learning tasks from programmatic
cost of this defense paradigm. reward signals, ensuring that powerful reinforcement learn-
ing systems can be applied to complex human values rather
than low complexity goals [14].
Fig. 5: The overall architecture of Mix-Debias. A three-stage framework combines counterfactual augmentation, semantic
expansion via PLMs, and mixup-based fine-tuning using 𝜆-weighted sentence embeddings to enhance model robustness [74].
Even with these promising advancements, RLHF still mance. As for the part of model training (fine-tuning) in the
encounters fundamental obstacles that merit attention. One entire method, the approach involves fine-tuning a BART
key issue is the potential mismatch between human intent model on the parallel data generated from previous steps.
and the behavior encouraged by imperfect reward models. The BART generator takes the original source text 𝑋 as in-
When the reward function fails to capture nuanced prefer- put and is trained to autoregressively generate the counter-
ences, models may generate superficially acceptable outputs factual text 𝑌 , using the corresponding counterfactual refer-
that bypass genuine alignment - a problem often described ences as supervision in a teacher-forcing manner. This ob-
as reward hacking. Moreover, the subjectivity and variabil- jective can be formulated as follows:
ity of human feedback introduce uncertainty and can em-
bed social biases into the model’s responses. As highlighted ∑
𝑘
by Perez et al. [7], RLHF-trained models may retain latent generator = − log 𝑃 (𝑦𝑡 ∣ 𝑌<𝑡 , 𝑋), (5)
𝑡=1
unsafe behaviors that remain hidden during routine evalua-
tions but emerge under adversarial or creative inputs. These where 𝑋 and 𝑌 represent the source and target texts, respec-
findings suggest that while RLHF brings models closer to tively. Here, 𝑦𝑡 ∈ 𝑌 denotes the 𝑡th token in the target text,
human-aligned outputs, it does not fully eliminate risks asso- and 𝑌<𝑡 refers to all tokens in 𝑌 preceding 𝑦𝑡 . Maudslay et
ciated with incomplete preference modeling or deeply rooted al. [45] made two improvements to CDA: one, Counterfac-
misalignment. tual Data Substitution (CDS), is a variant of CDA in which
potentially biased text is randomly replaced to avoid duplica-
4.3. Data Augmentation tion. The other, name intervention, can deal with the inher-
Data augmentation techniques mitigate or eliminate bias ent bias of names. Name intervention adopts a novel name-
by adding new examples to the training data. These ex- pairing strategy that takes into account both the frequency
amples increase the diversity and quantity of the training of the name and the gender specificity.
dataset, thereby expanding the distribution of underrepre- To remove the undesired stereotyped associations in mod-
sented or misrepresented social groups, which can then be els during fine-tuning, Yu et al. [74] proposed a mixture-
used for training [20]. This exposes the model to a wider based framework (Mix-Debias) from a new unified perspec-
and more balanced data distribution during training. tive, which directly combines the debiased models with fine-
Counterfactual Data Augmentation (CDA), one of the tuning applications. Mix-Debias applies CDA to obtain gender-
main techniques in data augmentation technology, aims to balanced correspondence of downstream task datasets. Then,
balance the demographic attributes in training data and has it further selects the most semantically meaningful sentences
been adopted widely to mitigate bias in NLP [63]. Con- from a rich additional corpus to expand the previously neu-
versely, due to the potential quality problems of this tech- tralized dataset. The overall architecture of Mix-Debias is
nology and the high cost of data collection, Tokpo & Calders illustrated in Fig. 5.
[63] proposed FairFlow, a method for automatically gener- While data augmentation and CDA-based approaches of-
ating parallel data for training counterfactual text generator fer practical and scalable solutions, they are not without short-
models that limit the need for human intervention. FairFlow comings. One pressing concern is the semantic integrity of
can significantly overcome the limitations of dictionary-based generated counterfactuals-modifications may introduce un-
word replacement methods while maintaining good perfor- intended meaning shifts or grammatical inconsistencies, par-
ticularly when applied to complex or nuanced language. Fur- making. English Gigaword [69], a comprehensive dataset
thermore, CDA methods often rely on demographic labels or for training and evaluating language models, highlights an-
templates, which may not fully capture the intersectionality other issue: the difficulty in developing defense methods that
or diversity of real-world identities. Research by Blodgett et can generalize well across various news categories and threat
al. [4] highlights that such simplifications risk reinforcing types. As we rely more and more on LLMs for real-world
normative assumptions about social groups and may lead to applications, ensuring their accuracy and reliability in these
overfitting on artificial patterns rather than true fairness im- contexts becomes ever more critical.
provements. As this is a final layer of defense, it becomes Social. Social datasets reveal crucial challenges surround-
especially important to recognize that debiasing at the data ing bias, fairness, and the ethical use of LLMs in sensitive ar-
level must be complemented by broader systemic considera- eas such as legal and healthcare contexts. The Sycophancy-
tions, including model architecture, evaluation metrics, and eval dataset [57] is used to evaluate sycophantic behavior in
ongoing feedback mechanisms. LLMs, a clear example of how the lack of control in free-
text generation can result in unethical behavior. WikiText-2
[57], with its Wikipedia articles, also highlights the issue of
5. Datasets biased content generation, as LLMs may perpetuate stereo-
In addition to addressing model vulnerabilities such as types or misinformation. Bias-in-Bios [20], focusing on gen-
bias, hallucination, and limited defense against novel threats, der bias in biographies, raises ethical concerns about how
critical also is the selection of appropriate datasets to eval- models trained on biased data can reinforce societal inequali-
uate the robustness and safety of LLMs under different ap- ties. Jigsaw [67], [63], [45], examining legal text deviations,
plication scenarios. In this section, datasets are categorized underscores the importance of fairness and accountability,
and reviewed based on their domains throughout Table 3. particularly in legal AI applications. ECHR [20], aimed at
Summarized are their characteristics, intended uses (attack detecting biases in online reviews, reflects a growing con-
or defense), and associated references. This overview will cern over how LLMs might exacerbate prejudices or unfair
assist researchers in selecting suitable datasets for studying treatment, making it essential to develop more transparent
data security risks and defense strategies in LLMs. and interpretable models.
Movie. Movie datasets are often used to evaluate vul- Book. BookCorpus, a collection of over 11,000 books
nerabilities in LLMs, especially concerning sentiment anal- [57], [74], serves as a crucial resource for training large lan-
ysis. The SST-2 dataset [34], [39], [57], [67], [73] contains guage models. However, its complexity presents challenges
11,855 sentences from movie reviews, each labeled as posi- in handling adversarial attacks, where subtle manipulations
tive or negative. The simplicity of this dataset makes it a fre- may lead to the generation of inaccurate or biased content.
quent target for attack experiments, which aim to inject back- The vastness and diversity of the dataset increase the diffi-
doors and assess the trustworthiness of a model. Similarly, culty of maintaining context and factual accuracy in gener-
IMDb [34], [67], with 50,000 reviews, provides a larger and ated outputs. As a result, models trained on such large-scale
more balanced set, often used to evaluate adversarial robust- datasets may struggle with hallucinations, creating informa-
ness. However, one of the challenges with movie datasets, tion that does not exist. The need for transparency in these
like OpenSubtitles [57], which includes dialogues, is that models becomes more apparent as understanding why cer-
the informal and diverse language structures introduce com- tain content is generated is often difficult, leading to issues
plexities when detecting adversarial manipulations. Rotten of trust and accountability in real-world applications [2].
Tomatoes [26], which focuses on emotional labels, brings Study. The AQuA dataset [76], used to evaluate arith-
forth concerns about hallucination risks, where a model might metic problem-solving, highlights the challenges in ensuring
generate incorrect or fabricated sentiments. The potential for precise reasoning in LLMs. Although it serves as a good
biased or harmful outputs due to these vulnerabilities can benchmark for evaluating basic computational tasks, it ex-
compromise the reliability and credibility of LLMs, thereby poses limitations in the abilities of models to generalize across
emphasizing the importance of robust defense strategies. diverse problem types, especially when faced with adver-
News. News datasets are indispensable for understand- sarial perturbations. Such weaknesses in defense mecha-
ing the vulnerabilities and biases in LLMs, as they often nisms become particularly concerning in high-stakes appli-
serve as testing grounds for adversarial attacks and defense cations where errors in calculations can have significant con-
mechanisms. AG News [34], [39], [67], [73] consists of sequences. These challenges underscore the broader prob-
120,000 news articles categorized into World, Sports, Busi- lem in the field: the need for more flexible, adaptive defense
ness, and Science. This variety makes it an ideal dataset methods that can effectively handle novel threats and ensure
for evaluating both attack models and the robustness of de- the reliability and transparency of models in practical set-
fenses. However, recent research points to the limitations tings [60], [77].
of current defense strategies, as many are ineffective against Research. In research fields outside of NLP, datasets
new types of adversarial inputs. The Financial Dataset [26], such as MNIST [44], [69], [62] and CIFAR-10 [44], [21],
with its focus on financial texts, presents unique challenges [69] are frequently employed to evaluate defense strategies,
in domain-specific adversarial attacks, where subtle manip- particularly in computer vision tasks. These datasets offer
ulations can cause significant errors in financial decision- valuable insight into the generalization and robustness of
Table 3
Dataset overview.
Scenario Dataset Description Purpose Reference
SST-21 SST-2 contains 11,855 sentences from movie reviews, Attack [34], [39], [57], [67], [73]
each labeled as positive or negative for sentiment
Movie analysis tasks.
IMDb2 Comprises 50,000 movie reviews labeled as positive or Attack [34], [67]
negative, equally split into training and testing sets.
OpenSubtitles3 Dialogue dataset containing subtitles for movies and Attack [57]
TV shows.
Rotten Tomatoes4 Contains movie reviews and their corresponding emo- Attack [26]
tional labels (positive or negative).
AG News5 AG News contains 120,000 news articles across four Attack [34], [39], [67], [73]
News categories: World, Sports, Business, and Science.
Financial Contains financial text data such as stock market Defense [26]
analysis and financial reports for news analysis.
English Gigaword Large English news text dataset for training and eval- Defense [69]
uating language models.
Sycophancy-eval6 Dataset to evaluate sycophancy behavior in language Attack [57]
models across free-text generation tasks.
Social WikiText-2 Dataset containing Wikipedia articles for text mod- Attack [57]
eling.
Bias-in-Bios7 Approximately 400,000 biographies used to examine Defense [20]
gender bias in occupational classification.
Jigsaw8 Dataset of cases published by the European Court of Defense [67], [63], [45]
Human Rights for analyzing legal text deviations.
ECHR9 Dataset by Jigsaw containing online review data for Defense [20]
bias detection research.
Book BookCorpus A text dataset containing more than 11,000 books. Attack [57], [74]
Study AQuA10 Evaluation dataset focusing on arithmetic problem Attack [76]
solving.
MNIST11 Grayscale images of handwritten digits, mainly used Defense [44], [69], [62]
for handwritten digit recognition research.
Research
CIFAR-1012,13 32x32 color images across 10 categories, used in im- Defense [44], [21], [69]
age classification model studies.
ImageNet Over 14 million annotated images covering 20,000+ Defense [44]
categories for large-scale image classification and
computer vision research.
TREC Used to evaluate information retrieval systems and Defense [74]
promote retrieval technology development.
1 https://github.com/neulab/RIPPLe
2 https://github.com/alexwan0/poisoning-instruction-tuned-models
3 https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles
4 https://github.com/BHui97/PLeak
5 https://github.com/wegodev2/virtual-prompt-injection
6 https://huggingface.co/datasets/meg-tong/sycophancy-eval
7 https://github.com/i-gallegos/Fair-LLM-Benchmark
8 https://github.com/rowanhm/counterfactual-data-substitution
9 https://github.com/WenRuiUSTC/EntF
10 https://worksheets.codalab.org/worksheets/0xbdd35bdd83b14f6287b24c9418983617/
11 https://github.com/MadryLab/mnist_challenge
12 https://www.cs.toronto.edu/~kriz/cifar.html
13 https://github.com/MadryLab/cifar10_challenge
models. ImageNet [44], with over fourteen million anno- ports research into the development of robust retrieval tech-
tated images, is one of the largest collections used to assess nologies.
defense strategies against adversarial attacks. The TREC
dataset [74] evaluates information retrieval systems and sup-
6. Future Directions Secure meta-data capture and verifiable audit trails will help
to both attribute harmful model behaviors and facilitate re-
6.1. Robust Adversarial Defense Mechanisms sponsible content sourcing.
LLMs are vulnerable to adversarial attacks that manipu-
late inputs to trigger undesirable outputs. These threats ex- 6.3. Continual Learning for Secure Model
ploit weaknesses in the decision-making process of a model, Updates
which can be particularly damaging in high-stakes applica- LLMs are incrementally updated with new data; there-
tions like dialogue systems and machine translation. As LLMs fore, there must be developed research mechanisms to ensure
are deployed in increasingly sensitive contexts, it is crucial that each update cannot be exploited to inject backdoors or
to develop robust defense strategies to mitigate such threats. leak previously covered private information. Tracking cu-
Therefore, we should focus on a range of advanced defen- mulative privacy loss over multiple fine-tuning rounds will
sive techniques, such as adversarial training [54] and certi- be essential.
fied robustness methods [37], all of which show promise in Future work should investigate privacy-preserving con-
improving the resilience of LLMs against adversarial manip- tinual learning frameworks that enable secure knowledge ac-
ulation. For example, Adversarial Contrastive Learning [30] quisition over time without exposing prior training data. In
improves the ability of a model to distinguish between se- continual learning settings, differentially private continual
mantically similar and dissimilar inputs while remaining ro- learning provides a foundational framework that maintains
bust to adversarial perturbations. This method can strengthen performance across sequential tasks while reducing risks of
LLMs by teaching them to generate more stable representa- unintended knowledge interference, laying the groundwork
tions of input sequences, making them less sensitive to ad- for safer long-term model adaptation [19]. This is especially
versarial perturbations. important as models interact with sensitive user inputs over
Furthermore, to ensure these techniques are effective, it time. While privacy concerns have been extensively dis-
is vital to develop specific benchmarks for evaluating the ad- cussed, data security risks - such as malicious prompt in-
versarial robustness of LLMs. This could include the devel- jection or the persistence of toxic content - remain under-
opment of a standardized adversarial attack library, as well addressed. LLMs can memorize and reproduce portions of
as guidelines for evaluating the trade-offs between model their training data, which may include toxic or policy-violating
performance and adversarial robustness [78]. content [11]. Meta-learning based continual learning ap-
proaches have been proposed to dynamically adjust model
6.2. Data Provenance and Traceability parameters during incremental updates, thereby improving
The data sources of LLMs are extensive, involving mul- resistance to adversarial attacks and reducing the risk of harm-
tiple stages and participants. Ensuring the security of the en- ful behavior in LLMs [40].
tire supply chain, from data collection, storage to transmis- In spite of this, there is still a significant challenge to en-
sion and use, is crucial. It is necessary to establish data sup- sure that such harmful data does not degrade model behavior
ply chain security standards and certification systems, con- or introduce vulnerabilities over successive training rounds.
duct strict reviews of data suppliers, prevent malicious data The need for robust data curation processes, ongoing data
injection or leakage, and ensure the integrity and availabil- sanitization, and rigorous security checks during model up-
ity of data. Apart from security issues, ensuring data prove- dates is necessary.
nance and traceability throughout the data pipeline is essen-
tial for model transparency and accountability. Recent work 6.4. Explainability-Driven Security Analysis
emphasizes that establishing machine-actionable provenance Leverage interpretability tools (attention-flow analysis,
records helps build explainable and trustworthy AI systems saliency methods, concept activation vectors) are not just
by providing an auditable trail of how data influence model for model transparency but also for active defenses - e.g.,
behavior [31]. detecting anomalous rationale patterns that signal poison-
In addition, traceability models and tools have been sys- ing, or flagging content that unduly reflects single training
tematically reviewed as foundational components for ensur- instances. It is crucial to focus on advancing these inter-
ing the trustworthiness and reproducibility of AI systems, pretability technologies in future research to create robust
particularly under complex and heterogeneous data environ- frameworks, which can enable real-time security monitoring
ments as seen in LLM development [47]. Building on this, of large language models during incremental updates. For
a comprehensive auditing framework has been proposed to instance, attention visualization methods have demonstrated
close the AI accountability gap, highlighting the need to trace potential in revealing unusual focus distributions that may
not only data inputs but also decision-making processes and indicate adversarial manipulation [65]. Saliency methods
model iterations across the entire development [56]. We highlight influential input features, facilitating the discov-
should design systematic frameworks for tracking the ori- ery of suspicious outputs influenced by memorized or ma-
gin, curation steps, and transformation history of every data- licious training data [59]. Additionally, concept activation
point used in LLM training. Existing studies have proposed vectors provide a quantitative measure of the influence of
a data management framework for responsible artificial in- human-understandable concepts on model decisions, which
telligence, emphasizing the core role of data traceability in could be instrumental in identifying spurious correlations
enhancing the transparency and compliance of models [70].
or backdoor triggers embedded during training [32]. Inte- scenarios. Lastly, we identified practical challenges, such as
grating these tools into continual learning pipelines offers a the scalability of secure data curation, model update safety,
promising direction to enhance the security and trustworthi- and benchmark limitations. We then proposed future re-
ness of LLMs as they evolve. search directions, including continual security verification,
explainability-driven threat analysis, and governance frame-
6.5. Ethical and Regulatory Frameworks for LLM works for secure LLM development and deployment.
Data Governance
Because LLMs handle sensitive data globally, interdisci-
plinary efforts must define auditing standards, data sovereignty Acknowledgments
protocols, and liability frameworks. Collaboration with pol- This research was supported in part by the National Nat-
icymakers will ensure alignment with evolving regulations. ural Science Foundation of China [No. 61977055], and in
Recent work has emphasized how global-scale models de- part by the Startup Fund of Jimei University [No. ZQ2024014].
mand policy-aware oversight and formal responsibility allo- The authors would like to thank Michael McAllister for
cation, especially when their decisions affect end users in proofreading this paper.
high-stakes contexts [5].
Moreover, bridge technical advances with policy are as
follows: propose data-security standards and certifications
References
for “safety-compliant” LLMs, inform privacy regulation (e.g., [1] Agarwal, D., Fabbri, A.R., Laban, P., Joty, S., Xiong, C., Wu, C.S.,
2024. Investigating the prompt leakage effect and black-box defenses
GDPR, CCPA) with concrete measurement methodologies, for multi-turn llm interactions. arXiv e-prints , arXiv–2404.
and develop governance models that enable redress when [2] Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S., 2021.
models inadvertently expose or misuse personal data. For On the dangers of stochastic parrots: Can language models be too
such frameworks to become operational, future work should big?, in: Proceedings of the 2021 ACM conference on fairness, ac-
explore how system-level governance mechanisms can be countability, and transparency, pp. 610–623.
[3] Biderman, S., Schoelkopf, H., Anthony, Q.G., Bradley, H., O’Brien,
embedded directly into the LLM development pipeline. An K., Hallahan, E., Khan, M.A., Purohit, S., Prashanth, U.S., Raff, E.,
end-to-end internal algorithmic auditing framework - such et al., 2023. Pythia: A suite for analyzing large language models
as the one in the context of deployed AI systems - can in- across training and scaling, in: International Conference on Machine
spire LLM-specific protocols that incorporate documenta- Learning, PMLR. pp. 2397–2430.
tion, oversight checkpoints, and accountability mapping through- [4] Blodgett, S.L., Barocas, S., Daumé III, H., Wallach, H., 2020. Lan-
guage (technology) is power: A critical survey of" bias" in nlp. arXiv
out the model lifecycle [56]. A further challenge is enabling preprint arXiv:2005.14050 .
user redress in cases where models inadvertently expose or [5] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von
misuse sensitive training data. To this end, governance mod- Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.,
els must incorporate mechanisms such as fine-grained data 2021. On the opportunities and risks of foundation models. arXiv
lineage tracking and post-hoc auditing of generation behav- preprint arXiv:2108.07258 .
[6] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhari-
ior. Embedding these governance principles into the train- wal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.,
ing lifecycle itself, as suggested in recent work on the ethi- 2020. Language models are few-shot learners. Advances in neural
cal risks of LLM deployment, may also enhance institutional information processing systems 33, 1877–1901.
trust and regulatory compliance [68]. [7] Burns, C., Ye, H., Klein, D., Steinhardt, J., 2024. Discovering latent
knowledge in language models without supervision. URL: https://
arxiv.org/abs/2212.03827, arXiv:2212.03827.
7. Conclusion [8] Cao, Z., Yang, Y., Zhao, H., 2023. Autohall: Automated halluci-
nation dataset generation for large language models. arXiv preprint
In this survey, we explored the critical issues surround- arXiv:2310.00259 .
ing data security risks in LLMs. Because these models are [9] Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer,
increasingly deployed across a wide range of real-world ap- F., Balle, B., Ippolito, D., Wallace, E., 2023. Extracting training
data from diffusion models, in: 32nd USENIX Security Symposium
plications, ensuring the integrity and safety of the data they
(USENIX Security 23), pp. 5253–5270.
consume during training and inference has become a press- [10] Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., Song, D., 2019. The se-
ing concern. We first discussed five major types of data se- cret sharer: Evaluating and testing unintended memorization in neural
curity risks - such as data poisoning, prompt injection, hallu- networks, in: 28th USENIX security symposium (USENIX security
cination, prompt leakage, and bias - that may lead to harm- 19), pp. 267–284.
[11] Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A.,
ful or manipulated outputs. We then reviewed several de-
Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al.,
fense strategies, including adversarial training, RLHF, data 2021. Extracting training data from large language models, in: 30th
augmentation, which can mitigate such threats by improv- USENIX security symposium (USENIX Security 21), pp. 2633–
ing model robustness and trustworthiness. In addition, we 2650.
presented a comparative analysis of existing datasets, cate- [12] Chen, C., He, X., Lyu, L., Wu, F., 2021. Killing one bird with two
stones: Model extraction and attribute inference attacks against bert-
gorized by domain, use cases (attack or defense), and key
based apis. arXiv preprint arXiv:2105.10909 .
characteristics. The aim of this systematic overview is to [13] Chen, D., Hong, W., Zhou, X., 2022. Transformer network for re-
assist researchers in selecting appropriate datasets for evalu- maining useful life prediction of lithium-ion batteries. IEEE Access
ating LLM robustness and safety across different application 10, 19621–19628.
[14] Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, [33] Kshetri, N., 2023. Cybercrime and privacy threats of large language
D., 2017. Deep reinforcement learning from human preferences. Ad- models. IT Professional 25, 9–13.
vances in neural information processing systems 30. [34] Kurita, K., Michel, P., Neubig, G., 2020. Weight poisoning attacks
[15] Das, B.C., Amini, M.H., Wu, Y., 2025. Security and privacy chal- on pre-trained models. arXiv preprint arXiv:2004.06660 .
lenges of large language models: A survey. ACM Computing Surveys [35] Li, B., Qi, P., Liu, B., Di, S., Liu, J., Pei, J., Yi, J., Zhou, B., 2023a.
57, 1–39. Trustworthy ai: From principles to practices. ACM Computing Sur-
[16] Ding, Y., Jia, S., Ma, T., Mao, B., Zhou, X., Li, L., Han, D., veys 55, 1–46.
2023. Integrating stock features and global information via large lan- [36] Li, J., Wang, B., Zhou, X., Jiang, P., Liu, J., Hu, X., 2025. De-
guage models for enhanced stock return prediction. arXiv preprint coding knowledge attribution in mixture-of-experts: A framework of
arXiv:2310.05627 . basic-refinement collaboration and efficiency analysis. arXiv preprint
[17] Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groen- arXiv:2505.24593 .
eveld, D., Mitchell, M., Gardner, M., 2021. Documenting large web- [37] Li, L., Xie, T., Li, B., 2023b. Sok: Certified robustness for deep
text corpora: A case study on the colossal clean crawled corpus. arXiv neural networks, in: 2023 IEEE symposium on security and privacy
preprint arXiv:2104.08758 . (SP), IEEE. pp. 1289–1310.
[18] Duan, H., Dziedzic, A., Papernot, N., Boenisch, F., 2023. Flocks of [38] Li, Y., Jiang, Y., Li, Z., Xia, S.T., 2022. Backdoor learning: A survey.
stochastic parrots: Differentially private prompt learning for large lan- IEEE transactions on neural networks and learning systems 35, 5–22.
guage models. Advances in Neural Information Processing Systems [39] Li, Y., Li, T., Chen, K., Zhang, J., Liu, S., Wang, W., Zhang, T., Liu,
36, 76852–76871. Y., 2024. Badedit: Backdooring large language models by model
[19] Farquhar, S., Gal, Y., 2019. Differentially private continual learning. editing. arXiv preprint arXiv:2403.13355 .
arXiv preprint arXiv:1902.06497 . [40] Li, Z., Hoiem, D., 2017. Learning without forgetting. IEEE transac-
[20] Gallegos, I.O., Rossi, R.A., Barrow, J., Tanjim, M.M., Kim, S., Der- tions on pattern analysis and machine intelligence 40, 2935–2947.
noncourt, F., Yu, T., Zhang, R., Ahmed, N.K., 2024. Bias and fairness [41] Liang, P.P., Wu, C., Morency, L.P., Salakhutdinov, R., 2021. Towards
in large language models: A survey. Computational Linguistics , 1– understanding and mitigating social biases in language models, in: In-
79. ternational Conference on Machine Learning, PMLR. pp. 6565–6576.
[21] Geiping, J., Fowl, L., Somepalli, G., Goldblum, M., Moeller, M., [42] Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T.,
Goldstein, T., 2021. What doesn’t kill you makes you robust (er): Liu, Y., Wang, H., Zheng, Y., et al., 2023. Prompt injection attack
How to adversarially train against data poisoning. arXiv preprint against llm-integrated applications. arXiv preprint arXiv:2306.05499
arXiv:2102.13624 . .
[22] Goyal, S., Doddapaneni, S., Khapra, M.M., Ravindran, B., 2023. A [43] Liu, Y., Yi, Z., Chen, T., 2020. Backdoor attacks and de-
survey of adversarial defenses and robustness in nlp. ACM Comput- fenses in feature-partitioned collaborative learning. arXiv preprint
ing Surveys 55, 1–39. arXiv:2007.03608 .
[23] Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, [44] Mądry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A., 2017.
M., 2023. Not what you’ve signed up for: Compromising real-world Towards deep learning models resistant to adversarial attacks. stat
llm-integrated applications with indirect prompt injection, in: Pro- 1050.
ceedings of the 16th ACM Workshop on Artificial Intelligence and [45] Maudslay, R.H., Gonen, H., Cotterell, R., Teufel, S., 2019. It’s all
Security, pp. 79–90. in the name: Mitigating gender bias with name-based counterfactual
[24] Han, J., Guo, M., 2024. An evaluation of the safety of chatgpt data substitution. arXiv preprint arXiv:1909.00871 .
with malicious prompt injection. URL: https://www.researchsquare. [46] McKenzie, I.R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A.,
com/article/rs-4487194/v1, doi:10.21203/rs.3.rs-4487194/v1, Prabhu, A., McLean, E., Kirtland, A., Ross, A., Liu, A., et al.,
arXiv:rs-4487194. 2023. Inverse scaling: When bigger isn’t better. arXiv preprint
[25] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., arXiv:2306.09479 .
Peng, W., Feng, X., Qin, B., et al., 2025. A survey on hallucination in [47] Mora-Cantallops, M., Sánchez-Alonso, S., García-Barriocanal, E.,
large language models: Principles, taxonomy, challenges, and open Sicilia, M.A., 2021. Traceability for trustworthy ai: A review of mod-
questions. ACM Transactions on Information Systems 43, 1–55. els and tools. Big Data and Cognitive Computing 5, 20.
[26] Hui, B., Yuan, H., Gong, N., Burlina, P., Cao, Y., 2024. Pleak: Prompt [48] Navigli, R., Conia, S., Ross, B., 2023. Biases in large language mod-
leaking attacks against large language model applications, in: Pro- els: origins, inventory, and discussion. ACM Journal of Data and
ceedings of the 2024 on ACM SIGSAC Conference on Computer and Information Quality 15, 1–21.
Communications Security, pp. 3600–3614. [49] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin,
[27] Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., Li, B., P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training
2018. Manipulating machine learning: Poisoning attacks and coun- language models to follow instructions with human feedback. Ad-
termeasures for regression learning, in: 2018 IEEE symposium on vances in neural information processing systems 35, 27730–27744.
security and privacy (SP), IEEE. pp. 19–35. [50] Pan, Y., Pan, L., Chen, W., Nakov, P., Kan, M.Y., Wang, W.Y., 2023.
[28] Jiang, C., Qi, B., Hong, X., Fu, D., Cheng, Y., Meng, F., Yu, M., On the risk of misinformation pollution with large language models.
Zhou, B., Zhou, J., 2024a. On large language models’ hallucination arXiv preprint arXiv:2305.13661 .
with regard to known facts. arXiv preprint arXiv:2403.20009 . [51] Paudice, A., Muñoz-González, L., Gyorgy, A., Lupu, E.C., 2018. De-
[29] Jiang, S., Kadhe, S.R., Zhou, Y., Ahmed, F., Cai, L., Baracaldo, N., tection of adversarial training examples in poisoning attacks through
2024b. Turning generative models degenerate: The power of data anomaly detection. arXiv preprint arXiv:1802.03041 .
poisoning attacks. arXiv preprint arXiv:2407.12281 . [52] Perez, F., Ribeiro, I., 2022. Ignore previous prompt: Attack tech-
[30] Jiang, Z., Chen, T., Chen, T., Wang, Z., 2020. Robust pre-training niques for language models. arXiv preprint arXiv:2211.09527 .
by adversarial contrastive learning. Advances in neural information [53] Plant, R., Giuffrida, V., Gkatzia, D., 2022. You are what you write:
processing systems 33, 16199–16210. Preserving privacy in the era of large language models. arXiv preprint
[31] Kale, A., Nguyen, T., Harris Jr, F.C., Li, C., Zhang, J., Ma, X., 2023. arXiv:2204.09391 .
Provenance documentation to enable explainable and trustworthy ai: [54] Qian, Z., Huang, K., Wang, Q.F., Zhang, X.Y., 2022. A survey of ro-
A literature review. Data Intelligence 5, 139–162. bust adversarial training in pattern recognition: Fundamental, theory,
[32] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and methodologies. Pattern Recognition 131, 108889.
et al., 2018. Interpretability beyond feature attribution: Quantitative [55] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M.,
testing with concept activation vectors (tcav), in: International con- Zhou, Y., Li, W., Liu, P.J., 2020. Exploring the limits of transfer
ference on machine learning, PMLR. pp. 2668–2677. learning with a unified text-to-text transformer. Journal of machine
learning research 21, 1–67. Is adversarial training really a silver bullet for mitigating data poi-
[56] Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T., Hutchin- soning?, in: Proceedings of the International Conference on Learning
son, B., Smith-Loud, J., Theron, D., Barnes, P., 2020. Closing the Representations. URL: https://openreview.net/forum?id=zKvm1ETDOq.
ai accountability gap: Defining an end-to-end framework for inter- [70] Werder, K., Ramesh, B., Zhang, R., 2022. Establishing data prove-
nal algorithmic auditing, in: Proceedings of the 2020 conference on nance for responsible artificial intelligence systems. ACM Transac-
fairness, accountability, and transparency, pp. 33–44. tions on Management Information Systems (TMIS) 13, 1–23.
[57] Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bow- [71] Xu, Z., Jain, S., Kankanhalli, M., 2024. Hallucination is inevitable:
man, S.R., Cheng, N., Durmus, E., Hatfield-Dodds, Z., Johnston, An innate limitation of large language models. arXiv preprint
S.R., et al., 2023. Towards understanding sycophancy in language arXiv:2401.11817 .
models. arXiv preprint arXiv:2310.13548 . [72] Yan, B., Li, K., Xu, M., Dong, Y., Zhang, Y., Ren, Z., Cheng, X.,
[58] Sheng, E., Chang, K.W., Natarajan, P., Peng, N., 2021. Societal bi- 2024a. On protecting the data privacy of large language models
ases in language generation: Progress and challenges. arXiv preprint (llms): A survey. arXiv preprint arXiv:2403.05156 .
arXiv:2105.04054 . [73] Yan, J., Yadav, V., Li, S., Chen, L., Tang, Z., Wang, H., Srinivasan,
[59] Simonyan, K., Vedaldi, A., Zisserman, A., 2013. Deep inside con- V., Ren, X., Jin, H., 2024b. Backdooring instruction-tuned large lan-
volutional networks: Visualising image classification models and guage models with virtual prompt injection, in: Proceedings of the
saliency maps. arXiv preprint arXiv:1312.6034 . 2024 Conference of the North American Chapter of the Association
[60] Smiley, C., Schilder, F., Plachouras, V., Leidner, J.L., 2017. Say the for Computational Linguistics: Human Language Technologies (Vol-
right thing right: Ethics issues in natural language generation systems, ume 1: Long Papers), pp. 6065–6086.
in: Proceedings of the First ACL Workshop on Ethics in Natural Lan- [74] Yu, L., Mao, Y., Wu, J., Zhou, F., 2023. Mixup-based unified frame-
guage Processing, pp. 103–108. work to overcome gender bias resurgence, in: Proceedings of the 46th
[61] Staab, R., Vero, M., Balunović, M., Vechev, M., 2023. Beyond memo- International ACM SIGIR Conference on Research and Development
rization: Violating privacy via inference with large language models. in Information Retrieval, pp. 1755–1759.
arXiv preprint arXiv:2310.07298 . [75] Yu, T., Yao, Y., Zhang, H., He, T., Han, Y., Cui, G., Hu, J., Liu,
[62] Steinhardt, J., Koh, P.W.W., Liang, P.S., 2017. Certified defenses for Z., Zheng, H.T., Sun, M., et al., 2024. Rlhf-v: Towards trustworthy
data poisoning attacks. Advances in neural information processing mllms via behavior alignment from fine-grained correctional human
systems 30. feedback, in: Proceedings of the IEEE/CVF Conference on Computer
[63] Tokpo, E.K., Calders, T., 2024. Fairflow: An automated approach to Vision and Pattern Recognition, pp. 13807–13816.
model-based counterfactual data augmentation for nlp, in: Joint Eu- [76] Zhang, C., Jin, M., Yu, Q., Liu, C., Xue, H., Jin, X., 2024a. Goal-
ropean Conference on Machine Learning and Knowledge Discovery guided generative prompt injection attack on large language models.
in Databases, Springer. pp. 160–176. arXiv preprint arXiv:2404.07234 .
[64] Tramer, F., Carlini, N., Brendel, W., Madry, A., 2020. On adaptive [77] Zhang, C., Zhou, X., Wan, Y., Zheng, X., Chang, K.W., Hsieh, C.J.,
attacks to adversarial example defenses. Advances in neural informa- 2022. Improving the adversarial robustness of nlp models by infor-
tion processing systems 33, 1633–1645. mation bottleneck. arXiv preprint arXiv:2206.05511 .
[65] Vig, J., 2019. A multiscale visualization of attention in the trans- [78] Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., Jordan, M., 2019.
former model. arXiv preprint arXiv:1906.05714 . Theoretically principled trade-off between robustness and accuracy,
[66] Wallace, E., Zhao, T.Z., Feng, S., Singh, S., 2020. Concealed data in: International conference on machine learning, PMLR. pp. 7472–
poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563 . 7482.
[67] Wan, A., Wallace, E., Shen, S., Klein, D., 2023. Poisoning lan- [79] Zhang, W.E., Sheng, Q.Z., Alhazmi, A., Li, C., 2020. Adversarial
guage models during instruction tuning, in: International Conference attacks on deep-learning models in natural language processing: A
on Machine Learning, PMLR. pp. 35413–35425. survey. ACM Transactions on Intelligent Systems and Technology
[68] Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, (TIST) 11, 1–41.
P.S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al., 2021. [80] Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao,
Ethical and social risks of harm from language models. arXiv preprint E., Zhang, Y., Chen, Y., et al., 2024b. Siren’s song in the ai ocean:
arXiv:2112.04359 . A survey on hallucination in large language models, 2023. URL
[69] Wen, R., Zhao, Z., Liu, Z., Backes, M., Wang, T., Zhang, Y., 2023. https://arxiv. org/abs/2309.01219 .