Conference
Conference
ABSTRACT Retrieval-Augmented Generation (RAG) is a promising solution that can enhance the
capabilities of large language model (LLM) applications in critical domains, including legal technology,
by retrieving knowledge from external databases. Implementing RAG pipelines requires careful attention to
the techniques and methods implemented in the different stages of the RAG process. However, robust RAG
can enhance LLM generation with faithfulness and few hallucinations in responses. In this paper, we discuss
the application of RAG in the legal domain. First, we present an overview of the main RAG methods, stages,
techniques, and applications in the legal domain. We then briefly discuss the different information retrieval
models, processes, and applied methods in current legal RAG solutions. Then, we explain the different
quantitative and qualitative evaluation metrics. We also describe several emerging datasets and benchmarks.
We then discuss and assess the ethical and privacy considerations for legal RAG and summarize various
challenges, and propose a challenge scale based on RAG failure points and control over external knowledge.
Finally, we provide insights into promising future research to leverage RAG efficiently and effectively in the
legal field.
INDEX TERMS Information retrieval, large language model (LLM), legal technology, prompt engineering,
retrieval-augmented generation (RAG).
TABLE 1. List of acronyms provided in the paper. of all currently available RAG methods relevant to the
legal domain. The primary contributions of this paper are
summarized as follows.
• We highlight various techniques utilized in RAG meth-
ods specific to the legal field.
• We examine how these methods contribute to improve-
ments in accuracy and interpretability.
• Our analysis of RAG methods and techniques in the
legal field offers insights into various applications,
methodologies, evaluations, datasets, and benchmarks.
• We extensively outline and describe some open chal-
lenges for RAG application in the legal domain and
provide deep insights into promising future research
directions in RAG applications.
The findings of this work are expected to guide legal
tech researchers who aim to use cutting-edge technology
to optimize LLM- driven legal applications and practices
for various tasks. In addition, this study will serve as a
contemporary reference for RAG methods in the legal field.
The remainder of this paper is organized as follows:
Section II provides an overview of RAG methods, RAG
main stages, and techniques as well as a classification of
legal RAG methods, applications, and datasets. Section III
explores advanced methods that improve retrieval accuracy
hallucination. RAG was proposed to enhance generators to in legal RAG systems, addressing the unique needs of
achieve less hallucination and offer more interpretability and legal information retrieval tasks. Section IV explains the
control [14]. quantitative and qualitative metrics used to analyze retrievers
Different studies have proved that RAG outperformed and generators in RAG systems, and Section V describes
fine-tuning processes for existing knowledge encountered relevant emerging datasets and benchmarks. Section VI,
during training and entirely new knowledge [15], [16]. Thus, evaluates ethical and privacy considerations in legal RAG
since RAG was first introduced in 2020 by Lewis et al. systems. Section VII focuses on the main challenges of
[14], different RAG systems have been rapidly developed legal RAG systems. Section VIII provides insights into
for various domains, including the legal domain. The most promising research directions. Finally, the paper is concluded
powerful feature of RAG is its ability to adapt recent in Section IX.
or specific external knowledge rapidly and dynamically
retrieve relevant information from external sources during the II. RAG IN THE LEGAL DOMAIN
generation process [17]. Numerous legal applications have A. OVERVIEW OF RAG
demonstrated that a combination of RAG and fine-tuning RAG comprises three key processes, i.e., the information
methods perform well [18], [19], [20], [21]. retrieval (IR), augmentation, and generation processes. Many
In 2024, more than 20 legal RAG pipelines were processes and techniques are involved before and after the IR
implemented using various embedding, retrieval, enhance- process is performed to enhance the process and its outcomes.
ment, and generation methods. These pipelines have been The IR process is a critical element in the RAG framework.
frequently integrated with other approaches, e.g., prompt The generator will likely produce poor outcomes if the
engineering, which is essential for all RAG pipelines, and retrieved data are inaccurate or inconsistent with the query.
with knowledge graphs (KG) [22], and fine-tuning (FT) A powerful IR method can outperform the performance of a
[18], [19], [20], [21], [23], [24] or embedded within multi- combined IR LLM [23]. thus, the IR process is the backbone
agent frameworks [25]. In addition, legal RAG pipelines span of the entire RAG pipeline [27]. IR techniques have been
various applications across the legal domain, ranging from improved over decades from initial traditional sparse IR
specialized systems focused on specific legal fields to more techniques [28], e.g., BM25 [29] and TF-IDF [30], to more
comprehensive legal platforms. advanced dense Transformer-based embedding models, e.g.,
DPR [31]. Transformer-based retrieval methods outperform
A. CONTRIBUTION AND ORGANIZATION nonneural methods, specifically for legal document retrieval
Given the absence of a comprehensive literature survey tasks [32]. In addition, KG embeddings have been integrated
on RAG systems within the legal domain, this paper with IR embeddings to optimize the IR process [33].
attempts to bridge this gap by providing a detailed overview Advanced RAG systems involve pre- and post-retrieval
enhancements to enrich the IR process and produce accurate B. RAG APPLICATIONS IN THE LEGAL DOMAIN
and precise results [34]. The IR process can be optimized Despite being introduced in 2020 [14], RAG was not applied
based on the complexity of the problem and the required to the legal domain until 2023. The first research paper was
reasoning steps. An effective method to optimize IR is by Shui et al. [23], who employed RAG to predict legal
to apply different methods for IR rather than relying on judgments. While they did not discuss the abstract term RAG
a single IR process. Selecting an appropriate IR process explicitly, they did introduce the RAG system and referred to
depends on the complexity of the task and required reasoning the process as ‘‘LLMs coordinate with IR (LLM + IR).’’
steps [34]. In addition, many methods and techniques are We conducted an extensive literature review by examining
used to optimize IR, e.g., the embedding model and query relevant papers from four major academic databases: SCO-
enhancements [35]. The chunking method, which plays a PUS, IEEExplore, Web of Science, and Google Scholar,
pivotal role in IR, is influenced by the nature of the dataset to in addition to preprints posted on arXiv. The query we used
be retrieved, how the texts are organized in the dataset, what to search for related papers included three main terms. The
will be retrieved from the datasets, and how much semantic first term included all keywords related to legal, including
similarity is important in the IR process. The precision of ‘legal’, ‘legal case’, ‘judiciary,’ ’judicial,’ and ‘law’. The
the IR process is strongly dependent on the selection of the second term included all keywords related to the RAG,
chunking strategy [34], [36], [37], [38]. including ‘‘retrieval’’, ‘‘augmented’’ and ‘‘generation’’. The
The augmentation process, which integrates retrieved third term included all keywords related to LLM models,
information and query fragments in the LLM, can be including ‘‘LLM’’, ‘‘transformer model’’ and ‘‘generative
performed in three ways in the input, output, and intermediate AI’’. As RAG was first introduced in 2020, the search
layers of the generator [36]. KG embeddings can enrich the was restricted to articles published between 2020 and 2024.
prompt for more accurate response generation by integrating The gathered papers were then scanned based on titles,
the retrieved triplets with the retrieved chunks and the abstracts, and keywords to determine relevant articles for
original user’s query, as proposed in [22] and illustrated further analysis. As this field is newly emerging, the final
in Fig. 2. Prompt engineering with one or few shots has selection includes only 22 papers.
demonstrated more accurate responses compared with zero- As shown in Fig. 3, RAG research experienced a notable
shot prompting [23]. surge in 2024, spanning various applications within the legal
In the generation process, the LLM can be retrained or domain. The RAG methods proposed in the legal field address
fine-tuned on legal data using a parameter-accessible LLM various areas, e.g., privacy law, legislative texts, public law,
or the RAG pipeline may utilize a parameter-inaccessible criminal law, statutory law, and immigration law. These
‘‘frozen’’ LLM [34], [36]. applications are employed in various systems, including
TABLE 4. Best-performing transformer-based models in the retrieval stage in legal RAG systems.
retrieval stage employs several techniques to enhance the introduces intention-aware query rewriting and leverages
accuracy of retrieved information; these techniques include multiple domain viewpoints to refine queries in knowledge-
data granularity adjustments, indexing enhancement, and dense contexts [40].
query formulation along with and the selection of a suitable In terms of chunking strategies, dividing long legal doc-
embedding model. Collectively, these techniques facilitate uments into smaller, manageable chunks improves retrieval
highly precise and structured retrieval of information, which precision. Chunk-based embedding strategies ensure that
is crucial for legal RAG systems due to the inherent contextual details are preserved within smaller text frag-
complexity and specificity of legal data [36]. ments, which reduces the noise associated with embedding
In addition, KG integration transforms conventional data the entire document [47]. HyPA–RAG employs multiple
handling approaches and plays a crucial role in improving techniques to evaluate the model, including sentence-
retrieval precision by structuring legal data into intercon- level, semantic, and pattern-based chunking to balance
nected entities and relationships. For example, combining token constraints and context. It has been demonstrated
KGs with LLMs allows retrieval systems to generate that pattern-based chunking using corpus-specific delimiters
context-aware responses using KG triplets as additional achieves the best retrieval precision, with top scores in context
context [22], [45], [49]. The PPNet framework encodes recall and faithfulness. Sentence-level chunking excels in
legal relationships from judicial sources into a KG, which context precision and F1 scores; thus, it is suitable for
improves the accuracy of responses [45]. Furthermore, hybrid precise retrieval tasks. In addition, unless heavily tuned,
systems, e.g., HyPA– RAG, utilize KG triplets alongside semantic chunking underperforms compared with simpler
dense and sparse methods for adaptive query tuning [49]. methods [49].
Query rewriting enhances the retrieval process by refor- An efficient indexing mechanism is essential for rapid
mulating user inputs to better align with indexed data while similarity searches in high-dimensional spaces. To bal-
incorporating related concepts that users may not explicitly ance search speed and accuracy, CASEGPT employs the
mention but are contextually relevant. Some frameworks, Hierarchical Navigable Small World algorithm, which is a
e.g., the PA-RAG framework, adapt queries dynamically state-of-the-art indexing technique. In addition, the system
by selecting the number of rewrites and the top related implements an incremental indexing mechanism to support
retrieved chunks (K) values based on query complexity [49]. real-time updates, which facilitates the seamless integration
In addition, the multiview RAG (MVRAG) framework of new cases without requiring complete reindexing [39].
average precision (MAP). Precision is the fraction of relevant i.e., the context relevance, which checks if the context
instances among the retrieved instances, and recall is the of the retrieved information is relevant to the query, and
fraction of relevant instances retrieved from the total number the response relevance, which indicates how relevant the
of relevant cases. MRR is the average of the reciprocal ranks generated response for the given query.
of the first correct response to a set of queries, and MAP is Table 6 shows the quantitative metrics used for each
the mean of the average precision scores for each query [57]. evaluation aspect. These metrics, obtained from surveyed
These four metrics can be used to evaluate how effectively a studies, are traditional indicators and do not yet represent a
retriever identifies and ranks relevant documents in response standardized framework for quantifying the quality aspects
to the user’s query [58]. of RAG systems [34]. Metrics definitions are summarized in
For response evaluation, the primary goal is ensuring Table 9.
that the response is relevant to the user query and avoid Note that there is no standardized evaluation method
hallucinations. Generation metrics, e.g., METEOR [18], [21], for RAG systems, and various frameworks utilize different
the bilingual evaluation understudy (BLEU) [52], and the metrics to evaluate RAG systems. One of such dedicated
recall-oriented understudy for gisting evaluation (ROUGE) frameworks is RAG assessment (RAGAs) [26]. RAGAs has
[43], are used to determine the response quality for RAG been employed in previous studies [42], [49] to assess the
systems in the legal domain. METEOR combines precision, performance of a RAG pipeline considering four factors,
recall, and sentence fluency to accurately calculate the i.e., faithfulness, relevance to the query, relevance to the
similarity between automatically generated and reference context, and recall of the context. The RAGAs framework
responses to evaluate the effectiveness of text generation was designed to serve as a universal standard to assess RAG
tasks. BLEU measures the overlap between a generated pipelines without requiring access to ground truths. The
response and a set of reference responses by focusing on the system uses OpenAI’s GPT-4 to determine a score ranging
precision of n-grams. Finally, ROUGE counts the number from 0 to 1 for each of the four metrics. The RAGAs score is
of overlapping units, e.g., n-grams, word sequences, and calculated by determining the average of the assigned scores.
word pairs, between the generated and reference responses
considering recall and precision. Utilizing these metrics to V. EMERGING DATASETS AND BENCHMARKS
evaluate retrieval and generation tasks helps build a robust, The advancement of Retrieval-Augmented Generation
efficient, and user-centric RAG system. However, there are (RAG) systems in legal technology heavily relies on
no ground truth answers to queries; thus, the focus of high-quality datasets that enable effective retrieval, reason-
the evaluation has shifted to quantitative aspects, wherein ing, and interpretability. Several benchmark datasets have
the retriever and generator are evaluated separately [59]. been introduced to improve legal question answering (LQA),
In other words, the nature of RAG systems makes them legal information retrieval (IR), and case law analysis. This
generate unstructured text, which means that qualitative and section reviews key datasets and benchmarks that support the
quantitative metrics are required to assess their performance development of RAG systems in the legal domain.
accurately. Therefore, we adopted a similar approach to Legal Question Answering (LQA) datasets play a cru-
Table 5 of [59], indicating the evaluation metrics used cial role in training and evaluating models that generate
for RAG-based systems in the medical domain and ethical precise legal responses. For example, the BorderLegal-QA
principles considered in surveyed studies. By adopting these Dataset [18], is specialized for legal queries related to border
metrics and considerations, we created Table 7, which shows inspections, and it contains 1,329 pairs of questions and
the evaluation metrics employed in surveyed RAG-based answers covering 51 types of questions. The goal is to offer
studies in the legal domain. Specifically, we reviewed expertly curated question–answer pairs that are applicable
22 papers to check for references to the five evaluation to realistic border inspection situations. In addition, the
metrics to assess their usage in RAG-based legal applications. JEC-QA dataset is a collection of multiple-choice questions
The five- evaluation metrics were correctness, completeness, from the National Unified Legal Professional Qualification
faithfulness, fluency, and relevance (context relevance and Examination. This dataset contains a total of 26,365 ques-
answer relevance). Correctness means that the response tions, and it acts as a standard to assess legal QA systems.
generated by the RAG system must perfectly align with The CJRC dataset was created from real-world accounts
the expected response or be a relevant statement that in Chinese court records, and it contains approximately
conveys the same information [60].Completeness refers to 10,000 documents and nearly 50,000 question– answer pairs
RAG-generated responses that are comprehensive and cover covering a wide range of reasoning scenarios. The CAIL2020
all aspects of the anticipated response. Faithfulness indicates and CAIL2021 datasets [18] improve the reasoning skills
that the response must be based on the provided context. required to answer legal questions. The CAIL2020 dataset
RAG systems are frequently utilized in contexts where contains 10,000 legal documents, and the CAIL2021 dataset
the factual accuracy of the generated text with respect to presents multisegment questions with approximately 7,000
the grounded sources is highly significant, e.g., law [26]. question–answer pairs. The Open Australian Legal Question-
Fluency is the ability of a RAG system to generate readable Answering Dataset [20] contains more than 2,100 question–
and clear text. Finally, relevance comprises two parts, answer–snippet triplets generated by GPT-4 using the Open
TABLE 7. Metrics to Evaluate RAG-Based Systems in the Legal Domain. We assess whether the ethical principles of Privacy, Safety, Robustness, Bias, and
Trust are Considered.
Australian Legal Corpus. This dataset allows LLMs to Legal IR datasets are critical for evaluating the retrieval
enhance their skills when answering legal questions. The precision of legal RAG systems. For example, the Chatlaw
LLeQA dataset [21] was created to help develop models that Legal Dataset [25] contains about 4 million data samples
can provide in-depth responses to legal questions in French. in 10 main categories and 44 minor categories. This dataset
This dataset comprises 1,868 legal questions explained by includes different legal areas, e.g., case classification, statute
experts with detailed answers based on applicable legal prediction, and legal document drafting, as well as special-
provisions sourced from a collection of 27,942 statutory ized tasks, e.g., public opinion analysis and named entity
articles. The LLeQA dataset improves on previous work recognition. This variety guarantees the thorough inclusion
by adding new kinds of annotations, e.g., a comprehensive of legal processing assignments. The Case Law Evaluation
taxonomy of questions, jurisdiction information, and specific and Retrieval Corpus [43] is the main dataset created from
references at the paragraph level, which makes it a versatile digitized case law retrieved from the Caselaw Access Project
resource for progressing research in LQA and other related by Harvard Law School. This platform contains more than
legal activities. 1.84 million federal case documents and was created for
IR and RAG tasks.In addition, the Chat-Eur-Lex dataset (English, Italian, and French), helping RAG systems adapt
was created specifically for the Chat-EUR-Lex project [41] to cross-jurisdictional applications, whereas datasets like
to improve the accessibility of European legal information JEC-QA and BorderLegal-QA are domain-specific and
using chat-based LLMs and RAG. The EUR-Lex repository monolingual.
contains approximately 37,000 legal acts in English and Regulatory vs. Contractual Focus: The Open Australian
Italian, which are divided into approximately 371,000 texts Legal QA Dataset and Privacy QA dataset specialize in
or ‘‘chunks’’ to improve search results. Note that this regulatory compliance, helping RAG models interpret legal
dataset does not include documents without XML or HTML policies and statutes. Meanwhile, datasets like ContractNLI
data and corrections, which guarantees both quality and and the Mergers and Acquisitions Understanding Dataset
significance. The main goal is to help create a conversational emphasize contract analysis, which is useful for legal contract
interface that offers simplified explanations of complicated review automation.
legal documents and allows for customized interactions for These differences determine how effectively a RAG
users requiring legal information. The specialized LeCaRDv2 system performs specific legal tasks. The choice of dataset
dataset [40] is uniquely curated for legal case retrieval affects model interpretability, retrieval accuracy, and domain
and is known for its thorough selection of legal cases and adaptation, ultimately shaping the development of more
careful methodology. It functions as a standard to assess robust legal AI applications. Although existing datasets
legal retrieval systems, and it covers various legal topics and offer a useful basis for RAG-based legal AI, a number of
situations. This dataset contains in-depth case descriptions limitations still exist. Most datasets are restricted to English
and is organized to help test retrieval models, especially in and Chinese, leaving gaps in legal systems that use languages
complex and uncommon legal cases, ultimately improving such as Arabic and French. Additionally, certain legal
the functioning and comprehension of legal IR systems. domains, such as international law and regulatory compli-
LegalBench-RAG [44] is a comprehensive benchmark ance, remain underrepresented, limiting the applicability of
that was constructed using the four primary datasets. The current models. Furthermore, existing benchmarks primarily
ContractNLI dataset focuses on NDA-related documents emphasize QA accuracy, while often overlooking crucial
and contains 946 entries. The Contract Understanding aspects such as interpretability and explainability. Addressing
Atticus Dataset includes private contracts and has a total of these gaps requires expanding datasets to cover diverse legal
4,042 entries. The Mergers and Acquisitions Understanding systems, refining benchmarks to evaluate interpretability,
Dataset (MAUD) comprises M&A documents from public and incorporating human-in-the-loop evaluation methods.
companies, with a total of 1,676 entries. Finally, the Privacy Moreover, integrating multiple datasets to develop hybrid
QA dataset comprises the privacy policies of consumer models could further enhance precision and contextual
applications with a total of 194 entries. In total, these understanding in legal AI applications. Table 8 summarizes
datasets contribute to a robust corpus of legal documents, the details of the compared datasets.
amounting to approximately 80 million characters across
714 documents, and they form the basis for the 6,889 VI. ETHICAL AND PRIVACY CONSIDERATIONS IN LEGAL
question–answer pairs that constitute the LegalBench-RAG RAG
benchmark. When utilizing RAG-based LLMs in the legal field, address-
While the datasets mentioned above all contribute to legal ing ethical concerns, including bias, privacy, hallucination,
RAG applications, they differ in terms of structure, purpose, and safety, is crucial. These issues can be resolved by
and impact on RAG performance. The following comparisons implementing strong data privacy measures, advocating for
highlight key distinctions: transparency and accountability, addressing bias, empha-
Legal Question-Answering vs. Case Law Retrieval: sizing human supervision, and promoting human–machine
Datasets like JEC-QA, LLeQA, and BorderLegal-QA focus collaboration. However, the analysis performed in this
on question-answering tasks, making them valuable for study shows that only a few papers have addressed these
improving the precision of RAG systems in legal inquiries. concerns, which indicates that there is considerable room for
In contrast, datasets such as CJRC, Case Law Evaluation and improvement. Table 7 displays whether ethical values, e.g.,
Retrieval Corpus, and LeCaRDv2 focus on case law retrieval, privacy, safety, robustness, bias, and trust, were considered in
enhancing RAG’s ability to fetch relevant case precedents. the 22 articles reviewed in this study. All definitions of ethical
Structured vs. Unstructured Legal Texts: The Chatlaw principles are listed in Table 9.
Legal Dataset and LegalBench-RAG incorporate structured
annotations, making them useful for legal document clas- VII. CHALLENGES IN LEGAL RAG
sification and knowledge extraction. On the other hand, A. COMPUTATIONAL COST AND COMPLEXITY
CAIL2020, CAIL2021, and Chat-Eur-Lex deal with unstruc- Many legal RAG methods encounter challenges associ-
tured legal documents, requiring RAG models to improve ated with the computational cost related to the use of
document chunking and summarization techniques. parameter-inaccessible LLMs as generators and embed-
Monolingual vs. Multilingual Data: Datasets such as dings utilizing APIs [40], which can make it inefficient
Chat-Eur-Lex and LLeQA introduce multilingual legal data to rely on a powerful LLM, e.g., GPT-4. However, the
complexity of using in-house embedding storage and LLMs optimized retrieval, reducing the reliance on expensive API
is another problem that may hinder the use of open- calls [41].
source solutions [40], [42], [44], [45], [52]. In retrieval,
the computational complexity of multiperspective retrieval
can pose challenges for real-time applications in specific B. NO RESPONSE AND HALLUCINATION
scenarios [40]. However, leveraging different techniques, Based on the failure points (FP) of the RAG systems
e.g., caching embeddings, can reduce redundant computation presented in the literature [65], legal RAG approaches have
and API costs [20]. HyPA–RAG [49] integrates an adaptive addressed most of these FPs. These FPs could lead to one
retrieval process to minimize unnecessary token usage of two challenges, i.e., no response and/or a hallucinated
and computational cost. Generally, a well-established RAG response. This subsection discusses how the proposed legal
pipeline can improve latency by integrating precomputed and RAG methods address these challenges.
TABLE 10. Failure points and corresponding rag methods, techniques, and processes.
the response based on the matching algorithm calculated Although incorporating KG into the retrieval can improve
the similarity between the original query and the document the accuracy of the retrieval, incomplete KG can result in
chunks in the vector database. This problem can be caused missing critical information and key relationships within
by various factors, including an incomplete dataset, a flawed legal texts [25], [49].
chunking method, a poor embedding model, a nonrelated In LegalBench [44], general-purpose re-rankers, such as
matching algorithm, a poor augmentation mechanism, a weak Cohere’s model, perform poorly on specialized legal texts due
generator, or an inaccurate prompt. Although the no-response to a lack of domain adaptation. On the other hand, increasing
scenario indicates that the RAG system cannot produce recall improves the chance of retrieving relevant snippets but
a response, it provides more robust control over external introduces more noise, reducing precision.
knowledge compared with scenarios that involve hallucinated In general, incorporating external knowledge sources into
responses, as shown in Fig. 7. RAG while maintaining coherence and relevance remains
complex and challenging.
2) HALLUCINATION
RAG has been introduced to overcome the hallucination E. EVALUATION METRICS LIMITATIONS
problems of LLMs [14], [67] and RAG itself can be employed Current metrics (e.g., BLEU, METEOR, precision, recall)
to evaluate the performance of LLMs [48]; however, may not fully assess factual correctness and semantic quality.
RAG systems are still subject to providing hallucinated For example, ROUGE struggles to accurately measure
responses [65]. For example, a previous study found that the factual correctness and semantic quality of long-form
GPT-4o generated hallucinated responses [43], and used responses [21]. Besides, the METEOR scores remained low
a prompt with cited cases to overcome this challenge. due to the complexity and length of the legal content, limiting
In another study, the LegalBench dataset was used to assess effective evaluation [18]. However, as revealed in Section IV,
the legal reasoning capabilities of LLMs in the RAG pipeline there is no standardized evaluation method for RAG systems,
to overcome the hallucination problem of RAG systems [44]. and various frameworks utilize different metrics to evaluate
Additional methods and techniques implemented in papers RAG systems.
reviewed herein to overcome these challenges are listed in
Table 10. VIII. RESEARCH DIRECTIONS IN LEGAL RAG
Recently, RAG has been applied in the legal domain, and it
C. COMPLEX QUERY HANDLING has exhibited promising benefits and outcomes. This section
RAG systems may struggle with ambiguous, multi-hop, highlights the potential research directions inspired by the
or vague queries, reducing accuracy in complex reasoning articles reviewed herein.
tasks. For example, in [46], the RAG system struggles with
Q9 and Q12, which require deeper contextual understanding. A. EXPANDING LEGAL DOMAINS
CASEGPT [39] showed limitations in addressing unprece- The scope of future legal RAG research should be extended
dented cases. Including multiple retrieved cases (neighbors) to cover a wider range of legal domains. Legal acts are
for context in CBR-RAG [20] poses challenges to maintain involved in everyday activities; thus, developing reliable and
prompt coherence, affecting the quality of the generated accurate LLM-driven systems can benefit individuals and
output. In Chatlaw [25], the system struggles with diverse law practitioners. Most of the surveyed studies have covered
user inputs, mainly when users provide incomplete, decep- many specific legal domains focusing on narrow subdomains,
tive, or misleading information, which can lead to incorrect such as contract analysis and case law retrieval. While other
answers. LegalBench-RAG [44] struggles with tasks that papers, such as Chatlaw [25] and CBR-RAG [20] cover
require multi-hop reasoning or handling technical legal broader legal domains. Legal RAG systems often struggle
jargon, particularly in datasets like MAUD. For PRO [45], with cross-jurisdictional generalization due to differences in
multi-hop reasoning or combining information from multiple legal systems, terminologies, and practices. Future research
retrieved documents remains a challenge, leading to incom- should explore methods to enhance the adaptability of
plete or incorrect responses. TaxTajweez [42] and [51] RAG RAG systems across jurisdictions. For example, techniques
systems struggle with complex or ambiguous user queries, like multi-task learning (e.g., as demonstrated in LEGAL-
leading to less relevant or incomplete retrieval results. BERT [68]) could be extended to RAG systems to improve
their performance in diverse legal environments. The open
D. DEPENDENCE ON RETRIEVAL ACCURACY challenge here is how to scale RAG systems to handle the
As clarified earlier, RAG systems rely heavily on retrieval complex interdependencies in multi-domain legal scenarios,
precision. Errors or irrelevant document fragments affect the and how to generalize domain-specific models without
outputs. For example, the retrieval precision of CLERC [43] sacrificing performance in individual domains.
is reduced due to repeated legal terms and irrelevant words
in legal documents that mislead retrieval models. In addition, B. DEVELOPING AND ENHANCING LEGAL DATASETS
setting a static confidence threshold in DRAG-BILQA [18] As discussed in Section VII, one of the most common root
may not be adaptable to all questions or complex queries. causes of failure in RAG systems is noise in used datasets.
To date, three benchmark datasets have been developed in D. MULTIDIMENSIONAL APPROACH IN LEGAL TECH
different legal domains to develop and evaluate legal RAG Integrating RAG with knowledge graphs, fine-tuning pro-
systems [21], [43], [44]. Therefore, future research should cesses, and prompt engineering, as shown in Fig. 2,
focus on developing robust open-source datasets in different is becoming a prominent approach in legal technology.
legal domains. As explained earlier in Section V, most of the This multidimensional approach is expected to enhance
datasets used in the surveyed papers were developed for the retrieval and generation capabilities, and future research and
experiment and the specific tasks. Developing a benchmark case studies are expected to further enrich the field with
dataset for the legal domain such as the ALQA dataset more interpretable and reliable LLM-driven applications.
which is used in CBR-RAG [20] can help in evaluating the For example, the integration of ConceptNet [77] with RAG
RAG systems in the legal domains. For example, developing systems could help bridge the gap between structured
huge datasets that can enhance the RAG models as well and unstructured legal knowledge. Case studies on prompt
the re-training of the LLMs [69], [70], [71]. Datasets like engineering for specific legal tasks (e.g., drafting legal briefs)
LexGLUE [72] have set a precedent for benchmarking legal [78] could also guide future research in this direction.
NLP tasks, but more specialized datasets are needed for RAG
systems. These datasets should include annotated legal texts, E. EVALUATION METRICS
case law, and statutory provisions to enable fine-tuning and
Focusing on developing a standardized evaluation method
evaluation of RAG models in specific legal contexts.
to assess the performance of RAG systems is a promising
research field as long as RAG is in the early stages.
C. MULTILINGUAL LEGAL RAG Researchers in this field can leverage the latest metrics
Another promising research direction for non-English and approaches [26] as discussed in Sections IV and VII.
researchers is to take advantage of the available non-English Developing a comprehensive evaluation framework for legal
legal knowledge in legal RAG systems. Researchers in RAG systems is essential to evaluate their reliability and
this field may study efficient methods and techniques for performance. Building on works like Kwiatkowski et al.
multilingual legal corpora, and legal technology researchers [79], who developed human evaluation methods for natural
can leverage the recent state-of-the-art methods in this language systems, researchers could devise legal-specific
domain [73], [74]. For example, handling code-switching in metrics that assess the factual accuracy of generated
non-Latin scripts, addressing fluency errors, improving doc- responses and citation quality in retrieved case law. Open
ument comprehension, and minimizing irrelevant retrievals challenges include how to measure the interpretability of
for multilingual RAG models are promising researches RAG systems in high-stakes legal settings, and what new
for multilingual legal RAG [73]. Recent advancements in metrics can evaluate the ethical implications of RAG outputs.
mBERT and XLM-R [75] provide opportunities to train
legal RAG systems on multilingual corpora efficiently. F. REINFORCEMENT LEARNING TO OPTIMIZE LEGAL RAG
Additionally, datasets such as MultiEURLex [76], which APPLICATIONS
cover EU legal documents in multiple languages, could serve Transformer models are the most widely used approach
as a foundation for developing multilingual RAG systems. for embedding and generation in legal RAG systems,
as demonstrated by the literature survey performed in this [9] Z. Zhang, J. Xiong, Z. Zhao, F. Wang, Y. Zeng, B. Zhao, and L. Ke,
study. However, it is expected that reinforcement learning can ‘‘An approach of dynamic response analysis of nonlinear structures based
on least square Volterra kernel function identification,’’ Transp. Saf.
be employed to optimize legal RAG [80], [81] in retrieval Environ., vol. 5, no. 2, p. 46, Mar. 2023.
and generation modules for different types of real-world [10] M. Dahl, V. Magesh, M. Suzgun, and D. E. Ho, ‘‘Large legal fictions:
applications. For example, methods like Reinforcement Profiling legal hallucinations in large language models,’’ J. Legal Anal.,
vol. 16, no. 1, pp. 64–93, Jan. 2024.
Learning from Human Feedback (RLHF) (e.g., as used in [11] F. Yu, L. Quartey, and F. Schilder, ‘‘Exploring the effectiveness of prompt
OpenAI’s GPT models) could be adapted for Legal RAG engineering for legal reasoning tasks,’’ in Proc. Findings Assoc. Comput.
systems to improve their performance in real-world legal Linguistics (ACL), Toronto, ON, Canada, 2023, pp. 13582–13596.
[12] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin,
applications. This approach would allow the system to C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton,
learn from interactions with legal professionals, ensuring it L. E. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike,
retrieves and generates more relevant and accurate outputs. and R. Lowe, ‘‘Training language models to follow instructions with
human feedback,’’ in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022,
In addition, the reward-based optimization approach [82] pp. 27730–27744.
could be applied to fine-tune models for specific legal tasks [13] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, ‘‘QLoRA:
(e.g., identifying precedents in case law). Building on works Efficient finetuning of quantized LLMs,’’ in Proc. Adv. Neural Inf. Process.
Syst., Jan. 2023, pp. 10088–10115.
like Ziegler et al. [83], which optimized language models [14] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler,
for specific human preferences, RL could enable legal RAG M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, ‘‘Retrieval-
systems to better align with legal practitioners’ needs. Recent augmented generation for knowledge-intensive NLP tasks,’’ in Proc. Adv.
Neural Inf. Process. Syst., Jan. 2020, pp. 9459–9474.
advances in RL for enhancing reasoning capabilities in [15] O. Ovadia, M. Brief, M. Mishaeli, and O. Elisha, ‘‘Fine-tuning or retrieval?
LLMs, such as those demonstrated in DeepSeek-R1 [84], Comparing knowledge injection in LLMs,’’ 2023, arXiv:2312.05934.
highlight its potential to align model outputs with domain- [16] H. Soudani, E. Kanoulas, and F. Hasibi, ‘‘Fine tuning vs. retrieval aug-
mented generation for less popular knowledge,’’ 2024, arXiv:2403.01432.
specific objectives. [17] S. Gupta, R. Ranjan, and S. N. Singh, ‘‘A comprehensive survey of
retrieval-augmented generation (RAG): Evolution, current landscape and
future directions,’’ 2024, arXiv:2410.12837.
IX. CONCLUSION
[18] Y. Zhang, D. Li, G. Peng, S. Guo, Y. Dou, and R. Yi, ‘‘A dynamic
This paper has presented an overview of the utilization of retrieval-augmented generation framework for border inspection legal
RAG in the legal domain. We have covered and analyzed all question answering,’’ in Proc. Int. Conf. Asian Lang. Process. (IALP),
Hohhot, China, Aug. 2024, pp. 372–376.
methods, techniques, and stages of legal RAG. The analysis
[19] A. Nikolakopoulos, S. Evangelatos, E. Veroni, K. Chasapas, N. Gousetis,
presented in this paper provides insights into embedding, A. Apostolaras, C. D. Nikolopoulos, and T. Korakis, ‘‘Large language mod-
retrieval, augmentation, and generation techniques. In addi- els in modern forensic investigations: Harnessing the power of generative
artificial intelligence in crime resolution and suspect identification,’’ in
tion, we have thoroughly investigated IR, as it is the backbone
Proc. 5th Int. Conf. Electron. Eng., Inf. Technol. Educ. (EEITE), Chania,
of RAG, and we explained the different evaluation metrics Greece, May 2024, pp. 1–5.
used to assess RAG systems. Furthermore, we have proposed [20] N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie,
a challenge scale to control the hallucination results of RAG, I. Nkisi-Orji, R. Weerasinghe, A. Liret, and B. Fleisch, ‘‘CBR-RAG: Case-
based reasoning for retrieval augmented generation in LLMs for legal
which is expected to be an initial foundation for developing a question answering,’’ 2024, arXiv:2404.04302.
new evaluation method. [21] A. Louis, G. Van Dijck, and G. Spanakis, ‘‘Interpretable long-form legal
question answering with retrieval-augmented large language models,’’ in
Proc. AAAI Conf. Artif. Intell., vol. 38, Mar. 2024, pp. 22266–22275.
REFERENCES [22] R. Venkatakrishnan, E. Tanyildizi, and M. A. Canbaz, ‘‘Semantic
[1] K. Mania, ‘‘Legal technology: Assessment of the legal tech indus- interlinking of immigration data using LLMs for knowledge graph
try’s potential,’’ J. Knowl. Economy, vol. 14, no. 2, pp. 595–619, construction,’’ in Proc. ACM Web Conf. Companion. Singapore: Springer,
Jun. 2023. May 2024, pp. 605–608.
[23] R. Shui, Y. Cao, X. Wang, and T.-S. Chua, ‘‘A comprehensive evaluation
[2] J. B. Rajendra, ‘‘Disruptive technologies and the legal profession,’’ Int.
of large language models on legal judgment prediction,’’ in Proc. Findings
J. Law, vol. 6, no. 5, pp. 271–280, Jan. 2020.
Assoc. Comput. Linguistics, Singapore, 2023, pp. 7337–7348.
[3] S. Sharma, S. Gamoura, D. Prasad, and A. Aneja, ‘‘Emerging legal
[24] M. Visciarelli, G. Guidi, L. Morselli, D. Brandoni, G. Fiameni, L. Monti,
informatics towards legal innovation: Current status and future challenges
S. Bianchini, and C. Tommasi, ‘‘SAVIA: Artificial intelligence in support
and opportunities,’’ Legal Inf. Manage., vol. 21, nos. 3–4, pp. 218–235,
of the lawmaking process,’’ in Proc. 4th Nat. Conf. Artif. Intell. Naples,
Dec. 2021.
Italy: CINI, May 2024.
[4] R. Dale, ‘‘Law and word order: NLP in legal tech,’’ Natural Lang. Eng., [25] J. Cui, M. Ning, Z. Li, B. Chen, Y. Yan, H. Li, B. Ling, Y. Tian,
vol. 25, no. 1, pp. 211–217, Jan. 2019. and L. Yuan, ‘‘Chatlaw: A multi-agent collaborative legal assistant with
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, knowledge graph enhanced mixture-of-experts large language model,’’
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ Adv. Neural Inf. 2023, arXiv:2306.16092.
Process. Syst., 2017, pp. 1–11. [26] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, ‘‘RAGAS:
[6] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, ‘‘Explainable AI: Automated evaluation of retrieval augmented generation,’’ 2023,
A review of machine learning interpretability methods,’’ Entropy, vol. 23, arXiv:2309.15217.
no. 1, p. 18, Dec. 2020. [27] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang,
[7] X. Chen, J. Zheng, C. Li, B. Wu, H. Wu, and J. Montewka, ‘‘Mar- J. Jiang, and B. Cui, ‘‘Retrieval-augmented generation for AI-generated
itime traffic situation awareness analysis via high-fidelity ship imaging content: A survey,’’ 2024, arXiv:2402.19473.
trajectory,’’ Multimedia Tools Appl., vol. 83, no. 16, pp. 48907–48923, [28] M. Mitra and B. B. Chaudhuri, ‘‘Information retrieval from documents:
Nov. 2023. A survey,’’ Inf. Retr., vol. 2, pp. 141–163, Apr. 2000.
[8] X. Chen, H. Wu, B. Han, W. Liu, J. Montewka, and R. W. Liu, [29] S. Robertson and S. Walker, ‘‘Some simple effective approximations to
‘‘Orientation-aware ship detection via a rotation feature decoupling the 2-Poisson model for probabilistic weighted retrieval,’’ in Proc. SIGIR,
supported deep learning approach,’’ Eng. Appl. Artif. Intell., vol. 125, B. W. Croft and C. J. Van Rijsbergen, Eds., London, U.K.: Springer,
Oct. 2023, Art. no. 106686. Aug. 1994, pp. 232–241.
[30] G. Salton and C. Buckley, ‘‘Term-weighting approaches in automatic text [52] A. Chouhan and M. Gertz, ‘‘LexDrafter: Terminology drafting for
retrieval,’’ Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Jan. 1988. legislative documents using retrieval augmented generation,’’ 2024,
[31] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and arXiv:2403.16295.
W.-T. Yih, ‘‘Dense passage retrieval for open-domain question answering,’’ [53] C. Niu, Y. Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, and
2020, arXiv:2004.04906. T. Zhang, ‘‘RAGTruth: A hallucination corpus for developing trustworthy
[32] H.-T. Nguyen, M.-K. Phi, X.-B. Ngo, V. Tran, L.-M. Nguyen, and M.-P. Tu, retrieval-augmented language models,’’ 2023, arXiv:2401.00396.
‘‘Attentive deep neural networks for legal document retrieval,’’ 2022, [54] D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt,
arXiv:2212.13899. D. Metropolitansky, R. O. Ness, and J. Larson, ‘‘From local to
[33] M. Grohe, ‘‘word2vec, node2vec, graph2vec, x2vec: Towards a theory global: A graph RAG approach to query-focused summarization,’’ 2024,
of vector embeddings of structured data,’’ in Proc. 39th ACM SIGMOD- arXiv:2404.16130.
SIGACT-SIGAI Symp. Princ. Database Syst., Jun. 2020, pp. 1–16. [55] D. Chandrasekaran and V. Mago, ‘‘Evolution of semantic similarity—
[34] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, A survey,’’ ACM Comput. Surv. (CSUR), vol. 54, no. 2, pp. 1–37,
and H. Wang, ‘‘Retrieval-augmented generation for large language models: Feb. 2021.
A survey,’’ 2023, arXiv:2312.10997. [56] S. Wu, Y. Xiong, Y. Cui, H. Wu, C. Chen, Y. Yuan, L. Huang, X. Liu,
[35] C.-M. Chan, C. Xu, R. Yuan, H. Luo, W. Xue, Y. Guo, and J. Fu, T.-W. Kuo, N. Guan, and C. J. Xue, ‘‘Retrieval-augmented generation for
‘‘RQ-RAG: Learning to refine queries for retrieval augmented generation,’’ natural language processing: A survey,’’ 2024, arXiv:2407.13193.
2024, arXiv:2404.00610. [57] H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu, ‘‘Evaluation of
[36] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, retrieval-augmented generation: A survey,’’ 2024, arXiv:2405.07437.
‘‘A survey on RAG meeting LLMs: Towards retrieval-augmented large [58] (Jun. 2024). The Ultimate Guide to Evaluate RAG System Components:
language models,’’ 2024, arXiv:2405.06211. What You Need to Know. Accessed: Nov. 11, 2024. [Online]. Available:
[37] X. Wang, Z. Wang, X. Gao, F. Zhang, Y. Wu, Z. Xu, T. Shi, Z. Wang, https://myscale.com/blog/ultimate-guide-to-evaluate-rag-system/
S. Li, Q. Qian, R. Yin, C. Lv, X. Zheng, and X. Huang, ‘‘Searching for best [59] L. M. Amugongo, P. Mascheroni, S. G. Brooks, S. Doering, and
practices in retrieval-augmented generation,’’ 2024, arXiv:2407.01219. J. Seidel, ‘‘Retrieval augmented generation for large language mod-
[38] E. Mollard, A. Patel, L. Pham, and R. Trachtenberg, ‘‘Improving retrieval els in healthcare: A systematic review,’’ Preprints, Jul. 2024, doi:
augmented generation,’’ Lab. Phys. Sci. (LPS), Univ. Maryland, College 10.20944/preprints202407.0876.v1.
Park, MD, USA, Tech. Rep., Aug. 2024. [60] S. Sivasothy, S. Barnett, S. Kurniawan, Z. Rasool, and R. Vasa,
[39] R. Yang, ‘‘CaseGPT: A case reasoning framework based on language ‘‘RAGProbe: An automated approach for evaluating RAG applications,’’
models and retrieval-augmented generation,’’ 2024, arXiv:2407.07913. 2024, arXiv:2409.19019.
[40] G. Chen, W. Yu, and L. Sha, ‘‘Unlocking multi-view insights in knowledge- [61] P. Domingos, ‘‘A few useful things to know about machine learning,’’
dense retrieval-augmented generation,’’ 2024, arXiv:2404.12879. Commun. ACM, vol. 55, no. 10, pp. 78–87, Oct. 2012.
[41] M. Cherubini, F. Romano, A. Bolioli, L. De, and M. Sangermano, [62] S. Zeng, J. Zhang, P. He, Y. Xing, Y. Liu, H. Xu, J. Ren, S. Wang, D. Yin,
‘‘Improving the accessibility of EU laws: The Chat-EUR-Lex project,’’ in Y. Chang, and J. Tang, ‘‘The good and the bad: Exploring privacy issues in
Proc. 4th Nat. Conf. Artif. Intell. Naples, Italy: CINI, May 2024. retrieval-augmented generation (RAG),’’ 2024, arXiv:2402.16893.
[42] M. A. Habib, S. M. Amin, M. Oqba, S. Jaipal, M. J. Khan, and A. Samad, [63] W. Li, J. Li, R. Ramos, R. Tang, and D. Elliott, ‘‘Understanding
‘‘TaxTajweez: A large language model-based chatbot for income tax retrieval robustness for retrieval-augmented image captioning,’’ 2024,
information in Pakistan using retrieval augmented generation (RAG),’’ in arXiv:2406.02265.
Proc. Int. FLAIRS Conf., vol. 37, May 2024, pp. 1–12. [64] Y. Zhou, Y. Liu, X. Li, J. Jin, H. Qian, Z. Liu, C. Li, Z. Dou, T.-Y. Ho,
[43] A. B. Hou, O. Weller, G. Qin, E. Yang, D. Lawrie, N. Holzenberger, and P. S. Yu, ‘‘Trustworthiness in retrieval-augmented generation systems:
A. Blair-Stanek, and B. Van Durme, ‘‘CLERC: A dataset for legal A survey,’’ 2024, arXiv:2409.10102.
case retrieval and retrieval-augmented analysis generation,’’ 2024, [65] S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, and M. Abdelrazek,
arXiv:2406.17186. ‘‘Seven failure points when engineering a retrieval augmented generation
[44] N. Pipitone and G. H. Alami, ‘‘LegalBench-RAG: A benchmark system,’’ in Proc. IEEE/ACM 3rd Int. Conf. AI Eng.-Softw. Eng. AI,
for retrieval-augmented generation in the legal domain,’’ 2024, Apr. 2024, pp. 194–199.
arXiv:2408.10343. [66] W. Yu, H. Zhang, X. Pan, K. Ma, H. Wang, and D. Yu, ‘‘Chain-of-note:
[45] J.-M. Chu, H.-C. Lo, J. Hsiang, and C.-C. Cho, ‘‘Patent response Enhancing robustness in retrieval-augmented language models,’’ 2023,
system optimised for faithfulness: Procedural knowledge embodiment arXiv:2311.09210.
with knowledge graph and retrieval augmented generation,’’ in Proc. 1st [67] K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, ‘‘Retrieval augmen-
Workshop Towards Knowledgeable Lang. Models (KnowLLM), Bangkok, tation reduces hallucination in conversation,’’ 2021, arXiv:2104.07567.
Thailand, 2024, pp. 146–155. [68] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androut-
[46] M. E. Mamalis, E. Kalampokis, F. Fitsilis, G. Theodorakopoulos, and sopoulos, ‘‘LEGAL-BERT: The muppets straight out of law school,’’ 2020,
K. Tarabanis, ‘‘A large language model agent based legal assistant for arXiv:2010.02559.
governance applications,’’ in Proc. Int. Conf. Electron. Government, [69] P. Henderson, M. Krass, L. Zheng, N. Guha, C. D. Manning, D. Jurafsky,
Jan. 2024, pp. 286–301. and D. E. Ho, ‘‘Pile of law: Learning responsible data filtering from the law
[47] I. Bošković and V. Tabaš, ‘‘Proposal for enhancing legal advisory services and a 256GB open-source legal dataset,’’ in Proc. Adv. Neural Inf. Process.
in the Montenegrin banking sector with artificial intelligence,’’ in Proc. Syst., Jan. 2022, pp. 29217–29234.
28th Int. Conf. Inf. Technol. (IT), Zabljak, Montenegro, Feb. 2024, [70] J. Niklaus, V. Matoshi, M. Stürmer, I. Chalkidis, and D. E. Ho, ‘‘MultiLe-
pp. 1–6. galPile: A 689GB multilingual legal corpus,’’ 2023, arXiv:2306.02069.
[48] C. Ryu, S. Lee, S. Pang, C. Choi, H. Choi, M. Min, and J.-Y. Sohn, [71] M. Ostendorff, T. Blume, and S. Ostendorff, ‘‘Towards an open platform
‘‘Retrieval-based evaluation for LLMs: A case study in Korean legal for legal information,’’ in Proc. ACM/IEEE Joint Conf. Digit. Libraries,
QA,’’ in Proc. Natural Legal Lang. Process. Workshop, Singapore, 2023, Aug. 2020, pp. 385–388.
pp. 132–137. [72] I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos,
[49] R. Kalra, Z. Wu, A. Gulley, A. Hilliard, X. Guan, A. Koshiyama, D. M. Katz, and N. Aletras, ‘‘LexGLUE: A benchmark dataset for legal
and P. Treleaven, ‘‘HyPA-RAG: A hybrid parameter adaptive language understanding in English,’’ 2021, arXiv:2110.00976.
retrieval-augmented generation system for AI legal and policy [73] N. Chirkova, D. Rau, H. Déjean, T. Formal, S. Clinchant, and
applications,’’ 2024, arXiv:2409.09046. V. Nikoulina, ‘‘Retrieval-augmented generation in multilingual settings,’’
[50] T.-H.-G. Vu and X.-B. Hoang, ‘‘User privacy risk analysis within website 2024, arXiv:2407.01463.
privacy policies,’’ in Proc. Int. Conf. Multimedia Anal. Pattern Recognit. [74] S. R. El-Beltagy and M. A. Abdallah, ‘‘Exploring retrieval augmented
(MAPR), Da Nang, Vietnam, Aug. 2024, pp. 1–6. generation in Arabic,’’ Proc. Comput. Sci., vol. 244, pp. 296–307,
[51] R. Nai, E. Sulis, I. Fatima, and R. Meo, ‘‘Large language models May 2024.
and recommendation systems: A proof-of-concept study on public [75] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
procurements,’’ in Natural Language Processing and Information Systems F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
(Lecture Notes in Computer Science), vol. 14763. Cham, Switzerland: ‘‘Unsupervised cross-lingual representation learning at scale,’’ 2019,
Springer, 2024, pp. 280–290. arXiv:1911.02116.
[76] I. Chalkidis, M. Fergadiotis, and I. Androutsopoulos, ‘‘MultiEURLEX— LINDA MOHAMMED received the B.Sc. degree
A multi-lingual and multi-label legal document classification dataset for in electrical and electronic engineering (electronic
zero-shot cross-lingual transfer,’’ 2021, arXiv:2109.00904. systems software engineering) from the University
[77] R. E. Speer, J. Chin, and C. Havasi, ‘‘ConceptNet 5.5: An open multilingual of Khartoum, Sudan, in 2020. She is currently
graph of general knowledge,’’ in Proc. AAAI Conf. Artif. Intell., vol. 31, pursuing the M.Sc. degree in software engineering
Feb. 2017, pp. 1–7. with United Arab Emirates University, United
[78] J. Lee, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. V. Le, and D. Zhou, Arab Emirates. Her current research interests
‘‘Chain-of-thought prompting elicits reasoning in large language models,’’
include AI, machine learning, and data science.
in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022, pp. 24824–24837.
[79] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh,
C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova,
L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. V. Le,
and S. Petrov, ‘‘Natural questions: A benchmark for question answering
research,’’ in Proc. Trans. Assoc. Comput. Linguistics, vol. 7, Aug. 2019,
pp. 453–466. OMMAMA MAAZ received the B.Sc. degree
[80] M. Kulkarni, P. Tangarajan, K. Kim, and A. Trivedi, ‘‘Reinforcement learn- in computer engineering from the University of
ing for optimizing RAG for domain chatbots,’’ 2024, arXiv:2401.06800. Sharjah, Sharjah, United Arab Emirates, in 2022.
[81] Z. Wang, S. Xian Teo, J. Ouyang, Y. Xu, and W. Shi, ‘‘M-RAG: Rein- She is currently pursuing the M.Sc. degree with the
forcing large language model performance through retrieval-augmented College of Information Technology, United Arab
generation with multiple partitions,’’ 2024, arXiv:2405.16420. Emirates University, United Arab Emirates. Her
[82] Y. Wu, E. Mansimov, S. M. Liao, R. Grosse, and J. Ba, ‘‘Scalable trust- current research interest includes the IoT systems.
region method for deep reinforcement learning using Kronecker-factored
approximation,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jan. 2017,
pp. 1–8.
[83] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei,
P. Christiano, and G. Irving, ‘‘Fine-tuning language models from human
preferences,’’ 2019, arXiv:1909.08593.
[84] DeepSeek-AI et al., ‘‘DeepSeek-R1: Incentivizing reasoning capability in
LLMs via reinforcement learning,’’ 2025, arXiv:2501.12948. ABDULMALIK ALWARAFY (Member, IEEE)
received the Ph.D. degree in computer science and
engineering from Hamad Bin Khalifa University,
Doha, Qatar. He is currently an Assistant Professor
with the College of Information Technology,
MAHD HINDI received the B.Sc. degree in United Arab Emirates University, Al Ain, United
information systems technology from Abu Dhabi Arab Emirates. His current research interests
University, Abu Dhabi, United Arab Emirates. include the application of artificial intelligence
He is currently pursuing the M.Sc. degree with techniques across various domains, including
the College of Information Technology, United wireless and the IoT networks, as well as edge and
Arab Emirates University, Al Ain, United Arab cloud computing. He is a member of the IEEE Communications Society.
Emirates. His current research interests include He served on the technical program committees of many international
LLMs and LLM-driven solutions. conferences. In addition, he has been a reviewer of several international
journals and conferences.