0% found this document useful (0 votes)
46 views19 pages

Conference

This paper surveys the application of Retrieval-Augmented Generation (RAG) in legal technology, highlighting its potential to enhance the precision and interpretability of large language models (LLMs) in legal applications. It provides an overview of RAG methods, stages, and techniques, discusses various legal RAG applications, and outlines challenges and future research directions. The findings aim to guide researchers in optimizing LLM-driven legal applications and address issues such as hallucination and outdated knowledge in legal contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views19 pages

Conference

This paper surveys the application of Retrieval-Augmented Generation (RAG) in legal technology, highlighting its potential to enhance the precision and interpretability of large language models (LLMs) in legal applications. It provides an overview of RAG methods, stages, and techniques, discusses various legal RAG applications, and outlines challenges and future research directions. The findings aim to guide researchers in optimizing LLM-driven legal applications and address issues such as hallucination and outdated knowledge in legal contexts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Received 14 February 2025, accepted 6 March 2025, date of publication 11 March 2025, date of current version 20 March 2025.

Digital Object Identifier 10.1109/ACCESS.2025.3550145

Enhancing the Precision and Interpretability of


Retrieval-Augmented Generation (RAG) in
Legal Technology: A Survey
MAHD HINDI , LINDA MOHAMMED , OMMAMA MAAZ ,
AND ABDULMALIK ALWARAFY , (Member, IEEE)
Department of Computer and Network Engineering, College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates
Corresponding author: Abdulmalik Alwarafy ([email protected])
This work was supported by United Arab Emirates University (UAEU) under Grant 12T047.

ABSTRACT Retrieval-Augmented Generation (RAG) is a promising solution that can enhance the
capabilities of large language model (LLM) applications in critical domains, including legal technology,
by retrieving knowledge from external databases. Implementing RAG pipelines requires careful attention to
the techniques and methods implemented in the different stages of the RAG process. However, robust RAG
can enhance LLM generation with faithfulness and few hallucinations in responses. In this paper, we discuss
the application of RAG in the legal domain. First, we present an overview of the main RAG methods, stages,
techniques, and applications in the legal domain. We then briefly discuss the different information retrieval
models, processes, and applied methods in current legal RAG solutions. Then, we explain the different
quantitative and qualitative evaluation metrics. We also describe several emerging datasets and benchmarks.
We then discuss and assess the ethical and privacy considerations for legal RAG and summarize various
challenges, and propose a challenge scale based on RAG failure points and control over external knowledge.
Finally, we provide insights into promising future research to leverage RAG efficiently and effectively in the
legal field.

INDEX TERMS Information retrieval, large language model (LLM), legal technology, prompt engineering,
retrieval-augmented generation (RAG).

I. INTRODUCTION Natural Language Processing (NLP) has enabled many


Legal technology, which is also referred to as legal tech, applications in the legal sector, covering a wide range of
emerged around 2010 as a technological solution or tool used areas, e.g., legal research, electronic discovery, contract
in the legal domain to support legal services implemented review, document automation, and legal advice [4]. Inno-
in different sectors to ordinary users and legal professionals, vations in Large Language Model (LLM) techniques with
including lawyers and other legal practitioners [1]. Techno- the emergence of transformer models have realized superior
logical innovations have influenced the evolution of legal performance, parallelizability, and faster training times
tech solutions, enabling them to provide high-quality services for sequence transduction tasks compared with traditional
in terms of efficiency, transparency, cost, and time [2]. recurrent or convolutional neural networks [5], [6], [7], [8],
This began with digitalizing legal content, followed by [9], thereby opening the door to more powerful LLM-driven
automating routine legal tasks, and now moving toward applications in various domains, including the legal domain.
advanced Artificial Intellegence (AI) integration [3]. However, in legal applications, LLMs have exhibited high
hallucination rates [10]. To address this issue, LLMs have
been enriched with prompt engineering [11], fine-tuning
The associate editor coordinating the review of this manuscript and processes [12], [13], or retrieval-augmented generation
approving it for publication was Hai Dong . (RAG) [14], to obtain better results in terms of precision and
2025 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
VOLUME 13, 2025 For more information, see https://creativecommons.org/licenses/by/4.0/ 46171
M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 1. List of acronyms provided in the paper. of all currently available RAG methods relevant to the
legal domain. The primary contributions of this paper are
summarized as follows.
• We highlight various techniques utilized in RAG meth-
ods specific to the legal field.
• We examine how these methods contribute to improve-
ments in accuracy and interpretability.
• Our analysis of RAG methods and techniques in the
legal field offers insights into various applications,
methodologies, evaluations, datasets, and benchmarks.
• We extensively outline and describe some open chal-
lenges for RAG application in the legal domain and
provide deep insights into promising future research
directions in RAG applications.
The findings of this work are expected to guide legal
tech researchers who aim to use cutting-edge technology
to optimize LLM- driven legal applications and practices
for various tasks. In addition, this study will serve as a
contemporary reference for RAG methods in the legal field.
The remainder of this paper is organized as follows:
Section II provides an overview of RAG methods, RAG
main stages, and techniques as well as a classification of
legal RAG methods, applications, and datasets. Section III
explores advanced methods that improve retrieval accuracy
hallucination. RAG was proposed to enhance generators to in legal RAG systems, addressing the unique needs of
achieve less hallucination and offer more interpretability and legal information retrieval tasks. Section IV explains the
control [14]. quantitative and qualitative metrics used to analyze retrievers
Different studies have proved that RAG outperformed and generators in RAG systems, and Section V describes
fine-tuning processes for existing knowledge encountered relevant emerging datasets and benchmarks. Section VI,
during training and entirely new knowledge [15], [16]. Thus, evaluates ethical and privacy considerations in legal RAG
since RAG was first introduced in 2020 by Lewis et al. systems. Section VII focuses on the main challenges of
[14], different RAG systems have been rapidly developed legal RAG systems. Section VIII provides insights into
for various domains, including the legal domain. The most promising research directions. Finally, the paper is concluded
powerful feature of RAG is its ability to adapt recent in Section IX.
or specific external knowledge rapidly and dynamically
retrieve relevant information from external sources during the II. RAG IN THE LEGAL DOMAIN
generation process [17]. Numerous legal applications have A. OVERVIEW OF RAG
demonstrated that a combination of RAG and fine-tuning RAG comprises three key processes, i.e., the information
methods perform well [18], [19], [20], [21]. retrieval (IR), augmentation, and generation processes. Many
In 2024, more than 20 legal RAG pipelines were processes and techniques are involved before and after the IR
implemented using various embedding, retrieval, enhance- process is performed to enhance the process and its outcomes.
ment, and generation methods. These pipelines have been The IR process is a critical element in the RAG framework.
frequently integrated with other approaches, e.g., prompt The generator will likely produce poor outcomes if the
engineering, which is essential for all RAG pipelines, and retrieved data are inaccurate or inconsistent with the query.
with knowledge graphs (KG) [22], and fine-tuning (FT) A powerful IR method can outperform the performance of a
[18], [19], [20], [21], [23], [24] or embedded within multi- combined IR LLM [23]. thus, the IR process is the backbone
agent frameworks [25]. In addition, legal RAG pipelines span of the entire RAG pipeline [27]. IR techniques have been
various applications across the legal domain, ranging from improved over decades from initial traditional sparse IR
specialized systems focused on specific legal fields to more techniques [28], e.g., BM25 [29] and TF-IDF [30], to more
comprehensive legal platforms. advanced dense Transformer-based embedding models, e.g.,
DPR [31]. Transformer-based retrieval methods outperform
A. CONTRIBUTION AND ORGANIZATION nonneural methods, specifically for legal document retrieval
Given the absence of a comprehensive literature survey tasks [32]. In addition, KG embeddings have been integrated
on RAG systems within the legal domain, this paper with IR embeddings to optimize the IR process [33].
attempts to bridge this gap by providing a detailed overview Advanced RAG systems involve pre- and post-retrieval

46172 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

FIGURE 1. Paper organization.

enhancements to enrich the IR process and produce accurate B. RAG APPLICATIONS IN THE LEGAL DOMAIN
and precise results [34]. The IR process can be optimized Despite being introduced in 2020 [14], RAG was not applied
based on the complexity of the problem and the required to the legal domain until 2023. The first research paper was
reasoning steps. An effective method to optimize IR is by Shui et al. [23], who employed RAG to predict legal
to apply different methods for IR rather than relying on judgments. While they did not discuss the abstract term RAG
a single IR process. Selecting an appropriate IR process explicitly, they did introduce the RAG system and referred to
depends on the complexity of the task and required reasoning the process as ‘‘LLMs coordinate with IR (LLM + IR).’’
steps [34]. In addition, many methods and techniques are We conducted an extensive literature review by examining
used to optimize IR, e.g., the embedding model and query relevant papers from four major academic databases: SCO-
enhancements [35]. The chunking method, which plays a PUS, IEEExplore, Web of Science, and Google Scholar,
pivotal role in IR, is influenced by the nature of the dataset to in addition to preprints posted on arXiv. The query we used
be retrieved, how the texts are organized in the dataset, what to search for related papers included three main terms. The
will be retrieved from the datasets, and how much semantic first term included all keywords related to legal, including
similarity is important in the IR process. The precision of ‘legal’, ‘legal case’, ‘judiciary,’ ’judicial,’ and ‘law’. The
the IR process is strongly dependent on the selection of the second term included all keywords related to the RAG,
chunking strategy [34], [36], [37], [38]. including ‘‘retrieval’’, ‘‘augmented’’ and ‘‘generation’’. The
The augmentation process, which integrates retrieved third term included all keywords related to LLM models,
information and query fragments in the LLM, can be including ‘‘LLM’’, ‘‘transformer model’’ and ‘‘generative
performed in three ways in the input, output, and intermediate AI’’. As RAG was first introduced in 2020, the search
layers of the generator [36]. KG embeddings can enrich the was restricted to articles published between 2020 and 2024.
prompt for more accurate response generation by integrating The gathered papers were then scanned based on titles,
the retrieved triplets with the retrieved chunks and the abstracts, and keywords to determine relevant articles for
original user’s query, as proposed in [22] and illustrated further analysis. As this field is newly emerging, the final
in Fig. 2. Prompt engineering with one or few shots has selection includes only 22 papers.
demonstrated more accurate responses compared with zero- As shown in Fig. 3, RAG research experienced a notable
shot prompting [23]. surge in 2024, spanning various applications within the legal
In the generation process, the LLM can be retrained or domain. The RAG methods proposed in the legal field address
fine-tuned on legal data using a parameter-accessible LLM various areas, e.g., privacy law, legislative texts, public law,
or the RAG pipeline may utilize a parameter-inaccessible criminal law, statutory law, and immigration law. These
‘‘frozen’’ LLM [34], [36]. applications are employed in various systems, including

VOLUME 13, 2025 46173


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

targeting different fields in the legal domain. In addition,


RAG methods in the legal domain attempt to address
the limitations of previous LLM-driven systems in terms
of hallucination [18], [25] and outdated knowledge [49].
They also improve IR [20], [44], support legal accessibility
and usability [24], and automate complex legal processes
[45], [52].

1) CASE STUDY: SEMANTIC INTERLINKING OF


IMMIGRATION DATA USING LLMS AND RAG FOR
KNOWLEDGE GRAPH CONSTRUCTION
The study on semantic interlinking of immigration data
using LLMs and RAG presents a transformative approach
to managing complex legal data, specifically in the context
of the U.S. Adjustment of Status (AOS) process [22]. The
framework addresses inefficiencies in processing heteroge-
neous, paper-based immigration records by converting them
into structured, interconnected knowledge graphs (KGs).
By employing advanced key-value mapping strategies and
integrating Retrieval-Augmented Generation (RAG) tech-
niques with LLMs, the system enables accurate extraction
of entities and relationships from legal documents. The
constructed KGs, ingested into Neo4j, provide a detailed rep-
resentation of the AOS process, allowing legal professionals
to retrieve context-aware and semantically enriched insights
FIGURE 2. RAG pipeline enriched with knowledge graph and LLM fine
tuning (inspired and elicited from legal RAG systems).
with simple English queries. This innovation enhances
decision-making and data management while maintaining
data privacy through the potential use of local LLMs. The
results demonstrate how this methodology simplifies the
complexities of legal processes, offering a scalable and
adaptable solution for immigration and other legal domains.

C. LEGAL RAG METHODS


Legal RAG systems are built on a variety of architectural
designs, each tailored to handle the complexity of legal
text retrieval, augmentation, and generation. While some
frameworks employ a standard retrieval-then-generation
pipeline, others integrate more advanced mechanisms such as
iterative retrieval, knowledge graph embeddings, and adap-
tive augmentation. These enhancements refine the retrieval
process, improve factual grounding, and optimize generative
fluency.
A comprehensive RAG pipeline in the legal domain is
FIGURE 3. Trends in legal RAG methods since 2020. illustrated in Fig. 2. Here, the RAG system is enriched with
a fine-tuned generator and the embedding of a KG. The
user query and legal documents are preprocessed, chunked,
question answering, recommendations, legal advice, case rea- embedded, and stored in a vector database, and the similarity
soning, legal chatbots, digital assistants, and predicting legal between the query and the chunks is calculated. Here, relevant
judgments. In addition, three datasets have been introduced chunks are retrieved, integrated, and passed to the fine-tuned
recently to provide benchmarking solutions when developing or pre-trained (i.e., frozen) LLM, which can be enriched
and evaluating RAG models. Detailed information about the with the KG triplets using a structured prompt, and then
legal RAG methods and relevant applications is given in the response is generated accordingly. Fig. 2 summarizes all
Table 2. currently surveyed RAG systems in the legal domain.
The common objectives of legal RAG systems are to It is essential to incorporate the appropriate strategies and
enhance efficiency, accuracy, accessibility, and contextual optimization at each pipeline stage to build an effective
understanding and to improve legal and regulatory services legal RAG system. The following sections explore key

46174 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 2. Legal RAG methods and applications.

aspects, including retrieval sources, models, augmentation 2) EMBEDDING MODELS


mechanisms, generation methods, and training approaches. Embedding models convert legal text into high-dimensional
vector representations, allowing retrieval models to measure
similarity between queries and legal documents [51]. These
1) RETRIEVAL SOURCE models capture the meaning of text beyond simple keyword
The effectiveness of the RAG process is strongly dependent matching, making them essential for dense retrieval.
on the external knowledge, which is also referred to as The most common embedding models used for dense
nonparametric memory [14], and the processes to be per- retrievers are transformer-based embeddings, and legal RAG
formed on the external dataset prior to retrieving the required systems primarily employ BERT and BERT-based models
information [44]. The strength of RAG lies in its ability to along with Open AI’s ADA-002 model. In addition, cus-
utilize the retrieval of the most relevant text chunks/segments tomized models have used non-English embeddings, e.g.,
pertaining to the query or the user’s question from the bge-large-zh-v1.5, text2vec, and bge-m3 for Chinese dataset
knowledge source, thereby enhancing the LLM’s ability embedding [40], and embed-multilingual-v3.0 for Italian
to produce the most accurate and appropriate response. dataset embedding [51]. Some RAG pipelines enrich the
Conversely, it will be difficult for the RAG system to produce model with KG embeddings, which enhance the retriever
a response if the information required for the relevant query and make the IR process more interpretable [22], [25],
is not present in the dataset. In some cases, the LLM will [49]. Detailed information on transformer-based embedding
provide a hallucinated response [53] or no response [54]. models is provided in Table 5.
Thus, the external knowledge should be complete, including
all information that is relevant to the application.
The dataset retrieval process could be closed-sourced 3) RETRIEVAL METHODS
from a specific dataset, which is more suitable for domains Transformer-based dense retrieval methods are mainly used
that do not involve rapid changes in knowledge, e.g., legal in legal RAG applications, and cosine similarity is the most
domain in most cases. Alternately, the dataset retrieval commonly used search algorithm for the retriever. In addition,
process could be open-sourced, where data can be retrieved sparse retrieval has been tested experimentally [43] and
directly from the Internet and other sources, which is outperformed some dense retrievers. Sparse retrieval is
more suitable for applications and domains that involve employed in three pipelines using BM25 as a ranking
rapidly changing knowledge [36]. Thus, it is expected that model [23], [50], [52]. By combining sparse and dense
most legal RAG systems employ the closed-source retrieval retrievers, a hybrid approach has been employed in a legal
mechanism. Table 3 summarizes the retrieval dataset types RAG pipeline [49]. As discussed in Section II-A, selecting an
and their corresponding datasets. Additional details about appropriate IR method is dependent on various factors related
emerging datasets in the legal domain are discussed in to the nature of the source of the IR and the task related to
Section V. the IR. For example, in complex legal scenarios [18], dense

VOLUME 13, 2025 46175


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 3. Overview of RAG systems and their associated datasets.

FIGURE 4. Dense retrieval approach used in most legal RAG methods.

retrievers are more effective than sparse retrievers in terms


of capturing semantic similarity [55]. Additional information FIGURE 5. Iterative retrieval process with n retrievals.
about advanced methods that improve the retrieval process
in legal RAG systems is given in Section III. Fig. 4
illustrates an advanced dense retrieval approach, and the 5) AUGMENTATION MECHANISM
best-performing transformer-based models in the retrieval The technique of combining the retrieved chunks/segments
process in surveyed legal RAG methods are summarized in along with the original query in a structured prompt and pass-
Table 4. ing it to the input layer of the generator (the LLM) was used in
all the surveyed methods. Meanwhile, some methods applied
4) RETRIEVAL PROCESS additional prompt engineering techniques with one-shot or
One-time process retrieval is the most common IR process in few-shot prompting to enhance the performance of the LLM
legal RAG pipelines. The iterative retrieval process used in to generate more accurate responses [21], [23].
previous studies [20] and [25] is illustrated in Fig. 5, where
the retrieval is repeated n times until a predefined threshold 6) GENERATION
is met. The adaptive retrieval process, (Section III), has also GPT-4 is an LLM that is primarily used as a generator in legal
been used in [18] and [49] (Fig. 6), where the RAG system RAG pipelines, followed by llama-based LLMs, both GPT-
can determine whether to initiate the retriever based on a 4 and llama-based LLMs of which are the best-performing
predefined threshold. generators used in different proposed pipelines. Generally,

46176 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 4. Best-performing transformer-based models in the retrieval stage in legal RAG systems.

recall model is trained with in-batch negative sampling


so that positive legal document embeddings are closer to
their corresponding queries than negative samples. Com-
mon hyperparameter settings include batch sizes of 32-64,
learning rates between 1e-5 and 3e-5, and training epochs
ranging from 2 to 10 (see also Section II for further details
on contrastive learning) [18], [21].
Generation model fine-tuning adapts large language
models (LLMs) for domain-specific legal text generation.
FIGURE 6. Adaptive retrieval process, wherein RAG has the ability to Parameter-efficient tuning methods such as QLoRA and
activate/deactivate the retriever. prompt tuning are widely adopted to minimize computational
costs. For example, DRAG-BILQA fine-tunes ChatGLM2-
Open AI LLMs are mainly used because they can generate 6B using QLoRA with a LoRA rank of 8, a dropout rate of
accurate responses, and Open AI models have outperformed 0.1, a maximum sequence length of 512, and a batch size of
the Cohere models [51]. Seven legal RAG methods surveyed 16 at a learning rate of 1e-5. Additionally, prompt tuning is
in this study employed FT processes in parameter-accessible performed with a batch size of 8 over 4 steps to optimize
LLMs to enhance the performance of the LLMs. It is prompt-specific learning. In contrast, full fine-tuning is
important to mention here that combining FT processes applied in specialized cases—such as adapting CamemBERT
with RAG can fail if the retrieval is not performing for case law reasoning using a batch size of 32, a learning rate
accurately [43]. Table 5 provides additional information of 2e-5, weight decay of 0.01, and 20 training epochs with the
about best-performing LLMs. AdamW optimizer [18], [21].
While Section II outlines the core components of
7) TRAINING APPROACHES FOR LEGAL RAG MODELS legal RAG systems—including retrieval, augmentation, and
Training legal RAG models involves optimizing retrieval generation—the overall performance depends critically on
accuracy, augmentation strategies, and generative fluency both the architectural design and the precision of the retrieval
through fine-tuning and parameter-efficient adaptations [21], mechanisms. Advanced methods that integrate multiple
[49]. Some systems train the retrieval and generation modules datasets and optimization techniques are further discussed in
separately, while others fine-tune them jointly to improve Section III.
overall performance [56].
Legal factor recognition training is a key component in III. INNOVATIONS IN RETRIEVAL PRECISION
certain RAG frameworks such as DRAG-BILQA. In this In this section, we discuss advanced methods used to improve
approach, the model first estimates its confidence in gen- retrieval precision and relevance in legal RAG systems.
erating an accurate response. If the confidence score meets These innovations address the challenges inherent in legal
a predefined threshold, the response is outputted directly; IR, including the complexity of legal language, domain
otherwise, additional retrieval is triggered to refine the input specificity, and need for responses that are contextually
context, thereby enhancing answer accuracy [18]. accurate.
Dense retrieval models (e.g., DPR, ColBERT) are typically
trained using contrastive learning to improve query-document A. ENHANCEMENTS IN RETRIEVAL METHODS
matching. DRAG-BILQA employs a dual-encoder recall 1) PRERETRIEVAL OPTIMIZATION
model for initial retrieval and a cross-encoder re-ranking In RAG systems, it is essential to efficiently retrieve
model to refine the set of retrieved legal passages. The relevant documents from the data source. Thus, the pre

VOLUME 13, 2025 46177


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 5. Summary of legal RAG methods and techniques.

retrieval stage employs several techniques to enhance the introduces intention-aware query rewriting and leverages
accuracy of retrieved information; these techniques include multiple domain viewpoints to refine queries in knowledge-
data granularity adjustments, indexing enhancement, and dense contexts [40].
query formulation along with and the selection of a suitable In terms of chunking strategies, dividing long legal doc-
embedding model. Collectively, these techniques facilitate uments into smaller, manageable chunks improves retrieval
highly precise and structured retrieval of information, which precision. Chunk-based embedding strategies ensure that
is crucial for legal RAG systems due to the inherent contextual details are preserved within smaller text frag-
complexity and specificity of legal data [36]. ments, which reduces the noise associated with embedding
In addition, KG integration transforms conventional data the entire document [47]. HyPA–RAG employs multiple
handling approaches and plays a crucial role in improving techniques to evaluate the model, including sentence-
retrieval precision by structuring legal data into intercon- level, semantic, and pattern-based chunking to balance
nected entities and relationships. For example, combining token constraints and context. It has been demonstrated
KGs with LLMs allows retrieval systems to generate that pattern-based chunking using corpus-specific delimiters
context-aware responses using KG triplets as additional achieves the best retrieval precision, with top scores in context
context [22], [45], [49]. The PPNet framework encodes recall and faithfulness. Sentence-level chunking excels in
legal relationships from judicial sources into a KG, which context precision and F1 scores; thus, it is suitable for
improves the accuracy of responses [45]. Furthermore, hybrid precise retrieval tasks. In addition, unless heavily tuned,
systems, e.g., HyPA– RAG, utilize KG triplets alongside semantic chunking underperforms compared with simpler
dense and sparse methods for adaptive query tuning [49]. methods [49].
Query rewriting enhances the retrieval process by refor- An efficient indexing mechanism is essential for rapid
mulating user inputs to better align with indexed data while similarity searches in high-dimensional spaces. To bal-
incorporating related concepts that users may not explicitly ance search speed and accuracy, CASEGPT employs the
mention but are contextually relevant. Some frameworks, Hierarchical Navigable Small World algorithm, which is a
e.g., the PA-RAG framework, adapt queries dynamically state-of-the-art indexing technique. In addition, the system
by selecting the number of rewrites and the top related implements an incremental indexing mechanism to support
retrieved chunks (K) values based on query complexity [49]. real-time updates, which facilitates the seamless integration
In addition, the multiview RAG (MVRAG) framework of new cases without requiring complete reindexing [39].

46178 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

2) POST-RETRIEVAL STRATEGIES 2) ADAPTIVE RETRIEVAL MECHANISMS


The post-retrieval procedure ensures that the retrieved Adaptive retrieval mechanisms optimize the retrieval process
information is presented appropriately and efficiently. Legal by adjusting to the complexity of the given query. For
documents frequently contain nuanced details that are critical example, HyPA–RAG employs a domain-specific query
for accurate interpretations; thus, post-retrieval innovations complexity classifier that categorizes queries based on
focus on organizing and refining the retrieved content. In the their complexities, which helps the system select the most
reranking process, rearranging document chunks is essential appropriate retrieval strategy, e.g., the number of subqueries
to reduce the total document pool. This serves as a filter to generate and the top k number of documents to be
and enhancer in IR, and it provides a more accurate input retrieved. This approach ensures that the retrieval process is
for processing language models [36]. Legal RAG systems efficient while maintaining relevance to the legal context [49].
employ various reranking techniques to enhance retrieval The dynamic RAG framework for border inspection legal
precision. For example, PRLLM reranks the retrieved para- question answering (LQA) controls the retrieval process
graphs according to their significance by prioritizing the dynamically based on confidence scores. In the framework,
most critical paragraph of the examiner, which is followed after generating an initial response, the system calculates the
by other comparable passages [45]. CaseGPT implements a confidence score of relevant legal factors. If the confidence
multifactor approach that integrates domain-specific factors, score is low, the system triggers an additional retrieval
including case recency, citation frequency, and jurisdictional process to enhance the response [18]. The adaptive retrieval
relevance. In addition, it balances relevance and diversity in process is illustrated in Fig. 6.
the retrieved cases using a diversity-aware retrieval process
based on the maximum marginal relevance method [39]. 3) FINE-TUNING RETRIEVAL MODELS
Similarly, in ChatLaw, reranking is performed using a Fine-tuning retrieval models is essential for aligning embed-
two-step evaluation process. Here, an LLM first assesses dings with legal-domain-specific data, particularly when
each document’s relevance to the query. Then, a critic model the context diverges considerably from the For example,
performs critical evaluations by refining the results iteratively HyPA–RAG fine-tunes its distilBERT model on legal corpora
by reviewing and selecting the best content. This iterative and CASEGPT adopts a fine-tuned version of Legal-BERT to
process ensures that the final response is self-assessed, achieve enhanced retrieval performance [39], [49]. In addi-
optimized, and highly relevant [25]. Another approach is tion, CamemBERT is fully fine-tuned on the long-form LQA
used in MVRAG, where documents are reranked based on a (LLeQA) dataset to improve its ability to handle complex
recalculated relevance score that integrates multiperspective legal queries [21]. Table 4 shows the best Transformers-based
alignment [40]. retrievers, of which four models are enhanced by fine-tuning
processes.
B. ADVANCED TECHNIQUES IN RETRIEVAL MODELS
As legal tasks become increasingly complex, advanced 4) ADVANCED SAMPLING STRATEGIES
retrieval techniques are required to improve the precision and In contrastive learning, different techniques, e.g., negative
relevance of the results. These techniques are designed to sampling and hard negative sampling, play important roles in
handle the challenges of legal IR tasks by adapting to the training the retrieval model to better distinguish between rel-
nuances of legal queries. evant and irrelevant documents. Negative sampling exposes
the model to relevant and irrelevant pairs of documents.
1) HYBRID RETRIEVAL APPROACHES In contrast, hard negative sampling challenges the model
Sparse and dense embedding approaches capture distinct by selecting tough negative examples, thereby improv-
relevance features. Sparse embeddings are particularly useful ing the model’s ability to classify documents accurately.
for tasks that are dependent on keyword matching and These strategies enable the model to learn in a low-
require high precision, focusing on specific words or phrases. dimensional, high-quality embedding space where relevant
However, dense embeddings excel in tasks that demand question–provision pairs are placed closer together than
semantic similarity and contextual understanding because irrelevant ones [21].
they generate continuous vector representations that capture
nuanced meanings beyond surface-level word matching. IV. EVALUATION OF RETRIEVAL AND GENERATION
By adopting hybrid retrieval models, systems can leverage QUALITY
the advantages of both approaches, where the precision of the Despite the growth in RAG-related research, we only found a
sparse embeddings are combined with the contextual depth few articles outlining state-of-the-art techniques in the legal
of the dense embeddings to improve the overall retrieval domain. The evaluation of RAG systems in most state-of-
accuracy and relevance [36]. For example, HyPA–RAG the-art systems was divided into two parts, i.e., retrieval
adopts a hybrid search engine that combines (dense and evaluation and response evaluation. The metrics used to
sparse) and KG retrieval methods to improve retrieval evaluate the effectiveness of the retrieval process include
accuracy [49]. precision, recall, mean reciprocal rank (MRR), and mean

VOLUME 13, 2025 46179


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

average precision (MAP). Precision is the fraction of relevant i.e., the context relevance, which checks if the context
instances among the retrieved instances, and recall is the of the retrieved information is relevant to the query, and
fraction of relevant instances retrieved from the total number the response relevance, which indicates how relevant the
of relevant cases. MRR is the average of the reciprocal ranks generated response for the given query.
of the first correct response to a set of queries, and MAP is Table 6 shows the quantitative metrics used for each
the mean of the average precision scores for each query [57]. evaluation aspect. These metrics, obtained from surveyed
These four metrics can be used to evaluate how effectively a studies, are traditional indicators and do not yet represent a
retriever identifies and ranks relevant documents in response standardized framework for quantifying the quality aspects
to the user’s query [58]. of RAG systems [34]. Metrics definitions are summarized in
For response evaluation, the primary goal is ensuring Table 9.
that the response is relevant to the user query and avoid Note that there is no standardized evaluation method
hallucinations. Generation metrics, e.g., METEOR [18], [21], for RAG systems, and various frameworks utilize different
the bilingual evaluation understudy (BLEU) [52], and the metrics to evaluate RAG systems. One of such dedicated
recall-oriented understudy for gisting evaluation (ROUGE) frameworks is RAG assessment (RAGAs) [26]. RAGAs has
[43], are used to determine the response quality for RAG been employed in previous studies [42], [49] to assess the
systems in the legal domain. METEOR combines precision, performance of a RAG pipeline considering four factors,
recall, and sentence fluency to accurately calculate the i.e., faithfulness, relevance to the query, relevance to the
similarity between automatically generated and reference context, and recall of the context. The RAGAs framework
responses to evaluate the effectiveness of text generation was designed to serve as a universal standard to assess RAG
tasks. BLEU measures the overlap between a generated pipelines without requiring access to ground truths. The
response and a set of reference responses by focusing on the system uses OpenAI’s GPT-4 to determine a score ranging
precision of n-grams. Finally, ROUGE counts the number from 0 to 1 for each of the four metrics. The RAGAs score is
of overlapping units, e.g., n-grams, word sequences, and calculated by determining the average of the assigned scores.
word pairs, between the generated and reference responses
considering recall and precision. Utilizing these metrics to V. EMERGING DATASETS AND BENCHMARKS
evaluate retrieval and generation tasks helps build a robust, The advancement of Retrieval-Augmented Generation
efficient, and user-centric RAG system. However, there are (RAG) systems in legal technology heavily relies on
no ground truth answers to queries; thus, the focus of high-quality datasets that enable effective retrieval, reason-
the evaluation has shifted to quantitative aspects, wherein ing, and interpretability. Several benchmark datasets have
the retriever and generator are evaluated separately [59]. been introduced to improve legal question answering (LQA),
In other words, the nature of RAG systems makes them legal information retrieval (IR), and case law analysis. This
generate unstructured text, which means that qualitative and section reviews key datasets and benchmarks that support the
quantitative metrics are required to assess their performance development of RAG systems in the legal domain.
accurately. Therefore, we adopted a similar approach to Legal Question Answering (LQA) datasets play a cru-
Table 5 of [59], indicating the evaluation metrics used cial role in training and evaluating models that generate
for RAG-based systems in the medical domain and ethical precise legal responses. For example, the BorderLegal-QA
principles considered in surveyed studies. By adopting these Dataset [18], is specialized for legal queries related to border
metrics and considerations, we created Table 7, which shows inspections, and it contains 1,329 pairs of questions and
the evaluation metrics employed in surveyed RAG-based answers covering 51 types of questions. The goal is to offer
studies in the legal domain. Specifically, we reviewed expertly curated question–answer pairs that are applicable
22 papers to check for references to the five evaluation to realistic border inspection situations. In addition, the
metrics to assess their usage in RAG-based legal applications. JEC-QA dataset is a collection of multiple-choice questions
The five- evaluation metrics were correctness, completeness, from the National Unified Legal Professional Qualification
faithfulness, fluency, and relevance (context relevance and Examination. This dataset contains a total of 26,365 ques-
answer relevance). Correctness means that the response tions, and it acts as a standard to assess legal QA systems.
generated by the RAG system must perfectly align with The CJRC dataset was created from real-world accounts
the expected response or be a relevant statement that in Chinese court records, and it contains approximately
conveys the same information [60].Completeness refers to 10,000 documents and nearly 50,000 question– answer pairs
RAG-generated responses that are comprehensive and cover covering a wide range of reasoning scenarios. The CAIL2020
all aspects of the anticipated response. Faithfulness indicates and CAIL2021 datasets [18] improve the reasoning skills
that the response must be based on the provided context. required to answer legal questions. The CAIL2020 dataset
RAG systems are frequently utilized in contexts where contains 10,000 legal documents, and the CAIL2021 dataset
the factual accuracy of the generated text with respect to presents multisegment questions with approximately 7,000
the grounded sources is highly significant, e.g., law [26]. question–answer pairs. The Open Australian Legal Question-
Fluency is the ability of a RAG system to generate readable Answering Dataset [20] contains more than 2,100 question–
and clear text. Finally, relevance comprises two parts, answer–snippet triplets generated by GPT-4 using the Open

46180 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 6. Quantity metrics for each quality aspect.

TABLE 7. Metrics to Evaluate RAG-Based Systems in the Legal Domain. We assess whether the ethical principles of Privacy, Safety, Robustness, Bias, and
Trust are Considered.

Australian Legal Corpus. This dataset allows LLMs to Legal IR datasets are critical for evaluating the retrieval
enhance their skills when answering legal questions. The precision of legal RAG systems. For example, the Chatlaw
LLeQA dataset [21] was created to help develop models that Legal Dataset [25] contains about 4 million data samples
can provide in-depth responses to legal questions in French. in 10 main categories and 44 minor categories. This dataset
This dataset comprises 1,868 legal questions explained by includes different legal areas, e.g., case classification, statute
experts with detailed answers based on applicable legal prediction, and legal document drafting, as well as special-
provisions sourced from a collection of 27,942 statutory ized tasks, e.g., public opinion analysis and named entity
articles. The LLeQA dataset improves on previous work recognition. This variety guarantees the thorough inclusion
by adding new kinds of annotations, e.g., a comprehensive of legal processing assignments. The Case Law Evaluation
taxonomy of questions, jurisdiction information, and specific and Retrieval Corpus [43] is the main dataset created from
references at the paragraph level, which makes it a versatile digitized case law retrieved from the Caselaw Access Project
resource for progressing research in LQA and other related by Harvard Law School. This platform contains more than
legal activities. 1.84 million federal case documents and was created for

VOLUME 13, 2025 46181


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

IR and RAG tasks.In addition, the Chat-Eur-Lex dataset (English, Italian, and French), helping RAG systems adapt
was created specifically for the Chat-EUR-Lex project [41] to cross-jurisdictional applications, whereas datasets like
to improve the accessibility of European legal information JEC-QA and BorderLegal-QA are domain-specific and
using chat-based LLMs and RAG. The EUR-Lex repository monolingual.
contains approximately 37,000 legal acts in English and Regulatory vs. Contractual Focus: The Open Australian
Italian, which are divided into approximately 371,000 texts Legal QA Dataset and Privacy QA dataset specialize in
or ‘‘chunks’’ to improve search results. Note that this regulatory compliance, helping RAG models interpret legal
dataset does not include documents without XML or HTML policies and statutes. Meanwhile, datasets like ContractNLI
data and corrections, which guarantees both quality and and the Mergers and Acquisitions Understanding Dataset
significance. The main goal is to help create a conversational emphasize contract analysis, which is useful for legal contract
interface that offers simplified explanations of complicated review automation.
legal documents and allows for customized interactions for These differences determine how effectively a RAG
users requiring legal information. The specialized LeCaRDv2 system performs specific legal tasks. The choice of dataset
dataset [40] is uniquely curated for legal case retrieval affects model interpretability, retrieval accuracy, and domain
and is known for its thorough selection of legal cases and adaptation, ultimately shaping the development of more
careful methodology. It functions as a standard to assess robust legal AI applications. Although existing datasets
legal retrieval systems, and it covers various legal topics and offer a useful basis for RAG-based legal AI, a number of
situations. This dataset contains in-depth case descriptions limitations still exist. Most datasets are restricted to English
and is organized to help test retrieval models, especially in and Chinese, leaving gaps in legal systems that use languages
complex and uncommon legal cases, ultimately improving such as Arabic and French. Additionally, certain legal
the functioning and comprehension of legal IR systems. domains, such as international law and regulatory compli-
LegalBench-RAG [44] is a comprehensive benchmark ance, remain underrepresented, limiting the applicability of
that was constructed using the four primary datasets. The current models. Furthermore, existing benchmarks primarily
ContractNLI dataset focuses on NDA-related documents emphasize QA accuracy, while often overlooking crucial
and contains 946 entries. The Contract Understanding aspects such as interpretability and explainability. Addressing
Atticus Dataset includes private contracts and has a total of these gaps requires expanding datasets to cover diverse legal
4,042 entries. The Mergers and Acquisitions Understanding systems, refining benchmarks to evaluate interpretability,
Dataset (MAUD) comprises M&A documents from public and incorporating human-in-the-loop evaluation methods.
companies, with a total of 1,676 entries. Finally, the Privacy Moreover, integrating multiple datasets to develop hybrid
QA dataset comprises the privacy policies of consumer models could further enhance precision and contextual
applications with a total of 194 entries. In total, these understanding in legal AI applications. Table 8 summarizes
datasets contribute to a robust corpus of legal documents, the details of the compared datasets.
amounting to approximately 80 million characters across
714 documents, and they form the basis for the 6,889 VI. ETHICAL AND PRIVACY CONSIDERATIONS IN LEGAL
question–answer pairs that constitute the LegalBench-RAG RAG
benchmark. When utilizing RAG-based LLMs in the legal field, address-
While the datasets mentioned above all contribute to legal ing ethical concerns, including bias, privacy, hallucination,
RAG applications, they differ in terms of structure, purpose, and safety, is crucial. These issues can be resolved by
and impact on RAG performance. The following comparisons implementing strong data privacy measures, advocating for
highlight key distinctions: transparency and accountability, addressing bias, empha-
Legal Question-Answering vs. Case Law Retrieval: sizing human supervision, and promoting human–machine
Datasets like JEC-QA, LLeQA, and BorderLegal-QA focus collaboration. However, the analysis performed in this
on question-answering tasks, making them valuable for study shows that only a few papers have addressed these
improving the precision of RAG systems in legal inquiries. concerns, which indicates that there is considerable room for
In contrast, datasets such as CJRC, Case Law Evaluation and improvement. Table 7 displays whether ethical values, e.g.,
Retrieval Corpus, and LeCaRDv2 focus on case law retrieval, privacy, safety, robustness, bias, and trust, were considered in
enhancing RAG’s ability to fetch relevant case precedents. the 22 articles reviewed in this study. All definitions of ethical
Structured vs. Unstructured Legal Texts: The Chatlaw principles are listed in Table 9.
Legal Dataset and LegalBench-RAG incorporate structured
annotations, making them useful for legal document clas- VII. CHALLENGES IN LEGAL RAG
sification and knowledge extraction. On the other hand, A. COMPUTATIONAL COST AND COMPLEXITY
CAIL2020, CAIL2021, and Chat-Eur-Lex deal with unstruc- Many legal RAG methods encounter challenges associ-
tured legal documents, requiring RAG models to improve ated with the computational cost related to the use of
document chunking and summarization techniques. parameter-inaccessible LLMs as generators and embed-
Monolingual vs. Multilingual Data: Datasets such as dings utilizing APIs [40], which can make it inefficient
Chat-Eur-Lex and LLeQA introduce multilingual legal data to rely on a powerful LLM, e.g., GPT-4. However, the

46182 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 8. Comparison of legal datasets for RAG systems.

TABLE 9. Metrics and ethical principles definitions.

complexity of using in-house embedding storage and LLMs optimized retrieval, reducing the reliance on expensive API
is another problem that may hinder the use of open- calls [41].
source solutions [40], [42], [44], [45], [52]. In retrieval,
the computational complexity of multiperspective retrieval
can pose challenges for real-time applications in specific B. NO RESPONSE AND HALLUCINATION
scenarios [40]. However, leveraging different techniques, Based on the failure points (FP) of the RAG systems
e.g., caching embeddings, can reduce redundant computation presented in the literature [65], legal RAG approaches have
and API costs [20]. HyPA–RAG [49] integrates an adaptive addressed most of these FPs. These FPs could lead to one
retrieval process to minimize unnecessary token usage of two challenges, i.e., no response and/or a hallucinated
and computational cost. Generally, a well-established RAG response. This subsection discusses how the proposed legal
pipeline can improve latency by integrating precomputed and RAG methods address these challenges.

VOLUME 13, 2025 46183


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 10. Failure points and corresponding rag methods, techniques, and processes.

The challenge scale is shown in Fig. 7. Here, the FP reflects


a hallucinated or no-response challenge. The worst-case
scenario of the RAG system is to generate a hallucinated
response, and the best-case scenario of an FP for the RAG
system is to generate an incomplete answer, as explained
in FP7 (Incomplete). Note that the no-response scenario is
preferred over the hallucinated response [66]. Most RAG
prompts include snippets like ‘‘If you don’t know the answer,
just say that you don’t know, don’t try to make up an
answer.’’ [46]. The challenge scale shown in Fig. 7 can
serve as a foundation for developing new evaluation metrics
by combining the FP with a degree of hallucination to
measure the dependency on external knowledge. In other
words, each FP can be measured in a given RAG system. The
RAGAs evaluation metric [26] (Section IV) was developed FIGURE 7. Challenge scale and associated FP of the RAG systems [65].
to evaluate the faithfulness, the relevance of the answer, The stronger the dependence on external knowledge in generating
and the relevance to the context of the generated responses. responses, the fewer FPs occur and the higher the faithfulness of the
generated responses.
We suggest including the presented FPs [65] when measuring
the degree of hallucination and faithfulness. The generated
responses should be evaluated against each FP; thus, the 1) NO RESPONSE
overall score is calculated. This evaluation can be examined The most common challenge for a poor RAG system is
on the retrieval module and/or the generation module (i.e., the the inability to generate a response [65] due to a limited
LLM). number of document chunks passed to the LLM to generate

46184 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

the response based on the matching algorithm calculated Although incorporating KG into the retrieval can improve
the similarity between the original query and the document the accuracy of the retrieval, incomplete KG can result in
chunks in the vector database. This problem can be caused missing critical information and key relationships within
by various factors, including an incomplete dataset, a flawed legal texts [25], [49].
chunking method, a poor embedding model, a nonrelated In LegalBench [44], general-purpose re-rankers, such as
matching algorithm, a poor augmentation mechanism, a weak Cohere’s model, perform poorly on specialized legal texts due
generator, or an inaccurate prompt. Although the no-response to a lack of domain adaptation. On the other hand, increasing
scenario indicates that the RAG system cannot produce recall improves the chance of retrieving relevant snippets but
a response, it provides more robust control over external introduces more noise, reducing precision.
knowledge compared with scenarios that involve hallucinated In general, incorporating external knowledge sources into
responses, as shown in Fig. 7. RAG while maintaining coherence and relevance remains
complex and challenging.
2) HALLUCINATION
RAG has been introduced to overcome the hallucination E. EVALUATION METRICS LIMITATIONS
problems of LLMs [14], [67] and RAG itself can be employed Current metrics (e.g., BLEU, METEOR, precision, recall)
to evaluate the performance of LLMs [48]; however, may not fully assess factual correctness and semantic quality.
RAG systems are still subject to providing hallucinated For example, ROUGE struggles to accurately measure
responses [65]. For example, a previous study found that the factual correctness and semantic quality of long-form
GPT-4o generated hallucinated responses [43], and used responses [21]. Besides, the METEOR scores remained low
a prompt with cited cases to overcome this challenge. due to the complexity and length of the legal content, limiting
In another study, the LegalBench dataset was used to assess effective evaluation [18]. However, as revealed in Section IV,
the legal reasoning capabilities of LLMs in the RAG pipeline there is no standardized evaluation method for RAG systems,
to overcome the hallucination problem of RAG systems [44]. and various frameworks utilize different metrics to evaluate
Additional methods and techniques implemented in papers RAG systems.
reviewed herein to overcome these challenges are listed in
Table 10. VIII. RESEARCH DIRECTIONS IN LEGAL RAG
Recently, RAG has been applied in the legal domain, and it
C. COMPLEX QUERY HANDLING has exhibited promising benefits and outcomes. This section
RAG systems may struggle with ambiguous, multi-hop, highlights the potential research directions inspired by the
or vague queries, reducing accuracy in complex reasoning articles reviewed herein.
tasks. For example, in [46], the RAG system struggles with
Q9 and Q12, which require deeper contextual understanding. A. EXPANDING LEGAL DOMAINS
CASEGPT [39] showed limitations in addressing unprece- The scope of future legal RAG research should be extended
dented cases. Including multiple retrieved cases (neighbors) to cover a wider range of legal domains. Legal acts are
for context in CBR-RAG [20] poses challenges to maintain involved in everyday activities; thus, developing reliable and
prompt coherence, affecting the quality of the generated accurate LLM-driven systems can benefit individuals and
output. In Chatlaw [25], the system struggles with diverse law practitioners. Most of the surveyed studies have covered
user inputs, mainly when users provide incomplete, decep- many specific legal domains focusing on narrow subdomains,
tive, or misleading information, which can lead to incorrect such as contract analysis and case law retrieval. While other
answers. LegalBench-RAG [44] struggles with tasks that papers, such as Chatlaw [25] and CBR-RAG [20] cover
require multi-hop reasoning or handling technical legal broader legal domains. Legal RAG systems often struggle
jargon, particularly in datasets like MAUD. For PRO [45], with cross-jurisdictional generalization due to differences in
multi-hop reasoning or combining information from multiple legal systems, terminologies, and practices. Future research
retrieved documents remains a challenge, leading to incom- should explore methods to enhance the adaptability of
plete or incorrect responses. TaxTajweez [42] and [51] RAG RAG systems across jurisdictions. For example, techniques
systems struggle with complex or ambiguous user queries, like multi-task learning (e.g., as demonstrated in LEGAL-
leading to less relevant or incomplete retrieval results. BERT [68]) could be extended to RAG systems to improve
their performance in diverse legal environments. The open
D. DEPENDENCE ON RETRIEVAL ACCURACY challenge here is how to scale RAG systems to handle the
As clarified earlier, RAG systems rely heavily on retrieval complex interdependencies in multi-domain legal scenarios,
precision. Errors or irrelevant document fragments affect the and how to generalize domain-specific models without
outputs. For example, the retrieval precision of CLERC [43] sacrificing performance in individual domains.
is reduced due to repeated legal terms and irrelevant words
in legal documents that mislead retrieval models. In addition, B. DEVELOPING AND ENHANCING LEGAL DATASETS
setting a static confidence threshold in DRAG-BILQA [18] As discussed in Section VII, one of the most common root
may not be adaptable to all questions or complex queries. causes of failure in RAG systems is noise in used datasets.

VOLUME 13, 2025 46185


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

TABLE 11. Challenges and research directions in legal RAG systems.

To date, three benchmark datasets have been developed in D. MULTIDIMENSIONAL APPROACH IN LEGAL TECH
different legal domains to develop and evaluate legal RAG Integrating RAG with knowledge graphs, fine-tuning pro-
systems [21], [43], [44]. Therefore, future research should cesses, and prompt engineering, as shown in Fig. 2,
focus on developing robust open-source datasets in different is becoming a prominent approach in legal technology.
legal domains. As explained earlier in Section V, most of the This multidimensional approach is expected to enhance
datasets used in the surveyed papers were developed for the retrieval and generation capabilities, and future research and
experiment and the specific tasks. Developing a benchmark case studies are expected to further enrich the field with
dataset for the legal domain such as the ALQA dataset more interpretable and reliable LLM-driven applications.
which is used in CBR-RAG [20] can help in evaluating the For example, the integration of ConceptNet [77] with RAG
RAG systems in the legal domains. For example, developing systems could help bridge the gap between structured
huge datasets that can enhance the RAG models as well and unstructured legal knowledge. Case studies on prompt
the re-training of the LLMs [69], [70], [71]. Datasets like engineering for specific legal tasks (e.g., drafting legal briefs)
LexGLUE [72] have set a precedent for benchmarking legal [78] could also guide future research in this direction.
NLP tasks, but more specialized datasets are needed for RAG
systems. These datasets should include annotated legal texts, E. EVALUATION METRICS
case law, and statutory provisions to enable fine-tuning and
Focusing on developing a standardized evaluation method
evaluation of RAG models in specific legal contexts.
to assess the performance of RAG systems is a promising
research field as long as RAG is in the early stages.
C. MULTILINGUAL LEGAL RAG Researchers in this field can leverage the latest metrics
Another promising research direction for non-English and approaches [26] as discussed in Sections IV and VII.
researchers is to take advantage of the available non-English Developing a comprehensive evaluation framework for legal
legal knowledge in legal RAG systems. Researchers in RAG systems is essential to evaluate their reliability and
this field may study efficient methods and techniques for performance. Building on works like Kwiatkowski et al.
multilingual legal corpora, and legal technology researchers [79], who developed human evaluation methods for natural
can leverage the recent state-of-the-art methods in this language systems, researchers could devise legal-specific
domain [73], [74]. For example, handling code-switching in metrics that assess the factual accuracy of generated
non-Latin scripts, addressing fluency errors, improving doc- responses and citation quality in retrieved case law. Open
ument comprehension, and minimizing irrelevant retrievals challenges include how to measure the interpretability of
for multilingual RAG models are promising researches RAG systems in high-stakes legal settings, and what new
for multilingual legal RAG [73]. Recent advancements in metrics can evaluate the ethical implications of RAG outputs.
mBERT and XLM-R [75] provide opportunities to train
legal RAG systems on multilingual corpora efficiently. F. REINFORCEMENT LEARNING TO OPTIMIZE LEGAL RAG
Additionally, datasets such as MultiEURLex [76], which APPLICATIONS
cover EU legal documents in multiple languages, could serve Transformer models are the most widely used approach
as a foundation for developing multilingual RAG systems. for embedding and generation in legal RAG systems,

46186 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

as demonstrated by the literature survey performed in this [9] Z. Zhang, J. Xiong, Z. Zhao, F. Wang, Y. Zeng, B. Zhao, and L. Ke,
study. However, it is expected that reinforcement learning can ‘‘An approach of dynamic response analysis of nonlinear structures based
on least square Volterra kernel function identification,’’ Transp. Saf.
be employed to optimize legal RAG [80], [81] in retrieval Environ., vol. 5, no. 2, p. 46, Mar. 2023.
and generation modules for different types of real-world [10] M. Dahl, V. Magesh, M. Suzgun, and D. E. Ho, ‘‘Large legal fictions:
applications. For example, methods like Reinforcement Profiling legal hallucinations in large language models,’’ J. Legal Anal.,
vol. 16, no. 1, pp. 64–93, Jan. 2024.
Learning from Human Feedback (RLHF) (e.g., as used in [11] F. Yu, L. Quartey, and F. Schilder, ‘‘Exploring the effectiveness of prompt
OpenAI’s GPT models) could be adapted for Legal RAG engineering for legal reasoning tasks,’’ in Proc. Findings Assoc. Comput.
systems to improve their performance in real-world legal Linguistics (ACL), Toronto, ON, Canada, 2023, pp. 13582–13596.
[12] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin,
applications. This approach would allow the system to C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton,
learn from interactions with legal professionals, ensuring it L. E. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike,
retrieves and generates more relevant and accurate outputs. and R. Lowe, ‘‘Training language models to follow instructions with
human feedback,’’ in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022,
In addition, the reward-based optimization approach [82] pp. 27730–27744.
could be applied to fine-tune models for specific legal tasks [13] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, ‘‘QLoRA:
(e.g., identifying precedents in case law). Building on works Efficient finetuning of quantized LLMs,’’ in Proc. Adv. Neural Inf. Process.
Syst., Jan. 2023, pp. 10088–10115.
like Ziegler et al. [83], which optimized language models [14] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler,
for specific human preferences, RL could enable legal RAG M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, ‘‘Retrieval-
systems to better align with legal practitioners’ needs. Recent augmented generation for knowledge-intensive NLP tasks,’’ in Proc. Adv.
Neural Inf. Process. Syst., Jan. 2020, pp. 9459–9474.
advances in RL for enhancing reasoning capabilities in [15] O. Ovadia, M. Brief, M. Mishaeli, and O. Elisha, ‘‘Fine-tuning or retrieval?
LLMs, such as those demonstrated in DeepSeek-R1 [84], Comparing knowledge injection in LLMs,’’ 2023, arXiv:2312.05934.
highlight its potential to align model outputs with domain- [16] H. Soudani, E. Kanoulas, and F. Hasibi, ‘‘Fine tuning vs. retrieval aug-
mented generation for less popular knowledge,’’ 2024, arXiv:2403.01432.
specific objectives. [17] S. Gupta, R. Ranjan, and S. N. Singh, ‘‘A comprehensive survey of
retrieval-augmented generation (RAG): Evolution, current landscape and
future directions,’’ 2024, arXiv:2410.12837.
IX. CONCLUSION
[18] Y. Zhang, D. Li, G. Peng, S. Guo, Y. Dou, and R. Yi, ‘‘A dynamic
This paper has presented an overview of the utilization of retrieval-augmented generation framework for border inspection legal
RAG in the legal domain. We have covered and analyzed all question answering,’’ in Proc. Int. Conf. Asian Lang. Process. (IALP),
Hohhot, China, Aug. 2024, pp. 372–376.
methods, techniques, and stages of legal RAG. The analysis
[19] A. Nikolakopoulos, S. Evangelatos, E. Veroni, K. Chasapas, N. Gousetis,
presented in this paper provides insights into embedding, A. Apostolaras, C. D. Nikolopoulos, and T. Korakis, ‘‘Large language mod-
retrieval, augmentation, and generation techniques. In addi- els in modern forensic investigations: Harnessing the power of generative
artificial intelligence in crime resolution and suspect identification,’’ in
tion, we have thoroughly investigated IR, as it is the backbone
Proc. 5th Int. Conf. Electron. Eng., Inf. Technol. Educ. (EEITE), Chania,
of RAG, and we explained the different evaluation metrics Greece, May 2024, pp. 1–5.
used to assess RAG systems. Furthermore, we have proposed [20] N. Wiratunga, R. Abeyratne, L. Jayawardena, K. Martin, S. Massie,
a challenge scale to control the hallucination results of RAG, I. Nkisi-Orji, R. Weerasinghe, A. Liret, and B. Fleisch, ‘‘CBR-RAG: Case-
based reasoning for retrieval augmented generation in LLMs for legal
which is expected to be an initial foundation for developing a question answering,’’ 2024, arXiv:2404.04302.
new evaluation method. [21] A. Louis, G. Van Dijck, and G. Spanakis, ‘‘Interpretable long-form legal
question answering with retrieval-augmented large language models,’’ in
Proc. AAAI Conf. Artif. Intell., vol. 38, Mar. 2024, pp. 22266–22275.
REFERENCES [22] R. Venkatakrishnan, E. Tanyildizi, and M. A. Canbaz, ‘‘Semantic
[1] K. Mania, ‘‘Legal technology: Assessment of the legal tech indus- interlinking of immigration data using LLMs for knowledge graph
try’s potential,’’ J. Knowl. Economy, vol. 14, no. 2, pp. 595–619, construction,’’ in Proc. ACM Web Conf. Companion. Singapore: Springer,
Jun. 2023. May 2024, pp. 605–608.
[23] R. Shui, Y. Cao, X. Wang, and T.-S. Chua, ‘‘A comprehensive evaluation
[2] J. B. Rajendra, ‘‘Disruptive technologies and the legal profession,’’ Int.
of large language models on legal judgment prediction,’’ in Proc. Findings
J. Law, vol. 6, no. 5, pp. 271–280, Jan. 2020.
Assoc. Comput. Linguistics, Singapore, 2023, pp. 7337–7348.
[3] S. Sharma, S. Gamoura, D. Prasad, and A. Aneja, ‘‘Emerging legal
[24] M. Visciarelli, G. Guidi, L. Morselli, D. Brandoni, G. Fiameni, L. Monti,
informatics towards legal innovation: Current status and future challenges
S. Bianchini, and C. Tommasi, ‘‘SAVIA: Artificial intelligence in support
and opportunities,’’ Legal Inf. Manage., vol. 21, nos. 3–4, pp. 218–235,
of the lawmaking process,’’ in Proc. 4th Nat. Conf. Artif. Intell. Naples,
Dec. 2021.
Italy: CINI, May 2024.
[4] R. Dale, ‘‘Law and word order: NLP in legal tech,’’ Natural Lang. Eng., [25] J. Cui, M. Ning, Z. Li, B. Chen, Y. Yan, H. Li, B. Ling, Y. Tian,
vol. 25, no. 1, pp. 211–217, Jan. 2019. and L. Yuan, ‘‘Chatlaw: A multi-agent collaborative legal assistant with
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, knowledge graph enhanced mixture-of-experts large language model,’’
Ł. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ Adv. Neural Inf. 2023, arXiv:2306.16092.
Process. Syst., 2017, pp. 1–11. [26] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, ‘‘RAGAS:
[6] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, ‘‘Explainable AI: Automated evaluation of retrieval augmented generation,’’ 2023,
A review of machine learning interpretability methods,’’ Entropy, vol. 23, arXiv:2309.15217.
no. 1, p. 18, Dec. 2020. [27] P. Zhao, H. Zhang, Q. Yu, Z. Wang, Y. Geng, F. Fu, L. Yang, W. Zhang,
[7] X. Chen, J. Zheng, C. Li, B. Wu, H. Wu, and J. Montewka, ‘‘Mar- J. Jiang, and B. Cui, ‘‘Retrieval-augmented generation for AI-generated
itime traffic situation awareness analysis via high-fidelity ship imaging content: A survey,’’ 2024, arXiv:2402.19473.
trajectory,’’ Multimedia Tools Appl., vol. 83, no. 16, pp. 48907–48923, [28] M. Mitra and B. B. Chaudhuri, ‘‘Information retrieval from documents:
Nov. 2023. A survey,’’ Inf. Retr., vol. 2, pp. 141–163, Apr. 2000.
[8] X. Chen, H. Wu, B. Han, W. Liu, J. Montewka, and R. W. Liu, [29] S. Robertson and S. Walker, ‘‘Some simple effective approximations to
‘‘Orientation-aware ship detection via a rotation feature decoupling the 2-Poisson model for probabilistic weighted retrieval,’’ in Proc. SIGIR,
supported deep learning approach,’’ Eng. Appl. Artif. Intell., vol. 125, B. W. Croft and C. J. Van Rijsbergen, Eds., London, U.K.: Springer,
Oct. 2023, Art. no. 106686. Aug. 1994, pp. 232–241.

VOLUME 13, 2025 46187


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

[30] G. Salton and C. Buckley, ‘‘Term-weighting approaches in automatic text [52] A. Chouhan and M. Gertz, ‘‘LexDrafter: Terminology drafting for
retrieval,’’ Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, Jan. 1988. legislative documents using retrieval augmented generation,’’ 2024,
[31] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and arXiv:2403.16295.
W.-T. Yih, ‘‘Dense passage retrieval for open-domain question answering,’’ [53] C. Niu, Y. Wu, J. Zhu, S. Xu, K. Shum, R. Zhong, J. Song, and
2020, arXiv:2004.04906. T. Zhang, ‘‘RAGTruth: A hallucination corpus for developing trustworthy
[32] H.-T. Nguyen, M.-K. Phi, X.-B. Ngo, V. Tran, L.-M. Nguyen, and M.-P. Tu, retrieval-augmented language models,’’ 2023, arXiv:2401.00396.
‘‘Attentive deep neural networks for legal document retrieval,’’ 2022, [54] D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt,
arXiv:2212.13899. D. Metropolitansky, R. O. Ness, and J. Larson, ‘‘From local to
[33] M. Grohe, ‘‘word2vec, node2vec, graph2vec, x2vec: Towards a theory global: A graph RAG approach to query-focused summarization,’’ 2024,
of vector embeddings of structured data,’’ in Proc. 39th ACM SIGMOD- arXiv:2404.16130.
SIGACT-SIGAI Symp. Princ. Database Syst., Jun. 2020, pp. 1–16. [55] D. Chandrasekaran and V. Mago, ‘‘Evolution of semantic similarity—
[34] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, A survey,’’ ACM Comput. Surv. (CSUR), vol. 54, no. 2, pp. 1–37,
and H. Wang, ‘‘Retrieval-augmented generation for large language models: Feb. 2021.
A survey,’’ 2023, arXiv:2312.10997. [56] S. Wu, Y. Xiong, Y. Cui, H. Wu, C. Chen, Y. Yuan, L. Huang, X. Liu,
[35] C.-M. Chan, C. Xu, R. Yuan, H. Luo, W. Xue, Y. Guo, and J. Fu, T.-W. Kuo, N. Guan, and C. J. Xue, ‘‘Retrieval-augmented generation for
‘‘RQ-RAG: Learning to refine queries for retrieval augmented generation,’’ natural language processing: A survey,’’ 2024, arXiv:2407.13193.
2024, arXiv:2404.00610. [57] H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, and Z. Liu, ‘‘Evaluation of
[36] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, retrieval-augmented generation: A survey,’’ 2024, arXiv:2405.07437.
‘‘A survey on RAG meeting LLMs: Towards retrieval-augmented large [58] (Jun. 2024). The Ultimate Guide to Evaluate RAG System Components:
language models,’’ 2024, arXiv:2405.06211. What You Need to Know. Accessed: Nov. 11, 2024. [Online]. Available:
[37] X. Wang, Z. Wang, X. Gao, F. Zhang, Y. Wu, Z. Xu, T. Shi, Z. Wang, https://myscale.com/blog/ultimate-guide-to-evaluate-rag-system/
S. Li, Q. Qian, R. Yin, C. Lv, X. Zheng, and X. Huang, ‘‘Searching for best [59] L. M. Amugongo, P. Mascheroni, S. G. Brooks, S. Doering, and
practices in retrieval-augmented generation,’’ 2024, arXiv:2407.01219. J. Seidel, ‘‘Retrieval augmented generation for large language mod-
[38] E. Mollard, A. Patel, L. Pham, and R. Trachtenberg, ‘‘Improving retrieval els in healthcare: A systematic review,’’ Preprints, Jul. 2024, doi:
augmented generation,’’ Lab. Phys. Sci. (LPS), Univ. Maryland, College 10.20944/preprints202407.0876.v1.
Park, MD, USA, Tech. Rep., Aug. 2024. [60] S. Sivasothy, S. Barnett, S. Kurniawan, Z. Rasool, and R. Vasa,
[39] R. Yang, ‘‘CaseGPT: A case reasoning framework based on language ‘‘RAGProbe: An automated approach for evaluating RAG applications,’’
models and retrieval-augmented generation,’’ 2024, arXiv:2407.07913. 2024, arXiv:2409.19019.
[40] G. Chen, W. Yu, and L. Sha, ‘‘Unlocking multi-view insights in knowledge- [61] P. Domingos, ‘‘A few useful things to know about machine learning,’’
dense retrieval-augmented generation,’’ 2024, arXiv:2404.12879. Commun. ACM, vol. 55, no. 10, pp. 78–87, Oct. 2012.
[41] M. Cherubini, F. Romano, A. Bolioli, L. De, and M. Sangermano, [62] S. Zeng, J. Zhang, P. He, Y. Xing, Y. Liu, H. Xu, J. Ren, S. Wang, D. Yin,
‘‘Improving the accessibility of EU laws: The Chat-EUR-Lex project,’’ in Y. Chang, and J. Tang, ‘‘The good and the bad: Exploring privacy issues in
Proc. 4th Nat. Conf. Artif. Intell. Naples, Italy: CINI, May 2024. retrieval-augmented generation (RAG),’’ 2024, arXiv:2402.16893.
[42] M. A. Habib, S. M. Amin, M. Oqba, S. Jaipal, M. J. Khan, and A. Samad, [63] W. Li, J. Li, R. Ramos, R. Tang, and D. Elliott, ‘‘Understanding
‘‘TaxTajweez: A large language model-based chatbot for income tax retrieval robustness for retrieval-augmented image captioning,’’ 2024,
information in Pakistan using retrieval augmented generation (RAG),’’ in arXiv:2406.02265.
Proc. Int. FLAIRS Conf., vol. 37, May 2024, pp. 1–12. [64] Y. Zhou, Y. Liu, X. Li, J. Jin, H. Qian, Z. Liu, C. Li, Z. Dou, T.-Y. Ho,
[43] A. B. Hou, O. Weller, G. Qin, E. Yang, D. Lawrie, N. Holzenberger, and P. S. Yu, ‘‘Trustworthiness in retrieval-augmented generation systems:
A. Blair-Stanek, and B. Van Durme, ‘‘CLERC: A dataset for legal A survey,’’ 2024, arXiv:2409.10102.
case retrieval and retrieval-augmented analysis generation,’’ 2024, [65] S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, and M. Abdelrazek,
arXiv:2406.17186. ‘‘Seven failure points when engineering a retrieval augmented generation
[44] N. Pipitone and G. H. Alami, ‘‘LegalBench-RAG: A benchmark system,’’ in Proc. IEEE/ACM 3rd Int. Conf. AI Eng.-Softw. Eng. AI,
for retrieval-augmented generation in the legal domain,’’ 2024, Apr. 2024, pp. 194–199.
arXiv:2408.10343. [66] W. Yu, H. Zhang, X. Pan, K. Ma, H. Wang, and D. Yu, ‘‘Chain-of-note:
[45] J.-M. Chu, H.-C. Lo, J. Hsiang, and C.-C. Cho, ‘‘Patent response Enhancing robustness in retrieval-augmented language models,’’ 2023,
system optimised for faithfulness: Procedural knowledge embodiment arXiv:2311.09210.
with knowledge graph and retrieval augmented generation,’’ in Proc. 1st [67] K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, ‘‘Retrieval augmen-
Workshop Towards Knowledgeable Lang. Models (KnowLLM), Bangkok, tation reduces hallucination in conversation,’’ 2021, arXiv:2104.07567.
Thailand, 2024, pp. 146–155. [68] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androut-
[46] M. E. Mamalis, E. Kalampokis, F. Fitsilis, G. Theodorakopoulos, and sopoulos, ‘‘LEGAL-BERT: The muppets straight out of law school,’’ 2020,
K. Tarabanis, ‘‘A large language model agent based legal assistant for arXiv:2010.02559.
governance applications,’’ in Proc. Int. Conf. Electron. Government, [69] P. Henderson, M. Krass, L. Zheng, N. Guha, C. D. Manning, D. Jurafsky,
Jan. 2024, pp. 286–301. and D. E. Ho, ‘‘Pile of law: Learning responsible data filtering from the law
[47] I. Bošković and V. Tabaš, ‘‘Proposal for enhancing legal advisory services and a 256GB open-source legal dataset,’’ in Proc. Adv. Neural Inf. Process.
in the Montenegrin banking sector with artificial intelligence,’’ in Proc. Syst., Jan. 2022, pp. 29217–29234.
28th Int. Conf. Inf. Technol. (IT), Zabljak, Montenegro, Feb. 2024, [70] J. Niklaus, V. Matoshi, M. Stürmer, I. Chalkidis, and D. E. Ho, ‘‘MultiLe-
pp. 1–6. galPile: A 689GB multilingual legal corpus,’’ 2023, arXiv:2306.02069.
[48] C. Ryu, S. Lee, S. Pang, C. Choi, H. Choi, M. Min, and J.-Y. Sohn, [71] M. Ostendorff, T. Blume, and S. Ostendorff, ‘‘Towards an open platform
‘‘Retrieval-based evaluation for LLMs: A case study in Korean legal for legal information,’’ in Proc. ACM/IEEE Joint Conf. Digit. Libraries,
QA,’’ in Proc. Natural Legal Lang. Process. Workshop, Singapore, 2023, Aug. 2020, pp. 385–388.
pp. 132–137. [72] I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androutsopoulos,
[49] R. Kalra, Z. Wu, A. Gulley, A. Hilliard, X. Guan, A. Koshiyama, D. M. Katz, and N. Aletras, ‘‘LexGLUE: A benchmark dataset for legal
and P. Treleaven, ‘‘HyPA-RAG: A hybrid parameter adaptive language understanding in English,’’ 2021, arXiv:2110.00976.
retrieval-augmented generation system for AI legal and policy [73] N. Chirkova, D. Rau, H. Déjean, T. Formal, S. Clinchant, and
applications,’’ 2024, arXiv:2409.09046. V. Nikoulina, ‘‘Retrieval-augmented generation in multilingual settings,’’
[50] T.-H.-G. Vu and X.-B. Hoang, ‘‘User privacy risk analysis within website 2024, arXiv:2407.01463.
privacy policies,’’ in Proc. Int. Conf. Multimedia Anal. Pattern Recognit. [74] S. R. El-Beltagy and M. A. Abdallah, ‘‘Exploring retrieval augmented
(MAPR), Da Nang, Vietnam, Aug. 2024, pp. 1–6. generation in Arabic,’’ Proc. Comput. Sci., vol. 244, pp. 296–307,
[51] R. Nai, E. Sulis, I. Fatima, and R. Meo, ‘‘Large language models May 2024.
and recommendation systems: A proof-of-concept study on public [75] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
procurements,’’ in Natural Language Processing and Information Systems F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
(Lecture Notes in Computer Science), vol. 14763. Cham, Switzerland: ‘‘Unsupervised cross-lingual representation learning at scale,’’ 2019,
Springer, 2024, pp. 280–290. arXiv:1911.02116.

46188 VOLUME 13, 2025


M. Hindi et al.: Enhancing the Precision and Interpretability of RAG in Legal Technology: A Survey

[76] I. Chalkidis, M. Fergadiotis, and I. Androutsopoulos, ‘‘MultiEURLEX— LINDA MOHAMMED received the B.Sc. degree
A multi-lingual and multi-label legal document classification dataset for in electrical and electronic engineering (electronic
zero-shot cross-lingual transfer,’’ 2021, arXiv:2109.00904. systems software engineering) from the University
[77] R. E. Speer, J. Chin, and C. Havasi, ‘‘ConceptNet 5.5: An open multilingual of Khartoum, Sudan, in 2020. She is currently
graph of general knowledge,’’ in Proc. AAAI Conf. Artif. Intell., vol. 31, pursuing the M.Sc. degree in software engineering
Feb. 2017, pp. 1–7. with United Arab Emirates University, United
[78] J. Lee, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. V. Le, and D. Zhou, Arab Emirates. Her current research interests
‘‘Chain-of-thought prompting elicits reasoning in large language models,’’
include AI, machine learning, and data science.
in Proc. Adv. Neural Inf. Process. Syst., Jan. 2022, pp. 24824–24837.
[79] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh,
C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova,
L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. V. Le,
and S. Petrov, ‘‘Natural questions: A benchmark for question answering
research,’’ in Proc. Trans. Assoc. Comput. Linguistics, vol. 7, Aug. 2019,
pp. 453–466. OMMAMA MAAZ received the B.Sc. degree
[80] M. Kulkarni, P. Tangarajan, K. Kim, and A. Trivedi, ‘‘Reinforcement learn- in computer engineering from the University of
ing for optimizing RAG for domain chatbots,’’ 2024, arXiv:2401.06800. Sharjah, Sharjah, United Arab Emirates, in 2022.
[81] Z. Wang, S. Xian Teo, J. Ouyang, Y. Xu, and W. Shi, ‘‘M-RAG: Rein- She is currently pursuing the M.Sc. degree with the
forcing large language model performance through retrieval-augmented College of Information Technology, United Arab
generation with multiple partitions,’’ 2024, arXiv:2405.16420. Emirates University, United Arab Emirates. Her
[82] Y. Wu, E. Mansimov, S. M. Liao, R. Grosse, and J. Ba, ‘‘Scalable trust- current research interest includes the IoT systems.
region method for deep reinforcement learning using Kronecker-factored
approximation,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jan. 2017,
pp. 1–8.
[83] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei,
P. Christiano, and G. Irving, ‘‘Fine-tuning language models from human
preferences,’’ 2019, arXiv:1909.08593.
[84] DeepSeek-AI et al., ‘‘DeepSeek-R1: Incentivizing reasoning capability in
LLMs via reinforcement learning,’’ 2025, arXiv:2501.12948. ABDULMALIK ALWARAFY (Member, IEEE)
received the Ph.D. degree in computer science and
engineering from Hamad Bin Khalifa University,
Doha, Qatar. He is currently an Assistant Professor
with the College of Information Technology,
MAHD HINDI received the B.Sc. degree in United Arab Emirates University, Al Ain, United
information systems technology from Abu Dhabi Arab Emirates. His current research interests
University, Abu Dhabi, United Arab Emirates. include the application of artificial intelligence
He is currently pursuing the M.Sc. degree with techniques across various domains, including
the College of Information Technology, United wireless and the IoT networks, as well as edge and
Arab Emirates University, Al Ain, United Arab cloud computing. He is a member of the IEEE Communications Society.
Emirates. His current research interests include He served on the technical program committees of many international
LLMs and LLM-driven solutions. conferences. In addition, he has been a reviewer of several international
journals and conferences.

VOLUME 13, 2025 46189

You might also like