0% found this document useful (0 votes)
59 views17 pages

Legal Query RAG

The paper presents Legal Query RAG (LQ-RAG), a novel framework designed to enhance the application of AI in legal practice by addressing challenges such as biased data and hallucinations in AI responses. LQ-RAG incorporates a recursive feedback mechanism and specialized components to improve accuracy and reliability in generating legal responses, demonstrating significant performance improvements over existing models. The findings suggest that domain-specific fine-tuning of language models, combined with advanced retrieval-augmented generation techniques, can greatly enhance AI's effectiveness in legal contexts.

Uploaded by

mohanaram352001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views17 pages

Legal Query RAG

The paper presents Legal Query RAG (LQ-RAG), a novel framework designed to enhance the application of AI in legal practice by addressing challenges such as biased data and hallucinations in AI responses. LQ-RAG incorporates a recursive feedback mechanism and specialized components to improve accuracy and reliability in generating legal responses, demonstrating significant performance improvements over existing models. The findings suggest that domain-specific fine-tuning of language models, combined with advanced retrieval-augmented generation techniques, can greatly enhance AI's effectiveness in legal contexts.

Uploaded by

mohanaram352001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Received 14 January 2025, accepted 31 January 2025, date of publication 14 February 2025, date of current version 3 March 2025.

Digital Object Identifier 10.1109/ACCESS.2025.3542125

Legal Query RAG


RAHMAN S. M. WAHIDUR 1 , SUMIN KIM 2 , HAEUNG CHOI 1, DAVID S. BHATTI 1,

AND HEUNG-NO LEE 1 , (Senior Member, IEEE)


1 School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, South Korea
2 Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, South Korea
Corresponding author: Heung-No Lee ([email protected])
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the
Korea Government (MSIT) (IITP-2025-RS-2021-II210118, Development of decentralized consensus composition technology for
large-scale nodes) and This work was supported by the IITP (Institute of Information & Communications Technology Planning &
Evaluation)-ITRC (Information Technology Research Center) grant funded by the Korea Government [Ministry of Science and
Information and Communication Technology (ICT)] (IITP-2025-RS-2021-II211835).

ABSTRACT Recently, legal practice has seen a significant rise in the adoption of Artificial Intelligence
(AI) for various core tasks. However, these technologies remain in their early stages and face challenges
such as understanding complex legal reasoning, managing biased data, ensuring transparency, and avoiding
misleading responses, commonly referred to as hallucinations. To address these limitations, this paper
introduces Legal Query RAG (LQ-RAG), a novel Retrieval-Augmented Generation framework with a
recursive feedback mechanism specifically designed to overcome the critical shortcomings of standard RAG
implementations in legal applications. The proposed framework incorporates four key components: a custom
evaluation agent, a specialized response generation model, a prompt engineering agent, and a fine-tuned legal
embedding LLM. Together, these components effectively minimize hallucinations, improve domain-specific
accuracy, and deliver precise, high-quality responses for complex queries. Experimental results demonstrate
that the fine-tuned embedding LLM achieves a 13% improvement in Hit Rate and a 15% improvement
in Mean Reciprocal Rank (MRR). Comparisons with general LLMs reveal a 24% performance gain when
using the Hybrid Fine-Tuned Generative LLM (HFM), the specialized response generation model integrated
into the LQ-RAG framework. Furthermore, LQ-RAG achieves a 23% improvement in relevance score over
naive configurations and a 14% improvement over RAG with Fine-Tuned LLMs (FTM). These findings
underscore the potential of domain-specific fine-tuned LLMs, combined with advanced RAG modules and
feedback mechanisms, to significantly enhance the reliability and performance of AI in legal practice. The
reliance of this study on a proprietary model as the evaluation agent, combined with the lack of feedback from
human experts, highlights the need for improvement. Future efforts should focus on developing a specialized
legal evaluation agent and enhancing its performance by incorporating feedback from domain experts.

INDEX TERMS Retrieval-augmented generation, legal query, LLM agent, information retrieval.

I. INTRODUCTION efficient responses to user queries. The remarkable versa-


Recent advancements in AI and NLP have propelled the tility demonstrated by models like OpenAI GPT or Meta
development of powerful LLMs. These LLMs leverage LLaMA across a wide spectrum of tasks highlights their
advanced deep learning techniques, transformer architec- potential. These models find applications across various
tures, and extensive amounts of data to provide more fields, including law, medicine, agriculture, coding, and
psychology. They often demonstrate their utility without
The associate editor coordinating the review of this manuscript and requiring specialized prompts [1]. However, while pro-
approving it for publication was Rongbo Zhu . prietary models like BloombergGPT [2] in finance and

2025 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
36978 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 13, 2025
R. S. M. Wahidur et al.: Legal Query RAG

Med-PaLM [3] in medicine have capitalized on their distinct to a prompt agent, where prompt engineering is used
data accumulations to advance in their respective sectors, to make slight adjustments that simplify the query while
the legal domain has a relatively limited number of reliable preserving its main idea. The modified query is then sent
LLMs. This scarcity of specialized models has hindered the back to the RAG to repeat the retrieval and generation
digital transformation of the legal sector [4]. process. This recursive feedback mechanism iteratively
Law serves as a cornerstone in shaping societies, governing refines the retrieved documents and generated responses by
human interactions, and upholding justice. Accurate and up- continuously evaluating answer relevance, context relevance,
to-date information is essential for legal professionals to and groundedness, ensuring greater accuracy and alignment
make informed decisions. Legal professionals must navigate with the legal domain.
the complexities of legal language and nuanced interpreta- The key contributions of this paper are given below.
tions. They also need to address the ever-evolving nature 1) A pioneering RAG framework has been designed to
of legislation. These challenges require tailored solutions seamlessly incorporate agent-driven recursive feed-
to effectively meet the unique demands of their field [5]. back processes, creating an innovative pathway to
Current LLMs are primarily trained on general corpora, refine response quality and precision.
which limits their access to domain-specific resources. This 2) The framework incorporates a custom-built LLM-
restriction hinders their ability to effectively utilize com- based evaluation agent designed to independently
prehensive domain knowledge for practical applications [6]. assess the accuracy and relevance of model-generated
LLMs also face challenges in expanding their parametric responses and trigger answer regeneration when
memory, which can result in generating hallucinated infor- necessary.
mation [7]. This issue renders the use of such models 3) A fine-tuned embedding LLM and a hybrid fine-tuned
risky in high-stakes domains, e.g., in several high-profile generative LLM have been developed. These LLMs
incidents, attorneys have been disciplined for filing court provide enhanced generalization, superior domain
documents that referenced fabricated case law produced adaptation, and improved adherence to instructions.
by AI [8]. Research indicates that general-purpose LLMs 4) Extensive evaluations were performed to assess the
frequently hallucinate when responding to legal queries, performance of the proposed RAG system. The results
with an average occurrence rate between 58% and 82% [9]. demonstrate that LQ-RAG consistently outperforms
A promising approach to address these limitations is RAG, baseline models, highlighting its applicability in the
introduced by Lewis et al. [10], which integrates external legal domain.
data retrieval into the generative process. RAG aids in The subsequent sections of this paper are structured
reducing hallucinations and facilitates continuous knowledge as follows. Section II provides background information.
updates and integration of domain-specific information [11]. Section III reviews pertinent literature. Section IV unveils
However, the conventional RAG method may limit LLMs’ the architectural framework of the proposed work. Section V
adaptability and diminish output quality by introducing presents tasks, baseline LLMs, and evaluation metrics.
irrelevant passages. This occurs because the retrieval model Section VI discusses and summarizes experimental findings.
does not consider domain-specific relevance when retrieving Section VII presents the conclusions. Section VIII discusses
passages. Additionally, the generative models lack explicit limitations and suggests potential areas for future research.
training on domain knowledge and have limited ability Finally, section IX discusses the acknowledgment that
to follow instructions efficiently, resulting in inconsistent supported this research.
responses [12].
To address the above-mentioned constraints identified II. BACKGROUND
within the legal domain, this paper introduces a new This section explores key methodologies in NLP, with an
framework named LQ-RAG. This framework employs a emphasis on generative and embedding LLMs, their fine-
hybrid approach to fine-tune the two principal components of tuning techniques, and the RAG system.
the RAG system: embedding generation module and response
generation module separately. These fine-tuned modules are A. GENERATIVE LLMS AND EMBEDDING LLMS
then integrated into the RAG ecosystem and augmented with The advancement of LLMs has given rise to two pri-
other RAG modules, such as chunk references, document mary categories: generative LLMs and embedding LLMs.
hybrid retrieval, and multi-document agents, to enhance Generative LLMs excel in generating text by utilizing the
the performance of the LQ-RAG system. Additionally, causal language modeling approach. This technique predicts
an evaluation agent powered by OpenAI GPT-41 with a each new token based on the preceding sequence, also
feedback mechanism is introduced to evaluate the response known as auto-regression or next-token prediction. Such a
generated by the generative module. If the generated response technique makes these LLMs highly effective for producing
meets the preset criteria, the agent displays the response contextually coherent content. In contrast, embedding LLMs
as the final output. Otherwise, the agent sends the query transform text into high-dimensional vector spaces, which is
useful for indexing and determining semantic relationships
1 https://platform.openai.com/docs/models through mathematical operations. These LLMs excel at

VOLUME 13, 2025 36979


R. S. M. Wahidur et al.: Legal Query RAG

identifying semantic similarities between sentences, making refined retrieval processes, enhancing granularity, and opti-
them suitable for applications like search engines and mizing embedding models to improve retrieval quality [18].
recommendation systems [13]. Modular RAG further enhances functionality by integrating
a search module for similarity retrieval, facilitating adaptable
B. LLM FINE-TUNING approaches for complex language tasks [19], [20].
Fine-tuning adapts a pre-trained language model to enhance
its performance in domain-specific applications. Fine-tuning
of a generative LLM employs two methods: Supervised Fine- III. RELATED WORK
Tuning (SFT) and Instruction Tuning (IT), each tailored to Recent years have seen growing interest in leveraging
optimize LLM differently [14]. Fine-tuning offers benefits LLMs for legal tasks. This section reviews several notable
such as leveraging pre-training knowledge, reducing the need studies, emphasizing their key contributions and shared
for labeled data, and enhancing model generalization. Addi- characteristics.
tionally, fine-tuning an embedding LLM enriches the seman- HanFei [21], a fully parameterized legal LLM with
tic representation of embeddings across the training data 700 million parameters, is pre-trained with large-scale
distribution, thereby enhancing retrieval performance [15]. legal documents. It offers features such as legal question-
Empirical observations indicate that the fine-tuning process answering, multi-turn dialogue, article generation, and search
commonly leads to significant improvements in retrieval functionalities. LawGPT_zh [22] is an open-source Chinese
evaluation metrics associated with RAG. legal LLM built on ChatGLM-6B LoRA 16-bit instruction
fine-tuning. It integrates legal Q&A datasets and high-quality
C. RETRIEVAL AUGMENTED GENERATION (RAG)
legal text, enhancing the performance and professionalism
of General Language Models (GLM) in the legal domain.
The RAG represents an architectural approach to enhance
Similarly, the LawGPT [23] series, built on Chinese-
LLM applications by utilizing customized data sources.
LLaMA-7B, aims to expand legal terminology and enhance
It marginalizes the retrieved documents to produce a distri-
semantic understanding within the legal domain. It achieves
bution over the generated text. There are two methods to
this through pre-training on extensive Chinese legal text
achieve this distribution: RAG-Sequence and RAG-Token.
databases. Subsequent fine-tuning on legal Q&A and judicial
The RAG-Sequence model uses the same retrieved document
datasets further improves the model’s effectiveness and
to generate the entire response. In contrast, the RAG-Token
comprehension within legal frameworks. LexiLaw [24], fine-
model utilizes multiple retrieved documents to produce an
tuned on the ChatGLM-6B architecture, aims to provide
answer, as shown in equations 1 and 2, respectively [10].
accurate and reliable legal consultation services for legal
professionals, students, and general users. It achieves this
X
pRAG-Sequence (y|x) ≈ pη (z|x)pθ (y|x, z)
z∈top-K (p(·|x))
by delving into particular legal matters, articles, and case
N
analyses while also providing valuable recommendations.
X Y Lawyer LLaMA [6] engaged in continuous pre-training
= pη (z|x)
on Chinese-LLaMA-13B and curated multiple instructions
z∈top-K (p(·|x)) i=1
fine-tuning datasets to enhance its capability to provide legal
× pθ (yi |x, z, y1:i−1 ) (1) counsel. Additionally, it possesses the ability to generate
N
Y X legal articles and offer legal advice. Despite advances in
pRAG-Token (y|x) ≈ pre-training and fine-tuning on domain-specific data, these
i=1 z∈top-K (p(·|x)) models still exhibit hallucinations and biases, making them
× pη (z|x)pθ (yi |x, z, y1:i−1 ) (2) unreliable. Additionally, the knowledge cutoff date limits
their ability to provide current information.
where x is the input sequence, y is the target sequence, and z Conversely, recent assessments [25], [26] underscore
are the retrieved documents. N denotes the target sequence the efficacy of RAG techniques in addressing question-
length. The retriever pη (z|x) with parameters η provides answering tasks. DISC-LawLLM [27] is an intelligent legal
distributions over text passages given x. The generator pθ (yi | system that integrates LLMs with a retrieval module, aiming
x, z, y1:i−1 ) with parameters θ generates the current token, yi , to augment the models’ capacity to access and utilize external
based on previous tokens, y1:i−1 , x, and z. top-K (p(· | x)) legal knowledge. CBR-RAG [28], an AI-based system for
represents the top-K truncated distribution over retrieved legal question answering, utilizes the initial retrieval stage,
documents. Based on the architectural complexity, the RAG indexing vocabulary, and similarity knowledge containers
system can be categorized into three types: Naive RAG, of the Case-Based Reasoning (CBR) cycle to enrich LLM
advanced RAG, and modular RAG. Naive RAG initially queries with contextual relevance. LexDrafter [29] is an
follows a Retrieve-Read framework involving indexing, innovative framework tailored for drafting definition articles
retrieval, and generation processes. It grapples with chal- within legislative documents. It harnesses RAG methods and
lenges such as low retrieval precision and hallucinations [16], leverages existing term definitions across diverse legislative
[17]. Advanced RAG addresses these shortcomings through documents. This approach streamlines the drafting process

36980 VOLUME 13, 2025


R. S. M. Wahidur et al.: Legal Query RAG

efficiently. Alotaibi et al. [30] propose Knowledge Aug- according to the objective function defined in Equation 3.
mented BERT2BERT (KAB), a question-answering system  
B B
for Islamic jurisprudential legal questions. KAB combines 1
eS(xi ,yj )  (3)
X X
E(x, y, θ) = S(xi , yi ) − log
retrieval-based and generative techniques. It utilizes prior B
i=1 j=1
knowledge sources such as previous questions, question
categories, and Islamic jurisprudential reference books to where x = {x1 , x2 , . . . , xB } is the input sequence, y =
provide context for its answers. Hoppe et al. [31] created an {y1 , y2 , . . . , yB } is the target sequence in a training batch
intelligent legal advisor for German documents. They demon- of size B, and θ represents the word embedding and neural
strated that Best Matching 25 (BM25) [32] outperforms network parameters used to calculate the similarity score.
pre-trained BERT in recall and Mean Average Precision The similarity score S(xi , yi ) determines how xi and yi are
(MAP). However, fine-tuned Dense Passage Retrieval (DPR) positively related. On the other hand, the similarity score
[33] excels on the GermanQuAD dataset. S(xi , yj ) determines how xi and yj are negatively related.
Overall, AI-based legal work is still emerging; however, In this model, the dot-product scoring introduced in [34] is
recent advancements have effectively integrated LLMs and utilized. The overall fine-tuning process of an embedding
retrieval techniques. This research extends prior work by LLM is illustrated in Algorithm 1.
incorporating fine-tuned LLMs with an agent-based RAG The bottom right part of the FT Layer involves the
solution equipped with a feedback loop. This approach fine-tuning process of the generative LLM. LLaMA-3-8B,
enhances the accuracy and relevance of legal responses. a general-purpose pre-trained autoregressive LLM, is uti-
By advancing these technologies, this paper aims to enhance lized, where the model is represented as pη (y | x), parame-
the reliability and effectiveness of AI in the legal domain. terized by η with the input sequence x and target sequence y.
Initially, two distinct datasets are collected and preprocessed:
IV. PROPOSED SYSTEM a domain-specific Q&A dataset DQA and a general-purpose
Figure 1 illustrates the overall schematic diagram of the instruction dataset DInstr . Each dataset can be represented
proposed LQ-RAG system. The proposed system is organized as input-target sequence pairs: D = {(xi , yi )}i=1,...,N , where
into two primary parts: Fine-Tuning (FT) Layer and RAG each target sequence yi = {yti }t=1,...,Ti is a combination of Ti
Layer. The FT Layer involves fine-tuning both the embedding tokens. The pre-trained LLM undergoes separate fine-tuning
LLM and the generative LLM. In contrast, the RAG Layer processes with the mentioned datasets using a technique
integrates advanced RAG modules, an evaluation agent, called Low-Rank Adaptation (LoRA) [35]. This technique
a prompt engineering agent, and a feedback mechanism to facilitates the fine-tuning process by reducing the number of
ensure the quality and accuracy of the generated responses. trainable parameters. Using LoRA, the weight update from
The bottom left quadrant of the FT Layer illustrates pre-trained weights η0 to η′ = η0 + 1η is replaced by an
the fine-tuning process of the embedding LLM. The top update to η′ (2) = η0 +1η(2), where 2 is a set of parameters
part of the diagram depicts the collection of unstructured whose size is much smaller than |η|. Consequently, the
legal domain corpora Clegal , sourced from an open-access fine-tuned LLM can acquire domain-specific knowledge and
portal named Library Genesis.2 A subset of Clegal , denoted enhance its ability to follow instructions while minimizing
as Csub-legal , undergoes preprocessing. This subset is then the utilization of computational resources. This fine-tuning
utilized by the synthetic data generator, driven by OpenAI process aims to optimize the log-likelihood objective through
GPT-3.5-turbo,3 to create a query-context pair-based syn- gradient updates, as identified in Equation 4.
thetic dataset Dsynthetic . This data generation process involves T
X X h i
the GPT model breaking down the unstructured text into G(2) = max log pη′ (2) (yt | x, y1:t−1 ) (4)
2
smaller, manageable chunks and generating questions that (x,y)∈D t=1
are directly related to each chunk. A mapping function
organizes the dataset by linking each generated question Here, T denotes the number of tokens in y and y1:t−1
to a unique identifier and its corresponding text segment. represents the set of tokens from y1 to yt−1 . Following the
The dataset is then used to fine-tune and evaluate the fine-tuning steps, the resulting models are combined together
performance of the embedding LLM. In this research, the using the linear merging method to create the desired Hybrid
GIST Large Embedding v0 model was used for fine-tuning. Fine-tuned Generative LLM (HFM). Finally, the performance
During fine-tuning, the Multiple Negatives Ranking Loss of the HFM is evaluated using domain-specific evaluation
(MNRL) [34] function is employed to minimize the distance datasets. The complete fine-tuning process of the HFM is
between embeddings of similar sentences while maximizing depicted in Algorithm 2.
the distance between embeddings of dissimilar sentences. The top part of the diagram, RAG-Layer, outlines the core
This approach ensures that the embedding LLM is trained workflow of the LQ-RAG system. The top left section of the
diagram involves the utilization of the remaining unstructured
legal corpora Crem , as an external knowledge source through
2 https://libgen.is/ a process known as data ingestion. This process employs
3 https://platform.openai.com/docs/models parallel workers to efficiently convert the data into document

VOLUME 13, 2025 36981


R. S. M. Wahidur et al.: Legal Query RAG

FIGURE 1. The schematic diagram of the proposed legal query RAG. The diagram is divided into two main components: Fine-tuning (FT) layer and RAG
layer. The FT layer focuses on fine-tuning processes of embedding LLM and generative LLM. On the other hand, the RAG layer incorporates different RAG
modules with fine-tuned LLMs, an evaluation agent, and a feedback system designed to enhance the accuracy and quality of the generated responses.

objects. Subsequently, these documents are segmented into a nuanced and comprehensive set of documents C ∗ . Once
smaller text chunks and processed through the fine-tuned the C ∗ is retrieved, it is forwarded to the post-processing
embedding LLM to generate d-dimensional real-valued unit for optionally scoring and re-ranked by the re-ranker
vectors, denoted as Edocument ∈ RN ×d . An index is then built denoted as Cre-ranked . The fundamental concept focuses on
using Facebook AI Similarity Search (FAISS) [36] for all the prioritizing relevant document records to reduce document
Crem passages that will be used in retrieval. These vectors volume. This approach addresses the challenge of expanding
are subsequently stored in a vector database, referred to as context windows during retrieval. Following this, a prompt
DB vector . When a user submits a query q, it is processed by p that contains system instructions i, the user query q,
the unified fine-tuned embedding LLM designed to handle and the re-ranked retrieved-context Cre-ranked , is fed into the
text chunks. The embedding LLM generates vectors for the generative LLM, to synthesize the initial response r.
query, represented as Equery ∈ RN ×d . These query vectors Finally, an evaluation agent Aevaluation , powered by
are then sent to a Reasoning and Action (ReAct) [37] agent GPT-4, assesses answer relevance, context relevance, and
that selects an appropriate query engine tool to retrieve groundedness for each query based on its related response.
highly relevant text chunks from the external knowledge This system evaluates response quality from the HFM
source through a search mechanism. The retrieval process model, utilizing the Chain-of-thought (CoT) [38] process
employs a hybrid search approach integrating BM25 and to ensure thorough assessment. First, the model retrieves
DPR techniques to enhance search precision. BM25 is a context chunks relevant to the user’s query to verify that
fundamental non-parametric lexical method that calculates only pertinent information is used, reducing the risk of
document relevance with Term Frequency (TF) and Inverse irrelevant details causing hallucinations. For groundedness,
Document Frequency (IDF). On the other hand, the DPR it breaks down the response into distinct claims, searching
retrieves K number of highly relevant passages C from the retrieved context for supporting evidence to ensure factual
the vector space. The similarity score between the q and accuracy. Finally, answer relevance is checked by aligning
the C can be defined as the dot product of their vectors. the response directly with the user’s original question to
The hybrid retriever performs both retrieval processes and confirm it effectively addresses the intended query. This
combines their results. It then re-ranks the findings to deliver structured, sequential approach promotes accuracy, factual

36982 VOLUME 13, 2025


R. S. M. Wahidur et al.: Legal Query RAG

grounding, and relevance in responses. In short, if r meets Algorithm 2 Generative LLM Fine-Tuning, & Merging
the criteria defined by Aevaluation , it is processed as the Process
final output. Otherwise, q enters a feedback loop, where Constants: Loss, LoRA Parameters (rank, α), Learning
prompt engineering is applied to modify the query using Rate η
Input: Training Dataset ∈ {DQA , DInstr },
a prompt engineering agent, repeating the retrieval and Eval Dataset Deval
generation process. This open-source, LLM-based agent is 1: Initialize LoRA & Fine-Tune Generative Model:
designed for seamless prompt engineering, enabling efficient 2: Baseline-Model M ← Pre-trained generative LLM
query transformation optimized for complex legal question- 3: for each trainable layer l in M do
answering tasks. The entire working process of the proposed 4: Al ← Random Initialize(din , rank)
LQ-RAG is depicted in Algorithm 3. 5: Bl ← Zeros Initialize(rank, dout )
6: Integrate Al and Bl into M as MLoRA
7: end for
Algorithm 1 Embedding LLM Fine-Tuning Process 8: Scale LoRA layers by α: MLoRA ← α · (Al · Bl )
Constants: Loss function MNRL, Evaluator Eval, 9: for each epoch do
Learning Rate η 10: for each batch(x (i) , y(i) ) ∈ Dataset do
Input: Csub-legal 11: Forward Pass:
Output: Trained LLM network parameters θglobal 12: ŷ(i) ← MLoRA (x (i) )
1: Data Collection and Preprocessing:
13: Compute Loss:
2: Dsynthetic ← LLM(Csub-legal )
14: L ← Loss(ŷ(i) , y(i) )
3: Dsynthetic ∈ {Dtrain , Deval }
15: Backward Pass:
4: Initialize & Fine-Tune Embedding Model:
16: Compute gradients ∇Al ,Bl L
5: Baseline-Model M ← Pre-trained embedding LLM
17: Update LoRA Parameters:
6: Ltrain_embed ← DataLoader(Dtrain , batch_size)
18: Al ← Al − η · ∇Al L
7: for each epochs do
19: Bl ← Bl − η · ∇Bl L
8: for each batch (x (i) , y(i) ) ∈ Ltrain_embed do 20: end for
9: Forward Pass: 21: Update Weights
22: end for
10: ŷ(i) ← f (x (i) ; θ )
23: MQA ← MLoRA (M , DQA )
11: Compute Loss:
24: MInstr ← MLoRA (M , DInstr )
12: L ← MNRL(ŷ(i) , y(i) )
25: Model Merging:
13: Backward Pass:
26: Mmerged ← Linear Merging(MQA , MInstr )
14: Compute gradients ∇θ L
15: Update Weights:
16: θlocal ← θ − η · ∇θ L
17: end for Algorithm 3 Response Generation Process
18: θglobal ← Update(θlocal , θglobal ) Constants: Embedding LLM Me , Generative LLM Mg
19: end for Input: Crem , User Query q
20: Return θglobal Output: Final Response r
1: Data Ingestion:
2: Edocument ← Me (Data Ingest(Crem ))
3: index ← FAISS(Edocument )
V. TASKS, BASELINE LLMS AND EVALUATION METRICS 4: DB vector ← Store(index)
This section presents an overview of the study, outlining the 5: Query Processing:
datasets, baseline LLMs, and evaluation metrics employed 6: Equery ← Me (q)
7: C ∗ ← Hybrid Retrieval(Equery , DB vector )
for performance assessment. 8: Cre-ranked ← Re-Ranker(C ∗ )
9: r ← Mg (q, Cre-ranked )
A. TASKS DESCRIPTION 10: Evaluation and Feedback Loop:
This paper focuses on six NLP tasks: (1) text classification, 11: Evaluation_result ← Aevaluation (r)
(2) multiple-choice, (3) sentence completion, (4) complex 12: if Evaluation_result ⊆ criteria bounded then
13: Return r
task understanding, (5) information retrieval, and (6) question 14: else
answering. Each task utilizes specific datasets tailored to its 15: n←0
requirements, as summarized in Tables 1 and Table 2. Seven 16: while Evaluation_result ̸ ⊆ criteria bounded and n ≤ N do
datasets listed in Table 1 are employed for the information 17: n←n+1
retrieval task, created in-house using book corpus data to 18: qmodified ← ModifyQuery(q)
19: r ← Mg (qmodified , Cre-ranked )
provide a comprehensive resource for retrieval experiments. 20: Evaluation_result ← Aevaluation (r)
The remaining tasks: text classification, multiple-choice, 21: end while
sentence completion, complex task understanding, and ques- 22: Return r
tion answering are addressed with datasets summarized in 23: end if
Table 2. In the question answering task, both open-domain
and closed-domain retrieval scenarios are explored. An open-
domain query refers to a query where relevant information is a closed-domain query refers to a query where relevant
available within a broad, diverse knowledge base, whereas information is limited or absent within such a domain.

VOLUME 13, 2025 36983


R. S. M. Wahidur et al.: Legal Query RAG

TABLE 1. Overview of datasets used for training and evaluating 2) MEAN RECIPROCAL RANK
embedding-based large language models in the information
retrieval task. Mean Reciprocal Rank (MRR) [56] is particularly useful
for assessing the performance of ranking algorithms in
information retrieval. This metric evaluates system precision
by identifying the highest-ranked relevant document for each
query and calculating the mean reciprocal rank across all
queries, defined as:
N
1 X 1
MRR = (6)
N ranki
i=1

where N is the total number of queries and ranki is the rank


position of the first relevant document for the i-th query.
B. BASELINE LLMS
This section explores the LLMs that serve as the foundation 3) COSINE SIMILARITY
of this study. The encapsulated model configurations of all
Cosine similarity (S) [57] measures the similarity between
baseline LLMs are delineated in Table 3, 4 & 5. In these
two vectors by calculating the cosine of the angle between
tables, (B) represents the number of parameters in billions.
them and It is commonly used in natural language processing
ColBERT [48] enhances retrieval efficiency by leveraging
to assess the similarity between text embeddings or document
deep language models that independently process queries
vectors, defined as:
and documents. LLM-Embedder [49] integrates key retrieval
capabilities to boost performance across various tasks, C·A
S(C, A) = (7)
ranging from knowledge-intensive processing to long-context ∥C∥∥A∥
modeling. BGE Embedding [50] utilizes RetroMAE, a novel here, C and A are two vectors, and ∥C∥ and ∥A∥ denote
retrieval-oriented pretraining paradigm based on a masked their magnitudes. Cosine similarity reflects a relative between
auto-encoder. GISTEmbed [51] introduces a novel approach sentences and can gauge how closely related two pieces of
to enhance in-batch adverse selection during contrastive text are in terms of their content.
training, mitigating biases and noise inherent in traditional
techniques. LLaMA [52], [53] utilizes the decoder com- 4) ANSWER RELEVANCE
ponent of a transformer architecture, with attention layers Answer Relevance (AR) [56] measures the degree to which
accessing only preceding words in sentences at each stage. the generated answer accurately addresses the given query.
The architectural details of the selected LLaMA models This metric helps evaluate the quality and pertinence of
are summarized in Table 4. Flan-T5 [54] leverages pre- model-generated responses in language model research.
trained T5 encoder-decoder transformer architectures and AR can be defined as:
employs fine-tuning techniques to enhance performance. The
N
architectural details of the selected models are summarized in 1 X
AR(Q, A) = fscore (Qi , Ai ) (8)
Table 5. N
i=1

C. EVALUATION METRICS where N is the total number of queries, Ai is the answer for
This section outlines the evaluation metrics deployed in the the i-th query, Qi is the i-th query. The function fscore(Qi ,Ai ) is
present study. used to evaluate the relevance of the answer Ai with respect
to the query Qi , define as:
1) HIT RATE fscore(Qi ,Ai ) ∈ [0, 1]
Hit Rate (HR) [55] quantifies the ratio of queries in
which the correct answer is present among the top-k with 0 denoting not relevant and 1 denoting highly relevant.
retrieved documents. It is a common metric for evaluating
retrieval-based models and search systems. The HR can be 5) CONTEXT RELEVANCE
defined as: Context Relevance (CR) [56] measures how well the retrieved
N context fits the given query. This metric is crucial for
1 X evaluating the context-awareness of models, particularly in
HR = 1{di ∈ Dtrue (qi )} (5)
N tasks requiring contextual understanding, defined as:
i=1
N
where N is the total number of queries, di is the retrieved 1 X
document, Dtrue (qi ) is the set of true relevant documents CR(Q, C) = fscore (Qi , Ci ) (9)
N
for query qi , and 1{di ∈ Dtrue (qi )} is the indicator i=1

function that returns 1 if di is in the set Dtrue (qi ) and where N is the total number of queries, Qi is the i-th query
0 otherwise. for question, Ci is the context which model retrieved.
36984 VOLUME 13, 2025
R. S. M. Wahidur et al.: Legal Query RAG

TABLE 2. Summary of datasets used for training and evaluation of generative LLMs.

TABLE 3. Encapsulation of model configurations in baseline embedding LLMs.

VOLUME 13, 2025 36985


R. S. M. Wahidur et al.: Legal Query RAG

TABLE 4. Concise overview and comparative analysis of LLaMA model architectures.

TABLE 5. Summarized architecture and comparative analysis of FLAN-T5 models.

6) GROUNDEDNESS 9) BLEU SCORE


Groundedness (G) [58] assesses the veracity of a model The BLEU score [59] evaluates machine-translated text
by evaluating its ability to differentiate between factual quality using n-gram precision and a brevity penalty for
and hallucinatory input. This metric is used to ensure that overly short translations. This metric is commonly used in
generated responses are based on credible information. The machine translation and text generation tasks, as follows:
Groundedness is defined as:
N
!
X
N BLEU Score = BP · exp wn log pn (13)
1 X
G(A, C) = fscore (Ai , Ci ) (10) n=1
N
i=1
where BP is the brevity penalty, wn is the weight for each
where N is the total number of queries, Ai is the answer, Ci n-gram precision pn , and N is the maximum n-gram length.
is the retrieved context, and scores how well Ai is grounded
in Ci . 10) ROUGE SCORE
The ROUGE score [60] measures the similarity between
7) ACCURACY machine-generated long and short it and reference summaries
Accuracy (Acc) [56] evaluates whether the answer contains using overlapping n-grams to calculate recall. It is generally
accurate and verified information, ensuring the reliability and used in summarization tasks to evaluate the quality of
validity of the generated response. This metric is generally generated summaries, as follows:
used to assess the overall correctness of a model’s predictions, P
w∈gen min(Countm-gen (w), Countref (w))
as defined below expression: ROUGE-NR = P
w∈ref Countref (w)
TP + TN (14)
Acc = × 100% (11)
TP + TN + FP + FN
where gen represents the generated summary and ref
TP is True-Positive samples and TN is True-Negative samples. represents the reference summaries.
Then, FP is False-Positive samples and FN is False-Negative
samples. VI. EXPERIMENT AND EVALUATION
This section details the experimental setup and provides
8) EXACT MATCH evaluation results, categorized into three parts: (i) Embedding
Exact Match (EM) [58] evaluates the accuracy of a model LLM (ii) Generative LLM and (iii) LQ-RAG system.
by checking if the predicted answer exactly matches the true
answer. This metric is particularly relevant for tasks requiring A. EMBEDDING LLM
precise answer extraction, such as question answering. The The experiment described herein was designed to fine-tune
Exact Match can be defined as: the embedding LLM and compare its performance with
( previously established baselines, as outlined in Section V-B.
1 if Apred = Atrue In this paper, the GIST Large Embedding v0 model from
EM (Q, A) = (12)
0 otherwise Hugging Face4 was employed for fine-tuning. The fine-
tuning process utilized the model fitting API provided by
Q is the query, Apred is the predicted answer, and Atrue is the
true answer. 4 https://huggingface.co/

36986 VOLUME 13, 2025


R. S. M. Wahidur et al.: Legal Query RAG

FIGURE 2. Performance evaluation of the GIST-large-embedding-v0


model before and after fine-tuning.
FIGURE 3. The performance evaluation avg. Hit rate & MRR for different
embedding models.

LlamaIndex5 from sentence transformers. Throughout the


training phase, batch sizes of 8 and 10, along with epoch
sizes ranging from 3 to 15, were iteratively tested. During the
experiment, the performance of the GIST Large Embedding
v0 model was evaluated both before and after fine-tuning by
measuring Hit Rate and MRR. The evaluation results, shown
in Figure 2, reveal that the fine-tuned model demonstrates
improved performance compared to the pre-trained model.
After fine-tuning, the model exhibited a 13% improvement in
average Hit Rate and a 15% improvement in average MRR,
indicating enhanced generalization across different corpora.
To extend this experiment, the performance of the fine-
tuned model, GIST-Law-Embed, was compared with other
baseline models. The experimental results, shown in Figure 3,
indicated that GIST-Law-Embed significantly outperformed
all other baseline LLMs in terms of average Hit Rate and
FIGURE 4. The performance evaluation of the hit rate & MRR on different
average MRR scores. The GIST-Law-Embed model achieved K values.
the highest average Hit Rate of 51% and an average MRR
of 40% under top K = 5. This model not only excelled
in average performance but also maintained the highest
scores for each document, demonstrating its robustness and examined and is presented in Figure 4. As documents are
consistency across various datasets. The overall performance segmented into small chunks during the index’s construction,
in retrieving relevant information is summarized in Table 6 these snippets may represent distinct sections of the original
and Table 7. document, offering supplementary information conducive to
Additionally, the large versions of the BGE and GIS- answer generation. Given this consideration, k = 15 snippets
TEmbed series consistently outperformed their small and were selected for subsequent experiments in this paper, as it
base counterparts, highlighting that increasing model size successfully retrieves the original passage more than 60% of
positively impacts retrieval capabilities. Furthermore, models the time without significantly augmenting the input prompt
with domain-specific tuning, such as GIST-Law-Embed, size. Consequently, increasing the number of retrieved
exhibited enhanced retrieval performance, as evidenced by snippets consistently enhances RAG’s retrieval from the
their higher Hit Rate and MRR scores. This trend underscores original context. This observation underscores the importance
the importance of model size and domain-specific tuning in of optimizing the number of snippets to balance retrieval
achieving superior retrieval performance. accuracy.
One key aspect of this analysis is RAG’s ability to retrieve
relevant information for the given context. The impact of B. GENERATIVE LLM
the number of retrieved snippets (referred to as top-k) was This section investigates the effectiveness of fine-tuning
a generative LLM and assesses its performance against
5 https://docs.llamaindex.ai/en/stable/ established baseline models detailed in Section V-B.

VOLUME 13, 2025 36987


R. S. M. Wahidur et al.: Legal Query RAG

TABLE 6. Summary of model performance in information retrieval: Hit rate analysis.

TABLE 7. Summary of model performance in information retrieval: MRR analysis.

In this paper, LLaMA-3-8B is employed as the baseline first group, the performance of HFM was improved by 9%,
model. Fine-tuning LLaMA-3-8B model requires significant while in the second group, the performance was improved by
memory and processing resources. To ensure compatibility 38%, based on the average performance score of 11 different
and optimization, the PEFT [61], was employed. To improve datasets. This result signifies that by fine-tuning and merging,
inference speed and reduce the model size, 4-bit quantiza- the model performs better across both general and task-
tion approach with BitsAndBytesConfig [62] was utilized. specific domains. The detailed evaluation results for both
Additionally, Key parameters for efficient training were set groups are summarized in Table 8 and Table 9.
using the training arguments configuration. In line with The experimental results for reasoning and commonsense
these optimizations, validation performance was monitored tasks demonstrated that the HFM model significantly outper-
to mitigate overfitting, early stopping was applied, and weight formed all other models, achieving the highest scores across
decay was used for regularization to improve generalization. all metrics. The HFM model achieved an Exact Match (EM)
Finally, the SFTTrainer [63] object from the trl library6 score of 0.65 ± 0.01 in the BBH dataset, surpassing the
was instantiated to manage the entire training process. State-of-the-Art (SOTA) model, LLaMA-3-8B, which scored
To comprehensively evaluate the performance of the Hybrid 0.62 ± 0.01. Additionally, in the Hellaswag dataset, the HFM
Fine-tuned Generative LLM (HFM), assessments were con- model led with an accuracy of 0.62 ± 0.01, compared to
ducted across two distinct groups. The first group focused on LLaMA-3-8B’s 0.60 ± 0.01.
reasoning, commonsense sensing, language understanding, The experimental results for the language understanding
and question-answering tasks, while the second group was and question-answering tasks demonstrated that the HFM
dedicated to legal domain-specific tasks. Figure 5 shows model substantially outperformed the other models across
the comparison results between these two groups. In the most metrics. For the TruthfulQA dataset, HFM demon-
strated superior performance with a blue score of 0.52 ±
6 https://huggingface.co/docs/trl/en/index 0.02, Rouge 1 of 0.58 ± 0.02, and Rouge L of 0.58 ± 0.02.

36988 VOLUME 13, 2025


R. S. M. Wahidur et al.: Legal Query RAG

FIGURE 5. Performance evaluation of fine-tuned model across multiple FIGURE 6. Performance evaluation of diverse models across multiple
tasks. tasks.

C. LQ-RAG SYSTEM
These scores were higher than those of LLaMA-3-8B, This section investigates the LQ-RAG system, comparing it
which scored 0.44 ± 0.02, 0.42 ± 0.02, and 0.40 ± 0.02, with Naive RAG and RAG with FTM. The goal is to identify
respectively. On the other hand, in the SQuAD_v2 dataset, the strengths and the weaknesses of the proposed system in
HFM achieved an accuracy of 0.45 ± 0.02, which is slightly legal contexts through empirical evaluation.
less than the baseline model LLaMA-3-8B, which scored For open-domain question answering, the test questions
0.50 ± 0.02. Despite this, the overall performance of the encompass a diverse range of types, including constitutional
HFM model in the first group tasks showcases its enhanced provisions, explanations of amendments, and hypothetical
capabilities. scenarios designed to simulate real-world legal inquiries.
The experimental findings for the second group of tasks, Table 10 presents a subset of these questions, illustrating
which are specific to the legal domain, demonstrated that the variety and specificity of the queries used for evaluation.
the HFM model consistently surpassed the performance of Additionally, Table 11 provides detailed experimental results,
other models. For the MMLU International Law dataset, offering insights into the system’s performance across these
HFM achieved a score of 0.81 ± 0.03, significantly higher diverse question types. Figure 7 presents the average rele-
than LLaMA-3-8B’s score of 0.77 ± 0.04. Similarly, in the vance scores for three RAG configurations. The Naive RAG
MMLU Professional Law dataset, HFM led with a score configuration achieved an average score of 65%, indicating
of 0.47 ± 0.01, while LLaMA-3-8B scored 0.46 ± 0.01. basic performance without specialized tuning. The RAG
For the Abercrombie classification dataset, HFM achieved with FTM improved to 70%, reflecting a 7% increase over
a score of 0.54 ± 0.04, compared to LLaMA-3-8B’s 0.45 Naive RAG, which suggests that fine-tuning enhances the
± 0.05. In the Legal Reasoning Causality (LRC) dataset, model’s ability to retrieve and generate relevant information.
HFM demonstrated superior performance with a score of The proposed LQ-RAG system attained the highest score
0.75 ± 0.01, outperforming LLaMA-3-8B’s score of 0.52 ± of 80%, showing a 23% improvement over Naive RAG and
0.01. In the Law Stack Exchange (LSE) dataset, the Flan- a 14% improvement over RAG with FTM. This substantial
T5 large model led with a score of 0.63 ± 0.01, while HFM improvement is attributed to the advanced integration and
scored 0.28 ± 0.01. For the Canada Tax Court Outcomes fine-tuning techniques in LQ-RAG, which enhance its ability
(CTCO) dataset, Flan-T5 large also performed best with a to understand and retrieve contextually relevant information,
score of 0.68 ± 0.01, while HFM scored 0.66 ± 0.01. Lastly, resulting in more coherent answers. The evaluation results
in the Contract QA (CQA) dataset, HFM achieved a score of as illustrated in Figure 8, further reinforce the effectiveness
0.56 ± 0.01, surpassing LLaMA-3-8B’s score of of the proposed system. For the same evaluation dataset, the
0.19 ± 0.01. system achieved scores of 88% in answer relevance, 70% in
Overall, the HFM model achieved the highest performance context relevance, and 82% in groundedness. In contrast, the
across both evaluation groups, as depicted in Figure 6, Naive RAG model struggled to meet the threshold levels in
demonstrating its robustness and effectiveness in handling these metrics, especially in context relevance. Although the
both general language understanding and domain-specific RAG with FTM performed better than the Naive RAG, the
queries. The results highlight the effectiveness of fine-tuning score was still not satisfactory enough to be considered an
and merging techniques in enhancing model performance acceptable answer.
across diverse tasks, demonstrating their value in improving In contrast, for closed-domain question answering, the
generalization and downstream task performance. evaluation results are summarized in Table 12. The evaluation

VOLUME 13, 2025 36989


R. S. M. Wahidur et al.: Legal Query RAG

TABLE 8. Experimental results of reasoning and language tasks.

TABLE 9. Experimental results of legal domain-specific tasks.

FIGURE 7. The average relevance score of the RAG system across various
network architectures. FIGURE 8. The evaluation of the RAG triad across diverse network
architectures.

encompassed posing five distinct sets of queries. Both the ing a discernible difference in answer relevance performance
Naive RAG and RAG with FTM consistently achieved an across these RAG system configurations. Throughout the
answer relevancy score of 88%. However, with the LQ-RAG assessment, the context relevancy score and the groundedness
system, the answer relevancy score decreased to 72%, indicat- score both remained less than 50% across all RAG system

36990 VOLUME 13, 2025


R. S. M. Wahidur et al.: Legal Query RAG

TABLE 10. Sample questions for evaluating the performance of LQ-RAG system.

TABLE 11. Performance evaluation of the LQ-RAG system across different user queries.

setups. This outcome was anticipated as the system was not all three criteria: answer relevance, context relevance, and
provided with any relevant context information during the groundedness to ensure confidence in the accuracy and
response generation. Based on these findings, the answers reliability of the generated responses.
generated by the RAG systems are potentially incorrect One of the key concerns in RAG implementation is
because the retrieved information lacks the required context. time complexity. To address this issue, we conducted an
This casts doubt on the answer relevance scores reported experiment to measure the average response generation times
by all configurations of the evaluated RAG systems. Based for different configurations. The results, depicted in Figure 9,
on the experimental results, it is evident that evaluating indicate that the Naive system exhibits the lowest latency,
answer generation by RAG systems requires considering taking only 7.2 seconds to complete five sets of questions.

VOLUME 13, 2025 36991


R. S. M. Wahidur et al.: Legal Query RAG

TABLE 12. RAG performance evaluation under closed-domain question as the evaluation agent, high response generation time, and
and answering.
the absence of feedback from domain experts. Future efforts
will focus on addressing these issues by optimizing time
complexity, developing a specialized legal evaluation agent
with domain-specific expertise, and incorporating feedback
from legal practitioners to ensure the model’s practical
utility and alignment with legal reasoning and context.
Additionally, benchmark datasets specifically designed for
the legal domain will be incorporated to further validate
the approach. State-of-the-art optimization techniques will
also be applied to enhance hit rate and MRR, improving the
system’s practical viability in legal applications. Empirical
experiments will be conducted in real-world legal scenarios
to demonstrate the system’s effectiveness.

ACKNOWLEDGMENT
The authors gratefully acknowledge the high-performance
GPU computing support provided by HPC-AI Open Infras-
tructure through GIST SCENT, which was instrumental in
enabling this research.7

REFERENCES
FIGURE 9. The average time complexity of RAG system across
diversenetwork architectures. [1] Q. Lang, S. Tian, M. Wang, and J. Wang, ‘‘Exploring the answering
capability of large language models in addressing complex knowledge
in entrepreneurship education,’’ IEEE Trans. Learn. Technol., vol. 17,
pp. 2053–2062, 2024.
In contrast, the RAG with FTM takes 11.2 seconds, and [2] G. B. Mohan, R. P. Kumar, P. V. Krishh, A. Keerthinathan, G. Lavanya,
the proposed LQ-RAG system requires 14.6 seconds, which M. K. U. Meghana, S. Sulthana, and S. Doss, ‘‘An analysis of large
language models: Their impact and potential applications,’’ Knowl. Inf.
is double the time of the Naive case. From this, it can be Syst., vol. 66, no. 9, pp. 5047–5070, Sep. 2024.
inferred that while the Naive RAG system responds faster, [3] B. Meskó and E. J. Topol, ‘‘The imperative for regulatory oversight of large
incorporating advanced modules increases the system’s time language models (or generative AI) in healthcare,’’ npj Digit. Med., vol. 6,
no. 1, p. 120, Jul. 2023.
complexity. [4] J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu, ‘‘Large language models in law:
A survey,’’ AI Open, vol. 5, pp. 181–196, 2024.
VII. CONCLUSION [5] D. M. Katz, M. J. Bommarito, S. Gao, and P. Arredondo, ‘‘GPT-4
This paper addresses domain-specific challenges in the legal passes the bar exam,’’ Phil. Trans. Roy. Soc. A, vol. 382, Mar. 2023,
Art. no. 20230254.
field, where traditional RAG systems often fail in information [6] Q. Huang, M. Tao, C. Zhang, Z. An, C. Jiang, Z. Chen, Z. Wu, and Y. Feng,
extraction and response generation. To address these issues, ‘‘Lawyer LLaMA technical report,’’ 2023, arXiv:2305.15062.
the LQ-RAG framework integrates RAG with a recursive [7] V. Magesh, F. Surani, M. Dahl, M. Suzgun, C. D. Manning, and D. E. Ho,
‘‘Hallucination-free? Assessing the reliability of leading AI legal research
feedback mechanism, combining specialized LLMs and an tools,’’ 2024, arXiv:2405.20362.
agent-driven approach for response evaluation and query [8] W. Benjamin, ‘‘Here’s what happens when your lawyer uses ChatGPT,’’
engineering. This multi-layered system reduces hallucina- New York Times, New York, NY, USA, Tech. Rep., 2023. [Online].
Available: https://www.nytimes.com/2023/05/27/nyregion/avianca-
tions and ensures precise, contextually relevant responses. airline-lawsuit-chatgpt.html
Fine-tuning a general-purpose LLM with legal corpora [9] M. Dahl, V. Magesh, M. Suzgun, and D. E. Ho, ‘‘Large legal fictions:
resulted in a 15% improvement over baseline models, while Profiling legal hallucinations in large language models,’’ J. Legal Anal.,
vol. 16, no. 1, pp. 64–93, Jan. 2024.
a hybrid fine-tuned generative LLM achieved up to 24% [10] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler,
better performance across various tasks compared to general M. Lewis, W.-T. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, ‘‘Retrieval-
domain LLMs. The LQ-RAG architecture outperformed augmented generation for knowledge-intensive NLP tasks,’’ in Proc. Adv.
Neural Inf. Process. Syst., 2020, pp. 9459–9474.
all baseline models, with a 23% improvement in average [11] J. Chen, H. Lin, X. Han, and L. Sun, ‘‘Benchmarking large language
relevance score over the naive configuration and a 14% models in retrieval-augmented generation,’’ in Proc. AAAI Conf. Artif.
improvement over RAG with fine-tuned LLMs. Its adaptable Intell., Mar. 2024, vol. 38, no. 16, pp. 17754–17762.
[12] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, ‘‘Self-RAG: Learning
design facilitates adoption across other specialized domains to retrieve, generate, and critique through self-reflection,’’ in Proc. 12th
with minimal adjustments, enabling professionals to make Int. Conf. Learn. Represent., Jan. 2023, pp. 1–30.
high-quality, informed decisions. [13] Y. Xia, Z. Xiao, N. Jazdi, and M. Weyrich, ‘‘Generation of asset
administration shell with large language model agents: Toward semantic
interoperability in digital twins in the context of industry 4.0,’’ IEEE
VIII. LIMITATIONS & FUTURE WORK Access, vol. 12, pp. 84863–84877, 2024.
While the current work demonstrates significant advance-
ments, a few limitations remain, including reliance on GPT-4 7 https://openhpc.kr/

36992 VOLUME 13, 2025


R. S. M. Wahidur et al.: Legal Query RAG

[14] R. S. M. Wahidur, I. Tashdeed, M. Kaur, and H.-N. Lee, ‘‘Enhancing [35] J. E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, and W. Chen,
zero-shot crypto sentiment with fine-tuned language model and prompt ‘‘LoRA: Low-rank adaptation of large language models,’’ in Proc.
engineering,’’ IEEE Access, vol. 12, pp. 10146–10159, 2024. Int. Conf. Learn. Represent., Jan. 2021, pp. 1–53. [Online]. Available:
[15] J. Bednár, J. Náplava, P. Barančíková, and O. Lisický, ‘‘Some like it small: https://openreview.net/forum?id=nZeVKeeFYf9
Czech semantic embedding models for industry applications,’’ in Proc. [36] M. Douze, A. Guzhva, and C. Deng. (2024). The Faiss Library. [Online].
AAAI Conf. Artif. Intell., vol. 38, Mar. 2024, pp. 22734–22742. Available: https://github.com/facebookresearch/faiss
[16] X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, ‘‘Query rewriting in [37] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao,
retrieval-augmented large language models,’’ in Proc. Conf. Empirical ‘‘ReAct: Synergizing reasoning and acting in language models,’’ 2022,
Methods Natural Lang. Process., 2023, pp. 5303–5315. arXiv:2210.03629.
[17] R. Sharma. (2024). Exploring Advanced RAG Techniques for AI. [Online]. [38] J. Lee, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, Q. V. Le, and
Available: https://markovate.com/blog/advanced-rag-techniques/ D. Zhou, ‘‘Chain-of-thought prompting elicits reasoning in large language
[18] ILIN. (2023). Advanced RAG Techniques: An Illustrated Overview. models,’’ in Proc. Adv. Neural Inf. Process. Syst., 2022, pp. 24824–24837.
[Online]. Available: https://pub.towardsai.net/advanced-rag-techniques- [39] I. Bunescu. (2023). QA Legal Dataset Train. [Online]. Available:
an-illustrated-overview-04d193d8fec6 https://huggingface.co/datasets/ibunescu/qa_legal_dataset_train
[19] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and W. Chen, ‘‘Enhanc- [40] T. Rohan and G. Ishaan. (2023). Stanford Alpaca: An Instruction-
ing retrieval-augmented large language models with iterative retrieval- following LLaMA Model. [Online]. Available: https://github.com/tatsu-
generation synergy,’’ in Proc. Findings Assoc. Comput. Linguistics, lab/stanford_alpaca
EMNLP, Stroudsburg, PA, USA, 2023, pp. 9248–9274. [41] P. Rajpurkar, R. Jia, and P. Liang, ‘‘Know what you don’t know:
[20] W. Yu, D. Iter, S. Wang, X. Yi-hong, M. Ju, S. Sanyal, C. Zhu, Unanswerable questions for SQuAD,’’ in Proc. 56th Annu. Meeting Assoc.
M. Zeng, and M. Jiang, ‘‘Generate rather than retrieve: Large language Comput. Linguistics, Stroudsburg, PA, USA, 2018, pp. 784–789.
models are strong context generators,’’ in Proc. 11th Int. Conf. Learn. [42] S. Lin, J. Hilton, and O. Evans, ‘‘TruthfulQA: Measuring how models
Represent., 2022, pp. 1–27. [Online]. Available: https://openreview.net/ mimic human falsehoods,’’ in Proc. 60th Annu. Meeting Assoc. Comput.
forum?id=fB0hRu9GZUS Linguistics, Stroudsburg, PA, USA, 2022, pp. 3214–3252.
[21] J. Wen and W. He. (2023). HanFei-1.0. [Online]. Available: https://github. [43] J. Li, R. Bhambhoria, and X. Zhu, ‘‘Parameter-efficient legal domain
com/siat-nlp/HanFei adaptation,’’ in Proc. Natural Legal Lang. Process. Workshop, 2022,
pp. 119–129. [Online]. Available: https://aclanthology.org/2022.nllp-1.10
[22] H. Liu, Y. Liao, and Y. Meng. (2023). Chinese Law Large Language Model.
[44] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung,
[Online]. Available: https://github.com/LiuHC0428/LAW_GPT
A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei, ‘‘Challenging BIG-
[23] H.-T. Nguyen, ‘‘A brief report on LawGPT 1.0: A virtual legal assistant
bench tasks and whether chain-of-thought can solve them,’’ in Proc.
based on GPT-3,’’ 2023, arXiv:2302.05729.
Findings Assoc. Comput. Linguistics: ACL. Stroudsburg, PA, USA, 2023,
[24] H. Li. (2023). LexiLaw. [Online]. Available: https://github.com/CSHaitao/ pp. 13003–13051.
LexiLaw [45] N. Guha et al., ‘‘Legalbench: A collaboratively built benchmark for
[25] D. Soong, S. Sridhar, H. Si, J.-S. Wagner, A. C. C. Sá, C. Y. Yu, measuring legal reasoning in large language models,’’ in Proc. Adv. Neural
K. Karagoz, M. Guan, S. Kumar, H. Hamadeh, and B. W. Higgs, Inf. Process. Syst., 2023, pp. 44123–44279.
‘‘Improving accuracy of GPT-3/4 results on biomedical data using a [46] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and
retrieval-augmented language model,’’ PLOS Digit. Health, vol. 3, no. 8, J. Steinhardt, ‘‘Measuring massive multitask language understanding,’’ in
Aug. 2024, Art. no. e0000568. Proc. Int. Conf. Learn. Represent., May 2021, pp. 1–6. [Online]. Available:
[26] C. Zakka, R. Shad, A. Chaurasia, A. R. Dalal, J. L. Kim, M. Moor, https://openreview.net/forum?id=d7KBjmI3GmQ
R. Fong, C. Phillips, K. Alexander, E. Ashley, J. Boyd, K. Boyd, [47] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, ‘‘HellaSwag:
K. Hirsch, C. Langlotz, R. Lee, J. Melia, J. Nelson, K. Sallam, S. Tullis, Can a machine really finish your sentence?’’ in Proc. 57th Annu. Meeting
M. A. Vogelsong, J. P. Cunningham, and W. Hiesinger, ‘‘Almanac— Assoc. Comput. Linguistics, Florence, Italy, 2019, pp. 4791–4800.
Retrieval—Augmented language models for clinical medicine,’’ NEJM AI, [48] O. Khattab and M. Zaharia, ‘‘ColBERT,’’ in Proc. 43rd Int. ACM SIGIR
vol. 1, no. 2, pp. 1–45, 2024. Conf. Res. Develop. Inf. Retr., New York, NY, USA, Jul. 2020, pp. 39–48.
[27] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, Y. Xiao, S. Yun, [49] P. Zhang, S. Xiao, Z. Liu, Z. Dou, and J.-Y. Nie, ‘‘Retrieve anything to
X. Huang, and Z. Wei, ‘‘DISC-LawLLM: Fine-tuning large language augment large language models,’’ 2023, arXiv:2310.07554.
models for intelligent legal services,’’ 2023, arXiv:2309.11325. [50] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, D. Lian, and J.-Y. Nie, ‘‘C-
[28] N. Wiratunga, R. Abeyratne, and L. Jayawardena, ‘‘CBR-RAG: Case- pack: Packed resources for general Chinese embeddings,’’ in Proc. 47th Int.
based reasoning for retrieval augmented generation in LLMs for legal ACM SIGIR Conf. Res. Develop. Inf. Retr., New York, NY, USA, Jul. 2024,
question answering,’’ in Proc. Case-Based Reasoning Res. Development. pp. 641–649.
ICCBR, vol. 14775, J. A. Recio-Garcia, M. G. Orozco-del-Castillo, and [51] A. V. Solatorio, ‘‘GISTEmbed: Guided in-sample selection of training
D. Bridge, Eds., Springer, 2024, pp. 445–460. negatives for text embedding fine-tuning,’’ 2024, arXiv:2402.16829.
[29] A. Chouhan and M. Gertz, ‘‘LexDrafter: Terminology drafting for [52] H. Touvron et al., ‘‘Llama 2: Open foundation and fine-tuned chat models,’’
legislative documents using retrieval augmented generation,’’ in Proc. Int. 2023, arXiv:2307.09288.
Conf. Comput. Linguistics, Lang. Resour. Eval. (LREC-COLING), 2024, [53] AI@Meta. (2024). Llama 3 Model Card. [Online]. Available:
pp. 10448–10458. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
[30] S. S. Alotaibi, A. A. Munshi, and A. T. Arag, ‘‘KAB: Knowledge [54] H. W. Chung et al., ‘‘Scaling instruction-finetuned language models,’’
augmented BERT2BERT automated questions-answering system for J. Mach. Learn. Res., vol. 25, pp. 1–53, 2024.
jurisprudential legal opinions,’’ Int. J. Comput. Sci. Netw. Security, IJCSNS, [55] X. Zhang, X. Zhou, Z. Zhang, L. Wang, and P. Wang, ‘‘A novel method to
vol. 22, pp. 346–356, Jun. 2022. improve hit rate for big data quick reading,’’ in Proc. Int. Conf. Artif. Intell.
[31] C. Hoppe, D. Pelkmann, N. Migenda, D. Hötte, and W. Schenck, ‘‘Towards Adv. Manuf. (AIAM), Oct. 2019, pp. 39–43.
intelligent legal advisors for document retrieval and question-answering in [56] S. Roychowdhury, S. Soman, H. G. Ranjani, N. Gunda, V. Chhabra, and
German legal documents,’’ in Proc. IEEE 4th Int. Conf. Artif. Intell. Knowl. S. K. Bala, ‘‘Evaluation of RAG metrics for question answering in the
Eng. (AIKE), Dec. 2021, pp. 29–32. telecom domain,’’ 2024, arXiv:2407.12873.
[32] S. Robertson, H. Zaragoza, and M. Taylor, ‘‘Simple BM25 extension [57] P. Xia, L. Zhang, and F. Li, ‘‘Learning similarity with cosine similarity
to multiple weighted fields,’’ in Proc. 13th ACM Int. Conf. Inf. Knowl. ensemble,’’ Inf. Sci., vol. 307, pp. 39–52, Jun. 2015.
Manage., New York, NY, USA, Nov. 2004, pp. 42–49. [58] A. Stolfo, ‘‘Groundedness in retrieval-augmented long-form generation:
[33] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and An empirical study,’’ in Proc. Findings Assoc. Comput. Linguistics,
W.-T. Yih, ‘‘Dense passage retrieval for open-domain question answering,’’ NAACL. Stroudsburg, PA, USA, 2024, pp. 1537–1552.
in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), [59] K. Papineni, S. Roukos, T. J. Ward, and W.-J. Zhu, ‘‘BLEU,’’
Stroudsburg, PA, USA, 2020, pp. 6769–6781. in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics (ACL).
[34] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukacs, R. Guo, Morristown, NJ, USA, 2001, p. 311.
S. Kumar, B. Miklos, and R. Kurzweil, ‘‘Efficient natural language [60] C.-Y. Lin, ‘‘ROUGE: A package for automatic evaluation of summaries,’’
response suggestion for smart reply,’’ 2017, arXiv:1705.00652. in Proc. Text Summarization Branches Out, Jul. 2004, pp. 74–81.

VOLUME 13, 2025 36993


R. S. M. Wahidur et al.: Legal Query RAG

[61] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. D. Laroussilhe, HAEUNG CHOI received the B.S. degree in
A. Gesmundo, M. Attariyan, and S. Gelly, ‘‘Parameter-efficient transfer electrical, electronics, and computer engineering
learning for NLP,’’ in Proc. Int. Conf. Mach. Learn., 2019, pp. 2790–2799. from Kyungpook National University, in 2013,
[62] T. Dettmers, M. Lewis, and Y. Belkada, ‘‘Gpt3. Int8 (): 8-bit matrix and the M.S. degree in electrical, electronics, and
multiplication for transformers at scale,’’ in Proc. Adv. Neural Inf. Process. computer engineering from Gwangju Institute of
Syst., vol. 35, 2022, pp. 30318–30332. Science and Technology, in 2015, where he is
[63] W. Leandro, B. Younes, T. Lewis, and B. Edward. (2020). TRL: currently pursuing the Ph.D. degree. He is also a
Transformer Reinforcement Learning. [Online]. Available: https://github. Researcher at LiberVance Company. His research
com/huggingface/trl
interests include blockchain and cybersecurity.

DAVID S. BHATTI received the Ph.D. degree in


computer science from the School of Electrical
Engineering and Computer Science (SEECS),
RAHMAN S. M. WAHIDUR received the B.Sc. National University of Sciences and Technology
degree in electrical and electronics engineering (NUST), Islamabad, Pakistan, in 2020. He is
from the Ahsanullah University of Science and currently a Postdoctoral Researcher at the School
Technology, Dhaka, Bangladesh, in 2009. He is of Electrical Engineering and Computer Science,
currently pursuing the combined M.S. and Ph.D. Gwangju Institute of Science and Technology,
degree with Gwangju Institute of Science and Gwangju, South Korea. He is also working on
Technology, Gwangju, South Korea. He is also augmenting HSI with AI and retrieval-augmented
a Research Assistant with the INFOrmation Pro- generation (RAG). His research interests include network security, deep
cessing, Controlling, and NETwork Laboratory learning, and hyperspectral imaging.
(INFONET LAB). Prior to his current academic
endeavors, he held the position of Telecommunication Engineer at various
multinational corporations, from 2010 to 2019. His research interests include HEUNG-NO LEE (Senior Member, IEEE)
natural language processing, deep learning, blockchain price modeling, and received the B.S., M.S., and Ph.D. degrees in
generative AI. electrical engineering from the University of
California at Los Angeles, Los Angeles, CA, USA,
in 1993, 1994, and 1999, respectively. He was a
Research Staff Member with HRL Laboratories,
LLC, Malibu, CA, USA, from 1999 to 2002.
From 2002 to 2008, he was an Assistant Professor
with the University of Pittsburgh, Pittsburgh,
PA, USA. In 2009, he joined the School of
SUMIN KIM received the B.S. degree in commu- Electrical Engineering and Computer Science, Gwangju Institute of Science
nications and convergence software from Kwang- and Technology, Gwangju, South Korea. His research interests include
woon University, Seoul, South Korea, in 2021. She information theory, signal processing theory, blockchain, communica-
is currently pursuing the Ph.D. degree with the tions/networking theory, and their applications to wireless communications
Artificial Intelligence Graduate School, Gwangju and networking, compressive sensing, future internet, and brain–computer
Institute of Science and Technology, Gwangju, interface. He was a recipient of several prestigious national awards, including
South Korea. Her research interests include con- the Top 100 National Research and Development Award, in 2012, the
tinual learning, reinforcement learning, natural Top 50 Achievements of Fundamental Research Award, in 2013, and the
language processing, and financial price modeling. Science/Engineer of the Month in January 2014.

36994 VOLUME 13, 2025

You might also like