Retrieval-Augmented Reasoning With Lean Language Models: (Rchan, Fnanni, Jgeddes) @turing - Ac.uk A.duncan@imperial - Ac.uk
Retrieval-Augmented Reasoning With Lean Language Models: (Rchan, Fnanni, Jgeddes) @turing - Ac.uk A.duncan@imperial - Ac.uk
Abstract
This technical report details a novel approach to combining reasoning and
retrieval augmented generation (RAG) within a single, lean language model
architecture. While existing RAG systems typically rely on large-scale models
and external APIs, our work addresses the increasing demand for performant
and privacy-preserving solutions deployable in resource-constrained or secure
environments. Building on recent developments in test-time scaling and small-
scale reasoning models, we develop a retrieval augmented conversational agent
capable of interpreting complex, domain-specific queries using a lightweight
backbone model. Our system integrates a dense retriever with fine-tuned
Qwen2.5-Instruct models, using synthetic query generation and reasoning
traces derived from frontier models (e.g., DeepSeek-R1) over a curated cor-
pus—in this case, the NHS A-to-Z condition pages. We explore the impact
of summarisation-based document compression, synthetic data design, and
reasoning-aware fine-tuning on model performance. Evaluation against both
non-reasoning and general-purpose lean models demonstrates that our domain-
specific fine-tuning approach yields substantial gains in answer accuracy and
consistency, approaching frontier-level performance while remaining feasible for
local deployment. All implementation details and code are publicly released to
support reproducibility and adaptation across domains.
1
1 Introduction
Recent efforts to improve the test-time performance of language models have shown
significant promise [1, 2, 3, 4]. These approaches, particularly those that target
enhancements of so-called “reasoning” capabilities via chain-of-thought prompting
[5], have enabled relatively small-scale models (e.g., DeepSeek-R1 distilled models
[6] or s1 [7]) to achieve results that are comparable to those of frontier models (e.g.,
OpenAI’s offerings [8]), in specific tasks.
In parallel, work that has focused on improving the factuality and verifiability of
the output of LLMs through retrieval augmented generation (RAG) strategies has
presented clear opportunities to reduce hallucinations [9, 10, 11], in particular when
dealing with the complexities of specific domains of knowledge [12, 13].
The successful integration of reasoning and RAG is now widely available in tools
like ChatGPT and Gemini. Given a user query, these systems can, for instance, first
reason about the query and then decide to take an action—such as performing a
web search or querying a tool like Google Maps—before returning a final answer.
This form of reasoning and tool use is characteristic of emerging agentic AI systems
[14, 15, 16]. Alternatively, the system may begin by retrieving documents relevant to
the user query and then reason over the collected evidence before responding. This
second approach—retrieval followed by reasoning—will be the focus of this technical
report.
While the combination of retrieval and reasoning has significantly enhanced
the performance of frontier language models in general-purpose applications, such
approaches encounter clear limitations in scenarios where users are unwilling or unable
to share data with external entities—particularly in domains involving sensitive or
private information. Even if the training data of a model is publicly available, the
prompts posed by users can often contain highly proprietary or sensitive information
which cannot cross organisational or national boundaries.
In these cases, it becomes necessary to deploy language models on local infras-
tructure, potentially within secure or air-gapped environments. To address such
requirements, recent years have seen steady progress in the development of openly
available large language models (e.g., [17, 18, 19]) alongside open-source frameworks
for retrieval augmented generation1 . More recently, small-scale reasoning models
have also begun to emerge [6, 7]. Nonetheless, the effective integration of reasoning
capabilities for interpreting retrieved evidence—particularly within the constraints of
lightweight or locally deployable models—remains an open research challenge. While
some recent work such as ReAct [20], REPLUG [21], and MemGPT [22] explore
hybrid architectures for strongly integrating LLM reasoning with document retrieval,
they are mostly in large, non-local model settings.
To address these limitations, this technical report presents an approach for
effectively combining reasoning and retrieval augmented generation within a single,
1
See for instance [Link] and [Link]
2
lean language model. Furthermore, we integrate this fine-tuned model into an
interactive conversational system to demonstrate its applicability in downstream
tasks. The resulting system is particularly well-suited for applications involving
complex queries over private, domain-specific knowledge bases. In such settings, the
reasoning component facilitates the interpretation and decomposition of intricate
queries, while the retrieval mechanism constrains the model to verifiable information,
thereby mitigating the risk of hallucinated responses. The focus on private and
sensitive domains motivated our emphasis on lean language models that can be
feasibly fine-tuned and deployed by small organizations or government departments,
particularly in compute-constrained or secure environments.
The report is structured as follows. We begin with an overview of test-time
scaling strategies and related work relevant to the task. This is followed by a
detailed description of our system architecture, including implementation choices and
practical guidance for reproducibility, supported by references to the accompanying
codebase. We then demonstrate the application of our approach to a representative
domain-specific knowledge base—the NHS A-to-Z condition webpages2 —using a set
of queries that require both retrieval and reasoning capabilities. The report concludes
with a discussion of potential future enhancements. An open-source implementation
of our method is available via GitHub,3 enabling practitioners to apply the system
to a broad range of problems involving domain-specific question answering that
combines retrieval with structured reasoning.
2 Related work
In the following section we provide an overview of research areas relevant to this
technical report.
3
selection mechanisms such as majority voting [3], self-consistency [23], or best-of-
N sampling [24]. These techniques improve robustness and factual accuracy by
exploiting diversity in the model’s outputs, with selection based on heuristics or
learned reward functions. Other common strategies such as beam search [25] and
Monte Carlo tree search [26], which maintains multiple high-probability continuations
of a sequence in parallel to explore more optimal generations. While such approaches
typically improve likelihood, they may reduce diversity, in contrast to sampling-based
methods.
A complementary family of approaches is known as sequential scaling, which
involves increasing the number of intermediate reasoning steps the model takes
before arriving at a final answer. The most prominent example is chain-of-thought
prompting [5], in which models are guided to produce intermediate reasoning steps
that improve performance on complex tasks. This trend has contributed to a broader
anthropomorphisation of model behaviour, often described in terms of “reasoning”
[27]. Extensions such as tree-of-thought prompting generalise this idea by exploring
multiple reasoning paths in a branching structure, potentially with scoring and
pruning mechanisms applied to select the most promising trajectory. More advanced
test-time scaling methods differ in whether they assume access to a verifier —a
model or module that can score, rerank, or validate outputs. In verifier-free settings,
selection relies on internal model heuristics (e.g., majority vote, self-consistency),
while verifier-assisted setups may use external reward models, classifiers, or even
humans to evaluate and select responses, leading to higher precision but increased
complexity.
Recent models such as DeepSeek-R1-Zero [6] have pushed this frontier by training
LLMs via reinforcement learning to produce structured reasoning paths, using
formatting conventions (e.g., enclosing thoughts in <think> tags) to aid downstream
reasoning alignment. While this model demonstrated strong reasoning capabilities,
it also exhibited practical limitations, such as decreased readability and occasional
mixing of languages.
To mitigate these challenges, DeepSeek-R1 incorporated a small quantity of
high-quality “cold start” data prior to reinforcement learning (RL). This dataset
comprised carefully curated examples, most notably chain-of-thought demonstrations,
designed to stabilise early training and improve the coherence of generated outputs.
DeepSeek-R1 was then trained via a two-stage RL procedure: the first stage targeted
improvements in reasoning ability, while the second focused on aligning model outputs
with human preferences, thereby enhancing readability and reducing incoherent
completions. This multi-phase training strategy enabled DeepSeek-R1 to achieve
performance on par with OpenAI’s o1 model across a range of reasoning benchmarks.
While there has been considerable effort in the last two years in developing a
large variety of reasoning models, evaluation of such models is still in most cases
restricted to a series of widely known mathematics and coding benchmarks, giving
the impression to the reader that reasoning for language modelling actually only
4
Retrieved top-K chunks
User Query
Chunk 1
Vector Language
Documents Chunk 2 Response
DB model
...
Chunk K
5
2. Querying: retrieving data relevant to a given query.
6
direction of retrieval-aware reasoning, where the retrieval process is optimized not
just for relevance, but also for supporting structured inference and faithful generation.
7
to create highly capable lean models, through strategies such as test-time scaling, is
a very active area of research.
Building on top of this previous work, in this technical report we focus on how
we can enhance in-domain reasoning capabilities on a small-scale language model,
which would be provided with a series of documents to address a user query.
3 System setup
In this section we provide an overview of our system covering all its aspects, from the
computational infrastructure needed to the pipeline design and frontend interface for
chat interactions.
8
These VMs offered excellent memory utilisation for large-model fine-tuning. We
found that using 8 GPUs per node, specifically, two nodes with 8 GPUs each (16
GPUs in total) to fine-tune the 32B model, resulted in more efficient training due to
tighter GPU coupling and better memory saturation. In contrast, HPC systems with
only 4 GPUs per node required distributed training across more nodes, for example,
6 nodes (24 GPUs in total) on Isambard-AI, which introduced additional overhead
and reduced efficiency.
Through Azure AI Foundry, we accessed inference endpoints for the following
models:
These endpoints enabled efficient generation of reasoning traces and synthetic user
queries as well as final performance testing for frontier LLMs.
We also had access to exploratory nodes with NVIDIA H100 80GB GPUs, which
were employed in selected fine-tuning and evaluation runs.
9
3.2 Pipeline overview
Our pipeline comprises multiple steps: indexing a collection using a vector database;
retrieving documents via vector similarity to a user query, reasoning about the
obtained results and finally generating an answer. In this section, we cover each step
of our pipeline, starting from the language model used, which is the central aspect of
our system.
10
Figure 2: Performance on AIME24 of Qwen2.5-Instruct models and their post-
trained versions which are fine-tuned on DeepSeek-R1 reasoning traces (reproducing
s1.1 work, following [7]).
11
We used a Chroma 11 vector database which by default uses an ℓ2 -norm similarity
score metric. In our codebase, we also provide the functionality for the user to use
alternative sentence transformer models and an option to use a FAISS [52] database
instead. In our use case, we found that Chroma and FAISS offered similar retrieval
performance, but Chroma was slightly faster.
Our retrieval system uses full document retrieval whereby if any chunk from a
particular document is found within this top-k set of chunks, our system proceeds to
retrieve the entire original document from which that chunk originated. This ensures
that the LLM receives the complete context surrounding the relevant information,
even if only a small portion of the document was initially flagged as relevant.
3.2.5 Fine-tuning
We use the reasoning traces to fine-tune a smaller model in order to to enhance
its capabilities at test-time. The goal is that the model should start producing a
“reasoning process”, similar to the one of DeepSeek-R1, before providing its final
answer, and that such reasoning process should improve overall performance.
11
[Link]
12
We follow the approach described in s1 [7] and perform supervised fine-tuning
on next token prediction of Qwen2.5-Instruct models (ranging from 1.5 to 32B
parameters), with basic hyperparameters. The main challenge, which distinguishes
our work from the set up of s1, is the fact that each of our model responses are much
longer compared to the ones in the s1 dataset, as they include reasoning traces and
a set of retrieved documents. This is due to the fact that we retrieve full documents
rather than chunks as described in Section 3.2.2. In particular, in our first attempt
to create reasoning traces with setting the number of retrieved documents to 5, the
average token length of the training examples using the Qwen2.5 tokenizer was 74 641.
By comparison, the s1K12 and s1K-1.113 datasets have an average token length of
9 109 and 26 969, respectively.
In order to train our model maintaining the same computational resources, we
have employed automatic document summarisation, to reduce the length of the input
context, while still being able to benefit from the retrieved materials.
13
3.3.1 Orchestration for chat interactions
To bring all these components together, we used the LangChain 14 framework in
Python to develop our RAG pipeline and combine this with our fine-tuned language
model to create a conversational chat bot application.
For our RAG application, we wanted to allow the user to have a back-and-forth
conversation whereby the language model is given the previous conversation history
and the retrieved context to construct a response. To incorporate historical messages,
it is necessary to use the conversation history and a prompt template. Prompt
templates turn raw user/human-AI chat interactions into a format that the language
model can work with and generate a response for. Typically chat templates are
language model-specific, meaning different language model families such as Llama
[41], Gemma [19, 46] and Qwen [17, 53] use different chat templates. For example, for
Qwen2.5-Instruct models use the following format to indicate the role and content
of a given interaction:
<|im_start|>{role}
{content}<|im_end|>
The role can be one of user, assistant or system. A system message can
be useful to instruct the model to perform certain actions or undertake different
characteristics such as the tone or style to use. Given a list of user-AI chat interactions,
we can use the Qwen prompt template to construct a prompt to the language model
such as:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?<|im_end|>
For presenting retrieved context from a knowledge base to the model, we can
construct a system prompt template (see Appendix A.1 for the system prompt we
used for the exemplary application described in Section 4) which defines the task for
the model and presents the retrieved context as well as additional information such
as the demographics of the user.
Note that we could alternatively have used a user prompt template whereby the
retrieved context is presented in the user message. We chose not to do this to limit
the growth of the conversation history context length, since presenting retrieved
context in the user message leads to the retrieved context staying in the conversation
history as the chat develops. In our case, when retrieval is used, the retrieved context
is provided to the model in the system prompt and so it can be different throughout
the conversation.
14
[Link]
14
The language models used here all have finite context windows. Consequently, as
conversations accumulate long message histories, it might be necessary to reduce the
size of the chat history. We do this by trimming the history based on token count.
Note that we never delete the system prompt from the history and only remove the
oldest chat interactions if necessary.
User : I have been having headaches recently, what are common ways to
alleviate a headache?
AI : Common ways to alleviate a tension headache include taking over-
the-counter pain relievers like ibuprofen or acetaminophen, applying a
warm or cool compress to your head or neck, and practising relaxation
techniques.
User : Where can I buy them?
In a standard RAG setting, the query “Where can I buy them?” is ambiguous
without the context of the full conversation.
Secondly, often user messages may be simple enough that the model would not
require retrieval. For instance, for simple user messages such as “Hello”, it would be
cheaper to avoid retrieval and for the model to respond directly.
To address these two issues, it is possible to treat retrieval as a tool that a model
has access to. For our purposes, a tool is an association between a function and its
schema that defines the function’s name, description and arguments. This schema is
then passed to the language model and the language model can decide to use that
tool by defining the name of the tool to use and the arguments to use. This approach
leverage tool-calling 15 (sometimes called function-calling) which is now commonly
supported in many modern chat models and endpoint providers.
In such a setting, we refer to the language model as an agent [20, 54] which
combines language generation with actions. Generally, an agent refers to anything
that can perceive its environment and act upon that environment [55]. The set of
actions that an agent can perform is defined by the tools it has access to. For our
RAG pipeline, the language model can be considered an agent and the tool is the
text retriever. Therefore, for a given user message, the model can decide to either
15
For more details on tool-calling in LangChain, see [Link]
concepts/tool_calling/.
15
query the retriever or to respond directly in natural language. Note that in our
use-case, the language model ideally only decides to not to use retrieval in simple
user messages since we want most model responses to be grounded with data from
the knowledge base.
In the case that the language model decides to use the retriever, a tool-call is
made and the language model decides the query to use. Since the language model
has access to the conversation history, it can leverage previous chat interactions to
choose a relevant query to the vector database.
Figure 3 presents the full flow of a given query. First the query is presented to a
conversational agent language model which decides whether to query the retriever,
or to respond directly (in the case of simple messages). The system prompt for this
conversational agent is presented in Appendix A.2. If the conversational agent decides
to perform a tool call to the retriever, it writes a query to the retriever given the chat
history and retrieves relevant document chunks from the knowledge base. Lastly, we
present the model the retrieved context through the system prompt of the reasoning
model as described above to generate a response given the chat history. Note that
there is flexibility in the choice of language models at the different stages as the
conversational agent language model which decides whether retrieve first and respond
or respond directly can be different to the language model that uses the retrieved
context to generate a response. In our final model, we used Qwen2.5-Instruct-32B
as the conversational agent language model and t0-1.1-k5-32B as the RAG language
model, since we found Qwen2.5-Instruct-32B to be more consistent for tool-calling.
Reasoning
Retriever
model
Generate query
to retriever
User Query
Conversational
Response
agent
Generate a response directly
16
Figure 4: Overview of our pipeline, covering the central aspects of the process:
synthetic data creation, information retrieval, reasoning trace generation, and model
fine-tuning.
each with unique identifiers, and switch between them at will. The responses from
the chatbot are parsed by the browser: by default, only the main answer is shown in
full to the user, with reasoning traces available via a dropdown toggle.
In general, the backend has sole responsibility for storing the conversation history:
the frontend only serves as a way to parse and display this information to an end-user.
Thus, the state of the frontend is fully derived from that of the backend; this design
prevents potential inconsistencies that may arise if the frontend were to store its own
copy of the conversation history.
Because the Python chatbot is exposed over HTTP, and modern browsers do
not (by default) allow HTTPS pages to make requests to HTTP endpoints, it was
further necessary to set up an Nginx reverse proxy to act as an intermediary. Thus,
insecure HTTP connections were handled by the proxy, and the website would only
ever see connections to a secure HTTPS URL. This could, if desired, be replaced
with Caddy.18
For this proof of concept, it was considered unnecessary to introduce authenti-
cation such that each conversation could be uniquely associated with a user. Thus,
every visitor to the webpage can see every conversation available. Implementing
authentication using OAuth2 would be an obvious next step for a more serious
deployment.
17
4 Exemplary Application
Research in test-time scaling and reasoning generally focuses its application in the
mathematics and code domains [3, 6, 7], since addressing such questions might require
a reasoning process, assessing the correctness of a final answers is straightforward
and there are many datasets available [56].
In our experiments, we instead focus on a different scenario, which hopefully is
closer to real applications which aim to leverage upon the combination of information
retrieval and reasoning for decision making. We consider the body of knowledge
provided by the NHS A-to-Z condition website.19 For each of the almost 1 000
conditions listed, a webpage provides information about it and a series of possible
next actions, depending on the patient symptoms (for instance asking for an urgent
GP appointment or going directly to A&E). We consider this an interesting setting
for testing our model, as it would require a retrieval component (to interpret the
user query and search across the available documents) and a reasoning element (to
interpret the patient symptoms and decide for the best next step, while staying
grounded on the provided information). An overview of the process, which we will
cover step by step in this section, is shown in Fig. 4.
It is important to emphasize that the prototype presented here is not intended
as a tool for providing medical advice. Rather, the medical domain is used purely
as a demonstration of its potential applicability across sectors that rely on private,
specialized knowledge and complex queries to support decision-making.
4.1 Dataset
We collected conditions by scraping all webpages under the NHS Conditions subdo-
main, obtaining 990 different conditions.20 We then removed the condition “Mental
Health”, as the page is actually a collection of conditions structured in a different way
compared to the rest of the collection. The remaining 989 are organised in a single
dataset (as a JSON Lines file) containing the name of the condition (the page title),
the entire content of the page (if the condition webpage contained multiple sub-pages,
we concatenated the content in a single stream of text) and a summarised version of
the content of the page obtained using Qwen2.5-32B-Instruct. The prompt used
for this task can be found in Appendix A.3.
With this prompt, we reduced the size of each document by 85% of its original
length, while maintaining core information relevant for the task, as every page
contained boiler-plate text and repeated information. However, it is true that reducing
the content size so drastically might lead to retrieval issues in the downstream task,
18
[Link]
19
[Link]
20
The script for downloading the pages is available here: [Link]
alan-turing-institute/t0-1/blob/main/scripts/Makefile.
18
so we have compared retrieval performance when operating on the full content or
only on the summarised versions (see Table 1).
• basic: Based on a single condition page, the query mentions relevant symptoms.
• downplay: Based on a single condition page, the query downplays the severity
of the symptoms.
While basic represents the most common type of requests for this type of system,
hypochondriac and downplay challenge the pipeline by offering either too much or
too little information about the condition and severity level.
The prompt used for generating the synthetic data is available in Appendix
A.4. As an example, the following synthetic request was generated with an input
prompt for a basic query from a woman which should be matched with the condition
hip-replacement and the disposition should be urgent primary care. The rest of the
content in the example is generated by GPT-4o:
19
Example
General patient information: age: 65, sex: female, occupation: retired
Teacher, social support: I live with my husband who helps me around
the house., medical history: I have osteoarthritis and occasionally take
over-the-counter pain relief. No other significant conditions.
Using the process described above, we generated two datasets, one containing
1 000 synthetic queries as an evaluation set and a second one containing 2 000
synthetic queries as a dataset to be used for fine-tuning. To ensure no overlap
between our evaluation and fine-tuning datasets, we identified queries which had
the same combination of condition and disposition. We then removed these queries
from the fine-tuning dataset and generated further data to bring our total number of
queries back up to 2 000. A clinician reviewed a subset of our synthetically generated
requests, confirming their suitability and different levels of complexity. We relied
on a frontier LLM (GPT-4o) for this step; this part of our approach would require
a different strategy if adopted on private data held in secure environments. As
alternatives, we suggest the following options: if possible, obtaining examples of real
queries on the body of knowledge under study, which would then perfectly reflect the
type of process to automatize. Alternatively, we recommend adopting a small-scale
model that can be run on-premise like Qwen2.5-32B-Instruct and then following
the rest of our approach for synthetic data generation. In this case, careful evaluation
of the synthetic data quality is essential to ensure its effectiveness for downstream
tasks.
20
The metric p@k refers to the fraction, p, of queries for which the correct condition
is present among the k returned documents. For example, p@5 is the proportion of
queries for which the correct condition appears among the five most similar documents
retrieved. Note that when retrieving summaries, the cutoff number corresponds
to the number of condition pages since the summarised documents all had context
length smaller than that of our embedding model (384), while for full documents it
corresponds to the number of chunks returned.
Knowing the performance at different cutoffs allows us to choose a suitable number
of retrieved documents to use when combining retrieval and reasoning. A higher
number of documents will guarantee, in most cases, that the correct condition is
retrieved, but will also lead to a longer context for the downstream LLM, impacting
fine-tuning and reasoning.
Table 1: Retrieval accuracy at different cutoffs when indexing full pages (which are
then divided into chunks) versus summaries of the same pages.
21
found in Appendix A.5. Through this process, each query is then structured such
that it has:
These components are all concatenated into a single stream of text for each query,
in order to then fine-tune a series of small versions of Qwen2.5-Instruct models
for next token prediction. What we expect to see is that the model would start
producing a thinking or reasoning process at test-time focused on the content of
the retrieved documents in relation to the user query, before generating the answer.
Such process should enhance its capabilities versus a non-reasoning baseline or a
general purpose reasoning model (e.g., s1).
The fine-tuning parameters were selected based on recommendations from [7].
The most important configuration choices are as follows:
• Epochs: 5
All models were trained using the same fine-tuning parameters as described above.
GPU and system configurations varied due to availability and cost considerations.
The training setups were as follows:
22
4.5 Condition and next action prediction
In this section, we measure whether retrieval augmented reasoning improves the
performance of a lean language model. In Table 2, we report the performance
of our 32B parameter fine-tuned model (named t0-1.1-k5-32B) on two tasks:
(i) determining the condition of a synthetic patient, given the textual description of
the symptoms; and (ii) establishing the next course of actions among the options
suggested in the document content. We assume that a baseline non-reasoning model
(Qwen2.5-32B-Instruct) would already have some general knowledge to be able to
perform such task at a decent level. However, the integration of retrieval and reasoning
capabilities should improve performance, as in domain evidence would be additionally
offered to the model. The prompt templates used to evaluate model performance
on predicting the condition and the suitable next action are given in Appendix A.6.
To provide a comparison with other systems, we report performance of two recent
comparable lean reasoning models, s1.1-32B [7] and Qwen3-32B [53] and also a
series of state-of-the-art large language models (GPT-4o, o3-mini, DeepSeek-R1), to
understand overall frontier performance. For t0-1.1-k5-32B and s1.1-32B we use
budget-forcing to control test-time compute as described in [7].21
As presented in Table 2, a 32B parameter model already has some core knowledge
in the topic at hand, as it is able to identify the correct condition from the synthetic
patient description in 38 % of the cases and predicts the correct next course of action
46 % of the time. This starting point assessment is necessary to establish the initial
capacity of the adopted language model. Depending on the considered task and
domain of knowledge, performance will vary, especially when applications are focused
on a specific body of knowledge that is not widely available, such as materials only
shared on the intranet of an organisation. For comparison, a frontier non-reasoning
model, such as GPT-4o, starts with an accuracy of 49 % for conditions and 56 % for
disposition.
Providing retrieved documents helps models in better predicting for the two tasks,
with a performance increase of over 15 % for condition accuracy for the relatively
small Qwen2.5-32B and 7 % for GPT-4o. Qwen also gains a 4 percentage points
improvement for disposition accuracy, while instead GPT-4o sees its performance
slightly decreasing, potentially due to more ambiguity from the amount of information
now available. Large performance improvements for condition accuracy are also
observed for the two frontier reasoning models examined (o3-mini and DeepSeek-R1),
highlighting the substantial benefit of incorporating a retrieval component to provide
relevant in-domain evidence to the model.
Moving to the lean models examined, we can see the additional benefit of
integrating a thinking process before providing the final answer. Our t0-1.1-k5-32B,
which has been fine-tuned on in-domain reasoning examples, further increases the
accuracy in determining the correct condition, with an almost 20% performance
21
In particular, we enforce a maximum token count of 1024 and suppress the generation of the
end-of-thinking token delimiter a maximum of 3 times.
23
Table 2: Condition and disposition accuracy by LLM and k value. We report first
the examined lean language models and then a series of frontier large language
models. A dash (–) indicates a k value of zero: that is, the model was provided with
no retrieved context, as a baseline. Note that for all models relying on a retrieval
component, their maximum accuracy achievable for condition is 0.76, as already
discussed regarding Table 1. All reported values are average accuracies over 10 runs.
Standard deviations were consistently around 0.01 and are omitted for clarity.
improvement compared to the base Qwen model and an additional small improvement
against using retrieval alone. The results for condition accuracy outperform those
of other leaner reasoning solutions examined (s1.1-32B and the newly released
Qwen3-32B), bringing the model into the same realm as massively larger frontier
reasoning models such as o3-mini and DeepSeek-R1.
We do not see a similar drastic improvement for determining the disposition.
While the model performs better than the base Qwen2.5-Instruct-32B and of
general purpose lean reasoning models, it does not reach the performance of a
frontier model as o3-mini. We believe this is due to the reasoning traces provided
by DeekSeek-R1, which set the limit of expertise that can be distilled into smaller
models, compared to the better reasoning process of o3-mini for dispositions. It
24
is also important to note that o3-mini shows very low performance for condition
accuracy when prompted without retrieved documents. From inspecting the outputs,
it seems that o3-mini loses focus when making a prediction without supporting
evidence, and ends up considering too many possible scenarios.
The main outcome of this evaluation is that a retrieval augmented lean reasoner
model can shortlist the correct condition among a few options in 76% of cases (as
shown in Table 1), and predict the correct condition in 56% of cases, comparable
to state-of-the-art frontier models (as shown in Table 2). This serves as a strong
starting point before initiating a conversation with the user, which would allow
the model to gather additional information, expand its search in the database, and
more accurately narrow down the possible conditions and patient disposition (see an
example of the conversational interface in Figure 6).
5 Discussion
In this section we present three aspects that should help orienting future implemen-
tations based on our technical report. We start by covering the trade-offs between
general purpose versus domain specific reasoning, then we examine how to further
reduce model size and we conclude with an overview of our system’s frontend as a
prototype directly available in our GitHub repository.
25
Table 3: Condition and disposition accuracy by type of query: comparison of a single
evaluation run between a general reasoning model and our in-domain approach, with
k = 5 retrieved documents.
how to tackle decisions relevant to the domain under study, brings a clear advantage
compared to a more general purpose reasoner. While the number of traces needed to
train such in-domain model is not excessive (the s1 study adopted one 1 000 traces,
whereas we have used 2 000), obtaining them remains a challenge for applications
focusing on private or sensitive collection of data. To address this problem, several
strategies could be considered, such as:
• manually creating reasoning traces based on a set of given queries and retrieved
documents;
26
model size without significantly impacting the overall performance. The possibility
of enhancing reasoning abilities in small language models via distillation is one of
the main contributions of [6]. However as shown in Figure 2, depending on the task,
the model will start outperforming its non-reasoning baseline only from a certain
size. In a similar fashion, we evaluate the performance of distilled models ranging
from 1.5B to 32B parameters, with the smallest models requiring only 3 to 6 GB of
GPU memory, making them suitable for execution on most modern laptops22 .
Table 4: Condition and disposition accuracy with different model sizes, with 5
documents retrieved. We also report as reference the performance of the 32B non
reasoning baseline (Qwen2.5-32B-Instruct) and of the frontier model from which
reasoning was distilled from (DeepSeek-R1), with the same number of documents
retrieved.
27
augmented reasoning models can in-fact deliver performance comparable to models
over ten times larger, which significantly broadens the range of deployment scenarios.
Such models are lightweight enough to run on many consumer-grade laptops, making
them viable for widespread use in research and government applications.
28
Figure 6: Snapshot of the chat interface.
care or as in this case A&E)25 . The web frontend also includes a form that is used
for entering demographic information associated with the query.
The provided frontend can be seamlessly adapted for many other applications that
would rely on combining model orchestration, reasoning capabilities and retrieval
(with the possibility of adding additional metadata) on a collection of documents.
6 Conclusion
In this technical report, we have described how we have effectively combined reasoning
and retrieval augmented generation together in a single lean model. To show its
usefulness for decision making on domain-specific collections, we have presented
a case study about determining conditions and disposition of a series of synthetic
requests using the NHS A-to-Z collection as body of knowledge. Our model performed
at comparable level with frontier reasoning model and especially outperformed other
small-scale reasoning models that have been trained on mathematical reasoning
and not fine-tuned for domain-specific applications. Finally, we highlight that it is
possible to further reduce model size via distillation of reasoning capabilities in very
25
During experimentations, we have also considered the fourth option “request ambulance,” which
would have been predicted in this case to avoid risking a further fall by mobilising.
29
small models, while keeping strong performance. We hope our overview and the
paired GitHub codebase will be useful to others interested in combining reasoning
and retrieval capabilities in domain-specific settings.
Acknowledgments
RC, FN, and TL contributed equally to this work, leading respectively the implemen-
tation (RC), the overall project (FN) and the computational work (TL). Based on the
CRediT taxonomy, these are the contributions of all authors: Conceptualisation (AD,
JG, FN), Implementation (RC, TL, FN, RW, PY), Computational infrastructure
(RC, TL, RW), Data curation (RC, JG, FN, RW), frontend interface (PY), Original
draft (RC, FN, TL), Reviewing & Editing (all), Advising (AD, JG, MG, LT), Project
management (AD, JG, FN).
This work was funded by The Alan Turing Institute. We would like to thank
Christopher Banerji, Maya Bronfeld, Jonathan Carter, Tom Jeffery and Giles
Lawrence for their invaluable support and constructive feedback throughout the
course of the project.
The computations described in the report were in part performed using the
Baskerville26 Tier 2 HPC service. Baskerville was funded by the EPSRC and UKRI
through the World Class Labs scheme (EP/T022221/1) and the Digital Research
Infrastructure programme (EP/W032244/1) and is operated by Advanced Research
Computing at the University of Birmingham.
The authors also acknowledge the use of resources provided by the Isambard-AI
National AI Research Resource (AIRR). Isambard-AI is operated by the University
of Bristol and is funded by the UK Government’s Department for Science, Innovation
and Technology (DSIT) via UK Research and Innovation; and the Science and
Technology Facilities Council [ST/AIRR/I-A-I/1023].
References
[1] Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky,
Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al.
OpenAI o1 System Card. arXiv preprint arXiv:2412.16720, 2024.
[2] Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-
Time Compute Optimally can be More Effective than Scaling Model Parameters.
In The Thirteenth International Conference on Learning Representations, 2025.
[3] Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang
Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. TTRL: Test-Time Reinforcement
Learning. arXiv preprint arXiv:2504.16084, 2025.
26
[Link]
30
[4] Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli
Ouyang, and Bowen Zhou. Can 1B LLM Surpass 405B LLM? Rethinking
Compute-Optimal Test-Time Scaling. arXiv preprint arXiv:2502.06703, 2025.
[5] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi,
Quoc V Le, Denny Zhou, et al. Chain-of-Thought Prompting Elicits Reasoning
in Large Language Models. Advances in Neural Information Processing Systems,
35:24824–24837, 2022.
[6] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu,
Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing
Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint
arXiv:2501.12948, 2025.
[7] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh
Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori
Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
[8] Ahmed El-Kishky, Daniel Selsam, Francis Song, Giambattista Parascandolo,
Hongyu Ren, Hunter Lightman, Hyung Won Chung, Ilge Akkaya, Ilya Sutskever,
Jason Wei, et al. OpenAI. Learning to reason with LLMs. OpenAI, 2024.
[9] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim
Rocktäschel, et al. Retrieval-Augmented Generation for Knowledge-Intensive
NLP Tasks. Advances in Neural Information Processing Systems, 33:9459–9474,
2020.
[10] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin
Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-Augmented Gener-
ation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997,
2(1), 2023.
[11] Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin,
Tat-Seng Chua, and Qing Li. A Survey on RAG Meeting LLMs: Towards
Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM
SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6491–
6501, 2024.
[12] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin
Leyton-Brown, and Yoav Shoham. In-Context Retrieval-Augmented Language
Models. Transactions of the Association for Computational Linguistics, 11:1316–
1331, 2023.
[13] Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer,
Hannaneh Hajishirzi, and Wen-tau Yih. Reliable, Adaptable, and Attributable
Language Models with Retrieval. arXiv preprint arXiv:2403.03187, 2024.
31
[14] Anthropic. Building Effective Agents. Anthropic Blog, 2024.
[15] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang,
Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A Survey on Large
Language Model based Autonomous Agents. Frontiers of Computer Science,
18(6):186345, 2024.
[16] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong,
Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The Rise and Potential
of Large Language Model Based Agents: A Survey. Science China Information
Sciences, 68(2):121101, 2025.
[17] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang
Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen Technical Report. arXiv
preprint arXiv:2309.16609, 2023.
[18] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. LLaMA: Open and Efficient Foundation Language Models.
arXiv preprint arXiv:2302.13971, 2023.
[19] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya
Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale,
Juliette Love, et al. Gemma: Open Models Based on Gemini Research and
Technology. arXiv preprint arXiv:2403.08295, 2024.
[20] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik
Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in
Language Models. In The Eleventh International Conference on Learning Rep-
resentations, 2023.
[21] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike
Lewis, Luke Zettlemoyer, and Wen-tau Yih. REPLUG: Retrieval-augmented
black-box language models. In Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (Volume 1: Long Papers), pages 8371–8384. Association
for Computational Linguistics, 2024.
[22] Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and
Joseph E Gonzalez. MemGPT: Towards LLMs as Operating Systems. CoRR,
2023.
[23] Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin,
Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Univer-
sal Self-Consistency for Large Language Model Generation. arXiv preprint
arXiv:2311.17311, 2023.
32
[24] Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le,
Christopher Ré, and Azalia Mirhoseini. Large Language Monkeys: Scaling
Inference Compute with Repeated Sampling. arXiv preprint arXiv:2407.21787,
2024.
[25] Alex Graves. Sequence Transduction with Recurrent Neural Networks. arXiv
preprint arXiv:1211.3711, 2012.
[26] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao,
and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with
Large Language Models. Advances in Neural Information Processing Systems,
36:11809–11822, 2023.
[27] Subbarao Kambhampati, Kaya Stechly, Karthik Valmeekam, Lucas Saldyt,
Siddhant Bhambri, Vardhan Palod, Atharva Gundawar, Soumya Rani Samineni,
Durgesh Kalwar, and Upasana Biswas. Stop Anthropomorphizing Intermediate
Tokens as Reasoning/Thinking Traces! arXiv preprint arXiv:2504.09762, 2025.
[28] Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du,
Radha Poovendran, Graham Neubig, and Xiang Yue. Does Math Reasoning
Improve General LLM Capabilities? Understanding Transferability of LLM
Reasoning. arXiv preprint arXiv:2507.00432, 2025.
[29] Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton,
Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, et al.
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.
arXiv preprint arXiv:2507.11473, 2025.
[30] Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language
Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-
of-Thought Prompting. Advances in Neural Information Processing Systems,
36:74952–74965, 2023.
[31] Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, and Himabindu Lakkaraju.
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language
Models. In Trustworthy Multi-modal Foundation Models and AI Agents (TiFA),
2024.
[32] Abulhair Saparov and He He. Language Models Are Greedy Reasoners: A
Systematic Formal Analysis of Chain-of-Thought. In The Eleventh International
Conference on Learning Representations, 2023.
[33] Anthropic. Introducing Contextual Retrieval. Anthropic Blog, 2024.
[34] Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork. Rethinking Search:
Making Domain Experts out of Dilettantes. In ACM SIGIR Forum, volume 55,
pages 1–27. ACM New York, NY, USA, 2021.
33
[35] Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. Few-Shot
Conversational Dense Retrieval. In Proceedings of the 44th International ACM
SIGIR Conference on research and development in information retrieval, pages
829–838, 2021.
[36] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey
Edunov, Danqi Chen, and Wen-tau Yih. Dense Passage Retrieval for Open-
Domain Question Answering. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online,
2020. Association for Computational Linguistics.
[37] Gautier Izacard and Edouard Grave. Distilling Knowledge from Reader to
Retriever for Question Answering. In The Ninth International Conference on
Learning Representations, 2021.
[38] Yuyu Zhang, Ping Nie, Arun Ramamurthy, and Le Song. Answering Any-hop
Open-domain Questions with Iterative Document Reranking. In Proceedings of
the 44th International ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 481–490, 2021.
[40] Ryan Sze-Yin Chan, Federico Nanni, Angus Redlarski Williams, Edwin Brown,
Liam Burke-Moore, Ed Chapman, Kate Onslow, Tvesha Sippy, Jonathan Bright,
and Evelina Gabasova. Prompto: An open source library for asynchronous
querying of LLM endpoints. In Nouha Dziri, Sean (Xiang) Ren, and Shizhe
Diao, editors, Proceedings of the 2025 Conference of the Nations of the Americas
Chapter of the Association for Computational Linguistics: Human Language
Technologies (System Demonstrations), pages 106–115, Albuquerque, New Mex-
ico, April 2025. Association for Computational Linguistics.
[41] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab-
hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel-
ten, Alex Vaughan, et al. The Llama 3 Herd of Models. arXiv preprint
arXiv:2407.21783, 2024.
[42] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang,
Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and
Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In
Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 2704–2713, 2018.
34
[43] Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A Simple and Effective
Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695,
2023.
[44] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a
Neural Network. arXiv preprint arXiv:1503.02531, 2015.
[45] Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge
Distillation of Large Language Models. In The Twelfth International Conference
on Learning Representations, 2024.
[46] Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy
Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahri-
ari, Alexandre Ramé, et al. Gemma 2: Improving Open Language Models at a
Practical Size. arXiv preprint arXiv:2408.00118, 2024.
[47] Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and
Aliaksei Severyn. Teaching Small Language Models to Reason. In Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 1773–1781. Association for Computational
Linguistics, 2023.
[48] Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu,
Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, et al. A Comprehensive Survey
of Small Language Models in the Era of Large Language Models: Techniques,
Enhancements, Applications, Collaboration with LLMs, and Trustworthiness.
arXiv preprint arXiv:2411.03350, 2024.
[49] Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav
Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. Small Language
Models are the Future of Agentic AI. arXiv preprint arXiv:2506.02153, 2025.
[52] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy,
Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The
Faiss Library. arXiv preprint arXiv:2401.08281, 2024.
35
[53] Qwen Team. Qwen3 Technical Report. arXiv preprint arXiv:2505.09388, 2025.
[54] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and
Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning.
Advances in Neural Information Processing Systems, 36:8634–8652, 2023.
[55] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
Prentice-Hall, 1995.
[56] Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua,
Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. A Survey on
Test-Time Scaling in Large Language Models: What, How, Where, and How
Well? arXiv preprint arXiv:2503.24235, 2025.
A Appendix
A.1 Conversational RAG system prompt template
For our conversational RAG pipeline described in Section 3.3.1, below is the system
prompt template we used for the language model to generate a response when
retrieval is used.
You are a helpful clinical AI assistant deployed in the United Kingdom
You will be given a description of some of the users symptoms and some
retieved context from NHS condition web pages which provide information
about various medical conditions that could be relevant to those
symptoms.
Use the description of the users symptoms, the following retrieved context
and similarity scores for each piece of context (a lower similarity
score means the higher similarity to the patient's query) to work out
what condition(s) the user is suffering from and provide a
recommendation of what they should do next.
Never state or refer to the similarity scores to the user.
36
In your response, reply in English and always refer to the user in the
second person.
If you don't know the answer to a question, just say that you don't know.
If the retrieved context is not relevant to the patient's query, you should
also say that you don't know.
Retrieved context:
{context}
You are provided a tool that can retrieve context from a knowledge base
taken from NHS condition web pages which provide information about
various medical conditions.
You should always use the tool to find relevant information to answer the
patient's question rather than relying on your own knowledge.
If you are confused or unsure about the user's question, you should use the
tool to find relevant information or ask the user for more information
or ask further details about their symptoms.
For follow up questions from the user, you should always use the tool to
find new relevant information to answer the user's question given the
conversation history.
You should only not use the tool in very simple messages that do not
require any context like "Hello" or "Thank you", or when the user is
just writing something random.
You can also ask the user for more information or ask further details about
their symptoms.
If you are going to reply to the user, always conclude with a question to
keep the conversation going to help the user or ask for more details
about their symptoms.
In your response, only reply in English and always refer to the user in the
second person.
Decide to use the tool at the start. Do not use the tool after you have
already started your response.
37
A.3 Prompt template for summarisation
For obtaining summarised versions of our documents as described in Section 4.1, we
used the following user prompt:
Summarise the document below, focusing only on symptoms and how to decide
the next course of action. Be concise - aim for a summary of 3-4
sentences or fewer, keeping only essential information.
Document:
{document}
```json
{
"general_demographics": {
"age": "[Realistic adult age given symptoms and severity, e.g., 20-80,
for anyone above 80 use 'above 80']",
38
"sex": "{sex}",
"occupation": "[A common occupation]",
"social_support": "[Specify if the patient has a social support network,
such as a partner, family member, or living carer. If applicable,
include details like the carer's role (e.g., 'My partner is here to
help me' or 'I live with my daughter who is my carer'). If no
support network is present, state 'No support network.']",
"medical_history": "[Include any relevant comorbidities, such as
diabetes, asthma, neurodegenerative conditions (e.g., Alzheimer's,
Parkinson's), allergies (e.g., to medications, food, or
environmental triggers), or other significant pre-existing health
conditions. If the person is on regular medications (e.g., insulin
for diabetes, inhalers for asthma, antihistamines for allergies, etc
.), list them as well. If there are no significant conditions,
medications, or allergies, keep it simple (e.g., 'No known issues'
or 'None relevant'). Only include specific conditions, medications,
or allergies if they are highly relevant to the current case or
commonly co-occur with the condition in question.]"
},
"symptoms_description": "[Generate a natural-sounding, first-person query
(using 'I', 'my') as if a patient is describing their symptoms to NHS
111. Ensure the described symptoms are primarily drawn from or
plausibly related to the condition content AND strongly align with the
specified severity_level. Select/adapt details from condition content
justifying the target severity (e.g., 'red flag' symptoms for Urgent
Primary Care; milder symptoms for Self-care). Ensure consistency with
the query_type. Vary tone (e.g., anxious, calm) and sentence structure
for realism. Occasionally include precise details, such as
temperature readings or numbers from previous exams (e.g., 'My
temperature is 39C or 102F'). At other times, be vague when describing
symptoms (e.g., 'I have a high temperature'). Numbers can be in
either US or UK format, depending on the context.]"
}
```
39
```json
{
"general_demographics": {
"age": 35,
"sex": "Female",
"occupation": "Teacher",
"social_support": "No support network",
"medical_history": "No known chronic conditions"
},
"symptoms_description": "I've had a severe headache for the past three
days that won't go away, even with painkillers. It feels like a tight
band around my head, and I'm also feeling slightly nauseous. My vision
is a bit blurry when I stand up too quickly. I don't normally get
headaches this bad, and I'm starting to feel concerned."
}
```
Reply only with the JSON output, without any additional text or explanation
Using the sources and context provided, submit the condition and the
severity level in the format: "(condition, severity)". Do not provide
any explanation to the output, only your final answer.
40
Remember that the condition must either be one of {sources} or "
inconclusive" if you think that the condition is not listed.
Remember that the severity level must be one of ["Self-care", "Urgent
Primary Care", "A&E"].
In this template, we present to the model the retrieved context from our retriever
along with their similarity scores (L2-norm), the user query and their demographics
(which are synthetically generated as per Section 4.2) and finally the titles of the
documents retrieved (i.e., the retrieved conditions).
41
for each piece of context (a lower similarity score means the higher
similarity to the patient's query).
You need to suggest the most likely condition and the level of severity.
You should use the provided tool to submit the condition and severity level.
Importantly, if you think that the condition is not listed, please use "
inconclusive" for the condition.
42
for each piece of context (a lower similarity score means the higher
similarity to the patient's query).
You need to suggest the most likely condition and the level of severity.
Importantly, if you think that the condition is not listed, please use "
inconclusive" for the condition.
Using the sources and context provided, submit the condition and the
severity level in the format: "(condition, severity)". Do not provide
any explanation to the output, only your final answer.
A.6.3 Prompt templates without retrieved context for models with tool-
use options
System prompt
You are a clinical AI assistant.
You need to suggest the most likely condition and the level of severity.
43
You should decide one of these options for severity:
* A&E: Emergency hospital treatment required
* Urgent Primary Care: patient should be seen as soon as possible, by a GP,
urgent care centre, or similar
* Self-care: Issue can be handled at home and/or with over-the-counter
medication.
You should use the provided tool to submit the condition and severity level.
Importantly, if you think that the condition is not listed, please use "
inconclusive" for the condition.
Remember that the condition must either be one of the conditions listed
above or "inconclusive" if you think that the condition is not listed.
Remember that the severity level must be one of ["Self-care", "Urgent
Primary Care", "A&E"].
You need to suggest the most likely condition and the level of severity.
44
* Self-care: Issue can be handled at home and/or with over-the-counter
medication.
Importantly, if you think that the condition is not listed, please use "
inconclusive" for the condition.
Using the sources and context provided, submit the condition and the
severity level in the format: "(condition, severity)". Do not provide
any explanation to the output, only your final answer.
Remember that the condition must either be one of the conditions listed
above or "inconclusive" if you think that the condition is not listed.
Remember that the severity level must be one of ["Self-care", "Urgent
Primary Care", "A&E"].
45