0% found this document useful (0 votes)
19 views14 pages

Empowering Llms in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

The document presents a Domain-Independent Multi-Agent Framework (DIMF) designed to enhance task-oriented dialogue systems using Large Language Models (LLMs). It separates complex tasks into distinct agents for intent classification, slot filling, and response generation, improving generalization and performance on fine-tuned lightweight models. The framework also incorporates Direct Preference Optimization (DPO) and a Data Distribution Adaptation (DDA) method to address training degradation issues, demonstrating superior performance in experiments.

Uploaded by

superc254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Empowering Llms in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

The document presents a Domain-Independent Multi-Agent Framework (DIMF) designed to enhance task-oriented dialogue systems using Large Language Models (LLMs). It separates complex tasks into distinct agents for intent classification, slot filling, and response generation, improving generalization and performance on fine-tuned lightweight models. The framework also incorporates Direct Preference Optimization (DPO) and a Data Distribution Adaptation (DDA) method to address training degradation issues, demonstrating superior performance in experiments.

Uploaded by

superc254
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent

Multi-Agent Framework and Fine-Tuning Strategy


Zihao Feng1,2 * † , Xiaoxue Wang2∗ , Bowen Wu3,2 , Weihong Zhong1 , Zhen Xu2 ,
Hailong Cao1 , Tiejun Zhao1‡ , Ying Li3 , Baoxun Wang2
1
Faculty of Computing, Harbin Institute of Technology
2
Platform and Content Group, Tencent
3
School of Software & Microelectronics, Peking University
21b903052@[Link], whzhong@[Link]
{caohailong, tjzhao}@[Link]
{yukixxwang, zenxu, asulewang}@[Link], {jason_wbw, [Link]}@[Link]
User: Where is a 4 star hotel located in North Cambridge?
Abstract System: I have several options. I have one cheap one, too. Would you like to book?
User: Sure , that could be nice
Dialog History

System: OK, how many are in your party, what day will you arrive, and how many nights will you be staying?
arXiv:2505.14299v1 [[Link]] 20 May 2025

User: I actually don't need reservations I just need the phone number, price range.
System: As I mentioned it is cheap and the phone number is 01223316074.
Task-oriented dialogue systems based on Large
User: Could you help me find a restaurant in the expensive price range that is in the same area as the hotel?
Language Models (LLMs) have gained in-
creasing attention across various industries NLU: Intent Extraction Strategy:
a. Intent classification
Intent Classification
and achieved significant results. Current ap- Dialogue State
b. Inheritance of last round
e. Dialog ending
Agent
Agent
proaches condense complex procedural work- DST: Slot Filling Strategy
a. Extract slot with its value
flows into a single agent to achieve satisfac- Task Agent b. Extract no-value slot
c. Inherit or modify historical slots Slot Filling Agent

tory performance on large-scale LLMs. How- Hotel Attraction Taxi Restaurant


NLG: Choose a Response Strategy
a. Make a conclusion
Train
b. Ask user for more detail
ever, these approaches face challenges to Agent Agent Agent Agent Agent
c. Respond the specific information Response Agent

achieve comparable performance on fine-tuned System: There are 3 cheap restaurants in the System: There are 5 restaurants in the north
north area. What type of food would you like? area. What type of food would you like?
lightweight LLMs, due to their limited capabili-
ties in handling multiple complex logic. In this
Figure 1: Different architectures of our proposed sys-
work, we design a Domain-Independent Multi-
tem and other LLM-based systems. The left part is
Agent Framework (DIMF), which contains In-
other LLM-based systems and the right is ours. The
tent Classification Agent, Slot Filling Agent
information in the orange box indicates the strategies in
and Response Agent. This approach simpli-
different sub-tasks that the agent needs to follow.
fies the learning complexity and enhances the
generalization ability by separating the tasks
into domain-independent components. In this
framework, we enhance the capabilities in con- (Karanikolas et al., 2023). 2) Dialogue State Track-
textual understanding using the Direct Prefer- ing (DST) (Feng et al., 2023; Heck et al., 2023;
ence Optimisation (DPO) method, and propose Feng et al., 2025). 3) Dialogue Policy. 4) Natural
a simple and effective Data Distribution Adap- Language Generation (NLG) (Li et al., 2020). With
tation (DDA) method to mitigate degradation
the development of the Large Language Model
issues during DPO training. Experiments con-
ducted on the MultiWOZ datasets show that our (LLM), recent research has mainly focused on
proposed method achieves a better average per- leveraging the strong capabilities and generaliza-
formance among all the baselines. Extensive tion of LLMs to solve the complex task of TOD
analysis also demonstrates that our proposed (Qin et al., 2023a; Algherairy and Ahmed, 2024;
framework exhibits excellent generalizability Chung et al., 2023). The LLM-based multi-agent
and zero-shot capability. approach has been proven to be effective in multi-
domain TOD systems (Gupta et al.).
1 Introduction
Existing methodologies often attempt to con-
Task-oriented dialogue (TOD) systems play a sig- dense complex procedural workflows of TOD sys-
nificant role in both academic research and indus- tems into a single large-scale LLM-based agent
try.(Peng et al., 2022; Xu et al., 2024). Researchers such as GPT-4 (Achiam et al., 2023) and Claude,
have divided the traditional TOD systems into the or divide the workflow into different domains to
following several key components (Zhang et al., conduct multi-agent TOD systems for lightweight
2020): 1) Natural Language Understanding (NLU) LLMs. Most works have achieved satisfactory per-
formance on large-scale LLMs (Xu et al., 2024;
* Equal contribution
† Gupta et al.). In contrast, the lightweight models,
Zihao Feng was an intern at Tencent during the prepara-
tion of this work even when fine-tuned for specific tasks, struggle

Corresponding author to attain comparable completion quality (Xu et al.,
2024; Gupta et al.). This discrepancy contrasts the TOD system have enhanced the system’s
sharply with their competitive performance in other scalability and zero-shot capabilities, allowing
NLP tasks , suggesting that the inherent complexity the system to maintain good performance even
of TOD necessitates specialized approaches. We on domains it has not seen before.
posit that effective modeling of multi-step procedu- 2 Background
ral logic and developing targeted learning strategies
are critical to bridging this performance gap. 2.1 Large Language Models as Agents
To address this challenge, we propose a Domain- Recently, many efforts have been made to build sys-
Independent Multi-Agent Framework (DIMF), tems through LLMs acting as agents for planning,
which contains Intent Classification Agent, Slot decision-making, and acting tasks between various
Filling Agent and Response Agent. Unlike the cur- specialized APIs, dialogue, or other simpler tools
rent methods, which conduct multi-agent system to perform complex tasks (Liu et al., 2023; Liang
by different domain-specific agents, DIMF decou- et al., 2023; Deng et al., 2024). ReAct (Yao et al.,
ples the workflow into several components which 2023) method is a prompt framework that has been
are domain-independent. As illustrated in Figure widely used for fine-tuning the LLMs with the abil-
1, both phases require contextual reasoning and ity of reasoning and action based on text. Various
policy-guided decision-making capabilities, easily tasks such as logical reasoning (Du et al., 2023;
conflated in monolithic agent architectures. The Tang et al., 2023), societal simulations (Zhou et al.,
task separation design stems from our observation 2023), tool learning (Qin et al., 2023b; Shen et al.,
of domain relevance and challenges in slot inte- 2024) have achieved significant improvement in
gration from dialogue history during slot filling performance using LLMs as agents.
process. This approach guarantees that the agent However, most research focuses on task-specific
considers the slot that matches the current specific scenarios with poor scalability. The challenge of
domain. Furthermore, this modular decomposition LLMs working as agents that can generalize better
facilitates the enhancement of targeted capability and adapt to different tasks needs more research.
through reinforcement learning techniques (e.g.,
DPO/PPO (Rafailov et al., 2023; Schulman et al., 2.2 Direct Preference Optimisation (DPO)
2017)), enabling specialized optimization while Direct Preference Optimisation (DPO) (Rafailov
maintaining domain adaptability. We therefore pro- et al., 2024) is a popular method for learning from
pose a Data Distribution Adaptation (DDA) method human-preference data, and it has been widely
designed to mitigate the degradation of DPO train- leveraged to improve the performance of pre-
ing attributable to the diversity of domain types. trained LLMs on downstream tasks (Wang et al.,
The experimental results indicate that the frame- 2023; Tunstall et al., 2023). DPO directly uses pair-
work and training methodology significantly en- wise preference data for model optimization. In
hance the performance of the fine-tuned models. this way, we can directly train the language model
Additionally, it was observed that the domain- through the reward learning pipeline, eliminating
independent design exhibits a robust zero-shot ca- the need for the reinforcement learning stage.
pability. In conclusion, this paper offers the follow- Although the DPO method facilitates model
ing contributions: training, experiments demonstrate that the DPO
loss has flaws: Compared to learning to generate
• We design a novel Domain-Independent Multi- responses preferred by humans, the DPO loss func-
Agent Framework for TOD systems based on tion demonstrates a tendency for LLMs to readily
LLMs. Our approach separates the complex learn to avoid generating responses that humans
task into three sub-tasks which better lever- disprefer (Feng et al., 2024). Based on this conclu-
ages the generalization capabilities of LLMs. sion, DPO exhibits significant degradation issues
• We utilize DPO during the training process, on data where the Levenshtein Distance between
and innovatively propose a Data Distribution positive and negative examples is small. The rea-
Adaptation method to alleviate the DPO’s son is that with highly similar positive and negative
training degradation problem during the DPO examples, the DPO process tends to reject the nega-
training process. tive examples, which in turn reduces the generation
probability for the corresponding positive exam-
• Our new framework and training strategy for ples (Pal et al., 2024). Thus, the DPO process
DIMF
Framework
I want to find an Training case
Prompt: SFT
expensive hotel Framework
Functions
Logical rules
SFT Database

User Intent Classification Agent

I find 3 hotels for


Intent: DPO-DDA
you: …. Which do
you prefer Hotel
Function Prompt Query

3 hotels’ information
User Generated
Agent Bad Cases
Hotel Prompt Bad Cases
Response Agent Slot Filling Agent

Slot: expensive

… DPO

Restaurant Train Hotel


API or Database

Figure 2: The main framework of our proposed method. The left part is the framework of our proposed DIMF. We
train three agents to collaboratively solve users’ questions and provide responses. Each agent can fulfill different
user needs through different prompts, instead of training domain-specific agents (as indicated by the agents in the
left part such as "Restaurant"). The right part is the framework of our training process for each agent. We first
fine-tune the model with the training set, and then leverage the validation dataset to complete the DPO process.

can lead to a simultaneous decrease in the reward 3.2 Slot Filling Agent
functions for both positive and negative examples, After obtaining the intent of the user’s question
which leads to degradation. from the Intent Classification Agent, we train a
Slot Filling Agent to extract slots for the specific
3 Domain-Independent Multi-Agent domain from the query, which is required for ex-
Framework tracting information from the database. This agent
can be adapted to various domains through con-
In this section, we introduce our proposed Domain-
ducting domain-specific prompts. In this way, we
Independent Multi-Agent Framework (DIMF) for
can obtain a generalized Slot Filling Agent instead
the TOD task. We give an introduction to the In-
of training different models for different domains.
tent Classification Agent , Slot Filling Agent and
For the user’s questions, there are two different
Response Agent separately. We will provide a de-
types of slots: 1) The slot with its corresponding
tailed introduction to the division of labor between
value, such as I need train reservations from Nor-
each agent.
wich to Cambridge. which contains the name of
the departure and destination. 2) The slot without
3.1 Intent Classification Agent
value, such as I would also like to know the travel
The Intent Classification Agent aims to extract the time, price, and departure time please. which needs
intent of the user’s question and serves as the foun- to respond the value to the user. We design two
dation for the subsequent agents. Specifically, this modules to respond to these two types of informa-
agent is provided with the user’s question and the tion separately, and provide a logical rules module
descriptions of each domain, then outputs in the Re- in the prompt to distinguish between them.
ACT format. Besides, this task involves the user’s Besides, to address the issue of slot inheritance
follow-up questions regarding historical dialogue. based on dialogue history, we have also designed
Therefore, we have designed a logic module in the a module for the Slot Filling Agent in the prompt
prompt that provides the logical rules in the current that includes historical dialogue slots, allowing the
round of dialogue based on the intent of the last agent to better implement this capability by inte-
round. Moreover, we design an "other" domain to grating this information with the dialogue history.
implement the dialogue-ending intent . The details Later, according to the generated slot information
of the prompt are appended in Appendix A.1. by the Slot Filling Agent, we can extract the en-
tries in the database that match the user’s query. In encountered the issue of model degradation after
this work, we use a rule-based approach for extrac- DPO training, which is mentioned in Section 2.2.
tion. The detail of the prompt is attached in the We analyze the bad cases and find that, compared
Appendix A.2. to the SFT training data, the rejected data used by
DPO had a very uneven distribution in terms of
3.3 Response Agent domains. Based on the conclusion that "the DPO
Different dialogue histories and states dictate var- loss function demonstrates a tendency for LLMs
ious strategies, such as asking the user to fill in to readily learn to avoid generating responses that
the required slots, allowing the user to refine re- humans disprefer" (Feng et al., 2024), we believe
sults, letting the user confirm or cancel, and so on. that if the category of the rejected data in the DPO
The Response Agent aims to respond to the user phase is concentrated in a certain category, it will
based on the dialogue history and states. Since the significantly reduce the generation probability for
database’s results of each query vary, we develop that category after training, which leads to model
the following strategies for the Response Agent to degradation in that category. Therefore, we gen-
assist the user in obtaining information about the erate bad cases for other categories to match the
outcome during conversations. distribution of rejected data across all categories
After calling database, the response strategy de- with the data from the SFT phase. In this way, we
pends on the number of database results that meet have effectively alleviated the degradation problem
the user’s question. If there is only one option, the caused by DPO.
agent should respond to the information of a spe-
cific item that the user asks directly. Otherwise, 5 Experimental Setup
the response’s content should contain the follow- 5.1 Dataset & Evaluation Metrics
ing information: 1) The total number of available
We evaluate our proposed method on the Multi-
options. 2) The conclusion of all options. 3) The
WOZ 2.2 dataset (Zang et al., 2020). The dataset is
question asking users for more specific information
a large-scale multi-domain TOD dataset which con-
to narrow the range of available options. The detail
tains 10437 conversations and is divided into train-
of the prompt is attached in the Appendix A.3.
ing, validation, and test sets. The dataset comprises
4 Improving DPO Training by Data 7 domains and contains a database for querying the
Distribution Adaptation Method information of a specific domain.
We leverage the traditional evaluation method of
Since multiple sub-tasks of TOD are executed un- the MultiWOZ 2.2 dataset, Inform, Success, and
der limited states, we conducted DPO training af- BLEU scores, to evaluate our proposed method.
ter Supervised Fine-Tuning (SFT) which is more The Inform rate is to check whether the system
conducive to leveraging the advantages of DPO. finds the right entity for the user. The Success
However, due to the uncertainty in the distribution rate is to check whether the system provides all
of domains in the bad cases, we encountered the the required entity attributes for the user. The
degradation issue of DPO mentioned in Section 2.2. BLEU measures the fluency compared to the ref-
We propose a Data Distribution Adaptation (DDA) erences, which are delexicalized. Finally, the
method to improve the issue simply and effectively. Combine score is a comprehensive metric to indi-
For the first two agents, their results for one cate the overall performance, which is formulated
real question are all on a specific domain in for- as: Combine = Inf orm+Success
2 + BLEU . Be-
matted structures. Therefore, the DPO method is sides, we leverage the Conditional Bigram Entropy
well-suited to leverage its strengths in this scenario. (CBE), #unique words and #unique 3-grams to eval-
Besides, both of the agents in our method need uate the richness of the response.
to complete the complex logical instructions in
the prompt, which faces challenges on lightweight 5.2 Baselines & Setup
LLMs. The DPO method can further improve the We compare our proposed method with the tra-
weaknesses in training on these instructions during ditional system and the LLM-based system. We
the SFT phase. choose several strong baselines fine-tuned on the
When we directly leverage the DPO method to traditional language models, including GALAXY
train on the bad cases in the validation set, we also (He et al., 2022), TOATOD (Bang et al., 2023),
Model BLEU Inform Success Combined CBE #uniq. words #uniq. 3-grams
Traditional model:
GALAXY (He et al., 2022) 19.6 85.4 75.7 100.2 1.75 295 2275
TOATOD (Bang et al., 2023) 17.0 90.0 79.8 101.9 - - -
Mars-G (Sun et al., 2023) 19.9 88.9 78.0 103.4 1.65 288 2264
KRLS (Yu et al., 2023) 19.0 89.2 80.3 103.8 1.90 494 3884
DiactTOD (Wu et al., 2023) 17.5 89.5 84.2 104.4 2.00 418 4477
SUIT2 (DPO-SFT) (Kaiser et al., 2024) 16.5 90.0 87.1 105.1 - - -
Large Language Model (LLM):
Mistral-7B DARD (Gupta et al.) 15.2 78.8 61.2 85.2 2.79 993 13317
Qwen2.5-7B DARD 14.9 80.1 61.5 85.7 2.14 902 12974
SGP-TOD-GPT3.5 (Zhang et al., 2023) 9.2 82.0 72.5 86.5 - - -
Claude Sonnet 3.0 DARD (Gupta et al.) 9.5 95.6 88.0 101.3 2.37 1197 13742
Ours:
Qwen2.5-7B DIMF w/o DPO 14.8 90.3 75.4 97.7 2.73 1139 14305
Qwen2.5-7B DIMF 18.7 92.4 82.8 106.3 2.81 1231 14328

Table 1: End-to-end response generation evaluation results on MultiWOZ 2.2 dataset. All results of traditional
models are cited from the official leaderboard. We execute the publicly accessible results of the LLM-based model.
The "bold" indicates the best score among all the systems of each language pair.

Mars-G (Sun et al., 2023), KRLS (Yu et al., 2023), results of the DARD method on the Qwen model
DiactTOD (Wu et al., 2023), SUIT (Kaiser et al., prove the advancement of our method. Besides,
2024). For the LLM-based system, we evaluate compared to the large-scale LLMs, our method has
the SGP-TOD (Zhang et al., 2023) method which a significant improvement on the BLEU. Moreover,
builds the TOD system with GPT3.5. Besides, we unlike the DARD method, we use a single model
compare our method with the state-of-the-art LLM- for all domains which demonstrates a better gener-
based method, DARD (Gupta et al.). Since the alization of our method.
code was not provided of DARD, we independently The last three metrics evaluate the textual rich-
replicate the results of the DARD method on the ness of the model response. The results show
Qwen2.5-7B model. that our method significantly outperformed other
We select Qwen2.5-7B-Instruct (Yang et al., models. This also demonstrates the advantages of
2024) as our foundation model for our proposed LLMs compared to the traditional models: the di-
method. The details of our training settings are versity of responses can provide users with a better
attached in the Appendix B. interactive experience in real-world scenarios.

6 Experiments 6.2 Results of Data Distribution Adaptation


6.1 Main Results Method for DPO Training
We present the results of our proposed DIMF and In this section, we aim to demonstrate that our
other baselines in Table 1. Specifically, each agent Data Distribution Adaptation method can effec-
in DIMF is first fine-tuned on the entire training set tively mitigate the issue of DPO degradation. The
under supervision and then trained using the DPO test set contains 5 domains with different numbers
method on the validation set. The results show that (Attraction (396), Hotel (394), Restaurant (437),
our proposed method achieves the best Combined Taxi (195) and Train (495)). We present the results
score among all the baselines. of each domain in Table 2. We define that if the
Compared with the traditional models, DIMF performance of a specific domain drops below the
has become more powerful in slot extraction which average accuracy, then the model has a degradation
corresponds to the scores of Inform and Success. issue in that domain. Due to testing issues, the In-
This also demonstrates that the method of separat- form for the Taxi did not change. The distribution
ing the complex tasks in our DIMF can effectively of bad cases on the test set is similar to the valida-
enhance the system’s capability. As for the Large tion set, so we will directly analyze the results on
Language Model, our model has outperformed the the test set between the two DPO methods.
same size model on all evaluation metrics. The Intent Classification Agent: Most of the errors
Attraction Hotel Restaurant Taxi Train
Model
BLEU Info. Succ. BLEU Info. Succ. BLEU Info. Succ. BLEU Info. Succ. BLEU Info. Succ.
Base System (All agents trained with SFT)
DIMF-base 14.8 98.7 83.2 14.2 89.6 74.8 13.7 96.2 85.3 15.2 100.0 85.1 15.0 90.1 78.1
w/ Intent Classification Agent DPO
DPO-Ori 11.9 86.3 71.0 13.1 90.0 75.2 12.2 90.2 79.1 12.7 100.0 73.3 15.0 90.5 80.0
DPO-DDA 14.8 99.1 83.7 13.7 90.3 76.7 13.6 96.2 85.3 15.6 100.0 86.0 14.9 91.4 78.4
w/ Intent Classification Agent DPO-DDA & Slot Filling Agent DPO
DPO-Ori 11.0 81.7 69.4 12.7 80.5 73.1 12.9 83.4 73.3 14.8 100.0 79.1 12.5 79.6 71.9
DPO-DDA 17.1 99.1 90.2 16.2 90.6 83.6 15.9 96.2 89.7 17.1 100.0 88.2 16.7 90.8 83.2
w/ Intent Classification Agent DPO-DDA & Slot Filling Agent DPO-DDA & Response Agent DPO
DPO-Ori 19.6 99.1 90.2 17.3 91.0 83.1 16.0 96.2 89.0 18.8 100.0 89.6 19.2 92.3 82.7
DPO-DDA 19.4 99.1 90.2 17.7 91.3 84.0 16.3 96.5 89.7 18.6 100.0 89.6 19.5 92.3 83.2

Table 2: Results of different DPO training method on each agent of DIMF. The gray data indicates the degradation
data. The DPO-Ori represents the original DPO training method which directly leverage the bad cases for training.
The DPO-DDA represents our proposed Data Distribution Adaptation method.

Original DPO Training Rewards DPO-DDT Training Rewards


2
2 progresses, while "reward_rejected" should be less
0 0
-2
than 0 and decline. As we can see, the original
-2
DPO method encountered issues with the chosen
Rewards

Rewards

-4
-4
-6
-8
-6 reward decreasing and becoming less than 0. This
rewards_chosen -8 rewards_chosen
-10
rewards_rejected rewards_rejected issue leads to the degradation of the DPO training
-12
2 4 6 8 10 2 4 6 8 10
Training Steps Training Steps process, which demonstrates our analysis above.
Our proposed DDA method can efficiently address
Figure 3: The rewards of the chosen data and rejected
this problem which is shown in the right figure. The
data during the Slot Filling Agent DPO training. The
left figure is the original DPO method and the right one experimental results demonstrate the effectiveness
is our proposed DDA method. The red line represents of our DDA-based DPO method. The other agents’
the reward of 0. results are appended in Appendix C.

6.3 Zero-shot Evaluation


are concentrated in the Hotel and Train domains
after SFT training. Therefore, these two domains We evaluate the zero-shot capabilities of our pro-
tend to appear more frequently in the chosen data posed framework in this section. For each agent
of the original DPO method. Most of the data in the in our method, we remove the data of one domain
rejected data set belongs to the other three domains. during the training process. We show the perfor-
The results show that the data distribution on the mance of the total system and each domain after
rejected data of the original DPO training method removing the specific domain in Figure 4.
leads to the decrease on these three domains. The first sub-figure presents the results of the
Slot Filling Agent: During the DPO training system. The x-axis represents the results of the
phase of Slot Filling Agent, the degradation issue original system and the results after removing the
appeared in more domains. We find that many bad training data of different domains. The results indi-
cases at this stage occurred when information from cate that, except for the Hotel and Train domains,
multiple rounds of dialogue needed to be inherited. the performance of the system does not have a sig-
These bad cases were very unevenly distributed nificant decrease compared to the original system
across different slot categories, such as area, lead- after removing other domains. As for the Hotel
ing to degradation in various domains. and Train, the results in Table 2 show that these
Response Agent: The degradation issue of DPO two domains are more challenging, and our system
is not significant in Response Agent. performs relatively poorly on them. We believe this
Training Rewards: We show the training rewards is the reason for the decline of performance. Nev-
of the chosen data and rejected data during the ertheless, the performance of our proposed method
DPO training process of the Slot Filling Agent in still exceeds the same size LLM in Table 1 in these
Figure 3. In an ideal situation, "reward_chosen" two experiments. The result demonstrates that our
should be greater than 0 and increase as training method enhances the generalization ability of the
140 Main Results w/o Attraction w/o Hotel
120
BLEU
Inform
Success
Combined
2 2
0 0
100

Score Change

Score Change
-2 -2
80 -4 -4
Score

60 -6 -6
40 -8 BLEU -8 BLEU
-10 Inform -10 Inform
20 -12 Success -12 Success
0
Syst. Attr. Hote. Rest. Taxi Trai. Attr. Hote. Rest. Taxi Trai. Attr. Hote. Rest. Taxi Trai.
w/o Domain Domain Domain

w/o Restaurant w/o Taxi w/o Train


2 2 2
0 0 0
Score Change

Score Change

Score Change
-2 -2 -2
-4 -4 -4
-6 -6 -6
-8 BLEU -8 BLEU -8 BLEU
-10 Inform -10 Inform -10 Inform
-12 Success -12 Success -12 Success
Attr. Hote. Rest. Taxi Trai. Attr. Hote. Rest. Taxi Trai. Attr. Hote. Rest. Taxi Trai.
Domain Domain Domain

Figure 4: The Results of the DIMF after removing training data from a specific domain. The first sub-figure shows
the results of the system after removing different domains. The other sub-figures shows the performance of each
domain after removing a specific domain respectively.

Model BLEU Inform Success Combined Model BLEU Inform Success Combined
Qwen2.5-7B Single Agent 10.3 59.8 37.4 58.9 Qwen2.5-7B DIMF 18.7 92.4 82.8 106.3
Qwen2.5-7B Two Agents 14.9 80.1 61.5 85.7 w/o R. DPO 16.8 91.2 81.3 103.1
Qwen2.5-7B DIMF w/o DPO 14.8 90.3 75.4 97.7
w/o R. & S. DPO 14.6 91.2 76.8 98.6
w/o R. & S. & I. DPO 14.8 90.3 75.4 97.7
Table 3: Ablation studies results on our proposed DIMF.
We compare the performance between different number Table 4: Ablation studies results on our proposed DDA-
of agents trained with SFT method. based DPO method. The R., S. and I. represent Re-
sponse Agent, Slot Filling Agent and Intent Classifica-
tion Agent separately. Each row in the table is based on
TOD system by refining tasks within the system. the last row with the DPO method removed.
The other sub-figures present the results on each
domain after removing different domains. The re-
sults indicate that the accuracy of the specific do- two-agents system. All the frameworks are trained
main decreased after removing its corresponding with SFT method. As shown in Table 3, the DIMF
data, particularly in the Hotel and Train domains, brings a significant improvement for the system, es-
which confirms the analysis in the last paragraph. pecially on the Inform and Success metrics, which
Besides, we also observed a phenomenon in the demonstrates the better accuracy of our DIMF.
experiment that the performance of some other do-
mains declined after removing one domain. We 6.4.2 Ablation Studies on DPO
think that this may be caused by the reduction in In order to better understand the effect of the DPO
data diversity. Moreover, we find that the zero-shot training method on each agent , we perform an ab-
setting has little impact on the BLEU metric. lation test and present the results in Table 4. All the
results in this section are obtained using our pro-
6.4 Ablation Studies
posed DDA training strategy for DPO. The results
6.4.1 Ablation Studies on Framework show that DPO training improves the accuracy of
In this section, we evaluate different frameworks each stage in the system, thereby alleviating the
to demonstrate the advantage of our DIMF. Specifi- problem of error accumulation.
cally, we combine all the training data of our pro- As we can see in Table 4, compared to the other
posed three agents to train a single agent for TOD two agents, the improvement of DPO in the Intent
task. Besides, we combine the intent classification Classification Agent is limited. We believe this is
and slot filling agents into a single agent to train a because the model trained after SFT already pos-
User: Where is a 4 star hotel located in North Cambridge? Dialog History
System: I have several options. I have one cheap one, too. Would you like to book a room?
User: Sure , that could be nice
System: OK, how many are in your party, what day will you arrive, and how many nights will you be staying?
User: I actually don't need reservations I just need the phone number, price range.
System: As I mentioned it is cheap and the phone number is 01223316074.

User: Okay. now could you help me find a restaurant in the expensive price range that is in the same area as the hotel?

Input: SFT Input: DPO


You are an agent that ... Begin! You are an agent that ... Begin!
Last Tool: … as the hotel? Last Tool: … as the hotel?
Intent Classification Output: Intent Classification Output:
Agent Action: find_restaurant Agent Action: find_restaurant

Input: Input:
Hello, Where is … North Cambridge … cheap one ... Hello, Where is … located in North Cambridge.
You are an agent whose goal ... Begin! You are an agent whose goal ... Begin!
Question: … same area as the hotel? Question: … same area as the hotel?
Slot Filling Output: Slot Filling Output:
Agent Parameters: {”pricerange“: [”cheap“], Agent Parameters: {“pricerange”: [“expensive”],
{“area”:[“north”]} {“area”:[“north”]}
Information: [] Information: []

Input: Input:
You act as an AI assistant … You act as an AI assistant …
The Question is: … The Question is: …
Your Response: Your Response:
Response Agent Output: Response Agent Output:
There are [choice] [type] in the area. What type of There are [choice] [type] in the area. What type of
food would you like? food would you like?

Figure 5: An example of one round of the conversation between user and our DIMF. This case contains the history
of the conversation, the question of the user and the generation process of DIMF trained with different methods.
The red word represents incorrect information and responses, and green represents correct ones.

sesses relatively good capabilities. However, the rectly. The system needs to inherit this information
Slot Filling Agent and the Response Agent still and remove another irrelevant slot "cheap" from
show significant improvement in the BLEU and the last intent. The Slot Filling Agent implements
Success metrics after our DDA-based DPO train- this ability by adding the logic rule about inherit-
ing. The experimental results also demonstrate that, ing historical dialogue information in the prompt.
compared to other methods, our DIMF approach, However, as shown in this case, the lightweight
which trains the Slot Filling Agent separately and LLMs trained with the SFT method cannot fully
isolates the Response Agent, is very effective in learn this capability and sometimes make mistakes
enhancing performance in the TOD system. on this issue. The DPO method provides targeted
training for this capability, effectively improving
6.5 Case Study the shortcomings of the SFT method and improving
To further understand the detailed process of our the system’s performance.
method, we provide a case study that contains the
7 Conclusion
output of each agent for a specific user’s question.
We select a more challenging case that requires In this work, we propose a new framework,
inheriting information from the historical dialogue. Domain-Independent Multi-Agent Framework
As shown in Figure 5, when our system receives (DIMF), for TOD systems. We separate the original
a user’s question, the question first be directly trans- complex task into three sub-tasks, Intent Classifi-
ferred into the Intent Classification Agent without cation Agent, Slot Filling Agent, and Response
dialogue history to obtain the user’s intent. Next, Agent, which reduces the complexity of each agent
the slot prompt of this specific domain with the and makes the performance of lightweight LLMs
dialogue history is input into the Slot Filling Agent more reliable. Our framework trained on the
to obtain the specific information in this domain Qwen2.5-7B achieves better performance com-
that the user needs to inquire about. Finally, the pared with all the baselines. Besides, during the
results queried from the database are input into the training process, we leverage the advantages of the
Response Agent to obtain the response for the user. DPO method on this task to address the deficiencies
In this case, we can see that the user does not in understanding logical rules in prompts during
specify the specific information in the "area" slot di- the SFT process. We propose a Data Distribution
Adaptation (DDA) method to mitigate the degra- Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-
dation issues of DPO. The results prove that our baum, and Igor Mordatch. 2023. Improving factual-
ity and reasoning in language models through multia-
method is easy to implement and effective. More-
gent debate. arXiv preprint arXiv:2305.14325.
over, we demonstrate that our system can better
utilize the generalization capabilities of LLMs and Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang,
has a good zero-shot ability. and Wenqiang Lei. 2024. Towards analyzing and
understanding the limitations of dpo: A theoretical
perspective. arXiv preprint arXiv:2404.04626.
8 Limitations
Yujie Feng, Zexin Lu, Bo Liu, Liming Zhan, and Xiao-
In this work, with a carefully designed TOD frame- Ming Wu. 2023. Towards llm-driven dialogue state
work, we have revealed that current systems on tracking. arXiv preprint arXiv:2310.14970.
TOD tasks severely suffer from insufficient task
Zihao Feng, Xiaoxue Wang, Ziwei Bai, Donghang Su,
independence and model scalability. We further Bowen Wu, Qun Yu, and Baoxun Wang. 2025. Im-
propose the DIMF and DDA training methods to proving generalization in intent detection: Grpo with
mitigate the phenomenon. However, our work still reward-based curriculum sampling. arXiv preprint
has limitations. Firstly, during the tool invocation arXiv:2504.13592.
stage, we directly access the database based on the Aman Gupta, Anirudh Ravichandran, Ziji Zhang, Swair
results of the Slot Filling Agent. When facing more Shah, Anurag Beniwal, and Narayanan Sadagopan.
diverse, complex, or real tools, it may be necessary Dard: A multi-agent approach for task-oriented dia-
log systems. In NeurIPS 2024 Workshop on Open-
for the model to generate a unified invocation state- World Agents.
ment to address this issue. Secondly, our current
reinforcement learning method mainly leverages Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu,
the improved DPO method. Nowadays, the Group Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei
Huang, Luo Si, and 1 others. 2022. Galaxy: A gener-
Relative Policy Optimization (GRPO) (Shao et al., ative pre-trained model for task-oriented dialog with
2024) shows impressive performance, we will ap- semi-supervised learning and explicit policy injection.
ply this new method on our framework in our future In Proceedings of the AAAI conference on artificial
work. intelligence, volume 36, pages 10749–10757.
Michael Heck, Nurul Lubis, Benjamin Ruppik, Re-
nato Vukovic, Shutong Feng, Christian Geishauser,
References Hsien-Chin Lin, Carel van Niekerk, and Milica Gašić.
2023. Chatgpt for zero-shot dialogue state track-
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama ing: A solution or an opportunity? arXiv preprint
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, arXiv:2306.01386.
Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, and 1 others. 2023. Gpt-4 techni- Magdalena Kaiser, Patrick Ernst, and György Szarvas.
cal report. arXiv preprint arXiv:2303.08774. 2024. Learning from relevant subgoals in successful
dialogs using iterative training for task-oriented dia-
Atheer Algherairy and Moataz Ahmed. 2024. A re- log systems. In Findings of the Association for Com-
view of dialogue systems: current trends and fu- putational Linguistics: EMNLP 2024, pages 6236–
ture directions. Neural Computing and Applications, 6246.
36(12):6325–6351.
Nikitas Karanikolas, Eirini Manga, Nikoletta Samaridi,
Namo Bang, Jeehyun Lee, and Myoung-Wan Koo. Eleni Tousidou, and Michael Vassilakopoulos. 2023.
2023. Task-optimized adapters for an end-to-end Large language models versus natural language un-
task-oriented dialogue system. In Findings of the As- derstanding and generation. In Proceedings of the
sociation for Computational Linguistics: ACL 2023, 27th Pan-Hellenic Conference on Progress in Com-
pages 7355–7369, Toronto, Canada. Association for puting and Informatics, pages 278–290.
Computational Linguistics.
Yangming Li, Kaisheng Yao, Libo Qin, Wanxiang Che,
Willy Chung, Samuel Cahyawijaya, Bryan Wilie, Holy Xiaolong Li, and Ting Liu. 2020. Slot-consistent
Lovenia, and Pascale Fung. 2023. Instructtods: NLG for task-oriented dialogue systems with iterative
Large language models for end-to-end task-oriented rectification network. In Proceedings of the 58th
dialogue systems. arXiv preprint arXiv:2310.08885. Annual Meeting of the Association for Computational
Linguistics, pages 97–106, Online. Association for
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Computational Linguistics.
Stevens, Boshi Wang, Huan Sun, and Yu Su. 2024.
Mind2web: Towards a generalist agent for the web. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang,
Advances in Neural Information Processing Systems, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and
36. Zhaopeng Tu. 2023. Encouraging divergent thinking
in large language models through multi-agent debate. Computational Linguistics: ACL 2023, pages 11139–
arXiv preprint arXiv:2305.19118. 11160.
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming
Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and
Men, Kejuan Yang, and 1 others. 2023. Agent- Mark Gerstein. 2023. Medagents: Large language
bench: Evaluating llms as agents. arXiv preprint models as collaborators for zero-shot medical reason-
arXiv:2308.03688. ing. arXiv preprint arXiv:2311.10537.
Arka Pal, Deep Karkhanis, Samuel Dooley, Man- Lewis Tunstall, Edward Beeching, Nathan Lambert,
ley Roberts, Siddartha Naidu, and Colin White. Nazneen Rajani, Kashif Rasul, Younes Belkada,
2024. Smaug: Fixing failure modes of prefer- Shengyi Huang, Leandro von Werra, Clémentine
ence optimisation with dpo-positive. arXiv preprint Fourrier, Nathan Habib, and 1 others. 2023. Zephyr:
arXiv:2402.13228. Direct distillation of lm alignment. arXiv preprint
arXiv:2310.16944.
Baolin Peng, Michel Galley, Pengcheng He, Chris
Brockett, Lars Liden, Elnaz Nouri, Zhou Yu, Bill Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai
Dolan, and Jianfeng Gao. 2022. Godel: Large-scale Lin, Yunbo Cao, Tianyu Liu, and Zhifang Sui. 2023.
pre-training for goal-directed dialog. arXiv preprint Making large language models better reasoners with
arXiv:2206.11309. alignment. arXiv preprint arXiv:2309.02144.
Libo Qin, Wenbo Pan, Qiguang Chen, Lizi Liao, Zhou Qingyang Wu, James Gung, Raphael Shu, and Yi Zhang.
Yu, Yue Zhang, Wanxiang Che, and Min Li. 2023a. 2023. Diacttod: Learning generalizable latent di-
End-to-end task-oriented dialogue: A survey of alogue acts for controllable task-oriented dialogue
tasks, methods, and future directions. arXiv preprint systems. arXiv preprint arXiv:2308.00878.
arXiv:2311.09008.
Heng-Da Xu, Xian-Ling Mao, Puhai Yang, Fanshu Sun,
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan and He-Yan Huang. 2024. Rethinking task-oriented
Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, dialogue systems: From complex modularity to zero-
Bill Qian, and 1 others. 2023b. Toolllm: Facilitating shot autonomous agent. In Proceedings of the 62nd
large language models to master 16000+ real-world Annual Meeting of the Association for Computational
apis. arXiv preprint arXiv:2307.16789. Linguistics (Volume 1: Long Papers), pages 2748–
2763.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo-
pher D Manning, Stefano Ermon, and Chelsea Finn. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
2023. Direct preference optimization: Your lan- Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
guage model is secretly a reward model. Advances in Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian-
Neural Information Processing Systems, 36:53728– hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
53741. Jingren Zhou, Junyang Lin, Kai Dang, and 22 oth-
ers. 2024. Qwen2.5 technical report. arXiv preprint
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- arXiv:2412.15115.
pher D Manning, Stefano Ermon, and Chelsea Finn.
2024. Direct preference optimization: Your language Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
model is secretly a reward model. Advances in Neu- Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
ral Information Processing Systems, 36. React: Synergizing reasoning and acting in language
models. In International Conference on Learning
John Schulman, Filip Wolski, Prafulla Dhariwal, Representations (ICLR).
Alec Radford, and Oleg Klimov. 2017. Proxi-
mal policy optimization algorithms. arXiv preprint Xiao Yu, Qingyang Wu, Kun Qian, and Zhou Yu. 2023.
arXiv:1707.06347. Krls: Improving end-to-end response generation in
task oriented dialog with reinforced keywords learn-
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, ing. In Proceedings of the 2023 Conference on Em-
Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan pirical Methods in Natural Language Processing,
Zhang, YK Li, Y Wu, and 1 others. 2024. Deepseek- pages 12338–12358.
math: Pushing the limits of mathematical reason-
ing in open language models. arXiv preprint Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara,
arXiv:2402.03300. Raghav Gupta, Jianguo Zhang, and Jindong Chen.
2020. Multiwoz 2.2: A dialogue dataset with addi-
Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming tional annotation corrections and state tracking base-
Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei lines. arXiv preprint arXiv:2007.12720.
Huang. 2024. Small llms are weak tool learners: A
multi-llm agent. Preprint, arXiv:2401.07324. Xiaoying Zhang, Baolin Peng, Kun Li, Jingyan Zhou,
and Helen Meng. 2023. SGP-TOD: Building task
Haipeng Sun, Junwei Bao, Youzheng Wu, and Xiaodong bots effortlessly via schema-guided LLM prompting.
He. 2023. Mars: Modeling context & state represen- In Findings of the Association for Computational
tations with contrastive learning for end-to-end task- Linguistics: EMNLP 2023, pages 13348–13369, Sin-
oriented dialog. In Findings of the Association for gapore. Association for Computational Linguistics.
Zheng Zhang, Ryuichi Takanobu, Qi Zhu, MinLie • Response Generation Agent: We modify re-
Huang, and XiaoYan Zhu. 2020. Recent advances sponse rules to generate contextually inappro-
and challenges in task-oriented dialog systems. Sci-
priate responses.
ence China Technological Sciences, 63(10):2011–
2027.
All the agents are fully fine-tuned and conducted
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, on 8 A100 GPUs with 40GB of RAM for 2 epochs.
Haofei Yu, Zhengyang Qi, Louis-Philippe Morency,
Yonatan Bisk, Daniel Fried, Graham Neubig, and C DPO Training Loss
1 others. 2023. Sotopia: Interactive evaluation for
social intelligence in language agents. arXiv preprint We present the results of the reward loss of the
arXiv:2310.11667. Intent Classification Agent and Response Agent
in Figure 6 and Figure 7. Compared to Slot Fill-
A Prompt ing Agent, the degradation issues on the original
A.1 Prompt of Intent Classification Agent DPO method are not as severe for these two mod-
els. The Intent Classification Agent experienced a
We show an example of the Intent Classification reduction in chosen reward, while the training of
Agent at the second-round of the conversation in the Response Agent was relatively normal.
Table A.1.
Original DPO Training Rewards DPO-DDT Training Rewards
6
4 rewards_chosen
A.2 Prompt of Slot Filling Agent 2 4 rewards_rejected

0 2
We show an example of the Slot Filling Agent of
Rewards

Rewards
-2 0
the restaurant domain at the second-round of the -4
-2
rewards_chosen
conversation in Table A.2. -6
rewards_rejected -4
2 4 6 8 10 2 4 6 8 10
Training Steps Training Steps
A.3 Prompt of Response Agent
Figure 6: The rewards of the chosen data and rejected
We show an example of the Response Agent in
data during the Intent Classification Agent DPO train-
Table A.3. ing.
B DDA Data Generating Method
Original DPO Training Rewards DPO-DDT Training Rewards
4
We generate the training dataset tailored to each 2
1 2
agent for SFT method based on the MultiWOZ 2.2 0
0
Rewards

Rewards

dataset. For the DDA method, the data-generating -1


-2 -2
method is as follows: -3
rewards_chosen rewards_chosen
-4 -4
We first introduce the preference pairs imple- -5
rewards_rejected rewards_rejected
2 4 6 8 10 2 4 6 8 10
Training Steps Training Steps
mentation method:
Figure 7: The rewards of the chosen data and rejected
• Positive samples: Responses with correct in-
data during the Response Agent DPO training.
tent/slot predictions. As for the Response
Agent, we select good cases based on a certain
threshold of BLEU.

• Negative samples: Responses with incorrect


predictions and under the threshold.

To conduct the DDA method, our negative exam-


ple sampling strategies for distribution balancing
are:

• Intent Classification Agent: We randomly re-


place target intents with incorrect ones.

• Slot Filling Agent: We either replace slot val-


ues with other values from the dialogue con-
text or remove values from multi-value slots.
Table 5: Intent Classification Agent prompt

You are an agent that helps users choose the right tool or tools from the given tools list to solve their problems.

For each tool, you are first given its description and required parameters. Then, a logic module specifically explains the
logical information needed for this tool to handle multi-turn conversation issues.

## Tool APIs

find_hotel: search for a hotel to stay in


book_hotel: book a hotel to stay in
find_train: search for trains that take you places
book_train: book train tickets
find_attraction: search for places to see for leisure
find_restaurant: search for places to wine and dine
book_restaurant: book a table at a restaurant
find_hospital: search for a medical facility or a doctor
find_taxi: find or book taxis to travel between places
find_bus: search for a bus
find_police: search for police station
other: This tool is used to handle problems that cannot be addressed by any other tools.

## Task Logic
If last query is find_restaurant, the user can use the same tool for the following types of query:
- restaurant-pricerange: price budget for the restaurant. only allowed values: [cheap, expensive, moderate]
- restaurant-area: area or place of the restaurant. only allowed values: [centre, east, north, south, west]
- restaurant-food: the cuisine of the restaurant you are looking for.
- restaurant-name: name of the restaurant.
- restaurant-bookday: day of the restaurant booking. only allowed values:
[monday, tuesday, wednesday, thursday, friday, saturday, sunday]
- restaurant-bookpeople: how many people for the restaurant reservation. only allowed values: [1, 2, 3, 4, 5, 6, 7, 8]
- restaurant-booktime: time of the restaurant booking.

## Output Format

Use the following format:

Last Tool: the tool used in last query


Question: the input question you must answer
Action: the action to take
Finish!

Begin!

Last Tool: find_restaurant


Question: Any sort of food would be fine. Could I get the phone number for your recommendation?
Table 6: Slot Filling Agent Filling prompt

You are an agent whose goal is to extract the required tool parameters and the content the user wants to query from their questions.

For a specific query, you are first given the parameters corresponding to the restaurant tool. Besides, you have also been informed the information
that the specific information this tool can query. Finally, you are given the logic distinguish between Tool Parameters and Tool Information.

## Tool Parameters

restaurant-pricerange: price budget for the restaurant. only allowed values: [cheap, expensive, moderate]
restaurant-area: area or place of the restaurant. only allowed values: [centre, east, north, south, west]
restaurant-food: the cuisine of the restaurant you are looking for.
restaurant-name: name of the restaurant.
restaurant-bookday: day of the restaurant booking. only allowed values: [monday, tuesday, wednesday, thursday, friday, saturday, sunday]
restaurant-bookpeople: how many people for the restaurant reservation. only allowed values: [1, 2, 3, 4, 5, 6, 7, 8]
restaurant-booktime: time of the restaurant booking.

## Tool Information

The user can use restaurant tool to query the following questions:
address: the address of the restaurant.
area: the location information of the restaurant can be selected from the following options: [east, south, west, north].
food: the food of the restaurant.
id: the id number of the restaurant.
introduction: the introduction of the restaurant.
location: the coordinates of the restaurant.
name: the name of the restaurant.
phone: the phone of the.
postcode: the postcode of the restaurant.
pricerange: the level of the price of the restaurant.
type: .

## Task Logic

- If the user’s question includes a slot name and the slot value, then this query information
belongs to the tool Parameters, and output must in a JSON type.
- If the user’s question only includes a slot name without value, then this query information belongs to the tool Information.
- If the user needs information from the historical conversation, you can obtain it from the History Conversation slot.

## History Conversation slot

restaurant:
"area": ["centre"], "pricerange": ["expensive"]

## Output Format

Use the following format:

Question: the input question you must answer


Action: the tool that user used
Parameters: must a JSON object of the slot with its value
Information: the tool information in a list object
Finish!

Begin!

Question: Any sort of food would be fine, as long as it is a bit expensive. Could I get the phone number for your recommendation?
Action: restaurant
Table 7: Response Agent prompt

You act as an AI assistant to reponse user’s question relied some given informations.
You should always communicate with the user in the first person and respond in a personified manner.
The Question is: I need train reservations from norwich to cambridge

## Responce Rules

You should respond according to the following rules:

Make a conclusion based on the the user’s question, Observation and conversation history. If there are several options,
you can first respond the total number of the option, make a conclusion of the "conclusion informations" and then ask the
question about the informations in "question content"
- example: "I have xxx options matching your request. Waht’s the xxx you want to xxx"
- example with conclusion informations: "I have xxx options matching your request. The range of xxx in these options is xxx.
Waht’s the xxx you want to xxx"
If there is only one options, you can make a conclusion if it and respond to the user.
All the specific information in the response should be in this format: [type_name]

## Observation

train information:
option number: 133
question content: arriveby, leaveat, trainid, day, price
conclusion informations:
arriveby: 06:35, 07:35, 08:35, 09:35, 21:35, 22:35, 23:35, 24:35
leaveat: 05:16, 06:16, 07:16, 08:16, 20:16, 21:16, 22:16, 23:16

## Note

You should respond with more varied expressions.


Your respond should contain all the information in Observation, and your reply should no more than 25 words.

Your Response:

You might also like