0% found this document useful (0 votes)

37 views22 pages

LLM Compiler

The document introduces LLMCompiler, a framework designed to enhance the efficiency of Large Language Models (LLMs) by enabling parallel function calling, which reduces latency and costs while improving accuracy. It consists of three main components: a Function Calling Planner, a Task Fetching Unit, and an Executor, which together optimize the orchestration of function calls. Benchmark results demonstrate significant performance improvements over existing methods, such as ReAct, achieving speedups of up to 3.7x and cost reductions of up to 6.7x.

Uploaded by

ssamozturk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views22 pages

LLM Compiler

Uploaded by

ssamozturk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

An LLM Compiler for Parallel Function Calling

Sehoon Kim * 1 Suhong Moon * 1 Ryan Tabrizi 1 Nicholas Lee 1 Michael W. Mahoney 1 2 3
Kurt Keutzer 1 Amir Gholami 1 2

Abstract Kojima et al., 2023; Wang et al., 2023b; Wei et al., 2022;
The reasoning capabilities of the recent LLMs Yang et al., 2022; Yao et al., 2023b; Zhou et al., 2023b);
enable them to execute external function calls and recent works have also shown how this reasoning ca-
arXiv:2312.04511v3 [cs.CL] 5 Jun 2024

to overcome their inherent limitations, such as pability can be helpful in improving accuracy for solving
knowledge cutoffs, poor arithmetic skills, or lack complex and logical tasks. The reasoning capability has also
of access to private data. This development has allowed function (i.e., tool) calling capability, where LLMs
allowed LLMs to select and coordinate multiple can invoke provided functions and use the function outputs
functions based on the context to tackle more to help complete their tasks. These functions range from a
complex problems. However, current methods simple calculator that can invoke arithmetic operations to
for function calling often require sequential more complex LLM-based functions.
reasoning and acting for each function which The ability of LLMs to integrate various tools and function
can result in high latency, cost, and sometimes calls could enable a fundamental shift in how we develop
inaccurate behavior. To address this, we introduce LLM-based software. However, this brings up an important
LLMCompiler, which executes functions in par- challenge: what is the most effective approach to incorpo-
allel to efficiently orchestrate multiple function rate multiple function calls? A notable approach has been
calls. Drawing inspiration from the principles of introduced in ReAct (Yao et al., 2022), where the LLM calls
classical compilers, LLMCompiler enables par- a function, analyzes the outcomes, and then reasons about
allel function calling with three components: (i) a the next action, which involves a subsequent function call.
Function Calling Planner, formulating execution For a simple example illustrated in Fig. 1 (Left), where the
plans for function calling; (ii) a Task Fetching LLM is asked if Scott Derrickson and Ed Wood have the
Unit, dispatching function calling tasks; and (iii) same nationality, ReAct initially analyzes the query and de-
an Executor, executing these tasks in parallel. cides to use a search tool to search for Scott Derrickson. The
LLMCompiler automatically generates an opti- result of this search (i.e., observation) is then concatenated
mized orchestration for the function calls and can back to the original prompt for the LLM to reason about
be used with both open-source and closed-source the next action, which invokes another search tool to gather
models. We have benchmarked LLMCompiler information about Ed Wood.
on a range of tasks with different patterns of
function calling. We observe consistent latency ReAct has been a pioneering work in enabling function
speedup of up to 3.7×, cost savings of up calling, and it has been integrated into several frame-
to 6.7×, and accuracy improvement of up to works (Langchain; Liu, 2022). However, scaling this ap-
∼9% compared to ReAct. Our code is available at proach for more complex applications requires considerable
https://github.com/SqueezeAILab/LLMCompiler. optimizations. This is due to the sequential nature of Re-
Act, where it executes function calls and reasons about their
1. Introduction observations one after the other. This approach, along with
the agent systems that extend ReAct (Khot et al., 2023; Qin
Recent advances in the reasoning capability of Large Lan- et al., 2023; Ruan et al., 2023b; Sumers et al., 2023; Yao
guage Models (LLMs) have expanded the applicability of et al., 2023b), may lead to inefficiencies in latency and cost,
LLMs beyond content generation to solving complex prob- due to the sequential function calling and repetitive LLM
lems (Besta et al., 2023; Chen et al., 2023b; Gao et al., 2022; invocations for each reasoning and action step. Furthermore,
*
Equal contribution 1 UC Berkeley 2 ICSI 3 LBNL. Correspon- while dynamic reasoning about the observations has benefits
dence to: Amir Gholami <[email protected]>. in certain cases, concatenating the outcomes of intermedi-
ate function calls could disrupt the LLM’s execution flow,
Proceedings of the 41 st International Conference on Machine potentially reducing accuracy (Xu et al., 2023). Common
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by failure cases include repetitive invocation of the same func-
the author(s).

1
An LLM Compiler for Parallel Function Calling

Question: Were Scott Derrickson and Ed Wood of the same nationality?

ReAct LLMCompiler
LLM Function Calling Planner
Thought: I need to search Scott Derrickson. $1 = search(Scott Derrickson)
$2 = search(Ed Wood)
DAG of tasks
Action: search(Scott Derrickson)
Tool invocation Parallel tool invocations
Search Tool Search Tool Search Tool
Observation: … Scott Derrickson (born July 16, 1966) Observation: … Scott Derrickson (born July Observation: … Edward Wood Jr was an
is an American filmmaker … 16, 1966) is an American filmmaker … American filmmaker, actor, and …
Appended to prompt
LLM LLM
Step

Thought: I need to search Ed Wood. Thought: They are both American filmmakers.
Action: search(Ed Wood) Executor Action: finish(yes)
Tool invocation
Search Tool
Observation: … Edward Wood Jr was an American
filmmaker, actor, and ….
Appended to prompt Latency Speedup: 1.8x
LLM
Thought: They are both American filmmakers.
Action: finish(yes)

Figure 1. An illustration of the runtime dynamics of LLMCompiler, in comparison with ReAct (Yao et al., 2022), given a sample
question from the HotpotQA benchmark (Yang et al., 2018). In LLMCompiler (Right), the Planner first decomposes the query into
several tasks with inter-dependencies. The Executor then executes multiple tasks in parallel, respecting their dependencies. Finally,
LLMCompiler joins all observations from the tool executions to produce the final response. In contrast, sequential tool execution of the
existing frameworks like ReAct (Left) leads to longer execution latency. In this example, LLMCompiler attains a latency speedup of
1.8× on the HotpotQA benchmark. While a 2-way parallelizable question from HotpotQA is presented here for the sake of simple visual
illustration, LLMCompiler is capable of managing tasks with more complex dependency patterns (Fig. 2 and Sec. 5).

tion, which is also highlighted in the original paper (Yao Fetching Unit (Sec. 3.2) that dispatches the function calls
et al., 2022), and early stopping based on the partial inter- in parallel; (iii) an Executor (Sec. 3.3) that executes the
mediate results, as will be further discussed in Sec. 5.1 and dispatched tasks using the associated functions.
Appendix A.
To address this challenge, we draw inspiration from clas- • We evaluate LLMCompiler on embarrassingly parallel
sical compilers, where optimizing instruction executions patterns using HotpotQA (Yang et al., 2018) and Movie
in traditional programming languages has been extensively Recommendation (Srivastava et al., 2022), where we ob-
explored. A key optimization technique in compilers in- serve 1.80×/3.74× speedup and 3.37×/6.73× cost reduc-
volves identifying instructions that can be executed in paral- tion compared to ReAct (Sec. 5.1).
lel and effectively managing their dependencies. Similarly,
one can envision a compiler, tailored for LLM function • To test the performance on more complex patterns, we
calling, which can efficiently orchestrate various function introduce a new benchmark called ParallelQA which in-
calls and their dependencies. This shares a similar philos- cludes various non-trival function calling patterns. We
ophy with the recent studies that align LLMs with com- show up to 2.27× speedup, 4.65× cost reduction, and 9%
puter systems (Karpathy, 2023; Packer et al., 2023). To improved accuracy compared to ReAct (Sec. 5.2).
this end, we introduce LLMCompiler, a novel framework
that enables parallel multi-tool execution of LLMs across • We evaluate LLMCompiler’s capability in dynamic re-
different models and workloads. To the best of our knowl- planning, which is achieved through a feedback loop
edge, LLMCompiler is the first framework to optimize from the Executor back to our Function Calling Planner.
the orchestration of LLM function calling that can not only For the Game of 24 (Yao et al., 2023b), which requires
improve latency and cost, but also accuracy, by minimizing repeated replanning based on the intermediate results,
interference from the outputs of intermediate function calls. LLMCompiler demonstrates a 2× speedup compared
In more detail, we make the following contributions: to Tree-of-Thoughts (Sec. 5.3).

• We introduce LLMCompiler, an LLM compiler that • We show that LLMCompiler can explore the interactive
optimizes the parallel function calling performance of decision-making environment effectively and efficiently.
LLMs. At a high level, this is achieved by introducing On WebShop, LLMCompiler achieves up to 101.7×
three key components: (i) a Function Calling Planner speedup and 25.7% improved success rate compared to
(Sec. 3.1) that identifies an execution flow; (ii) a Task the baselines (Sec. 5.4).

2
An LLM Compiler for Parallel Function Calling

2. Related Work each optimized through LLMs with dedicated prompts.

Step-Back Prompting (Zheng et al., 2023) enables LLMs to
2.1. Latency Optimization in LLMs abstract high-level concepts from details to enhance reason-
Various studies have focused on optimizing model de- ing abilities across various tasks. Plan-and-Solve Prompt-
sign (Chen et al., 2023a; Dettmers et al., 2023; Frantar & ing (Wang et al., 2023a) segments multi-step reasoning
Alistarh, 2023; Frantar et al., 2022; Kim et al., 2023; 2024; tasks into subtasks to minimize errors and improve task
Kwon et al., 2022; Leviathan et al., 2023; Lin et al., 2023) accuracy without manual prompting. However, these meth-
and systems (tgi; trt; Kwon et al., 2023; Yu et al., 2022) for ods primarily focus on improving the accuracy of reasoning
efficient LLM inference. Optimizations at the application benchmarks. In contrast, LLMCompiler uses a planner
level, however, are less explored. This is critical from a prac- to identify parallelizable patterns within queries, aiming to
tical point of view for situations involving black-box LLM reduce latency while maintaining accuracy.
models and services where modifications to the models and In addition to the aforementioned works, TPTU (Ruan
the underlying inference pipeline are highly restricted. et al., 2023a), HuggingGPT (Shen et al., 2023), and
Skeleton-of-Thought (Ning et al., 2023) recently proposed ViperGPT (Surı́s et al., 2023) have introduced end-to-end
to reduce latency through application-level parallel decod- plan-and-solve frameworks. LLMCompiler sets itself
ing. This method involves a two-step process of an initial apart by providing a general framework that enables ef-
skeleton generation phase, followed by parallel execution ficient and accurate function calling in a broader range of
of skeleton items. However, it is primarily designed for problems. This stems from LLMCompiler’s capabilities
embarrassingly parallel workloads and does not support in (i) planning and replanning; (ii) parallel execution; and
problems that have inherently interdependent tasks, as it (iii) addressing a wider range of problem domains, which
assumes no dependencies between skeleton tasks. This will be discussed in more detail in Appendix F.
limits its applicability in complex scenarios such as cod- Another notable work is ReWOO (Xu et al., 2023) which
ing (Austin et al., 2021; Chen et al., 2021; Hendrycks et al., employs a planner to separate the reasoning process from the
2021a; Madaan et al., 2023) or math (Hendrycks et al., execution and observation phases to decrease token usage
2021b;c) problems, as also stated in the paper (Ning et al., and cost as compared to ReAct. Our approach is different
2023). LLMCompiler addresses this by translating an from ReWOO in multiple aspects. First, LLMCompiler
input query into a series of tasks with inter-dependencies, allows parallel function calling which can reduce latency
thereby expanding the spectrum of problems it can handle. as well as cost. Second, LLMCompiler supports dynamic
Concurrently to our work, OpenAI has recently introduced replanning which is important for problems whose execu-
a parallel function calling feature in their 1106 release, en- tion flow cannot be determined statically in the beginning
hancing user query processing through the simultaneous (Sec. 5.3).
generation of multiple function calls (OpenAI, 2023). De-
spite its potential for reducing LLM execution time, this 2.3. Tool-Augmented LLMs
feature has certain limitations, as it is exclusively avail-
The enhanced reasoning capability of LLMs has enabled
able for OpenAI’s proprietary models. However, there is
them to invoke user-provided functions and use their outputs
a growing demand for using open-source models driven
to effectively complete tasks. Detailed exploration of this
by the increasing number of open-source LLMs as well
subject is provided in the Appendix C.1.
as parameter-efficient training techniques (Houlsby et al.,
2019; Hu et al., 2022; Lester et al., 2021) for finetuning and
customization. LLMCompiler enables efficient parallel 3. Methodology
function calling for open-source models, and also, as we
To illustrate the components of LLMCompiler, we use a
will show later in Sec. 5, it can potentially achieve better
simple 2-way parallel example in Fig. 2. To answer “How
latency and cost.
much does Microsoft’s market cap need to increase to ex-
ceed Apple’s market cap?,” the LLM first needs to conduct
2.2. Plan and Solve Strategy web searches for both companies’ market caps, followed by
Several studies (Hao et al., 2023; Patel et al., 2022; Press a division operation. While the existing frameworks, includ-
et al., 2023; Wolfson et al., 2020; Zhou et al., 2023b) have ing ReAct, perform these tasks sequentially, it is evident that
explored prompting methods of breaking down complex they can be executed in parallel. The key question is how to
queries into various levels of detail to solve them, thereby automatically determine which tasks are parallelizable and
improving LLM’s performance in reasoning tasks. Specif- which are interdependent, so we can orchestrate the execu-
ically, Decomposed Prompting (Khot et al., 2023) tackles tion of the different tasks accordingly. LLMCompiler ac-
complex tasks by decomposing them into simpler sub-tasks, complishes this through a system that consists of the follow-

3
An LLM Compiler for Parallel Function Calling

Executor
Function Calling Planner Tool Tool
User Input Task
Fetches Memory Memory
$1 = search(Microsoft Market Cap) Fetching Task
“How much does
Microsoft's market cap $2 = search(Apple Market Cap) Unit Tool Tool
need to increase to exceed $3 = math($1 / $2) Resolves
Dependency Memory Memory
Apple's market cap?” $4 = llm($3)

DAG of Tasks
Tools
search math … llm

Figure 2. Overview of the LLMCompiler framework. The Function Calling Planner generates a DAG of tasks with their inter-
dependencies. These tasks are then dispatched by the Task Fetching Unit to the Executor in parallel based on their dependencies. In this
example, Task $1 and $2 are fetched together for parallel execution of two independent search tasks. After each task is performed, the
results are forwarded back to the Task Fetching Unit to unblock the dependent tasks after replacing their placeholder variables (e.g., the
variable $1 and $2 in Task $3) with actual values. Once all tasks have been executed, the final answer is delivered to the user.

ing three components: a Function Calling Planner (Sec. 3.1) streams tasks as soon as they are created, instead of waiting
that generates a sequence of tasks and their dependencies; a to complete the entire planning process.
Task Fetching Unit (Sec. 3.2) that replaces arguments based
on intermediate results and fetches the tasks; and an Execu- 3.2. Task Fetching Unit
tor (Sec. 3.3) that executes the tasks with associated tools.
To use LLMCompiler, users are only required to provide The Task Fetching Unit, inspired by the instruction fetching
tool definitions, and optional in-context examples for the units in modern computer architectures, fetches tasks to the
Planner, as will be further discussed in Sec. 4.1. Executor as soon as they are ready for (parallel) execution
based on a greedy policy. Another key functionality is
to replace variables with the actual outputs from preceding
3.1. Function Calling Planner
tasks, which were initially set as placeholders by the Planner.
The Function Calling Planner is responsible for generating a For the example in Fig. 2, the variable $1 and $2 in Task $3
sequence of tasks to be executed along with any dependency would be replaced with the actual market cap of Microsoft
among them. For instance, Tasks $1 and $2 in Fig. 2 are and Apple. This can be implemented with a simple fetching
two independent searches that can be performed in paral- and queuing mechanism without a dedicated LLM.
lel. However, Task $3 has a dependency on the outcomes
of the first and second searches. Therefore, the Planner’s 3.3. Executor
role is to automatically identify the necessary tasks, their
input arguments, as well as their inter-dependencies using The Executor asynchronously executes tasks fetched from
the sophisticated reasoning capability of LLMs, essentially the Task Fetching Unit. As the Task Fetching Unit guar-
forming a directed acyclic graph of task dependencies. If antees that all the tasks dispatched to the Executor are in-
a task is dependent on a preceding task, it incorporates a dependent, it can simply execute them concurrently. The
placeholder variable, such as $1 in Task 3 of Fig. 2, which Executor is equipped with user-provided tools, and it del-
will later be substituted with the actual output from the egates the task to the associated tool. These tools can be
preceding task (Sec. 3.2). simple functions like a calculator, Wikipedia search, or API
calls, or they can even be LLM agents that are tailored for a
The Planner in LLMCompiler leverages LLMs’ reasoning specific task. As depicted in the Executor block of Fig. 2,
capability to decompose tasks from natural language inputs. each task has dedicated memory to store its intermediate
To achieve this, the Planner LLM incorporates a pre-defined outcomes, similar to what typical sequential frameworks
prompt that guides it on how to create dependency graphs do when aggregating observations as a single prompt (Yao
and to ensure correct syntax (see Appendix H for details). et al., 2022). Upon completion of the task, the final results
Besides this, users also need to supply tool definitions and are forwarded as input to the tasks dependent on them.
optional in-context examples for the Planner. These exam-
ples provide detailed demonstrations of task decomposition 3.4. Dynamic Replanning
specific to a problem, helping the Planner to better under-
stand the rules. Further details on user-supplied information In various applications, the execution graph may need to
for LLMCompiler are elaborated in Sec. 4.1. In Sec. 4.2, adapt based on intermediate results that are a priori unknown.
we introduce an additional optimization for the Planner that An analogy in programming is branching, where the path

4
An LLM Compiler for Parallel Function Calling

of execution is determined only during runtime, depending streaming feature leads to a latency gain of up to 1.3×. This
on which branch conditions are satisfied. Such dynamic ex- is attributed to the math tool’s longer execution time for
ecution patterns can also appear with LLM function calling. ParallelQA, which can effectively hide the Planner’s latency
For simple branching (e.g., if-else statements) one could in generating subsequent tasks, unlike the shorter execution
statically compile the execution flow and choose the right times of the search tool used in HotpotQA and Movie
dynamically based on the intermediate results. However, for Recommendation.
more complex branching it may be better to do a recompila-
tion or replanning based on the intermediate results. 5. Results
When replanning, the intermediate results are sent back
In this section, we evaluate LLMCompiler using a variety
from the Executor to the Function Calling Planner which
of models and problem types. We use both the proprietary
then generates a new set of tasks with their associated de-
GPT models and the open-source LLaMA-2 model, with
pendencies. These tasks are then sent to the Task Fetching
the latter demonstrating LLMCompiler’s capability in en-
Unit and subsequently to the Executor. This cycle continues
abling parallel function calling in open-source models. Fur-
until the desired final result is achieved and can be delivered
thermore, there are various types of parallel function calling
to the user. We show an example use case of this in Sec. 5.3
patterns that can be addressed with LLMs. This ranges
for solving the Game of 24 using the Tree-of-Thoughts
from embarrassingly parallel patterns, where all tasks can
approach.
be executed in parallel without any dependencies between
them, to more complex dependency patterns, as illustrated
4. LLMCompiler Details in Fig. 3. Importantly, we also assess LLMCompiler
on the Game of 24 benchmark, which involves dynamic
4.1. User-Supplied Information
replanning based on intermediate results, highlighting its
LLMCompiler requires two inputs from the user: adaptability to dynamic dependency graphs. Finally, we
apply LLMCompiler to the WebShop benchmark to show-
1. Tool Definitions: Users need to specify the tools that case its potential in decision-making tasks. Overall, we start
LLMs can use, including their descriptions and argument presenting results for simple execution patterns, and then
specifications. This is essentially the same requirement we move to more complex ones.
as other frameworks like ReAct and OpenAI function
calling.
5.1. Embarrassingly Parallel Function Calling
2. In-context Examples for the Planner: Optionally,
users can provide LLMCompiler with examples of The simplest scenario involves an LLM using a tool re-
how the Planner should behave. For instance, in the case peatedly for independent tasks such as conducting parallel
of Fig. 2, users may provide examples illustrating ex- searches or analyses to gather information on different top-
pected inter-task dependencies for certain queries. These ics, like the pattern depicted in Fig. 3 (a). While these tasks
examples can assist the Planner LLM understand how are independent of each other and can be executed in paral-
to use various tools and generate the appropriate depen- lel, ReAct, along with other LLM solutions as they stand,
dency graph for incoming inputs in the correct format. would need to run sequentially. This leads to increased
In Appendix G, we include the examples that we used latency and token consumption due to its frequent LLM in-
in our evaluations. vocations for each tool usage, as also illustrated in Fig. 1. In
this section, we demonstrate how LLMCompiler can iden-
4.2. Streamed Planner tify parallelizable patterns and execute independent tasks
concurrently to resolve this issue. To do so, we use the
The Planner may incur a non-trivial overhead for user following two benchmarks:
queries that involve a lot of tasks as it blocks the Task Fetch-
ing Unit and the Executor, which must wait for the Planner • HotpotQA: A dataset that evaluates multi-hop reason-
output before initiating their processes. However, analogous ing (Yang et al., 2018). We only use the comparison dev
to instruction pipelining in modern computer systems, this set. This contains 1.5k questions comparing two different
can be mitigated by enabling the Planner to asynchronously entities, thus exhibiting a 2-way embarrassingly parallel
stream the dependency graph, thereby allowing each task execution pattern. An example question is shown in Fig. 1.
to be immediately processed by the Executor as soon as its
dependencies are all resolved. In Table C.1, we present a • Movie Recommendation: A dataset with 500 examples
latency comparison of LLMCompiler with and without that asks to identify the most similar movie out of four
the streaming mechanism across different benchmarks. The options to another set of four movies, exhibiting an 8-way
results demonstrate consistent latency improvements with embarrassingly parallel pattern (Srivastava et al., 2022).
streaming. Particularly, in the ParallelQA benchmark, the

5
An LLM Compiler for Parallel Function Calling

(a) Analyze Apple and Microsoft's latest 10-K (b) If Stanford and UCLA were to merge, would they (c) Which has higher total healthcare expenses, Florida
form and compare their sales forecast. have more Nobel laureates than UC Berkeley? or New York, considering both public and private sectors?

Analyzer Analyzer
search search search search search search search
Agent Agent

output math math math

math math

output output

Figure 3. Examples of questions with different function calling patterns and their dependency graphs. HotpotQA and Movie Recommen-
dation datasets exhibit pattern (a), and ParallelQA dataset exhibits patterns (b) and (c), among other patterns. In (a), we need to analyze
each company’s latest 10-K. In (b), we need three searches for each school, followed by one addition and one comparison operation. In (c),
we need to search for each state’s annual healthcare spending in each sector, sum each state’s spending, and then perform a comparison.

Table 1. Accuracy and latency comparison of LLMCompiler compared to the baseline on different benchmarks, including HotpotQA,
Movie Recommendation, our custom dataset named ParallelQA, and the Game of 24. For HotpotQA and Movie Recommendation, we
frequently observe looping and early stopping (Sec. 5.1). To minimize these behaviors as much as possible, we incorporated ReAct-specific
prompting which we denote as ReAct† . ReAct (without † ) indicates the original results without this prompting. We do not include the
latency for the original ReAct since looping and early stopping make precise latency measurement difficult.
GPT (Closed-source) LLaMA-2 70B (Open-source)
Benchmark Method
Accuracy (%) Latency (s) Speedup Accuracy (%) Latency (s) Speedup
ReAct 61.52 - - 54.74 - -
HotpotQA ReAct† 62.47 7.12 1.00× 54.40 13.44 1.00×
OAI Parallel Function 62.05 4.42 1.61× - - -
LLMCompiler 62.00 3.95 1.80× 57.83 9.58 1.40×
ReAct 68.60 - - 70.00 - -
Movie Rec. ReAct† 72.47 20.47 1.00× 70.60 33.37 1.00×
OAI Parallel Function 77.00 7.42 2.76× - - -
LLMCompiler 77.13 5.47 3.74× 77.80 11.83 2.82×
ReAct 89.09 35.90 1.00× 59.59 15.47 1.00×
ParallelQA OAI Parallel Function 87.32 19.29 1.86× - - -
LLMCompiler 89.38 16.69 2.15× 68.14 26.20 2.27×
Tree-of-Thoughts 74.00 241.2 1.00× 30.00 952.06 1.00×
Game of 24
LLMCompiler 75.33 83.6 2.89× 32.00 456.02 2.09×

Table 2. Input and output token consumption as well as the esti- datasets, we use gpt-3.5-turbo (1106 release). For the exper-
mated cost on HotpotQA, Movie Recommendation, and our cus- iments using GPT, we additionally report the results using
tom dataset named ParallelQA. The cost is computed based on the OpenAI’s parallel function calling capability, which was
pricing table of the GPT models used for each benchmark. announced concurrently with our work. We also show how
Benchmark Method
Tokens Cost Cost LLMCompiler can be effectively combined with the open-
In. Out. ($/1k) Red.
source LLaMA-2 70B model to provide the model with
ReAct 2900 120 5.00 1.00× parallel function calling capabilities. For all experiments,
HotpotQA OAI Para. Func. 2500 63 2.66 1.87×
LLMCompiler 1300 80 1.47 3.37× we have measured accuracy, end-to-end latency, as well as
ReAct 20000 230 20.46 1.00× input and output token usage. See Appendix D for details
Movie Rec. OAI Para. Func. 5800 160 6.14 3.33× on experimental setups.
LLMCompiler 2800 115 3.04 6.73×
ReAct 46000 470 480 1.00×
ParallelQA OAI Para. Func. 25000 370 260 1.81× Accuracy and Latency. We report the accuracy, end-to-
LLMCompiler 9200 340 103 4.65× end latency, and relative speed-up of LLMCompiler com-
pared to ReAct in Tab. 1. First, we observe that ReAct
Experimental Setups. As a baseline method, we com- consistently achieves lower accuracy compared to OpenAI
pare LLMCompiler with ReAct. We follow the ReAct parallel function calling and LLMCompiler. We identify
setup (Yao et al., 2022) using the same Wikipedia search two main failure modes in ReAct: (1) the tendency for re-
tool that LLMs can use to search for information. We did dundant generation of prior function calls, a point also noted
not include the lookup tool since it is not relevant to our in the original ReAct paper (Yao et al., 2022); and (2) pre-
problem setting. We have optimized the prompt and in- mature early stopping based on the incomplete intermediate
context examples for both ReAct and LLMCompiler to results. In Appendix A, we offer a detailed analysis demon-
the best of our abilities. For all experiments across these strating how these two prevalent failure cases significantly

6
An LLM Compiler for Parallel Function Calling

hurt ReAct’s accuracy, and how they can be resolved with in Fig. 3 (b) and (c). Inspired by the IfQA benchmark (Yu
LLMCompiler, leading to an accuracy enhancement of et al., 2023), ParallelQA contains 113 examples that involve
up to 7 – 8%. Furthermore, we have conducted interven- mathematical questions on factual attributes of various en-
tional experiments in which we incorporated ReAct-specific tities. In particular, completing the task requires using two
prompts to avoid repetitive function calls and early stopping. tools (i.e., search and math tools), with the second tool’s ar-
ReAct† in Tab. 1 refers to ReAct with this ReAct-specific gument depending on the result of the first tool’s output. We
prompt. The ReAct-specific prompt yields a general accu- have meticulously included questions that are answerable
racy improvement with ReAct† as compared to the original only with information from Wikipedia’s first paragraph, ef-
ReAct. Nevertheless, LLMCompiler still demonstrates fectively factoring out the failure cases due to unsuccessful
on-par and better accuracy than ReAct† , as such prompting searches. See Appendix I for more details in ParallelQA.
does not serve as a perfect solution to completely avoiding
the erroneous behavior of ReAct. Experimental Setups. Similar to Sec. 5.1, we use Re-
Additionally, when compared to ReAct , LLMCompiler † Act (Yao et al., 2022) as the main baseline. Here, both Re-
demonstrates a noticeable speedup of 1.80× and 1.40× on Act and LLMCompiler are equipped with two tools: (1)
HotpotQA with GPT and LLaMA, respectively. Similarly, the search tool, identical to the one mentioned in Sec.5.1;
LLMCompiler demonstrates 3.74× and 2.82× speedup and (2) the math tool, which solves mathematical problems.
on Movie Recommendation with each model. Note that The math tool is inspired by the Langchain (Langchain)’s
we benchmark the latency of LLMCompiler against that LLMMathChain, which uses an LLM as an agent that in-
of ReAct† since the repeating and early stopping behav- terprets input queries and invokes the numexpr function
ior of the original ReAct as discussed above makes its la- with the appropriate formula. This enables the math chain to
tency unpredictable and unsuitable for a fair comparison. address a broad spectrum of math problems that are written
LLMCompiler demonstrates a speedup of up to 35% com- both in mathematical and verbal form. See Appendix D for
pared to OpenAI parallel function calling whose latency more details on experimental setups.
gain over ReAct is 1.61× and 2.76× on each benchmark.1
Accuracy and Latency. As shown in the ParallelQA row
of Tab. 1, LLMCompiler arrives at the final answer with
Costs. Another important consideration of using LLMs
an average speedup of 2.15× with gpt-4-turbo and 2.27×
is cost, which depends on the input and output token us-
with LLaMA-2 70B, by avoiding sequential execution of
age. The costs for GPT experiments are provided in Tab. 2.
the dependency graphs. Beyond the latency speedup, we ob-
LLMCompiler is more cost-efficient than ReAct for cost,
serve higher accuracy of LLMCompiler with the LLaMA-
as it involves less frequent LLM invocations. Interestingly,
2 model as compared to that of ReAct, due to the rea-
LLMCompiler also outperforms the recent OpenAI par-
sons discussed in Sec. 5.1. Particularly in the LLaMA-
allel function calling in cost efficiency. This is because
2 experiment, where LLMCompiler achieves around a
LLMCompiler’s planning phase is more prompt length
9% increase in accuracy, we note that ∼20% of the ex-
efficient than that of OpenAI parallel function calling since
amples experienced repetitive function calls with ReAct,
our Planner’s in-context examples are rather short and only
aligning with our observations from the accuracy analy-
include plans, not observations (see Appendix H).
sis detailed in Appendix A. Additionally, a comprehensive
analysis of LLMCompiler’s failure cases is provided in
5.2. Parallel Function Calling with Dependencies Appendix B, where we note minimal Planner failures, high-
The cases considered above are rather simple, as only one lighting LLMCompiler’s effectiveness in breaking down
tool is used and all tasks can be executed independently problems into complex multi-task dependencies.
of one another. However, similar to code execution in tra-
ditional code blocks, we may encounter function calling Cost. Similar to Sec. 5.1, LLMCompiler demonstrates
scenarios that involve more complex dependencies. To sys- substantial cost reductions of 4.65× and 2.57× compared
tematically evaluate the capability to plan out function call- to ReAct and OpenAI’s parallel function calling, respec-
ing in scenarios that involve complex task dependencies, we tively, as indicated in Tab. 2. This efficiency stems from
have designed a custom benchmark called ParallelQA. This LLMCompiler’s reduced frequency of LLM invocations,
benchmark is designed to incorporate non-trivial function which is also the case with OpenAI’s parallel function call-
calling patterns, including three different types of patterns ing, which is limited to planning out immediate paralleliz-
1
able tasks, not the entire dependency graph. For example, in
Unfortunately, we are unable to conclude why this is the case, as OpenAI
Fig. 3 (c), OpenAI’s method would necessitate three distinct
has not publicly disclosed any details about their function calling mechanism. One
speculation is that there might be additional overheads to validate the function and LLM calls for initial search tasks, following math tasks, and
argument names and to convert them into a system prompt. Nevertheless, we have the final math task. In contrast, LLMCompiler achieves
seen a consistent trend with multiple runs over several days. this with a single LLM call, planning all tasks concurrently.

7
An LLM Compiler for Parallel Function Calling

5.3. Parallel Function Calling with Replanning numbers exactly once each. Further details on experiment
In the previous sections, we have discussed cases in which setups are outlined in Appendix D.
dependency graphs can be determined statically. However,
there are cases where dependency graphs need to be con- Success Rate and Latency. In the last two rows of Tab. 1,
structed dynamically depending on intermediate observa- we explore the latency and success rate of LLMCompiler
tions. Here, we consider one such dynamic approach in the in comparison to the baseline described in (Yao et al., 2023b)
context of the Game of 24 with the Tree-of-Thoughts (ToT) on the Game of 24 benchmark. With the gpt-4 model,
strategy proposed in (Yao et al., 2023b). The Game of 24 is LLMCompiler demonstrates a 2.89× enhancement in la-
a task to generate 24 using a set of four numbers and basic tency while slightly improving the success rate compared
arithmetic operations. For example, from the numbers 2, 4, to the baseline. Similarly, when applied with the LLaMA-2
4, and 7, a solution could be 4 × (7 − 4) × 2 = 24. ToT model, LLMCompiler shows a 2.01× improvement in la-
approaches this task through two iterative LLM processes: tency, again without compromising on success rate. These
(i) the thought proposer generates candidate partial solutions results demonstrate not only a significant latency reduction
by selecting two numbers and applying an operation (e.g. without quality degradation, but also the replanning capabil-
2, 3, 7 from 2, 4, 4, 7 by calculating 7 - 4); (ii) the state ity of LLMCompiler for solving complex problems.
evaluator assesses the potential of each candidate. Only
the promising candidates are then processed in subsequent 5.4. Application: LLMCompiler in Interactive Decision
iterations of the thought proposer and state evaluator until Making Tasks
24 is reached. Details about the Game of 24 benchmark and
the ToT strategy can be found in Appendix J. In this section, we demonstrate that LLMCompiler can
explore language-based interactive environments effectively
While ToT achieves significant improvement at solving by benchmarking LLMCompiler on WebShop (Yao et al.,
the Game of 24, its sequential, breadth-first search ap- 2023a). As highlighted in (Shinn et al., 2023; Yao et al.,
proach through the state tree can be time-consuming. 2022; 2023a), WebShop exhibits considerable diversity,
LLMCompiler offers a faster alternative by enabling par- which requires extensive exploration to purchase the most
allel execution of the thought proposer and the subsequent appropriate item. While recent work feature advanced explo-
feasibility evaluator, akin to a parallel beam search method. ration strategies and show promising results (Ma et al., 2023;
Zhou et al., 2023a), their approaches are largely based on a
sequential and extensive tree search that incurs significant
Experimental Setups. Although LLMCompiler offers latency penalties. Here, LLMCompiler showcases an ex-
latency advantages, solving this problem with a single ploration strategy that is both effective and efficient with the
static graph is not feasible, as the Planner cannot plan use of parallel function calling. Our method enables broader
out the thought proposing stage before identifying the exploration of items in the environment, which improves
selected candidates from the state evaluator of the pre- success rate compared to ReAct. At the same time, this ex-
vious iteration. Consequently, the Planner is limited to ploration can be parallelized, yielding up to 101.7× speedup
planning only within one iteration at a time. To address against baselines that perform sequential exploration.
this, we resort to LLMCompiler’s replanning capabil-
ity. In particular, LLMCompiler is equipped with three
Experimental Setups. We evaluate LLMCompiler
tools: thought proposer and state evaluator,
against three baselines on this benchmark, ReAct (Yao
which are both LLMs adapted from the original ToT
et al., 2022), LATS (Zhou et al., 2023a), and LASER (Ma
framework, and top k select, which chooses the top
et al., 2023), using 500 WebShop instructions. The evalu-
k candidates from the thought proposer based on the
ation metrics are success rate, average score, and latency.
state evaluator’s assessment. After all these tools
More details of the WebShop environment and the baseline
are executed, LLMCompiler can decide to “replan” if no
methods are provided in Appendix K. For this experiment,
proposal reaches 24, triggering the Planner to devise new
LLMCompiler is equipped with two tools: search and
plans using the shortlisted states from top k select of
explore. The search function triggers the model to
the previous iteration. In this way, LLMCompiler can dy-
generate and dispatch a query that returns a list of typically
namically regenerate plans of each iteration, being able to
ten items from the Webshop environment. The explore
tackle highly complex tasks that require iterative replanning
function then clicks through links for each of the found
based on the outcomes of previous plans.
items and retrieves information about options, prices, at-
To evaluate LLMCompiler’s performance on the Game of tributes, and features that are available. Finally, based on
24, we use 100 different instances of the game. For each the gathered information, LLMCompiler decides on the
problem, we consider the output as successful if its opera- item that best matches the input instruction for purchasing.
tions are valid and yield 24 while also using the provided Further details on experiments can be found in Appendix D.

8
An LLM Compiler for Parallel Function Calling

Table 3. Performance and Latency Analysis for WebShop. We with its standard deviation, is 72.8 ± 4.01 for gpt-3.5-turbo.
evaluate LLMCompiler with two models: gpt-4 and gpt-3.5- Further note that while the performance differences are
turbo and compare LLMCompiler against three baselines: ReAct, marginal, our method exhibits significant execution speedup,
LATS, and LASER. We report success rate and average score in 101.7× over LATS and 2.69× over LASER.
percentage. We reproduce the success rate and average score for
ReAct, while those for LATS and LASER are from their papers.
N denotes the number of examples used for evaluation. 6. Conclusions
Model Method Succ. Rate Score Latency (s) N Existing methods for invoking multiple functions with
ReAct 19.8 54.2 5.98 500 LLMs resort to sequential and dynamic reasoning. As a
LATS 38.0 75.9 1066 50
gpt-3.5-turbo result, they suffer from inefficiencies in latency, cost, and
LLMCompiler 44.0 72.8 10.72 50
LLMCompiler 48.2 74.2 10.48 500 accuracy. As a solution, we introduced LLMCompiler,
ReAct 35.2 58.8 19.90 500 a compiler-inspired framework that enables efficient paral-
gpt-4-0613 LASER 50.0 75.6 72.16 500
LLMCompiler 55.6 77.1 26.73 500 lel function calling across various LLMs, including open-
source models like LLaMA-2 and OpenAI’s GPT se-
ries. By decomposing user inputs into tasks with defined
Performance and Latency. Our approach significantly inter-dependencies and executing these tasks concurrently
outperforms all baseline models as shown in Table 3. When through its Planner, Task Fetching Unit, and Executor com-
using gpt-3.5-turbo, LLMCompiler achieves a 28.4% ponents, LLMCompiler demonstrates substantial improve-
and 6% improvement in success rate against ReAct and ments in latency (up to 3.7×), cost efficiency (up to 6.7×),
LATS; with gpt-4, our method improves upon ReAct and and accuracy (up to ∼9%), even outperforming OpenAI’s
LASER by 20.4% and 5.6%, respectively. In terms of parallel function calling feature in latency gains. We look
latency, LLMCompiler exhibits a 101.7× and 2.69× forward to future work building upon LLMCompiler that
speedup against LATS and LASER. While we note that will improve both the capabilities and efficiencies of LLMs
LLMCompiler execution is slightly slower than ReAct on in executing complex, large-scale tasks, thus transforming
this benchmark, mainly due to the Planner overhead, we the future development of LLM-based applications.
also highlight that the gains in success rate far outweigh the
minor latency penalty. Impact Statement
We further delve into why LLMCompiler attains such an This paper presents research towards advancing the field of
improved success rate and score compared to ReAct. Based Machine Learning. While there are many potential societal
on our observations, we discover that the ReAct agent tends consequences of our work, we do not find any one to be
to commit to a decision with imperfect information, a sce- particularly noteworthy.
nario that can arise when the agent has not gathered suf-
ficient details about the features and options available for Acknowledgements
items. This observation was also noted in (Shinn et al.,
2023) – without exploring more items in the environment, We appreciate the valuable feedback from Minwoo Kang.
the agent struggles to differentiate between seemingly simi- We acknowledge gracious support from Furiosa team. We
lar choices, ultimately failing to make the correct decision. also appreciate the support from Microsoft through their
In contrast, LLMCompiler undergoes further exploration Accelerating Foundation Model Research, including great
by visiting all ten items found by search and retrieving support from Sean Kuno. Furthermore, we appreciate sup-
relevant information about each item. We find that employ- port from Google Cloud, the Google TRC team, and specif-
ing an effective search strategy is critical to decision-making ically Jonathan Caton, and Prof. David Patterson. Prof.
tasks such as the WebShop benchmark. Keutzer’s lab is sponsored by the Intel corporation, Intel
One-API, Intel VLAB team, the Intel One-API center of
The relatively high performance of LATS can also be ex-
excellence, as well as funding through BDD and BAIR. We
plained in terms of its exploration scheme. In this frame-
also appreciate support from Samsung including Dongkyun
work, the agent executes a brute-force search through the
Kim, and David Thorsley. We appreciate the great support
state and action space of Webshop, exploring as many as
from Ellick Chan, Saurabh Tangri, Andres Rodriguez, and
30 trajectories before making the final purchase. While this
Kittur Ganesh. Sehoon Kim and Suhong Moon would like
approach provides richer information for decision-making,
to acknowledge the support from the Korea Foundation for
the end-to-end execution becomes prohibitively slow.
Advanced Studies. Amir Gholami was supported through
We report that our method, LLMCompiler, outperforms funding from Samsung SAIT. Michael W. Mahoney would
LASER by an average score of 1.5. When compared to also like to acknowledge a J. P. Morgan Chase Faculty Re-
LATS, this score is within the standard deviation range of search Award as well as the DOE, NSF, and IARPA. Our
our method. The average score for LLMCompiler, along conclusions do not necessarily reflect the position or the

9
An LLM Compiler for Parallel Function Calling

policy of our sponsors, and no official endorsement should Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y.,
be inferred. Callan, J., and Neubig, G. Pal: Program-aided language
models. arXiv preprint arXiv:2211.10435, 2022.
References Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z.,
https://huggingface.co/text-generation-inference. and Hu, Z. Reasoning with language model is planning
with world model, 2023.
https://github.com/nvidia/tensorrt-llm.
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora,
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and
H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Steinhardt, J. Measuring coding challenge competence
Sutton, C. Program synthesis with large language models, with apps. NeurIPS, 2021a.
2021.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gi- Song, D., and Steinhardt, J. Measuring massive multitask
aninazzi, L., Gajda, J., Lehmann, T., Podstawski, M., language understanding. Proceedings of the International
Niewiadomski, H., Nyczyk, P., and Hoefler, T. Graph of Conference on Learning Representations (ICLR), 2021b.
thoughts: Solving elaborate problems with large language
models, 2023. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, ematical problem solving with the math dataset. NeurIPS,
L., and Jumper, J. Accelerating large language model 2021c.
decoding with speculative sampling, 2023a.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Gelly, S. Parameter-efficient transfer learning for nlp. In
Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, International conference on machine learning, pp. 2790–
M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, 2799. PMLR, 2019.
S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., S., Wang, L., and Chen, W. LoRA: Low-rank adaptation
Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, of large language models. In International Conference
J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., on Learning Representations, 2022.
Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra,
V., Morikawa, E., Radford, A., Knight, M., Brundage, Karpathy, A. Intro to large language models, 2023.
M., Murati, M., Mayer, K., Welinder, P., McGrew, B.,
Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K.,
W. Evaluating large language models trained on code. Clark, P., and Sabharwal, A. Decomposed prompting:
2021. A modular approach for solving complex tasks. In The
Eleventh International Conference on Learning Repre-
Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program sentations, 2023.
of thoughts prompting: Disentangling computation from
reasoning for numerical reasoning tasks, 2023b. Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen,
S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-
Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev, and-sparse quantization, 2023.
D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T.,
and Alistarh, D. Spqr: A sparse-quantized representation Kim, S., Mangalam, K., Moon, S., Malik, J., Mahoney,
for near-lossless llm weight compression, 2023. M. W., Gholami, A., and Keutzer, K. Speculative decod-
ing with big little decoder, 2024.
Frantar, E. and Alistarh, D. Sparsegpt: Massive language
models can be accurately pruned in one-shot, 2023. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa,
Y. Large language models are zero-shot reasoners, 2023.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh,
D. GPTQ: Accurate post-training compression for Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer,
generative pretrained transformers. arXiv preprint K., and Gholami, A. A fast post-training pruning frame-
arXiv:2210.17323, 2022. work for transformers, 2022.

10
An LLM Compiler for Parallel Function Calling

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y.,
C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating
memory management for large language model serving large language models to master 16000+ real-world apis.
with pagedattention. In Proceedings of the ACM SIGOPS arXiv preprint arXiv:2307.16789, 2023.
29th Symposium on Operating Systems Principles, 2023.
Ruan, J., Chen, Y., Zhang, B., Xu, Z., Bao, T., Du, G., Shi,
Langchain. https://github.com/langchain-ai/langchain. S., Mao, H., Zeng, X., and Zhao, R. Tptu: Task planning
and tool usage of large language model-based ai agents.
Lester, B., Al-Rfou, R., and Constant, N. The power of arXiv preprint arXiv:2308.03427, 2023a.
scale for parameter-efficient prompt tuning, 2021.
Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J.,
Leviathan, Y., Kalman, M., and Matias, Y. Fast inference Dubois, Y., Maddison, C. J., and Hashimoto, T. Identify-
from transformers via speculative decoding, 2023. ing the risks of lm agents with an lm-emulated sandbox.
arXiv preprint arXiv:2309.15817, 2023b.
Liang, Y., Wu, C., Song, T., Wu, W., Xia, Y., Liu, Y., Ou, Y.,
Lu, S., Ji, L., Mao, S., Wang, Y., Shou, L., Gong, M., and Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli,
Duan, N. Taskmatrix.ai: Completing tasks by connecting M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Tool-
foundation models with millions of apis, 2023. former: Language models can teach themselves to use
tools. arXiv preprint arXiv:2302.04761, 2023.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., and
Han, S. Awq: Activation-aware weight quantization for Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y.
llm compression and acceleration, 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends
in hugging face, 2023.
Liu, J. LlamaIndex, 11 2022. URL https://github.
com/jerryjliu/llama_index. Shinn, N., Cassano, F., Berman, E., Gopinath, A.,
Narasimhan, K., and Yao, S. Reflexion: Language agents
Ma, K., Zhang, H., Wang, H., Pan, X., and Yu, D. Laser: with verbal reinforcement learning, 2023.
Llm agent with state-space exploration for web naviga-
Song, Y., Xiong, W., Zhu, D., Wu, W., Qian, H., Song,
tion, 2023.
M., Huang, H., Li, C., Wang, K., Yao, R., Tian, Y., and
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Li, S. Restgpt: Connecting large language models with
Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, real-world restful apis, 2023.
Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,
S., Yazdanbakhsh, A., and Clark, P. Self-refine: Iterative A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
refinement with self-feedback, 2023. Garriga-Alonso, A., et al. Beyond the imitation game:
Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Quantifying and extrapolating the capabilities of language
Y. Skeleton-of-thought: Large language models can do models. arXiv preprint arXiv:2206.04615, 2022.
parallel decoding, 2023. Sumers, T. R., Yao, S., Narasimhan, K., and Griffiths, T. L.
Cognitive architectures for language agents, 2023.
OpenAI. Gpt-4 technical report, 2023.
Surı́s, D., Menon, S., and Vondrick, C. Vipergpt: Visual
OpenAI. New models and developer products announced at inference via python execution for reasoning, 2023.
devday, 2023.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., and A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Gonzalez, J. E. Memgpt: Towards llms as operating Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
systems, 2023. M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
Patel, P., Mishra, S., Parmar, M., and Baral, C. Is a question
A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
decomposition unit all we need? 2022.
V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
Large language model connected with massive apis, 2023. Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
and Lewis, M. Measuring and narrowing the composi- Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
tionality gap in language models, 2023. Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,

11
An LLM Compiler for Parallel Function Calling

M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Zheng, H. S., Mishra, S., Chen, X., Cheng, H.-T., Chi, E. H.,
and Scialom, T. Llama 2: Open foundation and fine-tuned Le, Q. V., and Zhou, D. Take a step back: Evoking
chat models, 2023. reasoning via abstraction in large language models, 2023.
Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R. K.-W., Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
and Lim, E.-P. Plan-and-solve prompting: Improving and Wang, Y.-X. Language agent tree search unifies
zero-shot chain-of-thought reasoning by large language reasoning acting and planning in language models, 2023a.
models. arXiv preprint arXiv:2305.04091, 2023a.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang,
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V.,
S., Chowdhery, A., and Zhou, D. Self-consistency im- and Chi, E. H. Least-to-most prompting enables complex
proves chain of thought reasoning in language models, reasoning in large language models. In The Eleventh
2023b. International Conference on Learning Representations,
2023b.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. volume 35,
pp. 24824–24837, 2022.
Wolfson, T., Geva, M., Gupta, A., Gardner, M., Goldberg,
Y., Deutch, D., and Berant, J. Break it down: A question
understanding benchmark. Transactions of the Associa-
tion for Computational Linguistics, 2020.
Xu, B., Peng, Z., Lei, B., Mukherjee, S., Liu, Y., and Xu,
D. Rewoo: Decoupling reasoning from observations for
efficient augmented language models, 2023.
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W.,
Salakhutdinov, R., and Manning, C. D. Hotpotqa: A
dataset for diverse, explainable multi-hop question an-
swering. arXiv preprint arXiv:1809.09600, 2018.
Yang, Z., Dong, L., Du, X., Cheng, H., Cambria, E., Liu,
X., Gao, J., and Wei, F. Language models as inductive
reasoners, 2022.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K., and Cao, Y. React: Synergizing reasoning and acting
in language models. arXiv preprint arXiv:2210.03629,
2022.
Yao, S., Chen, H., Yang, J., and Narasimhan, K. Web-
shop: Towards scalable real-world web interaction with
grounded language agents, 2023a.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao,
Y., and Narasimhan, K. Tree of Thoughts: Deliberate
problem solving with large language models, 2023b.
Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-
G. Orca: A distributed serving system for {Transformer-
Based} generative models. In 16th USENIX Symposium
on Operating Systems Design and Implementation (OSDI
22), pp. 521–538, 2022.
Yu, W., Jiang, M., Clark, P., and Sabharwal, A. Ifqa: A
dataset for open-domain question answering under coun-
terfactual presuppositions, 2023.

12
An LLM Compiler for Parallel Function Calling

A. Accuracy Analysis: ReAct vs. LLMCompiler

In this section, we delve into a detailed analysis that compares the accuracy of both ReAct and LLMCompiler, highlighting
two failure cases that are prevalent in ReAct: (i) premature early stopping; and (ii) repetitive function calls. Furthermore,
we demonstrate that while those failure cases negatively impact the ReAct accuracy, they can be effectively addressed
by LLMCompiler, thereby yielding the improved accuracy of our framework. We analyze two specific scenarios: the
Movie Recommendation evaluation with GPT, where ReAct often prematurely stops, leading to significantly lower accuracy
compared to LLMCompiler (68.60 vs. 77.13 in Tab. 1); and the HotpotQA evaluation with LLaMA-2 70B, where ReAct’s
repetitive function calls result in a notable accuracy degradation compared to LLMCompiler (70.00 vs. 77.80 in Tab. 1).

1.0 ReAct 1.0 ReAct + additional prompts 1.0 LLMCompiler

0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
# Function Calls # Function Calls # Function Calls

Figure A.1. Distributions of the number of function calls when running the Movie Recommendation benchmark on ReAct (Left), ReAct
with specific prompts to avoid early stopping (Middle, corresponding to ReAct† in Tab. 1), and LLMCompiler (Right). LLMCompiler
(Right) consistently completes the search for all 8 movies, whereas ReAct (Left) often exit early, demonstrated by about 85% of examples
stopping early. Although the custom prompts shift ReAct’s histogram to higher function calls (Middle), they still fall short of ensuring
comprehensive searches for all movies. gpt-3.5-turbo is used for the experiment.

80
Accuracy (%)

65 ReAct
LLMCompiler
4 5 6 7 8
# Function Calls on ReAct

Figure A.2. The Movie Recommendation accuracy of the examples that are categorized by the number of function calls on ReAct,
measured both on ReAct and LLMCompiler. The plot indicates that in ReAct, a decrease in the number of function calls correlates
with lower accuracy, indicating that premature exits lead to reduced accuracy. In contrast, when the same examples are evaluated using
LLMCompiler, which ensures complete searches for all eight movies before reaching a decision, they achieve higher and more consistant
accuracy than those processed by ReAct. gpt-3.5-turbo is used for the experiment, and the results are averaged over 3 different runs.

Premature Early Stopping of ReAct. ReAct frequently suffers from premature early stopping, ceasing function calls
too early, and making decisions based on incomplete information. A clear example of this is observed in the Movie
Recommendation benchmark, where ReAct often searches for fewer than the required 8 movies before delivering its final
answer. In Fig. A.1 (Left), we illustrate the distribution of the number of function calls within ReAct (using GPT) across
thhe Movie Recommendation benchmark. Here, we observe around 85% of the examples exhibit early stopping, making
decisions without completing all 8 movie searches. This contrasts with LLMCompiler (Right), where almost all examples
(99%) complete the full search of 8 movies. Although adding specific prompts to ReAct to prevent early stopping shifts
the distribution towards more function calls (Fig. A.1, Middle), resulting in an accuracy improvement from 68.60 to 72.47
(ReAct† in Tab. 1), it is nevertheless an imperfect solution.
To further assess how early stopping negatively impacts accuracy, we categorize Movie Recommendation benchmark
examples by their number of function calls in ReAct. We then evaluated these groups using LLMCompiler, ensuring
complete search results for all 8 movies. Fig. A.2 reveals that fewer function calls in ReAct correlate with lower average

13
An LLM Compiler for Parallel Function Calling

1.0 ReAct 1.0 LLMCompiler

0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
1 2 3 4+ (div.) 1 2 3 4+ (div.)
# Function Calls # Function Calls

Figure A.3. Distributions of the number of function calls when running the HotpotQA benchmark on ReAct (Left) and LLMCompiler
(Right). While LLMCompiler (Right) consistently completes the task within 2 function calls, which is expected as HotpotQA exhibits a
2-way parallelizable pattern, ReAct (Left) shows that around 10% of the examples undergo repetitive (>4) function calls, resulting in a
diverging behavior of the framework. LLaMA-2 70B is used for the experiment.

60
50
Accuracy (%)

40
30
20
ReAct
10 LLMCompiler
2 3 4+ (div.)
# Function Calls on ReAct

Figure A.4. The HotpotQA accuracy of the examples that are categorized by the number of function calls on ReAct, measured both on
ReAct and LLMCompiler. The plot indicates that in ReAct, repetitive function calls of more than or equal to four times can result in
a significant accuracy degradation due to its infinite looping and diverging behavior. On the other hand, when the same examples are
evaluated using LLMCompiler, which ensures only two searches per example, they achieve a higher of around 50%. LLaMA-2 70B is
used for the experiment.

accuracy (green line). Conversely, if these examples were processed through LLMCompiler, with complete searches for
all eight movies, they consistently attained higher accuracy (purple line). This not only indicates that ReAct struggles with
premature exits (which is not fully addressed by prompting), but the earlier it stops, the greater the decline in accuracy,
contributing to the overall accuracy drop observed in Tab. 1. In contrast, LLMCompiler effectively addresses this issue.

Repetitive Function Calls of ReAct. Another common failure case of ReAct is its tendency for repetitive function calls,
often leading to infinite loops or exceeding the context length limit. This problem is particularly noticeable in the HotpotQA
benchmark where ReAct repeatedly calls the same function if the Wikipedia search returns insufficient information about the
searched entity. Although HotpotQA is inherently 2-way parallelizable, as illustrated in Fig. A.3, we observe that about 10%
of its examples require more than four function calls in ReAct, usually resulting in an infinite loop or a divergent behavior.
In contrast, LLMCompiler executes only two function calls for most examples.
To show how the repetitive function calls impact the overall accuracy, we conduct an accuracy analysis similar to the
previous case. In Fig. A.4, we categorize HotpotQA benchmark examples by the number of function calls in ReAct, and
then we compare their accuracy on both ReAct and LLMCompiler. The analysis reveals that examples that launch two
function calls in ReAct maintain the same accuracy in LLMCompiler. However, cases with more than four function calls
in ReAct, which often lead to divergent behavior, show less than 10% accuracy in ReAct. On the other hand, when these
examples are processed with LLMCompiler, they achieve around 50% accuracy by circumventing repetitive calls. It is
worth noting that there are instances with three function calls in ReAct, where an extra search can lead to improved accuracy
by retrying with an alternate entity name when the initial search fails, yielding a better accuracy than LLMCompiler.
While this shows a potential adaptability advantage of ReAct, such instances represent less than 3% of cases.

14
An LLM Compiler for Parallel Function Calling

Table C.1. A latency comparison between using and not using streaming in the Planner. Streaming yields consistent latency improvement
across different benchmarks, as it enables the Task Fetching Unit to start task execution immediately as each task is produced by the
Planner. The impact of streaming is especially notable in the ParallelQA benchmark, where tool execution times are long enough to
effectively hide the Planner’s execution time.
Benchmark w/o streaming (s) w/ streaming (s) Latency speedup
HotpotQA 4.00 3.95 1.01×
Movie Rec. 5.64 5.47 1.03×
ParallelQA 21.72 16.69 1.30×

B. Failure Case Analysis of LLMCompiler

This section delves into a qualitative analysis of LLMCompiler’s failure cases on the ParallelQA benchmark, which can
be broadly attributed to failures in the Planner, Executor, or the final output process. Failures in the final output process
refer to cases when LLMs are unable to use the observations collected from tool execution (which are incorporated into
the context) to deliver the correct answer to the user. Among the 10.6% (36 examples) of LLMCompiler’s total failures
reported in Tab. 1, we have noted that the Planner, Executor, and final output process contributed to 8%, 64%, and 28% of
the failures, respectively. The Planner’s 8% failure rate is exclusive to LLMCompiler. For instance, the Planner would
incorrectly map inputs and outputs by assigning a wrong identifier as an input to a subsequent task, thereby forming an
incorrect DAG. However, with adequate tool definitions and in-context examples, Planner errors are significantly reduced
(only 3 instances in total throughout our evaluation), highlighting the LLM’s capability to decompose problems into complex
multi-task dependencies.
The remaining 92% of the total failures are attributed to the Executor and the final output process. The Executor accounts
for most of these failures (64%), with common issues like the math tool choosing wrong attributes or mishandling
unit conversions. For the final output process (28% of failures), errors include incorrect conclusions from the gathered
observations, such as failing to pick the smallest attribute from the collected data. It’s worth noting that these problems are
not exclusive to LLMCompiler, but they also occur in ReAct. Nevertheless, LLMCompiler tends to have slightly fewer
failures in these areas than ReAct, as it provides only relevant contexts to each tool, aiding in more accurate information
extraction. We believe that optimizing the structure of the agent scratchpad, rather than simply appending observations,
could further reduce failures in the final output process.

C. Related Work
Here, we continue with related work, which we started in Sec. 2.

C.1. Tool-Augmented LLMs

A notable work is Toolformer (Schick et al., 2023), which produces a custom LLM output to let the LLM decide what the
inputs for calling the functions should be and where to insert the result. This approach has inspired various tool calling
frameworks (Liang et al., 2023; Shen et al., 2023). ReAct (Yao et al., 2022) proposed to have LLMs interact with external
environments through reasoning and action generation for improved performance. Gorilla (Patil et al., 2023) introduced a
finetuned LLM designed for function calling, and ToolLLM (Qin et al., 2023) and RestGPT (Song et al., 2023) have extended
LLMs to support real-world APIs. Moreover, OpenAI (OpenAI, 2023) released their own function calling capabilities,
allowing their LLMs to return formatted JSON for execution.

D. Experimental Details
Our experiments evaluate two different common scenarios: (1) using API-based closed-source models; and (2) using open-
source models with an in-house serving framework. We use OpenAI’s GPT models as closed-source models, in particular,
gpt-3.5-turbo (1106 release) for HotpotQA and Movie Recommendation, gpt-4-turbo (1106 release) for ParallelQA, and
gpt-4 (0613 release) for Game of 24. Experiments on HotpotQA, Movie Recommendation, and ParallelQA were all
conducted in November 2023 after the 1106 release. The Game of 24 experiments were conducted over a two-month
period from September to October 2023. For an open-source model, we use LLaMA-2 (Touvron et al., 2023), which was
hosted on 2 A100-80GB GPUs using the vLLM (Kwon et al., 2023) framework. All the runs have been carried out with

15
An LLM Compiler for Parallel Function Calling

zero temperature, except for thought proposer and state evaluator for the Game of 24 evaluation, where the
temperature is set to 0.7. Since OpenAI has randomness in outputs even with temperature 0, we have conducted 3 runs, and
we reported the average accuracy. Across ReAct, OpenAI parallel function calling, and LLMCompiler, we perform 3,
1, and 5-shot learning for HotpotQA, Movie Recommendation, and ParallelQA, respectively; the same examples across
different methods were used to ensure a fair comparison. For the Game of 24, we use 2 in-context examples for the Planner.
We use the same instruction prompts across different methods for a fair comparison, except for ReAct† in Sec. 5.1 with
additional ReAct-specific prompts. For WebShop experiment, we use gpt-4-0613 with 8k context window and gpt-3.5-turbo
model with 16k context window.

E. Analysis
E.1. Parallel Speedup Modeling
While LLMCompiler shows noticeable latency gain in various workloads, it is not achieving the N × latency speedup
for N-way parallel workloads. This is mostly due to the overhead associated with LLMCompiler’s Planner and final
answering process that cannot be parallelized. In our Movie Recommendation experiment, LLMCompiler’s Planner and
the answering process have an overhead of 1.88 and 1.62 seconds on average, respectively, whose combined overhead
already comprises more than half of LLMCompiler’s overall latency in Tab 1. Another source of overhead is the straggler
effect among the parallel tasks when they need to join together. We observe the average latency of the slowest search to
be 1.13 seconds, which is nearly 2× the average latency of all tasks, which is 0.61 seconds. Below, we provide an analytical
latency modeling of ReAct, LLMCompiler, and LLMCompiler with streaming, and we provide an analysis of achievable
latency speedup.
In this section, our focus is on embarrassingly parallelizable workload (pattern Fig. 3(a)), as this allows for a clearer
understanding of the impact of each component on potential latency gains. For the precise latency analysis, we consider
three key components: the Planner, the Task Fetching Unit, and the Executor, in Fig. 2. Assume that the Planner generates N
different tasks to be done. We define Pi as the Planner’s output corresponding to the i-th atomic task. Each Pi is a blueprint
for a specific atomic task, which we refer to as Ei . The execution of Ei involves a specific function call using the appropriate
tool. The latency function of each unit in the system is defined to quantify the time taken for specific operations. For the
Planner, the latency is denoted as TP (Pi ), representing the time taken by the Planner to generate the plan Pi . Similarly, for
the Executor, the latency, TE (Ei ), corresponds to the time required to complete the task Ei . We ignore the latency of Task
Formulation Unit, as it is negligible in this section. Our focus here is on comparing the latency models of ReAct (Yao et al.,
2022), and LLMCompiler.
To begin our analysis of ReAct’s latency, we express its total latency as:

N
X
TR = TPR (Pi ) + TE (Ei ) .

(1)
i=1

Here, the superscript R refers to ReAct. In the ReAct agent system, the process typically involves initial thought generation,
followed by action generation and the acquisition of observations through function calls associated with the tool. The
creation of both thought and action are collectively considered as part of generating Pi . It is important to note that while the
Planner’s latency is denoted with a superscript (indicating ReAct), the Executor’s latency does not have such a superscript.
This is because the function calling and the tools execution remain the same between ReAct and LLMCompiler.
For LLMCompiler, where all parallelizable tasks are processed concurrently, the total latency is determined by the slowest
task among these tasks. Hence, the latency model for LLMCompiler can be represented as:

N
X
TC = TPC (Pi ) + max TE (Ek ). (2)
k∈1,...,N
i=1

This expression captures the sum of all planning times plus the execution time of the longest task, reflecting the system’s
focus on parallel execution.

16
An LLM Compiler for Parallel Function Calling

Latency vs. # Parallelizable Tasks

50 ReAct
LLMCompiler (Ours)
40

Latency (s)
30
20
10
0
2 3 4 5
Number of Parallelizable Tasks
Figure E.5. Latency on the ParallelQA benchmark grouped by the number of maximum parallelizable tasks.

Further, if the Planner employs streaming of the dependency graph, the latency model undergoes a modification and can be
expressed as:
N
X
T SC = TPC (Pi ) + TE (EN ). (3)
i=1

It is important to note that T SC ≤ T C . This implies that the streaming mechanism allows for a more efficient handling of
task dependencies, potentially reducing overall latency.
In evaluating the potential speedup achievable with the LLMCompiler framework compared to ReAct, the speedup metric,
denoted as γ, is defined as follows:
PN R

TR i=1 TP (Pi ) + TE (Ei )
γ = C = PN . (4)
T C
i=1 TP (Pi ) + maxk∈1,...,N TE (Ek )
This ratio represents the comparative efficiency of LLMCompiler over ReAct, considering both planning and execution
latencies.
To estimate the upper bound of this speedup, γmax , we assume that the executor latency TE (Ei ) is dominant over the
planning latency TP (Pi ) and all the latencies of executing tasks remain the same. Under this assumption, the upper bound is
calculated as:
PN
i=1 TE (Ei )
γmax ≈ = N, (5)
maxk∈1,...,N TE (Ek )
indicating the theoretical maximum speedup, γmax , is equal to the number of tasks, N .
On the other hand, the lower bound of the speedup, γ, is observed when the planning latency is the predominant factor.
Given that the planning latencies of both ReAct and LLMCompiler are generally similar, the minimum speedup is
approximated as:
PN R
i=1 TP (Pi )
γmin ≈ PN ≈ 1. (6)
C
i=1 TP (Pi )

From these observations, we can conclude that to achieve significant latency gains with LLMCompiler, it is crucial to (i)
reduce the planner overhead and (ii) minimize the occurrence of stragglers.

E.2. Latency versus Number of Parallelizable Tasks

In Fig. E.5, we also report a more detailed latency breakdown on ParallelQA where we show the end-to-end latency as a
function of the number of parallel tasks. This is often referred to as weak-scaling in high-performance computing, where the
ideal behavior is to have a constant latency as the number of tasks is increased. We can see that ReAct’s latency increases
proportionally to the number of tasks, which is expected as it executes the tasks sequentially. In contrast, the latency of
LLMCompiler increases at a much smaller rate, as it can perform multiple function calls in parallel when possible. The
reason the end-to-end latency increases slightly with LLMCompiler is due to the overhead of the Planner, which needs to
generate plans initially, and which cannot be parallelized. We provide a further analysis of this in Appendix E.1.

17
An LLM Compiler for Parallel Function Calling

Table E.2. Accuracy and latency comparison of LLMCompiler compared to ReAct on the HotpotQA bridge benchmark. ReAct† denotes
ReAct with additional prompting that minimizes looping and early stopping, similar to Tab. 1.
Method Accuracy (%) Latency (s)
ReAct 22.7 7.07
ReAct† 23.1 6.42
LLMCompiler 26.3 4.70

Table E.3. Qualitative comparison between LLMCompiler and other frameworks including ReAct (Yao et al., 2022), TPTU (SA for
Sequential Agent and OA for One-step Agent) (Ruan et al., 2023a), ViperGPT (Surı́s et al., 2023) and HuggingGPT (Shen et al., 2023).
Method Planning Replanning Parallel Execution Domain
ReAct X - X All
TPTU-SA X - X All
TPTU-OA O X X All
ViperGPT O X X Limited
HuggingGPT O X O Limited
LLMCompiler O O O All

E.3. Additional Experiments on the HotpotQA Bridge Benchmark

In our main experiments in Sec. 5.1, we used the comparison benchmark in HotpotQA to demonstrate the capability of
LLMCompiler in efficiently executing 2-way parallelizable workloads. The other part of the benchmark, called ‘bridge,’
involves sequential tasks such as “What government position was held by the woman who portrayed Corliss Archer in the
film Kiss and Tell?” LLMCompiler is not limited to the comparison benchmark, but it can also be applied to the bridge
benchmark due to its replanning capability: initially, it searches for the woman who played Corliss Archer in the film Kiss
and Tell, and then, through replanning, searches the government position held by this woman for the example above.
Similar to our experiments with the comparison benchmark, Tab. E.2 compares LLMCompiler against ReAct and ReAct
with the additional prompt that avoids repetitive function calling and early stopping (ReAct† ) on the bridge benchmark. We
observe 4 and 3% accuracy improvement, respectively, which is attributed to ReAct’s repetitive function invocation – even
with the additional prompt (ReAct† ), we have still observed 5% of the examples failing with this issue. Furthermore, such
repetitive function call also accounts for the slightly higher latency of ReAct compared to ours. This experiment demonstrates
that LLMCompiler allows for efficient and accurate function calling for both parallel and sequential workloads.

F. Additional Discussions about Related Works

TPTU (Ruan et al., 2023a), HuggingGPT (Shen et al., 2023), and ViperGPT (Surı́s et al., 2023) have introduced end-to-end
plan-and-solve frameworks. In this section, we discuss how LLMCompiler distinguishes itself from other frameworks
from various angles, including the capabilities in (i) planning and replanning; (ii) parallel execution; and (iii) addressing a
wider range of problem domains. Refer to Tab. E.3 for the summary.
Parallel Execution: Parallel execution is a critical feature in the LLMCompiler framework that allows for efficient
function calling and job completion. While the One-step Agent in TPTU (i.e., TPTU-OA) incorporates planning, it does
not enable parallel function calling, as it only decomposes a user input into a sequence of functions and the associated
arguments without their inter-dependencies. ViperGPT generates Python codes. However, ViperGPT, by itself, does not
support parallel execution without a dedicated parallel processing engine since the standard Python interpreter lacks support
for parallel execution. While HuggingGPT enables parallel execution, it strictly targets models in HuggingFace, making it
hard to apply in a wide range of problems and domains that LLMCompiler supports.
Planning and Replanning: The TPTU’s Sequential Agent (i.e., TPTU-SA) is an iterative framework like ReAct (Yao
et al., 2022) that executes one action per iteration. While TPTU-OA, HuggingGPT, and ViperGPT are all planning-based
frameworks that plan out multiple actions prior to execution, they lack replanning capabilities. LLMCompiler, in contrast,
incorporates the replanning mechanism to generate a new set of tasks when the previous plans are not sufficient enough to
deliver the response back to the user. This enables LLMCompiler to adapt plans based on intermediate results that are a
priori unknown, without the need for introducing complex branching logic, thereby extending the scope of problems that it
can address.

18
An LLM Compiler for Parallel Function Calling

Table F.4. Accuracy and latency speedup comparison of LLMCompiler compared to ReAct and TPTU (SA for Sequential Agent and OA
for One-step Agent) on the HotpotQA comparison benchmark using gpt-3.5-turbo. ReAct† and TPTU-SA† denote ReAct and TPTU-SA
with additional prompting that minimizes looping and early stopping, respectively, similar to Tab. 1.
Method Accuracy (%) Speedup
ReAct 61.52 -
ReAct† 62.47 1×
TPTU-SA 34.16 -
TPTU-SA† 44.59 1.09×
TPTU-OA 57.50 1.35×
LLMCompiler 62.00 1.51×

Problem Domains: ViperGPT and HuggingGPT aim for vision tasks via Python code generation and models in HuggingFace,
respectively, showing significant promise in these specific areas. In contrast, LLMCompiler targets a general framework
that enables efficient and accurate function calling in a wide range of problem domains, rather than restricting itself to
specific fields.

F.1. Quantitative Comparison between LLMCompiler and TPTU

Additionally, in Tab. F.4, we additionally provide accuracy and latency speedup of LLMCompiler against TPTU-SA and
TPTU-OA. Since the official implementation of TPTU is not available, we implemented TPTU-SA and TPTU-OA based on
the prompts provided in the original paper. As can be seen in the table, the results clearly demonstrate LLMCompiler’s
latency and accuracy benefit over both TPTU-SA and TPTU-OA. Compared with TPTU-SA, LLMCompiler exhibits a
significant accuracy improvement due to TPTU’s prevalent issue with repetitive function calls. Note that this issue is not
fully mitigated even with better prompting (TPTU-SA† ), leading to ∼15% of examples failing with repetitive function calls.
Compared with both TPTU-SA and TPTU-OA, LLMCompiler also benefits from reduced latency through parallel task
execution. Overall, the results are consistent with the main experiments and analysis against other baseline methods (i.e.,
ReAct and OpenAI’s parallel function calling).

G. User-Supplied Examples for LLMCompiler Configuration

LLMCompiler provides a simple interface that allows for tailoring the framework to different use cases by providing tool
definitions as well as optional in-context examples for the Planner. Below, we provide the Planner example prompts that are
used to set up the framework for the Movie Recommendation and Game of 24 benchmarks with only a few lines of prompts.

G.1. Movie Recommendation Example Prompts

Question: Find a movie similar to Mission Impossible, The Silence of the

Lambs, American Beauty, Star Wars Episode IV - A New Hope
Options:
Austin Powers International Man of Mystery
Alesha Popvich and Tugarin the Dragon
In Cold Blood
Rosetta

1. search("Mission Impossible")
2. search("The Silence of the Lambs")
3. search("American Beauty")
4. search("Star Wars Episode IV - A New Hope")
5. search("Austin Powers International Man of Mystery")
6. search("Alesha Popvich and Tugarin the Dragon")
7. search("In Cold Blood")
8. search("Rosetta")
Thought: I can answer the question now.

19
An LLM Compiler for Parallel Function Calling

9. finish()
###

G.2. Game of 24 Example Prompts

Question: "1 2 3 4", state list: [""]

$1 = thought proposer("1 2 3 4", "")
$2 = state evaluator("1 2 3 4", "$1")
$3 = top k select("1 2 3 4", ["$1"], ["$2"])
$4 = finish()
###
Question: "1 2 3 4", state list: ["1+2=3(left:3 3 4)","2-1=1(left:1 3
4)","3-1=2(left:2 2 4)","4-1=3(left:2 3 3)","2*1=2(left:2 3 4)"]
$1 = thought proposer("1 2 3 4", "1+2=3(left:3 3 4)")
$2 = thought proposer("1 2 3 4", "2-1=1(left:1 3 4)")
$3 = thought proposer("1 2 3 4", "3-1=2(left:2 2 4)")
$4 = thought proposer("1 2 3 4", "4-1=3(left:2 3 3)")
$5 = thought proposer("1 2 3 4", "2*1=2(left:2 3 4)")
$6 = state evaluator("1 2 3 4", "$1")
$7 = state evaluator("1 2 3 4", "$2")
$8 = state evaluator("1 2 3 4", "$3")
$9 = state evaluator("1 2 3 4", "$4")
$10 = state evaluator("1 2 3 4", "$5")
$11 = top k select("1 2 3 4", ["$1", "$2", "$3", "$4", "$5"], ["$6", "$7",
"$8", "$9", "$10"])
$12 = finish()
###

H. Pre-defined LLMCompiler Planner Prompts

The pre-defined LLMCompiler Planner prompt provides it with specific instructions on how to break down tasks and
generate dependency graphs while ensuring that the associated syntax is formatted correctly. This prompt contains specific
rules such as assigning each task to a new line, beginning each task with a numerical identifier, and using the $ sign to
denote intermediate variables.

- Each action described above contains input/output types and descriptions.

- You must strictly adhere to the input and output types for each action.
- The action descriptions contain the guidelines. You MUST strictly follow
those guidelines when you use the actions.
- Each action in the plan should strictly be one of the above types. Follow
the Python conventions for each action.
- Each action MUST have a unique ID, which is strictly increasing.
- Inputs for actions can either be constants or outputs from preceding
actions. In the latter case, use the format $id to denote the ID of the
previous action whose output will be the input.
- Ensure the plan maximizes parallelizability.
- Only use the provided action types. If a query cannot be addressed using
these, invoke the finish action for the next steps.
- Never explain the plan with comments (e.g. #).
- Never introduce new actions other than the ones provided.

20
An LLM Compiler for Parallel Function Calling

In addition to user-provided functions, the Planner includes a special, hard-coded finish function. The Planner uses this
function either when the plan is sufficient to address the user query or when it can no longer proceed with planning before
executing the current plan, i.e., when it deems replanning necessary. When the Planner outputs the finish function, its
plan generation stops. Refer to Appendix G for examples of the Planner’s usage of the finish function in planning. The
definition of the finish function is as below and is included as a prompt to the Planner along with the definitions of other
user-provided functions.

finish():
- Collects and combines results from prior actions.
- A LLM agent is called upon invoking join to either finalize the user
query or wait until the plans are executed.
- join should always be the last action in the plan, and will be called in
two scenarios:
(a) if the answer can be determined by gathering the outputs from tasks to
generate the final response.
(b) if the answer cannot be determined in the planning phase before you
execute the plans.

I. ParallelQA Benchmark Generation

Inspired by the IfQA benchmark (Yu et al., 2023), our custom benchmark ParallelQA contains 113 examples that are
designed to use mathematical questions on factual details of different entities to answer questions, thus requiring a mix of
search and mathematical operations that are interdependent in various ways. For instance, the benchmark includes examples
like “If Texas and Florida were to merge and become one state, as well as California and Michigan, what would be the
largest population density among these 2 new states?” requires four parallel search tasks, followed by math tasks dependent
on the search outcomes, that can be executed in parallel.
The main objective of the benchmark is to quantify the framework’s ability to decompose an input into multiple tasks to
derive an answer. Therefore, we have meticulously selected 56 distinct entities across various domains whose attributes
can be accessible from Wikipedia search. By minimizing tool execution (i.e., Wikipedia search) failures, we have aimed
our benchmark to effectively assess the frameworks’ abilities to decompose questions into multiple tasks, plan them
out, and derive final answers based on observations. Furthermore, to incorporate diverse execution patterns, we crafted
various dependency patterns that perform unary and binary math operations after searching for additional information about
entities in a given question. We have also curated different questions that accommodate different numbers of maximally
parallelizable tasks, ranging from 2 to 5, and we have included varying numbers of joins between parallel function calls
as well to increase problem complexity. For instance, we have 2 and 3 joins in Fig. 3 (b) and (c), respectively. The
benchmark contains 113 different examples, that were populated by GPT-4 based on the aforementioned criteria and labeled
by humans afterward.

J. Details of the Game of 24 and the Tree-of-Thoughts Approach

The Game of 24 is a mathematical reasoning game that challenges players to manipulate a given set of four numbers, using
the basic arithmetic operations of addition, subtraction, multiplication, and division, to arrive at the number 24. The rule of
this game is that the given numbers must be used only once. For instance, given the numbers 2, 4, 4, and 7, one possible
solution is 4 × (7 − 4) × 2 = 24. This is a non-trivial reasoning benchmark for LLMs, highlighted by the fact that even
advanced models like GPT-4 exhibit only a 4% success rate, even when using chain-of-thought prompting (Yao et al.,
2023b).
In ToT, the problem is solved in several steps. At each step, the LLM, referred to as the thought proposer, generates
thoughts. Each thought is a partial solution that consists of two numbers and an arithmetic operation between them. Then,
these thoughts are fed into the state evaluator which assigns a label for each of them. These labels are ‘sure,’ ‘likely,’ and
‘impossible,’ which are given to thoughts to denote how likely they could produce 24 with additional arithmetic operations
between the result and the remaining numbers. Only the thoughts that are likely to produce 24 continue onto the next step.
This process is illustrated in Figure J.6.

21
An LLM Compiler for Parallel Function Calling

Input: 2, 4, 4, 7

4+4=8 7-4=3 7/2=3.5 Thought Proposer

(left: 2, 7, 8) (left: 2, 3, 4) (left: 3.5, 4, 4) Input: 2, 4, 4, 7
Possible next steps:

4-2=2 2*3=6 4/2=2 State Evaluator

(left: 2, 3) (left: 4, 6) (left: 2, 3) Evaluate if given numbers can reach 24
(sure/likely/impossible)

4*6=24
(left: 24)

Figure J.6. Visualization of the Tree of Thoughts (ToT) in the Game of 24. Each node represents a distinct proposal, beginning with the
root node and branching out through the application of single operations by the thought proposer. Subsequent states are evaluated by the
state evaluator for their potential to reach the target number 24. The ToT retains the top-5 states according to their values.

K. Details of WebShop Experiments

K.1. WebShop Environment
The WebShop environment simulates an online shopping platform. Tasks are designed for the agent to find the item that best
matches the given instruction. For instance, if the instruction specifies, “I am looking for a queen-sized bed that is black,
and priced lower than 140.00 dollars,” the agent’s task is to pinpoint the bed that precisely fits these criteria: “queen-sized,”
“black,” and “priced under 140.00 dollars.” For each item, there is an associated reward measuring how well this item
matches the instruction based on price, item options, and other details contained in the item page. The evaluation metrics
are the success rate—the proportion of episodes where the selected product satisfies all requirements—and the average
score—the mean reward across episodes.

K.2. Baseline Methods

In addition to ReAct, we use LASER (Ma et al., 2023) and LATS (Zhou et al., 2023a) as baseline methods to compare
against LLMCompiler. LASER (Ma et al., 2023) solves tasks through a state-exploration approach. In the context of
WebShop, the possible environment pages are encoded as different states (e.g., search page, item page, and item detail
subpage). Actions are used to transition between these states, such as executing a search query, selecting an item, checking
the item detail, navigating the next search page and so on. The Webshop exploration is therefore reduced to a search problem
on the given state-space graph.
Using a variant of Monte Carlo Tree Search, LATS (Zhou et al., 2023a) plans its actions by constructing a decision tree,
evaluating potential moves based on their likelihood of success, and selecting actions through a balance of exploration and
exploitation. The agent then adapts its strategy based on feedback from the environment, learning from both successes and
failures to refine its decision-making process. This iterative approach allows LATS to navigate complex online shopping
tasks, albeit much more slowly due to its exhaustive tree search.

Alopex: On-Device LLM Function Calls
No ratings yet
Alopex: On-Device LLM Function Calls
12 pages
2025.naacl Industry.9
No ratings yet
2025.naacl Industry.9
13 pages
GALLM Unit 5 Note
No ratings yet
GALLM Unit 5 Note
7 pages
Too Lace
No ratings yet
Too Lace
21 pages
Google REST
No ratings yet
Google REST
19 pages
Enhancing Function-Calling Capabilities in LLMs
No ratings yet
Enhancing Function-Calling Capabilities in LLMs
13 pages
LLM's For Code Generation
No ratings yet
LLM's For Code Generation
31 pages
AI Model Optimization Guide
100% (1)
AI Model Optimization Guide
1 page
Function Calling at Edge
No ratings yet
Function Calling at Edge
9 pages
Performance Review On LLM For Solving Leetcode Pro
No ratings yet
Performance Review On LLM For Solving Leetcode Pro
5 pages
REX: Enhancing AI Agents with Rapid Exploration
No ratings yet
REX: Enhancing AI Agents with Rapid Exploration
16 pages
Toolfuzz - Automated Agent Tool Testing: Ivan Milev Mislav Balunovi C Maximilian Baader Martin Vechev
No ratings yet
Toolfuzz - Automated Agent Tool Testing: Ivan Milev Mislav Balunovi C Maximilian Baader Martin Vechev
27 pages
9959 The Berkeley Function Cal
No ratings yet
9959 The Berkeley Function Cal
22 pages
Functionality
No ratings yet
Functionality
10 pages
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
100% (3)
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
48 pages
Dynamic LLM Agents for Real-World Tasks
No ratings yet
Dynamic LLM Agents for Real-World Tasks
15 pages
Mnnit
No ratings yet
Mnnit
14 pages
Executable Code Actions Elicit Better LLM Agents
No ratings yet
Executable Code Actions Elicit Better LLM Agents
25 pages
Multi-Agentic RAG With Hugging Face Code Agents - by Gabriele Sgroi, PHD - Dec, 2024 - Towards Data Science
No ratings yet
Multi-Agentic RAG With Hugging Face Code Agents - by Gabriele Sgroi, PHD - Dec, 2024 - Towards Data Science
42 pages
LLM Agent Overview
No ratings yet
LLM Agent Overview
1 page
On The Design and Analysis of LLM-Based Algorithms: Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou Alibaba Group
No ratings yet
On The Design and Analysis of LLM-Based Algorithms: Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou Alibaba Group
52 pages
Lab Session1 25oct2024
No ratings yet
Lab Session1 25oct2024
29 pages
Llm-Autodiff: Auto-Differentiate Any LLM Workflow: Li Yin, Zhangyang "Atlas" Wang Sylphai University of Texas at Austin
No ratings yet
Llm-Autodiff: Auto-Differentiate Any LLM Workflow: Li Yin, Zhangyang "Atlas" Wang Sylphai University of Texas at Austin
30 pages
LUNA A Model-Based Universal Analysis Framework For Large Language Models
No ratings yet
LUNA A Model-Based Universal Analysis Framework For Large Language Models
28 pages
Ishca Uhammad Ave Ark Ssam
No ratings yet
Ishca Uhammad Ave Ark Ssam
12 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
Efficient Inference for Large Language Models
No ratings yet
Efficient Inference for Large Language Models
36 pages
More Agents Is All You Eed
No ratings yet
More Agents Is All You Eed
14 pages
SciReplicate-Bench Benchmarking LLMs in Agent-Driv
No ratings yet
SciReplicate-Bench Benchmarking LLMs in Agent-Driv
23 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
34 pages
From LLMs To LLM Based Agents For Software Engineering 1723301316
100% (1)
From LLMs To LLM Based Agents For Software Engineering 1723301316
42 pages
Fine Tuning Techniques For Large Language Models LLMs
100% (4)
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Agentic Workflows in Large Language Models
No ratings yet
Agentic Workflows in Large Language Models
6 pages
LLM Advanced
No ratings yet
LLM Advanced
4 pages
Big Code Bench
No ratings yet
Big Code Bench
62 pages
The Roleof LLMsin Automating Test Case Generationand Software Validation
No ratings yet
The Roleof LLMsin Automating Test Case Generationand Software Validation
12 pages
An Overview of Large Language Models For Statisticians
No ratings yet
An Overview of Large Language Models For Statisticians
67 pages
Agentic Workflows in LLMs
No ratings yet
Agentic Workflows in LLMs
43 pages
Autoagents: A Framework For Automatic Agent Generation
No ratings yet
Autoagents: A Framework For Automatic Agent Generation
9 pages
Lan Graph
No ratings yet
Lan Graph
7 pages
Agentlite: A Lightweight Library For Building and Advancing Task-Oriented LLM Agent System
No ratings yet
Agentlite: A Lightweight Library For Building and Advancing Task-Oriented LLM Agent System
13 pages
Multi-LLM Framework for Tool Learning
No ratings yet
Multi-LLM Framework for Tool Learning
21 pages
From Llms To Llm-Based Agents For Software Engineering: A Survey of Current, Challenges and Future
No ratings yet
From Llms To Llm-Based Agents For Software Engineering: A Survey of Current, Challenges and Future
50 pages
Controllable LLM Planning Framework
No ratings yet
Controllable LLM Planning Framework
22 pages
Overview of Full Stack LLMs
No ratings yet
Overview of Full Stack LLMs
39 pages
Team13 DevRev Report
No ratings yet
Team13 DevRev Report
14 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
From - LLMs To - LLM - Based - Agents
No ratings yet
From - LLMs To - LLM - Based - Agents
42 pages
Generative AI Toolkit for LLM Quality
No ratings yet
Generative AI Toolkit for LLM Quality
16 pages
LLM Framework - Documentation
100% (2)
LLM Framework - Documentation
23 pages
Week 11 Chats
No ratings yet
Week 11 Chats
5 pages
LMARL25 Final Projects
No ratings yet
LMARL25 Final Projects
8 pages
Handling Ambiguity in LLM Outputs
No ratings yet
Handling Ambiguity in LLM Outputs
24 pages
AI Agent Workflow Vs Agent Part 5 by Vipra Singh Mar, 2025 Medium
No ratings yet
AI Agent Workflow Vs Agent Part 5 by Vipra Singh Mar, 2025 Medium
25 pages
Compare and Optimize LLMs Efficiently
No ratings yet
Compare and Optimize LLMs Efficiently
8 pages
LLM Seminar PDF
No ratings yet
LLM Seminar PDF
10 pages
Automating Bias Testing of LLMs
No ratings yet
Automating Bias Testing of LLMs
3 pages
How LLMs Collaborate With Multi Agent Setup
No ratings yet
How LLMs Collaborate With Multi Agent Setup
6 pages
TLE 7 History of Computers wk1 2
No ratings yet
TLE 7 History of Computers wk1 2
10 pages
Ifix Embedded Ds Gfa1287a
No ratings yet
Ifix Embedded Ds Gfa1287a
2 pages
Ds Study Qs
No ratings yet
Ds Study Qs
38 pages
B.Tech CSE 5th & 6th Semester Scheme
No ratings yet
B.Tech CSE 5th & 6th Semester Scheme
56 pages
Examination 9
No ratings yet
Examination 9
4 pages
Development Tools and SAE J1939 Networks On Tractors
No ratings yet
Development Tools and SAE J1939 Networks On Tractors
8 pages
LTE Quality of Service
No ratings yet
LTE Quality of Service
2 pages
Enpac 2500
No ratings yet
Enpac 2500
2 pages
Website Critique
No ratings yet
Website Critique
9 pages
eMMC Data Recovery from Waterlogged Phone
No ratings yet
eMMC Data Recovery from Waterlogged Phone
10 pages
How To Download and Install The HASP License Manager
No ratings yet
How To Download and Install The HASP License Manager
3 pages
05 - Cisco ASA Firewall
No ratings yet
05 - Cisco ASA Firewall
33 pages
Constraints: Removing Unique, Primary Key, or Check Constraints
No ratings yet
Constraints: Removing Unique, Primary Key, or Check Constraints
3 pages
SOP Subscriber Details Updation Version 1.1
No ratings yet
SOP Subscriber Details Updation Version 1.1
49 pages
Azure Virtual Desktop Specialist Exam Study Guide - AZ-140 (Shabaz Darr)
No ratings yet
Azure Virtual Desktop Specialist Exam Study Guide - AZ-140 (Shabaz Darr)
257 pages
Mod Menu Crash 2023 08 30-21 42 27
No ratings yet
Mod Menu Crash 2023 08 30-21 42 27
2 pages
IoT & Intelligent Systems Assignment
No ratings yet
IoT & Intelligent Systems Assignment
9 pages
Login Vulnerabilities Overview
No ratings yet
Login Vulnerabilities Overview
72 pages
Yuen Et Al 2025 Laboratory Information Management System (Lims) 4 0 a Model to Solve the Current Pain Points For
No ratings yet
Yuen Et Al 2025 Laboratory Information Management System (Lims) 4 0 a Model to Solve the Current Pain Points For
20 pages
E-Governance MCQ
No ratings yet
E-Governance MCQ
2 pages
Assembly Language For x86 Processors 7th Edition by Kip Irvine ISBN 0133769402 9780133769401 Download
100% (1)
Assembly Language For x86 Processors 7th Edition by Kip Irvine ISBN 0133769402 9780133769401 Download
54 pages
Things Mobile API Guide
No ratings yet
Things Mobile API Guide
41 pages
UVM Testbench Configuration Guide
100% (1)
UVM Testbench Configuration Guide
6 pages
Open Source IDS: Snort & Advanced Features
No ratings yet
Open Source IDS: Snort & Advanced Features
14 pages
8088 Instruction Set
No ratings yet
8088 Instruction Set
33 pages
DCS and HMI Classnotes-1
No ratings yet
DCS and HMI Classnotes-1
72 pages
Lista
No ratings yet
Lista
3 pages
How To Root Sony Xperia Ray
100% (1)
How To Root Sony Xperia Ray
28 pages
CRM Implementation for Australian Cattle Sales
No ratings yet
CRM Implementation for Australian Cattle Sales
11 pages
SubEazy Requirements Document
No ratings yet
SubEazy Requirements Document
5 pages

LLM Compiler

Uploaded by

LLM Compiler

Uploaded by

An LLM Compiler for Parallel Function Calling

Question: Were Scott Derrickson and Ed Wood of the same nationality?

2. Related Work each optimized through LLMs with dedicated prompts.

output math math math

A. Accuracy Analysis: ReAct vs. LLMCompiler

1.0 ReAct 1.0 ReAct + additional prompts 1.0 LLMCompiler

1.0 ReAct 1.0 LLMCompiler

B. Failure Case Analysis of LLMCompiler

C.1. Tool-Augmented LLMs

Latency vs. # Parallelizable Tasks

E.2. Latency versus Number of Parallelizable Tasks

E.3. Additional Experiments on the HotpotQA Bridge Benchmark

F. Additional Discussions about Related Works

F.1. Quantitative Comparison between LLMCompiler and TPTU

G. User-Supplied Examples for LLMCompiler Configuration

G.1. Movie Recommendation Example Prompts

Question: Find a movie similar to Mission Impossible, The Silence of the

G.2. Game of 24 Example Prompts

Question: "1 2 3 4", state list: [""]

H. Pre-defined LLMCompiler Planner Prompts

- Each action described above contains input/output types and descriptions.

I. ParallelQA Benchmark Generation

J. Details of the Game of 24 and the Tree-of-Thoughts Approach

4+4=8 7-4=3 7/2=3.5 Thought Proposer

4-2=2 2*3=6 4/2=2 State Evaluator

K. Details of WebShop Experiments

K.2. Baseline Methods

You might also like