LLM Compiler
LLM Compiler
Sehoon Kim * 1 Suhong Moon * 1 Ryan Tabrizi 1 Nicholas Lee 1 Michael W. Mahoney 1 2 3
Kurt Keutzer 1 Amir Gholami 1 2
Abstract Kojima et al., 2023; Wang et al., 2023b; Wei et al., 2022;
The reasoning capabilities of the recent LLMs Yang et al., 2022; Yao et al., 2023b; Zhou et al., 2023b);
enable them to execute external function calls and recent works have also shown how this reasoning ca-
arXiv:2312.04511v3 [cs.CL] 5 Jun 2024
to overcome their inherent limitations, such as pability can be helpful in improving accuracy for solving
knowledge cutoffs, poor arithmetic skills, or lack complex and logical tasks. The reasoning capability has also
of access to private data. This development has allowed function (i.e., tool) calling capability, where LLMs
allowed LLMs to select and coordinate multiple can invoke provided functions and use the function outputs
functions based on the context to tackle more to help complete their tasks. These functions range from a
complex problems. However, current methods simple calculator that can invoke arithmetic operations to
for function calling often require sequential more complex LLM-based functions.
reasoning and acting for each function which The ability of LLMs to integrate various tools and function
can result in high latency, cost, and sometimes calls could enable a fundamental shift in how we develop
inaccurate behavior. To address this, we introduce LLM-based software. However, this brings up an important
LLMCompiler, which executes functions in par- challenge: what is the most effective approach to incorpo-
allel to efficiently orchestrate multiple function rate multiple function calls? A notable approach has been
calls. Drawing inspiration from the principles of introduced in ReAct (Yao et al., 2022), where the LLM calls
classical compilers, LLMCompiler enables par- a function, analyzes the outcomes, and then reasons about
allel function calling with three components: (i) a the next action, which involves a subsequent function call.
Function Calling Planner, formulating execution For a simple example illustrated in Fig. 1 (Left), where the
plans for function calling; (ii) a Task Fetching LLM is asked if Scott Derrickson and Ed Wood have the
Unit, dispatching function calling tasks; and (iii) same nationality, ReAct initially analyzes the query and de-
an Executor, executing these tasks in parallel. cides to use a search tool to search for Scott Derrickson. The
LLMCompiler automatically generates an opti- result of this search (i.e., observation) is then concatenated
mized orchestration for the function calls and can back to the original prompt for the LLM to reason about
be used with both open-source and closed-source the next action, which invokes another search tool to gather
models. We have benchmarked LLMCompiler information about Ed Wood.
on a range of tasks with different patterns of
function calling. We observe consistent latency ReAct has been a pioneering work in enabling function
speedup of up to 3.7×, cost savings of up calling, and it has been integrated into several frame-
to 6.7×, and accuracy improvement of up to works (Langchain; Liu, 2022). However, scaling this ap-
∼9% compared to ReAct. Our code is available at proach for more complex applications requires considerable
https://github.com/SqueezeAILab/LLMCompiler. optimizations. This is due to the sequential nature of Re-
Act, where it executes function calls and reasons about their
1. Introduction observations one after the other. This approach, along with
the agent systems that extend ReAct (Khot et al., 2023; Qin
Recent advances in the reasoning capability of Large Lan- et al., 2023; Ruan et al., 2023b; Sumers et al., 2023; Yao
guage Models (LLMs) have expanded the applicability of et al., 2023b), may lead to inefficiencies in latency and cost,
LLMs beyond content generation to solving complex prob- due to the sequential function calling and repetitive LLM
lems (Besta et al., 2023; Chen et al., 2023b; Gao et al., 2022; invocations for each reasoning and action step. Furthermore,
*
Equal contribution 1 UC Berkeley 2 ICSI 3 LBNL. Correspon- while dynamic reasoning about the observations has benefits
dence to: Amir Gholami <[email protected]>. in certain cases, concatenating the outcomes of intermedi-
ate function calls could disrupt the LLM’s execution flow,
Proceedings of the 41 st International Conference on Machine potentially reducing accuracy (Xu et al., 2023). Common
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by failure cases include repetitive invocation of the same func-
the author(s).
1
An LLM Compiler for Parallel Function Calling
Thought: I need to search Ed Wood. Thought: They are both American filmmakers.
Action: search(Ed Wood) Executor Action: finish(yes)
Tool invocation
Search Tool
Observation: … Edward Wood Jr was an American
filmmaker, actor, and ….
Appended to prompt Latency Speedup: 1.8x
LLM
Thought: They are both American filmmakers.
Action: finish(yes)
Figure 1. An illustration of the runtime dynamics of LLMCompiler, in comparison with ReAct (Yao et al., 2022), given a sample
question from the HotpotQA benchmark (Yang et al., 2018). In LLMCompiler (Right), the Planner first decomposes the query into
several tasks with inter-dependencies. The Executor then executes multiple tasks in parallel, respecting their dependencies. Finally,
LLMCompiler joins all observations from the tool executions to produce the final response. In contrast, sequential tool execution of the
existing frameworks like ReAct (Left) leads to longer execution latency. In this example, LLMCompiler attains a latency speedup of
1.8× on the HotpotQA benchmark. While a 2-way parallelizable question from HotpotQA is presented here for the sake of simple visual
illustration, LLMCompiler is capable of managing tasks with more complex dependency patterns (Fig. 2 and Sec. 5).
tion, which is also highlighted in the original paper (Yao Fetching Unit (Sec. 3.2) that dispatches the function calls
et al., 2022), and early stopping based on the partial inter- in parallel; (iii) an Executor (Sec. 3.3) that executes the
mediate results, as will be further discussed in Sec. 5.1 and dispatched tasks using the associated functions.
Appendix A.
To address this challenge, we draw inspiration from clas- • We evaluate LLMCompiler on embarrassingly parallel
sical compilers, where optimizing instruction executions patterns using HotpotQA (Yang et al., 2018) and Movie
in traditional programming languages has been extensively Recommendation (Srivastava et al., 2022), where we ob-
explored. A key optimization technique in compilers in- serve 1.80×/3.74× speedup and 3.37×/6.73× cost reduc-
volves identifying instructions that can be executed in paral- tion compared to ReAct (Sec. 5.1).
lel and effectively managing their dependencies. Similarly,
one can envision a compiler, tailored for LLM function • To test the performance on more complex patterns, we
calling, which can efficiently orchestrate various function introduce a new benchmark called ParallelQA which in-
calls and their dependencies. This shares a similar philos- cludes various non-trival function calling patterns. We
ophy with the recent studies that align LLMs with com- show up to 2.27× speedup, 4.65× cost reduction, and 9%
puter systems (Karpathy, 2023; Packer et al., 2023). To improved accuracy compared to ReAct (Sec. 5.2).
this end, we introduce LLMCompiler, a novel framework
that enables parallel multi-tool execution of LLMs across • We evaluate LLMCompiler’s capability in dynamic re-
different models and workloads. To the best of our knowl- planning, which is achieved through a feedback loop
edge, LLMCompiler is the first framework to optimize from the Executor back to our Function Calling Planner.
the orchestration of LLM function calling that can not only For the Game of 24 (Yao et al., 2023b), which requires
improve latency and cost, but also accuracy, by minimizing repeated replanning based on the intermediate results,
interference from the outputs of intermediate function calls. LLMCompiler demonstrates a 2× speedup compared
In more detail, we make the following contributions: to Tree-of-Thoughts (Sec. 5.3).
• We introduce LLMCompiler, an LLM compiler that • We show that LLMCompiler can explore the interactive
optimizes the parallel function calling performance of decision-making environment effectively and efficiently.
LLMs. At a high level, this is achieved by introducing On WebShop, LLMCompiler achieves up to 101.7×
three key components: (i) a Function Calling Planner speedup and 25.7% improved success rate compared to
(Sec. 3.1) that identifies an execution flow; (ii) a Task the baselines (Sec. 5.4).
2
An LLM Compiler for Parallel Function Calling
3
An LLM Compiler for Parallel Function Calling
Executor
Function Calling Planner Tool Tool
User Input Task
Fetches Memory Memory
$1 = search(Microsoft Market Cap) Fetching Task
“How much does
Microsoft's market cap $2 = search(Apple Market Cap) Unit Tool Tool
need to increase to exceed $3 = math($1 / $2) Resolves
Dependency Memory Memory
Apple's market cap?” $4 = llm($3)
DAG of Tasks
Tools
search math … llm
Figure 2. Overview of the LLMCompiler framework. The Function Calling Planner generates a DAG of tasks with their inter-
dependencies. These tasks are then dispatched by the Task Fetching Unit to the Executor in parallel based on their dependencies. In this
example, Task $1 and $2 are fetched together for parallel execution of two independent search tasks. After each task is performed, the
results are forwarded back to the Task Fetching Unit to unblock the dependent tasks after replacing their placeholder variables (e.g., the
variable $1 and $2 in Task $3) with actual values. Once all tasks have been executed, the final answer is delivered to the user.
ing three components: a Function Calling Planner (Sec. 3.1) streams tasks as soon as they are created, instead of waiting
that generates a sequence of tasks and their dependencies; a to complete the entire planning process.
Task Fetching Unit (Sec. 3.2) that replaces arguments based
on intermediate results and fetches the tasks; and an Execu- 3.2. Task Fetching Unit
tor (Sec. 3.3) that executes the tasks with associated tools.
To use LLMCompiler, users are only required to provide The Task Fetching Unit, inspired by the instruction fetching
tool definitions, and optional in-context examples for the units in modern computer architectures, fetches tasks to the
Planner, as will be further discussed in Sec. 4.1. Executor as soon as they are ready for (parallel) execution
based on a greedy policy. Another key functionality is
to replace variables with the actual outputs from preceding
3.1. Function Calling Planner
tasks, which were initially set as placeholders by the Planner.
The Function Calling Planner is responsible for generating a For the example in Fig. 2, the variable $1 and $2 in Task $3
sequence of tasks to be executed along with any dependency would be replaced with the actual market cap of Microsoft
among them. For instance, Tasks $1 and $2 in Fig. 2 are and Apple. This can be implemented with a simple fetching
two independent searches that can be performed in paral- and queuing mechanism without a dedicated LLM.
lel. However, Task $3 has a dependency on the outcomes
of the first and second searches. Therefore, the Planner’s 3.3. Executor
role is to automatically identify the necessary tasks, their
input arguments, as well as their inter-dependencies using The Executor asynchronously executes tasks fetched from
the sophisticated reasoning capability of LLMs, essentially the Task Fetching Unit. As the Task Fetching Unit guar-
forming a directed acyclic graph of task dependencies. If antees that all the tasks dispatched to the Executor are in-
a task is dependent on a preceding task, it incorporates a dependent, it can simply execute them concurrently. The
placeholder variable, such as $1 in Task 3 of Fig. 2, which Executor is equipped with user-provided tools, and it del-
will later be substituted with the actual output from the egates the task to the associated tool. These tools can be
preceding task (Sec. 3.2). simple functions like a calculator, Wikipedia search, or API
calls, or they can even be LLM agents that are tailored for a
The Planner in LLMCompiler leverages LLMs’ reasoning specific task. As depicted in the Executor block of Fig. 2,
capability to decompose tasks from natural language inputs. each task has dedicated memory to store its intermediate
To achieve this, the Planner LLM incorporates a pre-defined outcomes, similar to what typical sequential frameworks
prompt that guides it on how to create dependency graphs do when aggregating observations as a single prompt (Yao
and to ensure correct syntax (see Appendix H for details). et al., 2022). Upon completion of the task, the final results
Besides this, users also need to supply tool definitions and are forwarded as input to the tasks dependent on them.
optional in-context examples for the Planner. These exam-
ples provide detailed demonstrations of task decomposition 3.4. Dynamic Replanning
specific to a problem, helping the Planner to better under-
stand the rules. Further details on user-supplied information In various applications, the execution graph may need to
for LLMCompiler are elaborated in Sec. 4.1. In Sec. 4.2, adapt based on intermediate results that are a priori unknown.
we introduce an additional optimization for the Planner that An analogy in programming is branching, where the path
4
An LLM Compiler for Parallel Function Calling
of execution is determined only during runtime, depending streaming feature leads to a latency gain of up to 1.3×. This
on which branch conditions are satisfied. Such dynamic ex- is attributed to the math tool’s longer execution time for
ecution patterns can also appear with LLM function calling. ParallelQA, which can effectively hide the Planner’s latency
For simple branching (e.g., if-else statements) one could in generating subsequent tasks, unlike the shorter execution
statically compile the execution flow and choose the right times of the search tool used in HotpotQA and Movie
dynamically based on the intermediate results. However, for Recommendation.
more complex branching it may be better to do a recompila-
tion or replanning based on the intermediate results. 5. Results
When replanning, the intermediate results are sent back
In this section, we evaluate LLMCompiler using a variety
from the Executor to the Function Calling Planner which
of models and problem types. We use both the proprietary
then generates a new set of tasks with their associated de-
GPT models and the open-source LLaMA-2 model, with
pendencies. These tasks are then sent to the Task Fetching
the latter demonstrating LLMCompiler’s capability in en-
Unit and subsequently to the Executor. This cycle continues
abling parallel function calling in open-source models. Fur-
until the desired final result is achieved and can be delivered
thermore, there are various types of parallel function calling
to the user. We show an example use case of this in Sec. 5.3
patterns that can be addressed with LLMs. This ranges
for solving the Game of 24 using the Tree-of-Thoughts
from embarrassingly parallel patterns, where all tasks can
approach.
be executed in parallel without any dependencies between
them, to more complex dependency patterns, as illustrated
4. LLMCompiler Details in Fig. 3. Importantly, we also assess LLMCompiler
on the Game of 24 benchmark, which involves dynamic
4.1. User-Supplied Information
replanning based on intermediate results, highlighting its
LLMCompiler requires two inputs from the user: adaptability to dynamic dependency graphs. Finally, we
apply LLMCompiler to the WebShop benchmark to show-
1. Tool Definitions: Users need to specify the tools that case its potential in decision-making tasks. Overall, we start
LLMs can use, including their descriptions and argument presenting results for simple execution patterns, and then
specifications. This is essentially the same requirement we move to more complex ones.
as other frameworks like ReAct and OpenAI function
calling.
5.1. Embarrassingly Parallel Function Calling
2. In-context Examples for the Planner: Optionally,
users can provide LLMCompiler with examples of The simplest scenario involves an LLM using a tool re-
how the Planner should behave. For instance, in the case peatedly for independent tasks such as conducting parallel
of Fig. 2, users may provide examples illustrating ex- searches or analyses to gather information on different top-
pected inter-task dependencies for certain queries. These ics, like the pattern depicted in Fig. 3 (a). While these tasks
examples can assist the Planner LLM understand how are independent of each other and can be executed in paral-
to use various tools and generate the appropriate depen- lel, ReAct, along with other LLM solutions as they stand,
dency graph for incoming inputs in the correct format. would need to run sequentially. This leads to increased
In Appendix G, we include the examples that we used latency and token consumption due to its frequent LLM in-
in our evaluations. vocations for each tool usage, as also illustrated in Fig. 1. In
this section, we demonstrate how LLMCompiler can iden-
4.2. Streamed Planner tify parallelizable patterns and execute independent tasks
concurrently to resolve this issue. To do so, we use the
The Planner may incur a non-trivial overhead for user following two benchmarks:
queries that involve a lot of tasks as it blocks the Task Fetch-
ing Unit and the Executor, which must wait for the Planner • HotpotQA: A dataset that evaluates multi-hop reason-
output before initiating their processes. However, analogous ing (Yang et al., 2018). We only use the comparison dev
to instruction pipelining in modern computer systems, this set. This contains 1.5k questions comparing two different
can be mitigated by enabling the Planner to asynchronously entities, thus exhibiting a 2-way embarrassingly parallel
stream the dependency graph, thereby allowing each task execution pattern. An example question is shown in Fig. 1.
to be immediately processed by the Executor as soon as its
dependencies are all resolved. In Table C.1, we present a • Movie Recommendation: A dataset with 500 examples
latency comparison of LLMCompiler with and without that asks to identify the most similar movie out of four
the streaming mechanism across different benchmarks. The options to another set of four movies, exhibiting an 8-way
results demonstrate consistent latency improvements with embarrassingly parallel pattern (Srivastava et al., 2022).
streaming. Particularly, in the ParallelQA benchmark, the
5
An LLM Compiler for Parallel Function Calling
(a) Analyze Apple and Microsoft's latest 10-K (b) If Stanford and UCLA were to merge, would they (c) Which has higher total healthcare expenses, Florida
form and compare their sales forecast. have more Nobel laureates than UC Berkeley? or New York, considering both public and private sectors?
Analyzer Analyzer
search search search search search search search
Agent Agent
math math
output output
Figure 3. Examples of questions with different function calling patterns and their dependency graphs. HotpotQA and Movie Recommen-
dation datasets exhibit pattern (a), and ParallelQA dataset exhibits patterns (b) and (c), among other patterns. In (a), we need to analyze
each company’s latest 10-K. In (b), we need three searches for each school, followed by one addition and one comparison operation. In (c),
we need to search for each state’s annual healthcare spending in each sector, sum each state’s spending, and then perform a comparison.
Table 1. Accuracy and latency comparison of LLMCompiler compared to the baseline on different benchmarks, including HotpotQA,
Movie Recommendation, our custom dataset named ParallelQA, and the Game of 24. For HotpotQA and Movie Recommendation, we
frequently observe looping and early stopping (Sec. 5.1). To minimize these behaviors as much as possible, we incorporated ReAct-specific
prompting which we denote as ReAct† . ReAct (without † ) indicates the original results without this prompting. We do not include the
latency for the original ReAct since looping and early stopping make precise latency measurement difficult.
GPT (Closed-source) LLaMA-2 70B (Open-source)
Benchmark Method
Accuracy (%) Latency (s) Speedup Accuracy (%) Latency (s) Speedup
ReAct 61.52 - - 54.74 - -
HotpotQA ReAct† 62.47 7.12 1.00× 54.40 13.44 1.00×
OAI Parallel Function 62.05 4.42 1.61× - - -
LLMCompiler 62.00 3.95 1.80× 57.83 9.58 1.40×
ReAct 68.60 - - 70.00 - -
Movie Rec. ReAct† 72.47 20.47 1.00× 70.60 33.37 1.00×
OAI Parallel Function 77.00 7.42 2.76× - - -
LLMCompiler 77.13 5.47 3.74× 77.80 11.83 2.82×
ReAct 89.09 35.90 1.00× 59.59 15.47 1.00×
ParallelQA OAI Parallel Function 87.32 19.29 1.86× - - -
LLMCompiler 89.38 16.69 2.15× 68.14 26.20 2.27×
Tree-of-Thoughts 74.00 241.2 1.00× 30.00 952.06 1.00×
Game of 24
LLMCompiler 75.33 83.6 2.89× 32.00 456.02 2.09×
Table 2. Input and output token consumption as well as the esti- datasets, we use gpt-3.5-turbo (1106 release). For the exper-
mated cost on HotpotQA, Movie Recommendation, and our cus- iments using GPT, we additionally report the results using
tom dataset named ParallelQA. The cost is computed based on the OpenAI’s parallel function calling capability, which was
pricing table of the GPT models used for each benchmark. announced concurrently with our work. We also show how
Benchmark Method
Tokens Cost Cost LLMCompiler can be effectively combined with the open-
In. Out. ($/1k) Red.
source LLaMA-2 70B model to provide the model with
ReAct 2900 120 5.00 1.00× parallel function calling capabilities. For all experiments,
HotpotQA OAI Para. Func. 2500 63 2.66 1.87×
LLMCompiler 1300 80 1.47 3.37× we have measured accuracy, end-to-end latency, as well as
ReAct 20000 230 20.46 1.00× input and output token usage. See Appendix D for details
Movie Rec. OAI Para. Func. 5800 160 6.14 3.33× on experimental setups.
LLMCompiler 2800 115 3.04 6.73×
ReAct 46000 470 480 1.00×
ParallelQA OAI Para. Func. 25000 370 260 1.81× Accuracy and Latency. We report the accuracy, end-to-
LLMCompiler 9200 340 103 4.65× end latency, and relative speed-up of LLMCompiler com-
pared to ReAct in Tab. 1. First, we observe that ReAct
Experimental Setups. As a baseline method, we com- consistently achieves lower accuracy compared to OpenAI
pare LLMCompiler with ReAct. We follow the ReAct parallel function calling and LLMCompiler. We identify
setup (Yao et al., 2022) using the same Wikipedia search two main failure modes in ReAct: (1) the tendency for re-
tool that LLMs can use to search for information. We did dundant generation of prior function calls, a point also noted
not include the lookup tool since it is not relevant to our in the original ReAct paper (Yao et al., 2022); and (2) pre-
problem setting. We have optimized the prompt and in- mature early stopping based on the incomplete intermediate
context examples for both ReAct and LLMCompiler to results. In Appendix A, we offer a detailed analysis demon-
the best of our abilities. For all experiments across these strating how these two prevalent failure cases significantly
6
An LLM Compiler for Parallel Function Calling
hurt ReAct’s accuracy, and how they can be resolved with in Fig. 3 (b) and (c). Inspired by the IfQA benchmark (Yu
LLMCompiler, leading to an accuracy enhancement of et al., 2023), ParallelQA contains 113 examples that involve
up to 7 – 8%. Furthermore, we have conducted interven- mathematical questions on factual attributes of various en-
tional experiments in which we incorporated ReAct-specific tities. In particular, completing the task requires using two
prompts to avoid repetitive function calls and early stopping. tools (i.e., search and math tools), with the second tool’s ar-
ReAct† in Tab. 1 refers to ReAct with this ReAct-specific gument depending on the result of the first tool’s output. We
prompt. The ReAct-specific prompt yields a general accu- have meticulously included questions that are answerable
racy improvement with ReAct† as compared to the original only with information from Wikipedia’s first paragraph, ef-
ReAct. Nevertheless, LLMCompiler still demonstrates fectively factoring out the failure cases due to unsuccessful
on-par and better accuracy than ReAct† , as such prompting searches. See Appendix I for more details in ParallelQA.
does not serve as a perfect solution to completely avoiding
the erroneous behavior of ReAct. Experimental Setups. Similar to Sec. 5.1, we use Re-
Additionally, when compared to ReAct , LLMCompiler † Act (Yao et al., 2022) as the main baseline. Here, both Re-
demonstrates a noticeable speedup of 1.80× and 1.40× on Act and LLMCompiler are equipped with two tools: (1)
HotpotQA with GPT and LLaMA, respectively. Similarly, the search tool, identical to the one mentioned in Sec.5.1;
LLMCompiler demonstrates 3.74× and 2.82× speedup and (2) the math tool, which solves mathematical problems.
on Movie Recommendation with each model. Note that The math tool is inspired by the Langchain (Langchain)’s
we benchmark the latency of LLMCompiler against that LLMMathChain, which uses an LLM as an agent that in-
of ReAct† since the repeating and early stopping behav- terprets input queries and invokes the numexpr function
ior of the original ReAct as discussed above makes its la- with the appropriate formula. This enables the math chain to
tency unpredictable and unsuitable for a fair comparison. address a broad spectrum of math problems that are written
LLMCompiler demonstrates a speedup of up to 35% com- both in mathematical and verbal form. See Appendix D for
pared to OpenAI parallel function calling whose latency more details on experimental setups.
gain over ReAct is 1.61× and 2.76× on each benchmark.1
Accuracy and Latency. As shown in the ParallelQA row
of Tab. 1, LLMCompiler arrives at the final answer with
Costs. Another important consideration of using LLMs
an average speedup of 2.15× with gpt-4-turbo and 2.27×
is cost, which depends on the input and output token us-
with LLaMA-2 70B, by avoiding sequential execution of
age. The costs for GPT experiments are provided in Tab. 2.
the dependency graphs. Beyond the latency speedup, we ob-
LLMCompiler is more cost-efficient than ReAct for cost,
serve higher accuracy of LLMCompiler with the LLaMA-
as it involves less frequent LLM invocations. Interestingly,
2 model as compared to that of ReAct, due to the rea-
LLMCompiler also outperforms the recent OpenAI par-
sons discussed in Sec. 5.1. Particularly in the LLaMA-
allel function calling in cost efficiency. This is because
2 experiment, where LLMCompiler achieves around a
LLMCompiler’s planning phase is more prompt length
9% increase in accuracy, we note that ∼20% of the ex-
efficient than that of OpenAI parallel function calling since
amples experienced repetitive function calls with ReAct,
our Planner’s in-context examples are rather short and only
aligning with our observations from the accuracy analy-
include plans, not observations (see Appendix H).
sis detailed in Appendix A. Additionally, a comprehensive
analysis of LLMCompiler’s failure cases is provided in
5.2. Parallel Function Calling with Dependencies Appendix B, where we note minimal Planner failures, high-
The cases considered above are rather simple, as only one lighting LLMCompiler’s effectiveness in breaking down
tool is used and all tasks can be executed independently problems into complex multi-task dependencies.
of one another. However, similar to code execution in tra-
ditional code blocks, we may encounter function calling Cost. Similar to Sec. 5.1, LLMCompiler demonstrates
scenarios that involve more complex dependencies. To sys- substantial cost reductions of 4.65× and 2.57× compared
tematically evaluate the capability to plan out function call- to ReAct and OpenAI’s parallel function calling, respec-
ing in scenarios that involve complex task dependencies, we tively, as indicated in Tab. 2. This efficiency stems from
have designed a custom benchmark called ParallelQA. This LLMCompiler’s reduced frequency of LLM invocations,
benchmark is designed to incorporate non-trivial function which is also the case with OpenAI’s parallel function call-
calling patterns, including three different types of patterns ing, which is limited to planning out immediate paralleliz-
1
able tasks, not the entire dependency graph. For example, in
Unfortunately, we are unable to conclude why this is the case, as OpenAI
Fig. 3 (c), OpenAI’s method would necessitate three distinct
has not publicly disclosed any details about their function calling mechanism. One
speculation is that there might be additional overheads to validate the function and LLM calls for initial search tasks, following math tasks, and
argument names and to convert them into a system prompt. Nevertheless, we have the final math task. In contrast, LLMCompiler achieves
seen a consistent trend with multiple runs over several days. this with a single LLM call, planning all tasks concurrently.
7
An LLM Compiler for Parallel Function Calling
5.3. Parallel Function Calling with Replanning numbers exactly once each. Further details on experiment
In the previous sections, we have discussed cases in which setups are outlined in Appendix D.
dependency graphs can be determined statically. However,
there are cases where dependency graphs need to be con- Success Rate and Latency. In the last two rows of Tab. 1,
structed dynamically depending on intermediate observa- we explore the latency and success rate of LLMCompiler
tions. Here, we consider one such dynamic approach in the in comparison to the baseline described in (Yao et al., 2023b)
context of the Game of 24 with the Tree-of-Thoughts (ToT) on the Game of 24 benchmark. With the gpt-4 model,
strategy proposed in (Yao et al., 2023b). The Game of 24 is LLMCompiler demonstrates a 2.89× enhancement in la-
a task to generate 24 using a set of four numbers and basic tency while slightly improving the success rate compared
arithmetic operations. For example, from the numbers 2, 4, to the baseline. Similarly, when applied with the LLaMA-2
4, and 7, a solution could be 4 × (7 − 4) × 2 = 24. ToT model, LLMCompiler shows a 2.01× improvement in la-
approaches this task through two iterative LLM processes: tency, again without compromising on success rate. These
(i) the thought proposer generates candidate partial solutions results demonstrate not only a significant latency reduction
by selecting two numbers and applying an operation (e.g. without quality degradation, but also the replanning capabil-
2, 3, 7 from 2, 4, 4, 7 by calculating 7 - 4); (ii) the state ity of LLMCompiler for solving complex problems.
evaluator assesses the potential of each candidate. Only
the promising candidates are then processed in subsequent 5.4. Application: LLMCompiler in Interactive Decision
iterations of the thought proposer and state evaluator until Making Tasks
24 is reached. Details about the Game of 24 benchmark and
the ToT strategy can be found in Appendix J. In this section, we demonstrate that LLMCompiler can
explore language-based interactive environments effectively
While ToT achieves significant improvement at solving by benchmarking LLMCompiler on WebShop (Yao et al.,
the Game of 24, its sequential, breadth-first search ap- 2023a). As highlighted in (Shinn et al., 2023; Yao et al.,
proach through the state tree can be time-consuming. 2022; 2023a), WebShop exhibits considerable diversity,
LLMCompiler offers a faster alternative by enabling par- which requires extensive exploration to purchase the most
allel execution of the thought proposer and the subsequent appropriate item. While recent work feature advanced explo-
feasibility evaluator, akin to a parallel beam search method. ration strategies and show promising results (Ma et al., 2023;
Zhou et al., 2023a), their approaches are largely based on a
sequential and extensive tree search that incurs significant
Experimental Setups. Although LLMCompiler offers latency penalties. Here, LLMCompiler showcases an ex-
latency advantages, solving this problem with a single ploration strategy that is both effective and efficient with the
static graph is not feasible, as the Planner cannot plan use of parallel function calling. Our method enables broader
out the thought proposing stage before identifying the exploration of items in the environment, which improves
selected candidates from the state evaluator of the pre- success rate compared to ReAct. At the same time, this ex-
vious iteration. Consequently, the Planner is limited to ploration can be parallelized, yielding up to 101.7× speedup
planning only within one iteration at a time. To address against baselines that perform sequential exploration.
this, we resort to LLMCompiler’s replanning capabil-
ity. In particular, LLMCompiler is equipped with three
Experimental Setups. We evaluate LLMCompiler
tools: thought proposer and state evaluator,
against three baselines on this benchmark, ReAct (Yao
which are both LLMs adapted from the original ToT
et al., 2022), LATS (Zhou et al., 2023a), and LASER (Ma
framework, and top k select, which chooses the top
et al., 2023), using 500 WebShop instructions. The evalu-
k candidates from the thought proposer based on the
ation metrics are success rate, average score, and latency.
state evaluator’s assessment. After all these tools
More details of the WebShop environment and the baseline
are executed, LLMCompiler can decide to “replan” if no
methods are provided in Appendix K. For this experiment,
proposal reaches 24, triggering the Planner to devise new
LLMCompiler is equipped with two tools: search and
plans using the shortlisted states from top k select of
explore. The search function triggers the model to
the previous iteration. In this way, LLMCompiler can dy-
generate and dispatch a query that returns a list of typically
namically regenerate plans of each iteration, being able to
ten items from the Webshop environment. The explore
tackle highly complex tasks that require iterative replanning
function then clicks through links for each of the found
based on the outcomes of previous plans.
items and retrieves information about options, prices, at-
To evaluate LLMCompiler’s performance on the Game of tributes, and features that are available. Finally, based on
24, we use 100 different instances of the game. For each the gathered information, LLMCompiler decides on the
problem, we consider the output as successful if its opera- item that best matches the input instruction for purchasing.
tions are valid and yield 24 while also using the provided Further details on experiments can be found in Appendix D.
8
An LLM Compiler for Parallel Function Calling
Table 3. Performance and Latency Analysis for WebShop. We with its standard deviation, is 72.8 ± 4.01 for gpt-3.5-turbo.
evaluate LLMCompiler with two models: gpt-4 and gpt-3.5- Further note that while the performance differences are
turbo and compare LLMCompiler against three baselines: ReAct, marginal, our method exhibits significant execution speedup,
LATS, and LASER. We report success rate and average score in 101.7× over LATS and 2.69× over LASER.
percentage. We reproduce the success rate and average score for
ReAct, while those for LATS and LASER are from their papers.
N denotes the number of examples used for evaluation. 6. Conclusions
Model Method Succ. Rate Score Latency (s) N Existing methods for invoking multiple functions with
ReAct 19.8 54.2 5.98 500 LLMs resort to sequential and dynamic reasoning. As a
LATS 38.0 75.9 1066 50
gpt-3.5-turbo result, they suffer from inefficiencies in latency, cost, and
LLMCompiler 44.0 72.8 10.72 50
LLMCompiler 48.2 74.2 10.48 500 accuracy. As a solution, we introduced LLMCompiler,
ReAct 35.2 58.8 19.90 500 a compiler-inspired framework that enables efficient paral-
gpt-4-0613 LASER 50.0 75.6 72.16 500
LLMCompiler 55.6 77.1 26.73 500 lel function calling across various LLMs, including open-
source models like LLaMA-2 and OpenAI’s GPT se-
ries. By decomposing user inputs into tasks with defined
Performance and Latency. Our approach significantly inter-dependencies and executing these tasks concurrently
outperforms all baseline models as shown in Table 3. When through its Planner, Task Fetching Unit, and Executor com-
using gpt-3.5-turbo, LLMCompiler achieves a 28.4% ponents, LLMCompiler demonstrates substantial improve-
and 6% improvement in success rate against ReAct and ments in latency (up to 3.7×), cost efficiency (up to 6.7×),
LATS; with gpt-4, our method improves upon ReAct and and accuracy (up to ∼9%), even outperforming OpenAI’s
LASER by 20.4% and 5.6%, respectively. In terms of parallel function calling feature in latency gains. We look
latency, LLMCompiler exhibits a 101.7× and 2.69× forward to future work building upon LLMCompiler that
speedup against LATS and LASER. While we note that will improve both the capabilities and efficiencies of LLMs
LLMCompiler execution is slightly slower than ReAct on in executing complex, large-scale tasks, thus transforming
this benchmark, mainly due to the Planner overhead, we the future development of LLM-based applications.
also highlight that the gains in success rate far outweigh the
minor latency penalty. Impact Statement
We further delve into why LLMCompiler attains such an This paper presents research towards advancing the field of
improved success rate and score compared to ReAct. Based Machine Learning. While there are many potential societal
on our observations, we discover that the ReAct agent tends consequences of our work, we do not find any one to be
to commit to a decision with imperfect information, a sce- particularly noteworthy.
nario that can arise when the agent has not gathered suf-
ficient details about the features and options available for Acknowledgements
items. This observation was also noted in (Shinn et al.,
2023) – without exploring more items in the environment, We appreciate the valuable feedback from Minwoo Kang.
the agent struggles to differentiate between seemingly simi- We acknowledge gracious support from Furiosa team. We
lar choices, ultimately failing to make the correct decision. also appreciate the support from Microsoft through their
In contrast, LLMCompiler undergoes further exploration Accelerating Foundation Model Research, including great
by visiting all ten items found by search and retrieving support from Sean Kuno. Furthermore, we appreciate sup-
relevant information about each item. We find that employ- port from Google Cloud, the Google TRC team, and specif-
ing an effective search strategy is critical to decision-making ically Jonathan Caton, and Prof. David Patterson. Prof.
tasks such as the WebShop benchmark. Keutzer’s lab is sponsored by the Intel corporation, Intel
One-API, Intel VLAB team, the Intel One-API center of
The relatively high performance of LATS can also be ex-
excellence, as well as funding through BDD and BAIR. We
plained in terms of its exploration scheme. In this frame-
also appreciate support from Samsung including Dongkyun
work, the agent executes a brute-force search through the
Kim, and David Thorsley. We appreciate the great support
state and action space of Webshop, exploring as many as
from Ellick Chan, Saurabh Tangri, Andres Rodriguez, and
30 trajectories before making the final purchase. While this
Kittur Ganesh. Sehoon Kim and Suhong Moon would like
approach provides richer information for decision-making,
to acknowledge the support from the Korea Foundation for
the end-to-end execution becomes prohibitively slow.
Advanced Studies. Amir Gholami was supported through
We report that our method, LLMCompiler, outperforms funding from Samsung SAIT. Michael W. Mahoney would
LASER by an average score of 1.5. When compared to also like to acknowledge a J. P. Morgan Chase Faculty Re-
LATS, this score is within the standard deviation range of search Award as well as the DOE, NSF, and IARPA. Our
our method. The average score for LLMCompiler, along conclusions do not necessarily reflect the position or the
9
An LLM Compiler for Parallel Function Calling
policy of our sponsors, and no official endorsement should Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y.,
be inferred. Callan, J., and Neubig, G. Pal: Program-aided language
models. arXiv preprint arXiv:2211.10435, 2022.
References Hao, S., Gu, Y., Ma, H., Hong, J. J., Wang, Z., Wang, D. Z.,
https://huggingface.co/text-generation-inference. and Hu, Z. Reasoning with language model is planning
with world model, 2023.
https://github.com/nvidia/tensorrt-llm.
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora,
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and
H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., and Steinhardt, J. Measuring coding challenge competence
Sutton, C. Program synthesis with large language models, with apps. NeurIPS, 2021a.
2021.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gi- Song, D., and Steinhardt, J. Measuring massive multitask
aninazzi, L., Gajda, J., Lehmann, T., Podstawski, M., language understanding. Proceedings of the International
Niewiadomski, H., Nyczyk, P., and Hoefler, T. Graph of Conference on Learning Representations (ICLR), 2021b.
thoughts: Solving elaborate problems with large language
models, 2023. Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Tang, E., Song, D., and Steinhardt, J. Measuring math-
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, ematical problem solving with the math dataset. NeurIPS,
L., and Jumper, J. Accelerating large language model 2021c.
decoding with speculative sampling, 2023a.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Gelly, S. Parameter-efficient transfer learning for nlp. In
Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, International conference on machine learning, pp. 2790–
M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, 2799. PMLR, 2019.
S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar-
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,
Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., S., Wang, L., and Chen, W. LoRA: Low-rank adaptation
Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, of large language models. In International Conference
J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., on Learning Representations, 2022.
Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra,
V., Morikawa, E., Radford, A., Knight, M., Brundage, Karpathy, A. Intro to large language models, 2023.
M., Murati, M., Mayer, K., Welinder, P., McGrew, B.,
Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K.,
W. Evaluating large language models trained on code. Clark, P., and Sabharwal, A. Decomposed prompting:
2021. A modular approach for solving complex tasks. In The
Eleventh International Conference on Learning Repre-
Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program sentations, 2023.
of thoughts prompting: Disentangling computation from
reasoning for numerical reasoning tasks, 2023b. Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen,
S., Mahoney, M. W., and Keutzer, K. Squeezellm: Dense-
Dettmers, T., Svirschevski, R., Egiazarian, V., Kuznedelev, and-sparse quantization, 2023.
D., Frantar, E., Ashkboos, S., Borzunov, A., Hoefler, T.,
and Alistarh, D. Spqr: A sparse-quantized representation Kim, S., Mangalam, K., Moon, S., Malik, J., Mahoney,
for near-lossless llm weight compression, 2023. M. W., Gholami, A., and Keutzer, K. Speculative decod-
ing with big little decoder, 2024.
Frantar, E. and Alistarh, D. Sparsegpt: Massive language
models can be accurately pruned in one-shot, 2023. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa,
Y. Large language models are zero-shot reasoners, 2023.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh,
D. GPTQ: Accurate post-training compression for Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer,
generative pretrained transformers. arXiv preprint K., and Gholami, A. A fast post-training pruning frame-
arXiv:2210.17323, 2022. work for transformers, 2022.
10
An LLM Compiler for Parallel Function Calling
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y.,
C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating
memory management for large language model serving large language models to master 16000+ real-world apis.
with pagedattention. In Proceedings of the ACM SIGOPS arXiv preprint arXiv:2307.16789, 2023.
29th Symposium on Operating Systems Principles, 2023.
Ruan, J., Chen, Y., Zhang, B., Xu, Z., Bao, T., Du, G., Shi,
Langchain. https://github.com/langchain-ai/langchain. S., Mao, H., Zeng, X., and Zhao, R. Tptu: Task planning
and tool usage of large language model-based ai agents.
Lester, B., Al-Rfou, R., and Constant, N. The power of arXiv preprint arXiv:2308.03427, 2023a.
scale for parameter-efficient prompt tuning, 2021.
Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J.,
Leviathan, Y., Kalman, M., and Matias, Y. Fast inference Dubois, Y., Maddison, C. J., and Hashimoto, T. Identify-
from transformers via speculative decoding, 2023. ing the risks of lm agents with an lm-emulated sandbox.
arXiv preprint arXiv:2309.15817, 2023b.
Liang, Y., Wu, C., Song, T., Wu, W., Xia, Y., Liu, Y., Ou, Y.,
Lu, S., Ji, L., Mao, S., Wang, Y., Shou, L., Gong, M., and Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli,
Duan, N. Taskmatrix.ai: Completing tasks by connecting M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Tool-
foundation models with millions of apis, 2023. former: Language models can teach themselves to use
tools. arXiv preprint arXiv:2302.04761, 2023.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., and
Han, S. Awq: Activation-aware weight quantization for Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y.
llm compression and acceleration, 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends
in hugging face, 2023.
Liu, J. LlamaIndex, 11 2022. URL https://github.
com/jerryjliu/llama_index. Shinn, N., Cassano, F., Berman, E., Gopinath, A.,
Narasimhan, K., and Yao, S. Reflexion: Language agents
Ma, K., Zhang, H., Wang, H., Pan, X., and Yu, D. Laser: with verbal reinforcement learning, 2023.
Llm agent with state-space exploration for web naviga-
Song, Y., Xiong, W., Zhu, D., Wu, W., Qian, H., Song,
tion, 2023.
M., Huang, H., Li, C., Wang, K., Yao, R., Tian, Y., and
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Li, S. Restgpt: Connecting large language models with
Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, real-world restful apis, 2023.
Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,
S., Yazdanbakhsh, A., and Clark, P. Self-refine: Iterative A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
refinement with self-feedback, 2023. Garriga-Alonso, A., et al. Beyond the imitation game:
Ning, X., Lin, Z., Zhou, Z., Wang, Z., Yang, H., and Wang, Quantifying and extrapolating the capabilities of language
Y. Skeleton-of-thought: Large language models can do models. arXiv preprint arXiv:2206.04615, 2022.
parallel decoding, 2023. Sumers, T. R., Yao, S., Narasimhan, K., and Griffiths, T. L.
Cognitive architectures for language agents, 2023.
OpenAI. Gpt-4 technical report, 2023.
Surı́s, D., Menon, S., and Vondrick, C. Vipergpt: Visual
OpenAI. New models and developer products announced at inference via python execution for reasoning, 2023.
devday, 2023.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Packer, C., Fang, V., Patil, S. G., Lin, K., Wooders, S., and A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Gonzalez, J. E. Memgpt: Towards llms as operating Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen,
systems, 2023. M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W.,
Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn,
Patel, P., Mishra, S., Parmar, M., and Baral, C. Is a question
A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez,
decomposition unit all we need? 2022.
V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S.,
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y.,
Large language model connected with massive apis, 2023. Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog,
I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi,
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R.,
and Lewis, M. Measuring and narrowing the composi- Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X.,
tionality gap in language models, 2023. Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur,
11
An LLM Compiler for Parallel Function Calling
M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Zheng, H. S., Mishra, S., Chen, X., Cheng, H.-T., Chi, E. H.,
and Scialom, T. Llama 2: Open foundation and fine-tuned Le, Q. V., and Zhou, D. Take a step back: Evoking
chat models, 2023. reasoning via abstraction in large language models, 2023.
Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R. K.-W., Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H.,
and Lim, E.-P. Plan-and-solve prompting: Improving and Wang, Y.-X. Language agent tree search unifies
zero-shot chain-of-thought reasoning by large language reasoning acting and planning in language models, 2023a.
models. arXiv preprint arXiv:2305.04091, 2023a.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang,
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. V.,
S., Chowdhery, A., and Zhou, D. Self-consistency im- and Chi, E. H. Least-to-most prompting enables complex
proves chain of thought reasoning in language models, reasoning in large language models. In The Eleventh
2023b. International Conference on Learning Representations,
2023b.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
elicits reasoning in large language models. volume 35,
pp. 24824–24837, 2022.
Wolfson, T., Geva, M., Gupta, A., Gardner, M., Goldberg,
Y., Deutch, D., and Berant, J. Break it down: A question
understanding benchmark. Transactions of the Associa-
tion for Computational Linguistics, 2020.
Xu, B., Peng, Z., Lei, B., Mukherjee, S., Liu, Y., and Xu,
D. Rewoo: Decoupling reasoning from observations for
efficient augmented language models, 2023.
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W.,
Salakhutdinov, R., and Manning, C. D. Hotpotqa: A
dataset for diverse, explainable multi-hop question an-
swering. arXiv preprint arXiv:1809.09600, 2018.
Yang, Z., Dong, L., Du, X., Cheng, H., Cambria, E., Liu,
X., Gao, J., and Wei, F. Language models as inductive
reasoners, 2022.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K., and Cao, Y. React: Synergizing reasoning and acting
in language models. arXiv preprint arXiv:2210.03629,
2022.
Yao, S., Chen, H., Yang, J., and Narasimhan, K. Web-
shop: Towards scalable real-world web interaction with
grounded language agents, 2023a.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao,
Y., and Narasimhan, K. Tree of Thoughts: Deliberate
problem solving with large language models, 2023b.
Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-
G. Orca: A distributed serving system for {Transformer-
Based} generative models. In 16th USENIX Symposium
on Operating Systems Design and Implementation (OSDI
22), pp. 521–538, 2022.
Yu, W., Jiang, M., Clark, P., and Sabharwal, A. Ifqa: A
dataset for open-domain question answering under coun-
terfactual presuppositions, 2023.
12
An LLM Compiler for Parallel Function Calling
Figure A.1. Distributions of the number of function calls when running the Movie Recommendation benchmark on ReAct (Left), ReAct
with specific prompts to avoid early stopping (Middle, corresponding to ReAct† in Tab. 1), and LLMCompiler (Right). LLMCompiler
(Right) consistently completes the search for all 8 movies, whereas ReAct (Left) often exit early, demonstrated by about 85% of examples
stopping early. Although the custom prompts shift ReAct’s histogram to higher function calls (Middle), they still fall short of ensuring
comprehensive searches for all movies. gpt-3.5-turbo is used for the experiment.
85
80
Accuracy (%)
75
70
65 ReAct
LLMCompiler
4 5 6 7 8
# Function Calls on ReAct
Figure A.2. The Movie Recommendation accuracy of the examples that are categorized by the number of function calls on ReAct,
measured both on ReAct and LLMCompiler. The plot indicates that in ReAct, a decrease in the number of function calls correlates
with lower accuracy, indicating that premature exits lead to reduced accuracy. In contrast, when the same examples are evaluated using
LLMCompiler, which ensures complete searches for all eight movies before reaching a decision, they achieve higher and more consistant
accuracy than those processed by ReAct. gpt-3.5-turbo is used for the experiment, and the results are averaged over 3 different runs.
Premature Early Stopping of ReAct. ReAct frequently suffers from premature early stopping, ceasing function calls
too early, and making decisions based on incomplete information. A clear example of this is observed in the Movie
Recommendation benchmark, where ReAct often searches for fewer than the required 8 movies before delivering its final
answer. In Fig. A.1 (Left), we illustrate the distribution of the number of function calls within ReAct (using GPT) across
thhe Movie Recommendation benchmark. Here, we observe around 85% of the examples exhibit early stopping, making
decisions without completing all 8 movie searches. This contrasts with LLMCompiler (Right), where almost all examples
(99%) complete the full search of 8 movies. Although adding specific prompts to ReAct to prevent early stopping shifts
the distribution towards more function calls (Fig. A.1, Middle), resulting in an accuracy improvement from 68.60 to 72.47
(ReAct† in Tab. 1), it is nevertheless an imperfect solution.
To further assess how early stopping negatively impacts accuracy, we categorize Movie Recommendation benchmark
examples by their number of function calls in ReAct. We then evaluated these groups using LLMCompiler, ensuring
complete search results for all 8 movies. Fig. A.2 reveals that fewer function calls in ReAct correlate with lower average
13
An LLM Compiler for Parallel Function Calling
Figure A.3. Distributions of the number of function calls when running the HotpotQA benchmark on ReAct (Left) and LLMCompiler
(Right). While LLMCompiler (Right) consistently completes the task within 2 function calls, which is expected as HotpotQA exhibits a
2-way parallelizable pattern, ReAct (Left) shows that around 10% of the examples undergo repetitive (>4) function calls, resulting in a
diverging behavior of the framework. LLaMA-2 70B is used for the experiment.
60
50
Accuracy (%)
40
30
20
ReAct
10 LLMCompiler
2 3 4+ (div.)
# Function Calls on ReAct
Figure A.4. The HotpotQA accuracy of the examples that are categorized by the number of function calls on ReAct, measured both on
ReAct and LLMCompiler. The plot indicates that in ReAct, repetitive function calls of more than or equal to four times can result in
a significant accuracy degradation due to its infinite looping and diverging behavior. On the other hand, when the same examples are
evaluated using LLMCompiler, which ensures only two searches per example, they achieve a higher of around 50%. LLaMA-2 70B is
used for the experiment.
accuracy (green line). Conversely, if these examples were processed through LLMCompiler, with complete searches for
all eight movies, they consistently attained higher accuracy (purple line). This not only indicates that ReAct struggles with
premature exits (which is not fully addressed by prompting), but the earlier it stops, the greater the decline in accuracy,
contributing to the overall accuracy drop observed in Tab. 1. In contrast, LLMCompiler effectively addresses this issue.
Repetitive Function Calls of ReAct. Another common failure case of ReAct is its tendency for repetitive function calls,
often leading to infinite loops or exceeding the context length limit. This problem is particularly noticeable in the HotpotQA
benchmark where ReAct repeatedly calls the same function if the Wikipedia search returns insufficient information about the
searched entity. Although HotpotQA is inherently 2-way parallelizable, as illustrated in Fig. A.3, we observe that about 10%
of its examples require more than four function calls in ReAct, usually resulting in an infinite loop or a divergent behavior.
In contrast, LLMCompiler executes only two function calls for most examples.
To show how the repetitive function calls impact the overall accuracy, we conduct an accuracy analysis similar to the
previous case. In Fig. A.4, we categorize HotpotQA benchmark examples by the number of function calls in ReAct, and
then we compare their accuracy on both ReAct and LLMCompiler. The analysis reveals that examples that launch two
function calls in ReAct maintain the same accuracy in LLMCompiler. However, cases with more than four function calls
in ReAct, which often lead to divergent behavior, show less than 10% accuracy in ReAct. On the other hand, when these
examples are processed with LLMCompiler, they achieve around 50% accuracy by circumventing repetitive calls. It is
worth noting that there are instances with three function calls in ReAct, where an extra search can lead to improved accuracy
by retrying with an alternate entity name when the initial search fails, yielding a better accuracy than LLMCompiler.
While this shows a potential adaptability advantage of ReAct, such instances represent less than 3% of cases.
14
An LLM Compiler for Parallel Function Calling
Table C.1. A latency comparison between using and not using streaming in the Planner. Streaming yields consistent latency improvement
across different benchmarks, as it enables the Task Fetching Unit to start task execution immediately as each task is produced by the
Planner. The impact of streaming is especially notable in the ParallelQA benchmark, where tool execution times are long enough to
effectively hide the Planner’s execution time.
Benchmark w/o streaming (s) w/ streaming (s) Latency speedup
HotpotQA 4.00 3.95 1.01×
Movie Rec. 5.64 5.47 1.03×
ParallelQA 21.72 16.69 1.30×
C. Related Work
Here, we continue with related work, which we started in Sec. 2.
D. Experimental Details
Our experiments evaluate two different common scenarios: (1) using API-based closed-source models; and (2) using open-
source models with an in-house serving framework. We use OpenAI’s GPT models as closed-source models, in particular,
gpt-3.5-turbo (1106 release) for HotpotQA and Movie Recommendation, gpt-4-turbo (1106 release) for ParallelQA, and
gpt-4 (0613 release) for Game of 24. Experiments on HotpotQA, Movie Recommendation, and ParallelQA were all
conducted in November 2023 after the 1106 release. The Game of 24 experiments were conducted over a two-month
period from September to October 2023. For an open-source model, we use LLaMA-2 (Touvron et al., 2023), which was
hosted on 2 A100-80GB GPUs using the vLLM (Kwon et al., 2023) framework. All the runs have been carried out with
15
An LLM Compiler for Parallel Function Calling
zero temperature, except for thought proposer and state evaluator for the Game of 24 evaluation, where the
temperature is set to 0.7. Since OpenAI has randomness in outputs even with temperature 0, we have conducted 3 runs, and
we reported the average accuracy. Across ReAct, OpenAI parallel function calling, and LLMCompiler, we perform 3,
1, and 5-shot learning for HotpotQA, Movie Recommendation, and ParallelQA, respectively; the same examples across
different methods were used to ensure a fair comparison. For the Game of 24, we use 2 in-context examples for the Planner.
We use the same instruction prompts across different methods for a fair comparison, except for ReAct† in Sec. 5.1 with
additional ReAct-specific prompts. For WebShop experiment, we use gpt-4-0613 with 8k context window and gpt-3.5-turbo
model with 16k context window.
E. Analysis
E.1. Parallel Speedup Modeling
While LLMCompiler shows noticeable latency gain in various workloads, it is not achieving the N × latency speedup
for N-way parallel workloads. This is mostly due to the overhead associated with LLMCompiler’s Planner and final
answering process that cannot be parallelized. In our Movie Recommendation experiment, LLMCompiler’s Planner and
the answering process have an overhead of 1.88 and 1.62 seconds on average, respectively, whose combined overhead
already comprises more than half of LLMCompiler’s overall latency in Tab 1. Another source of overhead is the straggler
effect among the parallel tasks when they need to join together. We observe the average latency of the slowest search to
be 1.13 seconds, which is nearly 2× the average latency of all tasks, which is 0.61 seconds. Below, we provide an analytical
latency modeling of ReAct, LLMCompiler, and LLMCompiler with streaming, and we provide an analysis of achievable
latency speedup.
In this section, our focus is on embarrassingly parallelizable workload (pattern Fig. 3(a)), as this allows for a clearer
understanding of the impact of each component on potential latency gains. For the precise latency analysis, we consider
three key components: the Planner, the Task Fetching Unit, and the Executor, in Fig. 2. Assume that the Planner generates N
different tasks to be done. We define Pi as the Planner’s output corresponding to the i-th atomic task. Each Pi is a blueprint
for a specific atomic task, which we refer to as Ei . The execution of Ei involves a specific function call using the appropriate
tool. The latency function of each unit in the system is defined to quantify the time taken for specific operations. For the
Planner, the latency is denoted as TP (Pi ), representing the time taken by the Planner to generate the plan Pi . Similarly, for
the Executor, the latency, TE (Ei ), corresponds to the time required to complete the task Ei . We ignore the latency of Task
Formulation Unit, as it is negligible in this section. Our focus here is on comparing the latency models of ReAct (Yao et al.,
2022), and LLMCompiler.
To begin our analysis of ReAct’s latency, we express its total latency as:
N
X
TR = TPR (Pi ) + TE (Ei ) .
(1)
i=1
Here, the superscript R refers to ReAct. In the ReAct agent system, the process typically involves initial thought generation,
followed by action generation and the acquisition of observations through function calls associated with the tool. The
creation of both thought and action are collectively considered as part of generating Pi . It is important to note that while the
Planner’s latency is denoted with a superscript (indicating ReAct), the Executor’s latency does not have such a superscript.
This is because the function calling and the tools execution remain the same between ReAct and LLMCompiler.
For LLMCompiler, where all parallelizable tasks are processed concurrently, the total latency is determined by the slowest
task among these tasks. Hence, the latency model for LLMCompiler can be represented as:
N
X
TC = TPC (Pi ) + max TE (Ek ). (2)
k∈1,...,N
i=1
This expression captures the sum of all planning times plus the execution time of the longest task, reflecting the system’s
focus on parallel execution.
16
An LLM Compiler for Parallel Function Calling
Latency (s)
30
20
10
0
2 3 4 5
Number of Parallelizable Tasks
Figure E.5. Latency on the ParallelQA benchmark grouped by the number of maximum parallelizable tasks.
Further, if the Planner employs streaming of the dependency graph, the latency model undergoes a modification and can be
expressed as:
N
X
T SC = TPC (Pi ) + TE (EN ). (3)
i=1
It is important to note that T SC ≤ T C . This implies that the streaming mechanism allows for a more efficient handling of
task dependencies, potentially reducing overall latency.
In evaluating the potential speedup achievable with the LLMCompiler framework compared to ReAct, the speedup metric,
denoted as γ, is defined as follows:
PN R
TR i=1 TP (Pi ) + TE (Ei )
γ = C = PN . (4)
T C
i=1 TP (Pi ) + maxk∈1,...,N TE (Ek )
This ratio represents the comparative efficiency of LLMCompiler over ReAct, considering both planning and execution
latencies.
To estimate the upper bound of this speedup, γmax , we assume that the executor latency TE (Ei ) is dominant over the
planning latency TP (Pi ) and all the latencies of executing tasks remain the same. Under this assumption, the upper bound is
calculated as:
PN
i=1 TE (Ei )
γmax ≈ = N, (5)
maxk∈1,...,N TE (Ek )
indicating the theoretical maximum speedup, γmax , is equal to the number of tasks, N .
On the other hand, the lower bound of the speedup, γ, is observed when the planning latency is the predominant factor.
Given that the planning latencies of both ReAct and LLMCompiler are generally similar, the minimum speedup is
approximated as:
PN R
i=1 TP (Pi )
γmin ≈ PN ≈ 1. (6)
C
i=1 TP (Pi )
From these observations, we can conclude that to achieve significant latency gains with LLMCompiler, it is crucial to (i)
reduce the planner overhead and (ii) minimize the occurrence of stragglers.
17
An LLM Compiler for Parallel Function Calling
Table E.2. Accuracy and latency comparison of LLMCompiler compared to ReAct on the HotpotQA bridge benchmark. ReAct† denotes
ReAct with additional prompting that minimizes looping and early stopping, similar to Tab. 1.
Method Accuracy (%) Latency (s)
ReAct 22.7 7.07
ReAct† 23.1 6.42
LLMCompiler 26.3 4.70
Table E.3. Qualitative comparison between LLMCompiler and other frameworks including ReAct (Yao et al., 2022), TPTU (SA for
Sequential Agent and OA for One-step Agent) (Ruan et al., 2023a), ViperGPT (Surı́s et al., 2023) and HuggingGPT (Shen et al., 2023).
Method Planning Replanning Parallel Execution Domain
ReAct X - X All
TPTU-SA X - X All
TPTU-OA O X X All
ViperGPT O X X Limited
HuggingGPT O X O Limited
LLMCompiler O O O All
18
An LLM Compiler for Parallel Function Calling
Table F.4. Accuracy and latency speedup comparison of LLMCompiler compared to ReAct and TPTU (SA for Sequential Agent and OA
for One-step Agent) on the HotpotQA comparison benchmark using gpt-3.5-turbo. ReAct† and TPTU-SA† denote ReAct and TPTU-SA
with additional prompting that minimizes looping and early stopping, respectively, similar to Tab. 1.
Method Accuracy (%) Speedup
ReAct 61.52 -
ReAct† 62.47 1×
TPTU-SA 34.16 -
TPTU-SA† 44.59 1.09×
TPTU-OA 57.50 1.35×
LLMCompiler 62.00 1.51×
Problem Domains: ViperGPT and HuggingGPT aim for vision tasks via Python code generation and models in HuggingFace,
respectively, showing significant promise in these specific areas. In contrast, LLMCompiler targets a general framework
that enables efficient and accurate function calling in a wide range of problem domains, rather than restricting itself to
specific fields.
1. search("Mission Impossible")
2. search("The Silence of the Lambs")
3. search("American Beauty")
4. search("Star Wars Episode IV - A New Hope")
5. search("Austin Powers International Man of Mystery")
6. search("Alesha Popvich and Tugarin the Dragon")
7. search("In Cold Blood")
8. search("Rosetta")
Thought: I can answer the question now.
19
An LLM Compiler for Parallel Function Calling
9. finish()
###
20
An LLM Compiler for Parallel Function Calling
In addition to user-provided functions, the Planner includes a special, hard-coded finish function. The Planner uses this
function either when the plan is sufficient to address the user query or when it can no longer proceed with planning before
executing the current plan, i.e., when it deems replanning necessary. When the Planner outputs the finish function, its
plan generation stops. Refer to Appendix G for examples of the Planner’s usage of the finish function in planning. The
definition of the finish function is as below and is included as a prompt to the Planner along with the definitions of other
user-provided functions.
finish():
- Collects and combines results from prior actions.
- A LLM agent is called upon invoking join to either finalize the user
query or wait until the plans are executed.
- join should always be the last action in the plan, and will be called in
two scenarios:
(a) if the answer can be determined by gathering the outputs from tasks to
generate the final response.
(b) if the answer cannot be determined in the planning phase before you
execute the plans.
21
An LLM Compiler for Parallel Function Calling
Input: 2, 4, 4, 7
4*6=24
(left: 24)
Figure J.6. Visualization of the Tree of Thoughts (ToT) in the Game of 24. Each node represents a distinct proposal, beginning with the
root node and branching out through the application of single operations by the thought proposer. Subsequent states are evaluated by the
state evaluator for their potential to reach the target number 24. The ToT retains the top-5 states according to their values.
22