0% found this document useful (0 votes)
48 views49 pages

E3. AI Agents

The document discusses the capabilities and limitations of large language models (LLMs) like GPT-4 in performing real-world tasks, highlighting the significant gap between benchmark performance and practical applications. It emphasizes the need for AI agents to possess critical capabilities such as tool use, abstract reasoning, and up-to-date knowledge to effectively automate tasks with minimal human intervention. The introduction of WebArena is proposed as a realistic environment for evaluating AI agents, addressing the challenges in current evaluation methods.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views49 pages

E3. AI Agents

The document discusses the capabilities and limitations of large language models (LLMs) like GPT-4 in performing real-world tasks, highlighting the significant gap between benchmark performance and practical applications. It emphasizes the need for AI agents to possess critical capabilities such as tool use, abstract reasoning, and up-to-date knowledge to effectively automate tasks with minimal human intervention. The introduction of WebArena is proposed as a realistic environment for evaluating AI agents, addressing the challenges in current evaluation methods.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Solving Real-World Tasks

with AI Agents
Shuyan Zhou
Language Technologies Institute
Carnegie Mellon University
[email protected]
shuyanzhou.com
LLMs are useful, people are optimistic about the future
$1.3T revenue from generative AI in 2032
w/o GPT-4
w/ GPT-4

Density

Quality of work

Sparks of Artificial General Intelligence:


Early experiments with GPT-4
Sébastien Bubeck Varun Chandrasekaran Ronen Eldan Johannes Gehrke
Eric Horvitz Ece Kamar Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg
Harsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang

Microsoft Research
r 2023

Abstract
[Dell’Acqua et intelligence
Artificial al, 2023] (AI) [Bloomberg 2023]
researchers have been developing and refining large language models (LLMs)
2
that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding
LLMs can assist humans in many self-contained tasks
LLMs

“Write a data loader to read


this csv le”

def data_loader

Speed up a small part of a task
Not automate the tasks in an
end-to-end fashion 3
fi
The dream of AI is far more wild
action My research goal Automate various tasks with
minimal human intervention
Perform
AI agents scienti c
research

feedback
Develop software
Personalized health
and wellness
Reproduce results

Experiments

Literature review Finance and growth


4
management
fi
Questions to answer

How good are strong LLMs (e.g., GPT-4)? How can we


perform reliable evaluation?

What are the fundamental gaps between LLMs and AI


agents?

How could we mitigate the gaps?

5
Talk Overview
Natural language has LLMs know up to a
How good are LLMs? cutoff date
inherent limitations

Learning new
Evaluating AI Speaking AI’s
knowledge by
agents “language”
reading
- Zhou* et al., WebArena, ICLR 2024 - Zhou et al., PaP, SUKI 2022 - Zhou et al., DocPrompting, ICLR 2023
- Wang, Cuenca, Zhou et al., MCoNaLa, F- - Zhou* et al., PaL, ICML 2023 - Zhou* et al., Hierarchical Procedural KB,
EACL 2023 - Madaan, Zhou et al., CoCoGen, EMNLP 2022 ACL 2022
- Wang, Zhou et al., ODEX, F-EMNLP 2023 - Zhang, Xu, Yang, Zhou et al, Crepe, F-EACL 2023 6
Signi cant gap in benchmarks vs real-world applications

Task-solving rate on
Miniwob++
96.3%
human

“Play my favorite music” … GPT 4


[Liu et al., Miniwob++, 2018]

7
fi
Signi cant gap in benchmarks vs real-world applications

Task-solving rate on
Miniwob++
96.3%
human

… GPT 4

“Assign this issue to myself”


8
fi
Requirements for the agent evaluation

Realistic Useful &


interactive complex
environment tasks

Existing evaluations make trade-offs between them

Reliable Easy
evaluation extendability

Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 9
WebArena ful lls all requirements without compromise

Realistic Useful & Invite Alexis to


my agent repo
interactive complex
environment tasks
with rich contents

Checking the
members..
Reliable Easy
evaluation extendability
“Alexis
invited”

Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 10
fi
Example task in WebArena
Shop Find the customer who has spent the most money in my store over the past 56 days.
owner Customer
Send the customer some owers. appreciation task

Outcome-based evaluation
• A new order with owers

Identify the customer by • Shipped to Alex Martin


Buy some owers online
examining the order history
to the customer
in the store portal

812 long-horizon, realistic computer tasks


Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 11
fl
fl
fl
fl
LLMs are the critical yet early step toward AI autonomy
LLMs lack several critical capabilities to be AI agents
78.2
WebArena Task Success

Huge gap!
Rate (%)

14.9
7.1 8.9
1.4 1.7
Human Llama2 Gemini
Mixtral GPT 3.5 GPT 4
70B Pro
{
Open-source models struggle
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 12
LLMs lack critical capabilities to be AI agents
Tool use
AI agents
Alex’s total spend is • Employ tools to enhance accuracy
78.56 x 7 + 46.7 = 543.6
and expand capabilities
56 days ago is 5/20/2023

LLMs
• Scarce in natural language corpus
• Not consider tool use in standard
LLM development

Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 13
LLMs lack critical capabilities to be AI agents
Abstract reasoning
AI agents
• Learn the common principles
• Maintain steady and reliable
Fork `metaseq` performance
Fork `transformers`
Fork all repos owned by Meta LLMs
• Inconsistent performance
across conceptually similar
tasks
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 14
LLMs lack critical capabilities to be AI agents

Find the customer


who spent […] Send How can I nd all orders?
the customer […]

Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 15
fi
LLMs lack critical capabilities to be AI agents
Up-to-date knowledge
AI agents
• Up-to-date knowledge to deal
How can I nd all orders? with the evolving world

LLMs
• Knowledge of LLMs is
limited by the training cutoff
GPT-4 knowledge cutoff: Sep 2021
WebArena application version: Jan 2023

Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 16
fi
Tool use
Up-to-date knowledge
Abstract reasoning

Speaking AI’s Learning by


“language” reading docs

17
Generating natural language for various tasks
Alex Martin made three orders: $47.51 on 9/18/2023, $765.8 on 1/1/2024 and
$35.4 on 1/9/2024. How much he spent in my store in the last 56 days?

Today is 1/20/2024. I rst subtract 20


days […] The date 56 days ago is
12/20/2023
[Wei et al., Chain-of-thought]
[…] Order 1 was placed on 9/18/2023,
which is not within the last 56 days
[…] 765.8 + 35.4 = $785.4

Zhou et al, Procedures as programs: hierarchical control of situated agents through natural language, SUKI 2022 18
fi
Natural language exhibits limitations in performing tasks
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?

Today is 1/20/2024. I rst subtract 20


days […] The date 56 days ago is
12/20/2024
[…] Order 1 was placed on 9/18/2023,
which is not within the last 56 days
[…] 765.8 + 35.4 = $785.4

[Wei et al., Chain-of-thought]

Con ne reasoning and solving within LLMs

Zhou et al, Procedures as programs: hierarchical control of situated agents through natural language, SUKI 2022 19
fi
fi
Natural language exhibits limitations in performing tasks
Today is 2/13/2024
1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 192 days
56 days?

Today is 1/20/2024. I rst subtract 20 Today is 2/13/2024. I rst subtract 13


days […] The date 56 days ago is days […] The date 192 days ago is
?
12/20/2024 8/5/2023.
[…] Order 1 was placed on 9/18/2023, […] Order 1 was placed on 9/18/2023,
which is not within the last 56 days which is within the last 192 days
[…] 765.8 + 35.4 = $785.4 […] 47.51 + 765.8 + 35.4 …
[Wei et al., Chain-of-thought]

Con ne reasoning and solving within LLMs Express solutions at the example level

Zhou et al, Procedures as programs: hierarchical control of situated agents through natural language, SUKI 2022 20
fi
fi
fi
Maybe AI agents should speak another
“language”, but what is that?

21
Solving various tasks by reasoning with programs (PaL)
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?

[...] [...]
The first order is $47.51 order1_amount = 47.51
It was made on 9/18/2023 order_1_date = datetime(2023,9,18)
[...] [...]
Now check if the first order # check if order 1 is within the period
was placed within the period if order_1_date > start_date:
9/18/2023 is before the period, alex_total_spend += order1_amount
so it is not included [...]
[...]
So the answer is $801.2 >>> The total is $801.2
[Wei et al., Chain-of-thought] PaL
Zhou* et al, PaL: Program-aided language models, ICML 2023 22
Key design choices of PaL
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?

Interleave between natural language


Python
and programming language
order1_amount = 47.51
order2_amount = 765.8
[...]
# check if order 1 is within 56 days
[...]
• Abundant
• Easily comprehensible a = 47.51 [Chowdhery et al, PaLM]
b = 765.8 [Mishra et al, Lila]
return float(a + b) [Austin el at, Learning ..]

Zhou* et al, PaL: Program-aided language models, ICML 2023 23


Few-shot in-context learning with coding-pro cient LLMs
Alex Martin made three orders: $47.51
Input 1 on 9/18/2023, $765.8 on
1/1/2024 and $35.4 on 1/9/2024. How much he spent in my examples
In-context
store in the last 56 days?Program 1
Input 2 • Manually create
Program 2 • Select from a training set

coding-pro cient LLM


? [...]
order1_amount = 47.51
order_1_date = ...
# check if [...] 24
Zhou* et al, PaL: Program-aided language models, ICML 2023
fi
fi
PaL of oads the solving to tools seamlessly
Task solving accuracy (%) on
date understanding (Bigbench)

Today is 1/20/2024 […] How much has he 76.2


spent in the last 56 days?
from ..
from datetime import datetime, timedelta a = ..
b = ..
today = datetime(2024, 1, 20) c = a - timedelata(days=56)
64.8
# calculate 56 days ago 63.4
start_date = today - timedelta(days=56)
[...]
if order_1_date > start_date: CoT PaL PaL w/
[...] [Wei et al., 2022] only PL
[Chowdhery et al, PaLM]
[Mishra el at, Lila]
[Austin el at, Learning ..]
Zhou* et al, PaL: Program-aided language models, ICML 2023 25
fl
PaL > Large language models + Tools
Alex made two orders within the • Parsing failures
• Error propagation

Task solving accuracy (%) on GSM8k


last 56 days: one for $765.8 and
another for $35.4. How much did he • Limited toolset
spend in total? 72.0
63.1 65.4
[…] the total of two orders is
765.8 + 35.8 […]

order1_value = 765.8
[...]

[…] the total of two orders is


CoT PaL CoT
765.8 + 35.8 [Wei et al., 2022] +
<calculator(765.8+35.8)=801.6> Calculator
[Schick et al., Toolformer]
801.6[…]
Zhou* et al, PaL: Program-aided language models, ICML 2023 26
Natural language performs example-level problem solving
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?

Slight changes result in signi cant solution difference

Today is 1/20/2024. I rst subtract 20 Today is 2/13/2024. I rst subtract 13


days […] The date 56 days ago is days […] The date 192 days ago is
12/20/2024 8/5/2023.
[…] Order 1 was placed on 9/18/2023, […] Order 1 was placed on 9/18/2023,
which is not within the last 56 days which is within the last 192 days
[…] 765.8 + 35.4 = […] 47.51 + 765.8 + 35.4 …

Indirect

Zhou* et al, PaL: Program-aided language models, ICML 2023 27


fi
fi
fi
Programs encourage express “task templates”

today = datetime(2024,1,20) today = datetime(2024,2,13)


start_date = today - \ start_date = today - \
timedelta(days=56) timedelta(days=192)
[...] [...]
if order_1_date > start_date: if order_1_date > start_date:
total += order_1_amount total += order_1_amount
[...] [...]

direct

PaL
Zhou* et al, PaL: Program-aided language models, ICML 2023 28
Programs enhance LLMs in using in-context examples
• Maintain an object attribute list
4
• Spatial reasoning
Task solving accuracy (%)

2
What’s the color of the right most object?
1

0
Colo Peng Repe Objec
red o uins at cop t cou
b ject y nting What’s the color of the object left to
the goggle?
Datasets where different examples share common
Example tasks in colored objects
problem-solving strategies
Zhou* et al, PaL: Program-aided language models, ICML 2023 29
Programs enhance LLMs in using in-context examples
CoT PaL
100
96.7

Task solving accuracy (%)


95.1 93.3 90.6
75 86.3
79.2
68.8 73
50

25

0
Colo Peng Repe Objec
red o uins at cop t cou
bject y nting

Datasets where different examples share common


problem-solving strategies

Zhou* et al, PaL: Program-aided language models, ICML 2023 30


Bonus: Programs naturally encode structures
class Graph:
“Get Alex’s total spend within 56 days” goal = "Get the total spend of
Alex within 56 days"
def __init__(self):
Identify the date 56
days ago identify_date_56_days_ago = Node()
verify_order1_date = Node()
[...]
Verify order 1’s Verify order 2’s Verify order 3’s identify_date_56_days_ago.children = [
date date date verify_order1_date,
verify_order2_date
verify_order3_date
]
Sum the quali ed
orders
By a coding-pro cient model

Madaan, Zhou et al, Large language models of code are few-shot commonsense learners, EMNLP 2022
Zhang, Xu, Yang, Zhou et al, Causal Reasoning of Entities and Events in Procedural Texts, F-EACL 2023 31
fi
fi
Hypothesis 1: Corpus
• Pre-training corpus for code models contains procedural knowledge
useful for these tasks, e.g., game engine

Code snippet taken from https://github.com/allenai/ScienceWorld/ 32


Hypothesis 2: Training
class BakeACake:
def __init__(self) -> None:
self.find_recipe = Node()
self.gather_ingredients = Node()
self.mix_ingredients = Node()
self.find_recipe = Node()
self.preheat_oven_at_375f = Node()
self.put_cake_batter_into_oven = Node()
self.take_cake_out_after_30_min = Node()

self.find_recipe.children = [self.gather_ingredients, self.preheat_oven_at_375f]


self.gather_ingredients.children = [self.mix_ingredients]
self.mix_ingredients.children = [self.put_cake_batter_into_oven]
self.preheat_oven_at_375f.children = [self.put_cake_batter_into_oven]
self.put_cake_batter_into_oven.children = [self.take_cake_out_after_30_min]

Training on code makes the model better at


procedures / long-range inference / connecting-the-dots

[Kim et al, 2023] Coding-pro cient model shows stronger performance on entity tracking
33
fi
PaL brings a range of problems under one roof
Connecting PaL and follow-up work
+ Multi-sample generation
[Zhou et al, PaL] Improve program
+ More modularized planning generation quality
[PaL, Jiang et al]
+ Execution feedback
[Wang et al, Sun et al]

+ APIs for other modalities For multi-modal


[Lu et al, Stanic et al] tasks

PaL + Finetune with program-aided


solution for speci c domains
(e.g., math)
Sophisticated domain
[Yue et al, Xu et al]
models
34
fi
Speak general-purpose
programming language with a
coding-pro cient model
Tool use
Evaluating AI Abstract
Speaking AI’sreasoning Learning by
agents “language” reading docs

35
fi
LLMs do not always have enough knowledge

Find the customer


who has spent the
most money in my How can I nd all orders?
store over the past 56
days. Send the
customer some
owers.

36
fl
fi
Knowledge is limited by the training cutoff

How can I nd all orders?

Trained
Updated, new knowledge
knowledge

Time
Knowledge cutoff
37
fi
Humans adapt to new knowledge via reading

Not available for new knowledge

Direct demonstrations
38
Study scenario: using new tools by reading tool docs

“List slurm jobs


submitted by John” “List slurm jobs
submitted by John”
Bash commands
squeue ..
ls

Python APIs
“Make a temporary
mkdtemp le to save the logs”
numpy

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 39
fi
DocPrompting: Retrieval-then-generation
Docs for new commands
View slurm jobs submitted by John

squeue is used to view job squeue is used to view job


… by Slurm. … by Slurm

-u <user_list> —user=<.. -u <user_list> —user=<..


squeue -u john
Specify the usernames … Specify the usernames …

-i <seconds>, -- …

Retriever Generator
-j, <job_id_list> …

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 40
(n, di ), ∉ Dn �ni , dj �.
n n i n
+ − ∗ −
positive pair while each and n form a negative pair
dj We train the retriever
in a contrastiveContrastively
fashion where the similarity score
training of a positive pair is maximized while that of
+the doc retriever
in-batch negative pairs is minimized. For a pair (ni , di ), the loss function is defined as:
exp �sim(hn , hd+i )� Cosine similarity
L = − log
r
(3)
exp �sim(hn , hd+i )� + ∑d− ∈B�Dn∗ exp �sim(hn , hd−j )�
j

2
https://github.com/elastic/elasticsearch

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 41
(n, di ), ∉ Dn �ni , dj �.
n n i n
+ − ∗ −
positive pair while each and n form a negative pair
dj We train the retriever
in a contrastiveContrastively
fashion where the similarity score
training of a positive pair is maximized while that of
+the doc retriever
in-batch negative pairs is minimized. For a pair (ni , di ), the loss function is defined as:
exp �sim(hn , hd+i )� Cosine similarity
L = − log
r
(3)
exp �sim(hn , hd+i )� + ∑d− ∈B�Dn∗ exp �sim(hn , hd−j )�
j

2
https://github.com/elastic/elasticsearch
squeue is used to view job

by Slurm.
3

View slurm jobs submitted ls is used to list the


by John information ….

… dropout …
dropout

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 [SimCSE, Gao et al.] 42
Retrieve k nearest documents

… …


Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 43
Learning to read the documents
..
log p(c * | c1, c2, . . . , cn, . . . i)
View slurm jobs submitted
by shuyanzh every 5 secs
Retriever retrieves irrelevant information!

squeue is used to view job


… by slurm
Generator squeue -u john

-u <user_list> —user=<..
Specify the usernames …

ls is used to list the Learning to ignore irrelevant information


information ….

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 44
DocPrompting is applicable to various model architectures

NL doc 1 doc 2 doc 3 pre x generation code

NL + doc 1

NL + doc 2 encoder cat .. decoder code

NL + doc 3

Zhou, Alon, Xu, Wang, Jiang, Neubig, DocPrompting, ICLR 2023 [FID, Izacard and Grave] 45
fi
DocPrompting allows models to adapt to unseen tools
without explicit demonstrations
Docs for held-out
commands Bash command exact match (%)

175B
22.55
220M
Retriever

9.15 8.94

2.18
Generator
CodeT5 + OpenAI +
(supervised) DocPrompting Codex In-doc retrieval

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 46
DocPrompting allows models to adapt to unseen tools
without explicit demonstrations
Docs for held-out
CodeT5 CodeT5+DocPrompting
Python APIs
40
34.46
31.87
30 27.54 27.08
25.54
23.38

pass@k
18.7
20
Retriever 14.31
8.26
10
5.41

0
Generator 1 10 50 100 200
Execution-based Evaluation for
Python code generation (CoNaLa)
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 47
Docs ease the mapping between NL and code

NL Docs Code

NL ↔ Code NL ↔ Docs (NL + Docs) ↔ Code

40
39

30
Recall(%)

24
20
12
10 8
0 2
0
1 2
N-gram Matching Recall

Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 48
What docs created by humans that
explain the tool usage

retrieval and doc-augmented


How
generation

Evaluating AI Up-to-date
Speaking AI’s knowledge Learning by
agents “language”
Human-written docs as
- Theorem provingreading
[Wu et al, LeanDoJo]
- Proprietary code libraries [Zan et al, When]
learning resources - API use in products

+ Code document - [Zhou et al, Generating Code


Explanations with Controllability on
generation Purpose]
49

You might also like