E3. AI Agents
E3. AI Agents
with AI Agents
Shuyan Zhou
Language Technologies Institute
Carnegie Mellon University
[email protected]
shuyanzhou.com
LLMs are useful, people are optimistic about the future
$1.3T revenue from generative AI in 2032
w/o GPT-4
w/ GPT-4
Density
Quality of work
Microsoft Research
r 2023
Abstract
[Dell’Acqua et intelligence
Artificial al, 2023] (AI) [Bloomberg 2023]
researchers have been developing and refining large language models (LLMs)
2
that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding
LLMs can assist humans in many self-contained tasks
LLMs
def data_loader
…
Speed up a small part of a task
Not automate the tasks in an
end-to-end fashion 3
fi
The dream of AI is far more wild
action My research goal Automate various tasks with
minimal human intervention
Perform
AI agents scienti c
research
feedback
Develop software
Personalized health
and wellness
Reproduce results
Experiments
5
Talk Overview
Natural language has LLMs know up to a
How good are LLMs? cutoff date
inherent limitations
Learning new
Evaluating AI Speaking AI’s
knowledge by
agents “language”
reading
- Zhou* et al., WebArena, ICLR 2024 - Zhou et al., PaP, SUKI 2022 - Zhou et al., DocPrompting, ICLR 2023
- Wang, Cuenca, Zhou et al., MCoNaLa, F- - Zhou* et al., PaL, ICML 2023 - Zhou* et al., Hierarchical Procedural KB,
EACL 2023 - Madaan, Zhou et al., CoCoGen, EMNLP 2022 ACL 2022
- Wang, Zhou et al., ODEX, F-EMNLP 2023 - Zhang, Xu, Yang, Zhou et al, Crepe, F-EACL 2023 6
Signi cant gap in benchmarks vs real-world applications
Task-solving rate on
Miniwob++
96.3%
human
7
fi
Signi cant gap in benchmarks vs real-world applications
Task-solving rate on
Miniwob++
96.3%
human
… GPT 4
Reliable Easy
evaluation extendability
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 9
WebArena ful lls all requirements without compromise
Checking the
members..
Reliable Easy
evaluation extendability
“Alexis
invited”
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 10
fi
Example task in WebArena
Shop Find the customer who has spent the most money in my store over the past 56 days.
owner Customer
Send the customer some owers. appreciation task
Outcome-based evaluation
• A new order with owers
Huge gap!
Rate (%)
14.9
7.1 8.9
1.4 1.7
Human Llama2 Gemini
Mixtral GPT 3.5 GPT 4
70B Pro
{
Open-source models struggle
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 12
LLMs lack critical capabilities to be AI agents
Tool use
AI agents
Alex’s total spend is • Employ tools to enhance accuracy
78.56 x 7 + 46.7 = 543.6
and expand capabilities
56 days ago is 5/20/2023
LLMs
• Scarce in natural language corpus
• Not consider tool use in standard
LLM development
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 13
LLMs lack critical capabilities to be AI agents
Abstract reasoning
AI agents
• Learn the common principles
• Maintain steady and reliable
Fork `metaseq` performance
Fork `transformers`
Fork all repos owned by Meta LLMs
• Inconsistent performance
across conceptually similar
tasks
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 14
LLMs lack critical capabilities to be AI agents
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 15
fi
LLMs lack critical capabilities to be AI agents
Up-to-date knowledge
AI agents
• Up-to-date knowledge to deal
How can I nd all orders? with the evolving world
LLMs
• Knowledge of LLMs is
limited by the training cutoff
GPT-4 knowledge cutoff: Sep 2021
WebArena application version: Jan 2023
Zhou* et al, WebArena: A realistic web environment for building autonomous agents, ICLR 2024 16
fi
Tool use
Up-to-date knowledge
Abstract reasoning
17
Generating natural language for various tasks
Alex Martin made three orders: $47.51 on 9/18/2023, $765.8 on 1/1/2024 and
$35.4 on 1/9/2024. How much he spent in my store in the last 56 days?
Zhou et al, Procedures as programs: hierarchical control of situated agents through natural language, SUKI 2022 18
fi
Natural language exhibits limitations in performing tasks
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?
Zhou et al, Procedures as programs: hierarchical control of situated agents through natural language, SUKI 2022 19
fi
fi
Natural language exhibits limitations in performing tasks
Today is 2/13/2024
1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 192 days
56 days?
Con ne reasoning and solving within LLMs Express solutions at the example level
Zhou et al, Procedures as programs: hierarchical control of situated agents through natural language, SUKI 2022 20
fi
fi
fi
Maybe AI agents should speak another
“language”, but what is that?
21
Solving various tasks by reasoning with programs (PaL)
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?
[...] [...]
The first order is $47.51 order1_amount = 47.51
It was made on 9/18/2023 order_1_date = datetime(2023,9,18)
[...] [...]
Now check if the first order # check if order 1 is within the period
was placed within the period if order_1_date > start_date:
9/18/2023 is before the period, alex_total_spend += order1_amount
so it is not included [...]
[...]
So the answer is $801.2 >>> The total is $801.2
[Wei et al., Chain-of-thought] PaL
Zhou* et al, PaL: Program-aided language models, ICML 2023 22
Key design choices of PaL
Today is 1/20/2024, Alex made three orders: $47.51 on 9/18/2023, $765.8 on
1/1/2024, $35.4 on 1/9/2024. How much has he spent in the last 56 days?
order1_value = 765.8
[...]
Indirect
direct
PaL
Zhou* et al, PaL: Program-aided language models, ICML 2023 28
Programs enhance LLMs in using in-context examples
• Maintain an object attribute list
4
• Spatial reasoning
Task solving accuracy (%)
2
What’s the color of the right most object?
1
0
Colo Peng Repe Objec
red o uins at cop t cou
b ject y nting What’s the color of the object left to
the goggle?
Datasets where different examples share common
Example tasks in colored objects
problem-solving strategies
Zhou* et al, PaL: Program-aided language models, ICML 2023 29
Programs enhance LLMs in using in-context examples
CoT PaL
100
96.7
25
0
Colo Peng Repe Objec
red o uins at cop t cou
bject y nting
Madaan, Zhou et al, Large language models of code are few-shot commonsense learners, EMNLP 2022
Zhang, Xu, Yang, Zhou et al, Causal Reasoning of Entities and Events in Procedural Texts, F-EACL 2023 31
fi
fi
Hypothesis 1: Corpus
• Pre-training corpus for code models contains procedural knowledge
useful for these tasks, e.g., game engine
[Kim et al, 2023] Coding-pro cient model shows stronger performance on entity tracking
33
fi
PaL brings a range of problems under one roof
Connecting PaL and follow-up work
+ Multi-sample generation
[Zhou et al, PaL] Improve program
+ More modularized planning generation quality
[PaL, Jiang et al]
+ Execution feedback
[Wang et al, Sun et al]
35
fi
LLMs do not always have enough knowledge
36
fl
fi
Knowledge is limited by the training cutoff
Trained
Updated, new knowledge
knowledge
Time
Knowledge cutoff
37
fi
Humans adapt to new knowledge via reading
Direct demonstrations
38
Study scenario: using new tools by reading tool docs
Python APIs
“Make a temporary
mkdtemp le to save the logs”
numpy
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 39
fi
DocPrompting: Retrieval-then-generation
Docs for new commands
View slurm jobs submitted by John
-i <seconds>, -- …
Retriever Generator
-j, <job_id_list> …
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 40
(n, di ), ∉ Dn �ni , dj �.
n n i n
+ − ∗ −
positive pair while each and n form a negative pair
dj We train the retriever
in a contrastiveContrastively
fashion where the similarity score
training of a positive pair is maximized while that of
+the doc retriever
in-batch negative pairs is minimized. For a pair (ni , di ), the loss function is defined as:
exp �sim(hn , hd+i )� Cosine similarity
L = − log
r
(3)
exp �sim(hn , hd+i )� + ∑d− ∈B�Dn∗ exp �sim(hn , hd−j )�
j
2
https://github.com/elastic/elasticsearch
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 41
(n, di ), ∉ Dn �ni , dj �.
n n i n
+ − ∗ −
positive pair while each and n form a negative pair
dj We train the retriever
in a contrastiveContrastively
fashion where the similarity score
training of a positive pair is maximized while that of
+the doc retriever
in-batch negative pairs is minimized. For a pair (ni , di ), the loss function is defined as:
exp �sim(hn , hd+i )� Cosine similarity
L = − log
r
(3)
exp �sim(hn , hd+i )� + ∑d− ∈B�Dn∗ exp �sim(hn , hd−j )�
j
2
https://github.com/elastic/elasticsearch
squeue is used to view job
…
by Slurm.
3
… dropout …
dropout
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 [SimCSE, Gao et al.] 42
Retrieve k nearest documents
…
… …
…
…
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 43
Learning to read the documents
..
log p(c * | c1, c2, . . . , cn, . . . i)
View slurm jobs submitted
by shuyanzh every 5 secs
Retriever retrieves irrelevant information!
-u <user_list> —user=<..
Specify the usernames …
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 44
DocPrompting is applicable to various model architectures
NL + doc 1
NL + doc 3
Zhou, Alon, Xu, Wang, Jiang, Neubig, DocPrompting, ICLR 2023 [FID, Izacard and Grave] 45
fi
DocPrompting allows models to adapt to unseen tools
without explicit demonstrations
Docs for held-out
commands Bash command exact match (%)
175B
22.55
220M
Retriever
9.15 8.94
2.18
Generator
CodeT5 + OpenAI +
(supervised) DocPrompting Codex In-doc retrieval
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 46
DocPrompting allows models to adapt to unseen tools
without explicit demonstrations
Docs for held-out
CodeT5 CodeT5+DocPrompting
Python APIs
40
34.46
31.87
30 27.54 27.08
25.54
23.38
pass@k
18.7
20
Retriever 14.31
8.26
10
5.41
0
Generator 1 10 50 100 200
Execution-based Evaluation for
Python code generation (CoNaLa)
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 47
Docs ease the mapping between NL and code
NL Docs Code
40
39
30
Recall(%)
24
20
12
10 8
0 2
0
1 2
N-gram Matching Recall
Zhou et al, DocPrompting: Generating code by retrieving the docs, ICLR 2023 48
What docs created by humans that
explain the tool usage
Evaluating AI Up-to-date
Speaking AI’s knowledge Learning by
agents “language”
Human-written docs as
- Theorem provingreading
[Wu et al, LeanDoJo]
- Proprietary code libraries [Zan et al, When]
learning resources - API use in products