Agent Work Flows
Agent Work Flows
Workflows
CS294/194-196 Large Language Model Agents
Lecture 7 — October 21st, 2024
Nicolas Chapados Alexandre Drouin
ServiceNow Research ServiceNow Research
Who are we?
AGE NDA
Architecture
API Agents TapeAgents
AGE NDA
Architecture
API Agents TapeAgents
Resolution
Generation skill
What About?
• Scheduling tweets
• Sorting email
• Updating CRM
• Filling out time sheet
• Arranging 15-person meeting
© 2024 ServiceNow, Inc. All Rights Reserved. 11
across 4 organizations
Demo: Directions to GTC (original video is 4x longer
with long pauses)
Defining Agents
Background Enterprise workflow concepts
AGE NDA
Architecture
API Agents TapeAgents
LLM
WebSearch() Tools Planning Self-critique
Agent
Subgoal
… more …
Decomposition
Environment
G R E A D T H
(CONVERSATIONAL) AGENTS
G R E A D T H
GROUNDED RESPONSIVE ACCURATE DISCIPLINED TRANSPARENT HELPFUL
(CONVERSATIONAL) AGENTS
• Task: conversational assistant that routes the user to the right form and helps fill it
• Constraints: 5-star conversational experience at low compute cost
• 3 training domains: FlyCorp, BigBankCorp, CoffeeCorp
• 3 testing domains: DriveCorp, LuxuryCorp, ShopCorp
• Metric: GREADTH
– Grounded, Responsive, Accurate, Disciplined, Transparent, Helpful
• Method:
– Generate synthetic tapes with 19 user agents and a 5-node LLAMA-405B Teacher
– Finetune 1-node LLAMA-8B Student
• Outcome: student matches GPT-4o performance at 300x lower cost
AGE NDA
Architecture
API Agents TapeAgents
User
Book me a Observations
world-wide tour
for my rock
band next July
and August
Web Agent
Actions
Involves
User
Book me a Observations • Task / goal understanding
world-wide tour (natural language)
for my rock
band next July • Situational awareness
and August • Long-term planning
Web Agent • Detailed action execution
Actions
Prompt Task
• Task Description Fly me to
Yellowstone
• Web Page as text for the next
long weekend
• Action Space
Action space:
# Click an element
click(backend_id: str)
Answer Format:
<action>
Your actions
</action>
(a) WorkArena
© 2024 ServiceNow, Inc. All Rights Reserved. 35
Figur e 1: Overview of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and
waysof interacting with the ServiceNow Platform, awidely-used enterprise software platform
pip install browsergym-workarena
Wor kAr ena: Web Agents for Common K nowledge Wor k Tasks
An open-source benchmark of ~600 work-related tasks built on the ServiceNow platform
AXTree
HTML
_Free_ Screenshot
Observation
Action
1 Observation
Knowledge base
oar d retr ieval Example: The agent is assigned a ticket and instruction: “Please solve this.”
Action
Web Agent
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( ) AXTree
HTML
Arena: Web Agents for Common K nowledgeWor k Tasks Python (unsafe)
Screensho
mouse_cl i ck( x=47. 2, y=152. 6) cl i ck( bi d=" 103" )
2 Observation
coord-based bid-based
Dashboard
(a) WorkArena (b) BrowserGym
Chat Action
ew of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and 19,950 unique instances thatpage. cover common
get _by_l abel ( " Sear ch" ) . cl i ck(
Browser
g with the ServiceNow Platform, a widely-used enterprise software AXTree
platform. (b) Br owser
User: show me
flights from
Gym is aPython environment Python (unsafe)
HTML
evaluating web agents, which includes a rich set of actions and multimodal observations
Screenshot
MTL to (shown
NYC herethe HTML
mouse_cl i ck( x=47. 2,contents
y=152. 6) cl i ck(
cessibility 3tree,Service
and theraw pixelsafter browser rendering). Observation
Catalog
coord-based bi
ch as the 500,000
Figur daily users
e 1: Overview of Disney+’s
of contributions: (a) WoretkAr al.,ena 2022), asimulatedofe-commerce
is a benchmark
Action
33 web taskswebsite with
and 19,950 shopping
unique instances t
ways of2020).
enter (Maas, interacting with the ServiceNow Platform,
ServiceNow’sextensive tasksathat widely-used
require enterprise searching software
Web Agent platform.
and browsing (b) Br owser
acatalog Gym is aP
of items.
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )
for designing
n ideal real-world and evaluating
environment forweb agents, whichMorerecently,
evaluating includes a rich Zhouet
Python (unsafe) set of actions and multimodal
al. (2023) introduced observations
WebArena,(shownacol- heret
of thepage, itsaccessibility tree, and theraw pixelsafter browser rendering).
oard retrieval L2 goal.
pact of UI assistants in theworkplace. (b)
mouse_cl
Dashboard
i ck( x=47. 2, y=152.
lection
coord-based
of 6)
190 retrieval
cl i ck(
tasks L3 task
bi d=" 103" )
based
bid-based
description.
© 2024 ServiceNow, Inc. All Rights Reserved.
on realistic websites that emulate 38
WorkArena++ Towards Realistic Enterprise Workflows
Overview of tasks Some tasks are purposely infeasible.
Solve a series of enterprise Agent must detect this.
decision-making problems:
- Workload balancing
- Scheduling with constraints
- Assigning work to experts
Realistic Workflows
• WebShop (Yao, Chen et al., 2022) 12 087 tasks Call for unification
• WebArena (Zhou et al., 2023) 812 tasks Get everyone under the same
roof for a great Meta-
Benchmark
• VisualWebArena (Koh et al., 2024) 910 tasks
Figur e1: Overview of contributions: (a) Wor kAr ena is a benchmark of 29 web tasks and 18,050©unique
2024 ServiceNow, Inc. All Rights Reserved.
instances that cover common 43
pip install browsergym
Human evaluation
for any benchmark!
> API-based LLMs change silently > Experimental journal uploaded to public repo
Date, versions, agent configuration, traces, etc.
> Python packages evolve
Opportunity
• Long-term planning
• Multimodality
• Long-term planning
• Multimodality
• Long-term planning
• Multimodality
• Long-term planning
• Multimodality
AGE NDA
Architecture
API Agents TapeAgents
Resolution
Generation skill
Human Agent
What access
Who can I Generate a Jon has?
assign this plan based on
issue to? relevant KB
Assign Jon role Document
and similar
‘Knowledge’ resolution steps
incidents
User Access AI Agents
AI Agents KB AI Agent
© 2024 ServiceNow, Inc. All Rights Reserved. 59
Web Agents to address Low Value / Low Volume tasks
CHEAP
Development & maintenance
General-purpose
Agents (future)
Web Agents
(late 2024)
RPA
Low Value / Low Volume
region for individual tasks
(ability to address)
Traditional
EXPENSIVE
development
IDIOT CAPABLE
Agent ability, generalizability,
& adaptability
© 2024 ServiceNow, Inc. All Rights Reserved. 60
Researching Online Database Management Digital Marketing Camp
Data Analysis Technical Support Podcast Production
Email Communication Legal Research Software Testing and Q
Writing Reports Cybersecurity Monitoring Remote Team Manage
Project Planning Human Resources Tasks Event Planning and Ma
Presentation Creation Blogging and Content CreationMobile App Developme
Graphic Design Market Analysis Risk Management
WorkArena can help us
Website Management Inventory Management
Social Media ManagementDigital Asset Management
Intellectual Property Ma
Environmental Sustaina
What is Knowledge
understand
Video Editing
Programming
Work?
the future of
Strategic Planning Real Estate Analysis
Document Review and Editing Supply Chain Optimizat
Knowledge Work
Online Collaboration
Customer Relationship
Meeting Scheduling and
Coordination
Health Informatics
Scientific Research and
Management (CRM) Task and Workflow Automation E-commerce Managem
Financial Planning Cloud Computing Management Ethical Hacking and Pe
E-learning Development Knowledge Management Testing
Business Intelligence (BI) 3D Modeling and CAD
Voice Over Production Language Translation a
Accessibility Testing Localization
O*NET: Cataloging the Workforce
Source: Bureau of Labor Statistics Standard Occupational Classification and O*NET © 2024 ServiceNow, Inc. All Rights Reserved. 62
Technology adoption takes time and
uncertainty for generative AI adoption remains high
Early scenario
(including GenAI) Adoption Drivers
Late scenario • Technological maturity
(including GenAI)
• Integration speed
• Relative cost of
technology vs labor
• Technology diffusion
rate
• Supply constraints
(e.g. GPUs, regulatory)
Source: The economic potential of generative AI: The next productivity frontier (McKinsey) © 2024 ServiceNow, Inc. All Rights Reserved. 63
Assessing Impact: Top-Down vs Bottom-Up
Centaur Cyborg
• Strategic separation between • Task-level collaboration, where
“human tasks” and “AI tasks” the human can ask the AI to:
• From human intuition, AI can: — Assume a certain persona
— Map problem domain — Learn a task from examples
— Gather information — Give a logical explanation
— Handle data analysis — Provide a deep dive
— Refine human content — Respond to contradictions
and push-back
Source: Navigating the Jagged Technological Frontier: © 2024 ServiceNow, Inc. All Rights Reserved. 65
Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality (HBS, BCG)
Defining Agents
Background Enterprise workflow concepts
AGE NDA
Architecture
API Agents TapeAgents
Frameworks &
• Enables chaining • Automates tasks with • Multi-agent framework • Collaborative agent
LIBRARIES / FRAMEWORKS
multiple LLM calls for autonomous agents. for building workflows teams with specific
multi-step workflows. with AI agents. roles and goals.
• Uses a feedback loop
Benchmarks
• Various tools like APIs, to refine outputs • AutoGen agents can • Sequential and
databases, and based on goals and work together, hierarchical processes.
external data sources. constraints. integrating LLMs, tools, • Versatile tools with
and human inputs.
• Memory mgmt, • Unlike LangChain, error handling and
allowing context emphasizes • Unlike LangChain and caching capabilities.
retention across autonomous decision- AutoGPT, emphasize • Allows human
multiple interactions. making over structured multi-agent interaction oversight & interaction
workflow chaining. and human-AI collab
2024
Research
• Control computers w/ 812 tasks, 6 domains visual comprehension agent navigation
keyboard & mouse • Long-horizon tasks • Test visual & reasoning • 2337 expert demos on
from NL instructions skills of web agents 155 real-world websites
• Best GPT-4: 11% solve
Milestones
• MiniWob++ through RL rate vs 78% for humans • 910 tasks, 3 domains • Visual models not best;
with computer-human fine-tuning is key
interactions
World of Bits (WoB) WebGPT (OpenAI) Mind2Web (Ohio) WebAgent (Google) WebVoyager (Ten¢)
• First widely available • Fine-tuned GPT-3 for • Benchmark of realistic • Combine 2 LLMs to • Completes tasks on
web benchmark QA with web browsing web tasks from NL simplify huge HTML, real websites using
• Simplified tasks • Evaluated on “Explain plan solution, create textual+visual inputs
• Interaction traces
code talking to web
• 100 tasks Like I’m 5” Reddit Qs + • 2,350 tasks from 137 • New benchmark:
browser; no pixels
TruthfulQA dataset 15 websites, automatic
websites, 31 domains
• Can be solved by RL • MiniWoB & Mind2Web GPT-4V-based eval.
MU) VisualWebArena WebLINX (McGill) WorkArena OSWorld WorkArena++
mark, • Benchmark that needs • Conversational web (ServiceNow) • 369 computer tasks of (ServiceNow)
mains visual comprehension agent navigation • Basic tasks that a real web and desktop • Compositional tasks
knowledge worker apps in open domains with much higher
sks • Test visual & reasoning • 2337 expert demos on
skills of web agents 155 real-world websites must carry out • OS file I/O + workflows difficulty than
solve WorkArena
humans • 910 tasks, 3 domains • Visual models not best; • Implemented on the spanning multiple
ServiceNow platform applications • Today’s best models
fine-tuning is key
get single-digit
performance, with
huge room for
improvement
2024
Q&A
Many thanks to the following colleagues:
Alexandre Lacoste Dzmitry Bahdanau Oleh Shliazhko
Maxime Gasse Nicolas Gonthier Karam Ghanem
Massimo Caccia Gabriel Huang Soham Parikh
Léo Boisvert Ehsan Kamalloo Mitul Tiwari
Megh Thakkar Rafael Pardinas Quaizar Vohra
Tom Marty Jordan Prince Tremblay David Vazquez
Rim Assouel Alex Piché Valérie Bécaert
Thibault Le Sellier De Chezelles Torsten Scholak
© 2024 ServiceNow,
© 2024Inc.
ServiceNow,
All Rights Reserved.
Inc. All Rights
Confidential.
Reserved. 72