0% found this document useful (0 votes)
491 views72 pages

Agent Work Flows

The document discusses the development and optimization of Large Language Model (LLM) agents for automating enterprise workflows, highlighting the differences between API and web agents. It introduces the TapeAgents framework, which utilizes structured logs for agent optimization and describes a case study demonstrating a cost-effective conversational assistant. The presentation also outlines the challenges and methodologies for evaluating web agents in realistic environments.

Uploaded by

raags90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
491 views72 pages

Agent Work Flows

The document discusses the development and optimization of Large Language Model (LLM) agents for automating enterprise workflows, highlighting the differences between API and web agents. It introduces the TapeAgents framework, which utilizes structured logs for agent optimization and describes a case study demonstrating a cost-effective conversational assistant. The presentation also outlines the challenges and methodologies for evaluating web agents in realistic environments.

Uploaded by

raags90
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Agents for Enterprise

Workflows
CS294/194-196 Large Language Model Agents
Lecture 7 — October 21st, 2024
Nicolas Chapados Alexandre Drouin
ServiceNow Research ServiceNow Research
Who are we?

Nicolas Chapados Alexandre Drouin


ServiceNow Research ServiceNow Research

© 2024 ServiceNow, Inc. All Rights Reserved. 2


Safe harbor notice for forward-looking statements
This presentation may contain “forward-looking” statements that are based on our beliefs and
assumptions and on information currently available to us only as of the date of this presentation.
Forward-looking statements involve known and unknown risks, uncertainties, and other factors that
may cause actual results to differ materially from those expected or implied by the forward-looking
statements. Further information on these and other factors that could cause or contribute to such
differences include, but are not limited to, those discussed in the section titled “Risk Factors,” set
forth in our most recent Annual Report on Form 10-K and Quarterly Report on Form 10-Q and in our
other Securities and Exchange Commission filings. We cannot guarantee that we will achieve the
plans, intentions, or expectations disclosed in our forward‐looking statements, and you should not
place undue reliance on our forward‐looking statements. The information on new products,
features, or functionality is intended to outline our general product direction and should not be
relied upon in making a purchasing decision, is for informational purposes only, and shall not be
incorporated into any contract, and is not a commitment, promise, or legal obligation to deliver
any material, code, or functionality. The development, release, and timing of any features or
functionality described for our products remains at our sole discretion. We undertake no obligation,
and do not intend, to update the forward-looking statements.
Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts


Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further


Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts


Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further


LLM agents are LLM-powered entities
able to autonomously plan and take
actions to execute goals over multiple
iterations.

© 2024 ServiceNow, Inc. All Rights Reserved. 6


LLM-Based Agents

Reinforcement Learning LLM-Based Agents:


Agents Zero-Shot Task Solvers
• Require long training runs in • LLMs can display some
sandboxed environments commonsense, since they
have lots of world
• Limited action space background knowledge
• Low generalizability to • General-Purpose LLMs have
radically new tasks probably been trained on
the documentation of your
• A Minecraft agent can’t software
send emails

© 2024 ServiceNow, Inc. All Rights Reserved. 7


Two kinds of LLM Agents

API agents Web agents


• Observations: API call results, search history, • Observations: what human would see +
user-uploaded images, chat history accessibility tree / raw DOM
• Actions: API calls, search calls, responses to • Actions: enter text in fields, clicks
the user
• Pros: can do anything
• Pros: Lower latency, lower risks
• Cons: higher latency, higher risks
• Cons: needs appropriate APIs

© 2024 ServiceNow, Inc. All Rights Reserved. 8


Today’s Enterprise Workflows Remain Quite Manual
Jon needs (even with generative AI)
access to a KB

Do we have a Is there a What access Assign Jon the Resolve


KB that similar control does right role
explains what incident? Jon have? I generated
to do? resolution Close
notes
Human Agent

Resolution
Generation skill

GenAI © 2024 ServiceNow, Inc. All Rights Reserved. 9


Automation in Enterprise Workflows Agentic
workflows

Iterative, interactive approach


to automation, where the AI
Level of Automation

Conversational agent is empowered to


engage in a more dynamic
workflows and self-reflective process.
Reasoning model
Agentic framework
Orchestration
AI workflows
Automate resolution of high
RPA volume requests and submit
tickets on behalf of the user,
workflows adapting interactions based on
Scripted Use machine learning
users’ response

workflows models to dynamically Generative AI


RAG
adapt workflows based
Automate repetitive system Conversation engine
on patterns and
actions with UI-based
Automate repetitive feedback loops.
interactions
digital tasks with minimal Machine Learning models
RPA bots
workflow variations Vision models
Flow Engine
Integrations

Intelligence © 2024 ServiceNow,


© 2024Inc.
ServiceNow,
All Rights Reserved.
Inc. All Rights
Confidential.
Reserved. 10
Agents solve for Today’s automation
workhorses for high-
the Millions of value or high-volume
Low-Value/Low- tasks
Volume Tasks • Robotic Process Automation
• Low-Code / No-Code

What About?
• Scheduling tweets
• Sorting email
• Updating CRM
• Filling out time sheet
• Arranging 15-person meeting
© 2024 ServiceNow, Inc. All Rights Reserved. 11
across 4 organizations
Demo: Directions to GTC (original video is 4x longer
with long pauses)
Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts


Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

© 2024 ServiceNow, Inc. All Rights Reserved. 13


LLM-Based Single Agents: Typical Architecture
Long-Term Memory
Short-Term Memory
Calculator() (new tools)

CodeInterpreter() Memory Reflection

LLM
WebSearch() Tools Planning Self-critique
Agent

TriggerWorkflow() Action Chain of thoughts

Subgoal
… more …
Decomposition
Environment

Source: https://lilianweng.github.io/posts/2023-06-23-agent © 2024 ServiceNow, Inc. All Rights Reserved. 14


TapeAgents: towards a holistic framework
for agent development and optimization
Frameworks that address Frameworks for data-
agent development needs driven agent optimization Holistic Frameworks
● Resumable sessions • Structured agent TapeAgents:
● Low-code components configuration Agent ==
● Fine-grained control • Structured agent logs Resumable modular state
● Concurrency
• Optimization algorithms machine
● Streaming
… with structured
LangGraph, AutoGen, DSPy, TextGrad, Trace: configuration
Crew: • Agent == code that uses … that makes granular
• Agent == resumable structured modules and structured logs
modular state machine generates structured logs … that can make fine-
tuning data from logs
… and reuse other agent’s
logs
© 2024 ServiceNow, Inc. All Rights Reserved. 15
TapeAgents is a framework built around a structured,
granular, semantic-level log: the tape
• Agent reads the tape,
reasons, writes thoughts
and actions to the tape
• Environment executes
actions from the tape,
write observations to
the tape
• Apps use the tape as
session states
• Dev tool use tapes to
facilitate audit
• Algorithms use tapes to
tune agent prompts
• Agents make finetuning
data from tapes
© 2024 ServiceNow, Inc. All Rights Reserved. 16
Agent reasoning loop: example
Simple two-agent structure TapeAgents execution
(problem-specific) model

© 2024 ServiceNow, Inc. All Rights Reserved. 17


© 2024 ServiceNow, Inc. All Rights Reserved. 18
TapeAgents
allows the
optimization of
a Student Agent
from the tapes
of a Teacher
Agent

© 2024 ServiceNow, Inc. All Rights Reserved. 19


MAKING COST-EFFECTIVE

G R E A D T H
(CONVERSATIONAL) AGENTS

© 2024 ServiceNow, Inc. All Rights Reserved. 20


MAKING COST-EFFECTIVE

G R E A D T H
GROUNDED RESPONSIVE ACCURATE DISCIPLINED TRANSPARENT HELPFUL

(CONVERSATIONAL) AGENTS

© 2024 ServiceNow, Inc. All Rights Reserved. 21


Case Study: Cost-Effective Form-Filling Assistant

• Task: conversational assistant that routes the user to the right form and helps fill it
• Constraints: 5-star conversational experience at low compute cost
• 3 training domains: FlyCorp, BigBankCorp, CoffeeCorp
• 3 testing domains: DriveCorp, LuxuryCorp, ShopCorp
• Metric: GREADTH
– Grounded, Responsive, Accurate, Disciplined, Transparent, Helpful
• Method:
– Generate synthetic tapes with 19 user agents and a 5-node LLAMA-405B Teacher
– Finetune 1-node LLAMA-8B Student
• Outcome: student matches GPT-4o performance at 300x lower cost

© 2024 ServiceNow, Inc. All Rights Reserved. 22


© 2024 ServiceNow, Inc. All Rights Reserved. 23
Student is 330x Cheaper

© 2024 ServiceNow, Inc. All Rights Reserved. 24


Agentic Frameworks: How Does TapeAgents Compare?

© 2024 ServiceNow, Inc. All Rights Reserved. 25


Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts


Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

© 2024 ServiceNow, Inc. All Rights Reserved. 26


What is a Web Agent?

User
Book me a Observations
world-wide tour
for my rock
band next July
and August
Web Agent
Actions

© 2024 ServiceNow, Inc. All Rights Reserved. 27


Web Agents Act on the Web on Behalf of Human Users

Involves
User
Book me a Observations • Task / goal understanding
world-wide tour (natural language)
for my rock
band next July • Situational awareness
and August • Long-term planning
Web Agent • Detailed action execution
Actions

© 2024 ServiceNow, Inc. All Rights Reserved. 28


Making a basic Web Agent

Prompt Task
• Task Description Fly me to
Yellowstone
• Web Page as text for the next
long weekend
• Action Space

Answer Execute actions


• Action 1 • Python + Playwright
• Action 2

© 2024 ServiceNow, Inc. All Rights Reserved. 29


You can do this by prompting an LLM
Example prompt (simplified): LLM response:
<action>
Task: fill('14', 'Enola')
- Enter "Enola" into the text field and press Submit.
click('15')
DOM (Web Page): </action>
<html>
<body>

</body>
</html>

Action space:

# Fill out a form field


fill(backend_id: str, value: str)

# Click an element
click(backend_id: str)

# Move the mouse to a location


mouse_move(x: float, y: float)

Answer Format:
<action>
Your actions
</action>

© 2024 ServiceNow, Inc. All Rights Reserved. 30


ReAct

Agent: GPT-4 + ReAct


ReAct

Agent: GPT-4 + ReAct NOTE: 8x speedup


How do we evaluate web agents?

Source: https://miniwob.farama.org/index.html (MiniWoB++)

© 2024 ServiceNow, Inc. All Rights Reserved. 33


Realistic Trace-based Benchmarks
Mind2Web (Deng et al., 2023)
Thousands of human-generated
observation-action traces
✅Real websites
❌ Evaluation based on “gold traces” (what
about alternative solutions?)
❌ Traces can be memorized

NNetNav (Murty et al., 2024) WebLINX (Lù et al., 2024)

© 2024 ServiceNow, Inc. All Rights Reserved. 34


Realistic Live Environment Benchmarks
Evaluate end result rather than sequence of actions (e.g., database state) ✅Agnostic to action trace
✅Low memorization risk (no traces)
Sandboxed environments

Tasks performed on locally hosted server


✅High bandwidth (for parallel experiments)
❌Limited to open-source software
❌Complex local setup (e.g., Docker)

Open Web Environments


Wor kAr ena: Web Agents for Common K nowledge Wo
AssistantBench (Yoran et al., 2024) WorkArena (Drouin, Gasse et al., 2024)
Tasks performed on a remote server
✅More realistic (supports any website, latency)

✅No need for complex local setup


❌Can be unreliable (network issues) Browser

(a) WorkArena
© 2024 ServiceNow, Inc. All Rights Reserved. 35
Figur e 1: Overview of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and
waysof interacting with the ServiceNow Platform, awidely-used enterprise software platform
pip install browsergym-workarena

Wor kAr ena: Web Agents for Common K nowledge Wor k Tasks
An open-source benchmark of ~600 work-related tasks built on the ServiceNow platform

AXTree
HTML
_Free_ Screenshot
Observation

Action

Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )


Python (unsafe)

mouse_cl i ck( x=47. 2, y=152. 6) cl i ck( bi d


coord-based bid-ba

Tasks span basic UI interactions and complex realistic workflows


(a) WorkArena Open Web
(b) BrowserGym
© 2024 ServiceNow, Inc. All Rights Reserved. 36
Figur e 1: Overview of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and 19,950 unique instances tha
© 2024 ServiceNow, Inc. All Rights Reserved. 37
e retrieval ability of web agents, we formulate tasks requiring information extraction
Chat
ements such as plots and text elements such as forms. The agent is then requested to
WorkArena++ Towards Realistic Enterprise Workflows
User: show me
rse set of follow-up tasks using theretrieved
Wor kAr ena: information.
Web Agents for Common K nowledge Wor k Tasks HTML
AXTree flights from
MTL to NYC
Screenshot

1 Observation
Knowledge base
oar d retr ieval Example: The agent is assigned a ticket and instruction: “Please solve this.”

Action
Web Agent
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( ) AXTree
HTML
Arena: Web Agents for Common K nowledgeWor k Tasks Python (unsafe)
Screensho
mouse_cl i ck( x=47. 2, y=152. 6) cl i ck( bi d=" 103" )
2 Observation
coord-based bid-based
Dashboard
(a) WorkArena (b) BrowserGym
Chat Action
ew of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and 19,950 unique instances thatpage. cover common
get _by_l abel ( " Sear ch" ) . cl i ck(
Browser
g with the ServiceNow Platform, a widely-used enterprise software AXTree
platform. (b) Br owser
User: show me
flights from
Gym is aPython environment Python (unsafe)
HTML
evaluating web agents, which includes a rich set of actions and multimodal observations
Screenshot
MTL to (shown
NYC herethe HTML
mouse_cl i ck( x=47. 2,contents
y=152. 6) cl i ck(

cessibility 3tree,Service
and theraw pixelsafter browser rendering). Observation
Catalog
coord-based bi

(a) WorkArena (b) BrowserGym

ch as the 500,000
Figur daily users
e 1: Overview of Disney+’s
of contributions: (a) WoretkAr al.,ena 2022), asimulatedofe-commerce
is a benchmark
Action
33 web taskswebsite with
and 19,950 shopping
unique instances t
ways of2020).
enter (Maas, interacting with the ServiceNow Platform,
ServiceNow’sextensive tasksathat widely-used
require enterprise searching software
Web Agent platform.
and browsing (b) Br owser
acatalog Gym is aP
of items.
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )
for designing
n ideal real-world and evaluating
environment forweb agents, whichMorerecently,
evaluating includes a rich Zhouet
Python (unsafe) set of actions and multimodal
al. (2023) introduced observations
WebArena,(shownacol- heret
of thepage, itsaccessibility tree, and theraw pixelsafter browser rendering).
oard retrieval L2 goal.
pact of UI assistants in theworkplace. (b)
mouse_cl
Dashboard
i ck( x=47. 2, y=152.
lection
coord-based
of 6)
190 retrieval
cl i ck(
tasks L3 task
bi d=" 103" )
based
bid-based
description.
© 2024 ServiceNow, Inc. All Rights Reserved.
on realistic websites that emulate 38
WorkArena++ Towards Realistic Enterprise Workflows
Overview of tasks Some tasks are purposely infeasible.
Solve a series of enterprise Agent must detect this.
decision-making problems:
- Workload balancing
- Scheduling with constraints
- Assigning work to experts

Navigate the platform to gather


(104) multiple bits of information and
(132) then solve a task:
- Offboard a user

Read dashboards and act: (56)


- Restock IT asset inventory
(78)
Search for information in lists, Read dashboard, make
forms, and KBs: calculations, take action
- Find if a user’s laptop is under
warranty
Budget management: choose
where to invest based on
expected return
(312)
Expense management

© 2024 ServiceNow, Inc. All Rights Reserved. 39


WorkArena++ is far from being solved

Success rate (higher is better)

Realistic Workflows

© 2024 ServiceNow, Inc. All Rights Reserved. 40


WorkArena++ is far from being solved

Success rate (higher is better)


Realistic Workflows

What explains this?


• Failure to plan
• Hallucinated controls
• Incorrect action syntax

© 2024 ServiceNow, Inc. All Rights Reserved. 41


Benchmark Explosion
• MiniWoB++ (Shi et al., 2017; Liu et al., 2018) 125 tasks

• WebShop (Yao, Chen et al., 2022) 12 087 tasks Call for unification
• WebArena (Zhou et al., 2023) 812 tasks Get everyone under the same
roof for a great Meta-
Benchmark
• VisualWebArena (Koh et al., 2024) 910 tasks

• WebLINX (Lù et al., 2024) 2 300 tasks

• WebCanvas (Pan et al., 2024) 438 tasks

• WebVoyager (He et al., 2024) 643 tasks

• AssistantBench (Yoran et al., 2024) 214 tasks

• WorkArena++ (ServiceNow Research, 2024) 682 tasks

© 2024 ServiceNow, Inc. All Rights Reserved. 42


pip install browsergym
Wor kArena: Web Agents for Common K nowledgeWor k Tasks

A unified evaluation platform


Chat
> Standard Observation Space User: show me
flights from
• HTML HTML
AXTree
MTL to NYC
• Screenshots Screenshot
Observation
• Accessibility Tree
• And more

> Standard Action Space Action


Web Agent
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )
> Regroups most major benchmarks Python (unsafe)

(thousands of realistic tasks) mouse_cl i ck( x=47. 2, y=152. 6)


coord-based
cl i ck( bi d=" 103" )
bid-based

(a) WorkArena (b) BrowserGym

Figur e1: Overview of contributions: (a) Wor kAr ena is a benchmark of 29 web tasks and 18,050©unique
2024 ServiceNow, Inc. All Rights Reserved.
instances that cover common 43
pip install browsergym

Human evaluation
for any benchmark!

© 2024 ServiceNow, Inc. All Rights Reserved. 44


A toolbox for agent design
> Simple building blocks for agents
> Tools to inspect their behavior
> Experimental framework:

> Easy large-scale evaluation

> Reproducibility features

© 2024 ServiceNow, Inc. All Rights Reserved. 45


© 2024 ServiceNow, Inc. All Rights Reserved. 46
Reproducibility as a priority > Standardized observation/action traces
Act Act
Benchmarking on the open web is
challenging (dynamic environment)
> Websites are updated

> API-based LLMs change silently > Experimental journal uploaded to public repo
Date, versions, agent configuration, traces, etc.
> Python packages evolve

> Leaderboards with scores that are


automatically reproduced based on the above

© 2024 ServiceNow, Inc. All Rights Reserved. 47


Large Dataset Collection for Web Agents

Opportunity

With mechanisms for:

> Standardized observation and


action spaces
> Standardized trace collection

> Public repository for traces

We can collectively gather large-


scale finetuning datasets on public
benchmarks and on the open web.

Source: Murty et al. (2024)

© 2024 ServiceNow, Inc. All Rights Reserved. 48


The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Real-world web pages contain


Main hurdles hundreds of thousands of tokens

• Long context understanding

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency


Retrieval can help (e.g., Dense Markup Ranker;
Lù et al., 2024)
• Safety and alignment

© 2024 ServiceNow, Inc. All Rights Reserved. 50


The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Sparse rewards and near-


Main hurdles impossible test-time exploration

• Long context understanding NNetNav (Murty et al., 2024)

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency


Potential solution: automatically
• Safety and alignment gather huge exploratory traces
tagged with goal

© 2024 ServiceNow, Inc. All Rights Reserved. 51


The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

How to efficiently learn from


Main hurdles demonstrations and mistakes?

• Long context understanding

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency


Potential solution: use RL-inspired
approaches to finetune agent policy
• Safety and alignment (Agent Q uses MCTS + DPO; Putta et al., 2024)

© 2024 ServiceNow, Inc. All Rights Reserved. 52


The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Main hurdles Multimodality can be crucial

• Long context understanding

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency


Humans rely on vision, but current agents
fail to make use of that modality
• Safety and alignment

© 2024 ServiceNow, Inc. All Rights Reserved. 53


The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Main hurdles Web Agents must produce more


value than they cost to be viable
• Long context understanding
• Shrinking context size (e.g., retrieval)

• Long-term planning • Multi-agent architectures


• Smaller LLMs that solve low-level tasks
(e.g., a “date picker agent”)
• Learning and adaptability
• Finetuning smaller LLMs to improve
their performance
• Multimodality

• Cost and efficiency

• Safety and alignment

© 2024 ServiceNow, Inc. All Rights Reserved. 54


The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

• Website contents can trip over


Main hurdles agent LLM guardrails
— Text visible to LLM but not human
• Long context understanding (e.g., white on white)
— Random-character, ascii art and
• Long-term planning tokenizer attacks
— Even worse for multimodal models
• Learning and adaptability
• 2026’s fraudsters
• Multimodality — Malicious Chrome plugin detects
when you log onto your bank,
executes wire transfer abroad
• Cost and efficiency

• Safety and alignment

© 2024 ServiceNow, Inc. All Rights Reserved. 55


Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts


Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

© 2024 ServiceNow, Inc. All Rights Reserved. 56


AI Agents are poised to
change the nature of work

© 2024 ServiceNow, Inc. All Rights Reserved. 57


Today’s Enterprise Workflows Remain Quite Manual
Jon needs (even with generative AI)
access to a KB

Do we have a Is there a What access Assign Jon the Resolve


KB that similar control does right role
explains what incident? Jon have? I generated
to do? resolution Close
notes
Human Agent

Resolution
Generation skill

GenAI © 2024 ServiceNow, Inc. All Rights Reserved. 58


Access issue – With AI Agents
Jon needs
access to a KB

To solve this issue I need to –


1. Find what access Jon has Looks good. I’m going to assign Ok Close
2. Find KB permissions Go ahead Jon ‘Knowledge’
3. Give Jon the right access role
Can you approve?

Human Agent

What access
Who can I Generate a Jon has?
assign this plan based on
issue to? relevant KB
Assign Jon role Document
and similar
‘Knowledge’ resolution steps
incidents
User Access AI Agents

What are the


KB Permissions?
AI Agents Next-best-action AI User Access AI Documenter AI
Orchestrator Agents Agents Agents

AI Agents KB AI Agent
© 2024 ServiceNow, Inc. All Rights Reserved. 59
Web Agents to address Low Value / Low Volume tasks

CHEAP
Development & maintenance
General-purpose
Agents (future)
Web Agents
(late 2024)

cost effectiveness LC/NC

RPA
Low Value / Low Volume
region for individual tasks
(ability to address)
Traditional
EXPENSIVE

development

API UI/UX Natural language Tool Tool


calls interaction reasoning use making

IDIOT CAPABLE
Agent ability, generalizability,
& adaptability
© 2024 ServiceNow, Inc. All Rights Reserved. 60
Researching Online Database Management Digital Marketing Camp
Data Analysis Technical Support Podcast Production
Email Communication Legal Research Software Testing and Q
Writing Reports Cybersecurity Monitoring Remote Team Manage
Project Planning Human Resources Tasks Event Planning and Ma
Presentation Creation Blogging and Content CreationMobile App Developme
Graphic Design Market Analysis Risk Management
WorkArena can help us
Website Management Inventory Management
Social Media ManagementDigital Asset Management
Intellectual Property Ma
Environmental Sustaina
What is Knowledge
understand
Video Editing
Programming
Work?
the future of
Strategic Planning Real Estate Analysis
Document Review and Editing Supply Chain Optimizat
Knowledge Work
Online Collaboration
Customer Relationship
Meeting Scheduling and
Coordination
Health Informatics
Scientific Research and
Management (CRM) Task and Workflow Automation E-commerce Managem
Financial Planning Cloud Computing Management Ethical Hacking and Pe
E-learning Development Knowledge Management Testing
Business Intelligence (BI) 3D Modeling and CAD
Voice Over Production Language Translation a
Accessibility Testing Localization
O*NET: Cataloging the Workforce

Source: Bureau of Labor Statistics Standard Occupational Classification and O*NET © 2024 ServiceNow, Inc. All Rights Reserved. 62
Technology adoption takes time and
uncertainty for generative AI adoption remains high

Early scenario
(including GenAI) Adoption Drivers
Late scenario • Technological maturity
(including GenAI)
• Integration speed
• Relative cost of
technology vs labor
• Technology diffusion
rate
• Supply constraints
(e.g. GPUs, regulatory)

Source: The economic potential of generative AI: The next productivity frontier (McKinsey) © 2024 ServiceNow, Inc. All Rights Reserved. 63
Assessing Impact: Top-Down vs Bottom-Up

Top-Down Assessment Bottom-Up Assessment


• Analyze each task for each • Map each task in O*NET to
job in O*NET benchmark tasks in a
knowledge work proxy such as
• For each, “guess” what the WorkArena
task looks like with AI, and
decide if human still needed • Track ability of AI to successfully
complete the tasks and map
• Can be automated (GPT-4) back to job automation prob.
• Advantage: wide coverage • Advantage: detailed picture
• Challenge: blurry picture • Challenge: spotty coverage

© 2024 ServiceNow, Inc. All Rights Reserved. 64


Envisioning AI Augmentation
to Empower Workers

Centaur Cyborg
• Strategic separation between • Task-level collaboration, where
“human tasks” and “AI tasks” the human can ask the AI to:
• From human intuition, AI can: — Assume a certain persona
— Map problem domain — Learn a task from examples
— Gather information — Give a logical explanation
— Handle data analysis — Provide a deep dive
— Refine human content — Respond to contradictions
and push-back

Source: Navigating the Jagged Technological Frontier: © 2024 ServiceNow, Inc. All Rights Reserved. 65
Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality (HBS, BCG)
Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts


Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

© 2024 ServiceNow, Inc. All Rights Reserved. 66


LLM Agent LangChain (Oct) AutoGPT (Mar) AutoGen (Sept) Crew.ai (Dec)

Frameworks &
• Enables chaining • Automates tasks with • Multi-agent framework • Collaborative agent
LIBRARIES / FRAMEWORKS

multiple LLM calls for autonomous agents. for building workflows teams with specific
multi-step workflows. with AI agents. roles and goals.
• Uses a feedback loop

Benchmarks
• Various tools like APIs, to refine outputs • AutoGen agents can • Sequential and
databases, and based on goals and work together, hierarchical processes.
external data sources. constraints. integrating LLMs, tools, • Versatile tools with
and human inputs.
• Memory mgmt, • Unlike LangChain, error handling and
allowing context emphasizes • Unlike LangChain and caching capabilities.
retention across autonomous decision- AutoGPT, emphasize • Allows human
multiple interactions. making over structured multi-agent interaction oversight & interaction
workflow chaining. and human-AI collab

2022 2023 2024

ToolBench (May) AgentBench (Aug) MLAgentBench (Oct) GAIA (Nov)


• Evaluate tool use with 8 environments: • 13 tasks for ML • Q&A: need reasoning,
diverse real-world tasks experimentation, from multi-modality, tools.
• operating system
BENCHMARKS

• 8 tasks, e.g.: Open CIFAR-10 to BabyLM. • Humans: 92% vs. 15%


• database
Weather, Trip booking, • Tasks include file for GPT-4 with plugins.
• knowledge graph
Google Sheets operations, run code, • 466 questions; 166 with
• digital card game output inspection.
• Can boost open- detailed traces, 300
source LLMs to 90% • lateral thinking puzzles • Best is Claude v3 Opus retained for
success rate, • house-holding 37.5% avg success rate leaderboard.
matching GPT-4 in 4 • web shopping • Challenges: long-term • Questions have
out of 8 tasks planning, hallucination unambiguous
• web browsing © 2024 ServiceNow, Inc. Allanswer.
Rights Reserved. 67
Crew.ai (Dec) LangGraph (Jan) LlamaIndex TapeAgents (Oct)
• Collaborative agent • Graph-based: agent Workflows (Aug) • Single unifying
LIBRARIES / FRAMEWORKS

teams with specific workflows as nodes • Event-driven abstraction (the


roles and goals. and edges architecture “tape”) which is both
a log of events and
• Sequential and • Stateful design • Provides state the state of the system
hierarchical processes. • Supports human-agent management and
enables cyclical flows • Enables complex
• Versatile tools with collaboration
agent optimization
error handling and • Supports tools like Arize
• Real-time streaming such as prompt tuning
caching capabilities. Phoenix for debugging
• Allows granular control and distillation from
• Allows human complex teacher to
oversight & interaction simpler student

2024

ch (Oct) GAIA (Nov) SWE-Bench (Apr) 𝜏-Bench (Jun) InsightBench (Oct)


L • Q&A: need reasoning, • Evaluate AI agents on • Emulate conversations • Evaluate agents on
on, from multi-modality, tools. real-world software between a LLM user end-to-end data
BENCHMARKS

abyLM. engineering tasks and a LLM agent science workflows,


• Humans: 92% vs. 15%
provided with domain- measuring cross-
file for GPT-4 with plugins. • 2,294 problems from
specific API tools and domain generalization
n code, real GitHub issues and
• 466 questions; 166 with policy guidelines
tion. PR across 12 popular • Task planning,
detailed traces, 300
Python repositories • 175 tasks from retail execution, reasoning
v3 Opus retained for
and airline domains
ccess rate leaderboard. • Code generation, bug • Incomplete data &
fixing, design • Top models still at sub- ambiguous goals
ong-term • Questions have
par performance
ucination unambiguous answer. • Evals on correctness, © 2024 ServiceNow, Inc. All Rights Reserved. 68
efficiency, collab
Web Agent Learning to Control
Computers (DM)
WebArena (CMU)
• Realistic benchmark,
VisualWebArena
• Benchmark that needs
WebLINX (McGill)
• Conversational web

Research
• Control computers w/ 812 tasks, 6 domains visual comprehension agent navigation
keyboard & mouse • Long-horizon tasks • Test visual & reasoning • 2337 expert demos on
from NL instructions skills of web agents 155 real-world websites
• Best GPT-4: 11% solve

Milestones
• MiniWob++ through RL rate vs 78% for humans • 910 tasks, 3 domains • Visual models not best;
with computer-human fine-tuning is key
interactions

2017 2021 2022 2023 2024

World of Bits (WoB) WebGPT (OpenAI) Mind2Web (Ohio) WebAgent (Google) WebVoyager (Ten¢)
• First widely available • Fine-tuned GPT-3 for • Benchmark of realistic • Combine 2 LLMs to • Completes tasks on
web benchmark QA with web browsing web tasks from NL simplify huge HTML, real websites using
• Simplified tasks • Evaluated on “Explain plan solution, create textual+visual inputs
• Interaction traces
code talking to web
• 100 tasks Like I’m 5” Reddit Qs + • 2,350 tasks from 137 • New benchmark:
browser; no pixels
TruthfulQA dataset 15 websites, automatic
websites, 31 domains
• Can be solved by RL • MiniWoB & Mind2Web GPT-4V-based eval.
MU) VisualWebArena WebLINX (McGill) WorkArena OSWorld WorkArena++
mark, • Benchmark that needs • Conversational web (ServiceNow) • 369 computer tasks of (ServiceNow)
mains visual comprehension agent navigation • Basic tasks that a real web and desktop • Compositional tasks
knowledge worker apps in open domains with much higher
sks • Test visual & reasoning • 2337 expert demos on
skills of web agents 155 real-world websites must carry out • OS file I/O + workflows difficulty than
solve WorkArena
humans • 910 tasks, 3 domains • Visual models not best; • Implemented on the spanning multiple
ServiceNow platform applications • Today’s best models
fine-tuning is key
get single-digit
performance, with
huge room for
improvement

2024

WebAgent (Google) WebVoyager (Ten¢) WebCanvas (CMU) AssistantBench NNetNav (Stanford)


• Combine 2 LLMs to • Completes tasks on • Handles dynamic web • Diverse web tasks: • Training web agents
simplify huge HTML, real websites using search, navigation, entirely through
• Mind2Web-Live, a
plan solution, create textual+visual inputs refined Mind2Web: 542
data extraction, synthetic demos
code talking to web tasks, 2439 evaluation interaction • Web trajectory rollouts
• New benchmark:
browser; no pixels states • 214 tasks that can be are processed by an
15 websites, automatic
• MiniWoB & Mind2Web GPT-4V-based eval. auto-evaluated LLM to be retroactively
labeled into instruction
“Hey Lecture Agent, create
our 2025 Class Presentation!”
Tape Agents WorkArena BrowserGym AgentLab

Q&A
Many thanks to the following colleagues:
Alexandre Lacoste Dzmitry Bahdanau Oleh Shliazhko
Maxime Gasse Nicolas Gonthier Karam Ghanem
Massimo Caccia Gabriel Huang Soham Parikh
Léo Boisvert Ehsan Kamalloo Mitul Tiwari
Megh Thakkar Rafael Pardinas Quaizar Vohra
Tom Marty Jordan Prince Tremblay David Vazquez
Rim Assouel Alex Piché Valérie Bécaert
Thibault Le Sellier De Chezelles Torsten Scholak

© 2024 ServiceNow,
© 2024Inc.
ServiceNow,
All Rights Reserved.
Inc. All Rights
Confidential.
Reserved. 72

You might also like