0% found this document useful (0 votes)

491 views72 pages

Agent Work Flows

The document discusses the development and optimization of Large Language Model (LLM) agents for automating enterprise workflows, highlighting the differences between API and web agents. It introduces the TapeAgents framework, which utilizes structured logs for agent optimization and describes a case study demonstrating a cost-effective conversational assistant. The presentation also outlines the challenges and methodologies for evaluating web agents in realistic environments.

Uploaded by

raags90

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

491 views72 pages

Agent Work Flows

Uploaded by

raags90

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Agents for Enterprise

Workflows
CS294/194-196 Large Language Model Agents
Lecture 7 — October 21st, 2024
Nicolas Chapados Alexandre Drouin
ServiceNow Research ServiceNow Research
Who are we?

Nicolas Chapados Alexandre Drouin

ServiceNow Research ServiceNow Research

© 2024 ServiceNow, Inc. All Rights Reserved. 2

Safe harbor notice for forward-looking statements
This presentation may contain “forward-looking” statements that are based on our beliefs and
assumptions and on information currently available to us only as of the date of this presentation.
Forward-looking statements involve known and unknown risks, uncertainties, and other factors that
may cause actual results to differ materially from those expected or implied by the forward-looking
statements. Further information on these and other factors that could cause or contribute to such
differences include, but are not limited to, those discussed in the section titled “Risk Factors,” set
forth in our most recent Annual Report on Form 10-K and Quarterly Report on Form 10-Q and in our
other Securities and Exchange Commission filings. We cannot guarantee that we will achieve the
plans, intentions, or expectations disclosed in our forward‐looking statements, and you should not
place undue reliance on our forward‐looking statements. The information on new products,
features, or functionality is intended to outline our general product direction and should not be
relied upon in making a purchasing decision, is for informational purposes only, and shall not be
incorporated into any contract, and is not a commitment, promise, or legal obligation to deliver
any material, code, or functionality. The development, release, and timing of any features or
functionality described for our products remains at our sole discretion. We undertake no obligation,
and do not intend, to update the forward-looking statements.
Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts

Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts

Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

LLM agents are LLM-powered entities
able to autonomously plan and take
actions to execute goals over multiple
iterations.

© 2024 ServiceNow, Inc. All Rights Reserved. 6

LLM-Based Agents

Reinforcement Learning LLM-Based Agents:

Agents Zero-Shot Task Solvers
• Require long training runs in • LLMs can display some
sandboxed environments commonsense, since they
have lots of world
• Limited action space background knowledge
• Low generalizability to • General-Purpose LLMs have
radically new tasks probably been trained on
the documentation of your
• A Minecraft agent can’t software
send emails

© 2024 ServiceNow, Inc. All Rights Reserved. 7

Two kinds of LLM Agents

API agents Web agents

• Observations: API call results, search history, • Observations: what human would see +
user-uploaded images, chat history accessibility tree / raw DOM
• Actions: API calls, search calls, responses to • Actions: enter text in fields, clicks
the user
• Pros: can do anything
• Pros: Lower latency, lower risks
• Cons: higher latency, higher risks
• Cons: needs appropriate APIs

© 2024 ServiceNow, Inc. All Rights Reserved. 8

Today’s Enterprise Workflows Remain Quite Manual
Jon needs (even with generative AI)
access to a KB

Do we have a Is there a What access Assign Jon the Resolve

KB that similar control does right role
explains what incident? Jon have? I generated
to do? resolution Close
notes
Human Agent

Resolution
Generation skill

GenAI © 2024 ServiceNow, Inc. All Rights Reserved. 9

Automation in Enterprise Workflows Agentic
workflows

Iterative, interactive approach

to automation, where the AI
Level of Automation

Conversational agent is empowered to

engage in a more dynamic
workflows and self-reflective process.
Reasoning model
Agentic framework
Orchestration
AI workflows
Automate resolution of high
RPA volume requests and submit
tickets on behalf of the user,
workflows adapting interactions based on
Scripted Use machine learning
users’ response

workflows models to dynamically Generative AI

RAG
adapt workflows based
Automate repetitive system Conversation engine
on patterns and
actions with UI-based
Automate repetitive feedback loops.
interactions
digital tasks with minimal Machine Learning models
RPA bots
workflow variations Vision models
Flow Engine
Integrations

Intelligence © 2024 ServiceNow,

© 2024Inc.
ServiceNow,
All Rights Reserved.
Inc. All Rights
Confidential.
Reserved. 10
Agents solve for Today’s automation
workhorses for high-
the Millions of value or high-volume
Low-Value/Low- tasks
Volume Tasks • Robotic Process Automation
• Low-Code / No-Code

What About?
• Scheduling tweets
• Sorting email
• Updating CRM
• Filling out time sheet
• Arranging 15-person meeting
© 2024 ServiceNow, Inc. All Rights Reserved. 11
across 4 organizations
Demo: Directions to GTC (original video is 4x longer
with long pauses)
Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts

Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

© 2024 ServiceNow, Inc. All Rights Reserved. 13

LLM-Based Single Agents: Typical Architecture
Long-Term Memory
Short-Term Memory
Calculator() (new tools)

CodeInterpreter() Memory Reflection

LLM
WebSearch() Tools Planning Self-critique
Agent

TriggerWorkflow() Action Chain of thoughts

Subgoal
… more …
Decomposition
Environment

Source: https://lilianweng.github.io/posts/2023-06-23-agent © 2024 ServiceNow, Inc. All Rights Reserved. 14

TapeAgents: towards a holistic framework
for agent development and optimization
Frameworks that address Frameworks for data-
agent development needs driven agent optimization Holistic Frameworks
● Resumable sessions • Structured agent TapeAgents:
● Low-code components configuration Agent ==
● Fine-grained control • Structured agent logs Resumable modular state
● Concurrency
• Optimization algorithms machine
● Streaming
… with structured
LangGraph, AutoGen, DSPy, TextGrad, Trace: configuration
Crew: • Agent == code that uses … that makes granular
• Agent == resumable structured modules and structured logs
modular state machine generates structured logs … that can make fine-
tuning data from logs
… and reuse other agent’s
logs
© 2024 ServiceNow, Inc. All Rights Reserved. 15
TapeAgents is a framework built around a structured,
granular, semantic-level log: the tape
• Agent reads the tape,
reasons, writes thoughts
and actions to the tape
• Environment executes
actions from the tape,
write observations to
the tape
• Apps use the tape as
session states
• Dev tool use tapes to
facilitate audit
• Algorithms use tapes to
tune agent prompts
• Agents make finetuning
data from tapes
© 2024 ServiceNow, Inc. All Rights Reserved. 16
Agent reasoning loop: example
Simple two-agent structure TapeAgents execution
(problem-specific) model

© 2024 ServiceNow, Inc. All Rights Reserved. 17

© 2024 ServiceNow, Inc. All Rights Reserved. 18
TapeAgents
allows the
optimization of
a Student Agent
from the tapes
of a Teacher
Agent

© 2024 ServiceNow, Inc. All Rights Reserved. 19

MAKING COST-EFFECTIVE

G R E A D T H
(CONVERSATIONAL) AGENTS

© 2024 ServiceNow, Inc. All Rights Reserved. 20

MAKING COST-EFFECTIVE

G R E A D T H
GROUNDED RESPONSIVE ACCURATE DISCIPLINED TRANSPARENT HELPFUL

(CONVERSATIONAL) AGENTS

© 2024 ServiceNow, Inc. All Rights Reserved. 21

Case Study: Cost-Effective Form-Filling Assistant

• Task: conversational assistant that routes the user to the right form and helps fill it
• Constraints: 5-star conversational experience at low compute cost
• 3 training domains: FlyCorp, BigBankCorp, CoffeeCorp
• 3 testing domains: DriveCorp, LuxuryCorp, ShopCorp
• Metric: GREADTH
– Grounded, Responsive, Accurate, Disciplined, Transparent, Helpful
• Method:
– Generate synthetic tapes with 19 user agents and a 5-node LLAMA-405B Teacher
– Finetune 1-node LLAMA-8B Student
• Outcome: student matches GPT-4o performance at 300x lower cost

© 2024 ServiceNow, Inc. All Rights Reserved. 22

© 2024 ServiceNow, Inc. All Rights Reserved. 24

Agentic Frameworks: How Does TapeAgents Compare?

© 2024 ServiceNow, Inc. All Rights Reserved. 25

Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts

Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

© 2024 ServiceNow, Inc. All Rights Reserved. 26

What is a Web Agent?

User
Book me a Observations
world-wide tour
for my rock
band next July
and August
Web Agent
Actions

© 2024 ServiceNow, Inc. All Rights Reserved. 27

Web Agents Act on the Web on Behalf of Human Users

Involves
User
Book me a Observations • Task / goal understanding
world-wide tour (natural language)
for my rock
band next July • Situational awareness
and August • Long-term planning
Web Agent • Detailed action execution
Actions

© 2024 ServiceNow, Inc. All Rights Reserved. 28

Making a basic Web Agent

Prompt Task
• Task Description Fly me to
Yellowstone
• Web Page as text for the next
long weekend
• Action Space

Answer Execute actions

• Action 1 • Python + Playwright
• Action 2

© 2024 ServiceNow, Inc. All Rights Reserved. 29

You can do this by prompting an LLM
Example prompt (simplified): LLM response:
<action>
Task: fill('14', 'Enola')
- Enter "Enola" into the text field and press Submit.
click('15')
DOM (Web Page): </action>
<html>
<body>
…
</body>
</html>

Action space:

# Fill out a form field

fill(backend_id: str, value: str)

# Click an element
click(backend_id: str)

# Move the mouse to a location

mouse_move(x: float, y: float)

Answer Format:
<action>
Your actions
</action>

© 2024 ServiceNow, Inc. All Rights Reserved. 30

ReAct

Agent: GPT-4 + ReAct

ReAct

Agent: GPT-4 + ReAct NOTE: 8x speedup

How do we evaluate web agents?

Source: https://miniwob.farama.org/index.html (MiniWoB++)

© 2024 ServiceNow, Inc. All Rights Reserved. 33

Realistic Trace-based Benchmarks
Mind2Web (Deng et al., 2023)
Thousands of human-generated
observation-action traces
✅Real websites
❌ Evaluation based on “gold traces” (what
about alternative solutions?)
❌ Traces can be memorized

NNetNav (Murty et al., 2024) WebLINX (Lù et al., 2024)

© 2024 ServiceNow, Inc. All Rights Reserved. 34

Realistic Live Environment Benchmarks
Evaluate end result rather than sequence of actions (e.g., database state) ✅Agnostic to action trace
✅Low memorization risk (no traces)
Sandboxed environments

Tasks performed on locally hosted server

✅High bandwidth (for parallel experiments)
❌Limited to open-source software
❌Complex local setup (e.g., Docker)

Open Web Environments

Wor kAr ena: Web Agents for Common K nowledge Wo
AssistantBench (Yoran et al., 2024) WorkArena (Drouin, Gasse et al., 2024)
Tasks performed on a remote server
✅More realistic (supports any website, latency)

✅No need for complex local setup

❌Can be unreliable (network issues) Browser

(a) WorkArena
© 2024 ServiceNow, Inc. All Rights Reserved. 35
Figur e 1: Overview of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and
waysof interacting with the ServiceNow Platform, awidely-used enterprise software platform
pip install browsergym-workarena

Wor kAr ena: Web Agents for Common K nowledge Wor k Tasks
An open-source benchmark of ~600 work-related tasks built on the ServiceNow platform

AXTree
HTML
_Free_ Screenshot
Observation

Action

Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )

Python (unsafe)

mouse_cl i ck( x=47. 2, y=152. 6) cl i ck( bi d

coord-based bid-ba

Tasks span basic UI interactions and complex realistic workflows

(a) WorkArena Open Web
(b) BrowserGym
© 2024 ServiceNow, Inc. All Rights Reserved. 36
Figur e 1: Overview of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and 19,950 unique instances tha
© 2024 ServiceNow, Inc. All Rights Reserved. 37
e retrieval ability of web agents, we formulate tasks requiring information extraction
Chat
ements such as plots and text elements such as forms. The agent is then requested to
WorkArena++ Towards Realistic Enterprise Workflows
User: show me
rse set of follow-up tasks using theretrieved
Wor kAr ena: information.
Web Agents for Common K nowledge Wor k Tasks HTML
AXTree flights from
MTL to NYC
Screenshot

1 Observation
Knowledge base
oar d retr ieval Example: The agent is assigned a ticket and instruction: “Please solve this.”

Action
Web Agent
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( ) AXTree
HTML
Arena: Web Agents for Common K nowledgeWor k Tasks Python (unsafe)
Screensho
mouse_cl i ck( x=47. 2, y=152. 6) cl i ck( bi d=" 103" )
2 Observation
coord-based bid-based
Dashboard
(a) WorkArena (b) BrowserGym
Chat Action
ew of contributions: (a) Wor kAr ena is a benchmark of 33 web tasks and 19,950 unique instances thatpage. cover common
get _by_l abel ( " Sear ch" ) . cl i ck(
Browser
g with the ServiceNow Platform, a widely-used enterprise software AXTree
platform. (b) Br owser
User: show me
flights from
Gym is aPython environment Python (unsafe)
HTML
evaluating web agents, which includes a rich set of actions and multimodal observations
Screenshot
MTL to (shown
NYC herethe HTML
mouse_cl i ck( x=47. 2,contents
y=152. 6) cl i ck(

cessibility 3tree,Service
and theraw pixelsafter browser rendering). Observation
Catalog
coord-based bi

(a) WorkArena (b) BrowserGym

ch as the 500,000
Figur daily users
e 1: Overview of Disney+’s
of contributions: (a) WoretkAr al.,ena 2022), asimulatedofe-commerce
is a benchmark
Action
33 web taskswebsite with
and 19,950 shopping
unique instances t
ways of2020).
enter (Maas, interacting with the ServiceNow Platform,
ServiceNow’sextensive tasksathat widely-used
require enterprise searching software
Web Agent platform.
and browsing (b) Br owser
acatalog Gym is aP
of items.
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )
for designing
n ideal real-world and evaluating
environment forweb agents, whichMorerecently,
evaluating includes a rich Zhouet
Python (unsafe) set of actions and multimodal
al. (2023) introduced observations
WebArena,(shownacol- heret
of thepage, itsaccessibility tree, and theraw pixelsafter browser rendering).
oard retrieval L2 goal.
pact of UI assistants in theworkplace. (b)
mouse_cl
Dashboard
i ck( x=47. 2, y=152.
lection
coord-based
of 6)
190 retrieval
cl i ck(
tasks L3 task
bi d=" 103" )
based
bid-based
description.
© 2024 ServiceNow, Inc. All Rights Reserved.
on realistic websites that emulate 38
WorkArena++ Towards Realistic Enterprise Workflows
Overview of tasks Some tasks are purposely infeasible.
Solve a series of enterprise Agent must detect this.
decision-making problems:
- Workload balancing
- Scheduling with constraints
- Assigning work to experts

Navigate the platform to gather

(104) multiple bits of information and
(132) then solve a task:
- Offboard a user

Read dashboards and act: (56)

- Restock IT asset inventory
(78)
Search for information in lists, Read dashboard, make
forms, and KBs: calculations, take action
- Find if a user’s laptop is under
warranty
Budget management: choose
where to invest based on
expected return
(312)
Expense management

© 2024 ServiceNow, Inc. All Rights Reserved. 39

WorkArena++ is far from being solved

Success rate (higher is better)

Realistic Workflows

© 2024 ServiceNow, Inc. All Rights Reserved. 40

WorkArena++ is far from being solved

Success rate (higher is better)

Realistic Workflows

What explains this?

• Failure to plan
• Hallucinated controls
• Incorrect action syntax

© 2024 ServiceNow, Inc. All Rights Reserved. 41

Benchmark Explosion
• MiniWoB++ (Shi et al., 2017; Liu et al., 2018) 125 tasks

• WebShop (Yao, Chen et al., 2022) 12 087 tasks Call for unification
• WebArena (Zhou et al., 2023) 812 tasks Get everyone under the same
roof for a great Meta-
Benchmark
• VisualWebArena (Koh et al., 2024) 910 tasks

• WebLINX (Lù et al., 2024) 2 300 tasks

• WebCanvas (Pan et al., 2024) 438 tasks

• WebVoyager (He et al., 2024) 643 tasks

• AssistantBench (Yoran et al., 2024) 214 tasks

• WorkArena++ (ServiceNow Research, 2024) 682 tasks

© 2024 ServiceNow, Inc. All Rights Reserved. 42

pip install browsergym
Wor kArena: Web Agents for Common K nowledgeWor k Tasks

A unified evaluation platform

Chat
> Standard Observation Space User: show me
flights from
• HTML HTML
AXTree
MTL to NYC
• Screenshots Screenshot
Observation
• Accessibility Tree
• And more

> Standard Action Space Action

Web Agent
Browser page. get _by_l abel ( " Sear ch" ) . cl i ck( )
> Regroups most major benchmarks Python (unsafe)

(thousands of realistic tasks) mouse_cl i ck( x=47. 2, y=152. 6)

coord-based
cl i ck( bi d=" 103" )
bid-based

(a) WorkArena (b) BrowserGym

Figur e1: Overview of contributions: (a) Wor kAr ena is a benchmark of 29 web tasks and 18,050©unique
2024 ServiceNow, Inc. All Rights Reserved.
instances that cover common 43
pip install browsergym

Human evaluation
for any benchmark!

© 2024 ServiceNow, Inc. All Rights Reserved. 44

A toolbox for agent design
> Simple building blocks for agents
> Tools to inspect their behavior
> Experimental framework:

> Easy large-scale evaluation

> Reproducibility features

© 2024 ServiceNow, Inc. All Rights Reserved. 45

© 2024 ServiceNow, Inc. All Rights Reserved. 46
Reproducibility as a priority > Standardized observation/action traces
Act Act
Benchmarking on the open web is
challenging (dynamic environment)
> Websites are updated

> API-based LLMs change silently > Experimental journal uploaded to public repo
Date, versions, agent configuration, traces, etc.
> Python packages evolve

> Leaderboards with scores that are

automatically reproduced based on the above

© 2024 ServiceNow, Inc. All Rights Reserved. 47

Large Dataset Collection for Web Agents

Opportunity

With mechanisms for:

> Standardized observation and

action spaces
> Standardized trace collection

> Public repository for traces

We can collectively gather large-

scale finetuning datasets on public
benchmarks and on the open web.

Source: Murty et al. (2024)

© 2024 ServiceNow, Inc. All Rights Reserved. 48

The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Real-world web pages contain

Main hurdles hundreds of thousands of tokens

• Long context understanding

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency

Retrieval can help (e.g., Dense Markup Ranker;
Lù et al., 2024)
• Safety and alignment

© 2024 ServiceNow, Inc. All Rights Reserved. 50

The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Sparse rewards and near-

Main hurdles impossible test-time exploration

• Long context understanding NNetNav (Murty et al., 2024)

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency

Potential solution: automatically
• Safety and alignment gather huge exploratory traces
tagged with goal

© 2024 ServiceNow, Inc. All Rights Reserved. 51

The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

How to efficiently learn from

Main hurdles demonstrations and mistakes?

• Long context understanding

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency

Potential solution: use RL-inspired
approaches to finetune agent policy
• Safety and alignment (Agent Q uses MCTS + DPO; Putta et al., 2024)

The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Main hurdles Multimodality can be crucial

• Long context understanding

• Long-term planning

• Learning and adaptability

• Multimodality

• Cost and efficiency

Humans rely on vision, but current agents
fail to make use of that modality
• Safety and alignment

The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

Main hurdles Web Agents must produce more

value than they cost to be viable
• Long context understanding
• Shrinking context size (e.g., retrieval)

• Long-term planning • Multi-agent architectures

• Smaller LLMs that solve low-level tasks
(e.g., a “date picker agent”)
• Learning and adaptability
• Finetuning smaller LLMs to improve
their performance
• Multimodality

• Cost and efficiency

• Safety and alignment

The Challenges for Web Agents Remain tall
We are, after all, dealing with the World Wild Web

• Website contents can trip over

Main hurdles agent LLM guardrails
— Text visible to LLM but not human
• Long context understanding (e.g., white on white)
— Random-character, ascii art and
• Long-term planning tokenizer attacks
— Even worse for multimodal models
• Learning and adaptability
• 2026’s fraudsters
• Multimodality — Malicious Chrome plugin detects
when you log onto your bank,
executes wire transfer abroad
• Cost and efficiency

• Safety and alignment

Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts

Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

AI Agents are poised to
change the nature of work

Today’s Enterprise Workflows Remain Quite Manual
Jon needs (even with generative AI)
access to a KB

Do we have a Is there a What access Assign Jon the Resolve

KB that similar control does right role
explains what incident? Jon have? I generated
to do? resolution Close
notes
Human Agent

Resolution
Generation skill

Access issue – With AI Agents
Jon needs
access to a KB

To solve this issue I need to –

1. Find what access Jon has Looks good. I’m going to assign Ok Close
2. Find KB permissions Go ahead Jon ‘Knowledge’
3. Give Jon the right access role
Can you approve?

Human Agent

What access
Who can I Generate a Jon has?
assign this plan based on
issue to? relevant KB
Assign Jon role Document
and similar
‘Knowledge’ resolution steps
incidents
User Access AI Agents

What are the

KB Permissions?
AI Agents Next-best-action AI User Access AI Documenter AI
Orchestrator Agents Agents Agents

CHEAP
Development & maintenance
General-purpose
Agents (future)
Web Agents
(late 2024)

cost effectiveness LC/NC

RPA
Low Value / Low Volume
region for individual tasks
(ability to address)
Traditional
EXPENSIVE

development

API UI/UX Natural language Tool Tool

calls interaction reasoning use making

IDIOT CAPABLE
Agent ability, generalizability,
& adaptability
© 2024 ServiceNow, Inc. All Rights Reserved. 60
Researching Online Database Management Digital Marketing Camp
Data Analysis Technical Support Podcast Production
Email Communication Legal Research Software Testing and Q
Writing Reports Cybersecurity Monitoring Remote Team Manage
Project Planning Human Resources Tasks Event Planning and Ma
Presentation Creation Blogging and Content CreationMobile App Developme
Graphic Design Market Analysis Risk Management
WorkArena can help us
Website Management Inventory Management
Social Media ManagementDigital Asset Management
Intellectual Property Ma
Environmental Sustaina
What is Knowledge
understand
Video Editing
Programming
Work?
the future of
Strategic Planning Real Estate Analysis
Document Review and Editing Supply Chain Optimizat
Knowledge Work
Online Collaboration
Customer Relationship
Meeting Scheduling and
Coordination
Health Informatics
Scientific Research and
Management (CRM) Task and Workflow Automation E-commerce Managem
Financial Planning Cloud Computing Management Ethical Hacking and Pe
E-learning Development Knowledge Management Testing
Business Intelligence (BI) 3D Modeling and CAD
Voice Over Production Language Translation a
Accessibility Testing Localization
O*NET: Cataloging the Workforce

Source: Bureau of Labor Statistics Standard Occupational Classification and O*NET © 2024 ServiceNow, Inc. All Rights Reserved. 62
Technology adoption takes time and
uncertainty for generative AI adoption remains high

Early scenario
(including GenAI) Adoption Drivers
Late scenario • Technological maturity
(including GenAI)
• Integration speed
• Relative cost of
technology vs labor
• Technology diffusion
rate
• Supply constraints
(e.g. GPUs, regulatory)

Top-Down Assessment Bottom-Up Assessment

• Analyze each task for each • Map each task in O*NET to
job in O*NET benchmark tasks in a
knowledge work proxy such as
• For each, “guess” what the WorkArena
task looks like with AI, and
decide if human still needed • Track ability of AI to successfully
complete the tasks and map
• Can be automated (GPT-4) back to job automation prob.
• Advantage: wide coverage • Advantage: detailed picture
• Challenge: blurry picture • Challenge: spotty coverage

Envisioning AI Augmentation
to Empower Workers

Centaur Cyborg
• Strategic separation between • Task-level collaboration, where
“human tasks” and “AI tasks” the human can ask the AI to:
• From human intuition, AI can: — Assume a certain persona
— Map problem domain — Learn a task from examples
— Gather information — Give a logical explanation
— Handle data analysis — Provide a deep dive
— Refine human content — Respond to contradictions
and push-back

Source: Navigating the Jagged Technological Frontier: © 2024 ServiceNow, Inc. All Rights Reserved. 65
Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality (HBS, BCG)
Defining Agents
Background Enterprise workflow concepts

AGE NDA
Architecture
API Agents TapeAgents

Web Agent Concepts

Web Agents WorkArena
BrowserGym and AgentLab

Agents in the Automating enterprise workflows

Workplace Agents and the future of work

Resources to Dig Further

LLM Agent LangChain (Oct) AutoGPT (Mar) AutoGen (Sept) Crew.ai (Dec)

Frameworks &
• Enables chaining • Automates tasks with • Multi-agent framework • Collaborative agent
LIBRARIES / FRAMEWORKS

multiple LLM calls for autonomous agents. for building workflows teams with specific
multi-step workflows. with AI agents. roles and goals.
• Uses a feedback loop

Benchmarks
• Various tools like APIs, to refine outputs • AutoGen agents can • Sequential and
databases, and based on goals and work together, hierarchical processes.
external data sources. constraints. integrating LLMs, tools, • Versatile tools with
and human inputs.
• Memory mgmt, • Unlike LangChain, error handling and
allowing context emphasizes • Unlike LangChain and caching capabilities.
retention across autonomous decision- AutoGPT, emphasize • Allows human
multiple interactions. making over structured multi-agent interaction oversight & interaction
workflow chaining. and human-AI collab

2022 2023 2024

ToolBench (May) AgentBench (Aug) MLAgentBench (Oct) GAIA (Nov)

• Evaluate tool use with 8 environments: • 13 tasks for ML • Q&A: need reasoning,
diverse real-world tasks experimentation, from multi-modality, tools.
• operating system
BENCHMARKS

• 8 tasks, e.g.: Open CIFAR-10 to BabyLM. • Humans: 92% vs. 15%

• database
Weather, Trip booking, • Tasks include file for GPT-4 with plugins.
• knowledge graph
Google Sheets operations, run code, • 466 questions; 166 with
• digital card game output inspection.
• Can boost open- detailed traces, 300
source LLMs to 90% • lateral thinking puzzles • Best is Claude v3 Opus retained for
success rate, • house-holding 37.5% avg success rate leaderboard.
matching GPT-4 in 4 • web shopping • Challenges: long-term • Questions have
out of 8 tasks planning, hallucination unambiguous
• web browsing © 2024 ServiceNow, Inc. Allanswer.
Rights Reserved. 67
Crew.ai (Dec) LangGraph (Jan) LlamaIndex TapeAgents (Oct)
• Collaborative agent • Graph-based: agent Workflows (Aug) • Single unifying
LIBRARIES / FRAMEWORKS

teams with specific workflows as nodes • Event-driven abstraction (the

roles and goals. and edges architecture “tape”) which is both
a log of events and
• Sequential and • Stateful design • Provides state the state of the system
hierarchical processes. • Supports human-agent management and
enables cyclical flows • Enables complex
• Versatile tools with collaboration
agent optimization
error handling and • Supports tools like Arize
• Real-time streaming such as prompt tuning
caching capabilities. Phoenix for debugging
• Allows granular control and distillation from
• Allows human complex teacher to
oversight & interaction simpler student

2024

ch (Oct) GAIA (Nov) SWE-Bench (Apr) 𝜏-Bench (Jun) InsightBench (Oct)

L • Q&A: need reasoning, • Evaluate AI agents on • Emulate conversations • Evaluate agents on
on, from multi-modality, tools. real-world software between a LLM user end-to-end data
BENCHMARKS

abyLM. engineering tasks and a LLM agent science workflows,

• Humans: 92% vs. 15%
provided with domain- measuring cross-
file for GPT-4 with plugins. • 2,294 problems from
specific API tools and domain generalization
n code, real GitHub issues and
• 466 questions; 166 with policy guidelines
tion. PR across 12 popular • Task planning,
detailed traces, 300
Python repositories • 175 tasks from retail execution, reasoning
v3 Opus retained for
and airline domains
ccess rate leaderboard. • Code generation, bug • Incomplete data &
fixing, design • Top models still at sub- ambiguous goals
ong-term • Questions have
par performance
ucination unambiguous answer. • Evals on correctness, © 2024 ServiceNow, Inc. All Rights Reserved. 68
efficiency, collab
Web Agent Learning to Control
Computers (DM)
WebArena (CMU)
• Realistic benchmark,
VisualWebArena
• Benchmark that needs
WebLINX (McGill)
• Conversational web

Research
• Control computers w/ 812 tasks, 6 domains visual comprehension agent navigation
keyboard & mouse • Long-horizon tasks • Test visual & reasoning • 2337 expert demos on
from NL instructions skills of web agents 155 real-world websites
• Best GPT-4: 11% solve

Milestones
• MiniWob++ through RL rate vs 78% for humans • 910 tasks, 3 domains • Visual models not best;
with computer-human fine-tuning is key
interactions

2017 2021 2022 2023 2024

World of Bits (WoB) WebGPT (OpenAI) Mind2Web (Ohio) WebAgent (Google) WebVoyager (Ten¢)
• First widely available • Fine-tuned GPT-3 for • Benchmark of realistic • Combine 2 LLMs to • Completes tasks on
web benchmark QA with web browsing web tasks from NL simplify huge HTML, real websites using
• Simplified tasks • Evaluated on “Explain plan solution, create textual+visual inputs
• Interaction traces
code talking to web
• 100 tasks Like I’m 5” Reddit Qs + • 2,350 tasks from 137 • New benchmark:
browser; no pixels
TruthfulQA dataset 15 websites, automatic
websites, 31 domains
• Can be solved by RL • MiniWoB & Mind2Web GPT-4V-based eval.
MU) VisualWebArena WebLINX (McGill) WorkArena OSWorld WorkArena++
mark, • Benchmark that needs • Conversational web (ServiceNow) • 369 computer tasks of (ServiceNow)
mains visual comprehension agent navigation • Basic tasks that a real web and desktop • Compositional tasks
knowledge worker apps in open domains with much higher
sks • Test visual & reasoning • 2337 expert demos on
skills of web agents 155 real-world websites must carry out • OS file I/O + workflows difficulty than
solve WorkArena
humans • 910 tasks, 3 domains • Visual models not best; • Implemented on the spanning multiple
ServiceNow platform applications • Today’s best models
fine-tuning is key
get single-digit
performance, with
huge room for
improvement

2024

WebAgent (Google) WebVoyager (Ten¢) WebCanvas (CMU) AssistantBench NNetNav (Stanford)

• Combine 2 LLMs to • Completes tasks on • Handles dynamic web • Diverse web tasks: • Training web agents
simplify huge HTML, real websites using search, navigation, entirely through
• Mind2Web-Live, a
plan solution, create textual+visual inputs refined Mind2Web: 542
data extraction, synthetic demos
code talking to web tasks, 2439 evaluation interaction • Web trajectory rollouts
• New benchmark:
browser; no pixels states • 214 tasks that can be are processed by an
15 websites, automatic
• MiniWoB & Mind2Web GPT-4V-based eval. auto-evaluated LLM to be retroactively
labeled into instruction
“Hey Lecture Agent, create
our 2025 Class Presentation!”
Tape Agents WorkArena BrowserGym AgentLab

Q&A
Many thanks to the following colleagues:
Alexandre Lacoste Dzmitry Bahdanau Oleh Shliazhko
Maxime Gasse Nicolas Gonthier Karam Ghanem
Massimo Caccia Gabriel Huang Soham Parikh
Léo Boisvert Ehsan Kamalloo Mitul Tiwari
Megh Thakkar Rafael Pardinas Quaizar Vohra
Tom Marty Jordan Prince Tremblay David Vazquez
Rim Assouel Alex Piché Valérie Bécaert
Thibault Le Sellier De Chezelles Torsten Scholak

Building Advanced AI Agent Systems: From Fundamentals To Scalable Architecture
No ratings yet
Building Advanced AI Agent Systems: From Fundamentals To Scalable Architecture
18 pages
Build Intelligent AI Agents Today
100% (1)
Build Intelligent AI Agents Today
17 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
RAG - Genai
No ratings yet
RAG - Genai
11 pages
Context Engineering
100% (1)
Context Engineering
165 pages
Generative AI A Transformative Force in Business Intelligence
No ratings yet
Generative AI A Transformative Force in Business Intelligence
7 pages
Lesson 01 Getting Started With GenAI
No ratings yet
Lesson 01 Getting Started With GenAI
48 pages
Generative AI Tutorial
No ratings yet
Generative AI Tutorial
5 pages
Guide To Agentic AI Multi Agent Pattern 1741332267
No ratings yet
Guide To Agentic AI Multi Agent Pattern 1741332267
11 pages
Building Intelligent Agents With Semantic Kernel: A Comprehensive Guide
No ratings yet
Building Intelligent Agents With Semantic Kernel: A Comprehensive Guide
16 pages
IBM - Orchestrating Agentic AI For Business Operations
100% (1)
IBM - Orchestrating Agentic AI For Business Operations
32 pages
The Anatomy of AI Agents
No ratings yet
The Anatomy of AI Agents
11 pages
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
Introduction To Generative AI LLM
50% (2)
Introduction To Generative AI LLM
9 pages
RAG and Vector Database Guide
No ratings yet
RAG and Vector Database Guide
18 pages
LangChain For JavaScript Developers How To Integrate LLMs Into Javascript Web Apps (Daniel Nastase) (Z-Library)
No ratings yet
LangChain For JavaScript Developers How To Integrate LLMs Into Javascript Web Apps (Daniel Nastase) (Z-Library)
120 pages
Chunking Techniques in RAG Systems
No ratings yet
Chunking Techniques in RAG Systems
12 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
AI Agents Level 1 - Client Presentation
No ratings yet
AI Agents Level 1 - Client Presentation
19 pages
Fine-Tuning AI Models Explained
No ratings yet
Fine-Tuning AI Models Explained
12 pages
LLM and Gen AI
No ratings yet
LLM and Gen AI
4 pages
Langchain Retrieval Augmented Generation White Paper
100% (1)
Langchain Retrieval Augmented Generation White Paper
23 pages
Overview of Small Language Models
No ratings yet
Overview of Small Language Models
3 pages
GTC'24 Special Event - Build A RAG-powered Application With A Human Voice Interface (SE62869) - Deck - FINAL - 1714408879420001sjpp
No ratings yet
GTC'24 Special Event - Build A RAG-powered Application With A Human Voice Interface (SE62869) - Deck - FINAL - 1714408879420001sjpp
108 pages
Langchain 101
100% (2)
Langchain 101
4 pages
Agentic Workflows in Large Language Models
No ratings yet
Agentic Workflows in Large Language Models
6 pages
Retrieval-Augmented LMs Overview
No ratings yet
Retrieval-Augmented LMs Overview
120 pages
Python Notes
No ratings yet
Python Notes
279 pages
eBook-The Ultimate Guide To Using LLMs With Speech Recognition To Build Voice Apps
100% (1)
eBook-The Ultimate Guide To Using LLMs With Speech Recognition To Build Voice Apps
66 pages
Agentic RAGs 1740054167
No ratings yet
Agentic RAGs 1740054167
10 pages
A Practical Primer To AI Agents 1736197641
100% (1)
A Practical Primer To AI Agents 1736197641
23 pages
My CV
No ratings yet
My CV
2 pages
Generative AI - 48 Hours TOC
67% (3)
Generative AI - 48 Hours TOC
4 pages
IDE204 - TimeGPT Generative AI For Time Series
100% (1)
IDE204 - TimeGPT Generative AI For Time Series
36 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
Data Security in Agentic AI Solutions
No ratings yet
Data Security in Agentic AI Solutions
2 pages
Pragyashal AI Agents Workshop 2025
No ratings yet
Pragyashal AI Agents Workshop 2025
3 pages
AI Multi-Agent Workflow with LangChain
No ratings yet
AI Multi-Agent Workflow with LangChain
13 pages
Building Generative Ai-Powered Apps (Aarushi Kansal) (Z-Library)
No ratings yet
Building Generative Ai-Powered Apps (Aarushi Kansal) (Z-Library)
150 pages
Databricks Guide To Agent Systems
No ratings yet
Databricks Guide To Agent Systems
16 pages
MCP 9
No ratings yet
MCP 9
17 pages
Exploring The Shift From Generative AI To Agentic AI
No ratings yet
Exploring The Shift From Generative AI To Agentic AI
20 pages
Transformers For Natural Language Processing and Computer Vision
No ratings yet
Transformers For Natural Language Processing and Computer Vision
150 pages
AI Reasoning with KAG & RAG
No ratings yet
AI Reasoning with KAG & RAG
13 pages
Mastering RAG: A Comprehensive Guide
100% (1)
Mastering RAG: A Comprehensive Guide
15 pages
Building Affordable AI Knowledge Graphs
No ratings yet
Building Affordable AI Knowledge Graphs
31 pages
AgenticAIv2.0 Compressed
100% (1)
AgenticAIv2.0 Compressed
25 pages
Prompt Engineering by Google - Cheat Sheet - April 2025
No ratings yet
Prompt Engineering by Google - Cheat Sheet - April 2025
1 page
Multi-Agentic RAG With Hugging Face Code Agents - by Gabriele Sgroi, PHD - Dec, 2024 - Towards Data Science
No ratings yet
Multi-Agentic RAG With Hugging Face Code Agents - by Gabriele Sgroi, PHD - Dec, 2024 - Towards Data Science
42 pages
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
100% (3)
Sheffield R. Generative AI Development With Langchain. The Ultimate Guide 2023
134 pages
PROMPT Engineering
No ratings yet
PROMPT Engineering
16 pages
3502 Generative AI A To Z
No ratings yet
3502 Generative AI A To Z
88 pages
Fine Tuning Techniques For Large Language Models LLMs
100% (4)
Fine Tuning Techniques For Large Language Models LLMs
15 pages
20 Essential LLM Guardrails for AI Safety
100% (1)
20 Essential LLM Guardrails for AI Safety
12 pages
Agentic AI Vs AI Agents - What Is The Difference
No ratings yet
Agentic AI Vs AI Agents - What Is The Difference
4 pages
A Taxonomy of Retrieval Augmented Generation
100% (5)
A Taxonomy of Retrieval Augmented Generation
56 pages
Context Engineering Guide
No ratings yet
Context Engineering Guide
12 pages
Data Science & Generative AI Technologies
100% (1)
Data Science & Generative AI Technologies
97 pages
Beginner's Guide To AI Agents - ServiceNow
0% (1)
Beginner's Guide To AI Agents - ServiceNow
8 pages
Types of Agents
No ratings yet
Types of Agents
16 pages
Grace Complete Project
No ratings yet
Grace Complete Project
31 pages
English 10 Q2, W4 DLP
No ratings yet
English 10 Q2, W4 DLP
6 pages
Visual Research Methods in Educational Research 1st Edition Julianne Moss Ebook All Chapters PDF
100% (1)
Visual Research Methods in Educational Research 1st Edition Julianne Moss Ebook All Chapters PDF
55 pages
Critical Genre Analysis
No ratings yet
Critical Genre Analysis
25 pages
Classroom Digital Storytelling Guide
No ratings yet
Classroom Digital Storytelling Guide
17 pages
Purposive Communication
No ratings yet
Purposive Communication
20 pages
Digital Citizenship - Leandro Paladino
No ratings yet
Digital Citizenship - Leandro Paladino
48 pages
Multimodality Translation and Comics
No ratings yet
Multimodality Translation and Comics
21 pages
Intro to Functional Grammar
No ratings yet
Intro to Functional Grammar
14 pages
Lasttttttttttttttttttttttttttttttttttttttttt Styl PPT Edditing
No ratings yet
Lasttttttttttttttttttttttttttttttttttttttttt Styl PPT Edditing
19 pages
Purposive Communication Course Guide Term 1 AY 22-23
No ratings yet
Purposive Communication Course Guide Term 1 AY 22-23
11 pages
Gold Rush Letter
No ratings yet
Gold Rush Letter
19 pages
Transdisciplinary SLA Framework
No ratings yet
Transdisciplinary SLA Framework
29 pages
Creative Writing Thesis Abstract
100% (4)
Creative Writing Thesis Abstract
7 pages
Agentic Ai: The New Frontier in Genai
No ratings yet
Agentic Ai: The New Frontier in Genai
22 pages
David Barton Literacy
50% (2)
David Barton Literacy
16 pages
English IV Syllabus
No ratings yet
English IV Syllabus
6 pages
Academic Profile: K. Shannon Howard
No ratings yet
Academic Profile: K. Shannon Howard
7 pages
Tabu CSEV
No ratings yet
Tabu CSEV
21 pages
Ge 2 Purposive Communication W Interactive Learning
No ratings yet
Ge 2 Purposive Communication W Interactive Learning
7 pages
Snell Hornby2009
No ratings yet
Snell Hornby2009
12 pages
Week 3
No ratings yet
Week 3
4 pages
Swat Anaylisis PDF
No ratings yet
Swat Anaylisis PDF
23 pages
Kaindl, 2004
100% (1)
Kaindl, 2004
260 pages
Critical Discourse Analysis
No ratings yet
Critical Discourse Analysis
22 pages
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Full Access
No ratings yet
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Full Access
77 pages
Dabiq to Rumiyah: ISIS Propaganda Shift
No ratings yet
Dabiq to Rumiyah: ISIS Propaganda Shift
19 pages
Cheryl E. Ball's Academic Profile
No ratings yet
Cheryl E. Ball's Academic Profile
17 pages
Understanding Language Registers
No ratings yet
Understanding Language Registers
24 pages
The Academic Writer (A Brief Rhetoric) (5th Edition) Ede
No ratings yet
The Academic Writer (A Brief Rhetoric) (5th Edition) Ede
10 pages