Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force done output #835

Merged
merged 20 commits into from
Feb 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
c07b5c9
Add last step warning and enhance action result tracking
MagMueller Feb 23, 2025
ddd8648
Refactor message position handling and improve last step messaging
MagMueller Feb 23, 2025
52e8c96
Track token count before model output generation
MagMueller Feb 23, 2025
e1b5119
Enhance agent history tracking
MagMueller Feb 23, 2025
cc34c45
Improve version detection
MagMueller Feb 23, 2025
d881b9f
Enhance agent task completion and max steps handling
MagMueller Feb 23, 2025
601151d
Refactor task completion logging in Agent service
MagMueller Feb 23, 2025
bdb59fe
Refactor agent service tool calling and message handling
MagMueller Feb 23, 2025
12ed8da
Removed parameter
MagMueller Feb 23, 2025
58cd755
Refine action parsing and validation in agent service
MagMueller Feb 23, 2025
061e701
Add asyncio marker and refactor test fixtures
MagMueller Feb 23, 2025
10de966
Enhance test model search and API key handling
MagMueller Feb 23, 2025
88f018a
Update system prompt and service for task completion handling
MagMueller Feb 23, 2025
f2f8cf8
Eval models
MagMueller Feb 23, 2025
84f5f1c
Instruction for unfinished tasks
MagMueller Feb 23, 2025
ac07642
Improve type safety and clarify task completion instructions
MagMueller Feb 23, 2025
d129a7f
Refine last step instructions and model testing configuration
MagMueller Feb 23, 2025
65e165c
Simplify last step instructions for agent service
MagMueller Feb 23, 2025
17fcfe2
Refine last step warning message for agent service
MagMueller Feb 23, 2025
5f84512
Enhance last step warning message clarity
MagMueller Feb 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions browser_use/agent/message_manager/service.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ def add_model_output(self, model_output: AgentOutput) -> None:
# empty tool response
self.add_tool_message(content='')

def add_plan(self, plan: Optional[str], position: int = -1) -> None:
def add_plan(self, plan: Optional[str], position: int | None = None) -> None:
if plan:
msg = AIMessage(content=plan)
self._add_message_with_tokens(msg, position)
Expand All @@ -182,8 +182,10 @@ def get_messages(self) -> List[BaseMessage]:

return msg

def _add_message_with_tokens(self, message: BaseMessage, position: int = -1) -> None:
"""Add message with token count metadata"""
def _add_message_with_tokens(self, message: BaseMessage, position: int | None = None) -> None:
"""Add message with token count metadata
position: None for last, -1 for second last, etc.
"""

# filter out sensitive data from the message
if self.settings.sensitive_data:
Expand Down
4 changes: 2 additions & 2 deletions browser_use/agent/message_manager/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,9 @@ class MessageHistory(BaseModel):

model_config = ConfigDict(arbitrary_types_allowed=True)

def add_message(self, message: BaseMessage, metadata: MessageMetadata, position: int = -1) -> None:
def add_message(self, message: BaseMessage, metadata: MessageMetadata, position: int | None = None) -> None:
"""Add message with metadata to history"""
if position == -1:
if position is None:
self.messages.append(ManagedMessage(message=message, metadata=metadata))
else:
self.messages.insert(position, ManagedMessage(message=message, metadata=metadata))
Expand Down
141 changes: 98 additions & 43 deletions browser_use/agent/service.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,32 +158,28 @@ def __init__(
planner_interval=planner_interval,
)

# Initialize state
self.state = injected_agent_state or AgentState()

# Action setup
self._setup_action_models()
self._set_browser_use_version_and_source()
self.initial_actions = self._convert_initial_actions(initial_actions) if initial_actions else None

# Model setup
self._set_model_names()
self.tool_calling_method = self.set_tool_calling_method(self.settings.tool_calling_method)

# Initialize state
self.state = injected_agent_state or AgentState()

# for models without tool calling, add available actions to context
available_actions = controller.registry.get_prompt_description()
if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):
# add to context that we are using deepseek
if message_context:
message_context += f'\n\nAvailable actions: {available_actions}'
else:
message_context = f'Available actions: {available_actions}'
self.available_actions = self.controller.registry.get_prompt_description()

self.tool_calling_method = self._set_tool_calling_method()
self.settings.message_context = self._set_message_context()

# Initialize message manager with state
self._message_manager = MessageManager(
task=task,
system_message=self.settings.system_prompt_class(
available_actions,
self.available_actions,
max_actions_per_step=self.settings.max_actions_per_step,
).get_system_message(),
settings=MessageManagerSettings(
Expand Down Expand Up @@ -222,22 +218,40 @@ def __init__(
if self.settings.save_conversation_path:
logger.info(f'Saving conversation to {self.settings.save_conversation_path}')

def _set_message_context(self) -> str | None:
if self.tool_calling_method == 'raw':
if self.settings.message_context:
self.settings.message_context += f'\n\nAvailable actions: {self.available_actions}'
else:
self.settings.message_context = f'Available actions: {self.available_actions}'
return self.settings.message_context

def _set_browser_use_version_and_source(self) -> None:
"""Get the version and source of the browser-use package (git or pip in a nutshell)"""
try:
import pkg_resources
# First check for repository-specific files
repo_files = ['.git', 'README.md', 'docs', 'examples']
package_root = Path(__file__).parent.parent.parent

version = pkg_resources.get_distribution('browser-use').version
source = 'pip'
except Exception:
try:
import subprocess
# If all of these files/dirs exist, it's likely from git
if all(Path(package_root / file).exists() for file in repo_files):
try:
import subprocess

version = subprocess.check_output(['git', 'describe', '--tags']).decode('utf-8').strip()
version = subprocess.check_output(['git', 'describe', '--tags']).decode('utf-8').strip()
except Exception:
version = 'unknown'
source = 'git'
except Exception:
version = 'unknown'
source = 'unknown'
else:
# If no repo files found, try getting version from pip
import pkg_resources

version = pkg_resources.get_distribution('browser-use').version
source = 'pip'
except Exception:
version = 'unknown'
source = 'unknown'

logger.debug(f'Version: {version}, Source: {source}')
self.version = version
self.source = source
Expand Down Expand Up @@ -268,9 +282,16 @@ def _setup_action_models(self) -> None:
# Create output model with the dynamic actions
self.AgentOutput = AgentOutput.type_with_custom_actions(self.ActionModel)

def set_tool_calling_method(self, tool_calling_method: Optional[ToolCallingMethod]) -> Optional[ToolCallingMethod]:
# used to force the done action when max_steps is reached
self.DoneActionModel = self.controller.registry.create_action_model(include_actions=['done'])
self.DoneAgentOutput = AgentOutput.type_with_custom_actions(self.DoneActionModel)

def _set_tool_calling_method(self) -> Optional[ToolCallingMethod]:
tool_calling_method = self.settings.tool_calling_method
if tool_calling_method == 'auto':
if self.chat_model_library == 'ChatGoogleGenerativeAI':
if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):
return 'raw'
elif self.chat_model_library == 'ChatGoogleGenerativeAI':
return None
elif self.chat_model_library == 'ChatOpenAI':
return 'function_calling'
Expand Down Expand Up @@ -304,6 +325,7 @@ async def step(self, step_info: Optional[AgentStepInfo] = None) -> None:
model_output = None
result: list[ActionResult] = []
step_start_time = time.time()
tokens = 0

try:
state = await self.browser_context.get_state()
Expand All @@ -318,7 +340,18 @@ async def step(self, step_info: Optional[AgentStepInfo] = None) -> None:
# add plan before last state message
self._message_manager.add_plan(plan, position=-1)

if step_info and step_info.is_last_step():
# Add last step warning if needed
msg = 'Now comes your last step. Use only the "done" action now. No other actions - so here your action sequence musst have length 1.'
msg += '\nIf the task is not yet fully finished as requested by the user, set success in "done" to false! E.g. if not all steps are fully completed.'
msg += '\nIf the task is fully finished, set success in "done" to true.'
msg += '\nInclude everything you found out for the ultimate task in the done text.'
logger.info('Last step finishing up')
self._message_manager._add_message_with_tokens(HumanMessage(content=msg))
self.AgentOutput = self.DoneAgentOutput

input_messages = self._message_manager.get_messages()
tokens = self._message_manager.state.history.current_tokens

try:
model_output = await self.get_next_action(input_messages)
Expand Down Expand Up @@ -383,7 +416,7 @@ async def step(self, step_info: Optional[AgentStepInfo] = None) -> None:
step_number=self.state.n_steps,
step_start_time=step_start_time,
step_end_time=step_end_time,
input_tokens=self._message_manager.state.history.current_tokens,
input_tokens=tokens,
)
self._make_history_item(model_output, state, result, metadata)

Expand Down Expand Up @@ -454,20 +487,29 @@ def _remove_think_tags(self, text: str) -> str:
"""Remove think tags from text"""
return re.sub(self.THINK_TAGS, '', text)

def _convert_input_messages(self, input_messages: list[BaseMessage]) -> list[BaseMessage]:
"""Convert input messages to the correct format"""
if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):
return convert_input_messages(input_messages, self.model_name)
else:
return input_messages

@time_execution_async('--get_next_action (agent)')
async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutput:
"""Get next action from LLM based on current state"""
if self.model_name == 'deepseek-reasoner' or self.model_name.startswith('deepseek-r1'):
converted_input_messages = convert_input_messages(input_messages, self.model_name)
output = self.llm.invoke(converted_input_messages)
output.content = self._remove_think_tags(str(output.content))
input_messages = self._convert_input_messages(input_messages)

if self.tool_calling_method == 'raw':
output = self.llm.invoke(input_messages)
# TODO: currently invoke does not return reasoning_content, we should override invoke
output.content = self._remove_think_tags(str(output.content))
try:
parsed_json = extract_json_from_model_output(output.content)
parsed = self.AgentOutput(**parsed_json)
except (ValueError, ValidationError) as e:
logger.warning(f'Failed to parse model output: {output} {str(e)}')
raise ValueError('Could not parse response.')

elif self.tool_calling_method is None:
structured_llm = self.llm.with_structured_output(self.AgentOutput, include_raw=True)
response: dict[str, Any] = await structured_llm.ainvoke(input_messages) # type: ignore
Expand All @@ -480,8 +522,9 @@ async def get_next_action(self, input_messages: list[BaseMessage]) -> AgentOutpu
if parsed is None:
raise ValueError('Could not parse response.')

# cut the number of actions to max_actions_per_step
parsed.action = parsed.action[: self.settings.max_actions_per_step]
# cut the number of actions to max_actions_per_step if needed
if len(parsed.action) > self.settings.max_actions_per_step:
parsed.action = parsed.action[: self.settings.max_actions_per_step]

log_response(parsed)

Expand Down Expand Up @@ -517,8 +560,7 @@ async def take_step(self) -> tuple[bool, bool]:
if not await self._validate_output():
return True, False

logger.info('✅ Task completed successfully')

await self.log_completion()
if self.register_done_callback:
await self.register_done_callback(self.state.history)

Expand Down Expand Up @@ -554,16 +596,15 @@ async def run(self, max_steps: int = 100) -> AgentHistoryList:
if self.state.stopped: # Allow stopping while paused
break

await self.step()
step_info = AgentStepInfo(step_number=step, max_steps=max_steps)
await self.step(step_info)

if self.state.history.is_done():
if self.settings.validate_output and step < max_steps - 1:
if not await self._validate_output():
continue

logger.info('✅ Task completed successfully')
if self.register_done_callback:
await self.register_done_callback(self.state.history)
await self.log_completion()
break
else:
logger.info('❌ Failed to complete task in maximum steps')
Expand All @@ -573,10 +614,13 @@ async def run(self, max_steps: int = 100) -> AgentHistoryList:
self.telemetry.capture(
AgentEndTelemetryEvent(
agent_id=self.state.agent_id,
success=self.state.history.is_done(),
is_done=self.state.history.is_done(),
success=self.state.history.is_successful(),
steps=self.state.n_steps,
max_steps_reached=self.state.n_steps >= max_steps,
errors=self.state.history.errors(),
total_input_tokens=self.state.history.total_input_tokens(),
total_duration_seconds=self.state.history.total_duration_seconds(),
)
)

Expand Down Expand Up @@ -686,6 +730,17 @@ class ValidationResult(BaseModel):
logger.info(f'✅ Validator decision: {parsed.reason}')
return is_valid

async def log_completion(self) -> None:
"""Log the completion of the task"""
logger.info('✅ Task completed')
if self.state.history.is_successful():
logger.info('✅ Successfully')
else:
logger.info('❌ Unfinished')

if self.register_done_callback:
await self.register_done_callback(self.state.history)

async def rerun_history(
self,
history: AgentHistoryList,
Expand Down Expand Up @@ -862,15 +917,15 @@ async def _run_planner(self) -> Optional[str]:
]

if not self.settings.use_vision_for_planner and self.settings.use_vision:
last_state_message = planner_messages[-1]
last_state_message: HumanMessage = planner_messages[-1]
# remove image from last state message
new_msg = ''
if isinstance(last_state_message.content, list):
for msg in last_state_message.content:
if msg['type'] == 'text':
new_msg += msg['text']
elif msg['type'] == 'image_url':
continue
if msg['type'] == 'text': # type: ignore
new_msg += msg['text'] # type: ignore
elif msg['type'] == 'image_url': # type: ignore
continue # type: ignore
else:
new_msg = last_state_message.content

Expand Down
15 changes: 13 additions & 2 deletions browser_use/agent/system_prompt.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
You are an AI agent designed to automate browser tasks. Your goal is to accomplish the ultimate task following the rules.

# Input Format
Task
Previous steps
Expand All @@ -14,12 +15,14 @@ Example:

- Only elements with numeric indexes in [] are interactive
- elements without [] provide only context

# Response Rules
1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:
{{"current_state": {{"evaluation_previous_goal": "Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Mention if something unexpected happened. Shortly state why/why not",
"memory": "Description of what has been done and what you need to remember. Be very specific. Count here ALWAYS how many times you have done something and how many remain. E.g. 0 out of 10 websites analyzed. Continue with abc and xyz",
"next_goal": "What needs to be done with the next immediate action"}},
"action":[{{"one_action_name": {{// action-specific parameter}}}}, // ... more actions in sequence]}}

2. ACTIONS: You can specify multiple actions in the list to be executed in sequence. But always specify only one action name per item. Use maximum {{max_actions}} actions per sequence.
Common action sequences:
- Form filling: [{{"input_text": {{"index": 1, "text": "username"}}}}, {{"input_text": {{"index": 2, "text": "password"}}}}, {{"click_element": {{"index": 3}}}}]
Expand All @@ -29,9 +32,11 @@ Common action sequences:
- Only provide the action sequence until an action which changes the page state significantly.
- Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page
- only use multiple actions if it makes sense.

3. ELEMENT INTERACTION:
- Only use indexes of the interactive elements
- Elements marked with "[]Non-interactive text" are non-interactive

4. NAVIGATION & ERROR HANDLING:
- If no suitable elements exist, use other functions to complete the task
- If stuck, try alternative approaches - like going back to a previous page, new search, new tab etc.
Expand All @@ -40,19 +45,25 @@ Common action sequences:
- If you want to research something, open a new tab instead of using the current tab
- If captcha pops up, try to solve it - else try a different approach
- If the page is not fully loaded, use wait action

5. TASK COMPLETION:
- Use the done action as the last action as soon as the ultimate task is complete
- Dont use "done" before you are done with everything the user asked you.
- Dont use "done" before you are done with everything the user asked you, except you reach the last step of max_steps.
- If you reach your last step, use the done action even if the task is not fully finished. Provide all the information you have gathered so far. If the ultimate task is completly finished set success to true. If not everything the user asked for is completed set success in done to false!
- If you have to do something repeatedly for example the task says for "each", or "for all", or "x times", count always inside "memory" how many times you have done it and how many remain. Don't stop until you have completed like the task asked you. Only call done after the last step.
- Don't hallucinate actions
- Make sure to include everything the user asked for in the done text parameter. This is what the user will see. Do not just say you are done, but include the requested information of the task.
- Make sure you include everything you found out for the ultimate task in the done text parameter. Do not just say you are done, but include the requested information of the task.

6. VISUAL CONTEXT:
- When an image is provided, use it to understand the page layout
- Bounding boxes with labels on their top right corner correspond to element indexes

7. Form filling:
- If you fill an input field and your action sequence is interrupted, most often something changed e.g. suggestions popped up under the field.

8. Long tasks:
- Keep track of the status and subresults in the memory.

9. Extraction:
- If your task is to find information - call extract_content on the specific pages to get and store the information.
Your responses must be always JSON with the specified format.
Loading