Skip to content

Bug: Long-Running Function Calls Break Tool Response Processing in Gemini Live Realtime [Websockets] #1564

@rizified

Description

@rizified

Environment
pipecat-ai version: 0.0.62
python version: 3.11.11
OS: Windows

Issue description
When using Google Gemini Live Realtime with Websocket Implementation, function calls that take more than 6-7 seconds break the normal flow of processing. After a long-running function completes, the LLM is unable to process the tool result because the user transcription finishes first and becomes the last message in the context.

This happens because in GeminiMultimodalLiveLLMService, the condition elif context.messages and context.messages[-1].get("role") == "tool" fails since the last role becomes "user" instead of "tool" which doesn't executes the LLM.

Image

Repro steps

  1. Run the realtime Gemini function call demo with websockets from the samples
  2. Add a sleep(8) in the function method
  3. Observe that the function result doesn't get processed by the LLM

Expected behavior
Long-running function calls should still have their results properly processed by the LLM, regardless of timing. The tool message should be the last one processed.

Actual behavior
The user transcription message is placed after the tool result message, which prevents the LLM from processing the tool result.

Logs
see screenshot.

Proposed Fix
I've found a workaround by modifying pipecat.services.gemini_multimodal_live.gemini to detect this condition and rearrange the messages:

Updated the code to -

 elif isinstance(frame, OpenAILLMContextFrame):
             context: GeminiMultimodalLiveContext = GeminiMultimodalLiveContext.upgrade(
                 frame.context
             )
             # For now, we'll only trigger inference here when either:
             #   1. We have not seen a context frame before
             #   2. The last message is a tool call result
             if (self._context and len(context.messages) > 3 and
                 context.messages[-1].get('role') == 'user' and
                 context.messages[-2].get('role') == 'tool' and
                 context.messages[-3].get('tool_calls', [{}])[0].get('type') == 'function'
             ):
                 logger.info("Rearranging messages to move user role dictionary before LLM call for function call result")
                 # Extract the last element (user role dictionary)
                 user_element = context.messages.pop(-1)
                 # Insert it before the third-to-last element (type:function)
                 context.messages.insert(len(context.messages) - 2, user_element)
             else:
                 pass
             if not self._context:
                 self._context = context
                 if frame.context.tools:
                     self._tools = frame.context.tools
                 await self._create_initial_response()
             elif context.messages and context.messages[-1].get("role") == "tool":
                 # Support just one tool call per context frame for now
                 tool_result_message = context.messages[-1]
                 await self._tool_result(tool_result_message)

This solution checks if there's a function call and if the user message is the last one, then rearranges the messages appropriately to ensure proper processing.

Regards,
Rizwan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions