Skip to content

Comments

feat: handle large stream chunks responses to support Nano Banana [Test Needed]#18884

Merged
tjbck merged 2 commits intoopen-webui:devfrom
ShirasawaSama:feature/handle-large-stream-chunks
Nov 10, 2025
Merged

feat: handle large stream chunks responses to support Nano Banana [Test Needed]#18884
tjbck merged 2 commits intoopen-webui:devfrom
ShirasawaSama:feature/handle-large-stream-chunks

Conversation

@ShirasawaSama
Copy link
Contributor

@ShirasawaSama ShirasawaSama commented Nov 3, 2025

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch may lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code:: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review and manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

Handle large stream chunks responses to support gemini-2.5-flash-image model (fixes #17626)

Additionally, some third-party service providers may return excessively large data sets (>100MB, such as web search data), necessitating length restrictions.

Therefore, an environment variable named CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE (in bytes) has been introduced, which can be used to set the maximum read length for each

Added

  • feat: handle large stream chunks responses

Changed

  • [List any changes, updates, refactorings, or optimizations]

Deprecated

  • [List any deprecated functionality or features that have been removed]

Removed

  • [List any removed features, files, or functionalities]

Fixed

  • [List any fixes, corrections, or bug fixes]

Security

  • [List any new or updated security-related changes, including vulnerability fixes]

Breaking Changes

  • BREAKING CHANGE: [List any breaking changes affecting compatibility or functionality]

Additional Information

image

Screenshots or Videos

  • [Attach any relevant screenshots or videos demonstrating the changes]

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

@ShirasawaSama
Copy link
Contributor Author

fixes #17626

@ShirasawaSama
Copy link
Contributor Author

@Classic298 @rgaricano If you have time, please review the code.

This code has been running in our production environment for over six months without any issues detected, but it's still advisable to take a careful look.

@Classic298
Copy link
Collaborator

CC @silentoplayz

@ShirasawaSama ShirasawaSama changed the title feat: handle large stream chunks responses [Test Needed] feat: handle large stream chunks responses to support Nano Banana [Test Needed] Nov 3, 2025
@rgaricano
Copy link
Contributor

@ShirasawaSama,
Not sure about this approach,
why remove longer lines?
its can be splitted in shorter chunks, proccessed and removed from the buffer.
e.g. (the max buffer size & and all the transmited data are mantained).

async def handle_large_stream_chunks(stream: aiohttp.StreamReader, max_buffer_size: int = CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE):
    """
    Handle large stream chunks for streaming responses.
    Breaks large lines into manageable chunks without skipping any data.
    
    :param stream: The stream reader to handle.
    :param max_buffer_size: The maximum size of each chunk to yield.
    :return: An async generator yielding chunks of the stream.
    """
    buffer = bytearray()
    
    async for chunk, _ in stream.iter_chunks():
        if not chunk:
            continue
            
        # Add new chunk to buffer
        buffer.extend(chunk)
        
        # Process all complete lines
        while b'\n' in buffer:
            # Find first newline position
            line_end = buffer.find(b'\n')
            line = buffer[:line_end]
            
            # Process line in chunks if it's too large
            if len(line) > max_buffer_size:
                # Break large line into chunks
                for i in range(0, len(line), max_buffer_size):
                    yield line[i:i + max_buffer_size]
            else:
                yield line
            
            # Remove processed line from buffer
            buffer = buffer[line_end + 1:]
            
    # Yield remaining data if any
    if buffer:
        yield buffer

@rgaricano
Copy link
Contributor

& if we won't use a buffer for store the long lines (to prevent memory issues when dealing with very large lines) also we can proccess directly the chunks:

async def handle_large_stream_chunks(stream: aiohttp.StreamReader, max_buffer_size: int = CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE):
    """
    Handle large stream chunks without storing entire large lines in memory.
    
    :param stream: The stream reader to handle.
    :param max_buffer_size: The maximum size of each chunk to yield.
    :return: An async generator yielding chunks of the stream.
    """
    async for chunk, _ in stream.iter_chunks():
        if not chunk:
            continue
            
        # Process each chunk directly without storing in a buffer
        lines = chunk.split(b'\n')
        
        for line in lines[:-1]:
            # Process complete lines
            if len(line) > max_buffer_size:
                # Break large line into chunks
                for i in range(0, len(line), max_buffer_size):
                    yield line[i:i + max_buffer_size]
            else:
                yield line
                
        # Handle last partial line if exists
        if lines and lines[-1]:
            yield lines[-1]

@rgaricano
Copy link
Contributor

Maybe a mixed solution, using a limited buffer and chunking long lines as they arrive:

async def handle_large_stream_chunks(stream: aiohttp.StreamReader, 
                                    max_buffer_size: int = CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE):
    """
    Handle large stream chunks with a buffer that processes chunks as they arrive,
    without storing entire lines in memory.
    
    :param stream: The stream reader to handle.
    :param max_buffer_size: The maximum size of each chunk to yield.
    :return: An async generator yielding chunks of the stream.
    """
    # Initialize buffer with first chunk to start processing
    first_chunk = await stream.readany()
    if not first_chunk:
        return
    
    # Start processing from the beginning of the first chunk
    current_line = first_chunk
    
    async for chunk in stream:
        if not chunk:
            continue
            
        # Combine current_line with new chunk
        data = current_line + chunk
        
        # Process complete lines
        lines = data.split(b'\n')
        
        for line in lines[:-1]:
            # If line exceeds max size, break it into chunks
            if len(line) > max_buffer_size:
                for i in range(0, len(line), max_buffer_size):
                    yield line[i:i + max_buffer_size]
            else:
                yield line
                
        # Handle last partial line
        if lines:
            current_line = lines[-1]
            if len(current_line) > max_buffer_size:
                # If the partial line is too large, break it into chunks
                for i in range(0, len(current_line), max_buffer_size):
                    yield current_line[i:i + max_buffer_size]
            else:
                current_line = lines[-1]  # Keep partial line for next iteration
                
    # Process remaining data in current_line
    if current_line:
        if len(current_line) > max_buffer_size:
            for i in range(0, len(current_line), max_buffer_size):
                yield current_line[i:i + max_buffer_size]
        else:
            yield current_line

@ShirasawaSama
Copy link
Contributor Author

ShirasawaSama commented Nov 3, 2025

@rgaricano Have you tested your code?

Because I noticed that it seems to require returning a complete line each time, rather than a portion of a line.

https://github.com/open-webui/open-webui/blob/main/backend/open_webui/utils/middleware.py#L2321

Additionally, regarding why row count limitations are necessary: My production cluster has multiple instances with Redis enabled. Sometimes models include built-in MCP calls, which transmit the entire call process.

When executing web searches, this also sends web data. Occasionally, when reading PDF pages, it directly transmits tens of megabytes of data. Combined with socket.io broadcasts across multiple instances, this caused my Redis cluster to crash.

In fact, to fix the issue where the Redis cluster was overwhelmed by socket.io broadcast traffic, I converted the chat socket.io data back to SSE and switched to incremental updates. This reduced the original full request size of over 70MB to just 300KB. However, I believe Tim is highly unlikely to accept this code modification.

@rgaricano
Copy link
Contributor

rgaricano commented Nov 3, 2025

@ShirasawaSama
No, I haven't tried it (and probably won't be able to try it in the short term).

Lines are chunked data, when we proccess the stream we also are chunking the data in lines, that isn't a problem (at least it doesn't aggravate the problem... which in fact exists due to the use of lines as delimiters when SSE)

Probably for manage that situation is better use io.BytesIO and a controlled line_partial_size if neccessary.
(But it doesn't matter, it can be changed later)

@rgaricano
Copy link
Contributor

Leaving aside buffering,
Another question: what happend when there are JSON objects in the stream, and if somewant not want remove large content?

At the very least, it should have the option to delete or split large content.

e.g. for remove or split (mantaining integrity of possible JSON objects in data)

async def handle_large_stream_chunks(
    stream: aiohttp.StreamReader, 
    max_buffer_size: int = CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE,
    split_large_content: bool = True  # New parameter
):
    """
    Handle stream response chunks with configurable behavior for oversized lines.

    :param stream: The stream reader to handle.
    :param max_buffer_size: The maximum buffer size in bytes.
    :param split_large_content: If True, split large content; if False, skip it.
    :return: An async generator that yields the stream data.
    """

    buffer = b""
    skip_mode = False

    async for data, _ in stream.iter_chunks():
        if not data:
            continue

        if skip_mode and len(buffer) > max_buffer_size:
            buffer = b""

        lines = (buffer + data).split(b"\n")

        for i in range(len(lines) - 1):
            line = lines[i]

            if skip_mode:
                if len(line) <= max_buffer_size:
                    skip_mode = False
                    yield line
                else:
                    yield b"data: {}"
            else:
                if len(line) > max_buffer_size:
                    if split_large_content:
                        # Try to split the content dynamically
                        try:
                            line_str = line.decode('utf-8', 'replace')
                            if line_str.startswith('data:'):
                                data_str = line_str[len('data:'):].strip()
                                data_obj = json.loads(data_str)

                                # Split large content field
                                choices = data_obj.get('choices', [])
                                if choices and 'delta' in choices[0]:
                                    content = choices[0]['delta'].get('content', '')
                                    if len(content) > max_buffer_size:
                                        # Emit in chunks
                                        for i in range(0, len(content), max_buffer_size):
                                            chunk = content[i:i + max_buffer_size]
                                            chunked_data = {
                                                **data_obj,
                                                'choices': [{
                                                    **choices[0],
                                                    'delta': {**choices[0]['delta'], 'content': chunk}
                                                }]
                                            }
                                            yield f"data: {json.dumps(chunked_data)}\n".encode('utf-8')
                                        continue
                        except Exception as e:
                            log.warning(f"Failed to split large content, skipping: {e}")

                    # Fallback to skip mode
                    skip_mode = True
                    yield b"data: {}"
                    log.info(f"Skip mode triggered, line size: {len(line)}")
                else:
                    yield line

        buffer = lines[-1]

        if not skip_mode and len(buffer) > max_buffer_size:
            skip_mode = True
            log.info(f"Skip mode triggered, buffer size: {len(buffer)}")
            buffer = b""

    if buffer and not skip_mode:
        yield buffer

with 2 env vars:

CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE = int(os.getenv("CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE", 16384))
CHAT_STREAM_SPLIT_LARGE_CONTENT = os.getenv("CHAT_STREAM_SPLIT_LARGE_CONTENT", "true").lower() == "true"

@silentoplayz silentoplayz added the testing wanted Testing from the community is needed label Nov 3, 2025
@ShirasawaSama
Copy link
Contributor Author

ShirasawaSama commented Nov 4, 2025

@rgaricano Do you think -1 means not skipping overly large single-line content?

Additionally, I feel your code is a bit too complex, making it nearly impossible for others to maintain later on. In fact, the major issues I've encountered aren't limited to the choices.delta—they often appear in the tools, references, or thinking fields as well.

I added the feature to skip excessively long lines because we previously discovered that certain third-party model providers would output massive amounts of unnecessary content within a single line. This would directly cause the Redis cluster connected to OpenWebUI to crash and result in an instantaneous surge of traffic on the OpenWebUI backend servers. We simply need to skip these lines entirely.

This is purely a protective measure.

@tjbck tjbck self-assigned this Nov 5, 2025
@silentoplayz
Copy link
Collaborator

I've attempted to test this PR using the Google: Gemini 2.5 Flash Image (Nano Banana) model provided by OpenRouter and it appears to silently fail to provide back an image to me.
image
image
image

@ShirasawaSama
Copy link
Contributor Author

@silentoplayz Are there any error messages?

Can you see the file size of the output image in base64?

@ShirasawaSama ShirasawaSama marked this pull request as draft November 6, 2025 03:39
@silentoplayz
Copy link
Collaborator

@silentoplayz Are there any error messages?

Can you see the file size of the output image in base64?

There aren't any error messages thrown/triggered to be displayed that I am aware of. I've checked both frontend+backend logs and the browser console and there's no error.

As for network requests, here's what that looks like from the start of a new chat with Google: Gemini 2.5 Flash Image Preview (Nano Banana).
image

I've tested on both Chrome and Firefox web browsers.

@Classic298
Copy link
Collaborator

Nano Banana support was added in dev. Is this PR still needed?

@ShirasawaSama
Copy link
Contributor Author

Nano Banana support was added in dev. Is this PR still needed?

Yes, Clipboard_Screenshot_1762485187



CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE = os.environ.get(
"CHAT_STREAM_RESPONSE_CHUNK_MAX_BUFFER_SIZE", "10485760" # 10MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasons for setting it to 10MB specifically?

Copy link
Contributor Author

@ShirasawaSama ShirasawaSama Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10MB is the typical size for the base64 strings returned by most image generation models. This has been tested with models like Gemini Image 2.5, Qwen Image Edit, Doubao Seedream, and GPT Image.

Of course, I can accommodate values larger or smaller than this. This limit is solely to prevent LLM from returning excessively large data in a single response that could crash the backend service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(All data below is single-line after base64 encoding) For Gemini Image 2.5, the image size is approximately under 2.5MB; for gpt image, it's around 3.5MB; and for doubao seedream, it's roughly under 7MB. Therefore, I consider 10MB to be an acceptable value.

@ShirasawaSama
Copy link
Contributor Author

@silentoplayz Could you please provide the complete SSE data from OpenRouter? Nin can directly use JavaScript's fetch method; it's possible that the issue lies solely with OpenRouter's API returning data.

@silentoplayz
Copy link
Collaborator

silentoplayz commented Nov 7, 2025

@silentoplayz Could you please provide the complete SSE data from OpenRouter? Nin can directly use JavaScript's fetch method; it's possible that the issue lies solely with OpenRouter's API returning data.

Sorry, but the OpenRouter API key I used to test with was given to me for testing purposes only and I don't know how to obtain the SSE data from OpenRouter myself. I don't have access to the OpenRouter dashboard or anything like that.

@ShirasawaSama ShirasawaSama force-pushed the feature/handle-large-stream-chunks branch from bb78ce9 to ce1079d Compare November 7, 2025 07:00
@ShirasawaSama
Copy link
Contributor Author

Clipboard_Screenshot_1762499903

@silentoplayz I confirmed this is because the image data returned by OpenRouter appears non-standard. It places the image base64 in choices[0].images.image_url instead of choices[0].content, which is required for markdown image syntax display.

I recommend opening a separate PR to handle this non-standard output.

And, this does not affect my current PR handling large single-line SSE text.

@tjbck
Copy link
Contributor

tjbck commented Nov 8, 2025

@ShirasawaSama this generally looks good and seems like it can be merged as-is, another qq: will this potentially effect any existing behaviours?

@ShirasawaSama
Copy link
Contributor Author

@ShirasawaSama this generally looks good and seems like it can be merged as-is, another qq: will this potentially effect any existing behaviours?

I cannot guarantee with absolute certainty that there will be no other repercussions. However, this code has been running in our production environment for over eight months without any incidents or user-reported bugs.

@ShirasawaSama ShirasawaSama marked this pull request as ready for review November 8, 2025 06:54
@tjbck
Copy link
Contributor

tjbck commented Nov 10, 2025

Merging this but this will be disabled by default for now, Thanks!

@tjbck tjbck merged commit 27df461 into open-webui:dev Nov 10, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing wanted Testing from the community is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants