Skip to content

Comments

perf: 50x performance improvement for external embeddings#19155

Closed
Classic298 wants to merge 2 commits intoopen-webui:devfrom
Classic298:patch-2
Closed

perf: 50x performance improvement for external embeddings#19155
Classic298 wants to merge 2 commits intoopen-webui:devfrom
Classic298:patch-2

Conversation

@Classic298
Copy link
Collaborator

@Classic298 Classic298 commented Nov 13, 2025

  • Target branch: Verify that the pull request targets the dev branch. Not targeting the dev branch will lead to immediate closure of the PR.
  • Description: Provide a concise description of the changes made in this pull request down below.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: If necessary, update relevant documentation Open WebUI Docs like environment variables, the tutorials, or other documentation sources.
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Perform manual tests to verify the implemented fix/feature works as intended AND does not break any other functionality. Take this as an opportunity to make screenshots of the feature/fix and include it in the PR description.
  • Agentic AI Code: Confirm this Pull Request is not written by any AI Agent or has at least gone through additional human review AND manual testing. If any AI Agent is the co-author of this PR, it may lead to immediate closure of the PR.
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Title Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • perf: Performance improvement

Changelog Entry

Description

  • Processes embedding requests to external APIs (OpenAI, Azure OpenAI, Ollama) in parallel instead of sequential batches.

Before:

  • Sequential batch processing: Batch 1 → wait → Batch 2 → wait → Batch 3 → ... → Batch N
  • For 6000 chunks with batch_size=1: 6000 sequential HTTP requests (!!!)
  • Total time ≈ 6000 × request_latency (e.g., 6000 × 50ms = 5 minutes)

After:

  • Parallel batch processing: All batches sent simultaneously via asyncio.gather()
  • For 6000 chunks with batch_size=1: All 6000 requests execute in parallel
  • Total time ≈ 1 × request_latency + overhead (e.g., <10 seconds for 6000 chunks)
  • ~30-50x speed improvement observed in production

Added

  • Async embedding generation functions:
    • generate_openai_batch_embeddings_async() using aiohttp
    • generate_azure_openai_batch_embeddings_async() using aiohttp
    • generate_ollama_batch_embeddings_async() using aiohttp
  • generate_multiple_async() function for parallel batch coordination
  • Import of aiohttp and asyncio modules

Changed

  • get_embedding_function() now returns synchronous wrapper around async implementation
    • External API remains unchanged: still returns synchronous callable
    • Internal implementation uses asyncio.run() to execute parallel requests
  • generate_embeddings() now uses asyncio.run() for async execution
  • All batch embedding functions converted from synchronous requests.post() to async aiohttp.ClientSession()

Fixed

  • Massive performance bottleneck when processing large documents with external embedding APIs
  • Sequential processing causing unnecessary delays for models that only support batch_size=1

Breaking Changes

  • NO BREAKING CHANGES:
    • All callers remain synchronous and unmodified
    • API contract is identical: embedding_function(texts, prefix, user) still returns embeddings synchronously
    • Uses "sync wrapper around async code" pattern via asyncio.run()
    • No changes required to calling code in retrieval.py or elsewhere

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.

Note

Deleting the CLA section will lead to immediate closure of your PR and it will not be merged in.

@Classic298
Copy link
Collaborator Author

Yes, this is tested, i tested this in testing and even in production and it works!

@Classic298 Classic298 requested a review from tjbck November 13, 2025 09:13
if reranking:
scores = self.reranking_function(query, documents)
scores = self.reranking_function(
[(query, doc.page_content) for doc in documents]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Classic298 maybe you need to rebase against current dev? It seems you are changing reranking function back to older commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so? I directly edited the utils.py directly from the open-webui/dev branch. I edited the very very latest file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the master:

scores = self.reranking_function(
[(query, doc.page_content) for doc in documents]
)

Here is the dev:

scores = self.reranking_function(query, documents)

So your patch is against what branch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite literally the latest dev

I went on github

chose the dev branch of open webui

and modified the file using inline edit, which then creates a patch-x branch on github

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see,

the original line i am deleting here is
scores = self.reranking_function(query, documents)

the original line of the most recent version of the dev branch is
scores = self.reranking_function(query, documents)

so I am working on the dev branch.

The main branch has a whole different line:
[(query, doc.page_content) for doc in documents]

Hope that clears it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is your patch https://patch-diff.githubusercontent.com/raw/open-webui/open-webui/pull/19155.patch

image

Somehow you are modifying the reranking_function.

If that's unintented you need undo the changes in reranking_function.

@Classic298 Classic298 closed this Nov 13, 2025
@Classic298 Classic298 reopened this Nov 13, 2025
@tjbck
Copy link
Contributor

tjbck commented Nov 16, 2025

Please reopen after rebasing to latest dev.

@tjbck tjbck closed this Nov 16, 2025
Classic298 pushed a commit to Classic298/open-webui that referenced this pull request Nov 17, 2025
Processes embedding requests to external APIs (OpenAI, Azure OpenAI, Ollama)
in parallel instead of sequential batches.

Before:
- Sequential batch processing: Batch 1 → wait → Batch 2 → wait → ... → Batch N
- For 6000 chunks with batch_size=1: 6000 sequential HTTP requests
- Total time ≈ 6000 × request_latency (e.g., 6000 × 50ms = 5 minutes)

After:
- Parallel batch processing: All batches sent simultaneously via asyncio.gather()
- For 6000 chunks with batch_size=1: All 6000 requests execute in parallel
- Total time ≈ 1 × request_latency + overhead (e.g., <10 seconds for 6000 chunks)
- ~30-50x speed improvement observed in production

Added:
- Async embedding generation functions:
  - generate_openai_batch_embeddings_async() using aiohttp
  - generate_azure_openai_batch_embeddings_async() using aiohttp
  - generate_ollama_batch_embeddings_async() using aiohttp
  - generate_multiple_async() function for parallel batch coordination
- Import of aiohttp and asyncio modules

Changed:
- get_embedding_function() now uses async functions with parallel execution
- Uses asyncio.run() to execute parallel requests
- All batch embedding functions converted to async with aiohttp

Key difference from previous PR (open-webui#19155):
- MINIMAL changes - only modified what's necessary for parallel embeddings
- Did NOT modify get_reranking_function() - kept original signature
- Did NOT modify RerankCompressor.compress_documents() - kept original
- Kept all sync embedding functions intact
- Kept generate_embeddings() unchanged
- No unnecessary refactoring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants