feat: Adding file metadata to hybrid search#19095
Conversation
|
@jmleksan any reason this has to be specific to hybrid search? |
|
@tjbck There were a few:
It could be added to semantic, but I think it better leverages the strengths of hybrid, so I included it there instead. If you believe semantic would also benefit, I can update it. |
|
@jmleksan I think this pipeline could make sense to have it before embedding process instead of applying them during hybrid search, thoughts? |
|
@tjbck I can see the benefits of including metadata earlier in the pipeline so the chunk text the model sees can carry some extra context about which document it came from, and so hybrid search can use it as well. My only concern is in case the intent is for that enriched text to also be fed into the embedding step. Embedding titles or filenames along with the body can lead to lower quality vectors in situations where the chunk content isn’t strongly related to the metadata. I’d prefer to keep the embedding text focused on the body by default and treat metadata as an additional signal for hybrid and for sending context to the model. |
|
@jmleksan sounds reasonable |
|
Great addition, Thanks! |
* revert/fix: edit valves modal * chore: Update CHANGELOG for version 0.6.35 (open-webui#18481) * chore: Update CHANGELOG for version 0.6.35 * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG with recent feature additions * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * chore: format * refac * refac * fix(chats): fix chat search crash (open-webui#18576) * fix(chats): handle null bytes in PostgreSQL search Removes null bytes from message content before performing case-insensitive search in PostgreSQL, preventing conversion errors and ensuring reliable query results. * fix(chats): prevent null byte errors in PostgreSQL queries Ensures chat content and titles containing null bytes are excluded from PostgreSQL text queries to avoid conversion errors. Improves reliability of search and filtering by handling problematic characters in JSON fields. * refac: shortcuts * refac * refac * refac * fix: image edit workflow editor * fix: firecrawl import * refac * fix: Socket.IO CORS warning Co-Authored-By: Gero Doll <[email protected]> * feat: add OAUTH_GROUPS_SEPARATOR for configurable group parsing * fix: tool calling * fix: Shortcuts Modal i18n * chore: bump * chore: CHANGELOG 0.6.36 * chore: format * chore: format * Update catalan translation.json * Update catalan translation.json * i18n: improve Chinese translation * feat: handle large stream chunks responses * feat: Allow configuration of not process large single-line data * Update translation.json (pt-BR) New translations have been made of the new items that were added in the latest version. * fix: Handle AttributeError in hybrid search with reranking (open-webui#17046) - Split attribute existence checks from document content checks - Added hasattr() check for metadatas attribute - Prevents AttributeError when collection_result is missing attributes - Maintains all original validation logic Fixes open-webui#17046 * perf Optimize Socket Emits Using User Rooms (open-webui#18996) * This PR optimizes socket delta event broadcasting by leveraging rooms. Instead of iterating through a user's sessions and emitting events individually, this change sends a single event to a user-specific room. This approach is more efficient, reducing overhead and improving performance, particularly for users with multiple concurrent sessions. In testing this dramatically reduces emits and server load. * Update main.py Added userroom join --------- Co-authored-by: Tim Baek <[email protected]> * Update fi-FI translation.json Improved and added missing translations. * Upd: i18n_ es-ES Spanish Translation v0.6.36 ### UPD Spanish Translation v0.6.35 Added new strings * refactor: Remove unused litellm endpoint and associated frontend code Removes the unused `/litellm/config` endpoint, the corresponding `downloadLiteLLMConfig` frontend API function, and the unused import from the `Database.svelte` component. This code was identified as dead code as it was not being used in the UI. * refac: suggestions display full name on hover * enh: optionally add user headers external websearch Co-Authored-By: Classic298 <[email protected]> * refac * refac * refac: batch file processing Co-Authored-By: Sihyeon Jang <[email protected]> * refac * refac * refac: stream chunk max buffer size * refac: rerank * fix: images edit openai base url/key save issue * refac: get event emitter/caller * i18n - Update ie-GA translation * Fetched user_group_ids prior to looping through models with has_access to reduce DB hits for group membership * refac/fix: rag template placeholder substitution * refac: rm redundant query tag * refac/fix: mineru params breaking change * feat(i18n): fill in missing Farsi translations * fix: Duplicate instructions in tool selection calling prompt (open-webui#19122) * Fix duplicated query prefix in user prompt for function calling * Fix duplicated last user message in prompt for function calling * Feat: optionally disable password login endpoints (open-webui#19113) * Implement message cleaning before API call * Filter out empty assistant messages before cleaning * Update catalan translation.json (open-webui#29) Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> * Update main.py * Update auths.py * Update Chat.svelte --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> * fix verify mcp connection with oauth type (open-webui#19149) * feat: Add custom API endpoint and user info headers for Perplexity Search (open-webui#31) (open-webui#19147) Co-authored-by: Claude <[email protected]> * Fix: Handle empty strings in OAuth registration response (open-webui#19144) - The mcp package requires optional unset values to be None. If an empty string is passed, it gets validated and fails. - Replace all empty strings with None. * enh/refac: enable autocompletion for non rich text input * refac/fix * enh: custom headers for external tool servers * chore: bump unstructured to 0.18.18 * refac: chat tag suggestions behaviour * enh: text select copy behaviour * Updated Swedish translation (open-webui#19161) Refined existing swedish translations and added most of the missing ones. * refac: oauth pass client auth params * refac: pass token_endpoint_auth_method * Updated Danish translations (open-webui#19174) * make path to audit log configurable (open-webui#19173) * fix: docling params issue * Add Azure Search (open-webui#19104) Co-authored-by: Tim Baek <[email protected]> * refac * wip: requirements-min * refac * refac: decouple api key restrictions from get user * enh: copy table * chore: format * feat: voice mode prompt template * chore: dep * refac: background image styling behaviour * refac * refac/fix: automatic1111 params * refac/fix * refac/fix * refac/fix * Update translation.json (open-webui#19213) * fix(images): correct config key for image edit engine (open-webui#19200) Updates conditional to reference the appropriate configuration property for image editing, ensuring proper engine selection. * refac/enh: web search domain allow/block filter * refac * fix: UserValves contamination between multiple tools Co-Authored-By: Daniel Pots <[email protected]> * refac/sec: sanitize note pdf download * refac * refac/fix: inherit model stream_response setting * refac * refac: group members table db migration * refac: group members backend * refac: group members frontend * feat: add a metric to monitor daily unique users (open-webui#19236) open-webui#19234 * Update MCP Oauth server metadata discovery order (open-webui#19244) * feat: add granular import/export permissions for workspace items (open-webui#19242) * feat: add granular import/export permissions for workspace items (open-webui#55) Co-authored-by: Claude <[email protected]> * Fix permissions toggles not saving in EditGroupModal (open-webui#58) Co-authored-by: Claude <[email protected]> * Fix permissions toggles not saving in EditGroupModal (open-webui#59) Co-authored-by: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]> * refac: group members frontend integration * refac: styling * refac: styling * feat: pgvector hnsw index type (open-webui#19158) * Adding hnsw index type for pgvector, allowing vector dimensions larger than 2000 * remove some variable assignments * Make USE_HALFVEC variable configurable * Simplify USE_HALFVEC handling * Raise runtime error if the index requires rebuilt --------- Co-authored-by: Moritz <[email protected]> * feat/security: Add SSRF protection with configurable blocklist Co-Authored-By: Classic298 <[email protected]> * refac * refac: styling * obfuscate TTS elevenlabs api key (open-webui#19262) * refac: mineru api key required behaviour * refac: styling * refac * feat: Adding file metadata to hybrid search (open-webui#19095) * Added metadata to hybrid search * And config and env plus refac * consistency --------- Co-authored-by: Tim Baek <[email protected]> * refac/enh: create new note * fix: Use get_index() instead of list_indexes() in has_collection() to… (open-webui#19238) * fix: Use get_index() instead of list_indexes() in has_collection() to handle pagination Fixes open-webui#19233 Replace list_indexes() pagination scan with direct get_index() lookup in has_collection() method. The previous implementation only checked the first ~1,000 indexes due to unhandled pagination, causing RAG queries to fail for indexes beyond the first page. Benefits: - Handles buckets with any number of indexes (no pagination needed) - ~8x faster (0.19s vs 1.53s in testing) - Proper exception handling for ResourceNotFoundException - Scales to millions of indexes * Update s3vector.py Unneeded exception handling removed to match original OWUI code * feat: Add adjustable text size setting to interface (open-webui#19186) * Add adjustable text size setting to interface Introduces a user-configurable text size (scale) setting, accessible via a slider in the interface settings. Updates CSS and Sidebar chat item components to respect the new --app-text-scale variable, and persists the setting in the store. Adds related i18n strings and ensures the text scale is applied globally and clamped to allowed values. * Refactor text scale logic into utility module Moved all text scale related constants and functions from components and stores into a new utility module (src/lib/utils/text-scale.ts). Updated imports and usage in Interface.svelte and index.ts to use the new module, improving code organization and reusability. * Adjust sidebar chat scaling without extra classes keep sidebar markup using existing Tailwind utility classes so chat items render identically pre-feature move all text-scale sizing into app.css under the #sidebar-chat-item selectors change the root font-size multiplier to use 1rem instead of an explicit 16px so browser/user preferences propagate * Update Switch.svelte Adjust toggles from fixed pixel to rem to scale with the text size * Update Interface.svelte Updated label from 'Text Scale' to 'UI Scale'. Added padding around slider * Update app.css Added comments * enh: images openai api params * enh/feat: persist folder state Co-Authored-By: G30 <[email protected]> * Add additional config elements to control how engineio and redis log and interact. (open-webui#19091) * feat/enh: api keys user permission breaking change, `ENABLE_API_KEY` renamed to `ENABLE_API_KEYS` and disabled by default and must be explicitly toggled on. * feat: Add image handling in middleware for delta updates (open-webui#19073) * feat: Add image handling in middleware for delta updates * refactor: optimize the code logic * refac * chore: mcp bump * refac/enh: mcp oauth auth method support * refac: models endpoint * refac Co-Authored-By: G30 <[email protected]> * refac Co-Authored-By: G30 <[email protected]> * chore: format * refac: styling * refac: rm ai slop * refac * refac * refac * feat: default pinned models Co-Authored-By: Classic298 <[email protected]> * i18n: improve Chinese translation (open-webui#19285) * enh: revoked token handling * refac * refac * refac * refac: add reasoning_effort to azure supported params * feat: allow flat claims instead of nested claims as alternative (open-webui#19286) * i18n: improve Chinese translation (open-webui#19309) * enh/pref: convert markdown base64 images to urls Co-Authored-By: Shirasawa <[email protected]> * refac/enh: unregisterServiceWorkers on update * Support folder drag-n-drop (open-webui#19320) * feat: Add user header information for TTS/STT requests (open-webui#93) (open-webui#19323) Resolves open-webui#19312 Co-authored-by: Claude <[email protected]> * refac: feedback list optimisation * refac/fix: styling * feat/enh: optional password validation * feat: Add default group assignment for new users (open-webui#94) (open-webui#19325) Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: Claude <[email protected]> * refac: styling * feat/enh: user sharing perms * refac * feat/enh: group share setting * feat: add support for Weaviate vector database (open-webui#14747) * chore: dep * refac/enh: dedicated enable image edit toggle * refac: styling * refac: profile_image_url optimization * Korean update (open-webui#19336) * i18n: improve Chinese translation (open-webui#19334) * refac * fix: format date according to DEFAULT_LOCALE in chat search (open-webui#19305) * fix: localized format * load default_locale from backend * fix: add missing i18n import to fix build (open-webui#19337) * refac: styling * Update Catalan translation.json (open-webui#19338) * refac/pref: chat import optimization Co-Authored-By: G30 <[email protected]> * refac * refac/fix: openai edit multiple images * refac * enh: clone system models Co-Authored-By: G30 <[email protected]> * refac * refac * fix(i18n): correct Thai translation in sidebar (open-webui#19363) * Update translation.json (open-webui#19364) * refac * refac * fix: translation * refac: search chat postgres * fix(i18n): comprehensive revision and improvement of all Thai translations across the app (open-webui#19377) * Update translation.json (pt-BR) (open-webui#19384) new translations of the newly added items * refac/fix: chat search null byte filter * refac: clean null bytes on load * perf: 50x performance improvement for external embeddings (open-webui#19296) * Update utils.py (open-webui#77) Co-authored-by: Claude <[email protected]> * refactor: address code review feedback for embedding performance improvements (open-webui#92) Co-authored-by: Claude <[email protected]> * fix: prevent sentence transformers from blocking async event loop (open-webui#95) Co-authored-by: Claude <[email protected]> --------- Co-authored-by: Claude <[email protected]> * refac * refac * refac: models workspace optimization * feat/enh: move chats in folder on delete Co-Authored-By: expruc <[email protected]> * refac: rm folder id on chat archive * chore (open-webui#19389) * Upd:i18n es-ES_Spanish Translation_v0.6.37 (open-webui#19388) * Upd:i18n es-ES_Spanish Translation_v0.6.37 ### es-ES Spanish Translation v0.6.37 Added new strings. * Corrected string * refac * refac * refac * refac * chore: user header forward minimize code changes throughout codebase (open-webui#19392) * Update external.py * remove unused imports * Update ollama.py * Update ollama.py * Update ollama.py * Update openai.py * chore: google-genai bump * chore: Update README (open-webui#19398) * refac: disable single tilde * refac: sources and citations * refac * refac * enh: group members selector * refac * fix: kokorojs tts * refac * refac * refac/fix: refresh folder chat list * refac: folder page chat list * chore: format * refac * chore: CHANGELOG 0.6.37 (open-webui#19126) * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md * refac * refac * refac: styling * refac: prompt suggestions component Co-Authored-By: Classic298 <[email protected]> * refac * refac * refac: styling * chore: format * refac: styling * refac * refac: styling * refac * chore: format * i18n: improve Chinese translation * fix: hybrid search * fix * refac/fix: oauth * fix: tool server save error handling * chore: bump * doc: changelog * Update docker-build.yaml * refac --------- Co-authored-by: Timothy Jaeryang Baek <[email protected]> Co-authored-by: Classic298 <[email protected]> Co-authored-by: Davixk <[email protected]> Co-authored-by: Gero Doll <[email protected]> Co-authored-by: Adam M. Smith <[email protected]> Co-authored-by: EntropyYue <[email protected]> Co-authored-by: Aleix Dorca <[email protected]> Co-authored-by: Shirasawa <[email protected]> Co-authored-by: joaoback <[email protected]> Co-authored-by: krishna-medapati <[email protected]> Co-authored-by: Adam Skalicky <[email protected]> Co-authored-by: Kylapaallikko <[email protected]> Co-authored-by: _00_ <[email protected]> Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: Sihyeon Jang <[email protected]> Co-authored-by: Aindriú Mac Giolla Eoin <[email protected]> Co-authored-by: Adam Skalicky <[email protected]> Co-authored-by: amir ahrari <[email protected]> Co-authored-by: Mati <[email protected]> Co-authored-by: Oleg Yermolenko <[email protected]> Co-authored-by: Claude <[email protected]> Co-authored-by: xqqp <[email protected]> Co-authored-by: Siavash Vatanijalal <[email protected]> Co-authored-by: Jeppe Kuhlmann Andersen <[email protected]> Co-authored-by: Mikael Schirén <[email protected]> Co-authored-by: Sang Lê <[email protected]> Co-authored-by: Daniel Pots <[email protected]> Co-authored-by: FlorentMair80 <[email protected]> Co-authored-by: logan-hcg <[email protected]> Co-authored-by: lazariv <[email protected]> Co-authored-by: Moritz <[email protected]> Co-authored-by: Tom Haynes <[email protected]> Co-authored-by: Jacob Leksan <[email protected]> Co-authored-by: Seth Argyle <[email protected]> Co-authored-by: davecrab <[email protected]> Co-authored-by: G30 <[email protected]> Co-authored-by: gerhardj-b <[email protected]> Co-authored-by: Shirasawa <[email protected]> Co-authored-by: Blake <[email protected]> Co-authored-by: Diwakar <[email protected]> Co-authored-by: Cyp <[email protected]> Co-authored-by: Danny Liu <[email protected]> Co-authored-by: Siwadon S. (Jay) <[email protected]> Co-authored-by: expruc <[email protected]>
* Added metadata to hybrid search * And config and env plus refac * consistency --------- Co-authored-by: Tim Baek <[email protected]>



Pull Request Checklist
Before submitting, make sure you've checked the following:
devbranch. Not targeting thedevbranch will lead to immediate closure of the PR.Changelog Entry
Description
Use Case: “Show me our PPO documents”
PPO_insurance_2024.pdf,medical_PPO_plan.docx, etc.PPO benefitsor justPPOnow surfaces every relevant document because filenames, titles, headings, and snippets are indexed alongside the content.PPOreturns all PPO-related files immediately, instead of relying on the term being repeated in the body text.Added
Fixed
Additional Information
backend/open_webui/retrieval/utils.pyContributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.