Releases: kreuzberg-dev/kreuzberg
Releases · kreuzberg-dev/kreuzberg
v4.6.3
[4.6.3] - 2026-03-27
Added
- Tower service layer (
servicemodule): ComposableExtractionServiceimplementingtower::Servicewith configurable middleware layers (tracing, metrics, timeout, concurrency limit). Newtower-servicefeature flag, auto-enabled byapiandmcp.ExtractionServiceBuilderprovides ergonomic layer composition. - Semantic OpenTelemetry conventions (
telemetrymodule): Formalkreuzberg.*attribute namespace with 30+ span attributes, metric names, and operation/stage constants. - Extraction metrics: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind
otel. - InstrumentedExtractor wrapper: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when
otelfeature is enabled.
Improved
- Deeper instrumentation: Pipeline post-processing stages, individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics.
- API and MCP servers use ExtractionService: Both consumers now route extractions through the Tower service stack.
- Unified config merge: JSON config merge logic deduplicated between CLI and MCP.
- API server hardening: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware.
Changed
- Removed per-extractor
#[instrument]annotations: 29 manual annotations replaced by the automaticInstrumentedExtractorwrapper. - Span attribute names migrated to
kreuzberg.*namespace:extraction.filename->kreuzberg.document.filename,extraction.mime_type->kreuzberg.document.mime_type, etc.
Fixed
- EPUB spine semantics refactor (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures.
- DOCX image extraction for
<a:blip>with child elements (#591): Images with high-quality settings were not extracted. Now handlesEvent::Startfor<a:blip>. - OCR table extraction returned empty results via pipeline path (#593): Layout detection and table propagation fixed for both code paths.
- Missing
chunker_typefield in bindings (#592): Exposed across Python, TypeScript/WASM, Go, C#, PHP bindings. - Full API parity across all 10 bindings: Added
max_archive_depthto all bindings. Added missing typed config classes foracceleration,email,layout,concurrencywhere needed. - Node Windows publish failure: Prepare script fallback replaced with cross-platform
node -e. - CI Validate path triggers broadened: Covers
docs/**,biome.json,.task/**, and other lintable paths. - Publish pipeline ORT bundling: Configurable
strategyinput (system/bundled) onsetup-onnx-runtimeaction. Publish jobs now usestrategy: bundled. - C FFI CI missing ORT setup: Added
setup-onnx-runtimestep toci-c-ffi.yaml.
v4.6.2
[4.6.2] - 2026-03-26
Added
- PDF page rendering API (#583): New
render_pdf_pagefunction andPdfPageIteratorfor rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call.
Fixed
- Table recognition coordinate mismatch on scanned PDFs (#582): Layout detection bboxes (640x640 model space) are now scaled to OCR render resolution before TATR table recognition. Previously, coordinate space mismatch caused zero tables to be found.
- OCR elements report
page_number: 1for all pages (#582): Tesseract resets page numbers per single-page render. Page numbers are now correctly stamped after OCR in the batch loop. - Rust E2E tests missing PDF feature: Added
pdffeature to the e2e-generator Rust template, fixing 41UnsupportedFormat("application/pdf")failures. - HWP styled extraction empty on ARM: Added
skip_on_platformsupport to Python and Java e2e generators, skipping thehwp_styledfixture onaarch64-unknown-linux-gnu. - WASM CI build failure: Made
kreuzberg-nodeprepare script resilient to missing native addon, preventingENOENT: dist/cli.jsduring pnpm workspace install. - Go C header stale at 4.5.0: Synced header and
DefaultVersionconstant to match current version. - Ruby gem missing ONNX Runtime: Added
ort-bundledfeature to Ruby native Cargo.toml. - Elixir doctest failures: Updated
ExtractionConfig.to_map/1doctests forforce_ocr_pagesfield. - WASM benchmark timeout: Reduced per-extraction timeout from 600s to 120s and job timeout from 6h to 2h.
Improved
version:syncnow syncs Go C header, DefaultVersion, and Docker compose tags: Prevents version drift across language bindings.- Publish pipeline commits Elixir NIF checksums back to main: Prevents stale checksums after releases.
- WASM test app migrated to Deno: Replaced Node.js/vitest with Deno test runner, fixing
fetch()unavailability. - Docs migrated from MkDocs to Zensical: 4-5x faster incremental builds.
v4.6.1
Fixes
- OCR memory usage reduced 60-78%: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS.
- PDF control character encoding artifacts: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like
re\x02labelling→re-labelling.
v4.6.0
v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes
Added
- Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own
ExtractionResult. NewArchiveEntrytype andmax_archive_depthconfig. - YAML/JSON section chunker: New
ChunkerType::Yamlwith full hierarchy paths and auto-inference from metadata. - Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. All 35 extractors produce native DocumentStructure.
- Document-level OCR:
process_document()for whole-file extraction — up to 30% faster on multi-page documents. - DocBook/JATS inline annotations: Semantic formatting mapped to AnnotationKind variants.
Changed
- CSV extraction: Produces
Row N: Header: Valueformat for better embedding quality. - XML extraction: Indented hierarchical output preserving element tree.
Improved
- Zero-copy file I/O: memmap2 + simdutf8 SIMD UTF-8 validation for large files.
- Unified concurrency management: Centralized thread budget with configurable
ConcurrencyConfig.
Fixed
- #557: Auto-enable
extract_pagesfor element-based output — correct page numbers without manual PageConfig. - #558: Fixed misleading PageConfig docstring defaults.
- #560: MSG extraction now supports compressed RTF bodies (PR_RTF_COMPRESSED).
- #561: Indexed colour PDF images now decode correctly with palette lookup.
- ODT extraction robustness improvements.
See CHANGELOG.md for full details.
v4.5.4
[4.5.4] - 2026-03-23
Fixed
- PDF image extraction panic on mismatched buffer lengths (#552): Replaced
assert!with graceful error handling. Malformed PDF images are now skipped instead of panicking. Regression from v4.5.0. pdffeature compilation withoutlayout-detection(#550):config.layoutreference gated behind#[cfg(feature = "layout-detection")].- WASM module resolution in Supabase/Deno edge functions (#551): Added explicit
package.jsonexports and Deno detection in wasm-loader. zipdependency pinned below 7.4: Avoids let-chain build failures on some stable Rust toolchains (#549).- Vendored HWP text extraction: Replaced external
hwperscrate with vendored subset (~1,650 lines). Eliminateszip 2.xtransitive dependency that caused WASM/CI build failures. - Ruby binding missing
table_modelfield inLayoutDetectionConfiginitializer. - Clippy/unused variable warnings in table recognition and pipeline modules.
Added
prepend_heading_contextchunking option: Whentrueandchunker_typeisMarkdown, prepends the heading hierarchy path (e.g.# Title > ## Section) to each chunk's content string. Useful for RAG pipelines where chunks need self-contained structural context. Available across all 10 language bindings, CLI, and WASM. Includes fixture-driven e2e tests and documentation for all languages.
v4.5.3
What's New
SLANeXT Table Structure Recognition
Alternative table structure backends alongside TATR. New table_model field on LayoutDetectionConfig selects the backend:
| Model | Config Value | Size | Best For |
|---|---|---|---|
| TATR | "tatr" (default) |
30 MB | General-purpose, consistent results |
| SLANeXT Wired | "slanet_wired" |
365 MB | Bordered/gridlined tables |
| SLANeXT Wireless | "slanet_wireless" |
365 MB | Borderless tables |
| SLANeXT Auto | "slanet_auto" |
~737 MB | Mixed documents (auto-classifies) |
| SLANet-plus | "slanet_plus" |
7.78 MB | Resource-constrained environments |
Available across all 12 language bindings and CLI (--layout-table-model).
Apple iWork Format Support
Native parsing for .pages, .numbers, and .key files (2013+ format) via protobuf text extraction from Snappy-compressed IWA containers.
Other Changes
- PP-LCNet table classifier for automatic wired/wireless table detection
- CLI
cache warm --all-table-modelsfor opt-in SLANeXT download (~730MB) - ISO 21111-10 benchmark fixture with MinerU ground truth
- Format count updated to 91+
See CHANGELOG.md for full details.
v4.5.2
Fixed
- PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g.
"s hall a b e active"instead of"shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact. - Markdown underscore escaping: Underscores in extracted text (e.g.
CTC_ARP_01) were incorrectly escaped asCTC\_ARP\_01throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting. - Page header/footer leakage: Running headers like
ISO 21111-10:2021(E)and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages. - R batch function spurious NULL argument: R wrapper batch functions passed an extra
NULLpositional argument to native Rust functions, causing "unused argument" errors on all batch operations. - Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in
target/release/but not inpriv/native/where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.
Added
- General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
- Cache namespace isolation: New
cache_namespacefield onExtractionConfigenables multi-tenant cache isolation on shared filesystems. Available via--cache-namespaceCLI flag and across all language bindings. - Per-request cache TTL: New
cache_ttl_secsfield onExtractionConfigoverrides the global TTL for individual extractions. Set to0to skip cache entirely. Available via--cache-ttl-secsCLI flag. - Cache namespace deletion:
delete_namespace()removes all cache entries under a namespace.get_stats_filtered()returns per-namespace statistics. - Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
- Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
- Tessdata in
cache warm:kreuzberg-cli cache warmnow downloads all tessdata_fast language files (~120 languages) toKREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages. - Tessdata in
cache manifest:kreuzberg-cli cache manifestnow includes all tessdata files with source URLs, enabling--sync-cacheto download tessdata alongside models. KREUZBERG_CACHE_DIR/tessdataresolution:resolve_tessdata_path()now checksKREUZBERG_CACHE_DIR/tessdataand the bundled build path before falling back to system paths.- CLI
embedcommand: Generate vector embeddings from text viakreuzberg embed --text "..." --preset balanced. - CLI
chunkcommand: Split text into chunks viakreuzberg chunk --text "..." --chunk-size 512. - CLI
completionscommand: Generate shell completions for bash, zsh, fish, powershell. - CLI
--log-levelglobal flag: OverrideRUST_LOGviakreuzberg --log-level debug extract doc.pdf. - CLI extraction overrides: 27 flags exposed via
ExtractionOverridesstruct with#[command(flatten)]. - CLI colored output: Text output uses
anstylefor colored headers, labels, success values, and dim separators. RespectsNO_COLORenv var. - API
POST /detect,GET /version,GET /cache/manifest,POST /cache/warm: New REST endpoints. - MCP
get_version,cache_manifest,cache_warm,embed_text,chunk_text: New MCP tools. - Pipeline table extraction tracing: Zero-cost
tracing::trace!andtracing::debug!logging throughout layout detection and table extraction. - TATR model availability check: Layout detection returns an error if table regions are detected but the TATR model is unavailable.
Changed
- CLI batch flags: Batch command now supports all extraction override flags via shared
ExtractionOverridesstruct. - CLI config architecture: Replaced 13-parameter function with
ExtractionOverridesstruct using#[command(flatten)]. - MCP tool architecture: Removed dead
tools/trait-based duplicates; all tools implemented directly inserver.rs.
Improved
- CLI validation: OCR backend values, chunk size/overlap bounds, DPI range, layout confidence validated.
- API validation: Embedding preset names and chunk bounds checked.
- MCP validation: Empty paths rejected, chunk bounds checked, embedding preset validated.
- Chunk overlap auto-clamping: When
--chunk-sizeis smaller than default overlap, overlap is automatically clamped tosize/4.
See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md
v4.5.1
See CHANGELOG.md for release notes.
v4.5.0
See CHANGELOG.md for full release notes.
Benchmark Results 2026-03-21 (d062479)
Comparative benchmark results from workflow run 23359982805.
Commit: d062479
Date: 2026-03-21