Releases · kreuzberg-dev/kreuzberg

[4.6.3] - 2026-03-27

Added

Tower service layer (service module): Composable ExtractionService implementing tower::Service with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New tower-service feature flag, auto-enabled by api and mcp. ExtractionServiceBuilder provides ergonomic layer composition.
Semantic OpenTelemetry conventions (telemetry module): Formal kreuzberg.* attribute namespace with 30+ span attributes, metric names, and operation/stage constants.
Extraction metrics: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind otel.
InstrumentedExtractor wrapper: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when otel feature is enabled.

Improved

Deeper instrumentation: Pipeline post-processing stages, individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics.
API and MCP servers use ExtractionService: Both consumers now route extractions through the Tower service stack.
Unified config merge: JSON config merge logic deduplicated between CLI and MCP.
API server hardening: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware.

Changed

Removed per-extractor #[instrument] annotations: 29 manual annotations replaced by the automatic InstrumentedExtractor wrapper.
Span attribute names migrated to kreuzberg.* namespace: extraction.filename -> kreuzberg.document.filename, extraction.mime_type -> kreuzberg.document.mime_type, etc.

Fixed

EPUB spine semantics refactor (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures.
DOCX image extraction for <a:blip> with child elements (#591): Images with high-quality settings were not extracted. Now handles Event::Start for <a:blip>.
OCR table extraction returned empty results via pipeline path (#593): Layout detection and table propagation fixed for both code paths.
Missing chunker_type field in bindings (#592): Exposed across Python, TypeScript/WASM, Go, C#, PHP bindings.
Full API parity across all 10 bindings: Added max_archive_depth to all bindings. Added missing typed config classes for acceleration, email, layout, concurrency where needed.
Node Windows publish failure: Prepare script fallback replaced with cross-platform node -e.
CI Validate path triggers broadened: Covers docs/**, biome.json, .task/**, and other lintable paths.
Publish pipeline ORT bundling: Configurable strategy input (system/bundled) on setup-onnx-runtime action. Publish jobs now use strategy: bundled.
C FFI CI missing ORT setup: Added setup-onnx-runtime step to ci-c-ffi.yaml.

[4.6.2] - 2026-03-26

Added

PDF page rendering API (#583): New render_pdf_page function and PdfPageIterator for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call.

Fixed

Table recognition coordinate mismatch on scanned PDFs (#582): Layout detection bboxes (640x640 model space) are now scaled to OCR render resolution before TATR table recognition. Previously, coordinate space mismatch caused zero tables to be found.
OCR elements report page_number: 1 for all pages (#582): Tesseract resets page numbers per single-page render. Page numbers are now correctly stamped after OCR in the batch loop.
Rust E2E tests missing PDF feature: Added pdf feature to the e2e-generator Rust template, fixing 41 UnsupportedFormat("application/pdf") failures.
HWP styled extraction empty on ARM: Added skip_on_platform support to Python and Java e2e generators, skipping the hwp_styled fixture on aarch64-unknown-linux-gnu.
WASM CI build failure: Made kreuzberg-node prepare script resilient to missing native addon, preventing ENOENT: dist/cli.js during pnpm workspace install.
Go C header stale at 4.5.0: Synced header and DefaultVersion constant to match current version.
Ruby gem missing ONNX Runtime: Added ort-bundled feature to Ruby native Cargo.toml.
Elixir doctest failures: Updated ExtractionConfig.to_map/1 doctests for force_ocr_pages field.
WASM benchmark timeout: Reduced per-extraction timeout from 600s to 120s and job timeout from 6h to 2h.

Improved

version:sync now syncs Go C header, DefaultVersion, and Docker compose tags: Prevents version drift across language bindings.
Publish pipeline commits Elixir NIF checksums back to main: Prevents stale checksums after releases.
WASM test app migrated to Deno: Replaced Node.js/vitest with Deno test runner, fixing fetch() unavailability.
Docs migrated from MkDocs to Zensical: 4-5x faster incremental builds.

Fixes

OCR memory usage reduced 60-78%: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS.
PDF control character encoding artifacts: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like re\x02labelling → re-labelling.

v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes

Added

Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own ExtractionResult. New ArchiveEntry type and max_archive_depth config.
YAML/JSON section chunker: New ChunkerType::Yaml with full hierarchy paths and auto-inference from metadata.
Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. All 35 extractors produce native DocumentStructure.
Document-level OCR: process_document() for whole-file extraction — up to 30% faster on multi-page documents.
DocBook/JATS inline annotations: Semantic formatting mapped to AnnotationKind variants.

Changed

CSV extraction: Produces Row N: Header: Value format for better embedding quality.
XML extraction: Indented hierarchical output preserving element tree.

Improved

Zero-copy file I/O: memmap2 + simdutf8 SIMD UTF-8 validation for large files.
Unified concurrency management: Centralized thread budget with configurable ConcurrencyConfig.

Fixed

#557: Auto-enable extract_pages for element-based output — correct page numbers without manual PageConfig.
#558: Fixed misleading PageConfig docstring defaults.
#560: MSG extraction now supports compressed RTF bodies (PR_RTF_COMPRESSED).
#561: Indexed colour PDF images now decode correctly with palette lookup.
ODT extraction robustness improvements.

See CHANGELOG.md for full details.

[4.5.4] - 2026-03-23

Fixed

PDF image extraction panic on mismatched buffer lengths (#552): Replaced assert! with graceful error handling. Malformed PDF images are now skipped instead of panicking. Regression from v4.5.0.
pdf feature compilation without layout-detection (#550): config.layout reference gated behind #[cfg(feature = "layout-detection")].
WASM module resolution in Supabase/Deno edge functions (#551): Added explicit package.json exports and Deno detection in wasm-loader.
zip dependency pinned below 7.4: Avoids let-chain build failures on some stable Rust toolchains (#549).
Vendored HWP text extraction: Replaced external hwpers crate with vendored subset (~1,650 lines). Eliminates zip 2.x transitive dependency that caused WASM/CI build failures.
Ruby binding missing table_model field in LayoutDetectionConfig initializer.
Clippy/unused variable warnings in table recognition and pipeline modules.

Added

prepend_heading_context chunking option: When true and chunker_type is Markdown, prepends the heading hierarchy path (e.g. # Title > ## Section) to each chunk's content string. Useful for RAG pipelines where chunks need self-contained structural context. Available across all 10 language bindings, CLI, and WASM. Includes fixture-driven e2e tests and documentation for all languages.

What's New

SLANeXT Table Structure Recognition

Alternative table structure backends alongside TATR. New table_model field on LayoutDetectionConfig selects the backend:

Model	Config Value	Size	Best For
TATR	`"tatr"` (default)	30 MB	General-purpose, consistent results
SLANeXT Wired	`"slanet_wired"`	365 MB	Bordered/gridlined tables
SLANeXT Wireless	`"slanet_wireless"`	365 MB	Borderless tables
SLANeXT Auto	`"slanet_auto"`	~737 MB	Mixed documents (auto-classifies)
SLANet-plus	`"slanet_plus"`	7.78 MB	Resource-constrained environments

Available across all 12 language bindings and CLI (--layout-table-model).

Apple iWork Format Support

Native parsing for .pages, .numbers, and .key files (2013+ format) via protobuf text extraction from Snappy-compressed IWA containers.

Other Changes

PP-LCNet table classifier for automatic wired/wireless table detection
CLI cache warm --all-table-models for opt-in SLANeXT download (~730MB)
ISO 21111-10 benchmark fixture with MinerU ground truth
Format count updated to 91+

See CHANGELOG.md for full details.

Fixed

PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g. "s hall a b e active" instead of "shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact.
Markdown underscore escaping: Underscores in extracted text (e.g. CTC_ARP_01) were incorrectly escaped as CTC\_ARP\_01 throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting.
Page header/footer leakage: Running headers like ISO 21111-10:2021(E) and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages.
R batch function spurious NULL argument: R wrapper batch functions passed an extra NULL positional argument to native Rust functions, causing "unused argument" errors on all batch operations.
Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in target/release/ but not in priv/native/ where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.

Added

General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
Cache namespace isolation: New cache_namespace field on ExtractionConfig enables multi-tenant cache isolation on shared filesystems. Available via --cache-namespace CLI flag and across all language bindings.
Per-request cache TTL: New cache_ttl_secs field on ExtractionConfig overrides the global TTL for individual extractions. Set to 0 to skip cache entirely. Available via --cache-ttl-secs CLI flag.
Cache namespace deletion: delete_namespace() removes all cache entries under a namespace. get_stats_filtered() returns per-namespace statistics.
Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
Tessdata in cache warm: kreuzberg-cli cache warm now downloads all tessdata_fast language files (~120 languages) to KREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages.
Tessdata in cache manifest: kreuzberg-cli cache manifest now includes all tessdata files with source URLs, enabling --sync-cache to download tessdata alongside models.
KREUZBERG_CACHE_DIR/tessdata resolution: resolve_tessdata_path() now checks KREUZBERG_CACHE_DIR/tessdata and the bundled build path before falling back to system paths.
CLI embed command: Generate vector embeddings from text via kreuzberg embed --text "..." --preset balanced.
CLI chunk command: Split text into chunks via kreuzberg chunk --text "..." --chunk-size 512.
CLI completions command: Generate shell completions for bash, zsh, fish, powershell.
CLI --log-level global flag: Override RUST_LOG via kreuzberg --log-level debug extract doc.pdf.
CLI extraction overrides: 27 flags exposed via ExtractionOverrides struct with #[command(flatten)].
CLI colored output: Text output uses anstyle for colored headers, labels, success values, and dim separators. Respects NO_COLOR env var.
API POST /detect, GET /version, GET /cache/manifest, POST /cache/warm: New REST endpoints.
MCP get_version, cache_manifest, cache_warm, embed_text, chunk_text: New MCP tools.
Pipeline table extraction tracing: Zero-cost tracing::trace! and tracing::debug! logging throughout layout detection and table extraction.
TATR model availability check: Layout detection returns an error if table regions are detected but the TATR model is unavailable.

Changed

CLI batch flags: Batch command now supports all extraction override flags via shared ExtractionOverrides struct.
CLI config architecture: Replaced 13-parameter function with ExtractionOverrides struct using #[command(flatten)].
MCP tool architecture: Removed dead tools/ trait-based duplicates; all tools implemented directly in server.rs.

Improved

CLI validation: OCR backend values, chunk size/overlap bounds, DPI range, layout confidence validated.
API validation: Embedding preset names and chunk bounds checked.
MCP validation: Empty paths rejected, chunk bounds checked, embedding preset validated.
Chunk overlap auto-clamping: When --chunk-size is smaller than default overlap, overlap is automatically clamped to size/4.

See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md

See CHANGELOG.md for release notes.

See CHANGELOG.md for full release notes.

Comparative benchmark results from workflow run 23359982805.

Commit: d062479
Date: 2026-03-21

Releases: kreuzberg-dev/kreuzberg

v4.6.3

[4.6.3] - 2026-03-27

Added

Improved

Changed

Fixed

Uh oh!

v4.6.2

[4.6.2] - 2026-03-26

Added

Fixed

Improved

Uh oh!

v4.6.1

Fixes

Uh oh!

v4.6.0

v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes

Added

Changed

Improved

Fixed

Uh oh!

v4.5.4

[4.5.4] - 2026-03-23

Fixed

Added

Uh oh!

v4.5.3

What's New

SLANeXT Table Structure Recognition

Apple iWork Format Support

Other Changes

Uh oh!

v4.5.2

Fixed

Added

Changed

Improved

Uh oh!

v4.5.1

Uh oh!

v4.5.0

Uh oh!

Benchmark Results 2026-03-21 (d062479)

Uh oh!