Skip to content

Releases: kreuzberg-dev/kreuzberg

v4.6.3

27 Mar 13:56
v4.6.3
b569454

Choose a tag to compare

[4.6.3] - 2026-03-27

Added

  • Tower service layer (service module): Composable ExtractionService implementing tower::Service with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New tower-service feature flag, auto-enabled by api and mcp. ExtractionServiceBuilder provides ergonomic layer composition.
  • Semantic OpenTelemetry conventions (telemetry module): Formal kreuzberg.* attribute namespace with 30+ span attributes, metric names, and operation/stage constants.
  • Extraction metrics: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind otel.
  • InstrumentedExtractor wrapper: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when otel feature is enabled.

Improved

  • Deeper instrumentation: Pipeline post-processing stages, individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics.
  • API and MCP servers use ExtractionService: Both consumers now route extractions through the Tower service stack.
  • Unified config merge: JSON config merge logic deduplicated between CLI and MCP.
  • API server hardening: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware.

Changed

  • Removed per-extractor #[instrument] annotations: 29 manual annotations replaced by the automatic InstrumentedExtractor wrapper.
  • Span attribute names migrated to kreuzberg.* namespace: extraction.filename -> kreuzberg.document.filename, extraction.mime_type -> kreuzberg.document.mime_type, etc.

Fixed

  • EPUB spine semantics refactor (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures.
  • DOCX image extraction for <a:blip> with child elements (#591): Images with high-quality settings were not extracted. Now handles Event::Start for <a:blip>.
  • OCR table extraction returned empty results via pipeline path (#593): Layout detection and table propagation fixed for both code paths.
  • Missing chunker_type field in bindings (#592): Exposed across Python, TypeScript/WASM, Go, C#, PHP bindings.
  • Full API parity across all 10 bindings: Added max_archive_depth to all bindings. Added missing typed config classes for acceleration, email, layout, concurrency where needed.
  • Node Windows publish failure: Prepare script fallback replaced with cross-platform node -e.
  • CI Validate path triggers broadened: Covers docs/**, biome.json, .task/**, and other lintable paths.
  • Publish pipeline ORT bundling: Configurable strategy input (system/bundled) on setup-onnx-runtime action. Publish jobs now use strategy: bundled.
  • C FFI CI missing ORT setup: Added setup-onnx-runtime step to ci-c-ffi.yaml.

v4.6.2

26 Mar 10:47
38b53e0

Choose a tag to compare

[4.6.2] - 2026-03-26

Added

  • PDF page rendering API (#583): New render_pdf_page function and PdfPageIterator for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call.

Fixed

  • Table recognition coordinate mismatch on scanned PDFs (#582): Layout detection bboxes (640x640 model space) are now scaled to OCR render resolution before TATR table recognition. Previously, coordinate space mismatch caused zero tables to be found.
  • OCR elements report page_number: 1 for all pages (#582): Tesseract resets page numbers per single-page render. Page numbers are now correctly stamped after OCR in the batch loop.
  • Rust E2E tests missing PDF feature: Added pdf feature to the e2e-generator Rust template, fixing 41 UnsupportedFormat("application/pdf") failures.
  • HWP styled extraction empty on ARM: Added skip_on_platform support to Python and Java e2e generators, skipping the hwp_styled fixture on aarch64-unknown-linux-gnu.
  • WASM CI build failure: Made kreuzberg-node prepare script resilient to missing native addon, preventing ENOENT: dist/cli.js during pnpm workspace install.
  • Go C header stale at 4.5.0: Synced header and DefaultVersion constant to match current version.
  • Ruby gem missing ONNX Runtime: Added ort-bundled feature to Ruby native Cargo.toml.
  • Elixir doctest failures: Updated ExtractionConfig.to_map/1 doctests for force_ocr_pages field.
  • WASM benchmark timeout: Reduced per-extraction timeout from 600s to 120s and job timeout from 6h to 2h.

Improved

  • version:sync now syncs Go C header, DefaultVersion, and Docker compose tags: Prevents version drift across language bindings.
  • Publish pipeline commits Elixir NIF checksums back to main: Prevents stale checksums after releases.
  • WASM test app migrated to Deno: Replaced Node.js/vitest with Deno test runner, fixing fetch() unavailability.
  • Docs migrated from MkDocs to Zensical: 4-5x faster incremental builds.

v4.6.1

25 Mar 17:45
v4.6.1
8e819b1

Choose a tag to compare

Fixes

  • OCR memory usage reduced 60-78%: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS.
  • PDF control character encoding artifacts: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like re\x02labellingre-labelling.

v4.6.0

25 Mar 08:14
v4.6.0
7f61b87

Choose a tag to compare

v4.6.0 — Recursive Archives, DocumentStructure, Bug Fixes

Added

  • Recursive archive extraction: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own ExtractionResult. New ArchiveEntry type and max_archive_depth config.
  • YAML/JSON section chunker: New ChunkerType::Yaml with full hierarchy paths and auto-inference from metadata.
  • Unified DocumentStructure: Extended with 7 new node types, 4 annotation kinds, attributes bag. All 35 extractors produce native DocumentStructure.
  • Document-level OCR: process_document() for whole-file extraction — up to 30% faster on multi-page documents.
  • DocBook/JATS inline annotations: Semantic formatting mapped to AnnotationKind variants.

Changed

  • CSV extraction: Produces Row N: Header: Value format for better embedding quality.
  • XML extraction: Indented hierarchical output preserving element tree.

Improved

  • Zero-copy file I/O: memmap2 + simdutf8 SIMD UTF-8 validation for large files.
  • Unified concurrency management: Centralized thread budget with configurable ConcurrencyConfig.

Fixed

  • #557: Auto-enable extract_pages for element-based output — correct page numbers without manual PageConfig.
  • #558: Fixed misleading PageConfig docstring defaults.
  • #560: MSG extraction now supports compressed RTF bodies (PR_RTF_COMPRESSED).
  • #561: Indexed colour PDF images now decode correctly with palette lookup.
  • ODT extraction robustness improvements.

See CHANGELOG.md for full details.

v4.5.4

23 Mar 10:00
6343906

Choose a tag to compare

[4.5.4] - 2026-03-23

Fixed

  • PDF image extraction panic on mismatched buffer lengths (#552): Replaced assert! with graceful error handling. Malformed PDF images are now skipped instead of panicking. Regression from v4.5.0.
  • pdf feature compilation without layout-detection (#550): config.layout reference gated behind #[cfg(feature = "layout-detection")].
  • WASM module resolution in Supabase/Deno edge functions (#551): Added explicit package.json exports and Deno detection in wasm-loader.
  • zip dependency pinned below 7.4: Avoids let-chain build failures on some stable Rust toolchains (#549).
  • Vendored HWP text extraction: Replaced external hwpers crate with vendored subset (~1,650 lines). Eliminates zip 2.x transitive dependency that caused WASM/CI build failures.
  • Ruby binding missing table_model field in LayoutDetectionConfig initializer.
  • Clippy/unused variable warnings in table recognition and pipeline modules.

Added

  • prepend_heading_context chunking option: When true and chunker_type is Markdown, prepends the heading hierarchy path (e.g. # Title > ## Section) to each chunk's content string. Useful for RAG pipelines where chunks need self-contained structural context. Available across all 10 language bindings, CLI, and WASM. Includes fixture-driven e2e tests and documentation for all languages.

v4.5.3

22 Mar 14:53
v4.5.3
62cfaf3

Choose a tag to compare

What's New

SLANeXT Table Structure Recognition

Alternative table structure backends alongside TATR. New table_model field on LayoutDetectionConfig selects the backend:

Model Config Value Size Best For
TATR "tatr" (default) 30 MB General-purpose, consistent results
SLANeXT Wired "slanet_wired" 365 MB Bordered/gridlined tables
SLANeXT Wireless "slanet_wireless" 365 MB Borderless tables
SLANeXT Auto "slanet_auto" ~737 MB Mixed documents (auto-classifies)
SLANet-plus "slanet_plus" 7.78 MB Resource-constrained environments

Available across all 12 language bindings and CLI (--layout-table-model).

Apple iWork Format Support

Native parsing for .pages, .numbers, and .key files (2013+ format) via protobuf text extraction from Snappy-compressed IWA containers.

Other Changes

  • PP-LCNet table classifier for automatic wired/wireless table detection
  • CLI cache warm --all-table-models for opt-in SLANeXT download (~730MB)
  • ISO 21111-10 benchmark fixture with MinerU ground truth
  • Format count updated to 91+

See CHANGELOG.md for full details.

v4.5.2

21 Mar 22:34
db2e1bd

Choose a tag to compare

Fixed

  • PDF word splitting in extracted text: Pdfium's text extraction inserted spurious spaces mid-word (e.g. "s hall a b e active" instead of "shall be active"). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (font_size × 0.33 threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact.
  • Markdown underscore escaping: Underscores in extracted text (e.g. CTC_ARP_01) were incorrectly escaped as CTC\_ARP\_01 throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting.
  • Page header/footer leakage: Running headers like ISO 21111-10:2021(E) and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages.
  • R batch function spurious NULL argument: R wrapper batch functions passed an extra NULL positional argument to native Rust functions, causing "unused argument" errors on all batch operations.
  • Elixir Windows ORT DLL staging: ONNX Runtime DLL was only staged in target/release/ but not in priv/native/ where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI.

Added

  • General extraction result caching: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache.
  • Cache namespace isolation: New cache_namespace field on ExtractionConfig enables multi-tenant cache isolation on shared filesystems. Available via --cache-namespace CLI flag and across all language bindings.
  • Per-request cache TTL: New cache_ttl_secs field on ExtractionConfig overrides the global TTL for individual extractions. Set to 0 to skip cache entirely. Available via --cache-ttl-secs CLI flag.
  • Cache namespace deletion: delete_namespace() removes all cache entries under a namespace. get_stats_filtered() returns per-namespace statistics.
  • Multi-worker cleanup safety: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory.
  • Bundled eng.traineddata: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time).
  • Tessdata in cache warm: kreuzberg-cli cache warm now downloads all tessdata_fast language files (~120 languages) to KREUZBERG_CACHE_DIR/tessdata/, giving full Tesseract language support without system packages.
  • Tessdata in cache manifest: kreuzberg-cli cache manifest now includes all tessdata files with source URLs, enabling --sync-cache to download tessdata alongside models.
  • KREUZBERG_CACHE_DIR/tessdata resolution: resolve_tessdata_path() now checks KREUZBERG_CACHE_DIR/tessdata and the bundled build path before falling back to system paths.
  • CLI embed command: Generate vector embeddings from text via kreuzberg embed --text "..." --preset balanced.
  • CLI chunk command: Split text into chunks via kreuzberg chunk --text "..." --chunk-size 512.
  • CLI completions command: Generate shell completions for bash, zsh, fish, powershell.
  • CLI --log-level global flag: Override RUST_LOG via kreuzberg --log-level debug extract doc.pdf.
  • CLI extraction overrides: 27 flags exposed via ExtractionOverrides struct with #[command(flatten)].
  • CLI colored output: Text output uses anstyle for colored headers, labels, success values, and dim separators. Respects NO_COLOR env var.
  • API POST /detect, GET /version, GET /cache/manifest, POST /cache/warm: New REST endpoints.
  • MCP get_version, cache_manifest, cache_warm, embed_text, chunk_text: New MCP tools.
  • Pipeline table extraction tracing: Zero-cost tracing::trace! and tracing::debug! logging throughout layout detection and table extraction.
  • TATR model availability check: Layout detection returns an error if table regions are detected but the TATR model is unavailable.

Changed

  • CLI batch flags: Batch command now supports all extraction override flags via shared ExtractionOverrides struct.
  • CLI config architecture: Replaced 13-parameter function with ExtractionOverrides struct using #[command(flatten)].
  • MCP tool architecture: Removed dead tools/ trait-based duplicates; all tools implemented directly in server.rs.

Improved

  • CLI validation: OCR backend values, chunk size/overlap bounds, DPI range, layout confidence validated.
  • API validation: Embedding preset names and chunk bounds checked.
  • MCP validation: Empty paths rejected, chunk bounds checked, embedding preset validated.
  • Chunk overlap auto-clamping: When --chunk-size is smaller than default overlap, overlap is automatically clamped to size/4.

See full changelog: https://github.com/kreuzberg-dev/kreuzberg/blob/main/CHANGELOG.md

v4.5.1

21 Mar 12:26
v4.5.1
4bc9d91

Choose a tag to compare

See CHANGELOG.md for release notes.

v4.5.0

20 Mar 18:29
v4.5.0
5a6f054

Choose a tag to compare

See CHANGELOG.md for full release notes.

Benchmark Results 2026-03-21 (d062479)

21 Mar 05:18
d062479

Choose a tag to compare

Pre-release

Comparative benchmark results from workflow run 23359982805.

Commit: d062479
Date: 2026-03-21