Skip to content

doc-l10n-kit/tsuji

Repository files navigation

tsuji: LangChain4j-Powered Documentation Localization Toolkit

CI

tsuji is a localization toolkit designed for translating the quarkus.io documentation site. By leveraging the LangChain4j framework, it enables high-quality, consistent translations through LLMs with Retrieval-Augmented Generation (RAG) directly integrated into the gettext PO file workflow.

What is "tsuji"?

"Tsuji" (通詞) refers to the official interpreters in Japan from the 17th to the 19th century.

They were not merely translators of language. They served as a "Gateway of Knowledge," playing a crucial role in introducing the latest Western science and medicine to Japan. Following their legacy, this project aims to bridge the gap between languages and deliver knowledge to developers globally using modern AI technology.

Key Features

  • Triple Translator Support: Choose between Google Gemini (via LangChain4j), OpenAI, and DeepL as the translation engine. Each uses a different markup protection strategy optimized for its capabilities.
  • RAG-Enhanced Translation: Automatically retrieves relevant translation context from your existing TMX (Translation Memory eXchange) files using Lucene vector search, ensuring terminology consistency without manual model training.
  • Adaptive Parallelism: Two-level AIMD (Additive Increase / Multiplicative Decrease) control automatically adjusts both API request concurrency and batch size in response to rate limits, maximizing throughput while respecting API constraints.
  • AsciiDoc Markup Protection: Gemini translates AsciiDoc natively with post-translation validation and retry via AsciidoctorJ + jsoup. DeepL uses an HTML round-trip pipeline (AsciiDoc → HTML → translate → HTML → AsciiDoc) with 11 specialized message processors.
  • Structured Batch Translation: Sends multiple texts per LLM request using JSON Schema-constrained output, with index-based validation to ensure correct mapping between source and translated texts.
  • Glossary Support: Define terminology mappings in configuration to inject into translation prompts for consistent term usage.
  • Customizable System Prompts: Override built-in translation prompts with external files for project-specific tuning.
  • MT Engine Tracking: Machine-translated messages are tagged with mt: gemini, mt: openai, or mt: deepl comments, enabling selective acceptance of fuzzy MT translations during site builds.
  • Escalation Model Support: Use a cost-effective model for initial translation and automatically escalate to a higher-quality model when markup validation fails.
  • Gemini Thinking Level Control: Configure Gemini's reasoning depth (MINIMAL/LOW/MEDIUM/HIGH) per model to balance translation quality and speed.
  • PO File Management: Comprehensive tools for normalizing, purging, updating, and applying PO files, plus word-count-based translation statistics.
  • TMX Operations: Generate Translation Memory from PO files (confirmed or fuzzy translations) and apply TMX translations back to PO files.
  • Jekyll Integration: Seamlessly handles PO extraction, build processes, and previews for translated Jekyll sites, with selective acceptance of machine translations.

Project Structure

This is a multi-module Gradle project:

tsuji (root)         — Main CLI application (Kotlin / Quarkus / PicocLI)
├── tsuji-po         — PO file domain model and I/O library (jgettext)
└── tsuji-tmx        — TMX file domain model and I/O library (Jackson XML)

The root module follows a 3-layer architecture:

  • App Service Layer — Use case orchestration and workflow control
  • Core Service Layer — Pure domain logic (statistics, translation eligibility, TMX generation)
  • Core Driver Layer — External integrations (LLM APIs, vector store, file I/O, external processes)

CLI Commands

PO File Operations (po)

po machine-translate

Translate PO files using LLM or DeepL.

Options:
  -p, --po <path>         PO file or directory to translate (repeatable)
  --source <lang>         Source language (default: from config)
  --target <lang>         Target language (default: from config)
  --asciidoc <mode>       AsciiDoc inline markup processing mode: auto (detect
                          from filename), always (force enable), never (force
                          disable). Default: auto
  --rag                   Enable RAG (Retrieval-Augmented Generation)

po purge

Purge translations in PO file(s). By default only fuzzy; use --all for all.

Options:
  -p, --po <path>         PO file or directory to process (required)
  -a, --all               Purge all translations including confirmed ones

po normalize

Normalize PO file syntax via msgcat.

Options:
  -p, --po <path>         PO file or directory to normalize (required)

po update

Update PO file from master file using po4a.

Options:
  -p, --po <path>         PO file (required)
  -m, --master <path>     Master file (required)
  -f, --format <format>   File format (markdown, yaml, xhtml, etc.)

po apply

Generate translated document from PO via po4a.

Options:
  -p, --po <path>         PO file (required)
  -m, --master <path>     Master file (required)
  -l, --localized <path>  Output localized file path (required)
  -f, --format <format>   File format (markdown, yaml, xhtml, etc.)

po apply-tmx

Apply TMX translations to PO files (confirmed).

Options:
  -p, --po <path>         PO file or directory
  -t, --tmx <path>        TMX file (required)

po apply-fuzzy-tmx

Apply TMX translations to PO files (fuzzy).

Options:
  -p, --po <path>         PO file or directory
  -t, --tmx <path>        Fuzzy TMX file (required)

po remove-obsolete

Remove obsolete PO files that no longer have corresponding upstream files.

Options:
  -p, --po <path>         PO directory to clean up (required)
  -u, --upstream <path>   Upstream directory for reference (required)

po update-stats

Calculate and output translation progress statistics.

Options:
  -p, --po <path>         PO directories to analyze (repeatable)
  -o, --output <path>     Output CSV file path

TMX Operations (tmx)

tmx generate

Generate TMX from PO files.

Options:
  -t, --tmx <path>        Output TMX file path (required)
  -p, --po <path>         Directory containing PO files
  --mode <mode>           Generation mode: CONFIRMED or FUZZY

RAG Operations (rag)

rag index

Build or update vector index from TMX files.

Options:
  --tmx <path>            TMX file path (required)

Jekyll Operations (jekyll)

jekyll build

Build the translated Jekyll site.

Options:
  --[no-]translate              Apply translation (default: true)
  --accept-mt <engines>         Comma-separated list of MT engines whose fuzzy
                                translations should be accepted (e.g., gemini,deepl)
  -c, --additional-configs      Additional Jekyll configuration files (repeatable)

jekyll serve

Preview the translated site locally.

Options:
  --[no-]translate              Apply translation (default: true)
  --accept-mt <engines>         Comma-separated list of MT engines whose fuzzy
                                translations should be accepted (e.g., gemini,deepl)
  -c, --additional-configs      Additional Jekyll configuration files (repeatable)

jekyll extract

Extract PO files from Jekyll source.

jekyll update-stats

Update all Jekyll-related statistics (PO translation stats and override file stats).

Configuration (config)

config get

Display the value of a configuration property.

Options:
  <key>                   Configuration key (e.g., tsuji.version, tsuji.git.user.name)

Configuration

tsuji uses Quarkus SmallRye Config for configuration. All properties can be set via:

  • application.yml (or application.yaml) in the working directory or classpath
  • Environment variables (e.g., tsuji.translator.typeTSUJI_TRANSLATOR_TYPE)
  • System properties (e.g., -Dtsuji.translator.type=gemini)

Language

Property Default Description
tsuji.language.from en Source language code
tsuji.language.to ja Target language code

Translator

Property Default Description
tsuji.translator.type deepl Translation engine to use: gemini, openai, or deepl
tsuji.translator.target-directories (none) List of subdirectories under tsuji.po.base-dir to translate. If omitted, the entire base directory is processed
tsuji.translator.deepl.key (none) DeepL API key. Can also be set via TSUJI_TRANSLATOR_DEEPL_KEY

Adaptive Concurrency

Controls the adaptive parallelism for API requests (AIMD algorithm). Shared across all translator types.

Property Default Description
tsuji.translator.adaptive.initial-concurrency 40 Initial number of parallel API requests
tsuji.translator.adaptive.min-concurrency 1 Minimum concurrency (floor for AIMD decrease)
tsuji.translator.adaptive.max-concurrency 60 Maximum concurrency (ceiling for AIMD increase)
tsuji.translator.adaptive.max-retries 2 Maximum retry attempts per batch on error
tsuji.translator.adaptive.max-message-validation-retries 4 Maximum retry attempts for message/markup validation failures

Gemini Settings

Property Default Description
tsuji.translator.gemini.key (none) Gemini API key. Can also be set via QUARKUS_LANGCHAIN4J_GEMINI_API_KEY
tsuji.translator.gemini.model.model-id (none) Gemini model ID (e.g., gemini-3-flash-preview)
tsuji.translator.gemini.model.thinking.thinking-budget (none) Thinking token budget for the model (Gemini 2.5)
tsuji.translator.gemini.model.thinking.thinking-level (none) Thinking level: MINIMAL, LOW, MEDIUM, HIGH (Gemini 3)
tsuji.translator.gemini.escalation-model.model-id (none) Escalation model ID used for validation retries. Falls back to the primary model if omitted
tsuji.translator.gemini.escalation-model.thinking.thinking-budget (none) Thinking token budget for the escalation model
tsuji.translator.gemini.escalation-model.thinking.thinking-level (none) Thinking level for the escalation model

Gemini Batch Settings

Controls how many texts are sent per LLM request.

Property Default Description
tsuji.translator.gemini.batch.initial-texts-per-request 200 Initial number of texts per batch request
tsuji.translator.gemini.batch.max-texts-per-request 200 Maximum number of texts per batch request

Gemini System Prompts

Override the built-in translation prompts with external files for project-specific tuning.

Property Default Description
tsuji.translator.gemini.prompts.system-prompt (none) File path to a custom system prompt. If omitted, uses the built-in prompt
tsuji.translator.gemini.prompts.asciidoc-markup-rules (none) File path to custom AsciiDoc markup preservation rules
tsuji.translator.gemini.prompts.html-markup-rules (none) File path to custom HTML markup preservation rules

OpenAI Settings

Property Default Description
tsuji.translator.openai.key (none) OpenAI API key
tsuji.translator.openai.model.model-id (none) OpenAI model ID
tsuji.translator.openai.escalation-model.model-id (none) Escalation model ID used for validation retries. Falls back to the primary model if omitted
tsuji.translator.openai.mt-tag (none) Custom MT tag for tracking (defaults to openai)

OpenAI Batch Settings

Property Default Description
tsuji.translator.openai.batch.initial-texts-per-request 200 Initial number of texts per batch request
tsuji.translator.openai.batch.max-texts-per-request 200 Maximum number of texts per batch request

OpenAI System Prompts

Property Default Description
tsuji.translator.openai.prompts.system-prompt (none) File path to a custom system prompt. If omitted, uses the built-in prompt
tsuji.translator.openai.prompts.asciidoc-markup-rules (none) File path to custom AsciiDoc markup preservation rules
tsuji.translator.openai.prompts.html-markup-rules (none) File path to custom HTML markup preservation rules

Quarkus LangChain4j Settings

Standard Gemini API settings managed by the Quarkus LangChain4j extension:

Property Default Description
quarkus.langchain4j.ai.gemini.api-key (none) Gemini API key (typically set via ${tsuji.translator.gemini.key:})
quarkus.langchain4j.ai.gemini.chat-model.model-id (none) Model ID (typically set via ${tsuji.translator.gemini.model.model-id})
quarkus.langchain4j.ai.gemini.chat-model.max-output-tokens 65536 Maximum output tokens per response
quarkus.langchain4j.ai.gemini.timeout 300s API request timeout
quarkus.langchain4j.ai.gemini.log-requests false Log API requests
quarkus.langchain4j.ai.gemini.log-responses false Log API responses

RAG (Retrieval-Augmented Generation)

Property Default Description
tsuji.rag.index-path l10n/rag/index Path to the Lucene vector index directory
tsuji.rag.max-results 3 Maximum number of similar translations to retrieve per text
tsuji.rag.min-score 0.5 Minimum similarity score threshold for retrieval (0.0–1.0)

PO Files

Property Default Description
tsuji.po.base-dir l10n/po/ja_JP Base directory for PO files

Jekyll

Property Default Description
tsuji.jekyll.source-dir upstream Directory containing the original Jekyll source
tsuji.jekyll.override-dir l10n/override/ja_JP Directory with locale-specific overrides applied on top of the source
tsuji.jekyll.destination-dir docs Output directory for the built Jekyll site
tsuji.jekyll.stats-dir l10n/stats Directory for translation statistics output
tsuji.jekyll.additional-configs (none) Additional Jekyll config files to merge (comma-separated)
tsuji.jekyll.cname (none) CNAME value for the built site. Not used by tsuji itself; exposed via config get for external CI/CD scripts
tsuji.jekyll.surge-domain-suffix (none) Surge.sh domain suffix for preview deployments. Not used by tsuji itself; exposed via config get for external CI/CD scripts
tsuji.jekyll.jekyll-l10n-branch main Git branch (or tag) of the jekyll-l10n plugin to install
tsuji.jekyll.extract.yaml.exclude (none) YAML front matter keys to exclude from PO extraction
tsuji.jekyll.extract.html.include (none) HTML file patterns to include in PO extraction

Git

Property Default Description
tsuji.git.user.name (none) Git user name. Not used by tsuji itself; exposed via config get for external CI/CD scripts
tsuji.git.user.email (none) Git user email. Not used by tsuji itself; exposed via config get for external CI/CD scripts

Glossary

Property Default Description
tsuji.glossary.enabled false Enable glossary injection into translation prompts
tsuji.glossary.entries (none) List of term-translation pairs

Glossary entries are defined as a list in application.yml:

tsuji:
  glossary:
    enabled: true
    entries:
      - term: "dependency injection"
        translation: "依存性注入"
      - term: "build time"
        translation: "ビルド時"

Example Configuration

tsuji:
  language:
    from: "en"
    to: "ja"

  translator:
    type: "gemini"
    language:
      source: "en"
      target: "ja"
    adaptive:
      initial-concurrency: 40
      max-concurrency: 60
      max-retries: 2
    gemini:
      model:
        model-id: "gemini-3-flash-preview"
        thinking:
          thinking-level: "MINIMAL"
      escalation-model:
        model-id: "gemini-3.1-pro-preview"
        thinking:
          thinking-level: "LOW"
      batch:
        initial-texts-per-request: 200
        max-texts-per-request: 200

  rag:
    index-path: "l10n/rag/index"
    max-results: 3
    min-score: 0.5

  po:
    base-dir: "l10n/po/ja_JP"

  jekyll:
    source-dir: "upstream"
    destination-dir: "docs"

Getting Started

Prerequisites

  • JDK 21
  • Gettext: Required for PO file operations (e.g., msgcat).
  • Po4a: Required for converting between original sources and PO files.
  • Git: Required for retrieving commit timestamps for synchronization status.
  • LLM API Key: Default implementation uses Google Gemini (set via QUARKUS_LANGCHAIN4J_GEMINI_API_KEY). For DeepL, set TSUJI_TRANSLATOR_DEEPL_API_KEY.

Building

./gradlew build

Running the CLI

You can run the CLI in development mode using Quarkus:

./gradlew quarkusDev --quarkus-args='<args>'

Or run the built JAR:

java -jar build/tsuji.jar <command> [options]

Running Tests

./gradlew test       # Run all unit tests
./gradlew systemTest # Run system tests (CLI behavior)

Documentation

  • Architecture Guide: Detailed overview of the translation workflow, architecture, and core components (Japanese).

License

This project is licensed under the Apache License, Version 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages