tsuji is a localization toolkit designed for translating the quarkus.io documentation site. By leveraging the LangChain4j framework, it enables high-quality, consistent translations through LLMs with Retrieval-Augmented Generation (RAG) directly integrated into the gettext PO file workflow.
"Tsuji" (通詞) refers to the official interpreters in Japan from the 17th to the 19th century.
They were not merely translators of language. They served as a "Gateway of Knowledge," playing a crucial role in introducing the latest Western science and medicine to Japan. Following their legacy, this project aims to bridge the gap between languages and deliver knowledge to developers globally using modern AI technology.
- Triple Translator Support: Choose between Google Gemini (via LangChain4j), OpenAI, and DeepL as the translation engine. Each uses a different markup protection strategy optimized for its capabilities.
- RAG-Enhanced Translation: Automatically retrieves relevant translation context from your existing TMX (Translation Memory eXchange) files using Lucene vector search, ensuring terminology consistency without manual model training.
- Adaptive Parallelism: Two-level AIMD (Additive Increase / Multiplicative Decrease) control automatically adjusts both API request concurrency and batch size in response to rate limits, maximizing throughput while respecting API constraints.
- AsciiDoc Markup Protection: Gemini translates AsciiDoc natively with post-translation validation and retry via AsciidoctorJ + jsoup. DeepL uses an HTML round-trip pipeline (AsciiDoc → HTML → translate → HTML → AsciiDoc) with 11 specialized message processors.
- Structured Batch Translation: Sends multiple texts per LLM request using JSON Schema-constrained output, with index-based validation to ensure correct mapping between source and translated texts.
- Glossary Support: Define terminology mappings in configuration to inject into translation prompts for consistent term usage.
- Customizable System Prompts: Override built-in translation prompts with external files for project-specific tuning.
- MT Engine Tracking: Machine-translated messages are tagged with
mt: gemini,mt: openai, ormt: deeplcomments, enabling selective acceptance of fuzzy MT translations during site builds. - Escalation Model Support: Use a cost-effective model for initial translation and automatically escalate to a higher-quality model when markup validation fails.
- Gemini Thinking Level Control: Configure Gemini's reasoning depth (MINIMAL/LOW/MEDIUM/HIGH) per model to balance translation quality and speed.
- PO File Management: Comprehensive tools for normalizing, purging, updating, and applying PO files, plus word-count-based translation statistics.
- TMX Operations: Generate Translation Memory from PO files (confirmed or fuzzy translations) and apply TMX translations back to PO files.
- Jekyll Integration: Seamlessly handles PO extraction, build processes, and previews for translated Jekyll sites, with selective acceptance of machine translations.
This is a multi-module Gradle project:
tsuji (root) — Main CLI application (Kotlin / Quarkus / PicocLI)
├── tsuji-po — PO file domain model and I/O library (jgettext)
└── tsuji-tmx — TMX file domain model and I/O library (Jackson XML)
The root module follows a 3-layer architecture:
- App Service Layer — Use case orchestration and workflow control
- Core Service Layer — Pure domain logic (statistics, translation eligibility, TMX generation)
- Core Driver Layer — External integrations (LLM APIs, vector store, file I/O, external processes)
Translate PO files using LLM or DeepL.
Options:
-p, --po <path> PO file or directory to translate (repeatable)
--source <lang> Source language (default: from config)
--target <lang> Target language (default: from config)
--asciidoc <mode> AsciiDoc inline markup processing mode: auto (detect
from filename), always (force enable), never (force
disable). Default: auto
--rag Enable RAG (Retrieval-Augmented Generation)
Purge translations in PO file(s). By default only fuzzy; use --all for all.
Options:
-p, --po <path> PO file or directory to process (required)
-a, --all Purge all translations including confirmed ones
Normalize PO file syntax via msgcat.
Options:
-p, --po <path> PO file or directory to normalize (required)
Update PO file from master file using po4a.
Options:
-p, --po <path> PO file (required)
-m, --master <path> Master file (required)
-f, --format <format> File format (markdown, yaml, xhtml, etc.)
Generate translated document from PO via po4a.
Options:
-p, --po <path> PO file (required)
-m, --master <path> Master file (required)
-l, --localized <path> Output localized file path (required)
-f, --format <format> File format (markdown, yaml, xhtml, etc.)
Apply TMX translations to PO files (confirmed).
Options:
-p, --po <path> PO file or directory
-t, --tmx <path> TMX file (required)
Apply TMX translations to PO files (fuzzy).
Options:
-p, --po <path> PO file or directory
-t, --tmx <path> Fuzzy TMX file (required)
Remove obsolete PO files that no longer have corresponding upstream files.
Options:
-p, --po <path> PO directory to clean up (required)
-u, --upstream <path> Upstream directory for reference (required)
Calculate and output translation progress statistics.
Options:
-p, --po <path> PO directories to analyze (repeatable)
-o, --output <path> Output CSV file path
Generate TMX from PO files.
Options:
-t, --tmx <path> Output TMX file path (required)
-p, --po <path> Directory containing PO files
--mode <mode> Generation mode: CONFIRMED or FUZZY
Build or update vector index from TMX files.
Options:
--tmx <path> TMX file path (required)
Build the translated Jekyll site.
Options:
--[no-]translate Apply translation (default: true)
--accept-mt <engines> Comma-separated list of MT engines whose fuzzy
translations should be accepted (e.g., gemini,deepl)
-c, --additional-configs Additional Jekyll configuration files (repeatable)
Preview the translated site locally.
Options:
--[no-]translate Apply translation (default: true)
--accept-mt <engines> Comma-separated list of MT engines whose fuzzy
translations should be accepted (e.g., gemini,deepl)
-c, --additional-configs Additional Jekyll configuration files (repeatable)
Extract PO files from Jekyll source.
Update all Jekyll-related statistics (PO translation stats and override file stats).
Display the value of a configuration property.
Options:
<key> Configuration key (e.g., tsuji.version, tsuji.git.user.name)
tsuji uses Quarkus SmallRye Config for configuration. All properties can be set via:
application.yml(orapplication.yaml) in the working directory or classpath- Environment variables (e.g.,
tsuji.translator.type→TSUJI_TRANSLATOR_TYPE) - System properties (e.g.,
-Dtsuji.translator.type=gemini)
| Property | Default | Description |
|---|---|---|
tsuji.language.from |
en |
Source language code |
tsuji.language.to |
ja |
Target language code |
| Property | Default | Description |
|---|---|---|
tsuji.translator.type |
deepl |
Translation engine to use: gemini, openai, or deepl |
tsuji.translator.target-directories |
(none) | List of subdirectories under tsuji.po.base-dir to translate. If omitted, the entire base directory is processed |
tsuji.translator.deepl.key |
(none) | DeepL API key. Can also be set via TSUJI_TRANSLATOR_DEEPL_KEY |
Controls the adaptive parallelism for API requests (AIMD algorithm). Shared across all translator types.
| Property | Default | Description |
|---|---|---|
tsuji.translator.adaptive.initial-concurrency |
40 |
Initial number of parallel API requests |
tsuji.translator.adaptive.min-concurrency |
1 |
Minimum concurrency (floor for AIMD decrease) |
tsuji.translator.adaptive.max-concurrency |
60 |
Maximum concurrency (ceiling for AIMD increase) |
tsuji.translator.adaptive.max-retries |
2 |
Maximum retry attempts per batch on error |
tsuji.translator.adaptive.max-message-validation-retries |
4 |
Maximum retry attempts for message/markup validation failures |
| Property | Default | Description |
|---|---|---|
tsuji.translator.gemini.key |
(none) | Gemini API key. Can also be set via QUARKUS_LANGCHAIN4J_GEMINI_API_KEY |
tsuji.translator.gemini.model.model-id |
(none) | Gemini model ID (e.g., gemini-3-flash-preview) |
tsuji.translator.gemini.model.thinking.thinking-budget |
(none) | Thinking token budget for the model (Gemini 2.5) |
tsuji.translator.gemini.model.thinking.thinking-level |
(none) | Thinking level: MINIMAL, LOW, MEDIUM, HIGH (Gemini 3) |
tsuji.translator.gemini.escalation-model.model-id |
(none) | Escalation model ID used for validation retries. Falls back to the primary model if omitted |
tsuji.translator.gemini.escalation-model.thinking.thinking-budget |
(none) | Thinking token budget for the escalation model |
tsuji.translator.gemini.escalation-model.thinking.thinking-level |
(none) | Thinking level for the escalation model |
Controls how many texts are sent per LLM request.
| Property | Default | Description |
|---|---|---|
tsuji.translator.gemini.batch.initial-texts-per-request |
200 |
Initial number of texts per batch request |
tsuji.translator.gemini.batch.max-texts-per-request |
200 |
Maximum number of texts per batch request |
Override the built-in translation prompts with external files for project-specific tuning.
| Property | Default | Description |
|---|---|---|
tsuji.translator.gemini.prompts.system-prompt |
(none) | File path to a custom system prompt. If omitted, uses the built-in prompt |
tsuji.translator.gemini.prompts.asciidoc-markup-rules |
(none) | File path to custom AsciiDoc markup preservation rules |
tsuji.translator.gemini.prompts.html-markup-rules |
(none) | File path to custom HTML markup preservation rules |
| Property | Default | Description |
|---|---|---|
tsuji.translator.openai.key |
(none) | OpenAI API key |
tsuji.translator.openai.model.model-id |
(none) | OpenAI model ID |
tsuji.translator.openai.escalation-model.model-id |
(none) | Escalation model ID used for validation retries. Falls back to the primary model if omitted |
tsuji.translator.openai.mt-tag |
(none) | Custom MT tag for tracking (defaults to openai) |
| Property | Default | Description |
|---|---|---|
tsuji.translator.openai.batch.initial-texts-per-request |
200 |
Initial number of texts per batch request |
tsuji.translator.openai.batch.max-texts-per-request |
200 |
Maximum number of texts per batch request |
| Property | Default | Description |
|---|---|---|
tsuji.translator.openai.prompts.system-prompt |
(none) | File path to a custom system prompt. If omitted, uses the built-in prompt |
tsuji.translator.openai.prompts.asciidoc-markup-rules |
(none) | File path to custom AsciiDoc markup preservation rules |
tsuji.translator.openai.prompts.html-markup-rules |
(none) | File path to custom HTML markup preservation rules |
Standard Gemini API settings managed by the Quarkus LangChain4j extension:
| Property | Default | Description |
|---|---|---|
quarkus.langchain4j.ai.gemini.api-key |
(none) | Gemini API key (typically set via ${tsuji.translator.gemini.key:}) |
quarkus.langchain4j.ai.gemini.chat-model.model-id |
(none) | Model ID (typically set via ${tsuji.translator.gemini.model.model-id}) |
quarkus.langchain4j.ai.gemini.chat-model.max-output-tokens |
65536 |
Maximum output tokens per response |
quarkus.langchain4j.ai.gemini.timeout |
300s |
API request timeout |
quarkus.langchain4j.ai.gemini.log-requests |
false |
Log API requests |
quarkus.langchain4j.ai.gemini.log-responses |
false |
Log API responses |
| Property | Default | Description |
|---|---|---|
tsuji.rag.index-path |
l10n/rag/index |
Path to the Lucene vector index directory |
tsuji.rag.max-results |
3 |
Maximum number of similar translations to retrieve per text |
tsuji.rag.min-score |
0.5 |
Minimum similarity score threshold for retrieval (0.0–1.0) |
| Property | Default | Description |
|---|---|---|
tsuji.po.base-dir |
l10n/po/ja_JP |
Base directory for PO files |
| Property | Default | Description |
|---|---|---|
tsuji.jekyll.source-dir |
upstream |
Directory containing the original Jekyll source |
tsuji.jekyll.override-dir |
l10n/override/ja_JP |
Directory with locale-specific overrides applied on top of the source |
tsuji.jekyll.destination-dir |
docs |
Output directory for the built Jekyll site |
tsuji.jekyll.stats-dir |
l10n/stats |
Directory for translation statistics output |
tsuji.jekyll.additional-configs |
(none) | Additional Jekyll config files to merge (comma-separated) |
tsuji.jekyll.cname |
(none) | CNAME value for the built site. Not used by tsuji itself; exposed via config get for external CI/CD scripts |
tsuji.jekyll.surge-domain-suffix |
(none) | Surge.sh domain suffix for preview deployments. Not used by tsuji itself; exposed via config get for external CI/CD scripts |
tsuji.jekyll.jekyll-l10n-branch |
main |
Git branch (or tag) of the jekyll-l10n plugin to install |
tsuji.jekyll.extract.yaml.exclude |
(none) | YAML front matter keys to exclude from PO extraction |
tsuji.jekyll.extract.html.include |
(none) | HTML file patterns to include in PO extraction |
| Property | Default | Description |
|---|---|---|
tsuji.git.user.name |
(none) | Git user name. Not used by tsuji itself; exposed via config get for external CI/CD scripts |
tsuji.git.user.email |
(none) | Git user email. Not used by tsuji itself; exposed via config get for external CI/CD scripts |
| Property | Default | Description |
|---|---|---|
tsuji.glossary.enabled |
false |
Enable glossary injection into translation prompts |
tsuji.glossary.entries |
(none) | List of term-translation pairs |
Glossary entries are defined as a list in application.yml:
tsuji:
glossary:
enabled: true
entries:
- term: "dependency injection"
translation: "依存性注入"
- term: "build time"
translation: "ビルド時"tsuji:
language:
from: "en"
to: "ja"
translator:
type: "gemini"
language:
source: "en"
target: "ja"
adaptive:
initial-concurrency: 40
max-concurrency: 60
max-retries: 2
gemini:
model:
model-id: "gemini-3-flash-preview"
thinking:
thinking-level: "MINIMAL"
escalation-model:
model-id: "gemini-3.1-pro-preview"
thinking:
thinking-level: "LOW"
batch:
initial-texts-per-request: 200
max-texts-per-request: 200
rag:
index-path: "l10n/rag/index"
max-results: 3
min-score: 0.5
po:
base-dir: "l10n/po/ja_JP"
jekyll:
source-dir: "upstream"
destination-dir: "docs"- JDK 21
- Gettext: Required for PO file operations (e.g.,
msgcat). - Po4a: Required for converting between original sources and PO files.
- Git: Required for retrieving commit timestamps for synchronization status.
- LLM API Key: Default implementation uses Google Gemini (set via
QUARKUS_LANGCHAIN4J_GEMINI_API_KEY). For DeepL, setTSUJI_TRANSLATOR_DEEPL_API_KEY.
./gradlew buildYou can run the CLI in development mode using Quarkus:
./gradlew quarkusDev --quarkus-args='<args>'Or run the built JAR:
java -jar build/tsuji.jar <command> [options]./gradlew test # Run all unit tests
./gradlew systemTest # Run system tests (CLI behavior)- Architecture Guide: Detailed overview of the translation workflow, architecture, and core components (Japanese).
This project is licensed under the Apache License, Version 2.0.