TurkicNLP

NLP toolkit for 20+ Turkic languages — a pip-installable Python library inspired by Stanza, with adaptations for the low-resource, morphologically rich Turkic language family.

Maintained by Sherzod Hakimov

Citation

If you use TurkicNLP in your research, please cite:

@misc{hakimov2026turkicnlpnlptoolkit,
      title={TurkicNLP: An NLP Toolkit for Turkic Languages}, 
      author={Sherzod Hakimov},
      year={2026},
      eprint={2602.19174},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.19174}, 
}

Arxiv preprint

Read it here

Code samples

Jupyter notebooks are here

Features

24 Turkic languages from Turkish to Sakha, Kazakh to Uyghur
Script-aware from the ground up — Latin, Cyrillic, Perso-Arabic, Old Turkic Runic
Automatic script detection and bidirectional transliteration
Apertium FST morphology for ~20 Turkic languages via Python-native hfst bindings (no system install)
Stanza/UD integration — pretrained tokenization, POS tagging, lemmatization, dependency parsing, and NER via Stanza models trained on Universal Dependencies treebanks
NLLB embeddings + translation backend — sentence/document vectors and MT via NLLB-200
Language identification (LanguageDetection, GlotLID model) — FastText-based LID with 1,000+ Glottolog language labels
Multilingual Glot500 neural models — POS tagging & dependency parsing (15 languages), morphological analysis & lemmatization (23 languages) via shared Glot500 backbone
Multiple backends — choose between rule-based, Apertium FST, Stanza, or Glot500 neural backends per processor
License isolation — library is Apache-2.0; Apertium GPL-3.0 data downloaded separately
Stanza-compatible API — Pipeline, Document, Sentence, Word

Installation

Requirements: Python 3.9, 3.10, 3.11, or 3.12

pip install turkicnlp                    # core — tokenization, rule-based processing, CoNLL-U I/O
pip install "turkicnlp[hfst]"           # + Apertium FST morphology (Linux and macOS only)
pip install "turkicnlp[stanza]"         # + Stanza neural models (tokenize, POS, lemma, depparse, NER)
pip install "turkicnlp[lid]"            # + Language detection (GlotLID model; FastText + HF weights)
pip install "turkicnlp[translation]"    # + NLLB embeddings and machine translation
pip install "turkicnlp[transformers]"    # + Glot500 multilingual POS/DepParse/Morph models
pip install "turkicnlp[all]"            # everything above (Linux and macOS only)
pip install "turkicnlp[dev]"            # development tools (pytest, black, ruff, mypy)

Platform compatibility

Installation tests run nightly across all combinations of OS, Python version, and install extra (see CI workflow).

Extra	Ubuntu 22.04 / 24.04	macOS 14 / 15	Windows 2025
base	✅ 3.9 – 3.12	✅ 3.9 – 3.12	✅ 3.9 – 3.12
`[hfst]`	✅ 3.9 – 3.12	✅ 3.9 – 3.12	❌ not available
`[stanza]`	✅ 3.9 – 3.12	✅ 3.9 – 3.12	✅ 3.9 – 3.12
`[lid]`	✅ 3.9 – 3.12	✅ 3.9 – 3.12	✅ 3.9 – 3.12
`[transformers]`	✅ 3.9 – 3.12	✅ 3.9 – 3.12	✅ 3.9 – 3.12
`[translation]`	✅ 3.9 – 3.12	✅ 3.9 – 3.12	✅ 3.9 – 3.12
`[all]`	✅ 3.9 – 3.12	✅ 3.9 – 3.12	❌ not available

Windows users: the hfst Python package has no published wheels for Python 3.7 or later on Windows — this is an upstream limitation with no current workaround. All features except Apertium FST morphology work normally on Windows; use turkicnlp[stanza] or turkicnlp[translation] instead. If you need Apertium FST morphology on Windows, the recommended approach is Windows Subsystem for Linux (WSL), where hfst installs normally.

Quick Start

import turkicnlp

# Download models for a language
turkicnlp.download("kaz")

# Build a pipeline
nlp = turkicnlp.Pipeline("kaz", processors=["tokenize", "pos", "lemma", "ner", "depparse"])

# Process text
doc = nlp("Мен мектепке бардым")

# Access annotations
for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.text}\t{word.lemma}\t{word.upos}\t{word.feats}")

# Export to CoNLL-U
print(doc.to_conllu())

Embeddings (NLLB)

import math
import turkicnlp

turkicnlp.download("tur", processors=["embeddings"])
nlp = turkicnlp.Pipeline("tur", processors=["embeddings"])

doc1 = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")
doc2 = nlp("Parkta yürüyüş yapmak bugün çok keyifliydi çünkü hava güzeldi.")

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(y * y for y in b))
    return dot / (norm_a * norm_b)

print(len(doc1.embedding), len(doc2.embedding))
print(f"cosine = {cosine_similarity(doc1.embedding, doc2.embedding):.4f}")
print(doc1._processor_log)  # ['embeddings:nllb']

Language ID (`LanguageDetection`, GlotLID model)

import turkicnlp

lid = turkicnlp.LanguageDetection()  # defaults to all Turkic languages supported by TurkicNLP
labels, probs = lid.predict("salam, hemmelere!", k=3)
print(labels, probs)

# Limit to specific labels
limited = turkicnlp.LanguageDetection(
    languages=[
        "__label__eng_Latn",
        "__label__tur_Latn",
        "__label__kaz_Cyrl",
    ]
)
print(limited.predict("Merhaba dünya!", k=1))

Machine Translation (NLLB)

import turkicnlp

# Downloads once into ~/.turkicnlp/models/huggingface/facebook--nllb-200-distilled-600M
turkicnlp.download("tur", processors=["translate"])

nlp = turkicnlp.Pipeline(
    "tur",
    processors=["translate"],
    translate_tgt_lang="eng",
)

doc = nlp("Bugün hava çok güzel ve parkta yürüyüş yaptım.")
print(doc.translation)
print(doc._processor_log)  # ['translate:nllb']

translate_tgt_lang accepts either ISO-639-3 ("eng", "tuk", "kaz") or explicit Flores-200 codes ("eng_Latn", "kaz_Cyrl").

Using the Stanza Backend

from turkicnlp.processors.stanza_backend import (
    StanzaTokenizer, StanzaPOSTagger, StanzaLemmatizer, StanzaNERProcessor, StanzaDepParser
)
from turkicnlp.models.document import Document

# Models are downloaded automatically on first use
doc = Document(text="Merhaba dünya.", lang="tur")

for Proc in [StanzaTokenizer, StanzaPOSTagger, StanzaLemmatizer, StanzaNERProcessor, StanzaDepParser]:
    proc = Proc(lang="tur")
    proc.load()
    doc = proc.process(doc)

for word in doc.words:
    print(f"{word.text:12} {word.upos:6} {word.lemma:12} head={word.head} {word.deprel}")

# Export to CoNLL-U
print(doc.to_conllu())

Mixed Backends

from turkicnlp.processors.tokenizer import RegexTokenizer
from turkicnlp.processors.stanza_backend import StanzaPOSTagger, StanzaNERProcessor, StanzaDepParser
from turkicnlp.models.document import Document

doc = Document(text="Мен мектепке бардым.", lang="kaz")

# Rule-based tokenizer + Stanza POS/parsing (pretokenized mode)
tokenizer = RegexTokenizer(lang="kaz")
tokenizer.load()
doc = tokenizer.process(doc)

pos = StanzaPOSTagger(lang="kaz")
pos.load()
doc = pos.process(doc)

ner = StanzaNERProcessor(lang="kaz")
ner.load()
doc = ner.process(doc)

parser = StanzaDepParser(lang="kaz")
parser.load()
doc = parser.process(doc)

Multi-Script Support

# Kazakh — auto-detects Cyrillic vs Latin
doc = nlp("Мен мектепке бардым")    # Cyrillic
doc = nlp("Men mektepke bardym")     # Latin

# Explicit script selection
nlp_cyrl = turkicnlp.Pipeline("kaz", script="Cyrl")
nlp_latn = turkicnlp.Pipeline("kaz", script="Latn")

# Transliteration bridge — run Cyrillic model on Latin input
nlp = turkicnlp.Pipeline("kaz", script="Latn", transliterate_to="Cyrl")

Uyghur (Perso-Arabic)

nlp_ug = turkicnlp.Pipeline("uig", script="Arab")
doc = nlp_ug("مەن مەكتەپكە باردىم")

Transliteration

The Transliterator class converts text between scripts for any supported language pair:

from turkicnlp.scripts import Script
from turkicnlp.scripts.transliterator import Transliterator

# Kazakh Cyrillic → Latin (2021 official alphabet)
t = Transliterator("kaz", Script.CYRILLIC, Script.LATIN)
print(t.transliterate("Қазақстан Республикасы"))
# → Qazaqstan Respublıkasy

# Uzbek Latin → Cyrillic
t = Transliterator("uzb", Script.LATIN, Script.CYRILLIC)
print(t.transliterate("O'zbekiston Respublikasi"))
# → Ўзбекистон Республикаси

# Uyghur Perso-Arabic → Latin (ULY)
t = Transliterator("uig", Script.PERSO_ARABIC, Script.LATIN)
print(t.transliterate("مەكتەپ"))
# → mektep

# Azerbaijani Latin → Cyrillic
t = Transliterator("aze", Script.LATIN, Script.CYRILLIC)
print(t.transliterate("Azərbaycan"))
# → Азәрбайҹан

# Turkmen Latin → Cyrillic
t = Transliterator("tuk", Script.LATIN, Script.CYRILLIC)
print(t.transliterate("Türkmenistan"))
# → Түркменистан

# Tatar Cyrillic → Latin (Zamanälif)
t = Transliterator("tat", Script.CYRILLIC, Script.LATIN)
print(t.transliterate("Татарстан Республикасы"))
# → Tatarstan Respublikası

Old Turkic Runic Script

TurkicNLP supports transliteration of Old Turkic runic inscriptions (Orkhon-Yenisei script, Unicode block U+10C00–U+10C4F) to Latin:

from turkicnlp.scripts import Script
from turkicnlp.scripts.transliterator import Transliterator

t = Transliterator("otk", Script.OLD_TURKIC_RUNIC, Script.LATIN)

# Individual runic characters
print(t.transliterate("\U00010C34\U00010C07\U00010C2F\U00010C19"))
# → törk  (Türk)

# The transliterator maps each runic character to its standard
# Turkological Latin equivalent, handling both Orkhon and Yenisei
# variant forms (e.g., separate glyphs for consonants with
# back vs. front vowel contexts).

Neural POS Tagger & Dependency Parser (Glot500)

The multilingual Glot500-based model provides UPOS tagging and dependency parsing for 15 Turkic languages (10 trained + 5 zero-shot). Requires pip install "turkicnlp[transformers]".

import turkicnlp

# Download tokenizer + multilingual Glot500 POS/DepParse model
turkicnlp.download("kaz", processors=["tokenize", "pos", "depparse"])

nlp = turkicnlp.Pipeline(
    "kaz",
    processors=["tokenize", "pos", "depparse"],
    pos_backend="multilingual_glot500",
    depparse_backend="multilingual_glot500",
)

doc = nlp("Мен мектепке бардым.")

for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.text:12} {word.upos:6} head={word.head} {word.deprel}")

Morpheme Tokenizer (Hybrid Neural + FST)

The MorphemeTokenizer segments inflected Turkic words into labeled surface morphemes. It uses the neural morph model (Glot500) as its primary analyzer, enriched by Apertium HFST transducers and language-specific suffix allomorph tables with phonological rules (vowel harmony, consonant context). Requires pip install "turkicnlp[transformers]".

from turkicnlp.processors.morpheme_tokenizer import MorphemeTokenizer

# --- Kazakh ---
tok = MorphemeTokenizer(lang="kaz")
tok.load()

result = tok.segment("бармадым")
print(result.segments)
# ['бар', 'ма', 'ды', 'м']

print(result.labeled)
# [('бар', 'STEM'), ('ма', 'NEG'), ('ды', 'PST'), ('м', '1SG')]

result = tok.segment("оқығандар")
print(result.labeled)
# [('оқы', 'STEM'), ('ған', 'PTCP.PST'), ('дар', 'PLUR')]

# --- Turkish ---
tok = MorphemeTokenizer(lang="tur")
tok.load()

result = tok.segment("evlerinden")
print(result.labeled)
# [('ev', 'STEM'), ('ler', 'PLUR'), ('in', 'POSS.2SG'), ('den', 'ABL')]

result = tok.segment("gidiyorlar")
print(result.labeled)
# [('gid', 'STEM'), ('iyor', 'PROG'), ('lar', '3PL')]

# Apostrophe boundaries are handled for proper nouns
result = tok.segment("İstanbul'da")
print(result.labeled)
# [('İstanbul', 'STEM'), ("'da", 'LOC')]

# --- Uzbek ---
tok = MorphemeTokenizer(lang="uzb")
tok.load()

result = tok.segment("kitobimdan")
print(result.labeled)
# [('kitob', 'STEM'), ('im', 'POSS.1SG'), ('dan', 'ABL')]

result = tok.segment("bolalarning")
print(result.labeled)
# [('bola', 'STEM'), ('lar', 'PLUR'), ('ning', 'GEN')]

The tokenizer supports all 16 languages with suffix allomorph tables: Turkish, Azerbaijani, Kazakh, Uzbek, Kyrgyz, Tatar, Bashkir, Turkmen, Crimean Tatar, Sakha, Khakas, Tuvan, Altai, Chuvash, Gagauz, and Kumyk.

Neural Morphological Analyzer & Lemmatizer (Glot500)

The multilingual Glot500-based morph model provides UPOS tagging, UD morphological features, and lemmatization for 23 Turkic languages. Requires pip install "turkicnlp[transformers]".

import turkicnlp

# Download tokenizer + multilingual Glot500 morph model
turkicnlp.download("tur", processors=["tokenize", "morph_neural"])

nlp = turkicnlp.Pipeline(
    "tur",
    processors=["tokenize", "morph_neural"],
)

doc = nlp("Çocuklar okula gidiyorlar.")

for sentence in doc.sentences:
    for word in sentence.words:
        print(f"{word.text:20} {word.upos:6} {word.lemma:15} {word.feats}")

Output:

Çocuklar             NOUN   çocuk           Case=Nom|Number=Plur
okula                NOUN   okul            Case=Dat|Number=Sing
gidiyorlar           VERB   gitmek          Aspect=Prog|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin
.                    PUNCT  .               _

The morph analyzer also works for low-resource languages:

# Sakha (Yakut) — directly trained
turkicnlp.download("sah", processors=["tokenize", "morph_neural"])
nlp = turkicnlp.Pipeline("sah", processors=["tokenize", "morph_neural"])
doc = nlp("Мин оскуолаҕа бардым.")

# Karakalpak — zero-shot via Uzbek proxy embedding
turkicnlp.download("kaa", processors=["tokenize", "morph_neural"])
nlp = turkicnlp.Pipeline("kaa", processors=["tokenize", "morph_neural"])
doc = nlp("Men mektepke bardım.")

Supported Languages and Components

Geographic distribution of Turkic languages (source: Wikimedia Commons)

The table below shows all supported languages with their available scripts and processor status.

Legend:

Symbol	Backend	Description
■	Rule-based	Regex tokenizer, abbreviation lists
◆	Apertium FST	Finite-state morphology via `hfst` (GPL-3.0, downloaded separately)
●	Stanza/UD	Neural models trained on Universal Dependencies treebanks
▲	Custom Stanza	Custom-trained Stanza models hosted by turkic-nlp
◇	Glot500 Neural	Multilingual POS tagger & dependency parser (Glot500 backbone, 15 languages)
◈	Glot500 Neural Morph	Multilingual morphological analyzer & lemmatizer (Glot500 backbone, 23 languages)
★	NLLB	Embeddings and machine translation via NLLB-200
○	Planned	Implementation planned
—		Not available yet

Oghuz Branch

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Turkish	`tur`	Latn	■ ●	◆ ◈	● ◇	● ◈	● ◇	●	★	★
Azerbaijani	`aze`	Latn, Cyrl	■▲	◆ ◈	▲ ◇	▲ ◈	▲ ◇	—	★	★
Iranian Azerbaijani	`azb`	Arab	■	—	—	—	—	—	★	★
Turkmen	`tuk`	Latn, Cyrl	■▲	◆ ◈	▲ ◇	▲ ◈	▲ ◇	—	★	★
Gagauz	`gag`	Latn	■	◆ ◈	◈	◈	—	—	—	—

Kipchak Branch

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Kazakh	`kaz`	Cyrl, Latn	■ ●	◆ ◈	● ◇	● ◈	● ◇	●	★	★
Kyrgyz	`kir`	Cyrl	■ ●	◆ ◈	● ◇	● ◈	● ◇	—	★	★
Tatar	`tat`	Cyrl, Latn	■▲	◆ ◈	▲ ◇	▲ ◈	▲ ◇	—	★	★
Bashkir	`bak`	Cyrl	■▲	◆ ◈	▲ ◇	▲ ◈	▲ ◇	—	★	★
Crimean Tatar	`crh`	Latn, Cyrl	■	◆ ◈	◈	◈	—	—	★	★
Karakalpak	`kaa`	Latn, Cyrl	■	◆ ◈	◇ ◈	◈	◇	—	—	—
Nogai	`nog`	Cyrl	■	◆ ◈	◇ ◈	◈	◇	—	—	—
Kumyk	`kum`	Cyrl	■	◆ ◈	◇ ◈	◈	◇	—	—	—
Karachay-Balkar	`krc`	Cyrl	■	◆ ◈	◇ ◈	◈	◇	—	—	—

Karluk Branch

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Uzbek	`uzb`	Latn, Cyrl	■ ▲	◆ ◈	▲ ◇	▲ ◈	▲ ◇	—	★	★
Uyghur	`uig`	Arab, Latn	■ ●	◆ ◈	● ◇	● ◈	● ◇	—	★	★

Siberian Branch

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Sakha (Yakut)	`sah`	Cyrl	■	◆ ◈	◇ ◈	◈	◇	—	—	—
Altai	`alt`	Cyrl	■	◆ ◈	◈	◈	—	—	—	—
Tuvan	`tyv`	Cyrl	■	◆ ◈	◈	◈	—	—	—	—
Khakas	`kjh`	Cyrl	■	◆ ◈	◈	◈	—	—	—	—

Oghur Branch

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Chuvash	`chv`	Cyrl	■	◆ ◈	◈	◈	—	—	—	—

Arghu Branch

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Khalaj	`klj`	Latn	■	◈	◈	◈	—	—	—	—

Historical Languages

Language	Code	Script(s)	Tokenize	Morph	POS	Lemma	DepParse	NER	Embed	Translate
Ottoman Turkish	`ota`	Arab, Latn	■	◈	◇ ◈	◈	◇	—	—	—
Old Turkic	`otk`	Orkh, Latn	■	—	—	—	—	—	—	—

Stanza/UD Model Details

The Stanza backend provides neural models trained on Universal Dependencies treebanks. Official Stanza models (●) are downloaded via Stanza's model hub. Custom-trained models (▲) are hosted at turkic-nlp/trained-stanza-models and downloaded automatically.

Language	Stanza Code	Type	UD Treebank(s)	Stanza Processors	NER Dataset
Turkish	`tr`	●	IMST (default), BOUN, FrameNet, KeNet, ATIS, Penn, Tourism	tokenize, pos, lemma, depparse, ner	Starlang NER
Kazakh	`kk`	●	KTB	tokenize, pos, lemma, depparse, ner	KazNERD
Uyghur	`ug`	●	UDT	tokenize, pos, lemma, depparse	—
Kyrgyz	`ky`	●	KTMU	tokenize, pos, lemma, depparse	—
Uzbek	`uz`	▲	UzUDT	tokenize, pos, lemma, depparse	—
Turkmen	`tk`	▲	Tk-TUD	tokenize, pos, lemma, depparse	—
Azerbaijani	`az`	▲	Az-TUD	tokenize, pos, lemma, depparse	—
Tatar	`ta`	▲	Ta-TUD	tokenize, pos, lemma, depparse	—
Bashkir	`ba`	▲	Ba-TUD	tokenize, pos, lemma, depparse	—

Multilingual Glot500 Neural Models

TurkicNLP provides two multilingual neural models built on a frozen Glot500 backbone with script adapters, language embeddings, and shared BiLSTM layers. Both models are hosted at turkic-nlp/trained-stanza-models and downloaded automatically.

Model	Symbol	Tasks	Languages	Architecture
POS & DepParser	◇	UPOS, dependency parsing	10 trained + 5 zero-shot (15 total)	Glot500 → ScriptAdapter → LangEmbed → BiLSTM → POS Head + Biaffine Parser
Morph Analyzer	◈	UPOS, UD morph features, lemmatization	20 trained + 3 zero-shot (23 total)	Glot500 → ScriptAdapter → LangEmbed → BiLSTM → POS Head + MorphFeat Head + CharCNN LemmaHead

POS & DepParser supported languages: Turkish, Azerbaijani, Uzbek, Turkmen, Kazakh, Kyrgyz, Bashkir, Tatar, Uyghur, Ottoman Turkish + zero-shot: Karakalpak, Kumyk, Sakha, Karachay-Balkar, Nogai

Morph Analyzer supported languages: Turkish, Azerbaijani, Uzbek, Turkmen, Kazakh, Kyrgyz, Bashkir, Tatar, Uyghur, Ottoman Turkish, Crimean Tatar, Khakas, Sakha, Tuvan, Chuvash, Gagauz, Kumyk, Southern Altai, Khalaj, Northern Altai + zero-shot: Karakalpak, Karachay-Balkar, Nogai

Transliteration Support

Bidirectional script conversion is available for all multi-script languages. The transliterator uses a greedy longest-match algorithm with per-language mapping tables.

Language	Direction	Scripts	Standard
Kazakh	↔ Bidirectional	Cyrillic ↔ Latin	2021 official Latin alphabet
Uzbek	↔ Bidirectional	Cyrillic ↔ Latin	1995 official Latin alphabet
Azerbaijani	↔ Bidirectional	Cyrillic ↔ Latin	1991 official Latin alphabet
Tatar	↔ Bidirectional	Cyrillic ↔ Latin	Zamanälif
Turkmen	↔ Bidirectional	Cyrillic ↔ Latin	1993 official Latin alphabet
Karakalpak	↔ Bidirectional	Cyrillic ↔ Latin	2016 Latin alphabet
Crimean Tatar	↔ Bidirectional	Cyrillic ↔ Latin	Standard Crimean Tatar Latin
Uyghur	↔ Bidirectional	Perso-Arabic ↔ Latin	Uyghur Latin Yéziqi (ULY)
Ottoman Turkish	→ One-way	Latin → Perso-Arabic	Academic transcription
Old Turkic	→ One-way	Runic → Latin	Turkological convention

Apertium FST Quality Levels

Level	Description	Languages
Production	>90% coverage on news text	Turkish, Kazakh, Tatar
Stable	Good coverage, actively maintained	Azerbaijani, Kyrgyz, Uzbek
Beta	Reasonable coverage, some gaps	Turkmen, Bashkir, Uyghur, Crimean Tatar, Chuvash
Prototype	Limited coverage, experimental	Gagauz, Sakha, Karakalpak, Nogai, Kumyk, Karachay-Balkar, Altai, Tuvan, Khakas

Model Catalog and Apertium Downloads

TurkicNLP uses a model catalog to define download sources per language/script/processor. The catalog lives in:

turkicnlp/resources/catalog.json (packaged default)
Remote override: ModelRegistry.CATALOG_URL (or TURKICNLP_CATALOG_URL)

For each language, the catalog stores the Apertium source repo and the expected FST script. When turkicnlp.download() is called, it reads the catalog and downloads precompiled .hfst binaries from the url fields. If a language has no URL configured, download will fail with a clear error until the catalog is populated with hosted binaries (for example, a turkic-nlp/apertium-data releases repository).

Download folder

All models and resources are downloaded into this folder: ~/.turkicnlp.

Architecture

TurkicNLP follows Stanza's modular pipeline design:

Pipeline("tur", processors=["tokenize", "morph", "pos", "ner", "depparse"])
    │
    ▼
  Document ─── text: "Ben okula vardım"
    │
    ├── script_detect    → script = "Latn"
    ├── tokenize         → sentences, tokens, words
    ├── morph (Apertium) → lemma, pos, feats (via HFST)
    ├── pos (neural)     → refined UPOS, XPOS, feats
    ├── ner (neural)     → BIO tags and entity spans
    └── depparse         → head, deprel
    │
    ▼
  Document ─── annotated with all layers

Pipeline("sah", processors=["tokenize", "morph_neural"])
    │
    ▼
  Document ─── text: "Мин оскуолаҕа бардым"
    │
    ├── script_detect          → script = "Cyrl"
    ├── tokenize               → sentences, tokens, words
    └── morph_neural (Glot500) → upos, feats, lemma
    │
    ▼
  Document ─── annotated with morphological analysis

Pipeline("azb", processors=["embeddings", "translate"], translate_tgt_lang="eng")
    │
    ▼
  Document ─── text: "من کتاب اوخویورام"
    │
    ├── script_detect          → script = "Arab"
    ├── embeddings (NLLB)      → sentence/document vectors
    └── translate (NLLB)       → sentence/document translation
           (src resolved from FLORES map: azb -> azb_Arab,
            tgt resolved from ISO-3: eng -> eng_Latn)
    │
    ▼
  Document ─── annotated with all layers

Key Abstractions

Document → Sentence → Token → Word hierarchy (maps to CoNLL-U)
Processor ABC with PROVIDES, REQUIRES, NAME class attributes
Pipeline orchestrator with dependency resolution and script-aware model loading
ProcessorRegistry for pluggable backends (rule, Apertium, Stanza, Glot500, NLLB)
ModelRegistry with remote catalog and local caching at ~/.turkicnlp/models/
NLLB FLORES language map for ISO-3 to NLLB code resolution in translation (e.g. tuk -> tuk_Latn)

Model Storage Layout

~/.turkicnlp/models/
├── kaz/
│   ├── Cyrl/
│   │   ├── tokenize/rule/
│   │   ├── morph/apertium/    ← GPL-3.0 (downloaded separately)
│   │   │   ├── kaz.automorf.hfst
│   │   │   ├── LICENSE
│   │   │   └── metadata.json
│   │   ├── pos/neural/
│   │   └── depparse/neural/
│   └── Latn/
│       └── tokenize/rule/
├── tur/
│   └── Latn/
│       └── ...
├── multilingual/
│   ├── multilingual_glot500.pt           ← POS/DepParse checkpoint
│   └── multilingual_morph_glot500.pt     ← Morph analyzer checkpoint
├── huggingface/
│   ├── cis-lmu--glot500-base/            ← Shared Glot500 backbone
│   │   ├── config.json
│   │   ├── model.safetensors
│   │   └── ...
│   └── facebook--nllb-200-distilled-600M/
│       ├── config.json
│       ├── model.safetensors (or pytorch_model.bin)
│       ├── tokenizer.json
│       └── ...
└── catalog.json

# Stanza models are managed by Stanza at ~/stanza_resources/

Notes:

NLLB embeddings and translation use a shared Hugging Face model under ~/.turkicnlp/models/huggingface/.
The NLLB model is downloaded once and reused across supported Turkic languages.
The Glot500 backbone is shared between the POS/DepParse and Morph analyzer models under ~/.turkicnlp/models/huggingface/.
Unlike Apertium/Stanza components, NLLB and Glot500 artifacts are not duplicated per language/script directory.

License

Library code: Apache License 2.0
Stanza models: Apache License 2.0 — managed by Stanza's own download mechanism
Apertium FST data: GPL-3.0 — downloaded separately at runtime, never bundled in the pip package
NLLB-200 model weights/tokenizer: CC-BY-NC-4.0 — downloaded from Hugging Face at runtime and reused from ~/.turkicnlp/models/huggingface/ (non-commercial license terms apply)

Development

git clone https://github.com/turkic-nlp/turkicnlp.git
cd turkicnlp
pip install -e ".[dev]"
pytest

Contributing

Contributions are welcome, especially:

New language support — tag mappings, abbreviation lists, test data
Neural model training — POS taggers, parsers, NER models
Apertium FST improvements — better coverage for prototype-level languages
Other - any other aspect that you want

Create issues, Pull Requests etc.

Acknowledgements

TurkicNLP builds on the work of many researchers and communities. We gratefully acknowledge the following:

Stanza

Stanza provides the pretrained neural models for tokenization, POS tagging, lemmatization, dependency parsing, and NER.

Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. [paper]

Universal Dependencies Treebanks

The Stanza models are trained on Universal Dependencies treebanks created by the following teams:

Turkish (UD_Turkish-IMST)

Umut Sulubacak, Memduh Gokirmak, Francis Tyers, Cagri Coltekin, Joakim Nivre, and Gulsen Cebiroglu Eryigit. Universal Dependencies for Turkish. COLING 2016. [paper]

Turkish (UD_Turkish-BOUN)

Utku Turk, Furkan Atmaca, Saziye Betul Ozates, Gozde Berk, Seyyit Talha Bedir, Abdullatif Koksal, Balkiz Ozturk Basaran, Tunga Gungor, and Arzucan Ozgur. Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank and the BoAT Annotation Tool. Language Resources and Evaluation 56(1), 2022. [paper]

Turkish (UD_Turkish-FrameNet, KeNet, ATIS, Penn, Tourism)

Busra Marsan, Neslihan Kara, Merve Ozcelik, Bilge Nas Arican, Neslihan Cesur, Asli Kuzgun, Ezgi Saniyar, Oguzhan Kuyrukcu, and Olcay Taner Yildiz. Starlang Software and Ozyegin University. These treebanks cover diverse domains including FrameNet frames, WordNet examples, airline travel, Penn Treebank translations, and tourism reviews.

Kazakh (UD_Kazakh-KTB)

Aibek Makazhanov, Jonathan North Washington, and Francis Tyers. Towards a Free/Open-source Universal-dependency Treebank for Kazakh. TurkLang 2015. [paper]

Uyghur (UD_Uyghur-UDT)

Marhaba Eli (Xinjiang University), Daniel Zeman (Charles University), and Francis Tyers. [treebank]

Kyrgyz (UD_Kyrgyz-KTMU)

Ibrahim Benli. [treebank]

Ottoman Turkish (UD_Ottoman_Turkish-BOUN)

Saziye Betul Ozates, Tarik Emre Tiras, Efe Eren Genc, and Esma Fatima Bilgin Tasdemir. Dependency Annotation of Ottoman Turkish with Multilingual BERT. LAW-XVIII, 2024. [paper]

NER Datasets

Turkish NER (Starlang)

B. Ertopcu, A. B. Kanburoglu, O. Topsakal, O. Acikgoz, A. T. Gurkan, B. Ozenc, I. Cam, B. Avar, G. Ercan, and O. T. Yildiz. A New Approach for Named Entity Recognition. UBMK 2017. [paper]

Kazakh NER (KazNERD)

Rustem Yeshpanov, Yerbolat Khassanov, and Huseyin Atakan Varol (ISSAI, Nazarbayev University). KazNERD: Kazakh Named Entity Recognition Dataset. LREC 2022. [paper]

Glot500

The multilingual Glot500 model serves as the frozen backbone for TurkicNLP's neural POS/DepParse and Morph analyzer models.

ImaniGooghari, Ayyoob, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André Martins, François Yvon, and Hinrich Schütze. 2023. Glot500: Scaling Multilingual Corpora and Language Technology to 500 Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [paper]

GlotLID

The language detection model

Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze (2023). GlotLID: Language Identification for Low-Resource Languages. The 2023 Conference on Empirical Methods in Natural Language Processing. Model

Wiktextract / Kaikki.org

Morphological training data for extended Turkic languages was extracted from Wiktionary using Wiktextract. The structured data is available at kaikki.org.

Tatu Ylonen. 2022. Wiktextract: Wiktionary as Machine-Readable Structured Data. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). [paper]

UniMorph

The Universal Morphology (UniMorph) project provides morphological paradigms used for training and evaluating the multilingual morph analyzer across Turkic languages.

John Sylak-Glassman. 2016. The Composition and Use of the Universal Morphological Feature Schema (UniMorph Schema). Johns Hopkins University. [paper]

NLLB Embeddings & Machine Translation

TurkicNLP embeddings backend uses encoder pooling on:

facebook/nllb-200-distilled-600M

Reference:

NLLB Team, Marta R. Costa-jussà, et al. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. [paper]

Other Organisations

Apertium — morphological transducers covering 20+ Turkic languages
SIGTURK — ACL Special Interest Group on Turkic Languages
ISSAI — Institute of Smart Systems and Artificial Intelligence, Nazarbayev University, for Kazakh NLP resources
Universal Dependencies — the framework and community behind Turkic treebanks
Turkic Interlingua — resources for machine translation for Turkic languages
Turkic UD - group working on harmonizing Turkic UD treebanks
TurkLang - Conference on Computer Processing of Turkic Languages (2013-present)

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
scripts		scripts
turkicnlp		turkicnlp
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build.sh		build.sh
pypi_release.sh		pypi_release.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-all.txt		requirements-all.txt
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Folders and files

Latest commit

History

Repository files navigation

TurkicNLP

Citation

Arxiv preprint

Code samples

Features

Installation

Platform compatibility

Quick Start

Embeddings (NLLB)

Language ID (LanguageDetection, GlotLID model)

Machine Translation (NLLB)

Using the Stanza Backend

Mixed Backends

Multi-Script Support

Uyghur (Perso-Arabic)

Transliteration

Old Turkic Runic Script

Neural POS Tagger & Dependency Parser (Glot500)

Morpheme Tokenizer (Hybrid Neural + FST)

Neural Morphological Analyzer & Lemmatizer (Glot500)

Supported Languages and Components

Oghuz Branch

Kipchak Branch

Karluk Branch

Siberian Branch

Oghur Branch

Arghu Branch

Historical Languages

Stanza/UD Model Details

Multilingual Glot500 Neural Models

Transliteration Support

Apertium FST Quality Levels

Model Catalog and Apertium Downloads

Download folder

Architecture

Key Abstractions

Model Storage Layout

License

Development

Contributing

Acknowledgements

Stanza

Universal Dependencies Treebanks

NER Datasets

Glot500

GlotLID

Wiktextract / Kaikki.org

UniMorph

NLLB Embeddings & Machine Translation

Other Organisations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Language ID (`LanguageDetection`, GlotLID model)

Packages