Open · Research-backed · Community-driven

NLP for 20+ Turkic languages.

A practical, pip-installable Python toolkit — built so you can go from raw text to annotated output across the Turkic language family.

24 languages Morphology & POS Translation & Embeddings Apache 2.0

From text to understanding

One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.

$ pip install turkicnlp

Works with Turkish, Kazakh, Uzbek, Turkmen, Azerbaijani, Kyrgyz, Uyghur, Tatar, Bashkir and 15 more languages.

🔧

The Toolkit

A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.

Supports 24 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.

Toolkit capabilities

Everything you need to build Turkic NLP systems

Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.

🔤

Tokenization & Scripts

Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.

🧬

Morphological Analysis

Apertium HFST finite-state transducers for 20 languages, loaded natively via Python. No system Apertium install required.

🏷️

POS & Dependency Parsing

Neural models for 15 languages. Stanza models trained on UD treebanks for Turkish, Kazakh, Kyrgyz, Uyghur. Custom-trained Stanza models or multilingual models for Uzbek, Turkmen, Azerbaijani, Tatar, Bashkir, Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish.

🌐

Translation & Embeddings

NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.

📄

CoNLL-U I/O

Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.

🔬

Research-ready

MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.

Language coverage

24 Turkic languages and growing

From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.

TurkishturFull pipeline
KazakhkazFull pipeline
KyrgyzkirFull pipeline
UyghuruigFull pipeline
UzbekuzbFull pipeline
AzerbaijaniazeFull pipeline
TatartatFull pipeline
BashkirbakFull pipeline
TurkmentukFull pipeline
Crimean TatarcrhMorph + MT
S. AzerbaijaniazbEmbeddings + MT
SakhasahMorph + POS/Dep
KarakalpakkaaMorph + POS/Dep
KumykkumMorph + POS/Dep
Ottoman TurkishotaPOS/Dep
ChuvashchvMorphology
GagauzgagMorphology
NogainogMorph + POS/Dep
Karachay-BalkarkrcMorph + POS/Dep
AltaialtMorphology
TuvantyvMorphology
KhakaskjhMorphology
KhalajkljMorphology
Old TurkishotkTransliteration
Support the project

Help keep Turkic NLP open

The toolkit is free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

Funds go directly toward model training, infrastructure, and other tasks.