Open · Research-backed · Community-driven

NLP for 20+ Turkic languages.

A practical, pip-installable Python toolkit — built so you can go from raw text to annotated output across the Turkic language family.

24 languages Morphology & POS Translation & Embeddings Apache 2.0

Explore the toolkit → Read the paper

From text to understanding

One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.

$ pip install turkicnlp

Works with Turkish, Kazakh, Uzbek, Turkmen, Azerbaijani, Kyrgyz, Uyghur, Tatar, Bashkir and 15 more languages.

🔧

The Toolkit

A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.

Supports 24 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.

Get started

Toolkit capabilities

Everything you need to build Turkic NLP systems

Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.

🔤

Tokenization & Scripts

Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.

🧬

Morphological Analysis

Apertium HFST finite-state transducers for 20 languages, loaded natively via Python. No system Apertium install required.

🏷️

POS & Dependency Parsing

Neural models for 15 languages. Stanza models trained on UD treebanks for Turkish, Kazakh, Kyrgyz, Uyghur. Custom-trained Stanza models or multilingual models for Uzbek, Turkmen, Azerbaijani, Tatar, Bashkir, Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish.

🌐

Translation & Embeddings

NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.

📄

CoNLL-U I/O

Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.

🔬

Research-ready

MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.

Language coverage

24 Turkic languages and growing

From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.

TurkishturFull pipeline

KazakhkazFull pipeline

KyrgyzkirFull pipeline

UyghuruigFull pipeline

UzbekuzbFull pipeline

AzerbaijaniazeFull pipeline

TatartatFull pipeline

BashkirbakFull pipeline

TurkmentukFull pipeline

Crimean TatarcrhMorph + MT

S. AzerbaijaniazbEmbeddings + MT

SakhasahMorph + POS/Dep

KarakalpakkaaMorph + POS/Dep

KumykkumMorph + POS/Dep

Ottoman TurkishotaPOS/Dep

ChuvashchvMorphology

GagauzgagMorphology

NogainogMorph + POS/Dep

Karachay-BalkarkrcMorph + POS/Dep

AltaialtMorphology

TuvantyvMorphology

KhakaskjhMorphology

KhalajkljMorphology

Old TurkishotkTransliteration

Support the project

Help keep Turkic NLP open

The toolkit is free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.

💙 Donate via PayPal 🎁 Support on Patreon

Funds go directly toward model training, infrastructure, and other tasks.