A practical, pip-installable Python toolkit — built so you can go from raw text to annotated output across the Turkic language family.
One unified pipeline for tokenization, morphological analysis, POS tagging, dependency parsing, named entity recognition, multilingual embeddings, and machine translation.
Works with Turkish, Kazakh, Uzbek, Turkmen, Azerbaijani, Kyrgyz, Uyghur, Tatar, Bashkir and 15 more languages.
A Stanza-inspired modular pipeline with Apertium FST morphology, NLLB-200 translation and embeddings, and Stanza neural models for parsing and tagging.
Supports 24 Turkic languages. Script-aware from the ground up — Latin, Cyrillic, and Perso-Arabic handled natively.
Processor-based pipeline architecture — pick what you need, chain it together, get annotated output in CoNLL-U or JSON.
Neural (Stanza) and rule-based tokenizers for Latin, Cyrillic, and Perso-Arabic scripts. Automatic script detection and transliteration.
Apertium HFST finite-state transducers for 20 languages, loaded natively via Python. No system Apertium install required.
Neural models for 15 languages. Stanza models trained on UD treebanks for Turkish, Kazakh, Kyrgyz, Uyghur. Custom-trained Stanza models or multilingual models for Uzbek, Turkmen, Azerbaijani, Tatar, Bashkir, Sakha, Karakalpak, Kumyk, Karachay-Balkar, Nogai, Ottoman Turkish.
NLLB-200 (600M) translation and multilingual sentence embeddings for 11 Turkic languages. One download, all languages.
Full CoNLL-U parser and writer. Import treebanks, export annotated documents, and plug into any UD-compatible pipeline.
MWT expansion, tag mapping (Apertium → UD), batch processing, and GPU support. Built for reproducible NLP research.
From well-resourced Turkish to endangered varieties — the toolkit covers the full breadth of the family.
The toolkit is free and open-source. If this work is useful to you — whether for research, teaching, or building products — consider supporting its development.
Funds go directly toward model training, infrastructure, and other tasks.