The Assamese-AiW treebank is a manually annotated corpus in Assamese (Assamese script). Assamese is an Indo-Aryan language written in the Assamese script, from Left-to-Right. Word order is Subject-Object-Verb (SOV) with relatively free constituent order.
This treebank consists of total 73 sentences, of which 53 sentences are taken from the first chapter of the novel "অজান দেশত এলিচ" which is originally written by Lewis Caroll as “Alice’s Adventures in Wonderland”, and 20 sentences from a news article containing a discourse on the guidelines issued by the admininstration on how to celebrate Durga Puja. The data has been annotated according to Universal Dependencies guidelines.
The corpus is split contiguously into training, development, and test sets as follows:
| Split | Number of sentences |
|---|---|
| Train | 37 (AiW) + 14 (discourse) |
| Dev | 10 (AiW) + 4 (discourse) |
| Test | 6 (AiW) + 2 (discourse) |
Annotations follows the Universal Dependencies v2 guidelines for tokenization, part-of-speech tags, and dependency relations.
Data was collected manually from the first chapter of Alice’s Adventures in Wonderland (Assamese translation, অজান দেশত এলিচ) and from the news article on the prestigious e-news paper Asomiya Pratidin
The treebank was annotated by Kaushik Sengupta. Supervision and revision by Luigi Talamo, Helena Vaz, Annemarie Verkerk, Andy Dyer and Adityam Dutta.
In preparation
- 2026-05-15 v2.18
- Initial release in Universal Dependencies.
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: UD v2.18 License: CC BY-SA 4.0 Includes text: yes Parallel: no Genre: fiction news Lemmas: manual native UPOS: manual native XPOS: not available Features: manual native Relations: manual native Contributors: Sengupta, Kaushik; Talamo, Luigi; Verkerk, Annemarie Contributing: here Contact: [email protected] ===============================================================================