The Serbian UD treebank is based on the SETimes-SR corpus and additional news documents from the Serbian web.
- Lemmas: automatic pretagging + manual correction (there's no format/annotation style for this)
- UPOS, features: converted from MULTEXT-East
- Dependency realations: automatic preprocessing + manual correction, all directly in UD.
Training data: 80% of pseudorandom documents from SETimes-SR (sentence ids set*) and
13 web news documents (sentence ids news*), comprising 3497 sentences (77,334 tokens).
Development data: 10% of pseudorandom documents from SETimes-SR (sentence ids set*) and
13 web news documents (sentence ids news*), comprising 476 sentences (11,460 tokens).
Test data: 10% of pseudorandom documents from SETimes-SR (sentence ids set*) and
13 web news documents (sentence ids news*), comprising 411 sentences (8,879 tokens).
The corpus is parallel with a subset of UD Croatian-SET. In release 2.4, the Serbian and Croatian treebanks were re-split so that training, development and test sets are compatible (corresponding documents are in the same section in both languages).
-
2025-10-13 v2.16
- add parallel corpus information to machine-readable metadata
- add parallel data support with parallel_id metadata
-
2019-04-30 v2.4
- New data split
- 13 web news documents added
-
2018-04-15 v2.2
- Repository renamed from UD_Serbian to UD_Serbian-SET.
=== Machine-readable metadata ================================================= Data available since: UD v2.1 License: CC BY-SA 4.0 Includes text: yes Parallel: set Genre: news Lemmas: manual native UPOS: converted from manual XPOS: not available Features: converted from manual Relations: manual native Contributors: Samardžić, Tanja; Miletić, Aleksandra; Ljubešić, Nikola Contributing: elsewhere Contact: [email protected] ===============================================================================