Data

Our resources rely on deep lexical analysis of human languages, meticulously deciphering and mapping their linguistic DNA, identifying and categorizing the different elements, and linking them to each other and across multiple languages.

The datasets have been developed over the years by K Dictionaries and published by our partners in various media, serving millions of users around the world.

The content is human-curated, enriched by automatic language generation and supplemented by

morphological word form lists, language and grammar guides, biographical and geographical tables,

phonetic transcription (IPA), alternative scripts, frequency, and vocal pronunciation.

The data are available in XML, JSON and JSON-LD (RDF) formats.

Data Components

Words and Expressions

Inflections and Variants

Translations

Etimology

Senses

Definitions
Disambiguators

Semantic labels

Synonyms
Antonyms
Context
Domain

Usage labels

Range of Application
Register
Geographical region
Sentiment

Grammar

Part of Speech
Gram. Gender
Gram. Number
Subcategorization
Valency

Features

Frequency
Spell check
Geo multilingual table
Geographical entries
Biographical entries

Examples of Usage

Full sentences
Short phrases

Pronunciation

Phonetic transcription
Alternative script

Notes

Extra information
on language and grammar

Data Sample (JSON)

				
					{
  "id": "DE_DE00019883",
  "source": "global",
  "language": "de",
  "version": 1,
  "headword": {
    "text": "Schloss",
    "pronunciation": {
      "value": "ʃlɔs"
    },
    "pos": "noun",
    "gender": "neuter",
    "inflections": [
      {
        "text": "Schlosses",
        "number": "singular",
        "case": "genitive"
      },
      {
        "text": "Schlösser",
        "pronunciation": {
          "value": "ˈʃlœsɐ"
        },
        "number": "plural",
        "case": "nominative"
      }
    ]
  },
  "senses": [
    {    
				
			

In Use In

  • language models
  • machine translation
  • natural language processing
  • language learning solutions
  • online dictionary websites
  • mobile applications
  • research and innovation projects
  • internship programs

Parallel Corpora: Bilingual and Multilingual Parallel Corpora

Expert parallel corpora for nearly 400 language pairs and numerous multilingual combinations for training Language Models and boosting the performance of Machine Translation engines.


The corpora include bilingual and multilingual segments that consist of corpus-derived, manually curated full sentences and short phrases with their corresponding equivalents in other languages.


These segments are based on dictionary examples of usage, which have been created and refined to illustrate typical language patterns by expert linguists and translators worldwide, for general language use and 100 vertical domains.


The languages include: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, Latin, North Sami, Norwegian, Polish, Portuguese (Brazil / Portugal), Russian, Spanish, Swedish, and Turkish.


In addition to general language vocabulary, there are segments for more than one hundred vertical domains.

Parallel Corpora – Multilingual Sample (sport)

Arabic   .تمركز كل المشتركين على خط الانطلاق
Chinese S.   所有的参赛者都在起跑线上.
Danish   Alle konkurrencedeltagerne står på startlinjen.
Dutch   Alle deelnemers staan aan de start.
English   All the competitors are on the starting line.
French   Tous les concurrents sont sur la ligne de départ.
German   Alle Wettstreiter sind auf der Startlinie.
Greek   Όλοι οι αθλητές είναι στη γραμμή της αφετηρίας.
Hebrew   .כל המִתְחָרים עומדים על קו הזינוק
Italian   Tutti i concorrenti sono sulla linea di partenza.
Japanese   全(すべ)ての選手がスタートラインに立(た)っている。
Norwegian   Alle konkurrentene står på startlinjen.
Polish   Wszyscy rywale są na linii startu.
Portuguese Br.   Todos os competidores estão na linha de partida.
Portuguese Pt.   Todos os concurrentes estão na linha de partida.
Russian   Все уча́стники соревнова́ния собрали́сь на ста́рте.
Spanish   Todos los competidores están en la linea de salída.
Swedish   Alla deltagarna står på startlinjen.
Turkish   Bütün yarışçılar start çizgisinin üstündeler.

Domains

Lexicala datasets classify word senses into more than 100 domains.


Acoustics
Music


Architecture
Cartography


Chemistry
Pharmacology


Culinary
Drinks


Electricity
Energy


Geography
Geology


Grammar
Linguistics


Literature
Publishing


Military
Police


Theology
Religion


Agriculture
Botanics
Environment


Anthropology
Archeology
Philosophy


Culture
History
Politics


Education
School
University


Games
Leisure time&hobbies


Geometry
Mathematics
Statistics


Maritime
Nautical
Oceanography


Mythology
Psychology
Sociology


Journalism
Law
Occupation


Astronomy
Meteorology
Optics
Physics


Clothing
Cosmetics
Dress
Fashion


Radio
Technology
Telephone
Television


Anatomy
Genetics
Health
Medicine
Physiology


Aeronautics
Aviation
Automobiles
Rail
Transportation


Anatomy
Biology
Ecology
Genetics
Physiology
Zoology


Administration
Advertising
Commerce
Economics
Finance
Industry
Marketing


Art
Cinema
Color
Dance
Entertainment
Music
Photography
Theatre


Computers
Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication


Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics

Space
Sport
Time
Tourism


Acoustics
Music


Architecture
Cartography


Chemistry
Pharmacology


Culinary
Drinks


Electricity
Energy


Geography
Geology


Grammar
Linguistics


Literature
Publishing


Military
Police


Theology
Religion


Agriculture
Botanics
Environment


Anthropology
Archeology
Philosophy


Culture
History
Politics


Education
School
University


Games
Leisure time&hobbies


Geometry
Mathematics
Statistics


Maritime
Nautical
Oceanography


Mythology
Psychology
Sociology


Journalism
Law
Occupation


Astronomy
Meteorology
Optics
Physics


Clothing
Cosmetics
Dress
Fashion


Radio
Technology
Telephone
Television


Anatomy
Genetics
Health
Medicine
Physiology


Aeronautics
Aviation
Automobiles
Rail
Transportation


Anatomy
Biology
Ecology
Genetics
Physiology
Zoology


Administration
Advertising
Commerce
Economics
Finance
Industry
Marketing


Art
Cinema
Color
Dance
Entertainment
Music
Photography
Theatre


Computers
Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication


Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics

Space
Sport
Time
Tourism

API

Most of our data are available on Lexicala API.

Our REST API enables flexible search options and returns JSON responses with full dictionary

entries or  specific components – featuring syntactic and semantic details, sense definitions and

various disambiguation forms, examples of usage and multiword expressions, translations and

more – allowing easy processing and seamless integration with other applications.

For the API documentation, registration and access, click below.