Data
Home » Data
Our resources rely on deep lexical analysis of human languages, meticulously deciphering and mapping their linguistic DNA, identifying and categorizing the different elements, and linking them to each other and across multiple languages.
The datasets have been developed over the years by K Dictionaries and published by our partners in various media, serving millions of users around the world.
The content is human-curated, enriched by automatic language generation and supplemented by
morphological word form lists, language and grammar guides, biographical and geographical tables,
phonetic transcription (IPA), alternative scripts, frequency, and vocal pronunciation.
The data are available in XML, JSON and JSON-LD (RDF) formats.
Data Components
Words and Expressions
Inflections and Variants
Translations
Etimology
Senses
Definitions
Disambiguators
Semantic labels
Synonyms
Antonyms
Context
Domain
Usage labels
Range of Application
Register
Geographical region
Sentiment
Grammar
Part of Speech
Gram. Gender
Gram. Number
Subcategorization
Valency
Features
Frequency
Spell check
Geo multilingual table
Geographical entries
Biographical entries
Examples of Usage
Full sentences
Short phrases
Pronunciation
Phonetic transcription
Alternative script
Notes
Extra information
on language and grammar
Data Sample (JSON)
{
"id": "DE_DE00019883",
"source": "global",
"language": "de",
"version": 1,
"headword": {
"text": "Schloss",
"pronunciation": {
"value": "ʃlɔs"
},
"pos": "noun",
"gender": "neuter",
"inflections": [
{
"text": "Schlosses",
"number": "singular",
"case": "genitive"
},
{
"text": "Schlösser",
"pronunciation": {
"value": "ˈʃlœsɐ"
},
"number": "plural",
"case": "nominative"
}
]
},
"senses": [
{
In Use In
- language models
- machine translation
- natural language processing
- language learning solutions
- online dictionary websites
- mobile applications
- research and innovation projects
- internship programs
Parallel Corpora: Bilingual and Multilingual Parallel Corpora
Expert parallel corpora for nearly 400 language pairs and numerous multilingual combinations for training Language Models and boosting the performance of Machine Translation engines.
The corpora include bilingual and multilingual segments that consist of corpus-derived, manually curated full sentences and short phrases with their corresponding equivalents in other languages.
These segments are based on dictionary examples of usage, which have been created and refined to illustrate typical language patterns by expert linguists and translators worldwide, for general language use and 100 vertical domains.
The languages include: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, Latin, North Sami, Norwegian, Polish, Portuguese (Brazil / Portugal), Russian, Spanish, Swedish, and Turkish.
In addition to general language vocabulary, there are segments for more than one hundred vertical domains.
Parallel Corpora – Multilingual Sample (sport)
| Arabic .تمركز كل المشتركين على خط الانطلاق |
| Chinese S. 所有的参赛者都在起跑线上. |
| Danish Alle konkurrencedeltagerne står på startlinjen. |
| Dutch Alle deelnemers staan aan de start. |
| English All the competitors are on the starting line. |
| French Tous les concurrents sont sur la ligne de départ. |
| German Alle Wettstreiter sind auf der Startlinie. |
| Greek Όλοι οι αθλητές είναι στη γραμμή της αφετηρίας. |
| Hebrew .כל המִתְחָרים עומדים על קו הזינוק |
| Italian Tutti i concorrenti sono sulla linea di partenza. |
| Japanese 全(すべ)ての選手がスタートラインに立(た)っている。 |
| Norwegian Alle konkurrentene står på startlinjen. |
| Polish Wszyscy rywale są na linii startu. |
| Portuguese Br. Todos os competidores estão na linha de partida. |
| Portuguese Pt. Todos os concurrentes estão na linha de partida. |
| Russian Все уча́стники соревнова́ния собрали́сь на ста́рте. |
| Spanish Todos los competidores están en la linea de salída. |
| Swedish Alla deltagarna står på startlinjen. |
| Turkish Bütün yarışçılar start çizgisinin üstündeler. |
Domains
Lexicala datasets classify word senses into more than 100 domains.
Acoustics
Music
Architecture
Cartography
Chemistry
Pharmacology
Culinary
Drinks
Electricity
Energy
Geography
Geology
Grammar
Linguistics
Literature
Publishing
Military
Police
Theology
Religion
Agriculture
Botanics
Environment
Anthropology
Archeology
Philosophy
Culture
History
Politics
Education
School
University
Games
Leisure time&hobbies
Geometry
Mathematics
Statistics
Maritime
Nautical
Oceanography
Mythology
Psychology
Sociology
Journalism
Law
Occupation
Astronomy
Meteorology
Optics
Physics
Clothing
Cosmetics
Dress
Fashion
Radio
Technology
Telephone
Television
Anatomy
Genetics
Health
Medicine
Physiology
Aeronautics
Aviation
Automobiles
Rail
Transportation
Anatomy
Biology
Ecology
Genetics
Physiology
Zoology
Administration
Advertising
Commerce
Economics
Finance
Industry
Marketing
Art
Cinema
Color
Dance
Entertainment
Music
Photography
Theatre
Computers
Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication
Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics
Space
Sport
Time
Tourism
Acoustics
Music
Architecture
Cartography
Chemistry
Pharmacology
Culinary
Drinks
Electricity
Energy
Geography
Geology
Grammar
Linguistics
Literature
Publishing
Military
Police
Theology
Religion
Agriculture
Botanics
Environment
Anthropology
Archeology
Philosophy
Culture
History
Politics
Education
School
University
Games
Leisure time&hobbies
Geometry
Mathematics
Statistics
Maritime
Nautical
Oceanography
Mythology
Psychology
Sociology
Journalism
Law
Occupation
Astronomy
Meteorology
Optics
Physics
Clothing
Cosmetics
Dress
Fashion
Radio
Technology
Telephone
Television
Anatomy
Genetics
Health
Medicine
Physiology
Aeronautics
Aviation
Automobiles
Rail
Transportation
Anatomy
Biology
Ecology
Genetics
Physiology
Zoology
Administration
Advertising
Commerce
Economics
Finance
Industry
Marketing
Art
Cinema
Color
Dance
Entertainment
Music
Photography
Theatre
Computers
Data
Electronics
Engineering
Informatics
Internet
IT
Technical
Technology
Telecommunication
Construction
Family
Furniture
Hygiene
Measurements&units
Mechanics
Space
Sport
Time
Tourism
API
Most of our data are available on Lexicala API.
Our REST API enables flexible search options and returns JSON responses with full dictionary
entries or specific components – featuring syntactic and semantic details, sense definitions and
various disambiguation forms, examples of usage and multiword expressions, translations and
more – allowing easy processing and seamless integration with other applications.
For the API documentation, registration and access, click below.