Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, Principles, Construction and …
…
7 pages
1 file
In this paper, we introduce the Filipino wordnet project (FilWordNet). Filipino is the national language of the Philippines spoken by some 90 million people as their first or second language. However, it has historically had a limited number of computational linguistics resources. Creating the Filipino wordnet can be seen as the first step to enable a wide range of research projects. We describe our process of building a wordnet, including issues with the Filipino language itself, its morphology and structure.
International Journal of Machine Learning and Computing, 2019
The paper discusses the approach in creating a Filipino WordNet. A semi-supervised learning approach using Decision Tree and Language Modeling. This will take advantage on the information found on the web. It will help future NLP researchers in Filipino language. The approach uses words from a dictionary as preliminary data and as seed for the search engine to start crawling the WWW. To decide if the word is part of Filipino language, the word will first undergo in Code-Switching Points Module (CSPD). CSPD scores the word by using the frequency counts of word bigrams and unigrams from language models which were trained from an existing and available corpus. After scoring, Filipino Stemmer will get the stem of the word and examine if the stem word is part of the said language. Once the words were scored and stemmed, the archive will evaluate if the word is Filipino. To test the accuracy of the system, we collected different articles around the web and then grouped it into two groups-Plain Filipino and Bilingual. The result shows the F-measure for Plain Filipino Category range between 65.65%-96.85% with an average of 85.64% while for Bilingual range between 60%-100% with an average of 88.17%.
This paper outlines the creation of an open combined semantic lexicon as a resource for the study of lexical semantics in the Malay languages (Malaysian and Indonesian). It is created by combining three earlier wordnets, each built using different resources and approaches: the Malay Wordnet (Lim & Hussein 2006), the Indonesian Wordnet (Riza, Budiono & Hakim 2010) and the Wordnet Bahasa (Nurril Hirfana, Sapuan & Bond 2011). The final wordnet has been validated and extended as part of sense annotation of the Indonesian portion of the NTU Multilingual Corpus (Tan & Bond 2012). The wordnet has over 48,000 concepts and 58,000 words for Indonesian and 38,000 concepts and 45,000 words for Malaysian.
2007
Abstract A WordNet is a useful lexical resource where specific senses of words are clustered together into synonym sets, and semantic relationships between these sets are specified. This paper describes an ongoing project to create an Indonesian WordNet using the expand model approach, ie by mapping existing WordNet entries to Indonesian word sense definitions. We discuss some issues encountered during the development of a web-based application that facilitates this mapping.
PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2020
This paper discusses the construction and the ongoing development of the Old Javanese Wordnet. The words were extracted from the digitized version of the Old Javanese-English Dictionary (Zoetmulder, 1982). The wordnet is built using the 'expansion' approach (Vossen, 1998), leveraging on the Princeton Wordnet's core synsets and semantic hierarchy, as well as scientific names. The main goal of our project was to produce a high quality, human-curated resource. As of December 2019, the Old Javanese Wordnet contains 2,054 concepts or synsets and 5,911 senses. It is released under a Creative Commons Attribution 4.0 International License (CC BY 4.0). We are still developing it and adding more synsets and senses. We believe that the lexical data made available by this wordnet will be useful for a variety of future uses such as the development of Modern Javanese Wordnet and many language processing tasks and linguistic research on Javanese.
2006
This paper reports the current Portuguese WordNet (WordNet.PT) research and development directions, which mainly regard the enrichment of the WordNet model with event and argument structures (section 1), the codification of cross-part-of speech relations (section 2) and the exploitation of WordNet.PT in concrete applications (section 3).
2011
This paper outlines the creation of the Wordnet Bahasa as a resource for the study of lexical semantics in the Malay language. It is created by combining information from several lexical resources: the French-English-Malay dictionary FEM, the KAmus Melayu-Inggeris KAMI, and wordnets for English, French and Chinese. Construction went through three steps: (i) automatic building of word candidates; (ii) evaluation and selection of acceptable candidates from merging of lexicons; (iii) final hand check of the 5,000 core synsets. Our Wordnet Bahasa is only in the first phase of building a full fledged wordNet and needs to be further expanded, however it is already large enough to be useful for sense tagging both Malay and Indonesian.
Proceedings of the Language, …, 2006
This paper outlines an approach to produce a prototype WordNet system for Malay semi-automatically, by using bilingual dictionary data and resources provided by the original English WordNet system. Senses from an English-Malay bilingual dictionary were first aligned to English WordNet senses, and a set of Malay synsets were then derived. Semantic relations between the English WordNet synsets were extracted and re-applied to the Malay synsets, using the aligned synsets as a guide. A small Malay WordNet prototype with 12429 noun synsets and 5805 verb synsets was thus produced. This prototype is a first step towards building a full-fledged Malay WordNet.
A project to create a Polish WordNet is under way. Rather than localise the English WordNet, we are constructing the lexical network from scratch, in two phases. First, we have established the linguistic principles, among them a list of semantic relations with detailed diagnostic tests. We have also implemented a client software tool that records the lexicographers' decisions in a central database. A core WordNet, populated with around 10,000 most frequent lexemes in the IPI PAN Corpus, will be a fully functional resource for Natural Language Processing in Polish. In the second phase, the enhanced software tool will detect candidate semantic relations in a much larger corpus, based on statistical methods of grouping words by semantic similarity. Lexicographers will review and approve such candidate relations.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the 7th …, 2009
Journal of Research in Science, Computing and …, 2006
Arxiv preprint cmp-lg/ …, 1998
14th Annual Meeting of the Association for Natural Language Processing, 2008