Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Phonetic matching plays an important role in multilingual information retrieval, where data is manipulated in multiple languages and user need information in their native language which may be different from the language where data has been maintained. In such an environment, we need a system which matches the strings phonetically in any case. Once strings match, we can retrieve the information irrespective of languages. In this paper, we proposed an approach which matches the strings either in Hindi or Marathi or in cross-language in order to retrieve information. We compared our proposed method with soundex and Q-gram methods. We found better results as compared to these approaches.
2014
In a system with a large database, there always has been a problem that names may not be spelled well or might not be spelled in a way that one expected. So, data in the database gets degraded. In this case it is required to search the duplicates and merge them in the single entity. In doing so, one problem is that the way in which the strings would be compared. In such cases rather than looking for exact match, approximate string matching would be appreciable. One of the string matching techniques is Phonetic matching which is used to compare the name based on the pronunciation of the words. The similar sounding words could be retrieved from the large database using different phonetic matching algorithm and best known algorithm is Soundex algorithm. Phonetic matching is needed when many people from different culture come together. They either speak with different pronunciation or their writing habits are different. This scenario is very common in India, as we have many different la...
International Journal of …
In search engine query optimization plays the major role in order to give relevant result. The user query mostly contains name entities. Not only names but so many words are frequently used as search criteria for information retrieval and identity matching systems in Odia. The names have normally several variations. This variations and errors in names make the exact string matching problematic. If all the variations are approximately matched, then the result can be more relevant. In this paper we put forward an automatic approximate matching technique by which all the variations having similar phonetic code of the query word can be searched and gives the best result. Our algorithm is based on the phonemic encoding of the given query words which can give more relevant result of the desired search.
Global Journal of Enterprise Information System, 2017
In any digitization program, the reproduction of the handwritten demographic data is a challenging job particularly for the records of previous decades. Nowadays, the requirement of the digitization of the individual’s past records becomes very much essential. In the areas like financial inclusion, border security, driving license, passport issuance, weapon license, banking sectors, health care and social welfare benefits, the individual’s earlier case history is a mandatory part of the decision making process. Documents are scanned and stored in a systematic method; each and every scanned document is tagged with a proper key. Documents are retrieved with the help of assigned key, for the purpose of data entry through the software program/ package. Here comes the difficulty that the data, particularly the critical personal data like name and father name etc., may not be legible for the reading purpose and the data entry operators type the characters as per their understanding. The c...
2019
Searching using top 10 search engine1 to find "ગ ાંધીજી" or "ગ ન્ધીજી" and surprised to see the result which far differs from one to another. As in the Gujarati language, both strings are correct. Therefore, String similarity algorithm is useful for text mining applications while we generate index – saving space and time, both. Basically, string similarity compares each character from both strings but it may not give the accurate result on highly rich Gujarati language due to different kinds of writing styles which depend on matras, reph, vatu, and diacritics on simple and compound alphabets. GUJSIM (GUJarati SIMilarity) algorithm is the hybrid approach to do strings similarity for the Gujarati language. Here, the author compares 70 strings pairs and GUJSIM algorithm which gives optimum percentage result. This algorithm also helps to reduce the percentage of index based on the unique string.
Proc. V International Conf. Natural Language Processing (KBCS 2004), 2004
This paper proposes a novel approach for multilingual query processing, wherein we propose a phonetic distance based measure, for searching proper name data in Indian language scripts. The systems allows query in a language of user's choice. A cross-lingual search is conducted with the query being in one language and the documents being searched for, in another. Grapheme-to-phoneme converters are used to convert the user's query into an intermediate language-independent common ground (CG) representation. A dynamic time warping algorithm, wherein the substitution cost is based on a weighted phonetic distance measure, is used to match and rank the query results. In turn, a phoneme-to-grapheme converter is used to convert the search results in CG representation to the user's query language. We also discuss in detail the various issues particular to cross-lingual search on proper name data and address the same using the proposed approach.
2012
ABSTRACT This paper introduces a set of Japanese phonetic matching functions for the open source relational database PostgreSQL. Phonetic matching allows a search system to locate approximate strings according to the sound of a term. This sort of approximate string matching is often referred to as fuzzy string matching in the open source community. This approach to string matching has been well studied in English and other European languages, and open source packages for these languages are readily available.
2005
We present a phonetic encoding for Bangla that can be used by spelling checkers, transliteration, name searching application and cross-lingual information retrieval to drastically improve the quality. The complex, and often inconsistent, rules of Bangla word present a significant challenge in producing a proper phonetic code. We propose a phonetic encoding for Bangla, taking into account the various context-sensitive rules, including those involving the large repertoire of conjuncts in Bangla.
2005
We present a phonetic encoding for Bangla that can be used by spelling checkers, transliteration, name searching application and cross-lingual information retrieval to drastically improve the quality. The complex, and often inconsistent, rules of Bangla word present a significant challenge in producing a proper phonetic code. We propose a phonetic encoding for Bangla, taking into account the various context-sensitive rules, including those involving the large repertoire of conjuncts in Bangla.
2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), 2016
Automatic speech recognition (ASR) and Text to speech (TTS) are two prominent area of research in human computer interaction nowadays. A set of phonetically rich sentences is in a matter of importance in order to develop these two interactive modules of HCI. Essentially, the set of phonetically rich sentences has to cover all possible phone units distributed uniformly. Selecting such a set from a big corpus with maintaining phonetic characteristic based similarity is still a challenging problem. The major objective of this paper is to devise a criteria in order to select a set of sentences encompassing all phonetic aspects of a corpus with size as minimum as possible. First, this paper presents a statistical analysis of Hindi phonetics by observing the structural characteristics. Further a two stage algorithm is proposed to extract phonetically rich sentences with a high variety of triphones from the EMILLE Hindi corpus. The algorithm consists of a distance measuring criteria to select a sentence in order to improve the triphone distribution. Moreover, a special preprocessing method is proposed to score each triphone in terms of inverse probability in order to fasten the algorithm. The results show that the approach efficiently build uniformly distributed phonetically-rich corpus with optimum number of sentences.
International Journal of Artificial Intelligence & Applications
Maximum digital information is available to fewer people who can read or understand a particular language. The corpus is the basis for developing speech synthesis and recognition systems. In India, almost all speech research and development affiliations are developing their own speech corpora for Hindi language, which is the first language for more than 200 million people. The primary goal of this paper is to review the speech corpus created by various institutes and organizations so that the scientists and language technologists can recognize the crucial role of corpus development in the field of building ASR and TTS systems. This aim is to bring together all the information related to the recording, volume and quality of speech data in speech corpus to facilitate the work of researchers in the field of speech recognition and synthesis. This paper describes development of medium size database for Metro rail passenger information systems using HMM based technique in our organization for above application. Phoneme is chosen as basic speech unit of the database. The result shows that a medium size database consisting of 630 utterances with 12,614 words, 11572 tokens of phonemes covering 38 phonemes are generated in our database and it cover maximum possible phonetic context.
String matching is the common way of finding items from textual database. Because of the way people write some types of text, like names, string matching may not be helpful in text processing, and other mechanisms like phonetic matching needs to be included if we want to develop an efficient matching scheme. Phonetic matching is defined as a method of identifying a set of strings that are likely to be similar in sound to a given keyword. The problem of writing words (e.g. names) differently is a common problem in all languages. As a result there is need of a system which will match terms phonetically regardless of the type of errors introduced. There are many errors or variations that can be considered but we are referring to typographical errors, spelling errors as they differ in vowel and matching of consonants. In this paper, we will present a phonetic algorithm that is developed for isiXhosa language which matches terms written in Xhosa by approximating their meaning based on their sound. The algorithm is developed based on the principles of Soundex algorithm.
2006
In this research the concept of traditional Bangla word matching is replaced by partial matching based on pronunciation error. The authors have used some specially designed databases and some rules to analyze Bangla words. The rules are taken based on Bangla pronunciation rules given by Bangla Academy. As the outcome of this research it is possible to search a Bangla word through a text by using differently spelled word or misspelled word. Bangla vowels are analyzed successfully
Proceedings of the 5th International Conference on Data Management Technologies and Applications, 2016
Researchers confront major problems while searching for various kinds of data in a large imprecise database, as they are not spelled correctly or in the way they were expected to be spelled. As a result, they cannot find the word they are looking for. Over the years of struggle, relying on pronunciation of words was considered to be one of the practices to solve the problem effectively. The technique used to acquire words based on sounds is known as "Phonetic Matching". Soundex is the first algorithm proposed and other algorithms like Metaphone, Caverphone, DMetaphone, Phonex etc., have been also used for information retrieval in different environments. This paper deals with the analysis and evaluation of different phonetic matching algorithms on several datasets comprising of street names of North Carolina and English dictionary words. The analysis clearly states that there is no clear best technique in general since Metaphone has the best performance for English dictionary words, while NYSIIS has better performance for datasets having street names. Though Soundex has high accuracy in correcting the misspelled words compared to other algorithms, it has lower precision due to more noise in the considered arena. The experimental results paved way for introducing some suggestions that would aid to make databases more concrete and achieve higher data quality.
International Journal of Intelligent Information Processing, 2011
This study aims to develop a phonetic similarity measurement method across Asian languages. The method, cross-language similarity algorithm aggregates the transcription of language-specific Romanization, the International Phonetic Alphabet, the Soundex algorithm, and Levenshtein distance. To evaluate the proposed algorithm, this study involves an experiment using ninety-two chemical element names in nine different languages. The scores of the similarity of names were calculated between a source language and each target language. We could draw a line of threshold between the scores of similarities in each language into two groups (phonetic and semantic adoption groups). After evaluating the ratios of precision, recall, and F-measure, the results show that the proposed methodology successfully differentiates between phonetic and semantic groups by allocating the thresholds in all Asian languages, with the exception of Chinese. The results reported here prove that the proposed method has the potential to be applied to cross-language information retrieval and various linguistic studies.
Phonetic matching plays an important role in multilingual information retrieval, where data is manipulated in multiple languages and user need information in their native language which may be different from the language where data has been maintained. In such an environment, we need a system which matches the strings phonetically in any case. Once strings match, we can retrieve the information irrespective of languages. In this paper, we proposed an approach which matches the strings either in Hindi or Marathi or in cross-language in order to retrieve information. We compared our proposed method with soundex and Q-gram methods. We found better results as compared to these approaches.
In spite of South Asia being one of the richest areas in terms of linguistic diversity, South Asian languages have a lot in common. For example, most of the major Indian languages use scripts which are derived from the ancient Brahmi script, have more or less the same arrangement of alphabet, are highly phonetic in nature and are very well organised. We have used this fact to build a computational phonetic model of Brahmi origin scripts. The phonetic model mainly consists of a model of phonology (including some orthographic features) based on a common alphabet of these scripts, numerical values assigned to these features, a stepped distance function (SDF), and an algorithm for aligning strings of feature vectors. The SDF is used to calculate the phonetic and orthographic similarity of two letters. The model can be used for applications like spell checking, predicting spelling/dialectal variation, text normalization, finding rhyming words, and identifying cognate words across languages. Some initial experiments have been done on this and the results seem encouraging.
IJERA
India has 22 officially recognized languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. Clearly, India owns the language diversity problem. In the age of Internet, the multiplicity of languages makes it even more necessary to have sophisticated Systems for Natural Language Process. In this paper we are developing the phonetic dictionary for natural language processing particularly for Kannada. Phonetics is the scientific study of speech sounds. Acoustic phonetics studies the physical properties of sounds and provides a language to distinguish one sound from another in quality and quantity. Kannada language is one of the major Dravidian languages of India. The language uses forty nine phonemic letters, divided into three groups: Swaragalu (thirteen letters); Yogavaahakagalu (two letters); and Vyanjanagalu (thirty-four letters), similar to the vowels and consonants of English, respectively.
International Journal of Advanced Computer Science and Applications, 2020
The semantic coexistence is the reason to adopt the language spoken by other people. In such human habitats, different languages share words typically known as loan words which appears not only as of the principal medium of enriching language vocabulary but also for creating influence upon each other for building stronger relationships and forming multilingualism. In this context, the spoken words are usually common but their writing scripts vary or the language may have become a digraphia. In this paper, we presented the similarities and relatedness between Hindi and Urdu (that are mutually intelligible and major languages of Indian sub-continent). In general, the method modifies edit-distance; and works in the fashion that instead of using alphabets from the words it uses articulatory features from the International Phonetic Alphabets (IPA) to get the phonetic edit distance. This paper also shows the results for the languages consonant under the method which quantifies the evidence that the Urdu and Hindi languages are 67.8% similar on average despite the script differences.
Information Systems for Indian …, 2011
2016
Researchers confront major problems while searching for various kinds of data in large imprecise databases, as they are not spelled correctly or in the way they were expected to be spelled. As a result, they cannot find the word they sought. Over the years of struggle, pronunciation of words was considered to be one of the practices to solve the problem effectively. The technique used to acquire words based on sounds is known as "Phonetic Matching". Soundex was the first algorithm developed and other algorithms like Metaphone, Caverphone, DMetaphone, Phonex etc., are also used for information retrieval in different environments. This project mainly deals with the analysis and implementation of newly proposed Meta-Soundex algorithm for English and Spanish languages which retrieves suggestions for the misspelled words. The newly developed Meta-Soundex algorithm addresses the limitations of Metaphone and Soundex algorithms. Specifically, the new algorithm has more accuracy compared to both Soundex and Metaphone algorithm. The new algorithm also has higher precision compared to Soundex, thus reducing the noise in the considered arena. A phonetic matching toolkit is also developed enclosing the different phonetic matching algorithms along with the state-of-the-art Meta-Soundex algorithm for both Spanish and English languages
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.