Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
10 pages
1 file
This is the era of Information Technology. Today the most important thing is how one gets the right information at right time. More and more data repositories are now being made available online. Information retrieval systems or search engines are used to access electronic information available on the internet. These information retrieval systems depend on the available tools and techniques for efficient retrieval of information content in response to the user query needs. During last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8.
Indexing techniques are used to improve retrieval of data in response to certain search condition. Inverted files are mostly used for creating indexes. This paper proposes indexing technique for Urdu language. Language processing step in Index creation is different for a particular language. We discuss index creation steps specifically for Urdu language. We explore morphological rules for Urdu language and implement these rules to create Urdu stemmer. We implement our proposed technique with different implementations and compare results. We suggest that indexes should be created without stop words and also index file should be an order index file.
ETRI Journal, 2021
Urdu belongs to the Perso-Arabic cluster of languages [ 1 ] and is mainly composed of words from Arabic, Persian, and Sanskrit. It is the national language of Pakistan and has over 300 million speakers spread worldwide, with a large portion of this population residing in the Indian subcontinent [ 2,3 ]. Urdu was initially derived from the Perso-Arabic script of Iran, is written from right to left like Arabic or Persian, and is characterized by the Nasta`liq format [ 4,5 ]. The family tree of Urdu traces back to a mixture of Indo-European, Indo-Iranian, and Indo-Aryan lingo evolution [ 6 ]. Urdu is known to have a rich and complex morphology [ 7,8 ] and its syntax structure is composed of a combination of Persian, Sanskrit, English, Turkish, and Arabic structures. Research on information retrieval (IR) prior to the 1990s was relatively limited and immature. This is because only limited resources and data collections were available for evaluation. Experimentation based on new algorithms and techniques for various IR and natural language processing (NLP) tasks, as well as the development of language tools, requires benchmark collections. Worldwide, most text processing related research occurs through evaluation-based consortiums such as the Text Retrieval Conference (TREC), 1 which is cosponsored by the National Institute of Standards and Technology 2 and the US Department of Defense. TREC was started in 1992 as part of the TIPSTER Text Program. Its goal was to provide a basis for research within the IR community by providing the infrastructure necessary for the large-scale evaluation of text retrieval methodologies. The TREC text 1
Infomatics, 2017
Digital archiving of books and other documents using appropriate technologies is now a priority area of activity in libraries, academic and research systems and Knowledge management scenario. Digitzation enables finding valuable and accurate information from huge collections through networks or web speedily. It enables storing large amount of data and information at less cost and it has become possible to share information for human communities at the global level. Now, the web-based Information Retrieval (IR) is more popular among people from all spheres of life and activity and various IR systems are being used for searching required information. Unlike other countries in India more many languages and scripts are used and hence Information systems face many problems in processing digital text and IR. The study presents the brief overview of search engines with key issues related to information searching and retrieval, especially in Hindi language by observing results of queries on three popular Hindi search engines-Google, Raftaar and Khoj.
— Our system addresses the design and implementation of BiLingual Information Retrieval system on the domain, Festival.It is built for Marathi language working with the same efficiency. According to User's query ,searching,translation and information extraction is done effectively. The main task is to retrieve the solution for the user typed query in the both the languages one that of the query and the standard English language. In this process, a Ontological tree is built for the domain in such a way that there are entries of important keywords in the above listed two languages in every node of the tree. A Part-Of-Speech (POS) Tagger is used to divide the sentence into words and assign their POS then determine the keywords from the given query. Based on the context, the keywords are translated to appropriate languages using the Ontological tree. A search is performed and documents are retrieved based on the keywords. With the use of the Ontological tree, Information Extraction is done. And at last, the solution for the query is translated back to the query language and produced to the user as per his requirement.
Query Processing is a field of Artificial Intelligence in which a user asks a question in natural language and the system replies with an answer in the same language. A lot of work has been done in English, Chinese and many other languages but very limited work is done in Urdu Language. Urdu is one of the widely spoken and written languages of south Asia. Due to unstructured format of Urdu language information retrieval of information is a big challenge. This paper compares two query processing systems, Ontology based and keyword based systems, to process a user query in natural language. Ontology based system uses ontology for knowledge building, rule generations and for defining relationships. Keyword based system, on the other hand, uses keywords for answering using query. This paper applies both the systems on the same data sets and analyses their performance on the basis of recall and precision value of the results.
Multilingual information is overflowing on internet these days. This increasing diversity of web pages in almost every popular language in the world should enable the user to access information in any language of his choice. But sometimes it is difficult for a user to write her request in a language which she could easily read and understand. This makes cross-language information retrieval (CLIR) and multilingual information retrieval (MLIR) for Web applications a valuable need of the day. It increases the accessibility of web users to retrieve information in any language while post their queries in their native language. The paper critically analyzes the various researchers work in the area of Indian language CLIR. In this paper we also present our prospective prototype for English to Hindi language CLIR. It will also discuss the issues related to the English to Hindi language translation. We had tested 30 queries manually using suggested prototype and found that the precision level is quite good.
Web Search Engines are best gifts to the mankind by Information and Communication Technologies. Without the search engines it would have been almost impossible to make the efficient access of the information available on the web today. They play a very vital role in the accessibility and usability of the internet based information systems. As the internet users are increasing day by day so is the amount of information being available on web increasing. But the access of information is not uniform across all the language communities. Besides English and European languages that constitutes to the 60% of the information available on the web, there is still a wide range of the information available on the internet in different languages too. In the past few years the amount of information available in Indian Languages has also increased. Besides English and few European Languages, there are no tools and techniques available for the efficient retrieval of this information available on the internet. Especially in the case of the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools and techniques available for the efficient retrieval of the information for Indian Languages. As we know that Indian Languages are very resource poor languages in terms of IR test data collection. So my main focus was mainly on developing the data set for URDU IR, training and testing data for Stemmer. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU Language [1] as it is very important in the Information Retrieval.
Journal of Computer Science, 2018
The abundance of multilingual content on internet other than English gives an urge to develop information retrieval system that can cross language boundaries. Such cross lingual information retrieval systems will bridge this language gap and allow user to ask a query in regional language and retrieve relevant documents in a different language. The problem of finding relevant document in language different from source language is the most challenging application of any cross lingual information retrieval. This paper discusses the development process of complete English to Hindi cross language information retrieval system along with the contribution of individual components to the system. The main focus of this paper is to discuss how optimization is done to our disambiguation approach, which we named as 'Two level Disambiguation method'. The experimental results obtained affirm that the addition of a component 'Analyzer' to our CLIR architecture increases the efficiency of our proposed disambiguation algorithm.
Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora resolution system using the Hobbs' algorithm to get the correct expression of information. As the actions of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit (BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We hav...
This paper depicts a framework for English-Odia Cross-Lingual Information Retrieval system. The system retrieves Odia documents in response to query given in English or Odia. Thus monolingual and cross-lingual information retrieval can be achieved by using this system. Odia is the prominent regional language of Odisha and the sixth classical language of India. It is spoken by more than 33 million people in Odisha and is the official language of Jharkhand state in India. Here we have used an online bilingual dictionary for query translation. This bilingual dictionary contains sixteen thousand words including noun, verb, adjective, and adverb. We are using a bilingual dictionary for query translation. Other linguistic resources like tokenizer, stemmer and stop word list etc. for Odia were developed during this work.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Journal of Computer Applications, 2015
Library Hi Tech, 2009
ADBU Journal of Engineering and Technology, 2019
Proceedings. IEEE International Multi Topic Conference, 2001. IEEE INMIC 2001. Technology for the 21st Century.
Lecture Notes in Computer Science, 2008
Annals of Library and Information Studies, 2013
International Journal of Computer Applications, 2012