Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2014, International Journal for Scientific Research and Development
Text mining is a practice which is regarded as the supporting pillars of Information Retreival. This paper is in simple terms dedicated to text mining and bear a prime focus on mining academic papers. An architecture is proposed by the authors is presented in the paper, which they have named HTPI. This framework is built upon Java eclipse using Apache Hadoop. The problem under consideration for the paper is the reference metamorphosis of the references mentioned in the references section of any scientific paper based upon the similarity score(between the referenced paper and the paper whose reference list is being re-ordered) retrieved. Various notions have been used in the paper like stemming, skipping and similarity calculation using Jaccard Coefficient.
2013
At CEON ICM UW we are in possession of a large collection of scholarly documents that we store and process using MapReduce paradigm. One of the main challenges is to design a simple, but effective data model that fits various data access patterns and allows us to perform diverse analysis efficiently. In this paper, we will describe the organization of our data and explain how this data is accessed and processed by open-source tools from Apache Hadoop Ecosystem.
Hadoop is one of the generally received bunch figuring structures for handling of the Big Data. Despite the fact that Hadoop seemingly has turned into the standard answer for overseeing Big Data, it is not free from constraints. In nowadays developing technology researchers, students prefer all documents in txt format and doc format. Most text files are available in pdf format as per demand. Even all research papers are available in pdf format only and extracting a text from pdf format is one of the most difficult jobs. So for text extraction from multiple pdf files we have to apply some algorithms so that text extraction process takes place in comfortable mode. Text extraction is the basic step which we bear to follow before making a motion for further processing. We begin with the concise discussion concerning to the keyword. Steps involved in text extraction from any txt file. In this paper, we use a keyword based extraction method for extracting the text from txt file and with the help of these keywords we can get all the detail on that part of the research paper or any pdf file. Here we are also using the multithreading approach. Our approach is able to extract text in very less time, so time complexity is very less. The aim of this paper is to extract the text on the basis of particular keyword which is useful for the new researcher.
2013
Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters.
A considerable amount of research is being conducted by many people (researchers, graduate students, professors etc) everyday. Finding information about a specific topic is one of the most time consuming activities of those people. People doing research have to search, read and analyze multiple research papers, e-books and other documents and then determine what they contain and discover knowledge from them. Many available resources are in the form of unstructured text format of long text pages which require long time to read and analyze. In this paper we propose a two-stage method for scientific paper analysis. The method uses information extraction to extract the main idea key sentences (mainly needed by the most readers) from the paper and the extracted paper's information is then organized in a structured format and grouped in different clusters according to their topics using a multi-word based clustering method. The proposed method combines different features in paper's topics extraction and uses multi-word matching feature in selection of initial centroids for clustering. The proposed method can help readers to access and analyze multiple research papers documents timely and efficiently. Conducted experiments show the effectiveness and usefulness of our proposed approach.
Abstract. We describe algorithms used for automated extraction and analysis of information about scientific publications. To extract the information we propose an algorithm consisting of four steps – lexical analysis, terminals normalization, merging and filtering of entities. For analysis of information, we recommend the using of an algorithm that based on the minimal spanning tree. Keywords: Analysis of Scientific Publications, Information Extraction, Terminals, Entities Merging, Entities Filtering, Clustering Algorithms, Graphs, Markov Clustering Algorithm
2021
Abstract: Data nowadays is the language of technologies as every process needs a data to be processed the input is data and the output also is data. Analyzing the data is a significant task especially with the increasing production of the data particularly data as a text, it would be difficult to manually analyze the data, extract information and detect the hidden patterns from unstructured text. Datamining is automated technique for gathering or deriving a new high-quality information and uncover the relations among the data. Text mining is one of main branches of the data mining however data mining this paper, an overview for mining the publication papers via text mining comprehensive is more is approach first the as following: would be presented and evaluation techniques and their results keywords extraction using natural language processing (NLP) approach, the second approach named entity recognition and the last approach is document clustering where machine learning techniques ...
Journal of Software, 2014
A considerable amount of research is being conducted by many people (researchers, graduate students, professors etc) everyday. Finding information about a specific topic is one of the most time consuming activities of those people. People doing research have to search, read and analyze multiple research papers, e-books and other documents and then determine what they contain and discover knowledge from them. Many available resources are in the form of unstructured text format of long text pages which require long time to read and analyze. In this paper we propose a two-stage method for scientific paper analysis. The method uses information extraction to extract the main idea key sentences (mainly needed by the most readers) from the paper and the extracted paper's information is then organized in a structured format and grouped in different clusters according to their topics using a multi-word based clustering method. The proposed method combines different features in paper's topics extraction and uses multi-word matching feature in selection of initial centroids for clustering. The proposed method can help readers to access and analyze multiple research papers documents timely and efficiently. Conducted experiments show the effectiveness and usefulness of our proposed approach.
The ever-growing volume of published academic journals and the implicit knowledge that can be derived from them has not fully enhanced knowledge development but rather resulted into information and cognitive overload. However, publication data are textual, unstructured and anomalous. Analysing such high dimensional data manually is time consuming and this has limited the ability to make projections and trends derivable from the patterns hidden in various publications. This study was designed to develop and use intelligent text mining techniques to characterise academic journal publications. Journals Scoring Criteria by nineteen rankers from 2001 to 2013 of 50 th edition of Journal Quality List (JQL) were used as criteria for selecting the highly rated journals. The text-miner software developed was used to crawl and download the abstracts of papers and their bibliometric information from the articles selected from these journal articles. The datasets were transformed into structured data and cleaned using filtering and stemming algorithms. Thereafter, the data were grouped into series of word features based on bag of words document representation. The highly rated journals were clustered using Self-Organising Maps (SOM) method with attribute weights in each cluster.
Ingeniería e Investigación
Tree of Science (ToS) is a web-based tool which uses the network structure of paper citation to identify relevant literature. ToS shows the information in the form of a tree, where the articles located in the roots are the classics, in the trunk are the structural publications, and leaves are the most current papers. It has been found that some results in the leaves can be separated from the tree. Therefore, an algorithm (SAP) is proposed, in order to improve results in the leaves. Two improvements are presented: articles located in the leaves are from the last five years, and they are connected to root and trunk articles through their citations. This improvement facilitates construction of current literature for researchers.
2003
A fundamental feature of research papers is how many times they are cited in other articles, i.e. how many later references to them there are. That is the only objective way of evaluation how important or novel a paper's ideas are. With an increasing number of articles available online, it has become possible to find these citations in a more or less automated way. This thesis first describes existing possibilities of citations retrieval and indexing and then introduces CiteSeeker -a tool for a fully automated citations retrieval. CiteSeeker starts crawling the World Wide Web from given start points and searches for specified authors and publications in a fuzzy manner. That means that certain inaccuracies in the search strings are taken into account. CiteSeeker treats all common Internet file formats, including PostScript and PDF documents and archives. The project is based on the .NET technology.
Library Philosophy and Practice (e-journal), 2020
Text Mining (TM) is one of the immerging areas of research, but there are limited studies from the view point of scientometric. Using the bibliometric approach, this paper analyses TM research trend, forecast and citation approach from 2000 to 2019 by locating headings “text mining”, “text clustering”, “text extraction” and “text categorization” in Web of Science database. The paper classified 5006 retrieved articles, using the following ten categories – publication year, citation, country, institution, type of document, language, subject, author, source title and key-word – for distribution status of different areas, in order to explore the trend of researches in this field during this period. According to K-S test, the result depicts that the set of data confirms to Lotka's Law is rejected at 0.01 level of significance. To do so, Pao’s formula and Least-square method are used. The research provides a roadmap for future researchers to follow, whether they can concentrate in the core categories where the possibility of success is lying.
Current Challenges in …, 2011
We describe a novel search engine for scientific literature. The system allows for sentence-level search starting from portable document format (PDF) files, and integrates text and image search, thus facilitating the retrieval of information present in tables and figures. It allows the user to generate in an intuitive manner complex queries for search terms that are related through particular grammatical (and thus implicitly semantic) relations. The system uses grid processing to parallelise the analysis of large numbers of scientific papers. It is currently undergoing user evaluation, but we report some preliminary evaluation and comparison with Google Scholar, demonstrating its utility. Finally, we discuss future work and the potential and complimentarity of the system for patent search.
Lecture Notes in Computer Science, 2015
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
IEEE Access, 2017
Over the decades, the immense growth has been reported in research publications due to continuous developments in science. To date, various approaches have been proposed that find similarity between research papers by applying different similarity measures collectively or individually based on the content of research papers. However, the contemporary schemes are not conceptualized enough to find related research papers in a coherent manner. This paper is aimed at finding related research papers by proposing a comprehensive and conceptualized model via building ontology named COReS: Content-based Ontology for Research Paper Similarity. The ontology is built by finding the explicit relationships (i.e., supertype sub-type, disjointedness, and overlapping) between state-of-the-art similarity techniques. This paper presents the applications of the COReS model in the form of a case study followed by an experiment. The case study uses InText citation-based and vector space-based similarity measures and relationships between these measures as defined in COReS. The experiment focuses on the computation of comprehensive similarity and other content-based similarity measures and rankings of research papers according to these measures. The obtained Spearman correlation coefficient results between ranks of research papers for different similarity measures and user study-based measure, justify the application of COReS for the computation of document similarity. The COReS is in the process of evaluation for ontological errors. In the future, COReS will be enriched to provide more knowledge to improve the process of comprehensive research paper similarity computation.
2008 IEEE 25th Convention of Electrical and Electronics Engineers in Israel, 2008
—In this study the semantic classification of the references/citations of a scientific article according to position within the article is investigated. For this purpose, the article is divided into two major sections: the Introduction/Background section and the rest section which contains the methodology, experimental part, results and conclusion parts. Additionally, the references of an article are divided into two categories: the Self-References and the Citations which are used for the semantic interpretation of the references in combination with the aforementioned geographic partition. For achievement of this, an algorithm was constructed which was implemented using Java programming language and was applied in a numerous articles of open springer journals. Finally, the classification results as well as the interpretation of these should create a new consideration about the contribution of each reference in the knowledge creation specifically in the self-citation case.
IntechOpen, 2022
To face the problem of information overload, digital libraries, like other businesses, have used recommender systems and try to personalize recommendations to users by using the textual information of papers. This textual information includes title, abstract, keywords, publisher, author and other similar items. Since the volume of papers is increasing day by day and recommender systems do not have the ability to cover this huge volume to process papers according to the user's tastes, that is why we need to use our papers to cover and process this volume quickly. We have big data tools, which will offer relevant recommendations by running parallel processing. In this chapter, the researches and researches of researchers in the field of recommender systems/aware of the text of scientific papers and recommender systems have been discussed.
Journal of Fundamental and Applied Sciences, 2016
Recommender systems for research papers have been increasingly popular.In the past 14 years more than 170 research papers,patents and webpageshave been published in this field. Scientific papers recommender systemsare trying to provide some recommendations to each user which are consistent with the users' personal interests based on performance, personal tastes and users behaviors.Since the volume of papers are growing day after day and the recommender systemshave not the ability for covering these huge volumes ofprocessing papers according to the users' preferences it is necessary to use parallel processing (mappingreducing programming) for covering and fast processing of these volumes of papers. The suggested system for this research constitutes a profile for each paper which contains context information and the scope of paper. Then, the system will advise some papers to the user according to the user work domain and the papers domain. For implementing the system it has been used hadoop bed and the parallel programming because the volume of data was a part of a big data and the time was also an important factor. The performance of the suggested system was measured by the criteria such as user satisfaction and the accuracy and the results have been satisfactory.
Recommender systems for research papers have been increasingly popular.In the past 14 years more than 170 research papers,patents and webpageshave been published in this field. Scientific papers recommender systemsare trying to provide some recommendations to each user which are consistent with the users' personal interests based on performance, personal tastes and users behaviors.Since the volume of papers are growing day after day and the recommender systemshave not the ability for covering these huge volumes ofprocessing papers according to the users' preferences it is necessary to use parallel processing (mapping – reducing programming) for covering and fast processing of these volumes of papers. The suggested system for this research constitutes a profile for each paper which contains context information and the scope of paper. Then, the system will advise some papers to the user according to the user work domain and the papers domain. For implementing the system it has been used hadoop bed and the parallel programming because the volume of data was a part of a big data and the time was also an important factor. The performance of the suggested system was measured by the criteria such as user satisfaction and the accuracy and the results have been satisfactory. ABSTRACT Recommender systems for research papers have been increasingly popular.In the past 14
Recherche, 2011
This paper is devoted to the 3-years research performed at Warsaw University of Technology, aimed at building of an advanced software for university research knowledge base. As a result, a text mining platform has been built, enabling research in the areas of text mining and semantic information retrieval. In the paper some of the implemented methods are tested from the point of view of their applicability in a real life system.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.