Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
AI
The rapid rise of internet technology has generated an increasing volume of unstructured information across web pages, making effective content extraction essential. This paper reviews DOM tree-based methodologies from 2011 to 2021, focusing on approaches for extracting relevant data while minimizing noise, such as advertisements and navigation elements. By comparing various classification methods, limitations, and evaluation metrics, the work highlights the role of the DOM tree in enhancing the accuracy and efficiency of information extraction processes on the web.
Health Informatics Journal, 2006
As the number of medical web sites in various languages increases, it is increasingly necessary to establish specific criteria and control measures that give consumers some guarantee that the health web sites they are visiting, meet a minimum level of quality standards. Further, that the professionals offering the information are suitably qualified.. The paper presents briefly the current mechanisms for labelling medical web content and introduces the work done in the EC-funded project Quatro. This has defined a vocabulary for quality labels and a schema to deliver them in a machine-processable format. . In addition, the paper proposes the development of a labelling platform that will assist the work of medical labelling agencies in automating, up to a certain level, the retrieval of unlabelled medical web sites and their labelling, and the monitoring of labelled web sites as to whether they are still satisfying the criteria.
The Internet has revolutionized the way knowledge can be accessed and presented. However, the explosion of web content that has followed is now producing major difficulties for effective selection and retrieval of information that is relevant for the task in hand. In disseminating clinical guidelines and other knowledge sources in healthcare, for example, it may be desirable to provide a presentation of current knowledge about best practice which is limited to material that is appropriate for the current patient context.. A promising solution to this problem is to augment conventional guideline documents with decision-making and other "intelligent" services tailored to specific needs at the point of care. In this paper we describe how BMJ's Clinical Evidence, a well-known medical reference on the web, was enhanced with patient data acquisition and decision support services implemented in PROforma.
2017
Web documents contain vast amounts of information that can be extracted and processed to enhance the understanding of online data. Often, the structure of the document can be exploited in order to identify useful information within it. Pairs of attributes and their corresponding values are one such example of information frequently found in many online retail websites. These concentrated bits of information are often enclosed in specific tags of the web document, or highlighted with certain markers which can be automatically discovered and identified. This way, different methods can be employed to extract new pairs from other, more or less similar, documents. The method presented in this paper relies on the DOM (Document Object Model) structure and the text within web pages in order to extract patterns consisting of tags and pieces of text and then to classify them. Several classifiers have been compared and the best results have been obtained with a C4.5 decision tree classifier.
Since the INTERNET outburst, consumer perception turned into a complex issue to be measured. Non-traditional advertising methods and new product exhibition alternatives emerged. Forums and review sites allow end users to suggest, recommend or rate products according to their experiences. This gave raise to the study of such data collections. After analyze, store and process them properly, they are used to make reports used to assist in middle to high staff decision making. This research aims to implement concepts and approaches of artificial intelligence to this area. The framework proposed here (named GDARIM), is able to be parameterized and handled to other similar problems in different fields. To do that it first performs deep problem analysis to determine the specific domain variables and attributes. Then, it implements specific functionality for the current data collection and available storage. Next, data is analyzed and processed, using Genetic Algorithms to retro feed the keywords initially loaded. Finally, properly reports of the results are displayed to stakeholders
2006
Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they are not highly reliable regarding the information retrieved. Our experiments align in the area of this last type of methods for this purpose developing a tool for automatic data extraction from HTML pages.
Lecture Notes in Computer Science, 2007
As the number of health-related web sites in various languages increases, it is more than necessary to implement control mechanisms that give the users adequate guarantee that the web resources they are visiting, meet a minimum level of quality standards. Based upon state-of-the-art technology in the areas of semantic web, content analysis and quality labeling, the AQUA system, designed for the EC-funded project MedIEQ, aims to support the automation of the labeling process in health-related web content. AQUA provides tools that crawl the web to locate unlabelled health web resources in different European languages, as well as tools that traverse websites, identify and extract information and, upon this information, propose labels or monitor already labeled resources. Two major steps in this automated labeling process are web content collection and information extraction. This paper focuses on content collection. We describe existing approaches, present the architecture of the content collection toolkit and how this is integrated within the AQUA system, and discuss our initial experimental results in the English language (six more languages will be covered by the end of the project).
Proceedings of the 2017 International Conference on Digital Health, 2017
Automatic assessment of the quality of online health information is a need especially with the massive growth of online content. In this paper, we present an approach to assessing the quality of health webpages based on their content rather than on purely technical features, by applying machine learning techniques to the automatic identification of evidence-based health information. Several machine learning approaches were applied to learn classifiers using different combinations of features. Three datasets were used in this study for three different diseases, namely shingles, flu and migraine. The results obtained using the classifiers were promising in terms of precision and recall especially with diseases with few different pathogenic mechanisms.
Information Processing & Management, 2013
Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step.
Journal of open source software, 2021
Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium (Huggins et al., 2021). In contrast to existing software packages such as HTML2text (Swartz, 2021), jusText (Belica, 2021) and Lynx (Dickey, 2021), Inscriptis 1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment. 2. supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled. Statement of need Research in a growing number of scientific disciplines relies upon Web content. Li et al. (2014), for instance, studied the impact of company-specific News coverage on stock prices, in medicine and pharmacovigilance social media listening plays an important role in gathering insights into patient needs and the monitoring of adverse drug effects (Convertino et al., 2018), and communication sciences analyze media coverage to obtain information on the perception and framing of issues as well as on the rise and fall of topics within News and social media (Scharl et al., 2017; Weichselbraun et al., 2021). Computer science focuses on analyzing content by applying knowledge extraction techniques such as entity recognition (Fu et al., 2021) to automatically identify entities (e.g., persons, organizations, locations, products, etc.) within text documents, entity linking (Ding et al., 2021) to link these entities to knowledge bases such as Wikidata and DBPedia, and sentiment Weichselbraun, A., (2021). Inscriptis-A Python-based HTML to text conversion library optimized for knowledge extraction from the Web.
2000
The WWW is an important channel of information exchange in many domains, including the medical one. The ever increasing amount of freely available healthcare-related information generates, on the one hand, excellent conditions for self-education of patients as well as physicians, but on the other hand entails substantial risks if such information is trusted irrespective of low competence or even bad intentions of its authors. This is why medical website certification (also called 'quality labeling') by renowned authorities is of high importance. In this respect, it recently became obvious that the labeling process could benefit from employment of web mining and information extraction techniques, in combination with flexible methods of web-based information management developed within the semantic web initiative. Achieving such synergy is the central issue in the MedIEQ project. The AQUA (Assisting QUality Assessment) system, developed within the MedIEQ project, aims to provide the infrastructure and the means to organize and support various aspects of the daily work of labeling experts.
Polibits, 2014
The evolution of the Web from the original proposal made in 1989 can be considered one of the most revolutionary technological changes in centuries. During the past 25 years the Web has evolved from a static version to a fully dynamic and interoperable intelligent ecosystem. The amount of data produced during these few decades is enormous. New applications, developed by individual developers or small companies, can take advantage of both services and data already present on the Web. Data, produced by humans and machines, may be available in different formats and through different access interfaces. This paper analyses three different types of data available on the Web and presents mechanisms for accessing and extracting this information. The authors show several applications that leverage extracted information in two areas of research: recommendations of educational resources beyond content and interactive digital TV applications.
TELKOMNIKA Telecommunication Computing Electronics and Control, 2021
In the era of industrial digitalization, people are increasingly investing in solutions that allow their process for data collection, data analysis and performance improvement. In this paper, advancing web scale knowledge extraction and alignment by integrating few sources by exploring different methods of aggregation and attention is considered in order focusing on image information. The main aim of data extraction with regards to semi-structured data is to retrieve beneficial information from the web. The data from web also known as deep web is retrievable but it requires request through form submission because it cannot be performed by any search engines. As the HTML documents start to grow larger, it has been found that the process of data extraction has been plagued with lengthy processing time. In this research work, we propose an improved model namely wrapper extraction of image using document object model (DOM) and JavaScript object notation data (JSON) (WEIDJ) in response to the promising results of mining in a higher volume of image from a various type of format. To observe the efficiency of WEIDJ, we compare the performance of data extraction by different level of page extraction with VIBS, MDR, DEPTA and VIDE. It has yielded the best results in Precision with 100, Recall with 97.93103 and F-measure with 98.9547.
Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and dat-apoints in the best cluster are selected as positive training examples.
2009
Data mining is a fast-developing field of study, using computations to either predict or describe large amounts of data. The increase in data produced each year goes hand in hand with this, requiring algorithms that are more and more efficient in order to find interesting information within a given time. In this thesis, we study methods for extracting information from semi-structured data, for finding structure within large sets of discrete data, and to efficiently rank web pages in a topic-sensitive way. The information extraction research focuses on support for keeping both documentation and source code up to date at the same time. Our approach to this problem is to embed parts of the documentation within strategic comments of the source code and then extracting them by using a specific tool. The structures that our structure mining algorithms are able to find among crisp data (such as keywords) is in the form of subsumptions, i.e. one keyword is a more general form of the other. We can use these subsumptions to build larger structures in the form of hierarchies or lattices, since subsumptions are transitive. Our tool has been used mainly as input to data mining systems and for visualisation of datasets. The main part of the research has been on ranking web pages in a such a way that both the link structure between pages and also the content of each page matters. We have created a number of algorithms and compared them to other algorithms in use today. Our focus in these comparisons have been on convergence rate, algorithm stability and how relevant the answer sets from the algorithms are according to real-world users. The research has focused on the development of efficient algorithms for gathering and handling large data-sets of discrete and textual data. A proposed system of tools is described, all operating on a common database containing "fingerprints" and meta-data about items. This data could be searched by various algorithms to increase its usefulness or to find the real data more efficiently. All of the methods described handle data in a crisp manner, i.e. a word or a hyper-link either is or is not a part of a record or web page. This means that we can model their existence in a very efficient way. The methods and algorithms that we describe all make use of this fact.
Applied Sciences
This paper discusses the tool for the main text and image extraction (extracting and parsing the important data) from a web document. This paper describes our proposed algorithm based on the Document Object Model (DOM) and natural language processing (NLP) techniques and other approaches for extracting information from web pages using various classification techniques such as support vector machine, decision tree techniques, naive Bayes, and K-nearest neighbor. The main aim of the developed algorithm was to identify and extract the main block of a web document that contains the text of the article and the relevant images. The algorithm on a sample of 45 web documents of different types was applied. In addition, the issue of web pages, from the structure of the document to the use of the Document Object Model (DOM) for their processing, was analyzed. The Document Object Model was used to load and navigation of the document. It also plays an important role in the correct identificatio...
2007
Users visiting health related web sites would be served best if they knew whether these sites meet a minimum level of quality standards. However manually labelling health resources is a tedious task. Based upon state-of-the-art technology in the areas of semantic web, content analysis and labelling, the MedIEQ project integrates existing technologies and tests them in a novel application: AQUA, a system aiming to automate parts of the labelling process in health-related web content. AQUA provides tools that enable the creation of machine readable labels, tools that crawl the web to locate unlabelled health web resources, suggest labels for them according to predefined labelling criteria and monitor them. This paper describes the current status in the area of health information labelling and explains step-by-step how AQUA paves the way towards the automation of the labelling process.
The World Wide Web contains a huge amount of unstructured and semi-structured information, that is exponentially increasing with the coming of the Web 2.0, thanks to User-Generated Contents (UGC). In this paper we intend to briefly survey the fields of application, in particular enterprise and social applications, and techniques used to approach and solve the problem of the extraction of information from Web sources: during last years many approaches were developed, some inherited from past studies on Information Extraction (IE) systems, many others studied ad hoc to solve specific problems.
2006
Web is a vast data repository. By mining from this data efficiently, we can gain valuable knowledge. Unfortunately, in addition to useful content there are also many Web documents considered harmful (e.g. pornography, terrorism, illegal drugs). Web mining that includes three main areas -content, structure, and usage mining -may help us detect and eliminate these sites. In this paper, we concentrate on applications of Web content and Web structure mining. First, we introduce a system for detection of pornographic textual Web pages. We discuss its classification methods and depict its architecture. Second, we present analysis of relations among Czech academic computer science Web sites. We give an overview of ranking algorithms and determine importance of the sites we analyzed.
Journal of Computer …, 2009
Problems statement: Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various HTML information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. The number of selected pages is enormous. Therefore, the performance capabilities, the overlap among results for the same queries and limitations of web search engines are an important and large area of research. Extracting information from the web pages also becomes very important because the massive and increasing amount of diverse HTML information sources in the internet that are available to users and the variety of web pages making the process of information extraction from web a challenging problem. Approach: This study proposed an approach for extracting information from HTML web pages which was able to extract relevant information from different web pages based on standard classifications. Results: Proposed approach was evaluated by conducting experiments on a number of web pages from different domains and achieved increment in precision and F measure as well as decrement in recall. Conclusion: Experiments demonstrated that our approach extracted the attributes besides the sub attributes that described the extracted attributes and values of the sub attributes from various web pages. Proposed approach was able to extract the attributes that appear in different names in some of the web pages.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.