Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
In addition to the news content, most web news pages also contain navigation panels, advertisements, related news links etc. These non-news items not only exist outside the news region, but are also present in the news content region. Effectively extracting the news content and filtering the noise have important effects on the follow-up activities of content management and analysis. Our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Based on this observation, we design two tag path features to measure the importance of nodes: Text to tag Path Ratio (TPR) and Extended Text to tag Path Ratio (ETPR), and describe the calculation process of TPR by traversing the parsing tree of a web news page. In this paper, we present Content Extraction via Path Ratios (CEPR) -a fast, accurate and general on-line method for distinguishing news content from non-news content by the TPR/ETPR histogram effectively. In order to improve the ability of CEPR in extracting short texts, we propose a Gaussian smoothing method weighted by a tag path edit distance. This approach can enhance the importance of internal-link nodes but ignore noise nodes existing in news content. Experimental results on the CleanEval datasets and web news pages randomly selected from well-known websites show that CEPR can extract across multi-resources, multi-styles, and multi-languages. The average F and average score with CEPR is 8.69% and 14.25% higher than CETR, which demonstrates better web news extraction performance than most existing methods.
2008
We describe a method to extract content text from diverse Web pages by using the HTML document's Text-to-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-to-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
2009
In this paper we present a simple, robust, accurate and language-independent solution for extracting the main content of an HTML-formatted Web page and for removing additional content such as navigation menus, functional and design elements, and commercial advertisements. This method creates a text density graph of a given Web page and then selects the region of the Web page with the highest density. The results are comparable or better than state-of-the-art methods that are computationally more complex, when evaluated on a standard dataset. Accurate and efficient content extraction from Web pages is largely needed when searching or mining Web content.
Proceedings of the 13th …, 2004
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites.
2010
We present Content Extraction via Tag Ratios (CETR)a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.
2018
Besides the main content, webpages often contain other complementary and noisy data such as advertisements, navigational information, copyright notices, and other template-related elements. The detection and extraction of main content can have many applications, such as web summarization, indexing, data mining, content adaptation to mobile devices, web content printing, etc. We introduce a novel site-level technique for content extraction based on the DOM representation of webpages. This technique analyzes some selected pages in any given website to identify those nodes in the DOM tree that do not belong to the webpage template. Then, an algorithm explores these nodes in order to select the main content nodes. To properly evaluate the technique, we have built a suite of benchmarks by downloading several heterogeneous real websites and manually marking the main content nodes. This suite of benchmarks can be used to evaluate and compare different content extraction techniques.
International Journal of Computer Applications, 2010
As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better quality on web page indexing. Almost all algorithms have been proposed are tag dependent means they could only look for primary content among specific tags such as <TABLE> or <DIV>. The algorithm in this paper simulates a web page user visit and how the user finds the main content block position in the page. The proposed method is tag independent and has two phases to accomplish the extraction job. First it transforms input DOM tree obtained from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The introduced method doesn't have any learning phases and could find informative content on any random input detailed web page. This method has been tested in large variety of websites and as we will show, it gains better precision and recall based on other compared method K-FE.
2011
Extracting the main content of web documents, with high accuracy, is an important challenge for researchers working on the web. In this paper, we present a novel language-independent method for extracting the main content of web pages. Our method, called DANAg, in comparison with other main content extraction approaches has high performance in terms of effectiveness and efficiency. The extraction process of data DANAg is divided into four phases. In the first phase, we calculate the length of content and code of fixed segments in an HTML file. The second phase applies a naive smoothing method to highlight the segments forming the main content. After that, we use a simple algorithm to recognize the boundary of the main content in an HTML file. Finally, we feed the selected main content area to our parser in order to extract the main content of the targeted web page.
The Journal of Logic and Algebraic Programming, 2013
The main content in a webpage is usually centered and visible without the need to scroll. It is often rounded by the navigation menus of the website and it can include advertisements, panels, banners, and other not necessarily related information. The process to automatically extract the main content of a webpage is called content extraction. Content extraction is an area of research of widely interest due to its many applications. Concretely, it is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems (i.e., robots, indexers, crawlers, etc.) that need to extract the main content of a web document to avoid the treatment and processing of other useless information. In this work we present a new technique for content extraction that is based on the information contained in the DOM tree. The technique analyzes the hierarchical relations of the elements in the webpage and the distribution of textual information in order to identify the main block of content. Thanks to the hierarchy imposed by the DOM tree the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block (not necessarily textual such as images or videos), thus, producing very cohesive blocks.
pages.cs.brandeis.edu
In this paper we present an algorithm for automatic extraction of textual elements, namely titles and full text, associated with news stories in news web pages. We propose a supervised machine learning classification technique based on the use of a Support Vector Machine (SVM) ...
A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page, the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this project, we propose a noise elimination technique that uses a machine learning (ML) based method which compares HTML tag pairs to estimate how likely they present in the web pages. We use one of the ML techniques called J48 decision tree classifier since decision tree decides the target value (dependent variable) of a new sample based on various attribute values of the available data.
With the growth of the Internet and related tools, there has been an exponential growth of online resources. This tremendous growth has paradoxically made the task of finding, extracting and aggregating relevant information difficult. These days, finding and browsing news is one of the most important internet activities. In this paper, a hybrid method for online news article contents extraction is presented. The method combines RSS feeds and HTML Document Object Model (DOM) tree extraction. This approach is simple and effective at solving the problems associated with heterogeneous news layout and changing content found in many existing methods. The experimental results on some selected news sites show that the approach can extract news article contents automatically, effectively and consistently. The proposed method can also be adopted for other news sites.
International Journal of Computer Applications, 2011
Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. Web pages usually contain huge amount of information that may not interest the user, as it may not be the part of the main content of the web page. To extract the main content of the web page, data mining techniques need to be implemented. A lot of research has already been done in this field. Current automatic techniques are unsatisfactory as their outputs are not appropriate for the query of the user. In this paper, we are presenting an automatic approach to extract the main content of the web page using tag tree & heuristics to filter the clutter and display the main content. Experimental results have shown that the technique presented in this paper is able to outperform existing techniques dramatically.
Journal of Engineering Science and Technology Review, 2020
Due to the emergent of technological development, Web has evolved as the most powerful digital weapon for mankind in recent days. As the size of the web is increasing rapidly, extracting the interesting content from the web become the supreme challenge. In the meantime, the retrieved web pages have many uninteresting content blocks that are not useful for the user which also degrades the performance of content extraction. These uninteresting blocks include advertisements, banners, copyrights, navigation bars etc., and are normally named as web page noise. Removing these noises from the web pages is considered to be the primary task in pre-processing. This paper presents an approach that eliminates the noise and near duplicates for extracting significant content from the web page. The proposed method has three steps. Initially, the web page is divided into various blocks and the block which is considered as noise is removed using tag analysis and Document Object Model Tree. Secondly, the elimination of redundant blocks is carried out by computing fingerprints using modified simhash algorithm with proximity measure. From the distinct blocks, several parameters such as Titlewords, Linkwords and Contentwords are extracted. Thus, the extraction of significant content is carried out by computing the scores for each block using a weighted block scoring mechanism. The blocks having higher score values are extracted and finally, the core content is extracted from the web page. The experimental analysis has been performed and the results show that the proposed method eliminates noise in an efficient way.
2002
In this paper, we propose a new approach to discover informative contents from a set of tabular documents (or Web pages) of a Web site. Our system, InfoDiscoverer, first partitions a page into several content blocks according to HTML tag <TABLE> in a Web page. Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, we propose a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts.
A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page, the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this project, we propose a noise elimination technique that uses a machine learning (ML) based method which compares HTML tag pairs to estimate how likely they present in the web pages. We use one of the ML techniques called J48 decision tree classifier since decision tree decides the target value (dependent variable) of a new sample based on various attribute values of the available data.
2013
Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web we use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises has lesser text information. The title is the most important information on the page that tells us about what is this page for. So we simply extract all the information that is denser than particular threshold or at least contain one of the keywords that is made from the title of the page. By using this approach the more false negatives can be avoided. This approach gives very satisfactory results.
2013
Noise on the web pages leads to trouble mining the main content of web. Web pages typically contain a large amount of information like advertisements, navigation bar and copyright notices which are not part of the main information of pages called noise. Typically we apply data mining techniques such as classification and clustering after cleaning noise from the web pages. Eliminating noise blocks from the web pages will improve the accuracy and efficiency of web content mining. This paper proposes a method "web information extraction and classification framework". In this method we remove noisy data from the web pages and then retrieve the most relevant information from the same web page and finally the classification is performed by using Support Vector Machine which gives accurate and efficient result.
Web content extraction is a key technology for enabling an array of applications aimed at understanding the web. While automated web extraction has been studied extensively, they often focus on extracting structured data that appear multiple times on a single webpage, like product catalogs. This project aims to extract less structured web content, like news articles, that appear only once in noisy webpages. Our approach classifies text blocks using a mixture of visual and language independent features. In addition, a pipeline is devised to automatically label datapoints through clustering where each cluster is scored based on its relevance to the webpage description extracted from the meta tags, and dat-apoints in the best cluster are selected as positive training examples.
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies - i-KNOW '12, 2012
The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its e↵ectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.