Dom Tree as the base for webpage content extraction: Review

Hind Sabah Rahim; Aliea Salman Sabir; Nahla Abbas Flayh

Dom Tree as the base for webpage content extraction: Review

Abstract
AI

The rapid rise of internet technology has generated an increasing volume of unstructured information across web pages, making effective content extraction essential. This paper reviews DOM tree-based methodologies from 2011 to 2021, focusing on approaches for extracting relevant data while minimizing noise, such as advertisements and navigation elements. By comparing various classification methods, limitations, and evaluation metrics, the work highlights the role of the DOM tree in enhancing the accuracy and efficiency of information extraction processes on the web.

Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium (Huggins et al., 2021). In contrast to existing software packages such as HTML2text (Swartz, 2021), jusText (Belica, 2021) and Lynx (Dickey, 2021), Inscriptis 1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment. 2. supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled. Statement of need Research in a growing number of scientific disciplines relies upon Web content. Li et al. (2014), for instance, studied the impact of company-specific News coverage on stock prices, in medicine and pharmacovigilance social media listening plays an important role in gathering insights into patient needs and the monitoring of adverse drug effects (Convertino et al., 2018), and communication sciences analyze media coverage to obtain information on the perception and framing of issues as well as on the rise and fall of topics within News and social media (Scharl et al., 2017; Weichselbraun et al., 2021). Computer science focuses on analyzing content by applying knowledge extraction techniques such as entity recognition (Fu et al., 2021) to automatically identify entities (e.g., persons, organizations, locations, products, etc.) within text documents, entity linking (Ding et al., 2021) to link these entities to knowledge bases such as Wikidata and DBPedia, and sentiment Weichselbraun, A., (2021). Inscriptis-A Python-based HTML to text conversion library optimized for knowledge extraction from the Web.

Log In

Dom Tree as the base for webpage content extraction: Review

Sign up for access to the world's latest research

AbstractAI

Related papers

Abstract
AI