Academia.eduAcademia.edu

Dom Tree as the base for webpage content extraction: Review

Abstract
sparkles

AI

The rapid rise of internet technology has generated an increasing volume of unstructured information across web pages, making effective content extraction essential. This paper reviews DOM tree-based methodologies from 2011 to 2021, focusing on approaches for extracting relevant data while minimizing noise, such as advertisements and navigation elements. By comparing various classification methods, limitations, and evaluation metrics, the work highlights the role of the DOM tree in enhancing the accuracy and efficiency of information extraction processes on the web.