Automated HTML Data Extraction

This paper presents a novel technique for automated wrapper generation and data extraction from HTML sites called RoadRunner. RoadRunner takes two sample HTML pages as input and generates a wrapper by finding similarities and differences between the pages without requiring any human labeling. It handles nested data structures and generates a generalized wrapper using a match algorithm. The authors evaluate RoadRunner and show it outperforms other tools in extracting data more quickly from real-world sites while handling optional fields and nested data.

Uploaded by

xkjon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views2 pages

Automated HTML Data Extraction

Uploaded by

xkjon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Paper Review Form

Section I. Overview

A. Reader Interest

1. Which category describes this manuscript?

This is a research paper on data extraction form HTML sites using automated
wrapper generation and data extraction process. The authors have designed a tool called
RoadRunner: which automatically generates wrappers for input HTML samples and does the data
extraction.

B. Content

1. Please explain how this manuscript advances this field of research and/or contributes
something new to the literature.
The authors solve the problem of automated wrapper generation process by finding out the
similarities and differences between two HTML pages of the same class. Their process wouldn’t
require users to input label samples (that acts as additional information to wrapper generator) as
in other related literatures. Wrapper generator of RoadRunner infers schema along with the
wrapper generation & their system is able to0 handle nested structures (not restricted to flat
records as in other related literatures). They solve schema finding and data extraction process by
finding the minimal UFRE (union free regular expression) whose language contains the input
HTML strings. This is a novel approach and hasn’t been used. They use a match algorithm that
takes in a wrapper w (generated by taking in the first input page) and sample strings s (input page
2) & returns a generalized wrapper as output. One of the references by E.J. Neuhold talks about
generating wrappers using attributed grammars, which are evaluated with a fault-tolerant parsing
strategy to cope with ambiguous grammars and irregular source.

2. Is the manuscript technically sound?

Yes. It explains in detail the match algorithm used to generalize wrapper.

C. Presentation

1. Are the title, abstract, and keywords appropriate?

Yes

2. Does the manuscript contain sufficient and appropriate references?

References are sufficient and appropriate. The authors have tried to evaluate all
the existing approaches to data extraction & have tried to compare their approach with
theirs.

3. Does the introduction state the objectives of the manuscript in terms that encourage
the reader to read on?
Yes. It clearly describes their technique over other techniques used in other related literatures in
the overview and contributions section.

4. How would you rate the organization of the manuscript? Is it focused? Is the length
appropriate for the topic?
The organization of this manuscript seems satisfactory. The authors try to describe the wrapper
generation process & other theoretical issues of finding a minimal UFRE for HTML input
samples. The manuscript is focused and the authors have tried to evaluate the time taken by
various other wrapper generators & data extraction tools.

5. Please rate and comment on the readability of this manuscript.

Easy to read

Section II. Evaluation

Please rate the manuscript. Explain your choice.

Award Quality

Section III. Detailed Comments

The wrapper generation process described in this paper is fairly novel & it uses two html
sample pages to generate a generalized wrapper by comparing its similarities and
differences. Their technique doesn’t need any human intervention when a mismatch in
the tag occurs. It automatically takes in the optional strings and matches it. The wrapper
generator doesn’t need to know the schema according to which the data is organized. It
takes care of flat records as well as nested attributes present in the HTML sample.
Authors have a good understanding of the problem & have addressed some of the
complexities that other researchers from related literatures had faced (finding a regular
grammar which is needed for generating wrappers from a set of web pages). The authors
try to compare their work with others & have analyzed the results. The way they have
conducted experiments is interesting. The have actually conducted data extraction on
several sites like amazon etc & have shown whether they were able to extract data. They
have taken into account some of the schema details like levels of nesting, number of
attributes, number of optional strings used in resolving the mismatch. They have also
compared with other tools like Wien & Stalker and have shown that RoadRunner was
able to extract data more quickly than the other tools. They show that the tools like Wein
& Stalker were unable to handle optional fields in HTML samples and nested structures.
The have detailed the match algorithm and have discussed examples how the matching
process occur (matching of strings; matching of tags & string optionals if mismatch
occurs; and wrapper generalization). They have tried to explain the match algorithm with
examples: one simple & other slightly complicated one (which does back tracking &
identifying the internal mismatches & identification of optional patterns like
<i>Special</i>). This paper is a definite plus since it shows that their technique is quite
superior (in terms of time & the in the wrapper generation process) compared to their
counterparts.

2002-02 ICDEremin
No ratings yet
2002-02 ICDEremin
2 pages
Overview of Web Data Extraction Techniques
No ratings yet
Overview of Web Data Extraction Techniques
10 pages
Linked Data Extraction Challenge 2014
No ratings yet
Linked Data Extraction Challenge 2014
6 pages
Automatic Wrapper Generation: Craig Knoblock University of Southern California
No ratings yet
Automatic Wrapper Generation: Craig Knoblock University of Southern California
41 pages
Ocsef Abstract
No ratings yet
Ocsef Abstract
2 pages
Advanced Web Data Mining Projects
No ratings yet
Advanced Web Data Mining Projects
2 pages
Chapter 9
No ratings yet
Chapter 9
64 pages
Literatuer Survey On Document Extraction in Web Pages Using Data Mining Techniques
No ratings yet
Literatuer Survey On Document Extraction in Web Pages Using Data Mining Techniques
5 pages
Efficient Web Data Extraction
No ratings yet
Efficient Web Data Extraction
4 pages
Pratik Report
No ratings yet
Pratik Report
32 pages
Crawler Thesis
No ratings yet
Crawler Thesis
188 pages
Web Data Extraction for Researchers
No ratings yet
Web Data Extraction for Researchers
4 pages
Report Format
No ratings yet
Report Format
15 pages
Entity Extraction in Bibliographic Emails
No ratings yet
Entity Extraction in Bibliographic Emails
57 pages
Knoblock00 Deb
No ratings yet
Knoblock00 Deb
10 pages
Ai Base Paper
No ratings yet
Ai Base Paper
9 pages
Assignment #1 Text Retrieval & Search Engine
No ratings yet
Assignment #1 Text Retrieval & Search Engine
6 pages
Parsing of HTML Document: Pranit C. Patil, Pramila M. Chawan, Prithviraj M. Chauhan
No ratings yet
Parsing of HTML Document: Pranit C. Patil, Pramila M. Chawan, Prithviraj M. Chauhan
5 pages
A Progressive Understanding Web Agent For Web Crawler Generation
No ratings yet
A Progressive Understanding Web Agent For Web Crawler Generation
18 pages
Extracting Product From E-Commercial Web Site Using Entropy Estimation
100% (2)
Extracting Product From E-Commercial Web Site Using Entropy Estimation
5 pages
WebCrawler Report
No ratings yet
WebCrawler Report
6 pages
Data Filtering with NLP Techniques
No ratings yet
Data Filtering with NLP Techniques
4 pages
PURE Dataset
No ratings yet
PURE Dataset
5 pages
Software Engineering Project
No ratings yet
Software Engineering Project
55 pages
88
No ratings yet
88
8 pages
The Analysis of Web Page Information Processing Based On Natural Language Processing
No ratings yet
The Analysis of Web Page Information Processing Based On Natural Language Processing
4 pages
Web Data Extraction and Generating Mashup: Achala Sharma, Aishwarya Vaidyanathan, Ruma Das, Sushma Kumari
No ratings yet
Web Data Extraction and Generating Mashup: Achala Sharma, Aishwarya Vaidyanathan, Ruma Das, Sushma Kumari
6 pages
OCR++: Framework for Extracting Scholarly Data
No ratings yet
OCR++: Framework for Extracting Scholarly Data
9 pages
Navi-2. Literature Survey
No ratings yet
Navi-2. Literature Survey
13 pages
V1i911 Libre
No ratings yet
V1i911 Libre
5 pages
Can Developers Prompt? A Controlled Experiment For Code Documentation Generation
No ratings yet
Can Developers Prompt? A Controlled Experiment For Code Documentation Generation
13 pages
Citation Data-Set For Machine Learning Citation ST
No ratings yet
Citation Data-Set For Machine Learning Citation ST
56 pages
Class Assign
No ratings yet
Class Assign
3 pages
Python JSON Import Lab Report
No ratings yet
Python JSON Import Lab Report
2 pages
Java - Report Final
No ratings yet
Java - Report Final
15 pages
Lecture07 03
No ratings yet
Lecture07 03
13 pages
IJARCCE 67 Project Research Paper
No ratings yet
IJARCCE 67 Project Research Paper
3 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
SWD 326 Lab 1
No ratings yet
SWD 326 Lab 1
7 pages
Automated HTML from Hand Drawn Images
No ratings yet
Automated HTML from Hand Drawn Images
23 pages
Text Classification
No ratings yet
Text Classification
10 pages
Author Response Page For COLING 2025 - The 31st International Conference On Computational Linguistics
No ratings yet
Author Response Page For COLING 2025 - The 31st International Conference On Computational Linguistics
5 pages
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
Improving Retrieval Augmented Generation
No ratings yet
Improving Retrieval Augmented Generation
33 pages
Web URL Extraction Tool Guide
No ratings yet
Web URL Extraction Tool Guide
32 pages
Bloom
No ratings yet
Bloom
45 pages
A Web Scraper For Extracting Alumni Information From Social
No ratings yet
A Web Scraper For Extracting Alumni Information From Social
4 pages
2rd论文
No ratings yet
2rd论文
81 pages
Automatic Webpage Content Extraction Based On Structural and Semantic Features
No ratings yet
Automatic Webpage Content Extraction Based On Structural and Semantic Features
5 pages
Full Text 03
No ratings yet
Full Text 03
48 pages
Final
No ratings yet
Final
19 pages
Summary Paper 7 8 9
No ratings yet
Summary Paper 7 8 9
2 pages
COMP-8547 Lab 4: Pattern Search in Java
No ratings yet
COMP-8547 Lab 4: Pattern Search in Java
2 pages
Search Engine Report
No ratings yet
Search Engine Report
5 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
5 pages
Survey Template With Expansion Guides
No ratings yet
Survey Template With Expansion Guides
5 pages
WIRE: Open Source Web Crawler Overview
No ratings yet
WIRE: Open Source Web Crawler Overview
4 pages
Master Thesis
No ratings yet
Master Thesis
70 pages
Orion: Shortest Path Estimation For Large Social Graphs
No ratings yet
Orion: Shortest Path Estimation For Large Social Graphs
9 pages
Mele Et Al. - 2020 - Topic Propagation in Conversational Search
No ratings yet
Mele Et Al. - 2020 - Topic Propagation in Conversational Search
4 pages
Crema Catalana Receta
No ratings yet
Crema Catalana Receta
1 page
Memory-Efficient Fast Shortest Path Estimation in Large Social Networks
No ratings yet
Memory-Efficient Fast Shortest Path Estimation in Large Social Networks
10 pages
Congestion Control Algorithm Interactions
No ratings yet
Congestion Control Algorithm Interactions
8 pages
The Small Blanket
No ratings yet
The Small Blanket
11 pages
BBR Congestion Control Update 2017
No ratings yet
BBR Congestion Control Update 2017
37 pages
SOP - Coventry
100% (1)
SOP - Coventry
2 pages
Blockchain-Based Tender Management System
No ratings yet
Blockchain-Based Tender Management System
8 pages
Telecom Network Engineer Resume
No ratings yet
Telecom Network Engineer Resume
3 pages
Hfe Denon Dra-F109 Dab Service en
No ratings yet
Hfe Denon Dra-F109 Dab Service en
76 pages
Presentation - Unit No.7 - Lesson No.1-2 - Grade 10
No ratings yet
Presentation - Unit No.7 - Lesson No.1-2 - Grade 10
25 pages
Rover Technology for Mars Exploration
No ratings yet
Rover Technology for Mars Exploration
17 pages
Sap Remote Logistics Management Powerpoint 2013
No ratings yet
Sap Remote Logistics Management Powerpoint 2013
34 pages
EE258 2024 CoursePlan - SubjectedToDUGCapproval
No ratings yet
EE258 2024 CoursePlan - SubjectedToDUGCapproval
3 pages
Week 6 MISCONCEPTIONS ON ENTRPRENEURSHIP
No ratings yet
Week 6 MISCONCEPTIONS ON ENTRPRENEURSHIP
10 pages
5 Million Ethiopian Coders Initiatives
No ratings yet
5 Million Ethiopian Coders Initiatives
17 pages
Compal LS-K852P Rev 1.0 Hall sensor board Схема
No ratings yet
Compal LS-K852P Rev 1.0 Hall sensor board Схема
3 pages
Shimadzu Atomic Absorption Aa-7000
No ratings yet
Shimadzu Atomic Absorption Aa-7000
8 pages
SITECO SL10 Street Lighting in Relux
100% (1)
SITECO SL10 Street Lighting in Relux
40 pages
Chap: 1: Procedural Oriented Programming (POP)
No ratings yet
Chap: 1: Procedural Oriented Programming (POP)
5 pages
AV-27F802 Circuit Diagram Overview
No ratings yet
AV-27F802 Circuit Diagram Overview
14 pages
Project Management Definition
No ratings yet
Project Management Definition
7 pages
Company Profile PDF Lowres
100% (1)
Company Profile PDF Lowres
72 pages
CQ Amateur Radio 05-2019
No ratings yet
CQ Amateur Radio 05-2019
116 pages
Satellite Communication
No ratings yet
Satellite Communication
8 pages
Hyperledger Fabric for Enterprises
No ratings yet
Hyperledger Fabric for Enterprises
224 pages
Clause by Clause Explanation of ISO 27001 en
83% (6)
Clause by Clause Explanation of ISO 27001 en
23 pages
Srs
No ratings yet
Srs
12 pages
Nvis 6103
No ratings yet
Nvis 6103
1 page
Distribution Channel Insights
No ratings yet
Distribution Channel Insights
11 pages
Distance Measurement by Ultrasonic Sensor PDF
No ratings yet
Distance Measurement by Ultrasonic Sensor PDF
12 pages
Specifications: Vertex Standard Co., Ltd. Vertex Standard
No ratings yet
Specifications: Vertex Standard Co., Ltd. Vertex Standard
2 pages
Altivar Process ATV600 - ATV630U15N4
No ratings yet
Altivar Process ATV600 - ATV630U15N4
13 pages
Top 10 Trending Technologies for 2020
No ratings yet
Top 10 Trending Technologies for 2020
12 pages
Businessware Technologies - Adempiere Presentation
No ratings yet
Businessware Technologies - Adempiere Presentation
32 pages
Openmama Developers Guide
No ratings yet
Openmama Developers Guide
105 pages

Automated HTML Data Extraction

Uploaded by

Automated HTML Data Extraction

Uploaded by

Paper Review Form

1. Which category describes this manuscript?

2. Is the manuscript technically sound?

1. Are the title, abstract, and keywords appropriate?

2. Does the manuscript contain sufficient and appropriate references?

5. Please rate and comment on the readability of this manuscript.

Section II. Evaluation

Please rate the manuscript. Explain your choice.

Section III. Detailed Comments

You might also like