0% found this document useful (0 votes)
54 views2 pages

Automated HTML Data Extraction

This paper presents a novel technique for automated wrapper generation and data extraction from HTML sites called RoadRunner. RoadRunner takes two sample HTML pages as input and generates a wrapper by finding similarities and differences between the pages without requiring any human labeling. It handles nested data structures and generates a generalized wrapper using a match algorithm. The authors evaluate RoadRunner and show it outperforms other tools in extracting data more quickly from real-world sites while handling optional fields and nested data.

Uploaded by

xkjon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views2 pages

Automated HTML Data Extraction

This paper presents a novel technique for automated wrapper generation and data extraction from HTML sites called RoadRunner. RoadRunner takes two sample HTML pages as input and generates a wrapper by finding similarities and differences between the pages without requiring any human labeling. It handles nested data structures and generates a generalized wrapper using a match algorithm. The authors evaluate RoadRunner and show it outperforms other tools in extracting data more quickly from real-world sites while handling optional fields and nested data.

Uploaded by

xkjon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Paper Review Form

Section I. Overview

A. Reader Interest

1. Which category describes this manuscript?


This is a research paper on data extraction form HTML sites using automated
wrapper generation and data extraction process. The authors have designed a tool called
RoadRunner: which automatically generates wrappers for input HTML samples and does the data
extraction.

B. Content

1. Please explain how this manuscript advances this field of research and/or contributes
something new to the literature.
The authors solve the problem of automated wrapper generation process by finding out the
similarities and differences between two HTML pages of the same class. Their process wouldn’t
require users to input label samples (that acts as additional information to wrapper generator) as
in other related literatures. Wrapper generator of RoadRunner infers schema along with the
wrapper generation & their system is able to0 handle nested structures (not restricted to flat
records as in other related literatures). They solve schema finding and data extraction process by
finding the minimal UFRE (union free regular expression) whose language contains the input
HTML strings. This is a novel approach and hasn’t been used. They use a match algorithm that
takes in a wrapper w (generated by taking in the first input page) and sample strings s (input page
2) & returns a generalized wrapper as output. One of the references by E.J. Neuhold talks about
generating wrappers using attributed grammars, which are evaluated with a fault-tolerant parsing
strategy to cope with ambiguous grammars and irregular source.

2. Is the manuscript technically sound?


Yes. It explains in detail the match algorithm used to generalize wrapper.

C. Presentation

1. Are the title, abstract, and keywords appropriate?


Yes

2. Does the manuscript contain sufficient and appropriate references?


References are sufficient and appropriate. The authors have tried to evaluate all
the existing approaches to data extraction & have tried to compare their approach with
theirs.

3. Does the introduction state the objectives of the manuscript in terms that encourage
the reader to read on?
Yes. It clearly describes their technique over other techniques used in other related literatures in
the overview and contributions section.

4. How would you rate the organization of the manuscript? Is it focused? Is the length
appropriate for the topic?
The organization of this manuscript seems satisfactory. The authors try to describe the wrapper
generation process & other theoretical issues of finding a minimal UFRE for HTML input
samples. The manuscript is focused and the authors have tried to evaluate the time taken by
various other wrapper generators & data extraction tools.

5. Please rate and comment on the readability of this manuscript.


Easy to read

Section II. Evaluation

Please rate the manuscript. Explain your choice.


Award Quality

Section III. Detailed Comments

The wrapper generation process described in this paper is fairly novel & it uses two html
sample pages to generate a generalized wrapper by comparing its similarities and
differences. Their technique doesn’t need any human intervention when a mismatch in
the tag occurs. It automatically takes in the optional strings and matches it. The wrapper
generator doesn’t need to know the schema according to which the data is organized. It
takes care of flat records as well as nested attributes present in the HTML sample.
Authors have a good understanding of the problem & have addressed some of the
complexities that other researchers from related literatures had faced (finding a regular
grammar which is needed for generating wrappers from a set of web pages). The authors
try to compare their work with others & have analyzed the results. The way they have
conducted experiments is interesting. The have actually conducted data extraction on
several sites like amazon etc & have shown whether they were able to extract data. They
have taken into account some of the schema details like levels of nesting, number of
attributes, number of optional strings used in resolving the mismatch. They have also
compared with other tools like Wien & Stalker and have shown that RoadRunner was
able to extract data more quickly than the other tools. They show that the tools like Wein
& Stalker were unable to handle optional fields in HTML samples and nested structures.
The have detailed the match algorithm and have discussed examples how the matching
process occur (matching of strings; matching of tags & string optionals if mismatch
occurs; and wrapper generalization). They have tried to explain the match algorithm with
examples: one simple & other slightly complicated one (which does back tracking &
identifying the internal mismatches & identification of optional patterns like
<i>Special</i>). This paper is a definite plus since it shows that their technique is quite
superior (in terms of time & the in the wrapper generation process) compared to their
counterparts.

You might also like