Academia.eduAcademia.edu

Rapper: a wrapper generator with linguistic knowledge

1999

Abstract

Database management systems are becoming available for semistructured data, however, these tools cannot be used on many real-world data sources (e.g., most web sites) in their native form. Often, wrappers are needed to extract information and organize it into a graph structure that makes explicit the concepts users want to query and update. This paper presents a new approach to wrapper generation that exploits linguistic knowledge. The approach produces a more fine-grained parse of sources with natural language text than previous efforts. The resulting graph structured databases answer queries that could not be formulated in databases produced by prior generated wrappers. In addition, our approach may be more robust in the face of slight variations in word choice and order. We discuss a prototype implementation, lessons learned to date, evaluation issues, and future research directions.

Key takeaways

  • Researchers in semistructured data have converged on the use of graph-structured data models (e.g., OEM [9]), in which information is represented using a labeled, directed graph.
  • We call a program which automatically (or semiautomatically) extracts graph-structured data a "wrapper", and the creation of wrappers "wrapper generation".
  • In contrast, a deeper parse of this same information ( Figure 5) permits this query to be answered.
  • Our examples illustrate the benefits of performing deeper parsing in wrapper generation; the resulting graph-structured DBMS can answer queries that would not otherwise be answerable.
  • When using an approach such as ours that attempts to create a deeper parse of the source text, there is a non-trivial up-front cost in creating wrappers.