Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999
Database management systems are becoming available for semistructured data, however, these tools cannot be used on many real-world data sources (e.g., most web sites) in their native form. Often, wrappers are needed to extract information and organize it into a graph structure that makes explicit the concepts users want to query and update. This paper presents a new approach to wrapper generation that exploits linguistic knowledge. The approach produces a more fine-grained parse of sources with natural language text than previous efforts. The resulting graph structured databases answer queries that could not be formulated in databases produced by prior generated wrappers. In addition, our approach may be more robust in the face of slight variations in word choice and order. We discuss a prototype implementation, lessons learned to date, evaluation issues, and future research directions.
Proceedings of the 11th international conference on Extending database technology Advances in database technology - EDBT '08, 2008
In the classical database world, information access has been based on a paradigm that involves structured, schema-aware, queries and tabular answers. In the current environment, however, where information prevails in most activities of society, serving people, applications, and devices in dramatically increasing numbers, this paradigm has proved to be very limited. On the query side, much work has been done on moving towards keyword queries over structured data. In our previous work, we have touched the other side as well, and have proposed a paradigm that generates entire databases in response to keyword queries. In this paper, we continue in the same direction and propose synthesizing textual answers in response to queries of any kind over structured data. In particular, we study the transformation of a dynamically-generated logical database subset into a narrative through a customizable, extensible, and templatebased process. In doing so, we exploit the structured nature of database schemas and describe three generic translation modules for different formations in the schema, called unary, split, and join modules. We have implemented the proposed translation procedure into our own database front end and have performed several experiments evaluating the textual answers generated as several features and parameters of the system are varied. We have also conducted a set of experiments measuring the effectiveness of such answers on users. The overall results are very encouraging and indicate the promise that our approach has for several applications.
coopis, 1997
To simplify the task of obtaining information from the vast number of information sources that are available on the World Wide Web (WWW), we are building tools to build information mediators for extracting and integrating data from multiple Web sources. In a mediator based approach, wrappers are built around individual information sources, that provide translation between the mediator query language and the individual source. We present an approach for semi-automatically generating wrappers for structured internet sources. The key idea is to exploit formatting information in Web pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of Web sources using our implemented wrapper generation toolkit.
2008
Abstract In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist 1 and auction item listings like eBay.
ACM SIGMOD Record, 1997
With the current explosion of information on the World Wide Web (WWW) a wealth of information on many different subjects has become available on-line. Numerous sources contain information that can be classified as semistructured. At present, however, the only way to access the information is by browsing individual pages. We cannot query web documents in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured WWW sources by building wrappers around these sources. We present an approach for semi-automatically generating such wrappers. The key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. From this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. We demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit. ful to have the capability of issuing queries to this source which would allow us to query the source based on the structure in the pages.
2007
The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web's unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype.
1997
A truly natural language interface to databases also needs to be practical for actual implementation. We developed a new feasible approach to solve the problem and tested it successfully in a laboratory environment. The new result is based on metadata search, where the metadata grow in largely linear manner and the search is linguistics-free (allowing for grammatically incorrect and incomplete input). A new class of reference dictionary integrates four types of enterprise metadata: enterprise information models, database values, user-words, and query cases. The layered and scalable information models allow user-words to stay in original forms as users articulated them, as opposed to relying on permutations of individual words contained in the original query. A graphical representation method turns the dictionary into searchable graphs representing all possible interpretations of the input. A branch-and-bound algorithm then identifies optimal interpretations, which lead to SQL implementation of the original queries. Query cases enhance both the metadata and the search of metadata, as well as providing casebased reasoning to directly answer the queries. This design assures feasible solutions at the termination of the search, even when the search is incomplete (i.e., the results contain the correct answer to the original query). The necessary condition is that the text input contains at least one entry in the reference dictionary. The sufficient condition is that the text input contains a set of entries corresponding to a complete, correct single SQL query. Laboratory testing shows that the system obtained accurate results for most cases that satisfied only the necessary condition.
2012
The automated extraction of information from text and its transformation into a formal description is an important goal in both Semantic Web research and computational linguistics. The extracted information can be used for a variety of tasks such as ontology generation, question answering and information retrieval.
Conference on Innovative Data Systems Research, 2007
The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web's unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype.
International Journal of Mathematical Sciences and Computing, 2015
A natural language query builder interface retrieves the required data in structured form from database when query is entered in natural language. The user need not necessarily have sufficient technical knowledge of structured query language statements so nontechnical users can also use this proposed model. In natural language parsing, getting highly accurate syntactic analysis is a crucial step. Parsing of natural languages is the process of mapping an input string or a natural language sentence to its syntactic representation. Constituency parsing approach takes more time for parsing. So, natural language query builder interface is developed in which the parsing of natural language sentence is done by using dependency parsing approach. Dependency parsing technique is widespread in natural language domain because of its state-of-art accuracy and efficiency and also it performs best. In this paper, the buffering scheme is also proposed for natural language statements which will not load the whole sentence if it was done previously. Also there was a need of generalized access to all tables from database which is handled in this system.
Autonomous Agents and Multi-Agent …, 2001
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component o f a n y W eb-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document i n to a series of simpler extraction tasks. We introduce an inductive algorithm, stalker, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that stalker requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, stalker can wrap information sources that could not be wrapped by existing inductive techniques.
2003
A new method is developed to query a relational database in natural language (NL). Results: The method, based on a semantic approach, interprets grammatical and lexical units of a natural language into concepts of subject domain, which are given in a conceptual scheme. The conceptual scheme is mapped formally onto the logical scheme. We applied the method to query the FlyEx database in natural language. FlyEx contains information on the expression of segmentation genes in Drosophila melanogaster. The method allows formulation of queries in various natural languages simultaneously, and is adaptive to changes in the knowledge domain and user's views. It provides optimal transformation of queries from natural language to SQL, as well as visualization of information as a hyperscheme. The method does not require specification of all possible language constructions as well as a standard grammar accuracy in formulation of NL queries.
2000
In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We
International Journal of Computer Applications, 2014
Natural language query builder interface retrieves the required data from database when query is given in natural language. To retrieve the correct data from database, the user should have sufficient technical knowledge of Structured Query Language (SQL) statements. Natural Language Query Builder Interface (NLQBI) will solve this problem. In natural language parsing, getting highly accurate syntactic analysis is a crucial step. Parsing of natural languages can be seen as the process of mapping an input string or a sentence to its syntactic representation. One of the parsing technique is dependency parsing. Dependency parsing focuses on relations between words which resolve ambiguity. Most of the recent efficient algorithms for dependency parsing work by factoring the dependency trees. Graph based dependency parsing models are prevalent in dependency parsing because of their state-of-art accuracy and efficiency. This paper covers some recent developments in NLQBI systems and survey on dependency parsing techniques.
2008
In this paper, we propose an approach to extract information from HTML pages and to add semantic (XML) tags to them. Wrapping is an essential technique used to automatically extract information from Web sources. This paper describes both, a general approach based on rules, which can be used to automatically generate wrappers, and an assistant generator wrapper called WebMantic. We also provide some experimental results to show that both the rule generation process and the preprocessing task are fast and reliable.
The VLDB Journal, 2007
Précis queries represent a novel way of accessing data, which combines ideas and techniques from the fields of databases and information retrieval. They are free-form, keyword-based, queries on top of relational databases that generate entire multi-relation databases, which are logical subsets of the original ones. A logical subset contains not only items directly related to the given query keywords but also items implicitly related to them in various ways, with the purpose of providing to the user much greater insight into the original data. In this paper, we lay the foundations for the concept of logical database subsets that are generated from précis queries under a generalized perspective that removes several restrictions of previous work. In particular, we extend the semantics of précis queries considering that they may contain multiple terms combined through the AND, OR, and NOT operators. On the basis of these extended semantics, we define the concept of a logical database subset, we identify the one that is most relevant to a given query, and A. Simitsis we provide algorithms for its generation. Finally, we present an extensive set of experimental results that demonstrate the efficiency and benefits of our approach.
svn.aksw.org
The search for information on the Web of Data is becoming increasingly difficult due to its dramatic growth. Especially novice users need to acquire both knowledge about the underlying ontology structure and proficiency in formulating formal queries (e. g. SPARQL queries) to retrieve information from Linked Data sources. So as to simplify and automate the querying and retrieval of information from such sources, we present in this paper a novel approach for constructing SPARQL queries based on user-supplied keywords. Our approach utilizes a set of predefined basic graph pattern templates for generating adequate interpretations of user queries. This is achieved by obtaining ranked lists of candidate resource identifiers for the supplied keywords and then injecting these identifiers into suitable positions in the graph pattern templates. The main advantages of our approach are that it is completely agnostic of the underlying knowledge base and ontology schema, that it scales to large knowledge bases and is simple to use. We evaluate 17 possible valid graph pattern templates by measuring their precision and recall on 53 queries against DBpedia. Our results show that 8 of these basic graph pattern templates return results with a precision above 70%. Our approach is implemented as a Web search interface and performs sufficiently fast to return instant answers to the user even with large knowledge bases.
2006
The Web contains a vast amount of text that can only be queried using simple keywords-in, documentsout search queries. But Web text often contains structured elements, such as hotel location and price pairs embedded in a set of hotel reviews. Queries that process these structural text elements would be much more powerful than our current document-centric queries. Of course, text does not contain metadata or a schema, making it unclear what a structured text query means precisely. In this paper we describe three possible models for structured queries over text, each of which implies different query semantics and user interaction.
2005
A truly natural language interface to databases also needs to be practical for actual implementation. We developed a new feasible approach to solve the problem and tested it successfully in a laboratory environment. The new result is based on metadata search, where the metadata grow in largely linear manner and the search is linguistics-free (allowing for grammatically incorrect and incomplete input). A new class of reference dictionary integrates four types of enterprise metadata: enterprise information models, database values, user-words, and query cases. The layered and scalable information models allow user-words to stay in original forms as users articulated them, as opposed to relying on permutations of individual words contained in the original query. A graphical representation method turns the dictionary into searchable graphs representing all possible interpretations of the input. A branch-and-bound algorithm then identifies optimal interpretations, which lead to SQL implementation of the original queries. Query cases enhance both the metadata and the search of metadata, as well as providing casebased reasoning to directly answer the queries. This design assures feasible solutions at the termination of the search, even when the search is incomplete (i.e., the results contain the correct answer to the original query). The necessary condition is that the text input contains at least one entry in the reference dictionary. The sufficient condition is that the text input contains a set of entries corresponding to a complete, correct single SQL query. Laboratory testing shows that the system obtained accurate results for most cases that satisfied only the necessary condition.
Schema.org creates, supports and maintain schemas for structured data on the web pages. For a non-technical author, it is difficult to publish contents in a structured format. This work presents an automated way of inducing Schema.org markup from natural language context of web-pages by applying knowledge base creation technique. As a dataset, Web Data Commons was used, and the scope for the experimental part was limited to RDFa. The approach was implemented using the Knowledge Graph building techniques - Knowledge Vault and KnowMore.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.