Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Many Web data sources and APIs make their data available in XML, JSON, or a domain-specific semi-structured format, with the goal of making the data easily accessible and usable by Web application developers. Although such data formats are more machine-processable than pure text documents, managing and analyzing such data in large scale is often nontrivial. This is mainly due to the lack of a well-defined (or understood) structure and clear semantics in such data formats, which could result in poor data quality. In the xCurator project, we add structure to such data with the goal of publishing it on the Web as Linked Data. We enhance the quality of such data by: extracting entities, their types, and their relationships to other entities; performing entity (and entity type) identification; merging duplicate entities (and entity types); linking related entities (internally and to external sources); and publishing the results on the Web as high-quality Linked Data. This is all in a light-weight easy-to-use and scalable framework that effectively incorporates user feedback in all phases. We describe the initial framework of our system and report the results of using our system for managing large volumes of (user-generated) data on the Web in several real world applications.
2019
Linked Data (LD) is a set of best practices to publish data in RDF format. Transforming structured datasets into RDF datasets is possible thanks to RDF Mappings. To be able to define such mappings, it is necessary to be familiar with the LD practices and to know perfectly concerned datasets. An obstacle to the democratisation of the LD is that few people satisfy these two conditions.We believe that tools making easy the process of LD integration will foster the LD growth. In this demonstration, we present a chatbot-like tool that can semi-automatically generate RDF mappings for existing structured datasets. The challenge is to automate part of the integration process that requires getting familiar with RDF.
Journal of Next …, 2010
Extracting information from the web data sources becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying, analyzing, and presenting semi-structured web data sources. The framework is able to extract relevant information from different web data sources, and classify the extracted information based on the standard classification scheme of Nokia products, which has been chosen as the case study.
Advanced Information …, 2010
We present the Entity Name System (ENS), an enabling infrastructure, which can host descriptions of named entities and provide unique identifiers, on large-scale. In this way, it opens new perspectives to realize entity-oriented, rather than keyword-oriented, Web information systems. We describe the architecture and the functionality of the ENS, along with tools, which all contribute to realize the Web of entities.
Semantic Services, Interoperability and Web Applications
The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
2004
SIMILE is a joint project between MIT Libraries, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), HP Labs and the World Wide Web Consortium (W3C). It is investigating the application of Semantic Web tools, such as the Resource Description Framework (RDF), to the problem of dealing with heterogeneous metadata. This report describes how XML and RDF tools are used to perform data conversion, extraction and record linkage on some sample datasets featuring visual images (ARTstor) and learning objects (OpenCourseWare) in the first SIMILE proof of concept demo.
2010
Under review: please do not distribute link or text! With respect to large-scale, static, Linked Data corpora, in this paper we discuss scalable and distributed methods for: (i) entity consolidation-identifying entities which signify the same referent, aka. smushing, entity resolution, object consolidation, etc.using explicit owl:sameAs relations; (ii) extended entity consolidation based on a subset of OWL 2 RL/RDF rules-particularly over inverse-functional properties, functional-properties and (max-)cardinality restrictions with value one; (iii) deriving weighted concurrence measures between entities in the corpus based on shared inlinks/outlinks and attribute values using statistical analyses; (iv) disambiguating (initially) consolidated entities based on inconsistency detection using OWL 2 RL/RDF rules. Our methods are based upon distributed sorts and scans of the corpus, where we purposefully avoid the requirement for indexing all data. Throughout, we offer evaluation over a diverse Linked Data corpus consisting of 1.118 billion quadruples derived from a domain-agnostic, open crawl of 3.985 million RDF/XML Web documents, demonstrating the feasibility of our methods at that scale, and giving insights into the fecundity of the approach and the quality of the results.
The constantly growing amount of Linked Open Data (LOD) datasets constitutes the need for rich metadata descriptions, enabling users to discover, understand and process the available data. This metadata is often created, maintained and stored in diverse data repositories featuring disparate data models that are often unable to provide the metadata necessary to automatically process the datasets described. This paper proposes DataID, a best-practice for LOD dataset descriptions which utilize RDF files hosted together with the datasets, under the same domain. We are describing the data model, which is based on the widely used DCAT and VoID vocabularies, as well as supporting tools to create and publish DataIDs and use cases that show the benefits of providing semantically rich metadata for complex datasets. As a proof of concept, we generated a DataID for the DBpedia dataset, which we will present in the paper.
International Journal of Web Engineering and Technology, 2008
Data warehousing and Online Analytical Processing (OLAP) technologies are now moving onto handling complex data that mostly originate from the web. However, integrating such data into a decision-support process requires their representation in a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits eXtensible Markup Language (XML) as a pivot language. Our approach includes the integration of complex data in an ODS, in the form of XML ...
2003
The Resource Definition Framework (RDF) is designed to support agent communication on the Web, but it is also suitable as a framework for modeling and storing personal information. Haystack is a personalized information repository that employs RDF in this manner. This flexible semistructured data model is appealing for several reasons. First, RDF supports ontologies created by the user and tailored to the user's needs. At the same time, system ontologies can be specified and evolved to support a variety of high-level functionalities such as flexible organization schemes, semantic querying, and collaboration. In addition, we show that RDF can be used to engineer a component architecture that gives rise to a semantically rich and uniform user interface. We demonstrate that by aggregating various types of users' data together in a homogeneous representation, we create opportunities for agents to make more informed deductions in automating tasks for users. Finally, we discuss the implementation of an RDF information store and a programming language specifically suited for manipulating RDF.
Internet Computing, IEEE, 2009
E d it o r : M u n i n d a r P. S i ng h • s i ng h@ nc su.e du S h e ng r u Tu • s he ngr u@c s .uno .e du
2014
Data Integration is the problem of combining data in various data sources and providing a user with a unified view over these sources. Building an automatic data integration system that can process large, semi-structured data sources has emerged as an important problem. An automated data integration system requires automatic population of an Entity Name System (ENS). An ENS is a thesaurus for entities and is used to serve instance matching needs across data sources. Resource Description Framework (RDF) is a graph-based data model used to publish data on the Web as linked data. To build and populate an ENS over linked data, the fundamental problem of data matching needs to be solved. Traditionally, data matching concerned identifying pairs of logically equivalent entities across one or more structurally homogeneous data sources, and required a human in the loop. Additionally, most systems run on serial architectures. These assumptions cannot be expected to hold for linked data. Given...
Dke, 2007
XML has evolved as the new standard for the representation and exchange of both structured and semistructured data. XML's ability to succinctly describe complex information can also be used for specifying application meta-data. XML's popularity is evident from its use in a wide spectrum of application domains: from document publication, to computational chemistry, health care and life sciences, multimedia and ecommerce. Increasing popularity of web-based business and the emergence of web services that use XML-based descriptions in WSDL and exchange XML messages with SOAP have led to further acceptance of XML. The purpose of the second XML Data and Schema Management Workshop was to provide a forum for the exchange of ideas and experiences among the theoreticians and practitioners who are involved in design, management, and implementation of XML data management systems. It was held on 3 April, 2005 in conjunction with the 21st IEEE International Conference on Data Engineering (ICDE 2005). The workshop saw a lengthy full-day program that included 12 regular papers divided into three sessions. Many interesting discussions sparked off during the workshop among the presenters and the audience in almost any XML data management topics. This special issue of Data and Knowledge Engineering features four papers selected from the XSDM 2005 workshop based on their merits and relevance. Each of these papers is an extended and revised version of the original workshop paper and has gone through a rigorous reviewing process before being accepted for inclusion in this special issue. In the first paper entitled ''QMatch-A Hybrid Match Algorithm for XML Schemas'', Tansalarak and Claypool propose a new hybrid schema match algorithm, QMatch, that provides a unique path-based framework for harnessing traditional structural and semantic information, while exploiting the constraints inherent in XML documents such as the order of XML elements, to provide improved levels of matching between two given XML schemata. QMatch is based on the measurement of a unique quality of match metric, QoM, and a set of classifiers. They experimentally demonstrated the benefits of QMatch over existing algorithms such as Cupid. Efficient processing of XML queries is a key area of research in XML data management. The next two papers thus focus of efficient XML query evaluation. Chen et al. in their paper titled ''Index Structures for Matching XML Twigs Using Relational Query Processors'' address some of the limitations of existing XML path indices. They present a framework defining a family of index structures that includes most existing XML path indices. They also propose two novel index structures with different space-time tradeoffs effective for the evaluation of XML twig queries with value conditions. They also show how this family of index structures can be tightly integrated with a relational query processor. The next paper entitled ''Optimization of Nested XQuery Expressions with Orderby Clauses'' by Wang, Rundensteiner, and Mani, presents a technique for XQuery optimization. They propose an algebraic rewriting technique of nested XQuery expressions containing explicit ORDERBY clauses. Their technique is based on
Fifth Computer Science and Engineering …
Synthesis Lectures on the Semantic Web: Theory and Technology, 2011
This book gives an overview of the principles of Linked Data as well as the Web of Data that has emerged through the application of these principles. The book discusses patterns for publishing Linked Data, describes deployed Linked Data applications and examines their architecture.
A ‘Semantic Web’ Using Linked Data for day-to-day data transfer , 2009
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions - the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
A fundamental prerequisite of the Semantic Web is the existence of large amounts of meaningfully interlinked RDF data on the Web. The W3C SWEO community project Linking Open Data has made various open datasets available on the Web as RDF, and developed automated mechanisms to interlink them with RDF statements. Collectively, the datasets currently consist of over one billion triples. We believe that large scale interlinking will demonstrate the value of the Semantic Web compared to more centralized approaches such as Google Base 5. This paper outlines the work to date and describes the accompanying demonstration. A functioning Semantic Web is predicated on the availability of large amounts of data as RDF; not in isolated islands but as a Web of interlinked datasets. To date this prerequisite has not been widely met, leading to criticism of the broader endeavour and hindering the progress of developers wishing to build Semantic Web applications. Thanks to the Open Data movement, a va...
The term linked data is entering into common vocabulary and, as most interests us in this instance, into the specific terminology of library and information science. The concept is complex; we can summarize it as that set of best practices required for publishing and connecting structured data on the web for use by a machine. It is an expression used to describe a method of exposing, sharing and connecting data via Uniform Resource Identifiers (URIs) on the web. With linked data, in other words, we refer to data published on the web in a format readable, interpretable and, most of all, useable by machine, whose meaning is explicitly defined by a string of words and markers. In this way we constitute a linked data network (hence linked data) belonging to a domain (which constitutes the initial context), connected in turn to other external data sets (that is, those outside of the domain), in a context of increasingly extended relationships. Next is presented the Linked Open Data cloud (LOD), which collects the open data sets available on the web, and the paradigm of its exponential growth occurring in a very brief period of time which demonstrates the level of interest that linked data has garnered in organizations and institutions of different types.
2001
XML is increasingly being adopted for information publishing on the World Wide Web. However, the underlying data is often stored in the relational databases. Some mechanism is needed to convert the relational data into XML data. In this work, we employ a semantically rich semistructured data model, the Object-Relationship-Attribute model for semistructured data, as a middleware to support the schema conversion from semantically enriched relational schema to XML Schema. This approach allows us to handle the translation of a set of related relations and to distinguish attributes of relationship types from attributes of object classes, multivalued attributes, and different types of relationships such as binary, n-ary, recursive and ISA. The resulting XML structures are able to reflect the inherent semantics and implicit structure in the underlying relational database. We also show that the appropriate use of references is able to avoid unnecessary redundancy and the proliferation of disconnected XML elements.
2008
Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various semi-structured information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. The number of selected pages is enormous. Therefore, the performance capabilities, the overlap among results for the same queries and limitations of web search engines are an important and large area of research. Extracting information from the web data sources also becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying and browsing semi-structured web data sources. The framework is able to extract relevant information from different web data sources, and classify the extracted information based on the standard classification of Nokia products.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.