Papers by Sebastián Ferrada
ERDoc: A Web Interface for Entity-Relation Modelling
MillenniumDB: A Multi-modal, Multi-model Graph Database

Semantic web, Mar 6, 2024
The SPARQL standard provides operators to retrieve exact matches on data, such as graph patterns,... more The SPARQL standard provides operators to retrieve exact matches on data, such as graph patterns, filters and grouping. This work proposes and evaluates two new algebraic operators for SPARQL 1.1 that return similarity-based results instead of exact results. First, a similarity join operator is presented, which brings together similar mappings from two sets of solution mappings. Second, a clustering solution modifier is introduced, which instead of grouping solution mappings according to exact values, brings them together by using similarity criteria. For both cases, a variety of algorithms are proposed and analysed, and use-case queries that showcase the relevance and usefulness of the novel operators are presented. For similarity joins, experimental results are provided by comparing different physical operators over a set of real world queries, as well as comparing our implementation to the closest work found in the literature, DBSimJoin, a PostgreSQL extension that supports similarity joins. For clustering, synthetic queries are designed in order to measure the performance of the different algorithms implemented.
Information Systems, Jul 1, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

AMW, 2016
Linked Data rarely takes into account multimedia content, which forms a central part of the Web. ... more Linked Data rarely takes into account multimedia content, which forms a central part of the Web. To explore the combination of Linked Data and multimedia, we are developing IMGpedia: we compute content-based descriptors for images used in Wikipedia articles and subsequently propose to link these descriptions with legacy encyclopaedic knowledge-bases such as DBpedia and Wikidata. On top of this extended knowledge-base, our goal is to consider a unified query system that accesses both the encyclopaedic data and the image data. We could also consider enhancing the encyclopaedic knowledge based on rules applied to co-occurring entities in images, or content-based analysis, for example. Abstracting away from IMGpedia, we explore generic methods by which the content of images on the Web can be described in a standard way and can be considered as first-class citizens on the Web of Data, allowing, for example, for combining structured queries with image similarity search. This short paper thus describes ongoing work on IMGpedia, with focus on image descriptors.

Despite its importance to the Web, multimedia content is often neglected when building and design... more Despite its importance to the Web, multimedia content is often neglected when building and designing knowledge-bases: though descriptive metadata and links are often provided for images, video, etc., the multimedia content itself is often treated as opaque and is rarely analysed. IMGpedia is an effort to bring together the images of Wikimedia Commons (including visual information), and relevant knowledge-bases such as Wikidata and DBpedia. The result is a knowledge-base that incorporates similarity relations between the images based on visual descriptors, as well as links to the resources of Wikidata and DBpedia that relate to the image. Using the IMGpedia SPARQL endpoint, it is then possible to perform visuo-semantic queries, combining the semantic facts extracted from the external resources and the similarity relations of the images. This paper presents a new web interface to browse and explore the dataset of IMGpedia in a more friendly manner, as well as new visuo-semantic queries that can be answered using 6 million recently added links from IMGpedia to Wikidata. We also discuss future directions we foresee for the IMGpedia project. CCS CONCEPTS • Information systems → Multimedia databases; Wikis;
Lecture Notes in Computer Science, 2017
IMGpedia is a large-scale linked dataset that incorporates visual information of the images from ... more IMGpedia is a large-scale linked dataset that incorporates visual information of the images from the Wikimedia Commons dataset: it brings together descriptors of the visual content of 15 million images, 450 million visual-similarity relations between those images, links to image metadata from DBpedia Commons, and links to the DBpedia resources associated with individual images. In this paper we describe the creation of the IMGpedia dataset, provide an overview of its schema and statistics of its contents, offer example queries that combine semantic and visual information of images, and discuss other envisaged use-cases for the dataset.
Springer eBooks, 2020
We propose techniques that support the efficient computation of multidimensional similarity joins... more We propose techniques that support the efficient computation of multidimensional similarity joins in an RDF/SPARQL setting, where similarity in an RDF graph is measured with respect to a set of attributes selected in the SPARQL query. While similarity joins have been studied in other contexts, RDF graphs present unique challenges. We discuss how a similarity join operator can be included in the SPARQL language, and investigate ways in which it can be implemented and optimised. We devise experiments to compare three similarity join algorithms over two datasets. Our results reveal that our techniques outperform DB-SimJoin: a PostgreSQL extension that supports similarity joins.
CEUR workshop proceedings, 2017
IMGpedia is a linked dataset that provides a public SPARQL endpoint where users can answer querie... more IMGpedia is a linked dataset that provides a public SPARQL endpoint where users can answer queries that combine the visual similarity of images from Wikimedia Commons and semantic information from existing knowledge-bases. Our demo will show example queries that capture the potential of the current data stored in IMGpedia. We also plan to discuss potential use-cases for the dataset and ways in which we can improve the quality of the information it captures and the expressiveness of its queries.

Today's space of graph database solutions is characterized by two main technology stacks that hav... more Today's space of graph database solutions is characterized by two main technology stacks that have evolved separate from one another: on one hand, there are systems that focus on supporting the RDF family of standards; on the other hand, there is the Property Graph category of systems. As a basis for bringing these stacks together and, in particular, to facilitate data exchange between the different types of systems, different direct mappings between the underlying graph data models have been introduced in the literature. While fundamental properties are well-documented for most of these mappings, the same cannot be said about the practical implications of choosing one mapping over another. Our research aims to contribute towards closing this gap. In this paper we report on a preliminary study for which we have selected two direct mappings from (Labeled) Property Graphs to RDF, where one of them uses features of the RDF-star extension to RDF. We compare these mappings in terms of the query performance achieved by two popular commercial RDF stores, GraphDB and Stardog, in which the converted data is imported. While we find that, for both of these systems, none of the mappings is a clear winner in terms of guaranteeing better query performance, we also identify types of queries that are problematic for the systems when using one mapping but not the other.

AMW, 2018
The use of the join operator in metric spaces leads to what is known as a similarity join, where ... more The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n 2) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that using Θ(√ n) processors, the total execution takes linear time. We end discussing ways in which the algorithm can be improved in the future.

ArXiv, 2020
One of the grand challenges discussed during the Dagstuhl Seminar “Knowledge Graphs: New Directio... more One of the grand challenges discussed during the Dagstuhl Seminar “Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web” [24] and described in its report is that of a: Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of entities. For example, Amazon is creating a knowledge graph of all products in the world and Google and Apple have both created knowledge graphs of all locations in the world. This grand challenge extends this further by asking if we can create a knowledge graph of “everything” ranging from common sense concepts to location based entities. This knowledge graph should be “open to the public” in a FAIR manner democratizing this mass amount of knowledge. Although linked open data (LOD) is one knowledge graph, it is the closest realisation (and probably the only one) to a public FAIR Knowledge Graph (KG) of everything. Surely, LOD provides a ...
Converting property graphs to RDF
Proceedings of the 5th ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)
International Semantic Web Conference, 2017
IMGpedia is a linked dataset that provides a public SPARQL endpoint where users can answer querie... more IMGpedia is a linked dataset that provides a public SPARQL endpoint where users can answer queries that combine the visual similarity of images from Wikimedia Commons and semantic information from existing knowledge-bases. Our demo will show example queries that capture the potential of the current data stored in IMGpedia. We also plan to discuss potential use-cases for the dataset and ways in which we can improve the quality of the information it captures and the expressiveness of its queries.
IMGpedia Dataset
IMGpedia is a large-scale linked dataset that incorporates visual information of the images from ... more IMGpedia is a large-scale linked dataset that incorporates visual information of the images from the Wikimedia Commons dataset: it brings together descriptors of the visual content of 15 million images, 450 million visual-similarity relations between those images, links to image metadata from DBpedia Commons, and links to the DBpedia resources associated with the images.
We propose techniques that support the efficient computation of multidimensional similarity joins... more We propose techniques that support the efficient computation of multidimensional similarity joins in an RDF/SPARQL setting, where similarity in an RDF graph is measured with respect to a set of attributes selected in the SPARQL query. While similarity joins have been studied in other contexts, RDF graphs present unique challenges. We discuss how a similarity join operator can be included in the SPARQL language, and investigate ways in which it can be implemented and optimised. We devise experiments to compare three similarity join algorithms over two datasets. Our results reveal that our techniques outperform DBSimJoin: a PostgreSQL extension that supports similarity joins.

Linked Data rarely takes into account multimedia content, which forms a central part of the Web. ... more Linked Data rarely takes into account multimedia content, which forms a central part of the Web. To explore the combination of Linked Data and multimedia, we are developing IMGpedia: we compute content-based descriptors for images used in Wikipedia articles and subsequently propose to link these descriptions with legacy encyclopaedic knowledge-bases such as DBpedia and Wikidata. On top of this extended knowledge-base, our goal is to consider a unified query system that accesses both the encyclopaedic data and the image data. We could also consider enhancing the encyclopaedic knowledge based on rules applied to co-occurring entities in images, or content-based analysis, for example. Abstracting away from IMGpedia, we explore generic methods by which the content of images on the Web can be described in a standard way and can be considered as first-class citizens on the Web of Data, allowing, for example, for combining structured queries with image similarity search. This short paper t...
IMGpedia is a linked dataset that provides a public SPARQL endpoint where users can answer querie... more IMGpedia is a linked dataset that provides a public SPARQL endpoint where users can answer queries that combine the visual similarity of images from Wikimedia Commons and semantic information from existing knowledge-bases. Our demo will show example queries that capture the potential of the current data stored in IMGpedia. We also plan to discuss potential use-cases for the dataset and ways in which we can improve the quality of the information it captures and the expressiveness of the queries.
IMGpedia is a large-scale linked dataset that incorporates visual information of the images from ... more IMGpedia is a large-scale linked dataset that incorporates visual information of the images from the Wikimedia Commons dataset: it brings together descriptors of the visual content of 15 million images, 450 million visual-similarity relations between those images, links to image metadata from DBpedia Commons, and links to the DBpedia resources associated with individual images. In this paper we describe the creation of the IMGpedia dataset, provide an overview of its schema and statistics of its contents, offer example queries that combine semantic and visual information of images, and discuss other envisaged use-cases for the dataset.

The use of the join operator in metric spaces leads to what is known as a similarity join, where ... more The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2 ) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that usin...
Uploads
Papers by Sebastián Ferrada