tmp3DEB TMP
tmp3DEB TMP
Introduction
Informally attested by the manifold domains of customers claimed by major NoSQL actors.
https://www.mongodb.org/
3
http://db-engines.com/en/system/MongoDB
2
valuable data about all sorts of topics, that could benefit a large community at
the condition of being made accessible as Linked Open Data. Hence the research
question we address herein: How to access arbitrary MongoDB documents with
SPARQL?
Exposing legacy data as RDF has been the object of much research during
the last years, usually following two approaches: either by materialization, i.e.
translation of all legacy data into an RDF graph at once, or based on on-the-fly
translation of SPARQL queries into the target query language. The materialization is often difficult in practice for big datasets, and costly when data freshness
is at stake. Several methods have been proposed to achieve SPARQL access to
relational data, either in the context of RDB-backed RDF stores [8,21,11] or using arbitrary relational schemas [4,23,17,18]. R2RML [9], the W3C RDB-to-RDF
mapping language recommendation is now a well-accepted standard and several
SPARQL-to-SQL rewriting approaches hinge upon it [23,17,19]. Other solutions
intend to map XML [3,2] or CSV4 data to RDF. RML [10] tackles the mapping of
heterogeneous data formats such as CSV/TSV, XML and JSON. xR2RML [14]
is an extension of R2RML and RML addressing the mapping of an extensible
scope of databases to RDF. Regarding MongoDB specifically, Tomaszuk proposed a solution to use MongoDB as an RDF triple store [22]. The translation of
SPARQL queries that he proposed is closely tied to the data schema and does not
fit with arbitrary documents. MongoGraph5 is an extension of the AllegroGraph
triple store to query arbitrary MongoDB documents with SPARQL. Similarly
to the Direct Mapping [1] the approach comes up with an ad-hoc ontology (e.g.
each JSON field name is turned into a predicate) and hardly supports the reuse
of existing ontologies. More in line with our work, Botoeva et al. recently proposed a generalization of the OBDA principles to MongoDB [6]. They describe
a two-step rewriting process of SPARQL queries into the MongoDB aggregate
query language. In the last section we analyse in further details the relationship
between their approach and ours.
In this paper we propose a method to query arbitrary MongoDB documents
using SPARQL. We rely on xR2RML for the mapping of MongoDB documents
to RDF, allowing for the use of classes and predicates from existing (domain)
ontologies. In section 2 we shortly describe the xR2RML mapping language.
Section 3 defines a database-independent abstract query language, and summarizes a generic method to rewrite SPARQL queries into this language under
xR2RML mappings. Then section 4 presents our method to translate abstract
queries into MongoDB queries. Finally in section 5 we conclude by emphasizing
some technical issues and highlighting perspectives.
http://www.w3.org/2013/csvw/wiki
http://franz.com/agraph/support/documentation/4.7/mongo-interface.html
http://goessner.net/articles/JsonPath/
4
{ " project ":" Finance & Billing " , " code ":" fin " ,
" teams ":[
[ {" name ":" P . Russo "} , {" name ":" F . Underwood "}] ,
[ {" name ":" R . Danton "} , {" name ":" E . Meetchum "} ]] } ,
{ " project ":" Customer Relation " , " code ":" crm " ,
" teams ":[
[ {" name ":" R . Posner "} , {" name ":" H . Dunbar "}]] }
Listing 1.1. MongoDB collection "projects" containing two documents
Various methods have been defined to translate SPARQL queries into another
query language, that are generally tailored to the expressiveness of the target
query language. Notably, the rich expressiveness of SQL and XQuery makes it
possible to define semantics-preserving SPARQL rewriting methods [8,2]. By
contrast, NoSQL databases typically trade off expressiveness for scalability and
fast retrieval of denormalised data. For instance, many of them hardly support
joins. Therefore, to envisage the translation of SPARQL queries in the general
case, we propose a two-step method. Firstly, a SPARQL query is rewritten into
a pivot abstract query under xR2RML mappings, independently of any target database (illustrated by step 1 in Figure 1). Secondly, the pivot query is
translated into concrete database queries based on the specific target database
capabilities and constraints. In this paper we focus on the application of the second step to the specific case of MongoDB. The rest of this section summarizes
the first step to provide the reader with appropriate background. A complete
description is provided in [16].
The grammar of our pivot query language is depicted in Listing 1.3. Operators inner join on, left outer join on and union are entailed by the
dependencies between graph patterns of the SPARQL query, and SPARQL filters
involving variables shared by several triple patterns result in a filter operator.
The computation of these operators shall be delegated to the target database if
it supports them (i.e. if the target query language has equivalent operators like
SQL), or to the query processing engine otherwise (e.g. MongoDB cannot process joins). Each SPARQL triple pattern tp is translated into a union of atomic
abstract queries (<AtomicQuery>), under the set of xR2RML mappings likely to
generate triples matching tp. Components of an atomic abstract query are as
follows:
- From is the mappings logical source, i.e. the database query string (xrr:query)
and its optional iterator (rml:iterator).
- Project is the set of xR2RML references that must be projected, i.e. returned as
part of the query results. In SQL, projecting an xR2RML reference simply means
that the column name shall appear in the select clause. As to MongoDB, this
amounts to projecting the JSON fields mentioned in the JSONPath reference.
- Where is a conjunction of abstract conditions entailed by matching each term
of triple pattern tp with its corresponding term map in an xR2RML mapping:
the subject of tp is matched with the subject map of the mapping, the predicate
with the predicate map and the object with the object map. Three types of
condition may be created:
(i) a SPARQL variable in the triple pattern is turned into a not-null condition on
the xR2RML reference corresponding to that variable in the term map, denoted
by isNotNull(<xR2RML reference>);
(ii) A constant triple pattern term (IRI or literal) is turned into an equality
condition on the xR2RML reference corresponding to that RDF term in the
term map, denoted by equals(<xR2RML reference>, value);
(iii) A SPARQL filter condition f about a SPARQL variable is turned into a
filter condition, denoted by sparqlFilter(<xR2RML reference>, f ).
The triple pattern, denoted by tp, is translated into the atomic abstract query
{From, Project, Where}. From is the query in the logical source of mapping
<#TmLeader>, i.e. "db.projects.find({})". The detail of calculating Project is
out of the scope of this paper; let us just note that, since the values of variable
?proj (the subject of tp) shall be retrieved, only the subject map reference is
projected, i.e. the JSONPath expression $.code. The Where part is calculated
as follows:
- tps subject, variable ?proj, is matched with <#TmLeader>s subject map; this
entails condition C1 : isNotNull($.code).
- tps object, "H. Dunbar", is matched with <#TmLeader>s object map; this entails
condition C2 : equals($.teams[0,1][(@.length-1)].name, "H. Dunbar").
Thus, the SPARQL query is rewritten into the atomic abstract query below:
{ From :
{" db . projects . find ({})"} ,
Project : {$. code AS ? proj } ,
Where :
{ isNotNull ($. code ) ,
equals ($. teams [0 ,1][( @ . length -1)]. name , " H . Dunbar ") }}
The JSON documents needed to answer this abstract query shall verify condition
C1 C2 . In the next section, we elaborate on the method that allows to rewrite
such conditions into concrete MongoDB queries.
In this section we briefly describe the MongoDB query language, then we define
rules to transform an atomic abstract query into an abstract representation of
a MongoDB query (step 2 in Figure 1). Finally, we define additional rules to
optimize and rewrite an abstract representation of a MongoDB query into a
union of executable MongoDB queries (step 3 in Figure 1).
4.1
7
AND ( < exp 1 >, < exp 2 >, ...)
$and :[ < exp 1 >,< exp 2 > ,...]
OR ( < exp 1 >, < exp 2 >, ...)
$or :[ < exp 1 >,< exp 2 > ,...]
WHERE ( < JavaScript exp >)
$where :" < JavaScript exp >"
ELEMMATCH ( < exp 1 >,< exp 2 >...) $elemMatch :{ < exp 1 >,< exp 2 >...}
FIELD ( p 1 ) ... FIELD ( p n )
" p 1 . ... . p n ":
SLICE ( < exp > , < number >)
<exp >:{ $slice : < number >}
COND ( equals ( v ))
$eq : v
true , $ne :null
null
COND ( isNotNull )
$exists :true
null
COND ( isNull )
$eq :null
false }
NOT_EXISTS ( < exp >)
<exp >:{ $exists :false
COMPARE ( < exp > , <op > , <v >) <exp >:{ < op >: <v >}
NOT_SUPPORTED
CONDJS ( equals ( v ))
== v
CONDJS ( isNotNull )
!= null
Listing 1.4. Abstract MongoDB query representation and translation to a concrete
query string
the pipeline, that allows for richer aggregate computations. As a first approach,
this work considers the find query method, hereafter called the MongoDB query
language. As an illustration let us consider the following query:
db . projects . find (
{" teams .0":{ $elemMatch :{" age ":{ $gt :30}}}} , {" code ":1})
It retrieves documents from collection projects, whose first team (array "teams"
at index 0) has at least one member (operator $elemMatch) over 30 years old (operator $gt). The projection parameter, {"code":1}, states that only the "code"
field of each matching document must be returned.
The MongoDB documentation7 provides a rich description of the query language, that however lacks formal semantics. Recently, attempts were made to
clarify this semantics while underlining some limitations and ambiguities: [5] focuses mainly on the aggregate query and ignores some of the operators we use
in our translation, such as $where, $elemMatch, $regex and $size. On the other
hand, [13] describes the find query, yet some restrictions on the operator $where
are not formalized. Hence, in [15] we specified the grammar of the subset of
the query language that we consider. We also defined an abstract representation of MongoDB queries, that allows for handy manipulation during the query
construction and optimization phases. Listing 1.4 details the constructs of this
representation and their equivalent concrete query string. In the compare clause
definition, <op> stands for one of the MongoDB comparison operators: $eq, $ne,
$lt, $lte, $gt, $gte, $size and $regex. The not supported clause helps keep
track of parts of the abstract query that cannot be translated into an equivalent
MongoDB query element; it shall be used when rewriting the abstract query into
a concrete query (section 4.3).
7
https://docs.mongodb.org/manual/tutorial/query-documents/
4.2
In the current state of this work we do not consider SPARQL filter conditions.
Fig. 2. Translation of a condition on a JSONPath expression into an abstract MongoDB query (function trans)
R0 trans($, <cond>)
trans($<JP>, <cond>) trans(<JP>, <cond>)
R1 trans(, <cond>) COND(<cond>)
R2 Field alternative (a) or array index alternative (b)
(a) trans(<JP:F>["p","q",...]<JP>, <cond>)
OR(trans(<JP:F>.p<JP>, <cond>), trans(<JP:F>.q<JP>, <cond>), ...)
(b) trans(<JP:F>[i,j,...]<JP>, <cond>)
OR(trans(<JP:F>.i<JP>, <cond>), trans(<JP:F>.j<JP>, <cond>), ...)
R3 Heading field alternative (a) or heading array index alternative (b)
(a) trans(["p","q",...]<JP>, <cond>)
OR(trans(.p<JP>, <cond>), trans(.q<JP>, <cond>), ...)
(b) trans([i,j,...]<JP>, <cond>)
OR(trans(.i<JP>, <cond>), trans(.j<JP>, <cond>), ...)
R4 JavaScript filter on array elements, e.g., $.p[?(@.q)].r
trans([?(<bool expr>)]<JP>, <cond>)
ELEMMATCH(trans(<JP>, <cond>), transJS(<bool expr>))
R5 Array slice: n last elements (a) or n first elements (b)
(a) trans(<JP:F>[-<start>:]<JP>, <cond>)
trans(<JP:F>.*<JP>, <cond>) SLICE(<JP:F>, -<start>)
(b) trans(<JP:F>[:<end>]<JP>, <cond>)
trans(<JP:F>.*<JP>, <cond>) SLICE(<JP:F>, <end>)
trans(<JP:F>[0:<end>]<JP>, <cond>)
trans(<JP:F>.*<JP>, <cond>) SLICE(<JP:F>, <end>)
R6 Calculated array index, e.g., $.p[(@.length - 1)].q
(a) trans(<JP1 >[(<num expr>)]<JP2 >, <cond>) NOT SUPPORTED
if <JP1 > contains a wildcard or a JavaScript filter expression
(b) trans(<JP:F>[(<num expr>)], <cond>) AND(
EXISTS(<JP:F>),
WHERE(this<JP:F>[replaceAt(this<JP:F>, <num expr>)] CONDJS(<cond>)))
(c) trans(<JP:F1 >[(<num expr>)]<JP:F2 >, <cond>) AND(
EXISTS(<JP:F1 >),
WHERE(this<JP:F1 >[replaceAt(this<JP:F1 >, <num expr>)]<JP:F2 >
CONDJS(<cond>)))
R7 Heading wildcard
(a) trans(.*<JP>, <cond>) ELEMMATCH(trans(<JP>, <cond>))
(b) trans([*]<JP>, <cond>) ELEMMATCH(trans(<JP>, <cond>))
R8 Heading field name or array index
(a) trans(.p<JP>, <cond>) FIELD(p) trans(<JP>, <cond>)
(b) trans([p]<JP>, <cond>) FIELD(p) trans(<JP>, <cond>)
(c) trans([i]<JP>, <cond>) FIELD(i) trans(<JP>, <cond>)
R9 No other rule matched, expression <JP> is not supported
trans(<JP>, <cond>) NOT SUPPORTED
10
11
with a not supported clause. This way, the not supported issue is raised up
to the parent clause and shall be managed at the next iteration. Iteratively, the
not supported clause is raised up until it is eventually removed (cases and
and elemmatch above), or it ends up in the top-level query. The latter is the
worst case in which the query shall retrieve all documents.
- O5(e): Similarly to O5(c), a sequence of fields followed by a not supported
clause must be replaced with a not supported clause to raise up the issue to
the parent clause.
Pulling up WHERE Clauses. By construction, rule R6 ensures that where
clauses cannot be nested in an elemmatch, but they may show in and and or
clauses. Besides, rules O1 to O4 flatten nested or and and clauses, and merge
sibling where clauses. Therefore, a where clause may be either in the top-level
query (in this case the query is executable) or it may show in one of the following
patterns (where W stands for a where clause):
12
OR(...,W,...), AND(...,W,...), OR(..., AND(...,W,...), ...), AND(..., OR(...,W,...), ...).
In such patterns, rules W1 to W6 (Figure 4) address issue (ii) by pulling up
where clauses into the top-level query. Here is an insight into the approach:
- Since OR(C, W) is not a valid MongoDB query, it is replaced with query UNION(C,
W) which has the same semantics: C and W are evaluated separately against the
database, and the union is computed later on by the query processing engine.
- AND(C,OR(D,W)) is rewritten into OR(AND(C,D), AND(C,W)) and the or is replaced
with a union: UNION(AND(C,D), AND(C,W)). Since an logical and implicitly applies
to the top-level terms, we can finally rewrite the query into UNION((C,D), (C,W))
which is valid since W now shows in a top-level query.
13
union(
(field(code) cond(isNotNull), exists(.teams.0), where(JScond0 )),
(field(code) cond(isNotNull), exists(.teams.1), where(JScond1 )))
The abstract query can now be rewritten into a union of two valid queries:
{"code":{$exists:true, $ne:null}, "teams.0":{$exists:true},
$where:this.teams[0][this.teams[0].length-1)].name == "H. Dunbar"}
{"code":{$exists:true, $ne:null}, "teams.1":{$exists:true},
$where:this.teams[1][this.teams[1].length-1)].name == "H. Dunbar"}
The first query retrieves the document below, whereas the second query returns
no document.
{ " project ":" Customer Relation " , " code ":" crm " ,
" teams ":[ [ {" name ":" R . Posner "} , {" name ":" H . Dunbar "}]]}
Finally, the application of triples map <#TmLeader> to the query result produces
one RDF triple that matches the triple pattern tp:
<http://example.org/project/crm> ex:teamLeader "H. Dunbar".
14
although xR2RML is slightly more flexible: class names (in triples ?x rdf:type
A) and predicates can be built from database values whereas they are fixed in [6],
and xR2RML allows to turn an array field into an RDF collection or container.
To deal with the tree form of JSON documents we use JSONPath expressions.
This avoids the definition of a relational view over the database, but this also
comes with additional complexity in the translation process. Finally, [6] produces MongoDB aggregate queries, with the advantage that a SPARQL 1.0 query
may be translated into a single target query, thus delegating all the processing
to MongoDB. Yet, in practice, some aggregate queries may be very inefficient,
hence the need to decompose RA queries into sub-queries, as underlined by the
authors. Our approach produces find queries that are less expressive but whose
performance is easier to anticipate, thus putting a higher burden on the query
processing engine (joins, some unions and filtering). In the future, it would be
interesting to characterise mappings with respect to the type of query that shall
perform best (single vs. multiple separate queries, find vs. aggregate). A lead
may be to involve query plan optimization logics such as the bind join [12] and
the join reordering methods applied in the context of distributed SPARQL query
engines [20].
More generally, the NoSQL trend pragmatically gave up on properties such
as consistency and rich query features, as a trade-off to high throughput, high
availability and horizontal elasticity. Therefore, it is likely that the hurdles we
encountered with MongoDB shall occur with other NoSQL databases.
Implementation and evaluation. To validate our approach we have developed a prototype implementation9 available under the Apache 2 open source
licence. Further developments on query optimization are on-going, and in the
short-term we intend to run performance evaluations. Besides, we are working on two real-life use cases. Firstly, in the context of the Zoomathia research
project10 , we proposed to represent a taxonomic reference, designed to support
studies in Conservation Biology, as a SKOS thesaurus [7]. It is stored in a MongoDB database, and we are in the process of testing the SPARQL access to
that thesaurus using our prototype. Secondly, we are having discussions with
researchers in the fields of ecology and agronomy. They intend to explore the
added value of Semantic Web technologies using a large MongoDB database of
phenotype information. This context would be a significant and realistic use case
of our method and prototype.
References
1. M. Arenas, A. Bertails, E. Prudhommeaux, and J. Sequeda. A Direct Mapping of
Relational Data to RDF, 2012.
2. N. Bikakis, C. Tsinaraki, I. Stavrakantonakis, N. Gioldasis, and S. Christodoulakis.
The SPARQL2XQuery interoperability framework. WWW, 18(2):403490, 2015.
9
10
https://github.com/frmichel/morph-xr2rml
http://www.cepam.cnrs.fr/zoomathia
15
3. S. Bischof, S. Decker, T. Krennwallner, N. Lopes, and A. Polleres. Mapping between
RDF and XML with XSPARQL. J. Data Semantics, 1(3):147185, 2012.
4. C. Bizer and R. Cyganiak. D2R server - Publishing Relational Databases on the
Semantic Web. In ISWC, 2006.
5. E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, and G. Xiao. A formal presentation
of MongoDB (Extended version). Technical report, 2016.
6. E. Botoeva, D. Calvanese, B. Cogrel, M. Rezk, and G. Xiao. OBDA beyond relational DBs: A study for MongoDB. In Int. Ws. DL 2016, volume 1577, 2016.
7. C. Callou, F. Michel, C. Faron-Zucker, C. Martin, and J. Montagnat. Towards
a Shared Reference Thesaurus for Studies on History of Zoology, Archaeozoology
and Conservation Biology. In SW For Scientific Heritage, Ws. of ESWC, 2015.
8. A. Chebotko, S. Lu, and F. Fotouhi. Semantics preserving SPARQL-to-SQL translation. Data & Knowledge Engineering, 68(10):9731000, 2009.
9. S. Das, S. Sundara, and R. Cyganiak. R2RML: RDB to RDF mapping language,
2012.
10. A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de
Walle. RML: A generic language for integrated RDF mappings of heterogeneous
data. In LDOW, 2014.
11. B. Elliott, E. Cheng, C. Thomas-Ogbuji, and Z. M. Ozsoyoglu. A complete translation from SPARQL into efficient SQL. In IDEAS09, pages 3142. ACM, 2009.
12. L. Haas, D. Kossmann, E. Wimmers, and J. Yang. Optimizing Queries across
Diverse Data Sources. In VLDB, pages 276285, 1997.
13. A. Husson. Une semantique statique pour MongoDB. In 25th Journees Francophones des Langages Applicatifs (JFLA), pages 7792, 2014.
14. F. Michel, L. Djimenou, C. Faron-Zucker, and J. Montagnat. Translation of Relational and Non-Relational Databases into RDF with xR2RML. In WebIST, pages
443454, 2015.
15. F. Michel, C. Faron-Zucker, and J. Montagnat. Mapping-based SPARQL access
to a MongoDB database. Technical report, CNRS, 2015. https://hal.archivesouvertes.fr/hal-01245883.
16. F. Michel, C. Faron-Zucker, and J. Montagnat. A Generic Mapping-Based Query
Translation from SPARQL to Various Target Database Query Languages. In WebIST, 2016.
17. F. Priyatna, O. Corcho, and J. Sequeda. Formalisation and experiences of R2RMLbased SPARQL to SQL query translation using Morph. In WWW, 2014.
18. M. Rodrguez-Muro, R. Kontchakov, and M. Zakharyaschev. Ontology-based data
access: Ontop of databases. In The Semantic Web-ISWC 2013. Springer, 2013.
19. M. Rodrguez-Muro and M. Rezk. Efficient SPARQL-to-SQL with R2RML mappings. J. Web Semantics, 33:141169, 2015.
20. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: Optimization
techniques for federated query processing on Linked Data. In ISWC. 2011.
21. J. F. Sequeda and D. P. Miranker. Ultrawrap: SPARQL execution on relational
data. J. Web Semantics, 22:1939, 2013.
22. D. Tomaszuk. Document-oriented triplestore based on RDF/JSON. In Logic,
philosophy and computer science, pages 125140. University of Bialystok, 2010.
23. J. Unbehauen, C. Stadler, and S. Auer. Accessing relational data on the web with
sparqlmap. In Semantic Technology, pages 6580. Springer, 2013.