Page MenuHomePhabricator

[Epic] Evaluate alternatives to Blazegraph
Open, HighPublic

Description

Since Blazegraph project seems to not be active anymore (last commit 2 years ago at https://github.com/blazegraph/database) we need to evaluate if we want to switch to graph DB project that is more actively supported/developed.

The requirements should be:

  • Full SPARQL 1.1 support, including SPARQL Update
  • Open source
  • Can load and run queries on full Wikidata database

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedGehel
ResolvedAWesterinen
ResolvedAWesterinen
DeclinedNone
DeclinedNone
DeclinedNone
DeclinedNone
OpenNone
OpenNone
ResolvedGehel
ResolvedGehel
ResolvedGehel
DuplicateAWesterinen
ResolvedNone
DuplicateNone
OpenNone
OpenNone
ResolvedAWesterinen
ResolvedAWesterinen
ResolvedRKemper
ResolvedAndrew

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I imported the wikidata-DB into neo4j and it works quite well.

I imported the wikidata-DB into neo4j and it works quite well.

Can you be more specific? When we tested Wikidata on Neo4j several years ago, it worked in principle, but the performance was unacceptable. In particular, Neo4j does not efficiently support all kinds of JOIN operations that occur in typical SPARQL queries. Could you time a few SPARQL queries on your Neo4j instance and report the results here? That would be very helpful. For starters, you can simply pick some example queries from https://query.wikidata.org

Create a query set (extracted from the WDQS log) and a Wikidata subset of data to benchmark against graph databases (such as TPC).
Ask graph database vendors to test their products and publish the results to the community.

See
http://tpc.org/
https://github.com/socialsensor/graphdb-benchmarks

Consider relational databases with particular schema as graph backends

Daniel Hernández, Aidan Hogan, Cristian Riveros, Carlos Rojas, Enzo Zerega. "Querying Wikidata: Comparing SPARQL, Relational and Graph Databases". In the Proceedings of the 15th International Semantic Web Conference (ISWC), Kobe, Japan, October 17–21, 2016

Consider property graph back ends such as Neo4J and TigerGraph

Kovács, T., Simon, G., & Mezei, G. (2019). Benchmarking Graph Database Backends—What Works Well with Wikidata?. Acta Cybernetica, 24(1), 43-60. https://doi.org/10.14232/actacyb.24.1.2019.5

@So9q @AndreasKuczera @Versant.2612 why are you polluting the thread by suggesting projects/products that clearly do not meet the requirements? This includes Ontop, JanusGraph, TigerGraph, Neo4J etc.

“Pollution” is a strong word that comes off as needlessly hostile. It seems
prudent and rational to get a broad sense of the landscape(and where it is
moving). The Wikidata data model is not trivially 1:1 with RDF/SPARQL and
there may be scope for hybrid solutions.

@DanBri I would agree if this issue was not specifically about "alternatives to BlazeGraph" (RDF triplestore), with explicit requirements. Finding such alternative will already be difficult if not impossible, mostly due to the open-source requirement.

If you want non-RDF solutions be evaluated as well, then I think a separate issue should be created. But I doubt it has a chance of being completed within any reasonable timeframe.

Hey all, apologies if this has already been covered elsewhere, but I'm curious why Apache Jena Fuseki is not on the list of Blazegraph alternatives? It seems to meet the We've used Jena from time to time and really like it (it has a lot of features out of the box), but if there's been a previous analysis and it was not worth considering for WDQS's needs I'd love to learn from that.

Hey all, apologies if this has already been covered elsewhere, but I'm curious why Apache Jena Fuseki is not on the list of Blazegraph alternatives? It seems to meet the We've used Jena from time to time and really like it (it has a lot of features out of the box), but if there's been a previous analysis and it was not worth considering for WDQS's needs I'd love to learn from that.

I think only because so far no-one has brought it up. Please add a ticket for it with additional information.

I am taking the liberty to polute the thread with a reference to "MillenniumDB: A Persistent, Open-Source, Graph Database" https://arxiv.org/pdf/2111.01540.pdf from November 2021. Millennium may have some serious limitations in terms of requirements that can be setup, but interestingly they write "However, MillenniumDB was designed with the complete version of Wikidata – including qualifiers, references, etc. – in mind." and their benchmarks seems strong. They compare against Blazegraph, Jena, Virtuoso and Neo4J.

I am taking the liberty to polute the thread with a reference to "MillenniumDB: A Persistent, Open-Source, Graph Database" https://arxiv.org/pdf/2111.01540.pdf from November 2021. Millennium may have some serious limitations in terms of requirements that can be setup, but interestingly they write "However, MillenniumDB was designed with the complete version of Wikidata – including qualifiers, references, etc. – in mind." and their benchmarks seems strong. They compare against Blazegraph, Jena, Virtuoso and Neo4J.

Thanks for the pointer! Here are my first impressions from reading the paper:

  1. The engine is based on similar ideas as QLever. However, QLever is around for 5 years already, which the authors fail to acknowledge. I am sure they didn't do it on purpose though. I wrote to them.
  1. Like QLever, their engine currently is read-only and does not support SPARQL Update operations. Given the design of their engine, this is not something that will be easy to add.
  1. Their engine is currently very far away from SPARQL 1.1 support. In the current version, even basic features like GROUP BY and mathematical expressions are missing. I am not sure whether they actually strive for SPARQL 1.1 support, since the motivation expressed in the paper goes more in the direction of a more general data model that is independent of a particular query language. Anyway, adding full SPARQL 1.1 support would be a lot of work, as we know from experience.
  1. I find the evaluation misleading. Right at the beginning of their evaluation section, in Section 5.1, they claim that their engine is 30 times faster than Virtuoso for very simple queries (consisting of a single triple). We know Virtuoso very well and have compared it with QLever extensively. Virtuoso is a very mature and efficient engine and hard to beat, even on more complex queries. On simple queries, there are natural barriers to what can be achieved, and Virtuoso often (though not always) does the optimal thing. I think the authors either did not configure Virtuoso optimally or they stumbled on an artefact without being aware of it. Namely, Virtuoso is rather slow when it has to produce a very large output. That is not a weakness of their query processing engine, but of the way they translate their internal IDs to output IRIs and literals.

@KingsleyIdehen maybe you can provide some feedback concerning @4, in particular, the last two sentences.

We can be objective about feature support.

The working group tests for SPARQL 1.1 (updated for RDF 1.1) are maintained by the community: https://w3c.github.io/rdf-tests/.

They have reasonable coverage of features.

In addition, engines can and do support more of "XPath and XQuery Functions and Operators 3.1" than the minimal required by the SPARQL REC.

https://www.w3.org/TR/xpath-functions-3/

This comment was removed by nguyenm9.

also, any thoughts on https://cambridgesemantics.com/anzograph/ ?

"Horizontally Scalable Graph Database Built for Online Analytics and Data Harmonization"

it looks like anzograph could handle 1 trillion triples back in 2016.

Are there any timescale/triple scale goals currently being stated?

With a baseline minimum of 1B triples/3 months, and assuming a 5-10 year goal for any choice, that gets to 36B-56B triples minimum and it could easily exceed that.

Query performance is an important point to consider - I found a query that will run one million time slower in one database engine than in another one

Query performance is an important point to consider - I found a query that will run one million time slower in one database engine than in another one

Claims without evidence, such as that quoted above, are generally not helpful for evaluations such as this.

It would be helpful to all if you would post the query you describe, as well as the details of your testing — such as which engine(s) you tested (including name and version), on which OS (including version), on what hardware (including processor, bitness, and RAM), whether the engine & data were in a "hot/warm" state or just past cold start, etc.

Testing your query against current public endpoints and posting details of those results would also be helpful.

https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update mentions this task so maybe posting this request here will be effective.

There are several topics in the discussion page https://www.wikidata.org/wiki/Wikidata_talk:SPARQL_query_service/WDQS_backend_update that have been present for some time but that have not received any response.

This is especially concerning as the page states: "This page is the central hub for updates, background information, and community discussions related to the migration. "

Hi @Pfps , the topics on the backend update page are being responded to now. Apologies for missing them - there was a large amount of ownership handoff of documentation when our new team started and subscribing to this page to listen for comments from the community got lost in the shuffle. We are listening to the page now to ensure it can reliably serve as the central hub for updates we denoted it as.

The newsletter page you shared is likely marked as inactive following the publication of a newer version (see here). We slightly revamped this process in the last month, which may have also caused some confusion. Please see the page linked in this comment for our current monthly report and details on how to subscribe to upcoming newsletters.

OK, there is a newer newsletter. But that's not a newer version of the information in the November newsletter, as far as I can tell. The wording in the inactive banner contains: "Either the page is no longer relevant or consensus on its purpose has become unclear." I don't think that either of these are the case and those who see the wording are likely to be misled.

you are totally right @Pfps and thanks for flaging that. I have removed the inactive banner