Jump to content

Wikidata Query Service

From Wikitech
Wikidata Query Service components
Wikidata Query Service components

Wikidata Query Service is the Wikimedia implementation of SPARQL server, based on the Blazegraph engine, to service queries for Wikidata and other data sets. Please see more detailed description in the User Manual.

More information on technical interactions with Query Services is available.

Project Query GUI User docs
wikidata-main https://query-main.wikidata.org/ (or https://query.wikidata.org/) Wikidata:SPARQL query service - main graph
wikidata-scholarly https://query-scholarly.wikidata.org/ Wikidata:SPARQL query service - scholarly graph
wikidata-legacy-full https://query-legacy-full.wikidata.org/ Wikidata:SPARQL query service - legacy full graph (temporary service)
Linked Data Fragment Server https://query.wikidata.org/bigdata/ldf To be added
Wikimedia Commons https://commons-query.wikimedia.org/ Commons:SPARQL query service

Development environment

You will need Java 8 / JDK 8 and Maven. Consider use of an Ubuntu 22 or a Debian 11 Bullseye virtual machine, or possibly a Windows Subsystem for Linux setup (e.g., Ubuntu 22 WSL), if you don't already run one of these operating systems as your host OS.

If you are running Ubuntu 22, you should be able to find the JDK with apt-cache search openjdk-8, then install with sudo apt-get install <package name>.

If you are running Debian 11 Bullseye, where Java 8 isn't normally readily available, you may be able to use the Wikimedia APT external access and security instructions, and you should be able to install the following packages with dpkg -i <.deb file>.

https://apt.wikimedia.org/wikimedia/pool/component/jdk8/o/openjdk-8/openjdk-8-jre-headless_8u372-ga-1~deb11u1_amd64.deb

https://apt.wikimedia.org/wikimedia/pool/component/jdk8/o/openjdk-8/openjdk-8-jdk-headless_8u372-ga-1~deb11u1_amd64.deb

Most likely, the following commands can be used to point at the correct Java 8 / JDK 8 binaries for your interaction with the WDQS software.

$ sudo update-alternatives --config java
$ sudo update-alternatives --config javac

You can run those commands again to change your Java / JDK environment as needed for other projects.

If the tips above don't work, consider trying an installer from https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html.

To install Maven run sudo apt-get install maven. The installation on Ubuntu 22 and Debian 11 usually works just fine.


Code

The source code is in Gerrit project wikidata/query/rdf. In order to start working on Wikidata Query Service codebase, clone this repository:

git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf 

or GitHub mirror:

git clone https://github.com/wikimedia/wikidata-query-rdf.git

or if you want to push changes and have a Gerrit account:

git clone ssh://<your_username>@gerrit.wikimedia.org:29418/wikidata/query/rdf

Build

Then you can build the distribution package by running:

cd wikidata-query-rdf
./mvnw package

and the package will be in the dist/target directory. Or, to run Blazegraph service from the development environment (e.g. for testing) use:

bash war/runBlazegraph.sh

Add "-d" option to run it in debug mode. If your build is failing cause your version of maven is a different one you can:

 mvn package -Denforcer.skip=true


In order to run Updater, use:

 bash tools/runUpdate.sh

The build relies on Blazegraph packages which are stored in Archiva, and the source is in wikidata/query/blazegraph gerrit repository. See instructions on Mediawiki for the case where dependencies need to be rebuilt.

See also documentation in the source for more instructions.

SPARQL endpoint is available under /sparql path, which internally redirects to /bigdata/namespace/wdq/sparql.

Build Blazegraph

If there are changes needed to Blazegraph source, they should be checked into wikidata/query/blazegraph repo. After that, the new Blazegraph sub-version should be built and WDQS should switch to using it. The procedure to follow:

  1. Commit fixes (watch for extra whitespace changes!)
  2. Update README.wmf with descriptions of which changes were done against mainstream
  3. Blazegraph source in master branch will be on snapshot version, e.g. 2.1.5-wmf.4-SNAPSHOT - set it to non-snapshot: mvn versions:set -DnewVersion=2.1.5-wmf.4
  4. Make local build: mvn clean; bash scripts/mavenInstall.sh; mvn -f bigdata-war/pom.xml install -DskipTests=true
  5. Switch Blazegraph version in maim pom.xml of WDQS repo to 2.1.5-wmf.4 (do not push it yet!). Build and verify everything works as intended.
  6. Commit the version change in Blazegraph, push it to the main repo. Tag it with the same version and push the tag too.
  7. Run to deploy: mvn -f pom.xml -P deploy-archiva deploy -P Development; mvn -f bigdata-war/pom.xml -P deploy-archiva deploy -P Development; mvn -f blazegraph-war/pom.xml -P deploy-archiva deploy -P Development
  8. Commit the version change in WDQS, and push to gerrit. Ensure the tests pass (this would also ensure Blazegraph deployment to Archiva worked properly).
  9. After merging the WDQS change, follow the procedure below to deploy new WDQS version.
  10. Bump Blazegraph master version back to snapshot - mvn versions:set -DnewVersion=2.1.5-wmf.5-SNAPSHOT - and commit/push it.

Administration

Hardware

See this Grafana dashboard for an overview of WDQS' compute resources. As of this writing, the following clusters can be selected:

  • wdqs (soon to be retired in favor of 2 separate graphs, "main" and "scholarly". See this ticket for details)
  • wdqs-main
  • wdqs-scholarly
  • wdqs-internal-main
  • wdqs-internal-scholarly

These clusters are in active/active mode (traffic is sent to both), but due to how we route traffic with GeoDNS, the primary cluster (usually eqiad) sees most of the traffic.

Monitoring

Icinga group

Grafana dashboard: https://grafana.wikimedia.org/d/000000489/wikidata-query-service

Grafana frontend dashboard: https://grafana.wikimedia.org/d/000000522/wikidata-query-service-frontend

Prometheus

We have 2 prometheus exporters per Blazegraph instance:

exporter type port (main blazegraph) port (categories) port (wdqs-updater)
python 9193 9194 N/A
jmx 9102 9103 9101

I've got a new WDQS host, how do I get it ready for production?

This section explains how to go from a racked server with an OS, to a production WDQS server.

Create and Commit Puppet code

When a server is racked by DC Ops, they will add it to the Puppet repo's site.pp. See the Server Lifecycle page for more details on DC Ops' process.

Apply puppet role (Puppet patch 1)

Once DC Ops hands over the server, our first step is to move the server into one of the available wdqs puppet roles. As of this writing, the roles are:

Role name Purpose
wdqs::main main graph, accessible to the public. This graph gets the majority of the traffic
wdqs::internal_main main graph, internal users only. Identical graph to public
wdqs::scholarly scholarly graph, accessible to the public
wdqs::internal_scholarly scholarly graph, internal users only. Identical graph to public
wdqs::test Can you guess? ;P

Suppress Alerts (Puppet patch 2)

Until the host is ready, you can suppress alerts by setting the hiera key profile::query_service::blazegraph::monitoring_enabled: false . See this patch for an example.

Once these change are merged, run puppet agent on the host and it should be close to being ready.

Manual Steps

There's a few post-reimage steps necessary to get a host into service.

  • Run puppet at least once
  • Scap deploy the new host:
    scap deploy -l 'wdqs9999.eqiad.wmnet' 'deploy to fresh wdqs host'
    
    (you will need a corresponding entry for the host in hieradata/common/scap/dsh.yaml if you haven't yet done that)
  • Select a source host for data transfer. Before the new host can serve traffic, it needs to copy the graph data from an existing host. To check which host has which graph:
bking@cumin2002:~$ sudo cumin A:wdqs-scholarly
5 hosts will be targeted:
wdqs[2016,2023-2024].codfw.wmnet,wdqs[1023-1024].eqiad.wmnet

Make sure that both the newly-reimaged host and the host you'll use as the data source appear in the output.

  • Transfer the graph data via cookbook

Once you've selected a host, ssh into a cumin host and create a tmux window. Within the window, start a data-transfer as follows (making sure to select the appropriate blazegraph_instance:

sudo cookbook sre.wdqs.data-transfer --blazegraph_instance scholarly_articles --reason "${REASON}" --task-id ${TASK} --lvs-strategy source-only --source ${SOURCE} --dest ${NEW_HOST} --no-check-graph-type

(note the usage of the --no-check-graph-type for the initial transfer since that will also transfer the corresponding /srv/wdqs/data_loaded flag that indicates the graph type)

As of this writing, the cookbook takes about 45 minutes to transfer the graph (~650GB) from another host in the same DC.

If the cookbook fails, it will leave the source host depooled. This is probably what you want, assuming you will retry the transfer again immediately. If that's not the case, be sure to manually pool the source host.

  • Add new host to load balancer pool

This requires a puppet patch targeting `conftool.yaml` in the same datacenter. After that's merged, you'll need manual confctl command to pool the new server, something like

bking@cumin2002:~$sudo confctl select 'name=cirrussearch2113\.codfw\.wmnet' set/weight=10:pooled=no

Monitor lag using the WDQS dashboard . If it's not dropping, check the updater service again.

Lastly,

  • Remove the alert suppression you configured in "Puppet patch 2".
  • Wait 30 minutes and confirm no alerts in AlertManager/Icinga.

Application Deployment

Sources

The source code is in the Gerrit project wikidata/query/rdf (GitHub mirror). The GUI source code is in the GitLab repository repos/wmde/wikidata-query-gui.

The deployment version of the query service is in the Gerrit project wikidata/query/deploy. The GUI is deployed as a microsite.

Labs Deployment

Note that currently deployment is via git-fat (see below) which may require some manual steps after checkout. This can be done as follows:

  1. Check out wikidata/query/deploy repository and update gui submodule to current production branch (git submodule update).
  2. Run git-fat pull to instantiate the binaries if necessary.
  3. rsync the files to deploy directory (/srv/wdqs/blazegraph)

Use role role::wdqs::labs for installing WDQS. You may also want to enable role::labs::lvm::srv to provide adequate diskspace in /srv.

Command sequence for manual install:

git clone https://gerrit.wikimedia.org/r/wikidata/query/deploy
cd deploy
git fat init
git fat pull
git submodule init
git submodule update
sudo rsync -av --exclude .git\* --exclude scap --delete . /srv/wdqs/blazegraph

Production Deployment

Production deployment is done via git deployment repository wikidata/query/deploy. The procedure is as follows:

Initial Preparation

Preferred option: from Jenkins

  1. Log into Jenkins
  2. Go to https://integration.wikimedia.org/ci/job/wikidata-query-rdf-maven-release-wdqs/
  3. Select build with parameters

Note: If the job fails, the archiva credentials could have changed, see here: Analytics/Systems/Cluster/Deploy/Refinery-source#Changing the archiva-ci password

Fallback option: from your own machine

  1. ./mvnw -Pdeploy-archiva release:prepare in the source repository which updates the version numbers. If your system username is different that the one in scm, use -Dusername=... option.
  2. ./mvnw -Pdeploy-archiva release:perform in the source repository - this deploys the artifacts to archiva.

Note that for the above you will need repositories archiva.releases and archiva.snapshots configured in ~/.m2/settings.xml with archiva username/password. You will also need to have a gpg key setup.

Further required preparation

Prepare and submit the patch containing the new code updates:

  1. Run deploy-prepare.sh <target version> script - it will create a commit with newest version of jars.
  2. Using the commit generated by the above, open up a patch, and get it approved and merged.

Test that the service is working as expected before we mutate it with a deploy, so that we can compare properly:

  1. Open up a tunnel to the current wdqs canary instance via ssh -L 9999:localhost:80 wdqs1003.eqiad.wmnet (check that wdqs1003 is still the canary)
    1. Canary is defined in the deploy repo within scap/scap.cfg as the wdqs-canary dsh group.
    2. This dsh group is defined within the same repo in the file scap/wdqs-canary with one host per line.
  2. Run the test script through the tunnel: from the rdf repo, run cd queries && ./test.sh -s http://localhost:9999/
  3. You may also want to navigate to http://localhost:9999 and run an example query.

Now we're ready for the actual deploy!

The actual code deploy

On the deployment server (deployment.eqiad.wmnet currently), cd /srv/deployment/wdqs/wdqs. First we'll get the repo into the desired state, then do the actual deploy.

  1. First git fetch, then glance at git log HEAD...origin/master and manually verify the expected commits are there, then git rebase && git fat pull
  2. Now that the repo is in the desired state, sanity check with ls -lah that you see the .war file and that it doesn't seem absurdly small
  3. Use scap deploy '<the latest version number>' to deploy the new build. After canary deployment will be done (scap will ask for confirmation to proceed), please test the service. You can do that by ssh tunneling access to

wdqs1003.eqiad.wmnet and running ./tests.sh -s localhost:<tunneled_port> script from `queries` subdirectory

  1. Validation #1: In a separate pane or tab, navigate to /srv/deployment/wdqs/wdqs and run/look at scap deploy-log
  2. Validation #2: In yet another separate pane, tail the logs on the wdqs canary via tail -f /var/log/wdqs/wdqs-updater.log -f /var/log/wdqs/wdqs-blazegraph.log (Note we're tailing two log files at once so results will be interleaved)
  3. Once the canary deployment looks good, proceed to the rest of the fleet by pressing c

Post code-deploy operational steps

These steps have been separated into a separate subsection from the actual code deploy, but these steps are a mandatory part of the deploy, just to be clear.

  1. Run a basic test query on query.wikidata.org
  2. Check icinga - wdqs to verify there's no new warning/critical/unknown, and also check grafana to make sure everything looks good.
  3. Verify the commit hash for the new deploy matches the correct commit hash from the deploy repo: sudo -E cumin -b 4 'A:wdqs-all' 'ls -ld /srv/deployment/wdqs/wdqs' expected output: /srv/deployment/wdqs/wdqs-cache/revs/${COMMIT_HASH}

General puppet note (nothing to do here for a deploy): The puppet role that needs to be enabled for the service is role::wdqs. It is recommended to test deployment checkout on beta before deploying it in production. The test script is located at rdf/query/test.sh.

GUI deployment general notes

The GUI is deployed as a wikidata-query-gui service using deployment charts on WMF Kubernetes infrastructure. New image versions are automatically built and pushed by CI; see Kubernetes/Deployments for instructions on how to deploy them.

Data reload procedure (SRE-level access required)

Warning

Reloading data is a time-consuming (~4 days for wdqs-main, ~2.5 days for wdqs-scholarly) and fragile process (it's not uncommon for a reload to fail a few times in a row).

Reloading

Run the data-reload cookbook

Use the data-reload cookbook from cumin. You'll need to provide the cookbook the correct URL in HDFS of the data, this can be found by traversing down the directory starting like so: sudo -u analytics-search kerberos-run-command analytics-search hdfs dfs -ls 'hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/' on any stat* host.

and ultimately ending with a URI like so:

hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250714

Manual Process

The manual process is not used anymore, but is documented in this page's history just in case.

Data transfer procedure

Transferring data from between nodes is far faster than recovering from a dump (couple hours vs multiple days). The procedure is automated in the sre.wdqs.data-transfer cookbook.

Updating federation allowlist

  • Add endpoint to allowlist.txt in puppet repo. Commit & push to the repo.
  • Run puppet agent on WQDS hosts; this will pull down the new allowlist file.
  • Run the WDQS restart cookbook to activate the changes.
  • Test that it works with a query like so:
SELECT * {
  SERVICE <$ALLOWLISTED_SPARQL_ENDPOINT_URL> {
    SELECT * {
      ?s ?p ?o
    } LIMIT 10
  }
}

Manually updating entities

It is possible to update single entity or a number of entities on each server, in case data gets out of sync. The command to do it is:

 cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -S -- -b 500 --ids Q1234 Q5678 ...

In order to do it on all servers at once, commands like pssh can be used:

 pssh -t 0 -p 20 -P -o logs -e elogs -H "$SERVERS" "cd /srv/deployment/wdqs/wdqs; bash runUpdate.sh -n wdq -N -S -- -b 500 --ids $*"

Where $SERVERS would contain the list of servers updated. Note that since it is done via command line, updating larger batches of IDs will need some scripting to split them into manageable chunks. Doing bigger updates at moderate pace, with pauses to not interfere with regular updates, is recommended.

Updating IDs by timeframe

Sometimes, due to some malfunction, a segment of updates for certain time period gets lost. If it's a recent segment, updater can be reset to start with certain timestamp by using --start TIMESTAMP --init (you have to shut down regular updater, reset the timestamp, and then start it again). If the missed segment is in the past, the best way is to fetch IDs that were updated in that time period, using Wikidata recentchanges API, and then update these IDs as described above.

Example of such script can be found here: https://phabricator.wikimedia.org/P8919. The output should be filtered and duplicates removed, then fed to a script calling to update script as per above.

Updating value or reference nodes

Since value (wdv:hash) and reference (wdref:hash) nodes are supposed to be immutable, they will not be touched by updates to items that use them. To fix these nodes, you need to delete all triples with these nodes as the subject (SPARQL DELETE through production access), then trigger an update (as above) for items which reference these nodes (so they will be recreated; only one item per node necessary).

Notes about running the service on non-WMF infrastructure

This project has been designed to support wikidata and commons mediainfo and thus it's not rare to stumble upon features where this assumption has been hardcoded either directly in the code or as default values. Here are few notes that may help if you are running this service for your own wikibase installation.

runUpdate.sh options

Example: for a wikibase available on https://mywikibase.local/ whose config is set with something

$wgWBRepoSettings['conceptBaseUri'] = "https://myentities.local/entity/";

and using default blazegraph options:

./runUpdate.sh -- --wikibaseUrl https://mywikibase.local/ --conceptUri https://myentities.local --entityNamespaces 0,120

LDF (Linked Data Fragments) endpoint

The LDF endpoint is associated with a single WDQS host, because it has no easy way to track state between multiple hosts. The LDF host is set in hieradata profile::query_service::ldf_host

Issues

Known limitations

  • Data drift: The update process is imperfect, data might drift over time as updates are missed. Different servers might have slightly different data sets. This is mitigated by reloading the full data set periodically, and then transferring to the rest of the wdqs fleet to ensure consistency. If you identify a specific entity that has drifted, please open a Phabricator task with the entity (or list of entities) that need to be reloaded.

Scaling strategy

/ScalingStrategy

Contacts

If you need more info, talk to anybody on the Wikimedia Wikidata Platform team.

Usage