Wikimedia CH/Grant apply/Scaling Wikidata by benchmarking QLever

Infodata

Name of the project: Scaling Wikidata by benchmarking QLever
Amount requested: 10,000
Type of grantee: Individual
Name of the contact: Sam Klein
Contact: metasj gmail

The problem and the context

Wikidata needs to migrate away from blazegraph, which is no longer supported and doesn't scale to meet our needs.

What is the problem you're trying to solve?

QLever is the most performant of the open source alternative graph databases that we have considered. It had a few issues raised in the evaluation two years agoo which its development team has worked on since then.

What is your solution to this problem (please explain the context and the solution)?

QLever's maintainer has run some informal performance benchmarks and published them online.

This project aims to expand this benchmark to include many more queries against Wikidata that encompass current simple queries that complete quickly with the Wikidata query service, current queries that are slow but possible with the Wikidata query service, queries that time out or otherwise fail to complete in the Wikidata query service, and queries that time out or otherwise fail to complete in the QLever Wikidata query service, highlighting the potential improvement from migrating to a different query service. The queries will be run on at least QLever and the current Wikidata query service.

The queries will be run on a standalone machine - a current variant of the recommended desktop machine for benchmarking, which can be used for future benchmarks including of the QLever branch that includes a SPARQL 1.1 update for real-time updating (which is the last major feature they need to add for feature parity with the current Wikidata setup).

Project goals

Confirm and illustrate a benchmark for a high-performance upgrade to the Wikidata query service, to allow it to continue to grow.

Describe how this will improve current usage of Wikidata and allow future uses, and what that could mean for our projects' continued growth.

Project impact

After a successful migration to this or a similarly performant backend, queries via the Query Service should be 5-10x faster, more queries would resolve and not time out, and projects that have been stalled because they were too high-volume could continue. The graph split of scholarly metadata may be able to re-merge after a migration, making Scholia (which is widely used) faster and more useful.

How will you know if you have met your goals?

Successfully creating the benchmark and gathering repeatable statistics for it

Having the benchmark deemed as representative and suitable by the Wikidata and QLever teams, and having them agree that the statistics are correct.

Seeing updates to the phabricator tickets about Wikidata migration confirming that this is a promising candidate to evaluate for migrating the production system.

Do you have any goals or metrics around participation or content?

We would like to see

At least 1 heavy user of the Query Service sharing queries that are important to them that often time out (so the benchmark can check how those work on the test system).

At least 10 Wikidata Query Service developers or technical users, and at least 2 QLever contributors, commenting positively on the benchmark and process.

At least 10 other community members commenting positively on the process and impact and relative importance of finding a new query backend.

Project plan

Activities

Formalization of the benchmark carried out by QLever, focused on just a test QLever instance and Wikidata's current implementation.
Replication of this benchmark on the similar hardware
Addition of extra queries as described above
Writeup of the benchmark with additional detail around how the Wikidata queries were executed (via WDQS).
Identification of default Examples suggested in the WDQS interface which currently time out, and which ones no longer time out in the QLever instance.

Budget

Hardware - 3000 EUR. Previous benchmark was on AMD Ryzen 9 7950X with 16 cores, 128 GB, and 7.1 TB of NVMe SSD. The actual machine will be the current equivalent with maximum memory.
Technical implementation - 4500 EUR, mainly Peter F. Patel-Schneider
Project management + community engagement (incl. QLever community) - 1500 EUR
Fiscal sponsorship - 1000 EUR

Community engagement

This will happen on the Wikidata project page (to clarify parameters of the benchmark) and its talk page, through attendance at a few relevant office hours, and via Wikimedia phabricator, QLever github, and email to key networks who we want to participate. A technical report on the project will be prepared and possibly submitted for publication.

Wikimedia CH response

Dear Sam Klein, we are pleased to approve your grant request under the Innovation programme of Wikimedia CH for a total of 9,000 CHF. I will follow up shortly with an email to coordinate the payment details.

The project can begin already in 2024, with an initial payment of 5,000 CHF. The remaining balance will be provided halfway through the project, after submission of the mid-term report.

Once the project is completed, a final report will be required for inclusion in Wikimedia CH's reports and newsletters. --Ilario (talk) 08:31, 24 October 2024 (UTC)[reply]

Thanks Ilario, fantastic news. We have a build spec for the server, let me know when the email goes out! –SJ talk 21:13, 24 October 2024 (UTC)