Grants:Project/Future-proof WDQS

I will make a future-proof and much easier to scale WikiData.

Many aspects can be improved in WikiData, WikiBase, WDQS et al. In the following I try to make a comprehensive analysis of the current situation. Including some elements from the 2030 strategy, some of those recommendation are inspired from Denny Vrandečić essay called Toward An Abstract Wikipedia.

At its scale, Wikidata has reached the limits of what is possible to do with (legacy?) off-the-shelf software, efficiently in a future-proof way.

statuswithdrawn
Future-proof WikiData
summaryCreate a Minimum-Viable-Product for a future-proof WikiData
targetWikiData
type of granttools and software
amountplease add the amount you are requesting (USD)
type of applicantindividual
granteeIamamz3
contacttalk
this project needs...
volunteer
affiliate
grantee
join
endorse
created on11:28, 23 December 2019 (UTC)

Project idea edit

What is the problem you're trying to solve? edit

What problem are you trying to solve by doing this project? This problem should be small enough that you expect it to be completely or mostly resolved by the end of this project. Remember to review the tutorial for tips on how to answer this question.

The following problem statements is split into 3 parts:

  • Why and How WDQS does not scale?
  • Why and How WikiData does not scale?
  • Why and How WikiData is not future-proof?

This section ends with a summary.

WikiDataQueryService does not scale edit

Quoting Guillaume Lederrey Operations Engineer — Search Platform, Wikimedia Foundation in the wikidata mailling thread "Scaling Wikidata Query Service":

In an ideal world, WDQS should:
  1. scale in terms of data size
  2. scale in terms of number of edits
  3. have low update latency
  4. expose a SPARQL endpoint for queries
  5. allow anyone to run any queries on the public WDQS endpoint
  6. provide great query performance
  7. provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching a scale where there are no obvious easy solutions to address all the above constraints. At this point, just "throwing hardware at the problem" is not an option anymore. We need to go deeper into the details and potentially make major changes to the current architecture.

I want to add the requirement that is shall be easy for researchers and practitioners to setup their own instance which would be another form of scaling that is social which entails making it easier to use wikidata.

The current solution adopted to support WDQS involves BlazeGraph. BlazeGraph is not really maintained because the developers were hired by Amazon. Wikimedia could just invest more in BlazeGraph maintenance (see the commits of Stas Malyshev Software Engineer, Wikimedia Foundation on BlazeGraph repository.). Since sharding is not realistic because of the schema of wikidata and because performance would not be good anyway: Blazegraph scale only using the vertical strategy using replicas (copies). Vertical scaling hits the limitations of the hardware, and eventually of hardware and physics: in the foreseeable future there is only so much one can store inside a single machine box.

Here is breakdown of the current solution involving blazegraph try to scale WDQS:

Blazegraph approach to scaling WDQS
# requirements strategy limitation
1 Scale in terms of data size vertical scaling: bigger hard disks Physical, due to available hardware technology.
2 Scale in terms of edits vertical scaling: faster cpu (and larger network bandwidth) Physical, due to available hardware technology. The entailed limitations are linked to the vertical scaling strategy, lead to the existence of a "lag" between WikiBase and WDQS, see the following row.
3 Lag: have low update latency vertical scaling: faster cpu (and larger network bandwidth) Physical, due to available hardware technology. There shall be no lag: it is also a problem of software.
4 Expose a SPARQL endpoint for queries translation middleware in front of blazegraph, see https://github.com/wikimedia/wikidata-query-rdf Operations are made more difficult because there is a lot of services and moving parts.
5 Allow anyone to run any queries on the public WDQS endpoint wikidata triples WDQS
  • No time-traveling queries,
  • Lag (see row 2),
  • Many queries timeout,
6 Provide great query performance vertical scaling and replicas Operations are more difficult.
7 Provide a high level of availability replicas Operations are more difficult.
8 Easy to setup and operate docker-compose or kubernetes Requires more skills.

WikiData does not scale edit

The previous section describes several reasons why a specific component of wikidata infrastructure is not future-proof. WikiDataQueryService rely on vertical scaling, hence the availability of performant and efficient hardware that is possibly costly. The consequence of the limitations of Blazegraph software, hence WDQS, is that WikiData is difficult to:

  • setup and reproduce,
  • develop and maintain,
  • operate and scale.

Along those three dimensions, taking a look at the bigger picture that is wikidata project, draws a situation that is worse:

WikiData Problems
# Topic Problem Effect
0 setup and reproducibility Too many independent processes and code bases (microservices) Less contributions
1 setup and reproducibility Full-stack coding environment requires skills with Docker, docker-compose, Kubernetes Less contributions
2 setup and reproducibility Production environment setup requires skills with Kubernets or Puppet Less contributions
3 development and maintenance MediaWiki: PHP and JavaScript code base with a lot of legacy code Less contributions
4 development and maintenance WikiBase: PHP and JavaScript code base Less contributions
5 development and maintenance Too many programming languages (PHP, JavaScript, Ruby, Lua, Go, Java, sh...) Less contributions
6 operate and scale Too many databases (MySQL, REDIS, Blazegraph, ElasticSearch) Less contributions
7 operate and scale Impossible to do time travelling queries Less contributions
8 operate and scale See section "WikiDataQueryService does not scale" Less contributions
9 operate and scale Edit than spans multiple items Less contributions

Because WikiData is difficult to scale, Wikimedia fails to fully enable and empower users, according to its mission:

Wikimedia mission
"The mission of the Wikimedia Foundation is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally."
https://wikimediafoundation.org/about/mission/

Why and How WikiData is not future-proof? edit

In the two previous sections, two components were analyzed and shed light on some existing problems problems. This section will try to extract from existing publication, possible problems that wikidata shall need to tackle in the future.

Toward an abstract wikipedia edit

http://simia.net/download/abstractwikipedia_whitepaper.pdf

Wikimedia movement strategy toward 2030 edit

Strategy/Wikimedia movement/2018-20/Recommendations

Summary edit

Big Picture
# Problem Time scale Why Effect
1 WDQS is not scalable present
  • Legacy off-the-shelf software that lead the project to be neither maintainable, nor scalable.
  • WikiData is not scalable.
2 At WikiData scale, No Usable Versioned Triple Store. immediate future
  • No use-case for such a software until now.
  • No time-traveling queries,
  • No change request mechanic,
  • ⇒ Cooperation around the creation and maintenance of structured data is painful.
3 WikiData is not scalable immediate future
  • WikiData based on MediaWiki,
  • Horizontal scaling is an essential complexity.
  • Less code and data contributions.
4 Trusted knowledge as a service is difficult immediate future
  • Software and software development does not scale easily
  • Communities do not scale easily
  • Search and discovery is still difficult
  • Less knowledge equity
5 No Abstract Wikipedia future
  • No existing open-source software
  • Not enough interests in existing scientific contributions
  • Less knowledge equity
  • Unrealistic goal for the foreseeable future.
6 Earth scale encyclopedic knowledge future
  • Internet access is still not global
  • Current architecture and cooperation paradigm is not scalable
  • Need to continue to rethink the current architecture
  • Need to continue to explore cooperation mechanics
  • Need to continue to explore distribution mechanics

What is your solution? edit

For the problem you identified in the previous section, briefly describe your how you would like to address this problem. We recognize that there are many ways to solve a problem. We’d like to understand why you chose this particular solution, and why you think it is worth pursuing. Remember to review the tutorial for tips on how to answer this question.

Only the first three problems of the Big Picture will be addressed:

  • Scalable WikiDataQueryService
  • Scalable Versioned Triple Store
  • Scalable Wikidata

How to make WikiData scalable? edit

The summary of the solution is:

  • to drop legacy reduce the operational costs,
  • reduce the learning-curve to ease the on-boarding of new developers,
  • scale wikidata, including SPARQL queries,
  • add a new features: time traveling queries and change-request.

The following table describes proposed solutions to existing problems in WikiData:

Proposed Solution to WikiData Problems
# Topic Problem Solution Effect
0 setup and reproducibility Too many independent processes and code bases (microservices)
  • A single executable
  • A single process (no microservices, outside the possibly horizontally scalable distributed database)
  • A single code base
  • Easy to setup, hence reproduce,
  • Less operational work,
  • Code is easier to make sense and navigate,
  • ⇒ Easier to contribute code
1 setup and reproducibility Full-stack coding environment requires skills with Docker, docker-compose, Kubernetes
  • Avoid docker, docker-compose and Kubernetes to setup the coding environment.
  • This is made possible because the solution is not based on microservices.
  • Requires less coding and operational skills,
  • ⇒ Easier to contribute code.
2 setup and reproducibility Production environment setup requires skills with Kubernets or Puppet
  • Vertical scaling: the database is embedded in the "single process"
  • Horizontal scaling: Multiple machine deployment require something like Kubernetes or Puppet, so in some sense this problem is minor and not addressed.
  • Easy single process production setup,
  • Kubernetes or Puppet will still be required for production environment to support horizontal scalability.
3 development and maintenance MediaWiki: PHP and JavaScript code base with a lot of legacy code
  • Do not rely on MediaWiki,
  • Do not rely on PHP or JavaScript,
  • Do not rely on the existing approach where "structured data" is an afterthought.
  • No legacy code,
  • Rethink solution to match the problem,
  • Clean solution,
  • ⇒ Easier to contribute code.
4 development and maintenance WikiBase: PHP and JavaScript code base
  • Rewrite WikiBase from scratch with Scheme to match the problem.
  • Easier to code and maintain,
  • Faster code,
  • ⇒ Easier to contribute code.
5 development and maintenance Too many programming languages (PHP, JavaScript, Ruby, Lua, Go, Java, sh...)
  • Do not rely on PHP, Javascript, Lua, Go and Java,
  • Rely on Scheme,
  • Rely on project maintained by third parties when other programming languages are required.
  • Reduce the number of languages in the stack and under the responsability of wikimedia,
  • Dependance on third-parties,
  • Faster code,
  • ⇒ Easier to contribute code.
6 operate and scale Too many databases (MySQL, REDIS, Blazegraph, ElasticSearch)
  • Less database expert knowledge required,
  • Easier operations,
  • Faster,
  • Scalable,
  • ⇒ Easier to contribute code,
  • ⇒ Easier to contribute data.
7 operate and scale Impossible to do time travelling queries
  • ⇒ Easier to contribute data.
8 operate and scale See section "WikiDataQueryService does not scale"
  • ⇒ Easier to contribute code,
  • ⇒ Easier to contribute data.
9 operate and scale Edit that spans multiple items
  • Change request mechanic allow to add, delete and undo changes
  • ⇒ Easier to contribute data.
  • ⇒ Allow one-click undo of merges and QuickStatement batches.

What are other solutions? edit

virtuoso-opensource edit

github: https://github.com/openlink/virtuoso-opensource/

Pros edit
  • Similar existing deployment
  • Supported by an experienced company
Cons edit
  • monopoly
  • vendor lock-in
  • no support for time-traveling queries
  • no support for change-request
  • not a complete solution
  • AS OF YET, no jespen.io database harness tests?
  • MAYBE not complete ACID guarantees?

Property graph databases edit

See https://github.com/jbmusso/awesome-graph/#awesome-graph

Pros edit
  • MAYBE similar existing deployment but certainly not in the open
  • Supported by established company (neo4j, dgraph, arangodb), in the case of JanuGraph, it is supported by the Linux Foundation.
Cons edit
  • does not map efficiently to RDF triples
  • no support time-traveling queries
  • no support for change-request
  • not a complete solution
  • AS OF YET, no jespen.io database harness tests (neo4j, dgraph, arangodb)

Other triple stores edit

github: https://github.com/semantalytics/awesome-semantic-web#databases

Pros edit
  • ?
Cons edit
  • no support for time traveling queries
  • no support for change-request
  • not complete solution
  • AS OF YET, no jespen.io database harness tests?

Project goals edit

What are your goals for this project? Your goals should describe the top two or three benefits that will come out of your project. These should be benefits to the Wikimedia projects or Wikimedia communities. They should not be benefits to you individually. Remember to review the tutorial for tips on how to answer this question.

The goal of the project is to support WikiData growth in terms of:

  • code contributions,
  • data contributions.

Toward that goal, the project must be:

  • easy to setup, reproduce, code and maintain the code,
  • faster, allow time-traveling queries and provide a way to visual edition triples,
  • both vertically and horizontally scalable.

From this project will emerge a clear architecture toward a scalable wikidata.

Project impact edit

How will you know if you have met your goals? edit

For each of your goals, we’d like you to answer the following questions:

  1. During your project, what will you do to achieve this goal? (These are your outputs.)
  2. Once your project is over, how will it continue to positively impact the Wikimedia community or projects? (These are your outcomes.)

For each of your answers, think about how you will capture this information. Will you capture it with a survey? With a story? Will you measure it with a number? Remember, if you plan to measure a number, you will need to set a numeric target in your proposal (e.g. 45 people, 10 articles, 100 scanned documents). Remember to review the tutorial for tips on how to answer this question.

Outputs edit

Outcomes edit

  • More people outside wikimedia use the project to host wikidata or wikidata-like projects
  • The current stack / architecture is replaced with the result of this project
  • More people contribute to wikidata
  • wikidata doubles the number of triples to reach 20 billions

Do you have any goals around participation or content? edit

Are any of your goals related to increasing participation within the Wikimedia movement, or increasing/improving the content on Wikimedia projects? If so, we ask that you look through these three metrics, and include any that are relevant to your project. Please set a numeric target against the metrics, if applicable. Remember to review the tutorial for tips on how to answer this question.

The project will improve performance and availability of WikiData.

Project plan edit

Activities edit

Tell us how you'll carry out your project. What will you and other organizers spend your time doing? What will you have done at the end of your project? How will you follow-up with people that are involved with your project?

Plan
Quarter Title Activitiy Guesstimate Output
1 Arew
  • Finish work on Arew 1.0.0
    • Tidy the standard library
    • Finish SRFI-180 (JSON)
    • Submit untangle to SRFI
    • Submit HTTP/1.1 to SRFI
1 month
  • Efficient R7RS-large implementation on top of Chez Scheme
1 Ruse
  • Finish work on Ruse 1.0.0
    • Implement closure conversion
    • Implement JQuery bindings
    • Finish ReactJS bindings
1 month
  • R7RS-small based on nanopass framework that runs in the browsers
1 nomunofu 0.2.0
  • Finish work on nomunofu 0.2.0
    • Tidy nstore
    • Add JSON over HTTP query service
1 month
  • JSON query via REST API
  • WiredTiger micro-benchmarks
2 nomunofu 0.3.0
  • Adapt FoundationDB bindings to support SRFI-167
1 month
  • FoundationDB micro-benchmarks
2 nomunofu 0.4.0
  • Tidy versioned nstore
  • Initial SPARQL REST with change-request support:
    • select
    • insert
    • update
    • delete
1 month
  • RDF conformance tests[1]
  • time-traveling queries support via REST API
  • SPARQL via REST API
  • SPARQL benchmarks
2 nomunofu 0.5.0
  • Users and permissions
  • Visual edition of tuples of n items
  • Visual change-request mechanic
    • Create
    • Apply
    • Revert
  • History of changes via REST API
1 month
  • Web-based graphical user interface
  • Stream of changes via REST API
3 nomunofu 0.6.0
  • OAUTH 2.0
  • REST API improvements
1 month
  • Robot access
3 nomunofu 0.7.0
  • Autocomplete
  • Search
  • Spell-checking
1 month
  • Better usability of visual edition of tuple of n items
3 nomunofu 0.8.0
  • SPARQL optimizations, fine tunings, and benchmarks
1 month
  • SPARQL micro-benchmarks with WiredTiger
  • SPARQL micro-benchmarks with FoundationDB
4 nomunofu 0.9.0
  • Visual editor I18n, a11y, and ui/ux review and improvements
1 month
  • I18n visual editor
  • a11y visual editor
4 nomunofu 0.9.9
  • More SPARQL optimizations, fine tunings, and benchmarks
1 month
  • FoundationDB cluster tips, tricks and recommendations
  • WiredTiger configuration tips, tricks and recommendations
  • Full benchmarks[2]
4 nomunofu 1.0.0
  • Bug fixes
  • Tidy WikiJournal publication
1 month
  • Minimum Viable Product
    • Code
    • Tests
    • Documentation
  • WikiJournal Publication

Budget edit

How you will use the funds you are requesting? List bullet points for each expense. (You can create a table later if needed.) Don’t forget to include a total amount, and update this amount in the Probox at the top of your page too!

Budget will be set when we agree on a plan. The rough estimate is between 2500-5500 euros per month depending on applicable taxes, possibly plus the cost the rent hardware to do the benchmarks (see https://phabricator.wikimedia.org/T206636).

Community engagement edit

Community input and participation helps make projects successful. How will you let others in your community know about your project? Why are you targeting a specific audience? How will you engage the community you’re aiming to serve during your project?

  • I will continue to blog about my project at https://hyper.dev (currently offline, I will prolly move my blog to a mailling list at source hut) with weekly, bi-weekly and monthly review of my progress and engage with the community on the wiki spaces, mailing lists, and IRC,
  • I will publish a paper on wiki journal,
  • I expect input from the community regarding accessibility, usability and help regarding localization
  • I also wait for more information regarding the availability of hardware, see https://phabricator.wikimedia.org/T206636

References List edit

Get involved edit

Participants edit

Please use this section to tell us more about who is working on this project. For each member of the team, please describe any project-related skills, experience, or other background you have that might help contribute to making this idea a success.

I am amz3 also known as zig on freenode. I have been a software engineer in various domain for 10 years (bitbucket, github, sourcehut). I would like to join wikimedia.

Community notification edit

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. You are responsible for notifying relevant communities of your proposal, so that they can help you! Depending on your project, notification may be most appropriate on a Village Pump, talk page, mailing list, etc. Need notification tips?

Endorsements edit

Do you think this project should be selected for a Project Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).