Wikidata/Notes/Caching investigation

This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

There are many ways to build the client/server communication, having several issues that are connected to various degrees.

Definitions

Server: Wikidata server.
Client: A user/consumer of Wikidata data. This initially means Wikipedia language versions, but will include independent applications in the future.
Page on the server: Contains all data about one Wikidata item.
Page on the client: Wikipedia page potentially using Wikidata data.

Note that depending on the communication model, both the Wikidata server and the Wikidata client may sometimes act as client or server in the usual sense of the client–server model.

Scenarios

There are several scenarios that are quite different in what kind of data is used by what kinds of clients:

Initially, Wikidata stores only data used to generate interwiki links on Wikipedia pages.
Later, Wikidata stores arbitrary item property values. This data is used by
- Wikipedia infoboxes and lists
- Other 3rd party mediawiki
- Other clients, which may include anything from desktop applications through web sites to search engines

Each of these scenarios allows certain optimizations:

Wikidata may be more closely integrated with Wikipedias than with other clients, e.g. through shared DB tables
As long as Wikidata is used only for interwiki links, there is a one-to-one relation between Wikidata items and Wikipedia pages

Possibilities

Push/Pull/Poll

How does the server notify the client that the data is changed? There are three possibilities:

Push: When a page on the server is changed, the server pushes the changed data to the clients.
Pull: When a page on the client is changed, the client pulls the data from the server.
Poll: The client checks periodically what are changed on the server, and updates itself if there are any.

Right now, push/pull methods are used.

Note that choice of the method might be dependent on the use case. For example, for a wiki with few articles, pushing the data might be more efficient, but for English Wikipedia, polling might be more efficient.

Api/Server DB/Client DB

Will the client wiki load the data from the server via the API, from the server's database or from its own database?

The first two are parallel to each other, the third requires the server to push the data to the client's database.

This is somewhat dependent on the previous one, since f.e. polling via the API would be very inefficient.

Reparse/Rebuild links only

Right now, in order to add new links, the client is reparsing the entire page.

In principle, it does not have to do that, since all the information it needs are the links themselves and the no_external_interlang property, and all of it can be saved in the database. For this to be implemented, it would have to be possible to partially change the cache, recycling the previously cached ParserOutput object.

There is also a hypothetical possibility to do this for data itself, by storing UUIDs in the HTML cache.

In combination

An outline of possible combinations:

Right now

What is being done right now may be called pushpull/api/reparse:

When a page on the server is changed, the server calls the client's API and asks for a page purge (reparse). (For now we can assume that we know which client page/s is/are affected because we only deal with language links. In the general case, the server does not know much about the client.)
When a page on the client is changed (including when being called by the server), the client calls the server's API and asks for the data.

Possible problems:

Slowness: the client might update more slowly than the server, creating backlog.

Possible improvements:

pushpull/serverdb/*: easy to do, time improvement (but would require the client to have some server code, it would lead to possibly more complex server code).
pushpull/*/rebuild: more difficult to implemented, less processing time on the client. It would also require that the database contains information about which links came from the server and which came from the client, either via a new column in iwlinks table, or in a new table.

push/clientdb/rebuild

Roughly, it would work this way:

When a page on the server is changed, the server would push the new data to the client's database.
1. The server rebuilds the links and pushes them to the client's cache OR
2. The server invokes the client's API function to rebuild the links or the client would do it internally (pull/poll).

Proposals

Proposal: HTTP push to local db storage

Every time an item on Wikidata is changed, an HTTP push is issued to all subscribing clients (wikis)
- initially, "subscriptions" are just entries in an array in the configuration.
- Pushes can be done via the job queue.
- pushing is done via the mediawiki API, but other protocols such as PubSub Hubbub / AtomPub can easily be added to support 3rd parties.
- pushes need to be authenticated, so we don't get malicious crap. Pushes should be done using a special user with a special user right.
- the push may contain either the full set of information for the item, or just a delta (diff) + hash for integrity check (in case an update was missed).

When the client receives a push, it does two things:
1. write the fresh data into a local database table (the local wikidata cache)
2. invalidate the (parser) cache for all pages that use the respective item (for now we can assume that we know this from the language links)
  - if we only update language links, the page doesn't even need to be re-parsed: we just update the languagelinks in the cached ParserOutput object.

when a page is rendered, interlanguage links and other info is taken from the local wikidata cache. No queries are made to wikidata during parsing/rendering.

In case an update is missed, we need a mechanism to allow requesting a full purge and re-fetch of all data from on the client side and not just wait until the next push which might very well take a very long time to happen.
- There needs to be a manual option for when someone detects this. maybe action=purge can be made to do this. Simple cache-invalidation however shouldn't pull info from wikidata.
- A time-to-live could be added to the local copy of the data so that it's updated by doing a pull periodically so the data does not stay stale indefinitely after a failed push.

Variation: HTTP push to shared db storage

Instead of having a local wikidata cache on each wiki (which may grow big - a first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all client wikis could access the same central database table(s) managed by the wikidata wiki.

this is similar to the way the globalusage extension tracks the usage of commons images
whenever a page is re-rendered, the local wiki would query the table in the wikidata db. This means a cross-cluster db query whenever a page is rendered, instead a local query.
the HTTP push mechanism described above would still be needed to purge the parser cache when needed. But the push requests would not need to contain the updated data, they may just be requests to purge the cache.
the ability for full HTTP pushes (using the mediawiki API or some other interface) would still be desirable for 3rd party integration.

This approach greatly lowers the amount of space used in the database
it doesn't change the number of http requests made
- it does however reduce the amount of data transferred via http (but not by much, at least not compared to pushing diffs)
it doesn't change the number of database requests, but it introduces cross-cluster requests

Proposal: HTTP poll with echo notifications on local db storage

Another project is currently underway: mw:Echo_(Notifications). I think this would work best for 3rd party and other clients.

Server have Echo server enabled.
Client have Echo client enabled.
Client pull data item on page creation or modification.
Client subscribe to an Echo event on data item change.
Client poll Echo notifications periodically, see mw:Echo_(Notifications)#API (this could even be push if we read the stuff in parenthesis on that page).
Client, acting on notification on an existing client page, pull the new data item.
Client, acting on notification on a deleted client page, unsubscribe from the echo event on that particular data item changes.

For a shared db storage, simple change the pull action to a invalidate cache, forcing the client to render the page with the fresh data item from it's shared db storage. For that matters, if you're willing to delay the http data pulling on first page render and use invalidate cache for local db storage, you have the exact same code for both local and shared db storage scenario!