Wikidata/Notes/Change propagation/oc

This page is a translated version of the page Wikidata/Notes/Change propagation and the translation is 3% complete.

This document is a draft, and should not be assumed to represent the ultimate architecture.

This document describes how Wikibase propagates changes on the repository to any client wikis.

Torn d'orizont

  • Each change on the repository is recorded in the changes table which acts as an update feed to any client wikis (like the Wikipedias).
  • Dispatcher scripts periodically checks the changes table.
  • Each client wiki is notified of any changes on the repository changes via an entry in its job queue. These jobs are used to invalidate and re-render the relevant page(s) on the client wiki
  • Notifications about the changes are injected into the client's recentchanges table, to make them visible on watchlists, etc.
  • Consecutive edits by the same user to the same data item can be combined into one, to avoid clutter.

Assumptions and Terminology

The data managed by the Wikibase repository is structured into (data) entities. Every entity is maintained as a wiki page containing structured data. There are several types of entities, but one is particularly important in this context: items. Items are special in that they are linked with article pages on each client wiki (e.g., each Wikipedia). For more information, see the data model primer.

The propagation mechanism is based on the assumption that each data item on the Wikidata repository has at most one site link to each client wiki, and that only one item on the repository can link to any given page on a given client wiki. That is, any page on any client wiki can be associated with at most one data item on the repository.

(See comment on discussion page about consequences of limiting change propagation to cases where Wikipedia page and Wikidata item have a 1:1 relation)

This mechanism also assumes that all wikis, the repository and the clients (i.e. Wikidata and the Wikipedias), can connect directly to each other's databases. Typically, this means that they reside in the same local network. However, the wikis may use separate database servers: wikis are grouped into sections, where each section has one master database and potentially many slave databases (together forming a database cluster).

Communication between the repository (Wikidata) and the clients (Wikipedias) is done via an update feed. For now, this is implemented as a database table (the changes table) which is accessed by the dispatcher scripts directly, using the "foreign database" mechanism.

Support for 3rd party clients, that is, client wikis and other consumers outside of Wikimedia, is currently not essential and will not be implemented for now. It shall however be kept in mind for all design decisions.

Change Logging

Every change performed on the repository is logged into a table (the "changes table", namely wb_changes) in the repo's database. The changes table behaves similarly to MediaWiki's recentchanges table, in that it only holds changes for a certain time (e.g. a day or a week), older entries get purged periodically. As opposed to the recentchanges table however, wb_changes contains all information necessary to report and replay the change on a client wiki: besides information about when the change was made and by whom, it contains a structural diff against the entity's previous revision.

Effectively, the changes table acts as an update feed. Care shall be taken to isolate the database table as an implementation detail from the update feed, so it can later be replaced by an alternative mechanism, such as PubHub or an event bus. Note however that a protocol with queue semantics is not appropriate (it would require on queue per client).

Dispatching Changes

Changes on the repository (e.g. wikidata.org) are dispatched to client wikis (e.g. Wikipedias) by a dispatcher script. This script polls the repository's wb_changes table for changes, and dispatches them to the client wikis by posting the appropriate jobs to the client's job queue.

The dispatcher script is designed in a way that allows any number of instances to run and share load without any prior knowledge of each other. They are coordinated via the repoysitory's database using the wb_changes_dispatch table:

  • chd_client: the client's database name (primary key).
  • chd_latest_change: the ID of the last change that was dispatched to the client.
  • chd_touched: a timestamp indicating when updates have last been dispatched to the client.
  • chd_lock_name: the name of the global lock used by the dispatcher currently updating that client (or NULL).

The dispatcher operates by going through the following steps:

  1. Lock and initialize
    1. Choose a client to update from the list of known clients.
    2. Start DB transaction on repo's master database.
    3. Read the given client's row from wb_changes_dispatch (if missing, assume chd_latest_change = 0).
    4. If chd_lock_name is not null, call IS_FREE_LOCK(chd_lock_name) on the client's master database.
    5. If that returns 0, another dispatcher is holding the lock. Exit (or try another client).
    6. Decide on a lock name (dbname.wb_changes_dispatch.client or some such) and use GET_LOCK() to grab that lock on the client's master database.
    7. Update the client's row in wb_changes_dispatch with the new lock name in chd_lock_name.
    8. Commit DB transaction on repo's master database.
  1. Perform the dispatch
    1. Get n changes with IDs > chd_latest_change from wb_changes in the repo's database. n is the configured batch size.
    2. Filter changes for those relevant to this client wiki (optional, and may prove tricky in complex cases, e.g. cached queries).
    3. Post the corresponding change notification jobs to the client wiki's job queue.
  1. Log and unlock
    1. Start DB transaction on repo's master database.
    2. Update the client's row in wb_changes_dispatch with chd_lock_name=NULL and updated chd_latest_change and chd_touched.
    3. Call RELEASE_LOCK() to release the global lock we were holding.
    4. Commit DB transaction on repo's master database.

This can be repeated multiple times by one process, with a configurable delay between runs.

Changes Notification Jobs

The dispatcher posts changes notification jobs to the client wiki's job queue. These jobs contain a list of wikidata changes. When processing such a job, the cleint wiki performs the following steps:

  1. If the client maintains a local cache of entity data, update it.
  2. Find which pages need to be re-rendered after the change. Invalidate them and purge them from the web caches. Optionally, schedule re-render (or link update) jobs, or even re-render the page directly.
  3. Find which pages have changes that do not need re-rendering of content, but influence the page output, and thus need purging of the web cached (this may at some point be the case for changes to language links).
  4. Inject notifications about relevant changes into the client's recentchanges table. For this, consecutive edits by the same user to the same item can be coalesced.
  5. Possibly also inject a "null-entry" into the respective pages' history, i.e. the revision table.
(See comment on discussion page about recentchanges versus history table)

Coalescing Events

The system described above means several database writes for every change - and potentially many reads, depending on what is needed for rendering the page. And this happens on every client wiki (potentially hundreds) for every change on the repository. Since edits on the Wikibase repository tend to be very fine grained (like setting a label or adding a site link), this can quickly get problematic. Coalescing updates could help with this problem:

As explained in the Dispatching section, entries on the changes feed are processed in batches (per default, no more than 100 entries at once).

If multiple changes to the same item are processed in the same batch, these changes can be coalesced together if they were all performed consecutively by the same user. This would reduce the number of times pages get invalidated (and thus eventually re-rendered. All the necessary entries in the recentchanges table (and possibly the revision table) can be inserted using a single database request. This process can be fine tuned by adjusting the batch size and delay between batches.