Wikidata/Nota/Perubahan pembiakan

This page is a translated version of the page Wikidata/Notes/Change propagation and the translation is 22% complete.

Dokumen ini adalah draf, dan tidak boleh dianggap mewakili seni bina kekal.

This document describes how Wikibase propagates changes on the repository to any client wikis.

Am

Setiap perubahan pada simpanan yang direkodkan dalam jadual perubahan yang bertindak sebagai suapan kepada mana-mana kemas kini wiki pelanggan (seperti Wikipedia).
Skrip Penghantar secara berkala memeriksa jadual perubahan.
Setiap wiki pelanggan adalah mengenai apa-apa perubahan pada perubahan simpanan melalui catatan dalam giliran tugasnya. Pekerjaan ini digunakan untuk membatalkan dan memberikan semula laman yang berkaitan di wiki pelanggan
Pemberitahuan tentang perubahan disuntik ke Perubahan terkini meja pelanggan, untuk menjadikan mereka kelihatan pada senarai saham utama, dan lain-lain
Suntingan berterusan oleh pengguna yang sama untuk item data yang sama boleh digabungkan menjadi satu, untuk mengelakkan kekeliruan.

Andaian dan Istilah

Data yang diuruskan oleh repositori Wikiasas distrukturkan ke dalam (data) entiti. Tiap-tiap entiti adalah dikekalkan sebagai laman wiki yang mengandungi data berstruktur. Terdapat beberapa jenis entiti, tetapi satu adalah penting dalam konteks ini: item. Item yang istimewa dalam bahawa mereka adalah dikaitkan dengan laman rencana pada setiap wiki pelanggan (contohnya, setiap Wikipedia). Untuk maklumat lanjut, lihat model data asas.

Mekanisme pembiakan adalah berdasarkan kepada andaian bahawa setiap item data pada simpanan Wikidata mempunyai lebih daripada satu tapak untuk setiap pengguna wiki pelanggan, dan bahawa hanya satu perkara pada simpanan boleh pautan ke laman tertentu kepada pelanggan wiki diberikan. Iaitu, sebarang laman di mana-mana wiki pelanggan boleh dikaitkan dengan paling banyak satu item data pada simpanan.

(See comment on discussion page about consequences of limiting change propagation to cases where Wikipedia page and Wikidata item have a 1:1 relation)

This mechanism also assumes that all wikis, the repository and the clients (i.e. Wikidata and the Wikipedias), can connect directly to each other's databases. Typically, this means that they reside in the same local network. However, the wikis may use separate database servers: wikis are grouped into sections, where each section has one master database and potentially many slave databases (together forming a database cluster).

Communication between the repository (Wikidata) and the clients (Wikipedias) is done via an update feed. For now, this is implemented as a database table (the changes table) which is accessed by the dispatcher scripts directly, using the "foreign database" mechanism.

Support for 3rd party clients, that is, client wikis and other consumers outside of Wikimedia, is currently not essential and will not be implemented for now. It shall however be kept in mind for all design decisions.

Change Logging

Every change performed on the repository is logged into a table (the "changes table", namely wb_changes) in the repo's database. The changes table behaves similarly to MediaWiki's recentchanges table, in that it only holds changes for a certain time (e.g. a day or a week), older entries get purged periodically. As opposed to the recentchanges table however, wb_changes contains all information necessary to report and replay the change on a client wiki: besides information about when the change was made and by whom, it contains a structural diff against the entity's previous revision.

Effectively, the changes table acts as an update feed. Care shall be taken to isolate the database table as an implementation detail from the update feed, so it can later be replaced by an alternative mechanism, such as PubHub or an event bus. Note however that a protocol with queue semantics is not appropriate (it would require on queue per client).

Dispatching Changes

Changes on the repository (e.g. wikidata.org) are dispatched to client wikis (e.g. Wikipedias) by a dispatcher script. This script polls the repository's wb_changes table for changes, and dispatches them to the client wikis by posting the appropriate jobs to the client's job queue.

The dispatcher script is designed in a way that allows any number of instances to run and share load without any prior knowledge of each other. They are coordinated via the repoysitory's database using the wb_changes_dispatch table:

chd_client: the client's database name (primary key).
chd_latest_change: the ID of the last change that was dispatched to the client.
chd_touched: a timestamp indicating when updates have last been dispatched to the client.
chd_lock_name: the name of the global lock used by the dispatcher currently updating that client (or NULL).

The dispatcher operates by going through the following steps:

Lock and initialize
1. Choose a client to update from the list of known clients.
2. Start DB transaction on repo's master database.
3. Read the given client's row from wb_changes_dispatch (if missing, assume chd_latest_change = 0).
4. If chd_lock_name is not null, call IS_FREE_LOCK(chd_lock_name) on the client's master database.
5. If that returns 0, another dispatcher is holding the lock. Exit (or try another client).
6. Decide on a lock name (dbname.wb_changes_dispatch.client or some such) and use GET_LOCK() to grab that lock on the client's master database.
7. Update the client's row in wb_changes_dispatch with the new lock name in chd_lock_name.
8. Commit DB transaction on repo's master database.

Perform the dispatch
1. Get n changes with IDs > chd_latest_change from wb_changes in the repo's database. n is the configured batch size.
2. Filter changes for those relevant to this client wiki (optional, and may prove tricky in complex cases, e.g. cached queries).
3. Post the corresponding change notification jobs to the client wiki's job queue.

Log and unlock
1. Start DB transaction on repo's master database.
2. Update the client's row in wb_changes_dispatch with chd_lock_name=NULL and updated chd_latest_change and chd_touched.
3. Call RELEASE_LOCK() to release the global lock we were holding.
4. Commit DB transaction on repo's master database.

This can be repeated multiple times by one process, with a configurable delay between runs.

Changes Notification Jobs

The dispatcher posts changes notification jobs to the client wiki's job queue. These jobs contain a list of wikidata changes. When processing such a job, the cleint wiki performs the following steps:

If the client maintains a local cache of entity data, update it.
Find which pages need to be re-rendered after the change. Invalidate them and purge them from the web caches. Optionally, schedule re-render (or link update) jobs, or even re-render the page directly.
Find which pages have changes that do not need re-rendering of content, but influence the page output, and thus need purging of the web cached (this may at some point be the case for changes to language links).
Inject notifications about relevant changes into the client's recentchanges table. For this, consecutive edits by the same user to the same item can be coalesced.
Possibly also inject a "null-entry" into the respective pages' history, i.e. the revision table.

(See comment on discussion page about recentchanges versus history table)

Coalescing Events

The system described above means several database writes for every change - and potentially many reads, depending on what is needed for rendering the page. And this happens on every client wiki (potentially hundreds) for every change on the repository. Since edits on the Wikibase repository tend to be very fine grained (like setting a label or adding a site link), this can quickly get problematic. Coalescing updates could help with this problem:

As explained in the Dispatching section, entries on the changes feed are processed in batches (per default, no more than 100 entries at once).

If multiple changes to the same item are processed in the same batch, these changes can be coalesced together if they were all performed consecutively by the same user. This would reduce the number of times pages get invalidated (and thus eventually re-rendered. All the necessary entries in the recentchanges table (and possibly the revision table) can be inserted using a single database request. This process can be fine tuned by adjusting the batch size and delay between batches.