Research:External Reuse of Wikimedia Content/Wikidata Transclusion

Tracked in Phabricator:
Task T249654

While transclusion of Wikidata by other Wikimedia projects (namely Wikipedia and Commons) is not external re-use, there are a number of good reasons to study this form of transclusion as part of a broader research agenda around external re-use of Wikimedia content. There are a number of parallels between Wikidata transclusion and important aspects of external re-use -- e.g., Wikidata bridges as a prototype of editing Wikimedia content outside of its broader context, to gain a better understanding of what sorts of information can be effectively sourced from a single knowledge base as opposed to decided locally as a parallel to the decisions that outside platforms make about where they will source their content, to understand the value of linked open data, to understand the variation in value of specific Wikidata statements. With Wikidata transclusion, we also have the benefit of having full access to content and pageview data, which reduces the amount of estimation that we have to do about how people engage with content and makes experimentation (where deemed valuable) easier. And finally, focusing on Wikidata is also an opportunity to expand our research focus beyond Wikipedia (and hopefully to some of the smaller Wikimedia projects that transclude Wikidata as well).

Approaches edit

The most comprehensive data about how Wikidata is transcluded by other Wikimedia projects comes through the wbc_entity_usage wikibase table. This table tracks which Wikidata entities are used by which Wikipedia articles (and other project articles) so that changes to those Wikidata entities may be pushed to those articles. The main drawback is that this table only gives us a very high-level view of wikidata transclusion and tells us almost nothing about how that Wikidata content is actually used by the article.

We can also infer certain types of usages based on what we know about how a given Wikipedia platform works -- e.g., the usage of Wikidata descriptions by the mobile app or how certain templates transclude Wikidata content. This style of transclusion means that we can confidently estimate usage merely by analyzing template usage in the [dumps] for Wikipedia, potentially with additional cross-referencing of the associated Wikidata items -- e.g., via the API or JSON dumps. The drawback is that it will only ever cover a subset of Wikidata transclusion for which use cases can confidently be coded into a script.

Finally, we can build a more complete picture of Wikidata usage by manually inspecting the templates used by individual Wikipedia articles and cross-referencing this with the Wikidata items to which they are connected. The challenge with this is that it is difficult to automatically infer how Wikidata is used by a given template so this requires in-depth inspection of the most popular templates to determine how they include (or in some cases do not) Wikidata content. Given that templates are not global, this approach is naturally quite limited in its ability to describe Wikidata transclusion across many language versions of Wikipedia, but it is an important complement to more automated methods of analysis.

Analysis of English Wikipedia edit

Recommendations edit

The analysis from English Wikipedia has many limitations (namely English Wikipedia is not necessarily typical of how other language communities have adopted Wikidata transclusion) but does provide some initial insights into the extent of Wikidata transclusion and recommendations for how it might be better tracked. Specifically, much of the transclusion recorded in the wbc_entity_usage table is low-importance and this is further magnified in the Recent Changes feed for patrollers where general aspects like those triggered by metadata templates means that much of the Wikidata changes that shows up in the Recent Changes literally has no impact on the article. My two main recommendations for providing more nuance in statistics and flexibility for reducing noise in Recent Changes are:

  • Distinguishing between standard statements and identifiers in Lua calls (and wbc_entity_usage): it would be much easier to distinguish between transclusion that is part of linked open data and transclusion that is facts like birthday etc. It would also substantially reduce noise in Recent Changes because, at least in English Wikipedia, the very common metadata templates like Authority Control and Taxonbar trigger a general C aspect and so changes to any part of the Wikidata item show up in Recent Changes even when they have no impact on the article. In theory, a filter could be added to Recent Changes then to change how changes to identifiers show up in the feed.
  • Distinguishing between content transclusion and tracking in Lua calls (and wbc_entity_usage): this could be a parameter that can be passed with Lua calls that indicates that the property is only being used for tracking. This might just be a hacky change that long-term isn't useful, but tracking categories generate a lot of the entries in the wbc_entity_usage table and are quite different in impact than transclusion. These could then easily be filtered out from Recent Changes (much as changes to categories can now be filtered out).

Future Work edit

  • Expand to more languages, especially languages like Japanese that seem to have far less transclusion and Catalan or Russian that seem to have far more transclusion.
  • Expand automated analyses to include some infobox templates as well
  • Compute statistics for what percentage of Wikidata changes in the Recent Changes feed have no impact on the article vs. map to one of the importance categories. This will very likely have to be manually coded for a small sample much like this first stage of the project before considering more automated methods for analysis.