Research:Exploration on content propagation across Wikimedia projects
- Understand the cross-pollination of content across Wikimedia projects.
In this research we want to understand how content propagates across different languages in Wikipedia. As a unit of study we use the sitelinks on Wikidata Items, meaning that we consider the subset of Wikidata Items that has at least one article associated to any Wikimedia project. We start by evaluating once an sitelink is created in one language what is the most probable next language that will propagate to. Our hypothesis is there is a relation between the creation of items in different languages. For example, if an item exists just in one language it might not propagate to more projects, but if the item already exists in 5 languages it is more likely to appear in a new project. Moreover,this probability may also depend on how related languages (as proxy for cultures) are, for example if an item has sitelinks to Spanish, Catalan, and Portuguese, is more likely to appear later in French than in Chinese?
- Being able to model content propagation across Wikis would be useful to potentiate the flow of high quality content and also to prevent the cross-pollination of mis/disformation.
Here we list our most important findings:
- We found that 67% of articles just exists in one language and 90% items exists in 4 or less languages.
- We found that over 15% of content coming small Wikipedias (ranked by number of articles) can end-up propagating to larger Wikipedias.
- As expected, we found that cultural similarities among wikis are also related with the amount articles they share.
- Although that predicting if an article will propagate to other languages is difficult, we show some promising ML models and problem formulations to perform this task.
- We have created a large dataset to study knowledge propagation in Wikipedia.
More about our experiments can be found here. To learn about the dataset, please read our paper published on ICWSM'21.
We are currently working on understanding the relationship between content propagation and the existence of reliability issues, crossing the aforementioned dataset and the Wiki-reliability data. And analyzing the alignment of content quality across languages.