Research:Wikipedia Knowledge Integrity Risk Observatory
Wikipedia is one of the main repositories of free knowledge available today, with a central role in the Web ecosystem. For this reason, it can also be a battleground for actors trying to impose specific points of view or even spreading disinformation online. There is a growing need to monitor its "health" but this is not an easy task. Wikipedia exists in over 300 language editions and each project is maintained by a different community, with their own strengths, weaknesses and limitations. The output of this project will multi-dimensional observatory to monitor knowledge integrity risks across different Wikimedia projects.
The Web has become the largest repository of knowledge ever known in just three decades. However, we are witnessing in recent years the proliferation of sophisticated strategies that are heavily affecting the reliability and trustworthiness of online information. Web platforms are increasingly encountering misinformation problems caused by deception techniques such as astroturfing, harmful bots, computational propaganda, sockpuppetry, data voids, etc.
Wikipedia, the world’s largest online encyclopedia in which millions of volunteer contributors create and maintain free knowledge, is not free from the aforementioned problems. Disinformation is one of its most relevant challenges and some editors devote a substantial amount of their time in patrolling tasks in order to detect vandalism and make sure that new contributions fulfill community policies and guidelines. Furthermore, Knowledge Integrity is one of the strategic programs of Wikimedia Research with the goal of identifying and addressing threats to content on Wikipedia, increasing the capabilities of patrollers, and providing mechanisms for assessing the reliability of sources.
Many lessons have been learnt from fighting misinformation in Wikipedia and analyses of recent cases like the 2020 United States presidential election have suggested that the platform was better prepared than major social media outlets. However, there are Wikipedia editions in more than 300 languages, with very different contexts. To provide Wikipedia communities with an actionable monitoring system, this project will focus on creating a multi-dimensional observatory across different Wikimedia projects.
Taxomony of Knowledge Integrity RisksEdit
Risks to knowledge integrity in Wikipedia can arise in many and diverse forms. Inspired by a recent work that has proposed a taxonomy of knowledge gaps for Wikimedia projects, we have conducted a review of by the Wikimedia Foundation, academic researchers and journalists that provided empirical evidence of knowledge integrity risks. Then, we have classified them by developing a hierarchical categorical structure.
We initially differentiate between internal and external risks according to their origin. The former correspond to issues specific to the Wikimedia ecosystem while the latter involve activity from other environments, both online and offline.
For internal risks, we have identified the following categories focused on either community or content:
- Community capacity: Pool of resources of the community. Resources relate to the size of the corresponding Wikipedia project such as the volume of articles and active editors, but also the number of editors with elevated user rights (e.g., admins, checkusers, oversighters) and specialized patrolling tools.
- Community governance: Situations and procedures involving decision-making within the community. The reviewed literature has identified risks like the unavailability of local rapid-response noticeboards on smaller wikis or the abuse of blocking practices by admins.
- Community demographics: Characteristics of community members. Some analyses highlight that the lack of geographical diversity might favor nationalistic biases. Other relevant dimensions are editors' age and activity since misbehavior is often observed in editing patterns of newly created accounts or accounts that have been inactive for a long period to avoid certain patrolling systems or that are no longer monitored and became hacked.
- Content verifiability: Usage and reliability of sources in articles. This category is directly inspired by one the three principal core content policies of Wikipedia (WP:V) which states that readers and editors must be able to check that information comes from a reliable source. It is referred to in several studies of content integrity.
- Content quality: Criteria used for the assessment of article quality. Since each Wikipedia language community decide its own standards and grading scale, some works have explored language-agnostic signals of content quality such as the volume of edits and editors or the appearance of specific templates. In fact, there might exist distinctive cultural quality mechanisms as these metrics do not always correlate with featured status of articles.
- Content controversiality: Disputes between community members due to disagreements about the content of articles. Edit wars are the best known phenomenon that occurs when content becomes controversial, requiring sometimes articles to be protected.
For external risks, we have identified the following categories:
- Media: References and visits to the Wikipedia project from other external media on the Internet. Unusual amount of traffic to specific articles coming from social media sites or search engines may be a sign of coordinated vandalism.
- Geopolitics: Political context of the community and content of the Wikipedia project. Some well resourced interested parties (e.g., corporations, nations) might be interested in externally-coordinated long-term disinformation campaigns in specific projects.
The first version of the Wikipedia Knowledge Integrity Risk Observatory is built as a dashboard on the WMF's Superset instance. On the one hand, this system allows us to easily deploy visualizations built from various analytics data sources. On the other hand,
nda LDAP access is needed (for more details, see ). Following the WMF Guiding Principles of openness, transparency and accountability, we expect to release the Wikipedia Knowledge Integrity Risk Observatory as a dashboard available to the global movement of volunteers. Therefore, future work will focus on designing an open technological infrastructure to provide Wikimedia communities with valuable information on knowledge integrity.
The following indicators are available (those with the notation (T) are computed on a temporal basis, usually monthly):
- General statistics: pages count, articles count, edits count, images count, users count, active users count, admins count, ratio of admins per active users, average count of edits per page.
- Editors: active editors count (T), new active editors count (T), active admins count (T), active admins vs new active editors, number and percentage of editors from specific group.
- Edits: edit count (T), special user group edits ratio (T).
- Tools: AbuseFilter edits count (T), AbuseFilter hits (T), Spamblacklist hits (T).
- Admins: addition of admins (T), removal of admins (T)
- Locks: globally locked editors count (T), locally locked editors count (T)
- Country: gini index of country views count (T), gini index of country edits count (T)
- Age: age of admins, age of active editors over time (T)
- ORES: ORES average scores
- Articles: average number of editors per article, percentage of stub articles (less than 1,500 chars)
- Edits: special user group edits ratio (T), bot edits ratio (T), IP edits ratio (T), content edits ratio (T), minor edits ratio (T).
- Reverts: reverts ratio (T), IP edits reverts ratio (T), IP edits reverts ratio vs active editors, IP edits reverts ratio vs active admins
- Protected pages: percentage of protected pages
- Traffic: traffic from social media (T) extracted from the Social media traffic report pilot, traffic from search engines (T).
- Press freedom: Viewing rate with press freedom index, editing share with press freedom index
To illustrate the value of the indicators for knowledge integrity risk assessment in Wikipedia, we provide an example on community demographics, in particular, geographical diversity. The graph shows the entropy value of the distributions of number of edits and views by country of the language editions with over 500K articles. The data has been collected from November 2018 to April 2021. On the one hand, we observe large entropy values for both edits and views in the Arabic, English and Spanish editions, i.e., global communities. On the other hand, other large language editions like the Italian, Indonesian, Polish, Korean or Vietnamese Wikipedia lack that geographical diversity. We should highlight the extraordinarily low entropy values of the Japanese Wikipedia, which supports one of the main causes attributed to misinformation incidents in this edition. We also notice the misalignment between high edit entropy and low view entropy values in Cebuano and Waray-Waray editions, which might be the result of the large fraction of content produced by bots distributed around the world. It is also remarkable the misalignment of the Egyptian Arabic Wikipedia with much larger entropy values for views than edits.
As mentioned above, current indicators are essentially item counts and distributions of items over features. We will focus on defining advanced metrics while preserving the criteria of ease of interpretation, comparability across wikis and language-agnosticism. Also, indicators should be periodically updatable to allow longitudinal observations. Another future challenge will be to define indicators with finer levels of granularity, that is to say, metrics computed not only on an entire Wikipedia project but on categories, pages, etc.
- Zhang, Jerry and Carpenter, Darrell and Ko, Myung. 2013. Online astroturfing: A theoretical perspective.
- Ferrara, Emilio and Varol, Onur and Davis, Clayton and Menczer, Filippo and Flammini, Alessandro. 2016. The rise of social bots. Communications of the ACM 59, 7, 96–104.
- Woolley, Samuel C and Howard, Philip N. 2018. Computational propaganda: political parties, politicians, and political manipulation on social media. Oxford University Press.
- Kumar, Srijan and Cheng, Justin and Leskovec, Jure and Subrahmanian, VS. 2017. An army of me: Sockpuppets in online discussion communities. In Proceedings of the 26th International Conference on World Wide Web. 857–866.
- Golebiewski, Michael and Boyd, Danah. 2018. Data voids: Where missing data can easily be exploited.
- Saez-Trumper, Diego. 2019. Online disinformation and the role of wikipedia. arXiv preprint arXiv:1910.12596
- Morgan, Jonathan. 2019. Research:Patrolling on Wikipedia. Report-Meta
- Zia, Leila and Johnson, Isaac and Mansurov, Bahodir and Morgan, Jonathan and Redi, Miriam and Saez-Trumper, Diego and Taraborelli, Dario. 2019. Knowledge Integrity. https://doi.org/10.6084/m9.figshare.7704626
- Kelly, Heather. 2021. On its 20th birthday, Wikipedia might be the safest place online. https://www.washingtonpost.com/technology/2021/01/15/wikipedia-20-year-anniversary/
- Morrison, Sara. 2020. How Wikipedia is preparing for Election Day. https://www.vox.com/recode/2020/11/2/21541880/wikipedia-presidential-election-misinformation-social-media
- Song, Victoria. 2020. A Teen Threw Scots Wiki Into Chaos and It Highlights a Massive Problem With Wikipedia. https://www.gizmodo.com.au/2020/08/a-teen-threw-scots-wiki-into-chaos-and-it-highlights-a-massive-problem-with-wikipedia
- Sato, Yumiko. 2021. Non-English Editions of Wikipedia Have a Misinformation Problem. https://slate.com/technology/2021/03/japanese-wikipedia-misinformation-non-english-editions.html
- Rogers, Richard and Sendijarevic, Emina and others. 2012. Neutral or National Point of View? A Comparison of Srebrenica articles across Wikipedia’s language versions. In unpublished conference paper, Wikipedia Academy, Berlin, Germany, Vol. 29
- Kumar, Srijan and Spezzano, Francesca and Subrahmanian, VS. 2015. Vews: A wikipedia vandal early warning system. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 607–616.
- Kumar, Srijan and West, Robert and Leskovec, Jure. 2016. Disinformation on the web: Impact, characteristics, and detection of wikipedia hoaxes. In Proceedings of the 25th international conference on World Wide Web. 591–602.
- Joshi, Nikesh and Spezzano, Francesca and Green, Mayson and Hill, Elijah. 2020. Detecting Undisclosed Paid Editing in Wikipedia. In Proceedings of The Web Conference 2020. 2899–2905.
- Redi, Miriam and Fetahu, Besnik and Morgan, Jonathan and Taraborelli, Dario. 2019. Citation needed: A taxonomy and algorithmic assessment of Wikipedia’s verifiability. In The World Wide Web Conference. 1567–1578.
- Lewoniewski, Włodzimierz and Węcel, Krzysztof and Abramowicz, Witold. 2019. Multilingual ranking of Wikipedia articles with quality and popularity assessment in different topics. Computers 8, 3, 60.
- Lewoniewski, Włodzimierz and Węcel, Krzysztof and Abramowicz, Witold. 2017. Relative quality and popularity evaluation of multilingual Wikipedia articles. In Informatics, Vol. 4. Multidisciplinary Digital Publishing Institute, 43.
- Yasseri, Taha and Spoerri, Anselm and Graham, Mark and Kertész, János. 2014. The most controversial topics in Wikipedia. Global Wikipedia: International and cross-cultural issues in online collaboration 25, 25–48.
- Spezzano, Francesca and Suyehira, Kelsey and Gundala, Laxmi Amulya. 2019. Detecting pages to protect in Wikipedia across multiple languages. Social Network Analysis and Mining 9, 1, 1–16.
- Shubber, Kadhim. 2021. Russia caught editing Wikipedia entry about MH17. https://www.wired.co.uk/article/russia-edits-mh17-wikipedia-article
- Aragón, Pablo and Sáez-Trumper, Diego. 2021. TA preliminary approach to knowledge integrity risk assessment in Wikipedia projects. MIS2’21: Misinformation and Misbehavior Mining on the Web Workshop held in conjunction with KDD 2021