Research:Increasing article coverage

This page in a nutshell: The different Wikipedia language editions vary dramatically in how comprehensive they are. As a result, most language editions contain only a small fraction of the sum of information that exists across all Wikipedias. In this research, we present an approach to filling gaps in article coverage across different Wikipedia editions. Our main contribution is an end-to-end system for recommending articles for creation that exist in one language but are missing in another. The system involves identifying missing articles, ranking the missing articles according to their importance, and recommending important missing articles to editors based on their interests. We empirically validate our models in a controlled experiment. We find that personalizing recommendations increases editor engagement by a factor of two. Moreover, recommending articles increases their chance of being created by a factor of 3.2. Finally, articles created as a result of our recommendations are of comparable quality to organically created articles. Overall, our system leads to more engaged editors and faster growth of Wikipedia with no effect on its quality. This research can be read in details at https://arxiv.org/abs/1604.03235

Contact

Jure Leskovec

Stanford University

Robert West

Stanford University

Ellery Wulczyn

Wikimedia Foundation

Leila Zia

Wikimedia Foundation

Research:Projects

This page documents a completed research project.

Proposal

We are interested in identifying the gaps of knowledge in Wikipedia. The result of this research can help the communities and the foundation in focusing their efforts in editor acquisition (through Editathons, for example) and task recommendation systems currently under development.

The most commonly used approach for identifying what knowledge is missing from Wikipedia that probably should be there is the use of red links data along with their corresponding page-views. Topical coverage of Wikipedia has been formally studied. For example, it is shown that Wikipedia’s coverage is driven by the interests of its editors, and therefore, its completeness relies on subject-area of articles.^[1]

We propose two ways to formalizing the goal of increasing Wikipedia article coverage:

Make Wikipedia more complete with respect to the set of articles it contains.
Make articles more complete with respect to the topics they discuss (some previous work in ^[2]).

To consider these problems, we envision the following route:

Identify articles or categories that are missing or underrepresented in a language version.
Rank them in order of importance.
Find and incentivize the right editors to write the articles in question.

To achieve this, the following steps are currently visioned:

A missing article is represented as a pair (T, L), where T is a language-independent topic and L is a language, indicating that the Wikipedia in language L doesn't contain an article about topic T.How may we represent T? We could use the "translation graph" formed by the interlingual links (connecting different language versions of the same topic) and extract cliques (dense clusters). Each clique then represents a language-independent topic (e.g., the clique {en:Beer, de:Bier, nl:Bier, pt:Cerveja, pl:Piwo, ...} represents beer). Now, if a clique T doesn't contain an article in language L, that indicates there is no article about the respective topic T. (Extracting language clusters has been addressed before;^[3]^[4]^[5] to expunge false positives we could perform additional checks, such as whether there is an article in language L with a name that's a translation of many of the article names in clique T.)
After step (1) we have an unordered set of missing-article candidates (T, L), and ideally we would like all of these articles to be written. However, since the set will contain many more articles than the (shrinking) set of Wikipedia editors can handle, we need to prioritize. The task is to rank all candidates in order of importance. What are good indicators of the importance of writing a missing article?
- Volume of search queries for the missing articles. This is where we need Wikipedia site search logs. To identify whether a query is about a missing article, we could use translations from a dictionary; for named entities, translation won't even be necessary in most cases (e.g., the name of a Bosnian village will be the same in English, Bosnian, Croatian, Slovene, etc.)
- Number of red links.
- Number of visits to related articles (according to some external resource, such as Freebase, since we cannot measure the relatedness with a missing article by only using Wikipedia).
- Volume of visits to pages that contain a name of the missing article. Here we could use the session logs.
- A principled approach could be to collect ground-truth data from humans and then train a machine-learning model that combines the above features to produce a ranking.
Once a missing article is identified, task recommendation or targeted editor acquisition can be used to match potential authors to missing articles. The following is one set of similarity-based steps that can be used to identify such authors. For example, suppose we want to have the Hungarian version hu:T1 of topic T1 written. Author A has edited the Hungarian version hu:T2 of topic T2. The English versions en:T1 and en:T2 both exist and are very similar (according to one of many article similarity measures). Then author A would be a good candidate for writing hu:T1.

Introduction

The over 80 language editions of Wikipedia represent the largest encyclopedia in human history. However, the different language editions vary dramatically in how comprehensive they are. This research aims to identify important content available in one language edition but missing from another as well as the editors who would be interested in translating or creating from scratch such articles in the destination language.

Methodology

We divide the problem into four parts and address each separately.

Finding Missing Articles

We use two sources for identifying missing articles: Wikidata for mapping from language independent entities to Wikipedia articles in different languages, and Wikipedia's inter-language links (ILLs). We augment the Wikidata mapping with the ILLs, by building a graph G in which the nodes correspond to either Wikidata items or articles and the edges are Wikidata links, ILLs and MediaWiki redirects. We say article T is missing in language S if and only if none of the Wikidata items in the same connected component as T map to an article in S. Including ILLs reduces the number of entities that are falsely declared to be missing.

Ranking Missing Articles

For many languages there are many more articles that could be created than the current number of volunteer editors can contribute to. On top of that, not all missing articles in a destination language are relevant or desired in that language edition. Therefore, it is important to be able to rank the missing articles in the destination language, such that the editors' effort may be directed at the most crucial missing articles first. There are currently two approaches considered.

Pageviews as a proxy for importance

We build a linear-regression model for estimating the number of pageviews for an article in the destination language based on features of the corresponding article in the source language. The model is trained on articles that exist for both the source and the destination languages. We run the model on the articles in the source that are missing in the destination to estimate the number of pageviews these as yet non-existent articles would receive in the destination language if it were to exist. As input variables, the model uses the number of pageviews of the missing article in the source language, its length, and topics expressed in the article text of the missing article in the source language (topics are computed via latent Dirichlet allocation [LDA]). Topical features matter because different languages put different levels of emphasis on different topics (e.g., articles on Italian singers are more relevant for the Italian than the Chinese Wikipedia).

Notability as a proxy for importance

Notability is one of the most important and challenging measure of importance considered by the editors. Given a measurable definition of notability, we can build prediction models that can help assess whether a not-yet-existent article in the destination language is considered notable by the editors of that language.

Please share your thoughts in the talk page about how we can define notability.

Computing Editor-Article Affinity

For a given editor E and missing article A, we are interested in estimating the distance between A and E's topical interests (affinity of editor E for article A). We first embed documents in S in a topic vector space using LDA. We compute the affinity editor E has for article A as a function of the topic vectors for entities in E's edit history that exist in S and the topic vector for A. More specifically, an editor's interest vector is the normalized sum of the document vectors of the last 15 articles they have edited in the source language that have a corresponding article in the target language, weighted by the log of the number of bytes they have added to the article. The affinity that E has for A is the cosine distance between E's interest vector and the A's topic vector.

Matching Translators with Missing Articles

For each editor, we are interested in finding K important articles for which the editor has high affinity while ensuring that every article is recommended to only one editor. As a preprocessing step, we remove disambiguation pages and very short articles from the set of missing articles. Then we take the N missing articles with the highest estimated future number of pageviews in the destination and distribute these among the editors in a way that maximizes the total estimated affinity that editors have for their recommendations. This problem can be formulated as an integer max-flow problem that can be solved using linear programming techniques (One can easily show that the problem is a min-cost flow problem with integral demands. In this case, the relaxed linear programming problem is guaranteed to provide optimal integer solutions.)

Evaluation

How well the recommendation algorithm described above works in practice can be assessed by subjecting it to editors, i.e., selecting one or more source and destination languages, identifying missing content in the destination languages, computing editors' affinity for each missing article in the destination language, and matching important missing articles and editors. We are currently planning to do this in the following sequence and with a subset of editors in each group: 1) internal Wikimedia Foundation trial, 2) French Wikipedia, 3) Spanish Wikipedia.

Identifying potential contributors

We determine which editors are suitable for receiving recommendations for translating from the source to the target language via two methods. The first is scraping the target users' User pages for a Babel template that indicates that they speak the source language. The second is selecting target users who have an account with the same username in the source language, have made at least one edit in both the source and target Wikipedias, have made at least one edit in either language within the last year and have matching email addresses for the two accounts.

Internal WMF test

The goal of this stage is to receive internal feedback on the recommendation algorithm and identify bugs and/or must-have features before introducing the recommendations to other editors. 11 staff members who speak at least one of the French or Spanish languages have volunteered for this stage.

French Wikipedia Test: Template of the recommendation email

Re: Aidez à améliorer l'exhaustivité de Wikipédia en français
Bonjour ___,
L’équipe Recherche de la Fondation Wikimédia (Wikimedia Research) travaille actuellement sur l’identification d’articles populaires et importants [1] dans certaines langues du projet Wikipédia qui n’existent pas encore sur le Wikipédia francophone. Les cinq articles suivants existent dans la version anglophone de Wikipédia et sont considérés comme étant importants pour les autres langues du projet. Au vu de votre historique de contribution à Wikipédia, nous pensons que vous êtes un(e) excellent candidat(e) pour contribuer à ces articles. Démarrer la création de l'un de ces articles serait un premier pas considérable en vue d'élargir les connaissances disponibles en français. [2]
(LIST OF 5 RECOMMENDATIONS)
Nous vous remercions d'avance pour votre aide. [3] [4]
Equipe de Recherche, Fondation Wikimédia, 149 New Montgomery Street, 6th Floor, San Francisco, CA, 94105, 415.839.6885 (Office).
[1] Nous identifions les articles importants et populaires grâce à un algorithme. Cette sélection d'articles peut être un résultat personnalisé ou aléatoire. Vous pouvez en apprendre davantage sur la personnalisation et les méthodes utilisées pour trouver les articles importants à cette adresse.
[2] Les liens pointent vers l’outil de traduction de Wikipédia (ContentTranslation Tool). Cet outil est en cours de développement par l’équipe Language Engineering de la fondation (pour l’instant en version beta dans certaines langues). En savoir plus: https://www.mediawiki.org/wiki/Content_translation.
[3] Si vous désirez plus d’informations sur ce projet de recherche, vous pouvez lire cette page (en anglais), et nous en parler sur sa page de discussion (en anglais de préférence, même si nous trouverons certainement un traducteur si vous nous écrivez en français :).
[4] Votre avis est important pour nous. Faites nous part de vos impressions par courriel à l’adresse recommender-feedback@wikimedia.org.
Si vous ne souhaitez plus recevoir de courriel de Wikimedia Research, merci d’envoyer un courriel ayant pour sujet "unsubscribe" à l’adresse recommender-feedback@wikimedia.org.

French Wikipedia Test: Lessons Learned (Draft)

We will continue updating this section as more lessons become available. This is not a finalized set of items that we will use moving forward. We will still update the following paragraphs in the next few days. Here is what we have learned so far.

Editor Selection

Out of the 12 thousand users we contacted, approximately 0.25% replied saying that they do not have the requisite proficiency in either English or French to attempt a translation. In order to reduce this number, we have modified our selection criteria. This is our current proposal, feel free to contribute any suggestions.

The editor must have made an edit of any size in either the source or the target language in the last 12 months. This requirement was included in the frwiki test as well.
The editor must have made edits of at least 200 bytes in both the source and the target language. Previously, the condition was that the editor made an edit of any size in both the source and the target language and an edit of at least 100 bytes in either of the two. There are, however, editors who will make minor edits such as including image links in Wikipedias whose language they are not proficient in. These editors should be excluded.
The editor has not indicated that their proficiency in either language is less than intermediate in a Babel template.

Finally, to reduce confusion for editors who satisfy the above conditions but are not able to read the target language, we will add a section to the start of the email in the source language, explaining the situation and apologizing for the mistake.

Personalization

Trying to make the recommendations personalized to an editor’s interests, requires a substantial amount of additional "algorithmic" work compared to just recommending articles that are predicted to be widely read. To see if this additional work is justified (for when we implement the model within the CX tool), some editors did not receive personalized recommendations. We quickly saw that personalized recommendations lead to significantly higher engagement. Nearly all feedback around the poor quality of the recommendations came from editors who did not receive personalized recommendations. Going forward we will personalize all messages.

Article Selection

Our article selection methods are not yet advanced enough to exclude all articles that would not be of interest or encyclopedic value in the target language. This is a hard task, one that humans often disagree on. We will:
- Make this explicit in the body of the recommendation email and highlight that it is up to the editors to make the final call on whether the article should exist in the target language.
- Furthermore, the importance threshold for articles to be included in the recommendations was probably too low for the frwiki test and will be raised in the future.
- We are also investigating new ways for improving the algorithm's assessment of article importance.
The algorithm did not have a condition to filter out articles that have low quality in the source language. If the recommendation was to encourage editors to create an article, this would not matter as much as when the recommendation is about translations. In the translation recommendation cases, we should filter out low quality pages from the article set.

Disambiguation Pages

Disambiguation pages were excluded from the list of articles based on the "disambig" template. We are now also using all the disambiguation template variants to filter out disambiguation pages..

Results

The results are shared in detail in a paper that can be accessed at Growing Wikipedia Across Languages via Recommendation.

Research Terms
This formal research collaboration is based on a mutual agreement between the collaborators to respect Wikimedia user privacy and focus on research that can benefit the community of Wikimedia researchers, volunteers, and the WMF. To this end, the researchers who work with the private data have entered in a non-disclosure agreement as well as a memorandum of understanding.

Presentations

Below you can find the links to previous presentations about this research:

[2017-03-15] Presentation as part of CITRIS Research Exchange seminar series. (Slides, Video)
[2016-04] at WWW2016 Conference
[2015-08] The first results from this research were presented in Wikimania 2015. You can read more about the abstract of our presentation here.

References

↑ Halavais, Alexander, and Derek Lackaff. "An analysis of topical coverage of Wikipedia." Journal of Computer‐Mediated Communication 13.2 (2008): 429-440.
↑ Robert West, Doina Precup, and Joelle Pineau: Automatically Suggesting Topics for Augmenting Text Documents. In Proc. 19th ACM Conference on Information and Knowledge Management (CIKM'10), 2010.
↑ Gerard de Melo, Gerhard Weikum: Untangling the Cross-Lingual Link Structure of Wikipedia. In Proc. 48th Annual Meeting of the Association for Computational Linguistics (ACL’10), 2010.
↑ B. Hecht and D. Gergle. The tower of Babel meets web 2.0: User-generated content and its applications in a multilingual context. In Proc. CHI, pages 291–300, 2010.
↑ P. Bao, B. Hecht, S. Carton, M. Quaderi, M. Horn, and D. Gergle. Omnipedia: Bridging the wikipedia language gap. In Proc. CHI, 2012.

[1] Halavais, Alexander, and Derek Lackaff. "An analysis of topical coverage of Wikipedia." Journal of Computer‐Mediated Communication 13.2 (2008): 429-440.

[2] Robert West, Doina Precup, and Joelle Pineau: Automatically Suggesting Topics for Augmenting Text Documents. In Proc. 19th ACM Conference on Information and Knowledge Management (CIKM'10), 2010.

[3] Gerard de Melo, Gerhard Weikum: Untangling the Cross-Lingual Link Structure of Wikipedia. In Proc. 48th Annual Meeting of the Association for Computational Linguistics (ACL’10), 2010.

[4] B. Hecht and D. Gergle. The tower of Babel meets web 2.0: User-generated content and its applications in a multilingual context. In Proc. CHI, pages 291–300, 2010.

[5] P. Bao, B. Hecht, S. Carton, M. Quaderi, M. Horn, and D. Gergle. Omnipedia: Bridging the wikipedia language gap. In Proc. CHI, 2012.

[1]

[2]

[3]

[4]

[5]