Research:Recommending links to increase visibility of articles/Supporting entity insertion

Links are a fundamental part of Wikipedia, turning isolated pieces of knowledge into a network of information that is much richer than the sum of its parts. However, inserting a new link into the network is not trivial, requiring not only the identification of the corresponding source and target articles but also the reading of the source article to identify a suitable position where to integrate the link into the text.

In order to support editors in the latter task, in this project, we develop a multilingual model for the task of entity insertion. The task is motivated by the use-case of increasing the visibility of specific articles in the network, such as adding incoming links to orphan articles (i.e. de-orphanization).

This work builds heavily upon (and extends) previous efforts to support anchor text insertion Research:Improving_link_coverage/Supporting_anchor_text_insertion

Motivation

edit

General

edit

Adding new knowledge to Wikipedia not only requires creation of new content but also integrating it into the existing knowledge structure. In fact, editors have developed a dedicated guideline to “build the web” in English Wikipedia’s manual of style. When adding a new link, we need to, first, identify two entities that should be linked and, second, also identify suitable text in the source article where the link will be added. While there are a wide variety of models and tools to improve linking, existing approaches address this problem insufficiently.


 
Micro and macro aggregates of entity insertion types over 105 Wikipedia language versions considered in this study.

On the one hand, link recommendation aims to infer which nodes should be linked in a network[1]. Using features from, e.g., navigation of readers allows suggesting new useful links between articles[2]. However, these models focus exclusively on the network structure and ignore the problem that links need to be embedded in the text of the article.

On the other hand, entity-linking approaches consider the existing text and try to identify the most probable link target for specific tokens or substrings in the text (anchor). This approach is used in many existing tools for Wikipedia such as the add-a-link model[3].

 
CCDF of the number of candidate sentences (N ) in a Wikipedia article (log x-axis).

Challenges. Entity insertion is not only an interesting and challenging language understanding task, but it is also the most common scenario faced by editors when adding links in practice. In fact, we find that for 60-70% of all the links added to Wikipedia, none of the existing text is suitable to insert the corresponding entities, and new text needs to be added along with the entity by the editor.

We also find that entity insertion is associated with a high cognitive load, as the task requires, on average, an editor to select the most suitable sentence from a pool of ∼100 candidate sentences in a page.

Use-case: De-orphanization

edit

In our work on orphan articles[4], we found that there is a surprisingly large number of orphan articles in Wikipedia (~9M articles across all Wikipedia language versions), which are de-facto invisible to readers navigating Wikipedia. We described a promising approach to identify candidate links to increase the visibility of orphan articles (de-orphanization) based on link translation. While this gives us a source and target article for the candidate link, a remaining challenge is where to insert the specific link in the text of the source article. This step is crucial to make the link recommendations for de-orphanization more actionable for editors.

Methods

edit

Problem description

edit

We consider the problem of entity insertion in Wikipedia: Given a source and target article, the goal of entity insertion is to identify the most suitable text span in the source article for inserting a link to the target article. Specifically, we operationalize the task of entity insertion as a ranking problem, where the goal is to rank all the candidate text spans in the source article by how related they are to the target article.

 
Data processing pipeline. Obtain added links   by taking a set difference of the links existent in consecutive months. For each added link  , scan all   versions in the full revision history   to   to identify the article version in which the link was added and compute the difference between the before and after versions to extract the exact entity insertion scenario.

Data

edit

We constructed a new multilingual dataset for the entity insertion task in Wikipedia. The dataset consists of all the links from all the Wikipedia pages and each link's surrounding context and additional meta-data (such as page titles, QIDs, and lead paragraphs). Overall, the dataset contains 958M links from 49M pages in 105 languages. The largest language is English (en), with 166.7M links from 6.7M pages, and the smallest language is Xhosa (xh), with 2.8K links from 1.6K pages. The data processing was done in two steps. An overview of our data processing pipeline is presented in the figure on the right.

Existing links. We first extracted all the links present at the timestamp of 2023-10-01. We extract the content of all articles from their HTML version using the corresponding snapshot of the Enterprise HTML dumps. We removed articles without a lead paragraph and without a QID. For each article, we consider all internal links in the main article body (ignoring figures, tables, notes, and captions) together with their surrounding context. We removed all the links where either the source or the target article was one of the removed articles and we dropped all the self-links.

Added links. Then we found all the links added in the time between 2023-10-01 and 2023-11-01. We extract a set of added links by comparing the existing links in snapshots from two consecutive months. We apply the same procedure as above to each snapshot, respectively, and take the difference of the two sets to identify the links existing in the second month but missing from the first month. In order to identify the edit where the link was inserted, we go through the revision history of the respective articles available in the Wikimedia XML dumps. Once we had identified the pair of IDs associated with the revisions before and after the link was inserted, we directly downloaded the corresponding HTML versions. Comparing the two HTML versions, we could identify the changes made by the editor when inserting the link which we categorized into five categories:

  • text_present: links that fall into this category were added by hyperlinking an existing mention;
  • missing_mention: links were added by taking an already existing sentence and adding the mention for a new entity (and potentially some additional context to the sentence);
  • missing_sentence: an extension of the previous category, the link was added by writing a new sentence and hyperlinking part of the text, but where the editors wrote a span of multiple sentences;
  • missing_section: the links were added in a section that did not exist in the previous version of the article.

Model

edit
 
Architectural overview of LocEI. The target entity   and each candidate text span   of the source entity   are concatenated together and encoded jointly using a transformer encoder. The relevance scores of candidate text spans are computed using an MLP trained via a list-wise ranking objective.

Our model (LocEI) is composed of a transformer-based encoder that jointly encodes the target entity as well as the candidate spans in the source entity, and a multilayer perceptron (MLP) trained via a list-wise objective capable of ranking candidates based on their relevance to the target. We introduce a novel data augmentation strategy that closely mimics real-world entity insertion scenarios, a knowledge injection module to incorporate external knowledge about the entities, and the multilingual variant of our model. For details, please refer to our paper, which will be posted on arXiv soon. The architectural overview of our model is presented in the figure on the right.

Results

edit

Characterizing entity insertion

edit

Here, we empirically characterize how new links are inserted by looking at all added links and counting the occurrence of the different categories.

 
Entity insertion strategies for links added from 2023-10 to 2023-11 for a subset of 20 languages. The x-axis shows the language code and the number of links added in each language.

Only for 27% of added links, the mention for the anchor of the link was already present in the text. For the majority of added links, some of the text needed to be changed as well (adding the mention, adding a sentence, or adding a larger span of text).

This means for the majority of the added links, simple string matching between the text and page title of the target article will likely not be very successful. Moreover, as we will show in our detailed results next, even sophisticated entity linking/retrieval approaches struggle to address the entity insertion problem as the text to be linked is absent in the majority of the cases at the time of inserting the new entity.

For all the following sections, we report results using Hits@1 and MRR as the performance metrics aggregated over all languages (macro-average) in tables on the right.

Multilingual entity insertion

edit
 
Entity insertion performance obtained by macro-averaging over 20 Wikipedia language versions used for training the benchmarked methods. XLocEI trains a single model jointly on all 20 languages, whereas other methods train a separate model for each language. The categorization of entity insertion types into ‘Overall’, ‘Missing’, and ‘Present’ is discussed in the Data section. Note that EntQA and GET work only for English, whereas PRP-Allpair was only used for zero-shot analysis and English.

We see that xLocEI statistically significantly outperforms all other models for all cases considered. Specifically, xLocEI substantially outperforms the baseline models showing the advantage of our modeling framework and the introduced training novelties. Furthermore, we observe that xLocEI particularly improves upon the baseline models in the missing scenario, when the mention that is linked is not yet present in the text. This shows the usefulness of our model when going beyond entity linking.

Most importantly, xLocEI consistently yields better scores than the language-specific LocEI models, demonstrating that the multilingual model is capable of transferring knowledge across languages to improve overall performance. In fact, by compaing the performance for the individual languages (results in the paper, omitted from this metapage for brevity), we see that the improvement from xLocEI over LocEI is larger in low-resource languages (languages with less training data) such as Afrikaans (af), Welsh (cy), Uzbek (uz).

Zero-shot entity insertion

edit

Resources

edit

References

edit
  1. Ghasemian, Amir; Hosseinmardi, Homa; Galstyan, Aram; Airoldi, Edoardo M; Clauset, Aaron (2020-09-22). "Stacking models for nearly optimal link prediction in complex networks". Proc. Natl. Acad. Sci. U. S. A. 117 (38): 23393–23400. ISSN 0027-8424. doi:10.1073/pnas.1914950117. 
  2. Paranjape, Ashwin; West, Robert; Zia, Leila; Leskovec, Jure (2016). "Improving Website Hyperlink Structure Using Server Logs". Proc Int Conf Web Search Data Min 2016: 615–624. doi:10.1145/2835776.2835832. 
  3. Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. CIKM ’21, 3818–3827. https://doi.org/10.1145/3459637.3481939
  4. Arora, A., West, R., & Gerlach, M. (2023). Orphan Articles: The Dark Matter of Wikipedia. In arXiv [cs.SI]. http://arxiv.org/abs/2306.03940