# Research:Improving link coverage/Supporting anchor text insertion

Contact
Robert West
Akhil Arora
Contact: Leila Zia

Information may be incomplete and change as the project progresses.

## Introduction

Hyperlinks are important for ensuring easy and effective navigation of any website. Yet the rapid rate at which websites evolve (thanks to the big data era), renders maintaining the quality and structure of hyperlinks to be a challenging task. This problem can be broken down further into two subtasks:

1. Given a set of webpages, identify pairs of pages that should be linked together?
2. Having identified a pair of pages to be linked, where in the source page should the link(s) be inserted?

The first task reduces to solving a link prediction problem, thereby identifying a list of candidate pages to be linked to a given page, which was addressed in our previous work.

The second task on the other hand, can be seen as performing entity insertion. Specifically, when the source page contains appropriate anchor texts for the new link, these anchor texts become the candidate positions for the new link; all that remains to do is to ask a human which anchor text is best-suited. Things become more interesting when the source contains no anchor text for the new link; here it is far less clear where to insert the link. In essence, we have found a topic that is not yet, but should be, mentioned in the source, and the task is not simply to decide which existing anchor text to use, but rather where in the page to insert a new anchor text.

Here we develop an approach for mining human navigation traces to automatically find candidate positions for new links to be inserted in a webpage. Our intention is to demonstrate the effectiveness of our approach by evaluating it using Wikipedia server logs.

## Motivation

• Wikipedia has grown rapidly since its inception: from 163 articles in January 2001 to 42.9M articles in December 2018. Based on the statistics from 2018 [1], on an average approximately 8000 new articles were added to Wikipedia every day. Without doubt, this enormous growth has promoted the prosperity of Wikipedia enriching the content in terms of both quantity and variety. However, this growth also warrants continuous quality control towards maintaining the link structure, if not improving it, which at this scale either requires a humongous army of editors or powerful automatic methods that can aid editors in this task.
• In this work, our intention is to develop a method for estimating the probability distribution over all potential insertion positions. This will in turn be used to suggest potential link insertion positions by overlaying a heatmap on the text of the source page, thereby reducing the cognitive load of the editors. Specifically, instead of going through the entire page to find the best place to insert the new link, the editors have to process only a small fraction of the total content of the page. Note that a human in the loop is still useful to ensure that the new links are added at close-to-optimal positions, and the overall link structure is indeed of high quality.

## Methodology

We propose a data-driven approach that relies on the following intuitions for predicting potential positions of a new link from ${\displaystyle s}$  to ${\displaystyle t}$  in the source page ${\displaystyle s}$ .

1. Given an indirect path ${\displaystyle (s,m,...,t)}$ , the new link from ${\displaystyle s}$  to ${\displaystyle t}$  should appear in the proximity of the clicked link to ${\displaystyle m}$  in the source page ${\displaystyle s}$ , since ${\displaystyle s}$  is connected to ${\displaystyle t}$  via ${\displaystyle m}$  on the path (which hints to this being the case also in the user’s mind).
2. The more frequently ${\displaystyle m}$  is the successor of ${\displaystyle s}$  on paths to ${\displaystyle t}$ , the more peaked the position distribution should be around ${\displaystyle m}$ .
3. The shorter the paths from ${\displaystyle s}$  to ${\displaystyle t}$  that have ${\displaystyle m}$  as the immediate successor of ${\displaystyle s}$ , the more peaked the position distribution should be around ${\displaystyle m}$ .

## Data

• We plan to use navigation traces extracted from Wikimedia's server logs, where all HTTP requests to Wikimedia projects are logged.

## Evaluation

• Quantitative: As a quantitative metric, we would like to measure the distance between the recommended/suggested link insertion positions and the position where the link was actually inserted. We also intend to compare the estimated link insertion position probability distribution with the observed ground truth.
• Qualitative
For qualitative measures, we intend to run a crowdsourcing experiment, where the control condition asks editors to insert links without the heatmap, while the treatment condition with the heatmap. We then measure and compare the following for both the scenarios:
1. How long it takes them to find a position?
2. How cumbersome they find the experience?
3. How highly others rate the chosen position?
4. How frequently inserted links are reverted/clicked?

## Research Terms

This formal research collaboration is based on a mutual agreement between the collaborators to respect Wikimedia user privacy and focus on research that can benefit the community of Wikimedia researchers, volunteers, and the WMF. To this end, the researchers who work with the private data have entered in a non-disclosure agreement as well as a memorandum of understanding.