Research:Link recommendation model for add-a-link structured task

Created

14:13, 22 October 2020 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Djellel Difallah

NYU Abu Dhabi

Duration: 2020-September – 2021-August

Research:Projects

This page documents a completed research project.

As part of the work on structured tasks, the Growth team is aiming to implement the Add a link structured task. Here we describe the algorithm developed by the research team to automatically generate link-recommendations (see Add-a-link for the technical inrastructure). As a black box the algorithm works as follows:

Input: source-page/article = the article in which we want to add links
Output: list of recommended links with the following information
- target-page/link = article which we want to link to
- anchor/mention = text in source-page where link to target-page should be placed
- link-likelihood = probability the model assigns to the link. Setting a threshold for the link-probability (thus only recommending links with a probability above, say, 0.8) allows us to make more accurate predictions, however, with the expense of missing potential useful links.

In the following we want to explain some of the details of the black box that generates the recommendations.

Problem statement

Our goal is to make recommendations to add links to a given article. The recommendation should state: i) which article to link to, ii) where in the text should the link be added (using existing text without adding additional words).

Approach

First, we identify unlinked parts of the text which could potentially contain a link. For example, finding the plain text “Berlin”, we might recommend the link Berlin, the capital of Germany. However, we have to consider that this text could also refer to Berlin (Wisconsin), or the music band Berlin, etc. As a second step, we thus have to identify possible candidates and decide which (if any of them) to recommend as the article we would recommend to link.

Looking at links in existing articles, we use machine learning to identify the most likely candidates for articles to link to. The details are described below.

Anchor dictionary

The anchor dictionary records all links and their anchors across all articles in a given Wikipedia. This will serve as a look-up table to identify potential candidates for linking. From the wikitext-dump we identify all (anchor,link)-pairs from the signature using mwparserfromhell. As a result we obtain a dictionary of the form "{mention: {link: count} }".

We include the following pre-processing steps:

We normalize titles of the links (i.e. uppercasing the first character, see format of article-titles)
We resolve redirects of article-titles (i.e. if the link is a redirect-page we record the title of the redirected-to article)
We only keep links to main-namespace articles in the wiki under consideration
We only keep anchor/links in articles that are not redirects and that are in the main namespace
We normalize anchors by lowercasing the string
We only keep anchors with a link-probability > 6.5% (Milne&Witten 2008). The link-probability is the fraction of times the mention appears as a link (the mention can also appear as plain text without any links). This is intended to avoid mentions such as “and” in the anchor-dictionary, similar to filtering stopwords in NLP.

Gold dataset

We generate a gold-dataset of fully-formed sentences in Wikipedia articles which already contain links. For this, we consider the first sentence in an article which contains at least one link. This choice is motivated mainly by the fact that later sentences might not contain links which had already occurred earlier in the article (false negatives). An example for a gold-sentence from the article Belgium in simple wikipedia would be “Belgium, officially the Kingdom of Belgium, is a federal state in Western Europe.”

Training data

In order to train a ML-model we build a training dataset containing features X and labels Y. The gold-dataset provides examples of which words are used as anchors for which links in which articles (and similarly which words are not used as anchors). Thus, we can construct a dataset of positive (Y=1) and negative (Y=0) examples. Consider the example above for the article “simple:Belgium”: “Belgium, officially the Kingdom of Belgium, is a federal state in Western Europe”. The positive examples (Y=1) can be obtained from the existing links, for example the anchor-link pair (“state”, “simple:Sovereign state”). The negative examples (Y=0) can be obtained by two different means. The first possibility are alternative links to existing anchor-link pairs; looking up the anchor-dictionary would yield the anchor-link pair (“state”, “simple:State”). The second possibility are potential anchors which are do not contain links such as “western europe”; looking up the anchor dictionary would yield the anchor-link pair (“western europe”, “simple:Western Europe”). For the gold-sentence of an article, we generate as many positive and negative anchor-link pairs and calculate the following features (X):

ngram: the number of words in the anchor (based on simple tokenization)
frequency: count of the anchor-link pair in the anchor-dictionary
ambiguity: how many different candidate links exist for an anchor in the anchor-dictionary
kurtosis: the kurtosis of the shape of the distribution of candidate-links for a given anchor in the anchor-dictionary
Levenshtein-distance: the Levensthein-distance between the anchor and the link. This measures how similar the two strings are. Roughly speaking, it corresponds to the number of single-character edits one has to make to transform one string into another, e.g. the Levensthein-distance between “kitten” and “sitting” is 3.
w2v-distance: similarity between the article (source-page) and the link (target-page) based on the content of the pages. This is obtained from wikipedia2vec. Similar to the concept of word-embeddings, we map each article as a vector in an (abstract) 50-dimensional space in which articles with similar content will be located close to each other. Thus given two articles (say the source article and a possible link) we can look up their vectors and get an estimate for their similarity by calculating the distance between them (more specifically, the cosine-similarity). The rationale is that a link might be more likely if the corresponding article is more similar to the source article.
nav-distance: similarity between the article (source-page) and the link (target-page) based on the navigation of readers. This is obtained from an approach similar to the Navigation vectors. Conceptually similar to the w2v-distance except that we construct the 50-dimensional vectors based on the sequence of articles from readers on Wikipedia. That is, two articles will be more similar if they are read together in the same reading sessions. This feature was removed (see below).

Thus, for each triple (article, anchor, link) we will calculate a set of features X and record the label Y=0,1 (depending on whether the anchor-link pair exist in the article.

Training the model

Given the training data consisting of features X and labels Y, we train a classifier that distinguishes between the positive and the negative examples in the training dataset. Specifically, we use xgboost with default parameters.

Generating recommendations

Generating link recommendations work in the following way once we trained the model:

Choose an article and get the wikitext
Identify an unlinked part of plain text and consider as a possible anchor
Identify link candidates by looking up the anchor in the anchor-dictionary
For the triple (article, anchor, link) calculate the features X
Using the features X, make a prediction for the probability P(Y=1|X) from the xgboost-model. This yields the probability that we should add the anchor-link pair in the given article
Rank all candidate-links for a possible anchor according to their probability
Return the top candidate if the probability exceeds a given threshold (say 0.5)

Evaluating the model (Backtesting)

We evaluate the trained model using a held-out sample of the gold-dataset containing sentences with existing links taken from Wikipedia articles.

We generate link recommendations for the plain text of the sentence (removing all links). We then compare the predicted anchor-link pairs with the actual anchor-link pairs and calculate precision (how many of the recommended links are true, i.e. exist in the gold-data) and recall (how many of the true links are also recommended by the model). It is probably better to aim for a higher precision (making sure that the links we recommend are actually genuine) than for a high recall (making sure that we capture all existing links) in order to reduce the amount of false positives.

Hard-coded rules for (not) linking

We implement some hard-coded rules when to *not* add a link (anchor-text,link-title) to a source-page

The anchor is the same as the title of the source-page
The link already exists somewhere in the source-page. this implies "only recommend links on their first occurrence" since we scan the source-page iteratively from top to bottom to generate link recommendations. specifically, we check that
- the anchor-text is not linked somewhere else in the text of the source-page
- the link-title is not linked to in any of the other links in the text of the source-page
The page that we link to belongs to one of the following (such as disambiguation pages). In practice, we check the P31 (instance-of) property of the corresponding Wikidata-item (almost every Wikipedia article has a Wikidata item). Thus this list is, in principle, easily customizable if other link-types should be filtered.
- Disambiguation-page (Q4167410), example: Statistics (disambiguation) (Q2333935)
- List-page (Q13406463), example: List_of_lists_of_lists (Q957)
- Entities related to dates:
  - day (Q573), example: Saint Patrick's Day
  - Year (Q577), example: 640
  - decade (Q39911), example: 2010s
  - century (Q578), example: 20th century
  - millenium (Q36507), example: 2nd millennium
  - calendar day of a given year (Q47150325), example: Portal:Current_events/2006_February_4
  - calendar month of a given year (Q47018478), example: August 1936
  - Calendar year (Q3186692), example: 2015
  - calendar date (Q205892), example: Ides of March
  - point in time with respect to recurrent timeframe (Q14795564), example: January 1
  - events in a specific year or time period (Q18340514), example: 1881 in baseball
- Entities related to measurement units:
  - Unit of length (Q1978718), example: meter
  - Unit of time (Q1790144), example: day
  - Unit of volume (Q1302471), example: litre
  - Unit of mass (Q3647172), example: kilogram
  - Unit of energy (Q2916980), example: joule
  - Unit of information (Q3550873), example: byte
  - Currency (Q8142), example: US Dollar
  - Parts-per notation (Q27084), example: percent

We have also implemented implicit rules related to the anchor-text

The anchor-text is typically not used as a link across all other articles in the respective Wikipedia. More precisely, the anchor-text has to be used as a link in at least 6.5% of all of its occurrences (whether as a link or as a raw text without a link) across all articles in Wikipedia.
- This prevents to add links to many of the unwanted cases generalizing the overall usage in the respective Wikipedia without explicitly encoding every single case. For example, it avoids adding links to years such as “640 AD” since, in most cases, it is not linked when looking across all articles (even though there might be few exising cases).

Results

Third set of results (2021-06)

We made changes to the filter which prevents link recommendations to pages of a specific entity-type. After feedback from volunteers we added several entities related to time and units (more details in phab:T279434).

We first evaluated precision and recall on the four test wikis (arwiki, bnwiki, cswiki, viwiki) and enwiki (for comparison) using the backtesting data:

Precision is not affected. The updated filter does not increase the number of false positives (wrong suggestions).
Recall does decrease slightly (between 1 and 4 percentage points). After updating the filter, the algorithm is able to identify fewer of the existing links. This is expected since we manually prevent some pages being recommended as links based on volunteer feedback, however, some of these links actually do appear in the data.

		Threshold (link-probability)
		0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
enwiki	prec	0.538	0.626	0.681	0.731	0.771	0.813	0.849	0.883	0.915	0.957
enwiki	recall	0.591	0.585	0.572	0.547	0.505	0.45	0.382	0.305	0.219	0.087
arwiki	prec	0.479	0.563	0.616	0.663	0.711	0.754	0.799	0.847	0.89	0.937
arwiki	recall	0.392	0.396	0.384	0.366	0.337	0.302	0.252	0.196	0.143	0.067
bnwiki	prec	0.527	0.579	0.616	0.648	0.69	0.736	0.788	0.843	0.888	0.934
bnwiki	recall	0.409	0.411	0.401	0.377	0.339	0.28	0.195	0.131	0.075	0.021
cswiki	prec	0.588	0.629	0.668	0.699	0.732	0.764	0.808	0.855	0.907	0.965
cswiki	recall	0.585	0.576	0.559	0.535	0.496	0.435	0.356	0.259	0.152	0.044
viwiki	prec	0.679	0.753	0.791	0.823	0.855	0.888	0.922	0.953	0.987	0.995
viwiki	recall	0.708	0.73	0.724	0.707	0.678	0.633	0.577	0.515	0.432	0.398

We further train and evaluate the model for 7 additional wikipedia (ruwiki, frwiki, plwiki, rowiki, fawiki, huwiki, eswiki; see phab:T284481 for more details about the choice of wikis). The overall performance in the backtesting data is comparable to the previously considered wikis. For the default threshold-parameter 0.5, we obtain:

precision is at ~80% (or higher)
recall is above 40%

		Threshold (link-probability)
		0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
ruwiki	prec	0.546	0.641	0.673	0.705	0.744	0.778	0.814	0.856	0.893	0.956
ruwiki	recall	0.532	0.539	0.53	0.507	0.472	0.423	0.36	0.278	0.179	0.067
frwiki	prec	0.564	0.645	0.686	0.731	0.773	0.815	0.85	0.891	0.927	0.96
frwiki	recall	0.571	0.568	0.558	0.537	0.503	0.459	0.404	0.331	0.245	0.14
plwiki	prec	0.605	0.7	0.762	0.803	0.836	0.871	0.904	0.929	0.957	0.98
plwiki	recall	0.633	0.679	0.663	0.642	0.617	0.576	0.527	0.464	0.378	0.242
rowiki	prec	0.572	0.77	0.802	0.838	0.859	0.882	0.904	0.925	0.954	0.982
rowiki	recall	0.627	0.693	0.684	0.666	0.641	0.611	0.57	0.515	0.443	0.315
fawiki	prec	0.593	0.769	0.803	0.832	0.857	0.882	0.907	0.931	0.949	0.971
fawiki	recall	0.632	0.679	0.67	0.656	0.637	0.605	0.568	0.519	0.461	0.342
huwiki	prec	0.721	0.799	0.829	0.858	0.875	0.891	0.905	0.929	0.949	0.977
huwiki	recall	0.475	0.492	0.485	0.473	0.462	0.449	0.434	0.404	0.362	0.272
eswiki	prec	0.554	0.655	0.692	0.73	0.769	0.805	0.847	0.89	0.926	0.961
eswiki	recall	0.518	0.559	0.554	0.536	0.499	0.451	0.385	0.298	0.193	0.084

Second set of results (2020-12)

We made some changes to the training of the link-recommendation model:

the extracted example-sentences are now assigned to training and testing data randomly (previous method had some potential bias)
the generation of candidate anchors from n-grams now works properly if the ngram contains non-alphabetic characters- One common case are places which contain commas, such as en:Latrobe,_Pennsylvania; not identifying those links often leads to false positives downstream (e.g. suggesting wrongly en:Pennsylvania.
removing the navigation-based features (obtained from reading sessions) from the model. We find that these features do not substantially contribute to the model performance in the backtesting data. Removing these features will make it easier to potentially share the model and the underlying the data publicly.

We also trained/evaluated the model on bnwiki instead of kowiki.

Precision and recall for different wikis

We evaluate the link-recommendation model using precision and recall:

precision = How many of the recommended links did also exist in the evaluation data?
recall = How many of the links existing in the evaluation data did we recommend?

The threshold is a free parameter in the model which allows us to balance the tradeoff between precision and recall, i.e. increasing precision typically comes at the cost of recall. This parameter can in principle be chosen differently for each wiki in order to find an acceptable value for precision (making sure the recommendations are accurate) while ensuring that recall is not too small (making sure we make enough recommendations).

We see that for most wikis one can obtain a precision between 70-80% with recall at around 40% (or more).

		Threshold (link-probability)
		0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
simplewiki	prec	0.55	0.59	0.64	0.68	0.73	0.79	0.84	0.88	0.92	0.95
simplewiki	recall	0.64	0.64	0.62	0.59	0.53	0.45	0.36	0.27	0.15	0.06
dewiki	prec	0.63	0.67	0.7	0.73	0.76	0.8	0.84	0.88	0.91	0.95
dewiki	recall	0.63	0.62	0.6	0.58	0.54	0.48	0.41	0.33	0.24	0.12
ptwiki	prec	0.6	0.7	0.75	0.78	0.82	0.84	0.87	0.9	0.93	0.93
ptwiki	recall	0.58	0.62	0.62	0.6	0.57	0.53	0.48	0.41	0.31	0.2
arwiki	prec	0.49	0.56	0.6	0.65	0.7	0.75	0.8	0.85	0.91	0.97
arwiki	recall	0.45	0.45	0.44	0.42	0.39	0.34	0.28	0.21	0.14	0.07
bnwiki	prec	0.57	0.6	0.63	0.67	0.7	0.75	0.8	0.85	0.91	0.95
bnwiki	recall	0.45	0.45	0.44	0.41	0.37	0.3	0.22	0.15	0.09	0.04
cswiki	prec	0.6	0.64	0.67	0.7	0.74	0.78	0.82	0.86	0.91	0.96
cswiki	recall	0.58	0.57	0.56	0.53	0.5	0.44	0.37	0.27	0.16	0.05
viwiki	prec	0.64	0.73	0.78	0.82	0.86	0.89	0.92	0.96	0.99	0.99
viwiki	recall	0.69	0.75	0.75	0.73	0.69	0.65	0.58	0.51	0.44	0.41

An anecdote on the usefulness of the backtesting-evaluation. In a first version, bnwiki showed considerably worse performance than the other languages. Investigating manually examples from the backtesting data revealed that our approach was not able to identify correctly fully formed sentences (with existing links). This is caused by the fact that Bengali language uses an alternative symbol for full stop, । (daṛi). Once we account for this, performance was similar to the other wikis. More importantly, this anecdote shows that the backtesting-evaluation can indicate if our approach is not working for a specific language (even though it will take some detective work once we know that something is not working).

Feature importance for prediction

We assess the importance of the different features described above for making the recommendations. The xgboost model allows us to calculate a feature-importance score based on gain, i.e. "The average training loss reduction gained when using a feature for splitting." (this is similar to the Gini importance measure used in scikit-learn tree models, see here for a good overview of different ways to calculate feature importance scores)

feature importance scores for the link recommendation model for different languages

The navigation-based features ("nav") have the lowest feature score across all wikis and the normalized value is close to 0. This suggests that the feature is not relevant for the predictions to generate link recommendations. In fact, running the model with and without this feature shows hardly any differences; the changes are on the order of 0.00-0.02 and are thus most likely not significant (not shown here).

Number of link recommendations for an article?

We want to know how many link-recommendations are generated for a given article in a given language. The potential problem is that if we choose the threshold parameter too conservative (high precision but low recall), that only for few articles we will be able to generate link-recommendations.

Fraction of articles for which we generate at least 1 (blue), 5 (orange), or 10 (green) link recommendations in different wikis

Thus, for each language, we select 500 random articles (main namespace, no redirects), and count the number of generated link-recommendations for different choices of the threshold parameter. We show the fraction of articles (of the 500 randomly selected) for which the algorithm generates at least 1 (blue), 5 (orange), or 10 (green) link recommendations. For example, for arwiki, for threshold parameter 0.4 we can generate 5 or more link recommendations for approximately 10% of the articles. Extrapolating to the 1M articles in arwiki, we could guess that there are around 100k articles for which we can generate 5 or more link recommendations.

First set of results (2020-10)

We ran an initial evaluation on 7 wikis (simple, de, pt, ar, cs, ko, vi). There are some variations in the performance but, overall, the model yields satisfactory results across the different languages. Aiming for a recall of around 0.5 (i.e. the link recommendation capture 50% of the true links), we get a precision between 70-80%. In some languages, the precision is even higher (e.g. vi). The reasons for these variations are not clear at the moment and requires further research.

Threshold (link-probability)		0.0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
simple	prec	0.53	0.55	0.58	0.62	0.67	0.7	0.75	0.82	0.85	0.83
	recall	0.67	0.66	0.64	0.61	0.55	0.48	0.38	0.27	0.17	0.07
de	prec	0.62	0.67	0.69	0.72	0.75	0.79	0.83	0.87	0.91	0.94
	recall	0.63	0.62	0.6	0.58	0.54	0.48	0.41	0.33	0.24	0.11
pt	prec	0.57	0.67	0.72	0.79	0.82	0.85	0.87	0.9	0.94	0.99
	recall	0.54	0.57	0.58	0.58	0.57	0.52	0.46	0.4	0.3	0.18
ar	prec	0.56	0.61	0.67	0.72	0.77	0.81	0.86	0.9	0.94	0.98
	recall	0.43	0.43	0.42	0.4	0.37	0.34	0.29	0.24	0.18	0.08
cs	prec	0.58	0.62	0.64	0.67	0.71	0.74	0.78	0.84	0.88	0.94
	recall	0.56	0.55	0.54	0.51	0.48	0.42	0.35	0.25	0.13	0.03
ko	prec	0.49	0.61	0.64	0.67	0.69	0.72	0.75	0.79	0.83	0.87
	recall	0.4	0.4	0.4	0.4	0.39	0.37	0.35	0.3	0.21	0.08
vi	prec	0.74	0.79	0.83	0.86	0.88	0.91	0.94	0.96	0.99	0.996
	recall	0.81	0.78	0.76	0.74	0.72	0.68	0.63	0.58	0.53	0.48

References

Code: https://github.com/wikimedia/research-mwaddlink

Paper: Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939. Preprint available on arXiv: https://arxiv.org/abs/2105.15110

Step-by-step procedure for training a new model for production: Research:Link_recommendation_model_for_add-a-link_structured_task/Training_Procedure

Model card: Machine learning models/Proposed/add-a-link model