Research:Recommending links to increase visibility of articles/Link-translation

This page explores the idea of recommending new links to orphan articles from existing links in other Wikipedias.

Exploratory analysis edit

We explore the network of links between Wikipedia articles considering all languages and mapping articles to their Wikidata-item.

How many orphan articles are there edit

For each Wikipedia project, the table reports the following numbers (click to unfold):

  • N, the number of articles (main namespace, no redirects)
  • No, the number of orphan-articles (i.e. no incoming links from other main namespace articles in the same project)
  • Po, the percentage of orphan-articles in each project

Some observations:

  • in total, there are roughly 8.4M orphan articles in Wikipedia projects. this corresponds to 14.6% of all articles (57M)
  • large Wikipedias (in terms of articles) have a relatively small percentage of orphan articles (<10%), however, the absolute number often still exceeds 100,000 articles. For example, in enwiki the 4.7% imply close to 300k orphan articles.
  • among the 20 largest Wikipedias with the highest fraction of orphans are: arzwiki (81.9%), viwiki (51.5%), fawiki (22.3%), arwiki (20.9%), svwiki (17.2%)
  • for many of the smaller Wikipedias, the fraction of orphan articles is consistently higher (consistently above 30%)

Click [show] to see all results in the table:

Characterizing orphan articles edit

We characterize orphan articles by checking whether they have a certain property x:

  • disambiguation: page is a disambiguation page
  • bot-created: page was created by a bot
  • gender (woman): article is about a woman (considering only biography-articles)
  • quality (higher): article's quality score belongs to the top-50 percentile of all articles
  • age (newer): article age (time since creation date) belongs to the bottom 50-percentile of all articles

The first column summarizes the total number of orphan articles in the project:

  • N, the total number of articles
  • No, the number of articles that are orphan

The other columns show how common articles with property x are among orphan articles:

  • Nxo, the number of articles that are orphan and belong to the class x
  • logExo, the enrichment of property x among orphan articles. This is useful to see whether a given property x is very common (or uncommon) among orphan artilces. if this is positive (>0), it means that x occurs more frequently among orphan articles than in the population of all articles. if this is negative (<0), then x occurs less frequently than in the overall population of articles.

As an example, lets consider disambiguation pages in enwiki:

  • from the 299k orphan articles (No), roughly 147k are disambiguation pages (Nxo) which means that the conditional probability of being a disambiguation page given that is is an orphan is P(x|o)=0.49. is this a high or low probability?
  • for this, we have to compare with the overall number of disambiguation pages in Wikipedia: from the 6.4M articles in enwiki (N), there are in total 289k disambiguation pages (not shown in this table); which translates into a probability of P(x)=0.045. This means that, overall, disambiguation pages are much less common.
  • now we simply compare the two probabilities, P(x|o) and P(x), by taking the logarithm of the ratio via logExo=log P(x|o)/P(x). The nice property is that if P(x|o)>P(x) then logExo>0 (property x is more common among orphans) and if P(x|o)<P(x) then logExo<0 (property x is less common among orphans).
  • in this case we get that P(x|o) is much larger than P(x). in fact P(x|o)/P(x)=11 such that logExo=log(11)=2.39

Some observations:

  • disambiguation pages are very common among orphan articles. For enwiki, roughly half of all orphans are disambiguation pages, though there are still more than 150k orphans which are not disambiguation pages.
  • bot created pages: for some languages we see a substantial number of orphan articles that were created by bots, such as cebwiki (632k), svwiki (435k), viwiki (495k), etc. Some of these are known to contain many bot-created articles. Thus, while the absolute number of orphan articles created by bots is large, this is not more common than for the rest of the articles in these projects (logExo~0)
  • gender: articles on women are over-represented among orphan-articles. We know that overall, between 15-20% of biography articles are on women. However, when considering orphan articles the fraction of biography-articles on women is much higher (and thus logExo>0).
  • quality: higher-quality pages are under-represented among orphan articles (logExo<0). this means that orphan articles tend to have lower quality
  • age: newer pages are over-represented among orphan articles (logExo>0). this means that orphan articles tend to be younger.


Click [show] to see all results in the table:

De-orphanizing via link translation edit

How many of the orphan-articles could be de-orphanized by link translation? That is, for an orphan article in a specific language, are there already existing incoming links in other languages which we could add to the orphan article (such that the source-article of the link in the other language also exists in the language of the orphan article)?

Besides reporting the number of articles (N), the number of orphan articles (No), and the fraction of orphan articles (Po), we calculate for how many orphan articles we could recommend new incoming links via link translation:

  • No_k[1,2,5,10], the number of orphan articles for which there is at least k different incoming link which already exists in another language
  • Po_k[1,2,5,10], the percentage of orphan articles for which there is at least k incoming link which already exists in another language

Observations:

  • from the 8.4M orphan articles, for 4.9M articles we find at least one potential incoming link which already exists in another language. This means that 59% of the orphan articles can be de-orphanized via link translation. This number is a bit lower if we set a higher threshold of the number of new incoming links higher -- however, this number is still in the millions: there are 2.8M orphan articles for which we find 10 or more potential incoming links which already exist in another language.
  • for enwiki, we could only de-orphanize roughly 22% of orphans, however, this still amounts to more than 67k articles
  • for most of the other Wikipedias, we could de-orphanize more than half of the orphan articles; even for those wikis on the lower end (dewiki, frwiki, nlwiki, ruwiki, jawiki, etc) we can de-orphanize 30% or more. the one exception seems cebwiki with 17% (roughly 108k articles).
  • Link translation can generate recommendations for new incoming links for millions of orphan articles across all wikipedias.
  • For a single orphan article, we can often find several (10 or more) potential incoming links which already exist in other languages; one option for prioritization is the number of different language versions in which the link already exists.

Click [show] to see all results in the table:

Evaluation edit

Here, we evaluate the quality of de-orphanizing recommendations made by our proposed link-translation approach and compare it with 4 strong baselines, namely: (1) Findlink[1], (2) Morelike[2], (3) Reciprocity, and (4) VERSE[3].

Data and setup edit

We extract the hyperlink graphs for all the 305 language versions of Wikipedia from the Wikitext dumps published in Jan 2022 and Feb 2022, respectively. The graphs from Jan 2022 were used for training the aforementioned methods (Findlink and Morelike are exceptions as they are available only as APIs, and thus, we couldn’t use a specific dump for obtaining results with these methods. We are currently in the process of mitigating this gap and this should be resolved in our next update.). We tracked all the articles that were orphans in Jan 2022 but were de-orphanized in Feb 2022, which serve as the ground-truth set of deorphanization queries and consequently, as our test set for evaluating the aforementioned methods. Note that our setup ensures that there is no data leakage, i.e., the methods do not have access to the de-orphanizing links added to articles in Feb 2022.

Baselines edit

We carefully select four strong baselines, which can be further grouped into three broad categories.

Existing tools and resources from Wikimedia edit

  • 1. Findlink: The Find Link tool[1] does a search for article titles/keywords throughout Wikipedia to highlight those articles that mention the search term and therefore ought to be linked to. This tool is available online: https://edwardbetts.com/find_link/, and has been the de facto choice of the Wikipedia editor community to de-orphanize orphan articles (WikiProject Orphanage)
  • 2. Morelike: The Morelike query [2] is available as a part of the CirrusSearch functionality offered by the Wikimedia Search team. It is based on Elasticsearch, and finds articles whose text is most similar to the text of the given articles.

Heuristics edit

  • 3. Reciprocity: is a simple heuristic, where for each “directed” edge (u,v), we add a reciprocal edge (v,u) to the graph. For a given query article (v), the reciprocal links are ranked based on the degree of the source (u). More details at en:Reciprocity_(network_science)

Graph embeddings edit

  • 4. VERSE [3]: is one of the state-of-the-art node representation learning methods, which scales gracefully to Web-scale graphs. Once trained, VERSE facilitates computation of the similarity between any two Wikipedia articles based on the similarities in the network structure around them. This similarity (the higher the better) is used as a score to rank all the recommendations from a given article.

Metrics edit

We report Micro (query-level) and Macro (language-level) averages for Recall@k (k=1, 2, 3, 4, 5, 10, 25, 50, 75, 100) and Mean Reciprocal Rank (MRR) obtained by the aforementioned methods.

Results edit

Category Method Macro Average
recall@1 recall@2 recall@3 recall@4 recall@5 recall@10 recall@25 recall@50 recall@75 recall@100 MRR
Wikimedia Existing Tools Findlink* 0.002 0.002 0.003 0.003 0.003 0.003 0.004 0.004 0.005 0.005 0.002
Morelike* 0.095 0.138 0.161 0.19 0.201 0.242 0.304 0.345 0.375 0.395 0.145
Heuristics Reciprocity 0.058 0.09 0.123 0.136 0.148 0.183 0.201 0.204 0.204 0.205 0.097
Graph embeddings VERSE 0.05 0.063 0.082 0.092 0.111 0.14 0.185 0.222 0.243 0.259 0.078
Proposed Approach Link-translation 0.149 0.22 0.263 0.293 0.311 0.379 0.426 0.441 0.446 0.448 0.223
Category Method Micro Average
recall@1 recall@2 recall@3 recall@4 recall@5 recall@10 recall@25 recall@50 recall@75 recall@100 MRR
Wikimedia Existing Tools Findlink* 0.002 0.002 0.003 0.003 0.003 0.004 0.005 0.007 0.007 0.008 0.003
Morelike* 0.095 0.132 0.155 0.171 0.185 0.228 0.289 0.339 0.37 0.392 0.14
Heuristics Reciprocity 0.03 0.064 0.108 0.131 0.153 0.212 0.251 0.258 0.26 0.261 0.083
Graph embeddings VERSE 0.03 0.046 0.056 0.065 0.072 0.094 0.132 0.166 0.186 0.203 0.052
Proposed Approach Link-translation 0.126 0.167 0.192 0.21 0.222 0.255 0.282 0.292 0.296 0.298 0.168

It is clear from the results in the aforementioned tables (best performance in bold) that the proposed link-translation approach considerably outperforms all the baselines (statistical significance tests TBD). Specifically, the link-translation approach is very powerful in critical application scenarios, exemplified by strong performances for (1) lower values of k (a recall@1 of 15% is a remarkably strong outcome), and (2) low-resourced languages (macro average is as good as, and even stronger than micro average, indicating that link-translation performs equally well, and in fact, better for languages with fewer resources).


Lastly, in some cases (Micro average of recall@k, with k>=25) Morelike outperforms the proposed link-translation based approach. We hypothesize that this could be due to a difference in evaluation setup. We have reasons to believe that the reported performance for Findlink and Morelike is an overestimation of their true performance. This is because Findlink and Morelike are available as APIs, and thus, we could not evaluate them on the dumps from a specific snapshot (in our case Jan 2022). Instead, the reported performance is based on the results obtained via the call to their APIs executed in July 2022, thereby, providing an unfair advantage to these methods as they have access to information beyond what was available in Jan 2022. We are currently working to resolve this by a two-pronged approach:

  1. Identifying ways to run Morelike on the dumps from Jan 2022. We held discussions with the Cirrussearch team, and have identified ways to carry out such an experiment.
  2. Morelike is based on textual content of each Wikipedia article. We are trying to leverage the complementary information manifested in the textual content by incorporating it as an additional signal in our link-translation approach.

A detailed evaluation of the aforementioned methods (and their variants) showcasing wiki-specific results on all the 305 language versions of Wikipedia along with some intermediary analysis are present in this Google sheet:Orphans: First-eval results.

References edit

  1. a b Edward Betts. The Find Link tool - Add Wikipedia links to pages that really ought to have a link. Available: https://edwardbetts.com/find_link/
  2. a b Morelike. Available: https://www.mediawiki.org/wiki/Help:CirrusSearch#Morelike
  3. a b Tsitsulin et al. VERSE: Versatile Graph Embeddings from Similarity Measures. Available: https://dl.acm.org/doi/10.1145/3178876.3186120