Research:Wikipedia Navigation Vectors

Contact

Ellery Wulczyn

Wikimedia Foundation

Duration: 2016- – ??

Open source
via github.com

Open data
via figshare.com

Research:Projects

This page documents a completed research project.

About

In this project, we learned embeddings for Wikipedia articles and Wikidata items by applying Word2vec models to a corpus of reading sessions.

Although Word2vec models were developed to learn word embeddings from a corpus of sentences, they can be applied to any kind of sequential data. The learned embeddings have the property that items with similar neighbors in the training corpus have similar representations (as measured by the cosine similarity, for example). Consequently, applying Wor2vec to reading sessions results in article embeddings, where articles that tend to be read in close succession have similar representations. Since people usually generate sequences of semantically related articles while reading, these embeddings also capture semantic similarity between articles.

t-SNE projection of top 20 most popular enwiki articles for the first week of February 2016

There have been several approaches to learning vector representations of Wikipedia articles that capture semantic similarity by using the article text or the links between articles. An advantage of training Word2vec models on reading sessions, is that they learn from the actions of millions of humans who are using a diverse array of signals, including the article text, links, third-party search engines, and their existing domain knowledge, to determine what to read next in order to learn about a topic.

An additional feature of not relying on text or links, is that we can learn representations for Wikidata items by simply mapping article titles within each session to Wikidata items using Wikidata sitelinks. As a result, these Wikidata vectors are jointly trained over reading sessions for all Wikipedia language editions, allowing the model to learn from people across the globe. This approach also overcomes data sparsity issues for smaller Wikipedias, since the representations for articles in smaller Wikipedias are shared across many other potentially larger ones. Finally, instead of needing to generate a separate embedding for each Wikipedia in each language, we have a single model that gives a vector representation for any article in any language, provided the article has been mapped to a Wikidata item.

Where to get the Data

The canonical citation and most up-to-date version of this dataset can be found at:

Ellery Wulczyn (2016). Wikipedia Navigation Vectors. figshare. doi:10.6084/m9.figshare.3146878

Getting Started

Check out this ipython notebook for a tutorial on how to work with the data.

Data Preparation

Getting article requests per client [code]

Since we don't have unique tokens, define a "client" as an (IP, UA, XFF) tuple. To generate a list of requests per client for a given timespan:

take all non-spider requests for all Wikipedias
resolve redirects for the top 20 most visited Wikipedias
filter out requests for non-main namespace articles
filter out disambiguation pages
filter out requests for articles that were requested by fewer than 50 clients
filter out requests for articles that do not have a corresponding Wikidata item
group requests per client
remove all data from any client who made an edit

Break requests per client into sessions [code]

break requests from a client into sessions whenever there is a gap of 30 minutes or more between requests
drop sessions with a request to the Main Page
collapse consecutive requests for the same article into a single request
filter out sessions with less than 2 requests and more than 30 requests

Train Word2vec [code]

Train Word2vec model using the original C implentation
Save vectors in standard word2vec text format

To give some sense of the scale of the training data, here are some counts for a typical week's worth data:

Wikidata Embedding:

# of sessions for training: 370M
# number items across all training sessions: 1.4B

English Wikipedia Embedding:

# of sessions for training: 170M
# number items across all training sessions: 650M

Hyper-parameter tuning

The word2vec algorithm has several hyper-parameters. To tune the algorithm, we used one month of training data and ran randomized search over the following parameter grid:

size: (50, 100, 200, 300)

window: (1,2,4,6,10
sample: (1e-3, 5e-4, 1e-4, 5e-5, 1e-5
hs : (0,)
negative : (3,5,10,25,50)
iter: (1,2,3,4,6,10)
cbow: (0,1)

To evaluate an embedding, we took a random sample of 50k sessions from the week following the month during which the training data was collected. For each session in this evaluation set, we randomly selected a pair of articles and then computed the mean reciprocal rank of the second article in the ranked list of nearest neighbors for the first article over all pairs. The best model attained an MRR of 0.166.

Releases

On Figshare, there are currently releases for models trained on the requests from the following timespans:

2016-09-01 through 2016-09-30
2016-08-01 through 2016-08-31
2016-03-01 through 2016-03-07

Each release contains an embedding for English Wikipedia and an embedding for Wikidata for different embedding dimensions. The embedding file names have the following structure:

{start}_{stop}_{type}_{dimension}

To give an example, the " 2016-03-01_2016-03-07_wikidata_100" file contains a 100 dimensional Wikidata item embedding that was trained on data from the first week of March.

Note: Since a lot of readership on Wikipedia is driven by trending topics in the media, you can expect the embeddings for articles relating to media events to change based on these trends. For example, the nearest neighbor for Hillary Clinton may be Bernie Sanders in one month and Donald Trump in the next, depending on what is happening in the presidential campaign race. For articles about less trendy topics, the nearest neighbors should be fairly stable across releases.

Applications

Here are some ideas for how to use these embeddings to improve Wikipedia.

Translation Recommendations

We recently created a tool for recommending articles for translation between Wikipedias. Users choose a source language to translate from and a target language to translate to. Then they choose a seed article in the source language that represents their interests. We can find articles in the source language missing in the target language using Wikidata sitelinks, and then rank the missing articles according to the similarity between their vectors and the seed vector.

Reading Recommendations

The Reading Team at WMF recently introduced a "Related Pages" feature that gives readers 3 recommendations for further reading. The current recommendations are generated by the More Like This Query feature in Elastic Search.

Instead, we could generate recommendations for further reading by looking up the nearest neighbors of the current article the reader is on in an embedding. The advantage of this approach is that the nearest neighbors are by definition articles that tend to be read together. Furthermore, the Wikidata embedding would allow us to use a single model to generate recommendations across all languages! Here is demo of how this could work.

Link Recommendations

If articles are frequently read within the same session, you might be able to make Wikipedia easier to navigate if you were to create a link between them. For a given article you could generate recommendations for links to add by finding the nearest neighbors that are not already linked and adding a link if the original article has a suitable anchor text. Again, the Wikidata embedding would allow you to build a model for all languages.

External links

Wikipedia Vectors, figshare