User:MGerlach (WMF)/covid related pages reading sessions/Methodology list covid reading sessions

This page contains a detailed description of the methodology used for creating the list of articles related to covid-19 based on reading sessions in Wikipedia.

The methodology follows the approach developed in a previous Research-project on Navigation vectors.

Methodology:

The methodology consists of the following steps:

Reading sessions. We start from the webrequest logs and extract reading sessions of consecutive pageviews from the same unique device (client-ip and user_agent). Specifically, we apply the following filters:
- Keep requests to wikipedias of all languages (project-family= wikipedia)
- Remove identified bots (agent_type=user)
- Only desktop and mobile app (access_method != mobile app)
- Remove sessions which contain an edit-attempt
- Remove sessions with more than 100 pageviews per day (e.g. automated traffic)
- Only keep pageviews to articles in the main namespace
Mapping to Wikipedia-items. We then apply additional filters to the reading sessions:
- We map pageviews of articles to wikidata-items in order to aggregate sessions from all wikipedias
- We remove sessions which contain view of the main-page (Q5296 [Wikimedia main project])
- We collapse consecutive calls to the sameOnly keep pageviews to articles in the main namespace page into 1 pageview
- We cut sessions if the time between 2 pageviews is larger than 1 hour
- We only keep sessions containing 2 to 30 pageviews
Word2vec. We run word2vec on the sequences of wikidata-items. Word2vec is an approach from Natural Language Processing using neural networks to find the semantic structure in large corpora of texts. The output of this method is a word embedding -- i.e. each word is a vector in a say 100-dimensional space -- where words with similar meanings are close in that space. Here, we apply this approach to reading sessions by considering pages (or their corresponding wikidata-item) as words and the sequence of pageviews as a document. In the resulting article-embedding, we can then easily identify articles which occur in the same context (i.e. same sessions) by looking at the nearest neighbors. We use the gensim-implementation with mostly default parameters except that we remove all wikidata-items which appear less than 20 times across all sessions. We run the model for 1 week of reading sessions.
Generating a list of articles. Our starting point are the 2 of the main Wikidata-items related to Covid-19: Q81068910 (the pandemic) and Q84263196 (the disease). We identify the most related wikidata-items by calculating the maximum cosine-similarity of each item to the Covid-items and selecting the 1000 most related concepts. Using the wikibase-API, we query i) the label and description of each item in English, ii) get the number of wikis for which there is an article (i.e. number of sitelinks which end in ‘wiki’ and do not contain ‘_’ ), and iii) the article-title (if there is one) for 10 wikipedia editions ( Currently: enwiki, dewiki, frwiki, eswiki, ruwiki, zhwiki, ptwiki, arwiki, bnwiki, hiwiki; this choice is motivated from a trade-off between i) official languages of the United States, ii) size of the edition in terms of number of active users, iii) editions considered in the WikiGap_Challenge, and iv) focusing on not more than 10 editions in order to reduce the number of call to the pageview-API.). We remove the following items:
- Items with no label and no description (English) and no article (sitelink) in any of the selected wikis.
- Disambiguation pages, i.e. where label or description contain the string disambiguation. (English)
- items whose label (in English) contains any of the strings corona/covid/cov in order to remove obvious articles related to the context. We then select the 200 items with the highest scores and get the number of pageviews for the article in the respective edition during the week under consideration as well as the change with respect to the previous week using the wikimedia pageview API. Note that the numbers constitute only a lower limit for the number of pageviews since we are not taking into account pageviews to articles that redirect to the article under consideration (this is a technical issue and should be resolved in later iterations).

Limitations/Biases

Even though we are modeling reading sessions in a language-agnostic way by considering each article’s wikidata-item, traffic is highly skewed across different wikipedia editions and countries. For example, from the 23B pageviews to all wikis in March 2020, 40% concentrated on enwiki and 17% came from the USA. This means that data for certain projects and from certain regions is much more prevalent and, as a result, will have larger effect on the aggregated statistics. One way to account for this would be to downsample data (e.g. removing some fraction of the reading sessions from enwiki) in order to give more importance to data from other wikis or regions. Another way would be to do a separate analysis for each wiki or each country/region and explore how the result will change. However, at the moment we do not know the effect of this issue and this would constitute a separate project to investigate.

Why word2vec?

In order to find relations between pages based on reading sessions we apply word2vec -- this performs a dimensionality reduction and maps every page into a 100-dimensional vector space in which we can then calculate distances between pages; the hypothesis being that pages that are closer in this space also tend to co-occur there.

It is natural to ask whether all of that is really necessary -- we could simply retrieve the top-1000 (or so) pages that most often co-occur with our the covid-19 core articles. We can compare the lists from word2vec and co-occurrence statistics.

There is some overlap but we see substantial differences: less than half the articles can be found in both sets and the jaccard index (intersection / union ) < 0.35 for the top-k articles. Note that this is not just due to random fluctuations from sampling -- if comparing lists with same method based on data from two different days the jaccard-index is much higher, ~0.8. It is not completely clear why this is the case, but my hypothesis is that the list from co-occurrence statistics prefers generic articles that simply have a larger overall propensity but are not necessarily more closely connected to covid-19 reading sessions.