Research:Characterizing Readers Navigation/Modeling Reading Sessions: First Round of Analysis

This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

In the first round of analysis we explore the potential of modeling reading sessions with sequential models such as LSTM.

Main findings:

LSTM models yield huge improvements in next-article prediction for datasets we were able to run the model; for example recall@1 more than doubles from 0.139 to 0.280 compared to morelike in hiwiki
previously used Research:Wikipedia_Navigation_Vectors based on word2vec perform worse or equal than the text-based morelike used in the RelatedArticles-feature. this is consistent with previous qualitative evaluation on human raters ([[1]])

Problems and future directions:

Scalability to larger wikis: In order to train the LSTM, we use the GPUs on stat1005 and stat1008. With a single GPU we can only train smaller datasets (say up to 1M sessions). As a result, at the moment the LSTM is not suitable for large wikis in which we need a larger number of sessions to get sufficient coverage. This motivates several approaches
- introduce approximations, e.g. in the softmax-layer, in order to
- targeted sampling: we know that pageviews are very unevenly distributed (i.e. some pages occur orders of magnitude more often than others), thus increasing the amount of data might not actually add a lot of new information/signal. in contrast, one could preferentially sample sessions containing articles with lower coverage.
- add additional information from the underlying generative process, such as the information about available links or layout information.

Motivation

Previous research on Research:Wikipedia_Navigation_Vectors modeled reading sessions to generate embeddings of articles which capture their semantic similarity based on reader interests. One possible application was to use these embeddings to generate recommendations for which articles to read next. In a qualitative evaluation such recommendations were compared with text-based recommendations from the RelatedArticles-extension based on the morelike-search) -- finding that the latter was judged more useful for human raters showing that RelatedArticles constitutes a hard-to-beat baseline.

In order to learn the embeddings of articles from reading sessions, the navigation vectors used Word2vec, a common approach in natural language processing to capture semantic similarity of words (i.e. articles) in large collections of documents (i.e. reading sessions). However, one of the main limitations of this model is that it does not explicitly take into account the sequential information of the reading sesssion (only indirectly via a context window). We hypothesize that the sequential process plays an important role in understanding and modeling reading sessions.

Therefore, we aim to model reading sessions using explicitly sequential models. One of the most well-known approaches from natural language processing are so-called LSTMs

Methods

Data

We extract reading sessions for 14 different wikipedias (the same as in Why the world reads Wikipedia) from 1 week of webrequest logs. Specifically, we follow the basic approach described in Research:Wikipedia_Navigation_Vectors#Data_Preparation, i.e.

keeping only requests which are pageviews in main namespace
remove pageviews from identified bots (we filter sessions with more than 100 pageviews per day as a proxy for other automated traffic)
keep only sessions from desktop and mobile-web
remove sessions which contain an edit-attempt
remove sessions which contain the main-page of the respective wikipedia
cut reading sessions if time between consecutive pageviews is longer than 1 hour (see Halfaker et al. 2015)
keep only sessions with 2 to 30 pageviews

We split data into train-, dev-, testset (80%,10%,10%) randomly in terms of reading sessions.

Basic statistics of the datasets.

wiki	N-sessions	N-pages	N-pageviews	N-pageviews-per-page
arwiki	3864756	536653	18283789	34.07
bnwiki	235596	55210	1080966	19.58
dewiki	18827285	1690625	90471373	53.51
enwiki	156936661	4863258	775580093	159.48
eswiki	25811921	1278193	114638465	89.69
hewiki	1358828	199486	6358130	31.87
hiwiki	901426	113000	3939958	34.87
huwiki	1651606	224810	7559146	33.62
jawiki	21862400	1136453	102855008	90.51
nlwiki	2924407	542331	13172173	24.29
rowiki	748245	144575	3316610	22.94
ruwiki	18478481	1275477	87049576	68.25
ukwiki	1992400	349542	9238183	26.43
zhwiki	7893189	867361	38637735	44.55

Models and baselines

We compare in total 3 models.

LSTM, a sequential model to generate embeddings from reading sessions
word2vec, model used in navigation-vectors to generate embeddings from reading sessions
morelike, text-based model to find similar articles used for recommendations in RelatedArticles-feature.

Models are trained on the train-set. Evaluation of performance is done on the test-set. Evaluation on dev-set is used for optimizing hyperparameters. Note that for morelike, we do not do any additional training and only evaluate on the test-set.

Evaluation

We evaluate the models in the task of next-article prediction. For a session in the test-set of length L, we pick a random position i_source = 1,...,L-1 and aim to predict the next article at position i_target=i_source+1 (and thus the article at i_source and i_target are the source and target articles respectively). We assign a rank for each prediction by comparing the target-article with the list of articles predicted from the source-article, i.e. if the target article is the 5th most likely recommendation we assign rank=5. We then calculate

mean reciprocal rank (MRR): the average of the inverse rank (i.e. 1/MRR corresponds to the harmonic mean of the ranks)
recall@k: the fraction of test-cases for which the rank is <= k (i.e. recall@1 is the fraction of times the target-article was the most likely prediction based on the source article).

Results

We evaluate different models on several wikis using 4 different metrics (MRR, recall@1, recall@10, and recall@100).

Main results:

Note that only for 3 smaller wikis (hewiki, hiwiki, huwiki) we were able to finish training the LSTM (top cases)
- in each of these cases the LSTM substantially outperforms the other two baselines
Word2vec sometimes performs worse (or only slightly better) than the text-based morelike which does not use any training (hewiki, hiwiki, bnwiki, eswiki, nlwiki, rowiki, ukwiki ). this could be to several reasons
- not enough training data (mostly smaller wikis)
substantial variation across wikis (regardless of the model), e.g. hiwiki vs ukwiki.
- it is very likely that this is due to different composition of mobile vs desktop traffic. hiwiki has an exceptionally large fraction of readers via mobile-web (in some months it makes 80-90% wikistats), while for other wikis it is much lower (e.g. ukwiki where number of desktop-readers is higher than mobile-readers wikistats)
- (to be added in more detail below): in fact we can show that reader-sessions are more predictable in mobile than in desktop when separating by access-method, i.e. performance in next-article prediction is higher for mobile-sessions than for desktop-sessions. This is consistent with purely empirical observation that given the same number of sessions, desktop-sessions contain a higher diversity of pages than mobile-sessions (in terms of the number of different pages visited in those sessions). This cannot only be attributed to the RelatedArticles-feature showing 3 related articles for further reading at the bottom of articles in the mobile-version since we observe similar patterns for dewiki, in which the feature is not enabled (by default).

Next-article prediction for different models and different wikis
Wiki	Model	MRR	Recall@1	Recall@10	Recall@100
hewiki	Morelike	0.113305	0.06	0.217	0.406
	Word2vec	0.1291	0.07892	0.23134	0.38979
	LSTM	0.241439	0.158223	0.41163	0.577477
hiwiki	Morelike	0.224332	0.139	0.377	0.488
	Word2vec	0.188405	0.113174	0.345736	0.56412
	LSTM	0.37979	0.278942	0.57975	0.757443
huwiki	Morelike	0.114129	0.064	0.216	0.379
	Word2vec	0.127071	0.07669	0.2294	0.40457
	LSTM	0.24379	0.160018	0.415452	0.617054
arwiki	Morelike	0.125015	0.075	0.221	0.386
arwiki	Word2vec	0.136778	0.08385	0.24293	0.42
bnwiki	Morelike	0.193514	0.117	0.346	0.493
bnwiki	Word2vec	0.132293	0.078647	0.244514	0.435805
dewiki	Morelike	0.089064	0.05	0.173	0.318
dewiki	Word2vec	0.131105	0.08184	0.23067	0.38692
enwiki	Morelike	0.108866	0.054	0.212	0.39
enwiki	Word2vec	0.151636	0.09491	0.26648	0.43196
eswiki	Morelike	0.12912	0.077	0.233	0.402
eswiki	Word2vec	0.131733	0.08121	0.2362	0.40848
jawiki	Morelike	0.106265	0.058	0.206	0.369
jawiki	Word2vec	0.151647	0.09849	0.25689	0.41933
nlwiki	Morelike	0.111285	0.06	0.21	0.393
nlwiki	Word2vec	0.131795	0.08114	0.23303	0.38997
rowiki	Morelike	0.137073	0.078	0.248	0.417
rowiki	Word2vec	0.129519	0.078076	0.235162	0.418069
ruwiki	Morelike	0.110369	0.065	0.211	0.365
ruwiki	Word2vec	0.134226	0.08197	0.24137	0.41118
ukwiki	Morelike	0.092121	0.052	0.174	0.341
ukwiki	Word2vec	0.104915	0.06289	0.19127	0.36027
zhwiki	Morelike	0.095617	0.056	0.182	0.337
zhwiki	Word2vec	0.154163	0.09714	0.27255	0.44906