Research:Characterizing Readers Navigation/Modeling Reading Sessions: First Round of Analysis
This page is currently a draft. More information pertaining to this may be available on the talk page. Translation admins: Normally, drafts should not be marked for translation. |
In the first round of analysis we explore the potential of modeling reading sessions with sequential models such as LSTM.
Main findings:
- LSTM models yield huge improvements in next-article prediction for datasets we were able to run the model; for example recall@1 more than doubles from 0.139 to 0.280 compared to morelike in hiwiki
- previously used Research:Wikipedia_Navigation_Vectors based on word2vec perform worse or equal than the text-based morelike used in the RelatedArticles-feature. this is consistent with previous qualitative evaluation on human raters ([[1]])
Problems and future directions:
- Scalability to larger wikis: In order to train the LSTM, we use the GPUs on stat1005 and stat1008. With a single GPU we can only train smaller datasets (say up to 1M sessions). As a result, at the moment the LSTM is not suitable for large wikis in which we need a larger number of sessions to get sufficient coverage. This motivates several approaches
- introduce approximations, e.g. in the softmax-layer, in order to
- targeted sampling: we know that pageviews are very unevenly distributed (i.e. some pages occur orders of magnitude more often than others), thus increasing the amount of data might not actually add a lot of new information/signal. in contrast, one could preferentially sample sessions containing articles with lower coverage.
- add additional information from the underlying generative process, such as the information about available links or layout information.
Motivation
editPrevious research on Research:Wikipedia_Navigation_Vectors modeled reading sessions to generate embeddings of articles which capture their semantic similarity based on reader interests. One possible application was to use these embeddings to generate recommendations for which articles to read next. In a qualitative evaluation such recommendations were compared with text-based recommendations from the RelatedArticles-extension based on the morelike-search) -- finding that the latter was judged more useful for human raters showing that RelatedArticles constitutes a hard-to-beat baseline.
In order to learn the embeddings of articles from reading sessions, the navigation vectors used Word2vec, a common approach in natural language processing to capture semantic similarity of words (i.e. articles) in large collections of documents (i.e. reading sessions). However, one of the main limitations of this model is that it does not explicitly take into account the sequential information of the reading sesssion (only indirectly via a context window). We hypothesize that the sequential process plays an important role in understanding and modeling reading sessions.
Therefore, we aim to model reading sessions using explicitly sequential models. One of the most well-known approaches from natural language processing are so-called LSTMs
Methods
editData
editWe extract reading sessions for 14 different wikipedias (the same as in Why the world reads Wikipedia) from 1 week of webrequest logs. Specifically, we follow the basic approach described in Research:Wikipedia_Navigation_Vectors#Data_Preparation, i.e.
- keeping only requests which are pageviews in main namespace
- remove pageviews from identified bots (we filter sessions with more than 100 pageviews per day as a proxy for other automated traffic)
- keep only sessions from desktop and mobile-web
- remove sessions which contain an edit-attempt
- remove sessions which contain the main-page of the respective wikipedia
- cut reading sessions if time between consecutive pageviews is longer than 1 hour (see Halfaker et al. 2015)
- keep only sessions with 2 to 30 pageviews
We split data into train-, dev-, testset (80%,10%,10%) randomly in terms of reading sessions.
Basic statistics of the datasets. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Models and baselines
editWe compare in total 3 models.
- LSTM, a sequential model to generate embeddings from reading sessions
- word2vec, model used in navigation-vectors to generate embeddings from reading sessions
- morelike, text-based model to find similar articles used for recommendations in RelatedArticles-feature.
Models are trained on the train-set. Evaluation of performance is done on the test-set. Evaluation on dev-set is used for optimizing hyperparameters. Note that for morelike, we do not do any additional training and only evaluate on the test-set.
Evaluation
editWe evaluate the models in the task of next-article prediction. For a session in the test-set of length L, we pick a random position isource = 1,...,L-1 and aim to predict the next article at position itarget=isource+1 (and thus the article at isource and itarget are the source and target articles respectively). We assign a rank for each prediction by comparing the target-article with the list of articles predicted from the source-article, i.e. if the target article is the 5th most likely recommendation we assign rank=5. We then calculate
- mean reciprocal rank (MRR): the average of the inverse rank (i.e. 1/MRR corresponds to the harmonic mean of the ranks)
- recall@k: the fraction of test-cases for which the rank is <= k (i.e. recall@1 is the fraction of times the target-article was the most likely prediction based on the source article).
Results
editWe evaluate different models on several wikis using 4 different metrics (MRR, recall@1, recall@10, and recall@100).
Main results:
- Note that only for 3 smaller wikis (hewiki, hiwiki, huwiki) we were able to finish training the LSTM (top cases)
- in each of these cases the LSTM substantially outperforms the other two baselines
- Word2vec sometimes performs worse (or only slightly better) than the text-based morelike which does not use any training (hewiki, hiwiki, bnwiki, eswiki, nlwiki, rowiki, ukwiki ). this could be to several reasons
- not enough training data (mostly smaller wikis)
- substantial variation across wikis (regardless of the model), e.g. hiwiki vs ukwiki.
- it is very likely that this is due to different composition of mobile vs desktop traffic. hiwiki has an exceptionally large fraction of readers via mobile-web (in some months it makes 80-90% wikistats), while for other wikis it is much lower (e.g. ukwiki where number of desktop-readers is higher than mobile-readers wikistats)
- (to be added in more detail below): in fact we can show that reader-sessions are more predictable in mobile than in desktop when separating by access-method, i.e. performance in next-article prediction is higher for mobile-sessions than for desktop-sessions. This is consistent with purely empirical observation that given the same number of sessions, desktop-sessions contain a higher diversity of pages than mobile-sessions (in terms of the number of different pages visited in those sessions). This cannot only be attributed to the RelatedArticles-feature showing 3 related articles for further reading at the bottom of articles in the mobile-version since we observe similar patterns for dewiki, in which the feature is not enabled (by default).
Wiki | Model | MRR | Recall@1 | Recall@10 | Recall@100 |
hewiki | Morelike | 0.113305 | 0.06 | 0.217 | 0.406 |
Word2vec | 0.1291 | 0.07892 | 0.23134 | 0.38979 | |
LSTM | 0.241439 | 0.158223 | 0.41163 | 0.577477 | |
hiwiki | Morelike | 0.224332 | 0.139 | 0.377 | 0.488 |
Word2vec | 0.188405 | 0.113174 | 0.345736 | 0.56412 | |
LSTM | 0.37979 | 0.278942 | 0.57975 | 0.757443 | |
huwiki | Morelike | 0.114129 | 0.064 | 0.216 | 0.379 |
Word2vec | 0.127071 | 0.07669 | 0.2294 | 0.40457 | |
LSTM | 0.24379 | 0.160018 | 0.415452 | 0.617054 | |
arwiki | Morelike | 0.125015 | 0.075 | 0.221 | 0.386 |
Word2vec | 0.136778 | 0.08385 | 0.24293 | 0.42 | |
bnwiki | Morelike | 0.193514 | 0.117 | 0.346 | 0.493 |
Word2vec | 0.132293 | 0.078647 | 0.244514 | 0.435805 | |
dewiki | Morelike | 0.089064 | 0.05 | 0.173 | 0.318 |
Word2vec | 0.131105 | 0.08184 | 0.23067 | 0.38692 | |
enwiki | Morelike | 0.108866 | 0.054 | 0.212 | 0.39 |
Word2vec | 0.151636 | 0.09491 | 0.26648 | 0.43196 | |
eswiki | Morelike | 0.12912 | 0.077 | 0.233 | 0.402 |
Word2vec | 0.131733 | 0.08121 | 0.2362 | 0.40848 | |
jawiki | Morelike | 0.106265 | 0.058 | 0.206 | 0.369 |
Word2vec | 0.151647 | 0.09849 | 0.25689 | 0.41933 | |
nlwiki | Morelike | 0.111285 | 0.06 | 0.21 | 0.393 |
Word2vec | 0.131795 | 0.08114 | 0.23303 | 0.38997 | |
rowiki | Morelike | 0.137073 | 0.078 | 0.248 | 0.417 |
Word2vec | 0.129519 | 0.078076 | 0.235162 | 0.418069 | |
ruwiki | Morelike | 0.110369 | 0.065 | 0.211 | 0.365 |
Word2vec | 0.134226 | 0.08197 | 0.24137 | 0.41118 | |
ukwiki | Morelike | 0.092121 | 0.052 | 0.174 | 0.341 |
Word2vec | 0.104915 | 0.06289 | 0.19127 | 0.36027 | |
zhwiki | Morelike | 0.095617 | 0.056 | 0.182 | 0.337 |
Word2vec | 0.154163 | 0.09714 | 0.27255 | 0.44906 |