Research:Sequential Models for Analyzing Navigation/Reading Sessions

Created

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Akhil Arora

EPFL

Alberto García Durán

EPFL

Robert West

EPFL

Duration: 2020-May – ??

Research:Projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

About

This project aims to employ sequential modeling techniques, viz. RNNs/Transformers, to capture the sequential dependencies among page-views existent in the reading/navigation sessions extracted from the webrequest server logs. Once trained, such models possess the capability to address a plethora of downstream tasks to be used as user recommendations, some of which are stated as follows:

Predicting the next article in a session
Predicting a sequence of articles in a session
Predicting the target of a navigation session
Predicting the user-intent

In addition to using the reader/navigation sessions extracted from the server logs, we also aim to use the underlying graph structure (captured by the hyperlink network of Wikipedia articles) and the textual content of each article for training multi-modal sequential models.

Data

We plan to use reading/navigation sessions extracted from Wikimedia's server logs, where all HTTP requests to Wikimedia projects are logged.

Method

We aggregate the set of pages read by the same user by creating an identifier composed by request IP and user-agent (data hashed). Once users are uniquely identified, it is important to define the term "session." In this context, we consider two types of aggregation strategies:

Navigation session: It takes into account the referrer field of the server logs to represent user navigations as trees. Negative aspects of this approach include:

unclear representation of behaviors where the referrer field is not defined (i.e., a sequence of pages loaded from Search engine results).
unclear representation of how the content was consumed by the reader (user jumping between different tabs)

Reading session: This approach takes into account the navigation events sorted by timestamp as the readers generated them. It represents the natural way to model the trajectory of the reader across different topics. Negative aspects include:

There is a need to define a heuristic to split different reading sessions (i.e., one hour without activities).

Mixed approaches are also possible. The analysis will be limited to anonymous users.

Once the reading/navigation sessions have been extracted from the server logs, we then train the following language models on the sequence of articles extracted in the said sessions.

Word2Vec
LSTM
BERT (ongoing)

Baseline

We use the "morelike" feature in mw:Extension:CirrusSearch, which extracts keywords from articles via tfidf and searches other articles with those keywords, as a baseline.

Result

Based on our initial experiments on SimpleWiki, we have obtained the following results:

Method	MRR
MoreLike	0.176
Word2Vec	0.17
LSTM	0.30