Tracked in Phabricator:
Created
09:47, 25 March 2020 (UTC)
Collaborators
Duration:  2020-January – 2020-December

In contrast to Wikipedia’s editor population, little is known about its readers; in large parts due to the challenges and restrictions when dealing with privacy-sensitive data. Only recently have we started to characterize wikipedia’s readership. For example, recent studies (Singer, Lemmerich, et al. 2017; Lemmerich et al. 2019) approached the question why we read wikipedia in order to identify the motivation, information need, and prior knowledge of different users. Here, we investigate whether and to what degree this is reflected in how we use wikipedia. That is, instead of looking at page views as isolated events, we consider user’s full reading session in order to characterize patterns of navigation within and across wikimedia projects, and as a result, better understand usage of wikipedia, in particular how readers use wikipedia to learn about specific topics.

Main findings:

• There are systematic differences in the length of reading sessions depending on access-method and topic in which session starts
• There are 2 distinct phases in reading sessions with respect to topical exploration (first increasing focus, broader exploration towards end)
• Reader surveys suggest differences in topical exploration depending on information need and motivation (disclaimer: small and with given data not clear if statistically significant)
• Established methods for defining reader sessions (1-hour cutoff between 2 consecutive pageviews) need to be revisited as particular choice not obvious

## Background/Motivation/Goals

This project aims to answer questions such as: What are the strategies used for satisfying a user’s information needs and how do they vary, e.g., across different languages and projects? What are the reasons for when they fail, e.g. due to lack of accessibility; and how could we assist readers in finding information they are looking for? Answering these questions is pertinent to at least two of the five priorities laid out in the Wikimedia Foundation Medium-term plan 2019: worldwide readership (increase global readership) and platform evolution (ease-of-use and low barrier of entry). Thus, by providing a characterization of readership behavior and patterns, this project will be contributing towards the goal of addressing knowledge gaps in readership. Following the idea of the pipeline of online participation (Shaw & Halfaker 2018), this will also provide crucial insights into possible causes into knowledge gaps in contributorship and content. In the longer-term, we also hope to set the foundation for approaching questions on how users learn on Wikipedia (possibly over longer times), and how to support initiatives to train critical usage of wikipedia in order to increase resilience with respect to disinformation.

Goals

• Starting exploratory research to empirically characterize navigation session in Wikipedia, e.g. quantifying different across Wikipedia editions, geographical locations, mobile access, topical content, etc.

## Results

Histogram of interevent times in consecutive pageviews (log-binning)

Using webrequest logs, we identify reading sessions by grouping all pageviews for the same hash concatenating client_ip+user_agent+access_mode (removing automated traffic, only considering pageivews to main namespace, no redirects, etc).

One typical approach for identifying reading sessions is to break reading sessions if the time-difference between two pageviews is larger than a given threshold. Halfaker et al [1] suggested a cut-off of 1 hour for Wikipedia reading sessions. They convincingly showed that the distribution of interevent times (time between 2 consecutive pageviews) is bimodal distribution (with peaks at ~1min and ~1day) and identify a separation point at ~1hour (see Figure 1 bottom 3 panels). Following these insights, the choice of 1 hour has become the standard choice when constructing reader sessions in Wikipedia [2][3][4] .

Using a sample of ~500M sessions from 1 week of requests to enwiki, we found a slightly more subtle picture suggesting that a 1-hour cutoff might not be as obvious as suggested by the previous analysis. Looking at our data, we similarly observe 2 peaks at around 1 minute and 1 day. However, for the range in between there is no clear identifiable minimum which would suggest a cutoff at 1 hour. In fact, one could hypothesize whether there is a third peak at 1hour. In any case, any cutoff in the range 15min to 4 hours seems equally good and the distribution of interevent times seems insufficient to provide a clear answer. More research is needed to understand and solve this problem in a satisfactory way. While we will still choose 1 hour as a pragmatic choice for the cutoff, we suggest to check whether results obtained from reading sessions are robust when varying this cutoff (e.g. [15min, 1 hour, 4 hours]).

### Topic analysis

How do reading sessions vary across different topics in Wikipedia?

To answer this question, we assign each article a topic label from the WikiProjects-derived topic taxonomy (e.g. Culture.Sports or STEM.Physics). Specifically, we use the Wikidata topic model and choose the topic with the highest probability (the advantage of this model with respect to ORES' topic model is that the former can be applied to articles from any language using only the statements from the corresponding Wikidata-item and not the text of the article thus becoming language-independent.

First, we record the number of pageviews (across all sessions) for each topic and separate according to access method (desktop, mobile web, mobile app). Since there are many more pageviews to mobile web than to mobile app, for each access_method we calculate the fraction of pageviews across topics. According to our expectation, some topics are much more popular than others; for example 'Person' receives more than 10% of all pageviews whereas 'Culture.Crafts_and_Hobbies' receives less than 1/1000 of all pageviews (note that we do not account for the different number of articles in each topic). However, it is surprising that the distribution of pageviews across topics is very similar for the different access_methods. There are some notable differences, e.g. 'STEM.Technology' receives a higher fraction from desktop-readers, while 'Person' receives a higher fraction from web-users (mobile and app).

Second, we record the length of the reading session depending on the topic of the initial pageview. Specifically, we calculate the average (and standard error of the mean) length of sessions starting in a given topic conditioned on the different access methods. Across all topics, the average for mobile web is one pageview smaller than desktop and mobile app. The average varies substantially: 'Culture.Arts' has an average of 4 pageviews for desktop, while 'STEM.Technology' is around 2. This already suggests strong differences in the way Wikipedia is used depending on the topic of interest.

In this section, we measure navigation in topic space.

The discrete labels provide only a very coarse-grained view. Instead, we would like to describe navigation of a reader in a similar way we describe navigation when driving a car. Therefore, we first construct a continuous topic space (where similar articles are closer to each other) and then measure the topical spread of reading sessions in this space.

#### Building continuous topic space using word embeddings

Using the text of the articles, we use word-emebddings to project each article into a 2-dimensional topic space. The detailed workflow consists of the following steps: - extract all word tokens of the first section of an article using Wikitext Preprocessor from mediawiki-utilities/python-mwtext. this parses wikitext into text and performs tokenization using regex. note that there are more complex parser, such as mwparserfromhell, or tokenizers, such as spacy, but our approach is much faster. - use the fasttext word vectors (here for english) to obtain a 300-dimensional vector from the article text. specifically, we use the get_sentence_vector function which maps the string of a text (can be several) words to a 300-D vector. - use umap to project the 300-D topic space into a 2-D topic space.

We obtain a 2D-map of all articles in enwiki. We can see that similar articles will be closer to each other. For example, articles with a given topic label (obtained from the wikidata topic model) will be strongly localized in a particular region of the map. This suggests that our topical space captures the similarity of articles in terms of their content.

Once we have created the topical space, we can map each reading session from a sequence of page_ids to a sequence of 2D-coordinates. As an illustrative example, we show two reading sessions

• a very focused reading session (red) with 34 pageviews

We can quantify the spread of each session by calculating the pooled standard deviation: ${\displaystyle \sigma ={\sqrt {\sigma _{x}^{2}+\sigma _{y}^{2}}}}$ . Large values would indicate a large topical spread whereas smaller values would indicate higher topical focus in the reading session. Accordingly, in the example we obtain values 0.54 (red) and 5.24 (blue).

We assess if and how the topical spread changes during a reading session. Previous research in 'laboratory settings' (i.e. where readers were tasked to find a navigation path from a source to a target page) found two distinct phases [5].

We calculate the spread ${\displaystyle \sigma }$  for a window of 4 consecutive pageviews (other window sizes yield similar results) and move this window along the reading session from beginning to end. We further group reading sessions according to their length and show the mean and the standard error of the mean over the corresponding reading sessions.

In order to test whether the observations from the original data (blue) are genuine and not just an artifact of, e.g., the network topology of links, we compare with two null models in which we generate a synthetic set of reading sessions (each with the same starting page and the same length) assuming navigation as a random walk along the link network:

• random walk: all outgoing-links have the same probability. for each page we get all outgoing links from pagelinks table and randomly choose one as the next pageview in the reading session.
• clickstream: outgoing-links are weighted according to the number of times the link was used. for each page we get the number of times an outgoing link was clicked using the clickstream data and randomly choose (weighted with probability) one as the next pageview in the reading session.

Main observations:

• there is a characteristic U-shape in the topical spread: in the first half/third there is an increase in topical focus (smaller [itex] \sigma <\math>), towards the end the topical focus becomes much larger. Most importantly, this does not happen when just navigating randomly on the link network.
• This suggests that there are two distinct phases in the navigation of reading sessions
• This also clearly indicates that reading sessions are contextual and navigation depends on the previously visited pages (and not just the currently visited one). Compare previous studies critically assessing this aspect in navigation of web users [6] and findings that navigation in Wikigames can be memory-less[7]
• topical spread is smaller than for the null models in most cases (this is expected when readers do not just click links randomly); however, towards the end of sessions, the topical spread becomes larger suggesting that readers navigate content that is outside the reach of the link network.

#### Relation to motivation and information need

Using ~5000 reading sessions from the Reader survey, we check whether and how the topical spread differs depending on the information need and the motivation.

Separating sessions of different length, we calculate the topical spread for the entire session conditioned on the response given in the survey with respect to information need and motivation.

• Differences in spread across need and motivation (for same session length) are small and most likely not statistically significant; however, there seem to be interesting systematic shifts
• information need: In-depth sessions have smaller average topical spread
• motivation: intrinsic and work/school have smaller average topical spread
• The topical spread increases with increasing session length (expected)

## References

1. Halfaker, Aaron; Keyes, Oliver; Kluver, Daniel; Thebault-Spieker, Jacob; Nguyen, Tien; Shores, Kenneth; Uduwage, Anuradha; Warncke-Wang, Morten (2015). "User Session Identification Based on Strong Regularities in Inter-activity Time". pp. 410–418. doi:10.1145/2736277.2741117.
2. Singer, Philipp; Lemmerich, Florian; West, Robert; Zia, Leila; Wulczyn, Ellery; Strohmaier, Markus; Leskovec, Jure (2017). "Why We Read Wikipedia". pp. 1591–1600. doi:10.1145/3038912.3052716.
3. Lemmerich, Florian; Sáez-Trumper, Diego; West, Robert; Zia, Leila (2019). "Why the World Reads Wikipedia". pp. 618–626. doi:10.1145/3289600.3291021.
4. Paranjape, Ashwin; West, Robert; Zia, Leila; Leskovec, Jure (2016). "Improving Website Hyperlink Structure Using Server Logs". pp. 615–624. doi:10.1145/2835776.2835832.
5. West, Robert; Leskovec, Jure (2012). "Human wayfinding in information networks". pp. 619–628. doi:10.1145/2187836.2187920.
6. Chierichetti, Flavio; Kumar, Ravi; Raghavan, Prabhakar; Sarlos, Tamas (2012). "Are web users really Markovian?". pp. 609–618. doi:10.1145/2187836.2187919.
7. Chen, Zhongxue; Singer, Philipp; Helic, Denis; Taraghi, Behnam; Strohmaier, Markus (2014). "Detecting Memory and Structure in Human Navigation Patterns Using Markov Chain Models of Varying Order". PLoS ONE 9 (7): e102070. ISSN 1932-6203. doi:10.1371/journal.pone.0102070.