Research:Activity session

An activity session is a sequence of actions performed by a user during a "session" of activity. When we say "session" we mean a period of time in which a user - be they reader or editor - is browsing our sites and performing actions. Those actions can be clicking through to new pages, making edits, or taking admin actions; anything that involves actively interacting with the sites. When they stop doing this and move on to other activities, be it leaving to go to a different site or closing their browser altogether, their session has ended.

If we can identify sessions, we can generate a host of metrics about the length of time Wikimedia users spend engaged with the site, how it varies between platforms or activity types, and how changes to the software increase or reduce it, giving us a way to determine how design or engineering changes can optimise the user's experience. These metrics are discussed below, in the session-based metrics section.

There are a variety of approaches to identifying activity sessions, and these are discussed in the session reconstruction section. The Wikimedia Foundation's research team uses an approach based on identifying an appropriate time threshold, measured in seconds, and declaring a user's session "ended" when there is a gap between user actions that is greater than that threshold. The threshold can be determined in a standardised way (discussed in the section on thresholds), although the actual value varies between platforms and activity types.

Identifying and analysing activity sessions is dependent on having a unique user identifier (UUID) to determine which actions should be associated with each user. For editors, we can use usernames, and for Wikipedia App users, the App install ID, but absent a UUID for readers, this strategy is not currently generally applicable to all kinds of user and all kinds of action. We are distinctly working on a UUID scheme to solve for this deficiency.

Session-based metrics

With accurate session identification, a host of metrics for measuring user engagement with Wikimedia sites become available to us, which are described below in more detail.

These metrics give us more tools for identifying how changes we make to the user experience of Wikimedia sites impact on our users, and optimising those changes to provide beneficial results. For example, suppose we wanted to improve reader engagement, and present readers with incentives to browse more content. One way of measuring whether changes intended to elicit this effect have worked is to measure the "bounce rate"; the proportion of visits to the site that end with the user leaving after only seeing one article. If we can reconstruct users' activity sessions, we can calculate the bounce rate and monitor this number over time (or during a controlled A/B test) to see if changes to the interface have an impact (positive or negative) on the bounce rate. This informs the features we make, and the features we release into production as the default experience.

Metrics available as a result of identifying activity sessions include:

Bounce rate: the proportion of sessions that ended after a single action. Calculated by looking at the percentage of single-event sessions.
Time on page: the time spent idling on a page after opening it. Calculated by looking at the time gap between an event, and the next event, within a session.
Session length: the length of time spent acting in a single session. Calculated by looking at the time elapsed between the first event in a session, and the last event.
Number of actions: the number of actions within a session.
Number of sessions: the number of sessions a single user engages in, within a specific period of time.

Bounce rate

a visual representation of the concept of a "bounce rate".

The "bounce rate" a website has is a common metric in web analytics; it refers to the proportion of visits that end with the user clicking away to another site, or stopping browsing the internet altogether, after viewing a single page.^[1] With session reconstruction, we can easily calculate this: it is the number of sessions that only contain one action.

Once we have the ability to calculate the bounce rate, and see how it varies between different sets of conditions, we can establish a baseline and then learn things about how new features or design changes impact user behaviour. For example, we might have a global bounce rate of 30%. If this is acceptable, we have a metric to monitor for the future - something that is useful in and of itself. If this is not acceptable, and we want a lower bounce rate, we can experiment with different feature ideas (for example, a "recommended reading" feature) and use the bounce rate of the population the features are tested against to determine which ones are improvements. We can then roll those features out more widely, and continue monitoring the bounce rate to see if further work is needed. The same applies to edits; if we have a large proportion of edit sessions only consisting of one edit, we can experiment with features to increase this (such as recommended pages to edit) and monitor the impact they have.

Time on page

a visual representation of the calculation of "time on page".

Time on page is another useful metric that can be calculated if we are able to identify activity sessions; it refers to the time spent idling on an article after opening it. In this context it is primarily useful for read actions, not edit actions, since edit actions are identifiable in logs by the point at which the user completed, rather than started, them.

To calculate time on page for each action(A) in a sequence (S), we need two timestamps: the timestamp of the event, and the timestamp of the event following it. Subtracting the timestamp of the following event from the event we wish to calculate time-on-page for gives us the amount of time spent idling on that page. One problem with this approach is that it's impossible to directly identify the time spent on the last page. Some of the previous implementations within the Wikimedia Foundation, particularly the app session analysis project, have solved for this by not even attempting to estimate the time spent, which is an approach that other implementations, such as that used by Google Analytics, also take. While we are confident we can work out a way of providing accurate estimates in those situations, in the meantime time-on-page consists of the time elapsed between an action, and the action following it within the session - with the caveats that it cannot be calculated for the last or only action in a session.

There are a lot of applications of time-on-page. For readers, we can investigate whether there are indications that it's correlated with value extracted, or article quality, giving us a better idea of how to tweak content to appeal to users and how to measure the success of these changes. For editors, we could look at the time spent on an edit and see if it's possible to reduce that without reducing quality - effectively, freeing up editor energy for (potentially) more edits.

Session length

a visual representation of the calculation of "session length".

The length of a session consists of the time elapsed between the first action in the session, and the last one, along with some buffer to allow for time spent performing the last action (or first one, in the case of edits), and to handle sessions with only one action. This buffer room is calculated inconsistency between implementations, since it can't be observed directly and different customers have different standards around the data they are interested in; the app session analysis uses no buffer, and simply discounts the last event or single-event sessions from consideration, while the mobile versus desktop comparisons and edit session analysis used buffers consisting of the average time-on-page within the dataset as a whole.

Session length is, like time on page, a useful metric for understanding how strongly users are engaged with their Wikimedia-related activity. Unlike time-on-page, it looks at a longer timespan and so is less prone to fuzziness to do with, for example, action complexity or content length.

Number of actions

actions per session, visualised

The number of actions a user takes within a session is self-explanatory; the number of actions registered to a particular user that occurred within a single session. In aggregate, this number is useful to have because it allows us to examine (for example) behavioural differences between mobile and desktop readers which would explain disparities in pageview counts. This is also an alternative to session length or time-on-page when it comes to measuring the efficacy of experiments or features intended to increase user engagement, since it indicates changes in intentional user interventions.

Number of sessions

sessions per user, visualised

The "number of sessions" metric is simply the count of how many sessions are associated with a particular user, within a specific time period (usually set by the limitations on the UUID that is used).

Session reconstruction

For any of these metrics to be calculated, we have to first segment user actions into sessions. This is a long-studied problem and there are a variety of approaches that are taken; our preferred one is to use time thresholds. When a user does not take an action in more than a certain number of seconds, their session has ended, and a new session commences on their next action.

What this threshold is varies from userbase to userbase, and the approach to generating it is described below in the section "determining an appropriate threshold". We have established baselines for each user group, which are:

Mobile app readers: 30 minutes
Mobile web readers: 60 minutes
Desktop readers: 60 minutes
Editors: 60 minutes

More work should be done to subdivide editors and identify if different thresholds are appropriate for different types of editor, in the same way that different types of reader can have different thresholds.

Approaches to session reconstruction

There are essentially two approaches to session reconstruction; time-based, and navigation-based. Time-based approaches look for a period of inactivity, or "inactivity threshold": a span of time between requests by a user. Once this period of inactivity is reached, the user is assumed to have left the site or stopped using the browser entirely, and the session is ended: further requests from the same user are considered a second session. A common value for the inactivity threshold is 30 minutes, a well-established value sometimes described as the industry standard. The utility of this value has been questioned: some researchers have argued that it produces artefacts around naturally long sessions, and have experimented with other thresholds, including 10 and 60 minutes. Despite this, Jones & Klinkner argue in a paper at the 2008 Conference on Information and Knowledge Management that, at least in relation to search data, "no time threshold is effective at identifying [sessions]".

Navigation-based approaches, on the other hand, exploit the structure of websites: specifically, the presence of hyperlinks and the tendency of users to navigate between pages on the same website by clicking on them, rather than typing the full URL into their browser. One way of identifying sessions by looking at this data is to build a map of the website: if the user's first page can be identified, the "session" of actions lasts until they land on a page which cannot be accessed from any of the previously-accessed pages. This takes into account backtracking, where a user will retrace their steps before opening a new page.

Navigation-based heuristics are completely impractical for any website with a large amount of HTTPS traffic, since HTTPS requests lack a referer. As a result, we use a time-based approach. Despite the criticism that thresholds are hard to identify, we believe that we have found a reliable way of doing so, which is described in the section "determining an appropriate threshold" and is currently in the process of being published in a peer-reviewed journal.

Determining an appropriate threshold

A histogram of a random sample of inter-edit times for registered editors of English Wikipedia is plotted with a 3 Gaussian mixed model fit overlay. The one hour threshold used by Geiger and Halfaker^[2] is noted

Although threshold values can vary, we have found a standardised way of identifying where the appropriate threshold is that works fairly consistently - extremely consistently, with Wikimedia datasets. We can expect that times between events (intertimes) will be largely consistent for particular behaviours. One range of intertime values will cover actions within a session. Another range of intertime values will cover actions that connect multiple sessions together. In other words, if we calculate the time between events in a dataset, we can expect to see two buckets; one of low values, which represent actions within a session, and one of high values, that represent gaps between sessions. When we fit a log-normal mixture model to the data, these buckets show up very clearly, and if we draw a line through the point where we can an most expect to be right, we can determine the best threshold, where "best" represents the most likely to avoid either bucketing multiple sessions together, or splitting a single session up.

Implementations and prior art

The Wikimedia Foundation's R&D team has implemented several session analysis projects in the last 2 years, primarily centred around reading. These include:

App session analysis, a recurring report on shifts in various session-based metrics for users of the official Wikipedia App;
Mobile web/desktop session comparisons, a one-off report on the behavioural differences between mobile and desktop users, and;
Edit session analysis, a research project by Aaron Halfaker and R. Stuart Geiger that instituted this standardised methodology.

App session analysis

One ongoing, automated project to gather session-related metrics is that relating to the Wikimedia Apps. Metrics include pages per session, sessions per user and session length, and are aggregated and released publicly here. A 30-minute threshold value is used, in line with our findings. The session length analysis uses zero padding, and discounts single-event sessions in their entirety.

This analysis is only possible because apps has a custom UUID solution; no generalised UUID solution exists and so this is not applicable to mobile or desktop traffic.

Mobile/desktop session comparisons

Main article: Research:Mobile sessions

A historical project in January-March 2014 attempted to compare Mobile and Desktop sessions. This lacked a UUID, and instead used a fingerprinting method consisting of hashing the user's IP address, user agent and language version. While this produced useable results, it also demonstrated that fingerprinting is inaccurate - beyond the privacy problems with keeping this data around - necessitating a proper UUID solution. The session length definition included a buffer, consisting of the average time-on-page, and also included single-page sessions.

Edit session analysis

Stuart Geiger and Aaron Halfaker^[2] applied these techniques to editors, which is where they originate. This included average time between events in calculating session lengths, and was applied with a 60 minute threshold value. Explored and hypothesised uses, which have also been applied by other researchers, include:

Grouping edits together
- Halfaker used the notion of an edit session to measure the first experience of newcomers as registered editors^[3]. He found that the number of revisions an editor makes in their first edit session (understood as editor "investment") is a strong predictor of long-term retention.
- Halfaker et al. built on the previous study by using newcomers first session ( $t$ = 1 hour) edits as a dataset for determining good and bad-faith editors^[4]. This analysis was used to show that The Decline is not caused by decreasing newcomer quality.
- Panciera et al. measured the number of edits per session to control for editors who performed many small edits as opposed to those who package a large change into a single edit^[5]
Measuring labor hours (see R:Estimating session duration)
- Using the session duration measurement, the total number of hours that an editor has spent editing can be approximated. Research by Geiger & Halfaker builds such estimations across the encyclopedia^[6].

Standardised tools

One legacy of the many ad-hoc implementations of session analysis and reconstruction is a set of standardised, generalised tools. In Python, Wikimedia-Utilities contains session reconstruction code. In R, reconstructr is a dedicated library for performing this kind of analysis.

References

↑ "Bounce Rate". Google Analytics. Google Analytics. Retrieved 19 December 2014.
↑ ^a ^b Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).
↑ Aaron Halfaker (2011). First edit session, a technical document produced for the 2011 Wikimedia Summer of Research.
↑ Aaron Halfaker, R. Stuart Gieger, Jonathan Morgan & John Riedl. (in-press). The Rise and Decline of an Open Collaboration System: How Wikipedia's reaction to sudden popularity is causing its decline, American Behavioral Scientist.
↑ Katie Panciera, Aaron Halfaker & Loren Terveen (2009). Wikipedians are Born, Not Made: A study of power editors on Wikipedia, GROUP
↑ Using Edit Session to Measure Participation in Wikipedia R. Stuart Geiger & Aaron Halfaker. (2013). CSCW (pp. 861-870) DOI:10.1145/2441776.2441873.

[1] "Bounce Rate". Google Analytics. Google Analytics. Retrieved 19 December 2014.

[geiger13using-2] Geiger, R.S.; Halfaker, A. (2013). "Using Edit Sessions to Measure Participation in Wikipedia" (PDF). Proceedings of the 2013 ACM Conference on Computer Supported Cooperative Work (ACM).

[halfaker11first-3] Aaron Halfaker (2011). First edit session, a technical document produced for the 2011 Wikimedia Summer of Research.

[halfaker12rise-4] Aaron Halfaker, R. Stuart Gieger, Jonathan Morgan & John Riedl. (in-press). The Rise and Decline of an Open Collaboration System: How Wikipedia's reaction to sudden popularity is causing its decline, American Behavioral Scientist.

[panciera09wikipedians-5] Katie Panciera, Aaron Halfaker & Loren Terveen (2009). Wikipedians are Born, Not Made: A study of power editors on Wikipedia, GROUP

[6] Using Edit Session to Measure Participation in Wikipedia R. Stuart Geiger & Aaron Halfaker. (2013). CSCW (pp. 861-870) DOI:10.1145/2441776.2441873.

[1]

[2]

[3]

[4]

[5]

[6]