Research:Page view/Background

This page is kept for historical interest. Any policies mentioned may be obsolete. If you want to revive the topic, you can use the talk page or start a discussion on the community forum.

This page contains considerations from 2014/15 that led to a new pageview definition which is documented at Research:Page view.

Background

Primary use cases

A lot of use cases exist for pageviews data. Many of them come from senior management (C-levels and above) looking to inform their decisions about where the WMF goes and where it tries to point the Wikimedia movement. Our goal is to provide readers with content, to the best of our ability: that means focusing our energies where the readers are focused, or should be, and so these use cases are primary. A lot of others come from the volunteer community, and third-party researchers, who are also our customers: after solving for existential threats, these use cases are also primary.

Examples of questions that pageviews data is needed to answer (both on strategic level for decisionmakers, and on a more granular level for contributors, and external observers, e.g. press) would be:

How do people consume our content? (hardware and software, e.g. mobile device, OS and browser) A decade ago, people tended to access Wikipedia through desktop devices on fixed modem or fiber lines. This is no longer the case: mobile devices (phones and tablets) make up an ever-growing proportion of our user base. The Executive Director and the C-level team are interested in keeping an eye on what proportion of webpage views come from sites and access methods associated with our mobile initiatives - Zero, the Mobile Web platform and the Mobile App platform. For this they would like a breakdown of webpage views by access method (desktop, Zero, Mobile Web, Mobile Apps) capable of being represented as a 30-day rolling average. Because reading is the targeted activity, there is not any need to split it out into, say, "article" reads versus "meta" reads (certain namespaces, special pages, the history page, etc): an overall number will do, as long as it is available on a per-wiki basis. Both C-level and volunteers strongly expressed the need to count web crawlers and other automated systems separately. The Research and Data team, which is tasked with providing commentary on the data and following up on specific questions it prompts, advocates for the data itself to be stored at a 1-hour resolution, even if it is represented as a rolling average, so that (in the event of odd or unexpected results) it is possible to do some investigation without regenerating the dataset
Where do people consume our content? Wikimedia is a global entity, with global reach. Yet historically our readers have been heavily weighted towards Europe and North America. This is problematic, first because it creates a tremendous dependency and leaves us vulnerable to losing substantial amounts of relevance from changes of limited geographic scope, and second because our task is to reach everyone. We need to focus our attention on areas where we are not reaching a substantial chunk of the population. To do this, we need reliable pageviews data to show exactly where we’re weak.
Requirements for this match those for use case 1, but with the addition of country as a variable.
How do people get to our content? As well as geographic vulnerabilities, we also have third-party dependencies. Google and other major sources of referred traffic are responsible for an incredibly high proportion of our readership, and changes to their infrastructure have the possibility of creating dramatic shifts in our readership numbers. Without reliably tracking of these readership numbers, we lose the ability to quickly see changes and respond to them.
Requirements for this would be the same as for use case 2, but with the inclusion of referer data, at an aggregate level.

Examples of community or research-focused questions that pageviews data is needed to answer would be:

What articles do readers focus on? Editor attention is limited because we have a limited number of hours: ideally, it should be focused on much-requested or much-appreciated content that isn’t of incredibly high quality. Accurate pageviews data is necessary so that we can see which articles readers cluster around and use this to identify where editing energy would be best focused.
Requirements for this would be (ideally) requests, per method, per article, over time or, more realistically, requests per method, per URL, over time.
How does content quality relate to reader attention? As an extension of the previous question: is it worth our time focusing on what users have asked for? Does this increase, decrease or bear no relation to how likely readers are to read, or how likely they are to attempt to contribute?
This doesn't really have any new requirements; it's a distinct use case, but the requirements for community question 1 and existing datasets combine to answer it.
What articles are missing that people want?
What redirects and navigational paths are used by readers?
Which project and language do people access most?
What kind of information do people request most? (type of content may vary per wiki, e.g. inferred from category clouds)
Which media files do people request most? (preferably broken down by category, so as to group images by museum which donated)
I think this needs to be more precise. If we want to measure traffic to the File: page, that's a pageview. If we want to measure hits to the appropriate upload.wikimedia.org subdirectory that's not. Okeyes (WMF) (talk) 14:04, 23 October 2014 (UTC)
uploads counts are covered in [1] Erik Zachte (WMF) (talk) 00:02, 3 December 2014 (UTC)
A volunteer editor might want to know the same answers to many of these questions as well, but on a more granular level, per wiki, per category, per media file (image, video, sound), ideally per content page (consumption stats in the footer of each page could a major moral boost for editors).

Kevin's User Stories

User goes to Vital Signs for Pageviews per project

As an executive, product manager, researcher, community member...

I can quickly view daily Pageviews per project in Vital Signs;

so I can monitor the projects.

Details: daily sums; per project (wiki); includes all namespaces; crawler traffic is excluded (per the webstatscollector definition); with a breakdown distinguishing hits to the desktop or mobile sites; this data is public

User analyzes Pageview generators

As a product manager, researcher, analyst...

I can slice and dice who is generating Pageviews and visualize it;

so I can inform my project decisions.

Details

available in a cube / DW, visualized using a BI tool.

daily or monthly sums (perhaps the BI tool can aggregate by month)

data can be sliced by

Crawler (Yes/No)

Project, Country,

Target Site {desktop, mobile, API, Zero},

Device (Android, iPhone, ... other) (this list should be limited and cheap to implement)

User-Agent (sanitized and limited list)

this data is private initially and could be made public later if there are no privacy issues

includes pages across all namespaces

User analyzes Pageview hotspots

As a trouble shooter, community member, admin...

I can rank Pageviews per page;

so I can see what readers are paying attention to.

Details

available in a cube / DW, visualized using a BI tool.

for all pages in all namespaces

daily counts (

Pageview counts for each page

Dimensions:

Crawler, Country, Target Site, User Agent, Device

Existing pageviews definitions and infrastructure

There are two existing pageviews definitions used in production, each one doing different things. The first is for high-level metrics (the strategic questions): this produces a variety of aggregated numbers, such as pageviews broken down by region, or pageviews on mobile versus pageviews on desktop. It has several known issues. Of particular note is that it does not consume and track app traffic, which has an implication for the first open question. It does not capture a lot of traffic from certain projects with multiple dialects in use, which has an implication for how we understand the geographic distribution of our traffic. It under- and over-counts in a few areas, (mainly not filtering crawler traffic) which has implications for all three strategic question examples.

The second is for per-page counts - the community or research-focused questions. This produces aggregated counts of requests for each URL. One major limitation here is that it does not include mobile traffic, even mobile web traffic, which makes it difficult to accurately represent how reader attention is focused (and how much reader attention is focused) on particular articles.

No longer true. After webstatscollector logic has been migrated to Hives, new hourly dumps exists that include mobile and zero traffic. Erik Zachte (WMF) (talk) 00:20, 3 December 2014 (UTC)

For the record: This comment by Erik refers to the pagecounts-all-sites dataset (later discontinued in 2016). Regards, Tbayer (WMF) (talk) 21:28, 26 April 2018 (UTC)

Recommendations for the new definitions

With the conflict between the existing definitions and the range of use cases we’re tasked with providing data for, it’s clear that a new definition is needed. Specific filters will be tested and set down in detail after review, but as a high-level summary, this definition:

Should produce two streams of data - one “high-level” breakdown of overall pageviews by factors including access method and geographic area, and one “low-level” breakdown of pageviews by page, potentially but not necessarily also including access method, geographic area and other factors;
Should be able to accurately include and distinguish traffic from all of our streams - desktop, the Mobile Web, Zero and the Mobile Apps;
Should be able to filter and aggregate by geolocation results, and from the same use case, should be project-neutral - in other words, should be able to capture traffic regardless of the individual project’s settings.