Research:Defining monthly active editors, 2016

Active editors is a key metric used in the Wikimedia movement. Current definitions agree that an active editor is a registered user who makes 5 or more edits a month to content pages, but there are subtle difference in how this general rule is applied. The WMF Editing Analysis (Neil P. Quinn), Research (Leila Zia), and Analytics teams (Erik Zachte) are working to agree on a fully-specified version of this definition.

Discussion edit

  • Which namespaces are considered content?
    • Decision: the main namespace, extra configured content namespaces, and any non-configured content namespaces whose absence would have a major impact on the numbers (the File namespace on Commons is the only current example). However, we should also contact communities who have de-facto content namespaces to see if they would be willing to update their configurations.
  • Are deleted pages considered?
    • Decision: yes. This makes the metrics more stateless, so the numbers are more consistent no matter when they are calculated. Erik used to be a strong opponent of this, on the grounds that edits to deleted pages were probably unproductive, but he did some research and found that the number of edits excluded this way was only about 0.5% of the number of reverted edits, which were not excluded. In addition, Neil and others felt that this metric should focus on capturing editing activity, rather than also trying to capture productivity, which could be better captured in a separate metric.
  • Are all content pages counted, or are non-countable pages (mainly redirects) included as well?
    • Decision: all content pages will be counted, which will significantly simplify calculation at the (acceptable) expense of making this definition less similar to the article definition. Erik believes that this is the primary cause of the discrepancy between the Wikistats and Editing Analysis numbers, so this will likely cause the Wikistats numbers to change substantially.
  • What do we do about the fact that a page's namespace can change long after an edit is made? (The analogous issue is one reason why we decided to count deleted edits.)
    • Decision: if feasible, an edit should be counted if the page was in a content namespace at the time of the edit, in order to make the metric more stateless. This is generally infeasible with calculation done using the dumps or the databases, but it is likely to be possible using the Hadoop Data Lake.
  • What do we do when sites start configuring already-existing namespaces as content?
  • How should bots be identified?
    • Decision: we should treat accounts as bots if they have ever had the bot user right (because some projects remove the right from inactive bots) or if they match the Wikistats bot regex (with the corresponding whitelist, which currently has three users on it). Any account that's ever been flagged on any wiki should be treated as a bot globally, since it's much more likely that a cross-wiki bot makes unflagged edits on a small wiki than that someone uses the same account for a bot on one wiki and normal editing on another (or that someone is mistakenly flagged as a bot on a small wiki).
  • Which wikis should be included in the global number?
    • Decision: we should adopt the list used by Wikistats, which includes the Foundation wiki and Meta where the Editing Analysis list does not.

Comparison of recent data edit

month Wikistats Editing Analysis
May 2016 80 685 86 608
June 2016 78 207 83 323
July 2016 73 564 78 118
August 2016 74 202 78 556
September 2016 77 145 80 453
October 2016 79 329

Editing Analysis definition edit

Data source edit

An editor-month table using the following SQL query:

select 
month, 
count(*) as active_editors
from (
select month, user_name, sum(content_edits) as content_edits
from staging.editor_month
where bot = 0 and local_user_id != 0
group by month, user_name
) global_edits
where content_edits >= 5
group by month;

The editor month table is aggregated from both the revision and archive tables, and the content_edits column includes all edits from both.

Pages included edit

Included are edits made to all pages which were in content namespaces when the aggregating query was run. This may be different from the namespace the page was in when the edit was made. Content namespaces are defined as the following:

  1. the main namespace
  2. other namespaces defined by wikis as content namespaces (currently 69 extra namespaces across the cluster).
  3. other namespaces that are not configured as content namespaces by their wikis, but that in the judgment of the Editing Analysis team contain user-facing content. Currently (Nov 2016), these namespaces are the File and Category namespaces on Commons, the Property namespace on Wikidata, and the Grants, Research, Iberocoop, and Participation namespaces on Meta.

Users included edit

Users are classified as bots and excluded if they have ever been in the bot user group on that wiki.

Wikis included in global value edit

All the wikis with the following site_group values are included:

('commons', 'incubator', 'mediawiki', 'sources', 'species',
'wikibooks', 'wikidata', 'wikinews', 'wikipedia', 'wikiquote',
'wikisource', 'wikiversity', 'wikivoyage', 'wiktionary')

This means that Meta-Wiki and other movement-internal wikis like the Wikimania wikis and affiliate wikis are not included.

Wikistats definition edit

Note: definition on data page shows old situation and needs to be updated. Only namespace 6 on Commons is included hard coded, all others are collected via API.

Data source edit

Data is based on content dumps, so edits to pages which have since been deleted are not included.

Pages included edit

Non-countable namespaces and redirects are excluded.

Pages without internal link are not excluded. Wikistats only processes stub dumps and article content is not available in those dumps.

All edits to pages currently in the following namespaces are included:

  1. the main namespace
  2. As mw:Analytics/Metrics definitions says: "Wikistats dynamically establishes extra content namespaces per wiki via the API (since July 2013, for all history)".
  3. on Commons the File namespace (6) is added (hard coded, as it did not appear in API results), namespace Category was also enforced but has recently been excluded, as it's not included on any other wiki, and it raised eyebrows why Wikistats article counts for Commons exceeded online article counts by 5 million

Users included edit

Bots are excluded. Recap on how Wikistats detects bots:

  1. Is a name registered as bot, in other words is there a bot flag in the most recent dump of the user group table? Note that when a blog flag is removed in the user group table, all edits by that user name are no longer considered bot edits. (does this ever happen?)
  2. Does it sound like a bot? (nowadays such user names are only allowed for bot, on many wikis). Wikistats is rather restrictive in 'does it sound like a bot': Perl: if (($user =~ /bot\b/i) || ($user =~ /_bot_/i)) Meaning only names where 'bot' is end of string or is followed by non alpha-numerical char or is preceded and followed by underscores (in Mediawiki often place holder for spaces) sound like a bot for Wikistats. It would be interesting (but a bit more work) to break this down by language. I guess some languages are more prone to have 'bot' in real names than others.
  3. Is it known to be an unregistered bot ? (English Wikipedia has a list of false negatives). Erik copied that list long ago but does not keep it auto-updated.
  4. Is a name flagged as a bot on at least 10 wikis than treat it so on any wiki within the project (in the past when user names could easily collide this was more relevant). Basic rationale is that on smaller wikis bot registrations are often forgotten. With SUL it is unlikely that people use same name as bot on one wiki and as regular user on another wiki.
  5. Three names that sound like bot are hard coded exceptions (people who wrote me to tell me they are human): Paucabot|Niabot|Marbot

Wikis included in the global value edit

All the wikis with the following site_group values are included: ([!] are extra beyond editing analysis section above)

('commons', 'incubator', 'foundation' [!], 'mediawiki', 'meta' [!], 'sources', 'species',
'wikibooks', 'wikidata', 'wikinews', 'wikipedia', 'wikiquote',
'wikisource', 'wikiversity', 'wikivoyage', 'wiktionary')

"Canonical" definition edit

Documented at Research:Refining the definition of monthly active editors and mw:Analytics/Metric definitions.

Data source edit

Edits to pages which have since been deleted are not included.

Pages included edit

Only countable pages are included, using the standard definition of pages which contain at least one internal link and are not a redirect.

Only pages in content namespaces (the main namespace and extra configured content namespaces) are included.

Users included edit

Bots are excluded and defined by the following characteristics:

  • users with a bot flag on that particular wiki
  • users with a bot flag on at least 10 other wikis (does this still make sense in a post-SUL world?)
  • users with names that match a regex for the word "bot" before non-alphabetic characters or at the end of the name

Wikis included in the global value edit

?