Research:New editor
- = 1 edit
- = 1 day
SET @t = 1; /* time cutoff in days */
SET @n = 1; /* edits threshold */
SET @start_date = "20140101"; /* January 1st, 2014 after midnight */
SET @end_date = "20140102"; /* February 1st, 2014 before midnight */
/* Results in a set of "new editors" */
SELECT
user_id,
user_name,
user_registration
FROM
(
/* Get revisions to content pages that are still visible */
SELECT
user_id,
user_name,
user_registration,
SUM(rev_id IS NOT NULL) AS revisions
FROM user
INNER JOIN logging ON /* Filter users not created manually */
log_user = user_id AND
log_type = "newusers" AND
log_action = "create"
LEFT JOIN revision ON
rev_user = user_id AND
rev_timestamp <= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL @t DAY),
'%Y%m%d%H%i%S')
WHERE user_registration BETWEEN @start_date AND @end_date
GROUP BY 1,2,3
UNION ALL
/* Get revisions to content pages that have been archived */
SELECT
user_id,
user_name,
user_registration,
SUM(ar_id IS NOT NULL) AS revisions /* Note that ar_rev_id is sometimes set to NULL :( */
FROM user
INNER JOIN logging ON /* Filter users not created manually */
log_user = user_id AND
log_type = "newusers" AND
log_action = "create"
LEFT JOIN archive ON
ar_user = user_id AND
ar_timestamp <= DATE_FORMAT(
DATE_ADD(user_registration, INTERVAL @t DAY),
'%Y%m%d%H%i%S')
WHERE user_registration BETWEEN @start_date AND @end_date
GROUP BY 1,2,3
) AS user_content_revision_count
GROUP BY 1,2,3
HAVING SUM(revisions) >= @n;
New editor is a proposed standardized user class used to measure the number of first-time editors in a wiki project over time. It's used as a proxy for editor activation, and to a lesser extent, editor productivity. A "new editor" is a newly registered user who makes contributions within a given activation period since registration.
Discussion
editThe majority of new user accounts registered on Wikipedia do not attempt or fail to save an edit. So, when discussing the rate at which new editors are entering Wikipedia, it seems more relevant to measure the subset of new users who end up editing.
The n edits threshold
editWhat amount of activity is necessary? This choice is arbitrary to a large extent. The higher the threshold, the fewer newly registered editors will cross it.
The t time cutoff
editSince it is theoretically possible that a newly registered user may take years to make their first edit and observations at any time would truncate such future edits,[1] we artificially censor all observations using some time bound since the user signed up for a new account. By specifying a cutoff, we hold all new editors to the same standard, regardless of when they registered and when they make their first contribution.
Newly registered users only
editAn attached user is not considered a newly registered user and as a result is not counted as a new editor after completing any given number of edits.[2]
Since newly registered users may include accounts created for bot users if they are not registered by proxy, these users are also included in the new editor definition.
Edits across all namespaces
editWe propose to include in the definition of a new editor edits made to any namespace. When only edits to pages in a project's content namespace(s) are counted, we refer instead to a new content editor. In English Wikipedia, the only content namespace is the "article namespace", also known as namespace 0. Under the proposed "new editor" definition, contributions made to talk or user pages are considered edits as that qualify towards "new editor" status.
Edits to deleted pages
editThe proposed definition includes activity on pages that are later deleted (including page creation edits) as counting towards "new editor" status. This ensures that we provide a quantitative measurement of activation independent on the productivity or quality of contributions by a newly registered user (which we aim to measure using different metrics). Including activity on deleted pages also ensures that this measurement is not subject to censorship (historical data doesn't change as a function of a future deletion event). See this related discussion on the implications of counting or discounting activity on deleted pages.
Time lag
editThis metric can be generated days after user registration. In the case of the WMF standardized parameterization, this is 1 day.
Analysis
editThere are three variables that need to be chosen in order to apply this metric:
- The value of
- The value of
- Whether edits outside content namespaces should be counted
To check how decisions about each of these parameters affect counts of the number of new editors over time, several variations of these parameters were tested on a sample of projects.
English Wikipedia
editPortuguese Wikipedia
editGerman Wikipedia
editComparison
editThe figures above help us visualize the effects of differences between parameters. When a proportion remains constant over time, that suggests that one metric is proportional to another. That means that both versions of the metric capture the exact same trend information at different scales.
The #Content vs. all and #t = day vs. week are mostly horizontal. This suggests that the type of edits that count and the timescale that will be considered when generating stats for new editors will not affect overall trends.
However, #n = 1 vs. 10 edits shows strong trend in the proportion of editors who make to it to each threshold over time. This result suggests that different values for the threshold can change what this metric measures.
Discussion
editThe number of new editors drops about an order of magnitude for each step: 1, 5, and 10. While appears to be largely flat after 2008, and tell a different story -- one of a steady decline since 2008 for Portuguese and since 2007 for German (see #n = 1 vs. n = 10). The value of and whether edits outside of content namespaces will be counted seem to be less sensitive (see #t = day vs. week AND #Content vs. all).
Historical definition
editWikistats, the Wikimedia reportcard and the editor trends study define a "New Editor" or "New Wikipedian" as:
A registered and logged-in person (not known as a bot) who has made their 10th edit during the time-period under consideration. Number of edits is a cumulative count across all of time on one wiki.
The canonical restrictions apply to this definition: only edits on countable pages on content namespaces are considered.
Issues
edit- Due to the fact that this metric considers a user as a "new editor" when the 10th edit milestone is reached regardless of the user registration time, it doesn't inform us about the behavior of new registered users. The historical definition of a new editor is a hybrid metric, partly driven by new user activation, partly by existing user retention.
- The canonical definition doesn't distinguish between genuine new users and attached users, i.e., users with an existing record of contributions to their home project and starting for the first time to edit on another project.
- The definition doesn't include activity on pages that are later deleted as counting towards "new editor" status. See this related discussion on the implications of discounting activity on deleted pages.
- The definition applies a conventional 10-edit threshold and doesn't allow measuring how many users hit different thresholds that may be equally or more informative.
Comparison with New Wikipedians
editThe monthly count of New Wikipedians and New editors (ns=0 & t=24 hours) is plotted below for several wikis.
The factor of difference between New Wikipedians and New editors is plotted below to help visualize deviations. The following function explains how the factor plotted is related to New Wikipedians and New editors.
Notes
edit- ↑ see en:Censoring_(statistics), specifically "right censoring"
- ↑ Analysis of Wikipedia editor activation should be limited to users registered after 2006 because of inconsistencies in how the logging table recorded new registrations before 2006.