Research:Communication to New Editors 2004-2011

This page in a nutshell: looking at a sample of each February's new editors and extrapolating to all new editors, it seems that the vast majority of Newbies are still good faith editors, though Vandalism has increased but not uniformly, whilst spammers are rare but steadily increasing. Over the years, Goodfaith Newbies have become much less likely to be praised and much more likely to be criticised.

Contact

Steven Walling

Wikimedia Foundation

Maryana Pinchuk

Wikimedia Foundation

Research:Projects

This page documents a completed research project.

These datasets and analyses are mostly test cases for the rest of the qualitative work for the 2011 Summer of Research, but do suggest some interesting trends nonetheless. Our basic methodologies are described below.

Assessing quality of the first edits made by new editors, 2004 and 2011

How many contributions by new editors are made in good faith and are worth retaining or improving? Are most edits by newbies vandalism or spam, or are they made primarily in good faith?

We selected a randomized sample of first edits by contributors who joined in April 2004 and in April 2011, derived via simple SQL query run against the toolserver. We then analyzed these edits by hand, ranking the first edit on a 1-5 scale, with one being pure vandalism and five being a well-referenced content addition indistinguishable from the edit of an experienced contributor. We also noted when the first edit was not a mainspace contribution, and whether that was vandalism or not.

Results

Results are described at: "How much do new editors actually improve Wikipedia?" We'll publish the totals data soon, but the actual samples will not be distributed to avoid calling out individual editors by name.

April 2004 sample
April 2011 sample

The type and tone of user talk page edits directed at new editors within their first 30 days

As a follow up experiment to the previous one, which gave us an idea of how many new editors made valuable contributions according to Wikipedia standards, we wanted to look at how these good faith contributors were being communicated with on their user talk pages early on.

We prepared another random sample of several hundred edits made to user talk pages of new registered users on English Wikipedia from 2004 through 2011. These edits were made by other contributors within 30 days of a new person’s first edit.

The sample was gathered using the Toolserver, and the following query is an example of how the 2008 set was gathered. (If you want to run it on different years, simply change the timestamps.)

SQL query to get the sample

use enwiki_p;
select su.user_name,r.rev_id
from (SELECT u.user_id,u.user_name,u.user_registration,min(r.rev_timestamp) t
FROM user u
INNER JOIN revision r
ON u.user_id = r.rev_user
JOIN page p
ON r.rev_page = p.page_id
WHERE u.user_registration BETWEEN '20080201000000' AND '20080301000000' and u.user_id between 6335000 and 6565000        AND UNIX_TIMESTAMP(r.rev_timestamp) - UNIX_TIMESTAMP(u.user_registration) < (60*60*24*7)
LIMIT 500) su
INNER JOIN page p
ON su.user_name = p.page_title
INNER JOIN revision r
ON  r.rev_page=p.page_id   and r.rev_user != su.user_id
where p.page_namespace = 3 
AND UNIX_TIMESTAMP(r.rev_timestamp) - UNIX_TIMESTAMP(su.t) < (60*60*24*30);

The complete list of classification possibilities is below. If it was applicable, we noted multiple items per edit. For example: if the edit was the addition of a warning template, we marked "Template", "Tip, correction, or warning" and then assigned a tone depending on the contents of the template used.

Content discussion and/or debate: any edit whose purpose was primarily to discuss or debate the content of encyclopedia articles.
Template: any edit that was a template.
Welcome: any edit that was obviously intended to welcome a new editor, either in template form or personalized.
Tip, correction, warning: Any tip about future editing, correction about errors in past editing procedure or technique, or any warning to cease editing in violation of policy/guideline.
Invitation: any invitation to edit a specific page or subject, such as a WikiProject invitation or a suggestion about an interesting topic.
Praise: any form of praise, from personalized text to barnstars.
Vandalism: any edit to the user talk page that was purely vandalism.
Socializing: any edit that did not discuss the project directly, but instead consisted of socializing.
Minor: any minor change in formatting, grammar, spelling, etc.

Results

Results are described at: "The Rise of Warnings to New Editors on English Wikipedia". The totals data for the two items compared is below, but the actual samples will not be distributed to avoid calling out individual editors by name.

Two types of edits made to the user talk pages of good faith editors, correlated with tone analysis
Year	Edits that included praise	Edits that added a template with a negative tone	Total number of edits analyzed
2004	36	0	251
2005	23	0	223
2006	26	11	243
2007	5	24	347
2008	7	33	235
2009	13	36	176
2010	3	50	209
2011	6	84	244

Year	Edits that included praise	Edits adding a template with a negative tone
2004	14.34%	0
2005	10.31%	0
2006	10.70%	4.53%
2007	1.44%	6.92%
2008	2.98%	14.04%
2009	7.39%	20.45%
2010	1.44%	23.92%
2011	2.46%	34.4%

The totals calculated as a percent of the whole (in the sample) resulted in the following chart:

Analysis of the amount of new editors participating in good faith, 2004-2011

As a follow up activity to our sampling of the quality of first edits by newbies in 2004 and 2011, we made a similar qualitative assessment of whether new editors were participating in good faith or not overall. This is another attempt to understand how positively the influx of new contributors effects Wikipedia.

Partially working from the dataset of our previous analysis of user talk edits, we classified a random sample of new editors who arrived in February 2004-2011. We then simply noted whether that editor had an overall pattern of good faith participation through their edit history, or if they were clearly a vandal or spammer. We did so through correllating warnings and blocks from the community with a manual look at the composition of their edits. Practically speaking, duplicate accounts (i.e. sockpuppets) are quite difficult to identify, so only those that were blocked for the abuse of multiple accounts were counted as sockpuppets for our analysis.

Results

Percent of good faith, spammer, vandal, and sockpuppet editors 2004-2011, based on the sample. (Actual samples with analysis will not be distributed to avoid calling out individual editors by name.) See blog post.

Composition of the 2004-2011 sample
Years	Good faith editors	Vandals	Spammers	Sockpuppets	Total sampled
2004	189	0	1	0	190
2005	135	2	2	3	142
2006	193	11	2	2	208
2007	118	19	2	8	147
2008	143	25	4	6	178
2009	107	40	5	10	162
2010	118	34	7	14	173
2011	121	26	8	8	163
					1,363

Percent (of the whole sample)
Years	Good faith	Vandalism	Spam	Sockpuppets
2004	99.47%	0.00%	0.53%	0.00%
2005	95.07%	1.41%	1.41%	2.11%
2006	92.79%	5.29%	0.96%	0.96%
2007	80.27%	12.93%	1.36%	5.44%
2008	80.34%	14.04%	2.25%	3.37%
2009	66.05%	24.69%	3.09%	6.17%
2010	68.21%	19.65%	4.05%	8.09%
2011	74.23%	15.95%	4.91%	4.91%

After gathering the number and percent of the types of contributors within our sample, we next used statistics (from stats.wikimedia.org) to extrapolate the trends through the actual number of new editors each February. For example: if about 74% of new editors in our 2011 sample were acting in good faith, and there were 7,820 new English Wikipedians in 2011, then 5,805 of those editors are likely to be participating in good faith (if the trends from the sample are correct).

New editors, from stats.wikimedia.org
Year	February new editors
2004	683
2005	1,417
2006	7,659
2007	13,828
2008	10,812
2009	9,551
2010	8,492
2011	7,820

Projected number of each type of contributor based on the sample trends
Year	Good faith	Vandalism	Spam	Sockpuppets
2004	679	0	4	0
2005	1,347	19	20	30
2006	7,107	405	74	74
2007	11,100	1,787	188	753
2008	8,686	1,519	243	364
2009	6,308	2,358	295	590
2010	5,792	1,669	344	687
2011	5,805	1,247	384	384