Research:Article feedback/Interim report

Article Feedback Tool, Version 5
Phase 1 Report - Preliminary Findings
(December 2011 - January 2012)

This report features preliminary findings for research that is still ongoing. As a result, some of the numbers below may be adjusted in later drafts, and new findings added as we continue to analyse the data. For more details, see the related volume report and feedback evaluation report. Here are some preliminary overview slides, to complement this report. More information can be found on the Article Feedback v5 Project hub.

In October 2011, the Wikimedia Foundation began investigating a replacement for the Article Feedback Tool, Version 4, a feature that allowed readers to provide information the quality of articles using a five-star rating system. Criticisms of Version 4 included the potential for “gaming” the system, and the fact that a ratings system on its own didn't provide any information editors could actively use to improve the articles.

To address these criticisms, the Foundation began looking at designs that included a free text box, and eventually developed 3, known internally as Options 1-3. Option 1 asked the reader “Did you find what you were looking for?”, phrasing suggested by one of our editors and chosen for its simplicity. Option 2 split the form, offering readers the opportunity to select various categories of feedback like “suggestion” or “praise”; this was based on's work, and our desire to have a more open ended feedback form. Option 3 was a hybrid form, containing both a text-box and the more traditional five-star rating system – a format that is also more familiar to our readers than entirely new designs.

All 3 designs were deployed on 0.6 percent of all Wikipedia articles, and “bucketed” so that a third of users saw each design exclusively on those articles. From this, we gathered 5,958 pieces of feedback between the 10th and the 24th of January. A randomised sample of feedback from each form was then given to a group of volunteer editors, who evaluated them for their usefulness and appropriateness. A survey of readers was also run through the designs to evaluate how useful our users considered each design. Both data-gathering methods allowed us to obtain a lot of useful information about each form, combined with basic information from the use of the forms themselves. We are tremendously grateful to all our volunteers for their work.

Based on the data, we can see no design that significantly outperforms the others. One large distinction is that Option 3 produces substantially fewer feedback posts than Options 1 or 2. We are releasing this report in the hopes that the readers of it will make their opinions known, and be able to help us reach a decision.

Option 1


The first form we tested, “Option 1”, asked readers “Did you find what you were looking for?”, with “yes” and “no” check boxes, and contained a non-mandatory free text field; readers were only required to either the check boxes or the free text field, not both. This design was deployed in December on 0.3 percent of all Wikipedia articles, and the sample size was doubled on 4 January to 0.6 percent. Not only does the increased sample size give us more data to work with, it also helps avoid using data from an unrepresentative period of time – the form was initially deployed over Christmas, which usually sees substantial changes in reader numbers and behaviour.

2,630 posts
1,666 with text
37-67% useful to editors
66.3% of readers liked it

Since we doubled the sample size, a total of 2,630 items of feedback have been submitted through Option 1. Of these, 63.2% came with actual text, and a group of editors began using the Feedback Evaluation System (FES) to work out how many posts, in their opinion, were “useful”; how many pieces of feedback added something of value to the wiki. These editors – or “hand-coders” - reviewed a randomised sample of 250 pieces of feedback. Their responses can be measured in one of two ways; through “strict” measurements or “soft” measurements. Strict measurements are that all of the hand-coders who dealt with a particular piece of feedback found it useful, with none of them unsure, while soft measurements simply require that one or more of the hand-coders found the feedback useful.

Using strict measurements, 37.6% of the sample for Option 1 was useful; using soft measurements, it was 67.2%. The disparity can be explained by lots of things – editors have different opinions about what is useful, and there was some disagreement over whether things like praise should be counted as “useful” or not. A survey of readers was also run; using this, 66.3 percent of readers liked the form.

Option 2

1,539 posts
1,539 with text[1]
36-62% useful
59.5% of readers liked it

The second form we tested, “Option 2”, clearly indicated what sort of feedback we were looking for by subdividing the form into “suggestion” “praise”, “problem” and “question”. A reader is required to select one of those tabs (“suggestion” is checked by default) and then enter text into the free text box. Entering text was mandatory for this form, simply because selecting a category and then hitting “post your feedback” does not provide us with any useful data, unlike the check-boxes for the other two forms.

A total of 1,539 pieces of feedback were submitted using Option 2; all of them included text, which is not a useful metric given that it was mandatory. When the hand-coders used FES on Option 2, between 36.8 and 62.8 percent of a random sample of 250 suggestions were “useful”, depending on whether the strict or soft measuring system was used. The reader survey for Option 2 resulted in slightly less support than for Option 1; 59.5 percent of readers liked the form, compared to 66.3 percent for the first form.

Option 3

1,789 posts
1,148 with text
40-65% useful
65.5% of readers liked it

The last form, “Option 3”, asked readers “Is this article helpful?” and contained a non-mandatory text box. Like the existing system, it includes a five-star rating system, which meant we were comfortable with making the text box optional – even if readers only submit a rating, useful data can still be gathered.

A total of 1,789 posts were gathered with Option 3; of these, 64.1% contained suggestions from the free text box. This worked out numerically as substantially less feedback with text compared to the other options. Because of the lower number, our randomised sample for hand-coders was also smaller; A total of 143 posts were reviewed. From these 143 posts, hand-coders concluded that between 40.5 and 65 percent were useful, depending on the measurement we use. The results for the “strict” measurement were the best for all options, but only marginally, while the feedback from the survey also indicates that readers slightly prefer Option 3 over Option 2 – with 65.5 percent liking the form. Nevertheless, the number of text posts the form gathers is substantially less than for the other two options.



Based on the data we have gathered, we can compare the use of each form: for each category, the form that performed the best is in bold.

Metric Option 1 Option 2 Option 3
Number of posts 2,630 1,539 1,789
Number of posts with text 1,666 1,539[1] 1,148
Percentage of useful posts (strict) 37.6% 36.8% 40.5%
Percentage of useful posts (soft) 66.3% 62.8% 65%
Appeal to readers 66.3% 59.5% 65.5%
Appeal to WMF staffers 78.5% 78.6% 42.8%

Aaron Halfaker and Dario Taraborelli, our researchers, have looked at the data and concluded that we cannot identify any option that substantially outperforms the others across the board. We are hoping that by presenting all the evidence to the community, you can help us reach a decision, informed not just by this data but by your personal preferences and what you as an editor look for as the most important things for feedback. We are tremendously grateful to all the editors who helped us develop the forms and gather this data, including Bensin, Utar, Sonia, Dcoetzee, FT2, Tom Morris, GorillaWarfare and numerous others. We couldn't have done this without your assistance.



The posts from the various options that we are evaluating were gathered between 10th January and 24th January. This limited period was used for several reasons. First, a small error with Option 3 meant that we are uncomfortable relying on data between 4th January and the 10th for that form – and rather than end up with three samples for different periods of time, we felt it was more appropriate to shorten the period all samples covered so that they are standardised.

Second, the three options were initially deployed over the holiday season. Previous years have shown that reader and editor activity during this period is vastly different from their actions during the rest of the year, and we wanted to avoid polluting the data by basing our decision on information from samples that are not representative of most of the year.

Postswere gathered from a sample of 22,480 articles; this represents 0.6 percent of Wikipedia articles, excluding disambiguation pages (which the community had felt uncomfortable deploying previous versions of the tool on). These posts were then evaluated by our editors using the Feedback Evaluation System (FES), designed by Aaron Halfaker. This system presented editors with a set of feedback – randomised between the three forms – and then prompted them to mark if it was useful or not useful, and then what sort of feedback it was; a suggestion, a question, an issue, abuse, so on and so forth.

17 volunteers used FES, with each piece of feedback being checked by at least two users to provide comparisons, and so that we could check rates of agreement and disagreement. It is worth noting that the feedback samples used for this analysis was not gathered in the same time period as the data on the number of feedback posts left, left with text, and so on; nevertheless, we are confident that the data we are presenting is reliable and adequately representative. A survey of staffers at the Wikimedia Foundation was run, including both people involved in AFT5 development and those on other projects. The appeal of each option to these staffers can be found in the comparative table above.

  1. a b text was mandatory