Research:WikiGrok/Test2

Test 2: All logged-in users, en.wiki

Pre-test (internal QA-only):

1.25wmf11 train deployed to en.wiki

Wed, 10 Dec 2014 19:00:00 UTC / Wed, 10 Dec 2014 11:00:00 PST

Test began:

Config change SWAT deployment ($wgMFEnableWikiGrok set to true)

Fri, 12 Dec 2014 00:41:00 UTC / Thurs, 11 Dec 2014 16:41:00 PST

Test ends:

Config change SWAT deployment ($wgMFEnableWikiGrok set to false)

Thurs, 18 Dec 2014 00:00:00 UTC / Wed, 17 Dec 2014 16:00:00 PST

Sampling

The test targets all logged-in users on English Wikipedia on mobile devices with screen width less than 768 pixels. At the start of the test, users are randomly assigned to one of two buckets via a token that persists across sessions (clearing the token resets the bucket assignment). The test lasts for 1 week.

Treatments

Users in the pool of eligible participants see one of two versions of WikiGrok widget when landing on articles where WikiGrok is activated. The start and end of the worflow is identical in the two conditions:

a landing screen with a call to action that the user needs to accept in order to proceed to the next step
a form with a WikiGrok question, the design of which depends on the experimental group the user is assigned to
a confirmation screen, displayed after clicking on the submit button and successfully storing a response (including a "Not sure" or NULL response).

The WikiGrok question is the only element in the workflow that varies across conditions and it consists of two updated types of questions:

A binary question for group A
A tagging task with multiple possible values for group B.

Once a user has completed the WikiGrok workflow for a particular article, they will no longer see WikiGrok on that article in the future. The list of articles that they have completed WikiGrok on is stored in LocalStorage.

Claim selection

The total number of eligible pages is between 260,000 and 300,000.

Writer (36444 items)
- Item eligibility: instance of human, occupation writer, not occupation author
- Potential claims: occupation author
Actor (107047 items)
- Item eligibility: instance of human, occupation actor
- Potential claims: occupation television actor, occupation film actor
Album (155231 items)
- Item eligibility: instance of album
- Potential claims: instance of live album, instance of studio album

Data QA

Data quality issues for test 2 are tracked here.

Results

Top Level Statistics

From the 166,888 pages WikiGrok widget could be anabled on, 9173 unique pages had at least one version (a) tested on them. This number is 9013 for version (b). By the end of the test, 6% of these pages had at least one non-null response submitted through them.

The top level session statistics are as follow:

sessions with ...	version (a)	version (b)
page impression	22,693	21,622
widget impression	11,239	11,145
response	573	570
non-null responses	573
no-thanks	1679	1418
click-accept	732	697
success impression	646	598

The Funnel

The following two graphs show the funnel for version (a) and version (b) tests. Each node of the graph is labeled by the corresponding widget name in MobileWebWikiGrok schema. The numbers in parantheses show the number of times the widget is used, and the numbers on the connecting arcs show the probability of transitioning from a a widget to the next.

Version (a) funnel

Version (b) funnel.

Observations

In both versions (a) and (b), page impression does not result in widget impression ~50% of the time. These are from page impressions that do not scroll down in the page enough to see the widget impression. Further experiments with the location of WikiGrok widget in the page can help in identifying the optimal location for the gadget.
In both versions (a) and (b), no interaction is done by the user with WikiGrok widget ~80% of the time the widget is shown. This number is huge and we need to understand why this happens. Is this because the users do not see the widget (UX improvements)? Is this because editors do not find WikiGrok questions interesting? etc. We will carefully monitor this number in the reader experiment.
The users are 2.5 times more likely to choose no-thanks than accepting WikiGrok widget. Whether this number is high or low depends on the traffic each widget receives as well as the desired accuracy of responses. We will keep an eye on this number as we release the feature to readers.
A response is submitted ~90% of the times when WikiGrok widget is accepted. Ideally, we want to push for 100% response submission rate given that the questions are short.

Quality and Predictability of Responses

The Ground truth for the questions asked by WikiGrok is not known unless handcoded manually. However, we can use entropy to measure the predictability of responses.