Research:ORES-powered TeaHouse Invites

Contact

Wikimedia Foundation Contractor, Scoring Team

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Overview

The research aims to use machine learning to predict how "good" new editors to Wikipedia are and invite them to the TeaHouse based on that data, within their first few edits. The technology builds on the ORES platform, aggregating predictions about edits into higher-level predictions about edit-sessions, and finally to predictions about users.

Motivation

Newcomer retention is one of the largest problems facing Wikipedias today. One approach that has found success is to provide newcomer welcoming and mentoring programs such as the English Wikipedia's TeaHouse (or French Wikipedia's Forum des nouveaux.) However getting new editors to those forums usually involves either a) inviting all newcomers, which has the problems of overwhelming mentors and potentially invites vandals, or b) inviting a subset of newcomers based on heuristics, which could miss out on some good editors. Artificial intelligence or machine learning could potentially mitigate those problems by inviting only the best newcomers without humans having to sort through the hundreds or thousands of newly registered editors each day.

Background on HostBot

Since 2012 HostBot has been working in tandem with the English Wikipedia TeaHouse to do the repetitive work of inviting newly-registered users. In order to keep the number of invitees manageable for TeaHouse hosts, HostBot limits itself to inviting just 300 users among the approximately 2000 qualifying every day.

Proposed Initial Experiment: TeaHouse invites

In order to test this new technology, we propose an A/B test between the current iteration of HostBot and an AI-powered prototype (HostBot-AI) to determine if it can help retain more newcomers.

How does HostBot currently work?

The current way that HostBot works is that every day it searches for users with the following criteria:

User registered within the last 2 days
User has made at least 5 edits
User has not been blocked
User is not a registered bot
User page does not contain any level 4 user warnings, or keywords that indicate likely bad faith or other highly problematic behavior.^[1]

It then selects 300 users randomly meeting those criteria and invites them to the TeaHouse.

How would HostBot-AI work?

HostBot-AI would perform the same operation as HostBot—inviting users to the TeaHouse—but it would prioritize the editors it invites with AI, rather than selecting randomly.

The AI would prioritize the newcomers based on their predicted goodfaith-ness. That "goodfaith" definition comes from ORES, which plain predicts if the "user was acting in goodfaith". It would be alternatively possible to prioritize the non-damaging-ness of editors, as ORES was trained to predict that too, but we believe that the goodfaith measure is more inline with the TeaHouse's values.

Another way that HostBot-AI would be different is that it would operate more quickly than HostBot. HostBot checks daily to see if users have crossed an edit threshhold. HostBot-AI could make predictions for any user after their second or third edit (depending on the community's preference), in close to real-time.

How different would theses two methods even be?

The AI-powered prototype would still respect the 300 users per day invite limit, but it would select the 300 users it had the highest confidence were goodfaith. For instance if a new user makes 5 edits on a page that are each vaguely promotional or vain, they might not be blocked and HostBot might invite them. HostBot-AI on the other hand would predict that they are only moderately goodfaith and select even more goodfaith editors in their place. As there are about 2,000 users registering every day on English Wikipedia this prioritization could make a substantial difference.

Example list of differences

See an example differences in the invite lists, in a simulated output of the two bots on the same day—Research:ORES-powered_TeaHouse_Invites/Comparison.

What are the exact parameters of the experiment?

The test conducted would be an A/B test between HostBot (A) and HostBot-AI (B).
The exact statistical test and retention metric have not been finalized (we welcome your input). We would like to copy as much of the statiscal measures used as previous papers^[2] on the TeaHouse so that our results could be comparable.
The experiment would run for one month.
The experiment randomization would occur based on each user and both bots would be running concurrently.
The experiment would be "blind" meaning that the hosts and invitees would not know which method was being used.

What are the risks?

The main risk to conducting this experiment is that HostBot-AI will not be as effective at inviting high-quality users as HostBot, and as such we will fail to invite the the more deserving editors to the TeaHouse. In that case, in the very worst-case scenario, 150 users/day * 30 days = 4,500 users will not get the TeaHouse invites they would have otherwise. This calculation is an upper-bound on the risk, as it is likely that some of the users HostBot-AI invites will be the same as HostBot, which occured 100 times in the simulated comparison.

Does committing to the experiment mean committing to switching to HostBot-AI?

No. For now we only want to run the experiment, if it is successful a second community consultation could be had to instating the bot. If the experiment is unsuccessful it would be a learning experience for the developers of the technology.

Who is running this experiment exactly?

I Maximilianklein would be the principal researcher and software engineer in my capacity as a contractor for the Wikimedia Scoring Platform team. Experiment design consultation also comes from user:Jtmorgan, maintainer of HostBot.

Technical details

More technical details can be found on mw:ORES/Newcomerquality

Experimental Design

With the help of TeaHouse hosts we decided to conduct an A/B test live on Wikipedia between HostBot (the heuristic version) and HostBot-AI. The randomization coordination happened such that both where operating at the same time, but HostBot confined itself only to intervene on users with odd-numbered user-id's and HostBot-AI for even-numbered user-ids. Furthermore both limited themselves to inviting 150 users per day each to avoid inviting more than 300 users total to the TeaHouse on a given day.

We used two sets of retention measures, to determine if outcome of the experiment. The first came the 2015 HostBot introduction paper which proved that the heuristic algorithm increased retention above control. Notice that in this experiment there is no true control for both the intervention to save on statistical power and utilizes this previous fact. In this experiment we also added more standardized survival measures. Further key decisions we had to make were whether HostBot-AI should a) wait for users to make 5 edits before predicting them, and b) invite them on a 24 hour cycle like HostBot heursitic. We were excited to see how useful the speedier potential of AI was, so we in the end decided that HostBot-AI would predict users after their third edit, and run every hour. Even though these decisions made the invitee groups have different demographic characteristics, the question of 'which technique is better at selecting the top-300 daily new users?' could still be answered.

Measures

The number one concern that Teahouse hosts had were to "retain" newcomers, that is to have them continue editing after their initial joining—this is also known as "survival". The official retention measures from the Wikimedia Foundation are defined by a "trial period" (the inital editing spurt) the "survival period" (when they return) and how many edits in the survival period define a return [a detailed explanation here]. We also used some traditional survival measures to see how many days on average a user remained active after intervention.

Power Analysis

Conducting power analysis we found that for an 80% chance of seeing retention increase from 3% to 4% we needed 11,000 users per group. At a rate of 150 users per group per day, that meant we needed 73 days. To round this number we asked the community for a 90 trial period to run the experiment, which was granted and started on Feb. 1 2019. More details on power analysis can be found on phabricator.

Results

Production Success

HostBot-AI was deployed as a once-hourly job, that gets the registered users of the last 24 hours and, if the user has 3 or more edits, predicts their non-damaging-ness using the newcomer-quality model. Users who are not initially invited, are re-evaluated every time they make more edits, until they either qualify for invitation or phase out of being a newcomer after 1 day. Users who qualify after our 150 user quota were not invited, but marked as "overflow", the idea of which was to create a control group specific to the AI-invite. However only 50 users were collected as overflow, and so unfortunately we couldn't get statistical power from it. HostBot-AI was deployed in the Wikimedia Labs infrastructure, with much thanks to Wikimedia Cloud team. The production run went smoothly and accurately, in sum 8,141 users received invites from HostBot-AI (and 5,605 by HostBot-heuristic). Code remains available at github.

HostBot-heuristic continued to operate as normal, except limiting itself to odd-numbered user-id's and a inviting a maximum of 150 users per day.

Retention Measure 1. Wikipedia's survival measures.

After our data matured (24 weeks had passed after our last newcomer was invited), our first stop was to re-run the measures that were used in the HostBot introduction paper^[3]. This paper showed that the heuristic measure could beat a control group in retention. These measures come from the 2015 paper, and have three components. The trial weeks are the number of weeks after registration, the survival weeks are the window after the trial, and the edits are the number of required edits to make to be considered survived.

trial: survival:edits	AI improvement perc-points	p-value
1 : 1 : 1	-1.076	0.467
1 : 3 : 1	0.912	0.026*
3 : 1 : 1	1.750	0.000*
3 : 1 : 5	0.620	0.053
4 : 4 : 1	2.935	0.000*
4 : 4 : 5	2.425	0.000*
8 : 16 : 1	-2.261	0.010*
8 : 16 : 5	-1.459	0.026*

Comparison of HostBot-AI and HostBot-heuristic with Wikimedia measures.

What we see from these set of results are that AI-invited were retained more than heuristic-invited users by about 1-2 percentage points in the short (4 week) and medium (8 week) term. Meaning, for example that 16% were retained rather than 14% in the 4-week trial, 4-week survival, one-edit case. Perplexingly the trend reverses in the 24 week time-frame. Except for the extremely short term measure all theses differences are statistically significant under a t-test.

Retention Measure 2. Number of surviving days.

As well as the Wikimedia survival measures we ran other survival measures, that ask, in a given time window, "what are average number of surviving days" in each group? The number of surviving days are how many days after registration the user made their last edit within the window.

window	Heuristic avg survival days	AI avg survival days	p-value
survival_window_14	1.776	1.712	0.329
survival_window_28	4.070	4.527	0.003
survival_window_58	8.363	10.107	0.000
survival_window_168	22.456	20.780	0.034

Comparison of HostBot-AI and HostBot with survival measures.

Here we can see that with a 14 day view, we don't have statically significant differences. Under 28 and 56 day windows, the AI-invited users stayed for half-a-day and almost 2 days longer, respectively. Yet, again in the long-term, with a 168 day view the trend reverses and the heuristically-invited users tend to stick around more.

Continuous view of survival measure

Another good view of these dynamics can be seen comparing the survival curves of the techniques. For completeness, to test the significance of the difference between these two curves a negative binomial regression I believe should be run, which is future work for now. Still, visually you can see the short/long-term differences play out here again.

Graph of HostBot-AI survival compared to HostBot-heuristic.

For full analysis of results, please see the full analysis (on PAWS^[4], and blog ^[5]).

Discussion

In general, the results tell opposite stories depending on the time-frame we choose. If we had only worried about the measures up to 8 weeks after intervention the AI would have shown useful, but on the order of 3 months, it seems like it is not improving retention. This begs the question of why the long term/short term differences? In our eagerness to launch the experiment we introduced several confounds at play that now need to be teased out.

Confounds

Qualifying number of edits. In order be invited by the heuristic bot a user must have made 5 edits and not have blocked or warned on Wikipedia. In order to be invited by the AI a user must have made 3 edits and not have been blocked. Therefore the populations treated by each method are different.
Timing. The reason we chose to only require 3 edits to be considered by HostBot-AI was to get to the user more quickly, possibly while they were still editing on the site. In addition HostBot-AI invited users maximum one hour after they last edited, and the HostBot-heuristic up to 24 hours.

With the notion that Wikipedians are "born not made", one possible explanation that Jonathan Morgan offered is that "we know that some people are likely to become long-term Wikipedia editors whether or not they get sent a Teahouse invite. These people make more edits in the first day, and get blocked/warned at a lower rate than other newbies. The HostBot-heuristic invite sample likely contains a larger proportion of these people than the HostBot AI sample, because they have to make 2 more edits to qualify, so some additional winnowing has already occurred." Therefore in the long run, with the qualifying edit difference, more "born" Wikipedians survive. Hypothetically the reason that we don't see this effect in the short term is that the AI and treatment-speed effects are dominating, working on prone-to-be-made Wikipedians.

References

↑ https://github.com/jtmorgan/hostbot/blob/master/hb_output_settings.py#L27
↑ Evaluating the impact of the Wikipedia Teahouse on newcomer socialization and retention. Jonathan T. Morgan, Aaron Halfaker. http://www.opensym.org/wp-content/uploads/2018/07/OpenSym2018_paper_15.pdf
↑ https://www.opensym.org/wp-content/uploads/2018/07/OpenSym2018_paper_15-1.pdf
↑ https://paws-public.wmflabs.org/paws-public/User:Maximilianklein/hostbot-ai/hostbot-ai%20analysis.ipynb
↑ https://notconfusing.com/hostbot-ai.html

References

[1] ttps://github.com/jtmorgan/hostbot/blob/master/hb_output_settings.py#L27

[2] Evaluating the impact of the Wikipedia Teahouse on newcomer socialization and retention. Jonathan T. Morgan, Aaron Halfaker. http://www.opensym.org/wp-content/uploads/2018/07/OpenSym2018_paper_15.pdf

[3] ttps://www.opensym.org/wp-content/uploads/2018/07/OpenSym2018_paper_15-1.pdf

[4] ttps://paws-public.wmflabs.org/paws-public/User:Maximilianklein/hostbot-ai/hostbot-ai%20analysis.ipynb

[5] ttps://notconfusing.com/hostbot-ai.html

[1]

[2]

[3]

[4]

[5]