Research:Emerging Technical Communities

This page is to keep track of research work being done as part of technical contributors emerging communities metric definition project.

Project GoalEdit

Tracked in Phabricator:
task T250284

The goal of this project is to identify wiki communities that are emerging that can benefit from more automation (bots/tools) when it comes to manage the growth of content. In order to find meaningful metrics that would allow us to find these emerging communities we did quite a bit of data exploration. The reader interested in the data exploration journey should firstly read the #Explorations section below.

Based on our findings we provide a set of next steps and recommendations for the Technical Engagement team.

RecommendationEdit

Base AssumptionEdit

In the data exploration stage we looked a different ways to dissect the data, after some tryouts we excluded bots/editors ratio as an indicator of an emerging community. We learned that technical contributors mainly worked on content articles and acknowledged that the content size of a wiki is not correlated with its editing activity level.

As a result of our exploration we settle in identifying "emerging communities that might benefit from more tooling" by looking mainly at two variables:

  1. The amount of edits on that wiki that are done by non-bots and
  2. The amount of content pages that a wiki has.

We also look at the amount of distinct bots that edit that wiki as a dependent variable. Our basic assumption is that, for a wiki to be healthy, automation is needed once the number of content pages is over a certain threshold.

How to identify a community that might be underserved by technical contributors (bot/tool builders)Edit

We decided to choose "monthly non-bot edits" as the major metric to measure the need for automation/bots and "number of content pages" of a wiki as the secondary metric.

We first group wikis by its current number of "monthly bot editors" , "monthly non-bot edits" and "content pages". In this classification we look for outliers, wikis with a large number of content pages and a large number of "manual" edits but few bots, or wikis with large number of content pages but few edits and bots overall.

This might indicate a community that needs help developing tooling to be able to keep up with growth of their wiki. Before reaching out to the community at hand we need to look at other external markers like user pageviews for that wiki and overall edit history. These last two can be assessed via Wikistats. "Liveness" of talk pages is also an interesting one to asses whether there is a community behind the edits.


What is the ideal number of bots?Edit

Given the relationship between non-bot edits and bot editors is not linear (Figure 1, Figure 2) we use percentiles to define the suggested number of bot editors to match its editing activity level.

 
Figure 1: Correlation of non-bot edits and bot editors (data timeframe: 2020-04-01~2020-04-30)
 
Figure 2:  Correlation of non-bot edits and bot editors in low value area (Data timeframe: 2020-04-01~2020-04-30)
Table 1: Percentile of monthly nonbot edits, content page, monthly bot editors (Data timeframe: 2020-04-01~2020-05-31)
Percentile Nonbot_edits in April Nonbot_edits in May Content page Avg of monthly bot_editors
0.25 221 246 2869 2.5
0.5 962 983 10541 5
0.75 6250 6384 82666 8.5
1 5621913 5762249 6071412 319.5

* Metric definition:
Nonbot Edits: number of edits made by users who are not bot (by user group or by user name) in the given month. Edits that have been reverted or deleted are included.
Content page: The total number of existing (non-deleted) pages in content namespaces across all wikis.  
Monthly bot editors: The number of bots (by group or by name) that have edited in the given month.
Average of monthly bot editors: an average of  monthly bot editors in 2 continuous months. Bot editors in emerging communities fluctuate month by month. An average of 2 continuous months shows an overall bot active-ness in that community. The data in Table 1 is the average of April 2020 and May 2020.


The Table 1 shows the 25th, 50th, 75th and 100th percentile of each metric. The percentile of non-bot edits in two consecutive months (April 2020 and May 2020) is very consistent. The 25th percentile is 200+ edits, 50th percentile 900+ edits, and 75th percentile 6000+ edits. Our suggested ideal number of monthly bot editors for each percentile group is simplified as shown in Table 2. For a community which has 6000+ monthly non-bot edits, the ideal number of monthly bot editors is 9. For a community which has 900+ monthly nonbot edits, the ideal number of monthly bot editors is 5. For a community which has 200+ monthly nonbot edits, the ideal number of monthly bot editors is 3.

Table 2: Suggested ideal number of bots
Percentile Monthly nonbot edits Content pages Suggested Ideal monthly bot editors
0.25 200 2800 3
0.5 900 10000 5
0.75 6000 80000 9

ExplorationsEdit

The Technical Engagement team had a few questions about technical contributors in wiki communities. While the definition of technical contributors includes a variety of contributions in very different technical areas, this research focus on contributors who write tooling to help with edits on a wiki. This tooling is normally referred to as "bots", which are automated scripts that run on our cloud environment that patrol Wikipedia doing tasks like, for example, removing vandalism by reverting edits.

Is the ratio of bots/editors high in emerging communities but low on established communities?Edit

Comparing the ratio of bots/editors in emerging communities and established communities, it seems that a high bots/editors ratio is not a strong indicator that the community is an emerging community. Established communities tend to have a low bots/editors ratio as they usually have a large number of human editors. However, in some cases, some emerging communities could have a low bots/editors ratio when the number of bots is really very small. For example, in Table 3,  German Wikipedia (dewiki),  an established community, has 0.09% bots/editors rate. Hindi Wikipedia (hiwiki), an emerging community, has 0.1% bots/edits rate. The bots/editors ratios are very close even though those two wikis are in different development stages.

Table 3: Number of editors, edits, pages, and bot/editor ratio per Wikimedia project*        (Data timeframe: April 1, 2020 through April 30, 2020 )
wiki_db editors bot_editors bot_editor_ratio edits content_pages
dewiki 80531 71 0.09% 168897 2429468
hiwiki 15449 15 0.10% 20805 141852

* Metrics definition:
Editors: number of registered users who made edits on the given wiki in the given month.
Bot Editors: number of users who are bots by user group or by user name and made edits on the given wiki  in the given month.
Bot Editor Ratio: bot editors/editors
Edits: number of edits made on the given wikis during the given month. Edits that have been reverted or deleted are included among total edits.  
Content pages: number of existing (non-deleted) and non-redirected pages in content namespaces on the given wikis.


Figure 3 is the scatter diagram of  bots and editors on all Wikipedia projects. Figure 4 is a zoom-in of the low value area.  Dots in the upper-right corner present the established Wikipedia communities. Dots in the lower-left corner present the emerging Wikipedia communities. Figure 3 and Figure 4 show bots and editors do not have a linear relationship. The bot/editor ratio could be the same in high value area and low value area. Therefore, bots/editors ratio is not an ideal indicator for us to identify the community is an emerging community or an established community. It cannot be the metric to measure whether a community has enough tooling to thrive.

 
Figure 3: Correlation of editors and bot editors (Data timeframe: April 1, 2020 through April 30, 2020 )
 
Figure 4: Correlation of editors and bot editors in low value area (April 1, 2020 through April 30, 2020 )

What are the bots doing? What are the types of their contributions?Edit

The spreadsheet includes bot edits by namespace across all projects from 2020.01.01 to 2020.05.31. It shows that 65.8% bot edits are for content pages on all projects. The content bot edit rate of content pages by bots varies between 0.02% and 100%. I listed a few interesting cases in Table 2. On English wikipedia, 49.87% bot edits are content edits. On wiki commons, 97% of bot edits are file edits. On Wiktionary and Wikidata, bots mainly focus on content editing.

Table 4: Bot edits by namespace * (Data timeframe: 2020.01.01 ~ 2020.05.31)
project project_family Category Category talk Content File File talk Help Help talk MediaWiki MediaWiki talk Other Project Project talk Talk Template Template talk User User talk Grand Total Content Edits%
en.wikipedia wikipedia 91833 16781 1962241 89944 991 50 399 37 313 38016 764770 15027 209610 45891 10989 496897 190703 3934492 49.87%
ar.wikipedia wikipedia 596209 80865 2619849 7552 7 77 5 0 4 5475 30403 179 93504 125069 22977 17220 1328785 4928180 53.16%
commons.wikimedia commons 331504 388 5021 20624623 1543 160 0 29 48 27676 91794 1001 143 6975 63 224111 34532 21349611 0.02%
ca.wiktionary wiktionary 0 0 40412 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40412 100.00%
www.wikidata wikidata 5 0 54007654 0 0 4910 10 32 36 744939 264339 745 923 2107 7 91969 1038 55118714 97.98%

*Metric definition:
Bot Edits: number of edits made by users who are bot by user group or by user name in the given month. Edits that have been reverted or deleted are included.

What’s the metric to identify the community which needs more technical supportive work?Edit

Given the function of bots, a community with a high volume of edits or existing content which needs to maintain will likely need more bot support. Mapping to some measurable metrics, the possible metrics could be the number of monthly edits and total content pages.  Considering that the number of monthly edits inflated by existing bots, I chose non-bot edits to reflect the amount of organic edits. I also observed that the monthly non-bot edits are not correlated with total content pages in some communities. Those outliers in Figure 3 represent the communities  which have a large number of total content pages but are at low monthly editing level now.

 
Figure 5: Correlation of non-bot monthly edits and content pages (Data timeframe: 2020-04-01~2020-04-30 on all wikipedia projects)

* Metrics definition:
Non-bot edits: number of edits made by human users, who are not bot by user name or group,  on the given wikis during the given month.
Content pages: number of existing (non-deleted) and non-redirected pages in content namespaces on the given wikis.







Take a look at one of the outliers, newwiki ( Newari Wikipedia). It has more than 60 thousand content pages, considered as a medium size Wikipedia. But from history we can see the pages are mainly created by bots. The number of non-bot edits has never grown. When bots are not active in newwiki, the monthly edits keep flat at a low level. For such a community which does not have many organic editors, should we provide more bot support? I have no answer for it yet. But it makes me choose monthly non-bot edits as the major metric to measure the needs for bots.

 
Figure 6: History of content edits, bot content edits, total content pages on newwiki
 
Figure 7: History of editors on newwiki
 
Figure 8: History of bot editors on newwiki

When a wiki community needs to start thinking about bots? How does the editor trend correlate with the growth of bot editing?Edit

* Data timeframe: 2001~2020-05-31
* Metric definitions:
Total content edits: number of content edits made in the wiki during the given month.
Bot content edits: number of content edits made by bot users by group or by name in the wiki during the given month.
Total content pages: the cumulative total number of content pages created without being deleted by the end of the given month.
Editors: number of registered users who made edits in the given month in the given wiki.


We build superset dashboard to explore this data (WMF internal only): https://superset.wikimedia.org/r/263

We studied ruwiki (medium size wikipedia), rowiki (small size wikipedia) and  svwiki (large size wikipedia).







On ruwiki , when bot editing became active ( > 1k) in September 2004, the number of editors was 268 .

 
Figure 9: History of content edits, bot content edits, total content pages on ruwiki
 
Figure 10: History of editors on ruwiki

On rowiki, when bot editing became active ( > 1k) in July 2005, the number of editors was 134 .

 
Figure 11: History of content edits, bot content edits, total content pages on rowiki
 
Figure 12: History of editors on rowiki

On svwiki, when bot editing became active ( > 1k) in June 2005, the number of editors was 624.

 
Figure 13: History of content edits, bot content edits, total content pages on svwiki
 
Figure 14: History of editors on svwiki

Among the three wikis, only ruwiki has a stable monthly editing pattern (in terms of non bots edits). Svwiki and rowiki still rely on bots to create edits. It seems the growth of human editors is not correlated with the growth of bot editing. Also there is no clear answer to the question of when is best to introduce bot editing into the community. Wikis have different growth trajectories for many reasons.

Next StepsEdit