User:Yair rand/Global Council distribution formula

Work in progress

The Strategy Process came up with the idea of having a "Global Council", which would which would be responsible for oversight of Wikimedia organizations and movement strategy, among other responsibilities. It has been suggested that this group would include "around 90-100 people", and would be "composed of both elected and selected members, in a way designed to reflect the breadth and diversity of participation not only in the Movement at present but also in communities we wish to serve".

Let's say sixty of those are elected directly by the communities, with the rest being affiliate-selected or appointed by the group itself. Where would those elected members come from? In this proposal, I set out some principles for creating a formula that would determine that, and illustrate how such a formula could work in practice.

Principles

An election works best if the participants can actually understand what people are saying, when the electors can communicate with the elected, when voters don't have to leave their home wiki, and when the elected individuals have familiarity with the concerns of the electors. There are nearly 900 Wikimedia projects, in over 300 languages. Any kind of election that involves multiple projects or languages is going to be difficult. When possible, an election of any given member of the Council should involve a single community, while ensuring that every project can vote in at least one election.

How much representation should any given project have? The strategy documents suggest that the makeup should be representative of both the current contributors and people we aspire to reach. As such, for this formula, we will weigh the representation by both the quantity of current contributors to a project as well as the global number of speakers of the project's language. Additionally, because it matters whether anyone actually uses the project, the project's readership will be taken into account.

To ensure that smaller projects aren't left out, we could use some form of degressive proportionality. I'm going to use the Penrose method for this, and then overlay it with the D'Hondt method for figuring out proportional representation.

Formula

Before we distribute the representatives among the languages, we need to distribute among the projects (Wikipedia, Wiktionary, etc.), but for simplicity I'll start with a system for distributing Wikipedia's representatives among the language editions, and then deal with how to distribute among the projects afterward.

How to measure editors? Both the number of "active editors" and the number of "very active editors" (100+ edits in the month) are relevant variables, so for a language's editor count let's just take the numbers of (editors of edition / total of all languages) and average the two types.

For readership, one could take the pageview count, but I think the number of distinct readers is more important/relevant. Unique devices is a decent proxy for that, so let's use (unique devices accessing edition / unique devices for all languages) as our readership variable.

For speakers: the relevant data points are the number of native speakers, and the number of second-language speakers. In general, we tend to prioritize providing people with access to content in their native language, so let's weigh native speakers twice as much as second-language speakers.^{[note 1]} As with the other variables, it will be divided by the overall sum for all languages. Regarding how/when to use this variable: while we want to represent people from languages where we don't have many contributors, as a practical matter, we can't do that if we don't have anyone from those languages, so whenever this number is used we need to simultaneously account for the editor numbers, at least a bit. The same is not true in the reverse.^{[note 2]}

With that in mind, let's get to the math. Call our editor variable "e", the readership "r", the speakers "s", and then we'll feed the output of this formula into the Penrose method and D'Hondt method:

⁴√(s × s × r × e) × 3 + e

If we run this directly, we end up with a number of languages that get zero representation, so let's go back and group those languages together, adding up their editor counts and such. Some moving things around might be necessary to ensure that no language gets no vote.^{[note 3]} Which languages should be grouped with which is pretty arbitrary, so I used general geographic regions, following the boundaries of existing Wikimedia regional groupings where possible.^{[note 4]}

On distributing between the projects (Wikipedia, Commons, Wiktionary, etc): For this, we don't have the "speakers" variable, so we'll need to adapt some things. In the absence of language speakers, we don't have any variable which guides our thinking about the "potential" future size of any project's community. As an inadequate proxy, let's put somewhat more emphasis on the current readership. Also, let's use a little less degressive proportionality, because it's probably not as important here^{[citation needed]}, so switch from ^0.5 to ^0.75.

³√(r × r × e)

Output

Wikipedia got 49 spots, 26 of which went to the top ten Wikipedias, and 11 went to groups of languages that wouldn't otherwise get any representation on their own, representing 11.8% of active Wikipedia editors. 3 went to Wiktionary, one for ENWT and the other two for all the rest combined. 3 for Commons, 1 for all Wikibooks, 1 for all Wikisources, 1 for Wikidata, 1 for the Mediawiki community, and 1 for all the rest combined (Wikiquote, Wikiversity, Wikivoyage, Wikinews, Meta, Wikispecies, Incubator^{[note 5]}). In total, forty-four would be elected on a single wiki, and sixteen from groups of smaller wikis. The group elections would likely require a lot of translating work, and would be as difficult as the board elections in certain respects, but there's not much that can be done about that. The individual-wiki elections would ideally be administered entirely by the project itself, in whatever manner they decide.

The full list:

English Wikipedia: 6
Spanish Wikipedia: 3
Mandarin Chinese Wikipedia: 3
German Wikipedia: 2
French Wikipedia: 2
Russian Wikipedia: 2
Japanese Wikipedia: 2
Arabic Wikipedia: 2
Portuguese Wikipedia: 2
Italian Wikipedia: 2
Hindi Wikipedia: 1
Farsi Wikipedia: 1
Polish Wikipedia: 1
Korean Wikipedia: 1
Turkish Wikipedia: 1
Vietnamese Wikipedia: 1
Dutch Wikipedia: 1
Indonesian Wikipedia: 1
Bengali Wikipedia: 1
Ukrainian Wikipedia: 1
Hebrew Wikipedia: 1
Czech Wikipedia: 1
Remaining Central and Eastern European languages (Hungarian, Greek, Serbian, Azeri...): 2
Remaining East, Southeast Asia and Pacific languages (Thai, Yue Chinese, Malay, Tagalog...): 2
Remaining South Asian languages (Tamil, Marathi, Telugu, Urdu, Malayalam...): 2
Remaining Northern European languages (Swedish, Finnish, Norwegian, Danish...): 1
Remaining African languages (Egyptian Arabic, Swahili, Afrikaans...): 1
Remaining Ibero-American and Italian languages (Catalan, Galician, Basque, Venetian...): 1
Remaining North/Central Asian languages (Kazakh, Uzbek, Tatar, Bashkir...): 1
"Everything else", the remaining areas of Western Europe, West Asia, North America (Welsh, Kurdish, Haitian...), the constructed languages (Esperanto...), dead languages (Latin, Classical Chinese...), not-quite-languages (Simple English), none of which would get any places on their own: 1
Commons:^{[note 6]} 3
English Wiktionary: 1
All other Wiktionaries: 2
All Wikibooks: 1
All Wikisources: 1
Wikidata:^{[note 6]} 1
Mediawiki:^{[note 7]} 1
All other Wikimedia projects: 1

Source data is available here.^{[note 8]}

Notes

↑ I don't have access to a dataset for second-language speakers, so I'm leaving it out for the demonstration on this page. My estimate is that it would only shift roughly two or three spots in the results. For practical use, we could get this data (along with better data for native speakers) from ethnologue.
↑ This principle is reflected in the formula in the balance between the "mixed" value and the lone editor count. The editor count holds some independent influence on representation; it is technically possible to get representation with a sufficient editor count regardless of the language's speaker count or readership. The ratio of 3-to-1 between the values in the formula is, like the ratio of values inside the "mixed" value, selected with the aim of striking a balance between the goals of representing the present and aspirational community distributions.

↑ To be precise, the procedure for moving things around into groups is as follows (in pseudocode):

Add all language editions as items in the list.
Add an empty item, which we'll call "EE" ("everything else"), to the list.
While any item in the list is at 0:
  Calculate distribution for the items in the list, using the formula.
  If any language in the list is at 0 representatives:
    Combine that language into its language group, which shall be an item in the list.
    If EE has contents, de-combine it back into its component groups.
  else if any language group is at 0 representatives:
    Add that language group to EE.
  else if EE is at 0 representatives:
    Add the smallest language group to EE.

↑ The strategy transition documentation identified eight regional collaborations: Central and Eastern Europe, ESEAP, Indaba, Iberocoop, North America, South Asia, WikiArabia, and WikiFranca. Additionally, there is the Wikimedia Northern Europe collaborative. Two of the collaboratives (WikiFranca and WikiArabia) are each focused around a single language, and heavily overlap geographically with other regions (as well as each other), so will not be used for regional divisions here. The rest of the world comprises the remaining parts of Western Europe (which we'll consider as a region on its own), the remaining areas of Asia (which will be arbitrarily split down the Iran-Turkmenistan border into West Asia and North/Central Asia), and the Guianas in South America (which we'll lump with, say, North America). There are some overlaps between the regions, which are resolved as follows: The Baltics, which are included in both CEE and Wikimedia Northern Europe, are assigned to Northern Europe; the Spanish-speaking countries of North America, which are associated with both Iberocoop and North America, are assigned to Iberocoop.
↑ In this demonstration, Incubator is grouped together with other sister projects. In practice, it might be preferable for each Incubator test wiki to count as part of the relevant language group for the project that the test wiki is intended to be launched as.
↑ ^a ^b Both Wikidata and Wikimedia Commons have issues in that most consumers of their content do so outside the wikis themselves, and are thus not counted as readers by the metrics used here. On the other side, very many of the people counted here as editors of each project would not be typically considered as such. On Wikidata, many "active editors" were from automatic edits resulting from an edit on another wiki, and similarly on Commons, very many active editors only did cross-wiki uploads. In such cases, the editors may have never even visited the project. More generally, there is some difficulty determining whether any given editor belongs to these projects. Because these issues seem to very roughly balance out, I've left it as-is here for the demonstration, but in practice, these projects should use different formulae from the other projects, to account for their circumstances.
↑ Mediawiki is a software project, so it doesn't have readers, or editors really, so I just counted the number of active developers as "very active editors", and tweaked the formula to ignore readership.
↑ Readership and very-active-editors data comes from the REST API. Native speaker data comes from the ASJP database. Active editors data comes from the lists on Meta and from Special:Statistics for multilingual projects.

[1] I don't have access to a dataset for second-language speakers, so I'm leaving it out for the demonstration on this page. My estimate is that it would only shift roughly two or three spots in the results. For practical use, we could get this data (along with better data for native speakers) from ethnologue.

[2] This principle is reflected in the formula in the balance between the "mixed" value and the lone editor count. The editor count holds some independent influence on representation; it is technically possible to get representation with a sufficient editor count regardless of the language's speaker count or readership. The ratio of 3-to-1 between the values in the formula is, like the ratio of values inside the "mixed" value, selected with the aim of striking a balance between the goals of representing the present and aspirational community distributions.

[3] To be precise, the procedure for moving things around into groups is as follows (in pseudocode):
Add all language editions as items in the list. Add an empty item, which we'll call "EE" ("everything else"), to the list. While any item in the list is at 0: Calculate distribution for the items in the list, using the formula. If any language in the list is at 0 representatives: Combine that language into its language group, which shall be an item in the list. If EE has contents, de-combine it back into its component groups. else if any language group is at 0 representatives: Add that language group to EE. else if EE is at 0 representatives: Add the smallest language group to EE.

[4] The strategy transition documentation identified eight regional collaborations: Central and Eastern Europe, ESEAP, Indaba, Iberocoop, North America, South Asia, WikiArabia, and WikiFranca. Additionally, there is the Wikimedia Northern Europe collaborative. Two of the collaboratives (WikiFranca and WikiArabia) are each focused around a single language, and heavily overlap geographically with other regions (as well as each other), so will not be used for regional divisions here. The rest of the world comprises the remaining parts of Western Europe (which we'll consider as a region on its own), the remaining areas of Asia (which will be arbitrarily split down the Iran-Turkmenistan border into West Asia and North/Central Asia), and the Guianas in South America (which we'll lump with, say, North America). There are some overlaps between the regions, which are resolved as follows: The Baltics, which are included in both CEE and Wikimedia Northern Europe, are assigned to Northern Europe; the Spanish-speaking countries of North America, which are associated with both Iberocoop and North America, are assigned to Iberocoop.

[5] In this demonstration, Incubator is grouped together with other sister projects. In practice, it might be preferable for each Incubator test wiki to count as part of the relevant language group for the project that the test wiki is intended to be launched as.

[commonswikidata-6] Both Wikidata and Wikimedia Commons have issues in that most consumers of their content do so outside the wikis themselves, and are thus not counted as readers by the metrics used here. On the other side, very many of the people counted here as editors of each project would not be typically considered as such. On Wikidata, many "active editors" were from automatic edits resulting from an edit on another wiki, and similarly on Commons, very many active editors only did cross-wiki uploads. In such cases, the editors may have never even visited the project. More generally, there is some difficulty determining whether any given editor belongs to these projects. Because these issues seem to very roughly balance out, I've left it as-is here for the demonstration, but in practice, these projects should use different formulae from the other projects, to account for their circumstances.

[7] Mediawiki is a software project, so it doesn't have readers, or editors really, so I just counted the number of active developers as "very active editors", and tweaked the formula to ignore readership.

[8] Readership and very-active-editors data comes from the REST API. Native speaker data comes from the ASJP database. Active editors data comes from the lists on Meta and from Special:Statistics for multilingual projects.

[note 1]

[note 2]

[note 3]

[note 4]

[note 5]

[note 6]

[note 7]

[note 8]