Research:Incubator and language representation across Wikimedia projects

Created
19:07, 27 June 2023 (UTC)
Collaborators

HGhani-WMF

ILooremeta-WMF
Duration:  2022-December – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


The Wikimedia Foundation supports more languages than any other large online platform.[1] Currently, more than 320 languages have one or more editions of open-content Wikimedia projects.[2] However, another way of looking at that statistic is: currently, fewer than 5% of the world’s living languages have at least one edition of a Wikimedia project. [3] How can Wikimedians create new language editions? Via Wikimedia Incubator.

Goals of this project:

  • Develop metrics for the state of languages at Wikimedia
  • Develop metrics for better understanding Incubator
  • Develop knowledge gaps metrics for measuring language gaps

Progress on this project can be followed at T348246

Background edit

Why language? edit

Language representation is connected to the Foundation's mission and strategic goals as well as the Research team's Knowledge Gaps project.

The Foundation’s mission “is to empower and engage people around the world to collect and develop educational content under a free license or in the public domain, and to disseminate it effectively and globally”.[4] UNESCO estimates that as much as 40% of the global population does not have access to education in a language they speak or understand.[5] If effective educational content dissemination is central to the Foundation’s mission and receiving instruction in one’s home language has an across-the-board positive impact on learning,[5] then language representation is of central concern to the Foundation’s mission.

The Research team's Knowledge Gaps project draws from the Wikimedia movement's strategic goal of supporting “the knowledge and communities that have been left out by structures of power and privilege”.[6] In 2018-2019, the Research team began to advance knowledge equity with a research program to address knowledge gaps. The project aims "to deliver citable, peer-reviewed knowledge and new technology in order to generate baseline data on the diversity of the Wikimedia contributor population, understand reader needs across languages, remove barriers for contribution by underrepresented groups, and help contributors identify and expand missing content across languages and topics."[2] One of the representation gaps identified in the knowledge gaps taxonomy (for readers, content, and contributors) is language.

In order to understand language representation across Wikimedia projects, we would like to understand the following:

  • What is the state of languages at Wikimedia? What is our global coverage? (RQ1)
  • Are we reaching as many people as possible, by hosting content in the world’s biggest languages? (RQ2)
  • Are we serving the members of marginalized language communities? (RQ3)

Why Incubator? edit

The journey through Wikimedia Incubator is a key step for language communities to engage in new Wikimedia projects, and thereby new forms of knowledge in their languages. That is because Wikimedia Incubator is where new-language versions of Wikipedia, Wiktionary, Wikibooks, Wikivoyage, Wikiquote, and Wikinews are arranged, written, and tested before Wikimedia hosting. (Note: New-language versions of Wikiversity go to Beta Wikiversity, and new-language versions of Wikisource go to Multilingual Wikisource). Once an Incubator project is deemed ready for Wikimedia hosting, it “graduates” from the Incubator and receives its own domain. The Language Committee determines whether the project is worthy of graduating by assessing its meeting of the basic requirements and guidelines for Incubator projects. Upon graduating, the project’s content is then exported to its new domain.This process occurs manually and can take multiple days, or sometimes weeks. (*Note: Incubator was created in June 2006. Before Incubator’s creation in June 2006, test projects were launched and edited on Meta-Wiki).

Obstacles related to Incubator have been identified by multiple stakeholders in the past, including Peter Gallert in his 2018 Wikimania talk, "Wikipedia for Indigenous Communities", and Amir Aharoni in his 2020 Celtic Knot talk, "How We Can Make the Incubator Better." These obstacles include: learning curve; prefixes; visual editor problems; incomplete Wikidata support; no content translation; no specific wikistats; lack of automation; and configuration inequity.

Given these challenges, what we would like to understand is:

  • What is the state of past and present projects in Incubator, and what implications does it have for language equity? (RQ4)
  • What obstacles prevent successful and/or timely graduation from the Incubator? Do these obstacles affect certain languages more than others? (RQ5)

Research questions and proposed variables edit

(RQ1) What is the state of languages at Wikimedia?

  • # and % of world's language speakers/readers with access to educational content via Wikimedia projects
  • # and % languages with (0,1,2,3,etc.) Wikimedia projects
  • # and % languages with projects in Incubator
  • # and % languages with projects that have been sent back to Incubator (i.e., closed)

(RQ2) Are we reaching as many people as possible, by hosting content in the world’s biggest languages?

  • # and % world’s top languages with at least (0,1,2,3,etc.) Wikimedia projects
  • # and % world’s top languages with a Wikipedia, Wiktionary, etc.
  • # and % world’s top languages with projects in Incubator (incl. #, time in Incubator, etc.)

(RQ3) Are we serving the members of marginalized language communities?

  • # and % of world’s minority languages with at least (0,1,2,3,etc.) Wikimedia project
    • Of those with zero, # and % with project(s) in Incubator (incl. time in Incubator)
  • # and % of world’s threatened languages have a  Wikipedia, Wiktionary, Wikisource
    • Of those with zero, # with project(s) in Incubator (incl. time in Incubator)

(RQ4) What is the state of past and present projects in Incubator, and what implications does it have for language equity?

  • Average length of time an Incubator project spends in Incubator
  • Average length of time between an Incubator project's creation and first meaningful edits
  • Average length of time between an Incubator project's last meaningful edits and Incubator graduation
  • Variables predictive of time spent in Incubator
  • Variables predict of successful graduation from Incubator

(RQ5) What obstacles prevent successful and/or timely graduation from the Incubator? Do these obstacles affect certain languages more than others?

These questions are being addressed as part of the Wikimedia Language engineering/Incubator conversations.

Data sources edit

This project uses data from multiple sources:

Additional details about these sources are provided in the source data folder of this project's GitLab repo.

As the list above shows, data related to Incubator projects and the Incubator process live in multiple places. In the table below, the Incubator process is outlined with links to the references data sources.

Table. Incubator process with linked data sources

Step 1. Request for new language version of a Wikipedia, Wiktionary, Wikibooks, Wikiquote, Wikinews, or Wikivoyage.

Language must have an ISO code.

Step 2.a. Request approved

(Continue to Step 3.)

Step 2.b. Request denied
Step 3. Test wiki created in Incubator

These wikis don’t have their own domain (they live under incubator.wikimedia.org, with the URL formatting of incubator.wikimedia.org/W[a-z]/[a-z]) and they don’t have a unique database in MariaDB (they live with incubatorwiki)

Step 4. Test wiki is edited by volunteers
Step 5. Request for approval from Language Committee
Step 6.a. Language Committee deems criteria met for Wikimedia hosting, including finding an expert to validate content

(Continue to Step 7.)

Step 6.b.Language Committee deems criteria not met for Wikimedia hosting

(Go back to Step 4.)

Step 7. Test wiki graduates from Incubator

Its content is copied and migrated to its own domain.

Step 8. Wiki lives in the real world!

These wikis now have their domain (e.g., af.wikipedia.org) and their own database name in MariaDB (e.g., afwiki)

(Step 9.a. No closure requested)

(Continue to Step 10.)

Step 9.b Closure requested
Step 9.b.i. Closure request denied

(Continue to Step 10.)

Step 9.b.ii.Closure request approved

(Return to Step 3.)

Step 10. Wiki lives in the real world forever…

Insights (in progress) edit

Overall edit

As of December 2023, 329 languages have at least one edition of a hosted Wikimedia content project (i.e., a Wikipedia, Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity, and/or Wikivoyage). "Hosted" means that the projects has its own domain (e.g., en.wikipedia.org) as opposed to living in the Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource. Twelve languages have all 8 possible hosted content projects; those languages are Chinese, English, Finnish, French, German, Greek, Italian, Japanese, Portuguese, Russian, Spanish, and Swedish.


 


As of December 2023, an estimated 1,076 languages have at least one edition of a Wikimedia content project (i.e., a Wikipedia, Wikibooks, Wikinews, Wikiquote, Wikisource, Wiktionary, Wikiversity, and/or Wikivoyage), either hosted or in test. "Hosted" means that the projects has its own domain (e.g., en.wikipedia.org), while "test" refers to project that are hosted within Wikimedia Incubator, Wikiversity Beta, or Multilingual Wikisource. 45 languages are represented within all 8 possible content projects, including some combination of hosted and test projects.


 


As of December 2023, there were many more test projects compared to hosted projects. The Wikipedia, Wiktionary, Wikivoyage, Wikisource, Wikinews, and Wikiversity projects all had more editions in test than were hosted. For instance, while there were 326 hosted Wikipedia editions, there were an estimated 698 Wikipedia editions in the Incubator (including 13 that were previously hosted and then closed). And while there were 168 hosted Wiktionary editions, there were an estimated 271 Wiktionary editions in the Incubator (including 23 that were previously hosted and then closed).


 

Wikipedia edit

There is a wide range of Wikipedia sizes for the world's top 20 most spoken languages,[7] ranging from more than 6.7 million articles (English Wikipedia) to 1,200 articles (Nigerian Pidgin Wikipedia).

 

How do Wikipedia sizes compare to "language sizes" (i.e., the number of speakers of a language, including first-language speakers and speakers for whom the language is a second or other language)? Results of exploratory plotting suggests that Wikipedia’s coverage is largely unrepresentative of the world's language populations. For instance, while Indonesian has 33% more speakers than German, Indonesian Wikipedia has 325% fewer articles than German Wikipedia.

 

It is worth noting that some low article counts can be attributed to a Wikipedia edition's "age". For instance, Nigerian Pidgin Wikipedia has the lowest article count; but it is also the "youngest" Wikipedia of these 20, having "graduated" from the Incubator in August 2022.[8] Future analyses and visualizations will control for project age.

References edit

  1. "Pillar 1: How We Are Working Toward Knowledge Equity". 2021-2022 Annual Report. Wikimedia Foundation.
  2. Per canonical-data/wiki/wikis.tsv 30 May 2023
  3. Percentage based on current estimates of about 7,000 living languages provided by Ethnologue and Linguistic Society of America.
  4. https://meta.wikimedia.org/wiki/Mission
  5. a b Global Education Monitoring Report Team. "If you don't understand, how can you learn?" February 2016. Policy Paper 24. UNSECO.
  6. Wikimedia Movement Strategy: Direction. https://meta.wikimedia.org/wiki/Strategy/Wikimedia_movement/2017/Direction
  7. Eberhard, David M., Gary F. Simons, and Charles D. Fennig (eds.). 2023. Ethnologue: Languages of the World. Twenty-sixth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.
  8. [[1]] 2021-2022 Annual Report. Wikimedia Foundation.