Research:Content Translation language imbalances

12:37, 24 February 2023 (UTC)
Duration:  2022-June – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

The flow of translations between Wikipedia language editions shows extreme imbalances in favor of one direction of translations versus the other, when examining records produced by the Content Translation tool.[1]

We have begun to explore these translation imbalances and have found that they are probably produced by a combination of technical factors such as software design, and social factors such as perceived completeness and importance.

If these imbalances are undesirable, maybe they can be counteracted by an experimental intervention.

Glossary edit

Specific technical terms used by this project:

Content Translation (CX) - a MediaWiki extension providing the main user interface for assisted translation between Wikipedia languages. Sometimes this term will also be used to include the server component.

CX service - the backend for Content Translation, a Node.js server which supports section editing, machine translation, and suggestions.

Draft translation - an article in the Content Translation workflow which has some in-progress text but is not yet available to readers.

Language pair - the source and target language of a translation or a suggestion.

Language proficiency - a translator's self-reported skill level in a given language, eg. as notated using the Babel extension.

Local wiki expertise - a translator's edit count (bucketed to a range) on the target wiki.

Non-primary language(s) - the languages a person uses less often and less proficiently. Also "second language" (L2).

Primary language(s) - the language(s) a person uses most frequently and proficiently, also called less accurately the "native", "first" language (L1), or "mother tongue".

Published translation - an article which has been published to the wiki, at the end of the Content Translation workflow.

Reverse translation - term for translation from a primary into a non-primary language. This is not necessarily less common than forward translation.

Source language - the language being translated from.

Suggestion - articles recommended for translation, for the chosen language pair.

Target language - the language being translated into.

Translation ratio - Two measures of imbalance in translation flow. For one language, define it as the ratio between all translations out divided by all translations in. For a language pair, define as the ratio between flows in one direction vs. the reverse (so the ratio for language pair A -> B is the ratio A->B / B->A).

Universal Language Selector - a MediaWiki extension responsible for showing language pickers, and tracking which languages the user has chosen in the past.

Research questions edit

TBD: These need refinement and discussion.

  • Is the imbalanced translation flow produced by endogenous translator preferences, or is it a structural artifact?
  • When machine translation becomes available for a language pair, does translation volume increase? Does published article quality decrease?
  • Is machine translation quality correlated with target language wiki size?
  • Are the arguments given for disabling machine translation into English still valid today?

Methods edit

This is a draft outline of work, not a plan yet.

  • Passive analysis of Content Translation historical logs
    •   Done Compare and visualize flows between all languages. What can be observed?
    • Look for correlations between translation flow and the relative values for each language in the pair: total articles, active editors, pageviews, ...
    • Compare smaller subsets of languages.
    • Segment all statistics according to whether the published translation originated in a suggestion, whether machine translation was explicitly used, and whether machine translators are available externally or internally for the language pair.
    • TBD
  • Passive analysis of Content Translation source code.
    •   Done How are suggested language pairs chosen?
  • Instrument Content Translation with temporary, additional, structured log events.
    • task T241833: Send an event identifying which group of inputs the suggested translation source language came from.
  • Interviews with translators
    • Learn about how perceived language importance informs choice of languages
    • Learn how software design affects choice of languages and workflow
  • Experimental intervention
    • For example, changing the suggested translation target for a limited number of users to eg. translate away from the current language instead of into it.

Outreachy timeline edit

Wikimedia Foundation and Wikimedia Germany are providing resources to hire an intern through the Outreachy program, who will help with this research.

2023-03-06 - 2023-04-03 -   Done - Outreachy initial contributions period

2023-05-29 - 2023-08-25 - Outreachy internship period

Policy, Ethics and Human Subjects Research edit

We have started to discuss the many questions raised by this project with the Wikimedia Language Engineering team. If we design an experimental intervention, it will be done in collaboration with this team, and the preparation steps may include a human subjects review to be sure we aren't causing harm to Wikimedians. For example, switching default suggested languages arbitrarily might lead to fewer people getting involved with translation or to unsuccessful outcomes after publishing.

Results edit

Overview of imbalances edit

There are imbalances of 100:1 in the ratio of translations made in each direction of a language pair with the Content Translation software. These seem to always flow from a dominant language towards the language with a smaller number of wiki articles. Colonial relationships between languages are reproduced, for example English towards Spanish and Spanish towards Catalán. (TBD: publish raw data links, code, graphs, exceptions)

Data source:

R code:

chordDiagram(api.result.translate.json.pivot.selection, directional = 1, direction.type = c("diffHeight", "arrows"), link.arr.type = "big.arrow")


A Sankey diagram showing that English is the biggest source of translations to other language Wikipedias

Analysis of suggested translation language algorithm edit

On a user's first visit to the Translations page after enabling the Content Translation beta feature, they can find suggestions about articles to translate. There are two algorithms at play: one chooses the pair of source and target languages between which to translate, and the other chooses which articles to show for translation. The analysis in this section is focused on the initial default choice of translation languages.

The code responsible for setting the default languages is CXDashboard.findValidDefaultLanguagePair, and the rough outline is that it takes all languages that the user has frequently set in the Universal Language Selector using mw.uls.getFrequentLanguageList, picks the first one, and suggests translating from that language into the current wiki's language. The exact process is more complicated:

Activity diagram detailing the Content Translation calculation to find a default suggested translation language pair.

TBD: discuss alternatives to this algorithm, such as randomizing all valid language pair permutations, recommending multiple pairs; and instrumenting the algorithm output

The target language strongly defaults to the current wiki language, and then a source language must be chosen which is different than the target. The source language defaults first to the interface language, set in MediaWiki user preferences, or browser preferences and accept headers. Languages explicitly (?) chosen with ULS are also retrieved from localStorage under 'uls-previous-languages'.

TBD: give examples of fallback values

Resources edit


We've applied to mentor an Outreachy project on this topic, intentionally leaving the focus area flexible depending on participant interests. See task T328597 and

Related work

This section is very much in-progress, and we'll add more as we learn about it.

Source code

References edit

  1. Content Translation provides an assisted translation environment with visual editing, intelligent template transformation, and machine translation integration. The project is mature and has been used to create over one million articles.