Research:Content Translation language imbalances

Created
12:37, 24 February 2023 (UTC)
Contact
Duration:  2022-June – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.


The flow of translations between Wikipedia language editions shows extreme imbalances in favor of one direction of translations versus the other, when examining records produced by the Content Translation tool.[1]

We have begun to explore these translation imbalances and have found that they are probably produced by a combination of technical factors such as software design, and social factors such as perceived completeness and importance.

If these imbalances are undesirable, maybe they can be counteracted by an experimental intervention.

Glossary

edit

Specific technical terms used by this project:

Content Translation (CX) - a MediaWiki extension providing the main user interface for assisted translation between Wikipedia languages. Sometimes this term will also be used to include the server component.

CX service - the backend for Content Translation, a Node.js server which supports section editing, machine translation, and suggestions.

Draft translation - an article in the Content Translation workflow which has some in-progress text but is not yet available to readers.

Language pair - the source and target language of a translation or a suggestion.

Language proficiency - a translator's self-reported skill level in a given language, eg. as notated using the Babel extension.

Local wiki expertise - a translator's edit count (bucketed to a range) on the target wiki.

Non-primary language(s) - the languages a person uses less often and less proficiently. Also "second language" (L2).

Primary language(s) - the language(s) a person uses most frequently and proficiently, also called less accurately the "native", "first" language (L1), or "mother tongue".

Published translation - an article which has been published to the wiki, at the end of the Content Translation workflow.

Reverse translation - term for translation from a primary into a non-primary language. This is not necessarily less common than forward translation.

Source language - the language being translated from.

Suggestion - articles recommended for translation, for the chosen language pair.

Target language - the language being translated into.

Translation ratio - Two measures of imbalance in translation flow. For one language, define it as the ratio between all translations out divided by all translations in. For a language pair, define as the ratio between flows in one direction vs. the reverse (so the ratio for language pair A -> B is the ratio A->B / B->A).

Universal Language Selector - a MediaWiki extension responsible for showing language pickers, and tracking which languages the user has chosen in the past.

Research questions

edit

TBD: These need refinement and discussion.

  • Is the imbalanced translation flow produced by endogenous translator preferences, or is it a structural artifact?
  • When machine translation becomes available for a language pair, does translation volume increase? Does published article quality decrease?
  • Is machine translation quality correlated with target language wiki size?
  • Are the arguments given for disabling machine translation into English still valid today?

Methods

edit

This is a draft outline of work, not a plan yet.

  • Passive analysis of Content Translation historical logs
    •   Done Compare and visualize flows between all languages. What can be observed?
    • Look for correlations between translation flow and the relative values for each language in the pair: total articles, active editors, pageviews, ...
    • Compare smaller subsets of languages.
    • Segment all statistics according to whether the published translation originated in a suggestion, whether machine translation was explicitly used, and whether machine translators are available externally or internally for the language pair.
    • TBD
  • Passive analysis of Content Translation source code.
    •   Done How are suggested language pairs chosen?
  • Instrument Content Translation with temporary, additional, structured log events.
    • task T241833: Send an event identifying which group of inputs the suggested translation source language came from.
  • Interviews with translators
    • Learn about how perceived language importance informs choice of languages
    • Learn how software design affects choice of languages and workflow
  • Experimental intervention
    • For example, changing the suggested translation target for a limited number of users to eg. translate away from the current language instead of into it.

Outreachy timeline

edit

Wikimedia Foundation and Wikimedia Germany are providing resources to hire an intern through the Outreachy program, who will help with this research.

2023-03-06 - 2023-04-03 -   Done - Outreachy initial contributions period

2023-05-29 - 2023-08-25 -   Done - Outreachy internship period

Policy, Ethics and Human Subjects Research

edit

No experiments are currently planned.

Results

edit

Overview of imbalances

edit

There are imbalances of 100:1 in the ratio of translations made in each direction of a language pair with the Content Translation software. These seem to always flow from a dominant language towards the language with a smaller number of wiki articles. Colonial relationships between languages are reproduced, for example English towards Spanish and Spanish towards Catalán. (TBD: publish raw data links, code, graphs, exceptions)

Data source: https://en.wikipedia.org/w/api.php?action=query&list=contenttranslationstats&format=json

R code:

library(circlize)
chordDiagram(api.result.translate.json.pivot.selection, directional = 1, direction.type = c("diffHeight", "arrows"), link.arr.type = "big.arrow")

Visualizations:

 
A Sankey diagram showing that English is the biggest source of translations to other language Wikipedias

Analysis of suggested translation language algorithm

edit

On a user's first visit to the Translations page after enabling the Content Translation beta feature, they can find suggestions about articles to translate. There are two algorithms at play: one chooses the pair of source and target languages between which to translate, and the other chooses which articles to show for translation. The analysis in this section is focused on the initial default choice of translation languages.

The code responsible for setting the default languages is CXDashboard.findValidDefaultLanguagePair, and the rough outline is that it takes all languages that the user has frequently set in the Universal Language Selector using mw.uls.getFrequentLanguageList, picks the first one, and suggests translating from that language into the current wiki's language. The exact process is more complicated:

 
Activity diagram detailing the Content Translation calculation to find a default suggested translation language pair.

TBD: discuss alternatives to this algorithm, such as randomizing all valid language pair permutations, recommending multiple pairs; and instrumenting the algorithm output

The target language strongly defaults to the current wiki language, and then a source language must be chosen which is different than the target. The source language defaults first to the interface language, set in MediaWiki user preferences, or browser preferences and accept headers. Languages explicitly (?) chosen with ULS are also retrieved from localStorage under 'uls-previous-languages'.

TBD: give examples of fallback values

Translate to vs. translate from workflows

edit

There are good reasons that translators might be fluent in a smaller language and in a trade or world language, and it seems that many translators are comfortable working in either direction. It's this choice of direction which creates the imbalance seen in our research. But the choice of direction is often already decided by the time users enter the translation workflow: the two directions can be summarized as "find an article to translate into your language" vs. "translate this article from your language".

We would like to analyze translations made through these two workflows, to see if our assumption is correct that the workflows mostly correspond to a single direction of translation flow.

Resources

edit

Outreachy

edit

We've applied to mentor an Outreachy project on this topic, intentionally leaving the focus area flexible depending on participant interests. See task T328597 and https://www.outreachy.org/communities/cfp/wikimedia/

edit

This section is very much in-progress, and we'll add more as we learn about it.

Production source code

edit

Analysis code

edit

References

edit
  1. Content Translation provides an assisted translation environment with visual editing, intelligent template transformation, and machine translation integration. The project is mature and has been used to create over one million articles.