Research:Machine Translation, Human Editors

Contact

Wikimedia Foundation

Duration: 2022-May – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

By reducing the burdens of information gathering and translation tasks, machine translation-aided tools increase both the speed and ease with which content can be created. One such tool, Content Translation, has been used to create over 1 million Wikipedia articles, and reduce the content and knowledge gaps between larger and smaller wikis.

Currently we have a limited understanding of what happens from the time the editors receive the initial machine translation (MT) of an article until the time they publish an edited version of the article. To identify ways of supporting editors in their goal of high quality published translations, there are two questions we must answer. First, what linguistic changes are made during the MT post-editing process and how do these differ by language, language type, and article characteristics? Secondly, what’s the editor experience during this human-MT interaction, and how can tools better support editors as they improve MT outputs? The goal of Part 1 of this project ‘Machine translation, human editors’ is to answer the former of these questions.

Background and goals

This project seeks to provide an account of human edits to the initial machine translation outputs in a targeted set of languages and article genres/topics (for practical purposes). As a result, we will better understand how contributors edit MTs to create Wikipedia articles, and how edits vary across languages and language types. Additionally, we can compare how edit number and type correlates with scores assigned by the current algorithm.

Machine translation outputs post-editing literature review

As part of this project, an annotated bibliography was created to survey adjacent literature on the topic of post editing of machine translation outputs.

The languages/wikis

For practical purposes, we do not aim to perform an analysis of all ~300 languages that Wikipedia supports, nor the full set of language pairs supported by CX. Instead, we will target a small, strategic subset of languages.

Wiki characteristics
ISO	Language	Size	Articles	Active users	Boost	CX pubs	CX out of beta	Better with CX?
sq	Albanian	small	84k	212	Core boost	4900 (88% source=EN)	yes	yes (+14.39%)
id	Indonesian	med	593k	3300	no	12k (96% source=EN)	no	no (-1.39%)
zh	Standard written Chinese	large	1.2m	9500	no	22k (86% source=EN)	no	yes (+8.3%)

By "Better with CX?" we refer to how, by comparing the deletion ratio of CX-created and non-CX-created articles during 2020, we can gauge how likely an article produced with CX is likely to be deleted compared to an article not created with CX. Positive numbers indicate that articles created with CX are less likely to be deleted compared to those created from scratch, whereas negative numbers indicate higher CX deletion ratios.

To arrive at the three focus languages/wikis shown in the table above, we considered a number of wiki and language characteristics. Representing a small, medium, and large wiki, these three wikis also provide coverage of cases in which CX is out of beta and those where it’s not. They also ensure coverage of core boost wikis (important to Language Team goals), but also consider larger wikis with larger editor bases. Finally, this set provides cases of wikis in which a CX-produced article is both better and worse with CX. By ‘better’ we mean cases in which CX-produced articles are less likely to be deleted compared to their non-CX-created counterparts.

Language characteristics
ISO	Language	Writing system	Word order rigidity	Morphosyntax
sq	Albanian	alphabetic	flexible	highly synthetic
id	Indonesian	alphabetic	semi-flexible	semi synthetic
zh	Standard written Chinese	logographic	rigid	isolating

A few notes about characterizations in the table above: Albanian is canonically (S)ubject-(V)erb-(O)bject, but word order can vary based on pragmatic differences, and both verb-initial and verb-final word orders are readily attested. A synthetic language is one that expresses relationships (such as what the subject and object is) in a sentence by using prefixes and suffixes to modify words. Lastly, although Chinese word order is relatively rigid, topicalization may be the most frequent type of exception to SVO word order.

In arriving at the list of three target wikis/languages, a few main linguistic typological differences were also considered to ensure diversity of the linguistic structures represented (shown above). This is important in order to address how current CX mechanisms are able to cover a range of language types. First, this selection provides coverage of both logographic and alphabetic writing systems. Secondly, it includes languages that range from having highly flexible to relatively rigid word order. Finally, these three languages provide coverage of highly synthetic to isolating languages.

The sample articles / data

To create the dataset for this project, we retrieved 50 sample Content Translation publications and their associated initial metadata. This was done for each of the three target wikis: Albanian (sq), Indonesian (id), Standard Written Chinese (zh). For each of the 50 samples for each language, the following information was gathered:

The Content Translation-published article (at the time of publication (excluding any later edits to the article)
The initial unedited machine translation output for each of these Content Translation publications
A corresponding Content Translation quality algorithm-assigned score for each of the Content Translation publications
A historical snapshot of the source article at the time of the machine translation output generation

We followed a prescribed sampling method for retrieval of the articles, with the goal of obtaining a sample that was representative enough for the type of anlayses and generalizations we needed to make. A description follows, and only minor deviations were needed for practical purposes (trackable at Phabricator T290906):

Source language - Only articles with English as a source language should be included. This is because English is the most frequent source language (with rates as high as 80-90%+)
Translator diversity and experience - For each of the wikis, to establish a minimal amount of individual translator variation (i.e., we didn't want to inadvertently retrieve translations from a single editor), the 50 articles represented work of 10 or more individual editors, with no individual editor contributing more than 5. In addition, 50% of the articles should have been published by a ‘newer’ editor, defined here as an account created no longer than 2 years prior. The other half of articles should have been published by editors with CX publications beginning at least 3 years prior.
Machine translation engine - Assuming that Google Translate (GT) may be the only service available across Albanian, Indonesian, and Chinese, and it being one of the most common services used by CX users (overall, across all languages), all articles should have been produced exclusively (across all sections/paragraphs) using initial MT outputs provided by GT
Topic-Category - All articles should belong to the 'nature/natural phenomena' or 'biography' category.
Article length - All articles (CX published versions) should contain a minimum of 7+ paragraphs, but if this is overly restrictive, a minimum of 5 is acceptable. These paragraphs may be contained in a single article section or across multiple sections of an article (i.e., no 'number of sections' specification).
Percent modified - The CX quality algorithm calculates "percentage the MT is modified". We aimed to define three categories for the overall 50 articles to fall into. These categories are (1) less than 10% modified, (2) between 11 and 50% modified, and (3) more than 51% modified.

Results

Language Quality Trends Analysis

While the primary focus of this project was to analyze human edits made to machine translation outputs, given the amount of careful data sampling, we also had the opportunity to examine the quality of the randomly sampled articles. Overall, this included 1082 paragraphs from 132 articles; on average, nine paragraphs per article were analyzed for quality by human raters.

The quality assessment included two measures:

Language quality - To what degree does the text contain grammatical errors, and to what degree do these impact meaning and understanding?
Machine translation markedness - Compared to human-written texts, to what degree are there language choices in the text indicating that machine translation outputs were used to produce it?

Please refer to the full language quality trends analysis results report for more details, noting that the recommendations section is still in progress of being discussed.

Human Edits to Machine Translation Outputs (Post-editing) Results

More results coming soon.