Research:Machine Translation, Human Editors

Duration:  2022-May – ??

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

By reducing the burdens of information gathering and translation tasks, machine translation-aided tools increase both the speed and ease with which content can be created. One such tool, Content Translation, has been used to create over 1 million Wikipedia articles, and reduce the content and knowledge gaps between larger and smaller wikis.

Currently we have a limited understanding of what happens from the time the editors receive the initial machine translation (MT) of an article until the time they publish an edited version of the article. To identify ways of supporting editors in their goal of high quality published translations, there are two questions we must answer. First, what linguistic changes are made during the MT post-editing process and how do these differ by language, language type, and article characteristics? Secondly, what’s the editor experience during this human-MT interaction, and how can tools better support editors as they improve MT outputs? The goal of Part 1 of this project ‘Machine translation, human editors’ is to answer the former of these questions.

Background and goals Edit

This project seeks to provide an account of human edits to the initial machine translation outputs in a targeted set of languages and article genres/topics (for practical purposes). As a result, we will better understand how contributors edit MTs to create Wikipedia articles, and how edits vary across languages and language types. Additionally, we can compare how edit number and type correlates with scores assigned by the current algorithm.

Machine translation outputs post-editing literature review Edit

More details coming soon.

The languages/wikis Edit

For practical purposes, we do not aim to perform an analysis of all ~300 languages that Wikipedia supports, nor the full set of language pairs supported by CX. Instead, we will target a small, strategic subset of languages.

Wiki characteristics
ISO Language Size Articles Active users Boost CX pubs CX out of beta Better with CX?
sq Albanian small 84k 212 Core boost 4900 (88% source=EN) yes yes (+14.39%)
id Indonesian med 593k 3300 no 12k (96% source=EN) no no (-1.39%)
zh Standard written Chinese large 1.2m 9500 no 22k (86% source=EN) no yes (+8.3%)

By "Better with CX?" we refer to how, by comparing the deletion ratio of CX-created and non-CX-created articles during 2020, we can gauge how likely an article produced with CX is likely to be deleted compared to an article not created with CX. Positive numbers indicate that articles created with CX are less likely to be deleted compared to those created from scratch, whereas negative numbers indicate higher CX deletion ratios.

To arrive at the three focus languages/wikis shown in the table above, we considered a number of wiki and language characteristics. Representing a small, medium, and large wiki, these three wikis also provide coverage of cases in which CX is out of beta and those where it’s not. They also ensure coverage of core boost wikis (important to Language Team goals), but also consider larger wikis with larger editor bases. Finally, this set provides cases of wikis in which a CX-produced article is both better and worse with CX. By ‘better’ we mean cases in which CX-produced articles are less likely to be deleted compared to their non-CX-created counterparts.

Language characteristics
ISO Language Writing system Word order rigidity Morphosyntax
sq Albanian alphabetic flexible highly synthetic
id Indonesian alphabetic semi-flexible semi synthetic
zh Standard written Chinese logographic rigid isolating

A few notes about characterizations in the table above: Albanian is canonically (S)ubject-(V)erb-(O)bject, but word order can vary based on pragmatic differences, and both verb-initial and verb-final word orders are readily attested. A synthetic language is one that expresses relationships (such as what the subject and object is) in a sentence by using prefixes and suffixes to modify words. Lastly, although Chinese word order is relatively rigid, topicalization may be the most frequent type of exception to SVO word order.

In arriving at the list of three target wikis/languages, a few main linguistic typological differences were also considered to ensure diversity of the linguistic structures represented (shown above). This is important in order to address how current CX mechanisms are able to cover a range of language types. First, this selection provides coverage of both logographic and alphabetic writing systems. Secondly, it includes languages that range from having highly flexible to relatively rigid word order. Finally, these three languages provide coverage of highly synthetic to isolating languages.

The sample articles / data Edit

More details coming soon.

Results Edit

More details coming soon.