Research:Copyediting as a structured task

Tracked in Phabricator:
Task T293034

Created

18:32, 25 August 2021 (UTC)

Contact

Martin Gerlach

Wikimedia Foundation

Collaborators

Aisha Khatun

Wikimedia Foundation

Duration: 2021-July – 2023-June

Research:Projects

This page documents a completed research project.

In order to support newcomers in their first edits, the Growth Team has been developing the Structured Tasks framework. Structured tasks break down the editing process into smaller steps that are easily understood, easy to use on mobile devices, and can be guided by algorithms. The first structured task that was implemented was add-a-link, which has been deployed to 4 wikis (arwiki, bnwiki, cswiki, and viwiki). Results from those wikis have been encouraging (T277355) -- with less than 8% of edits from recommended links being reverted. Therefore, we would like to implement other types of tasks that are part of editors’ workflows.

One particularly promising task is copyediting, i.e. improving the text of articles with respect to spelling, grammar, tone, etc. This is one of the number one structured tasks that communities have been asking for. Work by the Growth team is tracked Structured Tasks/Copyedit

Aims of this project:

We want to understand the types of copyediting tasks it might be possible to assist with algorithms.
We want to use an algorithm that can suggest tasks for a type of copyediting in articles across different languages.
We want to know how good the algorithm works (e.g. know which model works best from a set of existing models).

Methods

Timeline

Perform a literature review to get an overview of i) different aspects of copyediting, ii) different commonly-used automatic tools for copyediting, iii) existing approaches to copyediting in Wikipedia, and iv) available models in NLP/ML research
Exploratory analysis of available models and datasets
Defining and scoping the specific task
Building an evaluation dataset and implementing a model

Results

Literature Review

Background research and literature review are captured in Copyediting_as_a structured_task/Literature_Review

Main findings:

Simple spell- and grammarcheckers such as LanguageTool or Enchant are most suitable for supporting copyediting across many languages and are open/free
Some adaptation to the context of Wikipedia and structured task will be required in order to decrease the sensitivity of the models; common approaches are to ignore everything in quotes or text that is linked.
The challenge will be to develop a ground-truth dataset for backtesting. Likely, some manual evaluation will be needed.
Long-term: Develop a model to highlight sentences that require editing (without necessarily suggesting a correction) based on copyediting templates. This could provide a set of more challenging copyediting tasks compared to spellchecking. This is also a more researchy project.

Approximate offline evaluation

We have identified LanguageTool as a first candidate to surface possible copyedits in articles because:

It is open, is being actively developed, and supports 30+ languages
The rule-based approach has the advantage that errors come with an explanation why they were highlighted and not just due to a high score from a ML-model. In addition, it provides functionality for adding custom rules by the community https://community.languagetool.org/
The copyedits from LanguageTool go beyond spellchecking of single words using a dictionary but also capture grammatical errors and style.

We can get a very rough approximation of how well LanguageTool works for detecting copyedits in Wikipedia articles by comparing the amount of errors in featured articles with those in articles containing a copyedit-template. We find that the performance is reasonable in many languages after applying a post-processing step in which we filter some of the errors from LanguageTool (e.g. those overlapping with links or bold text).

We also compared the performance of simple spellcheckers which are available for more languages than supported by LanguageTool. They can also surface many meaningful errors for copyediting but suffer from a much higher rate of false positives. This can be partially addressed by post-processing steps to filter the errors. Another disadvantage is that spellcheckers perform considerably worse than LanguageTool in suggesting the correct improvement of the error (instead of just detecting this).

One potentially substantial improvement could be to develop a model which assigns a confidence score to the errors surfaced by LanguageTool/spellchecker. This would allow us to prioritize those errors for the structured task copyediting task for which we have a high confidence that they are true copyedits. Some initial thoughts are in T299245.

Read here for more details: Research:Copyediting as a structured task/LanguageTool

Manual evaluation

We performed two rounds of manual evaluation of copyedits in 5 different languages (English, Arabic, Bengali, Czech, Spanish) testing different approaches. We generated random samples of ~100 copyedits in each wiki and asked volunteers to assess whether they think this is a genuine copyedit. This allows us to calculate the accuracy of each approach (that is, what fraction of the surfaced copyedits are genuine). More details in the phabricator-task: https://phabricator.wikimedia.org/T315086

Main conclusions:

LanguageTool: The copyedits surfaced via LanguageTool seem promising. The accuracy is between 70-90% across the 3 tested languages (English, Arabic, Spanish). We had to apply a set of heuristics to filter some of the errors surfaced by LanguageTool in order to reduce the number of false positives. LanguageTool provides a category-system for the rules associated with different errors which makes it easy to filter different types of errors for different languages.
Spellcheckers: The copyedits surfaced via common spellcheckers have shown overall low levers of accuracy (in fact, for Bengali the accuracy was 0%). The spellcheckers are too sensitive in surfacing copyedit-errors, because the dictionaries of the spellcheckers do not contain many of the words appearing in the Wikipedia articles of the respective language.
List of common misspellings: We manually compiled lists of common misspellings in each language and surfaced only copyedits associated with these misspellings. We observed that using custom lists of common misspellings yields high-accuracy copyedits. For example, in the case of Bengali this approach had an accuracy >90% in comparison to ~0% when using spellcheckers). Therefore, we believe that custom lists of common misspellings are a promising approach for surfacing copyedits which can be scaled, in principle, across many languages.

Curating lists of common misspellings

In the manual evaluation, we observed that lists of common misspellings yields high-accuracy copyedits. However, one of the main challenges is to curate such lists of common misspellings. For a few languages, such lists have been compiled by the communities (see for example English or German). For most languages, such lists are not readily available.

Wiktionary

One idea is to extract misspellings from Wiktionary projects. We take advantage of the fact that Wiktionary contains structured information about misspelled words in the form of the {{misspelling_of}} template.

This approach yields lists of misspellings across many languages. This is because the template exists in several Wiktionaries, and even the English Wiktionary contains misspellings in many different languages.
We find that these lists contain misspellings which are mostly not yet captured in the community-curated lists that are already available
These misspellings are found in the text of Wikipedia articles in the respective languages. These can be surfaced as candidates for copyediting
For some languages that lists of common misspellings are short. However, they are still useful because i) it demonstrates that the approach works in principle to identify copyedits in Wikipedia articles for many languages; and ii) serves as a starting points to expand those lists manually or with complementary approaches.

More details: Research:Copyediting_as_a_structured_task/Common_misspellings_wiktionary

Other approaches

There are alternative options to automatically curate lists of common misspellings which we have not explored in detail yet. This list is not exhaustive:

Redirects in Wikipedia. For example, English Wikipedia has a template {{R_from_misspelling}} which marks redirects from a misspelling or typographical error.
Redirects in Wiktionary.
Common occurrences of small changes to articles when looking at diffs between two revisions.

A model to detect sentences for copyediting

We train a mulitlingual machine-learning model that will take as input a sentence and yields as output a single score that indicates whether the sentence needs improvement in terms of copyediting. This will allow us to prioritize the lowest-scoring sentences (i.e. indicating the highest need for copyediting), e.g., for running other copyedit-tools such as LanguageTool which yield specific suggestions for improvement.

We show that such a model works well for a substantial number of languages. Admittedly, we also find that the performance is low for some languages such as Romanian. However, the latter could also be due to scarceness of high-quality multilingual ground truth data for training/evaluation.

More details: Research:Copyediting as a structured task/A model to detect sentences for copyediting

Papers, code, etc.

Code for analysis: https://gitlab.wikimedia.org/repos/research/copyedit
Code for experimental API: https://gitlab.wikimedia.org/repos/research/copyedit-api

Subpages

Pages with the prefix 'Copyediting as a structured task' in the 'Research' and 'Research talk' namespaces:

Research:

Research talk: