Research:Copyediting as a structured task
In order to support newcomers in their first edits, the Growth Team has been developing the Structured Tasks framework. Structured tasks break down the editing process into smaller steps that are easily understood, easy to use on mobile devices, and can be guided by algorithms. The first structured task that was implemented was add-a-link, which has been deployed to 4 wikis (arwiki, bnwiki, cswiki, and viwiki). Results from those wikis have been encouraging (T277355) -- with less than 8% of edits from recommended links being reverted. Therefore, we would like to implement other types of tasks that are part of editors’ workflows.
One particularly promising task is copyediting, i.e. improving the text of articles with respect to spelling, grammar, tone, etc. This is one of the number one structured tasks that communities have been asking for. Work by the Growth team is tracked Structured Tasks/Copyedit
Aims of this project:
- We want to understand the types of copyediting tasks it might be possible to assist with algorithms.
- We want to use an algorithm that can suggest tasks for a type of copyediting in articles across different languages.
- We want to know how good the algorithm works (e.g. know which model works best from a set of existing models).
- Perform a literature review to get an overview of i) different aspects of copyediting, ii) different commonly-used automatic tools for copyediting, iii) existing approaches to copyediting in Wikipedia, and iv) available models in NLP/ML research
- Exploratory analysis of available models and datasets
- Defining and scoping the specific task
- Building an evaluation dataset and implementing a model
Background research and literature review are captured in Copyediting_as_a structured_task/Literature_Review
- Simple spell- and grammarcheckers such as LanguageTool or Enchant are most suitable for supporting copyediting across many languages and are open/free
- Some adaptation to the context of Wikipedia and structured task will be required in order to decrease the sensitivity of the models; common approaches are to ignore everything in quotes or text that is linked.
- The challenge will be to develop a ground-truth dataset for backtesting. Likely, some manual evaluation will be needed.
- Long-term: Develop a model to highlight sentences that require editing (without necessarily suggesting a correction) based on copyediting templates. This could provide a set of more challenging copyediting tasks compared to spellchecking. This is also a more researchy project.