Research:Copyediting as a structured task/LanguageTool

A detailed analysis of LanguageTool for the copyedit structured task.

Advantages of LanguageTool

There are many reasons why LanguageTool seems a good starting point:

It is open software
It works out of the box and has been (and continues being) developed for many years
It supports 30+ languages
Each of the detected errors comes with an explanation. This is important for the machine-in-the-loop approach where editors should keep full control of whether to adopt or reject the model’s suggestions https://meta.wikimedia.org/wiki/Research:Knowledge_Gaps_3_Years_On#Principles_Guiding_Knowledge_Gaps_Research
Most of the detected errors come with a suggestion for improvement.
The community can define custom rules for the model https://community.languagetool.org/

Challenges for LanguageTool

LanguageTool provides a browser-interface (https://languagetool.org/) to copy-paste text for copyediting.

When using LanguageTool for Wikipedia articles we face different challenges:

A Wikipedia article contains not only plain text but also other elements such as tables, infoboxes, references etc which we probably dont want spell-check.
A Wikipedia article contains content (text, links, etc) that is transcluded from, e.g., templates. Fixing potential copyedits in this case is not recommended as i) it would have to be done in the template and not in the article itself; and ii) will also affect the content in other articles.
A Wikipedia article contains many text elements which might appear as copyedits but are in fact correct, such as in quotes, uncommon entity names and should thus not be highlighted as copyedits.

As an example, when manually pasting the text from the lead section of the article on Roman Catholic Diocese of Bisceglie, LanguageTool yields 7 copyedits (marked in bold) which are all false positives:

The Diocese of Bisceglie (Latin: Dioecesis Vigiliensis) was a Roman Catholic diocese located in the town of Bisceglie on the Adriatic Sea in the province of Barletta-Andria-Trani, Apulia in southern Italy. It is five miles south of Trani. In 1818, it was united with the Archdiocese of Trani to form the Archdiocese of Trani-Bisceglie.[1][2]

The main challenge is then to ensure that applying LanguageTool to find copyedits in Wikipedia articles yields genuine errors and not too many false positives (highlighted errors that are in fact correct).

API for LanguageTool

In order to investigate LanguageTool in more detail, we set up our own instance to be used via an API.

Endpoint on cloud-vps

We set up a remote server running our own instance of LanguageTool on cloud-vps.

We can then query LanguageTool in the following way:

Directly: https://copyedits.wmcloud.org/v2/check?language=en&text=my+text
In a jupyter-notebook: https://gitlab.wikimedia.org/repos/research/copyedit/-/blob/main/example_LanguageTool.ipynb

More documenation available at: https://github.com/wikimedia/research-api-endpoint-template/tree/language-tool

Frontend on toolforge

We also built an experimental API to run LanguageTool on Wikipedia articles. The tool automates some of the pre- and post-processing:

it extracts the plain text of an article. Using the HTML-version, we can keep track of the HTML-tags encoding information about whether the text corresponds to, e.g., a link, a quote, a reference, etc
it runs LanguageTool on the extracted plain text using the endpoint on cloud-vps
it allows for filtering of the copyedits based on some heuristics. For example, we filter errors related to the anchor-text of links.

The tool can be queried by specifying the language (e.g. "en") and the article title of the corresponding Wikipedia. Some example queries in different languages:

https://copyedit.toolforge.org/api/v1/lt?lang=en&title=Chiara_Cordelli

The supported languages are: ar, ast, be, br, ca, da, de, el, en, eo, es, fa, fr, ga, gl, it, ja, km, nl, pl, pt, ro, ru, simple, sk, sl, sv, ta, tl, uk, zh. These correspond to the Wikipedia-projects for which there is a supported language in LanguageTool. We always use the language without specifying a specific variant (e.g. “en” instead of “en-US”). For Simple Wikipedia (simplewiki) we use LanguageTool with “en”.

More documentation available at: https://gitlab.wikimedia.org/repos/research/copyedit-api

Evaluation of LanguageTool

In order to evaluate the performance of LanguageTool in detecting errors, we need an annotated dataset with ground-truth errors. Comparing the predicted with the true errors, we can calculate performance metrics around precision and recall, especially true positives (how many of the predicted errors are genuine errors) and false positives (how many of the predicted errors do not correspond to a genuine error).

The main limitation is that these ground-truth datasets are extremely rare. Even more so when going beyond English or for Wikipedia articles.

Benchmark corpus

One starting point is the NLP-task of grammatical error correction, i.e. “the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors.” In the past, different benchmark datasets with ground-truth errors have been compiled to systematically investigate different approaches for grammatical error correction. Though, most of these resources are only available for English.

We evaluate LanguageTool on the W&I benchmark data of the BEA19 Shared task using Errant. W&I (Write & Improve) is an online web platform that assists non-native English students with their writing. Specifically, students from around the world submit letters, stories, articles and essays in response to various prompts, and the W&I system provides instant feedback. Since W&I went live in 2014, W&I annotators have manually annotated some of these submissions and assigned them a CEFR level. Thus, we have annotated errors in three difference levels: A (beginner), B (intermediate), C (advanced). My interpretation is that these classes contain errors with increasing complexity.

We then compare the errors from LanguageTool with the ground-truth errors in the benchmark data. For LanguageTool, we use language variants “en” and “en-US”. We evaluate on error detection (only detection) as well as error correction (detection + improvement).

Evaluation on Error Detection
data	#sents	LT-lang	#TP	#FP	#FN	Prec.	Rec.	F0.5
A.train	10,880	en	2,338	2,045	26,734	0.5334	0.0804	0.2508
A.train	10,880	en-US	4,108	3,200	24,964	0.5621	0.1413	0.3523
B.train	13,202	en	1,363	1,954	22,854	0.4109	0.0563	0.1818
B.train	13,202	en-US	2,586	3,335	21,631	0.4368	0.1068	0.2699
C.train	10,667	en	516	1,362	9,140	0.2748	0.0534	0.1503
C.train	10,667	en-US	924	2,436	8,732	0.275	0.0957	0.2

Evaluation on Error Correction
data	#sents	LT-lang	#TP	#FP	#FN	Prec.	Rec.	F0.5
A.train	10,880	en	1,898	2,481	26,264	0.4334	0.0674	0.2078
A.train	10,880	en-US	2,873	4,431	25,289	0.3933	0.102	0.2504
B.train	13,202	en	1,175	2,136	22,490	0.3549	0.0497	0.1592
B.train	13,202	en-US	1,911	4,004	21,754	0.3231	0.0808	0.2019
C.train	10,667	en	461	1,415	9,017	0.2457	0.0486	0.1357
C.train	10,667	en-US	739	2,619	8,739	0.2201	0.078	0.1613

Summary:

Error detection yields a precision of around 55% in the easy-corpus (A.train). The difference between language-variants is small (53 for en and 56 for en-US)
Error-detection yields a recall between 8% (en) and 14% (en-US). The en-US language variants is more sensitive capturing more errors. This means that LanguageTool does not detect all errors and misses a large fraction. Though in absolute numbers, LanguageTool still detects thousands of errors.
The number of correctly detected errors decreases for medium (B.train) and hard (C.train) corpora.
Error-correction is a much harder problem, however, the precision is still at around 40% for the easy corpus (A.train)

Wikipedia (English)

We would like to understand how the results from the benchmark corpora generalize when applied to Wikipedia. However, evaluating LanguageTool on Wikipedia articles is more challenging. We dont have a ground-truth dataset of at least some articles with a complete annotation of all the grammatical errors. Thus, we cannot just repeat the analysis from above.

Therefore, we will do an approximation by using annotations on the article level (instead of each single error):

Featured articles are considered to be some of the best articles Wikipedia has to offer. As a rough approximation, we assume that these articles are free of any errors (as viewed by Wikipedia’s editors); thus, we consider any error we will find here as a false positive. For enwiki, we find 6,090 featured articles with 1,192,369 sentences.
Articles with a copyedit-template. Wikipedia’s editors add this template to articles to indicate at the top of the article that these “may require copy editing for grammar, style, cohesion, tone, or spelling.” We assume that these articles have a higher chance to yield errors. For enwiki, we find 1,024 articles with the copyedit-template with 104,403 sentences.

Running LanguageTool we get the following statistics:

featured_en: 0.06 errors per sentences
featured_en-US: 0.792 errors persentence
copyedit-template_en: 0.125 errors per sentence

Summary:

How many false positives are there?
- Using LanguageTool with the language-variant “en-US” on featured articles causes an extremely high number of false positives in Wikipedia articles. On average, almost every sentence will yield a false positive. This is consistent with the qualitative observations when using the browser-interface of LanguageTool. In fact, the default language-variant in the browser version is “en-US”
- Using the language-variant “en” on featured articles substantially reduces the occurrence of false positives by more than 10-fold to only 1 false positive every 15 sentences or so.
How precise are errors highlighted by LanguageTool?
- Using the language-variant “en” on copyedit-template articles we find a higher rate of errors (0.126 per sentence) than for featured articles (0.06 per sentence). Assuming that the errors in featured articles correspond to a baseline rate of false positives in all articles, we can approximate the precision by subtracting the baseline rate. Thus, we would have 0.126-0.06=0.066 errors per sentence that are genuine. This would correspond to a precision of 0.066/(0.06+0.066)= 0.524.
- This value for the precision is consistent with the findings in the benchmark corpora (52% vs 53%)
- The value for the precision is likely to be a lower bound (in reality it is higher) since we assume that all found errors in featured articles are false positives where, in fact, some of them might be genuine.

Error types We can also look at the types of errors LanguageTool detects in Wikipedia articles.

What sticks out, is that the large fraction of “misspelling” for featured articles when using “en-US”. One interpretation is that errors from the “misspelling”-rule are a main driver of false-positives in Wikipedia articles.

Fraction of error ruletypes from evaluating LanguageTool on enwiki

Fraction of error categories from evaluating LanguageTool on enwiki

Wikipedia (non-English)

We compare the error rate from LanguageTool in articles with the featured article badge (Q17437796) against articles containing the copyedit-template (Q6292692) in the corresponding language.

wiki_db	language-code	featured_n-art	featured_n-sent	featured_n-err	template_n-art	template_n-sent	template_n-err	featured_err-per-sent	template_err-per-sent	prec
enwiki	en	6090	1192321	71574	1024	104403	13197	0.06	0.126	0.525
simplewiki	en	30	4926	286	15	415	66	0.058	0.159	0.635
arwiki	ar	692	154990	310459	512	22594	58990	2.003	2.611	0.233
astwiki	ast	325	71918	324711	868	38430	169848	4.515	4.42	0
bewiki	be	88	38043	65063	675	36571	61446	1.71	1.68	0
brwiki	br	2	223	448	0	0	0	2.009	-	-
cawiki	ca	764	145185	155363	8	397	538	1.07	1.355	0.21
dawiki	da	17	5967	5077	14	819	676	0.851	0.825	0
dewiki	de	2730	935452	102807	0	0	0	0.11	-	-
elwiki	el	129	30611	44224	0	0	0	1.445	-	-
eowiki	eo	311	70371	136043	0	0	0	1.933	-	-
eswiki	es	1235	350673	425496	1547	99005	159589	1.213	1.612	0.247
fawiki	fa	198	53013	3358	16	1024	141	0.063	0.138	0.54
frwiki	fr	2019	679560	749826	0	0	0	1.103	-	-
gawiki	ga	2	509	1433	0	0	0	2.815	-	-
glwiki	gl	218	59451	112419	209	11371	21333	1.891	1.876	0
itwiki	it	536	124571	207444	720	53209	106962	1.665	2.01	0.172
jawiki	ja	92	30542	399	0	0	0	0.013	-	-
kmwiki	km	21	930	28741	6	53	1506	30.904	28.415	0
nlwiki	nl	365	115060	87518	0	0	0	0.761	-	-
plwiki	pl	944	268900	220568	1	22	27	0.82	1.227	0.332
ptwiki	pt	1315	328326	190550	1346	43865	32652	0.58	0.744	0.22
rowiki	ro	196	58467	70636	256	15335	28486	1.208	1.858	0.35
ruwiki	ru	1627	651035	480016	11	1157	304	0.737	0.263	0
skwiki	sk	73	20324	24104	0	0	0	1.186	-	-
slwiki	sl	381	86241	123624	212	11198	16979	1.433	1.516	0.055
svwiki	sv	354	79217	81458	278	11898	16231	1.028	1.364	0.246
tawiki	ta	14	3160	787	2	179	161	0.249	0.899	0.723
tlwiki	tl	29	3274	13667	3	155	930	4.174	6	0.304
ukwiki	uk	233	68208	40630	883	64783	42342	0.596	0.654	0.089
zhwiki	zh	929	154844	15396	1761	57344	7259	0.099	0.127	0.215

Summary:

The error rates in enwiki and simplewiki are consistent.
In most languages, error rates in articles with the copyedit-template is indeed higher than for featured articles.
However, for most languages the precision is below 0.5 and for some languages the error rate in featured articles is similar to that in articles with copyedit-templates.

Takeaway:

These results suggest that we should add a post-processing step in which we filter some errors such as certain types (e.g. spelling) or in certain text regions (text of links).

Filtering errors

We can now use additional post-processing to filter certain errors and use the above evaluation protocol to assess whether this strategy improves the precision. The idea is to propose such a filter that removes errors that are false positives (those in the featured articles) but keeps as many that are genuine (those in the articles with copyedit-templates).

As a first naive attempt, we use the annotations of the text contained in the HTML of the article:

we keep track of all substrings that contain any annotation (such as when text is bold, italics, a link, etc)
we filter an error from LanguageTool if: i) the position of the error overlaps with any of the substrings; or ii) the string of the error does not match any of the substrings with an annotation.

wiki_db	language-code	featured_err-per-sent	featured_err-per-sent-filter	template_err-per-sent	template_err-per-sent-filter	prec	prec-filter	prec-change-ppt
enwiki	en	0.06	0.037	0.126	0.072	0.525	0.489	-0.036
simplewiki	en	0.058	0.037	0.159	0.133	0.635	0.724	0.089
arwiki	ar	2.003	1.01	2.611	1.785	0.233	0.434	0.201
astwiki	ast	4.515	1.572	4.42	1.794	0	0.124	0.124
bewiki	be	1.71	0.557	1.68	1.039	0	0.464	0.464
brwiki	br	2.009	0.377	-	-	-	-	-
cawiki	ca	1.07	0.291	1.355	0.554	0.21	0.475	0.265
dawiki	da	0.851	0.242	0.825	0.336	0	0.28	0.28
dewiki	de	0.11	0.042	-	-	-	-	-
elwiki	el	1.445	0.564	-	-	-	-	-
eowiki	eo	1.933	0.647	-	-	-	-	-
eswiki	es	1.213	0.243	1.612	0.624	0.247	0.611	0.363
fawiki	fa	0.063	0.014	0.138	0.057	0.54	0.757	0.217
frwiki	fr	1.103	0.196	-	-	-	-	-
gawiki	ga	2.815	1.525	-	-	-	-	-
glwiki	gl	1.891	0.43	1.876	0.865	0	0.503	0.503
itwiki	it	1.665	0.421	2.01	0.916	0.172	0.541	0.369
jawiki	ja	0.013	0.012	-	-	-	-	-
kmwiki	km	30.904	16.081	28.415	18.528	0	0.132	0.132
nlwiki	nl	0.761	0.257	-	-	-	-	-
plwiki	pl	0.82	0.284	1.227	0.591	0.332	0.519	0.187
ptwiki	pt	0.58	0.314	0.744	0.461	0.22	0.318	0.098
rowiki	ro	1.208	0.287	1.858	1.127	0.35	0.745	0.396
ruwiki	ru	0.737	0.31	0.263	0.156	0	0	0
skwiki	sk	1.186	0.511	-	-	-	-	-
slwiki	sl	1.433	0.536	1.516	0.97	0.055	0.447	0.393
svwiki	sv	1.028	0.293	1.364	0.487	0.246	0.398	0.152
tawiki	ta	0.249	0.168	0.899	0.782	0.723	0.785	0.062
tlwiki	tl	4.174	2.009	6	2.142	0.304	0.062	-0.242
ukwiki	uk	0.596	0.248	0.654	0.388	0.089	0.362	0.274
zhwiki	zh	0.099	0.062	0.127	0.097	0.215	0.355	0.141

Summary:

The post-processing step of filtering errors substantially improves the precision of almost all wikis (i.e. it filter relatively more errors in the featured articles than in the copyedit-template articles)
Other more nuanced filters could lead to further improvements in the precision of LanguageTool.

Takeaways from the evaluation

We can apply LanguageTool to at least 30 Wikipedias running our own instance. Checking the text of Wikipedia articles requires some preprocessing of the text (e.g. to identify only raw text and avoid transcluded content from templates) and post-processing to filter some errors (e.g. avoid correcting anchor text of links)
LanguageTool can detect a high volume of copyedit errors beyond simple misspellings based on a dictionary-lookup.
We estimate the precision of the errors of LanguageTool in for English to be around 50% (or higher)
The concern about a large number of false positives can be mitigated by using the generic language-variant (e.g. “en” instead of “en-US”).
Applying different filtering of errors can substantially improve the precision of LanguageTool to detect errors in almost all wikis.

Comparison to spell-checker

In this section, we look how spellcheckers perform at the same tasks as above. This gives us a good sense how LanguageTool compares to the much simpler spellchecking tools. Specifically, I used the enchant spell-checking library which provides uniform access to spellcheckers in different languages via python. From the projects considered for the evaluation of LanguageTool, I readily found spellcheckers a subset of projects ‘enchant.list_languages()‘ (though many more languages can be installed): enwiki (en-US), simplewiki (en-US), arwiki (ar), cawiki (ca) , dewiki (de_DE), elwiki (el), eswiki (es), fawiki (fa), frwiki (fr), glwiki (gl_ES), itwiki (it_IT), nlwiki (nl), plwiki (pl), ptwiki (pt_BR), rowiki (ro), ruwiki (ru_RU), svwiki (sv), ukwiki (uk).

Benchmark corpus

Evaluation on Error Detection
data	#sents	LT-lang	#TP	#FP	#FN	Prec.	Rec.	F0.5
A.train	10,880	en_GB	1,878	1,471	27,194	0.5608	0.0646	0.2211
A.train	10,880	en_US	1,925	1,720	27,147	0.5281	0.0662	0.2205
B.train	13,202	en_GB	1,249	1,764	22,968	0.4145	0.0516	0.1722
B.train	13,202	en_US	1,312	2,119	22,905	0.3824	0.0542	0.1729
C.train	10,667	en_GB	423	1,288	9,233	0.2472	0.0438	0.1282
C.train	10,667	en_US	460	1,692	9,196	0.2138	0.0476	0.1259

Evaluation on Error Correctopm
data	#sents	LT-lang	#TP	#FP	#FN	Prec.	Rec.	F0.5
A.train	10,880	en_GB	872	2,477	27,290	0.2604	0.031	0.1049
A.train	10,880	en_US	897	2,748	27,265	0.2461	0.0319	0.1049
B.train	13,202	en_GB	683	2,330	22,982	0.2267	0.0289	0.0956
B.train	13,202	en_US	685	2,746	22,980	0.1997	0.0289	0.0916
C.train	10,667	en_GB	271	1,440	9,207	0.1584	0.0286	0.083
C.train	10,667	en_US	276	1,876	9,202	0.1283	0.0291	0.0763

Summary:

There is little difference between using the en_US or en_GB spellchecker
The performance in error detection is comparable to that of LanguageTool
The performance in error correction is only about half as good as that of LanguageTool (both in terms of precision and recall)

Wikipedia

wiki_db	language-code	featured_n-art	featured_n-sent	featured_n-err	template_n-art	template_n-sent	template_n-err	featured_err-per-sent	template_err-per-sent	prec
enwiki	en_US	6090	1235144	1221727	1024	108060	148391	0.989	1.373	0.280
simplewiki	en_US	30	5045	1714	15	435	675	0.340	1.552	0.781
arwiki	ar	692	173033	1593977	512	22594	136618	9.212	6.047	0.000
cawiki	ca	764	145185	182387	8	397	535	1.256	1.348	0.068
dewiki	de_DE	2730	935452	1390710	0	0	0	1.487	-	-
elwiki	el	129	30611	45876	0	0	0	1.499	-	-
eswiki	es	1235	350673	721562	1547	99005	238857	2.058	2.413	0.147
fawiki	fa	198	53013	205747	16	1024	3370	3.881	3.291	0.000
frwiki	fr	2019	679560	2332701	0	0	0	3.433	-	-
glwiki	gl_ES	218	59451	57637	209	11371	11756	0.969	1.034	0.062
itwiki	it_IT	536	124571	220426	720	53209	117756	1.769	2.213	0.200
nlwiki	nl	365	115060	109633	0	0	0	0.953	-	-
plwiki	pl	944	268900	198679	1	22	29	0.739	1.318	0.439
ptwiki	pt_BR	1315	328326	433376	1346	43865	85232	1.320	1.943	0.321
rowiki	ro	196	58467	70200	256	15335	27174	1.201	1.772	0.322
ruwiki	ru_RU	1627	651035	1069748	11	1157	845	1.643	0.730	0.000
svwiki	sv	354	79217	146754	278	11898	26267	1.853	2.208	0.161
ukwiki	uk	233	68208	75141	883	64783	72344	1.102	1.117	0.013

Summary:

For featured articles, the error rate from the spellcheckers is much higher than that for LanguageTool (e.g. in enwiki we find about 1 error per sentence compared to 0.06 errors per sentence when using LanguageTool). In general this translates into lower precision for spellcheckers.

Filtering errors

	wiki_db	language-code	featured_err-per-sent	featured_err-per-sent-filter	template_err-per-sent	template_err-per-sent-filter	prec	prec-filter	prec-change-ppt
1	enwiki	en_US	0.989	0.253	1.373	0.521	0.280	0.514	0.235
2	simplewiki	en_US	0.340	0.079	1.552	0.563	0.781	0.859	0.078
3	arwiki	ar	9.212	6.703	6.047	5.007	0.000	0.000	0.000
4	cawiki	ca	1.256	0.268	1.348	0.411	0.068	0.348	0.280
5	dewiki	de_DE	1.487	0.428	-	-	-	-	-
6	elwiki	el	1.499	0.600	-	-	-	-	-
7	eswiki	es	2.058	0.303	2.413	0.779	0.147	0.611	0.464
8	fawiki	fa	3.881	1.735	3.291	2.102	0.000	0.174	0.174
9	frwiki	fr	3.433	0.680	-	-	-	-	-
10	glwiki	gl_ES	0.969	0.182	1.034	0.487	0.062	0.627	0.565
11	itwiki	it_IT	1.769	0.378	2.213	0.929	0.200	0.593	0.393
12	nlwiki	nl	0.953	0.201	-	-	-	-
13	plwiki	pl	0.739	0.222	1.318	0.591	0.439	0.624	0.185
14	ptwiki	pt_BR	1.320	0.243	1.943	0.618	0.321	0.608	0.287
15	rowiki	ro	1.201	0.257	1.772	1.059	0.322	0.757	0.435
16	ruwiki	ru_RU	1.643	0.465	0.730	0.303	0.000	0.000	0.000
17	svwiki	sv	1.853	0.582	2.208	0.801	0.161	0.273	0.112
18	ukwiki	uk	1.102	0.363	1.117	0.620	0.013	0.415	0.402

Summary:

The filtering of errors substantially increases the precision of spellcheckers. The resulting approximated precision after filtering is comparable to that of LanguageTool but still remains systematically lower.

Takeaways

Spellcheckers can also detect and surface many meaningful errors for copyediting
Spellcheckers seem to suffer from a much higher rate of false-positives than LanguageTool. This can be partially addressed by imposing aggressive post-processing filters on the surfaced errors.
LanguageTool has a clear advantage in suggesting the correct improvement (spellcheckers perform substantially worse in error correction)
Since spellcheckers are available in many languages, they can serve as a backup solution for languages which are not supported by LanguageTool.
Given the higher rate of false positives when using spellcheckers, it would be desirable to develop a model that can assign a confidence score to the surfaced errors so that the structured task could prioritize surfacing those errors which were assigned a high confidence score.