Community Wishlist Survey 2020/Wikisource/New OCR tool

Random proposal ►◄ Wikisource The survey has concluded. Here are the results!

New OCR tool

Problem: 1) Wikisource has to rely on external OCR tools. The most widely used one has been out of service for many months and all that time we are waiting, whether its creator appears and repairs it or not. The other external OCR tools do not work well (they either have extremely slow response, or generate bad quality text). None of these tools can also handle text divided into columns in magazine pages and they often have problems with non-English characters and diacritics, the OCR output needs to be improved.
2) The tool hOCR is not working for wikisources based on non-Latin scripts. PheTool hOCR is creating a Tesseract OCR text layer for wikisources based on Latin script. E. g. for Indic Wikisource, there is a temporary Google OCR to do this, but integrating non-Latin scripts into our tool would be more useful.
Who would benefit: Wikisource contributors handling scanned texts which do not have an original OCR layer or whose original OCR layer is poor, and contributors to wikisources based on non-Latin scripts.
Proposed solution: Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person. The tool should:
- be quick
- generate good quality OCR text
- be able to handle text written in columns
- be able to handle non-English characters of Latin script including diacritics
- be able to handle non-Latin languages

Tesseract, which is an open source application, also has a specific procedure to training OCR which requires corrected text of a page and an image of the page itself. On the Wikisource side, pages that have been marked as proofread show books that have been transcribed and reviewed fully. So, what needs to be done is to strip formatting the text of these finished trascriptions, expand template transclusions and move references to the bottom. Then take the text along with an image of the page in question and run it through the Tesseracts procedure. The improvement would then be updated on ToolLabs. The better the OCR the easier the process is with each book, allowing Wikisource editors to become more productive, completing more pages than they could do previously. This would also motivate users on Wikisource.

More comments: Proposals "hOCR should work for all wikisource" by Jayantanth and "Improve OCR with wikisource texts" by Snaevar were merged into this proposal.

Some concerns have appeared that WMF nearly always uses open source software, which excludes e. g. Abby Reader and Adobe, and that the problem with free OCR engines is their lack of language support, so they are never really going to replace Phe's tools fully. I do not know whether free OCR engines suffice for this task or not, but I hope the new tool to be as good or even better than Phe's tools and ideological reasons that would be an obstacle to quality should be put aside.

Phabricator tickets: phab:T161978, phab:T228594
Proposer: Jan.Kamenicek (talk) 19:51, 24 October 2019 (UTC)[reply]

Discussion

I think this is the #1 biggest platform-related problem we are facing on English Wikisource at this time. —Beleg Tâl (talk) 15:09, 27 October 2019 (UTC)[reply]

Yeah. For some reason neither Google Cloud nor phetools support all of the languages of Tesseract. Tesseract in comparision to the wikisources is missing Anglo-Saxon, Faroese, Armenian, Limburgish, Neapolitan, Piedmontese, Sakha, Venetian and Min nan.--Snaevar (talk) 15:12, 27 October 2019 (UTC)[reply]

Note that you really don't want a tool that scans all pages for all languages as that is so compute-intensive that you'd wait minutes for every page you tried to OCR. Tesseract supports a boatload of languages and scripts, and can be trained for more, but you still need a sensible way to pick which ones are relevant on any given page. --Xover (talk) 07:27, 31 October 2019 (UTC)[reply]

I know. Both the Google Cloud and phetools gadgets pull the language from the language code of the wikisource that the button is pressed on and thus only uses one language. The same thing applies here. These languages are mentioned however so it is clear which wikisources this proposal could support, and witch ones it would not. P.S. I am not american, so I will never try to word things to cover all bases.--Snaevar (talk) 23:01, 2 November 2019 (UTC)[reply]

Even aside from the OCR aspect, being able to extract the formatting out of a PDF int wikitext would be highly valuable for converting pdfs (and other formats via pdf) into wikimarkup. T.Shafee(Evo﹠Evo)^talk 11:19, 29 October 2019 (UTC)[reply]

I am not sure about formatting. Some scans or even originals are quite poor and in such cases the result of trying to identify italics or bold letters may be much worse than if the tool extracted just pure text. I would support adding such feature only if it were possible to be turned on and off. --Jan.Kamenicek (talk) 22:05, 30 October 2019 (UTC)[reply]

Many pages requires only simple automatic OCR. But there are pages with another font (italics, fraktur) or pages with mixed languages (e.g. Missal both in local language and latin), where would be usseful to have possibility of some recognizing options. This can be more easily made on local PC, but not everybody have this option. JAn Dudík (talk) 11:21, 31 October 2019 (UTC)[reply]

Would also be great to default the OCR formatting to match the MOS, rather than having to change it all to conform to the MOS manually. --Yodin^T 14:19, 25 November 2019 (UTC)[reply]

Voting

Support Bodhisattwa (talk) 06:45, 21 November 2019 (UTC)[reply]
Support JAn Dudík (talk) 07:15, 21 November 2019 (UTC)[reply]
Support Le ciel est par dessus le toit (talk) 13:00, 21 November 2019 (UTC)[reply]
Support Lyokoï (talk) 17:32, 21 November 2019 (UTC)[reply]
Support Tpt (talk) 19:36, 21 November 2019 (UTC)[reply]
Support: impossible to contribute since Phe’s tool is down. —Pols12 (talk) 21:03, 21 November 2019 (UTC)[reply]
Support Pamputt (talk) 21:38, 21 November 2019 (UTC)[reply]
Support Sadads (talk) 21:41, 21 November 2019 (UTC)[reply]
Support Balajijagadesh (talk) 05:24, 22 November 2019 (UTC)[reply]
Support Libcub (talk) 08:13, 22 November 2019 (UTC)[reply]
Support Jahl de Vautban (talk) 09:22, 22 November 2019 (UTC)[reply]
Support Lionel Scheepmans ^{✉ Contact} _{French native speaker, sorry for my dysorthography} 10:47, 22 November 2019 (UTC)[reply]
Support Alan ^Talk 12:46, 22 November 2019 (UTC)
Support JLTB34 (talk) 13:29, 22 November 2019 (UTC)[reply]
Support GPSLeo (talk) 21:10, 22 November 2019 (UTC)[reply]
Support DraconicDark (talk) 02:29, 23 November 2019 (UTC)[reply]
Support FreeCorp (talk) 05:25, 23 November 2019 (UTC)[reply]
Support Pavithra.A (talk) 12:14, 23 November 2019 (UTC)[reply]
Support Emptyfear (talk) 17:12, 23 November 2019 (UTC)[reply]
Support @ջեօ 17:15, 23 November 2019 (UTC)[reply]
Support --Armenmir (talk) 17:27, 23 November 2019 (UTC)[reply]
Support আফতাবুজ্জামান (talk) 23:18, 23 November 2019 (UTC)[reply]
Support Liuxinyu970226 (talk) 10:26, 24 November 2019 (UTC)[reply]
Support VIGNERON * ^discut. 10:40, 24 November 2019 (UTC)[reply]
Support Pymouss Tchatcher - 11:38, 24 November 2019 (UTC)[reply]
Support Eatcha (talk) 12:22, 25 November 2019 (UTC)[reply]
Support --Bander7799 (talk) 12:34, 25 November 2019 (UTC)[reply]
Support JogiAsad (talk) 13:27, 25 November 2019 (UTC)[reply]
Support Murma174 (talk) 13:27, 25 November 2019 (UTC)[reply]
Support Also in rtl language wikisource, do not insert ltr tags before punctuation marks. This causes problems. Naḥum (talk) 13:37, 25 November 2019 (UTC)[reply]
Support --Yodin^T 14:19, 25 November 2019 (UTC)[reply]
Support Blue Rasberry (talk) 15:32, 25 November 2019 (UTC)[reply]
Support –MJL ‐Talk‐^☖ 15:35, 25 November 2019 (UTC)[reply]
Support Husky (talk) 16:12, 25 November 2019 (UTC)[reply]
Support A garbage person (talk) 16:19, 25 November 2019 (UTC)[reply]
Support 游魂 16:43, 25 November 2019 (UTC)[reply]
Support Sgvijayakumar (talk) 19:09, 25 November 2019 (UTC)[reply]
Support Ninovolador (talk) 21:27, 25 November 2019 (UTC)[reply]
Support Vkalaivani (talk) 22:46, 25 November 2019 (UTC)[reply]
Support Risker (talk) 05:03, 26 November 2019 (UTC)[reply]
Support Geonuch (talk) 05:32, 26 November 2019 (UTC)[reply]
Support Hsarrazin (talk) 14:31, 26 November 2019 (UTC)[reply]
Support β₁₆ - (talk) 15:08, 26 November 2019 (UTC)[reply]
Support Thibaut120094 (talk) 16:51, 26 November 2019 (UTC)[reply]
Support Noting that Community Tech forking and fixing Phe's tools will help precisely nothing in the long run. We need a WMF-supported tool that's within some WMF team's responsibilities to maintain and properly integrated into Mediawiki release cycles. Make use of volunteers where available, certainly, but someone at the WMF needs to own the OCR tool or it might as well stay a community gadget. Do please feel free to use this Wish to spend the necessary time kicking Phe's OCR tools until they start working again though. It's bound to be something stupid that's making it fail: like, has anybody tried to simply restart the tool? It could be hanging on a stale NFS file handle for all we know! Xover (talk) 06:10, 27 November 2019 (UTC)[reply]
That is exactly what I hope is going to be solved. In this proposal I stated the problem: "Wikisource has to rely on external OCR tools" and proposed the solution: "Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person." --Jan Kameníček (talk) 10:14, 1 December 2019 (UTC)[reply]
Support Acélan (talk) 13:19, 27 November 2019 (UTC)[reply]
Support Harkawal Benipal (talk) 16:08, 27 November 2019 (UTC)[reply]
Support Indic Wikisource community members at Wiki Advanced Training 2019 asked for a Bulk OCR tool not dependent on platform (Linux, Windows etc.). I hope this tool allows Bulk OCRing pages. Satdeep Gill (talk) 16:43, 27 November 2019 (UTC)[reply]
Support WhatamIdoing (talk) 16:55, 27 November 2019 (UTC)[reply]
Support Pyb (talk) 18:05, 27 November 2019 (UTC)[reply]
Support This would be number my #1 for wikisource. Of course it should be open source. Wellparp (talk) 19:03, 28 November 2019 (UTC)[reply]
Support Peter Alberti (talk) 19:54, 28 November 2019 (UTC)[reply]
Support 94rain ^Talk 12:53, 30 November 2019 (UTC)[reply]
Support Satpal Dandiwal (talk) 21:07, 30 November 2019 (UTC)[reply]
Support while also agreeing with Xover's thoughts. Mahir256 (talk) 07:37, 1 December 2019 (UTC)[reply]
Support Candalua (talk) 16:35, 1 December 2019 (UTC)[reply]
Support Rahmanuddin (talk) 06:49, 2 December 2019 (UTC)[reply]
Support सुबोध कुलकर्णी (talk) 12:25, 2 December 2019 (UTC)[reply]
Support Ruthven (msg) 12:41, 2 December 2019 (UTC)[reply]
Support Sannita - not just another it.wiki sysop 13:19, 2 December 2019 (UTC)[reply]
Support Jberkel (talk) 13:22, 2 December 2019 (UTC)[reply]
Support Saederup92 (talk) 13:24, 2 December 2019 (UTC)[reply]
Support Omshivaprakash (talk) 14:14, 2 December 2019 (UTC)[reply]
Support Novak Watchmen (talk) 17:54, 2 December 2019 (UTC)[reply]
Support --Yoosef Pooranvary (talk) 11:38, 19 November 2020 (UTC)[reply]