Community Wishlist Survey 2020/Wikisource/Improve extraction of a text layer from PDFs

Random proposal ►◄ Wikisource The survey has concluded. Here are the results!

Improve extraction of a text layer from PDFs

Problem: If a scan in PDF has an OCR layer (i. e. original OCR layer, usually of high quality, which is a part of many PDF scans provided by libraries, not the OCR text obtained by our OCR tools), the text is very poorly extracted from it in Wikisource page namespace. DJVUs do not suffer this problem and their OCR layer is extracted well. If the PDF is converted into DJVU, the extraction of the text from its OCR layer improves too. (Example of OCR extraction from a pdf here: [1], example of the same from djvu here: [2] ) As most libraries including Internet Archive or HathiTrust offer downloading PDFs with OCR layers and not DJVUs, we need to fix the text extraction from PDFs.
Who would benefit: All Wikisource contributors working with PDF scans downloaded from various major libraries (see above). Some contributors in Commons have expressed their concern that the DjVu file format is dying and attempted to deprecate it in favour of PDF. Although the attempt has not succeeded (this time), many people still prefer working with PDFs (because the DJVU format is difficult to work with for them, or they do not know how to convert PDF into DJVU, how to edit DJVU scans, and also because DJVU format is not supported by Internet browsers...)
Proposed solution: Fix the extraction of text from existing OCR layers of scans in PDF.
More comments:
Phabricator tickets:
Proposer: Jan.Kamenicek (talk) 20:18, 24 October 2019 (UTC)[reply]

Discussion

There are also libraries, where is possible to download bunch of pages (20-100) in PDF, but no or only single in djvu.

There is also possibility of external google OCR

mw.loader.load('//wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript');

, but there are more ocr errors and sometimes there are mixed lines. JAn Dudík (talk) 12:13, 25 October 2019 (UTC)[reply]

Yes, exactly, the Google OCR is really poor (en.ws has it among their gadgets), but the original OCR layer which is a part of most scans obtained from libraries is often really good, only Mediawiki fails to extract it correctly. If you download a PDF document e.g. from HathiTrust, it usually contains an OCR layer provided by the library (i.e. not obtained by some of our tools), and when you try to use this original OCR layer in the Wikisource page namespace, you get very poor results. But, if you take the same PDF document and convert it to djvu prior to uploading it here, then you get amazingly better results when extracting the text from the original OCR layer in Wikisource, and you do not need any of our OCR tools. This means that the original OCR layer of the PDF is good, only we are not able to extract it right from the PDF for some reason, although we are able to extract if from DJVU. --Jan.Kamenicek (talk) 17:10, 25 October 2019 (UTC)[reply]

yeah - it is pretty bad when the text layer does not appear, and OCR buttons hang with gray, but i can cut and paste text from IA txt file. clearly a failure to hand-off clear text layer. Slowking4 (talk) 02:34, 28 October 2019 (UTC)[reply]

@Jan.Kamenicek: It sounds like there are various problems with the extraction of some PDFs' text layers. Would indeed be great to fix! Is this to do with e.g. columns? Could you please add some examples to this proposal of PDFs (or pages of) that are failing to be extracted correctly? Thanks! —Sam Wilson 15:54, 12 November 2019 (UTC)[reply]
@Samwilson: We can compare File:The Hussite Wars, by the Count Lützow.pdf with the File:The Hussite wars, by the Count Lützow.djvu. The PDF file was downloaded from HathiTrust, the djvu file was created by converting the pdf file into djvu using an online converter. I personally would expect that further processing would result in some data loss and that the quality of the djvu file would be worse. However, when you compare for example [3] with [4], you can see that Mediawiki extracts the OCR layer much better from the DJVU file than from the PDF file, which means that Mediawiki is not able to extract the OCR layer from PDF files properly. --Jan.Kamenicek (talk) 17:17, 12 November 2019 (UTC) I have also added the examples into the problem description above. --Jan.Kamenicek (talk) 17:22, 12 November 2019 (UTC)[reply]
@Jan.Kamenicek: Thanks, that's very useful! Sam Wilson 19:39, 12 November 2019 (UTC)[reply]

Voting

Support veru useful. JAn Dudík (talk) 07:16, 21 November 2019 (UTC)[reply]
Support MartinPoulter (talk) 14:22, 21 November 2019 (UTC)[reply]
Support Libcub (talk) 08:21, 22 November 2019 (UTC)[reply]
Support, as a proposer. --Jan.Kamenicek (talk) 10:51, 23 November 2019 (UTC)[reply]
Support --Achim (talk) 15:18, 24 November 2019 (UTC)[reply]
Support Sewepb (talk) 07:16, 25 November 2019 (UTC)[reply]
Support Liuxinyu970226 (talk) 15:27, 25 November 2019 (UTC)[reply]
Support 游魂 16:48, 25 November 2019 (UTC)[reply]
Support Ciao • Bestoernesto • ✉ 17:54, 25 November 2019 (UTC)[reply]
Support Mauricio V. Genta (talk) 02:46, 27 November 2019 (UTC)[reply]
Support Geonuch (talk) 11:23, 27 November 2019 (UTC)[reply]
Support No brainer. This should have been fixed ages ago as routine bugfixing. The same for phab:T219376 and phab:T230415. These aren't so much new features or improved functionality as just plain bugs. Xover (talk) 12:20, 27 November 2019 (UTC)[reply]
Support Wellparp (talk) 19:10, 28 November 2019 (UTC)[reply]
Support Peter Alberti (talk) 19:53, 28 November 2019 (UTC)[reply]
Support VIGNERON * ^discut. 14:48, 29 November 2019 (UTC)[reply]
Support DraconicDark (talk) 15:36, 30 November 2019 (UTC)[reply]
Support Ruthven (msg) 12:40, 2 December 2019 (UTC)[reply]
Support Sannita - not just another it.wiki sysop 13:17, 2 December 2019 (UTC)[reply]
Support Novak Watchmen (talk) 17:55, 2 December 2019 (UTC)[reply]