Community Wishlist Survey 2020/Wikisource/Generate thumbnails for large-format PDFs

Generate thumbnails for large-format PDFs

  • Problem: For some PDFs, with very large images (typically scanned newspapers), no images (called "thumbnails") are shown.
  • Who would benefit: Wikisource when proofreading newspaper pages.
  • Proposed solution: Look at the PDF files described in phab:T25326, phab:T151202, commons:Category:Finlands Allmänna Tidning 1878, to find out why no thumbnails are generated.
  • More comments: When extracting the JPEG for an individual file, that JPEG can be uploaded. But when the JPEG is baked into a PDF, no thumbnail is generated. Is it because of its size? Small pages (books) work fine, but newspapers (large pages) fail.
  • Phabricator tickets: phab:T151202
  • Proposer: LA2 (talk) 21:04, 23 October 2019 (UTC)[reply]

Discussion

  • Hi LA2! Can you provide a description of the problem? This could help give us a deeper understanding of the wish. Thank you! --IFried (WMF) (talk) 18:52, 25 October 2019 (UTC)[reply]
    The problem is very easy to understand. I find a free, digitized PDF and upload it to Commons, then start to proofread in Wikisource. This always works fine for normal books, but when I try the same for newspapers, no image is generated. Apparently this is because the image has a larger number of pixels. I haven't tried to figure out what the limit is. --LA2 (talk) 21:36, 25 October 2019 (UTC)[reply]
  • What about to provide for ProofReading more compact desight at all. Those seconds scrolling counts. If we have on one site the window with the extracted text and in the other site the same size window with scan in which you can zoom and move fast, that should save your time and be more attractive for newbies. The way it is now it looks kind of techy and in some cases difficult to handle. E.g. there should be also more content help or a link to discussion page covered in more attracitve design. Juandev (talk) 09:22, 4 November 2019 (UTC)[reply]
  • I think that a tool that allows to generate such thumbnails manually / on request / offline with much higher limits and available to a specific group of users (commons admins? a dedicated group?) maybe a workaround for this problem. Ankry (talk) 20:23, 9 November 2019 (UTC)[reply]
I'm not sure what you want me to check - the question at hand is why that parti ular version of the file failed to render. Bawolff (talk) 09:15, 22 November 2019 (UTC)[reply]
Wow, @Hrishikes and Bawolff:, there is a fix? How exactly does it work? Could it be integrated into the upload process? Could it be applied to all files in commons:Category:Finlands Allmänna Tidning 1878? --LA2 (talk) 19:13, 10 December 2019 (UTC)[reply]
@LA2: -- This problem is occurring in highly compressed files and linked to the ocr layer. The fix consists of decompressing the file (so that the size in mb increases) and either flattening or removal of the ocr layer. I first tried flattening; it usually works but did not in this case; so I removed the ocr. Now it works. And yes, it is potentially usable for other files in your category. Extract the pages as png/jpg and rebuild the pdf. Hrishikes (talk) 01:39, 11 December 2019 (UTC)[reply]

Voting