Community Wishlist Survey 2020/Wikisource

Wikisource
28 proposals, 224 contributors, 781 support votes
The survey has closed. Thanks for your participation :)



UI improvements on Wikisource

Edit proposal/discussion

  • Problem: Big part of work on WS is proofreading of OCR texts. Wikitexteditor2010 have some useful functions, but these are divided in more tabs:
    • Advanced - there is very useful search and replace button
    • Special characters - there are many characters which are not on keyboard
    • Proofread tools (page namespace only) - some more tools.
    When I am working on some longer text from OCR, there are typical errors, which can be fixed by search and repace (e.g " -> “ or ii -> n) . So I must use first tab. Now there is missing character from another language, so I must switch to second tab and find this character. Then I find next typical error, so I must again switch to first...
  • Who would benefit: Wikisource editors, but useful for other projects too.
  • Proposed solution: Proofread is probably made mainly on desktops (notebooks) which have monitor wide enough to have all these tools on one tab without need of switching again and again
  • More comments:
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 20:59, 22 October 2019 (UTC)

Discussion

Hi, did you know that you can customize the edittoolbar to your liking? See https://www.mediawiki.org/wiki/Manual:Custom_edit_buttons. Also I use a search-replace plugin directly in a browser as this works better for me. See e.g. https://chrome.google.com/webstore/detail/find-replace-for-text-edi/jajhdmnpiocpbpnlpejbgmpijgmoknnl https://addons.mozilla.org/en-US/firefox/addon/find-replace-for-text-editing/?src=search I use the chrome one and it works alright for simple stuff. For more advanced stuff I copy the text to notepad++/notepadqq/libreoffice writer and do the regex stuff there.--So9q (talk) 11:26, 25 October 2019 (UTC)

Voting

Repair Index finder

Edit proposal/discussion

  • Problem: It's rather similar to the first proposal on this page; that is, for at least a month, the Index finding thingy is broken; whatever title you put in it, it says something along the lines of "The index finder is broken. Sorry for the inconvenience." (This is just from memory!) It also gives a list of indexes, from the largest to the smallest. The compromise I at any rate am using now is the index-finder installed in the search engine.
  • Who would benefit: Everybody who wants to find an index.
  • Proposed solution: Somebody who has a good knowledge about bugs? I'm not good at wikicode!
  • More comments: Excuses for any vague terminology - I am writing via mobile.
  • Phabricator tickets: task T232710
  • Proposer: Orlando the Cat (talk) 07:00, 5 November 2019 (UTC)

Discussion

Voting

Enable book2scroll that works for all Wikisources

Edit proposal/discussion

  • Problem: book2scroll is not enabled for all Wikisource and not working for any non -latin wikisource. It is very useful for Page marking numbering in index: pages any more..
    Français: book2scrool n’est pas activé pour tous les wikisources et ne fonctionne pas sur les wikisources non-latin. Cet outil est très utile pour la numérotation du marquage des Pages dans l’index:page.
  • Who would benefit: Whole Wikisource community.
    Français: Toute la communauté wikisource.
  • Proposed solution: problem is that this code is very old (as in Toolserver-old), and only works with some site naming schemes. Other languages don't work either for many titles.
    Français: Le problème est que le code est très anciens (??? as in Toolserver-old), et ne fonctionne que pour la nomenclature de nommage de certains sites et ne fonctionne pas pour plusieurs titres.
  • More comments: same as previous year list
  • Phabricator tickets: phab:T205549
  • Proposer: Jayantanth (talk) 15:58, 26 October 2019 (UTC)

Discussion

Voting

Migrate Wikisource specific edit tools from gadgets to Wikisource extension

Edit proposal/discussion

  • Problem: There are many useful edit tools gadgets on some wikisources. Many of these should be used everywhere, but...
    • Not every user knows, he can import script from another wiki.
    • Some of these script cannot be only imported, they must be translated or localised.
    • Majority of users will search these tools on en.wikisource, but there are many scripts eg. on it.wikisource too
  • Who would benefit: Editors on other Wikisources
  • Proposed solution: Select the best tools across wikisources and integrate them as new functions.
  • More comments:
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 13:24, 5 November 2019 (UTC)

Discussion

It would be good to point to these gadgets or describe the proposed process to choose and approve propositions of gadgets to integrate. --Wargo (talk) 21:35, 17 November 2019 (UTC)
1) Ask communities for the best tools on their wikisource
2) Make list of them, with comments, merge potentially duplicates
3) Ask communities again which ones should be integrated.
4) Make global version and integrate it (eg as beta function)
There is one problem, single-wikis gadgets are often hidden for others due language barrier etc. JAn Dudík (talk) 21:31, 18 November 2019 (UTC)

Voting

Batch move API

Edit proposal/discussion

  • Problem: On Wikisource, the "atomic unit" is a work, consisting of a scanned book in the File: namespace, a set of transcribed pages in the Page: namespace, an index in the Index: namespace, and hopefully also one or more pages in mainspace that transcludes the pages for presentation. This is unlike something like a Wikipedia, where the atomic unit is the (single) page in mainspace, period.
    ProofreadPage ties these together using the pagename: an Index: page looks for its own pagename (i.e. without namespace prefix) in the File: namespace, and creates virtual pages at Page:filenameoftheuploadedfile.PDF/1 (and …/2 etc.). If any one of these are renamed, the whole thing breaks down.
    A work can easily have 1000+ pages: if it needs to be renamed, all 1000 pages have to be renamed. This is obviously not something you would ever undertake manually. But API:Move just supports moving a single page, leading to the need for complicated hacks like w:User:Plastikspork/massmove.js.
    The net result is that nothing ever gets renamed on Wikisource, and when it's done it's only done by those running a personal admin-bot (so of the already very few admins available, only the subset that run their own admin-bots can do this, and that's before taking availability into account).
  • Who would benefit: All projects, but primarily the Wikisources; it would be used (via scripts) by +sysop, but it would benefit all users who can easily have consistent page names for, say, a multi-volume work or whatever else necessitates renaming.
  • Proposed solution: It would wastly simplify this if API:Move supported batch moves of related pages, at worst by an indexed list of fromto titles; better with fromto provided by a generator function; and ideally by intelligently moving by some form of pattern. For example, Index:vitalrecordsofbr021916brid.djvu would probably move to Index:Vital records of Bridgewater, Massachusetts - Vol. 2.djvu, and Page:-namespace pages from Page:vitalrecordsofbr021916brid.djvu/1 would probably move to Page:Vital records of Bridgewater, Massachusetts - Vol. 2.djvu/1
    It would also be of tremendous help if mw.api actually understood ProofreadPage and offered a convenience function that treated the whole work as a unit (Index:filename, Page:filename/pagenum, and, if local, File:filename) for purposes of renaming (moving) them.
  • More comments: For the purposes of this proposal, I consider cross-wiki moves out of scope, so, e.g., renaming a File: at Commons as part of the process of renaming the Index:/Page: pages on English Wikisource would be a separate problem (too complicated). Ditto fixing any local mainspace transclusions that refer to the old name (that's a manageable manual or semi-automated/user-tools job).
  • Phabricator tickets:
  • Proposer: Xover (talk) 12:41, 5 November 2019 (UTC)

Discussion

@Xover: Why sysop bit is needed here? I think the bot flag is enough unless the pages are fully protected. Ankry (talk) 20:45, 9 November 2019 (UTC)
@Ankry: Because page-move vandalism rises to a whole `nother level when you can do it in batches of 1k pages at a time. And for the volume we're talking about, having to go through a request and waiting for an admin to handle it is not a big deal: single page moves happen all the time, but batch moves of entire works would typically top out at a couple per week tops (ignore a decade's worth of backlog for now). Given these factors, requiring +sysop (or, if you want to be fancy, some other bit that can be assigned to a given user group like "mass movers" or whatever) seems like a reasonable tradeoff. You really don't want inexperienced users doing this willy nilly!
But so long as I get an API that lets me do this in a sane way (and w:User:Plastikspork/massmove.js is pretty insane), I'd be perfectly happy imposing limitations like that in the user script or gadget implementing it (unless full "Move work" functionality is implemented directly in core, of course). Different projects will certainly have different views on that issue. --Xover (talk) 21:28, 9 November 2019 (UTC)
  • «Problem: On Wikisource, the "atomic unit" is a work». In an ideal world yes, but not for MediaWiki until phabricator:T17071 is fixed. Nemo 09:07, 22 November 2019 (UTC)

Voting

Activate templatestyles by Index page css field

Edit proposal/discussion

  • Problem: templatestyles extension is almost magic into wikisource environment, but there's the need to activate easily it into all pages of an Index.
  • Who would benefit: all contributors
  • Proposed solution: to allow optionally to fill Index page css field with a valid templatestyle page. A simple regex could be used to see if css field content contains a valid css or a valid page name.
  • More comments: Presently it.wikisource and nap.wikisource are testing other tricks to load work-specific templatestyles into all pages of an Index, with very interesting results.
  • Phabricator tickets: phab:T226275, phab:T215165
  • Proposer: Alex brollo (talk) 07:24, 9 November 2019 (UTC)

Discussion

  • Reproducing original books is inherently layout and formatting heavy, presenting books to readers is inherently layout and formatting heavy. Inline formatting templates are carrying a lot of weight right now, with somewhat severe limitations and very little semantics. Getting a good system for playing with the power of CSS would help a lot. --Xover (talk) 11:08, 9 November 2019 (UTC)

Voting

Make content of Special:IndexPages up-to-date and available to wikicode

Edit proposal/discussion

  • Problem: 1. The content of Special:IndexPages (eg. s:pl:Special:IndexPages) is not updated after changing status of some pages in an index page until the appropriate index page is purged. 2. The data from this page is not available to wikicode. Its availability would make possible creation of various statistics / sortable lists or graphic tools showing the status of index pages by users. In plwikisource, we make this data available to wikicode via bot which updates specific teplates regularily; these extra edits would be able to be avoided.
  • Who would benefit: All wikisources, mainly those with large number of indexes
  • Proposed solution: Make per-index numbers of pages with various statuses from Special:IndexPages available via mechanism like a magic function, a LUA function or something similar.
  • More comments:
  • Phabricator tickets:
  • Proposer: Ankry (talk) 19:12, 9 November 2019 (UTC)

Discussion

Voting

Transcluded book viewer with book pagination

Edit proposal/discussion

 
Vis-itwikisource
  • Problem: When we view a transcluded (NS0) book, its a normal view of wikilike environments. Most of the book reader or lover don't like this kind of view and navigation. They are always like a book, page by page view two-page view like a physical book. Every-time we go to the next page subpage. For Italian Wikisource create one js to view like this, Vis, View In Sequence (two-sided view of our page).
  • Who would benefit: Wikisource editors and readers
  • Proposed solution: Create Vis like default viewer, View In Sequence (two-sided view of our page).
  • More comments:
  • Phabricator tickets:
  • Proposer: Jayantanth (talk) 15:43, 11 November 2019 (UTC)

Discussion

Voting

Repair Book Uploader Bot

Edit proposal/discussion

  • Problem: Book Uploader Bot was a valuable tool for the upload of books from Google-Books on Commons for Wikisource. It is not working for a long time and it takes a long time for uploading a book from: Google Books (you need to download the book in PDF, make an OCR, convert into a djvu, upload on Commons and then fill the information). From IA, we have IA upload. It is working but also have some issues from time to time.
  • Who would benefit: Contributors of Wikisources
  • Proposed solution: Repair the tool or build a new one
  • More comments:
  • Phabricator tickets:
  • Proposer: Shev123 (talk) 14:58, 10 November 2019 (UTC)

Discussion

Voting

Inter-language link support via Wikidata

Edit proposal/discussion

  • Problem: Wikidata's inter-language link system does not work well for Wikisource, because it assumes that pages are structured the same way as Wikipedia pages are structured, and this is not the case.
  • Who would benefit: Editors and readers of all Wikisources, and editors and readers of Wikidata
  • Proposed solution:
    1. Support linking from Wikidata to Multilingual Wikisource
    2. Support automatic interlanguage links between multiple editions that are linked to different items on Wikidata, where these items are linked by "has edition" and "edition or translation of"
  • More comments: This was also proposed last year
  • Phabricator tickets: phab:T138332, phab:T128173, phab:T180304, phab:T54971
  • Proposer: —Beleg Tâl (talk) 15:47, 23 October 2019 (UTC)

Discussion

This issue causes a lot of confusion for new editors on Wikisource and Wikidata, who frequently set up the interwiki links incorrectly in order to bypass this limitation. —Beleg Tâl (talk) 16:12, 23 October 2019 (UTC)

@Beleg Tâl: great proposal ! For information @Tpt: is working on something quite similar (Tpt: can you confirm?), we should keep this proposal as this is important and any help is welcome but still we should keep that in mind ;) Cdlt, VIGNERON * discut. 14:47, 27 October 2019 (UTC)
HI! Yes, indeed, I am working on it as part of mw:Extension:Wikisource. It's currently in the process of being deployed on the Wikimedia test cluster before a deployment on Wikisource. It should be done soon, so, hopefully no need from the Foundation on this (except helping the deployment). Tpt (talk) 13:59, 30 October 2019 (UTC)
@Tpt: Fantastic, thank you!! —Beleg Tâl (talk) 17:22, 2 November 2019 (UTC)
  • FYI I repeated T54971, which I asked for several decades to try to support it. --Liuxinyu970226 (talk) 13:17, 3 November 2019 (UTC)
  • I would just notify that in svwikisource and plwikisource there are javascript-based implementations of multi-version interwiki and they seem to work fine if appropriate structures are available in Wikidata. Ankry (talk) 20:09, 9 November 2019 (UTC)

Voting

Index creation wizard

Edit proposal/discussion

  • Problem: The process of turning a PDF or DjVu file into an index for transcription and proofreading is quite complicated and confusing. See Help:Index pages and Help:Page numbers for the basics.
  • Who would benefit: Anyone wanting to start a Wikisource transcription
  • Proposed solution: Create a wizard that walks an editor though the process of creating an index from a PDF or DjVu file (that has already been uploaded). Most importantly, it will facilitate creating the pagelist, by allowing the editor to go through the pages and identify the cover, title page, table of contents, etc, as well as where the page numbering begins.
  • More comments: This is similar to a proposal from the 2016 Wishlist, but more limited in scope, i.e. this proposal only deals with the index creation process, not uploading or importing files.
  • Phabricator tickets: task T154413 (related)
  • Proposer: Kaldari (talk) 15:32, 30 October 2019 (UTC)

Update June 2020: a project page has been set up for this at Wikisource Pagelist Widget.

Discussion

  • A wizard for initial setup is a good start, but an interactive visual editor for Index: pages, and especially for <pagelist … /> tags, would be even better. The pagelist is often edited multiple times and by multiple people, and currently requires a lot of jumping between the scan and the browser, mental arithmetic and mapping between physical and logical page numbers, multiple numbering schemes and ranges in a single work, etc. etc. A visual editor oriented around thumbnails of each page in the book and allowing you to tag pages: “This thumbnail, physically in position 7 in the file, is logically the ‘Title’ page”; “In these 24 pages (physical 13–37) the numbering scheme is roman numerals, and numbering starts on the first page I've selected”; “On this page (physical 38) the logical numbering resets to 1, and we're now back to default arabic numerals”; “This page (physical 324) is not included in the logical numbering sequence, so it should be skipped and logical numbering should resume on the subsequent page, and this page should get the label ‘Plate’”. All this stuff is much easier to do in a visual / direct-manipulation way than by writing rules describing it in a custom mini-syntax. --Xover (talk) 11:40, 9 November 2019 (UTC)

Voting

Vertical display for classical Chinese content

Edit proposal/discussion

  • Problem: Most content in Chinese Wikisource is classical Chinese, which has been printed or written in vertical for thousands of years.
  • Who would benefit: Chinese and Japanese Wikisource. Other Wikimedia projects of languages in vertical display (like Manchu).
  • Proposed solution: Add vertical support to the Wikimedia software. To the proposer's knowledge, MediaWiki already supports right-to-left display of Arabic and Hebrew.

    A switch button on each page and "force" setting in Special:Preferences should be added to allow readers to switch the display mode between traditional vertical text 傳統直寫 and modern horizontal text 新式橫寫. Magic word will be added that allow pages to set its own default display mode.

    Hypothetical vertical Chinese Wikisource as follows. (In this picture, some characters are rotated but they should not.)

     

  • More comments:
  • Phabricator tickets:
  • Proposer: 維基小霸王 (talk) 13:59, 1 November 2019 (UTC)

Discussion

Voting

Improve workflow for uploading books to Wikisource

Edit proposal/discussion

  • Problem:
Uploading books to Wikisource is difficult.
In the current workflow you need to upload the file on Commons, then go to Wikisource and create the Index page (and you need to know the exact URL). :The files need to be DJVU, which has different layers for the scan and the text. This is important for tools like Match & Split (if the file is a PDF, this tool doesn't work).
More importantly, the current workflow (especially for library uploads) includes Internet Archive, and the famous IA-Upload tool. This tool is now fundamental for many libraries and uploaders, but it has several issues.
As Internet Archive stopped creating the DJVU files from his scans, the international community has struggled solving the issue of creating automatically a DJVU for uploading on Commons and then Wikisource.
This has created a situation where libraries love Internet Archive, want to use it, but then get stuck because they don't know how to create a DJVU for Wikisource, and the IA-Upload is bugged and fails often.
Summary
    • IA-Upload tool is bugged and fails often when creating DJVU files.
    • M&S doesn't work with PDF files.
    • Users do not expect to upload to Commons when transferring files from Internet Archive to Wikisource.
    • Upload to Internet Archive is an important feature expecially for GLAMs (ie. libraries).
  • Who would benefit:
    • all Wikisource communities, especially new users
    • new GLAMs (libraries and archives) who at the moment have an hard time coping with the Wiki ecosystem.
  • Proposed solution:
Improve the IA-Upload tool: https://tools.wmflabs.org/ia-upload/commons/init
The tool should be able to create good-quality DJVU from Archive files, and do not fail as often as it does now.
it should also hide, for the end-user, the uploading to Commons phase. The user should be able to upload a file on Internet Archive, and then use the ID of the file to directly create the Index page on Wikisource. We could have an "Advanced mode" that shows all the passages for experienced user, and a "Standard" one that makes things more simple.
  • More comments:
  • Phabricator tickets: related: phab:T154413
  • Proposer: originally proposed by Aubrey (talk) in 2017 - re-proposed by Candalua (talk) 16:15, 6 November 2019 (UTC)

Discussion

Voting

Ajax editing of nsPage text

Edit proposal/discussion

  • Problem: Dealing with simple pages editing, much user time is lost into the cycle save - load in view mode - go to next page that opens in view mode - load it into edit mode.
  • Who would benefit: experienced users
  • Proposed solution: it.wikisource implemented an ajax environment, that allows to save edited text and to upload next page in edit mode (and much more) very fastly by ajax calls: it:s:MediaWiki:Gadget-eis.js (eis means Edit In Sequence). It's far from refined, but it works and it has been tested into other wikisource projects too. IMHO the idea should be refined and developed.
  • More comments:
  • Phabricator tickets:
  • Proposer: Alex brollo (talk) 07:16, 25 October 2019 (UTC)

Discussion

  • I enthusiastically support - I have often wished that I could move directly from page to page while staying in Edit mode - it would be particularly useful for error checking: making sure, for instance, that every page in a range which could have been proofread by different people over a number of months or even years all conform to the latest format/structure etc. CharlesSpencer (talk) 11:03, 25 October 2019 (UTC)
  • I think this is a very good project specific improvement that can be made within the remit of community wishlist. Seems feasible as well. —TheDJ (talkcontribs) 12:55, 4 November 2019 (UTC)
  • This would be a great first step towards something like a full-featured dedicated "transcription mode", that would likely involve popping into full screen (hiding page chrome, navbar, etc.; use all available space inside the browser window, but don't let the page scroll because it conflicts with the independently scrolling text field and scanned page display, in practice causing your whole editing UI to "jump around" unpredictably), some more flexibility and intelligence in coarse layout (i.e. when previewing a page, the text field and scanned page are side by side, but the rendered text you are trying to compare to the scanned page is about a screenworths of vertical scrolling away), prefetching of the next scanned page (cf. the gadget mentioned at the last Wikimania), and possibly other refinements (line by line highlighting on the scanned page? We often have pixel coordinates for that fro the OCR process). Alex brollo's proposal is one great first change under a broader umbrella that is adapting the tools to the typical workflow on Wikisource, versus the typical workflow on Wikipedia-like projects: the difference makes tools that are perfectly adequate for Wikipedia-likes really clunky and awkward for the Wikisources. Usable, but with needlessly high impedance. --Xover (talk) 12:53, 5 November 2019 (UTC)
    @Samwilson: Could s:User:Samwilson/FullScreenEditing.js be a piece of this larger puzzle? I haven't played with it, but it looks like a good place to start. If this kind of thing (a separate focussed editing mode) were implemented somewhere core-adjacent, it might also provide an opportunity to clean up the markup used ala. that attempt last year(ish) that failed due to reasons (I'm too fuzzy on the details. Resize behaviour for the text fields got messed up, I think.). Could something like that also have hooks for user scripts? There's lots of little things that are suitable for user scripting to optimize the proofreading process. Memoized per-work snippets of text or regex substitutions; refilling header/footer from the values in the associated Index:; magic comment / variables (think Emacs variables or linter options) for stuff like curly/straight quote marks. In a dedicated editing mode, where the markup is clean (unlike the chaos of a full skin and multiple editors), both the page and the code could have API-like hooks that would make that kind of thing easier. --Xover (talk) 11:20, 9 November 2019 (UTC)
  • Thanks for appreciation :-). Really the it.wikisource eis tool - even if rough in code - is appreciated by many users. I like to mention too its "ajax-preview" option, that allows to see very fastly (<1 sec) the result of current editing/formatting and that allows too some simple edit of brief chuncks of text nodes (immediately editing the underlying textarea). Some text mistakes are much more evident in "view" mode that in "edit" mode, but presently Visual Editor is too slow to be used for typical fast editing into wikisource. --Alex brollo (talk) 09:43, 7 November 2019 (UTC)

Voting

New OCR tool

Edit proposal/discussion

  • Problem: 1) Wikisource has to rely on external OCR tools. The most widely used one has been out of service for many months and all that time we are waiting, whether its creator appears and repairs it or not. The other external OCR tools do not work well (they either have extremely slow response, or generate bad quality text). None of these tools can also handle text divided into columns in magazine pages and they often have problems with non-English characters and diacritics, the OCR output needs to be improved.
    2) The tool hOCR is not working for wikisources based on non-Latin scripts. PheTool hOCR is creating a Tesseract OCR text layer for wikisources based on Latin script. E. g. for Indic Wikisource, there is a temporary Google OCR to do this, but integrating non-Latin scripts into our tool would be more useful.
  • Who would benefit: Wikisource contributors handling scanned texts which do not have an original OCR layer or whose original OCR layer is poor, and contributors to wikisources based on non-Latin scripts.
  • Proposed solution: Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person. The tool should:
    • be quick
    • generate good quality OCR text
    • be able to handle text written in columns
    • be able to handle non-English characters of Latin script including diacritics
    • be able to handle non-Latin languages

Tesseract, which is an open source application, also has a specific procedure to training OCR which requires corrected text of a page and an image of the page itself. On the Wikisource side, pages that have been marked as proofread show books that have been transcribed and reviewed fully. So, what needs to be done is to strip formatting the text of these finished trascriptions, expand template transclusions and move references to the bottom. Then take the text along with an image of the page in question and run it through the Tesseracts procedure. The improvement would then be updated on ToolLabs. The better the OCR the easier the process is with each book, allowing Wikisource editors to become more productive, completing more pages than they could do previously. This would also motivate users on Wikisource.

Some concerns have appeared that WMF nearly always uses open source software, which excludes e. g. Abby Reader and Adobe, and that the problem with free OCR engines is their lack of language support, so they are never really going to replace Phe's tools fully. I do not know whether free OCR engines suffice for this task or not, but I hope the new tool to be as good or even better than Phe's tools and ideological reasons that would be an obstacle to quality should be put aside.

Discussion

I think this is the #1 biggest platform-related problem we are facing on English Wikisource at this time. —Beleg Tâl (talk) 15:09, 27 October 2019 (UTC)

Yeah. For some reason neither Google Cloud nor phetools support all of the languages of Tesseract. Tesseract in comparision to the wikisources is missing Anglo-Saxon, Faroese, Armenian, Limburgish, Neapolitan, Piedmontese, Sakha, Venetian and Min nan.--Snaevar (talk) 15:12, 27 October 2019 (UTC)

Note that you really don't want a tool that scans all pages for all languages as that is so compute-intensive that you'd wait minutes for every page you tried to OCR. Tesseract supports a boatload of languages and scripts, and can be trained for more, but you still need a sensible way to pick which ones are relevant on any given page. --Xover (talk) 07:27, 31 October 2019 (UTC)
I know. Both the Google Cloud and phetools gadgets pull the language from the language code of the wikisource that the button is pressed on and thus only uses one language. The same thing applies here. These languages are mentioned however so it is clear which wikisources this proposal could support, and witch ones it would not. P.S. I am not american, so I will never try to word things to cover all bases.--Snaevar (talk) 23:01, 2 November 2019 (UTC)

Even aside from the OCR aspect, being able to extract the formatting out of a PDF int wikitext would be highly valuable for converting pdfs (and other formats via pdf) into wikimarkup. T.Shafee(Evo﹠Evo)talk 11:19, 29 October 2019 (UTC)

I am not sure about formatting. Some scans or even originals are quite poor and in such cases the result of trying to identify italics or bold letters may be much worse than if the tool extracted just pure text. I would support adding such feature only if it were possible to be turned on and off. --Jan.Kamenicek (talk) 22:05, 30 October 2019 (UTC)

Many pages requires only simple automatic OCR. But there are pages with another font (italics, fraktur) or pages with mixed languages (e.g. Missal both in local language and latin), where would be usseful to have possibility of some recognizing options. This can be more easily made on local PC, but not everybody have this option. JAn Dudík (talk) 11:21, 31 October 2019 (UTC)

Would also be great to default the OCR formatting to match the MOS, rather than having to change it all to conform to the MOS manually. --YodinT 14:19, 25 November 2019 (UTC)

Voting

  •   Support Bodhisattwa (talk) 06:45, 21 November 2019 (UTC)
  •   Support JAn Dudík (talk) 07:15, 21 November 2019 (UTC)
  •   Support Le ciel est par dessus le toit (talk) 13:00, 21 November 2019 (UTC)
  •   Support Lyokoï (talk) 17:32, 21 November 2019 (UTC)
  •   Support Tpt (talk) 19:36, 21 November 2019 (UTC)
  •   Support: impossible to contribute since Phe’s tool is down. —Pols12 (talk) 21:03, 21 November 2019 (UTC)
  •   Support Pamputt (talk) 21:38, 21 November 2019 (UTC)
  •   Support Sadads (talk) 21:41, 21 November 2019 (UTC)
  •   Support Balajijagadesh (talk) 05:24, 22 November 2019 (UTC)
  •   Support Libcub (talk) 08:13, 22 November 2019 (UTC)
  •   Support Jahl de Vautban (talk) 09:22, 22 November 2019 (UTC)
  •   Support Lionel Scheepmans Contact French native speaker, sorry for my dysorthography 10:47, 22 November 2019 (UTC)
  •   Support Alan Talk 12:46, 22 November 2019 (UTC)
  •   Support JLTB34 (talk) 13:29, 22 November 2019 (UTC)
  •   Support GPSLeo (talk) 21:10, 22 November 2019 (UTC)
  •   Support DraconicDark (talk) 02:29, 23 November 2019 (UTC)
  •   Support FreeCorp (talk) 05:25, 23 November 2019 (UTC)
  •   Support Pavithra.A (talk) 12:14, 23 November 2019 (UTC)
  •   Support Emptyfear (talk) 17:12, 23 November 2019 (UTC)
  •   Support @ջեօ 17:15, 23 November 2019 (UTC)
  •   Support --Armenmir (talk) 17:27, 23 November 2019 (UTC)
  •   Support আফতাবুজ্জামান (talk) 23:18, 23 November 2019 (UTC)
  •   Support Liuxinyu970226 (talk) 10:26, 24 November 2019 (UTC)
  •   Support VIGNERON * discut. 10:40, 24 November 2019 (UTC)
  •   Support Pymouss Tchatcher - 11:38, 24 November 2019 (UTC)
  •   Support Eatcha (talk) 12:22, 25 November 2019 (UTC)
  •   Support --Bander7799 (talk) 12:34, 25 November 2019 (UTC)
  •   Support JogiAsad (talk) 13:27, 25 November 2019 (UTC)
  •   Support Murma174 (talk) 13:27, 25 November 2019 (UTC)
  •   Support Also in rtl language wikisource, do not insert ltr tags before punctuation marks. This causes problems. Naḥum (talk) 13:37, 25 November 2019 (UTC)
  •   Support --YodinT 14:19, 25 November 2019 (UTC)
  •   Support Blue Rasberry (talk) 15:32, 25 November 2019 (UTC)
  •   SupportMJLTalk 15:35, 25 November 2019 (UTC)
  •   Support Husky (talk) 16:12, 25 November 2019 (UTC)
  •   Support A garbage person (talk) 16:19, 25 November 2019 (UTC)
  •   Support 16:43, 25 November 2019 (UTC)
  •   Support Sgvijayakumar (talk) 19:09, 25 November 2019 (UTC)
  •   Support Ninovolador (talk) 21:27, 25 November 2019 (UTC)
  •   Support Vkalaivani (talk) 22:46, 25 November 2019 (UTC)
  •   Support Risker (talk) 05:03, 26 November 2019 (UTC)
  •   Support Geonuch (talk) 05:32, 26 November 2019 (UTC)
  •   Support Hsarrazin (talk) 14:31, 26 November 2019 (UTC)
  •   Support β16 - (talk) 15:08, 26 November 2019 (UTC)
  •   Support Thibaut120094 (talk) 16:51, 26 November 2019 (UTC)
  •   Support Noting that Community Tech forking and fixing Phe's tools will help precisely nothing in the long run. We need a WMF-supported tool that's within some WMF team's responsibilities to maintain and properly integrated into Mediawiki release cycles. Make use of volunteers where available, certainly, but someone at the WMF needs to own the OCR tool or it might as well stay a community gadget. Do please feel free to use this Wish to spend the necessary time kicking Phe's OCR tools until they start working again though. It's bound to be something stupid that's making it fail: like, has anybody tried to simply restart the tool? It could be hanging on a stale NFS file handle for all we know! Xover (talk) 06:10, 27 November 2019 (UTC)
    That is exactly what I hope is going to be solved. In this proposal I stated the problem: "Wikisource has to rely on external OCR tools" and proposed the solution: "Create an integral OCR tool that the Wikimedia programmers would be able to maintain without relying on help of one specific person." --Jan Kameníček (talk) 10:14, 1 December 2019 (UTC)
  •   Support Acélan (talk) 13:19, 27 November 2019 (UTC)
  •   Support Harkawal Benipal (talk) 16:08, 27 November 2019 (UTC)
  •   Support Indic Wikisource community members at Wiki Advanced Training 2019 asked for a Bulk OCR tool not dependent on platform (Linux, Windows etc.). I hope this tool allows Bulk OCRing pages. Satdeep Gill (talk) 16:43, 27 November 2019 (UTC)
  •   Support WhatamIdoing (talk) 16:55, 27 November 2019 (UTC)
  •   Support Pyb (talk) 18:05, 27 November 2019 (UTC)
  •   Support This would be number my #1 for wikisource. Of course it should be open source. Wellparp (talk) 19:03, 28 November 2019 (UTC)
  •   Support Peter Alberti (talk) 19:54, 28 November 2019 (UTC)
  •   Support 94rain Talk 12:53, 30 November 2019 (UTC)
  •   Support Satpal Dandiwal (talk) 21:07, 30 November 2019 (UTC)
  •   Support while also agreeing with Xover's thoughts. Mahir256 (talk) 07:37, 1 December 2019 (UTC)
  •   Support Candalua (talk) 16:35, 1 December 2019 (UTC)
  •   Support Rahmanuddin (talk) 06:49, 2 December 2019 (UTC)
  •   Support सुबोध कुलकर्णी (talk) 12:25, 2 December 2019 (UTC)
  •   Support Ruthven (msg) 12:41, 2 December 2019 (UTC)
  •   Support Sannita - not just another it.wiki sysop 13:19, 2 December 2019 (UTC)
  •   Support Jberkel (talk) 13:22, 2 December 2019 (UTC)
  •   Support Saederup92 (talk) 13:24, 2 December 2019 (UTC)
  •   Support Omshivaprakash (talk) 14:14, 2 December 2019 (UTC)
  •   Support Novak Watchmen (talk) 17:54, 2 December 2019 (UTC)
  •   Support --Yoosef Pooranvary (talk) 11:38, 19 November 2020 (UTC)

Repair search and replace in Page editing

Edit proposal/discussion

  • Problem: Actually, "Search and replace", as provided by the code Editor (top left option in the advanced editing tab), just doesn't work when using it at "Page" namespace.

This is the basic tool to... search and replace text when editing, mass correct OCR mistakes, etc. It is simply not working.

  • Who would benefit: All editing users
  • Proposed solution: Reimplement the function, or fix the bug in the Mediawiki software.
  • More comments: There are some workarounds, as implemented in it.source, but they are new gadgets that mimic this basic functionality of MediaWiki.
  • Phabricator tickets: phabricator:T183950, phab:T198688 and phab:T212347
  • Proposer: Ruthven (msg) 11:44, 29 October 2019 (UTC)

Discussion

  • Extending the proposal: This would profit all Wiki-Projects.
    • I would suggest something more general: when I use Search and replace, I cannot go a step backwards anymore, in case my replace (or more importantly something before) was wrong. This is a general problem with the text-editor. Every time I use any of the already existing buttons (like Bold, or math or what so ever), I cannot do this step backwards. So, if I' m editing for sometime and then do something wrong and then use one of these buttons (or search and replace), I must do the whole work from the beginning, because I cannot go back to the mistake that I did before using one of these buttons. This is not the case with the visual editor, so, I think, it would be possible to change this in the texteditor rather easily.
    • There are only two options in search and replace: you can either replace one after the other, or the whole text. I would be really grateful if I could use search and replace only in a marked text (and not the whole one)Yomomo (talk) 22:24, 8 November 2019 (UTC)
    • About Search and replace. If I want to replace something with more lines, the new-line-mark will not be included. I don't know how difficult it is to change this, but it would be a profit to be able to replace parts also when they (and the new part) have more lines. Yomomo (talk) 14:52, 1 November 2019 (UTC)

Voting

Offer PDF export of original pagination of entire books

Edit proposal/discussion

Français: Pouvoir exporter en pdf en respectant la pagination de l'édition source.
  • Problem: Presently PDF conversion of proofread wikisource books doesn't mirrors original pagination and page design of original edition, since it comes from ns0 transclusion.
    Français: La conversion en PDF des livres Wikisource ne reflète pas la pagination et le design original des pages de l’édition originale, car la conversion provient de la transclusion et non des pages.
  • Who would benefit: Offline readers.
    Français: Lecteurs hors ligne.
  • Proposed solution: To build an alternative PDF coming from conversion, page for page, of nsPage namespace.
    Français: Élaborer un outil pour générer un PDF alternatif provenant d’une conversion page par page.
  • More comments: Some wikisource contributors think that nsIndex and nsPage are simply "transcription tools"; I think that they are much more - they are the true digitalization of a edition, while ns0 transclusioni is something like a new edition.
    Français: Certains contributeurs de wikisource pense que nsIndex et nsPage sont simplement des « outils de transcription » ; je pense qu’ils sont beaucoup plus que cela – ce sont la vraie numérisation d’une édition, tandis que la transclusion ns0 constitue une nouvelle édition.
  • Phabricator tickets: T179790
  • Proposer: previous year proposer Alex brollo got voted 57, Jayantanth (talk) 16:03, 26 October 2019 (UTC)

Discussion

  • I think I would have actually Opposed this: I don't want to reproduce original pagination, we have the original PDF for that. For this proposal to make sense, to me, it would need to be about having some way to control PDF generation in the same way transclusion to mainspace controls wikitext rendering. I wouldn't necessarily want to reproduce each original page in a PDF page there (often, yes, but not always), and I might want to tweak some formatting specifically for a paged medium (PDF) that doesn't apply in a web page, or vice versa. In other words, I'm going to abstain from voting on this proposal but I might support something like it in the future if it was better fleshed out. --Xover (talk) 06:03, 27 November 2019 (UTC)

Voting

  •   Support Libcub (talk) 08:28, 22 November 2019 (UTC)
  •   Support Geert Van Pamel (WMBE) (talk) 09:27, 22 November 2019 (UTC)
  •   Support --Jan.Kamenicek (talk) 10:22, 23 November 2019 (UTC)
  •   Support JogiAsad (talk) 13:24, 25 November 2019 (UTC)
  •   Support Melroross (talk) 15:20, 25 November 2019 (UTC)
  •   Support Liuxinyu970226 (talk) 15:28, 25 November 2019 (UTC)
  •   Support 16:55, 25 November 2019 (UTC)
  •   Support Une très bonne remarque qui débouche sur une idée pour un outil utile. Conserver la pagination d'origine peut être vraiment important pour certains usages des PDF. / English : A very good remark leading to an idea for a useful tool. Keeping the original pagination can be very important for some uses of a PDF. Eunostos (talk) 20:40, 25 November 2019 (UTC)
  •   Support Geonuch (talk) 05:34, 26 November 2019 (UTC)
  •   Support Bonvol (talk) 20:24, 26 November 2019 (UTC)
  •   Support This is so critical to unleashing the true power of Wikisource. We are generating a really valuable library behind the scenes, but the public is unable to "check out" our materials in the way many people are accustomed to doing this (e.g., downloading to their e-reader device). People do not generally do long-form reading through a web browser. Pete Forsyth (talk) 23:54, 26 November 2019 (UTC)
  •   Support Mauricio V. Genta (talk) 02:47, 27 November 2019 (UTC)
  •   Support Maitake (talk) 09:55, 27 November 2019 (UTC)
  •   Support PMG (talk) 13:21, 27 November 2019 (UTC)
  •   Support Paperoastro (talk) 11:32, 1 December 2019 (UTC)
  •   Support ਲਵਪ੍ਰੀਤ ਸਿੰਘ ਸਿੱਧੂ ਗੱਲਬਾਤ  11:26, 2 December 2019 (UTC)
  •   Support Novak Watchmen (talk) 17:54, 2 December 2019 (UTC)

Reorganize the Wikisource toolbar

Edit proposal/discussion

français: reorganization wikisource toolbar
  • Problem: Some shortcuts are superfluous, others are missing
    français: Certains raccourcis sont superflus, d'autres absents
  • Who would benefit: facilitate editing for new writers
    français: faciliter l'édition pour les nouveaux rédacteurs
  • Proposed solution: Rearrange the toolbar a bit
    français: Réorganiser un peu la barre outils
  • More comments: In the toolbar, we have {{}},{{|}}, {{|}}. I think keep {{}} and replace the other two useless it goes as fast to type | on the keyboard at the desired place. Instead we could put {{paragraph|}}, {{space|}}, {{separation}}, <ref>txt</ref> duplicates the icon next to (insert file) at the top left. It could be replaced by <ref follow=pxxx>. Next to <br/> we could add <brn|> and {{br0}}. The <search-replace> could appear next to the pencil dropdown at the top-right.
    français: Dans la barre, nous avons {{}},{{|}}, {{|}}. Je pense garder {{}} et remplacer les 2 autres pour moi inutiles ça va aussi vite de taper au clavier le | à l'endroit voulu. A la place on pourrait y mettre {{alinéa|}}, {{espace|}}, {{separation}} <ref>txt</ref> fait double emploi avec l'icone à côté (inserer fichier) en haut à gauche. On pourrait le remplacer par <ref follow=pxxx>. A côté de <br/>, on pourrait rajouter <brn|> et {{br0}}. Le <rechercher-remplacer> pourrait figurer à côté du crayon changer d'éditeur.
  • Phabricator tickets:
  • Proposer: Carolo6040 (talk) 10:58, 25 October 2019 (UTC)

Discussion

I think you mean the Character insertion bar below the editor ? That can already be modified by the community itself, it does not require effort by the development team. —TheDJ (talkcontribs) 12:48, 4 November 2019 (UTC)
we need a redesign of default menus for wikisource. this is beyond the capabilities of the new editor. visual editor will not be used until this is done, either by community or developers. Slowking4 (talk) 15:11, 4 November 2019 (UTC)
Still, these changes can be performed by local interface admins (such edits are not to be done by new editors). Check for example MediaWiki:Edittools. Ruthven (msg) 18:55, 4 November 2019 (UTC)
  • "Oppose" expending Wishlist resources on this. This can be fixed by the local project, and is something that should be handled by the local project. For example, the set of relevant templates and things like style guides for (curly) quote marks etc. are going to vary from project to project. Perhaps if there was something not possible with the current toolbar it might make sense to add that support, but then just to enable local customization. --Xover (talk) 12:16, 27 November 2019 (UTC)

Voting

improve external link

Edit proposal/discussion

  • Problem: in italian version it is a problem to make too much link to Wikidata in a single page. But this is necessary to improve te use of Wikisource books out of is ouwn platform: on tablet, pc: The presence of links to Wikidata make the books an hipertext much more usefull.
  • Who would benefit: Every readers
  • Proposed solution: I am not a tecnic so I have only needs e not solutions ;-)
  • More comments:
  • Phabricator tickets:
  • Proposer: Susanna Giaccai (talk) 11:22, 4 November 2019 (UTC)

Discussion

@Giaccai: Can you give specific examples of pages where this is currently a problem ? —TheDJ (talkcontribs) 12:57, 4 November 2019 (UTC)
This is the only one page with a Lua error: it:s:Ricordi di Parigi/Uno sguardo all’Esposizione. IMHO it's a Lua "not enough memory" issue, coming from Lua exausted space: you can see "Lua memory usage: 50 MB/50 MB" into NewPP limit report. --Alex brollo (talk) 14:47, 4 November 2019 (UTC)
Weren't the links to Wikidata to be used only in case of author's names? --Ruthven (msg) 18:45, 4 November 2019 (UTC)
No, presently there are tests to link to wikidata other kinds of entities (i.e. locations); wikidata is used to find a link to a wikisource page, or to a wikipedia page, or to wikidata page when both are lacking (dealing with locations, usually the resulting link points to wikipedia). --Alex brollo (talk) 07:17, 5 November 2019 (UTC)
I just investigated the error. The "not enough memory" issue is caused by s:it:Modulo:Wl and s:it:Module:Common. @Alex brollo: What is going on is that the full item serialization is loaded into Lua memory twice per link, once by local item = mw.wikibase.getEntity(qid) in s:it:Modulo:Wl and once by local item = mw.wikibase.getEntityObject() in s:it:Module:Common. You could probably avoid both of this calls by relying on the Wikibase Lua functions that are already used s:it:Modulo:Wl and, so, limit a lot the memory consumption of the module. Tpt (talk) 16:08, 12 November 2019 (UTC)
@Tpt: Thanks! I'll review the code following your suggestions. --Alex brollo (talk) 14:26, 14 November 2019 (UTC)

Voting

XTools Edit Counter for Wikisource

Edit proposal/discussion

Français: Compteur de modifications très amélioré.
  • Problem: There are no wikisource specific stats about user wise Proofread/validation. It is impossible to know stats about proofreading task. Wikisource workflow is different from Wikipedia. It could not be done by xtool. So we need specific Stats tools for Wikisource.
    Français: Il n’existe pas de statistiques spécifiques sur les correction/Validations par utilisateurs. Les processus de travail (workflow) de Wikisource diffèrent de ceux de Wikipédia. Ce ne peux être fait via par xtool. Nous avons besoin d’un outil statistique spécifique pour Wikisource.
  • Who would benefit: Whole Wikisource Community.
    Français: Toute la communauté Wikisource.
  • Proposed solution: Make a stats Tools for Wikisource specific.
    Français: Créer un outil de statistiques spécifiques à Wikisource.
  • More comments:
  • Phabricator tickets: phab:T173012
  • Proposer: Jayantanth (talk) 16:01, 26 October 2019 (UTC)

Discussion

  • I'm initially a “Support Sure, why not?” on this, but given there is a limited amount of resources available for the Wishlist tasks, and this is both not very important and potentially quite time-consuming to implement, I'm going to abstain from voting for this. It's a nice idea, but not worth the cost. --Xover (talk) 05:56, 27 November 2019 (UTC)
    • @Xover: I totally understand your point as I felt a bit the same way but after rethink, I don't think it's cost that much considering the impact. Better knowing a community can have a very strong effect to boost said community. In particular, I'm thinking about participation contest, who all need stats and could benefit from this edit counter. Cheers, VIGNERON * discut. 17:33, 27 November 2019 (UTC)

Voting

Template limits

Edit proposal/discussion

  • Who would benefit: Every text on every Wikisource is potentially concerned but obviously the target is the text with a lot of templates (either the long text, the text with heavy formatting or both).
  • Proposed solution: I'm not a dev but I can imagine multiples solutions :
    • increase the limit (easy but maybe not a good idea in the long run)bad idea (cf. infra)
    • improve the expansion of template (it's strange that "small" template like the ones for formatting consume so much)
    • use something than template to format text
    • any other idea is welcome
  • More comments:
  • Phabricator tickets: not exactly the same but there is phab:T123844
  • Proposer: VIGNERON * discut. 09:28, 24 October 2019 (UTC)

Discussion

  • Would benefit all projects as pages that use a large number of templates, such as cite templates, often hit the limit and have to work round the problem. Keith D (talk) 23:44, 27 October 2019 (UTC)
  • for clarity, this is soley about the include size limit? (There are several other types of template limits). Bawolff (talk) 23:14, 1 November 2019 (UTC)
    @Bawolff: What usually bites us is the post-expand include size limit. See e.g. s:Category:Pages where template include size is exceeded. Note that the problem is exacerbated by ugly templates that spit out oodles of data, but the underlying issue is that the Wikisourcen operate by transcluding together lots of smaller pages into one big page, so even well-designed templates and non-pathological cases will sometimes hit this limit. --Xover (talk) 12:02, 5 November 2019 (UTC)
  • @VIGNERON: unfortunately, various parser limits exist to protect our servers and users from pathologically slow pages. Relaxing them is not a good solution, so we can't accept this proposal as it is. However, if it were reformulated more generally like "do something about this problem", it might be acceptable. MaxSem (WMF) (talk) 19:32, 8 November 2019 (UTC)
    • @MaxSem (WMF): thank for this input. And absolutely! Raising the limit is just of the ideas I suggested, "do something about this problem" is exactly what this proposition is about. I scratched the "increase the limit" suggestion, I can change other wording if needed, my end goal is just to be able to format text on Wikisource. And if you have any other suggestion, you're welcome 😉. VIGNERON * discut. 19:54, 8 November 2019 (UTC)
  • The problem here is that almost all content on large Wikisources is transcluded using ProofreadPage. I noticed that the result is that all the code of templates placed on pages in the Page namespace is transcluded (counted into the post-expand include size limit) twice. If you also note here that except the templataes, Wikisource pages have a lot of non-template content, you will see that Wikisource templates must be tiny, effective, etc. And even long CSS class name in an extensively used template might be a problem.
@Bawolff and MaxSem (WMF): So the problem is whether this particular limit has to be the same for very large, high traffic wikis like English Wikipedia as for medium/small low trafic wikis like Wikisource? I think that Wikisources would benefit much even if raising it for 25-50% (from 2MB to 2.5-3MB)
Another idea is based on the fact that Wikisource page creation idea is: create/verify/leave untouched for years. So if large transclusion pages hit a lot parser efficiency, maybe the solution is to use less aggressive updates / more aggressive caching for them? I think, that delayed updates would not be a big problem for Wikisource pages.
Just another idea: in plwikisource we have not pages hitting this limit at the moment due to a workaround used: for large pages we make userspace transclusions using {{iwpages}} template, see here. Of course, very large pages may then kill users' browsers instead of killing servers. But I think this is acceptable if somebody really wants to see the whole Bible on a single page (we had such requests...). Unfortunately, this mechanism is incompatible with the Cite extension (transcluded parts contain references with colliding id's - but maybe this can be easily fixed?). Also, a disadvantage is that there is no dependencies to the userspace transcluded parts of the page(s) (but maybe this is not a problem?). Ankry (talk) 20:04, 9 November 2019 (UTC)
Yeah, depending on just exactly what the performance issue that limit is trying to avoid is, it is very likely a good idea to investigate whether that problem is actually relevant on the Wikisources. Once a page on Wikisource is finished it is by definition an exception if it is ever edited again: after initial development the page is supposed to reflect the original printed book which, obviously, will not change. Even the big Wikisources are also tiny compared to enwp, so general resource consumption (RAM, CPU) during parsing has a vastly smaller multiplication factor. A single person could probably be reasonably expected to patrol all edits for a given 24-hour period on enWS without making it a full time job (I do three days worth of userExpLevel=unregistered;newcomer;learner recent changes on my lunch break). If we can run enwp with the current limit, it should be possible handle all the Wikisourcen with even ten times that limit and barely be able to see it anywhere in Grafana.
Not that there can't be smarter solutions, of course. And I don't know enough about the MW architecture here to predict exactly what the limit is achieving, so it's entirely possible even a tiny change will melt the servers. But it's something that's worth investigating at least. --Xover (talk) 21:50, 9 November 2019 (UTC)
@Ankry and Xover: thanks a lot for these inputs, raising even a bit the limit may be a good short term solution but I think we need more a long term solution. I think the most urgent is to look more into all the aspect of the problem to see what can be done and how. Cheers, VIGNERON * discut. 15:01, 12 November 2019 (UTC)
[The following is just my personal view and does not necessarily reflect anyone else's reasoning on this question]: One issue with just raising the limit on a small wiki, is that first one wiki will want it, then another wiki, and then a slightly bigger wiki wants it, and pretty soon english wikipedia is complaining its unfair that everyone else can have X when they can't and things spiral. So its a lot easier to have the same standard across all wikis. Bawolff (talk) 00:45, 22 November 2019 (UTC)
Although, one thing to note, this is primarily about the page tag, so I guess if we did mess with the limit, we could maybe change the limit just for stuff included with the page tag. But I'm not sure if people would go for thatBawolff (talk) 00:54, 22 November 2019 (UTC)
@Bawolff: “Everyone will want it and it's hard to say no”. Certainly. But that's a social problem and not a technical one. As you note, the Wikisourcen will be served (at least mostly) if the raised limit only applies when transclusion is invoked through ProofreadPage. As a proxy for project size it should serve reasonably well: I don't see any ProofreadPage-using project growing to within orders of magnitude of enwp scope any time soon (sadly. that would be a nice problem to have). --Xover (talk) 12:40, 27 November 2019 (UTC)

Voting

Better editing of transcluded pages

Edit proposal/discussion

  • Problem: When somebody wants to edit text in page with transcluded content, he find nothing relevant in source, only links to one or many transcluded pages under textarea. There are some tools (default in some wikisources) which helps to find the correct page. These tools displays number of page in the middle of text (in some wikis in the edge) and in source html there are invisible parts, but sometimes in the middle of the word/sentence/paragraph.
  • Who would benefit: Users who wants to correct transcluded text
  • Proposed solution: 1) Make invisible html marking visible, but not disturbing the text. Find the way how to move it from the middle of words.
    • en.wikisource example (link to page is on the left edge):
      • dense undergrowth of the sweet myrtle,&#32;<span><span class="pagenum ws-pagenum" id="2" data-page-number="2" title="Page:Tales,_Edgar_Allan_Poe,_1846.djvu/16">&#8203;</span></span>so much prized by the horticulturists of England.
        
    • cs.wikisource example (link is not displayed by deafault, when make visible by css, is in the middle of text.)
      • vedoucí od západního břehu řeky k východ<span><span class="pagenum" id="20" title="Stránka:E. T. Seton - Prerijní vlk, přítel malého Jima.pdf/23"></span></span>nímu.
        
  • Alternate solution: 2) after click to [edit] display pagination of transcluded text, click on page will open it for edit.
  • Alternate solution 2: 3) Make transcluded page editable in VE.
  • More comments: Split from this proposal
  • Phabricator tickets:
  • Proposer: JAn Dudík (talk) 16:30, 11 November 2019 (UTC)

Discussion

  • I don't really see this as a worthwhile thing for Community Tech to spend their time on. The existing page separators from ProofreadPage can be customised locally by each project, and its display or not can be customised in project-local CSS. Editing of transcluded content is an inherent problem with a ProofreadPage-based project and is not really something that can be “fixed” in a push by Community Tech. --Xover (talk) 05:45, 27 November 2019 (UTC)

Voting

memoRegex

Edit proposal/discussion

  • Problem: OCR editing needs lots of work-specific regex substitutions, and it would be great to save them, and to share them with any other user. Regex shared substitutions are too very useful to armonize formatting into all pages of a work.
  • Who would benefit: all users (the unexperienced ones could use complex regex subsitutions, tested by experienced ones)
  • Proposed solution: it.wikisource uses it:s:MediaWiki:Gadget-memoRegex.js, that does the job (it optionally saves regex substitutions tested with a it.source Find & Replace tool, so that they can be called by any other user with a click while editing pages of the same Index). The idea should be tested, refined and applied to a deep revision of central Find and Replace tool.
  • More comments: The tool has been tested into different projects.
  • Phabricator tickets:
  • Proposer: Alex brollo (talk) 07:33, 25 October 2019 (UTC)

Discussion

  • Actually this is very useful. It's an extension to a workaround to solve the search & replace bug that affects all Wikisource projects. If reimplementing the Search & Replace is retained as a solution, "memoRegex" should be considered as part of the implementation. --Ruthven (msg) 18:51, 4 November 2019 (UTC)

Voting

ProofreadPage extension in alternate namespaces

Edit proposal/discussion

Français: Utiliser les outils de l'espace page dans d'autres espaces
  • Problem: ProofreadPage elements, such as "Source" link in navigation, do not display in namespaces other than mainspace
    Français: Les éléments de l’espace page, tels que le lien "Source" dans la navigation, ne s'affichent pas dans les espaces de noms autres que l’espace principal.
  • Who would benefit: Wikisources with works in non-mainspace, such as user translations on English Wikisource
    Français: Utilisateurs Wikisource qui font des travaux qui ne sont pas en espace principal, tels que des traductions utilisateur sur Wikisource anglaise
  • Proposed solution: Modify the ProofreadPage extension to allow its use in namespaces other than mainspace
    Français: Modifier l'extension de l'espace page, ProofreadPage, pour permettre son utilisation dans des espaces de noms autres que l’espace principal.
  • More comments: I also proposed this in the 2019 and 2017 wishlist surveys.
  • Phabricator tickets: phab:T53980
  • Proposer: —Beleg Tâl (talk) 16:23, 23 October 2019 (UTC)

Discussion

  • Not a lot of work, heaps of impact. Thanks for the proposal. --Gryllida 23:35, 6 November 2019 (UTC)

Voting

Improve extraction of a text layer from PDFs

Edit proposal/discussion

  • Problem: If a scan in PDF has an OCR layer (i. e. original OCR layer, usually of high quality, which is a part of many PDF scans provided by libraries, not the OCR text obtained by our OCR tools), the text is very poorly extracted from it in Wikisource page namespace. DJVUs do not suffer this problem and their OCR layer is extracted well. If the PDF is converted into DJVU, the extraction of the text from its OCR layer improves too. (Example of OCR extraction from a pdf here: [1], example of the same from djvu here: [2] ) As most libraries including Internet Archive or HathiTrust offer downloading PDFs with OCR layers and not DJVUs, we need to fix the text extraction from PDFs.
  • Who would benefit: All Wikisource contributors working with PDF scans downloaded from various major libraries (see above). Some contributors in Commons have expressed their concern that the DjVu file format is dying and attempted to deprecate it in favour of PDF. Although the attempt has not succeeded (this time), many people still prefer working with PDFs (because the DJVU format is difficult to work with for them, or they do not know how to convert PDF into DJVU, how to edit DJVU scans, and also because DJVU format is not supported by Internet browsers...)
  • Proposed solution: Fix the extraction of text from existing OCR layers of scans in PDF.
  • More comments:
  • Phabricator tickets:
  • Proposer: Jan.Kamenicek (talk) 20:18, 24 October 2019 (UTC)

Discussion

There are also libraries, where is possible to download bunch of pages (20-100) in PDF, but no or only single in djvu.

There is also possibility of external google OCR

mw.loader.load('//wikisource.org/w/index.php?title=MediaWiki:GoogleOCR.js&action=raw&ctype=text/javascript');

, but there are more ocr errors and sometimes there are mixed lines. JAn Dudík (talk) 12:13, 25 October 2019 (UTC)

Yes, exactly, the Google OCR is really poor (en.ws has it among their gadgets), but the original OCR layer which is a part of most scans obtained from libraries is often really good, only Mediawiki fails to extract it correctly. If you download a PDF document e.g. from HathiTrust, it usually contains an OCR layer provided by the library (i.e. not obtained by some of our tools), and when you try to use this original OCR layer in the Wikisource page namespace, you get very poor results. But, if you take the same PDF document and convert it to djvu prior to uploading it here, then you get amazingly better results when extracting the text from the original OCR layer in Wikisource, and you do not need any of our OCR tools. This means that the original OCR layer of the PDF is good, only we are not able to extract it right from the PDF for some reason, although we are able to extract if from DJVU. --Jan.Kamenicek (talk) 17:10, 25 October 2019 (UTC)
yeah - it is pretty bad when the text layer does not appear, and OCR buttons hang with gray, but i can cut and paste text from IA txt file. clearly a failure to hand-off clear text layer. Slowking4 (talk) 02:34, 28 October 2019 (UTC)
  • @Jan.Kamenicek: It sounds like there are various problems with the extraction of some PDFs' text layers. Would indeed be great to fix! Is this to do with e.g. columns? Could you please add some examples to this proposal of PDFs (or pages of) that are failing to be extracted correctly? Thanks! —Sam Wilson 15:54, 12 November 2019 (UTC)
    @Samwilson: We can compare File:The Hussite Wars, by the Count Lützow.pdf with the File:The Hussite wars, by the Count Lützow.djvu. The PDF file was downloaded from HathiTrust, the djvu file was created by converting the pdf file into djvu using an online converter. I personally would expect that further processing would result in some data loss and that the quality of the djvu file would be worse. However, when you compare for example [3] with [4], you can see that Mediawiki extracts the OCR layer much better from the DJVU file than from the PDF file, which means that Mediawiki is not able to extract the OCR layer from PDF files properly. --Jan.Kamenicek (talk) 17:17, 12 November 2019 (UTC) I have also added the examples into the problem description above. --Jan.Kamenicek (talk) 17:22, 12 November 2019 (UTC)
    @Jan.Kamenicek: Thanks, that's very useful! Sam Wilson 19:39, 12 November 2019 (UTC)

Voting

Tools to easily localize content deleted at Commons

Edit proposal/discussion

  • Problem: When a book scan is deleted on Commons, it completely breaks indexes on Wikisource. Commons does not notify Wikisource when they delete files, nor do they make any attempt to localize the file instead of deleting it. Wikisource has no way of tracking indexes that have been broken by the Commons deletion process.
  • Who would benefit: Wikisource editors
  • Proposed solution: 1) Make it really easy to localize files, for example by fixing phab:T8071, and 2) Fix or replace the bot(s) that used to notify Wikisources of pending deletions of book scans used by Wikisource
  • More comments: A similar approach may also be helpful for Wikiquote and Wiktionary items that depend on Wikisource, when Wikisource content is moved or deleted.
  • Phabricator tickets: phab:T8071
  • Proposer: —Beleg Tâl (talk) 14:45, 4 November 2019 (UTC)

Discussion

  • A new commons deletion bot was created in 2017. Create an phabricator task with the tag "Community-Tech" to enable it on your wiki. Once you have done that, only bug T8071 remains.--Snaevar (talk) 18:29, 4 November 2019 (UTC)
  • A better way is probably FileImporter enabled on local wiki: phab:T214280. --Xover (talk) 11:52, 5 November 2019 (UTC)
  • Comment. This happened to me recently. It's killer. –MJLTalk 15:41, 25 November 2019 (UTC)

Voting

  •   Support --Jan.Kamenicek (talk) 10:43, 23 November 2019 (UTC)
  •   Support Liuxinyu970226 (talk) 10:24, 24 November 2019 (UTC)
  •   Support Sebastian Wallroth (talk) 10:55, 25 November 2019 (UTC)
  •   Support JogiAsad (talk) 13:04, 25 November 2019 (UTC)
  •   Support Geonuch (talk) 13:12, 25 November 2019 (UTC)
  •   SupportMJLTalk 15:41, 25 November 2019 (UTC)
  •   Support 16:52, 25 November 2019 (UTC)
  •   Support Akme (talk) 17:09, 25 November 2019 (UTC)
  •   Support Ciao • Bestoernesto 17:59, 25 November 2019 (UTC)
  •   Support Eunostos (talk) 20:46, 25 November 2019 (UTC)
  •   Support DraconicDark (talk) 21:13, 25 November 2019 (UTC)
  •   Support Hsarrazin (talk) 14:38, 26 November 2019 (UTC)
  •   Support Solving this through phab:T214280 will solve lots of problems for every single project outside Commons: any fix that allows FileImporter to be used on the Wikisourcen can later be applied to other projects who need it. FileExporter/FileImporter are brilliant workflow for moving to Commons: it's just screaming for being enabled in the reverse direction! Xover (talk) 06:16, 27 November 2019 (UTC)
  •   Support Novak Watchmen (talk) 17:56, 2 December 2019 (UTC)

Generate thumbnails for large-format PDFs

Edit proposal/discussion

  • Problem: For some PDFs, with very large images (typically scanned newspapers), no images (called "thumbnails") are shown.
  • Who would benefit: Wikisource when proofreading newspaper pages.
  • Proposed solution: Look at the PDF files described in phab:T25326, phab:T151202, commons:Category:Finlands Allmänna Tidning 1878, to find out why no thumbnails are generated.
  • More comments: When extracting the JPEG for an individual file, that JPEG can be uploaded. But when the JPEG is baked into a PDF, no thumbnail is generated. Is it because of its size? Small pages (books) work fine, but newspapers (large pages) fail.
  • Phabricator tickets: phab:T151202
  • Proposer: LA2 (talk) 21:04, 23 October 2019 (UTC)

Discussion

  • Hi LA2! Can you provide a description of the problem? This could help give us a deeper understanding of the wish. Thank you! --IFried (WMF) (talk) 18:52, 25 October 2019 (UTC)
    The problem is very easy to understand. I find a free, digitized PDF and upload it to Commons, then start to proofread in Wikisource. This always works fine for normal books, but when I try the same for newspapers, no image is generated. Apparently this is because the image has a larger number of pixels. I haven't tried to figure out what the limit is. --LA2 (talk) 21:36, 25 October 2019 (UTC)
  • For File:Finlands_Allmänna_Tidning_1878-00-00.pdf at least, ghostscript correctly rendered the file locally, but took a lot of time (Like a ridiculous amount of time. evince seems to render it instantly, so I don't know why ghostscript takes so long). So at a first guess, I suppose its hitting time limits. Bawolff (talk) 20:20, 25 October 2019 (UTC)
    Maybe the solution is to fix ghostscript? Another way is to navigate around ghostscript and use pdfimages to extract the embedded JPEG images, and render them. Since JPEG rendering seems to work fine. I don't know. --LA2 (talk) 21:34, 25 October 2019 (UTC)
    pdfimages is not a solution as a PDF page may consist of multiple images and it is hard to extract their relative location (at least not possible with pdfimages). Ankry (talk) 20:23, 9 November 2019 (UTC)
    I was going to write to the National Library about this (I think I know at least one of the persons involved) but I don't observe this slowness on gs 9.27, I think: phabricator:P9760. Maybe I should try a non-dummy command. Nemo 09:06, 27 November 2019 (UTC)
  • What about to provide for ProofReading more compact desight at all. Those seconds scrolling counts. If we have on one site the window with the extracted text and in the other site the same size window with scan in which you can zoom and move fast, that should save your time and be more attractive for newbies. The way it is now it looks kind of techy and in some cases difficult to handle. E.g. there should be also more content help or a link to discussion page covered in more attracitve design. Juandev (talk) 09:22, 4 November 2019 (UTC)
  • I think that a tool that allows to generate such thumbnails manually / on request / offline with much higher limits and available to a specific group of users (commons admins? a dedicated group?) maybe a workaround for this problem. Ankry (talk) 20:23, 9 November 2019 (UTC)
I'm not sure what you want me to check - the question at hand is why that parti ular version of the file failed to render. Bawolff (talk) 09:15, 22 November 2019 (UTC)
Wow, @Hrishikes and Bawolff:, there is a fix? How exactly does it work? Could it be integrated into the upload process? Could it be applied to all files in commons:Category:Finlands Allmänna Tidning 1878? --LA2 (talk) 19:13, 10 December 2019 (UTC)
@LA2: -- This problem is occurring in highly compressed files and linked to the ocr layer. The fix consists of decompressing the file (so that the size in mb increases) and either flattening or removal of the ocr layer. I first tried flattening; it usually works but did not in this case; so I removed the ocr. Now it works. And yes, it is potentially usable for other files in your category. Extract the pages as png/jpg and rebuild the pdf. Hrishikes (talk) 01:39, 11 December 2019 (UTC)

Voting

  •   Support Important issue for every project which relies on multi-page documents (PDF is a notoriously bad format but that's what we have in practice). It probably doesn't require much coding, but the Community Tech team could help by lobbying the appropriate WMF departments to get more resources assigned to the thumbnail generation. Nemo 09:16, 22 November 2019 (UTC)
  •   Support --Jan.Kamenicek (talk) 10:31, 23 November 2019 (UTC)
  •   Support Liuxinyu970226 (talk) 10:26, 24 November 2019 (UTC)
  •   Support LA2 (talk) 12:45, 25 November 2019 (UTC)
  •   Support Stefan Kühn (talk) 13:16, 25 November 2019 (UTC)
  •   Support JogiAsad (talk) 13:25, 25 November 2019 (UTC)
  •   Support Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 13:26, 25 November 2019 (UTC)
  •   SupportAmmarpad (talk) 15:36, 25 November 2019 (UTC)
  •   Support A garbage person (talk) 16:18, 25 November 2019 (UTC)
  •   Support 16:41, 25 November 2019 (UTC)
  •   Support Ciao • Bestoernesto 17:45, 25 November 2019 (UTC)
  •   Support Eunostos (talk) 20:37, 25 November 2019 (UTC)
  •   Support Geonuch (talk) 15:09, 26 November 2019 (UTC)
  •   Support Not that I want to encourage more use of PDF, but we run into too many pointless problems with all multi-page formats (the majority with PDF it seems) and reducing this will reduce both wasted time and frustration (which often hits new contributors: the old hands have learned to avoid the pain points). Xover (talk) 05:54, 27 November 2019 (UTC)
  •   Support ··· 🌸 Rachmat04 · 14:35, 27 November 2019 (UTC)
  •   Support Peter Alberti (talk) 19:59, 28 November 2019 (UTC)
  •   Support Rahmanuddin (talk) 06:50, 2 December 2019 (UTC)
  •   Support सुबोध कुलकर्णी (talk) 12:12, 2 December 2019 (UTC)
  •   Support Novak Watchmen (talk) 17:56, 2 December 2019 (UTC)

Improve export of electronic books

Edit proposal/discussion

Original title (Français): Améliorer l'exportation des versions électroniques des livres
  • Problem: Imagine if Wikipedia pages could not display for many days, or would only be available once in a while for many weeks. Imagine if Wikipedia displayed pages with missing information or scrambled information. This is what visitors get when they download books from the French Wikisource. Visitors do not read books online in a browser. They want to download them on their reader in ePub, MOBI or PDF. The tool to export books, WsExport, has all those problems : on spring 2017, it was on and off for over a month; after October 2017, MOBI format was not available, followed by PDF after a while. These problems still continue on and off.


  • Since the project was finished sometime in July or August 2019, the stability of the WsExport tool has improved. Unfortunately there has been downtimes, some up to 12 hours. The fact that the tool does not get back on line rapidly is a deterrent for our readers/visitors.
    1. September 30 : no download from 10:00 to 22:00 Montreal time
    2. October 30 : no download for around 30 minutes from 13:00 to 13:30
    3. October 31 : no anwser or bad gateway at 22:10
    4. November 1st : no download from 17:15 to 22:30
    5. November 2 : no download from 10:30 to 11:40
    6. November 2 : no download or bad gateway from 19:25 to 22:46
  • I have tested books and founds the same problems as before.
    1. Missing text at end of page or beginning of page (in plain text or in table)
    2. Duplication of text at end of page or beginning of page
    3. Table titles don't appear
    4. Table alignment in a page (centered) not respected
    5. Text alignment in table cell not respected
    6. Style in table not respected in MOBI format
    7. And others
  • More information can be found on my Wikisource page
  • For all these reasons, this project is resubmitted this year. It is an important aspect of the Wikisource project: an interface for contributors and an interface for everyone else from the public who wishes to read good e-books. --Viticulum (talk) 21:45, 7 November 2019 (UTC)


  • Français: Imaginez si les pages Wikipédia ne s’affichaient pas pendant plusieurs jours, ou n’étaient disponibles que de façon aléatoire durant plusieurs jours. Imaginez si, sur les pages Wikipédia, certaines informations ne s’affichaient pas ou étaient illisibles. C’est la situation qui se produit pour les visiteurs qui désirent télécharger les livres de la Wikisource en français. Les visiteurs ne lisent pas les livres en ligne dans un navigateur, ils désirent les télécharger sur leurs lecteurs en ePub, MOBI ou PDF. L’outil actuel (WsExport) permettant l’export dans ces formats possède tous ces problèmes: au printemps 2017, il fonctionnait de façon aléatoire durant un mois ; depuis octobre 2017, le format mobi puis pdf ont cessé de fonctionner. Ces problèmes continuent de façon aléatoire.


  • Depuis la fin du projet en juillet ou août 2019, la stabilité de l'outil WsExport s'est améliorée. Malheureusement, il y a eu des temps d'arrêt, certains jusqu'à 12 heures. Le fait que l'outil ne soit pas remis en ligne rapidement peut être dissuasif pour nos lecteurs / visiteurs.
    1. 30 septembre : aucun téléchargement de 10 h à 22 h heure de Montréal
    2. 30 octobre : pas de téléchargement pour environ 30 minutes de 13 h à 13 h 30
    3. 31 octobre : pas de réponse ou mauvaise passerelle 22 h 10
    4. 1er novembre : pas de téléchargement de 17 h 15 à 22 h 30
    5. 2 novembre : pas de téléchargement de 10 h 30 à 11 h 40
    6. 2 novembre : pas de téléchargement ou mauvaise passerelle de 19 h 25 à 22 h 46
  • J'ai testé des livres et trouve les mêmes problèmes qu'avant.
    1. Texte manquant à la fin ou au début de la page (dans le texte ou dans un tableau)
    2. Duplication de texte en fin ou en début de page
    3. Les titres de table n'apparaissent pas
    4. L'alignement de la table sur une page (centrée) n'est pas respecté
    5. L'alignement du texte dans la cellule du tableau n'est pas respecté
    6. Style dans la table non respecté en format MOBI
    7. Et d'autres
  • Plus d'informations peuvent être trouvées sur ma page wikisource
  • Pour toutes ces raisons, ce projet est soumis à nouveau cette année. La communauté Wikisource accorde une importance haute à cet aspect du projet : une interface pour les contributeurs et une interface pour tous les autres utilisateurs du public souhaitant lire de bons livres électroniques. --Viticulum (talk) 21:57, 7 November 2019 (UTC)


  • Who would benefit: The end users, the visitors to Wikisource, by having access to high quality books. This would improve the credibility of Wikisource.

    This export tool is the showcase of Wikisource. Contributors can be patient with system bugs, but visitors won’t be, and won’t come back.

    The export tool is as important as the web site is.

    Français: L’utilisateur final, le visiteur de Wikisource, en ayant accès à des livres de haute qualité. Ceci contribuerait à améliorer la crédibilité de Wikisource. L’outil d´exportation est une vitrine pour Wikisource. Les contributeurs peuvent être patients avec les anomalies de système, mais les visiteurs ne le seront peut-être pas et ne reviendront pas. L’outil d’exportation est tout aussi important que le site web.
  • Proposed solution: We need a professional tool, that runs and is supported 24/7, as the different Wikimedia web sites are, by Wikimedia Foundation professional developers.

    The tool should support different possibilities of electronic book, and the evolution of ebooks technology.

    The different bugs should be corrected.

    Français: Nous avons besoin d’un outil professionnel, fonctionnel et étant supporté 24/7, comme tous les différents sites Wikimedia, par les développeurs professionnels de la Fondation Wikimedia. Les différentes anomalies doivent être corrigées.
  • More comments: There are not enough people in a small wiki (even on French, Polish and English Wikisource, the three most important by the size of their community) to support and maintain such a tool.
    Français: Nous ne sommes pas assez nombreux dans les petits wikis (même Wikisource en français, polonais ou anglais, les trois plus importantes par le nombre de contributeurs) pour supporter une telle application.


Discussion

Voting