Grants:IEG/Tools for Armenian Wikisource and beyond

This project is funded by an Individual Engagement Grant

status: selected

project:

Tools for Armenian Wikisource and beyond

project contact:

User:Xelgen

participants:

grantees: User:Xelgen

summary:

Developing set of tools for Armenian Wikisource, many of which can serve on other Wikisource projects once reconfigured

engagement target:

Armenian Wikisource, with potential to extend to other languages

strategic priority:

Improving Quality

total amount requested:

7850 USD

2014 round 1

Project idea

What is the problem you're trying to solve?

While digitizing Soviet Armenian Encyclopedia on Armenian Wikisource, and using articles of it later as a base for Armenian Wikipedia articles, we've felt strong need for a set of tools, which could make the whole process much faster, more efficient and fun by eliminating chore and as a result yield to higher quality and quantity of articles, and a happier community.

Digitizing dictionary-structured books challenged regular Wikisource workflow, and required actions and new tools never needed while digitizing, for example, fiction books. Examples of such actions are - dividing content into articles, creating index of the articles with some metadata, maintaining list of articles on 2 projects while they are being created/moved/deleted/merged. Proofreading text printed in a 3 column layout with rich formatting is also more difficult, because of the longer "seek time" and non-linear, back and forth movement for eyes of contributors trying to find the word in the scan of the page.

What is your solution?

Develop tools to assist in proofreading and digitizing (in wide sense) books, automate and semi-automate most time/attention consuming tasks. Improve the user experience for the proofreaders.

Those tools are somewhat language dependent. Project's main target is the Armenian Wikisource and Armenian language, but we're going to keep global vision while developing the tools, to ensure that they can be modified, localized, and reconfigured for other Wikisources as easy as possible. We'll do our best to provide sufficient documentation in English for that. Good news is that if we succeed with Armenian language, we'll also succeed with majority of other Indo-European languages, as Armenian hyphenation, capitalization rules are relatively complicated.

Non-WMF projects using MediaWiki and the ProofReadPage extension can also benefit from these tools.

Project goals

Proof of concept, of ZoomProof in action

Current state of IllustrationCropper, frontend is about 40% done. Some dev. output can be seen

Section marker and duplicate checker in work. Tightly focused on SAE, and needs to be made more universal and flexible

Following tools have to be developed, tested and fine-tuned in scope of the project:

ZoomProof - Wikisource gadget to automatically zoom and highlight the word which is being proofread (feature often found in OCR software, see screencast)
WikiSource IllustrationCropper - Tool to crop out and upload images out of book page scan, right on the edit page
Section Harvest - Parse all pages of a book, get a detailed list of sections used, and verify by some simple rules (e.g. no duplicate section names, no all-caps section names, etc..). Especially useful for digitizing dictionaries, encyclopedias, reference books, etc.
LST Guard (see Labeled Sections Transclustion) (service) to monitor changes to the names of the sections, which are already used to transclude content. We’ve noticed some editors would change section names without realising that as a result other pages become blank. Monitoring and notifying patrollers on such cases is the minimal goal, we’ll try to allow to automatically update the section name in trasclusion, or revert back the change of the section name (not the whole revision) according to some rules (e.g. section name is not unique for book). Best if it will automatically link Wikipedia and Wikisource articles.

The following 2 tools are implemented as part of SAE Tools user script and are being used in the Armenian Wikisource. But the current versions are too focused on the scanned version of Soviet Armenian Encyclopedia, and are interconnected with other tools of that pack.

AutoHinter, automatically finds and highlights possible mistakes by using data specific to the language and the OCR software used.
SectionMarker tool, which allows to add sections, and checks for duplicates in the neighboring scanned pages.

Project’s other goal is to turn those 2 tools into more universal and suitable for generic cases, “standalone” Gadgets (or User Scripts), so they can be used by other Wikisources as well. SectionMarker also needs deep refactoring (duplicate prevention parts are very messy and hard to read).

Project plan

Scope:

Scope and activities

Gathering and processing language data needed for tools
Recruiting volunteer coders among students who want to gain experience while building useful things
Making contracts, accountancy, reports to local tax authorities
Setting up collaboration environment for the project
Research on best methods and algorithms for ZoomProof (linking words between unprocessed *DjVu text and proofread, wikified text, to highlight the words in scans)
Coding
Mentoring volunteer coders, code review
Optimizing code to achieve seamless user experience (especially for ZoomProof which may require calculations and redrawing on the fly)
QA and encouraging beta users to test
Writing documentation and ensuring the source code is readable enough
Organizing proofing-thons
Writing reports

Tools, technologies, and techniques

Mostly meetups, including the ones with volunteer students․ We’ll be using tools, platforms which are currently standard-de facto in Free software world.

Budget:

Total amount requested

7850 USD

Budget breakdown

Research & Development 220 hours per 20$/hour
Volunteer coordination & meetings 40 hours 20$/hr
Project coordination 120 hours per 20$/hour
Promotional materials for volunteer contributors and participants of proof-read-a-thons (certiciates, stickers, etc..) 250 USD (some materials like mugs and T-Shirts may be provided by Wikimedia Armenia)

Hour rates, include 26% Taxes. Yerevan State University will provide space and lab, for meetups with volunteers IRL.

Intended impact:

Target audience

The main target audience of the project is the proofreaders in Wikisource. While talking specifically about SAE, Wikipedians using articles from this printed encyclopedia would also benefit, as they’ll be able to coordinate their work in a more efficient way.

IT students who would volunteer to help with coding, will get mentorship, code review, and learn about real world tools, techniques, workflow and platforms. We also hope to introduce them to world of free software, and see them contributing code to free projects in future.

Fit with strategy

Improving Quality
Increasing Participation
Encouraging innovation

Sustainability

On Armenian Wikisource, tools won’t need much maintanence (if we won’t switch to VisualEditor only model, or deprecate current APIs) thus will be used by contributors for years.

The source code will be made available on GitHub and Wikisource or Meta pages, along with documentation. This should make a more long-term goal possible, making these tools helpful for other Wikisource projects.

Large volumes of proofread texts will be produced by using the tools which will become valuable material for training various OCR programs. For example, there are no freely available OCR software for Armenian text, and the main reason is the lack of large volumes of scanned and proofread texts to train the programs. This may prove to be useful for many other languages as well.

Measures of success

Measure of success:

All tools listed are created, with all essential functionality
Tools work under current versions of Firefox, Chromium (Chrome, Opera). All critical functionality works under Internet Explorer 7.
User documentation is provided in Armenian and English
Technical documentation in English is provided for all tools (for later modifications by developers)
At least 5 volunteer coders involved in project
Tested by developers and community. If community members have major concerns related to functionality, UX or workflow during early Alpha version, necessary modifications are made.
It is ensured, at best that tools can be easily modified (if not just configured) for other languages.
Success rate of highlighting with ZoomProof tool is at least 75%, in at least 95% it is able to show correct area of page, without highlighting word. (Numbers are for Armenian text, other languages may show higher or lower success rates, after re-configuration)

Stretch-goals:

IllustrationCropper is able to use high-res files from external servers, local Wikisource, or Commons, as source instead of low-res DjVu images.
Every tool is implemented and used in at least one Wikisource project, other then Armenian, by the end of project.

Those are metrics, which can be used in retro-perspective manner, few months after project finishes and tools are being used:

One of the metrics to measure the success will be the number of editors using the tools. It may not be efficient to track all the changes done by any of the tools (e.g. every corrected mistake in the OCRed text). The usage of some of the tools will be easier to measure, for example, the number of pictures obtained by IllustrationCropper. The LST Guard will have a detailed log of its actions.

Rise in the number of edits, especially by users using these tools.

Having full index of articles in Soviet Armenian Encyclopedia, and separate subpage for every article by the end of the project, may be another target and an indicator of success. Though this depends on

In a longer term, the number of language versions of Wikisource that have used the tools will be a good indicator of the reusability.

Participant(s)

User:Xelgen, FLOSS enthusiast, Wikipedian for over 7 years, sysop of Armenian Wikipedia. Localisation enthusiast. Coding since 9 years old, 11 years of professional experience in IT, including managerial positions. Project management skills, experience running NGOs, and local tax reporting. Was a member of the initiative to get Soviet Armenian Encyclopedia released under Creative Commons license. Scanned and processed 13 volumes of the encyclopedia.

User:HrantKhachatrian, Masters student at the Department of Informatics and Applied Mathematics, Yerevan State University, president of the Student Scientific Society of the department, freelance web developer for 5 years.

User:Mahnerak, Bachelor student at the Department of Informatics and Applied Mathematics, Yerevan State University, winner of various local and international olympiads in informatics, author of the software used in school olympiads in Armenia.

Discussion

Community Notification:

Please paste a link to where the relevant communities have been notified of this proposal, and to any other relevant community discussions, here.

Old discussions related to possible SAE proof reading automation, tools, and policies:

Discussion on what and how to do with SAE Illustrations
Discussions on widespread OCR mistakes to consider when developing tools from 2012 and from 2013
Discussion on first version of SectionMarker
Discussion on blind, automatic removal of hyphenation Consensus was that due to complicated rules of hyphenation and OCR quality, simply removing hyphens with bot isn't a good idea. This sparked development of a smarter and more careful tool, removing only obvios hyphenation cases. In the end it lead to development of SAE Tools pack.

Endorsements:

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. Other feedback, questions or concerns from community members are also highly valued, but please post them on the talk page of this proposal.

Community member: add your name and rationale here.
Endorse -- by all means. User:Xelgen has a long history on Armenian Wikipedia and Wikisource, and has had a major role in release of Soviet Armenian Encyclopedia under Creative Commons license, and then taken on the digitization of all 13 volumes of it, and afterwards has continued to develop anbd fine-tune extemely useful proofreading tools on Armenian Wikisource (see description). I think it is extremely essential to enable these developers to build and enhance such proofreading tools without which proofreading gigantic texts such as encyclopedias could turn into an even more arduous task and take exponentially longer to accomplish. Chaojoker (talk) 06:22, 9 April 2014 (UTC)
Endorse -- He has my full endorsement. We are truly in need for greater tools for easier proofreading of the numerous material that has been uploaded into wikisource, especially for such a large editing tasks as editing the voluminous Armenian Soviet Encyclopedia. I have been already using some scripts created by User:Xelgen that has been quite helpful. Վազգեն (talk) 03:08, 10 April 2014 (UTC)
Endorse -- Useful tools and a team with a track record of getting things done in wiki-projects. ― Teak (talk) 18:42, 10 April 2014 (UTC)
Endorse -- I know both User:Xelgen and User:HrantKhachatrian. Both have improved/developed a lot of useful tools for Wikimedia projects and I think it would be incredible if they would not limit themselves to do it in their spare time. --va c io 06:45, 11 April 2014 (UTC)
Endorse This sounds like a great idea and who knows, it may benefit the wider community of Wikisorcerers eventually. Jane023 (talk) 17:21, 13 April 2014 (UTC)
Endorse -- I am familiar with the work that User:Xelgen has done in Wikipedia and Wikisource projects and I fully endorse this proposal. -Սահակ (talk) 01:17, 14 April 2014 (UTC)
Endorse -- Just the zoom and crop functionality by themselves would be great tools and worth the funding. --Ainali (talk) 20:34, 16 April 2014 (UTC)
Endorse -- Wanting almost this same functionality for digitized dictionary books having probably some more typographical complexities. Looking forward to test runs with such books. --Purodha Blissenbach (talk) 09:19, 17 April 2014 (UTC)
Endorse -- See my comments on the talk page.--Micru (talk) 14:13, 18 April 2014 (UTC)
Endorse -- I am fully endorsement this project. It will my Bangal Wikisource too.Jayantanth (talk) 14:30, 18 April 2014 (UTC)
Endorse -- Although see my comments on the talk page. the wub "?!" 22:20, 19 April 2014 (UTC)