Community Tech/OCR Improvements
This project aims to improve optical character recognition (OCR) tools on Wikisource. Currently, Wikisource editors use a range of OCR tools in the proofreading process. These tools are very important, but they have many issues. Some of these issues include:
- The tools can be difficult to discover for new users.
- Some tools are broken, inefficient, or unreliable.
- The user experience is unintuitive and uninviting.
- It can be difficult to determine which tool is appropriate for a specific text.
For all of these reasons, users may be discouraged from editing Wikisource. We hope to improve the OCR tools, so that editors can work with greater ease and support. This project was the #2 request from the 2020 Community Wishlist Survey. In the course of this project, we’ll investigate and identify the key issues, collaborate with various communities, and implement solutions that help volunteers to contribute with greater ease and support. We look forward to reading your feedback on the talk page!
Why use OCR toolsEdit
For Wikisource, OCR tools are a crucial component of the editor experience. OCR stands for “optical text recognition.” An OCR tool converts an image file with text into machine-encoded text. When the process is complete, the user has a digitized version of the text, which can be edited, searched, and stored electronically. OCR tools are commonly used by many online communities and platforms, including Wikisource.
When editors add books to Wikisource, they typically do the following:
- Upload a file to Wikimedia Commons. The book is usually a PDF or DjVu file, containing images of scanned pages.
- Create an index page (powered by the Proofread Page extension) for the book on Wikisource.
- Proofread the book, page by page:
- [This is where OCR tools come in] Convert the image into editable, machine-encoded text with an OCR tool.
- Once completed, the user has a newly digitized version of the text.
How to use OCR toolsEdit
In Wikisource, OCR tools can be accessed when the user clicks the “Edit” tab on the page.
Once they have clicked “Edit,” they will see the original image file of the text (at the right). Sometimes, the file has already been OCR-ed (on the left), such as when it is brought from the Internet Archive, which automatically OCR’s some texts, especially for languages with Latin scripts. However, these texts often go through the OCR process again with tools on Wikisource, which may improve the existing text layer. To do this, the user will use the OCR tools (described below) to render the image file into a text file (as displayed on the left).
Sometimes, texts may have not gone through the OCR process at all. In these cases, the user will see the image file (on the right) and a blank section (on the left). Users will use the OCR tools (described below) to render the image file into a text file (as displayed on the left). Once complete, the text is ready for proofreading.
Note that the right and left designations are the opposite for RTL (right-to-left) languages.
It is important to understand that OCR tools do not work for all texts. For example, hand-written manuscripts are usually not supported by OCR tools. This is because the characters are not as standardized in computer-generated fonts. In these cases, users typically need to manually type the text as displayed in the image file.
OCR tools available on WikisourceEdit
The OCR Gadget, also known as “basic” OCR, is a widely used OCR tool for Wikisource, originally developed by Phe. It uses Tesseract, an open source OCR system sponsored by Google and hosted on Toolforge, to generate new text. It is part of a wider suite of tools for Wikisource known as phetools, and uses a sophisticated system of speculative pre-processing and caching to deliver great interactive performance.
The backend uses the hOCR structured standard OCR format to communicate with the Gadget. OCR Gadget is considered better than the Google OCR (which we’ll describe below) at recognizing text columns. However, it has more character errors. Additionally, it has limited language support. While OCR Gadget generally supports languages with Latin scripts, it doesn’t generally support Indic languages. For example, it does not support Hindi or Punjabi. The tool also lacks an active maintainer which has led to long stretches of partial or complete outages in the past.
To enable OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click OCR: Enable OCR button () in Page: namespace.” Once enabled, OCR Gadget can be accessed in the toolbar (see screenshot example of the grey-colored “OCR” icon). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. The user can then proceed to proofread that version of the text.
In 2016, the Community Tech team developed Google OCR, which was wish #25 in the 2015 Community Wishlist Survey. The Google OCR tool was meant to address the lack of Indic language support in Tesseract-based OCR systems, such as OCR Gadget. This new OCR tool used the Cloud Vision API provided by Google.
With the development of Google OCR, Wikisource editors could receive OCR support for the following languages: Multilingual Wikisource, Arabic, Assamese, Bulgarian, Bengali, English, Spanish, Hindi, Kannada, Marathi, Malayalam, Neapolitan, Odia, Russian, Sanskrit, Tamil, Telugu, and Gujrati. However, some Indic languages were not included that had active Wikisource communities. You can read the full list of languages supported by the Google Vision API.
Generally, Google OCR is considered to be a rather accurate OCR tool. However, there are sometimes problems with properly recognizing text in columns, so the lines are interleaved.
To enable the Google OCR Gadget, go to Special:Preferences. In the Gadgets tab, you go to “Editing tools for the Page: namespace,” and you click “Google OCR: Enable the Google OCR button to submit the page image to Google's OCR service. ”Once enabled, Google Gadget can be accessed in the toolbar (see screenshot example below) by clicking on the tri-color “OCR” icon. Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. Alternatively, you can go to the website directly and add in the image for single-image usage (but this is primarily used for non-Wikisource purposes).
In 2018, IndicOCR was developed by Jay Prakash, a volunteer developer. IndicOCR uses Google Drive, which uses a different OCR back-end than Cloud Vision. The tool was meant to address the limitations of GoogleOCR by providing support for a wider range of Indic languages, including Bengali, Bhojpuri, Gujrati, Hindi, Kannada, Maithili, Malayalam, Marathi, Nepali, Oriya, Punjabi, Sanskrit, Tamil, Telugu, and Urdu. However, it is important to note that some of these languages do not yet have Wikisource communities (such as Urdu), but this OCR tool could provide support for such communities in the future.
To enable IndicOCR, you can add the following code to your local wiki common.js page.
If you want to add extra button in Visual Editor then add the following code also to your local wiki common.js page.
Once enabled, it is identified by the text analysis icon (which looks like magnifying glass over text) in the toolbar (see screenshot example below). Upon clicking on the icon, an OCR-ed version of the text should appear on the left side. Alternatively, you can go to the website directly and add in the image for single-image usage. For more information, check out the documentation.
OCR4Wikisource, developed by T. Shrinivasan, is a Python script that is set up to run on Linux operating systems. It requires you to share your password in plain text (on your personal device). The tool will download the book from Wikimedia Commons, split the file into individual pages, upload the pages to Google Drive one-by-one for doing OCR, download the OCRed text, and upload them to respective Wikisource pages. This entire process can be done on a personal device rather than individually clicking on OCR icons for each page. The end results uploads OCR-ed versions of the pages directly onto Wikisource.
This is the only bulk OCR upload provided to users, so some users prefer it. The quality of the OCR is also considered to be rather high. Before IndicOCR was developed, many Indic language Wikisourcers used OCR4Wikisource. You can read more documentation.
To enable OCR4Wikisource, you will need to download the zip file from a link (provided in the documentation), and then you’ll need to follow steps within the Terminal to complete the process.
The primary issues with OCR toolsEdit
If you are a new Wikisource editor, it can be confusing to first use OCR tools. You may not know that you should use OCR tools. If you do know that you should use OCR tools, you may not know which tools are available or how to access them. The documentation on these processes vary by wiki, and some wikis have more extensive documentation than others. As a result, new editors usually need to directly interact with experienced Wikisource editors to receive this information.
Once users do learn about the OCR tools, there is no simple “quick install.” Rather, different tools require different installation processes. Some can be enabled by checking a box in Preferences. Some are enabled by copying and pasting some code into the common.js page. Others are scripts that need to be run. In total, the discovery and installation is disjointed and often confusing.
Diversity of choicesEdit
There are simply too many OCR tools to choose from. Sometimes, a diversity of tools can be a good thing. However, in the case of Wikisource OCR, the range is confusing. This is because all of the tools are meant to accomplish the same thing: a textual rendering of the image file. Consequently, editors shouldn’t need to pick between various tools that look the same, are named similarly, have similar icons, and are designed to, in theory, do the same thing. Instead, editors should have a more streamlined experience, where they can either pick just one tool or at least be guided to the most appropriate tool for their workflow, without needing to conduct research themselves.
Many of the OCR tools don’t work very well. For example, OCR Gadget has been out of service for significant periods of time in the past, and it has suffered from a lack of sufficient gadget maintainers. The hOCR tool does not work for non-Latin scripts. Meanwhile, many of the OCR tools have a slew of reported issues, including slow response times and rendering texts of a low quality. The tools also struggle to deal with handling certain formatting issues, such as text divided into columns (e.g., in magazine pages). They also have problems with non-Latin characters and diacritics.
- Have we covered all of the main OCR tools used by Wikisource editors?
- Have we covered the major problems experienced when using OCR tools?
- Which OCR tools do you use the most, and why?
- What are the most common and frustrating issues you encounter when using OCR tools?
- Which problems, overall, do you find the most critical to fix, and why?
- Anything else you would like to add?
April 21, 2021Edit
Hello, everyone! We are very excited to share our first project update below:
As a team, we first conducted research on OCR tools for Wikisource, which we shared in this project page. Then, we collected feedback on the talk page. Following this feedback, we decided to establish some project principles. This way, we could have a stronger sense of the project and our goals. The principles are as follows:
- We want to improve the overall experience of OCR tools: Our #1 goal of the project is to improve the OCR experience on Wikisource. This means that we want the tools to be easier to discover and understand for newcomers, and we want the tools to be easier to use effectively for all Wikisource editors.
- We can’t build a new OCR tool: The original wish was entitled “New OCR tool.” Unfortunately, we don’t have the time or resources to build a new OCR tool, which would be an intensive, lengthy project. As a team, we try to take on smaller projects that last a few months, so that we can fulfill multiple wishes per year. However, we can make meaningful improvements to the existing OCR tools.
- We can improve Wikimedia OCR: The Wikimedia OCR tool (formerly known as Google OCR) was developed by the Community Tech team. For this reason, we have the ability to make impactful changes to the tool, and we also have already identified some areas of improvements. For this reason, we have made it one of our project priorities to improve this tool.
- We can address some major issues: On the project talk page, we heard users share some common pain points related to the OCR experience, including: lack of an easily accessible bulk OCR functionality, minimal support of texts with multiple columns, and other issues. We can’t fix all of the issues, but we will try to at least investigate some of the top issues and see if we can issue improvements.
The team has already begun work on the project! Here is what we have completed so far:
- Add Symfony to Wikimedia OCR: During the Ebook Export Improvements project, we added Symfony to Wikisource Export. This turned out to be a good move, since it helped us maintain and improve the tool. Similarly, we wanted to do the same thing for Wikimedia OCR.
- Create Toolforge Staging for Wikimedia OCR: We wanted to create a test environment that we (and all of you!) could check out, as we began implementing changes. This is now complete, and you can check it out.
Work in developmentEdit
- Move Wikimedia OCR to Wikisource Extension: We want to improve the current user experience, which requires that users install or enable multiple separate tools. To do this, we are moving the Wikimedia OCR to the Wikisource Extension. Once this work is complete, all users will be able to see the Wikimedia OCR tool on the Proofread page (with no installation required).
- Note: If wikis don’t want the tool displayed automatically, they can choose to opt out. Also, users will still be able to configure their toolbars with other OCR tools.
- Add Support for Tesseract on Wikimedia OCR: To improve Wikimedia OCR, we have decided to add Tesseract to it. This way, users do not need to install two separate OCR tools via Preferences, since both OCR engines will be available via Wikimedia OCR. This is currently testable on ocr-test.wmcloud.org.
- Accept Google Options on the API: This work is the first stage in being able to improve the quality of OCR for pages containing multiple languages. The final result will apply to both Google and Tesseract engines.
- Improve performance of Tesseract engine: We have identified a way that we can dramatically improve the speed of the Tesseract engine. If we move Tesseract from Toolforge to Cloud VPS, we could see it run much faster (potentially, about 10 times faster!). This work is in progress, and we hope its completion will result in an improved user experience for Wikisourcers.
- Investigate how to improve multiple column issues: Users have shared that Wikisource lacks sufficient OCR support for texts with multiple columns. For this reason, we have launched an investigation to see how this issue could be addressed. So far, we have come up with two potential approaches, and the investigation is in progress.
Work that is coming upEdit
- Add Tesseract options on the API: Through our technical investigation, we learned that Tesseract has many options that may help improve the OCR experience. For example, Tesseract has multiple page segmentation modes, which could help with multiple column support. It also has options to handle multiple languages within one text. For this reason, we want to make some of these options available for an improved editor experience.
- Determine the user experience for choosing OCR engine: Once Wikimedia OCR has 2 engines (Tesseract and Google Cloud Vision), there will need to be a user experience to support how this is handled on the Proofread page. We will be working on developing a proposal for this experience soon.
- What are your general thoughts about the project principles?
- How do you feel about our work to make Wikimedia OCR automatically available, with no installation required?
- How do you feel about our work to add Tesseract to Wikimedia OCR?
- Ideally, what user experience do you recommend for choosing an OCR engine when using Wikimedia OCR?
- What do you think of our work to improve the speed of Tesseract?
- Anything else you would like to add?