Wikisource Handbook/OCR

Optical Character Recognition (OCR) is a process by which a text character in a scanned image from pdf/djvu/jpg etc. can be converted to Unicode characters. For Indic languages, a suitable software to accurately do OCR was not available until mid 2015, after which Google released OCR service for Indic languages. Indic Wikisource communities are utilising the OCR service till then.

OCR4Wikisource

OCR4Wikisource is a free and open-source software developed by T. Shrinivasan et. al for Linux OS users to automate the process of doing mass Google OCR using Google Drive API. The software will:

Step 1:Download the book from Wikimedia Commons
Step 2:Split the file into individual pages
Step 3:Upload the pages to Google Drive one by one for doing OCR
Step 4:Download the OCRed text and
Step 5:Upload them to respective Wikisource pages

It is recommended to create a bot account to run this script.

To install the script, first download the zip file from this link.[1]

Step 1:Download the zip file from the above link
Step 2:Extract the OCR4wikisource-master folder from the zip file and keep it in Home directory.
Step 3:Open Terminal by using shortcut Ctrl+Alt+T.
Step 4: Type the following commands: cd OCR4wikisource-masterbash ./setup.sh

Drive configuration and running OCR

Step 1:Go to this address [2] and create a new project.
Step 2:Activate Google Drive API and Fusion Tables API
Step 3:Go to Credentials menu and then to OAuth Consent screen where, you have to write something at Product menu shown to users
Step 4:Create credentials by selecting OAuth client ID
Step 5:Select Application Type to Other and give any name
Step 6:Download the json file and copy it to the OCR4wikisource-master
Step 7:Rename the json file
Step 8:Open the terminal to download and install another tool from this address [3] by typing the following commands.

sudo apt-get install python-pip
sudo pip install google-api-python-client
sudo pip install gdcmdtools

Step 9:Run this command: gdauth.py client_secret_file name.json
Step 10:You will get a weblink in the terminal while running this command, click on the link and then click on the Allow button, which will open a new page with a Token.
Step 11:Copy the token number and paste in the terminal, after which API will be configured.
Step 12:Now, go to the OCR4wikisource-master folder and open the config.ini file and fill up accordingly.
Step 13: Open the terminal and run the following command: python do_ocr.py

Note this software only runs in Linux OS. Also please check if Google Drive API supports your language.

Google OCR tool

Ocr button

The Google OCR tool adds a Page-namespace toolbar button that will derive text from the current page’s image, via Google’s Cloud Vision API ^[1] OCR service. Check the languages which are supported [4] by this service. Click on the button to get OCRed text in each Wikisource page.

Note: OCRed texts are not 100% accurate. Manual proofreading is needed to correct the typo errors.

References

↑ https://cloud.google.com/vision/

[1] ttps://cloud.google.com/vision/

[1]