Grants:IEG/Tamil OCR to recognize content from printed books

statusnot selected
Tamil OCR to recognize content from printed books
summaryCreating trainer GUI for Tesseract OCR. It helps to create accurate Tamil OCR (also for other languages) and helps to recognize content from printed books (from images).
targetTamil wikisource (Also for all Languages)
strategic priorityimprove quality
themetools
amount18000 USD
granteekbalavignesh
contact• kbalavignesh@gmail.com
volunteerCommons sibi
join
endorse
created on20:15, 28 September 2014 (UTC)
2014 round 2


Project idea edit

What is the problem you're trying to solve? edit

Much of the world's content in non-English languages is still contained in printed format from books and other print materials. Because these works are usually digitized as image files, and because most non-English languages lack an efficient OCR, it is difficult to extract the text from these works.

What is your solution? edit

  • Tesseract is free open source OCR (Apache license version 2.0). it can read a wide variety of image formats and convert them to text in over 60 languages. However, Tesseract suffers from poor accuracy. The accuracy of Tesseract can be improved through a larger volume of training materials and trainers.
  • Creation of a Trainer GUI for Tesseract OCR will allow a greater number of trainers to submit materials and improve the accuracy of Tesseract OCR for various fonts. This GUI will help both technical and non-technical users to more easily train Tesseract for their font(s). This trained data can be shared, reused, and updated by many contributors.
  • By expanding the base of trainers, the GUI will eventually improve Tesseract OCR's accuracy for all submitted fonts. The development of a more robust Tesseract OCR will help Wikipedia editors to more easily add content from printed books/images.

How it differs from other tools? edit

Existing tools focus only on bounding box creation. we propose following methods to make it complete solution.

1. Online GUI tool to allow for submissions from a larger community of trainers.

2. Any errors produced by Tesseract OCR will be corrected by the trainers. The error correction will be used for training, which we refer to as 'feedback-based continuous training'.

3. The text in many types of printed books will be similar, due to identical publishers or same font styles. These works can be grouped and categorized into the same repository of training data.

Sample Use Cases edit

This tool will provide following features,

1.Training characters for specific font

2.Maintaining repository of trained data

3.Recognizing text from images

Use Case 1 edit

If the user want to recognize the text, they can check the availability of font from the trained data repository. If the trained data already exist, they can simply start using it.

Use Case 2 edit

If the trained data not exist, they start train the font. Until they cover all the characters and require quantity (same characters may need to be train many times) , they will continue the training.

Use Case 3 edit

If the user want to use the Tesseract on their own system , they can train on this proposed web tool (on line), and can download the trained data for their use. This way user can contribute to public as well.

Use Case 4 edit

If the user want to train on their local system (offline), they can download this tool and setup locally . They can train the data and finally they can share the image and bbox file to the on line repository. (Refer – How the trained data will be stored?)

How the trained data will be stored? edit

The basic elements of trained data are image and bbox file. This will be stored and trained data will be generated dynamically based on the need. Advantages of this method,

1.We can easily create the combination of trained data.

2.Easy to add minor changes with training

3.User can contribute (merge) their offline training.

For eg: if the user uses font1 and font2 on their images, they can pick only this 2 fonts and generate trained data.

Project goals edit

Creation of an efficient, free, open source Tesseract OCR GUI for contributors to easily upload content from printed books/images. This will increase efficiency of uploading, increase the number of contributors/trainers, increase the breadth of materials, and decrease the amount of time required to improve the Tesseract OCR software. Initially this will be targeted for Tamil language. The interface and techniques can be easily customized and utilized for all other languages supported by Tesseract OCR.

Project plan edit

Activities edit

Creation of a web-based application to train the Tesseract OCR. The text contained in each uploaded image can be trained, and trained data can be shared/utilized by the entire community.

Budget edit

Hardware & Hosting cost - 2400 USD

4 Human Resource (Coordinator, BackEnd Developer, FrontEnd Developer, Tester) 40 person hours/week , 15 USD / h, 6 month (26 weeks) - 15600 USD

Total = 18000 USD

Community engagement edit

The demo version of the product will be released during the third month of development to the community. Further features and development will be carried out based on user feedback.

Technology Stack edit

  • Tesseract OCR on Linux Box
  • For Web Application - PHP, HTML, CSS, JS ( jquery & backbone), PostgreSQL
  • For Image Manipulation – Imagemagick library

Sustainability edit

  • Initially this application will be build for Tamil language. But this can be customized and used for all other languages. (As the training steps are common).
  • Trained data for specific font/language can be utilized by many users.
  • Trained data will get continuous update and it will improve the quality of recognition.
  • Trained data can be downloaded and used by users.

Measures of success edit

1. The number of different fonts trained for recognition.
2. Amount of content added from images by using OCR.
3. Accuracy of recognized text.

Get involved edit

Participants edit

  • Coordinator/Developer Kbalavignesh - Having 10 yrs of experience on developing web based applications by using LAMP stack.
  • Developer Aarthi Devi - Research scholor
  • Front-End Developer Prasath
  • Reviewer Tshrinivasan, Arun Kumar

Community notification edit

Notified here - https://ta.wikipedia.org/wiki/%E0%AE%B5%E0%AE%BF%E0%AE%95%E0%AF%8D%E0%AE%95%E0%AE%BF%E0%AE%AA%E0%AF%8D%E0%AE%AA%E0%AF%80%E0%AE%9F%E0%AE%BF%E0%AE%AF%E0%AE%BE:%E0%AE%86%E0%AE%B2%E0%AE%AE%E0%AE%B0%E0%AE%A4%E0%AF%8D%E0%AE%A4%E0%AE%9F%E0%AE%BF#Grants:IEG.2FTamil_OCR_to_recognize_content_from_printed_books

Endorsements edit

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project below! (Other constructive feedback is welcome on the discussion page).

  • I am also working with Tesseract Engine . The Engine has immense potential. With many books/encyclopaedia s being brought into public domain , the OCR would minimise hundreds of man hours and contents directly brought into Wiki-source and if needed Wikipedia . Commons sibi (talk) 06:27, 30 September 2014 (UTC)
  • Since Tamil has a good amount of public domain literary works, Wikisource can immediately benefit from this technology. This technology will have a huge payoff for not just Wikisource, but Tamil computing in general. Sundar (talk) 11:01, 30 September 2014 (UTC)
  • Retired Professor V. Krishnamoorthi of Anna University had developed a fairly advanced OCR, after many years of experimentation, but it is not an open-source product. It may be worth consulting him. A need for Tamil OCR is very much there. --C.R.Selvakumar (talk) 12:54, 30 September 2014 (UTC)
  • There are few commercial efforts in Tamil OCR. But they are not ready to open their efforts for public. There is a very high need in the Tamil Society for an open source OCR. --Tshrinivasan (talk) 18:17, 30 September 2014 (UTC)
  • Tamil Virtual University gave their Encyclopedia and Children Encyclopedia to Tamil Wiki Projects. Including Encyclopedia and Children Encyclopedia Most of their files are in Image PDF's. We need OCR tool to convert the Image PDF's to contents automatically. Then we can easily edit and add that in our Tamil wikisource. In future Tamil Virtual University may give all their nationalized books under Creative Commons license. As I said all the nationalized books are also in Image Level PDF's. தென்காசி சுப்பிரமணியன் (talk) 23:07, 30 September 2014 (UTC)
  • I support the initiative --Kurumban (talk) 01:32, 1 October 2014 (UTC)
  • I'd like to endorse this initiative as Tamil wikisource will tremendously benefit from this project. --Mayooresan (talk) 03:25, 2 October 2014 (UTC)
  • It will help to develop new contents for Tamil Wiki Mohammed Ammar (talk) 08:17, 3 October 2014 (UTC)
  • Excellent initiative. --Sodabottle (talk) 09:35, 4 October 2014 (UTC)
  • While I endorse such effort, there have been similar projects done in Telugu in way much less time and without such cost. Please check https://code.google.com/p/cowboxer/, https://code.google.com/p/pytesseracttrainer/, http://vietocr.sourceforge.net/training.html. Please justify how better, and different your tool would be from the above. --Rahmanuddin (talk) 04:30, 5 October 2014 (UTC)
  • @Rahmanuddin Please check the section 'How it differs from other tools?' for details. - kbalavignesh
  • Existing OCR mechanisms for Indian languages are not very good with accuracy. I believe this deserves all the support it can get. ----Rsrikanth05 (talk) 11:46, 7 October 2014 (UTC)
  • any effort from community for building support for indian languages in FOSS based OCR systems is commentable . But what we lack primarily is annotated training data sets . Web based training tool is anyway a good idea . But make sure that you will release annotated training data sets as well , it is the most useful output for a training initiative . Also it will be good if you can through litttle more light on technology stack for the application AniVar (talk) 05:14, 10 October 2014 (UTC)
  • We need plenty of OCR support for Indian languages, so this is a welcome initiative. Could you please explain how the customising done on Tesseract for Tamil would translate into other language OCRs if we are not to re-duplicate the effort and the cost? --[[User: Tniranjana--Tniranjana (talk) 10:48, 17 October 2014 (UTC)]]
  • @Tniranjana Even though , we focus only on Tamil language to narrow the scope, it can be customized for other languages with very less effort. Most of the Tesseract training option will be bring in to the GUI. It means , we just need to change the settings to adopt with other languages. For example , It may need following changes on settings 1. Translating graphical interface texts 2.Defining other language character set 3.Listing ambiguity characters. --balavignesh (talk) 10:32, 3 November 2014 (UTC)
  • This project is very valuable to the Tamil/Indian Wiki communities, and other similar like minded communities, specially to help digital preservation. I support this project as a wiki user and as a technical lead of the Noolaham Foundation. We will provide test materials (tiff images) and staff support as can be effectively used by this Team. --Natkeeran (talk) 13:32, 17 October 2014 (UTC)
  •   Support the project idea. Some suggestions in the talk page. Best wishes!--Visdaviva (talk) 12:23, 27 October 2014 (UTC)