WikiConference India 2016/Submissions/Indic Wikisource & Google OCR Co-ordination
Main page | Hackathon | Programs | Edit-a-thon | Press coverage | FAQ | Sitemap |
- Title of the submission
- Your Username (For the submission author)
jayantanth (Link)
- Type of presentation
Workshop
- Abstract (in about 300 words)
The Indic wikisource domain was started about 2006-2007. But due non-availability of proper Indic OCR, the project was not flourish. Typing was not the good solution for this project. Althought the Malayalam, Telugu, Tamil and Bengali Wikisource community engaged by typing the book page by page. In past year January 2015 google release their Multilingual OCR Google Drive Tool. We have tested that OCR result is quite good. Then Tamil Wikipedia, Shrinivasan T was developed one python script for automated OCR job using the Google drive OCR tool in Oct 2015. IWth using this tool the Tamil and Bengali Wikisource community OCRed about 400554 and 466747 pages respectably. ( Latest stats of Indic Wikisource below). But other wikisource is quite silent due to proper knowledge of using this tool. With this workshop we will teach how would using this tool to developed other Indic languages.
Pr-requisite for this workshop will be.
- Linux users
- The basics editing of Wikipedia.
- The primary knowlege of wikisoure.
- Linux Based laptop.
- Windows Laptop With Oracle Virual Box ( Loaded Lixux will be added advantage)
Time required: Minimum of One hr
Requisite knowledge: knowledge of linux but not compulsory.
Target group: All Indic Language Community
- Result
Accepted
Last Update on July 2016, full stats will be available here.
Page namespace | Main namespace | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
language | all pages | not proof. | problem. | w/o text | proofread | validated | all pages | with scans | w/o scans | disamb | percent |
te | 27706 | 8537 | 28 | 621 | 18520 | 17219 | 11677 | 2839 | 8838 | 0 | 24.31 |
ta | 400554 | 395592 | 2 | 14 | 4946 | 4632 | 4099 | 68 | 4031 | 0 | 1.66 |
ml | 19918 | 11690 | 119 | 284 | 7825 | 668 | 6254 | 670 | 5584 | 0 | 10.71 |
gu | 5570 | 360 | 9 | 107 | 5094 | 2597 | 4824 | 655 | 4169 | 0 | 13.58 |
bn | 466747 | 459703 | 192 | 2909 | 3943 | 1159 | 6828 | 885 | 5913 | 30 | 13.02 |
kn | 10617 | 8659 | 5 | 51 | 1902 | 594 | 6862 | 73 | 6789 | 0 | 1.06 |
or | 4570 | 2764 | 2 | 22 | 1782 | 383 | 362 | 0 | 362 | 0 | 0 |
sa | 4308 | 3486 | 9 | 90 | 723 | 192 | 13851 | 6 | 13845 | 0 | 0.04 |
as | 710 | 307 | 0 | 0 | 403 | 66 | 1314 | 10 | 1304 | 0 | 0.76 |
mr | 1774 | 1749 | 0 | 3 | 22 | 6 | 970 | 1 | 969 | 0 | 0.1 |
Interested attendees and comments
edit- Highly recommended. Every Indian language Wikisource volunteer should attend this. --Ravi (talk) 18:37, 19 July 2016 (UTC)
- --Manojk (talk) 03:30, 20 July 2016 (UTC)
- Csyogi (talk) 10:05, 20 July 2016 (UTC)
- --Balajijagadesh (talk) 02:28, 2 August 2016 (UTC)
- --Kannan Shanmugam (talk) 06:06, 2 August 2016 (UTC)