Grants:IEG/Proofreading semiautomatically the Catalan Wikipedia with LanguageTool

statusselected
Proofreading semiautomatically the Catalan Wikipedia with LanguageTool
summaryProofread the whole Catalan Wikipedia using the results of the grammar checker LanguageTool and with the help of scripts and appropriate supervision.
targetCatalan Wikipedia
strategic priorityimproving quality
themetools
amount3,000 EUR
granteeJaumeortola
contact• jaumeortola(_AT_)gmail.com
created on15:51, 22 September 2015 (UTC)
round 2 2015



Project idea edit

What is the problem you're trying to solve? edit

Proofreading large amounts of text is a daunting task, but it is very much needed to improve the quality of some Wikipedia pages.

What is your solution? edit

The objective can be achieved using a smart combination of technological tools and linguistic knowledge and intuition in order to minimize (but not suppress) human supervision. The process involves the following steps:

  • Analyze the whole Catalan Wikipedia using the Open Source proof­reading program LanguageTool.
  • Filter and sort the results of the LanguageTool analysis.
  • Supervise the filtered and sorted results. This is the non-automatic part of the process.
  • Apply the selected results in the corresponding Wikipedia articles with the help of a bot.

Project goals edit

The described process has been tested with a certain degree of success during the last two years. With the acquired knowledge, now we want to achieve these goals:

  • To make the process faster so a Wikipedia like the Catalan one in size can be proofread in a significantly shorter amount of time.
  • To complete the proofreading of the whole Catalan Wikipedia.
  • To rewrite and document the code so it can be used by other people in other languages.

Project plan edit

Activities edit

  • Improve significantly the filtering and selection of the LanguageTool analysis results. This includes: minimizing problems of wikitext parsing; filtering out sentences in other languages (like quotations, titles, bibliography...) or non-standard language (ancient or dialect). These improvements can be done before the analysis, during the analysis and after the analysis.
  • Create auxiliary tools for the non-automatic supervision: black lists, etc.
  • Do the non-automatic supervision of the results. This will be used to evaluate the success of the previous filtering steps. The LanguageTool rules can also be updated and improved when necessary.
  • Document the code so it can be used by other people in other languages.
  • Test the process in at least one more language besides Catalan. Annoucements will be made to reach potential collaborators willing to take the lead in their own Wikipedias.

Budget edit

  • Project development: 3,000 EUR (for six 40-hour work weeks)
  • Total Budget: 3,000 EUR

(I dropped a budget allocation of 250 EUR to cover server infrastructure costs based on the committee recommendation to use WMFlabs instead. If WMFlabs is not a feasible substitute, I will need this allocation.)

Community engagement edit

We'll survey our target community at the start and at the end of the period.

Sustainability edit

The developed code will be available for continued use in Catalan Wikipedia, and it will be easily adaptable to other languages. Collaborators will be needed in order to use it in other Wikipedias.

Measures of success edit

  • The success can be measured by the number of edits made in Catalan Wikipedia articles. It will be of the order of hundreds of thousands. As a rough estimate, I'll do at least 400,000 edits.
  • A test (without edits) is made in another language (preferably one with good support in LanguageTool). Edits should be done only by an active and trusted part of the target Wikipedia.
  • Code and documentacion is available on GitHub. Documentation is posted in Final Report for IEGrant, and annoucements are made to reach other Wikipedias.

Get involved edit

Participants edit

Jaume Ortolà (Done more than 600.000 spelling and grammar corrections to Catalan wikipedia using the bot Langtoolbot; mantainer of Catalan language in LanguageTool, Grammar Checker; mantainer of several dictionaries and tools for the Catalan language.).

Community Notification edit

Please paste links below to where relevant communities have been notified of your proposal, and to any other relevant community discussions. Need notification tips?

Catalan Wikipedia: Taverna

Endorsements edit

Do you think this project should be selected for an Individual Engagement Grant? Please add your name and rationale for endorsing this project in the list below. (Other constructive feedback is welcome on the talk page of this proposal).

  • Community member: add your name and rationale here.
  • Spelling and grammar corrections made so far are awesome. I appreciate the effort to do this with semiautomatic supervision. Vriullop (talk) 07:20, 1 October 2015 (UTC)