WikiCorpus is proposal for new Wikimedia project to make multilingual parallel corpus which could be used in machine statistical translation.

Comparing to existing solutions edit

  • Wiktionary: has simple words, WikiCorpus will have sentence to sentence translation.
  • Wikipedia: good, big database of knowledge, but related articles in Wikipedias are generally not translated but are written from scratch.
  • European Parliament Proceedings Parallel Corpus 1996-2011 - is good corpus, has 20 languages + English, only specific topics.
  • OpenSubtitles for example is not so good for translation because translation is not accurate.
  • OmegaWiki - has expressions, but often simple words

WikiCorpus edit

WikiCorpus must have short important article text from English Wikipedia, even only 300 words (15 long sentences or more shorter),maybe a bit more, but surely <500 words. Real, stable English Wikipedia articles like Linux or Theory_of_relativity will be divided to parts because translation to other language must be easy and atomic. Article will be initiated with English part of article. Next, if translations appears, English source should not be edited. Translations must be accurate aligned to sentences or even to half-sentences, if sentence will long. It will be good, is English version will "UNLed". This make available knowledge in Universal Networking Language format.

For Example, French "Le Petit Prince" sentence "Le petit prince s’assit sur une pierre et leva les yeux vers le ciel:" ("The little prince sat down on a stone, and raised his eyes toward the sky.") is tagged:

{unl}
agt(to sit down(obj>posture):18.@past.@entry, Petit Prince:12)
plc(to sit down(obj>posture):18.@past.@entry, rock(icl>natural
object):36.@indef.@on)
and(to raise(icl>move):33.@past.@toward, to sit
down(obj>posture):18.@past.@entry)
agt(to raise(icl>move):33.@past.@toward, Petit Prince:12)
obj(to raise(icl>move):33.@past.@toward, eye(icl>sense organ):46.@pl)
plc(to raise(icl>move):33.@past.@toward, sky(icl>atmosphere):82)
{/unl}

If would be something similar, it allows analyze text by machine.

Similar proposals edit

Alternative names edit

Proposed by edit