Research talk:Expanding Wikipedia articles across languages/Inter language approach

Two oddities

Latest comment: 5 years ago3 comments2 people in discussion

Checking specific languages pairs, we can see that there are no languages that can cover more than the 20% of the English content , while English itself can cover more than the 40% of more than 15 languages. These results suggests that English is a good source, but bad target for cross-lingual recommendations.

I don't understand the line of thought here. The more relevant number would be articles or sections present in source language but missing from target language. I believe there are a lot of such content (that can be translated) present in all languages, and I don't see why any language would be especially good or bad source language based just on coverage.

Machine translation algorithms relays on parallel data.

Data-driven algorithms do, expert-driven not necessarily.

--Nikerabbit (talk) 14:06, 6 November 2018 (UTC)Reply

Given that we are using the sitelinks articles (articles linking the same Wikidata Item in other languages) to recommend sections, if an article in one language does not exist in other languages we can't use this strategy to provide recommendations. Therefore, if most of the articles in one language are unique (not existing in other languages), this strategy wont work good for that language. We are already working in solutions for this problem, like the "userMoreLike" feature that we have already introduced. But still, recommendations won't be as good as when you have the an article about the exact wikidata item in other language.

Regarding your last comment about "expert-driven" translation algorithms, I couldn't find any reference about such algorithms. I'll be happy to learn about that, specially if they don't require parallel data. Can you please add any link or reference about that?

Diego (WMF) (talk) 23:52, 16 April 2019 (UTC)Reply

The term expert-driven is another term to describe rule-based engines like Apertium that are based on dictionaries with linguistics rules created by linguists. While parallel data can help to build these systems, they do not rely on parallel data in the same way as the statistical and neural systems that cannot be created without sufficient corpora of texts in different languages. --Nikerabbit (talk) 13:23, 2 May 2019 (UTC)Reply

Add topic