Research:Improving multilingual support for link recommendation model for add-a-link task
In a previous project we developed a machine-learning model to recommend new links to articles: Research:Link_recommendation_model_for_add-a-link_structured_task
The model is used for the add-a-link structured task. The aim of this task is to provide suggested edits to newcomer editors (in this case adding links) to break down editing into simpler and more well-defined tasks. The hypothesis is that this leads to a more positive editing experience for newcomers and, as a result, they will keep contributing in the long-run. In fact, the experimental analysis showed that newcomers are more likely to be retained with this features, and that the volume and quality of their edits increases. As of now, the model is deployed to approximately 100 Wikipedia languages.
However, we have found that the model currently does not work well for all languages. After training the model for 301 Wikipedia languages, we identified 23 languages for which the model did not pass the backtesting-evaluation. This means, that we think the model’s performance does not meet a minimum quality standard in terms of the accuracy of the recommended links. Detailed results: Research:Improving multilingual support for link recommendation model for add-a-link task/Results round-1
In this project, we want to improve the multilingual support of the model. This means we want to increase the number of languages for which the model passes the backtesting evaluation such that it can be deployed to the respective Wikipedias.
We will pursue 2 different approaches to improve the multilingual support.
Improving the model for individual languages.
We will try to fix the existing model for individual languages. From the previous experiments where we trained the model for 301 languages, we gathered some information about potential improvements for individual languages (T309263). For example, the two most promising approaches are
- Unicode decode error when running wikipedia2vec to create article embeddings as features. This appeared in fywiki and zhwiki (T325521). This has been documented in the respective github repository. It also proposed a fix; however, this hasnt been merged yet. The idea would be to implement (or adapt if necessary) the proposed fix.
- Word-tokenization. Many of the languages which failed the backtesting evaluation do not use whitespaces to separate tokens (such as Japanese). The current model relies on whitespaces to identify tokens in order to generate candidates for anchors for links. Thus, improving the work-tokenization for non-whitespace delimited languages should improve the performance of the models in these languages. We recently developed mwtokenizer, a package for doing tokenization in (almost) all languages in Wikipedia. The idea would be to implement mwtokenizer into the tokenization pipeline.
Developing a language-agnostic model.
Even if we can fix the model for all languages above, the current model architecture has several limitations. Most importantly, we currently need to train a separate model for each language. This brings challenges for deploying this model for all languages, because we need to train and run 300 or more different models.
In order to simplify the maintenance work, ideally, we would like to develop a single language-agnostic model. We will explore different approaches to try to develop such a model while ensuring the accuracy of the recommendations. We will use. Among others, the language-agnostic revert-risk model as an inspiration where such an approach has been implemented and deployed with success.
- Gerlach, M., Miller, M., Ho, R., Harlan, K., & Difallah, D. (2021). Multilingual Entity Linking System for Wikipedia with a Machine-in-the-Loop Approach. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3818–3827. https://doi.org/10.1145/3459637.3481939