Grants:Project/Automatic Extraction of Multi-lingual Text and Concept Similarity

Automatic Extraction of Multi-lingual Text and Concept Similarity
summaryWe propose to develop a similarity score for text sections between languages based on a combination of content, activity and graph structure to correlate text sections in different languages. This method will be used to A) Detect "Similar concepts" in different languages, B) Detect parallel sections in text in different languages, C) Detect missing sections in articles in a given language.
targetComparison between Wikipediaes in different languages.
We aim at helping Wikipedia editors and users to find and compare concepts between different languages, in the absence of direct link or when the content of the Wiki page is incomplete or significantly different from the corresponding content in other languages.

We will use a combination of text analysis, graph measures and machine learning techniques to infer similarity metrics between texts in different language. This method will be based on a comparison of the Word2Vec content representation in different languages, the structure of the hyperlink network in the ego-network surrounding a text section and the similarity in the page views time series. Using a combination of similarity scores, we will find the most similar candidate to text section in any proposed language, and compare the sections of the same text in other languages.

We will use a combination of measures and machine learning techniques to infer similarity metric between texts in different languages. This will be useful for: A. Suggestion of the most similar concepts to a concept you know in a given language. B. Suggestion of missing translations of the Wikipedia entries (i.e. missing articles) or sections in an entry in different languages C. Detection of erroneous links between concepts in different languages.

The outputs for the similarity in content will be a Markov model matrix representing the probability of observing a given Word2Vec representation in a given language as a function of a different Word2Vec representation in a different language. The graph similarity will be available through an app that will build the ego-graph in a given language and provide links to all concepts in a different language sharing a high enough fraction of neighbors in the ego-graph. The page-view similarity will be provided as an app representing the correlation in the total page views of pairs of Wikipediae pages. The end product will be an app that will receive a Wikipedia page, and a different language and compare the section of the pages in different languages if a parallel Wikipedia page exist. Otherwise it will show the most related pages in the different language. In case of a high dissimilarity between two languages, the app will raise a flag. All of the elements above will be available freely to the Wikimedia editors and users through an open website. The algorithm, the tools and the datasets created over the course of the program will be disseminated via scientific publications and made available to the public.

We believe that availability of such tool will contribute to closing the gap in content quality across Wikipediaes in different languages, improve content neutrality by reducing omission bias and help coordinating editorial activity. Our current estimates suggest that hundreds of thousands of concepts covered in some languages will be identified and suggested for introduction into other languages. In addition, we hope that this project will help multilingual Wikipedia users to access richer information (e.g. complementing readings from one language with content from another).

This project will be executed by data science research groups in two universities. Each group will employ a full-time Ph.D. student who will be supervised by the PIs. We will leverage our expertise in analysis of the Wikipedia content in different languages, Wikipedia editing history and our platform that offers methods to perform computations with Wikipedia page views. The first 9 month of the project will be devoted to development of the analytical methods necessary to accomplish the project. The last 3 month of the project will focus on implementation of the web API and a web site that will make the tool available to the public.


Ph.D Student scholarships (20,000 USD)

We are in constant communication with Israel Wikimedia and will test the app with them. We also plan to publish the methods in a scientific journal to maximize impact. Finally, the algorithm, the tools and the datasets created over the course of the program will be disseminated via scientific publications and made available to the public.

Yoram Louzoun (yoraml) and Lev Muchnik (LevMuchnik at gmail com) have been studying Wikipedia for several years, with the first research published in 2007. Both research groups have experience in machine learning, text mining and graph analysis, and lead large research group in this domain. They now plan to use their expertise to propose new tools and methods for the Wikipedia users. The proposed project is the first Wikipedia research project proposed by this group, but they have extensive experience in collaboration with other institutions, analysis of large datasets, and in the development of machine learning-based applications. The Ph.D. students involved in this analysis are experienced in Big Data and construction of web API and web sites that offer analytical tools to the public (e.g.: List of the relevant publications

  • L. Muchnik, R. Itzhack, S. Solomon, and Y. Louzoun, “Self-emergence of knowledge trees: Extraction of the Wikipedia hierarchies,” Physical Review E, vol. 76, no. 1, p. 16106, 2007.
  • M. Kämpf, S. Tismer, J. W. Kantelhardt, and L. Muchnik, “Fluctuations in Wikipedia access-rate and edit-event data,” Physica A: Statistical Mechanics and its Applications, vol. 391, no. 23, pp. 6101–6111, Jul. 2012.
  • M. Kämpf, J. W. Kantelhardt, and L. Muchnik, “From Time Series to Co-Evolving Functional Networks: Dynamics of the Complex System ‘Wikipedia,’” in ECCS 2012, 2012.
  • L. Muchnik, S. Pei, L. C. Parra, S. D. S. Reis, J. S. Andrade, S. Havlin, H. A. Makse, and J. S. Andrade, “Origins of power-law degree distribution in the heterogeneity of human activity in social networks,” Scientific Reports, vol. 3, p. 23, Apr. 2013.
  • H. Brot, L. Muchnik, and Y. Louzoun, “Directed triadic closure and edge deletion mechanism induce asymmetry in directed edge properties,” The European Physical Journal B, vol. 88, no. 1, p. 12, Jan. 2015.
  • H. Brot, L. Muchnik, J. Goldenberg, and Y. Louzoun, “Evolution through bursts: Network structure develops through localized bursts in time and space,” Network Science, vol. 4, no. 3, pp. 293–313, Sep. 2016.

