Grants:Project/Wikipedia Cultural Diversity Observatory (WCDO)/Midpoint

Report under review

This Project Grant midpoint report has been submitted by the grantee, and is currently being reviewed by WMF staff. You may add comments, responses, or questions to this report's discussion page.

To read the approved grant submission for this project, please visit Grants:Project/Wikipedia Cultural Diversity Observatory (WCDO).

Review the reporting requirements to better understand the reporting process.
Review all Project Grant reports under review.
Please Email projectgrantswikimedia.org if you have additional questions.

Welcome to this project's midpoint report! This report shares progress and learnings from the Individual Engagement Grantee's first 3 months.

Summary

WCDO project reaches the midpoint mark with a robust proposal in order to obtain the content representing the cultural and geographical context from every Wikipedia language edition. The first phase, consisting in the selection of content has required some extra tasks in order to provide reliable results (e.g. the introduction of a new language-territories database and the implementation of machine learning techniques). As far as the third phase (dissemination) the project has been presented in different venues (WikiIndaba) and the content selection methodology has been submitted to an Academic Journal.

Methods and activities

The WCDO project is organized into 3 phases: 1) Content Selection, 2) Website creation and 3) Project Dissemination. Although it makes sense to tackle each phase sequentially, most of the activities conducted have been aimed at content selection (Phase 1) as it presented numerous challenges. Nonetheless, in order to progress and disseminate the project both with communities and academia, some actions have been undertaken such as attending events and publishing an academic paper.

The code which has been developed for both Phase 1 and 2 is available on github. We are pleased to report that the first phase is completed, and has been expanded beyond the initial requirements.

Phase 1

We set up the different article selection strategies using geolocation tags, keywords, wikidata properties and links expanding the original method published in previous research.
We created a Wikipedia_language_territories_mapping_quality.csv with the language - territories mapping (territories where the language is spoken today because it is official or indigenous) with ISO 3166 and 3166-2 codes, territory names in their corresponding languages and in English, demonyms, among others. Task finished.
We run series of manual assessment of the CCC (Cultural Context Content) Selection of articles with several volunteers in order to determine the selection's precision and recall.
We created several functions and set up the different mechanisms so the process is automated and executes once a month.

Phase 2

We evaluate different visualization frameworks in order to visualize some parts of the CCC datasets and decided for bokeh.
We selected the different graphs to be used and made a preliminar implementation that need to be adapted to the data from Phase 1.

Phase 3

We presented at Wikiindaba. The talk is (pdf) about African languages (their statistics and current situation) and the potential of WCDO to help them spread their content across languages.
We presented WCDO in Pre-Hackathon and Hackathon, etcetera.
We provided some content selections in order to help specific activities (Catalan Culture Challenge)dataset of articles geolocated .

Midpoint outcomes

Language-Territories Mapping. https://github.com/marcmiquel/WCDO/blob/master/language_territories_mapping/Wikipedia_language_territories_mapping_quality.xlsx

Content Selection for all languages with false positive / false negatives of (5-5%). Some results available here (e.g. ).

Women/men included as a category (not foreseen in the project proposal) so it is possible a) to compute the gender gap % in Cultural Context Content and in the entire Wikipedia, b) to provide lists of prominent women from each culture (e.g. Catalan women: https://github.com/marcmiquel/WCDO/blob/master/sample_recommendation_lists_catalan_wikipedia/100_ccc_female_vital_articles_edits_ca_fr.html).

Paper with the main methodology published in an indexed impact factor Journal 'Frontier Physics' under the title 'Wikipedia Culture Gap: Quantifying Content Imbalances Across 40 Language Editions'. Authors Marc Miquel-Ribé and David Laniado. DOI:10.3389/fphy.2018.00054

Presentation to the community in [Wikiindaba 2018].

Finances

Our finances are on track and are available at: https://meta.wikimedia.org/wiki/Grants:Project/Wikipedia_Cultural_Diversity_Observatory_(WCDO)/Finances

Learning

The main challenge we faced was automatizing the whole process, which required facing different sorts of technical bottlenecks (RAM, Disk, etc.), as data was retrieved from different sources (Wikidata dump, MySQL wikireplicas). We moved to a VPS server after exhausting all the different technological alternatives. This has been time-consuming as the code has been often rewritten in order to be functional and efficient. WMF technical members have been very helpful while looking for solutions to any technical problem (I want to mention Chico Venancio, who also endorsed the project and became a team member).

The second most important challenge was dealing with 300 language editions at the same time. While we knew it was possible to do it automatically without needing to actually understand the languages, the different criteria to select the content needed to be revised from our previous approaches. This is why we spent more time analyzing the problem and expanded the initial method using several strategies employing geolocation tags, keywords, wikidata properties and article links. We created a 'rosetta stone' database with the territories where a language is spoken and native or spoken and official with the ISO 3166 and ISO 3166-2 country and region codes, the demonym and territory name extracted from Wikidata and Ethnologue (a reliable source commonly used by sociolinguistics). Ultimately, in order to do the final selection of articles, we implemented a machine learning approaching using a classifier. This was not foreseen in the beginning of the project, but employing a threshold as we previously expected was not giving satisfactory results (many articles were selected when they actually should not be part of the cultural context content).

Another challenge was keeping in touch with different groups of people providing key information for the project: researchers and community leaders. In this case, even though some events were not scheduled in the project proposal (such as the pre-Hackathon or Wikiindaba), they became really useful in order to communicate the project value and receive feedback in regards of aspects such as visibility (this is what pushed us to forget about an external website and center it in meta, where it might be more visible). Instead, meetings with researchers are scheduled at a weekly basis in order to share the technical and research problems (content selection criteria, machine learning methods and content manual assessment) and write with the aim at publishing the results.

The previous challenges, especially the first two, were underestimated in the initial project proposal (the project has been considered ambitious by other researchers when explained). Because of this, time-management in face of technical difficulties has not been easy, as some tasks are dependant on others and it is important to have several at hand to keep progressing. Because of this, we scheduled additional tasks such as a) results quality manual assessment, b) community leaders meetings and e-mails asking for opinion, c) attending community events to present preliminar results, among others. Most of them are listed in the project timeline.

What is working well

Outlined below are the steps the project intends to accomplish in its second half, which is entirely dedicated to publish the graphs/table in order to become a useful tool to Wikipedians (Phase 2) and disseminate it across the communities (Phase 3):

Phase 2: Create the website "Wikipedia Cultural Diversity Observatory"

Create the tables and graphs for each Wikipedia language edition on a monthly basis.
Publish them in meta.wikimedia.org/wiki/Wikipedia_Cultural_Diversity_Obsevatory with an automated script.
Upload the datasets in a public repository as well as in http://wcdo.wmflabs.org/datasets/.

Phase 3: Disseminate the observatory & community engagement

Attend Wikimania conference and present the project results.
Promote the project to all the existing interlanguage collaboration initiatives.
Document all the different methods employed, scripts and overall project details.

Grantee reflection

This project aims at spreading cultural diversity and is based on a long-term project I have been working on for 8 years now (which became my PhD thesis). Considering that it all will become public, online and available, I am happy to give my best so the methods, results and datasets have a higher quality. It is such a new and amazing sensation to think about 300 language editions. Since this project starts from a very strong personal motivation, every time I communicate it to wikimedians I feel happy to share the results and see how it moves past research and becomes useful (so far, the preliminar results I contributed to Catalan Wikipedia community, Serbian editors, African communities editors, etc.). I do not want to miss the chance to be grateful to all the people who helped me sharing their knowledge and time, starting from all the project collaborators mentioned in the page, but also WMF members like Marti, Janice, Dumi, Zack, among others. I can't wait to meet you again in Wikimania.