Grants:Project/WCDO/Culture Gap Monthly Monitoring/Timeline

This project is funded by a Project Grant

Timeline for WCDO

Timeline	Date
Publish the Midpoint Report	15 August 2019
Publish the Final Report	30 January 2020

Monthly updates

March 2019

We debugged the process of collecting Cultural Context Content (CCC).
We participated in "Edit-a-thon" DHASA (Digital Humanities Association of Southern African) organized by DNdubane_(WMF) at the University of Pretoria with a short online presentation - Wikipedia Cultural Diversity Observatory project (March 27th).
We started creating the lists of Top CCC articles on several topics (folk, monuments, earth, music creations and organizations, sports and teams, food, paintings, glam, books, clothing and fashion, and industry).
We adapted the project meta site (https://meta.wikimedia.org/wiki/Wikipedia_Cultural_Diversity_Observatory) for the new phase.
We located several databases (e.g. ethnologue, wals) including all the world languages and studied their overlap in the territories where they are spoken in order to detect languages with a marginalization status.
We prepared the organizational documents, Excels, and code in order to tackle the new research and development phase for the project.

April 2019

We finished creating the lists of Top CCC articles on several topics (folk, monuments, earth, music creations and organizations, sports and teams, food, paintings, glam, books, clothing and fashion, and industry).
We created a language territories database (languages_territories.db) extending the file Wikipedia_language_territories_mapping_quality.csv and other files. This is based on the more than 6 thousand languages spoken in the world and computed their overlapping in the same territories.
We started writing a paper about editor participation on Cultural Context Content in order to explain how important it is to represent the context for the well-functioning of a Wikipedia.
We studied different possibilities in order to evangelize the Wikimedia movement with cultural diversity and wrote a document about a “Cultural Diversity Maturity Model” for communities.

We presented the WCDO project at the Seminario DigiDoc abril 2019 (Universitat Pompeu Fabra, Barcelona, Catalonia) as “Wikipedia Cultural Diversity Observatory: un caso de aplicación práctica del análisis de datos para mejorar la diversidad cultural en la Wikipedia” (slides in Commons).

May 2019

We made a first version of the code to retrieve, store and process the data related to a) editors, b) images and c) missing CCC.
We wrote and sent a chapter named “The Sum of Human Knowledge? Not in One Wikipedia Language Edition” for the book “Wikipedia@20”.
Marc has joined the Diversity Working Group in the 2030 Strategy and became a representative for the group in Wikimania.

June 2019

We published the CCC Dataset publicly and for the research community and presented it at the conference ICWSM, Munich June 11-13th (Program). Reference: Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity Dataset: A Complete Cartography for 300 Language Editions. Proceedings of the 13th International AAAI Conference on Web and Social Media (pdf). ICWSM. ACM.
We did an analysis of the world languages according to their geographical extent, their social status and number of speakers in order to determine both the coexistence in a territory and the situations of language marginalization.
We were stuck with a bottleneck with MySQL databases replicas and had to code again the functions in multiple ways in order to make it work.
Marc has been contributing to the Diversity Working Group with recommendations directed to expand the horizons of the observatory and address other problematics in the diversity area.
We received feedback and made extensive changes and edits to the chapter for “Wikipedia@20” and participated in the reviewing process of other chapters.

July 2019

We attended the Wikimedia conference Celtic Knot and presented “Languages Matter to Cultural Diversity: Finding Missing Languages and Bridging the Gaps in Minority Languages” (slides).
We designed the database and main code for the monthly analysis (stats_generation.py).
We contacted the Wikitech and Analytics teams to consult the bottleneck and started re-rewriting the code of the whole WCDO framework in order to use the SQL dumps (those concerning the replicas tables).
We ended the computing of the dataset “Missing CCC”: this dataset contains for every language the articles that should exist because they are in their local context and instead they exist in a language of higher status (e.g. articles on Uganda that do not exist in Luganda Wikipedia but exist in English Wikipedia).
Marc has been contributing to the Diversity Working Group with the writing of the recommendations, weekly calls and has met in Rome with some members of the WG.

August 2019

We finished the coding of the stats (stats_generation.py) and partially debugged it (these stats are explained in the file sets_intersections.xls).
We measured the language gap in geolocated articles to evaluate the impact of Wikimania 2018 on the creation of geolocated articles in Africa.
We created the interface in order to retrieve Missing CCC articles (those about the local content of a language that do not exist in that language but in bigger ones).
We attended the Wikimania conference and disseminated the work of the past months with 4 talks and 2 posters about the Cultural Diversity Observatory.
- Poster: Wikipedia Cultural Diversity Dataset: helping editors to enrich cross-language coverage. This poster explained the dataset.
- Poster: Maturity Levels for Cultural Diversity in Wikipedia Language Communities. This poster explained the different levels.
- Diversity Talk: Wikipedia Cultural Diversity Observatory (WCDO): Empowering Communities to Bridge the Culture Content Gaps. This presentation explained the current state of the project with its new Missing CCC lists and also alerted of the lack of impact of Wikimania 2018 to bridge the African content gap (pdf slides and video).
- Language Talk: Minoritized Languages and Missing Languages in Wikipedia: An Opportunity to Increase Cultural Diversity in Wikipedia. This presentation explained that to make Wikipedia more culturally diverse we need more languages (proposed a method to select them) and help minoritized languages to create their content (suggested a method to propose new articles) (pdf slides).
- Readership Talk: Increasing Wikipedia Readership By Creating Local Content In Language Editions. This presentation explained that local content is vital in order to increase a language edition readership and gave some numerical reasons (pdf slides).
- Research Talk: Cultural Diversity Funnels: A Metaphor To Study Wikipedia Communities and Knowledge Gaps. This presentation explained that there exist different barriers that stop cultural diversity representation and proposed the metaphor of a funnel in order to depict it.
We attended meetings in order to offer help in projects like GLOW, Intercultur, among others.

September 2019

We have generated a new version of the CCC dataset - although, it took three times more than before, and some features are still unavailable (those revision table based: number of editors, number of edits, etc.).
We have explored the possibility of uploading the Top CCC lists to Wikidata by creating new properties with Alex Stinton and Satdeep.
We made some specific analyses for the Arabic and Egyptian Arabic Wikipedias in order to prepare the presentation for the WikiArabia conference.
We started coding some new data visualizations (topical analyses) that are not available yet.

October 2019

We attended the Wikimedia conference WikiArabia and presented the Talk: The State of Cultural Diversity in Arabic Wikipedia: Insights and Challenges (slides).
We created a tool in order to obtain the images of articles from any Language CCC selection of articles (Visual CCC) https://wcdo.wmflabs.org/visual_ccc_articles/.
We created a tool in order to compare the stats of the articles in one language and in other languages (Incomplete CCC) https://wcdo.wmflabs.org/incomplete_ccc_articles/.
We created a tool in order to search for articles from other languages and see the gaps (Search CCC) https://wcdo.wmflabs.org/search_ccc_articles/.

November 2019

We created some dashboards both testing the stats generated and the different types of graphs available in Plotly (not uploaded yet).
We re-coded some of the functions in order to use dumps and avoid the replicas.
We started writing/outlining an article in order to disseminate results in a specialized journal.

December 2019

We created some dashboards both testing the stats generated and the different types of graphs available in Plotly (not uploaded yet).
We created a plan for the next potential phases of the Observatory.
We created a summary of the project for the Knowledge Equity Advent calendar initiative from Wikimedia Deutschland.

Is your final report due but you need more time?

Extension request

New end date

31.3.2020

Rationale

As we explained in the midreport (section What are the challenges), we have had to re-code an important part of the original due to a different functioning (lower performance) of the MySQL replicas. This lead us to look for different approaches using both the replicas and the XML dumps. However, the processing of the dumps for the 300 languages has not been successful as the processing time is too high, and the replicas functioning has not improved and any of the approaches using them was fast enough in order to retrieve and process the data (on the scale of weeks, which made the results totally invalid).

These issues are of vital importance for the project. First, because they do not allow to progress on the development of other fundamental parts of the project (visualizations and analysis) that are dependant on having the data. Second, because without these data processes automatized, it is not possible to have these visualizations showing up-to-date results. In order to fix this, we have been contacting several WMF departments in order to request access to certain services that would fix the problem or would give a much better solution to it. Even though it was not possible to obtain access, we can try a new approach that involves a new dataset that is being released in few weeks. In order to do this and finish the project with the desired functionalities, we will need to extend it three more months. This will set the final deadline for the report to the end of March. Please, do not hesitate to ask me for further detail on how I would spend these extra months or any aspect related to the project.

@Marcmiquel: Hi Marc, thanks for this extension request, and for detailing the obstacles you encountered in obtaining access to specific services. This extension request is approved, with a final report due one month later on 30 April 2020. I JethroBT (WMF) (talk) 15:58, 18 December 2019 (UTC)

Extension request

New end date

31.10.2020

Rationale

Extend the project from Cultural Diversity Observatory to Diversity Observatory

The project has been completed and the tasks designed for 2019 have given place to different dashboards, datasets, publications, and talks in the project site. During different conversations in Wikimedia events and in the frame of the Wikimedia Strategy 2030 conversations, some folks showed interest in having extended dashboards, tools, and data for some missing types of diversity: ethnic groups, LGTB+ groups, religious groups, among others.

Following the same approach, we propose to expand the project so that the Cultural Observatory becomes the Diversity Observatory (here a conceptualization for all types of diversity). The tasks and goals proposed for this extension are aimed at expanding the datasets, dashboards and improving the entire framework and website (see here).

@Marcmiquel: This extension of your timeline and additional tasks are formally approved! Your new end date is 10 November 2020, and a final report will be due on 10 December 2020. If you do need more time, especially as the review process for this extension took much longer than expected, please feel free to request additional time. I JethroBT (WMF) (talk) 20:46, 14 July 2020 (UTC)