Wikipedia Diversity Observatory/Cultural Diversity

Framing the problem of Cultural Diversity

“The sum of human knowledge” is not in a single language but in the existing cultural diversity from every territory and language in the world. We have to work on very different aspects and align all the Wikimedia movement stakeholders to facilitate the creation of content that ensures articles that show cultural diversity.

We see this as a two-step process or two sequential processes: representation and sharing. For each language, the process of representation implies creating content that relates to the geographical and cultural context from the editors. Instead, the process of sharing implies understanding where the gaps are both in the own language and in the others, in order to exchange each others' cultural context content and increase all languages' cultural diversity.

In order to facilitate cultural context representation, we propose:

Create, collect, process, and present different sorts of metrics and tools to describe the creation and usage of cultural content on Wikimedia projects.
Understand the situation of all the world's languages that could become Wikipedia language editions, and consequently, the potential content about their cultural context they would bring to the entire Wikipedia project.

In order to facilitate each language the sharing (import and export) of all languages cultural context content, we propose:

Ideate and develop tools that prioritize and allow finding the most valuable content (popular and relevant) that might be essential to be created across projects.
Provide training to organizations and individuals in these tools so that they can help mitigate the knowledge gaps and increase the cultural diversity in Wikimedia projects.

Not all languages are in the same position in order to achieve good coverage of the world's cultural diversity. Usually, languages represent their cultural context first and build the capacity and maturity later in order to create articles about every other language's cultural context. It is possible to compare and discuss the maturity level of a language edition in terms of content cultural diversity according to several aspects we discuss in this preliminary model.

You can learn more about how to improve cultural representation and share on these guidelines.

Cultural diversity tools

As an observatory, the outcomes of this project bridge the gap between research and activism more than focusing on the content creation itself. This portal itself provides results. Most of the visualizations are located or better depicted at an external website (wdo.wmcloud.org) created with Plotly hosted in Toolforge.

Even though some results are repeated in both sites, those at the external website are preferable as they allow better user interaction with the data. For example, the tables from List of Wikipedias by Cultural Context Content allow filtering feature not available in List of Wikipedias by Cultural Context Content.

This project is continually developing research questions, concepts, dashboards, visualizations, and tools.

WCDO's main concepts are Cultural Context Content, Culture Gap, Top CCC Diversity Lists, and Missing CCC articles:

Cultural Context Content (CCC) aka Local Content

Figure 1. By Cultural Context Content articles (CCC) I refer to the articles on a different range of topics, all related to the editors’ context, occurring in each Wikipedia language edition

Figure 2. CCC Datasets are a necessary map and a starting point to fight for cultural diversity in each Wikipedia (video explaining them).

Cultural Context Content (CCC) (methodology) is the group of articles in a Wikipedia language edition that relates to the editors' geographical and cultural context (places, traditions, language, politics, agriculture, biographies, events, etcetera.) (Figure 1). You can see this Youtube video explaining its creation and use.

In order to create any CCC it is necessary to establish a language territories mapping, in other words, to pin out the territories where the language is spoken as native or with official legal status.

Cultural Context Content is collected as a group of datasets (Figure 2), which are released on a monthly basis. These datasets are used to compute and depict several statistics on the state of knowledge equality and cross-cultural coverage.

For example, it is possible to consult the extent of CCC in each Wikipedia language edition (List of Wikipedias by Cultural Context Content) or even the amount of articles from a particular territory in one language edition CCC (List of Language Territories by Cultural Context Content).

Culture Gap

The culture gap occurs when a Wikipedia language edition is not covering articles that belong to another language edition CCC. Around 50% of the articles non-existing across language editions (language gap) is due to the culture gap.

In order to compute the culture gap and other statistics, WCDO proposes calculating the intersections between differents sets of articles (e.g. common articles between all articles from English language edition and articles from Japanese CCC). The use of intersections allows seeing the absolute number of articles and their extent (the relative importance) in each of the two sets.

In these two tables, it is possible to see the culture gap in two different ways. First, the spread of a language CCC on the rest of Wikipedia language editions, and, second, the coverage of all the languages CCC.

Language culture gap (spread) or CCC spread.
Language culture gap (coverage) and CCC coverage.

Top CCC Diversity lists

Wikipedia language editions should not be a replica of each other and the gap may never be completely closed. However, minimal coverage of all other languages should be a goal on the agenda of each Wikipedia edition to create more multicultural (and complete) encyclopaedias.

Figure 2. Top CCC articles lists are a selection of articles from CCC (such as gender, geolocation, etc.) ranked according to a particular feature (number of pageviews, number of editors contributing to it, etc.).This is a useful way to find some relevant articles to bridge the gap.

Top CCC articles lists can help in providing content for this minimal cultural coverage. Inspired by the Vital articles lists, the Top CCC articles present the most relevant articles in terms of different metrics (e.g. the number of editors or pageviews) and specific content types (e.g. geolocated articles or women) from a language cultural context or country's cultural context.

The Top CCC articles current generated lists are: list of CCC articles with most number of editors (Editors), list of CCC articles with featured article distinction (Featured), most bytes and references (weights: 0.8, 0.1 and 0.1 respectively), list of CCC articles with geolocation with most links coming from CCC, list of CCC articles with keywords on title with most bytes (Bytes), list of CCC articles categorized in Wikidata as women with most edits (Women), list of CCC articles categorized in Wikidata as men with most edits (Men), list of CCC articles created during the first three years and with most edits (First 3Y.), list of CCC articles created during the last year and with most edits (Last Y.), list of CCC articles with most pageviews during the last month (Pageviews), list of CCC articles with most edits in talk pages (Discussions).

On this page, you can consult the list from a particular country or language CCC generated on a monthly basis from the latest CCC dataset. You need to specify the list parameter (editors, featured, geolocated, keywords, women, men, created_first_three_years, created_last_year, pageviews, and discussions), the language target parameter (as target_lang and the language wikicode), the language origin (as source_lang and the language wikicode), and, optionally to limit the scope of the selection, the country origin parameter as part of the CCC (as source_country and the country ISO3166 code). In case no country is selected, the default is 'all'.

One possible URL with Top CCC list by number of editors, language origin Spanish, language target Italian and no country would be: https://wdo.wmcloud.org/top_ccc_articles/?list=editors&source_lang=es&target_lang=it

A similar list but limited to a specific country and to women would be:

https://wdo.wmcloud.org/top_ccc_articles

https://wdo.wmcloud.org/top_ccc_articles/?target_lang=ca&source_country=de&source_lang=de&list=women

The generated table includes several metrics and shows the availability in the top right column with the current title (in case it exists) or one possible title generated by the Content Translation tool or by a Wikidata label.

Another way to browse the lists is by examining how well a language edition covers the other language editions Top CCC articles lists (centered around countries, as Countries Top CCC article lists), or how well spread are one particular language editions Top CCC lists on the rest of language editions.

In this case, it is necessary to specify the language covering or spreading the lists with the lang parameter. This is an example using Catalan Wikipedia:

Languages Top CCC articles spread from Catalan Wikipedia.

https://wdo.wmcloud.org/languages_top_ccc_articles_spread/?lang=ca

Languages Top CCC articles coverage by Catalan Wikipedia.

https://wdo.wmcloud.org/languages_top_ccc_articles_coverage/?lang=ca

Countries Top CCC articles coverage by Catalan Wikipedia.

https://wdo.wmcloud.org/countries_top_ccc_articles_coverage/?lang=ca

Missing CCC articles

Normally Wikipedia language editions tend to cover their own cultural context (from territories to all the cultural expressions) much better than others. However, in around 150 languages their cultural context content is below 10% of the content, which is a sign that it is likely underrepresented. In this case, it very possible that larger Wikipedia language editions have articles that are missing in their CCC. Sometimes these languages are English, French Russian and Spanish, which are the languages that usually coexist with other languages with Wikipedia (only 48 Wikipedia language editions are of languages that do not coexist with other languages in one territory).

In order to improve the representation of local content in these underdeveloped Wikipedias, we proposed the creation of a tool named "Missing CCC articles". This allows us to query articles that should exist in one language CCC but they have not been created yet, and instead, exist in other languages. Additionally, we can also query articles from a language CCC that are longer in another language edition.

It is possible to query any list by changing the URL parameters or by using the following menus. You first need to select the target language (where you would like to improve local content representation). Additionally, if you want to aim at a specific part of a language context, you can select the target country and target region - they are optional and allow you to filter for a specific area. For instance, for Target language French, whose language context encompasses several countries, Target country and Target region could be France and Québec.

One possible URL with a query for Luganda CCC about Uganda and Geolocated content that is found in any other language edition would be:

https://wdo.wmcloud.org/missing_ccc_articles

https://wdo.wmcloud.org/missing_ccc_articles/?target_lang=lg&ccc_segment=geolocated&target_country=ug&source_lang=none

Disclaimer: This tool is still at the Alpha phase and may contain some bugs. Your feedback can be useful.

Common CCC articles

Cultural Context Content is a selection of articles that relate to language-related territories, their people and customs. Even though the selection provides a defined group of articles, we must acknowledge that some articles may belong to more than one cultural context. It can either be a celebrity who was born in one country but did most of her career in another, a historical battle in which two or more armies intervened, among other cases. For this, we can say that the cultural context is a continuum.

In order to find the articles that are found in-between two cultural contexts and may belong to more than one selection, we have created a tool named "Common CCC articles". This allows us to search for articles common to two language editions’ cultural context content and the gaps in other language editions.

It is possible to query articles by adding source languages (the first one sets the reference CCC, and the other languages will be used to filter the resulting list of articles depending on how related they are to it). The Target Languages parameter allows you to select a list of languages in which you want to check whether the resulting list of articles exist in their language editions. You can also filter the results to only show you the gaps in the target languages.

One possible URL with a query for Ukrainian and Polish cultural context content and its availability in English and Russian would be:

https://wdo.wmcloud.org/common_ccc_articles

https://wdo.wmcloud.org/common_ccc_articles/?source_langs=uk%2Cpl&target_langs=en%2Cru

Disclaimer: This tool is still at the Alpha phase and may contain some bugs or give undesired results. Your feedback can be useful.

Visual CCC articles

Usually Cultural Context Content is much more developed than the rest of the articles, in terms of the number of references, length, and the number of images. However, in some cases, the version of the article in another language contains valuable images. This tool allows you to check which images are more used for every article belonging to the Top CCC Diversity Lists or any list you want to paste. You can query a list of articles and their images and choose also the number of images you want to see or to see only those that are missing in the original article. For example, if we want to see images (only 4) of the Catalan Top CCC list "Women", and only those missing, we would use the following URL:

https://wdo.wmcloud.org/visual_ccc_articles/

https://wdo.wmcloud.org/visual_ccc_articles/?show_only=gaps&images=4&order_by=none&source_lang_list=ca&list=women

Incomplete CCC articles

Usually Cultural Context Content is much more developed than the rest of the articles, in terms of the number of references, length, and the number of images. As said before, in some cases they are more complete in other languages. On this page, you can check whether the articles of a language edition you introduce manually or a Top CCC list is more complete in other language editions. In other words, you can compare each article stats (number of Bytes, number of references, number of images, number of outlinks, among others) in other languages, and then, decide whether to expand these articles or not. You can also compare engagement characteristics (e.g. number of editors, number of edits or number of pageviews) or the 'featured article' distinction.

For example, if we want to see the articles from the Czech Top CCC list "Geolocated" that are more complete in other languages we would use the following URL:

https://wdo.wmcloud.org/incomplete_ccc_articles/

https://wdo.wmcloud.org/incomplete_ccc_articles/?list=geolocated&source_country=none&source_lang_list=cs&limit=300

Search CCC articles

On this page, you can search for articles in a Wikipedia language edition and see their availability in other language editions. First, you need to select the Source Language where you want to retrieve the content from. Then you can choose the Type of query: List of articles, List of categories articles, Wikidata SPARQL Query, and Wikipedia Content Search.

The List of articles query simply allows you to introduce a list of articles (their titles or their URLs separated by a comma, semicolon or a line break) in the textbox in order to see the main stats and their availability in the Target Languages. The List of categories' articles allows you to introduce a list of categories and retrieve the articles contained in them. The Wikidata SPARQL Query allows you to introduce a query in the textbox and retrieve the articles related to the Qitems that appear in them (if the query does not contain any Qitem and only labels, there will be no results). The Wikipedia Content Search allows you to introduce a query the same search engine of Wikipedia (CirrusSearch), for example, if you introduce the Source Language Japanese and the query "Japanese Cuisine", you will obtain the articles from Japanese Wikipedia along with their main stats on relevance features (number of editors, edits, discussion edits, pageviews, etc.). When using the search option, you can introduce the Language of the query and specify which language you are using to query (e.g. Japanese cuisine could be "cuisine du Japon" in French), no matter it is the same target language or not.

For example, if we want to see articles on Japanese Cuisine in Japanese Wikipedia and their availability in Catalan, Spanish, French, and English Wikipedia, we would use the following URL:

https://wdo.wmcloud.org/search_ccc_articles/

https://wdo.wmcloud.org/search_ccc_articles/?target_langs=ca,es,fr,en&query_lang=None&order_by=none&query_type=search&limit=100&topic=&source_lang=ja&textbox=japanese+cuisine

More tools (work in progress)

Currently we want to use the CCC datasets to monitor the gaps on a continual basis (showing the creation of articles for specific kinds of content to show whether and where editors are really bridging the gap) along with many other lists, solutions, and improvements after all the feedback gathered in past Wikimedia events and from local communities (Figure 4). Likewise, we want to create a multilingual editors dashboard where to find potential collaborators. The editor must be able to query lists or visualizations to see editors from other language editions according to their cultural context interests.